edev: E2 index (idea) by getha

This is my first venture into the edev realm...
This idea is a ramble in progress, meaning that it's not all that
clear in my head yet. Subject to change and comments are
very much appreciated...

edev: Everything₂ indexing - a higher level power?

A couple of times I have run into noders lamenting the absence of the ability to search on wu text itself. I have - on occasion - also lamented that same. The problem is of course obvious: huge load on the server/database. The implementation of a text search on wu text will probably never see the light. And rightly so, because it would kill E2 dead in its tracks.

I have lately been playing with an idea, though, that could possibly add greatly to the searching capabilities, while still keeping server and database load at an acceptable level (I think¹).

This idea came forth when researching for a regular node. A book I was using for the research covered
a very much larger topic than what I wanted to node about and on top of that covered its ground in large
portions of uninterrupted text. Very hard to find parts relevant to my subject of interest. At first I
thought the index at the back would be of help, but that was so poorly executed that I soon gave up on
using that book at all (and - temporarily - on the node).

The idea behind an index is very smart, and - if executed correctly - can greatly enhance the usability of a reference work. This possibly also goes for E2.

Assumptions...

If I understand correctly, a search on E2 is done by doing an SQL query on node titles. In fact, it seems the engine does two if a node by the exact search string does not exist (the second doing a 'fuzzy' search to find closely matching node titles²).

(As an aside: I can't find the dbtable where the actual wu texts would reside. None of the dbtables, like 'e2node', 'writeup' or even 'node' have an entry where it could reside, I think, like 'heaven' does with the field 'data' that has type 'text'. Or am I looking in the wrong place?)

The reason that searching on node title is acceptable and searching on wu text is not, would presumably be simple logistics: the title is limited to a maximum of 240 chars, but wu text can be much longer. Besides, a node can contain multiple writeups, so this adds an extra factor to a search on wu text.

An index

First, let me explain what I understand to be an index in the context of this wu. An index in this sense is a list of certain words in a reference work (usually a book) that, either:

records each instance (per page, or even down to the paragraph or line number) of these words in the work
records each instance of a section (chapter, paragraph) of the work that deals - more or less - with the word in question

The purpose of this is of course ease of use. If one wants to find information about a certain topic in a large reference work, one could look up words that are connected to the topic in the index and find all the pages that contain these word or all sections that in some way deal with these words, depending on the implementation used.

Something similar could be a great asset to E2, also.

Acceptable load

But now we come to the tricky part. If we want this in E2, we want this to be implemented in a way that does not add too much strain on the server and the database. It will add to the load. No matter what. That should not be mistaken.

The question therefore is this:

is this at all possible to implement in such a way that the extra load remains within acceptable limits?

I'm not sure. But I think it possibly is.

Implementation

What would the end user see? There are a number of possibilities.

One would be to create a separate search mode just for searching the index. The results then appear on a new 'index search results'-superdoc.

Another would be to add an extra checkbox for including the index in the search. In this case the results would be appended to the usual 'search results'-superdoc.

The results themselves would be presented more or less like they would appear in a normal hardcopy reference work. Which means a list of linked node titles for each search term.

A new dbtable

On a database level this would be implemented by adding an extra dbtable to the database, somewhat similar to a simplified 'links' dbtable, I would imagine. It looks something like this:

index_term - of type char(240)
to_node - of type int(11)
possibly some extra maintenance info

Workings

This would work something like this:

user creates a wu in normal way
before submitting user sets index terms (see below for mechanism and extra comments on this)
on submit for each index term a record is added to dbtable 'e2index'
- index_term = index term
- to-node = node_id of current node (or wu? not sure)
gods and editors can add to or delete from this dbtable directly
if user edits the wu: on re-submit the old index records with its node_id are deleted and the records are added as if new, thus ensuring consistency
if the wu gets killed, so do all the index records with its node_id

Restrictions on searching on index

To minimize the extra load, searching on index terms should only be done on complete index terms, so that the query can be of the form ".. WHERE index_term = 'search string'". This means that index terms need to be chosen intelligently, which ties in with the 'Divide and conquer' heading below.

The mechanism for adding index terms

Easiest to implement would probably be an extra number of inputboxes for adding the index terms. The number would tie in with the allowed number of index terms for the user's level, of course (see below).

Another way is to extend the hardlinking code to handle index terms. Something of the form:

[ flag that this is an index term | index term to use | visible text ] This could then also function as a regular hardlink. This implementation is harder I think, also because the limit on number of index terms has to be checked and caught if not adhered to.

Update: Just remembered. The big drawback (possibly) to the use of seperate inputboxes is that this lends itself to possible abuse (for instance, adding index terms that are unrelated but thought to be often searched for - somewhat like the small sized list of terms on those pr0n sites)

Divide and conquer

This extra feature lends itself for beefing up level powers, as the searching capabilities and index term creation can be graded. For example, the following setup could be used:

Level 2:
- Gain the power to search on index terms using a separate search interface (search is done only on the complete search string)
Level 3 (or some higher level):
- Gain the power to search on index terms using an extra checkbox (only on the complete search string)
- Gain the power to add a restricted number of index terms to each wu
Level (some higher level):
- Gain the power to search on index terms using an extra checkbox (where the search is done using all the separate words in the search string³ in addition to a search on the complete search string)
- Gain the power to add more index terms to each wu

Update September 6, 2001:

I want information about the person Wilhelm Maybach, so I do a regular search. Result: no hits that have anything to do with the man
So then I turn to an index search on the same term. Now imagine that I had been a noder with the powers to add index terms to my wu's. In that case I would certainly have added them for 'Wilhelm Maybach' in both my DMG and Mercedes-Benz wu's. Result: at least two that are very close to the mark

Update October 19, 2001:

JayBonci

¹ That's what this doc's for... Tell me if I'm right or wrong
² Which I think creates a much larger load on the server, too, because this involves using a ".. WHERE title LIKE 'search string'"-type structure in the query instead of the much faster ".. WHERE title = 'search string'"
³ The string is chomped into separate words and the query is done for every (excluding common words like 'and', 'I', 'the' and others) word separately

September 6, 2001

robots.txt	What is edev?	Firm Link	The Eleven Satanic Rules of the Earth
pet bee	The principles of nuclear weapon safety and meeting girls are remarkably similar	Soft link	Everything2 Hubs
ecore	Hard link	Google	Scottish Parliament Elections 2007
Curtiss P-40 Warhawk	Wilhelm Maybach	firmlink	edev: Writeuptype bugs, or how nate got stuck between two parentheses
edev: long node title in Random Nodes fix	The Everything Library	printable version	outsourcing
Table Of Contents	User-Agent	Star Ocean: The Second Story	Nodelet

edev: E2 index (idea)

edev: Everything₂ indexing - a higher level power?

Assumptions...

An index

Acceptable load

Implementation

A new dbtable

Workings

Restrictions on searching on index

The mechanism for adding index terms

Divide and conquer

Update September 6, 2001:

Update October 19, 2001:

Recommended Reading

About Everything2

User Picks

Editor Picks

New Writeups

Login
Password

edev: E2 index (idea)

edev: Everything2 indexing - a higher level power?

Assumptions...

An index

Acceptable load

Implementation

A new dbtable

Workings

Restrictions on searching on index

The mechanism for adding index terms

Divide and conquer

Update September 6, 2001:

Update October 19, 2001:

Sign In

Recommended Reading

About Everything2

User Picks

Editor Picks

New Writeups

edev: Everything₂ indexing - a higher level power?