e2link is a small (well, not so small) utility written in Perl that is useful to automatically create hard links from word lists. I originally created it quite a while ago in order to help me node How the Internet Came to Be; then I left it for a while in my home node and now I decided to release it.

The idea behind e2link is quite easy: it will parse a word list and create everything2 hard links (i.e. add brackets) in a text of yours when a match is found. It is smart enough to correct the capitalization of the word in your text file with the correct one taken from the word list and is able to handle pipe links, i.e. links that don't link their own literal string. In addiction to this, it won't put links where you already put them by hand.

I think it can be useful to node long texts full of technical jargon or group of nodes having the same subject, therefore making more hard links and with better quality. To use it, first you have to create a text file to submit to e2 (we'll call it txtfile.txt) and a word list file (we'll call it wordfile.txt). A word list file has a single word (or phrase) per line, you can add comments starting the line with '#' and leave white lines to better suit your needs. In the case you want to create a pipe link, you just enter the pipe link in the file, e.g you separate two phrases with the pipe symbol.

An example word list would look like this:

    # This is a sample word list
    everything|Everything2
    E2|Everything2
    
    pancakes|My secret pancake recipe
    
    John Doe
    Doe
    
Note how the longest match must go first, as in the case with "John Doe" vs. "Doe". If your original text file was something like
    Mr John Doe likes e2 very much. 
    His wife Jenna Doe likes pancakes better.
This would be the result:
    Mr [John Doe] likes [Everything2|E2] very much. 
    His wife Jenna [Doe] likes [My secret pancake recipe|pancakes] better.
To invoke the script, just type
    perl e2link.pl txtfile.txt wordfile.txt > e2file.txt
    
In addition, if you want to, you can control the maximum number of times a word is linked by adding an optional third parameter stating the maximum number of istances of a word that will be linked.
    perl e2link.pl txtfile.txt wordfile.txt 3 > e2file.txt
    
Means that any word present in wordfile will not be linked more than 3 times.

If you already linked that word, manual links will not be counted. Instances are substituted starting from the beginning of the file (thanks to KissThis and liveforever for the idea and support).

Have phun, and let me know of any bugs you should encounter.


Prerequisites
  • Perl 5 installed on your machine. If you are on a Linux/Unix box, it's probably already installed. If you are on a Windows machine, you can download a Perl version for free from http://www.activestate.com
  • A shell installed. bash via Telnet or xterm will do on a Unix machine; if you run Windows you can get a shell by typing COMMAND or CMD after selecting Start/Run...
History:
  • version 1.1, as of 2003-10-16: added max number of substitutions for words
  • version 1.0, as of 2003-10-15: initali public release.
To Do:
  • Sort word list so the longest matching string goes first (isogolem)

Perl source code follows.
use strict;
#
# e2link.pl
#
#  - version 1.1 - 20031016
#
# this program nodes an input text given a keyword
# and substritution file
#
# there is a shadow file, parallel to the one we're
# editing, that is used as a lowercase search pool
# and to protect already substituted text areas 
# (they appear as a string of '-'s in $shadow)
#

# use:
#  perl e2link.pl txtfile.txt wordfile.txt > e2file.txt
#
# or:
#  perl e2link.pl txtfile.txt wordfile.txt 2 > e2file.txt
# so that no word will be substituted more than two times
#

# this script is free software. originally by [lenz]

# docs @
# http://www.everything2.com/index.pl?node=
#        e2link:+Create+everything2+links+from+word+lists


my $txtfile = $ARGV[0];
my $wordfile= $ARGV[1];
my $maxsub  = $ARGV[2] || 9999999;
my $txt;
my $shadow;
my $r;


# loads textfile into $txt and initializes
# $shadow with 
open F, $txtfile or die "$! opening text file '$txtfile'\n";
while (<F>) { $txt .= $_; }
close F;
$shadow = lc $txt;


# do the subst game
# while reading word file
open F, $wordfile or die "$! opening word file '$wordfile'\n";
while ( $r = <F> ) {
    my $nodename; my $nodedest;
    next if $r =~ /^[\s*$|#]/;
        
    chomp $r;
    
    if ( $r =~ /^(.+?)\x7C(.+?)\s*$/ ) {
        ($nodename, $nodedest) = ($1, $2);
    } elsif ( $r =~ /^\s*(.+?)\s*$/ ) {
        ($nodename, $nodedest) = ($1, "");
    };
        
    subst( $nodename, $nodedest, $maxsub );

}
close F;

print "\n$txt\n";
exit;


sub subst() {
    my ($nodename, $nodedest, $maxsubst) = @_;
    
    #
    # This function substitutes $nodename (even if in [brackets])
    # with a suitable combination of $nodename and $nodedest
    #
    
    my $ln = length( $nodename );
    my $char_pre; my $char_post;
    my $alphalist = "abcdefrghijklmnopqrstuvwxyz01234567890";
    my $lcnodename = lc $nodename;
    my $donesubst = 1;    # number of substitutions 
                          # done for this word
    
    if ( $ln == 0 ) { return; };
    
    my $pos = 0;
    
    
    while ( (($pos = index( $shadow, $lcnodename, $pos )) > 0) 
            && ($donesubst <= $maxsubst) ){
        
        $ln = length( $nodename );
        
        # check boundary chars: is this match a part of a bigger word?
        $char_pre = substr( $shadow, $pos - 1, 1 );
        $char_post= substr( $shadow, $pos + $ln, 1 );
        
        if ( ( index( $alphalist, $char_pre) > -1 ) ||
             ( index( $alphalist, $char_post) > -1 ) ) {
             	$pos++;
             	next;
        }
        
        # check if already enclosed in square brackets
        
        if ( ( $char_pre  eq '[' ) &&
             ( $char_post eq ']' ) ) {
            $pos -= 1;
            $ln  += 2;
        }
        
            
        if ( length( $nodedest ) > 1 ) {
            repl( "[$nodedest|$nodename]", $pos, $ln );
        } else {
            repl( "[$nodename]", $pos, $ln );
        }
        $donesubst++;
    }
}


sub repl() {
    my ( $node, $start, $len ) = @_;
    #
    # this function replaces $len chars starting at $start with 
    # $node in the global $txt and with a sequence of ---'s
    # having the same length as $node in $shadow
    #
    
    my $it = substr( $txt, 0, $start );
    my $ft = substr( $txt, $start + $len  );
    my $is = substr( $shadow, 0, $start);
    my $fs = substr( $shadow, $start + $len  );
    
    $txt    = $it . $node . $ft;
    $shadow = $is . ( '-' x length($node) ) . $fs;
}

Log in or register to write something here or to contact authors.