README for Multi-Stage Matcher Version 0.5 (July 27, 2004) Satanjeev Banerjee (satanjeev AT cmu.edu) Alon Lavie (alavie AT cs.cmu.edu) 1. Introduction =============== This is software that takes two strings of space separated words as input and aligns matching words between the two strings. Alignment is done over several stages, where each stage uses different criteria to find candidate matching tokens from the two strings to align. Supported criteria are "exact", "porter_stem" and "wn_stem" (details below). 2. Code Organization ==================== The software is organized in modules. The overall algorithm is divided into two main parts, the matching algorithm that returns candidate token matches between tokens in the two strings, and the aligning algorithm. There are several implemented matching algorithms, each in a Perl module of its own: exact.pm: Returns tokens from the two strings that are exact matches of each other. porter_stem.pm: Returns tokens from the two strings that are matches of each other after being stemmed using the Porter stemmer. wn_stem.pm: Same as porter_stem, but stemming is done using WordNet. wn_synonymy.pm: Returns for each token in the second string, the first token (if any) going left to right in the first string such that the two tokens share at least one synset in WordNet. Given candidate matches between tokens in the two strings, the algorithm to actually construct an alignment between the two strings is implemented in the perl module mStageMatcher.pm. Program standAloneMatcher.pl includes the mStageMatcher.pm and uses it to match and align two sentences contained in an input text file. This program shows how to use mStageMatcher.pm from inside a program. 3. How to Run standAloneMatcher.pl ================================== One or more of the matching modules may be used in any order to run the program. To run the program with only exact match, run it like so: perl standAloneMatcher.pl input.txt exact The input file (input.txt in the above example) should have the two strings of words, each on a line of its own - the second string will be aligned to the first one. The output format is as follows: Line 1: (# of stages) Line 2: (# of matched words in stage 1) (# of flips needed in aligning words in stage 1) Line 3: (# of matched words in stage 2) (# of flips needed in aligning words in stage 2) . . . Line n: (# of chunks) (average chunk length) To run the program with only porter stemming, run it like so: perl standAloneMatcher.pl input.txt porter_stem To run it with first the exact and then the wn_stem, run it like so: perl standAloneMatcher.pl input.txt exact wn_stem Use the --details flag to get the actual final alignment: perl standAloneMatcher.pl --details input.txt [Note: The WordNet loading module takes a once-per-program loading time of about 3 seconds on a 2.4 GHz 1GB RAM machine].