README for mStageMatcher.pl Version 0.4 (July 16, 2004) Satanjeev Banerjee (satanjeev AT cmu.edu) Alon Lavie (alavie AT cs.cmu.edu) This is a program that takes two strings of space separated words as input and aligns matching words between the two strings. Alignment is done over several stages, where each stage uses different criteria to find candidate matching tokens from the two strings to align. Supported criteria are "exact", "porter_stem", "wn_stem" and "wn_synonymy" (details below). The program is organized in modules. The main program mStageMatcher.pl contains only the code that aligns matched words. The code to find the candidate matched words from the two strings is in 4 separate modules: exact.pm contains the code to find exact matches between the two strings, porter_stem.pm contains the code to find matches between the two strings, after the tokens are stemmed using the porter stemmer, wn_stem.pm contains the code to find matches between the two strings, after the tokens are stemmed using WordNet's validForms function, and wn_synonymy.pm contains the code to find matches between the two strings depending on the synonymy between the tokens. One or more of these matching modules may be used in any order to run the program. To run the program with only exact match, run it like so: perl mStageMatcher.pl input.txt exact The input file (input.txt in the above example) should have the two strings of words, each on a line of its own - the second string will be aligned to the first one. The output format is as follows: Line 1: (# of stages) Line 2: (# of matched words in stage 1) (# of flips needed in aligning words in stage 1) Line 3: (# of matched words in stage 2) (# of flips needed in aligning words in stage 2) . . . Line n: (# of chunks) (average chunk length) To run the program with only porter stemming, run it like so: perl mStageMatcher.pl input.txt porter_stem To run it with first the exact and then the wn_stem, run it like so: perl mStageMatcher.pl input.txt exact wn_stem Use the --details flag to get the actual final alignment: perl mStageMatcher.pl --details input.txt [Note: The WordNet loading module takes a once-per-program loading time of about 3 seconds on a 2.4 GHz 1GB RAM machine]. For really long inputs, use the --maxComputations switch to specify how many computations to perform before aborting. For example: perl mStageMatcher.pl --maxComputations 1000 input.txt exact Here, the program will perform 1000 computations, and then output the best alignment found so far. If it has not found any alignment so far, it will continue searching for one, and as soon as one is found, it will abort, and return the alignment found.