README file for align.C

Adam Berger 
February, 2000

(See align.C for copyright notice)


*** INTRODUCTION ***

_Align_ is a C++ package for aligning, at the sentence level, a pair
of text files which are translations of one another.  The problem
_Align_ was designed to solve is this: you have a pair of text files
which are translations of one another. Each file may contain
"spurious" (extra) sentences, not appearing in the other file.  The
translations may also be impressionistic. Relying on dynamic
programming and a user-provided routine for calculating the
probability of a word-to-word translation between the two languages, 
_Align_ will (ideally, anyway) weave an optimal sentence-to-sentence 
alignment between the two files. 

_Align_ takes as input a pair of ascii files to be aligned.  Each file
contain one "sentence" per line, the words of which are
space-delimited. That is, newlines delimit sentences, and spaces
delimit words. I put the word "sentence" in quotes because _Align_
doesn't actually care what syntactic units appear on each line;
however, the output of _Align_ will be an alignment between lines of
the input files. (If you so desire, you may put paragraphs or just
phrases on each line, to align at a coarser or finer level of
granularity.)

*** CODE-RELATED NOTES ***

There should be nothing you, the user of this code, needs to modify in
any part of the code except User.[CH], where you *will* be required to
fill in some empty functions. The most important component there is a
scoring routine which gives the probability that a "French" word is
the proper translation of an "English" word. The alignment program
relies on this scoring function to guide its alignment: a pair of
sentences containing words that are likely translations of one another
are probably themselves translations.

One can think of this probabilistic word-to-word translation model as
an N by M matrix, where N is the number of recognized French words and
M is the number of recognized English words. The model doesn't have to
be highly accurate, but the better the probabilities are, the better
the alignments are likely to be.

This code attempts to align the input text on a sentence-by-sentence
level. Of course, some bilingual corpora are actually aligned at a
much finer grain: at the phrase level, say. This program guarantees
only that the resulting alignment will be the optimal alignment
(relative to the user-provided translation probabilities and
user-provided thresholds) at the sentence level.

This code was originally intended for use in aligning the "Hansards":
proceedings of the Canadian parliament. That explains the use of
"French" and "English" in the code.  Despite this notation, the code
makes no explicit assumptions about the identity of the underlying
languages.

*** A WORD ABOUT ANCHORS ***

_Align_ looks for (but does not insist on) special "anchors" in the
files. These are lines of the form

 =t= [some anchor label] =t=

The program will guarantee that in the resulting alignment, anchors
with the same label will align in the two files. The actual spelling
of the '=t=' alignment symbol is a run-time parameter. (The intelligent
thing to do, of course, is to use a symbol which is not a word in
either language.) 


*** RUNNING THE PROGRAM ***

The program is meant to be compiled within a Unix-type environment.  A
makefile is provided. Invoke align with no arguments to get the proper
usage.


*** COPYRIGHT NOTICE ***

Copyright (C) 2000, Carnegie Mellon University and Adam Berger
All rights reserved.

This software package, including the documentation and makefile,
is made available for research purposes only.  It may be
redistributed freely for this purpose, in full or in part, provided
that this entire copyright notice is included on any copies of this
software and applications and derivations thereof.

This software is provided on an "as is" basis, without warranty of any
kind, either expressed or implied, as to any matter including, but not
limited to warranty of fitness of purpose, or merchantability, or
results obtained from use of this software.

You are welcome to send email to me, the developer, at
aberger@cs.cmu.edu, with bug reports or feature requests. I can't 
promise to address your concern or even reply to your email, but I
will try. This was not originally intended for public consumption, 
but I decided to make it available to the research community after I 
received a request for the code. I 

