
Description of File Format for Elicitation Interface
Erik Peterson
May 15, 2002


   The Elicitation Interface provides a simple, graphical tool for
eliciting information about a language.  The user is presented with a
sentence in one language (possibly English or Spanish) and is then
asked to translate the sentence into another language.  Next the user
indicates the words in the source sentence that best correspond to the
words in the target sentence.  Word alignments are stored as a
sequence of number pairs.  The interface can also display the context
of the sentence (e.g. the gender of the participants or the time it
occurs) and allow the user to type in their own separate comments.

   Before this elicitation can occur, the elicitation corpus must be
processed.  Each line starts with a tag, followed by a colon ":", a
space and then the content (all on one line).  Blank lines are
ignored.  The file starts with an optional line indicating the
computer encoding used by the file.  This could be any of the
encodings supported by Java.  A list of these encodings can be found
at http://java.sun.com/j2se/1.3/docs/guide/intl/encoding.doc.html.  If
the encoding line is not included, the file is assumed to be in the
default encoding for the operating system.  If used, the encoding must
the first line of the document.  For example, for a file in Chinese,
the encoding line is:

encoding: GB2312

Next in the header are two lines indicating the source and target
languages.  For example, if the source language were English and the
user were to translate into the target language Mandarin, the header
would be:

srclang: English
tgtlang: Mandarin

"English" and "Mandarin" would be replaced with the names of whatever
languages were actually used.


   After the header are one or more actual elicitation sentences.
Each sentence group starts with the tag "newpair".  Other tags are 

1. "srcsent: " followed by the source language sentence.
2. "tgtsent: " followed by the target language sentence, or followed
by nothing if the informant is to provide the translation.
3. "aligned: " followed by the indices of aligned word pairs.
Initially, this should just be followed by a pair of parentheses "()".
4. "context: " followed by any additional contextual information you
want the user to know about the situation the source sentence occurs
in.
5. "comment: " can be followed by any informant supplied comment on
their translation or alignments.
6. "alternate" (optional) Indicates that this translation is an
alternate translation of the previous sentence pair.

Note the single space after the colon for each tag.  Even tags with
nothing after them (except "alternate") should still have this space.


Some sample initial sentence groups are

newpair
srcsent: This is the first sentence in English.
tgtsent: 
aligned: ()
context: Here is some context on the sentence.
comment: 

newpair
srcsent: This is the second sentence in English.
tgtsent: 
aligned: ()
context: Here is some more context.
comment: 


    Words in the source and target sentences are white-space separated
groups of non-white space characters.  Some preprocessing may be
necessary on the source text to add in spaces so that all words can be
identified correctly.

    tgtsent would be blank until filled in by the informant.  If the
translations already exist they could be included if only work
alignment was needed.  "comment" and "context" can remain empty.

    The rest of the elicitation corpus file includes a newpair group
for each elicitation sentence.  No special commands are needed at the
end of the corpus.

