TRAINING
---------

Train phoneme models using all the data in wav/train.

The list of files to be used for training are found in etc/an4_train.fileids

The transcriptions for these files are in etc/an4_train.transcription. The
transcription file has lines of the type:

ENTER SIX TWO FOUR (an87-fbbh-b)

This indicates that the file an87-fbbh-b  (which is actually in the directory
train/fbbh) is the word sequence "ENTER SIX TWO FOUR".  To train models for
the phonemes, you can read the pronunciations for these words from the
dictionary and expand it out to the following phoneme sequence:

SIL EH N T ER  S IH K S T UW F AO R SIL

Note the silence padding on both sides. Every recording has some silence at
the beginning and end; the silence models have been inserted to capture it.


TESTING
-------

Recognize the 130 files listed in etc/an4_test/fileids using the language 
model in etc/an4.trigramlm. Use an approximate decoding strategy for this.
You may use either the single lextree or the flat bigram structure. For the
former, you will have to represent all the words in the dictionary as a
lextree. For the latter, you must represent all of them as a single bigram
graph.  Use the trigram LM provided to assign probabilities during
recognition.

Given below are several important notes on 
1) the dictionary, 
2) the format of the LM
3) how to store the LM and 
4) how to read the LM. 
5) Language weight.

Please read all notes. Notes 1-4 explain the dictoinary and LM formats and
how to use them for recognition. Note 5 explains a very key heuristic
without which N-gram based recognition simply will not work.
recognition.

----------------------------------------------------------------------

1. SOME NOTES ON THE DICTIONARY
------------------------

1.1:
-------

In the dictionary you will see entries such as the following:

ENTER    EH N T ER
ENTER(2)  EH N ER

This means that the dictionary lists two ways of pronouncing the word "ENTER".
In general dictionaries may list any number of ways of pronouncing words as
"alternate pronunciations". Some words may have 10 or more ways of being 
pronounced (although in the given dictionary we have not listed more than
2 ways of pronouncing any word).

If you are using the single lextree structure, you must include *both*
EH N T ER and  E N ER in the lextree. Any partial path exiting either of these
must be given the probability P(ENTER| previous_words).
"previous_words" would be "<s>" if ENTER is the first word in the hypothesis 
or the previous two words (one of which may be <s>) for instances of "ENTER"
that occur after the first position in the sentence.

If you are using a flat bigram structure, you must include *both*, EH N T ER
and E N ER in the bigram structure. Any partial path entering either of these
pronunciations will carry the probability P(ENTER | previous words).


1.2:
------

In the dictionary both the words "O" and "OH" have the same pronunciation:

O   OW
..
OH  OW

In general, you will frequently find words with identical pronunciations
in any dictionary.

If you're using a single lextree structure, keep the final phoneme of any
word in the lextree as a distinct entry. E.g., if we had words RED and READY

READY R EH D IY
RED  R EH D

the portion of the lextree corresponding to this would be:

R-EH-D
   |
   -D-IY

Note that we are maintaing TWO copies of "D", in order that we can distinctly
associate the final phoneme at a leaf in the lextree with "RED".

Now, if we also had "READ" with pronunciation "R EH D", i.e., if the 
dictionary had entries:

READ  R EH D
READY R EH D IY
RED   R EH D

The lextree would look like

R-EH-D
   |
   -D
   |
   -D-IY

Note that the final phoneme of both RED and READ are separate although both
words have the same pronunciation. This way, when a path exits the final
phoneme of a word, you will know exactly what the word was.


If you're using a flat bigram structure the problem is simpler -- each word
has its own model, even if the models are composed identically. So, in the
above example, we'd have a model for READ and another for RED in the flat
bigram structure. 

----------------------------------------------------------------------
2. THE LANGUAGE MODEL FORMAT
-----------------------------

The most complex portion of the assignment is loading up the language model.
The LM is in the ARPA BO format.  The first 4 lines of the LM file are:

\data\
ngram 1=103
ngram 2=1256
ngram 3=786

The \data\ is just a markup indicating that definition of the LM follows.
The "ngram 1=103" indicate that there are 103 unigrams indicating a
vocabulary of 103 words (<s> and </s> are counted as words, so there are
101 words besides <s> and </s>). The second ngram line indicates that there
are 1256 bigrams in the language model. The third line states that there
are 786 trigrams.

The next section of the LM begins as follows:

\1-grams:

-0.8287605      </s>    0
-99     <s>     -0.8484463
-1.48335        A       -0.6743829

The \1-grams: line is a markup indicating that the unigram section of the
LM has begun.  Each line contains entries pertaining to a single word.
E.g. the line corresponding to "A" above states that

logbase10(P(A)) =  -1.48335
logbase10(backoff(A)) = -0.6743829

Note that P(A) is a UNIGRAM probability. But Backoff(A) is a backoff
weight required to compute BIGRAM probabilities for unseen (not expliticly
listed in the LM file) bigrams of the kind P(word | A).

There are 103 "unigram" lines corresponding to the 103 unigrams.
Following the unigram section the file has the bigram section which
begins as follows:

\2-grams:

-1.800717       <s> A   -0.1356221
-3.076865       <s> APRIL       0
-3.076865       <s> AREA        0
-2.374748       <s> AUGUST      -0.1798202


The \2-grams: is a markup indicating that the bigram section has begun.
Each subseuent line represents a bigram. The interpretation of the lines
is as follows:

-2.374748       <s> AUGUST      -0.1798202

indicates that

logbase10(P(AUGUST | <s>)) = -2.374738
logbase10(BACKOFF(<s> AUGUST)) = -0.1798202

There are 1256 bigram lines. The term P(AUGUST | <s>) is a bigram probability,
but BACKOFF(<S> AUGUST) is required to compute unseen trigrams!

The bigram section is followed by:


\3-grams:

-1.160794       <s> A G
-0.8648967      <s> A L


\3-grams: is a markup indicating that the trigram section has begun.
The line 

-1.160794       <s> A G

indicates that

logbase10(P(G | <s> A)) = -1.160794

Note that it is the *third* word in the sequence whose probability is given,
conditioned on the first two words. Note also that no backoff score is
given. We do not need word-triplet backoffs since we will not be computing
4-gram probabilites.

----------------------------------------------------------------------
3. LOADING THE LM
----------------

The speed of recognition will be heavily dependent on how efficiently you
store the LM and how fast you can retrieve a trigram probability from it.

Since the LM for this homework is small, we can use a simple structure for
it, but more generally, complex efficient structures are required. 

Below I list two formats. The first is a simple format that should be
sufficient for this homework, but will not be useful for vocabularies more
than a acouple of hundred words. The second is a more efficient structure
that could be used for vocabularies of several hundred thousand words.
For even larger LMs, distributed structures are required.

3.1 HASH:
--------

The LM in the homework has a vocabluary of only 103 words and a total of less
than 2000 Ngrams, all inclusive. To store this you can simply use a hash.
The key to the hash will be the word sequence. E.g. for unigrams, the key
will be entries such as "NOVEMBER". For Bigrams, the keys would be 
"<S> NOVEMBER" (note the space). For Trigrams, they would have the format
"<S> A G".

The data fields for each hash entry will include two numbers:
i. The log probability of the Ngram.
ii. The log backoff score.

So, during recognition, if you wanted the trigram probability for 
P(G | <S> A), you would compose the key "<S> A G" to determine if a hash
entry exists for it. If it does, you can simply read the probability from
it directly. If it doesn't, you will have to back off (the back off procedure
is explained in section 4 below).

3.2 TRIE:
---------

A more efficient structure to store the LM is a TRIE. A good description of
the TRIE structure can be found in the following paper:
http://kefallonia.telecom.tuc.gr/conferences/icassp/2003/pdfs/01-00388.pdf
(only the first page of the paper is relevant).
 
The TRIE structure uses the following structure to represent LMS:

WORD ID OF FIRST WORD
Unigram log Probability of word
log backoff score for the word
    WORD ID OF THE FIRST BIGRAM OF THE WORD  (i.e. first entry P(X | WORD)
    Bigram log probability of word pair
    Backoff log score of word pair
        Word ID of first trigram of the word pair
        Trigram probability of word triplet.

        Word ID of second trigram of the word pair
        Trigram probability
        ..
        ..
    WORD ID OF THE SECOND BIGRAM OF THE WORD
    ..
    ..
WORD ID OF SECOND WORD
..

The advantage of the above structure is that the process of reading trigram
probabilities from the LM can be very fast. Also, the amount of memory
required to store the LM can be minimial.


-------------------------------------------------------------------------
4. READING THE LANGUAGE MODEL
-----------------------------

During recognition you will require Bigram probabilities of the kind
P(word | <s>) and trigram probabilities such as P(word | word1 word2).

The LM is small and will not contain several of these probabilities.
These must be computed by backing off. The backed off terms too may not
be present in the LM, so backoff will be recursive. The overall
algorithm for retrieving LM probabilities will be as given in the following
pseudo code.

For this code I'm assuming you will be using the hash structure (from 3.1).
You can retrieve both log bigram probabilities (P(word | <s>)) and log
trigram probabilities following the pseudo code given below.

# Function to compute log trigram prob P(word3 | word1 word2)
float compute_log_trigram_probability (word1,word2,word3):
    key = "word1 word2 word3"
    hashentry = retrieve_from_lmhash( key )
    if exists hashentry:
        return hashentry.logprob
    else: # Backoff to BG
        log_backoff_score = compute_log_backoff (word1, word2)
        log_bigram_prob = compute_log_bigram_probability (word2, word3)
        return log_backoff_score + log_bigram_prob


# Function to compute logbackoff (word1 word2)
float compute_log_backoff (word1, word2):
    key = "word1 word2"
    hashentry = retrieve_from_lmhash( key )
    if exists hashentry:
        return hashentry.logbackoff
    else
        return 0


# Function to compute log bigram prob P(word2 | word1)
float compute_log_bigram_probability (word1,word2):
    key = "word1 word2"
    hashentry = retrieve_from_lmhash( key )
    if exists hashentry:
        return hashentry.logprob
    else: # Backoff to UG
        hashentry = retrieve_from_lmhash(word1)
        log_backoff_score = hashentry.logbackoff

        hashentry = retrieve_from_lmhash(word2)
        log_unigram_score = hashentry.logprob

        return log_backoff_score + log_unigram_prob


Note that the above will return log probabilities to log base 10. You may
have to convert it to your own log base (which may be a natural log, for
instance) to perform recognition.

-----------------------------------------------------------------------
5. THE LANGUAGE WEIGHT
----------------------

The language model only gives you log P(word|word word). To be effective,
these must be multiplied by a "Language weight".

In other words, instead of simply using  logP(word|word word) in the trellis,
what you will actually use is   langwt*logP(word|word word).

The "language weight" is a heuristic that has no statistical explanation.
However, if you do not employ it, recognition will not work.

A good value of language weight is betweeen 5 and 20 if all computatoin is
done in natural logs.  

You must explore the range of language weights to identify the optimal 
language weight.


In addition to language weight is an insertion penalty. This is a penalty
that is applied anytime a new word is hypothesized. By increasing or 
decreasing it, you can hypothesize more or fewer words.

To incorporate this, the manner in which langauge probabilities are employed
must be further modified. The ACTUAL score you will impose on cross-word
transitions in your trellis will therefore be:

langwt * log(P(word | word word))  +  insertionpenatly.

A good value of insertion penalty is in the range log(1.0) -- log(0.000001).
You may have to determine the optimal insertion penalty as well.
