POS Sequence Model

Next: The Phrase Break Model Up: Overview of the Algorithm Previous: Overview of the Algorithm

POS Sequence Model

The POS sequence model is trained by searching the training data for each juncture type and counting the number of distinct sequences of POS tags before and after the juncture. Generally, the POS sequence is a window of L tags around a juncture j_i, M tags preceding j_i and L-M tags following j_i. In our standard system there are 2 tags before and 1 after the juncture (L=3, M=2). These counts are converted into probabilities by dividing each count by the total number of occurrences of that juncture type in the data. This gives an estimate of the probability of a POS sequence given a juncture type.

Let us denote a POS sequence c_i-M,..,c_i,..,c_i+L-M as C and the number of times this occurs in the training set as count(C). The number of times a juncture type occurs is given by count(j). Thus an estimation of the probability is given by:

$\begin{displaymath} \tilde{P}(C \vert j) = \frac{count(C\vert j)}{count(j)} \end{displaymath}$

(1)

which in expanded form is:

$\begin{displaymath} \tilde{P}(c_{k-M}, ... , c_{i-1}, c_{i}, c_{i+1}, ..., c_{i+... ...-1}, c_{i}, c_{i+1}, ..., c_{i+L-M}\vert j_{i})}{count(j_{i})} \end{displaymath}$

(2)

Next: The Phrase Break Model Up: Overview of the Algorithm Previous: Overview of the Algorithm

Alan W Black
1999-03-20