We formally define the problem as follows. Between every pair of words
is a *juncture*, which can take one of a number of *juncture
types*. In the simplest case the set of juncture types consists of
break and non-break (the only case discussed here), but in principle
any number is possible. The task of the algorithm is to decide the
best sequence of juncture types for each sentence.

Our algorithm uses a Markov model where each state represents one of the juncture types and emits probabilities of part-of-speech (POS) sequences occurring. We usually take two words before and one after the juncture in questions to represent the POS sequence:

where *j* is the juncture in question and is the POS
tag immediately before it.

In the simplest case there are only two states in the model, one for
break and the other for non-break. The transition probabilities are
determined by the prior probability of each juncture type
occurring. Bayes' rule is used to combine the emission probability of
each state (*P*(*C*|*j*)) with the transition probability (*P*(*j*)), to
find the probability we are interested in, *P*(*j*|*C*). This case
represents local information only, i.e. the basic probability of each
juncture type at each point. To take more context into account we use
an n-gram of juncture sequences. The ngram gives the probability of a
juncture type given the previous *N* junctures, i.e.

For an ngram of size *N*, a Markov model is built with nodes,
each corresponding to a particular unique sequence of juncture
types. The transition probabilities are taken from the ngram and give
the likelihood of particular sequences occurring. All states of a
particular juncture type are tied, i.e. the emission probabilities of
all break states are equal regardless of position in the network.

Training the network is straightforward: the emission probabilities
are calculated by collecting all the examples of breaks in the
training data. For all the possible unique POS sequences, *C*, counts
are made from the data of how many times each occurs. These are
converted to the probability *P*(*C*|*break*) by dividing by the total
number of breaks. *P*(*C*|*non*-*break*) is calculated the same way. The
ngrams are calculated by counting the number of times each unique
sequence of length *N* occurs in the training data.

At run time, this network is efficiently searched using the Viterbi algorithm to find the most likely sequence of junctures given the input POS tags for a sentence.

Tue Jul 1 17:09:00 BST 1997