We formally define the problem as follows. Between every pair of words is a juncture, which can take one of a number of juncture types. In the simplest case the set of juncture types consists of break and non-break (the only case discussed here), but in principle any number is possible. The task of the algorithm is to decide the best sequence of juncture types for each sentence.
Our algorithm uses a Markov model where each state represents one of the juncture types and emits probabilities of part-of-speech (POS) sequences occurring. We usually take two words before and one after the juncture in questions to represent the POS sequence:
where j is the juncture in question and is the POS tag immediately before it.
In the simplest case there are only two states in the model, one for break and the other for non-break. The transition probabilities are determined by the prior probability of each juncture type occurring. Bayes' rule is used to combine the emission probability of each state (P(C|j)) with the transition probability (P(j)), to find the probability we are interested in, P(j|C). This case represents local information only, i.e. the basic probability of each juncture type at each point. To take more context into account we use an n-gram of juncture sequences. The ngram gives the probability of a juncture type given the previous N junctures, i.e.
For an ngram of size N, a Markov model is built with nodes, each corresponding to a particular unique sequence of juncture types. The transition probabilities are taken from the ngram and give the likelihood of particular sequences occurring. All states of a particular juncture type are tied, i.e. the emission probabilities of all break states are equal regardless of position in the network.
Training the network is straightforward: the emission probabilities are calculated by collecting all the examples of breaks in the training data. For all the possible unique POS sequences, C, counts are made from the data of how many times each occurs. These are converted to the probability P(C|break) by dividing by the total number of breaks. P(C|non-break) is calculated the same way. The ngrams are calculated by counting the number of times each unique sequence of length N occurs in the training data.
At run time, this network is efficiently searched using the Viterbi algorithm to find the most likely sequence of junctures given the input POS tags for a sentence.