As with all models, there are trade-offs between complexity, both in time and space, and ease of implementation. The table below gives results from some simple algorithms tested on our data. The first inserts a phrase break deterministically after all punctuation while the second inserts a phrase break after all content words that are succeeded by a function word (e.g. as suggested by ).
We can see that the punctuation-model conservatively assigns breaks at positions that are almost always correct, but misses many others. The content/function model gets many more correct but at the cost of massive over insertion.
Within our basic model there are a number variables to investigate, including POS tagset size, size of POS window for POS sequence model, and size of n-gram for phrase break model.