Segmenting an arbitrarily concatenated phrase

As of today, Aug 15, 2008, search "blueipodblackberrywheelcellphone" in google and you get a zero document match with no spelling suggestion. Search that same phrase in nextag and you get a did-you-mean spelling suggestion of "blue ipod blackberry wheel cell phone", correctly segmented.

The segmentation algorithm runs in time quadratic in the length of the input you type in, independent of the dictionary size. I designed and imlemented this algorithm in 2006, when I was at Nextag. I'm adding two features to the segmenter and will host the demo here on this page. Visit back soon!

  • Feature 1: Some robustness against misspelling of words in the arbitrarily concatenated phrase.
  • Feature 2: Reporting the dictionary size in number of words, and the actual segmenting time in ms.

Applications

This segmenter has many areas of application. For example, have you ever wanted to be able to type a Chinese sentence comletely in pinyin, and let the computer automatically convert it into the Chinese characters that you intended? To achieve high accuracy, we need to employ machine learning techniques and look at the entire sentence as the context.

Indeed, this is exactly the problem of segmentation, as what is more formally depicted below.

Intelligent pinyin-to-character conversion

Let's say C = {c1, c2, ...} is the set of all characters, and P = {p1, p2, ...} is the set of all pinyins, where |P| << |C|. The main difficulty in pinyin-to-Chinese conversion is that Chinese is highly homonymous: the surjection f: C &rarr P typically maps many distinct characters to the same pinyin. This ambiguity is somewhat resolved with word/phrase-level conversion.

For example, say |{c : f(c) = p1}| = 20 and |{c : f(c) = p2}| = 40. Fortunatelly, typically the number |{(ci, cj) : "ci cj" is a word/phrase and f(ci) = p1 and f(cj) = p2}| of words/phrases is much less than min(20, 40) = 20.

Nevertheless, that number is not 1. What's more, given a stream of pinyins we need to be able to segment it first.

For example, say I type a stream of pinyins as a sentence: p1 p2 p3 p4.

There may be two ways to segment this sentence, with words/phrases separated by commas.
(1) p1, p2 p3, p4
(2) p1 p2, p3 p4

Each of these two would correspond to a different set of characters after conversion. (Note the positions of the commas.)
(1) c17, c2 c9, c8
(2) c23 c18, c9 c6

This is analogous to splitting an arbitrarily concatenated English sentence as posed on the top, except now we have an additional step of converting everything back to characters after the segmentation.