Learning a Semantic Coherence Function

Perhaps the most salient deficiency of conventional language models is their complete failure at modeling semantic coherence. These models capture fairly well short distance correlations among words in a sentence, yet are unable to distinguish meaningful sentences where the content words come from the same semantic domain from 'fake' sentences where content words are drawn randomly. As a result, in many language technology applications such as speech recognition, errors that are obvious to a human observer (eg a noun replaced by an acoustically similar but semantically different noun) cannot be salvaged by the model.

The whole-sentence exponential language model developed recently by our group is naturally suited to modeling whole-sentence phenomena such as semantic coherence. In previous work we have shown that more benefit can be expected from a handful of features that are frequently active (as opposed to many but rarely active features). Ideally, we would like to derive a single computational feature that captures the notion of semantic coherence in a sentence or document.

Building on previous work by Can Cai, we discovered significant differences in the distribution of content words between real text and model-generated ('fake') text. Specifically, for each sentence, all content words were identified. For each pair in that set, we estimated a measure of association called Q (similar to a correlation coefficient) based on the appropriate 2x2 contingency table of training-data co-occurrences of these two words. Thus each sentence, true or fake, can be represented by a (variable length) list of Q values. By defining 4 features of these lists (their min, max, median and mean), Can showed that the 'true' Q lists have a different distribution than the 'fake' Q lists, and by plugging these very simple features into the exponential model she achieved a performance improvement.

However, we believe that much more improvement is possible, and can be realized by learning a much more powerful feature from the Q list. The goal of this project is to automatically learn such a feature from data.

The training data for this task is a very large set of Q lists, each such list classified as 'true' (1) or 'false' (0). The goal is to find a single function, that takes as input a Q list and produces a single number between 0 and 1 as output, predicting whether the list came from a 'true' or 'fake' sentence. One of the main ML challenges here is dealing with variable length inputs.