Tomokiyo et al, ACL 2003
From ScribbleWiki: Analysis of Social Media
A Language Model Approach to Keyphrase Extraction
This paper presents an unsupervised and domain independent method for extracting and evaluating phrases. The most similar to this work is the pointwise mutual information (PMI) which is based on collocation and they also cite works based on frequency or combination of both.
Acknowledging that terms have subjective definition, they use the following definitions:
- Phraseness: collocation and cohesion of consecutive words
- Informativeness: the amount of new knowledge content
(Later in paper they show that there is little correlation between these)
- Foreground corpus: the document set that the phrases are extracted from (e.g., website of a certain company)
- Background corpus: the document set that the phrases are compared to (e.g., the entire Internet)
Their baseline is binomial log-likelihood ratio test (BLRT) which considers the word sequence as repeated sequence of binary trials and finds the log-likelihood ratio that they are from the same distribution to that they are from different distribution:
- Phrasesness = BLRT between two consecutive words
- Informativeness = BLRT of between a phrase in the background corupus and foreground corpus.
They combine the two scores in a logistic function. The parameters are chosen by user feedback which is a drawback of this method.
Their proposed method is using a smoothed language model (LM) instead and use KL divergence to measure to compute loss (optimal N in N-gram LM is the one minimizes perplexity or cross entropy)
- Phraseness = KL(Unigram || LM with optimal N)
- Informativeness = KL(LM with optimal N in foreground || LM with optimal N in backgroud)
They combine the two by using KL(LM with optimal N in foreground || Unigram for background) or simply adding them.
Experiments are using the 20 newsgroup data but no formal evaluation. In the case of bigram, they match the accuracy of BLRT. For phrases on length n, they used Apriori algorithm to extend the bigrams found and apply some heuristic filters.
In comparison with relative frequency ratio, KL show more robustness to sparse data.
Annotated by Mehrbod