machine learning department logo

Non-Negative Sparse Embedding

An interpretable and effective general-purpose semantic space


This is the webpage for NNSE (Non-Negative Sparse Embedding) - a semantic representation scheme that derives interpretable and cognitively plausible word representations from massive web corpora. The model matrices, and papers, can be downloaded below.

The main features of this vector space model (aka word embedding, aka distributional semantics model) are:

In brief, the model is derived in an unsupervised way from ~10 million documents and ~15 billion words of web text (from the Clueweb collection). MALT dependency co-occurrences (target word - dependency - head/dependent) are collated (applying a frequency cutoff), adjusted with positive pointwise mutual information (PPMI) to normalise for word and feature frequencies, and reduced in dimesionality with sparse SVD methods. In parallel, document co-occurrence counts (LSA/LDA style) are similarly collated, PPMI adjusted, and sparse SVD reduced. The union of these inputs is factorised again using Non-Negative Sparse Embedding, a variation on Non-Negative Sparse Coding.

The result is one in which a relatively compact set of features dimensions (typically in the 100's) can be used to describe all the words in a typical adult-scale vocabulary (here approximated with a list of ~35,000 words frequent words of American English). The representation of a single word is sparse, and disjoint - e.g. a typical concrete noun in the 300-dimension model might use only 30 of the features, and these features would be mostly disjoint with other word types (e.g. abstract nouns, verbs, function words).  Within the space, words should have both taxonomic neighbours (e.g. judge is near to referee) and topical neighbours (e.g. judge is near to prison).

The features can also be interpreted, and often encode prominent aspects of meaning, such as taxonomic categories, topical associations and word senses/usages. Here are a couple examples, giving the most prominent semantic dimensions for a word, and characterising each of those dimensions in turn by their most prominent word-members.

Representation for apple

Weight
Top Words (per weighted dimension)
0.40
raspberry, peach, pear, mango, melon
0.26
ripper, aac, converter, vcd, rm
0.14
cpu, intel, mips, pentium, risc
0.13
motorola, lg, samsung, vodafone, alcatel
0.11
peaches, apricots, pears, cherries, blueberries

Representation for motorbike

Weight
Top Words (per weighted dimension)
0.69
bike, mtb, bikes, harley, motorcycle
0.35
canoe, raft, scooter, kayak, skateboard
0.15
sedan, dealership, dealerships, dealer, convertible
0.10
attorney, malpractice, lawyer, attorneys, lawyers
0.08
earnhardt, speedway, irl, indy, racing

Download   

Several of the models used in the paper can be downloaded below as zipped versions of a plain-text file. Each line is a tab-delimited word entry - the first field is the word token, and following fields are entries in a fixed number of semantic feature dimensions. Comment lines starting with '#' can be ignored.

Full document and dependency model, NNSE reduced [number of output dimensions: 50 | 300 | 1000 | 2500]

Dependency model (taxonomic relatedness), NNSE reduced [number of output dimensions: 300]

Document model (topical relatedness), NNSE reduced [number of output dimensions: 300]


References

More details of the scheme are given in this paper:
    Brian Murphy, Partha Talukdar and Tom Mitchell, 2012: Learning Effective and Interpretable Semantic Models using Non-Negative Sparse Embedding, International Conference on Computational Linguistics (COLING 2012), Mumbai, India. [Paper]
... and there is further background in:
    Brian Murphy, Partha Talukdar and Tom Mitchell, 2012: Selecting Corpus-Semantic Models for Neurolinguistic Decoding. Proceedings of the First Joint Conference on Lexical and Computational Semantics (*SEM), Montreal, Pages 114-123. [Paper]


Feel free to e-mail with comments or questions: brianmurphy@cmu.edu.