Blitzer et al, ACL 2007
From ScribbleWiki: Analysis of Social Media
This page maintained by: Mahesh Joshi
Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification
John Blitzer, Mark Dredze, Fernando Pereira
This paper deals with the important problem of domain adaptation in supervised learning, specifically for the task of sentiment classification. Domain adaptation essentially involves the ability to utilize classification models that are learned on labeled data from one domain, for prediction in other domains (assuming the task is the same, such as classifying reviews of books based on models learned for movie reviews).
The authors address two key questions:
- How to address the problem of training and test data distributions being very different (specifically in terms of the features that are useful in the different domains)?
- How to choose the initial domain for which labeled data should be created, so that domain adaptation can be done later with the best performance during adaptation to new domains?
For the first question, the key idea in this paper is a better selection criterion for pivot features used in the structural correspondence learning (SCL) approach to domain adaptation, that has been proposed earlier by the authors. Pivot features are those that are common and important in both domains, and hence can be used for mapping other features across domains -- since, for example, non-pivot features (across domains) that are positively correlated with some common pivot feature would probably be good candidates for feature alignment across domains. In the original SCL paper, pivot features were selected based on frequency, in this paper the authors utilize the idea of doing pivot feature selection using mutual information of the features with labels available in the source domain, yielding a much better set of pivot features.
Evaluation of this new version of SCL (termed SCL-MI) on four product review domains from Amazon (books, DVDs, electronics and kitchen appliances) shows improvements in most cases.
Futher, the authors also show that with the availability of a small amount of labeled data from the target domain, erroneous cross-domain feature alignments produced by SCL-MI can be corrected, to increase performance further.
For the second question, the authors make use of the concept of A-distance (proposed earlier), which takes into consideration only those differences between domains that affect classification accuracy. For a given set of domains, the A-distance (actually its proxy in the form of average per-instance Huber loss) is computed for every pair of domains. This was found to correlate well with the domain adaptation loss -- the larger the distance, the more is the domain adaptation loss across the two domains. Thus, it provides an elegant approach for selecting domains for which labeled data should be created -- the selection should be such that A-distance with domains for which data won't be labeled should be minimized.
Overall this paper represents a significant step in the area of domain adaptation.