Understanding How Multiple Classes of Individuals Differ

Data that describe multiple classes of individuals can be exploited to enhance decision making. One way is to automatically generate a predictive scheme that, given a next individual, will predict the class of the individual (i.e., classify it), or predict the value of some feature. Such approaches have been successful in a variety of areas, such as determining the creditworthiness of a credit card applicant, where only two classes are involved: creditworthy or not.

However, sometimes an automated predictive scheme is not the goal. Rather, one seeks a general understanding of how the various classes differ, and tries to express these differences very concisely and intelligibly. The new understanding can aid further decision making or data gathering, by pinpointing the key features and signalling how the classes differ with respect to these features.

For this problem, the use of methods intended for prediction might not be the solution, because these methods emphasize predictive accuracy and are not especially driven by concerns with conciseness (minimizing the use of features) and ease of human understanding.

Our work with scientific and engineering data has motivated the design of new algorithms that address this very problem: expressing concisely how multiple classes differ, as reflected in the data. The problem is similar to descriptive discriminant analysis, but our approach is more general, optimizes for conciseness, leads to intelligible answers, and handles equally well data containing mixtures of numeric, yes/no, and nominal features.

Ongoing scientific applications include: (1) in biology, understanding how different proteins are expressed in cells, in terms of features calculated from images. Currently there are half a dozen proteins, a 100 or so examples of each protein, and 62 numeric image features; (2) in psycholinguistics, understanding how various language disorders impact speech patterns contained in a large database of language transcripts. Other applications have involved several dozen classes.

Our next steps are to improve how the software handles statistical (not absolute) differences among the classes, and to extend its current capability for inventing derived features from the data, which is proving quite useful. Especially, we will apply the software collaboratively to new problems where the data's implications need to be humanly understood.

Ref: Maximally Parsimonious Discrimination: A Generic Task from Linguistic Discovery