Support Distribution Machines

Let's say we have a bunch of labeled data. Given a new set of data, we want to be able to predict its label. This is the standard machine learning problem setting of classification (if the labels are distinct classes) or regression (if the labels are continuous values).

Most algorithms to solve these problems operate by extracting feature vectors from the underlying data, and then finding regions in feature space that correspond to the classes, or smooth-ish functions in that space that match the regression problem.

For example, if you're classifying emails as spam or not spam, you might choose one feature dimension that corresponds to the number of times that the word "account" is used, another for the number of times "pharmacy" appears, another for whether the sender is in your address book, and so on.

These choices are very important: even the fanciest learning algorithm can't do anything with the wrong features, and even the simplest can often do very well with the right ones.

But sometime it's not easy to summarize your data in a single, fixed-length vector. One particular case where it's difficult: if your data is really comprised of a set of points.

In this case, the obvious way to summarize the vectors (the mean) isn't enough for the classification. You could try coming up with some other types of summaries, but our method instead directly compares the amount by which the sets overlap. Technically, the way we do this is by treating the points as samples from some unknown probability distribution and then statistically estimating the distance between those distributions, such as the KL divergence, the closely related Rényi divergence, L2 distance, or other similar distances.

Some real-world examples where your data comes in sets:

The skl-groups and py-sdm packages implement this in Python. (skl-groups is newer, better-documented, and all-around cooler, but not yet feature-complete.) More details on the method are available in some papers:

back to my personal page