IBM Almaden

Download slides here (gzipped tar archive of PostScript files).

The talk generally addresses the largely-neglected theoretical issues raised by common marketing questions. The segmentation part is based on the STOC '98 paper ``Segmentation problems'', by Jon Kleinberg, Christos Papadimitriou, and Prabhakar Raghavan. The collaborative filtering work is based on joint work with Ravi Kumar, Sridhar Rajagopalan, and Andrew Tomkins.

The first category of marketing questions is
*customer segmentation*. This is a general class of problems
involving finding a selection of items that maximizes the sum over
customers of the cost of the best item for the customer.

A specific, applied example is the problem of *catalog
segmentation*. Suppose that we
have a set of items `I`, and we know that each particular
customer `x` would like to see in a catalog a subset
`S`_{x} of `I`. We can only
publish and distribute two catalogs `C`_{0}
and `C`_{1} (subsets of `I` with fixed
cardinality), but we would like to maximize the
happiness of our customers.
The happiness `H`(`x`) of customer `x` is
the maximum of the size of
`C`_{0} intersect `S`_{x}
and the size of `C`_{1} intersect
`S`_{x} (since we will mail the maximizing
catalog to `x`). We want to maximize the sum over `x`
of `H`(`x`).

The catalog segmentation problem, like almost any other segmentation
problem thus far considered, is MAXSNP-complete. There is a simple
2-approximation:
Publish only the best catalog for everybody (which can be determined
using the greedy method). Nothing better is known for the general case,
but it is natural to hope for something better than the naive
2-approximation.
(For the dense case, where every customer likes
at least a constant fraction of the products, there is a FPTAS for
a fixed number of catalogs.)
More generally, one might publish
`k` catalogs (the one-catalog algorithm gives a
`k`-approximation), or one might give a cost for publishing
each additional catalog.

Another segmentation problem is *maximum spanning tree
segmentation*. As a question from the audience pointed out, this
is not necessarily related to any serious marketing problem, but it
illustrates the breadth of segmentation problems that have yet to be
considered. Here customers have their own sets of edge weights, and
we would like to choose two spanning trees so that the sum over
customers of the weight of the better of the two spanning trees for
the customer's weights.

The other category of problems inspired by marketing questions is
*collaborative filtering*. Here we would like to recommend
items based on what customers have purchased. For each customer we have
a sample of items that customer frequently buys (milk, orange juice,
toast, bananas), and given this data from all customers we would like
to recommend to each customer something new to consider (maybe oats for
our breakfast-eater).

Even the proper model for collaborative filtering is not obvious, but a natural model is that the items are grouped into several clusters. Each customer is a distribution on clusters, and we want to identify which clusters a customer prefers based on the history.

These marketing-inspired problems - customer segmentation and collaborative filtering - represent important and very open territory for additional algorithmic research.