Efficient Data Mining

Anderson / Deng / Lee / Moore / Schneider

Multiresolution kd-tree representations of data clouds

In instance-based learning (aka memory-based or non-parametric learning), predictions are made by querying high-dimensional numerical data clouds, and making inferences from local clusters, locally polynomial patches, and local kernel density estimates. When we make a prediction we must gather up the effects induced by each datapoint and combine them statistically. This is so slow as to render instance-based learning impractical for datasets larger than around 10,000 to 100,000 records. Previous researchers had addressed this by either throwing away data (known as "editing the dataset") or by performing range searches over kd-trees. Range searching cannot help, however, in problems where all datapoints have non-zero effect (such as kernel density estimation), or where the ranges cover a large fraction of the data.

We have been working on new algorithms that avoid these problems. It uses a new kind of kd-tree in which statistics about regions of space are maintained at many resolutions simultaneously. For fixed kernel width or range size, the cost of the query increases sub-logarithmically in the number of datapoints. And, unlike "editing", once built, the same tree can be used for arbitrary and changing distance metrics and kernels. As a result, massive model selection (autonomously picking the best out of thousands of computationally expensive learning algorithms) becomes tractable, and we know of no other instance-based systems that are able to do nearly as much model selection search as ours.

Paper on Multiresolution Kernel Methods

More recent paper on Multiresolution methods for locally weighted polynomial regression

Elementary tutorial on kd-trees

Slides from talk on Multiresolution kdtrees

Constant-time "counting": ADtree representations of cached sufficient statistics

The ADtree (all-dimensions tree) is a new data structure that can empirically give a 50-fold to 1000-fold computational speedup for machine learning methods such as Bayes net structure finders, rule learners and feature selectors operating on large datasets. Many machine learning algorithms need to do vast numbers of counting queries (e.g. "How many records have Color=Blue, Nationality=British, Smoker=False, Status=Married, Nose=Big?"). Similarly, many algorithms need to build and test many contingency tables. We are researching methods for which, subject to certain assumptions, the costs of these kinds of operations can be shown to be independent of the number of records in the dataset and loglinear in the number of non-zero entries in the contingency table.

The ADtree is a very sparse data structure to minimize memory use. We can provide analytical worst-case memory and construction-cost bounds for this structure for several models of data distribution. We have empirically demonstrated that tractably-sized data structures can be produced for large real-world datasets by (a) using a sparse tree structure that never allocates memory for counts of zero, (b) never allocating memory for counts that can be deduced from other counts, and (c) not bothering to expand the tree fully near its leaves. We are interested in investigating the possible uses of ADTrees in other machine learning methods, and looking at the merits of ADTrees in comparison with alternative representations such as kd-trees, R-trees and frequent sets.

Slides from talk on ADtrees

The ADTree paper

Probabilistic reasoning and belief networks

We show how the ADTree can be used to accelerate Bayes net structure finding algorithms, and we are currently exploiting this technology (in which structure searches can now happen at interactive speeds) in a number of new projects.

Slides on introduction to Bayes Nets

Fast Association Rule Learning

We have also applied ADTrees to learning association rules quickly.

A new paper on ADTrees applied to association rule algorithms such as CN2.

Auton Research Index	Efficient Data Mining		Locally weighted Learning	Auton: The Big Picture
	Active Learning and Exploration		Decision and Reinforcement Learning	Auton Home Page