Machine Learning Preprocessing Assistant

CONTENTS

The Project
Plan
----- Datasets
----- Transforms
----- Preprocessors
----- PState
----- Primary Operations
Timetable
----- By July 1st
----- By September 1st
----- November 1st
----- Jan 1st, 1997
----- Jan 1st - March 1st, 1997
Deliverables

The Project

This project is funded by 3M corporation and has two parts:

-- Systems development of a potentially very useful machine learning tool for preprocessing datasets.

-- Research into autonomous systems for preprocessing data.

The project involves tools to help the user find useful transforms of the data (including various principal components methods, feature selection methods, useful non-linear transforms of data, possibly time-series data transforms, possibly outlier removal, and user-specified transforms). More interestingly it will guide the user in searching the enormous combinatorial space of possible transforms. The scientific question is: to what extent can automated search procedures do the job of a human at the preprocessing stage of Machine Learning? Can AI enhance a statistician's job? Can AI save a statistician time? Can AI enhance a naive user's job?

Plan

The software is written in C, and uses the standard Auton libraries.

The main objects in the system are DATASETS, TRANSFORMS, PREPROCESSOR, PSTATE. The preprocessing algorithms themselves takes a DATASET and a PREPROCESSOR specification, and produce a TRANSFORM.

Datasets

A DATASET is a list of columns. Each column has a name and a vector of real values. Each column's value vector is the same length. A Dataset in which each column value vector is length N is said to have N ROWS. The i'th row is formed by taking the ith value of each column in turn.

Example:

      Temp   Pressure  Speed . . .  Quality
      97      4300      19   . . .    8
      98      4100      35   . . .    6
      56      6752      32   . . .    9
       .       .        .    . . .    .
       .       .        .    . . .    .
       .       .        .    . . .    .

(Future inclusions: Provision for Missing Data to be represented, a special TIMESTAMP column)

Transforms

A TRANSFORM is a specification of how one dataset should be transformed into another. It is a list of INCLUDEs followed by a list of REMOVEs. An INCLUDE consists of a new column name, followed by a formula specifying how elements in the new column should be computed from elements in other columns. A REMOVE consists of the name of a column already in the datafile. It denotes the removal of the named column from the new dataset. Any column in the original dataset that does not appear in a REMOVE instruction with appear unchanged in the new dataset.

Example:

      include princomp1 = 0.976 * (Temp - 96.4) + 0.04 * (Pressure - 4900) + ...
      include log_quality = log(Quality)
      .
      .
      remove pressure
      remove temp

Preprocessors

A preprocessor is a simple data structure specifying a desired preprocessing algorithm and optional arguments to it.

Examples:

      princomps maxcomps 8 threshold 0.05
meaning "Perform principal components, and include 8 or fewer components, ignoring any that explain less than 5% of the variance".

and

      featureselect regression predicting log_quality maxfeats 4
meaning "Find up to the most important four features for predicting log_quality, assuming you must use linear regression".

The intended preprocessors include (in order of complexity and likelihood of accomplishment during the project)

PCA

Non-linear PCA

PCA-for-prediction

Feature selection with polynomial regression

Feature selection with fast decision trees

Feature selection with GMBL

Nonlinear transforms

Removal of redundant (mutually predicting) input-pairs

Rank transforms

Timeseries Time-weighted averaging

Timeseries moving time-window

Hidden feature imputation (Kan's thesis work)

The intention is that addition of new preprocessors is easy.

PState

The user and search algorithms will interact with the preprocessors in a search tree with datasets as states, transforms as operators, preprocessors as generators of transforms. This will be encapsulated in an (as yet only sketched) Pstate structure.

Primary Operations

      Command Line ---> Preprocessor
      Preprocessor ---> Explain
      Preprocessor , Dataset ---> Transform
      Preprocessor , Dataset ---> Explain
      Transform ---> Explain
      Row , Transform ---> Row
      Dataset , Transform ---> Dataset
      Dataset <---> ASCII file
      Transform <---> ASCII file

"Suggester" operation: Dataset ---> Transform

Timetable

By July 1st

Implement the skeleton transform, dataset, preprocessor stuctures Implement PCA preprocessor Implement regression-based feature selector

By September 1st

Implement GMDH-based composite feature finder Implement clever feature selectors Implement box-cox non-linear transforms

November 1st

Timeseries transforms (with Belinda?) Hidden value Imputation (with Kan)

Jan 1st, 1997

Extra PCA preprocessor features Deliver source code and documentation to 3M

Jan 1st - March 1st, 1997

During this period, provide support for 3M-side integration of software into PKB or other 3M software systems.

Additionally, deliver a stand-alone, type-at-console-based, command-line user interface.

Research into integrated Human/Search Engine preprocessor system

Deliverables

Source code written in standard ANSI-C. Should be fully compatible with (at the very least) compilation under any standard UNIX compiler on UNIX or any standard C compiler on Windows (95 , NT).