CONTENTS
The Project
Plan
----- Datasets
----- Transforms
----- Preprocessors
----- PState
----- Primary Operations
Timetable
----- By July 1st
----- By September 1st
----- November 1st
----- Jan 1st, 1997
----- Jan 1st - March 1st, 1997
Deliverables
This project is funded by 3M corporation and has two parts:
-- Systems development of a potentially very useful machine learning tool for preprocessing datasets.
-- Research into autonomous systems for preprocessing data.
The project involves tools to help the user find useful transforms of the data (including various principal components methods, feature selection methods, useful non-linear transforms of data, possibly time-series data transforms, possibly outlier removal, and user-specified transforms). More interestingly it will guide the user in searching the enormous combinatorial space of possible transforms. The scientific question is: to what extent can automated search procedures do the job of a human at the preprocessing stage of Machine Learning? Can AI enhance a statistician's job? Can AI save a statistician time? Can AI enhance a naive user's job?
The software is written in C, and uses the standard Auton libraries.
The main objects in the system are DATASETS, TRANSFORMS, PREPROCESSOR, PSTATE. The preprocessing algorithms themselves takes a DATASET and a PREPROCESSOR specification, and produce a TRANSFORM.
A DATASET is a list of columns. Each column has a name and a vector of real values. Each column's value vector is the same length. A Dataset in which each column value vector is length N is said to have N ROWS. The i'th row is formed by taking the ith value of each column in turn.
Example:
Temp Pressure Speed . . . Quality
97 4300 19 . . . 8
98 4100 35 . . . 6
56 6752 32 . . . 9
. . . . . . .
. . . . . . .
. . . . . . .
(Future inclusions: Provision for Missing Data to be represented, a special TIMESTAMP column)
A TRANSFORM is a specification of how one dataset should be transformed into another. It is a list of INCLUDEs followed by a list of REMOVEs. An INCLUDE consists of a new column name, followed by a formula specifying how elements in the new column should be computed from elements in other columns. A REMOVE consists of the name of a column already in the datafile. It denotes the removal of the named column from the new dataset. Any column in the original dataset that does not appear in a REMOVE instruction with appear unchanged in the new dataset.
Example:
include princomp1 = 0.976 * (Temp - 96.4) + 0.04 * (Pressure - 4900) + ...
include log_quality = log(Quality)
.
.
remove pressure
remove temp
A preprocessor is a simple data structure specifying a desired preprocessing algorithm and optional arguments to it.
Examples:
princomps maxcomps 8 threshold 0.05
meaning "Perform principal components, and include 8 or fewer components,
ignoring any that explain less than 5% of the variance".
and
featureselect regression predicting log_quality maxfeats 4
meaning "Find up to the most important four features for predicting
log_quality, assuming you must use linear regression".
The intended preprocessors include (in order of complexity and likelihood of accomplishment during the project)
PCA
Non-linear PCA
PCA-for-prediction
Feature selection with polynomial regression
Feature selection with fast decision trees
Feature selection with GMBL
Nonlinear transforms
Removal of redundant (mutually predicting) input-pairs
Rank transforms
Timeseries Time-weighted averaging
Timeseries moving time-window
Hidden feature imputation (Kan's thesis work)
The intention is that addition of new preprocessors is easy.
The user and search algorithms will interact with the preprocessors in a search tree with datasets as states, transforms as operators, preprocessors as generators of transforms. This will be encapsulated in an (as yet only sketched) Pstate structure.
Command Line ---> Preprocessor
Preprocessor ---> Explain
Preprocessor , Dataset ---> Transform
Preprocessor , Dataset ---> Explain
Transform ---> Explain
Row , Transform ---> Row
Dataset , Transform ---> Dataset
Dataset <---> ASCII file
Transform <---> ASCII file
"Suggester" operation: Dataset ---> Transform
Implement the skeleton transform, dataset, preprocessor stuctures Implement PCA preprocessor Implement regression-based feature selector
Implement GMDH-based composite feature finder Implement clever feature selectors Implement box-cox non-linear transforms
Timeseries transforms (with Belinda?) Hidden value Imputation (with Kan)
Extra PCA preprocessor features Deliver source code and documentation to 3M
During this period, provide support for 3M-side integration of software into PKB or other 3M software systems.
Additionally, deliver a stand-alone, type-at-console-based, command-line user interface.
Research into integrated Human/Search Engine preprocessor system
Source code written in standard ANSI-C. Should be fully compatible with (at the very least) compilation under any standard UNIX compiler on UNIX or any standard C compiler on Windows (95 , NT).