Machine Learning Preprocessing Assistant CONTENTS [1]The Project [2]Plan ----- [3]Datasets ----- [4]Transforms ----- [5]Preprocessors ----- [6]PState ----- [7]Primary Operations [8]Timetable ----- [9]By July 1st ----- [10]By September 1st ----- [11]November 1st ----- [12]Jan 1st, 1997 ----- [13]Jan 1st - March 1st, 1997 [14]Deliverables The Project This project is funded by 3M corporation and has two parts: -- Systems development of a potentially very useful machine learning tool for preprocessing datasets. -- Research into autonomous systems for preprocessing data. The project involves tools to help the user find useful transforms of the data (including various principal components methods, feature selection methods, useful non-linear transforms of data, possibly time-series data transforms, possibly outlier removal, and user-specified transforms). More interestingly it will guide the user in searching the enormous combinatorial space of possible transforms. The scientific question is: to what extent can automated search procedures do the job of a human at the preprocessing stage of Machine Learning? Can AI enhance a statistician's job? Can AI save a statistician time? Can AI enhance a naive user's job? Plan The software is written in C, and uses the standard Auton libraries. The main objects in the system are DATASETS, TRANSFORMS, PREPROCESSOR, PSTATE. The preprocessing algorithms themselves takes a DATASET and a PREPROCESSOR specification, and produce a TRANSFORM. Datasets A DATASET is a list of columns. Each column has a name and a vector of real values. Each column's value vector is the same length. A Dataset in which each column value vector is length N is said to have N ROWS. The i'th row is formed by taking the ith value of each column in turn. Example: Temp Pressure Speed . . . Quality 97 4300 19 . . . 8 98 4100 35 . . . 6 56 6752 32 . . . 9 . . . . . . . . . . . . . . . . . . . . . (Future inclusions: Provision for Missing Data to be represented, a special TIMESTAMP column) Transforms A TRANSFORM is a specification of how one dataset should be transformed into another. It is a list of INCLUDEs followed by a list of REMOVEs. An INCLUDE consists of a new column name, followed by a formula specifying how elements in the new column should be computed from elements in other columns. A REMOVE consists of the name of a column already in the datafile. It denotes the removal of the named column from the new dataset. Any column in the original dataset that does not appear in a REMOVE instruction with appear unchanged in the new dataset. Example: include princomp1 = 0.976 * (Temp - 96.4) + 0.04 * (Pressure - 4900) + .. . include log_quality = log(Quality) . . remove pressure remove temp Preprocessors A preprocessor is a simple data structure specifying a desired preprocessing algorithm and optional arguments to it. Examples: princomps maxcomps 8 threshold 0.05 meaning "Perform principal components, and include 8 or fewer components, ignoring any that explain less than 5% of the variance". and featureselect regression predicting log_quality maxfeats 4 meaning "Find up to the most important four features for predicting log_quality, assuming you must use linear regression". The intended preprocessors include (in order of complexity and likelihood of accomplishment during the project) PCA Non-linear PCA PCA-for-prediction Feature selection with polynomial regression Feature selection with fast decision trees Feature selection with GMBL Nonlinear transforms Removal of redundant (mutually predicting) input-pairs Rank transforms Timeseries Time-weighted averaging Timeseries moving time-window Hidden feature imputation (Kan's thesis work) The intention is that addition of new preprocessors is easy. PState The user and search algorithms will interact with the preprocessors in a search tree with datasets as states, transforms as operators, preprocessors as generators of transforms. This will be encapsulated in an (as yet only sketched) Pstate structure. Primary Operations Command Line ---> Preprocessor Preprocessor ---> Explain Preprocessor , Dataset ---> Transform Preprocessor , Dataset ---> Explain Transform ---> Explain Row , Transform ---> Row Dataset , Transform ---> Dataset Dataset <---> ASCII file Transform <---> ASCII file "Suggester" operation: Dataset ---> Transform Timetable By July 1st Implement the skeleton transform, dataset, preprocessor stuctures Implement PCA preprocessor Implement regression-based feature selector By September 1st Implement GMDH-based composite feature finder Implement clever feature selectors Implement box-cox non-linear transforms November 1st Timeseries transforms (with Belinda?) Hidden value Imputation (with Kan) Jan 1st, 1997 Extra PCA preprocessor features Deliver source code and documentation to 3M Jan 1st - March 1st, 1997 During this period, provide support for 3M-side integration of software into PKB or other 3M software systems. Additionally, deliver a stand-alone, type-at-console-based, command-line user interface. Research into integrated Human/Search Engine preprocessor system Deliverables Source code written in standard ANSI-C. Should be fully compatible with (at the very least) compilation under any standard UNIX compiler on UNIX or any standard C compiler on Windows (95 , NT). References 1. file://localhost/afs/cs.cmu.edu/project/learn/group/doc/mlpa.html#0 2. file://localhost/afs/cs.cmu.edu/project/learn/group/doc/mlpa.html#1 3. file://localhost/afs/cs.cmu.edu/project/learn/group/doc/mlpa.html#2 4. file://localhost/afs/cs.cmu.edu/project/learn/group/doc/mlpa.html#3 5. file://localhost/afs/cs.cmu.edu/project/learn/group/doc/mlpa.html#4 6. file://localhost/afs/cs.cmu.edu/project/learn/group/doc/mlpa.html#5 7. file://localhost/afs/cs.cmu.edu/project/learn/group/doc/mlpa.html#6 8. file://localhost/afs/cs.cmu.edu/project/learn/group/doc/mlpa.html#7 9. file://localhost/afs/cs.cmu.edu/project/learn/group/doc/mlpa.html#8 10. file://localhost/afs/cs.cmu.edu/project/learn/group/doc/mlpa.html#9 11. file://localhost/afs/cs.cmu.edu/project/learn/group/doc/mlpa.html#10 12. file://localhost/afs/cs.cmu.edu/project/learn/group/doc/mlpa.html#11 13. file://localhost/afs/cs.cmu.edu/project/learn/group/doc/mlpa.html#12 14. file://localhost/afs/cs.cmu.edu/project/learn/group/doc/mlpa.html#13