General Memory Based Learning A Brief Explanation of the concept, the software and the C interface Andrew W. Moore 19th December 1993 Updated 4th September 1994 awm@cs.cmu.edu School of Computer Science and Robotics Institute Carnegie-Mellon University 5000 Forbes Ave Pittsburgh, PA 15213 FAX: 412-268-5571 This document has the following contents: %%%%% WHAT IS MEMORY-BASED LEARNING? %%%%% WHAT IS GENERAL MEMORY-BASED LEARNING? %%%%% COMPILING GMBL %%%%% ON-LINE HELP %%%%% GM-STRINGS %%%%% DATAFILES %%%%% CROSS VALIDATION %%%%% GMBL SEARCH: SEARCH SPECIFICATION SYNTAX %%%%% PROGRAMMER'S GUIDE TO WRITING C APPLICATIONS LINKED TO GMBL %%%%% THE ABSTRACT DATATYPES OF EZGM %%%%% EZGM C FUNCTIONS: INDEX %%%%% EZGM C FUNCTIONS: REFERENCE %%%%% GRESP: GMBL RESPONSE SURFACE METHODOLOGY The subsections of the document are: %%%%% WHAT IS MEMORY-BASED LEARNING? %%% Introduction to Memory based learning %%% Nearest neighbor prediction %%% Kernel regression %%% Local weighted regression %%%%% WHAT IS GENERAL MEMORY-BASED LEARNING? %%%%% COMPILING GMBL %%% Compiling GMBL under unix %%% Compiling GMBL elsewhere %%%%% ON-LINE HELP %%%%% GM-STRINGS %%% Gmstring syntax %%% Gmstring Meaning: Reg-Degree %%% Gmstring Meaning: Smooth-Code %%% Gmstring Meaning: Neighbors %%% Gmstring Meaning: Attribute-Scales %%% Gmstring Examples %%%%% DATAFILES %%% Datafile format %%% Comments in datafiles %%%%% CROSS VALIDATION %%%%% GMBL SEARCH: SEARCH SPECIFICATION SYNTAX %%%%% PROGRAMMER'S GUIDE TO WRITING C APPLICATIONS LINKED TO GMBL %%%%% THE ABSTRACT DATATYPES OF EZGM %%% Simple datatypes: Inputs, Outputs and Gmstrings %%% The options Datatype %%% The dset Datatype %%%%% EZGM C FUNCTIONS: INDEX %%% Creating a dataset and adding points to it %%% Loading a dataset from a file %%% Accessing the size and shape and basic statistics of a dataset %%% Predicting outputs %%% Finding the leave-out-out error of a single point %%% Searching for good gmstrings %%% Changing and examining values stored in the "options" datatype %%% Finding nearest neighbors %%% Using kdtrees %%% Doing Classification instead of regression %%% Local gradients ("sensitivity") and local confidence %%% Doing graphics %%%%% EZGM C FUNCTIONS: REFERENCE %%%%% GRESP: GMBL RESPONSE SURFACE METHODOLOGY %%%%% WHAT IS MEMORY-BASED LEARNING? %%% Introduction to Memory based learning Here, we merely summarize the essential aspects of memory-based function approximation. A more thorough review may be found in "An Investigation of Memory-based Function Approximators for Learning Control" by Andrew Moore and Chris Atkeson (1992) (contact Andrew Moore if you want a copy). In memory-based learning, all experiences are explicitly remembered in a large memory of input-output vectors. { (x_1,y_1) , (x_2,y_2) ... (x_n,y_n) } where x_1, x_2 .. are input vectors and y_1 , y_2 .. are corresponding output vectors. When a prediction for the output of a novel input x_query is required, the memory is searched to obtain experiences with inputs close to x_query. These neighbors are used to determine a locally consistent output for the query. %%% Nearest neighbor prediction A simple memory-based method is nearest neighbor. Given x_query, nearest neighbor searches the memory to obtain an input-output pair (x_i,y_i) which minimizes Dist(x_i,x_query). The prediction is then y_i. A common distance metric, and the one used in GMBL, is the SCALED EUCLIDEAN METRIC: j < Nin ------- \ \ 2 2 Dist(x,x') = SQUARE-ROOT-OF / S_j * ( x[j] - x'[j] ) / ------- j = 0 where Nin is the number of input variables and where, x[j] is the j'th component of the vector x (indexed by j=0, j=1 .. j=Nin-1) and where S_0 , S_1 , ... S_Nin-1 are SCALING PARAMETERS which greatly affect the prediction by defining how much influence each input variable has on the prediction. Such scaling can play a crucial role in reducing the prediction error in cases where one input dimension has a greater effect on the output than another. A common extension of nearest neighbor is to use the mean output value of a small number k > 1 of the nearest neighbors. The advantage is protection against noisy data, but a potential disadvantage is a slower learning rate: bias is traded against variance. %%% Kernel regression Another extension, called KERNEL REGRESSION, finds the weighted average of experiences in the memory. _____i < Npt / _____i < Npt \ / \ \ / \ y_predict = / w_i * y_i / / w_i /____ / /____ i = 0 / i = 0 where w_i is the WEIGHT of the i'th experience. This weight depends on its distance to the query point, and is constructed so that the closest points have the greatest effect. GMBL uses a gaussian weighting function by default. 2 ( d_i ) w_i = exp( - ------------ ) where d_i = Dist(x_i,x_query) ( 2 ) 2 * K_w The weighting function is parameterized by a SMOOTHING KERNEL WIDTH, K_w, which determines the distance, in the Euclidian metric, over which the smoothing function has a significant weight. %%% Local weighted regression Instead of using local averages, LOCAL REGRESSION can be applied. In the simplest case, each time a query arrives, the k nearest neighbors are obtained, where k is a system parameter. A least-squares linear fit is then performed on these neighbors---this is done by direct solution of a Nin * Nin system of linear equations, (where Nin is the number of input variables). The resulting linear map is then applied to the query point. LOCAL WEIGHTED REGRESSION (LWR) uses the same kind of weighting function as kernel regression. Instead of using ordinary least squares regression, it finds the linear map T y = A x + c (where A is a vector defining a local linear map) to minimize the sum of locally weighted squares of residuals: _____ \ 2 T 2 \ w_i * ( A x + c - y_i ) / /_____ This minimization can also be solved by means of a Nin * Nin system of linear equations. When the kernel width (described in "%%% Kernel Regression" section) is very large then local weighted regression behaves in the same manner as ordinary global linear regression. So global linear regression is a member of the GMBL family of function approximators. %%%%% WHAT IS GENERAL MEMORY-BASED LEARNING? General Memory Based Learning is the product of an ongoing research project which is investigating memory-based learning and developing efficient algorithms to make it as computationally quick, and as autonomous as possible. It contains implementations of the methods described in the previous sections and some variants of them. Its goal is to be given an arbitrary dataset and to automatically choose what is the best kind of memory-based learner for the data, how big the smoothing kernel width should be, how big the "k" in "k-nearest-neighbor" should be, how the euclidean distance metric should be scaled, and whether some of the inputs should be ignored. It has a straightforward, but not especially clever, user interface for running itself from the command line and there is a documented C library of functions for programmers wishing to use GMBL as part of an application program. GMBL is not being distributed freely at this time. %%%%% COMPILING GMBL [This section has changed in the Sept 94 release] The directory that this file (explain.txt) lives in is the GMBL root directory. If you look at the contents of this directory they should contain four subdirectories gmblsrc/ contains all the source and header files for gmbl, except for the main() function gmbl/ contains the main.c which has the main() function for compiling the gmbl application grespsrc/ contains all the source and header files for the gmbl response surface methods programs gresp/ contains the main.c for the gmbl response surface application. If you are not reading this file in the root directory of gmbl you will need to obtain a copy of the gmbl distribution. %%% Compiling GMBL under unix unix> cd unix> compile-all -O Important note: THE FINAL CHARACTER IS A CAPITAL LETTER "Oh". NOT LOWER CASE, AND NOT A ZERO! The -O (which may be left out if desired) uses the optimizing compiler, which can produce code which runs between 50% and 100% faster. This will compile gmbl, and leave the executable in /gmbl/gmbl. It will also compile the gresp program (see %%%%% GRESP: GMBL RESPONSE SURFACE METHODOLOGY) for more details). It is IMPORTANT that all files are compiled with the same compiler. If some are compiled with gcc and others compiled with cc then there are incompatibilites between the code. Notice that the September release of GMBL is all ANSI C. If you wish to link your own application programs to GMBL, you should compile them (taking care to let you compiler know which directory the gmbl header files live in (its gmblsrc/)) and link them to gmblsrc/*.o, and also to the math and X-windows libraries thus: Compiling: gcc -c -I/gmblsrc Linking: gcc -o /gmblsrc/*.o -lm -lX11 Please let Andrew Moore know (awm@cs.cmu.edu, fax 412-268-5571) if you have problems. %%% Compiling GMBL elsewhere You need an ANSI-C compiler (this is a change from the pre-july-94 versions which were written in old-style C). The source files are are all those in the gmblsrc directory, plus the gmbl/main.c file. The headers are all in gmblsrc. If you are linking to your own program, you should link to all the object files produced from the sources EXCEPT for main.c which contains the standard call to main(argc,argv). Please let Andrew Moore know (awm@cs.cmu.edu, fax 412-268-5571) if you have problems. %%%%% ON-LINE HELP Once gmbl has compiled, type gmbl help for a list of the available commands. For more details about any given command type gmbl help, for example: unix> gmbl predict help to give you extra information on the predict command. Most commands allow you to use standard options on the command line. Type unix> gmbl option list to see a list of these. To get more information about any of these, type gmbl option