The Machine Learning Architecture for SPR's as-yet-unnamed ML "product". [INLINE] CONTENTS [1]The AVSET ADT ----- [2]Functions On Avsets [3]The FORMAT ADT [4]Params [5]Hypo [6]The functions that a LEARNER may implement ----- [7]Training: Making a hypo from an examples. ----- [8]Predicting: Filling an output avspec from a hypo and an input avspec ----- [9]Details: Asking other questions of the learning algorithm ----- [10]Provide a description of this learning algorithm ----- [11]Explain (maybe diagram) the result of training ----- [12]Let the user tweak what was learned ----- [13]Make default parameters ----- [14]Make params compatible ----- [15]Make a processable form of the parameters ----- [16]Process user/file requests to update parameters ----- [17]Load/Save/Check compatibility of hypos ----- [18]Freeing stuff [19]The LEARNER ADT ----- [20]Learner Creation [21]Learners in the examples environment. ----- [22]Learner Nouns ----- [23]Learner Verbs ----- [24]Learner-supporting-functions [LINK] The AVSET ADT An avset is a sparse representation of attribute-value pairs, suitable for specifying a query for a prediction, and also for returning the result of a prediction. (See the examples documentation for information on attributes.) Because it may appear inside some efficiency-critical inner loops, some of the operations on avsets are less simple and explicit than the simplest programmer interface. Some may be implemented as Macros in AMFAST mode. Each attribute is represented by a unique integer. Typically an avset will only contain a few of the available attributes. Thus suppose that a particular avset, av, contains the attributes represented by 2, 4, and 9. Then: avset_attnum(av, 0) = 2 avset_attnum(av, 1) = 4 avset_attnum(av, 2) = 9 and in general the ith attribute held in an avset, av, is represented by the integer avset_attnum(av, i). If you wish to iterate through all the attributes defined in a avset do for ( avset_index = 0 ; avset_index < avset_size(sa) ; avset_index++ ) { int att_num = avset_attnum(sa,avset_index); ... } Even if an avset contains very few attributes, one of its fields, attnum_to_avindex will still be large. This is because we want a fast way to go from any attribute to its index (if any) in the avset. The other three fields only need to contain as many elements as there are attributes in this particular avset (rather than all the attributes in the environment). Functions On Avsets avset *mk_empty_avset(examples *exams) Makes an avset in which no attributes are assigned any values. May take up to O(number of attributes in exams) time and space. void clear_avset(avset *av) Resets the avset so no attribute-value pairs are defined. Takes no more than O(number of attributes defined in original avset) time. void free_avset(avset *av) Frees an avset. avset *mk_copy_avset(avset *av) Copies an avset and returns the copy. bool avset_attnum_symbolicp(avset *av, int attnum) This returns TRUE iff attnum is a symbolic attribute. int avset_size(avset *av) Returns the number of attributes defined in the avset. int avset_attnum(avset *av,int avset_index) Returns the attribute number of the avset_index'th attribute-value pair in the avset. Takes a constant short time. int avset_symbol_ref(examples *exams, avset *av, int att_num) Returns the symbolic value associated with att_num. Signals an error if att_num is undefined in this avset, or if att_num is a real attribute. Takes a constant (short) time. double avset_real_ref(examples *exams, avset *av, int att_num) Returns the real value associated with att_num. Signals an error if att_num is undefined in this avset, or if att_num is a symbolic attribute. Takes a constant (short) time. void avset_symbol_set(examples *exams, avset *av, int att_num, int symbol_value) Sets the symbol value associated with att_num in this avset. It doesn't matter whether att_num was previously defined in the avset. att_num must refer to a symbolic attribute, and symbol_value must lie among the legal symbolic values. Takes a constant (short) time. void avset_real_set(examples *exams, avset *av, int attnum, double real_value) Sets the real value associated with att_num in this avset. It doesn't matter whether att_num was previously defined in the avset. att_num must refer to a real attribute. Takes a constant (short) time. expo *mk_expo_from_avset(examples *exams,avset *av) Gives a nicely formatted description of the attribute-value pairs in the example. Uses printed names rather than numbers where possible. Descriptions of each attribute-value pair are printed in order. Currently attribute-value pairs are shown in the order in which they were added to the avset. (One alternative would be to show them in order of increasing attnum.) avset *mk_user_edit_avset(examples *exams,avset *initial_av) Lets the user change the values of attributes already defined in the avset. The user is prevented from choosing illegal values. If the user cancels, NULL is returned, otherwise a copy of the avset is returned. If initial_av is NULL on entry, then a default avset will be created as a starting point.... NOTE: since this function uses aforms to do the editing, it assumes that any string values will contain NO spaces and NO double quotes. void default_avset(examples *exams,ivec *att_nums,ivec *row_nums, avset *av) Fills in the avset to match the values in the first row of the examples. Only fills in the attributes mentioned in att_nums. void avset_from_row(examples *exams,ivec *att_nums,int row_num,avset *av) Fills in the avset to match the values in the row_num'th row of the examples. Only fills in the attributes mentioned in "att_nums". (It clears any existing attribute-value pairs in the avset prior to filling them in to match the specified row.) avset *mk_avset_using_attnums(examples *exams, ivec *attnums) This returns a newly made avset that contains the attributes in attnums and no others. There is no guarantee what the initial values of these attributes will be. [LINK] The FORMAT ADT A format denotes requests to use given subsets of the attributes as inputs and other given subsets as outputs in a prediction. It also allows the user to specify a set of conjunctive constraints on which rows the learner may consider. And it also contains information used when generating test/training/holdout sets: the random seed used in such operations, a holdout_set_fraction, and a test_set_fraction. format *mk_copy_format(format *f) Returns a copy of the format. format *mk_empty_format() Returns an empty format: no inputs, no outputs, no constraints. format *mk_default_format(examples *exams) Creates a default format in which the rightmost column is regarded as an output, all other columns are inputs, and there are no constraints. void free_format(format *f) Frees a format, freeing its subcomponents as well. ivec *format_input_att_nums(format *f) Returns the ivec of the input attributes. The caller is permitted to add and remove from this ivec. ivec *format_output_att_nums(format *f) Returns the ivec of the output attributes. The caller is permitted to add and remove from this ivec. int format_output_att_num(format *f) Assumes there is exactly one output attribute and returns its attribute number. Calls my_error if this assumption is incorrect. This function is retained for backwards-compatibility. cset *format_constraints(format *f) Returns the constraints. int format_seed(format *f) Returns the random seed double format_holdout_set_fraction(format *f); Returns the proportion of the dataset to be assigned to a holdout set. double format_test_set_fraction(format *f) Returns the proportion of the dataset to be assigned to a test set. ivec *mk_format_row_nums(examples *exams, format *f) Returns an ivec whose elements are the row-nums that obey the constraints. void fprintf_format(FILE *s,char *m1,format *x,char *m2) Prints a description of the format. Since it does not refer to examples, attribute numbers are printed rather attribute names. void pformat(format *f) Prints a description of the format to stdout. expo *mk_expo_from_ivec_of_attnums(examples *exams, ivec *attnums) This produces an expo consisting of a list of the names of the attributes whose attribute numbers are in attnums. expo *mk_expo_from_format(examples *exams,format *f) Explains the format in a nice, user-understandable way. format *mk_format_from_file(examples *exams,char *fname,expo **r_error_ex po) Tries to load a format from a file. The first line should contain the input attribute numbers. The second line should contain the output attribute numbers. Each further line (if any) gives one of the constraints. If it loads the format successfully, it returns the format, and sets *r_error_expo = NULL. If the load fails, it returns NULL and sets *r_error_expo to a non-null message explaining the problem. void save_format(examples *exams,format *f,char *fname,expo **r_error_exp o) Tries to save the format to the file named fname. If it succeeds, it sets *r_error_expo = NULL. If it fails *r_error_expo is set to a non-null message explaining the problem. format *mk_user_edit_format(examples *exams,format *oldf) Lets the user edit a format. If oldf is NULL, mk_default_format will be used to create the initial format. This function returns NULL if canceled, and otherwise returns the new format. [LINK] Params A params structure represents internal learning algorithm parameters. But see the discussion of stdparams below. (??? This does not appear to exist.) From the view at this level, a params is represented as a pointer to a block of memory: char *params; The individual learning algorithm will cast this to its own internal representation of its parameters. Some algorithms may need no internal parameters, in which case params will be represented as NULL. [LINK] Hypo A hypo structure represents the result of training with a learning algorithm. It might be a set of neural net weights, a decision tree, a rule set, polynomial regression hypo, cached nearest-neighbor-lookup trees etc. From the view at this level, a hypo is represented as a pointer to a block of memory: char *hypo; The individual learning algorithm will cast this to its own internal representation of its hypo. Some algorithms may need no internal hypo (e.g. an inefficient implementation of nearest neighbor) in which case hypo will be represented as NULL. [LINK] The functions that a LEARNER may implement When someone implements a learning algorithm, it means they implement some subset of the following functions. None, except predict, are compulsory. Training: Making a hypo from an examples. As mentioned above, some learners (e.g. naive nearest neighbor) may require no training. But any learner that does require training must implement two functions: mk_xxx_hypo and free_xxx_hypo, where xxx is the name of the learner. char *mk_xxx_hypo(examples *exams, ivec *in_att_nums, ivec *out_att_nums, char *params, ivec *train_rows, ivec *holdout_rows, expo **r_error_expo); mk_xxx_hypo trains the learner on the data in examples, considering just the specified input and output attributes (the data may have additional attributes not included in either in_att_nums or out_att_nums). If xxx is a learner that needs an internal holdout-set, this holdout-set is defined by holdout_rows. The rows that may be used for training are specified by train_rows. If the learner uses internal parameters, these will be specified by the params argument. mk_xxx_hypo should return a pointer to the result of training (called a "hypo"). This lump of memory will have been AM_MALLOCKED, and will eventually be freed with free_xxx_hypo. If the learner is unable to learn using the given set of att_nums, params, and rows, it returns NULL and sets *r_error_expo to a non-NULL error message explaining the problem. If the learner is able to learn, *r_error_expo should be returned as NULL. The second function that must be present if a learner needs training is: void free_xxx_hypo(char *hypo) This frees the hypo. Predicting: Filling an output avspec from a hypo and an input avspec Every single learner must implement the predict function, which is called xxx_predict where xxx is the name of the learner. void xxx_predict(examples *exams, ivec *in_att_nums, ivec *out_att_nums, char *params, char *hypo, avset *in, avset *out) Given the hypo, the parameters, and the input avset, this fills in the predicted values in the output avset. The predicted values are based only on the attributes in in_att_nums, and predictions are only calculated for those attributes in out_att_nums (if other att_num's in "out" are defined, they are unaffected by this operation). Details: Asking other questions of the learning algorithm For future expansion. "detail" is similar to the GMBL "answer" data structure, and specifies extra information we may wish to ask the learner regarding gradients, confidence intervals, confusion matrices and more. Ignore this until the "detail" module is defined. detail *mk_xxx_detail(examples *exams, ivec *in_att_nums, ivec *out_att_nums, char *params, char *hypo, avset *in, int detail_type, double confidence) This function will become increasingly important, (it may become one of the SPR competitive advantages: "We don't just predict, we give all kinds of other information needed for decision making") but will not be included in any initial learners. Provide a description of this learning algorithm expo *mk_xxx_expo_from_learner() An optional function that gives an explanation of the learning algorithm. Explain (maybe diagram) the result of training report *mk_xxx_report_from_hypo(examples *exams, avset *query, ivec *in_att_nums, ivec *out_att_nums, char *params, char *hypo) An optional function that gives a report (which may include graphics and/or text) as to the contents of the hypo. Let the user tweak what was learned char *mk_xxx_user_edit_hypo(examples *exams, ivec *in_att_nums, ivec *out_att_nums, char *params, char *hypo) Allows the user to change some of the hypo to "see what happens". Make default parameters char *mk_xxx_default_params(examples *exams, ivec *in_att_nums,ivec *out_att_nums) If the learner uses parameters, it must implement both this function and free_xxx_params. Make params compatible char *mk_xxx_compatible_params(examples *exams, ivec *in_att_nums, ivec *out_att_nums,char *params) An optional function. Suppose the user changes the set of examples to be used, or the set of input attributes, or the set of output attributes. This may mean the parameters are now invalid. mk_xxx_compatible_params checks to see if the parameters are still compatible; if so, it returns NULL; if they are incompatible, it returns a new set of suitable parameters. If this function is unimplemented, the high-level code will assume that any change in the examples or the att_nums renders the parameters invalid, and will revert to default parameters. Make a processable form of the parameters aform *mk_aform_from_xxx_params(examples *exams, ivec *in_att_nums, ivec *out_att_nums, char *params) An optional but recommended function for any learner that uses parameters. See the aform documentation in xgui.doc . Once you have a function to make an aform, you also have a function to save the params to a file and to display them to the user. Process user/file requests to update parameters expo *update_xxx_params_from_keyval(examples *exams,char *params, char *key,dtvalue *dt) An optional function. If both mk_aform_from_xxx_params and update_xxx_params_from_keyval are defined, then the user will be able to edit the parameters and to load parameters from a file. Load/Save/Check compatibility of hypos Some learning algorithms take a long time to run, and so it might be worth saving their results. The following two optional functions follow our standard *r_error_expo convention to report loading/saving problems: r_error_expo is set to NULL if the operation succeeds, and is otherwise set to a message explaining the problem. char *mk_load_xxx_hypo(char *filename,expo **r_error_expo) void save_xxx_hypo(char *hypo,char *filename,expo **r_error_expo) A freshly loaded hypo might be incompatible with the current dataset or the current attributes of interest. The next (optional) function checks that. If the hypo is compatible, then this function sets *r_error_expo to NULL. If the hypo is incompatible, this function sets *r_error_expo to an explanation of the problem. If this function is not provided, unfortunate and potentially catastrophic errors may occur if the hypo is incompatible. bool xxx_hypo_compatiblep(examples *exams, ivec *in_att_nums, ivec *out_att_nums, char *hypo, **r_error_expo) Freeing stuff void free_xxx_params(char *params) Always necessary if the learner uses parameters. void free_xxx_hypo(char *hypo) Always necessary if the learner uses a hypo. [LINK] The LEARNER ADT A learner is merely a structure containing a whole set of the above function pointers. It also contains a small amount of supplementary information. The operations on learners are: learner *mk_empty_learner(char *learner_name) void free_learner(learner *le) expo *mk_expo_from_learner(learner *le) Explains what the learner can and can't do. The programmer may also call any non-null functions in the learner. The programmer must always check said functions exist before calling. Additional learner fields (they may be accessed by other functions, and they may be set by the creator of a learner). bool le->uses_holdout_set double le->holdout_set_fraction paramdesc *le->paramdesc A paramdesc is a simple datastructure that allows a higher level of the program to know which are the important parameters of the learning algorithm to tune, what values they may take while being tuned, and whether they are relevant, given other values of current parameters (e.g. "there's no point tuning the momentum rate variable if the use_momentum falg is switched off"). paramdescs come later, so this will initially be unimplemented. Learner Creation The implementor of a learning algorithm will deliver their software by means of a learner *mk_xxx_learner() function. See examples/linearreg.cpp for an example of this and of the general process of implementing the functions for a new learner. When creating a learner in mk_xxx_learner, the implementor should set the fields to reflect all the functions they have provided. Fields representing any unimplemented functions should be NULL. Remember that the xxx_predict function MUST be implemented for all learners. If the learner uses training, then mk_xxx_hypo and free_xxx_hypo must be provided. If the learners uses parameters, then mk_xxx_default_params and free_xxx_params must be provided. Having written mk_xxx_learner and the functions for that learner, the implementor of a new algorithm has two more things to do. They must modify the definitions of the following two functions given in the file examples/learner.cpp, to add in their new learner. learner *mk_learner_from_string(char *learner_name) string_array *mk_learner_names() The string used to name the learner must be identical in both cases. [LINK] Learners in the examples environment. There are several learner-specific verbs and nouns. Learner Nouns LEARNER type learner * PARAMS type char * HYPO type char * QUERY type avset * Learner Verbs CHOOSELEARNER TRAIN PREDICT Learner-supporting-functions The following functions are used to hook up the learner to a user interface. void env_free_learner(env *e,nstate *ns,int nn) ...Also frees params and hypo if defined void env_free_params(env *e,nstate *ns,int nn) ...Also frees hypo if defined void env_free_hypo(env *e,nstate *ns,int nn) void env_free_query(env *e,nstate *ns,int nn) int env_choose_learner(env *e,nstate *ns,int vb,int nn,char *arg,report * *r_rep) The choose verb pops up a list of the names of available learners, which the user must choose from, then sets up the current learner (freeing any earlier one if necessary) accordingly. This is achieved using mk_learner_from_string. This function also creates default parameters, but no default hypo. int env_inspect_learner(env *e,nstate *ns,int vb,int nn,char *arg,report **r_rep) int env_train_learner(env *e,nstate *ns,int vb,int nn,char *arg,report ** r_rep) ...this needs format to be defined. From format it constructs the row_nums for training and the holdout set using the random seed in the format. int env_predict_learner(env *e,nstate *ns,int vb,int nn,char *arg,report **r_rep) ...this needs query to be defined int env_load_params(env *e,nstate *ns,int vb,int nn,char *arg,report **r_ rep) Makes parameters compatibile after loading and complains if not compatible. If no compatibility making function is defined, complains and fails. int env_save_params(env *e,nstate *ns,int vb,int nn,char *arg,report **r_ rep) int env_edit_params(env *e,nstate *ns,int vb,int nn,char *arg,report **r_ rep) int env_inspect_params(env *e,nstate *ns,int vb,int nn,char *arg,report * *r_rep) int env_load_hypo(env *e,nstate *ns,int vb,int nn,char *arg,report **r_re p) ...Checks compatibility and complains (and has no effect) if not compatible int env_save_hypo(env *e,nstate *ns,int vb,int nn,char *arg,report **r_re p) int env_edit_hypo(env *e,nstate *ns,int vb,int nn,char *arg,report **r_re p) int env_inspect_hypo(env *e,nstate *ns,int vb,int nn,char *arg,report **r _rep) int env_edit_query(env *e,nstate *ns,int vb,int nn,char *arg,report **r_r ep) int env_inspect_query(env *e,nstate *ns,int vb,int nn,char *arg,report ** r_rep) Also, extra functionality has been added to the env_load_format and env_edit_format commands. Because loading or changing the format changes the learning task, the above functions free the current hypo. They also do a compatibilizing on params, and if not possible to compatibalize, create default params. References 1. file://localhost/afs/cs.cmu.edu/project/learn/group/doc/learner.html#0 2. file://localhost/afs/cs.cmu.edu/project/learn/group/doc/learner.html#1 3. file://localhost/afs/cs.cmu.edu/project/learn/group/doc/learner.html#2 4. file://localhost/afs/cs.cmu.edu/project/learn/group/doc/learner.html#3 5. file://localhost/afs/cs.cmu.edu/project/learn/group/doc/learner.html#4 6. file://localhost/afs/cs.cmu.edu/project/learn/group/doc/learner.html#5 7. file://localhost/afs/cs.cmu.edu/project/learn/group/doc/learner.html#6 8. file://localhost/afs/cs.cmu.edu/project/learn/group/doc/learner.html#7 9. file://localhost/afs/cs.cmu.edu/project/learn/group/doc/learner.html#8 10. file://localhost/afs/cs.cmu.edu/project/learn/group/doc/learner.html#9 11. file://localhost/afs/cs.cmu.edu/project/learn/group/doc/learner.html#10 12. file://localhost/afs/cs.cmu.edu/project/learn/group/doc/learner.html#11 13. file://localhost/afs/cs.cmu.edu/project/learn/group/doc/learner.html#12 14. file://localhost/afs/cs.cmu.edu/project/learn/group/doc/learner.html#13 15. file://localhost/afs/cs.cmu.edu/project/learn/group/doc/learner.html#14 16. file://localhost/afs/cs.cmu.edu/project/learn/group/doc/learner.html#15 17. file://localhost/afs/cs.cmu.edu/project/learn/group/doc/learner.html#16 18. file://localhost/afs/cs.cmu.edu/project/learn/group/doc/learner.html#17 19. file://localhost/afs/cs.cmu.edu/project/learn/group/doc/learner.html#18 20. file://localhost/afs/cs.cmu.edu/project/learn/group/doc/learner.html#19 21. file://localhost/afs/cs.cmu.edu/project/learn/group/doc/learner.html#20 22. file://localhost/afs/cs.cmu.edu/project/learn/group/doc/learner.html#21 23. file://localhost/afs/cs.cmu.edu/project/learn/group/doc/learner.html#22 24. file://localhost/afs/cs.cmu.edu/project/learn/group/doc/learner.html#23