THE EXAMPLES MODULE

CONTENTS

INTRODUCTION
USING THE EXAMPLES MODULE
----- FILES
----- TERMINOLOGY
----- LOADING DATA
ACCESSOR FUNCTIONS
The Format data structure
Attribute Scoring Utilities
----- In the works...

INTRODUCTION

      The examples module is used to load and access data in memory.

USING THE EXAMPLES MODULE

This module reads in a data file and an optional associated attribute (.typ) file. There are two results:

      1. Data is stored in memory

      2. A .typ file with the same prefix as the data file is created if one is not present
         or specified.

FILES

1. Data File

      a) The data file must be a two-dimensional array stored in comma 
          separated value format.  

      b) There first line is ignored since it is assumed to be a title.

      c) The second row is assumed to be the names of the columns.

      d) Empty values can be specified by the strings "*NOVALUE*", "?" 
         or also the empty string (i.e. two commas: "...entry,,entry,...").

      e) The data file extension is recommended to be ".csv".

2. Attribute File

      a) The attribute file is used to specify the types and values of the attributes in the data file.
      	When data is being loaded from the data file, this file is read and the specified types
      	are assigned to each attribute.

      b) The typ file may be specified in one of two ways:

      	1)  The typ filename can be specified next to the title on the first line of the data file 
               as follows:

      		title #typ filename

      	2) If a file with the same prefix as the data file and 
      	   with extension ".typ" is present in the working directory, it will be read and
      	   those specified types will be assigned to each attribute.

      	If both a typ file is present in the working directory and one is specified in the
      	data file, the filename specified in the data file will take precedence.

      c) The typ file must be in the following form beginning on line 1:

         column name, column type <value1 value2 value3 ...>

         column name, column type

      	.

      	.

      	.

      d) The two legal column types are "symbol" and "real"

      e) If the typ file is not present or specified, one will be created according to the 
          following rules:

      	1) a column is of type SYMBOL if all it's values 
      	   are not numbers.  In addition, if entire column consists 
      	   of "0"'s and "1"'s , it will also be assigned type SYMBOL.

      	2) Else, it is of type REAL.

TERMINOLOGY

Type and Terminology:

The EXAMPLES denotes a 2d array. Rows are examples, columns are attribute values

Terminology by example:

an EXAMPLE is description of a day

an ATTRIBUTE is MONTH

a VALUE is April

The ARITY (number of values) of Month is 12

The n possible values of a symbolic attribute with arity N are represented by {0, 1, .. N-1}.

All ATTRIBUTES will be of type REAL or SYMBOL. (real includes integers)

For example,

attribute TEMPERATURE (e.g., 98.6) would have type REAL

attribute NUMBER-OF-XRAYS (e.g., 0,1,2,..) is type REAL (even though it has integer values)

attribute GENDER (e.g., Male,Female) is SYMBOL (and is represented in C by the integers 0 and 1)

the difference between NUMBER-OF-XRAYS vs. GENDER is that while both have a finite set of possible values, the values of GENDER have no particular order, whereas the integer values of NUMBER-OF-XRAYS have an obvious ordering

LOADING DATA

      #include "example.h"

      examples *mk_load_examples(char *filename);

This function loads data from a properly formatted data file and returns it in an examples data set. Returns NULL if it fails. In addition, if a .typ file is not present one will be created. If one is present, it is also read and the attributes are assigned the specified types.

ACCESSOR FUNCTIONS

The 'examples" data set can be thought of as a 2-D array where the "rows" are instances of an example and the "columns" are attributes. Specific entries are referenced by their coordinates, where example_num and att_num are row and column, respectively.

      #include "example.h"

      examples *mk_empty_examples(int num_atts)

Returns an empty examples data set of size num_atts.

      void free_examples(examples *ex);

Frees the entire data set.

      void fprintf_examples(FILE *s,char *m1, examples *ex, char *m2);

Prints the data set.

      void put_symbol_examples_value(examples *ex, int example_num, int att_num, int value);

Set (example_num, att_num) to value where att_num refers to an attribute of type SYMBOL. Signals an error if an attribute of the wrong type is used.

      void put_real_examples_value(examples *ex, int example_num, int att_num, double value);

Set (example_num,att_num) to value, where att_num refers to an attribute of type REAL Signals an error if an attribute of the wrong type is used.

      int get_symbol_examples_value(examples *ex, int example_num, int att_num);

Returns the value at (example_num, att_num), where att_num refers to an attribute of type SYMBOL Signals an error if an attribute of the wrong type is used.

      double get_real_examples_value(examples *ex, int example_num, int att_num);

Returns the value at (example_num, att_num), where att_num refers to an attribute of type REAL. Signals an error if an attribute of the wrong type is used.

      bool attribute_symbolicp(examples *ex, int att_num);

Returns TRUE if attribute is of type SYMBOL

      bool attribute_realp(examples *ex, int att_num);

Returns TRUE if attribute is of type REAL

      int attribute_arity(examples *ex, int att_num);

Returns the number of possible values that the given attribute can have

      char *attribute_name(examples *ex, int att_num);

Returns the name of the attribute as a pointer to a string. The string MUST NOT be updated or freed.

      int att_num_from_string(examples *ex, char *name);

Returns the number of the attribute name.

      char *examples_value_name(examples *ex, int att_num, int example_num);

Returns the symbolic value at (example_num, att_num) as a pointer to a string. The string MUST NOT be updated or freed. Attribute MUST be of type SYMBOL. Signals an error if an attribute of the wrong type is used.

      int examples_name_value(examples *ex, int att_num, char *name);

Returns the internal value of the given symbol. Attribute MUST be of type SYMBOL. Signals an error if an attribute of the wrong type is used.

      char *attribute_value_name(examples *ex, int att_num, int value);

Returns the symbolic name of internal value as a pointer to a string. Attribute MUST be of type SYMBOL. Signals an error if an attribute of the wrong type is used.

      int examples_num_attributes(examples *ex);

Returns the number of attributes (columns) in the examples data set

      int examples_size(examples *ex);

Returns the number of examples (rows) in the examples data set

Other Utilities:

      double real_attribute_min(examples *, int att_num);
      double real_attribute_max(examples *, int att_num);
      int real_attribute_argmin(examples *, int att_num);
      int real_attribute_argmax(examples *, int att_num);
      double real_attribute_mean(examples *, int att_num);
      int real_attribute_value_frequency(examples *, int att_num, int value);
      double real_attribute_sdev(examples *, int att_num);

The Format data structure

Many learning algorithms will need to know what they are trying to predict, and which attributes they are allowed to use for making that prediction. This is conveniently stored in the format structure. Later, extra information (perhaps which rows are tests and which are training, etc) might also be stored inside the format structure.

Here are the publicly visible operations on formats:

      format *mk_default_format(examples *exams);
The default format has all but the rightmost attribute as inputs, and the rightmost attribute as the output.

      format *mk_copy_format(format *f);
      void free_format(format *f);
      ivec *format_input_att_nums(format *f);
Returns a pointer to an ivec (note, the caller must NOT change this ivec or free it) in which the i'th element of the ivec is an att_num that may be used by a learning algorithm to predict the output.

      int format_output_att_num(format *f);
This is the attribute that it is desired that we predict.

      void fprintf_format(FILE *s,char *m1,format *x,char *m2);
      void pformat(format *f);
      expo *mk_expo_from_format(examples *exams,format *f);
makes a user-understandable formatted description of the format.

      format *mk_format_from_file(examples *exams,char *fname,expo **r_error_expo);
Attempts to load a format from a file. If loads okay returns non-null result, and set *r_error_expo = NULL If load fails returns NULL and sets *r_error_expo to a non-null message explaining the problem.

The syntax of a format file is trivial. It is a list of attribute names, space separated and all on one line. The rightmost is the output attribute, all others are legal input attributes.

      void save_format(examples *exams,format *f,char *fname,expo **r_error_expo);
If it is saved successfully, it sets *r_error_expo = NULL If fails, *r_error_expo will be a non-null message explaining the problem.

      format *mk_user_edit_format(examples *exams,format *oldf);
Returns a NEW format (and, of course, does not free the old one) that the user has been allowed to edit (i.e. add or remove attributes). The edit loop is pretty hideous. It is fine for oldf to be NULL: this gives the user the opportunity to create a new format. If the user cancels during the edit cycle, NULL is returned.

Attribute Scoring Utilities

This section documents the infogain.c and infogain.h utilities. As of this writing (Jun 1st) they are under construction, with only symbol --> symbol info gain implemented. The plan is to have

      double info_gain(column *ins,column *outs,ivec *row_nums);
PRECONDITION: outs must be a column of symbolic attributes. If ins is also symbolic, uses the standard information gain formula (see Tom's book chapter on Decision Trees). Running time is O(N + AI * AO) where there are N rows in "row_nums", Attribute ins has arity AI and outs has arity AO.

If ins is numeric, the best single threshold split is found, and the resulting info gain is returned. There should be a visible sub-function so that the caller can find out what that threshold is. Running time is O(N log N + N AO).

      double fraction_variance_explained(column *ins,column *outs,ivec *row_nums);
Pre: Outs is real-valued. If ins is symbolic, computes the fraction of variance eliminated if, instead of simply predicting the output as the average of all outputs, one first looks at the value of ins, and then predicts the avarge value of all previous outputs given the same input value. A perfect predictor would give 1. A Crappy predictor will give 0. Note that if the arity of ins is very large, the percent variance explained will be deceptively high. Cleverer statistics would avoid that.

If ins is numeric, finds the best threshold, a la CART.

      double ins_usefulness_score(column *ins,column *outs,ivec *row_nums);
No restriction on column types. Automatically decides which of the above two functions to call.

      expo *mk_expo_explaining_usefulness(examples *exams,int att_num_in,
                                         int att_num_out);
A decent brief user report. You can see the output by running the score verb in the current application.

In the works...

Most of the above isn't implemented yet, and "not_implemented" run-time error messages abound. When it is, the next stage will be to have functions so that given a format (see prev section) the computer can report the most promising of the permissible input attributes. It can suggest which pairwise relations might be most interesting to graph.

Then, independent of the output, it might interest many users to get a general report on pair of attributes that seem to have high relations to each other. This can also serve as a warning of redundant or unvarying inputs.

Then, this forms the backbone of a simple decision tree implementation. One thing that in turn might be used for is cross-validation searches for useful families of attributes.