SIMPLE INTRO TO THE PRIMARY DATA TYPES AND OPERATIONS IN AMBL

CONTENTS

BASICS:
The Major AMBL datatypes
----- USIN
----- USOUT
----- REGDEGS
----- FACODES
----- EXTENT
----- ATTNAMES
----- SIDAT
----- ATREE
----- TERMS and COEFFS
----- APRIOR
----- AREQUEST
----- BREQUEST
----- LOC (Locators)

BASICS:

All data structures have a "make a copy of me and recursively copy all my contents" function. And all data structures have a "free me and recursively free all my contents" function.

If the data structure is call "plop", then the two above functions will be called

         plop *mk_copy_plop(plop *p)

and

         free_plop(plop *p)

Furthermore, any function that returns a plop, and has mk_ in its title, e.g.

           plop *mk_plop_from_qibble_and_squibble(quibble *q,sqibble *sq)
is guaranteed to produce a newly allocated plop, in which all subfields of plop are also allocated (if necessary by including copies of q and sq) so that after its creation nothing that happens to q affects plop, nothing that happens to plop affects q, etc.

Any plop that you create by calling a mk_ function, you must also eventually free with free_plop(p).

The Major AMBL datatypes

USIN

A usin is represented as a dyv (dynamic vector; amdm.h). It is an unscaled (i.e. raw) point in input space.

USOUT

A usout is represented as a dyv (dynamic vector; amdm.h). It is an unscaled (i.e. raw) point in output space.

REGDEGS

Describe which terms are to be used in a polynomial

FACODES

Describe pretty much everything you need to know about a given choice of a function approximator. GMstrings can be turned into facodes, and facodes can be turned into gmstrings.

/* Usin size says how many inputs there are. Calls a my_error() and prints problem explanation if string is illegal */

      facode *mk_facode_from_string(char *string,int usin_size);

/* Makes a gmstring. Must be freed with free_string() */

      char *mk_string_from_facode(facode *fc);

fc->rd : This field of a facode gives the regdeg associated with the facode.

EXTENT

An extent denotes the rough, rounded, minimum and maximum ranges of the input and output features in a dataset

ATTNAMES

An attnames stores names of input and output columns

SIDAT

A sidat denotes a dataset of numeric inputs and outputs, a set of attribute names, and a rough sketch of the minimum and maximum ranges of the inputs and outputs.

si -> ext is the extent of the sidat si -> ans is the attribute names of the sidat si -> usins is the matrix of unscaled input vectors dym_ref(si->usins,i,j) is the j,th component of the i'th input datapoint. si -> usouts is the matrix of unscaled output vectors dym_ref(si->usouts,i,j) is the j,th component of the i'th output datapoint.

      sidat *mk_sidat_from_filename_simple(char *fname);
Loads a sidat. my_error()'s if problem.

ATREE

A kdtree that allows fast predictions. Contains a dataset in which all points are scaled and stored in an efficient access manner.

      atree *mk_atree_from_sidat_and_facode(sidat *si,facode *fc);

      dyv *mk_predict_from_atree(atree *at, facode *fc, dyv *query_usin);

TERMS and COEFFS

A term is a dyv representing the terms in a multivariate polynomial. e.g. if the input space was 2-d with inputs x1 and x2, then linearly scale x1 to z1 so that z1 lies between 0 and 1. Linearly scale x2 to z2 so that z2 lies between 0 and 1. Then terms = (1,z1,z2,z1*z1,z1*z2,z2*z2).

      dyv *mk_term_from_usin(dyv *usin, extent *ext, regdeg *rd);

A coeffs is the terms of a multi-input and possibly multi-output linear map in which

scaled predicted output = coeffs^T term

This can be implemented by

          dyv *sout = mk_dym_transpose_times_dyv(coeffs,term);

The unscaled output (or "usout") is computed from the sout as follows:

        dyv *usout = mk_usout_from_sout(extent *ext,dyv *sout);

(See mk_predict_from_atree in atree.c for example)

The following data structures are probably not necessary for you to know about:

APRIOR

Represents a prior for the Bayesian regression

      aprior *mk_aprior_from_facode(facode *fc);

AREQUEST

Data representing about half the infomation in a facode. Specifically it is information needed to build an atree (see above), including distance metric info, regdeg info, but not kernel width info or number of neighbors.

      arequest *mk_arequest_from_facode(facode *fc);

BREQUEST

Data representing the other infomation in a facode. Specifically it is all but the information needed to build an atree (see below), so includes kernel width info and number of neighbors and weight function info.

      brequest *mk_brequest_from_facode(facode *fc);

LOC (Locators)

A locator represents a point in a distance metric space. Jeff, please document.