Schenley Park Research: Available Commands

about, acors, add, aggregate, autocor, batch, batchrename, brief, bsave, cartesian, change_scorer_matrix, cleartree, clt, comment, compat, concatenate, copy, cor, correlate, cors, define, delete, delete_value, dlist, dlist_anomaly_hunt, dlisttest, dtree, features, frequent, ignore, info, inputs, join, left-click, load, loadtest, log, ls, maketest, maketree, marginals, mark, matrix, nearest, net, new, new_value, newdlist, newdlist_anomaly_hunt, newdlisttest, newra, nomissing, output, pca, print_scorer_matrix, quit, realfeatures, realize, regress, removeduplicates, removerows, rename, rename_value, report, right-click, rules, save, scoring, see, select, shear, show, showmarked, shuffle, sort, sortmissing, sortrows, speed, submark, swap, symb2real, symbolize, symbolize_missing, table, togglemarks, transpose, treetest, unmark



about

about : Displays copyright information




acors

acors : <attname> [modifiers]


Computes autocorrelation function for <attname>.
<attname> must represent a real valued attribute,
Legal modifiers: "rank", "sig", "brief", "marked", "rand [samplesize <n>]"

add

add : <attributes> : Add to the set of inputs


This allows you to add one or more attributes to the set of inputs. See
the inputs command for more details, and remember you can never add
the current output attribute.


aggregate

aggregate : <parameters>: Create aggregate real-variable


Type the following (all on one line):
aggregate <writeattname> = <agstat> of <readattname> using [marked] records 
                      [matching <attnum1> <attnum2> .. <attnumN>]
Where...
<writeattname>         is the name of a real-valued attribute you are 
                       defining.
<agstat>               is one of mean variance sdev min max first.
<readattname>          is the attribute of which you are taking aggregate 
                       statistics.
using marked records   means only use statistics of those records currently 
                       marked (see the mark and unmark commands)
using records          means use statistics from all records
matching <attributes>  means when defining the statistic for the i'th record,
                       use that statistic over all the records which match
                       the same values as the i'th record in each of the
                       attributes in <atributes>.

Example

INCOME TOWN CARMAKE HOUSEVALUE
30000  york ford    100000
20000  bath volvo    60000
50000  york gm      170000
10000  bath ford     80000
60000  bath ford     90000

You might want a new attribute that specifies for each record what is
the mean income in the town associated with that record. You'd do that
with:

  new TOWNINCOME real
  aggregate TOWNINCOME = mean of INCOME using records matching TOWN

Or you might want an atribute that specifies for each record what is
the mean income of ford-owners in the town associated with that record.
You'd do that with

  new TFOINCOME real
  unmark all
  mark CARMAKE == ford
  aggregate TFOINCOME = mean of INCOME using marked records matching TOWN

Note that the new attribute <writeattname> and the attribute you're
collecting statistics for <readattname> must both be real-valued. The
matching attributes must be symbolic (sorry).


autocor

autocor : <lag> <attname> [modifiers]


Computes autocorrelation coefficient of <attname> for the specified <lag>.
<attname> must represent a real valued attribute,
<lag> must be an integer larger than 0 (and reasonably smaller than the number of datapoints).
Legal modifiers: "rank", "sig", "brief", "marked", "rand [samplesize <n>]"

batch

batch : <filename>: execute the series of commands in <filename>


The commands are executed in just the same way as if you
were to type them into the command line manually here. Note
that all output is sent to this display as usual. Note too
that the execution of the commands continues blindly even
if there are errors along the way.

The batch file should simply have one command on each line.
You can leave blank lines if you like. Lines beginning
with a # character will be treated as comments and ignored.


batchrename

batchrename : <filename> : Rename attributes from a file.


<filename> must be a textfile containing exactly the same number of
tokens as there are attributes in the dataset. The tokens must be separated
by spaces and/or on separate lines. The i'th token in the file 
becomes the name of the i'th attribute in the dataset.


brief

brief : Give a brief listing of details of each attribute.


  You are simply told the names and whether the attributes are symbolic
  or real. Each attribute is described on one line. For more details on
  attributes, use the info command.


bsave

bsave : {marked|unmarked} <filename>.


  This command saves in a simple platform-dependent binary format.
  See the save command for other formats.

  The optional keyword marked or unmarked may be supplied.
  If marked is supplied, saves ONLY those records that have been
  marked (see the mark, unmark and togglemarks commands).
  If unmarked is supplied, saves ONLY those records that have been
  unmarked. If neither keyword is supplied, saves ALL the records.

  To load from MATLAB on the same platform, use:
    fid=fopen('file','r');
    dims=fread(fid,2,'int');
    data=fread(fid,dims,'double');
    fclose(fid);


cartesian

cartesian : NEEDS DOCUMENTING



change_scorer_matrix

change_scorer_matrix :


Change the reward matrix used in matrix scorer rule learning. Call the matrix M.
Then Mij gives the reward for a rule that predicts a record will have value i,
but that actually has value j.
Initially we use the identity matrix for reward.


cleartree

cleartree : : erase the existing adtree.


Use this command to erase an existing adtree, so you can build a new
one with different parameters.


clt

clt : <type> [modifiers]


Generates Chow-Liu dependency tree(s) for current inputs
using correlation method specified by <type> with optional settings [modifiers].
Legal types: "linear","quad","rank", and  "quadrank"
Legal modifiers: "marked", "threshold <n>"

comment

comment : <words> : Add a comment to a demo session.


This command is almost certainly useless to you unless you
are planning on saving the text of your session to a file, and wish
to add a comment to the reader of the session.

comment on <sessionname>: This will switch on some internal flags that will
            add special annotations to all output so that it can easily
            be turned into a tutorial. Graphics will be saved to files
            with names like andrew12.ps if andrew was given as the
            session name.

comment off: Switch off the above.

comment <anything else>: Will allow the user to type in commentary about
                         what's going on to be used by the comment command.
         Important: TO STOP ENTERING COMMENT, PUT A SINGLE . ON A LINE.


compat

compat : <file1> <file2> : enables 2 compatible files to be used as training and test


and validation data together.


concatenate

concatenate : <file(s)> [rename] : Add records from other file(s)


    Examples:
      1.  concatenate population.fds
      2.  concatenate population.fds latitude.fds
      3.  concatenate population.fds rename
Suppose the current dataset in memory is

city state crimerate homestate
pit  pa    100       pa
ny   ny    230       pa
phil pa    120       oh
new  nj    150       pa

And suppose population.fds is

city state crimerate homestate
new  pa     40       pa
gulf al     30       fl

Then after example 1 the dataset in memory will be

city state crimerate homestate
pit  pa    100       pa
ny   ny    230       pa
phil pa    120       oh
new  nj    150       pa
new  pa     40       pa
gulf al     30       fl

All that happens is that the records from population.fds are appended
at the end of the current dataset.

Example 2 allows you to merge with more than one file at once. It's
exactly equivalent to running the following commands in succession:
   concatenate population.fds
   concatenate latitude.fds

Example 3 shows the rename option. If the columns in the population.fds
had different names than those in the current dataset, then the names
in population.fds will be ignored. (By default, if you don't include
rename on the command, the system will moan at you if it finds two
attributes with names that diagree).


copy

copy : <attname> <newattname> : Make a copy of an attribute




cor

cor : <type> <input attname> <output attname> [modifiers]


Computes correlation coefficient of <input attname> w.r.t. the real valued <output attname>
using the method specified by <type> with optional settings [modifiers].
Legal types: "linear","quad","rank" and "all"
Legal modifiers: "sig", "brief", "marked", "rand [samplesize <n>]"

correlate

correlate : <att1> <att2> : Find the correlation between two real-valued attributes



cors

cors : <type> [modifiers]


Computes correlation coefficients of all current inputs w.r.t. the current output
using the method specified by <type> with optional settings [modifiers].
Legal types: "linear","quad" and "rank"
Legal modifiers: "sig", "brief", "marked", "rand [samplesize <n>]"

define

define : <attribute> <expression> : (Re)define an attribute's values


	or (equivalent syntax):
<attribute> = <expression>
<expression> may be composed of attribute names, numeric/symbolic values
parentheses and operators
Available operators (ordered according to decreasing priority):
abs	(unary) calculates absolute value of its argument,
	applicable to real valued attributes only;
sqrt	(unary) square root of a real, non-negative argument;
sign	(unary) returns -1.0, 0.0 or +1.0 if its real argument
	is respectively negative, zero, or positive;
exp	(unary) applicable to real valued arguments;
log	(unary) natural logatithm, applicable to positive
	real arguments;
log10	(unary) decimal logarithm, applicable to positive
	real arguments;
pre	(unary) applicable to real or symbolic data attributes
	(not to individual values), it shifts the argument
	values back, so that if k is the data row index,
	then resultant[k] = argument[k-1] for k>0
	and resultant[0] = argument[0] otherwise;
suc	(unary) applicable to real or symbolic data attributes
	(not to individual values), it shifts the argument
	values forward, so that if k is the data row index,
	 then resultant[k] = argument[k+1] for k<size-1
	and resultant[size-1] = argument[size-1] otherwise,
	(size is the number of data rows);
deriv	(binary), applicable to real attributes only,
	"deriv y x" computes the first derivative dy/dx;
ranks	(unary) computes ranks of its argument,
	intended for use with real attributes;
sranks	(binary) "sranks s r" computes ranks of symbolic
	attribute s with regard to the real attribute r,
	using a specific permutation of the values of s,
	which provides a presumably monotonic behavior of s vs. r.


delete

delete : <attributes> : Remove one or more attributes.


  Gets rid of the given attribute (and, of course, all its values)
  from the dataset. There's no undo!


delete_value

delete_value : <attribute> <value_name> : delete value.


<attribute> must be a symbolic attribute. This removes the printed
name of one of its symbolic values. All records with the named value
revert to having a missing value for that attribute.


dlist

dlist : <val> [rmin <n>] [maxmegs <n>] [numatts n] [support n]: Decision List Finder.


Runs a decison list finder. Searches for a set of if-then-else
rules predictive of
    <output attribute> = <value>
	 where value is either a specific value or the literal any.
The rules it will consider contain up to <numatts> attribute-value pairs.
It only considers rules that match at least <support> records.
All the attributes must be symbolic. Delete or Symbolize any real
attributes (see the delete and symbolize commands).

rmin <n> - This parameter, described in the ADTREE paper, defines the
           maximum leaf-list length. Usually (though not always) the larger
           rmin, the smaller the amount of memory used by the adtree, but
           the slower the counting queries. The speed of the counting 
           queries is not substantially affected for rmin values 
           below approximately 200.

maxmegs <n> - Unfortunately, the size of the ADTREE cannot be easily
              predicted in advance before you try building it. Use this
              parameter to prevent the ADTREE from growing too big...if the
              number of megabytes it uses grows above <n> during construction
              the ADTREE frees itself, and harmlessly returns with a set of
              suggestions about what you can do before trying the maketree
              command again.
Notes about ROC graphics:
We plot 
Fraction of records for which a prediction is made.
vs
Fraction of desired records correctly classified.


dlist_anomaly_hunt

dlist_anomaly_hunt : [rmin <n>] [maxmegs <n>] [numatts n] [support n]: Decision List Finder.


Runs a decison list anomaly finder.
A new attribute called STRANGENESS is added to the datset.
Things with high STRANGENESS stand out as difficult to distinguish from
random noise using a dlist.

The rules considered for use in the dlist contain up to 
<numatts> attribute-value pairs.
It only considers rules that match at least <support> records.
All the attributes must be symbolic. Delete or Symbolize any real
attributes (see the delete and symbolize commands).

rmin <n> - This parameter, described in the ADTREE paper, defines the
           maximum leaf-list length. Usually (though not always) the larger
           rmin, the smaller the amount of memory used by the adtree, but
           the slower the counting queries. The speed of the counting 
			queries is not substantially affected for rmin values 
           below approximately 200.

maxmegs <n> - Unfortunately, the size of the ADTREE cannot be easily
              predicted in advance before you try building it. Use this
              parameter to prevent the ADTREE from growing too big...if the
              number of megabytes it uses grows above <n> during construction
              the ADTREE frees itself, and harmlessly returns with a set of
              suggestions about what you can do before trying the maketree
              command again.
Notes about ROC graphics:
We plot 
Fraction of records for which a prediction is made.
vs
Fraction of desired records correctly classified.


dlisttest

dlisttest : <val> [rmin <n>] [maxmegs <n>] [numatts n] [support n]: Decision List Finder.


1) Runs a decison list finder. Searches for a set of if-then-else
rules predictive of
    <output attribute> = <value>
	 where value is either a specific value or the literal any.
2) Tests the rules on the test data set (enabling cross validation).

The rules it will consider contain up to <numatts> attribute-value pairs.
It only considers rules that match at least <support> records.
All the attributes must be symbolic. Delete or Symbolize any real
attributes (see the delete and symbolize commands).

rmin <n> - This parameter, described in the ADTREE paper, defines the
           maximum leaf-list length. Usually (though not always) the larger
           rmin, the smaller the amount of memory used by the adtree, but
           the slower the counting queries. The speed of the counting 
           queries is not substantially affected for rmin values 
			below approximately 200.

maxmegs <n> - Unfortunately, the size of the ADTREE cannot be easily
              predicted in advance before you try building it. Use this
              parameter to prevent the ADTREE from growing too big...if the
              number of megabytes it uses grows above <n> during construction
              the ADTREE frees itself, and harmlessly returns with a set of
              suggestions about what you can do before trying the maketree
              command again.
Notes about graphics:
We plot 
Fraction of records for which a prediction is made.
vs
Fraction of desired records correctly classified.


dtree

dtree : size <n> iters <m>: build a decision tree.


   size n : The tree returned should have no more than
            this number of nodes (including leaf nodes).

  iters n : If n==1 do regular decision tree. If n > 1, run
            n iterations of a randomized wrapper feature
            selection first.


features

features : [numatts <n>] : Find best set of <n> features.


This set of <n> features will be a subset of the current
set of inputs. It will be the best such set (in terms of
infogain) for predicting the current output. See the
matrix command for the definition of what this all means.


frequent

frequent : <freq_atts> <match_atts> <setsize> <mincount> <filename>: make frequent datset


  freq_atts  - the attributes of which you want to find frequent tuples
  match_atts - the attributes that indicate which records should be considered
               together in finding frequent attribute tuples
  setsize    - the size of the tuple to be found (e.g. 2 indicates find all
               frequent pairs)
  mincount   - the minimum number of occurences before a tuple will be included
               in the final result
  filename   - file in which to save the resulting frequent dataset 


ignore

ignore : <attributes> : Remove from the set of inputs


This allows you to remove one or more attributes to the set of inputs. See
the inputs command for more details.


info

info : <attributes> : Give detailed attribute information.


  <attributes> may be "all" or a list of attribute names or numbers
  For the symbolic attributes, you are told what the symbolic values
  are, and how many times each value occurs in the dataset. For the real-valued
  attributes, you are told their means, variances, mins and maxes. In 
  each case you are also shown the first eight values. You are also told
  how many values are missing.


inputs

inputs : <attributes> : Define which attributes are to be used as inputs


Several ADTREE operations require a set of input attributes to search
over. Here's the command you can use to define that set. Just list the
attributes you're interested in, or type inputs all to select all
attributes as inputs. Note that the current output attribute (see the
output command) will never be added to the set of inputs, even if you
request that it should be.


join

join : <file(s)> key(s) <keycode(s)> : Extra attributes from other file(s)


    Examples:
      1.  join population.fds key state
      2.  join population.fds latitude.fds key state
      3.  join population.fds key homestate/state
      4.  join age.fds keys firstname lastname
Suppose the current dataset in memory is

city state crimerate homestate
pit  pa    100       pa
ny   ny    230       pa
phil pa    120       oh
new  nj    150       pa

And suppose population.fds is

state population
ha     3
nj    10
ny    25
oh    18
pa    15

Then after example 1 the dataset in memory will be

city state crimerate homestate population
pit  pa    100       pa        15
ny   ny    230       pa        25
phil pa    120       oh        15
new  nj    150       pa        10

The newly added column gives you the population of the state mentioned
as the 2nd attribute in the memory dataset. Where does the population value
come from? From the file population.fds

Example 2 allows you to merge with more than one file at once. It's
exactly equivalent to running the following commands in succession:
   join population.fds key state
   join latitude.fds key state

Example 3 shows what we'd have to do if we wanted the population column to
be associated with the homestate in the in-memory file even though the
matching record we're using in population.fds is called state.

Sometimes one key is not enough to disambiguate which record in
the new file contains the values you want. For example, 4 would be
useful in this case:

ORIGINAL DATASET....                               AGE.FDS...
firstname lastname  height weight year              firstname lastname age
andrew    moore     75     135    1985              andrew    moore     34
jeff      lee       65     235    1999              andrew    lee       33
andrew    moore     75     155    1998              jeff      lee       32
andrew    lee       35      55    1998       

when the result would be
firstname lastname  height weight year age
andrew    moore     75     135    1985 34
jeff      lee       65     235    1999 32
andrew    moore     75     155    1998 34
andrew    lee       35      55    1998 33

Notes: * In all cases, all attributes from <filename> are included into the 
         in-memory dataset. Delete those you don't want. 
       * All key attributes must be symbolic
       * Key attributes must match (have the same values) in memory & in
         loaded datasets


left-click

left-click : (Un)Maximize a bar graph in a report




load

load : <filename> [options] : Load new dataset from CSV or FDS file.


   <filename> must be the name of a datafile in one of the formats
              described below.
   [options] are
      default_real {true|false} (default value is FALSE)
         If you specify default_real true on the command line then an
         attribute with only a small number of values, each an integer,
         will be treated as a real-valued attribute. See the realize 
         command for more discussion.

      ignore_missing {true|false} (default value is FALSE)
         If there are any missing values in the dataset, and if set to
         TRUE, the loader will simply ignore any records containing missing
         values.

If filename contains *, then multiple files will be loaded unto 1 datsetIf the new dataset is loaded successfully, then this tool forgets
the previous dataset.

File formats. 
If the filename ends with .fds, then this tool will assume that the
file was created in standard Schenley Park Research compact .fds format,
and will loadaccordingly. For large files, this is up to hundreds 
of times faster for loading, and so is heavily recommended. 
For more discussion, see the save command.

All other file suffixes are loaded with an adaptive loader that can
attempt to load with comma separated (CSV) format and/or space-separated
format. It ignores all blank lines and lines beginning with #. The
first non-ignored line is initially assumed to consist of attribute names.
If there's at least one number amongst them, however, loading assumes
that this file was created without attribute names. In that case it
treats the first line as the first record, and generates its own 
(unimaginative) attribute names. It automatically decides whether each
attribute is real or symbolic. If it sees symbolic values wih spaces
in the symbolic name, it replaces the spaces with - hyphens. 


loadtest

loadtest : <filename> [options] : Load test dataset from CSV or FDS file.


Same loading conventions as the regular load command.


log

log : <filename/off> : Enable or disable logging of all text output to a file


This command takes all text output and appends it into a file.
(Note that appending allows the user to add to a log file at any time.)
Typing 'log off' at any time will disable logging.

ls

ls : lists the current working directory




maketest

maketest : [rmin <n>] [maxmegs <n>] : Build test adtree.


You must run maketest before you can run any of the decision tree
code.
Uses the rmin and maxmegs options described in maketree command.


maketree

maketree : [rmin <n>] [maxmegs <n>] [lazy y/n]


You must run maketree before you can run any of the ADTREE datamining
commands. The ADTREE is the thing that cleverly caches sufficient
statistics to allow counting queries to be answered very quickly.

All the attributes must be symbolic. Delete or Symbolize any real
attributes (see the delete and symbolize commands).

rmin <n> - This parameter, described in the ADTREE paper, defines the
           maximum leaf-list length. Usually (though not always) the larger
           rmin, the smaller the amount of memory used by the adtree, but
           the slower the counting queries. The speed of the counting 
           queries is not substantially affected for rmin values
			below approximately 200.

maxmegs <n> - Unfortunately, the size of the ADTREE cannot be easily
              predicted in advance before you try building it. Use this
              parameter to prevent the ADTREE from growing too big...if the
              number of megabytes it uses grows above <n> during construction
              the ADTREE frees itself, and harmlessly returns with a set of
              suggestions about what you can do before trying the maketree
              command again.

lazy y/n  - You can choose whether to use regular 'classic' static trees
            or new clever lazy trees. Lazy trees can use much less memory
            with some (possibly severe) slowdowns.


marginals

marginals : contextsize <n> predictor <x> predictee <y>


  where <n> is 1 or greater and <x> is "all" or an attribute name

    A very fast and high-dimensional correlation finder.

    This command will operate on the attributes mentioned in the
    current set of inputs and defined as the current output.

    You can ask to see which of these attributes are good for 
    predicting which other attributes in what contexts.

    Suppose, for example, that X Y and Z are all binary (True/False)
    attributes and Z is True if and only if X and Y have the same
    value. Suppose that in the dataset X and Y have their values
    randomly and independently chosen to be True/False with 50-50 
    probability. Then notice, importantly, that the pairwise correlation
    between X and Z will be zero (and so will X's RIG for predicting
    Z). Ditto for Y and Z. And yet X and Y together are a perfect
    predictor of Z (RIG = 100%).

                (Side note: For a definition of RIG,
                see the documentation for the matrix
                command).

    This command can avoid the error that normal correlation would
    make, of deducing no relation between X and Z. It does so by
    searching for the best CONTEXT for X to predict Z, where by
    CONTEXT we mean we allow X to appeal to other attributes to
    help it predict Z.

    Suppose we want to know how well some attribute called the
    PREDICTOR predicts some other attribute (called the PREDICTEE).

    We use the important notion of RIGI (Relative Information Gain
    Increase) to score the predictor's performance in a given context. 
    We compute
         RIG_with    = RIG of predictor and context for predicting predictee
         RIG_without = RIG of context alone for predicting predictee
    then
         RIGI = RIG_with - RIG_without

    For each requested PREDICTOR/PREDICTEE combination, this 
    command finds the best context (the one with the highest RIGI)
    and reports it.

    For each PREDICTEE the command gives a ranked list of which
    are the best PREDICTORS and their associated contexts.

    Then for each PREDICTOR the command gives a ranked list of which
    are the PREDICTEES for which it does best and their associated contexts.

Parameters:

    contextsize: What is the maximal allowed size of the context
                 set of variables? The number of high dimensional 
                 correlations considered increases VERY rapidly if
                 you make this large.

    predictor: Can be "all", meaning use all the current inputs and the 
               current output as predictors. Else you may specify a
               single predictor.

    predictee: Can be "all", meaning use all the current inputs and the 
               current output as predictees. Else you may specify a
               single predictee.

    It is fine for an attribute to be used both as a predictor and a
    predictee.


mark

mark : <constraint> : Set all rows matching constraint to be marked.


When the program starts, or when a new dataset is loaded, all rows are 
initially unmarked.
You can mark subsets of the rows at any time. And then, if you wish,
you can unmark subsets at any time. You can then save only the marked, or
only the unmarked rows if you wish (see the save command).
To mark all the rows matching some constraint, just type mark <constraint>.
Legal constraints are...
    all                - matches every single row
    missing            - matches only rows with missing value(s)
<attribute> == <value> - matches only rows in which given SYMBOLIC attribute
                         has the given SYMBOLIC value.
<attribute> < <number> - matches only rows in which given REAL attribute
                         has a numeric value strictly below threshold.
<attribute> > <number> - matches only rows in which given REAL attribute
                         has a numeric value strictly above threshold.
row < <number>         - matches only the rows with row_numbers strictly
                         below <number>. The first (top) row in the dataset
                         has row_number 0 etc.
row > <number>         - matches only the rows with row_numbers strictly 
						  greater than <number>


matrix

matrix : : Display joint contingency table between inputs and output.


Draws what Schenley Park Research calls a w_matrix, where each row of the w_matrix
corresponds to a different assignment of sets-of-values to the inputs, and where
each column corresponds to a different value of the output attribute.

   A w_matrix is a two-dimensional table counting how frequently
   each value of the output attribute co-occurs with values of
   input attributes. For example, suppose we had a dataset with
   four symbolic attributes thus:

           Weight   which can have values   light, middle, heavy
           Healthy  which can have values   true, false
           Wealthy  which can have values   true, false
           Wise     which can have values   true, false

   Then here is an example of a w_matrix for inputs Weight and Wise and
   output healthy...

        Weight   Wise   |  Healthy=False   Healthy=True   
        -----------------------------------------------
        light    false  |   64               32
        light    true   |   16               0
        medium   true   |   20               20
        heavy    false  |   10               2

    The w_matrix is the matrix of numbers in the bottom right quadrant,
    and tells us, for example, that amongst people who are light and unwise,
    64 are also unhealthy and 32 are healthy.

  IMPORTANT: a w_matrix never bothers to include rows for which all
             counts are zero. So for example this w_matrix shows us that
             there are no weight=medium wise=false people in the database.

The entropy of the output variable is shown. This is a measure of
how uniformly distributed the output is...a high entropy output is one
with a uniform distribution, and is thus harder to predict. Formally,
the entropy is the number of binary bits of information needed to encode
an output value selected according to the empirical distribution of values
of the output attribute.

The information gain (IG) of using the inputs to predict the output is
also shown. The higher this number the more predictive. Formally, it is
the number of bits that are, on average, saved if you had to encode the
output optimally and were lucky enough to be told, in advance, for free,
the values of the inputs. A set of inputs that are uncorrelated with the output
will have a relatively small infogain. A perfectly correlated set will have an
infogain equal to the output's entropy.

The Relative Information Gain (RIG) is merely the ratio of the information
gain to the entropy, expressed as a fraction. RIG=100 is the best possible
(perfect) correlation. RIG=0 is perfect independence (this set of inputs by
itself has no discernable predictive power for the output). As a simple
rule of thumb, any RIG less than about 5 percent is quite unsatisfactory.


nearest

nearest : <inattributes> <outattribute>: Do inputs predict output well?


Example:
  nearest height weight age
The above command would compute, via leave-one-out, 1-nearest-neighbor
cross-validation, how well height and weight predict age.


net

net : [iters <n>] [ordered {true|false}] : Learn a Bayes Net.




new

new : <attname> [real|symbolic] [symbolic values] : Make new attribute




new_value

new_value : <attname> [symbolic values]: Add symbolic values to attribute




newdlist

newdlist : <val> [rmin <n>] [maxmegs <n>] [numatts n] [support n]: Decision List Finder.


Runs a decison list finder. Searches for a set of if-then-else
rules predictive of
    <output attribute> = <value>
	 where value is either a specific value or the literal any.
The rules it will consider contain up to <numatts> attribute-value pairs.
It only considers rules that match at least <support> records.
All the attributes must be symbolic. Delete or Symbolize any real
attributes (see the delete and symbolize commands).

rmin <n> - This parameter, described in the ADTREE paper, defines the
           maximum leaf-list length. Usually (though not always) the larger
           rmin, the smaller the amount of memory used by the adtree, but
           the slower the counting queries. The speed of the counting 
           queries is not substantially affected for rmin values
			below approximately 200.

maxmegs <n> - Unfortunately, the size of the ADTREE cannot be easily
              predicted in advance before you try building it. Use this
              parameter to prevent the ADTREE from growing too big...if the
              number of megabytes it uses grows above <n> during construction
              the ADTREE frees itself, and harmlessly returns with a set of
              suggestions about what you can do before trying the maketree
              command again.


newdlist_anomaly_hunt

newdlist_anomaly_hunt : [rmin <n>] [maxmegs <n>] [numatts n] [support n]: Decision List Finder.


Runs a decison list anomaly finder.
A new attribute called STRANGENESS is added to the datset.
Things with high STRANGENESS stand out as difficult to distinguish from
random noise using a newdlist.

The rules considered for use in the newdlist contain up to 
<numatts> attribute-value pairs.
We only considers rules that match at least <support> records.
All the attributes must be symbolic. Delete or Symbolize any real
attributes (see the delete and symbolize commands).

rmin <n> - This parameter, described in the ADTREE paper, defines the
           maximum leaf-list length. Usually (though not always) the larger
           rmin, the smaller the amount of memory used by the adtree, but
           the slower the counting queries. The speed of the counting 
           queries is not substantially affected for rmin values
			below approximately 200.

maxmegs <n> - Unfortunately, the size of the ADTREE cannot be easily
              predicted in advance before you try building it. Use this
              parameter to prevent the ADTREE from growing too big...if the
              number of megabytes it uses grows above <n> during construction
              the ADTREE frees itself, and harmlessly returns with a set of
              suggestions about what you can do before trying the maketree
              command again.
Notes about ROC graphics:
We plot 
Fraction of records for which a prediction is made.
vs
Fraction of desired records correctly classified.


newdlisttest

newdlisttest : <val> [rmin <n>] [maxmegs <n>] [numatts n] [support n]: Decision List Finder.


1) Runs a decison list finder. Searches for a set of if-then-else
rules predictive of
    <output attribute> = <value>
	 where value is either a specific value or the literal any.
2) Tests the rules on the test data set (enabling cross validation).
The rules it will consider contain up to <numatts> attribute-value pairs.
It only considers rules that match at least <support> records.
All the attributes must be symbolic. Delete or Symbolize any real
attributes (see the delete and symbolize commands).

rmin <n> - This parameter, described in the ADTREE paper, defines the
           maximum leaf-list length. Usually (though not always) the larger
           rmin, the smaller the amount of memory used by the adtree, but
           the slower the counting queries. The speed of the counting
           queries is not substantially affected for rmin values 
			below approximately 200.

maxmegs <n> - Unfortunately, the size of the ADTREE cannot be easily
              predicted in advance before you try building it. Use this
              parameter to prevent the ADTREE from growing too big...if the
              number of megabytes it uses grows above <n> during construction
              the ADTREE frees itself, and harmlessly returns with a set of
              suggestions about what you can do before trying the maketree
              command again.
Notes about graphics:
We plot 
Fraction of records for which a prediction is made.
vs
Fraction of desired records correctly classified.


newra

newra : <sourceatt> <destatt> : Update <destatt> from <sourceatt>




nomissing

nomissing : NEEDS DOCUMENTING



output

output : <attribute> : Define which attribute is to be used as output


Several ADTREE operations require an output attribute, which is usually
the attribute you are trying to predict. Use this command to change the
output. If the attribute you choose is currently an input, it will be
removed from the set of inputs automatically (see inputs command).


pca

pca : NEEDS DOCUMENTING



print_scorer_matrix

print_scorer_matrix :


Print the reward matrix used in matrix scorer rule learning. Call the matrix M.
Then Mij gives the reward for a rule that predicts a record will have value i,
but that actually has value j.


quit

quit : end program


Warning---please make sure you've saved any results you
need before you type quit


realfeatures

realfeatures : [maxatts <number>] : Feature selection with real attributes.




realize

realize : <attributes> : Convert certain symbolic attributes to real.


  The only attributes this will affect are ones that are
  initially symbolic, but all their symbolic values can
  be parsed (interpreted) as numbers. So a symbolic attribute 
  with values Andrew, Fred, 3, and 7 would be unchanged. But a symbolic attribute with
  values 0 1 92 17 would be changed into a real attribute in which, for
  example, those records with value==symbolic 92 previously would now
  have the real value 92.000.

  You may wonder why the dataset reader does not simply automatically convert
  attributes into real attributes if all the symbolic value names can be
  parsed as numbers. The reason is that if there are only a small number
  of distinct numbers, the memory savings of representing as symbols is
  colossal. To force the dataset loader to automatically convert such
  attributes as reals, include default_real true on the command line.


regress

regress : [marked] or regress [marked] <input> <output>




removeduplicates

removeduplicates : <attributes> : Remove duplicate rows.


    The simplest use of this command is: removeduplicates all.
        In this case we simply search for pairs of records that are
	identical. Whenever we find such a pair we remove the later
        of the records (needless to say, the implementation is much more
	efficient than this description makes it sound).
				 
    A slightly more complex use of the command is when you specify a
    set of attributes other than "all". In this case it considers
    two records identical if they match on all the attributes you mentioned
    even if they don't match on the attributes you don't mention.


removerows

removerows : [marked|unmarked] : Remove some rows.


    removerows marked: Deletes all the marked records.
    removerows unmarked: Deletes all the unmarked records.
  For help on marking and unmarking records, see the mark command


rename

rename : <attribute> <newname> : Rename.


  Simply changes the name of the given attribute. The new name must not
  duplicate another name in the dataset, nor may it be a number.


rename_value

rename_value : <attribute> <oldvalname> <newname> : Rename value.


<attribute> must be a symbolic attribute. This changes the printed
name of one of its symbolic values. <oldvalname> must be one of
<attribute>'s values, and <newname> is the name it's given.


report

report : <output(s)> [<dep exp>] [<ind exp(s)>] ... : Report data in a table to a file.


The <output> can be any of 'stdout' to report in text to the screen,
'graph' to report graphically to the screen, or a filename to report
to.  If the filename ends with '.csv' then it will output the data in
a comma-separated format which can then be read in directly by Excel or
by Miner.  Otherwise, it will output the data as it would to the text
window.  <output(s)> can also be any combination of these, separated by
commas (no spaces), such as: 'report stdout,a.csv,b.txt'.  Finally, a
handy shortcut for 'stdout,graph' is to simply enter 'both'.

<dep_exp> is a mathematical expression formulated from the attributes
with real values.  Immediately before each attribute name in the
expression may appear a '+' or a '|' to indicate whether it should
be reporting the total or the mean of these values.  In addition, the
expression can use any of the binary operators +, -, *, /, or |
(x|y = mean of x and y), or any of the unary operators +, *, or |
which order it to take the cumulative sum, product, or mean of the
value respectively.
Lastly, if <dep_exp> is '#', or is simply left empty, then it will
report a count of records rather than an actual value.
Example: x reports the total values of x (same as +x).
         |x reports the average values of x.
         *|x reports the cumulative product of the average values of x.
         ++x reports the cumulative sum of the total values of x.
         x+y reports the sum between the total values of x and y.
         x|++y reports the average between x and the cumulative sums
               of the total values of y.

<ind_exp(s)> is a list of any number of individual independent
expressions, each separated by a space.  These represent what values
the table is to report over.  Each independent expression consists of
one symbolic attribute.  If it is to only take on certain values,
these values can be specified using any of the relative operators
=, != (not equal to), >, <, or : (x:3,5 means that x can be 3 or 5).
Like the dependendent expression, if +, *, or | precedes the
attribute name, it indicates that the values of the dependent
expression should be cumulatively summed, multiplied, or averaged
over the values of the independent attribute.  Finally, an operator
(+, -, *, /, or |) can also be placed immediately after the attribute
name to indicate that the value of the dependent expression should be
non-cumulatively summed (or any other operation) over the values of
the independent attribute (causing the independent attribute to not
be necessary in the table).
Examples: 
report graph rain day weather
  reports the total amount of rain on each day in each type of weather
report graph |rain day=Monday weather
  reports the average amount of rain on Mondays in each type of weather
report graph rain +day weather
  reports the cumulative amount of rain throughout each day of the week
  in each type of weather
report graph |rain day-:Saturday,Sunday weather
  reports the difference in average rainfall between Saturday and Sunday
  for each type of weather
report graph # day:Saturday,Sunday weather
  simply reports the number of records that have Saturday or Sunday for
  their day, and have each one of the types of weather.


right-click

right-click : Hide a strip in a report




rules

rules : <val> [numatts n] [numrules n] [support n]: Rule Finder.


Runs an exhaustive rule finder. Searches for rules predictive of
    <output attribute> = <value>
The rules it will consider contain up to <numatts> attribute-value pairs.
It prints the <numrules> best rules found 
and saves them as the rulequeue object.
It only considers rules that match at least <support> records.


save

save : {marked|unmarked} <filename>.


  The optional keyword marked or unmarked may be supplied.
  If marked is supplied, saves ONLY those records that have been
  marked (see the mark, unmark and togglemarks commands).
  If unmarked is supplied, saves ONLY those records that have been
  unmarked. If neither keyword is supplied, saves ALL the records.

  IMPORTANT: .fds files.
  There are currently two formats to save. 

.csv format

   The default format is simple comma separated values (CSV). In SPR or
   the Auton lab, we usually give these files suffixes of .csv or .ds. The
   format of these files is that the first non-blank line of the file is
   a list of attribute-names separated by columns, e.g.

     Age,Weight,HomeTown

   and then all other non-blank-lines contain the same number (in the above
   example, 3) of tokens on each line, with the tokens separated by columns.
   The k'th such line contains the values of the k'th row (i.e. the
   k'th record) in the dataset. This is the same file format you get if
   you save an Excel(TM) spreadsheet in CSV format. For more details, see
   the load command.

.fds format

   This format is usually more compact (especially if there are many
   symbolic attributes) and always much much much faster to load and save.
   The downside is that it is an SPR/Auton only format, so you can't export
   of import files in this format between non-SPR/Auton applications. You CAN
   export them between different SPR/Auton applications, even ones which run
   on different machines. But you can always convert between .fds and
   .csv formats with this tool.
  
   To save a file in .fds format, all you need do is name the file with
   a .fds suffix.


scoring

scoring : [accuracy|cat_inventory] :


accuracy - score rules based on prediction accuracy
cat_inventory - special purpose scoring for Caterpillar inventory


see

see : [marked/all] <att1> <att2> <att3> : Graphical view


Each of att1 and att2 must be one of an attribute name, or marks or row.


select

select : <attributes> : delete all but the selected attributes.


  Gets rid of every single attribute in the dataset EXCEPT for those
  specified in the the list of attributes. There's no UNDO!


shear

shear : rowatt <att> colatt <att> cellatt <att>


rowatt must be a symbolic attribute
colatt must be a symbolic attribute
cellatt must be a real attribute

Replaces the new dataset by one in which the number of rows
equals the original arity of rowsatt and the number of columns
equals one plus the arity of colatt. The i+1'th attribute in the new
dataset is given the same name as the i'th value of colatt in the
original dataset. The value in the r'th row of the new dataset
in the first (leftmost) column is symbolic: its the symbolic value
of the r'th value of rowatt. The value in the i+1'th column and
the r'th row is the value that cellatt had in the original
dataset in the first record in which rowatt==r and colatt==i.
    
EXAMPLE: Old dataset:

GENDER STATE EYES  AGE HEIGHT
M      PA    Blue  17  6
M      CA    Brown 39  5
F      MA    Blue  54  6
M      MA    Grey  16  6
F      PA    Grey  17  4

If you run the command:
shear rowatt GENDER colatt EYES cellatt AGE
the dataset would become:

ROWNAME Blue Brown Grey
M       17   39    16
F       54   ?     17

...where ? denotes "missing value"



show

show : {rules {scores|info}}: Display inputs, outputs, and speed, OR rules.


Without arguments: the inputs are the set of attributes the datamining algorithms will be
allowed to consider for making predictions. The output is the thing
being predicted. And the speed is whether or not you use the adtree
for counting. See the inputs command for more details.
With the rules argument: displays the rules.
  with the second argument, scores: displays the prediction scores of the rules on the dataset.
  For each rule a cumulative score is shown for using that rule and all
  its predecessors as a set.


showmarked

showmarked : <number> <attributes> : Show a subset of the data.


This function allows you to see a subset of the rows and attributes
of the dataset. It displays the first <number> marked rows in the
dataset, one per line of output but the only attribute values that
appear on each line of output are those specified in <attributes>.
If you wanted to see values of attribute height and age for records
numbers 40 through 49 you could do
   mark all
   unmark row < 40
   show 10 height age
See the mark command for more details about marking.


shuffle

shuffle : : Randomly re-order the dataset rows.


The row_numbers of all the records in the dataset change randomly, and when
saved, will be saved in this different order. This is especially useful
for creating training and test sets. For example, suppose you want to
save a test-set of a random subset of rows, and you want that test-set to
be of size 1000 rows, and you want the remainder of the records to be
saved to a training set, you could do the following...

  shuffle
  mark row < 1000
  save marked test.csv
  save unmarked train.csv


sort

sort : [arity|entropy] [down|up] : Sort Attributes.


  Reorders the attributes (columns) of the dataset, but doesn't change
  their values. Both keywords must be supplied. They have the following
  meanings:
  arity = measure an attribute by how many symbolic values it has
  entropy = measure an attribute by its entropy (which is a measure
  of how uniform the distribution of values is...the more uniform, the
  higher the entropy).
  up = leftmost attribute has the lowest measure. Rightmost has highest.
  down = leftmost attribute has the highest measure. Rightmost has lowest.


sortmissing

sortmissing : : Display attributes sorted by number missing rows.


A trivial utility that prints out the attribute names in order,
so that those with the most missing values are printed first.


sortrows

sortrows : <attributes> : Re-oder the dataset records


Takes one or more attributes as an argument. Sorts using
the first attribute as the primary key, the next as the secondary
key and so on.
Good new for sort algorithm afficionados: The sort algorithm is
careful not to unnecessarily reorder rows that are equal according
to the specified attributes.


speed

speed : {fast|slow} : fast => use adtree for counting. slow => don't.


You must either type speed fast or speed slow. The default is fast, and the
only reason for running in slow mode is to do speed comparisons.


submark

submark : <constraint> : Leave only marked records matching constraint marked.


Exactly the same syntax and usage as the mark command, except that it looks only at records that are already marked and leaves them marked if they meet the given condition, and unmarks them if they don't.

swap

swap : <attnum1> <attnum2> : Reposition attributes by swapping columns.


  Simply swaps the attnum (attribute number) of the two attributes. That
  will affect the order in which attributes are listed in command such
  as brief.


symb2real

symb2real : <attributes> : Replace symbols with numbers.



   This function will only change an attribute if it is symbolic and
   has no cached real-values.

   Then it simply sets the ith real value to be a small integer
   representation of the value. For example if you had a dset with
   one attribute called fruit and it originally was

   fruit
   -----
   apple
   orange
   apple
   banana
   banana

   then set of values for the fruit attribute is { apple , banana , orange }
   and the small-integer value of each attribute is simply 0, 1 and 2
   respectively (note how the values are sorted alphabetically).

   The result dataset column is:

   fruit
   -----
   0
   2
   0
   1
   1



symbolize

symbolize : <style> [<num_levels>] <attributes> : make numeric atts symbolic.


<style> must be equal, uniform, manual, or integer. 

If <style> is equal,   unform or manual, <num_levels> MUST be supplied
If <style> is integer, <num_levels> MUST NOT be supplied.

    equal      breaks up the attribute into buckets of equal width. 

    uniform    can create buckets of non-equal width, but makes sure that
               the number of records in the various buckets are not wildly 
               skewed.

    manual     reads in numbers from the user on where to put the breaks.

    integer    rounds all the real numbers to their nearest integer value
               and makes one symbolic value per integer.

 <num_levels> must be a number between 2 up to 100
 <attributes> may be "all" or a list of attribute names or numbers

This takes every numeric attribute in the list and breaks
it up into bucketed symbols. How many buckets? That's
defined by <num_levels>. The buckets are equal sized, with
boundaries at round numbers. The bucket corresponding to the
lowest value has the name v0:XXX- where XXX is the boundary
between the smallest and second smallest bucket. 
The integer style simply creates the name to be the integer value.


symbolize_missing

symbolize_missing : <attributes> : make missing values a new symbol.


This command only works on symbolic attributes.
In order to use it on a real column, you would have to first symbolize that column.


table

table : : Display the contingency table for the current set of inputs.


This command pays attention to the current set of inputs but 
ignores the current output. For each possible joint assignment of
values given to the set of input attributes, computes the number of
records that match the given assignment. Prints out this table
(it doead't bother to print out counts for attribute-value-pair-sets
with a count of zero). WARNING! THESE TABLES CAN BE MONSTROUS IF
THE CURRENT SET OF INPUTS CONTAINS MORE THAN A FEW ATTRIBUTES.


togglemarks

togglemarks : : Make marked rows unmarked and vice versa.


See the mark command for more detail about marks.


transpose

transpose : NEEDS DOCUMENTING



treetest

treetest : size <maxtreesize> iters <iterations>: build & test decision trees.


   size n : The tree returned should have no more than
            this number of nodes (including leaf nodes).

  iters n : If n==1 do regular decision tree. If n > 1, run
            n iterations of a randomized wrapper feature
            selection first.


unmark

unmark : <constraint> : Set all rows matching constraint to be unmarked.


Unmarks a row. See the mark command for more details.