about, acors, add, aggregate, autocor, batch, batchrename, brief, bsave, cartesian, change_scorer_matrix, cleartree, clt, comment, compat, concatenate, copy, cor, correlate, cors, define, delete, delete_value, dlist, dlist_anomaly_hunt, dlisttest, dtree, features, frequent, ignore, info, inputs, join, left-click, load, loadtest, log, ls, maketest, maketree, marginals, mark, matrix, nearest, net, new, new_value, newdlist, newdlist_anomaly_hunt, newdlisttest, newra, nomissing, output, pca, print_scorer_matrix, quit, realfeatures, realize, regress, removeduplicates, removerows, rename, rename_value, report, right-click, rules, save, scoring, see, select, shear, show, showmarked, shuffle, sort, sortmissing, sortrows, speed, submark, swap, symb2real, symbolize, symbolize_missing, table, togglemarks, transpose, treetest, unmark
Computes autocorrelation function for <attname>. <attname> must represent a real valued attribute, Legal modifiers: "rank", "sig", "brief", "marked", "rand [samplesize <n>]"
This allows you to add one or more attributes to the set of inputs. See the inputs command for more details, and remember you can never add the current output attribute.
Type the following (all on one line):
aggregate <writeattname> = <agstat> of <readattname> using [marked] records
[matching <attnum1> <attnum2> .. <attnumN>]
Where...
<writeattname> is the name of a real-valued attribute you are
defining.
<agstat> is one of mean variance sdev min max first.
<readattname> is the attribute of which you are taking aggregate
statistics.
using marked records means only use statistics of those records currently
marked (see the mark and unmark commands)
using records means use statistics from all records
matching <attributes> means when defining the statistic for the i'th record,
use that statistic over all the records which match
the same values as the i'th record in each of the
attributes in <atributes>.
Example
INCOME TOWN CARMAKE HOUSEVALUE
30000 york ford 100000
20000 bath volvo 60000
50000 york gm 170000
10000 bath ford 80000
60000 bath ford 90000
You might want a new attribute that specifies for each record what is
the mean income in the town associated with that record. You'd do that
with:
new TOWNINCOME real
aggregate TOWNINCOME = mean of INCOME using records matching TOWN
Or you might want an atribute that specifies for each record what is
the mean income of ford-owners in the town associated with that record.
You'd do that with
new TFOINCOME real
unmark all
mark CARMAKE == ford
aggregate TFOINCOME = mean of INCOME using marked records matching TOWN
Note that the new attribute <writeattname> and the attribute you're
collecting statistics for <readattname> must both be real-valued. The
matching attributes must be symbolic (sorry).
Computes autocorrelation coefficient of <attname> for the specified <lag>. <attname> must represent a real valued attribute, <lag> must be an integer larger than 0 (and reasonably smaller than the number of datapoints). Legal modifiers: "rank", "sig", "brief", "marked", "rand [samplesize <n>]"
The commands are executed in just the same way as if you were to type them into the command line manually here. Note that all output is sent to this display as usual. Note too that the execution of the commands continues blindly even if there are errors along the way. The batch file should simply have one command on each line. You can leave blank lines if you like. Lines beginning with a # character will be treated as comments and ignored.
<filename> must be a textfile containing exactly the same number of tokens as there are attributes in the dataset. The tokens must be separated by spaces and/or on separate lines. The i'th token in the file becomes the name of the i'th attribute in the dataset.
You are simply told the names and whether the attributes are symbolic or real. Each attribute is described on one line. For more details on attributes, use the info command.
This command saves in a simple platform-dependent binary format.
See the save command for other formats.
The optional keyword marked or unmarked may be supplied.
If marked is supplied, saves ONLY those records that have been
marked (see the mark, unmark and togglemarks commands).
If unmarked is supplied, saves ONLY those records that have been
unmarked. If neither keyword is supplied, saves ALL the records.
To load from MATLAB on the same platform, use:
fid=fopen('file','r');
dims=fread(fid,2,'int');
data=fread(fid,dims,'double');
fclose(fid);
Change the reward matrix used in matrix scorer rule learning. Call the matrix M. Then Mij gives the reward for a rule that predicts a record will have value i, but that actually has value j. Initially we use the identity matrix for reward.
Use this command to erase an existing adtree, so you can build a new one with different parameters.
Generates Chow-Liu dependency tree(s) for current inputs using correlation method specified by <type> with optional settings [modifiers]. Legal types: "linear","quad","rank", and "quadrank" Legal modifiers: "marked", "threshold <n>"
This command is almost certainly useless to you unless you
are planning on saving the text of your session to a file, and wish
to add a comment to the reader of the session.
comment on <sessionname>: This will switch on some internal flags that will
add special annotations to all output so that it can easily
be turned into a tutorial. Graphics will be saved to files
with names like andrew12.ps if andrew was given as the
session name.
comment off: Switch off the above.
comment <anything else>: Will allow the user to type in commentary about
what's going on to be used by the comment command.
Important: TO STOP ENTERING COMMENT, PUT A SINGLE . ON A LINE.
and validation data together.
Examples:
1. concatenate population.fds
2. concatenate population.fds latitude.fds
3. concatenate population.fds rename
Suppose the current dataset in memory is
city state crimerate homestate
pit pa 100 pa
ny ny 230 pa
phil pa 120 oh
new nj 150 pa
And suppose population.fds is
city state crimerate homestate
new pa 40 pa
gulf al 30 fl
Then after example 1 the dataset in memory will be
city state crimerate homestate
pit pa 100 pa
ny ny 230 pa
phil pa 120 oh
new nj 150 pa
new pa 40 pa
gulf al 30 fl
All that happens is that the records from population.fds are appended
at the end of the current dataset.
Example 2 allows you to merge with more than one file at once. It's
exactly equivalent to running the following commands in succession:
concatenate population.fds
concatenate latitude.fds
Example 3 shows the rename option. If the columns in the population.fds
had different names than those in the current dataset, then the names
in population.fds will be ignored. (By default, if you don't include
rename on the command, the system will moan at you if it finds two
attributes with names that diagree).
Computes correlation coefficient of <input attname> w.r.t. the real valued <output attname> using the method specified by <type> with optional settings [modifiers]. Legal types: "linear","quad","rank" and "all" Legal modifiers: "sig", "brief", "marked", "rand [samplesize <n>]"
Computes correlation coefficients of all current inputs w.r.t. the current output using the method specified by <type> with optional settings [modifiers]. Legal types: "linear","quad" and "rank" Legal modifiers: "sig", "brief", "marked", "rand [samplesize <n>]"
or (equivalent syntax): <attribute> = <expression> <expression> may be composed of attribute names, numeric/symbolic values parentheses and operators Available operators (ordered according to decreasing priority): abs (unary) calculates absolute value of its argument, applicable to real valued attributes only; sqrt (unary) square root of a real, non-negative argument; sign (unary) returns -1.0, 0.0 or +1.0 if its real argument is respectively negative, zero, or positive; exp (unary) applicable to real valued arguments; log (unary) natural logatithm, applicable to positive real arguments; log10 (unary) decimal logarithm, applicable to positive real arguments; pre (unary) applicable to real or symbolic data attributes (not to individual values), it shifts the argument values back, so that if k is the data row index, then resultant[k] = argument[k-1] for k>0 and resultant[0] = argument[0] otherwise; suc (unary) applicable to real or symbolic data attributes (not to individual values), it shifts the argument values forward, so that if k is the data row index, then resultant[k] = argument[k+1] for k<size-1 and resultant[size-1] = argument[size-1] otherwise, (size is the number of data rows); deriv (binary), applicable to real attributes only, "deriv y x" computes the first derivative dy/dx; ranks (unary) computes ranks of its argument, intended for use with real attributes; sranks (binary) "sranks s r" computes ranks of symbolic attribute s with regard to the real attribute r, using a specific permutation of the values of s, which provides a presumably monotonic behavior of s vs. r.
Gets rid of the given attribute (and, of course, all its values) from the dataset. There's no undo!
<attribute> must be a symbolic attribute. This removes the printed name of one of its symbolic values. All records with the named value revert to having a missing value for that attribute.
Runs a decison list finder. Searches for a set of if-then-else
rules predictive of
<output attribute> = <value>
where value is either a specific value or the literal any.
The rules it will consider contain up to <numatts> attribute-value pairs.
It only considers rules that match at least <support> records.
All the attributes must be symbolic. Delete or Symbolize any real
attributes (see the delete and symbolize commands).
rmin <n> - This parameter, described in the ADTREE paper, defines the
maximum leaf-list length. Usually (though not always) the larger
rmin, the smaller the amount of memory used by the adtree, but
the slower the counting queries. The speed of the counting
queries is not substantially affected for rmin values
below approximately 200.
maxmegs <n> - Unfortunately, the size of the ADTREE cannot be easily
predicted in advance before you try building it. Use this
parameter to prevent the ADTREE from growing too big...if the
number of megabytes it uses grows above <n> during construction
the ADTREE frees itself, and harmlessly returns with a set of
suggestions about what you can do before trying the maketree
command again.
Notes about ROC graphics:
We plot
Fraction of records for which a prediction is made.
vs
Fraction of desired records correctly classified.
Runs a decison list anomaly finder.
A new attribute called STRANGENESS is added to the datset.
Things with high STRANGENESS stand out as difficult to distinguish from
random noise using a dlist.
The rules considered for use in the dlist contain up to
<numatts> attribute-value pairs.
It only considers rules that match at least <support> records.
All the attributes must be symbolic. Delete or Symbolize any real
attributes (see the delete and symbolize commands).
rmin <n> - This parameter, described in the ADTREE paper, defines the
maximum leaf-list length. Usually (though not always) the larger
rmin, the smaller the amount of memory used by the adtree, but
the slower the counting queries. The speed of the counting
queries is not substantially affected for rmin values
below approximately 200.
maxmegs <n> - Unfortunately, the size of the ADTREE cannot be easily
predicted in advance before you try building it. Use this
parameter to prevent the ADTREE from growing too big...if the
number of megabytes it uses grows above <n> during construction
the ADTREE frees itself, and harmlessly returns with a set of
suggestions about what you can do before trying the maketree
command again.
Notes about ROC graphics:
We plot
Fraction of records for which a prediction is made.
vs
Fraction of desired records correctly classified.
1) Runs a decison list finder. Searches for a set of if-then-else
rules predictive of
<output attribute> = <value>
where value is either a specific value or the literal any.
2) Tests the rules on the test data set (enabling cross validation).
The rules it will consider contain up to <numatts> attribute-value pairs.
It only considers rules that match at least <support> records.
All the attributes must be symbolic. Delete or Symbolize any real
attributes (see the delete and symbolize commands).
rmin <n> - This parameter, described in the ADTREE paper, defines the
maximum leaf-list length. Usually (though not always) the larger
rmin, the smaller the amount of memory used by the adtree, but
the slower the counting queries. The speed of the counting
queries is not substantially affected for rmin values
below approximately 200.
maxmegs <n> - Unfortunately, the size of the ADTREE cannot be easily
predicted in advance before you try building it. Use this
parameter to prevent the ADTREE from growing too big...if the
number of megabytes it uses grows above <n> during construction
the ADTREE frees itself, and harmlessly returns with a set of
suggestions about what you can do before trying the maketree
command again.
Notes about graphics:
We plot
Fraction of records for which a prediction is made.
vs
Fraction of desired records correctly classified.
size n : The tree returned should have no more than
this number of nodes (including leaf nodes).
iters n : If n==1 do regular decision tree. If n > 1, run
n iterations of a randomized wrapper feature
selection first.
This set of <n> features will be a subset of the current set of inputs. It will be the best such set (in terms of infogain) for predicting the current output. See the matrix command for the definition of what this all means.
freq_atts - the attributes of which you want to find frequent tuples
match_atts - the attributes that indicate which records should be considered
together in finding frequent attribute tuples
setsize - the size of the tuple to be found (e.g. 2 indicates find all
frequent pairs)
mincount - the minimum number of occurences before a tuple will be included
in the final result
filename - file in which to save the resulting frequent dataset
This allows you to remove one or more attributes to the set of inputs. See the inputs command for more details.
<attributes> may be "all" or a list of attribute names or numbers For the symbolic attributes, you are told what the symbolic values are, and how many times each value occurs in the dataset. For the real-valued attributes, you are told their means, variances, mins and maxes. In each case you are also shown the first eight values. You are also told how many values are missing.
Several ADTREE operations require a set of input attributes to search over. Here's the command you can use to define that set. Just list the attributes you're interested in, or type inputs all to select all attributes as inputs. Note that the current output attribute (see the output command) will never be added to the set of inputs, even if you request that it should be.
Examples:
1. join population.fds key state
2. join population.fds latitude.fds key state
3. join population.fds key homestate/state
4. join age.fds keys firstname lastname
Suppose the current dataset in memory is
city state crimerate homestate
pit pa 100 pa
ny ny 230 pa
phil pa 120 oh
new nj 150 pa
And suppose population.fds is
state population
ha 3
nj 10
ny 25
oh 18
pa 15
Then after example 1 the dataset in memory will be
city state crimerate homestate population
pit pa 100 pa 15
ny ny 230 pa 25
phil pa 120 oh 15
new nj 150 pa 10
The newly added column gives you the population of the state mentioned
as the 2nd attribute in the memory dataset. Where does the population value
come from? From the file population.fds
Example 2 allows you to merge with more than one file at once. It's
exactly equivalent to running the following commands in succession:
join population.fds key state
join latitude.fds key state
Example 3 shows what we'd have to do if we wanted the population column to
be associated with the homestate in the in-memory file even though the
matching record we're using in population.fds is called state.
Sometimes one key is not enough to disambiguate which record in
the new file contains the values you want. For example, 4 would be
useful in this case:
ORIGINAL DATASET.... AGE.FDS...
firstname lastname height weight year firstname lastname age
andrew moore 75 135 1985 andrew moore 34
jeff lee 65 235 1999 andrew lee 33
andrew moore 75 155 1998 jeff lee 32
andrew lee 35 55 1998
when the result would be
firstname lastname height weight year age
andrew moore 75 135 1985 34
jeff lee 65 235 1999 32
andrew moore 75 155 1998 34
andrew lee 35 55 1998 33
Notes: * In all cases, all attributes from <filename> are included into the
in-memory dataset. Delete those you don't want.
* All key attributes must be symbolic
* Key attributes must match (have the same values) in memory & in
loaded datasets
<filename> must be the name of a datafile in one of the formats
described below.
[options] are
default_real {true|false} (default value is FALSE)
If you specify default_real true on the command line then an
attribute with only a small number of values, each an integer,
will be treated as a real-valued attribute. See the realize
command for more discussion.
ignore_missing {true|false} (default value is FALSE)
If there are any missing values in the dataset, and if set to
TRUE, the loader will simply ignore any records containing missing
values.
If filename contains *, then multiple files will be loaded unto 1 datsetIf the new dataset is loaded successfully, then this tool forgets
the previous dataset.
File formats.
If the filename ends with .fds, then this tool will assume that the
file was created in standard Schenley Park Research compact .fds format,
and will loadaccordingly. For large files, this is up to hundreds
of times faster for loading, and so is heavily recommended.
For more discussion, see the save command.
All other file suffixes are loaded with an adaptive loader that can
attempt to load with comma separated (CSV) format and/or space-separated
format. It ignores all blank lines and lines beginning with #. The
first non-ignored line is initially assumed to consist of attribute names.
If there's at least one number amongst them, however, loading assumes
that this file was created without attribute names. In that case it
treats the first line as the first record, and generates its own
(unimaginative) attribute names. It automatically decides whether each
attribute is real or symbolic. If it sees symbolic values wih spaces
in the symbolic name, it replaces the spaces with - hyphens.
Same loading conventions as the regular load command.
This command takes all text output and appends it into a file. (Note that appending allows the user to add to a log file at any time.) Typing 'log off' at any time will disable logging.
You must run maketest before you can run any of the decision tree code. Uses the rmin and maxmegs options described in maketree command.
You must run maketree before you can run any of the ADTREE datamining
commands. The ADTREE is the thing that cleverly caches sufficient
statistics to allow counting queries to be answered very quickly.
All the attributes must be symbolic. Delete or Symbolize any real
attributes (see the delete and symbolize commands).
rmin <n> - This parameter, described in the ADTREE paper, defines the
maximum leaf-list length. Usually (though not always) the larger
rmin, the smaller the amount of memory used by the adtree, but
the slower the counting queries. The speed of the counting
queries is not substantially affected for rmin values
below approximately 200.
maxmegs <n> - Unfortunately, the size of the ADTREE cannot be easily
predicted in advance before you try building it. Use this
parameter to prevent the ADTREE from growing too big...if the
number of megabytes it uses grows above <n> during construction
the ADTREE frees itself, and harmlessly returns with a set of
suggestions about what you can do before trying the maketree
command again.
lazy y/n - You can choose whether to use regular 'classic' static trees
or new clever lazy trees. Lazy trees can use much less memory
with some (possibly severe) slowdowns.
where <n> is 1 or greater and <x> is "all" or an attribute name
A very fast and high-dimensional correlation finder.
This command will operate on the attributes mentioned in the
current set of inputs and defined as the current output.
You can ask to see which of these attributes are good for
predicting which other attributes in what contexts.
Suppose, for example, that X Y and Z are all binary (True/False)
attributes and Z is True if and only if X and Y have the same
value. Suppose that in the dataset X and Y have their values
randomly and independently chosen to be True/False with 50-50
probability. Then notice, importantly, that the pairwise correlation
between X and Z will be zero (and so will X's RIG for predicting
Z). Ditto for Y and Z. And yet X and Y together are a perfect
predictor of Z (RIG = 100%).
(Side note: For a definition of RIG,
see the documentation for the matrix
command).
This command can avoid the error that normal correlation would
make, of deducing no relation between X and Z. It does so by
searching for the best CONTEXT for X to predict Z, where by
CONTEXT we mean we allow X to appeal to other attributes to
help it predict Z.
Suppose we want to know how well some attribute called the
PREDICTOR predicts some other attribute (called the PREDICTEE).
We use the important notion of RIGI (Relative Information Gain
Increase) to score the predictor's performance in a given context.
We compute
RIG_with = RIG of predictor and context for predicting predictee
RIG_without = RIG of context alone for predicting predictee
then
RIGI = RIG_with - RIG_without
For each requested PREDICTOR/PREDICTEE combination, this
command finds the best context (the one with the highest RIGI)
and reports it.
For each PREDICTEE the command gives a ranked list of which
are the best PREDICTORS and their associated contexts.
Then for each PREDICTOR the command gives a ranked list of which
are the PREDICTEES for which it does best and their associated contexts.
Parameters:
contextsize: What is the maximal allowed size of the context
set of variables? The number of high dimensional
correlations considered increases VERY rapidly if
you make this large.
predictor: Can be "all", meaning use all the current inputs and the
current output as predictors. Else you may specify a
single predictor.
predictee: Can be "all", meaning use all the current inputs and the
current output as predictees. Else you may specify a
single predictee.
It is fine for an attribute to be used both as a predictor and a
predictee.
When the program starts, or when a new dataset is loaded, all rows are
initially unmarked.
You can mark subsets of the rows at any time. And then, if you wish,
you can unmark subsets at any time. You can then save only the marked, or
only the unmarked rows if you wish (see the save command).
To mark all the rows matching some constraint, just type mark <constraint>.
Legal constraints are...
all - matches every single row
missing - matches only rows with missing value(s)
<attribute> == <value> - matches only rows in which given SYMBOLIC attribute
has the given SYMBOLIC value.
<attribute> < <number> - matches only rows in which given REAL attribute
has a numeric value strictly below threshold.
<attribute> > <number> - matches only rows in which given REAL attribute
has a numeric value strictly above threshold.
row < <number> - matches only the rows with row_numbers strictly
below <number>. The first (top) row in the dataset
has row_number 0 etc.
row > <number> - matches only the rows with row_numbers strictly
greater than <number>
Draws what Schenley Park Research calls a w_matrix, where each row of the w_matrix
corresponds to a different assignment of sets-of-values to the inputs, and where
each column corresponds to a different value of the output attribute.
A w_matrix is a two-dimensional table counting how frequently
each value of the output attribute co-occurs with values of
input attributes. For example, suppose we had a dataset with
four symbolic attributes thus:
Weight which can have values light, middle, heavy
Healthy which can have values true, false
Wealthy which can have values true, false
Wise which can have values true, false
Then here is an example of a w_matrix for inputs Weight and Wise and
output healthy...
Weight Wise | Healthy=False Healthy=True
-----------------------------------------------
light false | 64 32
light true | 16 0
medium true | 20 20
heavy false | 10 2
The w_matrix is the matrix of numbers in the bottom right quadrant,
and tells us, for example, that amongst people who are light and unwise,
64 are also unhealthy and 32 are healthy.
IMPORTANT: a w_matrix never bothers to include rows for which all
counts are zero. So for example this w_matrix shows us that
there are no weight=medium wise=false people in the database.
The entropy of the output variable is shown. This is a measure of
how uniformly distributed the output is...a high entropy output is one
with a uniform distribution, and is thus harder to predict. Formally,
the entropy is the number of binary bits of information needed to encode
an output value selected according to the empirical distribution of values
of the output attribute.
The information gain (IG) of using the inputs to predict the output is
also shown. The higher this number the more predictive. Formally, it is
the number of bits that are, on average, saved if you had to encode the
output optimally and were lucky enough to be told, in advance, for free,
the values of the inputs. A set of inputs that are uncorrelated with the output
will have a relatively small infogain. A perfectly correlated set will have an
infogain equal to the output's entropy.
The Relative Information Gain (RIG) is merely the ratio of the information
gain to the entropy, expressed as a fraction. RIG=100 is the best possible
(perfect) correlation. RIG=0 is perfect independence (this set of inputs by
itself has no discernable predictive power for the output). As a simple
rule of thumb, any RIG less than about 5 percent is quite unsatisfactory.
Example: nearest height weight age The above command would compute, via leave-one-out, 1-nearest-neighbor cross-validation, how well height and weight predict age.
Runs a decison list finder. Searches for a set of if-then-else
rules predictive of
<output attribute> = <value>
where value is either a specific value or the literal any.
The rules it will consider contain up to <numatts> attribute-value pairs.
It only considers rules that match at least <support> records.
All the attributes must be symbolic. Delete or Symbolize any real
attributes (see the delete and symbolize commands).
rmin <n> - This parameter, described in the ADTREE paper, defines the
maximum leaf-list length. Usually (though not always) the larger
rmin, the smaller the amount of memory used by the adtree, but
the slower the counting queries. The speed of the counting
queries is not substantially affected for rmin values
below approximately 200.
maxmegs <n> - Unfortunately, the size of the ADTREE cannot be easily
predicted in advance before you try building it. Use this
parameter to prevent the ADTREE from growing too big...if the
number of megabytes it uses grows above <n> during construction
the ADTREE frees itself, and harmlessly returns with a set of
suggestions about what you can do before trying the maketree
command again.
Runs a decison list anomaly finder.
A new attribute called STRANGENESS is added to the datset.
Things with high STRANGENESS stand out as difficult to distinguish from
random noise using a newdlist.
The rules considered for use in the newdlist contain up to
<numatts> attribute-value pairs.
We only considers rules that match at least <support> records.
All the attributes must be symbolic. Delete or Symbolize any real
attributes (see the delete and symbolize commands).
rmin <n> - This parameter, described in the ADTREE paper, defines the
maximum leaf-list length. Usually (though not always) the larger
rmin, the smaller the amount of memory used by the adtree, but
the slower the counting queries. The speed of the counting
queries is not substantially affected for rmin values
below approximately 200.
maxmegs <n> - Unfortunately, the size of the ADTREE cannot be easily
predicted in advance before you try building it. Use this
parameter to prevent the ADTREE from growing too big...if the
number of megabytes it uses grows above <n> during construction
the ADTREE frees itself, and harmlessly returns with a set of
suggestions about what you can do before trying the maketree
command again.
Notes about ROC graphics:
We plot
Fraction of records for which a prediction is made.
vs
Fraction of desired records correctly classified.
1) Runs a decison list finder. Searches for a set of if-then-else
rules predictive of
<output attribute> = <value>
where value is either a specific value or the literal any.
2) Tests the rules on the test data set (enabling cross validation).
The rules it will consider contain up to <numatts> attribute-value pairs.
It only considers rules that match at least <support> records.
All the attributes must be symbolic. Delete or Symbolize any real
attributes (see the delete and symbolize commands).
rmin <n> - This parameter, described in the ADTREE paper, defines the
maximum leaf-list length. Usually (though not always) the larger
rmin, the smaller the amount of memory used by the adtree, but
the slower the counting queries. The speed of the counting
queries is not substantially affected for rmin values
below approximately 200.
maxmegs <n> - Unfortunately, the size of the ADTREE cannot be easily
predicted in advance before you try building it. Use this
parameter to prevent the ADTREE from growing too big...if the
number of megabytes it uses grows above <n> during construction
the ADTREE frees itself, and harmlessly returns with a set of
suggestions about what you can do before trying the maketree
command again.
Notes about graphics:
We plot
Fraction of records for which a prediction is made.
vs
Fraction of desired records correctly classified.
Several ADTREE operations require an output attribute, which is usually the attribute you are trying to predict. Use this command to change the output. If the attribute you choose is currently an input, it will be removed from the set of inputs automatically (see inputs command).
Print the reward matrix used in matrix scorer rule learning. Call the matrix M. Then Mij gives the reward for a rule that predicts a record will have value i, but that actually has value j.
Warning---please make sure you've saved any results you need before you type quit
The only attributes this will affect are ones that are initially symbolic, but all their symbolic values can be parsed (interpreted) as numbers. So a symbolic attribute with values Andrew, Fred, 3, and 7 would be unchanged. But a symbolic attribute with values 0 1 92 17 would be changed into a real attribute in which, for example, those records with value==symbolic 92 previously would now have the real value 92.000. You may wonder why the dataset reader does not simply automatically convert attributes into real attributes if all the symbolic value names can be parsed as numbers. The reason is that if there are only a small number of distinct numbers, the memory savings of representing as symbols is colossal. To force the dataset loader to automatically convert such attributes as reals, include default_real true on the command line.
The simplest use of this command is: removeduplicates all.
In this case we simply search for pairs of records that are
identical. Whenever we find such a pair we remove the later
of the records (needless to say, the implementation is much more
efficient than this description makes it sound).
A slightly more complex use of the command is when you specify a
set of attributes other than "all". In this case it considers
two records identical if they match on all the attributes you mentioned
even if they don't match on the attributes you don't mention.
removerows marked: Deletes all the marked records.
removerows unmarked: Deletes all the unmarked records.
For help on marking and unmarking records, see the mark command
Simply changes the name of the given attribute. The new name must not duplicate another name in the dataset, nor may it be a number.
<attribute> must be a symbolic attribute. This changes the printed name of one of its symbolic values. <oldvalname> must be one of <attribute>'s values, and <newname> is the name it's given.
The <output> can be any of 'stdout' to report in text to the screen,
'graph' to report graphically to the screen, or a filename to report
to. If the filename ends with '.csv' then it will output the data in
a comma-separated format which can then be read in directly by Excel or
by Miner. Otherwise, it will output the data as it would to the text
window. <output(s)> can also be any combination of these, separated by
commas (no spaces), such as: 'report stdout,a.csv,b.txt'. Finally, a
handy shortcut for 'stdout,graph' is to simply enter 'both'.
<dep_exp> is a mathematical expression formulated from the attributes
with real values. Immediately before each attribute name in the
expression may appear a '+' or a '|' to indicate whether it should
be reporting the total or the mean of these values. In addition, the
expression can use any of the binary operators +, -, *, /, or |
(x|y = mean of x and y), or any of the unary operators +, *, or |
which order it to take the cumulative sum, product, or mean of the
value respectively.
Lastly, if <dep_exp> is '#', or is simply left empty, then it will
report a count of records rather than an actual value.
Example: x reports the total values of x (same as +x).
|x reports the average values of x.
*|x reports the cumulative product of the average values of x.
++x reports the cumulative sum of the total values of x.
x+y reports the sum between the total values of x and y.
x|++y reports the average between x and the cumulative sums
of the total values of y.
<ind_exp(s)> is a list of any number of individual independent
expressions, each separated by a space. These represent what values
the table is to report over. Each independent expression consists of
one symbolic attribute. If it is to only take on certain values,
these values can be specified using any of the relative operators
=, != (not equal to), >, <, or : (x:3,5 means that x can be 3 or 5).
Like the dependendent expression, if +, *, or | precedes the
attribute name, it indicates that the values of the dependent
expression should be cumulatively summed, multiplied, or averaged
over the values of the independent attribute. Finally, an operator
(+, -, *, /, or |) can also be placed immediately after the attribute
name to indicate that the value of the dependent expression should be
non-cumulatively summed (or any other operation) over the values of
the independent attribute (causing the independent attribute to not
be necessary in the table).
Examples:
report graph rain day weather
reports the total amount of rain on each day in each type of weather
report graph |rain day=Monday weather
reports the average amount of rain on Mondays in each type of weather
report graph rain +day weather
reports the cumulative amount of rain throughout each day of the week
in each type of weather
report graph |rain day-:Saturday,Sunday weather
reports the difference in average rainfall between Saturday and Sunday
for each type of weather
report graph # day:Saturday,Sunday weather
simply reports the number of records that have Saturday or Sunday for
their day, and have each one of the types of weather.
Runs an exhaustive rule finder. Searches for rules predictive of
<output attribute> = <value>
The rules it will consider contain up to <numatts> attribute-value pairs.
It prints the <numrules> best rules found
and saves them as the rulequeue object.
It only considers rules that match at least <support> records.
The optional keyword marked or unmarked may be supplied.
If marked is supplied, saves ONLY those records that have been
marked (see the mark, unmark and togglemarks commands).
If unmarked is supplied, saves ONLY those records that have been
unmarked. If neither keyword is supplied, saves ALL the records.
IMPORTANT: .fds files.
There are currently two formats to save.
.csv format
The default format is simple comma separated values (CSV). In SPR or
the Auton lab, we usually give these files suffixes of .csv or .ds. The
format of these files is that the first non-blank line of the file is
a list of attribute-names separated by columns, e.g.
Age,Weight,HomeTown
and then all other non-blank-lines contain the same number (in the above
example, 3) of tokens on each line, with the tokens separated by columns.
The k'th such line contains the values of the k'th row (i.e. the
k'th record) in the dataset. This is the same file format you get if
you save an Excel(TM) spreadsheet in CSV format. For more details, see
the load command.
.fds format
This format is usually more compact (especially if there are many
symbolic attributes) and always much much much faster to load and save.
The downside is that it is an SPR/Auton only format, so you can't export
of import files in this format between non-SPR/Auton applications. You CAN
export them between different SPR/Auton applications, even ones which run
on different machines. But you can always convert between .fds and
.csv formats with this tool.
To save a file in .fds format, all you need do is name the file with
a .fds suffix.
accuracy - score rules based on prediction accuracy cat_inventory - special purpose scoring for Caterpillar inventory
Each of att1 and att2 must be one of an attribute name, or marks or row.
Gets rid of every single attribute in the dataset EXCEPT for those specified in the the list of attributes. There's no UNDO!
rowatt must be a symbolic attribute
colatt must be a symbolic attribute
cellatt must be a real attribute
Replaces the new dataset by one in which the number of rows
equals the original arity of rowsatt and the number of columns
equals one plus the arity of colatt. The i+1'th attribute in the new
dataset is given the same name as the i'th value of colatt in the
original dataset. The value in the r'th row of the new dataset
in the first (leftmost) column is symbolic: its the symbolic value
of the r'th value of rowatt. The value in the i+1'th column and
the r'th row is the value that cellatt had in the original
dataset in the first record in which rowatt==r and colatt==i.
EXAMPLE: Old dataset:
GENDER STATE EYES AGE HEIGHT
M PA Blue 17 6
M CA Brown 39 5
F MA Blue 54 6
M MA Grey 16 6
F PA Grey 17 4
If you run the command:
shear rowatt GENDER colatt EYES cellatt AGE
the dataset would become:
ROWNAME Blue Brown Grey
M 17 39 16
F 54 ? 17
...where ? denotes "missing value"
Without arguments: the inputs are the set of attributes the datamining algorithms will be allowed to consider for making predictions. The output is the thing being predicted. And the speed is whether or not you use the adtree for counting. See the inputs command for more details. With the rules argument: displays the rules. with the second argument, scores: displays the prediction scores of the rules on the dataset. For each rule a cumulative score is shown for using that rule and all its predecessors as a set.
This function allows you to see a subset of the rows and attributes of the dataset. It displays the first <number> marked rows in the dataset, one per line of output but the only attribute values that appear on each line of output are those specified in <attributes>. If you wanted to see values of attribute height and age for records numbers 40 through 49 you could do mark all unmark row < 40 show 10 height age See the mark command for more details about marking.
The row_numbers of all the records in the dataset change randomly, and when saved, will be saved in this different order. This is especially useful for creating training and test sets. For example, suppose you want to save a test-set of a random subset of rows, and you want that test-set to be of size 1000 rows, and you want the remainder of the records to be saved to a training set, you could do the following... shuffle mark row < 1000 save marked test.csv save unmarked train.csv
Reorders the attributes (columns) of the dataset, but doesn't change their values. Both keywords must be supplied. They have the following meanings: arity = measure an attribute by how many symbolic values it has entropy = measure an attribute by its entropy (which is a measure of how uniform the distribution of values is...the more uniform, the higher the entropy). up = leftmost attribute has the lowest measure. Rightmost has highest. down = leftmost attribute has the highest measure. Rightmost has lowest.
A trivial utility that prints out the attribute names in order, so that those with the most missing values are printed first.
Takes one or more attributes as an argument. Sorts using the first attribute as the primary key, the next as the secondary key and so on. Good new for sort algorithm afficionados: The sort algorithm is careful not to unnecessarily reorder rows that are equal according to the specified attributes.
You must either type speed fast or speed slow. The default is fast, and the only reason for running in slow mode is to do speed comparisons.
Exactly the same syntax and usage as the mark command, except that it looks only at records that are already marked and leaves them marked if they meet the given condition, and unmarks them if they don't.
Simply swaps the attnum (attribute number) of the two attributes. That will affect the order in which attributes are listed in command such as brief.
This function will only change an attribute if it is symbolic and
has no cached real-values.
Then it simply sets the ith real value to be a small integer
representation of the value. For example if you had a dset with
one attribute called fruit and it originally was
fruit
-----
apple
orange
apple
banana
banana
then set of values for the fruit attribute is { apple , banana , orange }
and the small-integer value of each attribute is simply 0, 1 and 2
respectively (note how the values are sorted alphabetically).
The result dataset column is:
fruit
-----
0
2
0
1
1
<style> must be equal, uniform, manual, or integer.
If <style> is equal, unform or manual, <num_levels> MUST be supplied
If <style> is integer, <num_levels> MUST NOT be supplied.
equal breaks up the attribute into buckets of equal width.
uniform can create buckets of non-equal width, but makes sure that
the number of records in the various buckets are not wildly
skewed.
manual reads in numbers from the user on where to put the breaks.
integer rounds all the real numbers to their nearest integer value
and makes one symbolic value per integer.
<num_levels> must be a number between 2 up to 100
<attributes> may be "all" or a list of attribute names or numbers
This takes every numeric attribute in the list and breaks
it up into bucketed symbols. How many buckets? That's
defined by <num_levels>. The buckets are equal sized, with
boundaries at round numbers. The bucket corresponding to the
lowest value has the name v0:XXX- where XXX is the boundary
between the smallest and second smallest bucket.
The integer style simply creates the name to be the integer value.
This command only works on symbolic attributes. In order to use it on a real column, you would have to first symbolize that column.
This command pays attention to the current set of inputs but ignores the current output. For each possible joint assignment of values given to the set of input attributes, computes the number of records that match the given assignment. Prints out this table (it doead't bother to print out counts for attribute-value-pair-sets with a count of zero). WARNING! THESE TABLES CAN BE MONSTROUS IF THE CURRENT SET OF INPUTS CONTAINS MORE THAN A FEW ATTRIBUTES.
See the mark command for more detail about marks.
size n : The tree returned should have no more than
this number of nodes (including leaf nodes).
iters n : If n==1 do regular decision tree. If n > 1, run
n iterations of a randomized wrapper feature
selection first.
Unmarks a row. See the mark command for more details.