.po 1i
.ll 6.5i
.ce 10
\fB\s16Homework 1: Version Space Algorithm\fP\s0
.sp
.sz 14
CS 395T: Machine Learning
.sp .5
Due: Tuesday, February 14
.sp
.ce 0
.pp
A Common Lisp implementation of the version space algorithm is in the file
VERSION-SPACE.  The top-level function is called VERSION-SPACE and takes a
list of examples.  A very simple data file for this system is FIGURE-DATA
which contains several examples which were discussed in class. The code and
data are commented and provide the necessary information needed for their use.
This assignment has two parts.  In the first part, you will run the system on
your personal location concept and in the second part, you will add an
additional feature to the system and test it on soybean data.
.uh "Part 1: Personal Location Concept"
.pp
Run the system on your personal location dataset.  Use the function VS-TEST to
run and test on various subsets of the entire dataset.  Turning on the flag
*print-with-feature-names* should help you interpret the resulting
generalizations.  Try to find at least some relatively large subset of the
examples on which to train such that VS at least produces a non-null result.
Hand in a dribble file for the output of a train and test for such a
"successful" run (leave *trace-vs* off for this run).  Include a brief comment
on your evaluation of the results.
.uh "Part 2: Adding Linear Features"
.pp
The current system only supports simple nominal feature vectors. Add the
ability to support linear features (while maintaining all of the current
abilities).  You will need to edit and redefine the following functions:
MATCH, MORE-GENERAL?, INITIALIZE-G, GENERALIZATIONS-TO, and
SPECIALIZATIONS-AGAINST.  Additional functions may also be necessary. The
entry in *domains* for a linear feature should be of the form: (LINEAR
lower-bound upper-bound).  The value of the feature is assumed to be a number
between the lower and upper bounds inclusively.  A bound may simply be NIL
indicating there is no such bound on the value.  In the feature vector for a
generalization, the value of a linear variable can be any of the following
forms where N is always a number: N, ?, (> N), (>= N), (< N), (<= N).  The
forms N and ? mean the same as for nominal features, the form (> N) matches a
value V of an instance iff V > N, and similarly for the other forms.  Any
generalization formed should take into account the upper and lower bounds (if
any) for the feature.  For example, if U is the upper-bound then (<= U)
should just be a ?, and (> U) is not a possible value for a generalization.
You should be careful in formulating the generalization and specialization
functions and be aware that unlike for the existing language, S may become
larger than one generalization.  Test the system on the provided examples in
the file FIGURE-LINEAR-DATA and hand in your commented code and a dribble file
for the examples.  For your dribble file, make sure you set the variable
*TRACE-VS* to T so that the current S and G sets are printed out after
processing each example.
.pp
Also test your linear feature stuff on soybean disease data.  The complete
soybean disease dataset (17 diseases, 50 features, 289 examples) is in the
file SOYBEAN-DATA and is too large to run on this system.  The file
SOYBEAN-RDATA is a dataset containing descriptions of 17 examples for each of
four soybean diseases using only 32 features (the "R" is for "reduced").  The
function VS-CATEGORIES can be used to learn concepts for multiple categories
and the variable SOYBEAN-CATEGORIES is a list of the four soybean categories.
Just call (VS-CATEGORIES SOYBEAN-CATEGORIES) to make the system learn
descriptions for each of these categories.  The instances for each category
are initially divided into 8 instances for learning and 9 instances for
testing. The function SEPARATE-INSTANCES can be used to change the number of
learning and testing instances and the function TEST-CATEGORIES can be used to
test the performance of the learned descriptions on the test instances.  The
features (TIME-OF-OCCURRENCE PRECIPITATION TEMPERATURE CROPPING-HISTORY
SEVERITY PLANT-HEIGHT LEAF-SPOT-SIZE) (feature #'s 1, 2, 3, 4, 6, 11) are
actually discrete linear features and are currently just being treated as
nominal.  Change them appropriately to linear features, run it on your new
code, and compare results to when they were treated as nominal in terms of
efficiency and accuracy.  Hand in a dribble (trace flag off) of a run and a
test with linear features and include any brief explanations you have for the
results.


