.po 1i
.ll 6.5i
.ce 10
\fB\s16Homework 4: UNIMEM\fP\s0
.sp
.sz 14
CS 395T: Machine Learning
.sp .5
Due: Thursday, March 30
.pp
An unfinished Common Lisp implementation of the UNIMEM system for incremental
concept formation is in the file UNIMEM. It can be run on the simple data sets
in UNIMEM-TEST-DATA or on the four disease soybean data in SOYBEAN-UNIMEM-DATA
or the class location data in LOCATION-UNIMEM-DATA.  This  soybean data was
originally used to test the CLUSTER/2 system as discussed in class. The code
and data are commented and provide the necessary information needed for their
use.
.sh 1 "System Modifications"
.pp
Hand in a commented version of the code you write to make the following
changes to UNIMEM.
.sh 2 "Evaluating Generalizations"
.pp
As discussed in class, Lebowitz's UNIMEM system uses statistics to decide
which features of a generated concept are spurious and which are significant.
Add this capability to the provided UNIMEM system.  First complete the
function EVALUATE-CONCEPT-FEATURES. When the trace flag is set, your code
should report when a feature is discarded and when one is frozen. Deciding
when to perform these alterations should be guided by the relevant global
variables defined in the code. Next, complete the function REMOVE-CONCEPT?   
to delete the entire concept when the fraction of remaining features
falls below the *keep-concept-threshold*. Concept deletion should also be
reported.  In your "dribble" output, demonstrate the correct operation of your
enhancements on one of the simple examples in DISCOVERY-DATA.  You may alter
the appropriate global variables as necessary to force deletions.
.sh 2 "Structured Features"
.pp
The code for handling structured features is missing.  Complete the function
PARTIALLY-SCORE-STRUCTURED-FEATURE in order to allow "pragmatic" matching of
structured features.  When partial matches are allowed, the construction of
the features of a new concept is somewhat more complicated.  When two
feature/value pairs only partially match, an "average" must be used in the
concept definition.  Complete the function COMBINE-FEATURE-VALUES so that it
handles structured variables. Use the data in the file UNIMEM-STRUCTURED-DATA
to test this addition and hand in a traced dribble of this example with
paramaters set to get a good clustering.
.sh 1 Experiments
.pp
Try the following tests of the system and hand in some of your best results.
.sh 2 "Soybean Data"
.pp
Use the resulting system to cluster the soybean data in SOYBEAN-UNIMEM-DATA.
Observe the trace and adjust parameters to improve the quality of the
clustering (the degree to which it separates the instances of the four
diseases into disjoint categories). The function HIERARCHY-INSTANCES is useful
in determining the quality of the results by forming a nested list structure
which illustrates how the instances are grouped. The best clustering I
obtained had a lone instance stored at the root and four top-level branches
which divided the instances into the four diseases. Try to beat that!
Remember that UNIMEM is order dependent and randomizes instance ordering
before each run. Hand in a printout of the HIERARCHY-INSTANCES result for your
best run along with the parameter settings used. Also include some comments
which may explain the results which you got.
.sh 2 "Location Data"
.pp
Use the resulting system to cluster the locations in LOCATION-UNIMEM-DATA.
Adjust parameters to get some reasonable clustering of cities.  Hand in a
printout of the HIERARCHY-INSTANCES result for your best run along with the
parameter settings used. Also include some comments which may explain the
results which you got.

