.po 1i
.ll 6.5i
.nr ps 11
.nr pp 11
.ce 10
\fB\s16Homework 4: UNIMEM\fP\s0
.sp
.sz 14
CS 395T: Machine Learning
.sp .5
Due: Thursday, March 30
.sp
.ce 0
.uh "Part 1: UNIMEM"
.pp
An unfinished Common Lisp implementation of the UNIMEM system for
incremental concept formation is in the file UNIMEM. It can be run
on the simple data sets in DISCOVERY-DATA or on the four disease
soybean data in UNIMEM-SOYBEAN-DATA. This data was originally used to test
the CLUSTER/2 system as discussed in class. The code and data are
commented and provide the necessary information needed for their use.
.uh "Part 1a: Evaluating Generalizations in UNIMEM"
.pp
As discussed in class, Lebowitz's UNIMEM system uses statistics to decide
which features of a generated concept are spurious and which are significant.
Add this capability to the provided UNIMEM system.  First complete the
function EVALUATE-CONCEPT-FEATURES. When the trace flag is set, your code
should report when a feature is discarded and when one is frozen. Deciding
when to perform these alterations should be guided by the relevant global
variables defined in the code. Next, complete the function REMOVE-CONCEPT?   
to delete the entire concept when the fraction of remaining features
falls below the *keep-concept-threshold*. Concept deletion should also be
reported.  In your "dribble" output, demonstrate the correct operation of your
enhancements on one of the simple examples in DISCOVERY-DATA.  You may alter
the appropriate global variables as necessary to force deletions.
.uh "Part 1b: Clustering the Soybean Data"
.pp
Use the resulting system to cluster the soybean data in UNIMEM-SOYBEAN-DATA.
Observe the trace and adjust parameters to improve the quality of the
clustering (the degree to which it separates the instances of the four
diseases into disjoint categories). The function HIERARCHY-INSTANCES is useful
in determining the quality of the results by forming a nested list structure
which illustrates how the instances are grouped. The best clustering I
obtained had a lone instance stored at the root and four top-level branches
which divided the instances into the four diseases. Try to beat that!
Remember that UNIMEM is order dependent and randomizes instance ordering
before each run. Hand in a printout of the HIERARCHY-INSTANCES result for your
best run along with the parameter settings used. Also include some comments
which may explain the results which you got.
.uh "Part 2: EXPLORER (a minimal AM)"
.pp
A simple system for exploring concepts using general-to-specific agenda-based
best-first search guided by interestingness is in the file EXPLORER. It uses
an attribute-value representation and can be run on the examples in
DISCOVERY-DATA without very interesting results.  It currently just looks for
concepts (conjunctions of features) which most evenly divide the instances
into two groups (i.e. such concepts are judged to be more "interesting").  The
code and data are commented and provide the necessary information needed for
their use.
.pp
Your task is to add the ability to make empirical conjectures that a
particular concept is either equivalent to, a generalization of, or a
specialization of another explored concept based on set theoretic
relationships found between examples of these concepts.  This should be done
by completing the function GENERATE-CONJECTURES.  Always report any such
conjectures which are made and appropriately store the conjecture under both
of the concepts involved. The simple geometric data in POLYGON-DATA should be
used to test the making of these conjectures.  Your system should be able to
"discover" that the sum of all angles is always 180 for a triangle and 360 for
a square, among many other relationships.  Be careful to avoid conjecturing
obvious relationships which must hold because of concept definitions, for
example that the concept RED & SQUARE is a specialization of the concept RED.
In addition to your commented code, hand in a dribble showing a number of
conjectures of each type that your system makes when run on the POLYGON-DATA.
