.po 1i
.ll 6.5i
.nr ps 11
.nr pp 11
.ce 10
\fB\s16Homework 2: Multiple Concept ID3\fP\s0
.sp
.sz 14
CS 395T: Machine Learning
.sp .5
Due: Thursday, February 21
.sp
.ce 0
.pp
A version of the ID3 decision tree learning system is in the file ID3.  It can
be tested on the examples in FIGURE-DATA or the weather example from Quinlan's
article (in the file WEATHER-DATA).  The current system is capable of building
a decision tree for distinguishing only two categories. In order to learn to
discriminate multiple categories, a decision tree must be built to distinguish
examples of each category from examples of every other category using the
function TRAIN-MULTI-ID3 (which resolves multiple diagnoses by picking the
matched category with the most examples).  Your assignment is to change the
system so that it can learn a \fIsingle\fP decision tree for distinguishing
instances of multiple categories.
.pp
Write a function TRAIN-SINGLE-ID3 which directly accepts a list of
multi-category examples of the form (category-name (value1 value2 ...)).
Assume the information needed to distinguish N categories is:
.EQ
I(k sub 1 , k sub 2 ,..., k sub N ) ~=~ - sum from j=1 to N {k sub j} over S log
sub 2 {k sub j} over S
.EN
where k\*<j\*> is the number of examples in the jth category and S is the
total number of examples i.e.
.EQ
S = sum from j=1 to N k sub j
.EN
Analogously, assume the expected information required for the tree with A
as root is:
.EQ
E(A) ~=~ sum from i=1 to V {sum from j=1 to N {k sub j} sub i} over S 
I({k sub 1} sub i , {k sub 2} sub i ,..., {k sub N} sub i )
.EN
where k\*<ji\*> is the number of examples in the jth class and with the ith
value for attribute A. The functions EXPECTED-INFO and INFO will have to
be changed to deal with multiple categories.  A couple of small additional
functions may also be needed.  
.pp
The last paragraph in Quinlan's article makes an unsubstantiated claim that
learning multiple decision trees (i.e. one for each category) may be better
than learning one decision tree for all of the categories.  You will test this
claim empirically on the full soybean data set in the file SOYBEAN-DATA.
Use TRAIN-AND-TEST to compare MULTI-ID3 and SINGLE-ID3 for different number
of training instances (e.g. 10, 30, 60, 150, and 200).  Run several splits
to be sure of your results.
.pp
Hand in your commented code, graphs of your learning curves, and a dribbled run
of your system on the full soybean data set.  This time, do not turn on the
*TRACE-ID3* flag for your dribble file, just make sure to include a printout of
the constructed tree and a testing of this tree.  Comment on the difference in
accuracy and training time for the two systems.  If you have an explanation for
the results then please include it.  This time I don't really have a good
explanation.

