.po 1i
.ll 6.5i
.ce 10
\fB\s16Homework 2: ID3\fP\s0
.sp
.sz 14
CS 395T: Machine Learning
.sp .5
Due: Tuesday, February 28
.sp
.ce 0
.pp
A version of the ID3 decision tree learning system is in the file ID3.
It can be tested on the examples in FIGURE-DATA or the weather examples from
Quinlan's article (in the file WEATHER-DATA).  The function ID3-TEST is
analogous to VS-TEST and can be used to train on a subset of examples and test
on the rest. The current system is capable of building a decision tree for
distinguishing only two categories. In order to learn to discriminate multiple
categories, a decision tree must be built to distinguish examples of each
category from examples of every other category using the function
ID3-CATEGORIES (analogous to VS-CATEGORIES).
.sh 1 "System Modifications"
.pp
Hand in a commented version of the code you write to make the following
changes to ID3. For parts 1.1 and 1.2 also hand in a dribble of the new traces
for the examples in FIGURE-DATA (with *trace-id3* set).
.sh 2 "Eliminating NULL Leaves"
.pp
First modify the system to eliminate "null" leaves using the method suggested
by Quinlan.  That is, label a leaf with a null set of examples with the most
common class among the examples landing at its parent. 
.sh 2 "Using the Gain Ratio"
.pp
Change the selection criterion for the splitting feature from the current
"gain" criterion to the "gain ratio" criterion described in section 7 of the
ID3 paper.  Be sure to print out the appropriate values for gain 
and gain ratio if the trace flag is set.
.sh 2 "Taking a Better Guess with Multiple Categories"
.pp
When learning a separate decision tree for each category in multi-category
data (like soybean), the learned concepts are overlapping and a new case can
be classified as belonging to a number of categories or to no category at all.
The current grading used in TEST-CATEGORIES counts the system as wrong when
its set of possible diagnoses does not exactly match the single correct
diagnosis.  Assuming we know that each test instance is in exactly one
category, change the testing so that it at least takes a well-informed guess
as to the single category of an example when the learned decision trees assign
it to multiple categories or to no category at all.  Assume (as in the soybean
data) that no category is \fIa priori\fP more likely than any other.
.sh 1 "Experiments"
.pp
Test the effect of your changes on the system's performance by running them on
both the full soybean data (default 8 training instances per category) and
your personal location concept (train on say 40 of the 66 instances).  Run
tests for each of the following cases for both data sets (of course, the
changes in 1.3 have no effect on the single-concept location data and need
not be run).
.(l
1.  The original ID3 system
2.  With only the changes in 1.1
3.  With only the changes in 1.2
4.  With only the changes in 1.3
5.  With both the changes in 1.1 and 1.3
6.  With all changes (1.1, 1.2 and 1.3)
.)l
Hand in a table showing the correctness results for each case for each
dataset.  Also hand in a trace (trace flag off) for case 6 for each dataset.
Include a brief comment on your analysis of the results and the quality of
the trees for your personal concept.




