Assignment 3: Decision Tree for Equine Colic
Diagnosis
Instructor: Tuomas Sandholm
TAs: Kate Larson and Pat Riley
If there are any questions about the organization of the assignment then email
Kate.
Due: November 26, 2002 in class
Maximum points: 100
Jack and Spot
Equine colic is the leading cause of death in adult horses. However,
if diagnosed early enough, it is usually surgically curable.
Using the language of your choice, write a decision tree algorithm that will
learn to diagnose whether a patient is healthy or has colic. Use the
horse.train file to train the decision tree. Each
training instance has 16 numeric attributes (features) and a classification,
all separated by commas. The attributes correspond to the following measurements
made from each patient at admission to the clinic.
- K
- Na
- Cl
- HCO3
- Endotoxin
- Aniongap
- PLA2
- SDH
- GLDH
- TPP
- Breath rate
- PCV
- Pulse rate
- Fibrinogen
- Dimer
- FibPerDim
In the decision tree, use only binary tests, i.e. each node should test whether
a particular attribute has a value greater or smaller than a threshold. In
deciding which attribute to test at any point, use the information gain metric
(see Russell and Norvig, section 18.4).
Set the node test threshold for each potential attribute using this same metric.
i.e. at each point, see all the values that exist for a particular attribute
in the remaining instances, order those values, and try threshold values
that are (half way) between those attribute values. Use the threshold
value that gives the highest information gain.
Allow the same attribute to be tested again later in the tree (with a different
threshold). This means that along a path from
the root to a leaf, the same attribute might be tested multiple times.
After learning the decision tree, use the horse.test
file to test the generalization accuracy of the tree.
Deliverables
- Printout of the code
- Output of the algorithm -- annotated if it is not in an easy to understand
form
- Picture of the decision tree. Hand drawn is fine.
- How many of the training instances does the tree classify correctly?
- How many of the test instances does the tree classify correctly?
- Description of how you use the information metric.
Format of data files:
We are providing you with data from an actual vet clinic. Each line
in a file is one instance. The first 16 numbers are the values for the
16 attributes listed above. The last entry on a line is whether the
horse was healthy or not. You are allowed to reformat the data files
in what ever way you want so as to make reading them easier for your
program.