Assignment 3: Decision Tree for Equine Colic Diagnosis

Instructor: Tuomas Sandholm

TAs: Kate Larson and Pat Riley
If there are any questions about the organization of the assignment then email Kate.

Due: November 26, 2002 in class

Maximum points: 100

Jack and Spot

Jack and Spot


Equine colic is the leading cause of death in adult horses.  However, if diagnosed early enough, it is usually surgically  curable.

Using the language of your choice, write a decision tree algorithm that will learn to diagnose whether a patient is healthy or has colic.  Use the horse.train file to train the decision tree.  Each training instance has 16 numeric attributes (features) and a classification, all separated by commas. The attributes correspond to the following measurements made from each patient at admission to the clinic.
  1. K
  2. Na
  3. Cl
  4. HCO3
  5. Endotoxin
  6. Aniongap
  7. PLA2
  8. SDH
  9. GLDH
  10. TPP
  11. Breath rate
  12. PCV
  13. Pulse rate
  14. Fibrinogen
  15. Dimer
  16. FibPerDim

In the decision tree, use only binary tests, i.e. each node should test whether a particular attribute has a value greater or smaller than a threshold. In deciding which attribute to test at any point, use the information gain metric (see Russell and Norvig, section 18.4).

Set the node test threshold for each potential attribute using this same metric.  i.e. at each point, see all the values that exist for a particular attribute in the remaining instances, order those values, and try threshold values that are  (half way) between those attribute values.  Use the threshold value that gives the highest information gain.

Allow the same attribute to be tested again later in the tree (with a different threshold).  This means that along a path from
the root to a leaf, the same attribute might be tested multiple times.

After learning the decision tree, use the horse.test file to test the generalization accuracy of the tree.

Deliverables
  1. Printout of the code
  2. Output of the algorithm -- annotated if it is not in an easy to understand form
  3. Picture of the decision tree.  Hand drawn is fine.
  4. How many of the training instances does the tree classify correctly?
  5. How many of the test instances does the tree classify correctly?
  6. Description of how you use the information metric.
Format of data files:

We are providing you with data from an actual vet clinic.  Each line in a file is one instance.  The first 16 numbers are the values for the 16 attributes listed above.  The last entry on a line is whether the horse was healthy or not.  You are allowed to reformat the data files  in what ever way you want so as to make reading them easier for your program.