Overview: This assignment gives you the opportunity to experiment with a decision tree learning program. You are asked to experiment with three data sets, the simple buy_stock set, a set of random data, and a set of information describing loan applications.
Files: All of the files are located here. All of the files with .m suffixes are matlab code, .dat are data files. You should copy all of these files to a working folder (either copy them to andrew and ftp them, or simply load them into netscape and save them) and then add that folder to the path when you activate matlab, so that matlab can see the files.
Documentation: There is a document describing the usage of files and some basic matlab info here. Go read it now or print it out. You will need to refer to it during the assignment.
Does it produce the decision tree shown on page 26 in chapter 3 of the class notes? (Hint: to look at the tree, try disp_tree(tree), using the tree produced by id3).
3. Can you remove an single example from the training set to cause id3 to produce a tree which depends on price_ratio, even though the target concept is independent of price_ratio? If so, write down which example and turn in a print out of the corresponding tree. If not, explain why. Speculate on what this means in terms of overfitting. Which tree would you prefer?
4. Suppose we want to learn the target concept (price_ratio = low and outlook = strong) or (price_ratio = high and market = bull) or (price_ratio = low and earnings = low).
First guess a dataset that you think will cause id3 to correctly learn the target concept. Write it down, and then create the proper datafile and print out the tree it produces. If it's not correct, find a dataset that actually works and turn it in along with a printout of the correct tree. What was wrong with your first dataset?
6. Run id3 on the credit data set using only 50% of the examples for training and 50% for testing set five times, each time using a different random seed each time(use [tree,p_t,train_acc,test_acc,pruned_train_acc,pruned_test_acc] = id3('credit.dat',50,50,0,r_seed). Examine the trees. How similar are taccuracy. How accurate do the trees seem to be? Turn in the accuracy plots and random seeds for each of the five attempts.
7. Run id3 on the credit data set using 50% of the examples for training, 25% for testing, and 25% for pruning. set five times, each time using a different random seed each time(use [tree,p_t,train_acc,test_acc,pruned_train_acc,pruned_test_acc] = id3('credit.dat',50,25,25,r_seed)). Examine the trees. How similar are they?. Turn in one plot of both the training accuracy and the testing accuracy, and another plot of the training and testing accuracy on the pruned tree. Was overfitting a problem? Did pruning help?
8. Run id3 on the random data set with 50% training and 50% pruning five times with different random seeds (use [tree, pruned_tree] = id3('rand_data.dat',50,0,50,r_seed). For each case, examine the regular tree and the pruned tree. What do you notice, compared with the trees generated for the credit example? (HINT: if disp_tree gives you a blank graph, the tree is empty) Why do you think this is happening, considering the random nature of the data?
9. Run id3 on the random data set ('rand_data.dat') with a variety of parameters (using both test sets and pruning sets). Examine the plots of training accuracy and test accuracy on both regular and pruned trees. How do you think that you could tell that this data was meaningless, in contrast to the credit data? Turn in at least one appropriate accuracy plot with an explanation.
Build a regression model for the Credit task. You can use any regression method and other details you wish.
Use a similar experimental setup to that of section (7). Namely: 25% of the data should be used for testing. 75% (50%+25%) should be used for training and any kind of tuning/validation that may be required by regression. Note that pruning in decision trees is viewed here as a form of training; the only important division thus is the one between training (in its various forms) and testing.
Similarly, to section(7), repeat the experiment (with different training/testing division) multiple times, and measure average performance (accuracy). Compare it to that of the decision tree. What did you expect? Were you correct?
Extra extra credit: build a regression model (as above) that outperforms the (pruned) decision tree model.