15-681 Assignment updates

Assignment 3

Question: what does the X-axis on the gnuplots mean? Answer: it depends on which .gnu file you're using for the plot. If you're using "train.gnu" or "re.both.gnu," the X-axis scale is "number of nodes in the decision tree." If you're using "re.prune.gnu," on the other hand, the X-axis is "number of nodes pruned." When pruning, the raw data generated is actually based on the number of nodes pruned, but then test_script turns this around and creates a different data file in which the X axis is the number of nodes *left*. The latter file is the one used by re.both.gnu. (In all plots, the Y-axis is the percentage of correctly classified instances.)

IMPORTANT: it has recently been pointed out that the random number generator used to split the data set into training, testing and pruning sets is screwing things up rather badly. (The training data you're given alternates strictly between YES and NO, and the crappy number generator is splitting the examples rather unevenly as a result.) If you have the time and motivation, you should probably change all calls in args.c from srand() and rand() to srandom() and random(); you'll wind up getting a more accurate picture of how well the algorithm performs on this dataset. (It's not quite as bad as we've been making it out to be; the accuracies look quite a bit better with the better random number generator, and pruning actually looks useful.) You should make sure that any comparisons you make between algorithms are based on the same random number generator, however.

The assignment is pretty vague about what you're supposed to hand in. In addition to answering the questions asked in the assignment, you should probably turn in a few 9of your gnuplots, along with printouts of any code you create or modify for Part 2. The English exposition of what you do for part 2 probably doesn't need to be more than 2 pages long, but you should include a discussion and some plots comparing your algorithms versus the ones supplied. (Try to be reasonably non-handwavy with your comparisons.)

A number of people have been concerned because their pruning set curves are nonexistent for part 1.3 -- the part where the training set is included in the pruning set. When you don't get any curves, it's probably because no pruning is being done. Why isn't the algorithm pruning any nodes in this case? Think about how reduced error pruning works.

To get a hardcopy of your gnuplots: After starting up gnuplot, but before loading "re.both.gnu" or whatever .gnu file you wish to use, do this:

	set terminal postscript
	set output "myfilename.ps"
(substituting whatever filename you wish for "myfilename.ps"). This will generate a postscript file. For more information, type "help terminal postscript" from within gnuplot.

As of 6 pm, Thursday September 19th, numerous memory allocation bugs have been discovered in the original code and subsequently stomped. (The code was core dumping on certain architectures, which led to a search for such bugs.) If you copied the source directory before then, you may want to re-copy it.

The file referred to in the documentation as "tree_script" was named "test_script" in the class directory. To avoid any further confusion, a copy named "tree_script" has been created.

Some people have tried and failed to compile the supplied code: the linker complains that it cannot find -liostream. The makefile has been changed so that it no longer tries to include this file; this fix seems to work on most systems. If you get errors messages of the form "..... not defined" when your code links, you may want to try inserting "-liostream" after "-lm" in Makefile.

The old version of the assignment handout stated that the due date was September 24. The due date has been extended to September 26.