46838-s99 Assignment 3 Documentation

Matlab

.m files are functions. So long as they are in the path, they are invoked as follows:

>> [output] = fx(parameter);

parameter may be a list of parameters, and output may be a list of variables. Once the command is invoked, the variables are in memory and can be further manipulated.

Typing

>> output

Will simply list the content of the variable 'output'.

How to use the .m files

id3.m

Id3.m is used as follows. At the matlab prompt, type

>> [tree,pruned_tree, train_acc, test_acc, pruned_train_acc,pruned_test_acc] = id3(data_file, train_size, test_size, prune_size, r_seed)

the variables in [] will contain the results of the decision tree; the variables in () are the parameters you give it. What each means is explained below:

results:
tree = decision tree
pruned = decision tree after pruning
train_acc = training accuracy
test_acc = testing accuracy
pruned_train_acc = training set accuracy on pruned tree
pruned_test_acc = test set accuracy on pruned tree

The decision trees are in a format that will be incomprehensible to you. The accuracy variables are simply listings of the percent accuracy at each number of nodes in the decision tree (so test_acc = [.5 .6 .7 .8] means accuracy of 50% with a tree with one node, 60% with two nodes, etc etc.

parameters:

data_file (the file containing the data. It will be something like 'credit.dat' -- you must use single quotes when referring to a file)

train_size: % of examples used for training
test_size: % of examples used for testing
prune_size: % of examples used for pruning

r_seed = seed for random number generator; should be an integer between 1 and 9999. This 'seeds' the pseudo-random number generator; given the same seed, the computer will produces that same sequence of 'random' numbers. So, it's useful to try the same experiment over with a couple of different seeds, to make sure that the result isn't merely an aritifact of a particular sequence of pseudo-random numbers.

So, for example, on the first problem you might want to do something like this:

[tree, pruned_tree, train_acc, test_acc] = id3('buy_stock.dat', 100, 0, 0, 10)

where train_size = 100
test_size = prune_size = 0
r_seed = 10.

Note also that you can omit variables; if you type

>> [tree] = id3('buy_stock.dat')

The program will automatically put default values in the parameters you leave out and will not return values for the output variables you left out.

disp_tree.m

At the matlab prompt, type

>> disp_tree(tree)

to view a tree. Note -- the nodes are labelled by numbers, which correspond to the attribute used for the split; the arcs are also labelled with numbers, which refer to the value of the attribute. The listings of which number means which attribute and value are contained at the end of this file.

One final matlab note. Running the programs may give you a slew of warnings, like 'warning: divide by zero'. You can safely ignore them.

Plotting

You will have to plot accuracy results. Type:

>> plot(test_acc)

To plot. You should also type 'help plot' to find out more about the plot command. And please, label the plots when you turn them in. You can do it in matlab, or just write labels on the plots after you print them out.

Note: every time you type plot, matlab erases the old plot. So to get two plots on one graph, type 'hold on' after the first plot command; this will prevent matlab from erasing the first plot. Typing 'hold off' makes matlab go back to its default mode.

Data Files

There are three data files.

They are all in matrix format. Each row represents a single instance, and each column the appropriate attribute; the value in that column is the attribute value for that instance. So, a sample 'buy_stock' instance might be

3 3 1 2 0

which means outlook = weak, temp = mild, market = bear, wind = strong, buy_stock = no

The only file you will need to worry about is buy_stock.dat; for section 2 you will have to generate your own dataset.

How to modify the buy_stock data set: Copy buy_stock.dat into another file. You can then edit this file, deleting instances by deleting rows, and adding new instances by creating a new row and inserting the appropriate values. Spacing is important!

Here are the appropriate mappings from the numbers in the matrices.

buy_stock.dat

column 1, outlook
1 = strong
2 = moderate
3 = weak

column 2, price_ratio
1 = high
2 = medium
3 = low

column 3, market
1 = bear
2 = bull

column 4, earnings
1= high
2=low

column 5, buy stock?
0 = no
1 = yes

rand_data.dat

200 randomly selected examples, each with 10 binary characteristics.

credit.dat

1: checking account
1= none
2= overdraw
n 3= 0-to-200
4 = x>200

2:duration (in months)
1 = 12
2 = 24
3 = 36
4 = 48

3:credit history
1 = paid-back-all/no history
2 = paid-back-here
3 = paid-back-previously
4 = delays
5 = critical

4: purpose
1 = new-car
2 = used-car
3 = furniture
4 = tv
5 = appliance
6 = repairs
7 = education
8 = retraining
9 = other
10 = business

5: credit amount
1 = 0-to-2000
2 = 2000-to-4000
3 = 4000-to-6000
4 = 6000-to-8000
5 = 8000-to-10000
6 = 10000-to-12000

6:savings account
1 = none
2 = below-100
3 = 100-to-500
4 = 500-to-1000
5 = x>1000

7:employment
1 = unemployed
2 = 1-to-4yr
3 = 4-to-7yr
4 = x>7yr
5 = less-than-1yr

8: installment rate
1 = 1
2 = 2
3 = 3
4 = 4

9: marital status
1 = divorced-male
2 = married-or-divorced-female
3 = single-male
4 = married-male

10: other guarantors
1 = none
2 = co-applicant
3 = guarantor

11: present residence since (1= shorter, 4 = longer)
1 = 1
2 = 2
3 = 3
4 = 4

12: property
1 = real-estate
2 = life-insurance
3 = car
4 = unknown

13: age
1 0-to-25
2 26-to-50
3 51-to-75

14: installment plans
1 = none
2 = bank
3 = stores

15: housing
1 = rent
2 = own-home
3 = for-free

16: existing credits
1 0-to-25
2 26-to-50
3 51-to-75

17: job
1 = unskilled-resident
2 = skilled-employee
3 = highly-skilled-employee

18: number of people liable to provide maintenance for

19: telephone
1: none
2: yes

20: foreign
1: no
2: yes

21:acceptable?
0:no
1:yes

Rosie Jones

Last modified: Mon Mar 29 09:39:07 EST 1999