46838-s99 Machine Learning for Computational Finance

Assignment 5

Due Before Class Monday April 19th, 1999, IN HARD COPY

Problem 9.3 from the textbook
Problem 9.4 from the textbook
The next problems involve applying genetic algorithms and hillclimbing to currency data. Five files are necessary:
- yen_returns.dat
- yen_series.dat
- dm_returns.dat
- dm_series.dat
- ga.m
which you should download from here Put these in a directory and start up matlab. First, load in the .dat files (type 'load yen_returns.dat' etc at the matlab prompt). These files contains 4 variables:
- 'dm_series':, which contains mark/dollar exchange rate daily closes and precomputed moving averages
- 'dm_returns', which contains daily returns for holding marks (difference in exchange rate + interest rate differential).
- 'yen_series': like dm_series, but for yen/dollar
- 'yen_returns': like dm_returns, but for yen/dollar
ga.m is the genetic algorithm program, and is run by the following command at the matlab prompt:
[train_returns, test_returns] = ga(series, returns, training_proportion, tree_size, population_size, generations)
Where:
- train_returns: array of values containing returns over training set, one for each generation
- test_returns: array of values containing returns over testing set, one for each generation
- series: either dm_series, or yen_series
- returns: either dm_returns, or yen_returns
- training_proportion: number between [0,1] containing proportion of data for use in training set
- tree_size: integer (for this assignment, either 2 or 3), representing max depth of tree
- population_size: integer representing population size (if you use a pop size of 1, the program assumes to want to use a hillclimbing algorithm)
- generations: number of generations for ga to run
when run, ga plots the train_returns in blue and the test_returns in red.
Questions:
Run ga.m ten times for each of the following parameters set:
- [train_returns, test_returns] = ga(dm_series, dm_returns, .5, 2, 10, 50)
- [train_returns, test_returns] = ga(dm_series, dm_returns, .5, 2, 1, 500)
(Note that matlab plots will be generated automatically by the code executing.)
The first is a genetic algorithm with a max tree size of 2, population of 10, and 50 generations. The second is a hillclimbing algorithm run for 500 generations.
- Using 50 generations for the GA and 500 generations for the hillclimbing algorithm makes this a fair comparision. Why?
- Record the mean and variance of the final test fitness produced by each approach over the ten trials. Can you explain the difference in variance? Which do you prefer?
- (Optional) Do the results lead you to suggest a modification to the hillclimbing algorithm? If so, what?
Would there be an advantage to having a population size of 100? A disadvantage? Run the algorithm as above with a population size of 100, and see if your predictions are correct.
Run ga.m ten times for each of the following parameters set:
- [train_returns, test_returns] = ga(dm_series, dm_returns, .1, 2, 10, 50)
This is the same as the genetic algorithm of the previous question, except the proportion of data used for training is only 10%.
- Record the mean and variance of the final test fitness produced over the ten trials. Compare to the results from the GA in the previous section, and propose an explanation for what is causing the differences in terms of overfitting. (hint: what whould happen if you set training_proportion to .01?)
- There is a potential confound in comparing results of the algorithm run on different chunks of data -- what is it?
Run ga.m ten times for each of the following parameters sets (note the change from yen to dm):
- [train_returns, test_returns] = ga(yen_series, yen_returns, .1, 1, 1, 200)
- [train_returns, test_returns] = ga(yen_series, yen_returns, .1, 2, 1, 200)
Examine the test_fitness curves as a function of the number of generations. Do you notice a qualitative difference between the curves from the algorithm with tree_size 2 as opposed to the algorithm with tree_size 1? (hint: for each trial compare the maximum test fitness values with the final test fitness value). Print out one representative plot for each condition that demonstrates this, and speculate about how the difference in tree size causes the differences in curve shapes (hint: think about pruning in decision trees).

Created by James Thomas, maintained by Rosie Jones
Last modified: Mon Apr 12 10:34:54 EDT 1999