Excerpts from a draft for Byte Magazine  "Some Assembly Required", 
February 1994
By Andy Singleton, Copyright 1993

*****************************************************************************
These excerpts contain some technical information that might be useful for 
interpreting the code in GPQUICK.  The real intro stuff is all cut out.

I have included a FASTEVAL in GPQUICK which is different from the eval 
described here.

I encourage comments on the "Problem" class, with the ultimate goal of 
designing encapsulated problems which would be source code compatible with a 
variety of GP systems.
*****************************************************************************

...
Programming with S-Expressions

The form of GP described by Koza and used in GPQUICK generates S-Expressions, 
more commonly known as LISP expressions.  An S-Expression consists of a 
function, followed by zero or more arguments.  Each argument is in turn also an 
S-Expression.  We can represent "2+2" as (ADD 2 2), and "2+3*5/2" as (ADD 2 
(MULTIPLY  3 (DIVIDE 5 2))).

This is called "prefix notation" because we write the function first.  If we 
flip the function around and put it after the arguments, we end up with a 
"postfix" stack language like FORTH or Postscript, which would also be a 
candidate  for GP.

A function in an expression is called a "node".  Functions with no arguments, 
such as numeric constants, are called "terminals".

All functions return some value.  Functions can also perform actions, known as 
"side effects" to the math geeks.  We could have functions like (TURN_RIGHT),  
(TURN_RIGHT number_of_degrees), etc. which give S_Expressions  procedural 
abilities.  The PROG function can be used to string actions together into a 
procedural program, such as (PROG GO_FORWARD GO_FORWARD TURN_RIGHT).

If we add conditionals, such as (IF condition dothis otherwise_do_that), 
looping constructs such as (WHILE condition dothis) or (FOR count dothis), and 
a memory array with (SET element value) and (GET element), we have a Turing 
complete programming language which can represent any structured program.

Achieving Closure

To do GP, we will chop a piece off of one program and stick it together with a 
piece of another program, on the off chance that we might get something good.  
If you did this with one of your C++ programs, you would have approximately a 
0% chance of creating a working program.  With S-Expressions we can rig the 
game so that we have a much higher probability of success.  We design our 
functions so that they all return the same argument type, usually a floating 
point number, and they all accept arguments of this type.  This situation is 
called "closure".  Now we can use any expression as an argument for any other 
expression.  We can swap a sub-expression from one program for a sub-expression 
in another program and we will always end up with a syntactically valid 
program.  It might be completely useless, but it will never send smoke 
billowing out from under the CPU.

Crossover

GPQUICK uses "subtree" crossover.  First we select two parents.  Then we pick a  
node, any node, from the first parent.  This node is by definition the 
beginning of some complete sub-expression.  If we wrote the expression out as a 
parse tree, the sub-expression would be a branch or subtree.  We extract this, 
leaving a hole in the program.  We then pick a node in the second parent, 
extract the following sub-expression, and swap it with the first sub-
expression.

For example, we could cross (ADD 2 (MULTIPLY  3 (DIVIDE 5 2)))  with (ADD 
(MULTIPLY 7 11) 23).  In the first parent, we pick the fourth node, a "3".  In 
the second parent, we pick the second node, a "(MULTIPLY 7 11)".  Swapping, we 
end up with (ADD 2 (MULTIPLY (MULTIPLY 7 11) (DIVIDE 5 2))).

Mutation

Any random change to a program qualifies as a mutation.  There are many types 
of  changes that we can make to an S-Expression and  still end up with a 
syntactically valid program.  For simplicity I  have included in GPQUICK one 
type of mutation that picks a random function and changes it to a different  
function with the same number of arguments.

Applying this mutation to (ADD 2 (MULTIPLY 3 (DIVIDE 5 2))), we select the 
third node, a MULTIPLY, for mutation.  We can change it to any two argument 
function.  Randomly, we select ADD, ending up with (ADD 2 (ADD 3 (DIVIDE 5 
2))).  To mutate a terminal, we would substitute another terminal.

The GP code and incredible shrinking interpreter

GPQUICK uses a representation for the GP code called "Linear prefix jump 
table", which was suggested to me by Mike Keith.  Programs coded this way are 
small and evaluate quickly.  When we evaluate populations of several thousand 
programs, both size and speed will be very important.

The GP code is an array of two byte structures of type "node", defined by
typedef struct {unsigned char op; unsigned char idx;} node;  // node type

Member op (for opcode) is a function number.  idx is an immediate operand which 
can represent a table or constant index.

The GP code is stored in an object called a Chrome, short for chromosome.

Functions are represented in GPQUICK by objects of class Function.  All of the 
functions are in stored in the array "funcarray" and given an array index.  We 
could install the functions NUMBER (to return numeric constants), ADD, 
SUBTRACT, MULTIPLY and DIVIDE as functions 0,1,2,3 and 4.  Function NUMBER uses 
the idx to determine which number to return.  Other functions do not use idx 
and leave a zero value.

The expression (ADD 2 (MULTIPLY 3 (DIVIDE 5 2))) would be represented in our GP 
machine language as
{(1,0) (0,2) (3,0) (0,3) (4,0) (0,5) (0,2)}
Where the first element (1,0) is an ADD (opcode 1) node, and the second element 
(0,2) is a NUMBER node (opcode 0) returning the value 2.

Our Chrome method to evaluate the function at position ip in the code stack 
looks like this:
	ip++;return funclist[expr[ip].op]->eval(this);

This calls the appropriate Function eval method.  The Function eval method 
calls the Chrome eval method recursively to get its arguments.  The eval method  
for function ADD is:
	return chrome->eval() + chrome->eval();
The eval method for function NUMBER is:
	return chrome->expr[ip].idx;

That is all there is to the interpreter.  Add functions to taste.

GPQUICK Architecture

The object oriented architecture of GPQUICK is divided into three pieces: The 
Chrome and Function classes, which are the underlying program representation, 
the Problem class, which holds the objectives, functions and data necessary to 
define and solve a particular GP task, and the Pop class, which holds a 
population of Chromes and runs a GA on them, calling the fitness function in 
the Problem to do an evaluation.  This three part architecture has worked well 
for me in a variety of applications.  Obviously, there is a tremendous variety 
of potential "Problems".   There are also a multitude of different kinds of 
genetic algorithm that might be associated with subclasses of Pop, each with 
different methods of selection, a different mix of genetic operations, and a 
different topology for the population (single processor, parallel or 
distributed, for example).  As long as they share the same Chrome 
representation, most Problems can run with most Pops and GAs.

Chrome and Function classes

The Chrome (short for Chromosome) class is the carrier of a genetic program.  
Important data members include:
	int ip;                                 // Instruction pointer
	node *expr;             // actual GP code
	float lastfitness;              // fitness on last eval
	    // array of pointers to  Functions from the current Problem
	Function** funclist;
	ChromeParams* params;   // Initialization and evaluation parameters

Important methods include:
	// Initialize a new chrome with random "ramped half and half" expression
	Chrome(ChromeParams* p,CFitness* cf,Function** f,retval* c); 
	 // Eval the expression at ip
	retval eval() {ip++;return funclist[FUNCNUM(expr[ip])]->eval(this);} ;
	retval evalAll() ; /    // eval the whole expression
		// Cross with mate and return a new chrome
	Chrome* CrossTree(Chrome* mate);
	void Mutate();                          // mutate self
		// load and parse and expression.  Return success
	virtual int Load(istream& istr,int issource);
		 // write the expression to a stream
	void write(int pretty,ostream& ofile = cout);

A Function object carries information needed by the Chrome for initializing, 
displaying and evaluating expressions that contain that function.  Many 
functions, such as the arithmetic functions, are problem independent, but some 
functions, such as data terminals, require the data structures and evaluation 
environment of a particular problem.  Important data members include:
	char name[30];          // written name
	    // number of arguments.  Used to initialize and traverse arguments.
	int argnum;
	    //number of variables in variable table of opcode. To init node.idx
	int varnum;
	    // selection frequency relative to other functions.   Used in 
initialization, mutation.
	int weight;
	       // pointer to evaluation code.  Not a virtual for speed reasons
	EVALFUNC evalfunc;

The important method is:
	retval eval(class Chrome* st) {return (evalfunc)(st);};

Pop class

The Pop (short for Population) carries an array of Chromes, evaluates them with 
the fitness function provided in the current Problem, and performs a genetic 
algorithm to generate new and better Chromes.  Important data members include:
	Chrome** pop;           // allocated array holding the actual chromes
	Problem* problem;       // The current Problem
	UINT popsize;           // Number of Chromes
	long gencount;          // number of individuals generated so far
	int   BestMember;       // Index of current best Chrome

Important methods include:
		// set up a population for a particular problem
	Pop(Problem* prob,ChromeParams* par, UINT size );
		// generate a new chrome.  Return its fitness
	virtual Chrome* generate();
		// generate until reaching time max_time, evaluating maxevals 
individuals, or reaching maxfitness
	virtual Chrome* go_until(time_t max_time,long maxevals,float maxfitness);

Pop uses a "steady state" genetic algorithm, generating one new Chrome and 
replacing one old Chrome at a time, as opposed to making a whole new batch as a 
"generation".  The generate method is the agent which selects one or more 
parents and applies mutation or crossover.  This virtual method could be 
replaced by code with different strategies for selection, reproduction, or 
interaction with parallel populations.

Selection strategy is important.  We want to choose parents randomly enough so 
that we maintain a healthy diversity in our population.  If only the best 
individual reproduces, the GA will run out of raw material.  On the other hand, 
progress requires a lot more reproductions of the good individuals than the bad 
ones.  The default generate uses tournament selection of size six, which means 
that it looks at a set of six candidates selected at random and chooses the 
best one in the set.  We could make this "greedier", or more likely to pick the 
very best individuals, by increasing the size of the set.  The default generate 
also uses a kill-tournament to select an individual for replacement, replacing 
the worst of two randomly selected individuals.

The default generate does crossover 70% of the time, mutation 20% of the time, 
and a straight copy 10% of the time.  You can change these ratios in 
ChromeParams.

Problem class

In order to do GP, you must have a fitness function, which defines the problem, 
and a set of primitive operations sufficient for solving it.  Any 
implementation of GPQUICK must have a subclass of Problem which includes the 
appropriate fitness function and primitive functions.  The important data 
member is:
		// array of primitive functions
Important methods include:
	Function** funclist;
		// install the primitives and initialize the evaluation environment
	Problem();
	void AddF(Function* f);         // add a Function to the funclist
		// Pure virtual fitness function. 
		// It should evaluate the chrome in an appropriate environment,
		// and return a fitness value
	virtual float fitness(Chrome* chrome);

Most programs need to be evaluated in a particular context.  If you are trying 
to evolve a program that can navigate a maze, you need to define the maze.  
Setting up the environment should be done in the Problem constructor and at the 
beginning of the fitness function.


****************************************************************************
The included problem...

SSRProb - Simple Symbolic Regression

The sample problem provided with GPQUICK performs symbolic regression, chosen 
for simplicity and speed.  In symbolic regression we attempt to evolve an 
arithmetic expression that matches a set of output variables, or an unknown 
function.  SSRProb creates an expression using the input variables John, Paul, 
George, Ringo and Elvis.  The correct expression is John*(George - Paul) + 
2.5*Ringo.  Elvis is a noise variable.

The function set for SSRProb includes the terminals NUMBER (for numeric 
constants), John, Paul, George, Ringo, and Elvis.  It also includes the 
arithmetic functions ADD, SUB, MUL, DIV and SINE.

The fitness function for SSRProb is the sum squared error of the expression for 
20 randomly chosen samples of the input variables.

Running SSRProb

As GPQUICK runs, it periodically lists out the number of Chromes generated, the 
best fitness, and the best expression.  We can learn some important things 
about artificial evolution by watching this output.  At first, the search for a 
better expression wanders aimlessly, and total error hovers near 10^11.  
Sometimes fitness does not improve, and the run fails.  At some point, QPQUICK 
may discover a variant of the expression (MUL John (SUB George Paul), and total 
error drops suddenly to 10^7.  This illustrates the phenomenon of punctuated 
equilibrium.  Change tends to be sudden, unpredictable, and surrounded by 
periods of stability.  Now GPQUICK is looking for 2.5 Ringos.  It eventually 
finds one Ringo, followed very quickly by a second Ringo.  This demonstrates 
how a genetic algorithm uses crossover to build a solution piece by piece.  The 
expression resembles (ADD (ADD Ringo Ringo) (MUL John (SUB George Paul))) and 
the total error is near 300,000.  The expression begins to grow bigger, adding 
various functions in search of that extra half Ringo.  It may eventually build 
an sizable expression that simplifies to exactly John*(George - Paul)+2.5*Ringo 
and quit, before it times out after generating 80,000 chromes.  This result 
shows  the tendency of GP to generate messy, obscure, but effective solutions.

If you program your Pop to output population statistics, you will see that the 
variance in fitness (error) decreases just when the expressions start to grow 
in size.  The bigger expressions have a lot of redundancy, and they are likely 
to have a pretty good fitness after crossover.  I have dubbed this phenomenon 
"defense against crossover".   Chopping a piece out of one program and sticking 
it randomly into another program is a pretty radical operation which would 
almost surely kill a manually written program, but genetic programs learn 
tolerate and even thrive on it.  This is a key which explains why genetic 
programming works at all.  It evolves evolvability.

You can change the difficulty of the symbolic regression problem by changing 
the formula.  Longer polynomials are harder match.  Changing the parameters in 
the ChromeParams constructor will allow you to change the size of the 
expressions, the mix of mutation and crossover, and "greediness" of the 
selection tournament.  By removing the comment on the SINE function, you can 
increase the size of the search in "program space" and noticeably reduce the 
probability of success.  This dependence on the function set is a weakness of 
GP.  It is also an opportunity, since by improving your function set you can 
make a difficult problem easier and get it solved.

Using symbolic regression to discover a function that you already know is not 
useful, but you can modify SSRProb to learn prediction and classification from 
real world examples.  This is genetic induction, and it has several advantages 
when compared with competing statistical or neural techniques.
...
*****************************************************************************
This problem was included for its simplicity.  More Problems should be 
available later.

