This file is a collection of terms and definitions that are used to
describe benchmark problems and results in this benchmark collection.  By
defining these terms here, in a central file, we hope to encourage precise
and uniform reporting of results without having to define a lot of terms
within every benchmark description file.

This glossary file is NOT meant to be a comprehensive survey of how these
terms are used in the literature.  Where different researchers use a term
in different ways, we must choose -- perhaps arbitrarily -- which
definition to use here.

This file was created and is currently maintained by Scott E. Fahlman.
It is not copyrighted because I want researchers to have unrestricted
access to everything in this benchmark collection.  However, I would ask
anyone reprinting a substantial portion of this file to acknowledge the
source, as a matter of simple courtesy.
---------------------------------------------------------------------------
        
ACTIVATION FUNCTION: In many learning architectures, the output of each
  unit is some function (usually nonlinear) of the weighted inputs to that
  unit.  This function is called the "activation function" for that unit.
  Among the popular activation functions are threshold, sigmoid, symmetric
  sigmoid, and Gaussian.

BACKPROP: A colloquial expression for the BACK-PROPAGATION learning
  algorithm.

BACK-PROPAGATION LEARNING: A supervised learning method for networks
  without any directed loops.  See [1] for a detailed description of the
  algorithm and parameters.

BIAS UNIT: Many learning network architectures use something like the
  following formula to describe the behavior of each unit:

  output = f(sum-of-weighted-inputs + bias)

  The function f is some nonlinear activation function.  Both the weights
  and the bias for each unit must be adjusted by the learning machinery.
  The learning algorithm is simplified by replacing the bias unit with a
  connection to a "bias unit" whose value is held constant and positive.

BIAS CONNECTION: The connection a unit receives from the bias unit.

CROSS ENTROPY ERROR FUNCTION: An alternative to the mean-squared error
  function.  In general, the values of d, the desired output for a given
  training example, and y, the observed output for that case, will be real
  numbers.  We can interpret d as the desired probability that a
  binary-valued output will assume a value of 1 in this case, and y as the
  observed probability of seeing a 1 in that case.

  The cross-entropy error, C is then expressed as

  C =  - sum [all cases and outputs] ( d log(y) + (1-d) log(1-y) )

  The derivative of this error function for a given output and training
  example (this is the value we actually back-propagate) is

  dC/dy =  - d/y + (1-d)(1-y)

  Note that this back-propagated derivative goes to infinity as the
  difference between y and d goes to +1 or -1.  This can counteract the
  tendency of the network to get stuck in regions where the derivative of
  the sigmoid function approaches zero.

COMPLETE CONNECTIVITY: In most feed-forward networks the units are
  arranged in regular layers.  Layer L1 is said to be "completely
  connected" to layer L2 if every unit in L2 receives a connection from
  every unit in L1.  A network is said to be "completely connected" if each
  of its layers is completely connected to the next one, and if every unit
  in the first hidden layer receives a connection from each of the
  network's external inputs.

EPOCH: An epoch is defined as a single presentation of each of the
  examples in the training set, either in fixed or random order.  This
  concept is only well-defined for problems with a fixed, finite set of
  training examples.  Modification of weights may occur after every epoch,
  after every pattern presentation, or on some other schedule.

  It is often convenient to measure learning time in epochs for problems
  and algorithms in which the concept of epoch is well-defined; for other
  problems it may be necessary to measure learning time in terms of
  individual pattern presentations or in terms of the number of arithmetic
  operations required.

ERROR FUNCTION: The task of supervised learning is generally to minimize
  some function of the difference between the observed and the desired
  outputs, taken over all the outputs and over some set of input/output
  pairs.  This function is called the "error function".  The most common
  error function is "mean squared error": the square of the difference
  between desired and observed outputs is averaged over all outputs and
  over the entire set of trials.  The derivative of this measure with
  respect to each output value is just the difference itself, so it is this
  difference that produces the error signal that is back-propagated through
  the network.

40-20-40 CORRECTNESS CRITERION: For problems in which the output units
  are supposed to assume binary values -- logical zero and logical one --
  there is a question of what range of output values are to be considered
  "close enough" to these ideal targets.  The 40-20-40 criterion says that
  values in the lowest 40% of the output range are treated as logical zero,
  values in the highest 40% are treated as logical one, and values in the
  middle 20% are treated as indeterminate and therefore not correct.  This
  criterion is widely used because it does not require extreme accuracy in
  the outputs but does require that the output values be distinct enough
  for some amount of noise immunity.  Do not confuse this measure with the
  TARGET VALUES for the output units.

HYPERBOLIC ARCTANGENT ERROR FUNCTION: An error function under which the
  derivative error value back-propagated into the network is the hyperbolic
  arctangent of the difference between the desired and observed values,
  rather than the difference itself.  See [2] for details.

HYPERGEOMETRIC MEAN: The hypergeometric mean H of a set of N values {v1, v2,
  ... vN} can be computed by the formula

  H = N / (1/v1 + 1/v2 ... + 1/vN)

  This measure has been used by some researchers as a way of reporting
  "average" learning times for problems in which a normal arithmetic mean
  is unusable because some of the learning trials do not converge at all.
  These trials go to zero in the summation above, and so they cause no
  special problems.  Note, however, that this method gives a result that is
  strictly less than the arithmetic mean of the same value set because fast
  learning trials are weighted more heavily than slow ones.  This measure
  is also known as the "harmonic mean".

MEAN SQUARED ERROR: See ERROR FUNCTION.

PATTERN PRESENTATION: See PRESENTATION.

PRESENTATION: Most supervised learning algorithms operate by presenting
  training examples or training patterns to the network, one at a time.
  Each such example consists of an input and a desired output.  Learning
  time is often measured by the total number of presentations required for
  training the network to some specified level of proficiency.

QUICKPROP: A variation on the BACK-PROPAGATION learning algorithm that
  uses a second-order weight-update function, based on measurement of the
  error gradient at two successive points, to speed up convergence over
  simple first-order gradient descent.  See reference [2] for details.

REINFORCEMENT LEARNING: A learning scheme in which the network is trained
  by giving it inputs, observing the outputs, and indicating how good those
  outputs were according to some criterion known to the trainer.  The
  desired outputs are not presented to the network in explicit form.

RESTART REPORTING METHOD: This is a method for reporting "average"
  learning times in situations where some trials do not converge or take an
  unusually long time.  The experimenter picks some number of epochs as a
  "restart" value.  If a learning trial has not been completed by the time
  this value is reached, the trial is terminated and started over with a
  new set of random initial values.  When the problem ultimately is solved,
  the time reported includes the time invested before the restart as well
  as the time after.

SHORTCUT CONNECTIONS: In most feed-forward networks the units are
  arranged in regular layers.  Units in one layer receive incoming
  connections only from units in the previous layer; units in the first
  layer receive connections only from the external inputs.  A "shortcut
  connection" skips over layers in the network.  In a network with
  "complete shortcut connectivity", each unit receives direct connections
  from all units in ALL previous layers and also from all the external
  inputs.

SIGMOID ACTIVATION FUNCTION: A monotonic, continuously differentiable
  activation function often used in back-propagation networks.  Defined as
  output = 1 / ( 1 + exp (-x/T) )
  where x is the weighted sum of inputs and T is a scaling parameter
  controlling the steepness of the slope.  This is equivalent to
  output = ( tanh (x/T) + 1 ) / 2
  The range of this function is from 0 to +1.

SQUARED ERROR: See ERROR FUNCTION.

SUPERVISED LEARNING: A learning scheme in which the desired output for each
  training input is presented to the network explicitly.

SYMMETRIC SIGMOID ACTIVATION FUNCTION: A version of the sigmoid function
  shifted to that its range is symmetric around 0.  Some researchers use a
  range of -1 to +1, while others use -1/2 to +1/2.  It is best to specify
  which convention is in use.

TARGET VALUE: This is the value presented as the desired output during
  training.  For problems with binary outputs, the choice of target values
  is up to the algorithm designer.  Many researchers use .1 and .9 rather
  than the extreme values of 0.0 and 1.0, which are unreachable with
  sigmoid output units.

THRESHOLD ACTIVATION FUNCTION: An activation function that is 0 for all
  input sums below 0 or some other threshold value, and 1 for all input
  sums above this threshold.  This is the limiting case of the sigmoid
  activation function as T goes to 0.  Used in the Perceptron, but
  unsuitable for back-propagation, which requires a continuous,
  differentiable activation function.

TWICE-MEDIAN REPORTING METHOD: This is an alternative to the restart
  reporting method or the use of the hypergeometric mean for reporting
  learning-time experiments in which some trials do not terminate.  We
  record the time required by all trials, with very long trials being
  terminated and given a value of infinity.  We compute the median of all
  trials, and classify any trial that took longer than twice the median
  time to be "unsuccessful".  We report the median value, the average of
  all successful trials, and the number of unsuccessful trials.  The
  advantage of this method over the restart method is that it avoids the
  guesswork involved in choosing a good restart value.  Of course, it is
  only applicable if at least half of the trials converge.

UNSUPERVISED LEARNING: A learning scheme in which the desired outputs are
  not presented to the network; only the inputs are seen.  The task of the
  learning algorithm is generally to learn to reproduce the input
  distribution, to complete partial inputs, or to classify the inputs
  according to some observed regularities.

---------------------------------------------------------------------------
REFERENCES:

1. D. E. Rumelhart, G. E. Hinton, and R. J. Williams, "Learning Internal
   Representations by Error Propagation", in Parallel Distributed
   Processing, Vol. 1, D. E. Rumelhart and J. L. McClelland (eds.), MIT
   Press, 1986.

2. Scott E. Fahlman, "Faster-Learning Variations on Back-Propagation: An
   Empirical Study", in Proceedings of the 1988 Connectionist Models Summer
   School, Morgan Kaufmann, 1988.
