Newsgroups: comp.ai.neural-nets,comp.answers,news.answers
Path: cantaloupe.srv.cs.cmu.edu!bb3.andrew.cmu.edu!newsfeed.pitt.edu!godot.cc.duq.edu!newsgate.duke.edu!zombie.ncsc.mil!news.mathworks.com!newsfeed.internetmci.com!in1.uu.net!news.interpath.net!sas!newshost.unx.sas.com!hotellng.unx.sas.com!saswss
From: saswss@unx.sas.com (Warren Sarle)
Subject: comp.ai.neural-nets FAQ, Part 2 of 7: Learning
Originator: saswss@hotellng.unx.sas.com
Sender: news@unx.sas.com (Noter of Newsworthy Events)
Message-ID: <nn2.posting_833338813@hotellng.unx.sas.com>
Supersedes: <nn2.posting_830746813@hotellng.unx.sas.com>
Approved: news-answers-request@MIT.EDU
Date: Wed, 29 May 1996 03:00:15 GMT
Expires: Wed, 3 Jul 1996 03:00:13 GMT
X-Nntp-Posting-Host: hotellng.unx.sas.com
Reply-To: saswss@unx.sas.com (Warren Sarle)
Organization: SAS Institute Inc., Cary, NC, USA
Keywords: frequently asked questions, answers
Followup-To: comp.ai.neural-nets
Lines: 1388
Xref: glinda.oz.cs.cmu.edu comp.ai.neural-nets:31741 comp.answers:18895 news.answers:72925

Archive-name: ai-faq/neural-nets/part2
Last-modified: 1996-05-27
URL: ftp://ftp.sas.com/pub/neural/FAQ2.html
Maintainer: saswss@unx.sas.com (Warren S. Sarle)

This is part 2 (of 7) of a monthly posting to the Usenet newsgroup
comp.ai.neural-nets. See the part 1 of this posting for full information
what it is all about.

========== Questions ========== 
********************************

Part 1: Introduction
Part 2: Learning

   How many learning methods for NNs exist? Which?
   What is backprop?
   What are conjugate gradients, Levenberg-Marquardt, etc.?
   How should categories be coded?
   Why use a bias input?
   Why use activation functions?
   What is a softmax activation function?
   How do MLPs compare with RBFs?
   Should I normalize/standardize/rescale the data?
   What is ART?
   What is PNN?
   What is GRNN?
   What about Genetic Algorithms and Evolutionary Computation?
   What about Fuzzy Logic?

Part 3: Generalization
Part 4: Books, data, etc.
Part 5: Free software
Part 6: Commercial software
Part 7: Hardware

------------------------------------------------------------------------

Subject: How many learning methods for NNs exist?
=================================================
Which?
======

There are many many learning methods for NNs by now. Nobody knows exactly
how many. New ones (or at least variations of existing ones) are invented
every week. Below is a collection of some of the most well known methods,
not claiming to be complete.

The main categorization of these methods is the distinction between
supervised and unsupervised learning: 

 o In supervised learning, there is a "teacher" who in the learning phase
   "tells" the net how well it performs ("reinforcement learning") or what
   the correct behavior would have been ("fully supervised learning"). 
 o In unsupervised learning the net is autonomous: it just looks at the data
   it is presented with, finds out about some of the properties of the data
   set and learns to reflect these properties in its output. What exactly
   these properties are, that the network can learn to recognise, depends on
   the particular network model and learning method. Usually, the net learns
   some compressed representation of the data. 

Many of these learning methods are closely connected with a certain (class
of) network topology.

Now here is the list, just giving some names:

1. UNSUPERVISED LEARNING (i.e. without a "teacher"):
     1). Feedback Nets:
        a). Additive Grossberg (AG)
        b). Shunting Grossberg (SG)
        c). Binary Adaptive Resonance Theory (ART1)
        d). Analog Adaptive Resonance Theory (ART2, ART2a)
        e). Discrete Hopfield (DH)
        f). Continuous Hopfield (CH)
        g). Discrete Bidirectional Associative Memory (BAM)
        h). Temporal Associative Memory (TAM)
        i). Adaptive Bidirectional Associative Memory (ABAM)
        j). Kohonen Self-organizing Map/Topology-preserving map (SOM/TPM)
        k). Competitive learning
     2). Feedforward-only Nets:
        a). Learning Matrix (LM)
        b). Driver-Reinforcement Learning (DR)
        c). Linear Associative Memory (LAM)
        d). Optimal Linear Associative Memory (OLAM)
        e). Sparse Distributed Associative Memory (SDM)
        f). Fuzzy Associative Memory (FAM)
        g). Counterprogation (CPN)

2. SUPERVISED LEARNING (i.e. with a "teacher"):
     1). Feedback Nets:
        a). Brain-State-in-a-Box (BSB)
        b). Fuzzy Congitive Map (FCM)
        c). Boltzmann Machine (BM)
        d). Mean Field Annealing (MFT)
        e). Recurrent Cascade Correlation (RCC)
        f). Backpropagation through time (BPTT)
        g). Real-time recurrent learning (RTRL)
        h). Recurrent Extended Kalman Filter (EKF)
     2). Feedforward-only Nets:
        a). Perceptron
        b). Adaline, Madaline
        c). Backpropagation (BP)
        d). Cauchy Machine (CM)
        e). Adaptive Heuristic Critic (AHC)
        f). Time Delay Neural Network (TDNN)
        g). Associative Reward Penalty (ARP)
        h). Avalanche Matched Filter (AMF)
        i). Backpercolation (Perc)
        j). Artmap
        k). Adaptive Logic Network (ALN)
        l). Cascade Correlation (CasCor)
        m). Extended Kalman Filter(EKF)
        n). Learning Vector Quantization (LVQ)
        o). Probabilistic Neural Network (PNN)
        p). General Regression Neural Network (GRNN) 

------------------------------------------------------------------------

Subject: What is backprop? 
===========================

Backprop is short for backpropagation of error. The term backpropagation
causes much confusion. Strictly speaking, backpropagation refers to the
method for computing the error gradient for a feedforward network, a
straightforward but elegant application of the chain rule of elementary
calculus (Werbos 1994). By extension, backpropagation or backprop refers
to a training method that uses backpropagation to compute the gradient. By
further extension, a backprop network is a feedforward network trained by
backpropagation. 

Standard backprop is a euphemism for the generalized delta rule, the
training algorithm that was popularized by Rumelhart, Hinton, and Williams
in chapter 8 of Rumelhart and McClelland (1986), which remains the most
widely used supervised training method for neural nets. The generalized
delta rule (including momentum) is called the heavy ball method in the
numerical analysis literature (Poljak 1964; Bertsekas 1995, 78-79). 

Standard backprop can be used for on-line training (in which the weights are
updated after processing each case) but it does not converge. To obtain
convergence, the learning rate must be slowly reduced. This methodology is
called stochastic approximation. 

For batch processing, there is no reason to suffer through the slow
convergence and the tedious tuning of learning rates and momenta of standard
backprop. Much of the NN research literature is devoted to attempts to speed
up backprop. Most of these methods are inconsequential; two that are
effective are Quickprop (Fahlman 1989) and RPROP (Riedmiller and Braun
1993). But conventional methods for nonlinear optimization are usually
faster and more reliable than any of the "props". See "What are conjugate
gradients, Levenberg-Marquardt, etc.?". 

References on backprop: 

   Bertsekas, D. P. (1995), Nonlinear Programming, Belmont, MA: Athena
   Scientific, ISBN 1-886529-14-0. 

   Poljak, B.T. (1964), "Some methods of speeding up the convergence of
   iteration methods," Z. Vycisl. Mat. i Mat. Fiz., 4, 1-17. 

   Rumelhart, D.E., Hinton, G.E., and Williams, R.J. (1986), "Learning
   internal representations by error propagation", in Rumelhart, D.E. and
   McClelland, J. L., eds. (1986), Parallel Distributed Processing:
   Explorations in the Microstructure of Cognition, Volume 1, 318-362,
   Cambridge, MA: The MIT Press. 

   Werbos, P.J. (1994), The Roots of Backpropagation, NY: John Wiley &
   Sons. 

References on stochastic approximation: 

   Robbins, H. & Monro, S. (1951), "A Stochastic Approximation Method",
   Annals of Mathematical Statistics, 22, 400-407. 

   Kushner, H. & Clark, D. (1978), Stochastic Approximation Methods for
   Constrained and Unconstrained Systems, Springer-Verlag. 

   White, H. (1989), "Some Asymptotic Results for Learning in Single Hidden
   Layer Feedforward Network Models", J. of the American Statistical Assoc.,
   84, 1008-1013. 

References on better props: 

   Fahlman, S.E. (1989), "Faster-Learning Variations on Back-Propagation: An
   Empirical Study", in Touretzky, D., Hinton, G, and Sejnowski, T., eds., 
   Proceedings of the 1988 Connectionist Models Summer School, Morgan
   Kaufmann, 38-51. 

   Riedmiller, M. and Braun, H. (1993), "A Direct Adaptive Method for Faster
   Backpropagation Learning: The RPROP Algorithm", Proceedings of the IEEE
   International Conference on Neural Networks 1993, San Francisco: IEEE. 

------------------------------------------------------------------------

Subject: What are conjugate gradients,
======================================
Levenberg-Marquardt, etc.? 
===========================

Training a neural network is, in most cases, an exercise in numerical
optimization of a usually nonlinear function. Methods of nonlinear
optimization have been studied for hundreds of years, and there is a huge
literature on the subject in fields such as numerical analysis, operations
research, and statistical computing, e.g., Bertsekas 1995, Gill, Murray, and
Wright 1981. There is no single best method for nonlinear optimization. You
need to choose a method based on the characteristics of the problem to be
solved. For functions with continuous second derivatives (which would
include feedforward nets with the most popular differentiable activation
functions and error functions), three general types of algorithms have been
found to be effective for most practical purposes: 

 o For a small number of weights, stabilized Newton and Gauss-Newton
   algorithms, including various Levenberg-Marquardt and trust-region
   algorithms are efficient. 
 o For a moderate number of weights, various quasi-Newton algorithms are
   efficient. 
 o For a large number of weights, various conjugate-gradient algorithms are
   efficient. 

All of the above methods find local optima. For global optimization, there
are a variety of approaches. You can simply run any of the local
optimization methods from numerous random starting points. Or you can try
more complicated methods designed for global optimization such as simulated
annealing or genetic algorithms (see Reeves 1993 and "What about Genetic
Algorithms and Evolutionary Computation?"). 

For a survey of optimization software, see More\' and Wright (1993). For
more on-line information on numerical optimization see: 

 o The kangaroos, a nontechnical description of various optimization
   methods, at ftp://ftp.sas.com/pub/neural/kangaroos. 
 o John Gregory's nonlinear programming FAQ at 
   http://www.skypoint.com/subscribers/ashbury/nonlinear-programming-faq.html.
 o Arnold Neumaier's page on global optimization at 
   http://solon.cma.univie.ac.at/~neum/glopt.html. 

References: 

   Bertsekas, D. P. (1995), Nonlinear Programming, Belmont, MA: Athena
   Scientific, ISBN 1-886529-14-0. 

   Gill, P.E., Murray, W. and Wright, M.H. (1981) Practical Optimization,
   Academic Press: London. 

   Levenberg, K. (1944) "A method for the solution of certain problems in
   least squares," Quart. Appl. Math., 2, 164-168. 

   Marquardt, D. (1963) "An algorithm for least-squares estimation of
   nonlinear parameters," SIAM J. Appl. Math., 11, 431-441. 

   More\', J.J. (1977) "The Levenberg-Marquardt algorithm: implementation
   and theory," in Watson, G.A., ed., _Numerical Analysis_, Lecture Notes in
   Mathematics 630, Springer-Verlag, Heidelberg, 105-116. 

   More\', J.J. and Wright, S.J. (1993), Optimization Software Guide,
   Philadelphia: SIAM, ISBN 0-89871-322-6. 

   Reeves, C.R., ed. (1993) Modern Heuristic Techniques for Combinatorial
   Problems, NY: Wiley. 

   Rinnooy Kan, A.H.G., and Timmer, G.T., (1989) Global Optimization: A
   Survey, International Series of Numerical Mathematics, vol. 87, Basel:
   Birkhauser Verlag. 

------------------------------------------------------------------------

Subject: How should categories be coded? 
=========================================

First, consider unordered categories. If you want to classify cases into one
of C categories (i.e. you have a categorical target variable), use 1-of-C
coding. That means that you code C binary (0/1) target variables
corresponding to the C categories. Statisticians call these "dummy"
variables. Each dummy variable is given the value zero except for the one
corresponding to the correct category, which is given the value one. Then
use a softmax output activation function (see "What is a softmax activation
function?") so that the net, if properly trained, will produce valid
posterior probability estimates. If the categories are Red, Green, and Blue,
then the data would look like this: 

   Category  Dummy variables
   --------  ---------------
    Red        1   0   0
    Green      0   1   0
    Blue       0   0   1

When there are only two categories, it is simpler to use just one dummy
variable with a logistic output activation function; this is equivalent to
using softmax with two dummy variables. 

The common practice of using target values of .1 and .9 instead of 0 and 1
prevents the outputs of the network from being directly interpretable as
posterior probabilities. 

Another common practice is to use a logistic activation function for each
output. Thus, the outputs are not constrained to sum to one, so they are not
valid posterior probability estimates. The usual justification advanced for
this procedure is that if a test case is not similar to any of the training
cases, all of the outputs will be small, indicating that the case cannot be
classified reliably. This claim is incorrect, since a test case that is not
similar to any of the training cases will require the net to extrapolate,
and extrapolation is thoroughly unreliable; such a test case may produce all
small outputs, all large outputs, or any combination of large and small
outputs. If you want a classification method that detects novel cases for
which the classification may not be reliable, you need a method based on
probability density estimation. For example, see "What is PNN?". 

It is very important not to use a single variable for an unordered
categorical target. Suppose you used a single variable with values 1, 2, and
3 for red, green, and blue, and the training data with two inputs looked
like this: 

      |    1    1
      |   1   1
      |       1   1
      |     1   1
      | 
      |      X
      | 
      |    3   3           2   2
      |     3     3      2
      |  3   3            2    2
      |     3   3       2    2
      | 
      +----------------------------

Consider a test point located at the X. The correct output would be that X
has about a 50-50 chance of being a 1 or a 3. But if you train with a single
target variable with values of 1, 2, and 3, the output for X will be the
average of 1 and 3, so the net will say that X is definitely a 2! 

For an input with categorical values, you can use 1-of-(C-1) coding if the
network has a bias unit. This is just like 1-of-C coding, except that you
omit one of the dummy variables (doesn't much matter which one). Using all C
of the dummy variables creates a linear dependency on the bias unit, which
is not advisable unless you are using weight decay or Bayesian estimation or
some such thing that requires all C weights to be treated on an equal basis.
1-of-(C-1) coding looks like this: 

   Category  Dummy variables
   --------  ---------------
    Red        1   0
    Green      0   1
    Blue       0   0

Another possible coding is called "effects" coding or "deviations from
means" coding in statistics. It is like 1-of-(C-1) coding, except that when
a case belongs to the category for the omitted dummy variable, all of the
dummy variables are set to -1, like this: 

   Category  Dummy variables
   --------  ---------------
    Red        1   0
    Green      0   1
    Blue      -1  -1

As long as a bias unit is used, any network with effects coding can be
transformed into an equivalent network with 1-of-(C-1) coding by a linear
transformation of the weights. So the only advantage of effects coding is
that the dummy variables require no standardizing (see "Should I
normalize/standardize/rescale the data?"). 

If you are using weight decay, you want to make sure that shrinking the
weights toward zero biases ('bias' in the statistical sense) the net in a
sensible, usually smooth, way. If you use 1 of C-1 coding for an input,
weight decay biases the output for the C-1 categories towards the output for
the 1 omitted category, which is probably not what you want, although there
might be special cases where it would make sense. If you use 1 of C coding
for an input, weight decay biases the output for all C categories roughly
towards the mean output for all the categories, which is smoother and
usually a reasonable thing to do. 

Now consider ordered categories. For inputs, some people recommend a
"thermometer code" like this: 

   Category  Dummy variables
   --------  ---------------
    Red        1   1   1
    Green      0   1   1
    Blue       0   0   1

However, thermometer coding is equivalent to 1-of-C coding, in that for any
network using 1-of-C coding, there exists a network with thermometer coding
that produces identical outputs; the weights in the thermometer-coded
network are just the differences of successive weights in the 1-of-C-coded
network. To get a genuinely ordinal representation, you must constrain the
weights connecting the dummy variables to the hidden units to be nonnegative
(except for the first dummy variable). 

It is often effective to represent an ordinal input as a single variable
like this: 

   Category  Input
   --------  -----
    Red        1
    Green      2
    Blue       3

Although this representation involves only a single quantitative input,
given enough hidden units, the net is capable of computing nonlinear
transformations of that input that will produce results equivalent to any of
the dummy coding schemes. But using a single quantitative input makes it
easier for the net to use the order of the categories to generalize when
that is appropriate. 

B-splines provide a way of coding ordinal inputs into fewer than C variables
while retaining information about the order of the categories. See Gifi
(1990, 365-370). 

Target variables with ordered categories require thermometer coding. The
outputs are thus cumulative probabilities, so to obtain the posterior
probability of any category except the first, you must take the difference
between successive outputs. It is often useful to use a proportional-odds
model, which ensures that these differences are positive. For more details
on ordered categorical targets, see McCullagh and Nelder (1989, chapter 5). 

References: 

   Gifi, A. (1990), Nonlinear Multivariate Analysis, NY: John Wiley & Sons,
   ISBN 0-471-92620-5. 

   McCullagh, P. and Nelder, J.A. (1989) Generalized Linear Models, 2nd
   ed., London: Chapman & Hall. 

------------------------------------------------------------------------

Subject: Why use a bias input? 
===============================

Consider a multilayer perceptron with any of the usual sigmoid activation
functions. Choose any hidden unit or output unit. Let's say there are N
inputs to that unit, which define an N-dimensional space. The given unit
draws a hyperplane through that space, producing an "on" output on one side
and an "off" output on the other. (With sigmoid units the plane will not be
sharp -- there will be some gray area of intermediate values near the
separating plane -- but ignore this for now.) 

The weights determine where this hyperplane lies in the input space. Without
a bias input, this separating hyperplane is constrained to pass through the
origin of the space defined by the inputs. For some problems that's OK, but
in many problems the hyperplane would be much more useful somewhere else. If
you have many units in a layer, they share the same input space and without
bias would ALL be constrained to pass through the origin. 

The "universal approximation" property of multilayer perceptrons with most
commonly-used hidden-layer activation functions does not hold if you omit
the bias units. But Hornik (1993) shows that a sufficient condition for the
universal approximation property without biases is that no derivative of the
activation function vanishes at the origin, which implies that with the
usual sigmoid activation functions, a fixed nonzero bias can be used. 

Reference: 

   Hornik, K. (1993), "Some new results on neural network approximation,"
   Neural Networks, 6, 1069-1072. 

------------------------------------------------------------------------

Subject: Why use activation functions? 
=======================================

Activation functions for the hidden units are needed to introduce
nonlinearity into the network. Without nonlinearity, hidden units would not
make nets more powerful than just plain perceptrons (which do not have any
hidden units, just input and output units). The reason is that a composition
of linear functions is again a linear function. However, it is the
nonlinearity (i.e, the capability to represent nonlinear functions) that
makes multilayer networks so powerful. Almost any nonlinear function does
the job, although for backpropagation learning it must be differentiable and
it helps if the function is bounded; the sigmoidal functions such as
logistic and tanh and the Gaussian function are the most common choices. 

For the output units, you should choose an activation function suited to the
distribution of the target values. Bounded activation functions such as the
logistic are particularly useful when the target values have a bounded
range. But if the target values have no known bounded range, it is better to
use an unbounded activation function, most often the identity function
(which amounts to no activation function). There are certain natural
associations between output activation functions and various noise
distributions which have been studied by statisticians in the context of
generalized linear models. The output activation function is the inverse of
what statisticians call the "link function". See: 

   McCullagh, P. and Nelder, J.A. (1989) Generalized Linear Models, 2nd
   ed., London: Chapman & Hall. 

   Jordan, M.I. (1995), "Why the logistic function? A tutorial discussion on
   probabilities and neural networks", 
   ftp://psyche.mit.edu/pub/jordan/uai.ps.Z. 

For more information on activation functions, see Donald Tveter's 
Backpropagator's Review. 

------------------------------------------------------------------------

Subject: What is a softmax activation function? 
================================================

The purpose of the softmax activation function is to make the sum of the
outputs equal to one, so that the outputs are interpretable as posterior
probabilities. Let the net input to each output unit be q_i, i=1,...,c where
c is the number of categories. Then the softmax output p_i is: 

           exp(q_i)
   p_i = ------------
          c
         sum exp(q_j)
         j=1

Unless you are using weight decay or Bayesian estimation or some such thing
that requires the weights to be treated on an equal basis, you can choose
any one of the output units and leave it completely unconnected--just set
the net input to 0. Connecting all of the output units will just give you
redundant weights and will slow down training. To see this, add an arbitrary
constant z to each net input and you get: 

           exp(q_i+z)       exp(q_i) exp(z)       exp(q_i)    
   p_i = ------------   = ------------------- = ------------   
          c                c                     c            
         sum exp(q_j+z)   sum exp(q_j) exp(z)   sum exp(q_j)  
         j=1              j=1                   j=1

so nothing changes. Hence you can always pick one of the output units, and
add an appropriate constant to each net input to produce any desired net
input for the selected output unit, which you can choose to be zero or
whatever is convenient. You can use the same trick to make sure that none of
the exponentials overflows. 

Statisticians usually call softmax a "multiple logistic" function. It
reduces to the simple logistic function when there are only two categories.
Suppose you choose to set q_2 to 0. Then 

           exp(q_1)         exp(q_1)              1
   p_1 = ------------ = ----------------- = -------------
          c             exp(q_1) + exp(0)   1 + exp(-q_1)
         sum exp(q_j)
         j=1

and p_2, of course, is 1-p_1. 

The softmax function derives naturally from log-linear models and leads to
convenient interpretations of the weights in terms of odds ratios. You
could, however, use a variety of other nonnegative functions on the real
line in place of the exp function. Or you could constrain the net inputs to
the output units to be nonnegative, and just divide by the sum--that's
called the Bradley-Terry-Luce model. 

References: 

   Bridle, J.S. (1990a). Probabilistic Interpretation of Feedforward
   Classification Network Outputs, with Relationships to Statistical Pattern
   Recognition. In: F.Fogleman Soulie and J.Herault (eds.), Neurocomputing:
   Algorithms, Architectures and Applications, Berlin: Springer-Verlag, pp.
   227-236. 

   Bridle, J.S. (1990b). Training Stochastic Model Recognition Algorithms as
   Networks can lead to Maximum Mutual Information Estimation of Parameters.
   In: D.S.Touretzky (ed.), Advances in Neural Information Processing
   Systems 2, San Mateo: Morgan Kaufmann, pp. 211-217. 

   McCullagh, P. and Nelder, J.A. (1989) Generalized Linear Models, 2nd
   ed., London: Chapman & Hall. See Chapter 5. 

------------------------------------------------------------------------

Subject: How do MLPs compare with RBFs? 
========================================

Notation:

      a_j     is the altitude of the jth hidden unit
      b_j     is the bias of the jth hidden unit
      f       is the fan-in of the jth hidden unit 
      h_j     is the activation of the jth hidden unit 
      s       is a common width shared by all hidden units in the layer
      s_j     is the width of the jth hidden unit
      w_ij    is the weight connecting the ith input to
                the jth hidden unit
      w_i     is the common weight for the ith input shared by
                all hidden units in the layer
      x_i     is the ith input

The inputs to each hidden or output unit must be combined with the weights
to yield a single value called the net input. There does not seem to be a
standard term for the function that combines the inputs and weights; I will
use the term "combination function". 

The multilayer perceptron (MLP) has one or more hidden layers for which the
combination function is the inner product of the inputs and weights, plus a
bias. The activation function is typically a logistic or tanh function.
Hence the formula for the activation is typically: 

h_j = tanh( b_j + sum[w_ij*x_i] )

The MLP architecture is the most popular one in practical applications. Each
layer uses a linear combination function. The inputs are fully connected to
the first hidden layer, each hidden layer is fully connected to the next,
and the last hidden layer is fully connected to the outputs. You can also
have "skip-layer" connections; direct connections from inputs to outputs are
especially useful. 

Consider the multidimensional space of inputs to a given hidden unit. Since
an MLP uses linear combination functions, the set of all points in the space
having a given value of the activation function is a hyperplane. The
hyperplanes corresponding to different activation levels are parallel to
each other (the hyperplanes for different units are not parallel in
general). These parallel hyperplanes are the isoactivation contours of the
hidden unit. 

To illustrate various architectures, an example with two inputs and one
output will be used so that the results can be shown graphically. The
function being learned resembles a landscape with a Gaussian hill and a
logistic plateau as shown in:
54K ftp://ftp.sas.com/pub/neural/tnnex_hillplat.ps
1K ftp://ftp.sas.com/pub/neural/tnnex_hillplat.sas
In the examples, files with extensions of ".ps" are postscript files
containing graphics, whereas files with extensions of ".sas" and ".bls" are
plain text files. 

An example using an MLP with one hidden layer and an identity output
activation function is shown in these files:
357K ftp://ftp.sas.com/pub/neural/tnnex_hillplat_mlp.bls
885K ftp://ftp.sas.com/pub/neural/tnnex_hillplat_mlp.ps
1K ftp://ftp.sas.com/pub/neural/tnnex_hillplat_mlp.sas

Radial basis function (RBF) networks usually have only one hidden layer for
which the combination function is the Euclidean distance between the input
vector and the weight vector, divided by the squared width. There may also
be another term added to the combination function, which determines what I
will call the "altitude" of the unit. 

There are two distinct types of Gaussian RBF architectures. The first type
uses the exp activation function, so the activation of the unit is a
Gaussian "bump" as a function of the inputs. There seems to be no specific
term for this type of Gaussian RBF network; I will use the term "ordinary
RBF", or ORBF, network. 

The second type of Gaussian RBF architecture uses the softmax activation
function, so the activations of all the hidden units are normalized to sum
to one. This type of network is often called a "normalized RBF", or NRBF,
network. 

While the distinction between these two types of Gaussian RBF architectures
is sometimes mentioned in the NN literature, its importance has rarely been
appreciated except by Tao (1993). 

There are several subtypes of both ORBF and NRBF architectures.
Descriptions, formulas, and examples are as follows: 

ORBFUN 
   Ordinary radial basis function (RBF) network with unequal widths
   h_j = exp(f*log(a_j) - s_j^-2 * [(w_ij-x_i)^2] )
   177K ftp://ftp.sas.com/pub/neural/tnnex_hillplat_orbfeh.bls
   681K ftp://ftp.sas.com/pub/neural/tnnex_hillplat_orbfeh.ps
   1K ftp://ftp.sas.com/pub/neural/tnnex_hillplat_orbfeh.sas
ORBFEQ 
   Ordinary radial basis function (RBF) network with equal widths
   h_j = exp( - s^-2 * [(w_ij-x_i)^2] )
   159K ftp://ftp.sas.com/pub/neural/tnnex_hillplat_orbfeq.bls
   715K ftp://ftp.sas.com/pub/neural/tnnex_hillplat_orbfeq.ps
   1K ftp://ftp.sas.com/pub/neural/tnnex_hillplat_orbfeq.sas
NRBFUN 
   Normalized RBF network with unequal widths and heights
   h_j = softmax(f*log(a_j) - s_j^-2 *
   [(w_ij-x_i)^2] )
   35K ftp://ftp.sas.com/pub/neural/tnnex_hillplat_nrbfun.bls
   221K ftp://ftp.sas.com/pub/neural/tnnex_hillplat_nrbfun.ps
   1K ftp://ftp.sas.com/pub/neural/tnnex_hillplat_nrbfun.sas
NRBFEV 
   Normalized RBF network with equal volumes
   h_j = softmax( f*log(b_j) - s_j^-2 *
   [(w_ij-x_i)^2] )
   39K ftp://ftp.sas.com/pub/neural/tnnex_hillplat_nrbfev.bls
   221K ftp://ftp.sas.com/pub/neural/tnnex_hillplat_nrbfev.ps
   1K ftp://ftp.sas.com/pub/neural/tnnex_hillplat_nrbfev.sas
NRBFEH 
   Normalized RBF network with equal heights (and unequal widths)
   h_j = softmax( - s_j^-2 * [(w_ij-x_i)^2] )
   35K ftp://ftp.sas.com/pub/neural/tnnex_hillplat_nrbfeh.bls
   220K ftp://ftp.sas.com/pub/neural/tnnex_hillplat_nrbfeh.ps
   1K ftp://ftp.sas.com/pub/neural/tnnex_hillplat_nrbfeh.sas
NRBFEW 
   Normalized RBF network with equal widths (and unequal heights)
   h_j = softmax( f*log(a_j) - s^-2 *
   [(w_ij-x_i)^2] )
   170K ftp://ftp.sas.com/pub/neural/tnnex_hillplat_nrbfew.bls
   658K ftp://ftp.sas.com/pub/neural/tnnex_hillplat_nrbfew.ps
   1K ftp://ftp.sas.com/pub/neural/tnnex_hillplat_nrbfew.sas
NRBFEQ 
   Normalized RBF network with equal widths and heights
   h_j = softmax( - s^-2 * [(w_ij-x_i)^2] )
   152K ftp://ftp.sas.com/pub/neural/tnnex_hillplat_nrbfeq.bls
   659K ftp://ftp.sas.com/pub/neural/tnnex_hillplat_nrbfeq.ps
   1K ftp://ftp.sas.com/pub/neural/tnnex_hillplat_nrbfeq.sas

The ORBF architectures use radial combination functions and the exp
activation function. Only two of the radial combination functions are useful
with ORBF architectures. For radial combination functions including an
altitude, the altitude would be redundant with the hidden-to-output weights.

Radial combination functions are based on the Euclidean distance between the
vector of inputs to the unit and the vector of corresponding weights. Thus,
the isoactivation contours for ORBF networks are concentric hyperspheres. A
variety of activation functions can be used with the radial combination
function, but the exp activation function, yielding a Gaussian surface, is
the most useful. Radial networks typically have only one hidden layer, but
it can be useful to include a linear layer for dimensionality reduction or
oblique rotation before the RBF layer. 

The output of an ORBF network consists of a number of superimposed bumps,
hence the output is quite bumpy unless many hidden units are used. Thus an
ORBF network with only a few hidden units is incapable of fitting a wide
variety of simple, smooth functions, and should rarely be used. 

The NRBF architectures also use radial combination functions but the
activation function is softmax, which forces the sum of the activations for
the hidden layer to equal one. Thus, each output unit computes a weighted
average of the hidden-to-output weights, and the output values must lie
within the range of the hidden-to-output weights. 

Radial combination functions incorporating altitudes are useful with NRBF
architectures. The NRBF architectures combine some of the virtues of both
the RBF and MLP architectures, as explained below. However, the
isoactivation contours are considerably more complicated than for ORBF
architectures. 

Consider the case of an NRBF network with only two hidden units. If the
hidden units have equal widths, the isoactivation contours are parallel
hyperplanes; in fact, this network is equivalent to an MLP with one logistic
hidden unit. If the hidden units have unequal widths, the isoactivation
contours are concentric hyperspheres; such a network is almost equivalent to
an ORBF network with one Gaussian hidden unit. 

If there are more than two hidden units in an NRBF network, the
isoactivation contours have no such simple characterization. If the RBF
widths are very small, the isoactivation contours are approximately
piecewise linear for RBF units with equal widths, and approximately
piecewise spherical for RBF units with unequal widths. The larger the
widths, the smoother the isoactivation contours where the pices join. 

The NRBFEQ architecture is a smoothed variant of the learning vector
quantization (Kohonen 1988, Ripley 1996) and counterpropagation
(Hecht-Nielsen 1990), architectures. In LVQ and counterprop, the hidden
units are often called codebook vectors. LVQ amounts to nearest-neighbor
classification on the codebook vectors, while counterprop is
nearest-neighbor regression on the codebook vectors. The NRBFEQ architecture
uses not just the single nearest neighbor, but a weighted average of near
neighbors. As the width of the NRBFEQ functions approaches zero, the weights
approach one for the nearest neighbor and zero for all other codebook
vectors. LVQ and counterprop use ad hoc algorithms of uncertain reliability,
but standard numerical optimization algorithms (not to mention backprop) can
be applied with the NRBFEQ architecture. 

In a NRBFEQ architecture, if each observation is taken as an RBF center, and
if the weights are taken to be the target values, the outputs are simply
weighted averages of the target values, and the network is identical to the
well-known Nadaraya-Watson kernel regression estimator, which has been
reinvented at least twice in the neural net literature (see "What is
GRNN?"). A similar NRBFEQ network used for classification is equivalent to
kernel discriminant analysis (see "What is PNN?"). 

Kernels with variable widths are also used for regression in the statistical
literature. Such kernel estimators correspond to the the NRBFEV
architecture, in which the kernel functions have equal volumes but different
altitudes. In the neural net literature, variable-width kernels appear
always to be of the NRBFEH variety, with equal altitudes but unequal
volumes. The analogy with kernel regression would make the NRBFEV
architecture the obvious choice, but which of the two architectures works
better in practice is an open question. 

Hybrid training and the curse of dimensionality
+++++++++++++++++++++++++++++++++++++++++++++++

A comparison of the various architectures must separate training issues from
architectural issues to avoid common sources of confusion. RBF networks are
often trained by hybrid methods, in which the hidden weights (centers) are
first obtained by unsupervised learning, after which the output weights are
obtained by supervised learning. Unsupervised methods for choosing the
centers include: 

1. Distribute the centers in a regular grid over the input space. 
2. Choose a random subset of the training cases to serve as centers. 
3. Cluster the training cases based on the input variables, and use the mean
   of each cluster as a center. 

Various heuristic methods are also available for choosing the RBF widths.
Once the centers and widths are fixed, the output weights can be learned
very efficiently, since the computation reduces to a linear or generalized
linear model. The hybrid training approach can thus be much faster than the
nonlinear optimization that would be required for supervised training of all
of the weights in the network. 

Hybrid training is not often applied to MLPs because no effective methods
are known for unsupervised training of the hidden units (except when there
is only one input). 

Hybrid training will usually require more hidden units than supervised
training. Since supervised training optimizes the locations of the centers,
while hybrid training does not, supervised training will provide a better
approximation to the function to be learned for a given number of hidden
units. Thus, the better fit provided by supervised training will often let
you use fewer hidden units for a given accuracy of approximation than you
would need with hybrid training. And if the hidden-to-output weights are
learned by linear least-squares, the fact that hybrid training requires more
hidden units implies that hybrid training will also require more training
cases for the same accuracy of generalization (Tarassenko and Roberts 1994).

The number of hidden units required by hybrid methods becomes an
increasingly serious problem as the number of inputs increases. In fact, the
required number of hidden units tends to increase exponentially with the
number of inputs. This drawback of hybrid methods is discussed by Minsky and
Papert (1969). For example, with method (1) for RBF networks, you would need
at least five elements in the grid along each dimension to detect a moderate
degree of nonlinearity; so if you have Nx inputs, you would need at least 
5^Nx hidden units. For methods (2) and (3), the number of hidden units
increases exponentially with the effective dimensionality of the input
distribution. If the inputs are linearly related, the effective
dimensionality is the number of nonnegligible (a deliberately vague term)
eigenvalues of the covariance matrix, so the inputs must be highly
correlated if the effective dimensionality is to be much less than the
number of inputs. 

The exponential increase in the number of hidden units required for hybrid
learning is one aspect of the curse of dimensionality (Bellman 1961; Scott
1992). The number of training cases required also increases exponentially in
general. No neural network architecture--in fact no method of learning or
statistical estimation--can escape the curse of dimensionality in general,
hence there is no practical method of learning general functions in more
than a few dimensions. 

Fortunately, in many practical applications of neural networks with a large
number of inputs, most of those inputs are additive, redundant, or
irrelevant, and some architectures can take advantage of these properties to
yield useful results. But escape from the curse of dimensionality requires
fully supervised learning. 

Additive inputs
+++++++++++++++

An additive model is one in which the output is a sum of linear or nonlinear
transformations of the inputs. If an additive model is appropriate, the
number of weights increases linearly with the number of inputs, so high
dimensionality is not a curse. Various methods of training additive models
are available in the statistical literature (e.g. Hastie and Tibshirani
1990). You can also create a feedforward neural network, called a 
generalized additive network (GAN), to fit additive models (Sarle 1994).
Additive models have been proposed in the neural net literature under the
name "topologically distributed encoding" (Geiger 1990). 

Projection pursuit regression (PPR) provides both universal approximation
and the ability to avoid the curse of dimensionality for certain common
types of target functions (Friedman and Stuetzle 1981). Like MLPs, PPR
computes the output as a sum of nonlinear transformations of linear
combinations of the inputs. Each term in the sum is analogous to a hidden
unit in an MLP. But unlike MLPs, PPR allows general, smooth nonlinear
transformations rather than a specific nonlinear activation function, and
allows a different transformation for each term. The nonlinear
transformations in PPR are usually estimated by nonparametric regression,
but you can set up a projection pursuit network (PPN), in which each
nonlinear transformation is performed by a subnetwork. If a PPN provides an
adequate fit with few terms, then the curse of dimensionality can be
avoided, and the results may even be interpretable. 

If the target function can be accurately approximated by projection pursuit,
then it can also be accurately approximated by an MLP with a single hidden
layer. The disadvantage of the MLP is that there is little hope of
interpretability. An MLP with two or more hidden layers can provide a
parsimonious fit to a wider variety of target functions than can projection
pursuit, but no simple characterization of these functions is known. 

Redundant inputs
++++++++++++++++

With proper training, all of the RBF architectures listed above, as well as
MLPs, can process redundant inputs effectively. When there are redundant
inputs, the training cases lie close to some (possibly nonlinear) subspace.
If the same degree of redundancy applies to the test cases, the network need
produce accurate outputs only near the subspace occupied by the data. Adding
redundant inputs has little effect on the effective dimensionality of the
data; hence the curse of dimensionality does not apply. However, if the test
cases do not follow the same pattern of redundancy as the training cases,
generalization will require extrapolation and will rarely work. 

Irrelevant inputs
+++++++++++++++++

MLP architectures are good at ignoring irrelevant inputs. MLPs can also
select linear subspaces of reduced dimensionality. Since the first hidden
layer forms linear combinations of the inputs, it confines the networks
attention to the linear subspace spanned by the weight vectors. Hence,
adding irrelevant inputs to the training data does not increase the number
of hidden units required, although it increases the amount of training data
required. 

ORBF architectures are not good at ignoring irrelevant inputs. The number of
hidden units required grows exponentially with the number of inputs,
regardless of how many inputs are relevant. This exponential growth is
related to the fact that RBFs and ERBFs have local receptive fields, meaning
that changing the hidden-to-output weights of a given unit will affect the
output of the network only in a neighborhood of the center of the hidden
unit, where the size of the neighborhood is determined by the width of the
hidden unit. (Of course, if the width of the unit is learned, the receptive
field could grow to cover the entire training set.) 

Local receptive fields are often an advantage compared to the distributed
architecture of MLPs, since local units can adapt to local patterns in the
data without having unwanted side effects in other regions. In a distributed
architecture such as an MLP, adapting the network to fit a local pattern in
the data can cause spurious side effects in other parts of the input space. 

However, ORBF architectures often must be used with relatively small
neighborhoods, so that several hidden units are required to cover the range
of an input. When there are many nonredundant inputs, the hidden units must
cover the entire input space, and the number of units required is
essentially the same as in the hybrid case (1) where the centers are in a
regular grid; hence the exponential growth in the number of hidden units
with the number of inputs, regardless of whether the inputs are relevant. 

You can enable an ORBF architecture to ignore irrelevant inputs by using an
extra, linear hidden layer before the radial hidden layer. This type of
network is sometimes called an elliptical basis function network. If the
number of units in the linear hidden layer equals the number of inputs, the
linear hidden layer performs an oblique rotation of the input space that can
suppress irrelevant directions and differentally weight relevant directions
according to their importance. If you think that the presence of irrelevant
inputs is highly likely, you can force a reduction of dimensionality by
using fewer units in the linear hidden layer than the number of inputs. 

Note that the linear and radial hidden layers must be connected in series,
not in parallel, to ignore irrelevant inputs. In some applications it is
useful to have linear and radial hidden layers connected in parallel, but in
such cases the radial hidden layer will be sensitive to all inputs. 

For even greater flexibility (at the cost of more weights to be learned),
you can have a separate linear hidden layer for each RBF unit, allowing a
different oblique rotation for each RBF unit. 

NRBF architectures with equal widths (NRBFEW and NRBFEQ) combine the
advantage of local receptive fields with the ability to ignore irrelevant
inputs. The receptive field of one hidden unit extends from the center in
all directions until it encounters the receptive field of another hidden
unit. It is convenient to think of a "boundary" between the two receptive
fields, defined as the hyperplane where the two units have equal
activations, even though the effect of each unit will extend somewhat beyond
the boundary. The location of the boundary depends on the heights of the
hidden units. If the two units have equal heights, the boundary lies midway
between the two centers. If the units have unequal heights, the boundary is
farther from the higher unit. 

If a hidden unit is surrounded by other hidden units, its receptive field is
indeed local, curtailed by the field boundaries with other units. But if a
hidden unit is not completely surrounded, its receptive field can extend
infinitely in certain directions. If there are irrelevant inputs, or more
generally, irrelevant directions that are linear combinations of the inputs,
the centers need only be distributed in a subspace orthogonal to the
irrelevant directions. In this case, the hidden units can have local
receptive fields in relevant directions but infinite receptive fields in
irrelevant directions. 

For NRBF architectures allowing unequal widths (NRBFUN, NRBFEV, and NRBFEH),
the boundaries between receptive fields are generally hyperspheres rather
than hyperplanes. In order to ignore irrelevant inputs, such networks must
be trained to have equal widths. Hence, if you think there is a strong
possibility that some of the inputs are irrelevant, it is usually better to
use an architecture with equal widths. 

References: 

   Bellman, R. (1961), Adaptive Control Processes: A Guided Tour, Princeton
   University Press. 

   Friedman, J.H. and Stuetzle, W. (1981), "Projection pursuit regression,"
   J. of the American Statistical Association, 76, 817-823. 

   Geiger, H. (1990), "Storing and Processing Information in Connectionist
   Systems," in Eckmiller, R., ed., Advanced Neural Computers, 271-277,
   Amsterdam: North-Holland. 

   Hastie, T.J. and Tibshirani, R.J. (1990) Generalized Additive Models,
   London: Chapman & Hall. 

   Hecht-Nielsen, R. (1990), Neurocomputing, Reading, MA: Addison-Wesley. 

   Kohonen, T (1988), "Learning Vector Quantization," Neural Networks, 1
   (suppl 1), 303. 

   Minsky, M.L. and Papert, S.A. (1969), Perceptrons, Cambridge, MA: MIT
   Press. 

   Ripley, B.D. (1996), Pattern Recognition and Neural Networks,
   Cambridge: Cambridge University Press. 

   Sarle, W.S. (1994), "Neural Networks and Statistical Models," in SAS
   Institute Inc., Proceedings of the Nineteenth Annual SAS Users Group
   International Conference, Cary, NC: SAS Institute Inc., pp 1538-1550, 
   ftp://ftp.sas.com/pub/neural/neural1.ps. 

   Scott, D.W. (1992), Multivariate Density Estimation, NY: Wiley. 

   Tao, K.M. (1993), "A closer look at the radial basis function (RBF)
   networks," Conference Record of The Twenty-Seventh Asilomar
   Conference on Signals, Systems and Computers (Singh, A., ed.), vol 1,
   401-405, Los Alamitos, CA: IEEE Comput. Soc. Press. 

   Tarassenko, L. and Roberts, S. (1994), "Supervised and unsupervised
   learning in radial basis function classifiers," IEE Proceedings-- Vis.
   Image Signal Processing, 141, 210-216. 

------------------------------------------------------------------------

Subject: Should I normalize/standardize/rescale the
===================================================
data? 
======

First, some definitions. "Rescaling" a vector means to add or subtract a
constant and then multiply or divide by a constant, as you would do to
change the units of measurement of the data, for example, to convert a
temperature from Celsius to Fahrenheit. 

"Normalizing" a vector most often means dividing by a norm of the vector,
for example, to make the Euclidean length of the vector equal to one. In the
NN literature, "normalizing" also often refers to rescaling by the minimum
and range of the vector, to make all the elements lie between 0 and 1. 

"Standardizing" a vector most often means subtracting a measure of location
and dividing by a measure of scale. For example, if the vector contains
random values with a Gaussian distribution, you might subtract the mean and
divide by the standard deviation, thereby obtaining a "standard normal"
random variable with mean 0 and standard deviation 1. 

However, all of the above terms are used more or less interchangeably
depending on the customs within various fields. Since the FAQ maintainer is
a statistician, he is going to use the term "standardize" because that is
what he is accustomed to. 

Now the question is, should you do any of these things to your data? The
answer is, it depends. 

There is a common misconception that the inputs to a multilayer perceptron
must be in the interval [0,1]. There is in fact no such requirement,
although there often are benefits to standardizing the inputs as discussed
below. 

If your output activation function has a range of [0,1], then obviously you
must ensure that the target values lie within that range. But it is
generally better to choose an output activation function suited to the
distribution of the targets than to force your data to conform to the output
activation function. See "Why use activation functions?" 

When using an output activation with a range of [0,1], some people prefer to
rescale the targets to a range of [.1,.9]. I suspect that the popularity of
this gimmick is due to the slowness of standard backprop. But using a target
range of [.1,.9] for a classification task gives you incorrect posterior
probability estimates, and it is quite unnecessary if you use an efficient
training algorithm (see "What are conjugate gradients, Levenberg-Marquardt,
etc.?") 

Now for some of the gory details: note that the training data form a matrix.
You could set up this matrix so that each case forms a row, and the inputs
and target values form columns. You could conceivably standardize the rows
or the columns or both or various other things, and these different ways of
choosing vectors to standardize will have quite different effects on
training. 

Subquestion: Should I standardize the input column vectors?
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

That depends primarily on how the network combines inputs to compute the net
input to the next (hidden or output) layer. If the inputs are combined
linearly, as in a multilayer perceptron, then it is rarely strictly
necessary to standardize the inputs, at least in theory. The reason is that
any rescaling of an input vector can be effectively undone by changing the
corresponding weights and biases, leaving you with the exact same outputs as
you had before. However, there are a variety of practical reasons why
standardizing the inputs can make training faster and reduce the chances of
getting stuck in local optima. 

If the inputs are combined via some distance function, such as Euclidean
distance as in a radial basis-function network, standardizing inputs can be
crucial. Rescaling an input cannot be undone by adjusting the weights. The
contribution of an input will depend heavily on its variability relative to
other inputs. If one input has a range of 0 to 1, while another input has a
range of 0 to 1,000,000, then the contribution of the first input to the
distance will be swamped by the second input. So it is essential to rescale
the inputs so that their variability reflects their importance, or at least
is not in inverse relation to their importance. For lack of better prior
information, it is common to standardize each input to the same range or the
same standard deviation. 

For more details, see: ftp://ftp.sas.com/pub/neural/tnn3.html. 

Subquestion: Should I standardize the target column vectors?
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Standardizing targets is typically more a convenience for getting good
initial weights than a necessity. However, if you have two or more target
variables and your error function is scale-sensitive like the usual least
(mean) squares error function, then the variability of each target relative
to the others can effect how well the net learns that target. If one target
has a range of 0 to 1, while another target has a range of 0 to 1,000,000,
the net will expend most of its effort learning the second target to the
possible exclusion of the first. So it is essential to rescale the targets
so that their variability reflects their importance, or at least is not in
inverse relation to their importance. If the targets are of equal
importance, they should typically be standardized to the same range or the
same standard deviation. 

For more details, see: ftp://ftp.sas.com/pub/neural/tnn3.html. 

Subquestion: Should I standardize the input cases (row vectors)?
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Whereas standardizing variables is usually beneficial, the effect of
standardizing cases (row vectors) depends on the particular data. Cases are
typically standardized only across the input variables, since including the
target variable(s) in the standardization would make prediction impossible. 

There are some kinds of networks, such as simple Kohonen nets, where it is
necessary to standardize the input cases to a common Euclidean length; this
is a side effect of the use of the inner product as a similarity measure. If
the network is modified to operate on Euclidean distances instead of inner
products, it is no longer necessary to standardize the input cases. 

Standardization of cases should be approached with caution because it
discards information. If that information is irrelevant, then standardizing
cases can be quite helpful. If that information is important, then
standardizing cases can be disastrous. Issues regarding the standardization
of cases must be carefully evaluated in every application. There are no
rules of thumb that apply to all applications. 

For more details, see: ftp://ftp.sas.com/pub/neural/tnn3.html. 

------------------------------------------------------------------------

Subject: What is ART?
=====================

ART stands for "Adaptive Resonance Theory", invented by Stephen Grossberg in
1976. ART encompasses a wide variety of neural networks based explicitly on
neurophysiology. ART networks are defined algorithmically in terms of
detailed differential equations intended as plausible models of biological
neurons. In practice, ART networks are implemented using analytical
solutions or approximations to these differential equations. 

ART comes in several flavors, both supervised and unsupervised. As discussed
by Moore (1988), the unsupervised ARTs are basically similar to many
iterative clustering algorithms in which each case is processed by: 

1. finding the "nearest" cluster seed (AKA prototype or template) to that
   case 
2. updating that cluster seed to be "closer" to the case 

where "nearest" and "closer" can be defined in hundreds of different ways.
In ART, the framework is modified slightly by introducing the concept of
"resonance" so that each case is processed by: 

1. finding the "nearest" cluster seed that "resonates" with the case 
2. updating that cluster seed to be "closer" to the case 

"Resonance" is just a matter of being within a certain threshold of a second
similarity measure. A crucial feature of ART is that if no seed resonates
with the case, a new cluster is created as in Hartigan's (1975) leader
algorithm. This feature is said to solve the stability/plasticity dilemma. 

ART has its own jargon. For example, data are called an arbitrary sequence
of input patterns. The current training case is stored in short term memory
and cluster seeds are long term memory. A cluster is a maximally compressed
pattern recognition code. The two stages of finding the nearest seed to the
input are performed by an Attentional Subsystem and an Orienting Subsystem,
the latter of which performs hypothesis testing, which simply refers to the
comparison with the vigilance threshhold, not to hypothesis testing in the
statistical sense. Stable learning means that the algorithm converges. So
the oft-repeated claim that ART algorithms are "capable of rapid stable
learning of recognition codes in response to arbitrary sequences of input
patterns" merely means that ART algorithms are clustering algorithms that
converge; it does not mean, as one might naively assume, that the clusters
are insensitive to the sequence in which the training patterns are
presented--quite the opposite is true. 

There are various supervised ART algorithms that are named with the suffix
"MAP", as in Fuzzy ARTMAP. These algorithms cluster both the inputs and
targets and associate the two sets of clusters. The effect is somewhat
similar to counterpropagation. The main disadvantage of the ARTMAP
algorithms is that they have no mechanism to avoid overfitting and hence
should not be used with noisy data. 

For more information, see the ART FAQ at http://www.wi.leidenuniv.nl/art/
and the "ART Headquarters" at Boston University, http://cns-web.bu.edu/. For
a different view of ART, see Sarle, W.S. (1995), "Why Statisticians Should
Not FART," ftp://ftp.sas.com/pub/neural/fart.doc. 

References: 

   Carpenter, G.A., Grossberg, S. (1996), "Learning, Categorization, Rule
   Formation, and Prediction by Fuzzy Neural Networks," in Chen, C.H., ed.
   (1996) Fuzzy Logic and Neural Network Handbook, NY: McGraw-Hill, pp.
   1.3-1.45. 

   Hartigan, J.A. (1975), Clustering Algorithms, NY: Wiley. 

   Kasuba, T. (1993), "Simplified Fuzzy ARTMAP," AI Expert, 8, 18-25. 

   Moore, B. (1988), "ART 1 and Pattern Clustering," in Touretzky, D.,
   Hinton, G. and Sejnowski, T., eds., Proceedings of the 1988
   Connectionist Models Summer School, 174-185, San Mateo, CA: Morgan
   Kaufmann. 

------------------------------------------------------------------------

Subject: What is PNN?
=====================

PNN or "Probabilistic Neural Network" is Donald Specht's term for kernel
discriminant analysis. You can think of it as a normalized RBF network in
which there is a hidden unit centered at every training case. These RBF
units are called "kernels" and are usually probability density functions
such as the Gaussian. The hidden-to-output weights are usually 1 or 0; for
each hidden unit, a weight of 1 is used for the connection going to the
output that the case belongs to, while all other connections are given
weights of 0. Alternatively, you can adjust these weights for the prior
probabilities of each class. So the only weights that need to be learned are
the widths of the RBF units. These widths (often a single width is used) are
called "smoothing parameters" or "bandwidths" and are usually chosen by
cross-validation or by more esoteric methods that are not well-known in the
neural net literature; gradient descent is not used. 

Specht's claim that a PNN trains 100,000 times faster than backprop is at
best misleading. While they are not iterative in the same sense as backprop,
kernel methods require that you estimate the kernel bandwidth, and this
requires accessing the data many times. Furthermore, computing a single
output value with kernel methods requires either accessing the entire
training data or clever programming, and either way is much slower than
computing an output with a feedforward net. And there are a variety of
methods for training feedforward nets that are much faster than standard
backprop. So depending on what you are doing and how you do it, PNN may be
either faster or slower than a feedforward net. 

PNN is a universal approximator for smooth class-conditional densities, so
it should be able to solve any smooth classification problem given enough
data. The main drawback of PNN is that, like kernel methods in general, it
suffers badly from the curse of dimensionality. PNN cannot ignore irrelevant
inputs without major modifications to the basic algorithm. So PNN is not
likely to be the top choice if you have more than 5 or 6 nonredundant
inputs. 

But if all your inputs are relevant, PNN has the very useful ability to tell
you whether a test case is similar (i.e. has a high density) to any of the
training data; if not, you are extrapolating and should view the output
classification with skepticism. This ability is of limited use when you have
irrelevant inputs, since the similarity is measured with respect to all of
the inputs, not just the relevant ones. 

References: 

   Hand, D.J. (1982) Kernel Discriminant Analysis, Research Studies Press. 

   McLachlan, G.J. (1992) Discriminant Analysis and Statistical Pattern
   Recognition, Wiley. 

   Masters, T. (1995) Advanced Algorithms for Neural Networks: A C++
   Sourcebook, NY: John Wiley and Sons, ISBN 0-471-10588-0 

   Michie, D., Spiegelhalter, D.J. and Taylor, C.C. (1994) Machine
   Learning, Neural and Statistical Classification, Ellis Horwood. 

   Scott, D.W. (1992) Multivariate Density Estimation, Wiley. 

   Specht, D.F. (1990) "Probabilistic neural networks," Neural Networks, 3,
   110-118. 

------------------------------------------------------------------------

Subject: What is GRNN?
======================

GRNN or "General Regression Neural Network" is Donald Specht's term for
Nadaraya-Watson kernel regression, also reinvented in the NN literature by
Schi\oler and Hartmann. You can think of it as a normalized RBF network in
which there is a hidden unit centered at every training case. These RBF
units are called "kernels" and are usually probability density functions
such as the Gaussian. The hidden-to-output weights are just the target
values, so the output is simply a weighted average of the target values of
training cases close to the given input case. The only weights that need to
be learned are the widths of the RBF units. These widths (often a single
width is used) are called "smoothing parameters" or "bandwidths" and are
usually chosen by cross-validation or by more esoteric methods that are not
well-known in the neural net literature; gradient descent is not used. 

GRN is a universal approximator for smooth functions, so it should be able
to solve any smooth function-approximation problem given enough data. The
main drawback of GRNN is that, like kernel methods in general, it suffers
badly from the curse of dimensionality. GRNN cannot ignore irrelevant inputs
without major modifications to the basic algorithm. So GRNN is not likely to
be the top choice if you have more than 5 or 6 nonredundant inputs. 

References: 

   Caudill, M. (1993), "GRNN and Bear It," AI Expert, Vol. 8, No. 5 (May),
   28-33. 

   Haerdle, W. (1990), Applied Nonparametric Regression, Cambridge Univ.
   Press. 

   Masters, T. (1995) Advanced Algorithms for Neural Networks: A C++
   Sourcebook, NY: John Wiley and Sons, ISBN 0-471-10588-0 

   Nadaraya, E.A. (1964) "On estimating regression", Theory Probab. Applic.
   10, 186-90. 

   Schi\oler, H. and Hartmann, U. (1992) "Mapping Neural Network Derived
   from the Parzen Window Estimator", Neural Networks, 5, 903-909. 

   Specht, D.F. (1968) "A practical technique for estimating general
   regression surfaces," Lockheed report LMSC 6-79-68-6, Defense Technical
   Information Center AD-672505. 

   Specht, D.F. (1991) "A Generalized Regression Neural Network", IEEE
   Transactions on Neural Networks, 2, Nov. 1991, 568-576. 

   Watson, G.S. (1964) "Smooth regression analysis", Sankhy{\=a}, Series A,
   26, 359-72. 

------------------------------------------------------------------------

Subject: What about Genetic Algorithms?
=======================================

There are a number of definitions of GA (Genetic Algorithm). A possible one
is

  A GA is an optimization program
  that starts with
  a population of encoded procedures,       (Creation of Life :-> )
  mutates them stochastically,              (Get cancer or so :-> )
  and uses a selection process              (Darwinism)
  to prefer the mutants with high fitness
  and perhaps a recombination process       (Make babies :-> )
  to combine properties of (preferably) the succesful mutants.

Genetic Algorithms are just a special case of the more general idea of
``evolutionary computation''. There is a newsgroup that is dedicated to the
field of evolutionary computation called comp.ai.genetic. It has a detailed
FAQ posting which, for instance, explains the terms "Genetic Algorithm",
"Evolutionary Programming", "Evolution Strategy", "Classifier System", and
"Genetic Programming". That FAQ also contains lots of pointers to relevant
literature, software, other sources of information, et cetera et cetera.
Please see the comp.ai.genetic FAQ for further information. 

------------------------------------------------------------------------

Subject: What about Fuzzy Logic?
================================

Fuzzy Logic is an area of research based on the work of L.A. Zadeh. It is a
departure from classical two-valued sets and logic, that uses "soft"
linguistic (e.g. large, hot, tall) system variables and a continuous range
of truth values in the interval [0,1], rather than strict binary (True or
False) decisions and assignments.

Fuzzy logic is used where a system is difficult to model exactly (but an
inexact model is available), is controlled by a human operator or expert, or
where ambiguity or vagueness is common. A typical fuzzy system consists of a
rule base, membership functions, and an inference procedure.

Most Fuzzy Logic discussion takes place in the newsgroup comp.ai.fuzzy, but
there is also some work (and discussion) about combining fuzzy logic with
Neural Network approaches in comp.ai.neural-nets.

References: 

   Chen, C.H., ed. (1996) Fuzzy Logic and Neural Network Handbook, NY:
   McGraw-Hill, ISBN 0-07-011189-8. 

   Klir, G.J. and Folger, T.A.(1988), Fuzzy Sets, Uncertainty, and
   Information, Englewood Cliffs, N.J.: Prentice-Hall. 

   Kosko, B.(1992), Neural Networks and Fuzzy Systems, Englewood Cliffs,
   N.J.: Prentice-Hall. 

------------------------------------------------------------------------

Next part is part 3 (of 7). Previous part is part 1. 

-- 

Warren S. Sarle       SAS Institute Inc.   The opinions expressed here
saswss@unx.sas.com    SAS Campus Drive     are mine and not necessarily
(919) 677-8000        Cary, NC 27513, USA  those of SAS Institute.
