Newsgroups: comp.ai.neural-nets,comp.answers,news.answers
Path: cantaloupe.srv.cs.cmu.edu!das-news2.harvard.edu!oitnews.harvard.edu!news.sesqui.net!uuneo.neosoft.com!news.blkbox.COM!academ!bcm.tmc.edu!news.msfc.nasa.gov!newsfeed.internetmci.com!news.mathworks.com!uhog.mit.edu!news.mtholyoke.edu!world!mv!barney.gvi.net!redstone.interpath.net!sas!mozart.unx.sas.com!hotellng.unx.sas.com!saswss
From: saswss@unx.sas.com (Warren Sarle)
Subject: comp.ai.neural-nets FAQ, Part 2 of 7: Learning
Originator: saswss@hotellng.unx.sas.com
Sender: news@unx.sas.com (Noter of Newsworthy Events)
Message-ID: <nn2.posting_823026416@hotellng.unx.sas.com>
Supersedes: <nn2.posting_820193057@hotellng.unx.sas.com>
Approved: news-answers-request@MIT.EDU
Date: Tue, 30 Jan 1996 18:26:58 GMT
Expires: Tue, 5 Mar 1996 18:26:56 GMT
X-Nntp-Posting-Host: hotellng.unx.sas.com
Reply-To: saswss@unx.sas.com (Warren Sarle)
Organization: SAS Institute Inc., Cary, NC, USA
Keywords: frequently asked questions, answers
Followup-To: comp.ai.neural-nets
Lines: 433
Xref: glinda.oz.cs.cmu.edu comp.ai.neural-nets:29577 comp.answers:16668 news.answers:63360


Archive-name: ai-faq/neural-nets/part2
Last-modified: 1996-01-27
URL: ftp://ftp.sas.com/pub/neural/FAQ2.html
Maintainer: saswss@unx.sas.com (Warren S. Sarle)

This is part 2 (of 7) of a monthly posting to the Usenet newsgroup
comp.ai.neural-nets. See the part 1 of this posting for full information
what it is all about.

========== Questions ========== 
********************************

Part 1: Introduction

   What is this newsgroup for? How shall it be used?
   What is a neural network (NN)?
   What can you do with a Neural Network and what not?
   Who is concerned with Neural Networks?

Part 2: Learning

   What does 'backprop' mean? What is 'overfitting'?
   Why use a bias input? Why activation functions?
   How many hidden units should I use?
   How many learning methods for NNs exist? Which?
   What about Genetic Algorithms and Evolutionary Computation?
   What about Fuzzy Logic?
   How are NNs related to statistical methods?

Part 3: Information resources

   Good introductory literature about Neural Networks?
   Any journals and magazines about Neural Networks?
   The most important conferences concerned with Neural Networks?
   Neural Network Associations?
   Other sources of information about NNs?

Part 4: Datasets

   Databases for experimentation with NNs?

Part 5: Free software

   Freely available software packages for NN simulation?

Part 6: Commercial software

   Commercial software packages for NN simulation?

Part 7: Hardware

   Neural Network hardware?

------------------------------------------------------------------------

Subject: What does 'backprop' mean? What is
===========================================
'overfitting'? 
===============

Backprop is short for backpropagation of error. The term backpropagation
causes much confusion. Strictly speaking, backpropagation refers to the
method for computing the error gradient for a feedforward network, a
straightforward but elegant application of the chain rule of elementary
calculus. By extension, backpropagation or backprop refers to a training
method that uses backpropagation to compute the gradient. By further
extension, a backprop network is a feedforward network trained by
backpropagation. Standard backprop is a euphemism for the generalized delta
rule, the training algorithm that was popularized by Rumelhart, Hinton, and
Williams in chapter 8 of Rumelhart and McClelland (1986) and that remains
the most widely used supervised training method for neural nets. 

Literature:
   Rumelhart, D. E. and McClelland, J. L. (1986): Parallel Distributed
   Processing: Explorations in the Microstructure of Cognition (volume 1, pp
   318-362). The MIT Press. 

(this is the classic one) or one of the dozens of other books or articles on
backpropagation (see also question 'literature').

'Overfitting' (often also called 'overtraining' or 'overlearning') is the
phenomenon that in most cases a network gets worse instead of better after a
certain point during training when it is trained to as low errors as
possible. This is because such long training may make the network 'memorize'
the training patterns, including all of their peculiarities. However, one is
usually interested in the generalization of the network, i.e., the error it
exhibits on cases NOT seen during training. Learning the peculiarities of
the training set makes the generalization worse. The network should only
learn the general structure of the training cases. 

There are various methods to fight overfitting. The two most important
classes of such methods are regularization methods (such as weight decay)
and early stopping. Regularization methods try to limit the complexity of
the network such that it is unable to learn peculiarities. Early stopping
aims at stopping the training at the point of optimal generalization by
dividing the available data into training and validation sets. A description
of the early stopping method can for instance be found in section 3.3 of 
/pub/papers/techreports/1994-21.ps.gz on ftp.ira.uka.de (anonymous ftp). 

------------------------------------------------------------------------

Subject: Why use a bias input? Why activation functions?
========================================================

One way of looking at the need for bias inputs is that the inputs to each
unit in the net define an N-dimensional space, and the unit draws a
hyperplane through that space, producing an "on" output on one side and an
"off" output on the other. (With sigmoid units the plane will not be sharp
-- there will be some gray area of intermediate values near the separating
plane -- but ignore this for now.)
The weights determine where this hyperplane is in the input space. Without a
bias input, this separating plane is constrained to pass through the origin
of the hyperspace defined by the inputs. For some problems that's OK, but in
many problems the plane would be much more useful somewhere else. If you
have many units in a layer, they share the same input space and without bias
would ALL be constrained to pass through the origin. 

Activation functions are needed to introduce nonlinearity into the network.
Without nonlinearity, hidden units would not make nets more powerful than
just plain perceptrons (which do not have any hidden units, just input and
output units). The reason is that a composition of linear functions is again
a linear function. However, it is just the nonlinearity (i.e, the capability
to represent nonlinear functions) that makes multilayer networks so
powerful. Almost any nonlinear function does the job, although for
backpropagation learning it must be differentiable and it helps if the
function is bounded; the popular sigmoidal functions and gaussian functions
are the most common choices.

------------------------------------------------------------------------

Subject: How many hidden units should I use? 
=============================================

Some books and articles offer "rules of thumb" for choosing a topopology --
Ninputs plus Noutputs divided by two, maybe with a square root in there
somewhere -- but such rules are total garbage. There is no way to determine
a good network topology just from the number of inputs and outputs. It
depends critically on the number of training cases, the amount of noise, and
the complexity of the function or classification you are trying to learn.
There are problems with one input and one output that require thousands of
hidden units, and problems with a thousand inputs and a thousand outputs
that require only one hidden unit, or none at all.

Other rules relate to the number of cases available: use at most so many
hidden units that the number of weights in the network times 10 is smaller
than the number of cases. Such rules are only concerned with overfitting and
are unreliable as well. All one can say is that if the number of training
cases is much larger (but no one knows exactly how much larger) than the
number of weights, you are unlikely to get overfitting, but you may suffer
from underfitting.

An intelligent choice of the number of hidden units depends on whether you
are using early stopping or some other form of regularization. If not, you
must simply try many networks with different numbers of hidden units,
estimate the generalization error for each one, and choose the network with
the minumum estimated generalization error. However, there is little point
in trying a network with more weights than training cases, since such a
large network is almost sure to overfit.

If you are using early stopping, it is essential to use lots of hidden units
to avoid bad local optima. There seems to be no upper limit on the number of
hidden units, other than that imposed by computer time and memory
requirements. But there also seems to be no advantage to using more hidden
units than you have training cases, since bad local minima do not occur with
so many hidden units.

If you are using weight decay or Bayesian estimation, you can also use lots
of hidden units. However, it is not strictly necessary to do so, because
other methods are available to avoid local minima, such as multiple random
starts and simulated annealing (such methods are not safe to use with early
stopping). You can use one network with lots of hidden units, or you can try
different networks with different numbers of hidden units, and choose on the
basis of estimated generalization error. With weight decay or MAP Bayesian
estimation, it is prudent to keep the number of weights less than half the
number of training cases. 

------------------------------------------------------------------------

Subject: How many learning methods for NNs exist?
=================================================
Which?
======

There are many many learning methods for NNs by now. Nobody knows exactly
how many. New ones (at least variations of existing ones) are invented every
week. Below is a collection of some of the most well known methods; not
claiming to be complete.

The main categorization of these methods is the distinction of supervised
from unsupervised learning:

In supervised learning, there is a "teacher" who in the learning phase
"tells" the net how well it performs ("reinforcement learning") or what the
correct behavior would have been ("fully supervised learning").

In unsupervised learning the net is autonomous: it just looks at the data it
is presented with, finds out about some of the properties of the data set
and learns to reflect these properties in its output. What exactly these
properties are, that the network can learn to recognise, depends on the
particular network model and learning method.

Many of these learning methods are closely connected with a certain (class
of) network topology.

Now here is the list, just giving some names:

1. UNSUPERVISED LEARNING (i.e. without a "teacher"):
     1). Feedback Nets:
        a). Additive Grossberg (AG)
        b). Shunting Grossberg (SG)
        c). Binary Adaptive Resonance Theory (ART1)
        d). Analog Adaptive Resonance Theory (ART2, ART2a)
        e). Discrete Hopfield (DH)
        f). Continuous Hopfield (CH)
        g). Discrete Bidirectional Associative Memory (BAM)
        h). Temporal Associative Memory (TAM)
        i). Adaptive Bidirectional Associative Memory (ABAM)
        j). Kohonen Self-organizing Map/Topology-preserving map (SOM/TPM)
        k). Competitive learning
     2). Feedforward-only Nets:
        a). Learning Matrix (LM)
        b). Driver-Reinforcement Learning (DR)
        c). Linear Associative Memory (LAM)
        d). Optimal Linear Associative Memory (OLAM)
        e). Sparse Distributed Associative Memory (SDM)
        f). Fuzzy Associative Memory (FAM)
        g). Counterprogation (CPN)

2. SUPERVISED LEARNING (i.e. with a "teacher"):
     1). Feedback Nets:
        a). Brain-State-in-a-Box (BSB)
        b). Fuzzy Congitive Map (FCM)
        c). Boltzmann Machine (BM)
        d). Mean Field Annealing (MFT)
        e). Recurrent Cascade Correlation (RCC)
        f). Learning Vector Quantization (LVQ)
        g). Backpropagation through time (BPTT)
        h). Real-time recurrent learning (RTRL)
        i). Recurrent Extended Kalman Filter (EKF)
     2). Feedforward-only Nets:
        a). Perceptron
        b). Adaline, Madaline
        c). Backpropagation (BP)
        d). Cauchy Machine (CM)
        e). Adaptive Heuristic Critic (AHC)
        f). Time Delay Neural Network (TDNN)
        g). Associative Reward Penalty (ARP)
        h). Avalanche Matched Filter (AMF)
        i). Backpercolation (Perc)
        j). Artmap
        k). Adaptive Logic Network (ALN)
        l). Cascade Correlation (CasCor)
        m). Extended Kalman Filter(EKF)

------------------------------------------------------------------------

Subject: What about Genetic Algorithms?
=======================================

There are a number of definitions of GA (Genetic Algorithm). A possible one
is

  A GA is an optimization program
  that starts with
  a population of encoded procedures,       (Creation of Life :-> )
  mutates them stochastically,              (Get cancer or so :-> )
  and uses a selection process              (Darwinism)
  to prefer the mutants with high fitness
  and perhaps a recombination process       (Make babies :-> )
  to combine properties of (preferably) the succesful mutants.

Genetic Algorithms are just a special case of the more general idea of
``evolutionary computation''. There is a newsgroup that is dedicated to the
field of evolutionary computation called comp.ai.genetic. It has a detailed
FAQ posting which, for instance, explains the terms "Genetic Algorithm",
"Evolutionary Programming", "Evolution Strategy", "Classifier System", and
"Genetic Programming". That FAQ also contains lots of pointers to relevant
literature, software, other sources of information, et cetera et cetera.
Please see the comp.ai.genetic FAQ for further information. 

------------------------------------------------------------------------

Subject: What about Fuzzy Logic?
================================

Fuzzy Logic is an area of research based on the work of L.A. Zadeh. It is a
departure from classical two-valued sets and logic, that uses "soft"
linguistic (e.g. large, hot, tall) system variables and a continuous range
of truth values in the interval [0,1], rather than strict binary (True or
False) decisions and assignments.

Fuzzy logic is used where a system is difficult to model exactly (but an
inexact model is available), is controlled by a human operator or expert, or
where ambiguity or vagueness is common. A typical fuzzy system consists of a
rule base, membership functions, and an inference procedure.

Most Fuzzy Logic discussion takes place in the newsgroup comp.ai.fuzzy, but
there is also some work (and discussion) about combining fuzzy logic with
Neural Network approaches in comp.ai.neural-nets.

For more details see (for example): 

Klir, G.J. and Folger, T.A.: Fuzzy Sets, Uncertainty, and Information
Prentice-Hall, Englewood Cliffs, N.J., 1988. 
Kosko, B.: Neural Networks and Fuzzy Systems Prentice Hall, Englewood
Cliffs, NJ, 1992. 

------------------------------------------------------------------------

Subject: How are NNs related to statistical methods? 
=====================================================

There is considerable overlap between the fields of neural networks and
statistics.
Statistics is concerned with data analysis. In neural network terminology,
statistical inference means learning to generalize from noisy data. Some
neural networks are not concerned with data analysis (e.g., those intended
to model biological systems) and therefore have little to do with
statistics. Some neural networks do not learn (e.g., Hopfield nets) and
therefore have little to do with statistics. Some neural networks can learn
successfully only from noise-free data (e.g., ART or the perceptron rule)
and therefore would not be considered statistical methods. But most neural
networks that can learn to generalize effectively from noisy data are
similar or identical to statistical methods. For example: 

 o Feedforward nets with no hidden layer (including functional-link neural
   nets and higher-order neural nets) are basically generalized linear
   models. 
 o Feedforward nets with one hidden layer are closely related to projection
   pursuit regression. 
 o Probabilistic neural nets are identical to kernel discriminant analysis. 
 o Kohonen nets for adaptive vector quantization are very similar to k-means
   cluster analysis. 
 o Hebbian learning is closely related to principal component analysis. 

Some neural network areas that appear to have no close relatives in the
existing statistical literature are: 

 o Kohonen's self-organizing maps. 
 o Reinforcement learning ((although this is treated in the operations
   research literature as Markov decision processes). 
 o Stopped training (the purpose and effect of stopped training are similar
   to shrinkage estimation, but the method is quite different). 

Feedforward nets are a subset of the class of nonlinear regression and
discrimination models. Statisticians have studied the properties of this
general class but had not considered the specific case of feedforward neural
nets before such networks were popularized in the neural network field.
Still, many results from the statistical theory of nonlinear models apply
directly to feedforward nets, and the methods that are commonly used for
fitting nonlinear models, such as various Levenberg-Marquardt and conjugate
gradient algorithms, can be used to train feedforward nets. 

While neural nets are often defined in terms of their algorithms or
implementations, statistical methods are usually defined in terms of their
results. The arithmetic mean, for example, can be computed by a (very
simple) backprop net, by applying the usual formula SUM(x_i)/n, or by
various other methods. What you get is still an arithmetic mean regardless
of how you compute it. So a statistician would consider standard backprop,
Quickprop, and Levenberg-Marquardt as different algorithms for implementing
the same statistical model such as a feedforward net. On the other hand,
different training criteria, such as least squares and cross entropy, are
viewed by statisticians as fundamentally different estimation methods with
different statistical properties. 

It is sometimes claimed that neural networks, unlike statistical models,
require no distributional assumptions. In fact, neural networks involve
exactly the same sort of distributional assumptions as statistical models,
but statisticians study the consequences and importance of these assumptions
while most neural networkers ignore them. For example, least-squares
training methods are widely used by statisticians and neural networkers.
Statisticians realize that least-squares training involves implicit
distributional assumptions in that least-squares estimates have certain
optimality properties for noise that is normally distributed with equal
variance for all training cases and that is independent between different
cases. These optimality properties are consequences of the fact that
least-squares estimation is maximum likelihood under those conditions.
Similarly, cross-entropy is maximum likelihood for noise with a Bernoulli
distribution. If you study the distributional assumptions, then you can
recognize and deal with violations of the assumptions. For example, if you
have normally distributed noise but some training cases have greater noise
variance than others, then you may be able to use weighted least squares
instead of ordinary least squares to obtain more efficient estimates. 

Communication between statisticians and neural net researchers is often
hindered by the different terminology used in the two fields. There is a
comparison of neural net and statistical jargon in 
ftp://ftp.sas.com/pub/neural/jargon 

Here are a few references: 

Bishop, C.M. (1995), _Neural Networks for Pattern Recognition_, Oxford:
Oxford University Press. 

Chatfield, C. (1993), "Neural networks: Forecasting breakthrough or passing
fad", International Journal of Forecasting, 9, 1-3. 

Cheng, B. and Titterington, D.M. (1994), "Neural Networks: A Review from a
Statistical Perspective", Statistical Science, 9, 2-54. 

Geman, S., Bienenstock, E. and Doursat, R. (1992), "Neural Networks and the
Bias/Variance Dilemma", Neural Computation, 4, 1-58. 

Kushner, H. & Clark, D. (1978), _Stochastic Approximation Methods for
Constrained and Unconstrained Systems_, Springer-Verlag. 

Michie, D., Spiegelhalter, D.J. and Taylor, C.C. (1994), _Machine Learning,
Neural and Statistical Classification_, Ellis Horwood. 

Ripley, B.D. (1993), "Statistical Aspects of Neural Networks", in O.E.
Barndorff-Nielsen, J.L. Jensen and W.S. Kendall, eds., _Networks and Chaos:
Statistical and Probabilistic Aspects_, Chapman & Hall. ISBN 0 412 46530 2. 

Sarle, W.S. (1994), "Neural Networks and Statistical Models," Proceedings of
the Nineteenth Annual SAS Users Group International Conference, Cary, NC:
SAS Institute, pp 1538-1550. ( ftp://ftp.sas.com/pub/neural/neural1.ps)

White, H. (1989), "Learning in Artificial Neural Networks: A Statistical
Perspective," Neural Computation, 1, 425-464. 

White, H. (1992), _Artificial Neural Networks: Approximation and Learning
Theory_, Blackwell. 

------------------------------------------------------------------------

Next part is part 3 (of 7). Previous part is part 1. 

-- 

Warren S. Sarle       SAS Institute Inc.   The opinions expressed here
saswss@unx.sas.com    SAS Campus Drive     are mine and not necessarily
(919) 677-8000        Cary, NC 27513, USA  those of SAS Institute.
