Newsgroups: comp.ai.neural-nets,comp.answers,news.answers
Path: cantaloupe.srv.cs.cmu.edu!nntp.club.cc.cmu.edu!goldenapple.srv.cs.cmu.edu!das-news2.harvard.edu!oitnews.harvard.edu!news.sesqui.net!news.blkbox.COM!academ!insync!news.maxwell.syr.edu!news.mathworks.com!newsgate.duke.edu!interpath!news.interpath.net!news.interpath.net!sas!newshost.unx.sas.com!hotellng.unx.sas.com!saswss
From: saswss@unx.sas.com (Warren Sarle)
Subject: comp.ai.neural-nets FAQ, Part 1 of 7: Introduction
Originator: saswss@hotellng.unx.sas.com
Sender: news@unx.sas.com (Noter of Newsworthy Events)
Message-ID: <nn1.posting_859608014@hotellng.unx.sas.com>
Supersedes: <nn1.posting_857188812@hotellng.unx.sas.com>
Approved: news-answers-request@MIT.EDU
Date: Sat, 29 Mar 1997 04:00:15 GMT
Expires: Sat, 3 May 1997 04:00:14 GMT
X-Nntp-Posting-Host: hotellng.unx.sas.com
Reply-To: saswss@unx.sas.com (Warren Sarle)
Organization: SAS Institute Inc., Cary, NC, USA
Keywords: frequently asked questions, answers
Followup-To: comp.ai.neural-nets
Lines: 814
Xref: glinda.oz.cs.cmu.edu comp.ai.neural-nets:36938 comp.answers:25092 news.answers:10716

Archive-name: ai-faq/neural-nets/part1
Last-modified: 1997-03-28
URL: ftp://ftp.sas.com/pub/neural/FAQ.html
Maintainer: saswss@unx.sas.com (Warren S. Sarle)

  ---------------------------------------------------------------
    Additions, corrections, or improvements are always welcome.
    Anybody who is willing to contribute any information,
    please email me; if it is relevant, I will incorporate it.

    The monthly posting departs at the 28th of every month.
  ---------------------------------------------------------------

This is the first of seven parts of a monthly posting to the Usenet
newsgroup comp.ai.neural-nets (as well as comp.answers and news.answers,
where it should be findable at any time). Its purpose is to provide basic
information for individuals who are new to the field of neural networks or
who are just beginning to read this group. It will help to avoid lengthy
discussion of questions that often arise for beginners. 

   SO, PLEASE, SEARCH THIS POSTING FIRST IF YOU HAVE A QUESTION
                           and
   DON'T POST ANSWERS TO FAQs: POINT THE ASKER TO THIS POSTING

The latest version of the FAQ is available as a hypertext document, readable
by any WWW (World Wide Web) browser such as Mosaic, under the URL: 
"ftp://ftp.sas.com/pub/neural/FAQ.html".

These postings are archived in the periodic posting archive on host
rtfm.mit.edu (and on some other hosts as well). Look in the anonymous ftp
directory "/pub/usenet/news.answers/ai-faq/neural-nets" under the file names
"part1", "part2", ... "part7". If you do not have anonymous ftp access, you
can access the archives by mail server as well. Send an E-mail message to
mail-server@rtfm.mit.edu with "help" and "index" in the body on separate
lines for more information.

For those of you who read this FAQ anywhere other than in Usenet: To read
comp.ai.neural-nets (or post articles to it) you need Usenet News access.
Try the commands, 'xrn', 'rn', 'nn', or 'trn' on your Unix machine, 'news'
on your VMS machine, or ask a local guru. WWW browsers are often set up for
Usenet access, too--try the URL news:comp.ai.neural-nets. 

The FAQ posting departs to comp.ai.neural-nets on the 28th of every month.
It is also sent to the groups comp.answers and news.answers where it should
be available at any time (ask your news manager). The FAQ posting, like any
other posting, may a take a few days to find its way over Usenet to your
site. Such delays are especially common outside of North America. 

This FAQ is not meant to discuss any topic exhaustively.

Disclaimer: 

   This posting is provided 'as is'. No warranty whatsoever is expressed or
   implied, in particular, no warranty that the information contained herein
   is correct or useful in any way, although both are intended. 

To find the answer of question "x", search for the string "Subject: x"

========== Questions ========== 
********************************

Part 1: Introduction

   What is this newsgroup for? How shall it be used?
   Where is comp.ai.neural-nets archived?
   May I copy this FAQ?
   What is a neural network (NN)?
   What can you do with an NN and what not?
   Who is concerned with NNs?
   How are layers counted?
   What are cases and variables?
   What are the population, sample, training set, design set, validation
   set, and test set?
   How are NNs related to statistical methods?

Part 2: Learning

   How many learning methods for NNs exist? Which?
   What is backprop?
   What are conjugate gradients, Levenberg-Marquardt, etc.?
   How should categories be coded?
   Why use a bias input?
   Why use activation functions?
   What is a softmax activation function?
   What is the curse of dimensionality?
   How do MLPs compare with RBFs?
   What are OLS and subset regression?
   Should I normalize/standardize/rescale the data?
   Should I nonlinearly transform the data?
   How to measure importance of inputs?
   What is ART?
   What is PNN?
   What is GRNN?
   What does unsupervised learning learn?
   What about Genetic Algorithms and Evolutionary Computation?
   What about Fuzzy Logic?

Part 3: Generalization

   How is generalization possible?
   How does noise affect generalization?
   What is overfitting and how can I avoid it?
   What is jitter? (Training with noise)
   What is early stopping?
   What is weight decay?
   What is Bayesian learning?
   How many hidden layers should I use?
   How many hidden units should I use?
   How can generalization error be estimated?
   What are cross-validation and bootstrapping?

Part 4: Books, data, etc.

   Books and articles about Neural Networks?
   Journals and magazines about Neural Networks?
   The most important conferences concerned with Neural Networks?
   Neural Network Associations?
   Other sources of information about NNs?
   Databases for experimentation with NNs?

Part 5: Free software

   Freeware and shareware packages for NN simulation?

Part 6: Commercial software

   Commercial software packages for NN simulation?

Part 7: Hardware, etc.

   Neural Network hardware?
   Unanswered FAQs

------------------------------------------------------------------------

Subject: What is this newsgroup for? How shall it be
====================================================
used?
=====

The newsgroup comp.ai.neural-nets is intended as a forum for people who want
to use or explore the capabilities of Artificial Neural Networks or
Neural-Network-like structures.

There should be the following types of articles in this newsgroup:

1. Requests
+++++++++++

   Requests are articles of the form "I am looking for X", where X
   is something public like a book, an article, a piece of software. The
   most important about such a request is to be as specific as possible!

   If multiple different answers can be expected, the person making the
   request should prepare to make a summary of the answers he/she got and
   announce to do so with a phrase like "Please reply by email,
   I'll summarize to the group" at the end of the posting.

   The Subject line of the posting should then be something like 
   "Request: X" 

2. Questions
++++++++++++

   As opposed to requests, questions ask for a larger piece of information
   or a more or less detailed explanation of something. To avoid lots of
   redundant traffic it is important that the poster provides with the
   question all information s/he already has about the subject asked and
   state the actual question as precise and narrow as possible. The poster
   should prepare to make a summary of the answers s/he got and announce to
   do so with a phrase like "Please reply by email, I'll
   summarize to the group" at the end of the posting.

   The Subject line of the posting should be something like "Question:
   this-and-that" or have the form of a question (i.e., end with a
   question mark) 

   Students: please do not ask comp.ai.neural-net readers to do your
   homework or take-home exams for you. 

3. Answers
++++++++++

   These are reactions to questions or requests. As a rule of thumb articles
   of type "answer" should be rare. Ideally, in most cases either the answer
   is too specific to be of general interest (and should thus be e-mailed to
   the poster) or a summary was announced with the question or request (and
   answers should thus be e-mailed to the poster).

   Most news-reader software automatically provides a subject line beginning
   with "Re:" followed by the subject of the article which is being
   followed-up. Note that sometimes longer threads of discussion evolve from
   an answer to a question or request. In this case posters should change
   the subject line suitably as soon as the topic goes too far away from the
   one announced in the original subject line. You can still carry along the
   old subject in parentheses in the form "Subject: new subject
   (was: old subject)" 

4. Summaries
++++++++++++

   In all cases of requests or questions the answers for which can be
   assumed to be of some general interest, the poster of the request or
   question shall summarize the answers he/she received. Such a summary
   should be announced in the original posting of the question or request
   with a phrase like "Please answer by email, I'll
   summarize"

   In such a case, people who answer to a question should NOT post their
   answer to the newsgroup but instead mail them to the poster of the
   question who collects and reviews them. After about 5 to 20 days after
   the original posting, its poster should make the summary of answers and
   post it to the newsgroup.

   Some care should be invested into a summary: 
    o simple concatenation of all the answers is not enough: instead,
      redundancies, irrelevancies, verbosities, and errors should be
      filtered out (as well as possible) 
    o the answers should be separated clearly 
    o the contributors of the individual answers should be identifiable
      (unless they requested to remain anonymous [yes, that happens]) 
    o the summary should start with the "quintessence" of the answers, as
      seen by the original poster 
    o A summary should, when posted, clearly be indicated to be one by
      giving it a Subject line starting with "SUMMARY:" 
   Note that a good summary is pure gold for the rest of the newsgroup
   community, so summary work will be most appreciated by all of us. Good
   summaries are more valuable than any moderator ! :-) 

5. Announcements
++++++++++++++++

   Some articles never need any public reaction. These are called
   announcements (for instance for a workshop, conference or the
   availability of some technical report or software system).

   Announcements should be clearly indicated to be such by giving them a
   subject line of the form "Announcement: this-and-that" 

6. Reports
++++++++++

   Sometimes people spontaneously want to report something to the newsgroup.
   This might be special experiences with some software, results of own
   experiments or conceptual work, or especially interesting information
   from somewhere else.

   Reports should be clearly indicated to be such by giving them a subject
   line of the form "Report: this-and-that" 

7. Discussions
++++++++++++++

   An especially valuable possibility of Usenet is of course that of
   discussing a certain topic with hundreds of potential participants. All
   traffic in the newsgroup that can not be subsumed under one of the above
   categories should belong to a discussion.

   If somebody explicitly wants to start a discussion, he/she can do so by
   giving the posting a subject line of the form " Discussion:
   this-and-that"

   It is quite difficult to keep a discussion from drifting into chaos, but,
   unfortunately, as many many other newsgroups show there seems to be no
   secure way to avoid this. On the other hand, comp.ai.neural-nets has not
   had many problems with this effect in the past, so let's just go and
   hope... 

------------------------------------------------------------------------

Subject: Where is comp.ai.neural-nets archived? 
================================================

The following archives are available for comp.ai.neural-nets: 

 o Deja News at http://xp8.dejanews.com/ 
 o ftp://ftp.cs.cmu.edu/user/ai/pubs/news/comp.ai.neural-nets 
 o http://asknpac.npac.syr.edu 

   According to Gang Cheng, gcheng@npac.syr.edu, the Northeast Parallel
   Architecture Center (NPAC), Syracue University, maintains an archive
   system for searching/reading USENET newsgroups and mailing lists. Two
   search/navigation interfaces accessible by any WWW browser are provided:
   one is an advanced search interface allowing queries with various options
   such as query by mail header, by date, by subject (keywords), by sender.
   The other is a Hypermail-like navigation interface for users familiar
   with Hypermail. 

For more information on newsgroup archives, see 
http://starbase.neosoft.com/~claird/news.lists/newsgroup_archives.html 

------------------------------------------------------------------------

Subject: May I copy this FAQ?
=============================

The intent in providing a FAQ is to make the information freely available to
whoever needs it. You may copy all or part of the FAQ, but please be sure to
include a reference to the URL of the master copy,
ftp://ftp.sas.com/pub/neural/FAQ.html, and do not sell copies of the FAQ. If
you want to include information from the FAQ in your own web site, it is
better to include links to the master copy rather than to copy text from the
FAQ to your web pages, because various answers in the FAQ are updated at
unpredictable times. 

------------------------------------------------------------------------

Subject: What is a neural network (NN)?
=======================================

First of all, when we are talking about a neural network, we should more
properly say "artificial neural network" (ANN), because that is what we mean
most of the time in comp.ai.neural-nets. Biological neural networks are much
more complicated than the mathematical models we use for ANNs. But it is
customary to be lazy and drop the "A" or the "artificial". 

There is no universally accepted definition of an NN. But perhaps most
people in the field would agree that an NN is a network of many simple
processors ("units"), each possibly having a small amount of local memory.
The units are connected by communication channels ("connections") which
usually carry numeric (as opposed to symbolic) data, encoded by any of
various means. The units operate only on their local data and on the inputs
they receive via the connections. The restriction to local operations is
often relaxed during training. 

Some NNs are models of biological neural networks and some are not, but
historically, much of the inspiration for the field of NNs came from the
desire to produce artificial systems capable of sophisticated, perhaps
"intelligent", computations similar to those that the human brain routinely
performs, and thereby possibly to enhance our understanding of the human
brain. 

Most NNs have some sort of "training" rule whereby the weights of
connections are adjusted on the basis of data. In other words, NNs "learn"
from examples (as children learn to recognize dogs from examples of dogs)
and exhibit some capability for generalization beyond the training data. 

NNs normally have great potential for parallelism, since the computations of
the components are largely independent of each other. Some people regard
massive parallelism and high connectivity to be defining characteristics of
NNs, but such requirements rule out various simple models, such as simple
linear regression (a minimal feedforward net with only two units plus bias),
which are usefully regarded as special cases of NNs. 

Here is a sampling of definitions from the books on the FAQ maintainer's
shelf. None will please everyone. Perhaps for that reason many NN textbooks
do not explicitly define neural networks. 

According to the DARPA Neural Network Study (1988, AFCEA International
Press, p. 60): 

   ... a neural network is a system composed of many simple processing
   elements operating in parallel whose function is determined by
   network structure, connection strengths, and the processing performed
   at computing elements or nodes. 

According to Haykin, S. (1994), Neural Networks: A Comprehensive
Foundation, NY: Macmillan, p. 2: 

   A neural network is a massively parallel distributed processor that
   has a natural propensity for storing experiential knowledge and
   making it available for use. It resembles the brain in two respects: 

   1. Knowledge is acquired by the network through a learning process. 
   2. Interneuron connection strengths known as synaptic weights are
      used to store the knowledge. 

According to Nigrin, A. (1993), Neural Networks for Pattern Recognition,
Cambridge, MA: The MIT Press, p. 11: 

   A neural network is a circuit composed of a very large number of
   simple processing elements that are neurally based. Each element
   operates only on local information. Furthermore each element operates
   asynchronously; thus there is no overall system clock. 

According to Zurada, J.M. (1992), Introduction To Artificial Neural Systems,
Boston: PWS Publishing Company, p. xv: 

   Artificial neural systems, or neural networks, are physical cellular
   systems which can acquire, store, and utilize experiential knowledge.

For more information on "What is a neural network?", with examples and
diagrams, see Leslie S. Smith's on-line introduction at: 
http://www.cs.stir.ac.uk/~lss/NNIntro/InvSlides.html. 

------------------------------------------------------------------------

Subject: What can you do with an NN and what not?
=================================================

In principle, NNs can compute any computable function, i.e. they can do
everything a normal digital computer can do. 

In practice, NNs are especially useful for classification and function
approximation/mapping problems which are tolerant of some imprecision, which
have lots of training data available, but to which hard and fast rules (such
as those that might be used in an expert system) cannot easily be applied.
Almost any mapping between vector spaces can be approximated to arbitrary
precision by feedforward NNs (which are the type most often used in
practical applications) if you have enough data and enough computing
resources. 

NNs are, at least today, difficult to apply successfully to problems that
concern manipulation of symbols and memory. And there are no methods for
training NNs that can magically create information that is not contained in
the training data. 

As for simulating human consciousness and emotion, that's still in the realm
of science fiction. 

For examples of NN applications, see: 

 o The Pacific Northwest National Laboratory web pages at 
   http://www.emsl.pnl.gov:2080/docs/cie/neural/neural.research.html and 
   http://www.emsl.pnl.gov:2080/docs/cie/neural/products/ 
 o The Stimulation Initiative for European Neural Applications web page at 
   http://www.mbfys.kun.nl/snn/siena/cases/ 
 o The DTI NeuroComputing Web's Applications Portfolio at 
   http://www.globalweb.co.uk/nctt/portfolo/ 
 o The Applications Corner, provided by NeuroDimension, Inc., at 
   http://www.nd.com/appcornr/purpose.htm 
 o The BioComp Systems, Inc. Solutions page at http://www.bio-comp.com 
 o Athanasios Episcopos's web page with References on Neural Net
   Applications to Finance and Economics at 
   http://phoenix.som.clarkson.edu/~episcopo/neurofin.html 
 o Chen, C.H., ed. (1996) Fuzzy Logic and Neural Network Handbook, NY:
   McGraw-Hill, ISBN 0-07-011189-8. 
 o Trippi, R.R. & Turban, E. (1993), Neural Networks in Finance and
   Investing, Chicago: Probus. 
 o The series Advances in Neural Information Processing Systems containing
   proceedings of the conference of the same name, published yearly by
   Morgan Kauffman starting in 1989. 

There is an on-line application of a Kohonen network with a 2-dimensional
output layer for prediction of protein secondary structure percentages from
UV circular dichroism spectra. According to J.J. Merelo: 

   You only need to submit 41 CD values ranging from 200 nm to 240 nm
   (given in deg cm^2 dmol^-1 multiplied by 0.001) and the k2d server
   gives back the estimated percentages of helix, beta and rest of
   secondary structure of your protein plus an estimation of the
   accuracy of the prediction. 

The address of the k2d server is http://kal-el.ugr.es/k2d/spectra.html. The
home page of the k2d program is at http://kal-el.ugr.es/k2d/k2d.html or 
http://www.embl-heidelberg.de/~andrade/k2d.html. 

------------------------------------------------------------------------

Subject: Who is concerned with NNs?
===================================

Neural Networks are interesting for quite a lot of very different people: 

 o Computer scientists want to find out about the properties of non-symbolic
   information processing with neural nets and about learning systems in
   general. 
 o Statisticians use neural nets as flexible, nonlinear regression and
   classification models. 
 o Engineers of many kinds exploit the capabilities of neural networks in
   many areas, such as signal processing and automatic control. 
 o Cognitive scientists view neural networks as a possible apparatus to
   describe models of thinking and consciousness (High-level brain
   function). 
 o Neuro-physiologists use neural networks to describe and explore
   medium-level brain function (e.g. memory, sensory system, motorics). 
 o Physicists use neural networks to model phenomena in statistical
   mechanics and for a lot of other tasks. 
 o Biologists use Neural Networks to interpret nucleotide sequences. 
 o Philosophers and some other people may also be interested in Neural
   Networks for various reasons. 

For world-wide lists of groups doing research on NNs, see the Foundation for
Neural Networks's (SNN) page at 
http://www.mbfys.kun.nl/snn/pointers/groups.html and see Neural Networks
Research on the IEEE Neural Network Council's homepage 
http://www.ieee.org/nnc. 

------------------------------------------------------------------------

Subject: How are layers counted? 
=================================

This is a matter of considerable dispute. 

 o Some people count layers of units. But of these people, some count the
   input layer and some don't. 

 o Some people count layers of weights. But I have no idea how they count
   skip-layer connections. 

To avoid ambiguity, you should speak of a 2-hidden-layer network, not a
4-layer network (as some would call it) or 3-layer network (as others would
call it). And if the connections follow any pattern other than fully
connecting each layer to the next and to no others, you should carefully
specify the connections. 

------------------------------------------------------------------------

Subject: What are cases and variables?
======================================

A vector of values presented at one time to all the input units of a neural
network is called a "case", "example", "pattern, "sample", etc. The term
"case" will be used in this FAQ because it is widely recognized,
unambiguous, and requires less typing than the other terms. A case may
include not only input values, but also target values and possibly other
information. 

A vector of values presented at different times to a single input unit is
often called an "input variable" or "feature". To a statistician, it is a
"predictor", "regressor", "covariate", "independent variable", "explanatory
variable", etc. A vector of target values associated with a given output
unit of the network during training will be called a "target variable" in
this FAQ. To a statistician, it is usually a "response" or "dependent
variable". 

A "data set" is a matrix containing one or (usually) more cases. In this
FAQ, it will be assumed that cases are rows of the matrix, while variables
are columns. 

Note that the often-used term "input vector" is ambiguous; it can mean
either an input case or an input variable. 

------------------------------------------------------------------------

Subject: What are the population, sample, training set,
=======================================================
design set, validation set, and test set?
=========================================

There seems to be no term in the NN literature for the set of all cases that
you want to be able to generalize to. Statisticians call this set the
"population". Neither is there a consistent term in the NN literature for
the set of cases that are available for training and evaluating an NN.
Statisticians call this set the "sample". The sample is usually a subset of
the population. 

In NN methodology, the sample is often subdivided into "training",
"validation", and "test" sets. The distinctions among these subsets are
crucial, but the terms "validation" and "test" sets are often confused.
There is no book in the NN literature more authoritative than Ripley (1996),
from which the following definitions are taken (p.354): 

Training set: 
   A set of examples used for learning, that is to fit the parameters
   [weights] of the classifier. 
Validation set: 
   A set of examples used to tune the parameters of a classifier, for
   example to choose the number of hidden units in a neural network. 
Test set: 
   A set of examples used only to assess the performance [generalization] of
   a fully-specified classifier. 

Bishop (1995), another indispensable reference on neural networks, provides
the following explanation (p. 372): 

   Since our goal is to find the network having the best performance on
   new data, the simplest approach to the comparison of different
   networks is to evaluate the error function using data which is
   independent of that used for training. Various networks are trained
   by minimization of an appropriate error function defined with respect
   to a training data set. The performance of the networks is then
   compared by evaluating the error function using an independent 
   validation set, and the network having the smallest error with
   respect to the validation set is selected. This approach is called
   the hold out method. Since this procedure can itself lead to some
   overfitting to the validation set, the performance of the selected
   network should be confirmed by measuring its performance on a third
   independent set of data called a test set. 

The crucial point is that a test set, by definition, is never used to choose
among two or more networks, so that the error on the test set provides an
unbiased estimate of the generalization error (assuming that the test set is
representative of the population, etc.). Any data set that is used to choose the
best of two or more networks is, by definition, a validation set, and the error of
the chosen network on the validation set is optimistically biased. 

There is a problem with the usual distinction between training and validation
sets. Some training approaches, such as early stopping, require a validation
set, so in a sense, the validation set is used for training. Other approaches,
such as maximum likelihood, do not inherently require a validation set. So the
"training" set for maximum likelihood might encompass both the "training" and
"validation" sets for early stopping. Greg Heath has suggested the term
"design" set be used for cases that are used solely to adjust the weights in a
network, while "training" set be used to encompass both design and validation
sets. There is considerable merit to this suggestion, but it has not yet been
widely adopted. 

But things can get more complicated. Suppose you want to train nets with 5 ,10,
and 20 hidden units using maximum likelihood, and you want to train nets with
20 and 50 hidden units using early stopping. You also want to use a validation
set to choose the best of these various networks. Should you use the same
validation set for early stopping that you use for the final network choice, or
should you use two separate validation sets? That is, you could divide the
sample into 3 subsets, say A, B, C and proceed as follows: 

 o Do maximum likelihood using A. 
 o Do early stopping with A to adjust the weights and B to decide when to stop
   (this makes B a validation set). 
 o Choose among all 3 nets trained by maximum likelihood and the 2 nets
   trained by early stopping based on the error computed on B (the validation
   set). 
 o Estimate the generalization error of the chosen network using C (the test
   set). 

Or you could divide the sample into 4 subsets, say A, B, C, and D and proceed
as follows: 

 o Do maximum likelihood using A and B combined. 
 o Do early stopping with A to adjust the weights and B to decide when to stop
   (this makes B a validation set with respect to early stopping). 
 o Choose among all 3 nets trained by maximum likelihood and the 2 nets
   trained by early stopping based on the error computed on C (this makes C a
   second validation set). 
 o Estimate the generalization error of the chosen network using D (the test
   set). 

Or, with the same 4 subsets, you could take a third approach: 

 o Do maximum likelihood using A. 
 o Choose among the 3 nets trained by maximum likelihood based on the error
   computed on B (the first validation set) 
 o Do early stopping with A to adjust the weights and B (the first validation
   set) to decide when to stop. 
 o Choose among the best net trained by maximum likelihood and the 2 nets
   trained by early stopping based on the error computed on C (the second
   validation set). 
 o Estimate the generalization error of the chosen network using D (the test
   set). 

You could argue that the first approach is biased towards choosing a net
trained by early stopping. Early stopping involves a choice among a potentially
large number of networks, and therefore provides more opportunity for
overfitting the validation set than does the choice among only 3 networks
trained by maximum likelihood. Hence if you make the final choice of networks
using the same validation set (B) that was used for early stopping, you give an
unfair advantage to early stopping. If you are writing an article to compare
various training methods, this bias could be a serious flaw. But if you are using
NNs for some practical application, this bias might not matter at all, since you
obtain an honest estimate of generalization error using C. 

You could also argue that the second and third approaches are too wasteful in
their use of data. This objection could be important if your sample contains 100
cases, but will probably be of little concern if your sample contains
100,000,000 cases. For small samples, there are other methods that make more
efficient use of data; see "What are cross-validation and bootstrapping?" 

References: 

   Bishop, C.M. (1995), Neural Networks for Pattern Recognition, Oxford:
   Oxford University Press. 

   Ripley, B.D. (1996) Pattern Recognition and Neural Networks, Cambridge:
   Cambridge University Press. 

------------------------------------------------------------------------

Subject: How are NNs related to statistical methods? 
=====================================================

There is considerable overlap between the fields of neural networks and
statistics. Statistics is concerned with data analysis. In neural network
terminology, statistical inference means learning to generalize from noisy
data. Some neural networks are not concerned with data analysis (e.g., those
intended to model biological systems) and therefore have little to do with
statistics. Some neural networks do not learn (e.g., Hopfield nets) and
therefore have little to do with statistics. Some neural networks can learn
successfully only from noise-free data (e.g., ART or the perceptron rule) and
therefore would not be considered statistical methods. But most neural
networks that can learn to generalize effectively from noisy data are similar or
identical to statistical methods. For example: 

 o Feedforward nets with no hidden layer (including functional-link neural
   nets and higher-order neural nets) are basically generalized linear models. 
 o Feedforward nets with one hidden layer are closely related to projection
   pursuit regression. 
 o Probabilistic neural nets are identical to kernel discriminant analysis. 
 o Kohonen nets for adaptive vector quantization are very similar to k-means
   cluster analysis. 
 o Hebbian learning is closely related to principal component analysis. 

Some neural network areas that appear to have no close relatives in the
existing statistical literature are: 

 o Kohonen's self-organizing maps. 
 o Reinforcement learning ((although this is treated in the operations research
   literature on Markov decision processes). 
 o Stopped training (the purpose and effect of stopped training are similar to
   shrinkage estimation, but the method is quite different). 

Feedforward nets are a subset of the class of nonlinear regression and
discrimination models. Statisticians have studied the properties of this general
class but had not considered the specific case of feedforward neural nets
before such networks were popularized in the neural network field. Still, many
results from the statistical theory of nonlinear models apply directly to
feedforward nets, and the methods that are commonly used for fitting
nonlinear models, such as various Levenberg-Marquardt and conjugate
gradient algorithms, can be used to train feedforward nets. 

While neural nets are often defined in terms of their algorithms or
implementations, statistical methods are usually defined in terms of their
results. The arithmetic mean, for example, can be computed by a (very simple)
backprop net, by applying the usual formula SUM(x_i)/n, or by various other
methods. What you get is still an arithmetic mean regardless of how you
compute it. So a statistician would consider standard backprop, Quickprop, and
Levenberg-Marquardt as different algorithms for implementing the same
statistical model such as a feedforward net. On the other hand, different
training criteria, such as least squares and cross entropy, are viewed by
statisticians as fundamentally different estimation methods with different
statistical properties. 

It is sometimes claimed that neural networks, unlike statistical models,
require no distributional assumptions. In fact, neural networks involve exactly
the same sort of distributional assumptions as statistical models, but
statisticians study the consequences and importance of these assumptions
while most neural networkers ignore them. For example, least-squares
training methods are widely used by statisticians and neural networkers.
Statisticians realize that least-squares training involves implicit distributional
assumptions in that least-squares estimates have certain optimality
properties for noise that is normally distributed with equal variance for all
training cases and that is independent between different cases. These
optimality properties are consequences of the fact that least-squares
estimation is maximum likelihood under those conditions. Similarly,
cross-entropy is maximum likelihood for noise with a Bernoulli distribution. If
you study the distributional assumptions, then you can recognize and deal with
violations of the assumptions. For example, if you have normally distributed
noise but some training cases have greater noise variance than others, then you
may be able to use weighted least squares instead of ordinary least squares to
obtain more efficient estimates. 

Hundreds, perhaps thousands of people have run comparisons of neural nets
with "traditional statistics" (whatever that means). Most such studies involve
one or two data sets, and are of little use to anyone else unless they happen to
be analyzing the same kind of data. But there is an impressive comparative
study of supervised classification by Michie, Spiegelhalter, and Taylor (1994),
and an excellent comparison of unsupervised Kohonen networks and k-means
clustering by Balakrishnan, Cooper, Jacob, and Lewis (1994). 

Communication between statisticians and neural net researchers is often
hindered by the different terminology used in the two fields. There is a
comparison of neural net and statistical jargon in 
ftp://ftp.sas.com/pub/neural/jargon 

References: 

   Balakrishnan, P.V., Cooper, M.C., Jacob, V.S., and Lewis, P.A. (1994) "A
   study of the classification capabilities of neural networks using
   unsupervised learning: A comparison with k-means clustering",
   Psychometrika, 59, 509-525. 

   Bishop, C.M. (1995), Neural Networks for Pattern Recognition, Oxford:
   Oxford University Press. 

   Chatfield, C. (1993), "Neural networks: Forecasting breakthrough or
   passing fad", International Journal of Forecasting, 9, 1-3. 

   Cheng, B. and Titterington, D.M. (1994), "Neural Networks: A Review from
   a Statistical Perspective", Statistical Science, 9, 2-54. 

   Cherkassky, V., Friedman, J.H., and Wechsler, H., eds. (1994), From
   Statistics to Neural Networks: Theory and Pattern Recognition
   Applications, Berlin: Springer-Verlag. 

   Geman, S., Bienenstock, E. and Doursat, R. (1992), "Neural Networks and
   the Bias/Variance Dilemma", Neural Computation, 4, 1-58. 

   Kuan, C.-M. and White, H. (1994), "Artificial Neural Networks: An
   Econometric Perspective", Econometric Reviews, 13, 1-91. 

   Kushner, H. & Clark, D. (1978), Stochastic Approximation Methods for
   Constrained and Unconstrained Systems, Springer-Verlag. 

   Michie, D., Spiegelhalter, D.J. and Taylor, C.C. (1994), Machine Learning,
   Neural and Statistical Classification, Ellis Horwood. 

   Ripley, B.D. (1993), "Statistical Aspects of Neural Networks", in O.E.
   Barndorff-Nielsen, J.L. Jensen and W.S. Kendall, eds., Networks and
   Chaos: Statistical and Probabilistic Aspects, Chapman & Hall. ISBN 0 412
   46530 2. 

   Ripley, B.D. (1994), "Neural Networks and Related Methods for
   Classification," Journal of the Royal Statistical Society, Series B, 56,
   409-456. 

   Ripley, B.D. (1996) Pattern Recognition and Neural Networks, Cambridge:
   Cambridge University Press. 

   Sarle, W.S. (1994), "Neural Networks and Statistical Models," Proceedings
   of the Nineteenth Annual SAS Users Group International Conference,
   Cary, NC: SAS Institute, pp 1538-1550. (
   ftp://ftp.sas.com/pub/neural/neural1.ps) 

   White, H. (1989), "Learning in Artificial Neural Networks: A Statistical
   Perspective," Neural Computation, 1, 425-464. 

   White, H. (1989), "Some Asymptotic Results for Learning in Single Hidden
   Layer Feedforward Network Models", J. of the American Statistical Assoc.,
   84, 1008-1013. 

   White, H. (1992), Artificial Neural Networks: Approximation and Learning
   Theory, Blackwell. 

------------------------------------------------------------------------

Next part is part 2 (of 7). 

-- 

Warren S. Sarle       SAS Institute Inc.   The opinions expressed here
saswss@unx.sas.com    SAS Campus Drive     are mine and not necessarily
(919) 677-8000        Cary, NC 27513, USA  those of SAS Institute.
 *** Do not send me unsolicited commercial or political email! ***

