Newsgroups: comp.ai.neural-nets
Path: cantaloupe.srv.cs.cmu.edu!das-news2.harvard.edu!news2.near.net!howland.reston.ans.net!news.sprintlink.net!redstone.interpath.net!sas!mozart.unx.sas.com!saswss
From: saswss@hotellng.unx.sas.com (Warren Sarle)
Subject: FAQ: NN vs. statistics
Originator: saswss@hotellng.unx.sas.com
Sender: news@unx.sas.com (Noter of Newsworthy Events)
Message-ID: <D3E3tH.L2C@unx.sas.com>
Date: Thu, 2 Feb 1995 20:34:29 GMT
Nntp-Posting-Host: hotellng.unx.sas.com
Organization: SAS Institute Inc.
Lines: 104


Here's a proposal for an addition to the FAQ. It's rather long (but
could have been a lot longer), so I have arranged it with the most
important material at the beginning and less important at the end
to make it easier for people to suggest how much to cut off. I don't
know much about recurrent nets, Markov random fields, etc., so maybe
somebody else could contribute material on that sort of thing. I will
add references later. Opinions?
......................................................................
Q: How are neural networks related to statistical methods?

A: There is considerable overlap between the fields of neural
networks and statistics.

Statistics is concerned with data analysis. In neural network
terminology, statistical inference means learning to generalize from
noisy data. Some neural networks are not concerned with data analysis
(e.g., those intended to model biological systems) and therefore have
little to do with statistics. Some neural networks do not learn (e.g.,
Hopfield nets) and therefore have little to do with statistics. Some
neural networks can learn successfully only from noise-free data (e.g.,
ART or the perceptron rule) and therefore would not be considered
statistical methods. But most neural networks that can learn to
generalize effectively from noisy data are similar or identical to
statistical methods. For example:

 * Feedforward nets with no hidden layer (including functional-link
   neural nets and higher-order neural nets) are basically
   generalized linear models.

 * Feedforward nets with one hidden layer are closely related
   to projection pursuit regression.

 * Probabilistic neural nets are identical to kernel discriminant
   analysis.

 * General regression neural nets are identical to Nadaraya-Watson
   kernel regression.

 * Kohonen nets for adaptive vector quantization are very similar
   to k-means cluster analysis.

 * Hebbian learning is closely related to principal component
   analysis.

Some neural network areas that appear to have no close relatives in the
existing statistical literature are:

 * Kohonen's self-organizing maps.

 * Reinforcement learning.

 * Stopped training (the purpose and effect of stopped training are
   similar to shrinkage estimation, but the method is quite different).

Feedforward nets are a subset of the class of nonlinear regression and
discrimination models. Statisticians have studied the properties of this
general class but had not considered the specific case of feedforward
neural nets before such networks were popularized in the neural network
field. Still, many results from the statistical theory of nonlinear
models apply directly to feedforward nets, and the methods that are
commonly used for fitting nonlinear models, such as various
Levenberg-Marquardt and conjugate gradient algorithms, can be used to
train feedforward nets.

While neural nets are often defined in terms of their algorithms or
implementations, statistical methods are usually defined in terms of
their results. The arithmetic mean, for example, can be computed by a
(very simple) backprop net, by applying the usual formula SUM(x_i)/n, or
by various other methods. What you get is still an arithmetic mean
regardless of how you compute it. So a statistician would consider
standard backprop, Quickprop, and Levenberg-Marquardt as different
algorithms for implementing the same statistical model such as a
feedforward net. On the other hand, different training criteria, such as
least squares and cross entropy, are viewed by statisticians as
fundamentally different estimation methods with different statistical
properties.

It is sometimes claimed that neural networks, unlike statistical models,
require no distributional assumptions. In fact, neural networks involve
exactly the same sort of distributional assumptions as statistical
models, but statisticians study the consequences and importance of these
assumptions while most neural networkers ignore them. For example,
least-squares training methods are widely used by statisticians and
neural networkers. Statisticians realize that least-squares training
involves implicit distributional assumptions in that least-squares
estimates have certain optimality properties for noise that is normally
distributed with equal variance for all training cases and that is
independent between different cases. These optimality properties are
consequences of the fact that least-squares estimation is maximum
likelihood under those conditions. Similarly, cross-entropy is maximum
likelihood for noise with a Bernoulli distribution. If you study the
distributional assumptions, then you can recognize and deal with
violations of the assumptions. For example, if you have normally
distributed noise but some training cases have greater noise variance
than others, then you may be able to use weighted least squares instead
of ordinary least squares to obtain more efficient estimates.


-- 

Warren S. Sarle       SAS Institute Inc.   The opinions expressed here
saswss@unx.sas.com    SAS Campus Drive     are mine and not necessarily
(919) 677-8000        Cary, NC 27513, USA  those of SAS Institute.
