Newsgroups: comp.ai.neural-nets,comp.answers,news.answers
Path: cantaloupe.srv.cs.cmu.edu!bb3.andrew.cmu.edu!nntp.sei.cmu.edu!news.psc.edu!scramble.lm.com!news.math.psu.edu!news.cse.psu.edu!uwm.edu!cs.utexas.edu!howland.reston.ans.net!gatech!news.mathworks.com!zombie.ncsc.mil!newsgate.duke.edu!interpath!news.interpath.net!sas!newshost.unx.sas.com!hotellng.unx.sas.com!saswss
From: saswss@unx.sas.com (Warren Sarle)
Subject: comp.ai.neural-nets FAQ, Part 3 of 7: Generalization
Originator: saswss@hotellng.unx.sas.com
Sender: news@unx.sas.com (Noter of Newsworthy Events)
Message-ID: <nn3.posting_836017220@hotellng.unx.sas.com>
Supersedes: <nn3.posting_833338818@hotellng.unx.sas.com>
Approved: news-answers-request@MIT.EDU
Date: Sat, 29 Jun 1996 03:00:21 GMT
Expires: Sat, 3 Aug 1996 03:00:20 GMT
X-Nntp-Posting-Host: hotellng.unx.sas.com
Reply-To: saswss@unx.sas.com (Warren Sarle)
Organization: SAS Institute Inc., Cary, NC, USA
Keywords: frequently asked questions, answers
Followup-To: comp.ai.neural-nets
Lines: 1069
Xref: glinda.oz.cs.cmu.edu comp.ai.neural-nets:32247 comp.answers:19514 news.answers:75436

Archive-name: ai-faq/neural-nets/part3
Last-modified: 1996-06-25
URL: ftp://ftp.sas.com/pub/neural/FAQ3.html
Maintainer: saswss@unx.sas.com (Warren S. Sarle)

This is part 3 (of 7) of a monthly posting to the Usenet newsgroup
comp.ai.neural-nets. See the part 1 of this posting for full information
what it is all about.

========== Questions ========== 
********************************

Part 1: Introduction
Part 2: Learning
Part 3: Generalization

   How is generalization possible?
   How does noise affect generalization?
   What is overfitting and how can I avoid it?
   What is jitter? (Training with noise)
   What is early stopping?
   What is weight decay?
   What is Bayesian estimation?
   How many hidden layers should I use?
   How many hidden units should I use?
   How can generalization error be estimated?
   What are cross-validation and bootstrapping?

Part 4: Books, data, etc.
Part 5: Free software
Part 6: Commercial software
Part 7: Hardware

------------------------------------------------------------------------

Subject: How is generalization possible? 
=========================================

During learning, the outputs of a supervised neural net come to approximate
the target values given the inputs in the training set. This ability may be
useful in itself, but more often the purpose of using a neural net is to
generalize--i.e., to have the outputs of the net approximate target values
given inputs that are not in the training set. Generalizaton is not always
possible. There are two conditions that are typically necessary (although
not sufficient) for good generalization. 

The first necessary condition is that the function you are trying to learn
(that relates inputs to correct outputs) be, in some sense, smooth. In other
words, a small change in the inputs should, most of the time, produce a
small change in the outputs. For continuous inputs and targets, smoothness
of the function implies continuity and restrictions on the first derivative
over most of the input space. Some neural nets can learn discontinuities as
long as the function consists of a finite number of continuous pieces. Very
nonsmooth functions such as those produced by pseudo-random number
generators and encryption algorithms cannot be generalized by neural nets.
Often a nonlinear transformation of the input space can increase the
smoothness of the function and improve generalization. 

For Boolean functions, the concept of smoothness is more elusive. It seems
intuitively clear that a Boolean network with a small number of hidden units
and small weights will compute a "smoother" input-output function than a
network with many hidden units and large weights. If you know a good
reference characterizing Boolean functions for which good generalization is
possible, please inform the FAQ maintainer (saswss@unx.sas.com). 

The second necessary condition for good generalization is that the training
cases be a sufficiently large and representative subset ("sample" in
statistical terminology) of the set of all cases that you want to generalize
to (the "population" in statistical terminology). The importance of this
condition is related to the fact that there are, loosely speaking, two
different types of generalization: interpolation and extrapolation.
Interpolation applies to cases that are more or less surrounded by nearby
training cases; everything else is extrapolation. In particular, cases that
are outside the range of the training data require extrapolation. Cases
inside large "holes" in the training data may also effectively require
extrapolation. Interpolation can often be done reliably, but extrapolation
is notoriously unreliable. Hence it is important to have sufficient training
data to avoid the need for extrapolation. Methods for selecting good
training sets are discussed in numerous statistical textbooks on sample
surveys and experimental design. 

Thus, for an input-output function that is smooth, if you have a test case
that is close to some training cases, the correct output for the test case
will be close to the correct outputs for those training cases. If you have
an adequate sample for your training set, every case in the population will
be close to a sufficient number of training cases. Hence, under these
conditions and with proper training, a neural net will be able to generalize
reliably to the population. 

If you have more information about the function, e.g. that the outputs
should be linearly related to the inputs, you can often take advantage of
this information by placing constraints on the network or by fitting a more
specific model, such as a linear model, to improve generalization.
Extrapolation is much more reliable in linear models than in flexible
nonlinear models, although still not nearly as safe as interpolation. You
can also use such information to choose the training cases more efficiently.
For example, with a linear model, you should choose training cases at the
outer limits of the input space instead of evenly distributing them out
throughout the input space. 

------------------------------------------------------------------------

Subject: How does noise affect generalization? 
===============================================

Noise in the actual data is never a good thing, since it limits the accuracy
of generalization that can be achieved no matter how extensive the training
set is. On the other hand, injecting artificial noise (jitter) into the
inputs during training is one of several ways to improve generalization for
smooth functions when you have a small training set. 

Certain assumptions about noise are necessary for theoretical results.
Usually, the noise distribution is assumed to have zero mean and finite
variance. The noise in different cases is usually assumed to be independent
or to follow some known stochastic model, such as an autoregressive process.
The more you know about the noise distribution, the more effectively you can
train the network (e.g., McCullagh and Nelder 1989). 

If you have noise in the target values, the mean squared generalization
error can never be less than the variance of the noise, no matter how much
training data you have. But you can estimate the mean of the target
values, conditional on a given set of inputs, to any desired degree of
accuracy by obtaining a sufficiently large and representative training set,
assuming that the function you are trying to learn is one that can indeed be
learned by the type of net you are using. 

Noise in the target values is exacerbated by overfitting (Moody 1992). 

Noise in the inputs also limits the accuracy of generalization, but in a
more complicated way than does noise in the targets. In a region of the
input space where the function being learned is fairly flat, input noise
will have little effect. In regions where that function is steep, input
noise can degrade generalization severely. 

Furthermore, if the target function is Y=f(X), but you observe noisy inputs
X+D, you cannot obtain an arbitrarily accurate estimate of f(X) given X+D no
matter how large a training set you use. The net will not learn f(X), but
will instead learn a convolution of f(X) with the distribution of the noise
D (see "What is jitter?)" 

For more details, see one of the statistically-oriented references on neural
nets such as: 

   Bishop, C.M. (1995), Neural Networks for Pattern Recognition, Oxford:
   Oxford University Press, especially section 6.4. 

   Geman, S., Bienenstock, E. and Doursat, R. (1992), "Neural Networks and
   the Bias/Variance Dilemma", Neural Computation, 4, 1-58. 

   McCullagh, P. and Nelder, J.A. (1989) Generalized Linear Models, 2nd
   ed., London: Chapman & Hall. 

   Moody, J.E. (1992), "The Effective Number of Parameters: An Analysis of
   Generalization and Regularization in Nonlinear Learning Systems", NIPS 4,
   847-854. 

   Ripley, B.D. (1996) Pattern Recognition and Neural Networks, Cambridge:
   Cambridge University Press. 

------------------------------------------------------------------------

Subject: What is overfitting and how can I avoid it? 
=====================================================

The critical issue in developing a neural network is generalization: how
well will the network make predictions for cases that are not in the
training set? NNs, like other flexible nonlinear estimation methods such as
kernel regression and smoothing splines, can suffer from either underfitting
or overfitting. A network that is not sufficiently complex can fail to
detect fully the signal in a complicated data set, leading to underfitting.
A network that is too complex may fit the noise, not just the signal,
leading to overfitting. Overfitting is especially dangerous because it can
easily lead to predictions that are far beyond the range of the training
data with many of the common types of NNs. But underfitting can also produce
wild predictions in multilayer perceptrons, even with noise-free data. 

For an elementary discussion of overfitting, see Smith (1993). For a more
rigorous approach, see the article by Geman, Bienenstock, and Doursat (1992)
on the bias/variance trade-off (it's not really a dilemma). We are talking
statistical bias here: the difference between the average value of an
estimator and the correct value. Underfitting produces excessive bias in the
outputs, whereas overfitting produces excessive variance. There are
graphical examples of overfitting and underfitting in Sarle (1995). 

The best way to avoid overfitting is to use lots of training data. If you
have at least 30 times as many training cases as there are weights in the
network, you are unlikely to suffer from overfitting. But you can't
arbitrarily reduce the number of weights for fear of underfitting. 

Given a fixed amount of training data, there are at least five effective
approaches to avoiding underfitting and overfitting, and hence getting good
generalization: 

 o Model selection 
 o Jittering 
 o Weight decay 
 o Early stopping 
 o Bayesian estimation 

There approaches are discussed in more detail under subsequent questions. 

The complexity of a network is related to both the number of weights and the
size of the weights. Model selection is concerned with the number of
weights, and hence the number of hidden units and layers. The more weights
there are, relative to the number of training cases, the more overfitting
amplifies noise in the targets (Moody 1992). The other approaches listed
above are concerned, directly or indirectly, with the size of the weights.
Reducing the size of the weights reduces the "effective" number of
weights--see Moody (1992) regarding weight decay and Weigend (1994)
regarding early stopping. 

References: 

   Geman, S., Bienenstock, E. and Doursat, R. (1992), "Neural Networks and
   the Bias/Variance Dilemma", Neural Computation, 4, 1-58. 

   Moody, J.E. (1992), "The Effective Number of Parameters: An Analysis of
   Generalization and Regularization in Nonlinear Learning Systems", NIPS 4,
   847-854. 

   Sarle, W.S. (1995), "Stopped Training and Other Remedies for
   Overfitting," to appear in Proceedings of the 27th Symposium on the
   Interface, ftp://ftp.sas.com/pub/neural/inter95.ps.Z (this is a very
   large compressed postscript file, 747K, 10 pages) 

   Smith, M. (1993), Neural Networks for Statistical Modeling, NY: Van
   Nostrand Reinhold. 

   Weigend, A. (1994), "On overfitting and the effective number of hidden
   units," Proceedings of the 1993 Connectionist Models Summer School,
   335-342. 

------------------------------------------------------------------------

Subject: What is jitter? (Training with noise) 
===============================================

Jitter is artificial noise deliberately added to the inputs during training.
Training with jitter is a form of smoothing related to kernel regression
(see "What is GRNN?"). It is also closely related to regularization methods
such as weight decay and ridge regression. 

Training with jitter works because the functions that we want NNs to learn
are mostly smooth. NNs can learn functions with discontinuities, but the
functions must be piecewise continuous in a finite number of regions if our
network is restricted to a finite number of hidden units. 

In other words, if we have two cases with similar inputs, the desired
outputs will usually be similar. That means we can take any training case
and generate new training cases by adding small amounts of jitter to the
inputs. As long as the amount of jitter is sufficiently small, we can assume
that the desired output will not change enough to be of any consequence, so
we can just use the same target value. The more training cases, the merrier,
so this looks like a convenient way to improve training. But too much jitter
will obviously produce garbage, while too little jitter will have little
effect (Koistinen and Holmstro\"m 1992). 

Consider any point in the input space, not necessarily one of the original
training cases. That point could possibly arise as a jittered input as a
result of jittering any of several of the original neighboring training
cases. The average target value at the given input point will be a weighted
average of the target values of the original training cases. For an infinite
number of jittered cases, the weights will be proportional to the
probability densities of the jitter distribution, located at the original
training cases and evaluated at the given input point. Thus the average
target values given an infinite number of jittered cases will, by
definition, be the Nadaraya-Watson kernel regression estimator using the
jitter density as the kernel. Hence, training with jitter is an
approximation to training with the kernel regression estimator as target.
Choosing the amount (variance) of jitter is equivalent to choosing the
bandwidth of the kernel regression estimator (Scott 1992). 

When studying nonlinear models such as feedforward NNs, it is often helpful
first to consider what happens in linear models, and then to see what
difference the nonlinearity makes. So let's consider training with jitter in
a linear model. Notation: 

   x_ij is the value of the jth input (j=1, ..., p) for the
        ith training case (i=1, ..., n).
   X={x_ij} is an n by p matrix.
   y_i is the target value for the ith training case.
   Y={y_i} is a column vector.

Without jitter, the least-squares weights are B = inv(X'X)X'Y, where
"inv" indicates a matrix inverse and "'" indicates transposition. Note that
if we replicate each training case c times, or equivalently stack c copies
of the X and Y matrices on top of each other, the least-squares weights are
inv(cX'X)cX'Y = (1/c)inv(X'X)cX'Y = B, same as before. 

With jitter, x_ij is replaced by c cases x_ij+z_ijk, k=1, ...,
c, where z_ijk is produced by some random number generator, usually with
a normal distribution with mean 0 and standard deviation s, and the 
z_ijk's are all independent. In place of the n by p matrix X, this
gives us a big matrix, say Q, with cn rows and p columns. To compute the
least-squares weights, we need Q'Q. Let's consider the jth diagonal
element of Q'Q, which is 

                   2           2       2
   sum (x_ij+z_ijk) = sum (x_ij + z_ijk + 2 x_ij z_ijk)
   i,k                i,k

which is approximately, for c large, 

             2     2
   c(sum x_ij  + ns ) 
      i

which is c times the corresponding diagonal element of X'X plus ns^2.
Now consider the u,vth off-diagonal element of Q'Q, which is 

   sum (x_iu+z_iuk)(x_iv+z_ivk)
   i,k

which is approximately, for c large, 

   c(sum x_iu x_iv)
      i

which is just c times the corresponding element of X'X. Thus, Q'Q equals
c(X'X+ns^2I), where I is an identity matrix of appropriate size.
Similar computations show that the crossproduct of Q with the target values
is cX'Y. Hence the least-squares weights with jitter of variance s^2 are
given by 

       2                2                    2
   B(ns ) = inv(c(X'X+ns I))cX'Y = inv(X'X+ns I)X'Y

In the statistics literature, B(ns^2) is called a ridge regression
estimator with ridge value ns^2. 

If we were to add jitter to the target values Y, the cross-product X'Y
would not be affected for large c for the same reason that the off-diagonal
elements of X'X are not afected by jitter. Hence, adding jitter to the
targets will not change the optimal weights; it will just slow down
training. 

The ordinary least squares training criterion is (Y-XB)'(Y-XB).
Weight decay uses the training criterion (Y-XB)'(Y-XB)+d^2B'B,
where d is the decay rate. Weight decay can also be implemented by
inventing artificial training cases. Augment the training data with p new
training cases containing the matrix dI for the inputs and a zero vector
for the targets. To put this in a formula, let's use A;B to indicate the
matrix A stacked on top of the matrix B, so (A;B)'(C;D)=A'C+B'D.
Thus the augmented inputs are X;dI and the augmented targets are Y;0,
where 0 indicates the zero vector of the appropriate size. The squared error
for the augmented training data is: 

   (Y;0-(X;dI)B)'(Y;0-(X;dI)B)
   = (Y;0)'(Y;0) - 2(Y;0)'(X;dI)B + B'(X;dI)'(X;dI)B
   = Y'Y - 2Y'XB + B'(X'X+d^2I)B
   = Y'Y - 2Y'XB + B'X'XB + B'(d^2I)B
   = (Y-XB)'(Y-XB)+d^2B'B

which is the weight-decay training criterion. Thus the weight-decay
estimator is: 

    inv[(X;dI)'(X;dI)](X;dI)'(Y;0) = inv(X'X+d^2I)X'Y

which is the same as the jitter estimator B(d^2), i.e. jitter with
variance d^2/n. The equivalence between the weight-decay estimator and
the jitter estimator does not hold for nonlinear models unless the jitter
variance is small relative to the curvature of the nonlinear function.
However, the equivalence of the two estimators for linear models suggests
that they will often produce similar results even for nonlinear models. 

B(0) is obviously the ordinary least-squares estimator. It can be shown
that as s^2 increases, the Euclidean norm of B(ns^2) decreases; in
other words, adding jitter causes the weights to shrink. It can also be
shown that under the usual statistical assumptions, there always exists some
value of ns^2 > 0 such that B(ns^2) provides better expected
generalization than B(0). Unfortunately, there is no way to calculate a
value of ns^2 from the training data that is guaranteed to improve
generalization. There are other types of shrinkage estimators called Stein
estimators that do guarantee better generalization than B(0), but I'm not
aware of a nonlinear generalization of Stein estimators applicable to neural
networks. 

The statistics literature describes numerous methods for choosing the ridge
value. The most obvious way is to estimate the generalization error by
cross-validation, generalized cross-validation, or bootstrapping, and to
choose the ridge value that yields the smallest such estimate. There are
also quicker methods based on empirical Bayes estimation, one of which
yields the following formula, useful as a first guess: 

    2    p(Y-XB(0))'(Y-XB(0))
   s   = --------------------
    1      n(n-p)B(0)'B(0)

You can iterate this a few times: 

    2      p(Y-XB(0))'(Y-XB(0))
   s     = --------------------
    l+1              2     2
            n(n-p)B(s )'B(s )
                     l     l

Note that the more training cases you have, the less noise you need. 

References: 

   Bishop, C.M. (1995), Neural Networks for Pattern Recognition, Oxford:
   Oxford University Press. 

   Ho\"lmstrom, L. and Koistinen, P. (1992) "Using additive noise in
   back-propagation training", IEEE Transaction on Neural Networks, 3,
   24-38. 

   Koistinen, P. and Holmstro\"m, L. (1992) "Kernel regression and
   backpropagation training with noise," NIPS4, 1033-1039. 

   Scott, D.W. (1992) Multivariate Density Estimation, Wiley. 

   Vinod, H.D. and Ullah, A. (1981) Recent Advances in Regression Methods,
   NY: Marcel-Dekker. 

------------------------------------------------------------------------

Subject: What is early stopping? 
=================================

NN practitioners often use nets with many times as many parameters as
training cases. E.g., Nelson and Illingworth (1991, p. 165) discuss training
a network with 16,219 parameters with only 50 training cases! The method
used is called early stopping or stopped training and proceeds as follows: 

1. Divide the available data into training and validation sets. 
2. Use a large number of hidden units. 
3. Use very small random initial values. 
4. Use a slow learning rate. 
5. Compute the validation error rate periodically during training. 
6. Stop training when the validation error rate "starts to go up". 

It is crucial to realize that the validation error is not a good estimate
of the generalization error. One method for getting an unbiased estimate of
the generalization error is to run the net on a third set of data, the test
set, that is not used at all during the training process. For other methods,
see "How can generalization error be estimated?" 

Early stopping has several advantages: 

 o It is fast. 
 o It can be applied successfully to networks in which the number of weights
   far exceeds the sample size. 
 o It requires only one major decision by the user: what proportion of
   validation cases to use. 

But there are several unresolved practical issues in early stopping: 

 o How many cases do you assign to the training and validation sets? Rules
   of thumb abound, but appear to be no more than folklore. The only
   systematic results known to the FAQ maintainer are in Sarle (1995), which
   deals only with the case of a single input. Amari et al. (1995) attempts
   a theoretical approach but contains serious errors that completely
   invalidate the results, especially the incorrect assumption that the
   direction of approach to the optimum is distributed isotropically. 
 o Do you split the data into training and validation sets randomly or by
   some systematic algorithm? 
 o How do you tell when the validation error rate "starts to go up"? It may
   go up and down numerous times during training. The safest approach is to
   train to convergence, then go back and see which iteration had the lowest
   validation error. For more elaborate algorithms, see section 3.3 of 
   ftp://ftp.ira.uka.de/pub/papers/techreports/1994/1994-21.ps.gz. 

Statisticians tend to be skeptical of stopped training because it appears to
be statistically inefficient due to the use of the split-sample technique;
i.e., neither training nor validation makes use of the entire sample, and
because the usual statistical theory does not apply. However, there has been
recent progress addressing both of the above concerns (Wang 1994). 

Early stopping is closely related to ridge regression. If the learning rate
is sufficiently small, the sequence of weight vectors on each iteration will
approximate the path of continuous steepest descent down the error function.
Early stopping chooses a point along this path that optimizes an estimate of
the generalization error computed from the validation set. Ridge regression
also defines a path of weight vectors by varying the ridge value. The ridge
value is often chosen by optimizing an estimate of the generalization error
computed by cross-validation, generalized cross-validation, or bootstrapping
(see "What are cross-validation and bootstrapping?") There always exists a
positive ridge value that will improve the expected generalization error in
a linear model. A similar result has been obtained for early stopping in
linear models (Wang, Venkatesh, and Judd 1994). In linear models, the ridge
path lies close to, but does not coincide with, the path of continuous
steepest descent; in nonlinear models, the two paths can diverge widely. The
relationship is explored in more detail by Sjo\"berg and Ljung (1992). 

References: 

   S. Amari, N.Murata, K.-R. Muller, M. Finke, H. Yang. Asymptotic
   Statistical Theory of Overtraining and Cross-Validation. METR 95-06,
   1995, Department of Mathematical Engineering and Information Physics,
   University of Tokyo, Hongo 7-3-1, Bunkyo-ku, Tokyo 113, Japan. 

   Finnof, W., Hergert, F., and Zimmermann, H.G. (1993), "Improving model
   selection by nonconvergent methods," Neural Networks, 6, 771-783. 

   Nelson, M.C. and Illingworth, W.T. (1991), A Practical Guide to Neural
   Nets, Reading, MA: Addison-Wesley. 

   Sarle, W.S. (1995), "Stopped Training and Other Remedies for
   Overfitting," to appear in Proceedings of the 27th Symposium on the
   Interface, ftp://ftp.sas.com/pub/neural/inter95.ps.Z (this is a very
   large compressed postscript file, 747K, 10 pages) 

   Sjo\"berg, J. and Ljung, L. (1992), "Overtraining, Regularization, and
   Searching for Minimum in Neural Networks," Technical Report
   LiTH-ISY-I-1297, Department of Electrical Engineering, Linkoping
   University, S-581 83 Linkoping, Sweden, http://www.control.isy.liu.se . 

   Wang, C. (1994), A Theory of Generalisation in Learning Machines with
   Neural Network Application, Ph.D. thesis, University of Pennsylvania. 

   Wang, C., Venkatesh, S.S., and Judd, J.S. (1994), "Optimal Stopping and
   Effective Machine Complexity in Learning," NIPS6, 303-310. 

   Weigend, A. (1994), "On overfitting and the effective number of hidden
   units," Proceedings of the 1993 Connectionist Models Summer School,
   335-342. 

------------------------------------------------------------------------

Subject: What is weight decay? 
===============================

Weight decay adds a penalty term to the error function. The usual penalty is
the sum of squared weights times a decay constant. In a linear model, this
form of weight decay is equivalent to ridge regression. See "What is
jitter?" for more explanation of ridge regression. 

Weight decay is a subset of regularization methods. The penalty term in
weight decay, by definition, penalizes large weights. Other regularization
methods may involve not only the weights but various derivatives of the
output function (Bishop 1995). 

The weight decay penalty term causes the weights to converge to smaller
absolute values than they otherwise would. Large weights can hurt
generalization in two different ways. Excessively large weights leading to
hidden units can cause the output function to be too rough, possibly with
near discontinuities. Excessively large weights leading to output units can
cause wild outputs far beyond the range of the data if the output activation
function is not bounded to the same range as the data. To put it another
way, large weights can cause excessive variance of the output (Geman,
Bienenstock, and Doursat 1992). 

Other penalty terms besides the sum of squared weights are sometimes used. 
Weight elimination (Weigend, Rumelhart, and Huberman 1991) uses: 

          (w_i)^2
   sum -------------
    i  (w_i)^2 + c^2

where w_i is the ith weight and c is a user-specified constant. Whereas
decay using the sum of squared weights tends to shrink the large
coefficients more than the small ones, weight elimination tends to shrink
the small coefficients more, and is therefore more useful for suggesting
subset models (pruning). 

The generalization ability of the network can depend crucially on the decay
constant, especially with small training sets. One approach to choosing the
decay constant is to train several networks with different amounts of decay
and estimate the generalization error for each; then choose the decay
constant that minimizes the estimated generalization error. Weigend,
Rumelhart, and Huberman (1991) iteratively update the decay constant during
training. 

There are other important considerations for getting good results from
weight decay. You must either standardize the inputs and targets, or adjust
the penalty term for the standard deviations of all the inputs and targets.
It is usually a good idea to omit the biases from the penalty term. 

A fundamental problem with weight decay is that different types of weights
in the network will usually require different decay constants for good
generalization. At the very least, you need three different decay constants
for input-to-hidden, hidden-to-hidden, and hidden-to-output weights.
Adjusting all these decay constants to produce the best estimated
generalization error often requires vast amounts of computation. 

Fortunately, there is a superior alternative to weight decay: hierarchical 
Bayesian estimation. Bayesian estimation makes it possible to estimate
efficiently numerous decay constants. 

References: 

   Bishop, C.M. (1995), Neural Networks for Pattern Recognition, Oxford:
   Oxford University Press. 

   Geman, S., Bienenstock, E. and Doursat, R. (1992), "Neural Networks and
   the Bias/Variance Dilemma", Neural Computation, 4, 1-58. 

   Ripley, B.D. (1996) Pattern Recognition and Neural Networks, Cambridge:
   Cambridge University Press. 

   Weigend, A. S., Rumelhart, D. E., & Huberman, B. A. (1991).
   Generalization by weight-elimination with application to forecasting. In:
   R. P. Lippmann, J. Moody, & D. S. Touretzky (eds.), Advances in Neural
   Information Processing Systems 3, San Mateo, CA: Morgan Kaufmann. 

------------------------------------------------------------------------

Subject: What is Bayesian estimation? 
======================================

I haven't written an answer for this yet, but here are some references: 

   Bernardo, J.M., DeGroot, M.H., Lindley, D.V. and Smith, A.F.M., eds.,
   (1985), Bayesian Statistics 2, Amsterdam: Elsevier Science Publishers B.V.
   (North-Holland). 

   Bishop, C.M. (1995), Neural Networks for Pattern Recognition, Oxford:
   Oxford University Press. 

   Gelman, A., Carlin, J.B., Stern, H.S., and Rubin, D.B. (1995), Bayesian
   Data Analysis, London: Chapman & Hall, ISBN 0-412-03991-5. 

   MacKay, D.J.C. (1992), "A practical Bayesian framework for
   backpropagation networks," Neural Computation, 4, 448-472. 

   MacKay, D.J.C. (199?), "Probable networks and plausible predictions--a
   review of practical Bayesian methods for supervised neural networks," 
   ftp://mraos.ra.phy.cam.ac.uk/pub/mackay/network.ps.Z. 

   Neal, R.M. (1995), Bayesian Learning for Neural Networks, Ph.D. thesis,
   University of Toronto, ftp://ftp.cs.toronto.edu/pub/radford/thesis.ps.Z. 

   O'Hagan, A. (1985), "Shoulders in hierarchical models," in Bernardo et
   al. (1985), 697-710. 

   Ripley, B.D. (1996) Pattern Recognition and Neural Networks, Cambridge:
   Cambridge University Press. 

   Sarle, W.S. (1995), "Stopped Training and Other Remedies for
   Overfitting," to appear in Proceedings of the 27th Symposium on the
   Interface, ftp://ftp.sas.com/pub/neural/inter95.ps.Z (this is a very
   large compressed postscript file, 747K, 10 pages) 

------------------------------------------------------------------------

Subject: How many hidden layers should I use? 
==============================================

You may not need any hidden layers at all. Linear and generalized linear
models are useful in a wide variety of applications (McCullagh and Nelder
1989). And even if the function you want to learn is mildly nonlinear, you
may get better generalization with a simple linear model than with a
complicated nonlinear model if there is too little data or too much noise to
estimate the nonlinearities accurately. 

In MLPs with step/threshold/Heaviside activation functions, you need two
hidden layers for full generality (Sontag 1992). For further discussion, see
Bishop (1995, 121-126). 

In MLPs with any of a wide variety of continuous nonlinear hidden-layer
activation functions, one hidden layer with an arbitrarily large number of
units suffices for the "universal approximation" property (e.g., Hornik,
Stinchcombe and White 1989; Hornik 1993; for more references, see Bishop
1995, 130). But there is no theory yet to tell you how many hidden units are
needed to approximate any given function. 

If you have only one input, there seems to be no advantage to using more
than one hidden layer. But things get much more complicated when there are
two or more inputs. To illustrate, examples with two inputs and one output
will be used so that the results can be shown graphically. In each example
there are 441 training cases on a regular 21-by-21 grid. The test sets have
1681 cases on a regular 41-by-41 grid over the same domain as the training
set. If you are reading the HTML version of this document via a web browser,
you can see surface plots based on the test set by clicking on the file
names mentioned in the folowing text. Each plot is a gif file, approximately
9K in size. 

Consider a target function of two inputs, consisting of a Gaussian hill in
the middle of a plane (hill.gif). An MLP with an identity output activation
function can easily fit the hill by surrounding it with a few sigmoid
(logistic, tanh, arctan, etc.) hidden units, but there will be spurious
ridges and valleys where the plane should be flat (h_mlp_6.gif). It takes
dozens of hidden units to flatten out the plane accurately (h_mlp_30.gif). 

Now suppose you use a logistic output activation function. As the input to a
logistic function goes to negative infinity, the output approaches zero. The
plane in the Gaussian target function also has a value of zero. If the
weights and bias for the output layer yield large negative values outside
the base of the hill, the logistic function will flatten out any spurious
ridges and valleys. So fitting the flat part of the target function is easy 
(h_mlpt_3_unsq.gif and h_mlpt_3.gif). But the logistic function also tends
to lower the top of the hill. 

If instead of a rounded hill, the target function was a mesa with a large,
flat top with a value of one, the logistic output activation function would
be able to smooth out the top of the mesa just like it smooths out the plane
below. Target functions like this, with large flat areas with values of
either zero or one, are just what you have in many noise-free classificaton
problems. In such cases, a single hidden layer is likely to work well. 

When using a logistic output activation function, it is common practice to
scale the target values to a range of .1 to .9. Such scaling is bad in a
noise-free classificaton problem, because it prevents the logistic function
from smoothing out the flat areas (h_mlpt1-9_3.gif). 

For the Gaussian target function, [.1,.9] scaling would make it easier to
fit the top of the hill, but would reintroduce undulations in the plane. It
would be better for the Gaussian target function to scale the target values
to a range of 0 to .9. But for a more realistic and complicated target
function, how would you know the best way to scale the target values? 

By introducing a second hidden layer with one sigmoid activation function
and returning to an identity output activation function, you can let the net
figure out the best scaling (h_mlp1_3.gif). Actually, the bias and weight
for the output layer scale the output rather than the target values, and you
can use whatever range of target values is convenient. 

For more complicated target functions, especially those with several hills
or valleys, it is useful to have several units in the second hidden layer.
Each unit in the second hidden layer enables the net to fit a separate hill
or valley. So an MLP with two hidden layers can often yield an accurate
approximation with fewer weights than an MLP with one hidden layer. (Chester
1990). 

To illustrate the use of multiple units in the second hidden layer, the next
example resembles a landscape with a Gaussian hill and a Gaussian valley,
both elliptical (hillanvale.gif). The table below gives the RMSE (root mean
squared error) for the test set with various architectures. If you are
reading the HTML version of this document via a web browser, click on any
number in the table to see a surface plot of the corresponding network
output. 

The MLP networks in the table have one or two hidden layers with a tanh
activation function. The output activation function is the identity. Using a
squashing function on the output layer is of no benefit for this function,
since the only flat area in the function has a target value near the middle
of the target range. 

          Hill and Valley Data: RMSE for the Test Set
              (Number of weights in parentheses)
                         MLP Networks

HUs in                  HUs in Second Layer
First  ----------------------------------------------------------
Layer    0           1           2           3           4
 1     0.204(  5)  0.204(  7)  0.189( 10)  0.187( 13)  0.185( 16)
 2     0.183(  9)  0.163( 11)  0.147( 15)  0.094( 19)  0.096( 23)
 3     0.159( 13)  0.095( 15)  0.054( 20)  0.033( 25)  0.045( 30)
 4     0.137( 17)  0.093( 19)  0.009( 25)  0.021( 31)  0.016( 37)
 5     0.121( 21)  0.092( 23)              0.010( 37)  0.011( 44)
 6     0.100( 25)  0.092( 27)              0.007( 43)  0.005( 51)
 7     0.086( 29)  0.077( 31)
 8     0.079( 33)  0.062( 35)
 9     0.072( 37)  0.055( 39)
10     0.059( 41)  0.047( 43)
12     0.047( 49)  0.042( 51)
15     0.039( 61)  0.032( 63)
20     0.025( 81)  0.018( 83)  
25     0.021(101)  0.016(103)  
30     0.018(121)  0.015(123)  
40     0.012(161)  0.015(163)  
50     0.008(201)  0.014(203)  

For an MLP with only one hidden layer (column 0), Gaussian hills and valleys
require a large number of hidden units to approximate well. When there is
one unit in the second hidden layer, the network can represent one hill or
valley easily, which is what happens with three to six units in the first
hidden layer. But having only one unit in the second hidden layer is of
little benefit for learning two hills or valleys. Using two units in the
second hidden layer enables the network to approximate two hills or valleys
easily; in this example, only four units are required in the first hidden
layer to get an excellent fit. Each additional unit in the second hidden
layer enables the network to learn another hill or valley with a relatively
small number of units in the first hidden layer, as explained by Chester
(1990). In this example, having three or four units in the second hidden
layer helps little, and actually produces a worse approximation when there
are four units in the first hidden layer due to problems with local minima. 

Unfortunately, using two hidden layers exacerbates the problem of local
minima, and it is important to use lots of random initializations or other
methods for global optimization. Local minima with two hidden layers can
have extreme spikes or blades even when the number of weights is much
smaller than the number of training cases. One of the few advantages of 
standard backprop is that it is so slow that spikes and blades will not
become very sharp for practical training times. 

More than two hidden layers can be useful in certain architectures such as
cascade correlation (Fahlman and Lebiere 1990) and in special applications,
such as the two-spirals problem (Lang and Witbrock 1988) and ZIP code
recognition (Le Cun et al. 1989). 

RBF networks are most often used with a single hidden layer. But an extra,
linear hidden layer before the radial hidden layer enables the network to
ignore irrelevant inputs (see How do MLPs compare with RBFs?") The linear
hidden layer allows the RBFs to take elliptical, rather than radial
(circular), shapes in the space of the inputs. Hence the linear layer gives
you an elliptical basis function (EBF) network. In the hill and valley
example, an ORBFUN network requires nine hidden units (37 weights) to get
the test RMSE below .01, but by adding a linear hidden layer, you can get an
essentially perfect fit with three linear units followed by two radial units
(20 weights). 

References: 

   Bishop, C.M. (1995), Neural Networks for Pattern Recognition, Oxford:
   Oxford University Press. 

   Chester, D.L. (1990), "Why Two Hidden Layers are Better than One,"
   IJCNN-90-WASH-DC, Lawrence Erlbaum, 1990, volume 1, 265-268. 

   Fahlman, S.E. and Lebiere, C. (1990), "The Cascade Correlation Learning
   Architecture," NIPS2, 524-532, 
   ftp://archive.cis.ohio-state.edu/pub/neuroprose/fahlman.cascor-tr.ps.Z. 

   Hornik, K., Stinchcombe, M. and White, H. (1989), "Multilayer feedforward
   networks are universal approximators," Neural Networks, 2, 359-366. 

   Hornik, K. (1993), "Some new results on neural network approximation,"
   Neural Networks, 6, 1069-1072. 

   Lang, K.J. and Witbrock, M.J. (1988), "Learning to tell two spirals
   apart," in Touretzky, D., Hinton, G., and Sejnowski, T., eds., 
   Procedings of the 1988 Connectionist Models Summer School, San Mateo,
   CA: Morgan Kaufmann. 

   Le Cun, Y., Boser, B., Denker, J.s., Henderson, D., Howard, R.E.,
   Hubbard, W., and Jackel, L.D. (1989), "Backpropagation applied to
   handwritten ZIP code recognition", Neural Computation, 1, 541-551. 

   McCullagh, P. and Nelder, J.A. (1989) Generalized Linear Models, 2nd
   ed., London: Chapman & Hall. 

   Sontag, E.D. (1992), "Feedback stabilization using two-hidden-layer
   nets", IEEE Transactions on Neural Networks, 3, 981-990. 

------------------------------------------------------------------------

Subject: How many hidden units should I use? 
=============================================

Some books and articles offer "rules of thumb" for choosing a topopology --
Ninputs plus Noutputs divided by two, maybe with a square root in there
somewhere -- but such rules are total garbage. There is no way to determine
a good network topology just from the number of inputs and outputs. It
depends critically on the number of training cases, the amount of noise, and
the complexity of the function or classification you are trying to learn.
There are problems with one input and one output that require thousands of
hidden units, and problems with a thousand inputs and a thousand outputs
that require only one hidden unit, or none at all. 

Other rules relate to the number of cases available: use at most so many
hidden units that the number of weights in the network times 10 is smaller
than the number of cases. Such rules are only concerned with overfitting and
are unreliable as well. All one can say is that if the number of training
cases is much larger (but no one knows exactly how much larger) than the
number of weights, you are unlikely to get overfitting, but you may suffer
from underfitting. Geman, Bienenstock, and Doursat (1992) discuss how the
number of hidden units affects the bias/variance trade-off. 

An intelligent choice of the number of hidden units depends on whether you
are using early stopping or some other form of regularization. If not, you
must simply try many networks with different numbers of hidden units, 
estimate the generalization error for each one, and choose the network with
the minimum estimated generalization error. 

Using conventional optimization algorithms (see "What are conjugate
gradients, Levenberg-Marquardt, etc.?"), there is little point in trying a
network with more weights than training cases, since such a large network is
likely to overfit. But Lawrence, Giles, and Tsoi (1996) have shown that
standard online backprop can have considerable difficulty reducing training
error to a level near the globally optimal value, hence using "oversize"
networks can reduce both training error and generalization error. 

If you are using early stopping, it is essential to use lots of hidden units
to avoid bad local optima (Sarle 1995). There seems to be no upper limit on
the number of hidden units, other than that imposed by computer time and
memory requirements. Weigend (1994) makes this assertion, but provides only
one example as evidence. Tetko, Livingstone, and Luik (1995) provide
simulation studies that are more convincing. The FAQ maintainer obtained
similar results in conjunction with the simulations in Sarle (1995), but
those results are not reported in the paper for lack of space. On the other
hand, there seems to be no advantage to using more hidden units than you
have training cases, since bad local minima do not occur with so many hidden
units. 

If you are using weight decay or Bayesian estimation, you can also use lots
of hidden units (Neal 1995). However, it is not strictly necessary to do so,
because other methods are available to avoid local minima, such as multiple
random starts and simulated annealing (such methods are not safe to use with
early stopping). You can use one network with lots of hidden units, or you
can try different networks with different numbers of hidden units, and
choose on the basis of estimated generalization error. With weight decay or
MAP Bayesian estimation, it is prudent to keep the number of weights less
than half the number of training cases. 

Bear in mind that with two or more inputs, an MLP with one hidden layer
containing just a few units can fit only a limited variety of target
functions. Even simple, smooth surfaces such as a Gaussian bump in two
dimensions may require 20 to 50 hidden units for a close approximation.
Networks with a smaller number of hidden units often produce spurious ridges
and valleys in the output surface (see Chester 1990 and the very large
(885K) example in ftp://ftp.sas.com/pub/neural/tnnex_hillplat_mlp.ps and,
for more explanation, "How do MLPs compare with RBFs?") Training a network
with 20 hidden units will typically require anywhere from 150 to 2500
training cases if you do not use early stopping or regularization. Hence, if
you have a smaller training set than that, it is usually advisable to use
early stopping or regularization rather than to restrict the net to a small
number of hidden units. 

References: 

   Chester, D.L. (1990), "Why Two Hidden Layers are Better than One,"
   IJCNN-90-WASH-DC, Lawrence Erlbaum, 1990, volume 1, 265-268. 

   Geman, S., Bienenstock, E. and Doursat, R. (1992), "Neural Networks and
   the Bias/Variance Dilemma", Neural Computation, 4, 1-58. 

   Lawrence, S., Giles, C.L., and Tsoi, A.C. (1996), "What size neural
   network gives optimal generalization? Convergence properties of
   backpropagation," Technical Report UMIACS-TR-96-22 and CS-TR-3617,
   Institute for Advanced Computer Studies, University of Maryland, College
   Park, MD 20742,
   http://www.neci.nj.nec.com/homepages/lawrence/papers/minima-tr96/minima-tr96.html

   Neal, R.M. (1995), Bayesian Learning for Neural Networks, Ph.D. thesis,
   University of Toronto, ftp://ftp.cs.toronto.edu/pub/radford/thesis.ps.Z. 

   Sarle, W.S. (1995), "Stopped Training and Other Remedies for
   Overfitting," to appear in Proceedings of the 27th Symposium on the
   Interface, ftp://ftp.sas.com/pub/neural/inter95.ps.Z (this is a very
   large compressed postscript file, 747K, 10 pages) 

   Tetko, I.V., Livingstone, D.J., and Luik, A.I. (1995), "Neural Network
   Studies. 1. Comparison of Overfitting and Overtraining," J. Chem. Info.
   Comp. Sci., 35, 826-833. 

   Weigend, A. (1994), "On overfitting and the effective number of hidden
   units," Proceedings of the 1993 Connectionist Models Summer School,
   335-342. 

------------------------------------------------------------------------

Subject: How can generalization error be estimated? 
====================================================

There are many methods for estimating generalization error. 

Single-sample statistics: AIC, SBC, FPE, Mallows' C_p, etc. 
   In linear models, statistical theory provides several simple estimators
   of the generalization error under various sampling assumptions
   (Darlington 1968, Efron and Tibshirani 1993). These estimators adjust the
   training error for the number of weights being estimated, and in some
   cases for the noise variance if that is known. See 
   ftp://ftp.sas.com/pub/neural/tnn3.html for some formulas. These
   statistics can also be used as crude estimates of the generalization
   error in nonlinear models when you have a "large" training set.
   Correcting these statistics for nonlinearity requires substantially more
   computation (Moody 1992), and the theory does not always hold for neural
   networks due to violations of the regularity conditions. 
Split-sample validation. 
   The most commonly used method for estimating generalization error in
   neural networks is to reserve part of the data as a test set, which must
   not be used in any way during training. The test set must be a
   representative sample of the cases that you want to generalize to. After
   training, run the network on the test set, and the error on the test set
   provides an unbiased estimate of the generalization error, provided that
   the test set was chosen randomly. The disadvantage of split-sample
   validation is that it reduces the amount of data available for both
   training and validation. See Weiss and Kulikowski (1991). 
Cross-validation (e.g., leave one out). 
   Cross-validation is an improvement on split-sample validation that allows
   you to use all of the data for training. The disadvantage of
   cross-validation is that you have to retrain the net many times. See 
   "What are cross-validation and bootstrapping?". 
Bootstrapping. 
   Bootstrapping is an improvement on cross-validation that often provides
   better estimates of generalization error at the cost of even more
   computing time. See "What are cross-validation and bootstrapping?". 

If you use any of the above methods to choose which of several different
networks to use for prediction purposes, the estimate of the generalization
error of the best network will be optimistic. For example, if you train
several networks using one data set, and use a second (validation set) data
set to decide which network is best, you must use a third (test set) data
set to obtain an unbiased estimate of the generalization error of the chosen
network. Hjorth (1994) explains how this principle extends to
cross-validation and bootstrapping. 

References: 

   Darlington, R.B. (1968), "Multiple Regression in Psychological Research
   and Practice," Psychological Bulletin, 69, 161-182. 

   Efron, B. and Tibshirani, R.J. (1993), An Introduction to the Bootstrap,
   London: Chapman & Hall. 

   Hjorth, J.S.U. (1994), Computer Intensive Statistical Methods Validation,
   Model Selection, and Bootstrap, London: Chapman & Hall. 

   Moody, J.E. (1992), "The Effective Number of Parameters: An Analysis of
   Generalization and Regularization in Nonlinear Learning Systems", NIPS 4,
   847-854. 

   Weiss, S.M. & Kulikowski, C.A. (1991), Computer Systems That Learn,
   Morgan Kaufmann. 

------------------------------------------------------------------------

Subject: What are cross-validation and bootstrapping? 
======================================================

Cross-validation and bootstrapping are both methods for estimating
generalization error based on "resampling". In k-fold cross-validation, you
divide the data into k subsets of equal size. You train the net k times,
each time leaving out one of the subsets from training, but using only the
omitted subset to compute whatever error criterion interests you. If k
equals the sample size, this is called leave-one-out cross-validation. A
more elaborate and expensive version of cross-validation involves leaving
out all possible subsets of a given size. 

Note that cross-validation is quite different from the "split-sample" or
"hold-out" method that is commonly used for early stopping in neural nets.
In the split-sample method, only a single subset (the validation set) is
used to estimate the error function, instead of k different subsets; i.e.,
there is no "crossing". While various people have suggested that
cross-validation be applied to early stopping, the proper way of doing that
is not obvious. 

Cross-validation is also easily confused with jackknifing. Both involve
omitting each training case in turn and retraining the network on the
remaining subset. But cross-validation is used to estimate generalization
error, while the jackknife is used to estimate the bias of a statistic. In
the jackknife, you compute some statistic of interest in each subset of the
data. The average of these subset statistics is compared with the
corresponding statistic computed from the entire sample in order to estimate
the bias of the latter. You can also get a jackknife estimate of the
standard error of a statistic. 

Leave-one-out cross-validation often works well for continuous error
functions such as the mean squared error, but it may perform poorly for
noncontinuous error functions such as the number of misclassified cases. In
the latter case, k-fold cross-validation is preferred. But if k gets too
small, the error estimate is pessimistically biased because of the
difference in sample size between the full-sample analysis and the
cross-validation analyses. A value of 10 for k is popular. 

Bootstrapping seems to work better than cross-validation in many cases. In
the simplest form of bootstrapping, instead of repeatedly analyzing subsets
of the data, you repeatedly analyze subsamples of the data. Each subsample
is a random sample with replacement from the full sample. Depending on what
you want to do, anywhere from 200 to 2000 subsamples might be used. There
are many more sophisticated bootstrap methods that can be used not only for
estimating generalization error but also for estimating confidence bounds
for network outputs. 

References: 

   Efron, B. and Tibshirani, R.J. (1993), An Introduction to the Bootstrap,
   London: Chapman & Hall. 

   Hjorth, J.S.U. (1994), Computer Intensive Statistical Methods Validation,
   Model Selection, and Bootstrap, London: Chapman & Hall. 

   Masters, T. (1995) Advanced Algorithms for Neural Networks: A C++
   Sourcebook, NY: John Wiley and Sons, ISBN 0-471-10588-0 

   Weiss, S.M. & Kulikowski, C.A. (1991), Computer Systems That Learn,
   Morgan Kaufmann. 

------------------------------------------------------------------------

Next part is part 4 (of 7). Previous part is part 2. 

-- 

Warren S. Sarle       SAS Institute Inc.   The opinions expressed here
saswss@unx.sas.com    SAS Campus Drive     are mine and not necessarily
(919) 677-8000        Cary, NC 27513, USA  those of SAS Institute.
