Newsgroups: comp.ai.neural-nets,sci.stat.math
Path: cantaloupe.srv.cs.cmu.edu!nntp.club.cc.cmu.edu!miner.usbm.gov!news.er.usgs.gov!stc06.ctd.ornl.gov!cs.utk.edu!gatech!newsfeed.internetmci.com!in2.uu.net!news.interpath.net!sas!mozart.unx.sas.com!saswss
From: saswss@hotellng.unx.sas.com (Warren Sarle)
Subject: Re: weight decay
Originator: saswss@hotellng.unx.sas.com
Sender: news@unx.sas.com (Noter of Newsworthy Events)
Message-ID: <DowzJC.5xB@unx.sas.com>
Date: Wed, 27 Mar 1996 07:02:00 GMT
X-Nntp-Posting-Host: hotellng.unx.sas.com
References: <4j4oe4$75l@ccshst05.cs.uoguelph.ca> <4j53ub$h4u@delphi.cs.ucla.edu> <4j5i74$sta@dfw-ixnews4.ix.netcom.com>
Organization: SAS Institute Inc.
Lines: 132
Xref: glinda.oz.cs.cmu.edu comp.ai.neural-nets:30728 sci.stat.math:9887


In article <4j5i74$sta@dfw-ixnews4.ix.netcom.com>, jdadson@ix.netcom.com(Jive Dadson ) writes:
|> 
|> I've added weight decay to my NN software. Now I need to learn how
|> to use it. :-) I'm not even sure I've done it right. Should the output
|> layer have the same weight penalties as hidden layers? Does the type
|> of layer matter? For example, should a softmax output layer have
|> a different weight penalty than hidden tanh layers in front of it?

I'm working on weight decay for the comp.ai.neural-nets FAQ. Here's 
what I've got so far.

Weight decay adds a penalty term to the error function. The usual
penalty is the sum of squared weights times a decay constant.  In a
linear model, this form of weight decay is equivalent to ridge
regression. See "What is jitter?" for more explanation of ridge
regression.
 
The penalty term causes the weights to converge to smaller absolute
values than they otherwise would.  Large weights can hurt generalization
in two different ways.  Excessively large weights leading to hidden
units can cause the output function to be too rough, possibly with near
discontinuities.  Excessively large weights leading to output units can
cause wild outputs far beyond the range of the data if the output
activation function is not bounded to the same range as the data.

Other penalty terms besides the sum of squared weights are sometimes
used. Weight elimination (Weigend, Rumelhart, and Huberman 1991)
uses:

          (w_i)^2
   sum -------------
       (w_i)^2 + c^2

where w_i is the ith weight and c is a user-specified constant.  Whereas
decay using the sum of squared weights tends to shrink the large
coefficients more than the small ones, weight elimination tends to
shrink the small coefficients more, and is therefore more useful for
suggesting subset models (pruning).

The generalization ability of the network can depend crucially on the
decay constant, especially with small training sets.  One approach to
choosing the decay constant is to train several networks with different
amounts of decay and estimate the generalization error for each; then
choose the decay constant that minimizes the estimated generalization
error.  Weigend, Rumelhart, and Huberman (1991) iteratively update the
decay constant during training.

There are other important considerations for getting good results from
weight decay. You must either standardize the inputs and targets, or
adjust the penalty term for the standard deviations of all the inputs
and targets. It is usually a good idea to omit the biases from the
penalty term.

A fundamental problem with weight decay is that different types of
weights in the network will usually require different decay constants
for good generalization. At the very least, you need three different
decay constants for input-to-hidden, hidden-to-hidden, and
hidden-to-output weights. Adjusting all these decay constants to produce
the best estimated generalization error often requires vast amounts of
computation.

Fortunately, there is a superior alternative to weight decay:
hierarchical Bayesian estimation. Bayesian estimation makes it possible
to estimate efficiently numerous decay constants.  See "What is Bayesian
estimation?" [Unfortunately, I haven't written the answer to that
question yet].

References:

   Bishop, C.M. (1995), Neural Networks for Pattern Recognition,
   Oxford: Oxford University Press. 

   Ripley, B.D. (1996) Pattern Recognition and Neural
   Networks, Cambridge: Cambridge University Press.

   Weigend, A. S., Rumelhart, D. E., & Huberman, B. A. (1991).
   Generalization by weight-elimination with application to forecasting.
   In: R. P. Lippmann, J. Moody, & D. S. Touretzky (eds.),
   Advances in Neural Information Processing Systems 3,
   San Mateo, CA: Morgan Kaufmann.

|> The books I have don't have sufficient explanations of the theoretical
|> motivation or the practical application. Can anyone recommend one?
|> 
|> I haven't really come to grips with error criteria in general. I want
|> to get thoroughly familiar with the Bayesian significance of the
|> error penalty as it applies both to weights and to the difference
|> between estimated values and the corresponding training values.

Bishop (1995) and Ripley (1996) are, of course, excellent sources on
weight decay and Bayesian issues. The best textbook I've seen on
Bayesian inference is Gelman, Carlin, Stern, and Rubin (1995). O'Hagan
(1985) is an excellent explanation of some of the odd things that can
happen with MAP estimation. MacKay and Neal have done the most work on
Bayesian methods for neural nets. I have had trouble getting some of
MacKay's methods to work; my own efforts are described too briefly
(there was a 10 page limit!) in Sarle (1995).

   Bernardo, J.M., DeGroot, M.H., Lindley, D.V. and Smith, A.F.M., eds.,
   (1985), Bayesian Statistics 2, Amsterdam: Elsevier
   Science Publishers B.V. (North-Holland).

   Gelman, A., Carlin, J.B., Stern, H.S., and Rubin, D.B. (1995),
   Bayesian Data Analysis, London: Chapman & Hall,
   ISBN 0-412-03991-5.

   MacKay, D.J.C. (1992), "A practical Bayesian framework for 
   backpropagation networks," Neural Computation, 4, 448-472.

   MacKay, D.J.C. (199?), "Probable networks and plausible
   predictions--a review of practical Bayesian methods for supervised
   neural networks," ftp://mraos.ra.phy.cam.ac.uk/pub/mackay/network.ps.Z.

   Neal, R.M. (1995), Bayesian Learning for Neural Networks,
   Ph.D. thesis, University of Toronto,
   ftp://ftp.cs.toronto.edu/pub/radford/thesis.ps.Z.

   O'Hagan, A. (1985), "Shoulders in hierarchical models,"
   in Bernardo et al. (1985), 697-710.

   Sarle, W.S. (1995), "Stopped Training and Other
   Remedies for Overfitting," to appear in Proceedings of
   the 27th Symposium on the Interface,
   ftp://ftp.sas.com/pub/neural/inter95.ps.Z 
   (this is a very large compressed postscript file, 747K, 10 pages)

-- 

Warren S. Sarle       SAS Institute Inc.   The opinions expressed here
saswss@unx.sas.com    SAS Campus Drive     are mine and not necessarily
(919) 677-8000        Cary, NC 27513, USA  those of SAS Institute.
