Newsgroups: comp.ai.neural-nets
Path: cantaloupe.srv.cs.cmu.edu!das-news2.harvard.edu!news2.near.net!news.mathworks.com!news.duke.edu!concert!sas!mozart.unx.sas.com!saswss
From: saswss@hotellng.unx.sas.com (Warren Sarle)
Subject: Re: Measuring generalization
Originator: saswss@hotellng.unx.sas.com
Sender: news@unx.sas.com (Noter of Newsworthy Events)
Message-ID: <Cy348D.40C@unx.sas.com>
Date: Sat, 22 Oct 1994 17:21:49 GMT
References: <370lit$bgi@nz12.rz.uni-karlsruhe.de> <782750417snz@ecowar.demon.co.uk>
Nntp-Posting-Host: hotellng.unx.sas.com
Organization: SAS Institute Inc.
Lines: 139


In article <782750417snz@ecowar.demon.co.uk>, jimmy@ecowar.demon.co.uk (Jimmy Shadbolt) writes:
|> ...
|> That's right - is there other ways to assess the generalisation properties
|> for non-categorical data apart from regularisation theory?

Some excerpts from the documentation for my TNN macro:

The critical issue in developing a neural network is generalization:
how well will the network make predictions for cases that are not in the
training set?  It is essential to choose a network of appropriate
complexity to get good generalization.

One aspect of complexity is the number of weights to be estimated in the
network, which is related to the number of hidden units.  If you have
too few hidden units, the network will not be flexible enough to fit
complicated nonlinear functions. This phenomenon is called
_underfitting_ and produces both a bad fit to the training data and bad
generalization. If you have too many hidden units and train the network
long enough, it will fit the noise in the data instead of just fitting
the signal. This phenomenon is called _overfitting_ and produces _too_
good a fit to the training data and bad generalization. One way to
obtain a network of appropriate complexity is to choose an appropriate
number of hidden units to avoid both underfitting and overfitting.

Another aspect of complexity is the size of the weights. It is well
known from shrinkage estimation and ridge regression in linear models
that generalization can be improved by reducing the size of the weights
from the estimates that give best fit in the sample. The optimal amount
of shrinkage must be estimated, and decreases to zero as the sample size
goes to infinity. In the neural net literature, shrinkage and related
methods are called _regularization_.

The size of the weights has an additional effect in neural networks with
sigmoid activation functions. If the weights feeding in to a sigmoid
activation function are sufficiently small, then the activation of the
unit for all training cases will lie in the central, almost linear
region of the curve, and the unit will be effectively linear. If the
number of linear hidden units in a layer exceeds the number of inputs or
outputs, the excess units are redundant. Hence, if a network contains
many small weights, the effective number of hidden units may be much
less than the actual number. Thus another way to obtain a network of
appropriate complexity is to control the size of the weights.

Whether you control the number of hidden units or the size of the
weights, you must be able to estimate the generalization error to choose
the appropriate complexity.  There are many methods for estimating
generalization error:

 * Single-sample statistics: AIC, SBC, FPE, Mallows' C_p, etc.--fast
   but require "large" samples

 * Split-sample validation--fast but statistically inefficient

 * Cross-validation (e.g., leave one out)--slow and erratic

 * Bootstrapping--very slow

The single-sample fit statistics that %TNN computes include the
following:

   Notation
   --------
   
   _ indicates subscript, ^ indicates superscript
   
   n        = the number of observations
   
   p        = the number of weights
   
   SSE      = the error sum of squares
   
   Criteria for adequacy of the estimated model in the sample
   ----------------------------------------------------------
   
   ASE      = SSE/n, the average squared error
   
   RASE     = sqrt(ASE), root average squared error
   
   Criteria for adequacy of the true model in the population
   ---------------------------------------------------------
   
   MSE      = SSE/(n-p), the mean square error (Darlington 1968)
   
   RMSE     = sqrt(MSE), root mean square error
   
   Criteria for adequacy of the estimated model in the population
   --------------------------------------------------------------
   
   FPE      = (n+p)MSE/n, final prediction error (Akaike 1969; Judge
              et al. 1980), the estimated mean square error of
              prediction assuming that the values of the regressors
              are fixed and that the model is correct
   
   RFPE     = sqrt(FPE), root final prediction error
   
   GCV      = SSE*n / (n-p)^2 for OLS, the generalized cross-
              validation statistic (Golub, Heath & Wahba 1979; Wahba
              1990)
   
   RGVC     = sqrt(GCV), root generalized cross-validation statistic
   
   AIC      = (n)ln(SSE/n)+2p, Akaike's information criterion (Akaike
              1969; Judge et al. 1980)
   
   SBC      = (n)ln(SSE/n)+(p)ln(n), Schwarz's Bayesian criterion
              (Schwarz 1978; Judge et al. 1980)

References
----------

  Akaike, H. (1969), "Fitting Autoregressive Models for Prediction,"
  Annals of the Institute of Statistical Mathematics, 21, 243-247.

  Akaike, H. (1974), "A new look at the statistical model identification
  IEEE Trans. Automatic Control AC-19, 716-723.

  Darlington, R.B. (1968), "Multiple Regression in Psychological Research
  and Practice," Psychological Bulletin, 69, 161-182.

  Golub, G.H., Heath, M., and Wahba, G. (1979), "Generalized
  Cross-Validation as a Method for Choosing a Good Ridge Parameter,"
  Technometrics, 21, 215-223.

  Judge, G.G., Griffiths, W.E., Hill, R.C., and Lee, T. (1980), The Theory
  and Practice of Econometrics, New York: John Wiley & Sons. There is a
  more recent edition of this with a slightly different list of authors.

  Schwarz, G. (1978), "Estimating the Dimension of a Model," Annals of
  Statistics, 6, 461-464.

  Wahba, G. (1990), Spline Models for Observational Data, Philadelphia:
  SIAM.

-- 

Warren S. Sarle       SAS Institute Inc.   The opinions expressed here
saswss@unx.sas.com    SAS Campus Drive     are mine and not necessarily
(919) 677-8000        Cary, NC 27513, USA  those of SAS Institute.
