Newsgroups: comp.ai.neural-nets
Path: cantaloupe.srv.cs.cmu.edu!das-news.harvard.edu!news2.near.net!MathWorks.Com!solaris.cc.vt.edu!swiss.ans.net!howland.reston.ans.net!swrinde!emory!pirates!news-feed-1.peachnet.edu!concert!sas!mozart.unx.sas.com!saswss
From: saswss@hotellng.unx.sas.com (Warren Sarle)
Subject: Stopped Training
Originator: saswss@hotellng.unx.sas.com
Sender: news@unx.sas.com (Noter of Newsworthy Events)
Message-ID: <Cw6xJA.1Gr@unx.sas.com>
Date: Thu, 15 Sep 1994 21:40:22 GMT
Nntp-Posting-Host: hotellng.unx.sas.com
Organization: SAS Institute Inc.
Lines: 186


There are many methods for estimating generalization error:
 * AIC, SBC, FPE, Mallows' C_p, etc.--fast but require "large" sample
 * Split-sample validation--fast but statistically inefficient
 * Cross-validation (leave one out)--slow and erratic
 * Bootstrapping--very slow

Most statistical theory for nonlinear models is based on asymptotic
results. In particular, the sample size must be substantially greater
than the number of parameters for the asymptotic theory to apply.

However, NN practitioners often use nets with many times as many
parameters as training cases. E.g., Nelson and Illingworth (1991, p.
165) discuss training a network with 16,219 parameters with only 50
training cases!  The method they use for doing this is called "stopped
training":
 1. Divide the available data into training and test sets.
 2. Use a large number of hidden units (hu).
 3. Use very small random initial values.
 4. Use a slow learning rate.
 5. Compute the test error rate periodically during training.
 6. Stop training when the test error rate "starts to go up".

Statisticians are disturbed by stopped training because:
 * The usual statistical theory does not apply.
 * It's statistically inefficient like split-sample.
 * It's erratic like cross-validation, only more so.

On the other hand, stopped training is kind of like shrinkage
estimation, so maybe there's something to it.

Despite the widespread use of stopped training, there has been
very little research on the subject. Most published studies
are seriously flawed.

Morgan & Boulard (1990) "Generalization and parameter estimation
in feedforward nets: Some experiments," NIPS 2, 630-637, generate
artificial data sets in such a way that the training set and test
set are correlated, thus invalidating the results.

Finnof, Hergert & Zimmermann (1993) "Improving model selection by
nonconvergent methods," Neural Networks, 6, 771-783, generate
artificial data with uniformly distributed noise, then train the
networks by least absolute values, a bizarre combination (see
Weigend 1994).

According to Weigend (1994), "On overfitting and the effective number
of hidden units," Proceedings of the 1993 Conectionist Models Summer
School, 335-342:

 1. "... fairly small [4hu] networks (that never reach good performance)
    also, already, show overfitting."

 2. "... large [15hu] networks generalize better than small ones"

These conclusions are based on a single data set with 540 training
cases, 540 test cases, 160 inputs, and a 6-category output, trained
using the cross-entropy loss function, initial values uniformly
distributed between -.01 and +.01, and a learning rate of .01 with
no momentum. A 4hu network has 674 weights, which is greater than
the number of training cases. Hence a 4hu network is hardly "fairly
small" relative to the training set size, and (1) is not supported
by Weigend's results since he didn't report any results for small
networks.

(2) is clearly supported by Weigend's results, but generalizing
from a single case is risky!

Here is a small experiment designed to see if Weigend's results will
generalize. The design is therefore very different from Weigend's.
Artificial data sets were generated with:
   40, 100, 200, or 1000 cases
   5 inputs: 3 relevant, 2 irrelevant
   1 continuous target
   Gaussian noise with a standard deviation of 0.1

Here's the code for 1000 cases:

   data sample;
      do n=1 to 1000;
         x1=rannor(0); x2=rannor(0); x3=rannor(0); * relevant;
         x4=rannor(0); x5=rannor(0);               * irrelevant;
         m=1/(1+exp(2*x1));
         y=  m  /(1+exp(x2)) +
           (1-m)/(1+exp(x3)) +
           rannor(0)/10;
         output;
      end;
   run;

Another data set was generated like the above but without noise to
be used to measure generalization error.

A sample size of 1000 is large enough that overfitting is not a 
concern, so the results for 1000 training cases will provide a
standard of comparison for generalization results from small sample
sizes. Here is the generalization error for 1000 training cases:

                             RMS Error for
                    Hidden     Predicting
                     Units   Mean Response
                       2         0.0674
                       3         0.0235
                       5         0.0136
                      10         0.0046
                      20         0.0061

Ten hidden units appear to be about right for most practical
purposes. Experiments with smaller sample sizes will therefore
be run with 5, 10, and 20 hidden units to cover the most interesting
range.

For sample sizes of 40, 100, and 200 cases, from 10 to 40 samples were
generated and used to train NNs by both stopped training and convergent
training.  Results in the tables below are averages over the 10 to 40
samples. The standard errors for most of these averages are roughly
.005 to .02.

Each sample was randomly divided into training and test sets, with the
test set size ranging from 10% to 90% (not all combinations were run due
to lack of time). Networks with 5, 10, or 20 hidden units were trained
by a conjugate gradient algorithm with a small step size. The weights
corresponding to the minimum test set error were chosen to compute
generalization error (this avoids the question of when does the test set
error "start to go up").  Each network was also trained to convergence
using the entire sample, with results shown in the table under 0% test
size.

                     RMS Error for Predicting Mean Response

   Sample Hidden                 Test Set Percentage
    Size   Units      0      10      25      50      75      90
   
      40     5    0.282   0.135   0.127   0.138   0.148   0.165
            10    0.241   0.138   0.128   0.120   0.137   0.173
            20    0.240   0.144   0.139   0.120   0.143   0.161
   
   
     100     5    0.123    .      0.092   0.105   0.113    .
            10    0.141    .      0.092   0.096   0.115    .
            20    0.180    .      0.097   0.107   0.114    .
   
   
     200     5    0.073    .      0.076   0.085    .       .
            10    0.079    .      0.073   0.083    .       .
            20    0.095    .      0.065   0.090    .       .
   
Similar experiments were run using FPE (final prediction error),
GCV (generalized cross-validation), and SBC (Schwarz's Bayesian
criterion) to select the number of hidden units. The column in
the table below labeled POP uses the actual generalization error
to select the number of hidden units; this provides a lower bound
on the generalization error obtainable by any criterion for model
selection. The column labeled Stopped shows the best performance
from the stopped training results in the table above.

                RMS Error for Predicting Mean Response
  
        Sample
         Size     FPE    GCV    SBC    POP   Stopped
  
          40     .248   .179   .154   .124     .120
  
         100     .157   .152   .107   .082     .092
  
         200     .133   .117   .050   .048     .065

Conclusions: for data sets of the type used in this experiment,
 * SBC works better that GCV, much better than FPE
 * Stopped training works better than SBC in small samples, but
      SBC may work better in large samples
 * The test set for stopped training should be 25% to 50% of the data
 * Generalization is insensitive to number of hidden units over
      the range of 5 to 20

Questions for further research:
 * Do these results generalize?
 * What about cross-validation and bootstrapping?
 * What about shrinkage estimators?
 * What about data sets with local minima?

-- 

Warren S. Sarle       SAS Institute Inc.   The opinions expressed here
saswss@unx.sas.com    SAS Campus Drive     are mine and not necessarily
(919) 677-8000        Cary, NC 27513, USA  those of SAS Institute.
