Newsgroups: comp.ai.neural-nets
Path: cantaloupe.srv.cs.cmu.edu!bb3.andrew.cmu.edu!newsfeed.pitt.edu!godot.cc.duq.edu!newsgate.duke.edu!news.mathworks.com!newsfeed.internetmci.com!in2.uu.net!news.interpath.net!sas!newshost.unx.sas.com!saswss
From: saswss@hotellng.unx.sas.com (Warren Sarle)
Subject: Backprop vs conventional optimization (Was Re: TR available ...)
Originator: saswss@hotellng.unx.sas.com
Sender: news@unx.sas.com (Noter of Newsworthy Events)
Message-ID: <Dsv2n8.IKq@unx.sas.com>
Date: Wed, 12 Jun 1996 00:28:20 GMT
X-Nntp-Posting-Host: hotellng.unx.sas.com
References: <317F9481.59E2@research.nj.nec.com> <4ovqnm$sdg@llnews.ll.mit.edu>
Organization: SAS Institute Inc.
Lines: 148


Steve Lawrence and Greg Heath have been having a protracted argument
over the technical report:

   Lawrence, S., Giles, C.L., and Tsoi, A.C. (1996), "What size
   neural network gives optimal generalization? Convergence properties
   of backpropagation," Technical Report UMIACS-TR-96-22 and CS-TR-3617,
   Institute for Advanced Computer Studies, University of Maryland,
   College Park, MD 20742,
   http://www.neci.nj.nec.com/homepages/lawrence,
   http://www.elec.uq.edu.au/~lawrence,
   http://www.neci.nj.nec.com/homepages/giles.html

I do not want to become ensnared in the Lawrence/Heath discussion, but
will instead follow up on my previous comments:
> Note that these conclusions apply only to inept training methods
> such as standard backprop. If you use more sophisticated training
> methods such as Levenberg-Marquardt, quasi-Newton, or conjugate
> gradients.

Lawrence et al. (LGT) generated artificial data sets by simulating MLPs
with randomly generated weights. In most cases, the MLPs had 20 inputs
10 hidden units (HUs), and 1 output. The weights were drawn from a
uniform distribution on -K to K. The larger the value of K, the more
complex (or the less smooth) the output. LGT then trained networks using
the artificial data with varying numbers of HUs, values of K, and sizes
of the training set. Performance was evaluated on test sets of 5000
cases.

A net with 10 HUs should theoretically be able to obtain a training
error of zero on these artificial training sets. Since LGT used online
backprop training, a nonconvergent algorithm, they were unable to get
the normalized mean squared errors (NMSE) much below 1e-4, judging from
their plots. This performance seems respectable for online backprop.

A more interesting finding was that the training error for nets with 10
HUs was considerably worse than for larger nets. With K=1, the NMSE for
training averaged over five data sets was (approximately, reading LGT's
plots):

   Training      Hidden Units
    Cases      10     20     40
     200     .0006  .0002  .0002
    2000     .0012  .0009  .0008
   20000     .0030  .0010  .0008

Furthermore, the larger nets also yielded better generalization error
than did the 10-HU nets.

Whether this comparison of nets of different sizes is fair has been one
of the main subjects in the Lawrence/Heath discussion. I am not terribly
concerned about the fairness of the comparison, since I never use a
fixed number of training updates; I either train to convergence or use
early stopping. And I think the poor performance of the 10-HU nets is
most likely due to their inability to find a global optimum, a
difficulty that I have also observed in backprop training.

For larger values of K in 10-HU nets, the training NMSE gets much worse:
e.g., about .0050 for K=5 and .0600 for K=10. These findings are
consistent with the hypothesis that local optima are the main cause of
the poor performance of 10-HU nets, since more complex functions are
likely to have more bad local optima.

My contention was that LGT's findings are not characteristic of
conventional optimization algorithms, but only of inept training methods
such as standard backprop. I have reproduced some of LGT's experiments
using Levenberg-Marquardt, quasi-Newton, and conjugate gradient
algorithms as implemented in the NLP procedure of the SAS/OR product.
Various network failures, disk crashes, and other computer problems have
prevented me from repeating all of LGT's experiments, but I think I have
enough results to make my points.

First, it is quite possible to find global optima with 10-HU nets using
conventional optimization algorithms. You can drive the NMSE down to
1e-12 with conjugate gradients, or even 1e-24 with Levenberg-Marquardt,
but that is pretty much a waste of computer time. I decided that a NMSE
of 1e-6 was sufficiently close to zero for most practical purposes, and
therefore terminated training when that NMSE was reached or when the
usual convergence criteria were satisfied. When the NMSE criterion was
satisfied, I counted that as a global optimum; otherwise, I considered
the result a bad local optimum (even though some of my local optima were
better than any results reported by LGT!).

I wanted to be able to estimate the probability of finding a global
optimum form random initial weights, so I trained 10 nets on each of 4
data sets with 200 training cases. This proved to be rather time-
consuming, and when I ran the data sets with 2000 cases, I trained only
3 nets each. In the table below, the results are given as
number-of-global-optima-found/number-of-networks-trained; each row
represents one data set:

   ntrain=200 k=1
   ----------10 HUs----------   ----------20 HUs----------
   levmar    quanew    congra   levmar    quanew    congra
     9/10     10/10     10/10    10/10     10/10     10/10  
    10/10     10/10     10/10    10/10     10/10     10/10  
    10/10     10/10     10/10    10/10     10/10     10/10  
    10/10     10/10     10/10    10/10     10/10     10/10  
   
   
   ntrain=200 k=5
   ----------10 HUs----------   ----------20 HUs----------
   levmar    quanew    congra   levmar    quanew    congra
     9/10      7/10      9/10    10/10     10/10     10/10  
     8/10      8/10      9/10    10/10     10/10     10/10  
     6/10      7/10      9/10    10/10     10/10     10/10  
     9/10      6/10     10/10    10/10     10/10     10/10  
   
   
   ntrain=200 k=10
   ----------10 HUs----------   ----------20 HUs----------
   levmar    quanew    congra   levmar    quanew    congra
     8/10      4/10      7/10    10/10     10/10     10/10  
     7/10      3/10      9/10    10/10     10/10     10/10  
     7/10      1/10      8/10    10/10     10/10     10/10  
     6/10      1/10      6/10    10/10     10/10     10/10  
   
   
   ntrain=2000 k=1
   ----------10 HUs----------   ----------20 HUs----------
   levmar    quanew    congra   levmar    quanew    congra
     1/3       3/3       1/3      3/3       3/3       3/3   
     2/3       1/3       3/3      3/3       3/3       3/3   
     3/3       1/3       3/3      3/3       3/3       3/3   
     1/3       3/3       2/3      3/3       3/3       3/3   
   
The results on the test sets were predictable. For 200 training cases,
the test NMSE was around 1, due to severe overfitting. For 2000 training
cases, the test NMSE was around 1e-6, just slightly larger than the
training NMSE, whenever a global optimum was found for 10, 20, or 40
HUs. When a bad local optimum was found, the test error was
correspondingly bad.  I ran out of memory trying to run 80 HUs, so I
reduced the training set size from 2000 to 1000 and ran 50 HUs to get
overfitting, and got a test NMSE of about .01. I was surprised it was
that small, but I'm sure I could make it worse by jacking up the number
of HUs more.

Conclusion: using conventional optimization algorithms, oversize
networks do not produce better training error or generalization except
in so far as it is easier to find the global optimum with a larger
network than with a smaller one. On the contrary, oversize networks
produce overfitting as predicted by statistical theory.

-- 

Warren S. Sarle       SAS Institute Inc.   The opinions expressed here
saswss@unx.sas.com    SAS Campus Drive     are mine and not necessarily
(919) 677-8000        Cary, NC 27513, USA  those of SAS Institute.
