Newsgroups: comp.ai.neural-nets
Path: cantaloupe.srv.cs.cmu.edu!das-news2.harvard.edu!news2.near.net!howland.reston.ans.net!ix.netcom.com!netcom.com!park
From: park@netcom.com (Bill Park)
Subject: Re: local minima
Message-ID: <parkCznJE9.1ID@netcom.com>
Followup-To: comp.ai.neural-nets
Cc: william@kub.nl
Organization: Netcom Online Communications Services (408-241-9760 login: guest)
References: <3aqb7l$civ@kubds1.kub.nl>
Date: Tue, 22 Nov 1994 04:34:57 GMT
Lines: 139

In article <3aqb7l$civ@kubds1.kub.nl> william@kub.nl (W. Verkooijen)
writes:

> THE QUESTION:
> 
> Can anybody reveal some, or better ALL, of the mystery about why

> backpropation with one-case-at-a-time weight updating ends in "better"
> local minima as a "batch" version or Newton version of backpropagation
> does?
> 
> Thanks.
> William Verkooijen
> E-mail: william@kub.nl

(Please put a carriage return character into your articles every 60
characters or so; some mail software may truncate long lines or
display them in a way that is difficult to read.  I've reformattted
your text above).

I think the reason one-case-at-a-time training produces better results
(if in fact it does) might be because it is doing something simlar to
simulated annealing.

Take as an example an untrained 3-layer perceptron with one output and
randomized weights. We'll include biases in the form of weights
connected to a unit input.

Now consider the hypersurface representing the net's squared output
error for a single example as a function of all the weights.  There
will be some subspaces of the weight space over which the net will
give the correct answer (where the surface is at height zero).
They'll be subspaces, not points, because a single example places far
fewer constraints on the values of the weights than the number of
weights.  There will be other subspaces over which the net's error is
only locally minimal. If we do gradient search on the squared-error
surface, the "weight point" representing the weight values will be
attracted towards one of these subspaces -- the closest one or deepest
one, depending on which one has the most influence on the gradient at
the weight point.

When the next example is presented, the shape of the squared-error
surface will change.  The output error will be zero or locally minimal
over a different set of subspaces of the weight space.  Some of these
subspaces may intersect the minimal-error subspaces of the preceding
example in still lower-dimensional subspaces.  For this second
example, gradient search will send the weight point towards one of
these new subspaces.  This is unlikely to be the same direction as the
weight point was drawn by the first example, so some "forgetting"
takes placee.  Equivalently, from the point of view of the first
example, the second example exerts some "noise disturbance" in the
solution to its subproblem.

Note that because we are doing one-step-at-a-time, the number of
constraints on the weights is small, so the subspaces are large, so a
random starting point is likely to be closer to one of them than to a
globally optimum point in weight space.  So we are, in effect, solving
training subproblems by updating weights in this way, rather than
trying to solve the whole problem at once by finding the right
gradient that will lead us directly to the global optimum of the
training problem in one shot.  Seems like a more robust strategy for
that reason alone.

If we assume for the sake of exposition that the training set contains
only these two examples, then when we present the first example again,
it draws the weight point back towards one of the first set of zero-error
subspaces again, probably the same one as the first time.

As training proceeds, the weight point is dragged alternately towards
one subspace, then the other.  After a while, it may finally reach one
of the attracting subspaces -- let's say for example #1. After that,
it will tend to stay within that subspace and slide along it as
succeding presentations of example #2 drag it towards one of its
minimal-squared error subspaces.  If these two subspaces intersect,
the weight point will stop moving when it reaches that intersection in
weight space. If both subspaces happen to represent zero-error outputs
for their respective examples, the net will have learned both examples
perfectly. 

If the subspaces intersect, but one or both represent only a local
minimum of squared error, the net will be improperly trained, even
though it has stopped learning.  What we need is some sort of
disturbance to temporarily drive the weight point away from any local
minimum subspace so that it has a chance of falling into the attraction
well of a global, zero-error minum subspace.

In practice, perhaps it is the other training examples that provide
this needed disturbance.  If so, this might explain why nets seem to
be able to find much better optima than one would expect.

In practice, the training set will actually contain hundreds or
thousands of different examples.  Let's say 100.  Then for every time
the weight point is pulled towards a given subspace by one training
example, it will be pulled in essentially a random direction 99 times
by the other examples.  If the gradients of the squared-error function
are steeper around deeper minima, then these disturbances will be less
likely to jostle the weight point out of an attractive well if it is a
deeper well.

This sounds remarkably like what happens in simulated annealing.
(NOTE: The following is only a rough description of the actual
algorithm!)  You start with a high value of a simulated "temperature"
that determines how much the point representing the solution is
"thermally agitated."  At high temperatures, it quickly leaves shallow
optima but spends more time in deeper optima.  As you very slowly
lower the simulated temperature, the solution point is agitated less
and less vigorously and so could get trapped in a shallow well, except
that by then it is *probably* already in a deep well.  Applying this
argument repeatedly, it is most likely to end up in one of the deepest
wells.  -- I.e., the solution is most likely to be in the neighborhood
of a true global optimum, and close to that optimum as well, as a
result of gradient search.

In the case of training a neural net, there is an obvious candidate
for an effect of one-case-at-a-time updating that corresponds to the
decreasing temperature of simulated annealing: Initially, the weight
point is likely to be close to some of the subspaces that are
attracting it, so it is on a steep gradient of many attractive wells
(each for a different example) around it.  So they disturb the
position of the weight point a relatively large amount in differentr
directions over a training epoch. As the weight point reaches the
bottoms of these wells, their gradients flatten out, so the
disturbances they exert on the weight point's position over an epoch
decrease over many epochs.  At the same time, the weight point will be
slowly attracted to other subspaces from which it started out at a
large distance.  As it enters their wells of attraction, they begin to
introduce increasing disturbances.  These "late" suboptimizations tend
to prolong the "cooling" process.  But ultimately as the weight point
approaches the subspaces for almost all the examples simultaneously,
the net disturbance to its position per training epoch will get
smaller and smaller -- just as in annealing when the temperature
approaches zero and the sotuion "freezes" near a global optimum.

Please send comments to comp.ai.neural-nets so we can all chew these
suggestions over.

Bill Park
=========
-- 
Grandpaw Bill's High Technology Consulting & Live Bait, Inc.
