Newsgroups: sci.math.num-analysis,comp.ai.neural-nets
Path: cantaloupe.srv.cs.cmu.edu!das-news2.harvard.edu!news2.near.net!news.mathworks.com!gatech!swrinde!cs.utexas.edu!news.sprintlink.net!redstone.interpath.net!sas!mozart.unx.sas.com!saswss
From: saswss@hotellng.unx.sas.com (Warren Sarle)
Subject: Re: Question: gradient descent
Originator: saswss@hotellng.unx.sas.com
Sender: news@unx.sas.com (Noter of Newsworthy Events)
Message-ID: <D812BB.A7s@unx.sas.com>
Date: Thu, 4 May 1995 00:21:11 GMT
Distribution: inet
X-Nntp-Posting-Host: hotellng.unx.sas.com
References:  <3o6ck7$6fk@rover.ucs.ualberta.ca>
Organization: SAS Institute Inc.
Lines: 40
Xref: glinda.oz.cs.cmu.edu sci.math.num-analysis:20581 comp.ai.neural-nets:23844


In article <3o6ck7$6fk@rover.ucs.ualberta.ca>, fwang@ucs.ualberta.ca (Feng Wang) writes:
|> I have a question about gradient descent (GD) optimization. In some books, the
|> algorithm is:
|>   Wi(t+1) = W(t) + learning_rate * dY/dWi.

With this formula, the step size is proportional to the norm of the
gradient. This works nicely with a quadratic error surface, since
you take big steps when you are far from the minimum and the gradient
is steep, and the steps get smaller and smaller as you approach the
minimum and the gradient flattens out. With batch training and a
suitable learning rate, this method converges to the minimum. With a
non-quadratic error surface, this rationale breaks down, which is
why standard backprop is often so ridiculously slow.

|> But as I understand, it should:
|>   Wi(t+1) = Wi(t) + learning_rate * (dY/dWi)/SUM(dY/dWi^2)

Now the step size is inversely proportional to the norm of the
gradient. If you ever get close to the minimum, you will take big
steps and move away from the mimimum. So this method will not
converge.

You could also consider:
   Wi(t+1) = Wi(t) + learning_rate * (dY/dWi)/sqrt(SUM(dY/dWi^2))

which would give you a constant step size. With batch training and a
suitable learning rate, this method would approach the minimum and then
wander around in the vicinity of the mimimum without converging.

The moral of the story is that you need an adaptive learning rate,
like Quickprop, RPROP, steepest descent with line-search, or any of
the other more sophisticated methods in the numerical optimization
literature.

-- 

Warren S. Sarle       SAS Institute Inc.   The opinions expressed here
saswss@unx.sas.com    SAS Campus Drive     are mine and not necessarily
(919) 677-8000        Cary, NC 27513, USA  those of SAS Institute.
