Newsgroups: comp.ai.neural-nets
Path: cantaloupe.srv.cs.cmu.edu!rochester!cornellcs!newsstand.cit.cornell.edu!news.acsu.buffalo.edu!news.uoregon.edu!hammer.uoregon.edu!hunter.premier.net!feed1.news.erols.com!howland.erols.net!news.sprintlink.net!news-peer.sprintlink.net!interpath!news.interpath.net!news.interpath.net!sas!newshost.unx.sas.com!saswss
From: saswss@hotellng.unx.sas.com (Warren Sarle)
Subject: Re: Scaling Question
Originator: saswss@hotellng.unx.sas.com
Sender: news@unx.sas.com (Noter of Newsworthy Events)
Message-ID: <E3Lqp5.Mq3@unx.sas.com>
Date: Mon, 6 Jan 1997 19:41:29 GMT
X-Nntp-Posting-Host: hotellng.unx.sas.com
References: <32AA4E4C.7ABB@netreach.net> <32AC2994.68D1@ais.net> <E2xM0n.2r1@unx.sas.com> <Pine.SOL.3.91.961231170451.6371F-100000@miles>
Organization: SAS Institute Inc.
Lines: 48


In article <Pine.SOL.3.91.961231170451.6371F-100000@miles>, Greg Heath <heath@ll.mit.edu> writes:
|> On Tue, 24 Dec 1996, Warren Sarle wrote:
|> 
|> > For a typical standard backprop net, scaling inputs to [-1,1] will
|> > give you a better chance of finding a global optimum and faster
|> > learning than scaling to [0,1]. 
|> 
|> Why is this true if 
|> 1. The activation functions are tanh and linear for the hidden and 
|>    output layers, respectively. 
|> 2. The output layer bias vector, b2, is initialized to the target vector 
|>    expected value <o>.
|> 3. The hidden layer bias vector, b1, hidden and output weight dyadics, w1 and 
|>    w2 are initialized to sufficently small bipolar random values.
|> 
|> The output and hidden layer vectors are given by o = b2 + w2*h and 
|> h = tanh(b1 + w1*i), respectively. I don't see the advantage of the input 
|> i being bipolar as long as w1 has a bipolar initialization.

That answer is deep inside "Should I normalize/standardize/rescale the
data?" in part 2 of the FAQ:

But standardizing input variables can have far more important
effects on initialization of the weights than simply avoiding
saturation.  Assume we have an MLP with one hidden layer applied to
a classification problem and are therefore interested in the
hyperplanes defined by each hidden unit. Each hyperplane is the
locus of points where the net-input to the hidden unit is zero and
is thus the classification boundary generated by that hidden unit
considered in isolation. The connection weights from the inputs to a
hidden unit determine the orientation of the hyperplane.  The bias
determines the distance of the hyperplane from the origin. If the
bias terms are all small random numbers, then all the hyperplanes
will pass close to the origin.  Hence, if the data are not centered
at the origin, the hyperplane may fail to pass through the data
cloud. If all the inputs have a small coefficient of variation, it
is quite possible that all the initial hyperplanes will miss the
data entirely.  With such a poor initialization, local minima are
very likely to occur.  It is therefore important to center the
inputs to get good random initializations. 
-- 

Warren S. Sarle       SAS Institute Inc.   The opinions expressed here
saswss@unx.sas.com    SAS Campus Drive     are mine and not necessarily
(919) 677-8000        Cary, NC 27513, USA  those of SAS Institute.
 *** Do not send me unsolicited commercial or political email! ***

