Newsgroups: comp.ai.neural-nets
Path: cantaloupe.srv.cs.cmu.edu!rochester!udel-eecis!gatech!arclight.uoregon.edu!news.sprintlink.net!news-peer.sprintlink.net!cs.utexas.edu!newshost.convex.com!newsgate.duke.edu!interpath!news.interpath.net!sas!newshost.unx.sas.com!saswss
From: saswss@hotellng.unx.sas.com (Warren Sarle)
Subject: Re: MLP Initialization values
Originator: saswss@hotellng.unx.sas.com
Sender: news@unx.sas.com (Noter of Newsworthy Events)
Message-ID: <Dy0BID.LEB@unx.sas.com>
Date: Fri, 20 Sep 1996 01:15:49 GMT
X-Nntp-Posting-Host: hotellng.unx.sas.com
References: <322C9E07.1F3D@iol.it> <512f54$dpc@nuacht.iol.ie> <51lsq7$rpe@llnews.ll.mit.edu>
Organization: SAS Institute Inc.
Lines: 67


In article <51lsq7$rpe@llnews.ll.mit.edu>, heath@ll.mit.edu (Greg Heath) writes:
|> On 10 Sep 1996 in <512f54$dpc@nuacht.iol.ie>, michael mc ardle
|> <mmcardle@iol.ie> wrote:
|> 
|> |> Hi Mark,
|> |> I'am no NN expert but I dont think the start values are very
|> |> important so long as they are small and random, at least this is 
|> |> the way NEURAL_WORKS does it.
|> |>
|> |> Yours Sincerely, M J Mcardle.
|> 
|> Small, random, and *bipolar*... but be careful! 
|> 
|> Obviously, if initial weights are too large, training will 
|> take forever because of tanh/sigmoid activation funtion saturation. 
|> However, if initial weights are too small, training will also take 
|> forever to reach operational values.

Very true of standard backprop, but less of an issue for training
methods that adjust the learning rate in a sensible way.

|>  1. Generalizing Bishops argument (page #s?):
|> ...  
|>       stdev{z} ~ 1   ==>   stdev{w} ~ 1/(sqrt(d) *stdev{x}) 
|> 
|> 2. Haykin, p156-7, 160-2: 
|> ...
|>   b. weights uniformly distributed in [-2.4/d, 2.4/d] 
|> 
|> Weird. I need a little more time to think about this one. 

If the inputs are uncorrelated, you should divide by by sqrt(d) rather
than by d (the number of inputs), as Bishop indicates. You would need to
divide by d only to guard against a case where the inputs are highly
correlated and one of the initial weight vectors happens to point along
the direction of the first principal component. But highly
correlated inputs will slow down most training algorithms, so it
would usually be preferable to transform the inputs to reduce the
correlation (e.g., by principal components).

Also note that you should divide the weights for each input by the
standard deviation of that individual input (I think that Greg's ascii
transliteration of Bishop's algebra was using the sqrt of the trace of
the input covariance matrix, but I may have been misreading it). I
prefer just to standardize the inputs, which accomplishes the same
effect regarding the initial weights and also improves the condition of
the optimization problem.

If you are using early stopping, divide the inital weights by another
factor of 10 or 100 to make sure that the initial outputs are nice
and smooth.

As for hidden-to-output weights, you can set them initially to zero, and
set the output bias to make the initial outputs equal to the mean of the
target values.  If you want to be fancy, you might be able to save an
iteration by initializing the hidden-to-output weights by linear least
squares as Masters (1994), Practical Neural Network Recipes in C++,
suggests, but I have not found this to be worth the bother.



-- 

Warren S. Sarle       SAS Institute Inc.   The opinions expressed here
saswss@unx.sas.com    SAS Campus Drive     are mine and not necessarily
(919) 677-8000        Cary, NC 27513, USA  those of SAS Institute.
