Newsgroups: comp.ai.neural-nets
Path: cantaloupe.srv.cs.cmu.edu!das-news2.harvard.edu!news2.near.net!bloom-beacon.mit.edu!news.kei.com!news.mathworks.com!udel!gatech!concert!sas!mozart.unx.sas.com!saswss
From: saswss@hotellng.unx.sas.com (Warren Sarle)
Subject: Re: Range for Input-/outputdata?
Originator: saswss@hotellng.unx.sas.com
Sender: news@unx.sas.com (Noter of Newsworthy Events)
Message-ID: <Cyyn48.LsL@unx.sas.com>
Date: Tue, 8 Nov 1994 17:55:20 GMT
References:  <39ltom$l7o@fuhainf.fernuni-hagen.de>
Nntp-Posting-Host: hotellng.unx.sas.com
Organization: SAS Institute Inc.
Lines: 133


In article <39ltom$l7o@fuhainf.fernuni-hagen.de>, annette.lunau@fernuni-hagen.de (Annette Lunau) writes:
|> Im doing a paper on stock-prediction (no, not again...) with
|> backprop-networks.  All the books Ive read say, that you have to normalize
|> the inputdata in order to get good training results.  Can anybody explain to
|> my why?  Doesnt the range for the input data depend on the
|> transformation/output-function.  How do I find out which would be the best
|> Range to transform my data to?

Scott's reply covered the basics pretty well. For those of you who
want the gory details, here is an excerpt from the documentation of
my TNN macro:

Standardization tends to make the training process better behaved by
improving the numerical condition of the optimization problem and
ensuring that certain default values involved in initialization and
termination are appropriate. For linear models fitted by least-squares,
the results are invariant or equivariant under addition or
multiplication of the variables by a constant; hence there is no
disadvantage to standardization. Invariance or equivariance holds
similarly for standardization of the inputs in a feedforward NN.
Invariance or equivariance fails for standardization of the outputs if:

 * the output activation function is nonlinear

 * the loss function assumes that the target variables are counts
   (BERNOULLI, BINOMIAL, MULTINOMIAL, POISSON) and hence on an
   absolute scale of measurement

 * the loss function assumes that the target variables are nonnegative
   values on a ratio scale of measurement (GAMMA)

 * there are two or more targets with loss functions that optionally
   involve a scale parameter (e.g., NORMAL, GAMMA, and the M
   estimators) and you do not explicitly estimate the scale
   parameters.

Thus there are two considerations regarding standardizing targets:
the output activation function and the loss function.

If the output activation function is bounded (typically between 0 and
1), the targets obviously should fall in the same range. It is often
recommended in the NN literature to scale targets between .1 and .9,
since traditional training methods can be too slow to get the weights
large enough to give good predictions of target values of 0 and 1.
However, this makes it impossible to get a good fit if there are many
target values at each end of the range, and PROC NLP has no difficulty
making the weights large enough to fit target values of 0 and 1.
Rather than standardizing the targets with STD=RANGE, it may be more
convenient to adjust the range of the output activation function to
correspond to the range of the targets by using the ADD= and MULT=
arguments to %OUT or %TNN. If the targets are not bounded, it usually
is not advisable to use a bounded output activation function.

For a loss function that assumes an absolute or ratio scale of
measurement for the target variables, standardizing targets is not
_permissible_ in the technical sense. In most other cases, standardizing
targets is useful. However, when you have two or more targets with loss
functions that optionally involve a scale parameter (e.g., NORMAL,
GAMMA, and the M estimators) and you do not explicitly estimate the
scale parameters, the results are sensitive to the relative scaling of
the targets. A target variable with twice the variance of another target
variable has roughly twice the influence on training.  Hence it is often
advisable to scale such targets to equal variances; it is imperative to
scale such targets if they are measured in noncomparable units such as
miles and centimeters (different units for a comparable attribute) or
miles and seconds (different attributes).  However, you can scale the
targets to different variances if you explicitly want to give one target
more weight than the other (you can also do that with a WEIGHT= variable
in the %OUT macro).

If you do explicitly estimate scale parameters, standardizing targets
causes the scale estimates to apply to the standardized values, not to
the original values. Hence this combination of options may not be
appropriate if you want to interpret the scale estimates in the original
units of measurement.

Standardizing inputs has a more subtle effect than standardizing
outputs. Since each input is multiplied by estimated weights, a change
in scale of an input can be absorbed by the reciprocal change in the
corresponding weights and biases.  In this sense, the network should be
invariant under change of scale of the inputs.  In practice, scaling the
inputs affects initialization and some training methods.

It is customary to initialize the weights, including biases, to small
pseudo random values. Assume we have a MLP with one hidden layer applied
to a classification problem and are therefore interested in the
hyperplanes defined by each hidden unit. Each hyperplane is the locus of
points where the net-input to the hidden unit is zero and is thus the
classification boundary generated by that hidden unit considered in
isolation. The connection weights from the inputs to a hidden unit
determine the orientation of the hyperplane. The bias determines the
distance of the hyperplane from the origin. If the bias terms are all
small random numbers, then all the hyperplanes will pass close to the
origin. Hence, if the data are not centered at the origin, the
hyperplane may fail to pass through the data cloud. If all the inputs
have a small coefficient of variation, it is quite possible that all the
initial hyperplanes will miss the data entirely. With such a poor
initialization, local minima are very likely to occur. It is therefore
important to center the inputs to get good initializations.

The main emphasis in the NN literature on initial values has been on the
avoidance of saturation, hence the desire to use small random values.
How small these random values should be depends on the scale of the
inputs as well as the number of inputs and their correlations.
Standardizing inputs removes the problem of scale dependence of the
initial weights.

Steepest descent is very sensitive to scaling. The more ill-conditioned
the Hessian is, the slower the convergence. Hence, scaling is an
important consideration for gradient descent methods such as standard
backprop. Momentum, if properly chosen, alleviates bad scaling to some
extent.

Quasi-Newton and conjugate gradient methods begin with a steepest
descent step and therefore are scale sensitive. However, they accumulate
second-order information as training proceeds and hence are less scale
sensitive than pure gradient descent.

Newton-Raphson and Gauss-Newton, if implemented correctly, are
theoretically invariant under scale changes as long as none of the
scaling is so extreme as to produce underflow or overflow.

Levenberg-Marquardt is scale invariant as long as no ridging is
required. There are several different ways to implement ridging; some
are scale invariant and some are not. The More method used in PROC NLP
is scale sensitive, so extremely bad scaling should be avoided.

-- 

Warren S. Sarle       SAS Institute Inc.   The opinions expressed here
saswss@unx.sas.com    SAS Campus Drive     are mine and not necessarily
(919) 677-8000        Cary, NC 27513, USA  those of SAS Institute.
