Newsgroups: comp.ai.neural-nets
Path: cantaloupe.srv.cs.cmu.edu!rochester!udel!gatech!news.mathworks.com!news.kei.com!simtel!news.sprintlink.net!redstone.interpath.net!sas!mozart.unx.sas.com!saswss
From: saswss@hotellng.unx.sas.com (Warren Sarle)
Subject: Re: Scaling the rows or the columns? With which method?
Originator: saswss@hotellng.unx.sas.com
Sender: news@unx.sas.com (Noter of Newsworthy Events)
Message-ID: <DFHIqs.MHH@unx.sas.com>
Date: Mon, 25 Sep 1995 23:32:52 GMT
X-Nntp-Posting-Host: hotellng.unx.sas.com
References:  <436auo$ekd@nx2.hrz.uni-dortmund.de>
Organization: SAS Institute Inc.
Keywords: normalization, scaling, standardization,
Lines: 234


I assume we are talking about typical feedforward neural nets (NNs) 
such as multilayer perceptrons.

In article <436auo$ekd@nx2.hrz.uni-dortmund.de>, udra@flw.maschinenbau.uni-dortmund.de (Udo Raetzer (Dipl mawo)) writes:
|> Several kinds of prepreparing the inputdata are known,
|> i.e. normalization, scaling, ...
|>
|> In order to let each inputvector have the same influence
|> on the training: Row-prepreparation is necessary.

Not exactly. In linear models, there is a property of input cases called
"leverage" which measures the potential influence of the case.  Leverage
is related to Mahalanobis distance of the case from the mean of the
training set. Row-normalization of the inputs could equalize the
leverages of the training cases if done appropriately. However, this is
no reason to perform any kind of row-normalization for a NN, because
in a flexible nonlinear model such as a NN, there is no fixed leverage
for each case, and you can't compute the leverage until after you have
trained the net.

Furthermore, row-normalization throws away information (in fact, that is
its purpose). Before you throw away any information, make sure it's not
_useful_ information!  The only reason to do row-normalization is to
throw away _useless_ information.

|> On the other hand column-prepreparation of the inputdata
|> is necessary if each compnent of the inputvectors should
|> have equal influence on the training.

This is true for the targets, not the inputs, and is only relevant
if you have two or more target variables (I will assume least-squares
training so I can omit several other "if"s).

There are reasons for standardizing input variables, but they pertain
only to numerical efficiency and are discussed below.

|>    I know
|>    a) normalization (deviding each vector through its lenght => lenght = 1)
|>    b) scaling (projection of the range on the intervall [0,1] or [A,B])
|>    c) standardization (unit-variance, zero-mean)

These terms are not always used with such precise meanings.

|>    Are all three methods applicable for row- as well as for column-
|>    prepreparation of the input?

Yes.

|>    I assume that i.e. normalization is only approriate for row-prepreparation,
|>    or am I mistaken?

That would be unusual but not unheard of. However, for input variables,
it's important to have then centered around zero, so (c) would usually
be preferable to (a) or (b).

And now for the gory details, here is an excerpt from the documentation
for my %TNN macro:

Standardizing either inputs or targets tends to make the training
process better behaved by improving the numerical condition of the
optimization problem and ensuring that certain default values involved
in initialization and termination are appropriate. Standardizing targets
can also affect the objective function.  In some applications you may
need to standardize each observation (row-wise standardization rather
than column-wise), which you can do via a DATA step or view.

Standardizing inputs affects the weights but not the outputs. Each
connection leading from an input node has an estimated weight. If you
multiply the input by a constant, you can divide each associated weight
by the same constant, and the outputs will not change.  If you add a
constant to an input, you can make a corresponding adjustment to each
associated bias term, and again the outputs will not change.

Output nodes also have associated weights and bias terms.  In many
cases, standardization of targets can be absorbed by corresponding
adjustments to the output weights and biases.  However, standardizing
targets will affect outputs and/or the objective function in the
following cases:

 * the output activation function is nonlinear

 * the loss function assumes that the target variables are counts
   (BERNOULLI, BINOMIAL, MULTINOMIAL, POISSON) and hence on an
   absolute scale of measurement

 * the loss function assumes that the target variables are nonnegative
   values on a ratio scale of measurement (GAMMA)

 * there are two or more targets with loss functions that optionally
   involve a scale parameter (e.g., NORMAL, GAMMA, and the M
   estimators) and you do not explicitly estimate the scale
   parameters.

Thus there are two considerations regarding standardization of targets:
the output activation function and the loss function.

If the targets are not bounded, it usually is not advisable to use a
bounded output activation function.

If the output activation function is bounded (typically between 0 and
1), the targets obviously should fall in the same range. It is often
recommended in the NN literature to scale targets between .1 and .9,
since traditional training methods can be too slow to get the weights
large enough to give good predictions of target values of 0 and 1.
However, this makes it impossible to get a good fit if there are many
target values at each end of the range. Unlike traditional training
algorithms, the algorithms in PROC NLP have no difficulty making the
weights large enough to fit target values of 0 and 1. Hence, there is
no reason to use a reduced range such as [.1,.9] with the %TNN macro.
Rather than standardizing the targets with STD=RANGE, it may be more
convenient to adjust the range of the output activation function to
correspond to the range of the targets by using the ADD= and MULT=
arguments to %OUT or %TNN.

For a loss function that assumes an absolute or ratio scale of
measurement for the target variables, standardizing targets is not
_permissible_ in the technical sense. In most other cases, standardizing
targets is useful. However, when you have two or more targets with loss
functions that optionally involve a scale parameter (e.g., NORMAL,
GAMMA, and the M estimators) and you do not explicitly estimate the
scale parameters, the results are sensitive to the relative scaling of
the targets. A target variable with twice the variance of another target
variable has roughly twice the influence on training.  Hence it is often
advisable to scale such targets to equal variances; it is imperative to
scale such targets if they are measured in noncomparable units such as
miles and centimeters (different units for a comparable attribute) or
miles and seconds (different attributes).  However, you can scale the
targets to different variances if you explicitly want to give one target
more weight than the other (you can also do that with a WEIGHT= variable
in the %OUT macro).

If you do explicitly estimate scale parameters, standardizing targets
causes the scale estimates to apply to the standardized values, not to
the original values. Hence this combination of options may not be
appropriate if you want to interpret the scale estimates in the original
units of measurement.

Standardizing inputs has a more subtle effect than standardizing
targets. Standardizing inputs affects initialization and some training
methods.

It is customary to initialize the weights, including biases, to small
pseudo random values. Assume we have a MLP with one hidden layer applied
to a classification problem and are therefore interested in the
hyperplanes defined by each hidden unit. Each hyperplane is the locus of
points where the net-input to the hidden unit is zero and is thus the
classification boundary generated by that hidden unit considered in
isolation. The connection weights from the inputs to a hidden unit
determine the orientation of the hyperplane. The bias determines the
distance of the hyperplane from the origin. If the bias terms are all
small random numbers, then all the hyperplanes will pass close to the
origin. Hence, if the data are not centered at the origin, the
hyperplane may fail to pass through the data cloud. If all the inputs
have a small coefficient of variation, it is quite possible that all the
initial hyperplanes will miss the data entirely. With such a poor
initialization, local minima are very likely to occur. It is therefore
important to center the inputs to get good random initializations.

The main emphasis in the NN literature on initial values has been on the
avoidance of saturation, hence the desire to use small random values.
How small these random values should be depends on the scale of the
inputs as well as the number of inputs and their correlations.
Standardizing inputs removes the problem of scale dependence of the
initial weights.

Standardizing inputs has different effects on different algorithms:

 * Steepest descent is very sensitive to scaling. The more
   ill-conditioned the Hessian is, the slower the convergence. Hence,
   scaling is an important consideration for gradient descent methods
   such as standard backprop. Momentum, if properly chosen, alleviates
   bad scaling to some extent.

 * Quasi-Newton and conjugate gradient methods begin with a steepest
   descent step and therefore are scale sensitive. However, they
   accumulate second-order information as training proceeds and hence
   are less scale sensitive than pure gradient descent.

 * Newton-Raphson and Gauss-Newton, if implemented correctly, are
   theoretically invariant under scale changes as long as none of the
   scaling is so extreme as to produce underflow or overflow.

 * Levenberg-Marquardt is scale invariant as long as no ridging is
   required. There are several different ways to implement ridging;
   some are scale invariant and some are not. The More method used in
   PROC NLP is scale sensitive, so extremely bad scaling should be
   avoided.

You may want to standardize each observation if there is extraneous
variability between observations. For example, suppose you want to
classify plant specimens according to species but the specimens are at
different stages of growth. You have measurements such as stem length,
leaf length, and leaf width. However, the over-all size of the specimen
is determined by age or growing conditions, not by species. Given
sufficient data, a NN could learn to ignore the size of the specimens
and classify them by shape instead.  However, training will be easier
and generalization better if you can remove the extraneous size
information before training the network.

If the input data are measured on an interval scale, you can control for
size by subtracting a measure of the over-all size of each observation
from each datum. For example, if no other direct measure of size is
available, you could subtract the mean of each row of the input matrix,
producing a row-centered input matrix.

If the data are measured on a ratio scale, you can control for size by
dividing each datum by a measure of over-all size; in this case, the
geometric mean is a more natural measure of size than the arithmetic
mean. However, it is often more meaningful to analyze the logarithms of
ratio-scaled data, in which case you can subtract the arithmetic mean
after taking logarithms. You must also consider the dimensions of
measurement.  For example, if you have measures of both length and
weight, you may need to cube the measures of length or take the cube
root of the weights. In NN aplications with ratio-level data, it is
common to divide by the Euclidean length of each row, which projects the
data points onto the surface of a unit hypersphere.

Issues of size and shape are pertinent to many areas besides biology.
Suppose you have data consisting of subjective ratings made by several
different raters.  Some raters may tend to give higher over-all ratings
than other raters.  Some raters may also tend to spread out their
ratings over more of the scale than do other raters. If it is impossible
to directly adjust for rater differences, then you can standardize the
observations to control for both differences in size and variability.
For example, if the data are considered to be measured on an interval
scale, you can subtract the mean of each observation and divide by the
standard deviation, producing a row-standardized data matrix.

-- 

Warren S. Sarle       SAS Institute Inc.   The opinions expressed here
saswss@unx.sas.com    SAS Campus Drive     are mine and not necessarily
(919) 677-8000        Cary, NC 27513, USA  those of SAS Institute.
