Newsgroups: comp.ai.neural-nets
Path: cantaloupe.srv.cs.cmu.edu!das-news2.harvard.edu!news2.near.net!howland.reston.ans.net!news.sprintlink.net!redstone.interpath.net!sas!mozart.unx.sas.com!saswss
From: saswss@hotellng.unx.sas.com (Warren Sarle)
Subject: Re: How do YOU apply noise for training?
Originator: saswss@hotellng.unx.sas.com
Sender: news@unx.sas.com (Noter of Newsworthy Events)
Message-ID: <D7rwCt.LBB@unx.sas.com>
Date: Sat, 29 Apr 1995 01:34:05 GMT
X-Nntp-Posting-Host: hotellng.unx.sas.com
References:  <3nlu89$r8i@uuneo.neosoft.com>
Organization: SAS Institute Inc.
Lines: 167


In article <3nlu89$r8i@uuneo.neosoft.com>, hav@neosoft.com writes:
|> I was wondering how folks are applying noise for training.
|>
|> I have tried using noise on inputs and on outputs and find
|> similar results in both cases.  Has anyone studied the
|> difference on generalization (if any) between peturbation
|> of inputs vs outputs for training data?

Training with noise is a form of regularization related to weight decay
and ridge regression. It is also a form of smoothing related to kernel
regression (aka a generalized regression neural network).

It is necessary to distinguish noise that is deliberately added to the
training data from noise that unavoidably contaminates the target
values. I will refer to the former as "jitter" and the latter as
"noise".

Training with jitter works because the functions that we want NNs to
learn are mostly smooth. NNs can learn functions with discontinuities,
but the discontinuities must be restricted to sets of measure zero if
our network is restricted to a finite number of hidden units.

In other words, if we have two cases with similar inputs, the desired
outputs will usually be similar.  That means we can take any training
case and generate new training cases by adding small amounts of jitter
to the inputs.  As long as the amount of jitter is sufficiently small,
we can assume that the desired output will not change enough to be of
any consequence, so we can just use the same target value.  The more
training cases, the merrier, so this looks like a convenient way to
improve training. But too much jitter will obviously produce garbage,
while too little jitter will have little effect.

When studying nonlinear models such as feedforward NNs, it is often
helpful first to consider what happens in linear models, and then
to see what difference the nonlinearity makes. So let's consider
training with jitter in a linear model. Notation:
   x_ij is the value of the jth input (j=1, ..., p) for the
        ith training case (i=1, ..., n).
   X={x_ij} is an n by p matrix.
   y_i is the target value for the ith training case.
   Y={y_i} is a column vector.

Without jitter, the least-squares weights are B = inv(X'X)X'Y, where
"inv" indicates a matrix inverse and "'" indicates transposition.  Note
that if we replicate each training case c times, or equivalently stack c
copies of the X and Y matrices on top of each other, the least-squares
weights are inv(cX'X)cX'Y = (1/c)inv(X'X)cX'Y = B, same as before.

With jitter, x_ij is replaced by c cases x_ij+z_ijk, k=1, ..., c, where
z_ijk is produced by some random number generator, usually with a normal
distribution with mean 0 and standard deviation s, and the z_ijk's are
all independent. In place of the n by p matrix X, this gives us a big
matrix, say Q, with cn rows and p columns. To compute the least-squares
weights, we need Q'Q. Let's consider the jth diagonal element of Q'Q,
which is
                   2           2       2
   sum (x_ij+z_ijk) = sum (x_ij + z_ijk + 2 x_ij z_ijk)
   i,k                i,k

which is approximately, for c large,

             2     2
   c(sum x_ij  + ns ) 
      i

which is c times the corresponding diagonal element of X'X plus ns^2.
Now consider the u,vth off-diagonal element of Q'Q, which is

   sum (x_iu+z_iuk)(x_iv+z_ivk)
   i,k

which is approximately, for c large,

   c(sum x_iu x_iv)
      i

which is just c times the corresponding element of X'X. Thus, Q'Q equals
c(X'X+ns^2I), where I is an identity matrix of appropriate size.
Similar computations show that the crossproduct of Q with the target
values is cX'Y. Hence the least-squares weights with jitter of variance
s^2 are given by B(ns^2) = inv(c(X'X+ns^2I))cX'Y = inv(X'X+ns^2I)X'Y.
In the statistics literature, B(ns^2) is called a ridge regression
estimator with ridge value ns^2.

If we were to add jitter to the target values Y, the cross-product X'Y
would not be affected for large c for the same reason that the off-
diagonal elements of X'X are not afected by jitter. Hence, adding jitter
to the targets will not change the optimal weights; it will just slow
down training.

The ordinary least squares training criterion is (Y-XB)'(Y-XB).  Weight
decay uses the training criterion (Y-XB)'(Y-XB)+d^2B'B, where d is the
decay rate. Weight decay can also be implemented by inventing artificial
training cases. Augment the training data with p new training cases
containing the matrix dI for the inputs and a zero vector for the
targets. To put this in a formula, let's use A;B to indicate the matrix
A stacked on top of the matrix B, so (A;B)'(C;D)=A'C+B'D.  Thus the
augmented inputs are X;dI and the augmented targets are Y;0, where 0
indicates the zero vector of the appropriate size. The squared error for
the augmented training data is:

   (Y;0-(X;dI)B)'(Y;0-(X;dI)B)
   = (Y;0)'(Y;0) - 2(Y;0)'(X;dI)B + B'(X;dI)'(X;dI)B
   = Y'Y - 2Y'XB + B'(X'X+d^2I)B
   = Y'Y - 2Y'XB + B'X'XB + B'(d^2I)B
   = (Y-XB)'(Y-XB)+d^2B'B

which is the weight-decay training criterion. Thus the weight-decay
estimator is:

    inv[(X;dI)'(X;dI)](X;dI)'(Y;0) = inv(X'X+d^2I)X'Y

which is the same as the jitter estimator B(d^2), i.e. jitter with
variance d^2/n. The equivalence between the weight-decay estimator
and the jitter estimator does _not_ hold for nonlinear models.
However, the equivalence of the two estimators for linear models
suggests that they will often produce similar results even for
nonlinear models.

B(0) is obviously the ordinary least-squares estimator. It can be shown
that as s^2 increases, the Euclidean norm of B(ns^2) decreases; in other
words, adding jitter causes the weights to shrink. It can also be shown
that under the usual statistical assumptions, there always exists some
value of ns^2 > 0 such that B(ns^2) provides better expected
generalization than B(0). Unfortunately, there is no way to calculate a
value of ns^2 from the training data that is guaranteed to improve
generalization.  There are other types of shrinkage estimators called
Stein estimators that _do_ guarantee better generalization than B(0),
but I'm not aware of a nonlinear generalization of Stein estimators
applicable to neural networks.

The statistics literature describes numerous methods for choosing the
ridge value. The most obvious way is to estimate the generalization
error by cross-validation, generalized cross-validation, or
bootstrapping, and to choose the ridge value that yields the smallest
such estimate. There are also quicker methods, one of which yields the
following formula, useful as a first guess:

    2    p(Y-XB(0))'(Y-XB(0))
   s   = --------------------
    1      n(n-p)B(0)'B(0)

You can iterate this a few times:

    2      p(Y-XB(0))'(Y-XB(0))
   s     = --------------------
    l+1              2     2
            n(n-p)B(s )'B(s )
                     l     l

It would be an interesting research project to see how well this works
with nonlinear neural nets.

I am getting too hungry to explain how jitter relates to kernel
regression tonight.

Reference:

Vinod, H.D. and Ullah, A. (1981) _Recent Advances in Regression Methods_,
NY: Marcel-Dekker.

-- 

Warren S. Sarle       SAS Institute Inc.   The opinions expressed here
saswss@unx.sas.com    SAS Campus Drive     are mine and not necessarily
(919) 677-8000        Cary, NC 27513, USA  those of SAS Institute.
