Newsgroups: comp.ai.neural-nets
Path: cantaloupe.srv.cs.cmu.edu!das-news2.harvard.edu!news2.near.net!news.mathworks.com!udel!gatech!howland.reston.ans.net!news.sprintlink.net!redstone.interpath.net!sas!mozart.unx.sas.com!saswss
From: saswss@hotellng.unx.sas.com (Warren Sarle)
Subject: Outliers revisited
Originator: saswss@hotellng.unx.sas.com
Sender: news@unx.sas.com (Noter of Newsworthy Events)
Message-ID: <D6uFLJ.IHw@unx.sas.com>
Date: Mon, 10 Apr 1995 23:51:19 GMT
X-Nntp-Posting-Host: hotellng.unx.sas.com
Organization: SAS Institute Inc.
Lines: 162


A bug in my neural net macro caused me to mistake a local optimum for
a global one in my post from 30 Mar 95 on "Re: Robustness of MLP's to
outliers ???" Here is a corrected and slightly expanded version.

In article <3l6kgj$rs7@ionews.io.org>, Byron Bodo <bodo@io.org> writes:
|> Am about to take a stab at fitting an NN (simple MLP) for
|> a nonlinear hydrochem relation of complex functional form. I
|> know from robust regression fitting that both predictors x
|> and response y contain erroneous measurements.
|>
|> In my studies thus far, I've seen no mention of NN sensitivity
|> to outliers nor any discussion of robustness issues.
|>
|> How sensitive are MLP's and what strategies exist for
|> minimizing distortions from outliers in predictors & response
|> variables ? Any references would be much appreciated ?

First some terminology: an outlier is an extreme value of a target
(response) variable--extreme in the sense of being beyond the usual
range of noise.  For example, consider a situation where the noise has a
standard deviation of 10. If a target value differed from the correct
output value by 15 or 20, that would be well within the usual range of 2
or 3 standard deviations and would not be an outlier. If a target value
differed from the correct output value by 50 or 100, that would be well
beyond the usual range of noise and would definitely be an outlier.
There is no sharp cutoff for defining outliers. Outliers may be the
result of measurement error or clerical error, or they may just be weird
cases.

An extreme value of an input (predictor) variable is usually called a
high-leverage point rather than an outlier. Leverage is a measure of the
potential influence of a training case on the estimates as a function of
the input values, not the target value. In linear models, leverage can
be computed directly; it is related to a type of distance between the
given input values and the mean of the inputs for the entire training
set (the Mahalanobis distance for those of you who have studied
multivariate statistics). In nonlinear models, leverage is more
complicated, as are most things in nonlinear models, and it depends on
the target values as well as the inputs.

An outlier will not cause much trouble if it is at a low-leverage point.
A high-leverage point will not cause much trouble if the target is not
an outlier. But an outlier at a high-leverage point can seriously
distort the estimates (weights) if you are using the usual least-squares
training criterion. There are various other training criteria that are
less sensitive to outliers and are said to be "robust" or "resistant"
(I am not going to give precise definitions of those terms!).  Robust
training criteria can be used fairly easily in NNs. For algorithms, see
Peter J. Huber (1981), _Robust Statistics_, NY: Wiley.

There are also training methods for linear models that are less
sensitive to high-leverage points. These are called bounded-influence
estimators. I don't know how to apply bounded-influence methods to NNs.

Since MLPs are nonlinear regression models, the issues of outliers and
robustness for MLPs are essentially the same as for other types of
nonlinear models. However, the flexibility of MLPs and especially their
ability to approximate discontinuities gives _all_ training cases the
potential to have high leverage!

Consider the data sets shown in the following two plots:

Figure 1: Training Data

   2 +            *                      2 +            *
     |                                     |
  Y1 |                                  Y2 |
     |                                     |
     |                                     |
   1 +  *                   *            1 +  *                   *
     |   *                 *               |   *                 *
     |    *               *                |    *               *
     |     **           **                 |     **           **
     |       **       **                   |       **       **
   0 +         *** ***                   0 +         *******
     ---+---------+---------+--            ---+---------+---------+--
       -1         0         1                -1         0         1

                  X                                     X

X is the input, Y1 and Y2 are targets. The two data sets are the same
except that the right one has an extra training case at (0,0).  Both
have outliers at (0,2). But since the left data set has only one case
with X=0, a neural net can fit that outlier exactly.  With the right
data set, it is impossible to get a perfect fit because of the two
different target values for X=0.

If we train MLPs with 5 hidden units on each of the two data sets using
least squares, we get the following outputs:

Figure 2: Least Squares Estimates;  * is target, - is output

  Y1 |                                  Y2 |
   2 +            -                      2 +            *
     |                                     |
     |                                     |
     |                                     |
   1 +  -                   -            1 +  -         -         -
     |   --               --               |   --               --
     |     -             -                 |     -             -
     |      ---       ---                  |      ---       ---
   0 +         --- ---                   0 +         ---*---
     ---+---------+---------+--            ---+---------+---------+--
       -1         0         1                -1         0         1

                  X                                     X

For the left data set, the outputs coincide with the target values, so
all you see are dashes. For the right data set, the two target values
for X=0 are visible since the output passes halfway between them. But in
both cases, the outlier has a large effect on the output values for X=0.
Thus the outlier appears to be a high leverage point even though it is
in the middle of the distribution of input values; this impression is
confirmed by computing the leverage of the outlier from the Jacobian
matrix

If we train the MLPs via Huber's robust criterion we will probably get
the following:

Figure 3: Huber Estimates;  * is target, - is output

  Y1 |                                  Y2 |
   2 +            *                      2 +            *
     |                                     |
     |                                     |
     |                                     |
   1 +  -                   -            1 +  -                   -
     |   --               --               |   --               --
     |     -             -                 |     -             -
     |      ---       ---                  |      ---       ---
   0 +         -------                   0 +         -------
     ---+---------+---------+--            ---+---------+---------+--
       -1         0         1                -1         0         1

                  X                                     X

For both data sets, the outlier is ignored.  However, these are local
optima. The global optima are the same as for least-squares estimation.
The global optima are difficult to find from random initial values,
especially for the left data, but in this case the local optima are
probably what most people would prefer!  It is a characteristic of M
estimators in general that global optima are not necessarily preferable
to local optima, even in theory, since the estimators are defined as
stationary points of the loss function rather than as optima. Of course,
some stationary points are clearly not desirable, such as the saddle
point at the origin for MLPs.  Having a prior estimate of the noise
variance reduces the confusion about which stationary point to choose,
and it is generally sensible to prefer the global optimum when the noise
variance is fixed during training rather than estimated.

There is another class of M estimators, called redescending estimators,
for which the plots in Figure 3 are in fact global optima. However, the
plots in Figure 2 are also global optima. Redescending estimators are
tricky because of this propensity for multiple global optima, but they
can be useful if you want outliers to be ignored completely. 

-- 

Warren S. Sarle       SAS Institute Inc.   The opinions expressed here
saswss@unx.sas.com    SAS Campus Drive     are mine and not necessarily
(919) 677-8000        Cary, NC 27513, USA  those of SAS Institute.
