Newsgroups: comp.ai.neural-nets
Path: cantaloupe.srv.cs.cmu.edu!das-news2.harvard.edu!news2.near.net!news.mathworks.com!usenet.eel.ufl.edu!news-feed-1.peachnet.edu!hobbes.cc.uga.edu!cssun.mathcs.emory.edu!emory!swrinde!howland.reston.ans.net!news.moneng.mei.com!news.ecn.bgu.edu!willis.cis.uab.edu!ddsw1!redstone.interpath.net!sas!mozart.unx.sas.com!saswss
From: saswss@hotellng.unx.sas.com (Warren Sarle)
Subject: Re: Robustness of MLP's to outliers ???
Originator: saswss@hotellng.unx.sas.com
Sender: news@unx.sas.com (Noter of Newsworthy Events)
Message-ID: <D6A916.HvD@unx.sas.com>
Date: Fri, 31 Mar 1995 02:17:30 GMT
X-Nntp-Posting-Host: hotellng.unx.sas.com
References:  <3l6kgj$rs7@ionews.io.org>
Organization: SAS Institute Inc.
Lines: 140


In article <3l6kgj$rs7@ionews.io.org>, Byron Bodo <bodo@io.org> writes:
|> Am about to take a stab at fitting an NN (simple MLP) for
|> a nonlinear hydrochem relation of complex functional form. I
|> know from robust regression fitting that both predictors x
|> and response y contain erroneous measurements.
|>
|> In my studies thus far, I've seen no mention of NN sensitivity
|> to outliers nor any discussion of robustness issues.
|>
|> How sensitive are MLP's and what strategies exist for
|> minimizing distortions from outliers in predictors & response
|> variables ? Any references would be much appreciated ?

First some terminology: an outlier is an extreme value of a target
(response) variable--extreme in the sense of being beyond the usual
range of noise.  For example, consider a situation where the noise has a
standard deviation of 10. If a target value differed from the correct
output value by 15 or 20, that would be well within the usual range of 2
or 3 standard deviations and would not be an outlier. If a target value
differed from the correct output value by 50 or 100, that would be well
beyond the usual range of noise and would definitely be an outlier.
There is no sharp cutoff for defining outliers. Outliers may be the
result of measurement error or clerical error, or they may just be weird
cases.

An extreme value of an input (predictor) variable is usually called a
high-leverage point rather than an outlier. Leverage is a measure of the
potential influence of a training case on the estimates as a function of
the input values, not the target value. In linear models, leverage can
be computed directly; it is related to a type of distance between the
given input values and the mean of the inputs for the entire training
set (the Mahalanobis distance for those of you who have studied
multivariate statistics). In nonlinear models, leverage is more
complicated, as are most things in nonlinear models.

An outlier will not cause much trouble if it is at a low-leverage point.
A high-leverage point will not cause much trouble if the target is not
an outlier. But an outlier at a high-leverage point can seriously
distort the estimates (weights) if you are using the usual least-squares
training criterion. There are various other training criteria that
are less sensitive to outliers and are said to be "robust" or
"resistant" (I am not going to give precise definitions of those terms!).
Robust training criteria can be used fairly easily in NNs. For
algorithms, see Peter J. Huber (1981), _Robust Statistics_, NY: Wiley.

There are also training methods for linear models that are less
sensitive to high-leverage points. These are called bounded-influence
estimators. I don't know how to apply bounded-influence methods to NNs.

Since MLPs are nonlinear regression models, the issues of outliers
and robustness for MLPs are essentially the same as for other types
of nonlinear models. However, the flexibility of MLPs and especially
their ability to approximate discontinuities gives _all_ training
cases the potential to have high leverage!

Consider the data sets shown in the following two plots:

   2 +            *                      2 +            *
     |                                     |
  Y1 |                                  Y2 |
     |                                     |
     |                                     |
   1 +  *                   *            1 +  *                   *
     |   *                 *               |   *                 *
     |    *               *                |    *               *
     |     **           **                 |     **           **
     |       **       **                   |       **       **
   0 +         *** ***                   0 +         *******
     ---+---------+---------+--            ---+---------+---------+--
       -1         0         1                -1         0         1
 
                  X                                     X

X is the input, Y1 and Y2 are targets. The two data sets are the same
except that the right one has an extra training case at (0,0).  Both
have outliers at (0,2). But since the left data set has only one case
with X=0, a neural net can fit that outlier exactly.  With the right
data set, it is impossible to get a perfect fit because of the two
different target values for X=0.

If we train MLPs with 5 hidden units on each of the two data sets
using least-squares, we get the following outputs:

           Least Squares Estimates:  * is target, - is output

  Y1 |                                  Y2 |
   2 +            -                      2 +            *
     |                                     |
     |                                     |
     |                                     |
   1 +  -                   -            1 +  -         -         -
     |   --               --               |   --               --
     |     -             -                 |     -             -
     |      ---       ---                  |      ---       ---
   0 +         --- ---                   0 +         ---*---
     ---+---------+---------+--            ---+---------+---------+--
       -1         0         1                -1         0         1
 
                  X                                     X

For the left data set, the outputs coincide with the target values,
so all you see are dashes. For the right data set, the two target
values for X=0 are visible since the output passes halfway between
them. But in both cases, the outlier has a large effect on the
output values for X=0.

If we train the MLPs via Huber's robust criterion we might get the
following:

              Huber Estimates:  * is target, - is output

  Y1 |                                  Y2 |
   2 +            *                      2 +            *
     |                                     |
     |                                     |
     |                                     |
   1 +  -                   -            1 +  -                   -
     |   --               --               |   --               --
     |     -             -                 |     -             -
     |      ---       ---                  |      ---       ---
   0 +         -------                   0 +         -------
     ---+---------+---------+--            ---+---------+---------+--
       -1         0         1                -1         0         1
 
                  X                                     X

For both data sets, the outlier is ignored. The result for the right
data set is obtained reliably. However, for the left data set, there
are two distinct global optima: one is as shown; the other has the
same outputs as with the least-squares estimates! If we reduced the
number of hidden units to 2, the latter global optimum would be
eliminated, but the fit would be worse for the remaining global
optimum. I'm not sure what would happen with 3 hidden units.

-- 

Warren S. Sarle       SAS Institute Inc.   The opinions expressed here
saswss@unx.sas.com    SAS Campus Drive     are mine and not necessarily
(919) 677-8000        Cary, NC 27513, USA  those of SAS Institute.
