Newsgroups: comp.ai.neural-nets
Path: cantaloupe.srv.cs.cmu.edu!das-news2.harvard.edu!news4.ner.bbnplanet.net!cam-news-hub1.bbnplanet.com!cpk-news-feed2.bbnplanet.com!cpk-news-hub1.bbnplanet.com!www.nntp.primenet.com!nntp.primenet.com!howland.erols.net!news.sprintlink.net!news-peer.sprintlink.net!interpath!news.interpath.net!sas!newshost.unx.sas.com!saswss
From: saswss@hotellng.unx.sas.com (Warren Sarle)
Subject: How to measure importance of inputs?
Originator: saswss@hotellng.unx.sas.com
Sender: news@unx.sas.com (Noter of Newsworthy Events)
Message-ID: <DxpArM.4LI@unx.sas.com>
Date: Sat, 14 Sep 1996 02:26:10 GMT
X-Nntp-Posting-Host: hotellng.unx.sas.com
Organization: SAS Institute Inc.
Lines: 631


A month ago, this was a FAQ. Now that I've gotten the first draft
of an answer ready, nobody seems interested anymore. Oh, well, the
answer needs more work anyhow--suggestions are welcome, especially
references to the NN literature.


Measures of importance of inputs
================================

There is no single measure of the importance of each input that is
sufficient by itself to understand the workings of a nonlinear model
such as a feedforward neural network. Even in a linear model, it is
not generally possible to come up with a single number for the
importance of each input.


Linear models
-------------

A linear model is a feedforward NN with no hidden layer and an
identity output activation function. If there is one output Y and
three inputs X1, X2, and X3, the model is:

   Y = b + w1*X1 + w2*X2 + w3*X3 + noise

where b is the bias and w1, w2, and w3 are the connection weights.

For example, with these training data:

    Y   X1  X2   X3  
   ----------------
    7    1   2  500  
    3    2   1  100  
    6    3   4  200  
    9    4   3  600  
   12    5   6  300  
   15    6   5  700  
   18    7   8  800  
   14    8   7  400  

The weights learned by least squares are:

   Input  Weight
   -----  ------
   X1     0.506250
   X2     1.006250
   X3     0.008750
   bias  -0.243750
  
The mean squared training error is 0.6125.


Weights in linear models
------------------------

In linear models, the weights have a simple interpretation:  each
weight is the change in the output associated with a unit change in
the corresponding input, assuming all other inputs are held fixed.
Whether this interpretation is useful largely depends on whether one
input can in fact change independently of the other inputs. For
example, if the data are from an industrial process in which all the
inputs are controlled by an operator, and the operator can change
inputs independently of each other, then the interpretation of the
weights is directly applicable.  But if the inputs include
characteristics of raw materials that the operator cannot control and
that are not independent, the interpretation of the weights is of
questionable relevance.


Why comparing weights in linear models can be misleading
--------------------------------------------------------

Consider the linear model:

   Y = b + w1*X1 + w2*X2 + w3*X3 + noise

Suppose X1 is measured in meters, but you want to convert it to
millimeters. Since the conversion multiplies X1 by 1000, you have to
divide w1 by 1000. Similarly, if you want to convert X1 to kilometers,
you have to divide X1 by 1000 and multiply w1 by 1000. Thus the size
of w1 depends entirely on the units of measurement of X1.  Likewise,
the size of w2 depends entirely on the units of measurement of X2. So
unless X1 and X2 are measured in comparable units, the comparison of
w1 and w2 is meaningless.

For the data in the linear model example above, X3 has by far the
smallest weight. But X3 has much larger values, and a larger range of
values, than the other inputs. In this example, X1 and X2 were
measured in meters, while X3 was measured in centimeters.  If you
convert X3 to meters, the weights become:

   Input  Weights based on common units
   -----  -----------------------------
   X1     0.506250
   X2     1.006250
   X3     0.875000
   bias  -0.243750

Now X1 has the smallest weight.


Why comparing standardized weights in linear models can be misleading
---------------------------------------------------------------------

If you want to try to interpret weights when the inputs are not
measured in comparable units, one thing you can do is standardize the
inputs, i.e., divide each input by its standard deviation.
Standardization also involves subtracting the mean, but that has no
effect on weights other than biases. When the inputs are standardized,
each input is measured in units of standard deviations.

You can train the network using standardized inputs (often a good
idea), or you can train the network on the raw inputs and then
multiply each input weight by the standard deviation of the input.
Either way, you get the same standardized weights (barring local
minima, convergence problems, etc.).  In linear models, it is also
customary to standardize targets, but that does not affect comparisons
of input weights--for the purposes of this discussion, the crucial
thing is to standardize the inputs.

Standardized weights can be compared meaningfully if the standard
deviations are meaningful. For the standard deviations to be
meaningful, it is usually necessary for the input cases to be a
representative sample from the set of all cases you want to be able to
generalize to.

For example, suppose you want to use a linear model to predict job
performance ratings for people who apply for jobs at your company.
There are two inputs:

 * score on a job-skills test,
 * grade point average (GPA) in school.

If you hire everybody who applies for a job and use a representative
sample of these people for the training cases, then the standard
deviations of the two inputs are meaningful descriptors of the pool of
job applicants. Thus it is legitimate to compare standardized input
weights. But if you hire only people whose test score exceeds some
cutoff score, while ignoring GPA, the standard deviation of test
scores in your training set will be artificially reduced. The higher
the cutoff score, the smaller the standard deviation of the test
scores in the training set. The smaller the standard deviation, the
smaller the standardized weight for test score. Which input has the
higher standardized weight may depend more on your choice of cutoff
score than it does on the importance of the inputs.

If job performance really is linearly related to test score and GPA,
changing the cutoff score will not affect the true raw
(nonstandardized) weights. In fact, the true raw weights will not be
affected by any method of selecting cases based solely on the inputs,
as long as the distribution of the inputs is nonsingular.


Why comparing changes in the error function in linear models can be misleading
------------------------------------------------------------------------------

Another way to measure the importance of an input is to omit it from
the model, retrain the model, and see how much the error function
increases.  The change in the error function is a direct measure of
the usefulness of an input in making predictions, but this measure can
be misleading when the inputs are correlated.  For the data in the
linear model example above, the change in MSE produced by omitting
each input is shown in the following table:

   Omitted  Change 
   Input    in MSE
   -----    -------
   X1       0.24051
   X2       0.95018
   X3       3.06250

X1 and X2 appear much less important than X3 because X1 and X2 are
strongly correlated with each other as shown in the following
correlation matrix:

            X1       X2       X3
   X1  1.00000  0.90476  0.47619
   X2  0.90476  1.00000  0.47619
   X3  0.47619  0.47619  1.00000

If we omit X1 from the model and retrain, X2's weight increases to
compensate. If we omit X2 from the model and retrain, X1's weight
increases to compensate.  If we omit X3 from the model, there is no
other highly correlated input to compensate. Thus, X3 is more
important for prediction than either X1 or X2 considered individually.

However, it would be incorrect to conclude that X1 and X2 are jointly
unimportant.  If both X1 and X2 are omitted from the model, the MSE
increases by 8.77738, which is much greater than the sum of the
increases (1.19069) resulting from omitting each input individually.

If the inputs are uncorrelated, the change in MSE produced by omitting
an input is proportional to the square of that input's standardized
weight, regardless of which inputs are included in the model. It is
only with uncorrelated inputs that the change in MSE is an unambiguous
measure of importance.


MLP example: An additive function
---------------------------------

Neural networks such as MLPs are capable of fitting complicated
nonlinear functions. However, many of the issues involving importance
of inputs can be illustrated with relatively simple functions. This
section will describe a simple, nonlinear, noise-free, additive
function of three inputs to be used as a running example.  Unless
otherwise noted, the inputs will be assumed to be statistically
independent to further simplify the measurement of importance.

Consider an MLP with output Y, three inputs (X1, X2, X3), and a single
hidden layer with five tanh units (H1 through H5), with weights as
given in the following table:

              To:  H1   H2     H3   H4    H5     Y
   From: Bias      25   25    150  150     0   -0.1
         X1       100 -100      0    0     0  
         X2         0    0    100 -100     0  
         X3         0    0      0    0     1  
         H1                                     0.1
         H2                                     0.1
         H3                                     0.1
         H4                                     0.1
         H5                                     1.0

The output function can be written as follows:

   Y = .1*tanh(100*(X1+.25))-.1*tanh(100*(X1-.25))-.1
     + .1*tanh(100*(X2+1.5))-.1*tanh(100*(X2-1.5))   
     +    tanh(X3)                                   

The output Y is the sum of three functions written on the three lines
above. Each of these three functions depends on only one of the
inputs. A model such as this in which the output is the sum of
univariate (nonlinear) transformations of the inputs is called an
"additive" model. It is easier to assess the importance of inputs in
an additive model than in the general case because additivity implies
that the effect of one input does not depend on the values of the
other inputs. Thus, to understand the properties of the output
function, we can consider the inputs one at a time, instead of having
to visualize a 3-D nonlinear manifold in a 4-D space. Assuming each
input is distributed over the interval [-3,3], the three additive
functions appear as in the following plot (to the limited resolution
of a plain-text file):

 1.0 +                                                    3333333333333
     |                                              333333
     |                                           333
     |                                        333
     |                                       33
 0.5 +                                     33
     |                                    33
     |                                   33
     |                 2222222222222222222222222222222 
     |                2             11111             2
 0.0 +22222222222222222            1  3  1            22222222222222222
     |111111111111111111111111111111 3   111111111111111111111111111111
     |                             33
     |                            33
     |                           33
-0.5 +                          33
     |                        33
     |                      333
     |                   333
     |             333333
-1.0 +3333333333333
     |
     -+-------+-------+-------+-------+-------+-------+-------+-------+-
    -3.00   -2.25   -1.50   -0.75   0.00    0.75    1.50    2.25    3.00

The output has two abrupt changes in response to X1, corresponding to
two large weights. The positions of these abrupt changes are close
together, so except for a narrow interval containing these two abrupt
changes, the output does not depend on X1 at all.  The output also has
two abrupt changes in response to X2, again corresponding to two large
weights. The positions of these abrupt changes for X2 are farther
apart than for X1, so it is clear that X1 and X2 have different
effects on the output, and for most practical purposes, X2 would be
considered more important than X1. It is also obvious that X3 is
associated with much larger changes in the output than either X1 or
X2, and for most practical purposes, X3 would be considered the most
important input.


Why comparing weights in MLPs can be misleading
-----------------------------------------------

In MLPs, raw input-to-hidden weights depend on the units of
measurement of the inputs, just as in linear models. And standardized
input-to-hidden weights depend on the selection of training cases,
just as in linear models. But comparing weights in MLPs is even more
problematic than comparing weights in linear models. This difficulty
arises from the fact the simple interpretation of weights for linear
models does not apply to MLPs due to the hidden layer(s).

A huge input-to-hidden weight does not necessarily mean that the input
has a huge effect on the output, since the "squashing" functions of
the hidden units limit that effect. Huge input-to-hidden weights
usually indicate abrupt changes in the output, as would occur if the
network were trying to approximate a discontinuity. But the size of
the weight is related primarily to the abruptness of the change, not
to the size of the change.

A tiny input-to-hidden weight does not necessarily mean that the input
has a tiny effect on the output, since that effect can be amplified by
the hidden-to-output weights. In fact it is quite common to have tiny
input-to-hidden weights and huge hidden-to-output weights; some
reasons for this are explained by Cardell, Joerding, and Li (1994).

The main advantage of raw weights over standardized weights in linear
models is that the true raw weights (i.e. those that give the best
possible generalization) do not depend on what region of the input
space you want to generalize to, as long as that region is
nonsingular. This invariance results from the fact that every point on
a plane has the same slope and therefore the same weights apply. But
when you are fitting an MLP to a nonlinear surface, different hidden
units may be important in different regions of the input space.
Consider a surface produced by the formula:

   Y = tanh(X1) + tanh(X2)

If you consider only cases with X1 > 3, an MLP with one hidden unit
depending only on X2 will generalize very well.  If you consider only
cases with X2 > 3, an MLP with one hidden unit depending only on X1
will generalize very well. The weights for these two MLPs are
completely different.  For MLPs, the weights that give the best
generalization can depend on what region of the input space you want
to generalize to.

For the additive function example, the sum of the absolute (squared
weights would produce similar results) is shown for each input:

   Input   Sum of absolute input weights    
   -----   -----------------------------
   X1      200
   X2      200
   X3        1 

Thus, the weights suggest two incorrect conclusions:
 * X1 and X2 are equally important.
 * X3 is much less important than X1 or X2.


Why partial derivatives are more interpretable than weights
-----------------------------------------------------------

If the weights in an MLP cannot be interpreted like the weights in a
linear model, is there something that can be so interpreted?  To some
degree, yes: the gradient of the output with respect to the inputs.
Note that this gradient is not the gradient that is used for training
(that gradient is taken with respect to the weights), but it can be
computed in a manner similar to the usual backpropagation algorithm.

The gradient is a vector of partial derivatives. Each partial
derivative, by definition, gives the local rate of change of the
output with respect to the corresponding input, holding the other
inputs fixed.  Thus a partial derivative has the same interpretation
as a weight in a linear model, except for one crucial difference: a
weight in a linear model applies to the entire input space, but a
partial derivative applies only to a small neighborhood of the input
point at which it is computed.


Why partial derivatives at a few points can be misleading
---------------------------------------------------------

It is tempting to compute the partial derivatives at one "typical"
point in the input space, such as the centroid, and assume that those
derivatives are typical of the entire input space, but this assumption
is dangerously false. If the partial derivatives are constant over the
input space, then the output function is linear. If you are using a
nonlinear neural network, presumably you think it is possible for the
output function to have important nonlinearities. If the output
function has important nonlinearities, then there will be important
variation of the partial derivatives over the input space.

It can also be dangerously misleading to look at the partial
derivatives at only a few points in the output space. One method that
has been proposed is to vary each input in turn while all the other
inputs are fixed at their mean values.  But this method can overlook
important variation of the partial derivatives. For example, consider
a continuous version of the XOR data:

  Y = X1 + X2 - 2*X1*X2 

where X1 and X2 vary uniformly over [0,1]. Then the mean of each input
is .5, and if you fix one input to .5, you will find that the output
is a constant regardless of the value of the other input.  In other
words, if either input is fixed at its mean value, the partial
derivative with respect to the other input is zero.

For the additive function example, the partial derivative with respect
to each input is shown at the mean of the inputs:

   Input  Partial derivative at mean
   -----  --------------------------
   X1     0
   X2     0
   X3     1

Thus, the partial derivatives at the mean suggest the incorrect
conclusion that X1 and X2 are completely unimportant.


Why average partial derivatives over the input space can be misleading
----------------------------------------------------------------------

Partial derivatives are interpretable, but you have to evaluate them
at a large, representative sample of points from the input space.  The
next question is how to reduce this large collection of numbers to a
single measure of importance for each input. One obvious way to
summarize the partial derivatives is to report an average value (mean,
median, etc.). But the partial derivatives for a given input may take
both large positive and large negative values, producing an average
near zero. So average partial derivatives are useful but not
sufficient for measuring importance of inputs.

For the additive function example, the average partial derivatives with
respect to each input are shown:

   Input  Average partial derivative
   -----  --------------------------
   X1     0
   X2     0
   X3     0.33

Thus, the average partial derivatives suggest the incorrect conclusion
that X1 and X2 are completely unimportant.


Why the average absolute (or squared) partial derivative can be misleading
--------------------------------------------------------------------------

To allow for both positive and negative partial derivatives, you can
compute the average of the absolute values or squares. This gives you
a better measure of importance than the average of the signed values.
But the importance of an input depends not only on the size of the
partial derivatives, but on the location of points in the input space
with large partial derivatives. In fact, it is sometimes impossible to
tell which of two inputs is more important even by looking at the
complete frequency distribution of the partial derivatives, as is
shown by the following example.

In the additive function example, X1 and X2 have the same mean partial
derivative. They also have the same mean absolute derivative and the
same mean squared derivative. In fact, the partial derivatives for X1
and X2 have exactly the same frequency dustribution, so it is
impossible to tell which one is more important based on the partial
derivatives alone.  The average absolute partial derivatives with
respect to each input are shown:

   Input  Average absolute partial derivative
   -----  -----------------------------------
   X1     0.5
   X2     0.5
   X3     0.33

Thus, the average absolute partial derivatives suggest two incorrect
conclusions:
 * X1 and X2 are equally important
 * X1 and X2 are more important than X3


Why differences can be more informative than derivatives
--------------------------------------------------------

Since the partial derivative of the output with respect to each input
provides only local information, it might be better to look at the
change in the output over an interval. For example, to assess the
importance of X1 given an output function Y = f( X1, X2, X3), you
could compute:

   D1 = f( X1+h, X2, X3) - f( X1, X2, X3)

for a large, representative sample of input points, and then take the
average absolute value or square of D1. But how do you choose h?  If
the output function is periodic, such as Y = sin(X1) + sin(2*X2) +
sin(3*X3), D1 will be zero when h is a multiple of the period, but
large when h is an odd multiple of half the period. So the safest
thing to do is to look at a range of h values. The following plot
shows the the mean absolute difference in the output as a function of
h for the three inputs in the additive function used in the example
above:

  2.0 +                                                    3    3    3
      |                                          3    3
      |
      |                                     3
      |
  1.5 +                                3
      |
      |
      |                           3
      |
  1.0 +
      |                      3
      |
      |                 3
      |
  0.5 +
      |            3
      |
      |       3                   2    2    2
      |            2    2    2    1              2
  0.0 +  1    1    1    1    1         1    1    1    1    1    1    1
      |
      ---+----+----+----+----+----+----+----+----+----+----+----+----+--
        0.0  0.5  1.0  1.5  2.0  2.5  3.0  3.5  4.0  4.5  5.0  5.5  6.0

The order of importance of the inputs is evident in this plot, but in
general there remains the problem of how to reduce the information to a
single value for each input. Presumably this would involve averaging
over h, but there are many different ways to do the averaging.  When the
inputs have different units of measurement and different distributions,
it is not obvious how to select appropriate values for h. Suppose X1 is
uniformly distributed on [0,1] and X2 is a binary variable with values
in {0,1} with equal probability. For X2, the only sensible value of h is
1, but for X1 you would want to use various values of h intermediate
between 0 and 1. How do you average over h in a way that fairly
represents X1 and X2?

For the additive function example, if absolute differences are averaged
over all pairs of input values, the following results are obtained:

   Input  Average absolute difference
   -----  ---------------------------
   X1     .030 
   X2     .099 
   X3     .916

Thus, this type of average absolute difference correctly indicates
the order of importance of the inputs.


Change in the error function when an input is removed
-----------------------------------------------------

Another measure of the importance of an input is the change in the
error function when the input is removed from the network.  It is
important to retrain the network after removing the input.

Simply deleting an input unit from the network without retraining is
equivalent to freezing the value of that input to zero for all cases.
If zero is not a reasonable value for that input, the network outputs
are likely to be nonsense.

Instead of freezing the input to a constant value of zero, it would be
better to freeze it to a typical value such as the mean of that input.
But you may not get a typical output value from a typical input value.
In the example above with an additive output function, the mean value
of X1 produces unusually high values of the output, and freezing X1 to
its mean value increases the RMSE by 0.19. The mean value of X2 also
produces high outputs, but not as unusually high as for X1, so
freezing X2 to its mean value increases the RMSE by only 0.16. Thus X1
appears more important than X2.

For the additive function example, the changes in RMSE, with and
without retraining, produced by omitting each input, are as
follows:

           ........ Change in RMSE ........
   Input   No retraining    With retraining
   -----   -------------    ---------------
   X1      .190             .057 
   X2      .140             .100 
   X3      .820             .808

Thus, the change in RMSE with no retraining incorrectly suggests that
X1 is more important than X2. But the change in RMSE with retraining
yields the correct order of importance.


Dependent inputs
----------------

In the nonlinear examples considered up to this point, the inputs have
been statistically independent by construction. If the inputs are
statistically dependent, it is even more difficult to measure the
importance of inputs, because the effects of different inputs cannot
generally be separated. The problem is essentially the same as the
problem with correlated inputs in linear models, except that linear
correlation is not an adequate indicator of statistical dependence of
the inputs for nonlinear models.

In the additive function example, if the values of X1 and X3 are
restricted to differ by no more than 0.2 (causing those inputs to be
highly correlated), the changes in RMSE produced by omitting each
input are as follows:

           ........ Change in RMSE ........
   Input   No retraining    With retraining
   -----   -------------    ---------------
   X1      .191             .028 
   X2      .143             .100 
   X3      .799             .061 

The results with no retraining are essentially the same as with
independent inputs. But with retraining, both X1 and X3 appear
less important, especially X3, which by this measure is now
less important than X2.


Noisy data
----------

For noisy data, all of the measures of importance of inputs are
subject to sampling variation. Except for raw weights in linear
models, it is difficult to estimate the amount of sampling variation
(e.g., the standard errors of the importance measures).  One possible
way to assess the variability of importance measures is bootstrapping
as illustrated by Baxt and White (1995).


References
----------

Baxt, W.G. and White, H. (1995) "Bootstrapping confidence intervals
for clinical input variable effects in a network trained to identify
the presence of acute myocardial infarction", Neural Computation, 7,
624-638.

Cardell, N.S., Joerding, W., and Li, Y. (1994), "Why Some Feedforward
Networks Cannot Learn Some Polynomials," Neural Computation, 6,
761-766.


-- 

Warren S. Sarle       SAS Institute Inc.   The opinions expressed here
saswss@unx.sas.com    SAS Campus Drive     are mine and not necessarily
(919) 677-8000        Cary, NC 27513, USA  those of SAS Institute.
