Newsgroups: comp.ai.neural-nets
Path: cantaloupe.srv.cs.cmu.edu!bb3.andrew.cmu.edu!newsfeed.pitt.edu!godot.cc.duq.edu!newsgate.duke.edu!news.mathworks.com!newsfeed.internetmci.com!in1.uu.net!news.interpath.net!sas!newshost.unx.sas.com!saswss
From: saswss@hotellng.unx.sas.com (Warren Sarle)
Subject: Re: Covariances in Backprop networks
Originator: saswss@hotellng.unx.sas.com
Sender: news@unx.sas.com (Noter of Newsworthy Events)
Message-ID: <Ds14sn.1Gp@unx.sas.com>
Date: Sun, 26 May 1996 20:26:47 GMT
X-Nntp-Posting-Host: hotellng.unx.sas.com
References: <4o28oi$erm@newsbf02.news.aol.com> <4o2ifk$cba@delphi.cs.ucla.edu>
Organization: SAS Institute Inc.
Lines: 90


In article <4o2ifk$cba@delphi.cs.ucla.edu>, edwin@cs.ucla.edu (E. Robert Tisdale) writes:
|> rafizaman@aol.com (RafiZaman) writes:
|> 
|> >I am estimating a backprop multi-layer perceptron model using 15 input nodes,
|> >5 nodes in hidden layer and one output node.
|> 
|> >Does the notion of covariance hold for neural nets? 
|> >If so is there a way to extract this 'covariance'
|> >between the weights or parameters of the neural net connections?
|> 
|> The biases and connection weights in a multi-layer perceptron
|> are supposed to be *constant* parameters.
|> They don't vary so the idea of 'covariance' doesn't make any sense.

If you assume there exists some unknown "true" or "optimal" network,
then the weights in that network could be considered constant. But I
take it that RafiZaman is referring to the learned weights, which do
indeed vary in various ways.

In frequentist statistical theory, parameters (the "true" or "optimal"
weights) are taken to be constant, but the estimates (the learned
weights) vary depending on the training set. Statistical theory is
largely concerned with the way in which these estimates vary over all
possible training sets. The mechanism for selecting a training set
(usually some form of random sampling) generates a probability
distribution over the possible training sets, from which one can (in
principle if not always in practice) compute a probability distribution
for the learned weights. This latter distribution is called the
_sampling distribution_ of the weights. The sampling distribution
provides the foundation for frequentist statistical inference. The
covariance matrix of the sampling distribution is one of the most
important tools for describing sampling distributions, since under
rather broad regularity conditions, sampling distributions approach
multivariate normality as the amount of training data goes to infinity.

Several problems arise in applying the concept of sampling distribution
to neural nets. Most problematic is the fact that the estimates of the
weights often are not uniquely determined because of mathematical
indeterminacies arising from symmetry or singularities. These
indeterminacies exist regardless of the training algorithm, although
certain methods such as weight decay may partially alleviate them. In
addition, particular training methods may not yield unique values for
the learned weights. For example, the results of stopped training may
depend on a random division of the data into training and validation
sets and on the random initial weights. Furthermore, estimates of the
learned weights may be unbounded (as in logistic regression on linearly
separable data). So not only do the usual regularity conditions fail for
many neural networks, but the whole concept of sampling distribution may
be ill-defined.

In practice, the actual weights may be of little interest; it is usually
the predicted values (outputs) that are most important. Many of the
indeterminacies that apply to weights disappear when you look at
predictions instead. Still, the regularity conditions for multivariate
normality do not generally hold, and if the sampling distribution is not
at least approximately normal, the covariance matrix of the sampling
distribution is not terribly useful.

Any textbook on nonlinear regression tells how to estimate the
covariance matrix of the sampling distribution when the regularity
conditions hold, for example:

   Bates, D.M. and Watts, D.G. (1988) Nonlinear Regression Analysis &
   Its Applications, Wiley: NY.

   Gallant, A.R. (1987) Nonlinear Statistical Models, Wiley: NY.

   Seber, G.A.F and Wild, C.J. (1989) Nonlinear Regression, Wiley: NY.

For situations where the regularity conditions do not hold,
bootstrapping is probably the best bet. See, for example:

   Dixon, P.M. (1993), "The bootstrap and the jackknife: 
   Describing the precision of ecological indices," in Scheiner,
   S.M. and Gurevitch, J., eds., Design and Analysis of Ecological
   Experiments, New York: Chapman & Hall, pp 290-318.
   
   Efron, B. and Tibshirani, R.J. (1993), An Introduction to the
   Bootstrap, New York: Chapman & Hall.
   
   Hjorth, J.S.U. (1994), Computer Intensive Statistical Methods,
   London: Chapman & Hall.


-- 

Warren S. Sarle       SAS Institute Inc.   The opinions expressed here
saswss@unx.sas.com    SAS Campus Drive     are mine and not necessarily
(919) 677-8000        Cary, NC 27513, USA  those of SAS Institute.
