Newsgroups: comp.ai.neural-nets
Path: cantaloupe.srv.cs.cmu.edu!rochester!cornellcs!newsstand.cit.cornell.edu!portc01.blue.aol.com!chi-news.cic.net!cs.utexas.edu!news.sprintlink.net!news-stk-200.sprintlink.net!news.sprintlink.net!news-stk-11.sprintlink.net!interpath!news.interpath.net!sas!newshost.unx.sas.com!saswss
From: saswss@hotellng.unx.sas.com (Warren Sarle)
Subject: Re: [Q] Projection-Method like kohonen-net
Originator: saswss@hotellng.unx.sas.com
Sender: news@unx.sas.com (Noter of Newsworthy Events)
Message-ID: <DvFB2v.M90@unx.sas.com>
Date: Wed, 31 Jul 1996 19:49:43 GMT
X-Nntp-Posting-Host: hotellng.unx.sas.com
References: <837711958.10296.0@chmqst.demon.co.uk> <4seg2h$4dp@llnews.ll.mit.edu> <31EFEBCE@dibe.unige.it> <4tjt4i$sim@llnews.ll.mit.edu>
Organization: SAS Institute Inc.
Lines: 100


In article <4tjt4i$sim@llnews.ll.mit.edu>, heath@ll.mit.edu (Greg Heath) writes:
|> In article <837711958.10296.0@chmqst.demon.co.uk>, David Livingstone
|> ...
|> |>                   ... We have used PCA plots and non-linear maps in
|> |> drug design and sometimes one "works" better than another (in terms
|> |> of separating activity classes), sometimes they both work well and
|> |> sometimes neither - presumably because we haven't calculated the
|> |> "right" chemical properies.
|> 
|> Or perhaps you haven't found the right projection. The bottleneck
|> maximizes variance, not separability.

You can use bottlenecks in heteroassociative networks (targets different
from inputs) as well as in autoassociative networks (targets same as
inputs), thereby maximizing separability.

Suppose you have a data matrix X with p inputs, and you want a
d-dimensional representation, i.e. d units in the bottleneck. The first
d principal components provide a least-squares fit to a linear network
with p inputs, d hidden units, and p outputs, where both inputs and
targets are X. You can put this in matrix notation as X ~= XVW, where X
is n by p, V is p by d, and W is d by p; V and W are the weights in the
network, "~=" means "make the left-hand side and right-hand side
approximately equal by minimizing the sum of squared differences".

Let's say X contains c classes. Code c dummy (0/1) variables for the
classes and store the dummy variables in a matrix Y. The first d
maximum-redundancy components provide a least-squares fit to a linear
network with p inputs, d hidden units, and c outputs, where X provides
inputs and Y provides targets. In matrix notation, Y ~= XVW, where Y is
n by c, V is p by d, and W is d by c. This maximum redundancy analysis
gives a d-dimensional representation of the data that maximizes
separability of the classes in the least-squares sense. Rao (1964)
called this "principal components of instrumental variables", but 
the name never caught on. See also Fortier (1996) and van den 
Wollenberg (1977).

You can produce a nonlinear generalization of principal components by
introducing an additional, nonlinear hidden layer between the inputs and
the bottleneck layer. For full generality, you also need a nonlinear
hidden layer between the bottleneck and outputs.

You can produce a nonlinear generalization of maximum redundancy
analysis by introducing an additional, nonlinear hidden layer between
the inputs and the bottleneck layer. For classification problems, there
is no need for a nonlinear hidden layer between the bottleneck and
outputs.

In PCA and MRA, you can apply any full rank linear transformation to the
d-dimensional representation without changing the fit of the model (this
is called "rotation"). This nonuniqueness introduces some trickiness
into interpretations of the dimensions. In the nonlinear NN
generalizations of PCA and MRA, you can apply a wide variety of
continuous one-to-one nonlinear transformations to the d-dimensional
representation without changing the fit of the model. This makes visual
interpretation very problematic. An important research problem is
to find some way of constraining or regularizing the d-dimensional 
representation to make it interpretable.

The usual method used by statisticians to produce a d-dimensional
representation that maximizes separability is canonical discriminant
analysis. This amounts to a model of the from YW' ~= XV, except that
"~=" means something more complicated. Canonical discriminant analysis
is also equivalent to a principal component analysis of the class means
in the metric of the inverse within-class covariance matrix.  In other
words, this maximizes separability of the classes relative to
within-class variability. The d dimensions are sometimes called
"canonical variates" (Marriott 1974), sometimes CRIMCOORDS (yuck)
(Gnanadesikan 1977), and sometimes other confusng things.  Putting
canonical discriminant analysis into a neural net framework would be an
interesting exercise.

   Fortier, J.J. (1966), "Simultaneous Linear Prediction,"
   Psychometrika, 31, 369-381.

   Gnanadesikan, R. (1977), Methods for Statistical Data Analysis
   of Multivariate Observations, New York: John Wiley & Sons.

   Marriott, F.H.C. (1974) The Interpretation of Multiple Observations,
   NY: Academic Press.

   Rao, C.R. (1964), "The Use and Interpretation of Principal
   Component Analysis in Applied Research," Sankya A, 26, 329-358.

   van den Wollenberg, A.L. (1977), "Redundancy Analysis--An
   Alternative to Canonical Correlation Analysis," Psychometrika,
   42, 207-219.







-- 

Warren S. Sarle       SAS Institute Inc.   The opinions expressed here
saswss@unx.sas.com    SAS Campus Drive     are mine and not necessarily
(919) 677-8000        Cary, NC 27513, USA  those of SAS Institute.
