Newsgroups: comp.ai.neural-nets
Path: cantaloupe.srv.cs.cmu.edu!bb3.andrew.cmu.edu!newsfeed.pitt.edu!gatech!newsfeed.internetmci.com!in2.uu.net!news.interpath.net!sas!mozart.unx.sas.com!saswss
From: saswss@hotellng.unx.sas.com (Warren Sarle)
Subject: Re: Comparison of BP vs. RBF
Originator: saswss@hotellng.unx.sas.com
Sender: news@unx.sas.com (Noter of Newsworthy Events)
Message-ID: <Dp9vou.3nK@unx.sas.com>
Date: Wed, 3 Apr 1996 06:07:42 GMT
X-Nntp-Posting-Host: hotellng.unx.sas.com
References:  <3160ABA9.7445@ecis.com>
Organization: SAS Institute Inc.
Lines: 389


In article <3160ABA9.7445@ecis.com>, Tim Lyons <lyons@ecis.com> writes:
|> ...
|> I have used BP before with success, but am not sure where to begin with 
|> RBFs.  Can somebody tells me the pluses and minuses of each?

I presume the question is intended to be MLP vs. RBF. Both MLPs and RBF
networks can be trained by backprop or any of numerous more efficient
methods commonly used for numerical optimization. I have been working on
an answer to this FAQ, but it has turned out to be a much more
complicated issue than the literature suggests. Here is an incomplete
draft on the subject:

Notation:

      a_j     is the altitude of the jth hidden unit
      b_j     is the bias of the jth hidden unit
      f       is the fan-in of the jth hidden unit 
      h_j     is the activation of the jth hidden unit 
      s       is a common width shared by all hidden units in the layer
      s_j     is the width of the jth hidden unit
      w_ij    is the weight connecting the ith input to
                the jth hidden unit
      w_i     is the common weight for the ith input shared by
                all hidden units in the layer
      x_i     is the ith input

MLPs, Gaussian RBF networks, LVQ, TDE, etc.
*******************************************

The inputs to each hidden or output unit must be combined with the weights
to yield a single value called the net input. There does not seem to be a
standard term for the function that combines the inputs and weights; I will
use the term "combination function". 

The multilayer perceptron (MLP) has one or more hidden layers for which the
combination function is the inner product of the inputs and weights, plus a
bias. The activation function is typically a logistic or tanh function.
Hence the formula for the activation is typically: 

   h_j = tanh( b_j + sum[w_ij*x_i] )

The MLP architecture is the most popular one in practical applications. Each
layer uses a linear combination function. The inputs are fully connected to
the first hidden layer, each hidden layer is fully connected to the next,
and the last hidden layer is fully connected to the outputs. You can also
have "skip-layer" connections; direct connections from inputs to outputs are
especially useful. 

Consider the multidimensional space of inputs to a given hidden unit. Since
an MLP uses linear combination functions, the set of all points in the space
having a given value of the activation function is a hyperplane. The
hyperplanes corresponding to different activation levels are parallel to
each other (the hyperplanes for different units are not parallel in
general). These parallel hyperplanes are the isoactivation contours of the
hidden unit. 

Radial basis function (RBF) networks usually have only one hidden layer for
which the combination function is the Euclidean distance between the input
vector and the weight vector, divided by the squared width. There may also
be another term added to the combination function, which determines what I
will call the "altitude" of the unit. 

There are two distinct types of Gaussian RBF architectures. The first type
uses the exp activation function, so the activation of the unit is a
Gaussian "bump" as a function of the inputs. There seems to be no specific
term for this type of Gaussian RBF network; I will use the term "ordinary
RBF", or ORBF, network. 

The second type of Gaussian RBF architecture uses the softmax activation
function, so the activations of all the hidden units are normalized to sum
to one. This type of network is often called a "normalized RBF", or NRBF,
network. 

While the distinction between these two types of Gaussian RBF architectures
is sometimes mentioned in the NN literature, its importance has rarely been
appreciated except by Tao (1993) and "partition of unity" reference??? 

There are several subtypes of both ORBF and NRBF architectures: 

ORBFUN 
   Ordinary radial basis function (RBF) network with unequal widths
   h_j = exp(f*log(a_j) - s_j^-2 * [(w_ij-x_i)^2] )

ORBFEQ 
   Ordinary radial basis function (RBF) network with equal widths
   h_j = exp( - s^-2 * [(w_ij-x_i)^2] )

NRBFUN 
   Normalized RBF network with unequal widths and heights
   h_j = softmax(f*log(a_j) - s_j^-2 * [(w_ij-x_i)^2] )

NRBFEV 
   Normalized RBF network with equal volumes
   h_j = softmax( f*log(b_j) - s_j^-2 * [(w_ij-x_i)^2] )

NRBFEH 
   Normalized RBF network with equal heights (and unequal widths)
   h_j = softmax( - s_j^-2 * [(w_ij-x_i)^2] )

NRBFEW 
   Normalized RBF network with equal widths (and unequal heights)
   h_j = softmax( f*log(a_j) - s^-2 * [(w_ij-x_i)^2] )

NRBFEQ 
   Normalized RBF network with equal widths and heights
   h_j = softmax( - s^-2 * [(w_ij-x_i)^2] )

The ORBF architectures use radial combination functions and the exp
activation function. Radial combination functions are based on the Euclidean
distance between the vector of inputs to the unit and the vector of
corresponding weights. Thus, the isoactivation contours for ORBF networks
are concentric hyperspheres. A variety of activation functions can be used
with the radial combination function, but the exp activation function,
yielding a Gaussian surface, is the most useful. Radial networks typically
have only one hidden layer, but it can be useful to include a linear layer
for dimensionality reduction or oblique rotation before the RBF layer. 

Only two of the radial combination functions are useful with ORBF
architectures. For radial combination functions including an altitude, the
altitude would be redundant with the hidden-to-output weights. 

The NRBF architectures also use radial combination functions but the
activation function is softmax, which forces the sum of the activations for
the hidden layer to equal one. Thus, each output unit computes a weighted
average of the hidden-to-output weights, and the output values must lie
within the range of the hidden-to-output weights. 

Radial combination functions incorporating altitudes are useful with NRBF
architectures. The NRBF architectures combine some of the virtues of both
the RBF and MLP architectures, as explained below. However, the
isoactivation contours are considerably more complicated than for ORBF
architectures. 

Consider the case of an NRBF network with only two hidden units. If the
hidden units have equal widths, the isoactivation contours are parallel
hyperplanes; in fact, this network is equivalent to an MLP with one logistic
hidden unit. If the hidden units have unequal widths, the isoactivation
contours are concentric hyperspheres; such a network is almost equivalent to
an ORBF network with one Gaussian hidden unit. 

If there are more than two hidden units in an NRBF network, the
isoactivation contours have no such simple characterization. If the RBF
widths are very small, the isoactivation contours are approximately
piecewise linear for RBF units with equal widths, and approximately
piecewise spherical for RBF units with unequal widths. The larger the
widths, the smoother the isoactivation contours where the pices join. 

The NRBFEQ architecture is a smoothed variant of the learning vector
quantization (Kohonen 1988, Ripley 1996) and counterpropagation (
Hecht-Nielsen 1990), architectures. In LVQ and counterprop, the hidden units
are often called codebook vectors. LVQ amounts to nearest-neighbor
classification on the codebook vectors, while counterprop is
nearest-neighbor regression on the codebook vectors. The NRBFEQ architecture
uses not just the single nearest neighbor, but a weighted average of near
neighbors. As the width of the NRBFEQ functions approaches zero, the weights
approach one for the nearest neighbor and zero for all other codebook
vectors. LVQ and counterprop use ad hoc algorithms of uncertain reliability,
but standard numerical optimization algorithms (not to mention backprop) can
be applied with the NRBFEQ architecture. 

Hybrid training and the curse of dimensionality
===============================================

A comparison of the various architectures must separate training issues from
architectural issues to avoid common sources of confusion. RBF networks are
often trained by hybrid methods, in which the hidden weights (centers) are
first obtained by unsupervised learning, after which the output weights are
obtained by supervised learning. Unsupervised methods for choosing the
centers include: 

1. Distribute the centers in a regular grid over the input space. 
2. Choose a random subset of the training cases to serve as centers. 
3. Cluster the training cases based on the input variables, and use the mean
   of each cluster as a center. 

Various heuristic methods are also available for choosing the RBF widths.
Once the centers and widths are fixed, the output weights can be learned
very efficiently, since the computation reduces to a linear or generalized
linear model. The hybrid training approach can thus be much faster than the
nonlinear optimization that would be required for supervised training of all
of the weights in the network. 

Hybrid training is not often applied to MLPs because no effective methods
are known for unsupervised training of the hidden units (except when there
is only one input). 

The trouble with hybrid methods is that the required number of hidden units
tends to increase exponentially with the number of inputs. This drawback of
hybrid methods is discussed by Minsky and Papert (1969). For example, with
method (1) for RBF networks, you would need at least five elements in the
grid along each dimension to detect a moderate degree of nonlinearity; so if
you have Nx inputs, you would need at least 5**Nx hidden units. For
methods (2) and (3), the number of hidden units increases exponentially with
the effective dimensionality of the input distribution. If the inputs are
linearly related, the effective dimensionality is the number of
nonnegligible (a deliberately vague term) eigenvalues of the covariance
matrix, so the inputs must be highly correlated if the effective
dimensionality is to be much less than the number of inputs. 

The exponential increase in the number of hidden units required for hybrid
learning is one aspect of the curse of dimensionality (Bellman 1961; Scott
1992). The number of training cases required also increases exponentially in
general. No neural network architecture--in fact no method of learning or
statistical estimation--can escape the curse of dimensionality in general,
hence there is no practical method of learning general functions in more
than a few dimensions. 

Fortunately, in many practical applications of neural networks with a large
number of inputs, most of those inputs are additive, redundant, or
irrelevant, and some architectures can take advantage of these properties to
yield useful results. Escape from the curse of dimensionality requires fully
supervised learning. 

Additive inputs
===============

An additive model is one in which the output is a sum of linear or nonlinear
transformations of the inputs. If an additive model is appropriate, the
number of weights increases linearly with the number of inputs, so high
dimensionality is not a curse. Various methods of training additive models
are available in the statistical literature (e.g. Hastie and Tibshirani
1990). You can also create a feedforward neural network, called a 
generalized additive network (GAN), to fit additive models (Sarle 1994).
Additive models have been proposed in the neural net literature under the
name "topologically distributed encoding" (Geiger 1990). 

Projection pursuit regression (PPR) provides both universal approximation
and the ability to avoid the curse of dimensionality for certain common
types of target functions (Friedman and Stuetzle 1981). Like MLPs, PPR
computes the output as a sum of nonlinear transformations of linear
combinations of the inputs. Each term in the sum is analogous to a hidden
unit in an MLP. But unlike MLPs, PPR allows general, smooth nonlinear
transformations rather than a specific nonlinear activation function, and
allows a different transformation for each term. The nonlinear
transformations in PPR are usually estimated by nonparametric regression,
but you can set up a projection pursuit network (PPN), in which each
nonlinear transformation is performed by a subnetwork. If a PPN provides an
adequate fit with few terms, then the curse of dimensionality can be
avoided, and the results may even be interpretable. 

If the target function can be accurately approximated by projection pursuit,
then it can also be accurately approximated by an MLP with a single hidden
layer. The disadvantage of the MLP is that there is little hope of
interpretability. An MLP with two or more hidden layers can provide a
parsimonious fit to a wider variety of target functions than can projection
pursuit, but no simple characterization of these functions is known. 

Redundant inputs
================

With proper training, all of the RBF architectures listed above, as well as
MLPs, can process redundant inputs effectively. When there are redundant
inputs, the training cases lie close to some (possibly nonlinear) subspace.
If the same degree of redundancy applies to the test cases, the network need
produce accurate outputs only near the subspace occupied by the data. Adding
redundant inputs has little effect on the effective dimensionality of the
data; hence the curse of dimensionality does not apply. However, if the test
cases do not follow the same pattern of redundancy as the training cases,
generalization will require extrapolation and will rarely work. 

Irrelevant inputs
=================

MLP architectures are good at ignoring irrelevant inputs. MLPs can also
select linear subspaces of reduced dimensionality. Since a hidden layer
forms linear combinations of the previous layer, each hidden layer confines
the networks attention to the linear subspace spanned by the weight vectors.
Hence, adding irrelevant inputs to the training data does not increase the
number of hidden units required, although it increases the amount of
training data required. 

ORBF architectures are not good at ignoring irrelevant inputs. The number of
hidden units required grows exponentially with the number of inputs,
regardless of how many inputs are relevant. This exponential growth is
related to the fact that RBFs and ERBFs have local receptive fields, meaning
that changing the hidden-to-output weights of a given unit will affect the
output of the network only in a neighborhood of the center of the hidden
unit, where the size of the neighborhood is determined by the width of the
hidden unit. (Of course, if the width of the unit is learned, the receptive
field could grow to cover the entire training set.) 

Local receptive fields are often an advantage compared to the distributed
architecture of MLPs, since local units can adapt to local patterns in the
data without having unwanted side effects in other regions. In a distributed
architecture such as an MLP, adapting the network to fit a local pattern in
the data can cause spurious side effects in other parts of the input space. 

However, ORBF architectures often must be used with relatively small
neighborhoods, so that several hidden units are required to cover the range
of an input. When there are many nonredundant inputs, the hidden units must
cover the entire input space, and the number of units required is
essentially the same as in the hybrid case (1) where the centers are in a
regular grid; hence the exponential growth in the number of hidden units
with the number of inputs, regardless of whether the inputs are relevant. 

You can enable an ORBF architecture to ignore irrelevant inputs by using an
extra, linear hidden layer before the radial hidden layer. This type of
network is sometimes called an elliptical basis function network. If the
number of units in the linear hidden layer equals the number of inputs, the
linear hidden layer performs an oblique rotation of the input space that can
suppress irrelevant directions and differentally weight relevant directions
according to their importance. If you think that the presence of irrelevant
inputs is highly likely, you can force a reduction of dimensionality by
using fewer units in the linear hidden layer than the number of inputs. 

Note that the linear and radial hidden layers must be connected in series,
not in parallel, to ignore irrelevant inputs. In some applications it is
useful to have linear and radial hidden layers connected in parallel, but in
such cases the radial hidden layer will be sensitive to all inputs. 

For even greater flexibility (at the cost of more weights to be learned),
you can have a separate linear hidden layer for each RBF unit, allowing a
different oblique rotation for each RBF unit. 

NRBF architectures with equal widths (NRBFEW and NRBFEQ) combine the
advantage of local receptive fields with the ability to ignore irrelevant
inputs. The receptive field of one hidden unit extends from the center in
all directions until it encounters the receptive field of another hidden
unit. It is convenient to think of a "boundary" between the two receptive
fields, defined as the hyperplane where the two units have equal
activations, even though the effect of each unit will extend somewhat beyond
the boundary. The location of the boundary depends on the heights of the
hidden units. If the two units have equal heights, the boundary lies midway
between the two centers. If the units have unequal heights, the boundary is
farther from the higher unit. 

If a hidden unit is surrounded by other hidden units, its receptive field is
indeed local, curtailed by the field boundaries with other units. But if a
hidden unit is not completely surrounded, its receptive field can extend
infinitely in certain directions. If there are irrelevant inputs, or more
generally, irrelevant directions that are linear combinations of the inputs,
the centers need only be distributed in a subspace orthogonal to the
irrelevant directions. In this case, the hidden units can have local
receptive fields in relevant directions but infinite receptive fields in
irrelevant directions. 

For NRBF architectures allowing unequal widths (NRBFUN, NRBFEV, and NRBFEH),
the boundaries between receptive fields are generally hyperspheres rather
than hyperplane. In order to ignore irrelevant inputs, such networks must be
trained to have equal widths. Hence, if you think there is a strong
possibility that some of the inputs are irrelevant, it is usually better to
use an architecture with equal widths. 

References
**********

Bellman, R. (1961), Adaptive Control Processes: A Guided Tour, Princeton
University Press. 

Friedman, J.H. and Stuetzle, W. (1981) "Projection pursuit regression," J.
of the American Statistical Association, 76, 817-823. 

Geiger, H. (1990), "Storing and Processing Information in Connectionist
Systems," in Eckmiller, R., ed., Advanced Neural Computers, 271-277,
Amsterdam: North-Holland. 

Hastie, T.J. and Tibshirani, R.J. (1990) Generalized Additive Models,
London: Chapman & Hall. 

Hecht-Nielsen, R. (1990), Neurocomputing, Reading, MA: Addison-Wesley. 

Kohonen, T (1988), "Learning Vector Quantization," Neural Networks, 1 (suppl
1), 303. 

Minsky, M.L. and Papert, S.A. (1969), Perceptrons, Cambridge, MA: MIT
Press. 

Ripley, B.D. (1996), Pattern Recognition and Neural Networks, Cambridge:
Cambridge University Press.

Sarle, W.S. (1994), "Neural Networks and Statistical Models," in SAS
Institute Inc., Proceedings of the Nineteenth Annual SAS Users Group
International Conference, Cary, NC: SAS Institute Inc., pp 1538-1550, 
ftp://ftp.sas.com/pub/neural/neural1.ps. 

Scott, D.W. (1992), Multivariate Density Estimation, NY: Wiley. 

Tao, K.M. (1993), "A closer look at the radial basis function (RBF)
networks," Conference Record of The Twenty-Seventh Asilomar Conference
on Signals, Systems and Computers (Singh, A., ed.), vol 1, 401-405, Los
Alamitos, CA: IEEE Comput. Soc. Press. 


-- 

Warren S. Sarle       SAS Institute Inc.   The opinions expressed here
saswss@unx.sas.com    SAS Campus Drive     are mine and not necessarily
(919) 677-8000        Cary, NC 27513, USA  those of SAS Institute.
