This page last updated 3 June 1998.
This is a partial list of the papers that were presented at NIPS*97. They are listed in (approximately) alphabetical order according to the last name of the first author.
These papers should be cited as:
Advances in Neural Information Processing Systems 10,
M. I. Jordan, M. J. Kearns, S. A. Solla, eds., MIT Press, 1998.
You can also jump to the names starting with
ABCDEFGHIJKLMNOPRSTVWXYZ


 Generalized Prioritized Sweeping  David Andre, Nir Friedman, Ronald Parr
 Prioritized sweeping is a modelbased reinforcement learning method
that attempts to focus an agent's limited computational resources to
achieve a good estimate of the value of environment states. The
classic account of prioritized sweeping uses an explicit, statebased
representation of the value, reward, and model parameters. Such a
representation is unwieldy for dealing with complex environments and
there is growing interest in learning with more compact
representations. We claim that classic prioritized sweeping is
illsuited for learning with such representations. To overcome this
deficiency, we introduce generalized prioritized sweeping, a
principled method for generating representationspecific algorithms
for modelbased reinforcement learning. We then apply this method for
several representations, including statebased models and generalized
model approximators (such as Bayesian networks). We describe
preliminary experiments that compare our approach with classical
prioritized sweeping.
 Nonparametric ModelBased Reinforcement Learning  Christopher G. Atkeson
 This paper describes some of the interactions of model learning algorithms and
planning algorithms we have found in exploring
modelbased reinforcement learning.
The paper focuses on how local trajectory optimizers can be used effectively
with learned nonparametric models.
We find that trajectory planners that are fully consistent with the learned
model often have difficulty finding reasonable plans in the early stages
of learning.
Trajectory planners that balance obeying the learned model
with minimizing cost often do better, even if the plan is not fully consistent
with the learned model.
 Coding of Naturalistic Stimuli by Auditory Midbrain Neurons  Hagai Attias, Christoph E. Schreiner
 It is known that humans can make finer discriminations between familiar sounds
(e.g. syllables) than between unfamiliar ones (e.g. different noise segments).
Here we show that a corresponding enhancement is present in early auditory
processing stages. Based on previous work
which demonstrated that natural sounds had robust statistical
properties that could be quantified, we hypothesize that the auditory
system exploits those properties to construct efficient neural codes.
To test this hypothesis, we measure the information rate carried by
auditory spike trains on narrowband stimuli whose amplitude modulation
has naturalistic characteristics, and compare it to the information rate
on stimuli with nonnaturalistic modulation. We find that
naturalistic inputs significantly enhance the rate of transmitted
information, indicating that auditiory neural responses are matched to
characteristics of natural auditory scenes.
 Synchronized Auditory and Cognitive 40 Hz Attentional Streams, and the Impact of Rhythmic Expectation on Auditory Scene Analysis  Bill Baird
 We have developed a neural network architecture that implements a
theory of attention, learning, and transcortical communication based
on adaptive synchronization of 515 Hz and 3080 Hz oscillations
between cortical areas. Here we present a specific higher order
cortical model of attentional networks, rhythmic expectancy, and the
interaction of higherorder and primary cortical levels of processing.
It accounts for the ``mismatch negativity'' of the auditory ERP and the
results of psychological experiments of Jones showing that auditory
stream segregation can depend on the rhythmic structure of inputs. The
timing mechanisms of the model allow us to explain how relative timing
information such as the relative order of events between streams is
lost when streams are formed. The model suggests how the theories of
auditory perception and attention of Jones and Bregman may be
reconciled.
 Using Expectation to Guide Processing: A Study of Three RealWorld Applications  Shumeet Baluja
 In many real world tasks, only a small fraction of the available inputs are
important at any particular time. This paper presents a method for ascertaining
the relevance of inputs by exploiting temporal coherence and
predictability. The method proposed in this paper dynamically allocates
relevance to inputs by using expectations of their future values. As a model of
the task is learned, the model is simultaneously extended to create
taskspecific predictions of the future values of inputs. Inputs which are
either not relevant, and therefore not accounted for in the model, or those
which contain noise, will not be predicted accurately. These inputs can be
deemphasized, and, in turn, a new, improved, model of the task created. The
techniques presented in this paper have yielded significant improvements for the
visionbased autonomous control of a land vehicle, visionbased hand tracking
in cluttered scenes, and the detection of faults in the etching of
semiconductor wafers.
 Ensemble Learning for MultiLayer Networks  David Barber, Christopher M. Bishop
 Bayesian treatments of learning in neural networks are typically based
either on local Gaussian approximations to a mode of the posterior
weight distribution, or on Markov chain Monte Carlo simulations. A
third approach, called ensemble learning, was introduced by
Hinton and van Camp (1993). It aims to approximate the posterior
distribution by minimizing the KullbackLeibler divergence between the
true posterior and a parametric approximating distribution. However,
the derivation of a deterministic algorithm relied on the use of a
Gaussian approximating distribution with a diagonal covariance
matrix and so was unable to capture the posterior correlations between
parameters. In this paper, we show how the ensemble learning approach
can be extended to fullcovariance Gaussian distributions while
remaining computationally tractable. We also extend the framework to
deal with hyperparameters, leading to a simple reestimation
procedure. Initial results from a standard benchmark problem are
encouraging.
 Radial Basis Functions: A Bayesian Treatment  David Barber, Bernhard Schottky
 Bayesian methods have been successfully applied to regression and
classification problems in multilayer perceptrons. We present a novel
application of Bayesian techniques to Radial Basis Function networks
by developing a Gaussian approximation to the posterior distribution
which, for fixed basis function widths, is analytic in the
parameters. The setting of regularization constants by
crossvalidation is wasteful as only a single optimal parameter
estimate is retained. We treat this issue in a principled manner by
assigning prior distributions to these constants, which are then
adapted in light of the data under a simple reestimation formula.
 The Canonical Distortion Measure in Feature Space and 1NN Classification  Jonathan Baxter, Peter Bartlett
 We prove that the Canonical Distortion Measure (CDM) is
the optimal distance measure to use for 1 nearestneighbour (1NN)
classification, and show that it reduces to squared Euclidean distance
in feature space for function classes that can be expressed as linear
combinations of a fixed set of features. PAClike bounds are given on
the samplecomplexity required to learn the CDM. An experiment is
presented in which a neural network CDM was learnt for a Japanese OCR
environment and then used to do 1NN classification.
 Shared Context Probabilistic Transducers  Yoshua Bengio, Samy Bengio, JeanFran\c{c}ois Isabelle, Yoram Singer
 Recently, a model for supervised learning of probabilistic transducers
represented by suffix trees was introduced. However, this algorithm
tends to build very large trees, requiring very large amounts of
computer memory. In this paper, we propose a new, more compact,
transducer model in which one shares the parameters of distributions
associated to contexts yielding similar conditional output
distributions. We illustrate the
advantages of the proposed algorithm with comparative experiments on
inducing a noun phrase recognizer.
 Refractoriness and Neural Precision  Michael J. Berry II, Markus Meister
 The relationship between a neuron's refractory period and the precision
of its response to identical stimuli was investigated. The light
response of retinal ganglion cells was modeled as probabilistic firing
combined with a refractory period: the instantaneous firing rate is the
product of a ``free firing rate'', which depends only on the stimulus, and
a ``recovery function'', which depends only on the time since the last
spike. In simulations, longer refractory periods were found to make the
response more reproducible, eventually matching the precision of
measured ganglion cell spike trains. The underlying free firing rate
derived in this way often exceeded the observed firing rate by an order
of magnitude, and was found to convey information about the stimulus
over a much wider dynamic range. Thus the free firing rate may be the
preferred variable for describing a spiking neuron's response.
 Approximating Posterior Distributions in Belief Networks Using Mixtures  Christopher M. Bishop, Neil Lawrence, Tommi Jaakkola, Michael I. Jordan
 Exact inference in densely connected Bayesian networks is
computationally intractable, and so there is considerable interest in
developing effective approximation schemes. One approach which has
been adopted is to bound the log likelihood using a meanfield
approximating distribution. While this leads to a tractable algorithm,
the mean field distribution is assumed to be factorial and hence
unimodal. In this paper we demonstrate the feasibility of using a
richer class of approximating distributions based on mixtures
of mean field distributions. We derive an
efficient algorithm for updating the mixture parameters and apply it
to the problem of learning in sigmoid belief networks. Our results
demonstrate a systematic improvement over simple mean
field theory as the number of mixture components is increased.
 Receptive Field Formation in Natural Scene Environments: Comparison of Single Cell Learning Rules  Brian S. Blais, Nathan Intrator, Harel Shouval, Leon N. Cooper
 We study several statistically and biologically motivated learning rules using
the same visual environment and neuronal architecture. This allows us to
concentrate on the feature extraction and neuronal coding properties of these
rules. We find that the quadratic form of the BCM rule behaves in a manner
similar to a kurtosis maximization rule when the distribution contains
kurtotic directions, although the BCM modification equations are
computationally simpler.
 Multiple Threshold Neural Logic  Vasken Bohossian, Jehoshua Bruck
 We introduce a new Boolean computing element related to
the Linear Threshold element, which is the Boolean version
of the neuron. Instead of the sign function, it computes an
arbitrary (with polynomialy many transitions) Boolean function
of the weighted sum of its inputs. We call the new computing
element an LTM element, which stands for Linear Threshold with
Multiple transitions.
 An Annealed SelfOrganizing Map for Source Channel Coding  Matthias Burger, Thore Graepel, Klaus Obermayer
 We derive and analyse robust optimization schemes for noisy vector quantization
on the basis of deterministic annealing. Starting from a cost function for
central clustering that incorporates distortions from channel noise we develop
a soft topographic vector quantization algorithm (STVQ) which is based on the
maximum entropy principle and which performs a maximumlikelihood estimate in
an expectationmaximization (EM) fashion. Annealing in the temperature
parameter $\beta$ leads to phase transitions in the existing code vector
representation during the cooling process for which we calculate critical
temperatures and modes as a function of eigenvectors and eigenvalues of the
covariance matrix of the data and the transition matrix of the channel noise.
A whole family of vector quantization algorithms is derived from STVQ, among
them a deterministic annealing scheme for Kohonen's selforganizing map (SOM).
This algorithm, which we call SSOM, is then applied to vector quantization of
image data to be sent via a noisy binary symmetric channel. The algorithm's
performance is compared to those of LBG and STVQ. While it is naturally
superior to LBG, which does not take into account channel noise, its results
compare very well to those of STVQ, which is computationally much more
demanding.
 Incorporating Test Inputs into Learning  Zehra Cataltepe, Malik MagdonIsmail
 In many applications, such as
credit default prediction
and medical image recognition,
test inputs are available
in addition to the labeled training examples.
We propose a method to incorporate the
test inputs into learning.
Our method results in solutions having
smaller test errors than that of simple training solution,
especially for noisy problems or small training sets.
 On Efficient Heuristic Ranking of Hypotheses  Steve Chien, Andre Stechert, Darren Mutz
 This paper considers the problem of learning the ranking of a set of
alternatives based upon incomplete information (e.g., a limited number
of observations). We describe two algorithms for hypothesis ranking
and their application for probably approximately correct (PAC) and
expected loss (EL) learning criteria. Empirical results are
provided to demonstrate the effectiveness of these ranking
procedures on both synthetic datasets and realworld data from a
spacecraft design optimization problem.
 On Parallel versus Serial Processing: A Computational Study of Visual Search  Eyal Cohen, Eytan Ruppin
 A novel neural network model of preattention processing in visualsearch
tasks is presented. Using displays of line orientations taken
from Wolfe's experiments [1992], we study the hypothesis that the
distinction between parallel versus serial processes arises from the
availability of global information in the internal representations of the
visual scene. The model operates in two phases.
First, the visual displays are compressed via
principalcomponentanalysis. Second, the compressed
data is processed by a target detector module in order to identify
the existence of a target in the display. Our main finding
is that targets in displays which were found experimentally to be
processed in parallel can be detected by the system,
while targets in experimentallyserial displays cannot.
This fundamental difference is
explained via variance analysis of the compressed
representations, providing a numerical criterion distinguishing
parallel from serial displays. Our model yields a mapping of
responsetime slopes that is similar to Duncan and Humphreys's
``search surface'' [1989], providing an explicit formulation of
their intuitive notion of feature similarity. It presents a neural
realization of the processing that may underlie the classical metaphorical
explanations of visual search.
 Learning to Order Things  William W. Cohen, Robert E. Schapire, Yoram Singer
 There are many applications in which it is desirable to order rather
than classify instances. Here we consider the problem of learning how
to order, given feedback in the form of preference judgments,
i.e., statements to the effect that one instance should be ranked
ahead of another. We outline a twostage approach in which one first
learns by conventional means a {\em preference function}, of the
form ``$\mbox{prefer}(u,v)$'', which indicates whether it is advisable to rank
$u$ before $v$. New instances are then ordered so as to maximize
agreements with the learned preference function. We show that the
problem of finding the ordering that agrees best with a preference
function is NPcomplete, even under very restrictive assumptions.
Nevertheless, we describe a simple greedy algorithm that is guaranteed
to find a good approximation. We then discuss an online
learning algorithm, based on the ``$\mbox{Hedge}$'' algorithm,
for finding a good linear combination of ranking ``experts.'' We use
the ordering algorithm combined with the online learning algorithm
to find a combination of ``search experts,'' each of which is a
domainspecific query expansion strategy for a WWW search engine,
and present experimental results that demonstrate the merits of
our approach.
 Recovering Perspective Pose with a Dual Step EM Algorithm  Andrew D. J. Cross, Edwin R. Hancock
 This paper describes a new approach to extracting 3D perspective
structure from 2D pointsets. The novel feature
is to unify the tasks of estimating transformation geometry
and identifying pointcorrespondence matches.
Unification is realised by constructing a mixture model over the bipartite
graph representing the correspondence match and by
effecting optimisation using the EM algorithm.
According to our EM framework
the probabilities of structural correspondence
gate contributions to the expected likelihood function
used to estimate maximum likelihood perspective pose
parameters. This provides a means of rejecting structural outliers.
We evaluate the technique
on the matching of different perspective views
of 3.5in floppy discs.
We provide a sensitivity
study based on
synthetic data.
 Task and Spatial Frequency Effects on Face Specialization  Matthew N. Dailey, Garrison W. Cottrell
 The double dissociation between prosopagnosia, a face recognition
deficit occurring after brain damage, and visual object agnosia,
difficulty recognizing other kinds of complex objects, indicates that
face and nonface object recognition may be served by partially
independent mechanisms in the brain. Such a dissociation could be
the result of a competitive learning mechanism that, during
development, devotes neural resources to the tasks they are best at
performing.
Studies of normal adult performance on face and object recognition
tasks seem to indicate that face recognition is primarily configural,
involving the low spatial frequency information present in a stimulus
over relatively large distances, whereas object recognition is
primarily
featural, involving analysis of the object's parts using local, high
spatial frequency information.
In a feedforward computational model of visual processing, two
modules compete to classify input stimuli; when one
module receives low spatial frequency information and the other
receives high spatial frequency information, the lowfrequency module
shows a strong specialization for face recognition in a combined
face identification/object classification task. The
series of experiments shows that the fine discrimination necessary for
distinguishing members of a visually homogeneous class such as faces
relies heavily on the low spatial frequencies present in a stimulus.
 Statistical Models of Conditioning  Peter Dayan, Theresa Long
 Recent evidence suggests that dopaminergic neurons in vertebrates
report prediction errors during reward based classical and
instrumental conditioning. We consider more complicated conditioning
paradigms which involve combining predictions from multiple
predictive stimuli. We show that our existing model fails to act in
accordance with the learning data, and suggest an alternative in
which there is attentional selection between different available
stimuli. The new model is a form of mixture of experts (Jacobs,
Jordan \& Barto, 1991) and is statistically wellfounded.
 Structure Driven Image Database Retrieval  Jeremy S. De Bonet, Paul A. Viola
 A new algorithm is presented which approximates the perceived visual
similarity between images. The images are initially transformed into
a feature space which captures visual structure, texture and color
using a tree of filters. Similarity is the inverse of the distance in
this perceptual feature space. Using this algorithm we have
constructed an image database system which can perform example based
retrieval on large image databases. Using carefully constructed
target sets, which limit variation to only a single visual
characteristic, retrieval rates are quantitatively compared to those
of standard methods.
 A NonParametric MultiScale Statistical Model for Natural Images  Jeremy S. De Bonet, Paul A. Viola
 The observed distribution of visual images is far from uniform. On
the contrary, images have complex and important structure that can be
used for image processing, recognition and analysis. There have been
many proposed approaches to the principled statistical modeling of
images, but each has been limited in either the complexity of the
models or the complexity of the images. We present a nonparametric
multiscale statistical model for images that can be used for
recognition, image denoising, and in a ``generative mode'' to
synthesize high quality textures.
 Characterizing Neurons in the Primary Auditory Cortex of the Awake Primate Using Reverse Correlation  R. Christopher deCharms, Michael M. Merzenich
 While the understanding of the functional role of different classes of
neurons in the awake primary visual cortex has been extensively studied
since the time of Hubel and Weisel, our understanding of the basic
response selectivity and functional role of neurons in the primary
auditory cortex is much farther from complete. In particular, while
moving bars are now wellrecognized as optimal for many neurons in the
primary visual cortex, for an auditory cortical neuron it is not clear
how to select the most appropriate stimulus. In this study, we recorded
from neurons in the auditory cortex of the awake primate, and used a
novel reverse correlation technique to compute the receptive fields (or
preferred stimuli) of auditory cortical neurons, encompassing both
multiple frequency components and ongoing time. These spectrotemporal
receptive fields make clear that neurons in the primary auditory cortex,
as in the primary visual cortex, typically show considerable structure,
sometimes including multiple excitatory and inhibitory regions in their
receptive fields. These neurons can be sensitive to stimulus edges in
frequency composition or in time, sensitive to stimulus transitions such
as changes in frequency, and sensitive to combinations of stimulus
features.
 Using Helmholtz Machines to Analyze Multichannel Neuronal Recordings  Virginia R. de Sa, R. Christopher deCharms, Michael M. Merzenich
 We believe that understanding neural processing will ultimately
require observing the response patterns and interaction of large
populations of neurons. To that end, we have developed a chronic
multielectrode implant for long term simultaneous recording from
multiple neurons. The task of analyzing firing patterns, across time
and between different units, arising from this electrode array is a
critical and technically challenging task. We discuss a greedy,
incremental algorithm from the ``Helmholtz machine'' family that we are
using for automated discovery of ensemble neuronal events. We
construct a generative model that attempts to maximize a bound on the
likelihood of observed firing patterns. The model is constructed
incrementally; each added unit in the model attempts to greedily
maximize the likelihood. While not globally optimal, this strategy is
computationally tractable. We show encouraging benchmark data on
artificially constructed spike trains and promising early results on
some real data.
 Neural Basis of ObjectCentered Representations  Sophie Deneve, Alexandre Pouget
 We present a neural model that can perform eye movements to a
particular side of an object regardless of the position and
orientation of the object in space, a generalization of a task which
has been recently used by Olson and Gettner to
investigate the neural structure of objectcentered
representations. Our model uses an intermediate representation in
which units have oculocentric receptive fields  just like collicular
neurons  whose gain is modulated by the side of the object to which
the movement is directed, as well as the orientation of the object. We
show that these gain modulations are consistent with Olson and Gettner
single cell recordings in the supplementary eye field. This
demonstrates that it is possible to perform an objectcentered task
without a representation involving an objectcentered map, viz.,
without neurons whose receptive fields are defined in
objectcentered coordinates. We also show that the same approach can
account for objectcentered neglect, a situation in which patients
with a right parietal lesion neglect the left side of objects
regardless of the orientation of the objects.
 Instabilities in Eye Movement Control: A Model of Periodic  Ernst R. Dow, Thomas J. Anastasio
 Nystagmus is a pattern of eye movement characterized by a smooth rotations
of the eye in one direction and rapid rotations in the opposite direction
that reset eye position. Periodic alternating nystagmus (PAN) is a form of
uncontrollable nystagmus that has been described as an unstable but
amplitudelimited oscillation. PAN has been observed previously only in
subjects with vestibulocerebellar damage. We describe results in which
PAN can be produced in normal subjects by prolonged rotation in darkness.
We propose a new model in which the neural circuits that control eye movement
are inherently unstable, but this instability is kept in check under normal
circumstances by the cerebellum. Circumstances which alter this cerebellar
restraint, such as vestibulocerebellar damage or plasticity due to rotation
in darkness can lead to PAN.
 Agnostic Classification of Markovian Sequences  Ran ElYaniv, Shai Fine, Naftali Tishby
 Classification of finite sequences without explicit knowledge of their
statistical origins is of great importance to various applications, such
as text categorization and biological sequence analysis. We propose a new
information theoretic algorithm for this problem based on two important
ingredients. The first, motivated by the ``two sample problem'', provides
an (almost) optimal model independent criterion for comparing two
statistical sources. The second ingredient is a novel idea for agnostic
``statistical merging'' of sequences in an asymptotically optimal way.
 A General Purpose Image Processing Chip: Orientation Detection  Ralph EtienneCummings, Donghui Cai
 A 80x78 pixel general purpose vision chip for spatial focal plane processing
is presented. The size and configuration of the processing receptive field
are programmable. The chip's architecture allows the photoreceptor cells
to be small and densely packed by performing all computation on the readout,
away from the array. In addition to the raw intensity image, the chip outputs
four processed images in parallel. Also presented is an application of the
chip to line segment orientation detection, as found in the retinal receptive
fields of toads.
 Ensemble and Modular Approaches for Face Detection: A Comparison  Rapha\"el Feraud, Olivier Bernier
 A new learning model based on autoassociative neural networks is developed
and applied to face detection. To extend the detection ability in orientation
and to decrease the number of false alarms, different combinations of networks
are tested: ensemble, conditional ensemble and conditional mixture of networks.
The use of a conditional mixture of networks allows to obtain state of the art
results on different benchmark face databases.
 Hippocampal Model of Rat Spatial Abilities Using Temporal Difference Learning  David J. Foster, Richard G. M. Morris, Peter Dayan
 We present a model of the hippocampus dependent performance of rats in
a delayed matchtoplace spatial watermaze learning task. Real rats
show slow learning in the first few days of training; but ultimately
exhibit onetrial learning. We model the early performance using a
temporal difference (TD) reinforcement learning architecture, using
place cells in the hippocampus as the representation of space. We
model late performance using a coordinate learning system which also
employs TD learning and place cells. We show a simple integration of
these two navigation models.
 Bayesian Model of Surface Perception  William T. Freeman, Paul A. Viola
 Image intensity variations can result from several different object
surface effects, including shading from 3dimensional relief of the
object, or paint on the surface itself. An essential problem in
vision, which people solve naturally, is to attribute the proper
physical cause, e.g. surface relief or paint, to an observed image.
We addressed this problem with an approach combining psychophysical
and Bayesian computational methods.
 Regularisation in Sequential Learning Algorithms  Jo\~ao F. G. de Freitas, Mahesan Niranjan, Andrew H. Gee
 In this paper we discuss regularisation in online/sequential
learning algorithms. In environments where data arrives sequentially,
techniques such as crossvalidation to achieve regularisation or
model selection are not possible. Further, bootstrapping to determine a
confidence level is not practical. To surmount these problems, a minimum
variance estimation approach that makes
use of the extended Kalman algorithm for training multilayer
perceptrons is employed. The novel contribution
of this paper is to show the theoretical links between extended
Kalman filtering, Sutton's variable learning rate algorithms and Mackay's
Bayesian estimation framework. In doing so, we propose algorithms
to overcome the need for heuristic choices of the initial conditions and
noise covariance matrices in the Kalman approach.
 A Revolution: Belief Propagation in Graphs with Cycles  Brendan J. Frey, David J. C. MacKay
 Until recently, artificial intelligence researchers have frowned upon
the application of probability propagation in Bayesian belief networks
that have cycles. The probability propagation algorithm is only exact
in networks that are cyclefree. However, it has recently been discovered
that the two best practical erorcorrecting decoding algorithms are
actually performing probability propagation in belief networks with
cycles.
 Features as Sufficient Statistics  Davi Geiger, Archisman Rudra, Laurance T. Maloney
 An image is often represented by a set of detected features. We get
an enormous compression by representing images in this
way. Furthermore, we get a representation which is little affected by
small amounts of noise in the image. However, features are typically
chosen in an ad hoc manner. We show how a good set of features can be
obtained using sufficient statistics. The idea of sparse data representation
naturally arises. We treat the 1dimensional and
2dimensional signal reconstruction problem to make our ideas
concrete.
 Hierarchical Nonlinear Factor Analysis and Topographic Maps  Zoubin Ghahramani, Geoffrey E. Hinton
 We first describe a hierarchical, generative model that can be viewed
as a nonlinear generalization of factor analysis and can be
implemented in a neural network. The model performs perceptual
inference in a probabilistically consistent manner by using topdown,
bottomup and lateral connections. These connections can be learned
using simple rules that require only locally available information.
We then show how to incorporate nonadaptive lateral connections into
the generative model. The model extracts a sparse, distributed,
hierarchical representation of depth from simplified randomdot
stereograms and the localized disparity detectors in the first hidden
layer form a topographic map.
 Regression with Inputdependent Noise: A Gaussian Process Treatment  Paul W. Goldberg, Christopher K. I. Williams, Christopher M. Bishop
 Gaussian processes provide natural nonparametric prior distributions
over regression functions. In this paper we consider regression
problems where there is noise on the output, and the variance of the
noise depends on the inputs. If we assume that the noise is a smooth
function of the inputs, then it is natural to model the noise variance
using a second Gaussian process, in addition to the Gaussian process
governing the noisefree output value. We show that the posterior
distribution of the noise rate can be sampled using Gibbs
sampling. Our results on a synthetic data set give a posterior
variance that wellapproximates the true variance. We also show that
the predictive likelihood of a test data set approximates the true
likelihood better under this model than under a uniform noise model.
 Generalization in Decision Trees and DNF: Does Size Matter?  Mostefa Golea, Peter Bartlett, Wee Sun Lee, Llew Mason
 Recent theoretical results for pattern classification with thresholded
realvalued functions (such as support vector machines, sigmoid
networks, and boosting) give bounds on misclassification probability
that do not depend on the size of the classifier, and hence can be
considerably smaller than the bounds that follow from the VC theory.
In this paper, we show that these techniques can be more widely applied,
by representing other boolean functions as twolayer neural networks
(thresholded convex combinations of boolean functions). For example,
we show that with high probability any decision tree of depth no more
than $d$ that is consistent with $m$ training examples has
misclassification probability no more than
$O \left( \left( \frac{1}{m} \left( N_{\mbox{eff}} \, \overline{d} \,
\mbox{VC}_{\mbox{dim}} ({\cal U}) \,
\log^2 m \log d\right)\right)^{1/2}\right)$,
where ${\cal U}$ is the class of node decision functions,
$N_{\mbox{eff}} \le N$ can
be thought of as the effective number of leaves (it becomes small as
the distribution on the leaves induced by the training data gets far
from uniform), and $\overline{d}$ is the average leaf depth (averaged over
the training data). This bound is qualitatively different from the
VC bound and can be considerably smaller.
 A Mathematical Model of Axon Guidance by Diffusible Factors  Geoffrey J. Goodhill
 In the developing nervous system, gradients of targetderived
diffusible factors play an important role in guiding axons to
appropriate targets. What shape does a gradient set up by diffusion
have, and over what range is it capable of guiding axons? A simple
mathematical model suggests the answer is about 1 mm. This value fits
well with experimental data.
 Gradients for Retinotectal Mapping  Geoffrey J. Goodhill
 The initial activityindependent formation of a topographic map in the
retinotectal system has long been thought to rely on the matching of
molecular cues expressed in gradients in the retina and the tectum.
However, direct experimental evidence for the existence of such
gradients has only emerged in the past two years. The new data has
provoked the discussion of a new set of models in the experimental
literature. Here, these models are subjected to a rigourous
theoretical analysis.
 Linear Concepts and Hidden Variables: An Empirical Study  Adam J. Grove, Dan Roth
 A fundamental issue the arises in deciding which approach to choose
is the robustness of the density estimation approach to deviations
from the postulated model. We study this question experimentally in a
restricted, yet nontrivial and interesting case: the model considered
here is a {\em conditionally independent attribute\/} (CIA) model
which postulates a single binaryvalued {\em hidden variable\/} $z$ on
which all other attributes (i.e. the target and the observables)
depend. We observe that this model has the property that finding the
most likely value of any one variable (given known values for the
others) reduces to testing a linear function of the observed values.
 Detection of First and Second Order Motion  Alexander Grunewald, Heiko Neumann
 A model of motion detection is presented. The model contains three
stages. The first stage is unoriented and is selective for contrast
polarities. The next two stages work in parallel. A phase {\em
insensitive} stage pools across different contrast polarities through
a spatiotemporal filter and thus can detect first and second order
motion. A phase {\em sensitive} stage keeps contrast polarities
separate, each of which is filtered through a spatiotemporal filter,
and thus only first order motion can be detected. Differential phase
sensitivity can therefore account for the detection of first and
second order motion. Phase insensitive detectors correspond to
cortical complex cells, and phase sensitive detectors to simple cells.
 A Neural Network Model of Naive Preference and Filial Imprinting in the Domestic Chick  Lucy E. Hadden
 Filial imprinting in domestic chicks is of interest in psychology,
biology, and computational modeling because it exemplifies simple,
rapid, innately programmed learning which is biased toward learning
about some objects. Horn et al.\ have recently discovered a naive
visual preference for heads and necks which develops over the course
of the first three days of life. These interactions are reminiscent
of those in human language acquisition. The neurological basis of
this predisposition is almost entirely unknown; that of
imprintingrelated learning is fairly clear. This project is the
first model of the predisposition consistent with what is known about
learning in imprinting. The model develops the predisposition
appropriately, learns to ``approach'' a training object, and
replicates one interaction between the two processes. Future work
will replicate more interactions between imprinting and the
predisposition in chicks, and analyze why the system works.
 An Improved Policy Iteration Algorithm for Partially Observable MDPs  Eric A. Hansen
 A policy iteration algorithm for partially observable Markov decision
processes is described that is simpler and more efficient than an
earlier policy iteration algorithm of Sondik. The key simplification
is the representation of a policy as a finitestate controller. The
dynamicprogramming update used in the policy improvement step is
interpreted as the transformation of a finitestate controller into an
improved finitestate controller. Empirical testing shows that this
policy iteration algorithm outperforms value iteration on a range of
examples.
 An Analog VLSI Model of the Fly Elementary Motion Detector  Reid R. Harrison, Christof Koch
 Flies are capable of rapidly detecting and integrating visual motion
information in behaviorlyrelevant ways. The first stage of visual
motion processing in flies is a retinotopic array of functional units
known as elementary motion detectors (EMDs). Several decades ago,
Reichardt and colleagues developed a correlationbased model of motion
detection that described the behavior of these neural circuits. We
have implemented a variant of this model in a 2.0micron analog CMOS
VLSI process. The result is a lowpower, continuoustime analog
circuit with integrated photoreceptors that responds to motion in real
time. The responses of the circuit to drifting sinusoidal gratings
qualitatively resemble the temporal frequency response, spatial
frequency response, and direction selectivity of motionsensitive
neurons observed in insects. In addition to its possible engineering
applications, the circuit could potentially be used as a building
block for constructing hardware models of higherlevel insect motion
integration.
 Classification by Pairwise Coupling  Trevor Hastie, Robert Tibshirani
 We discuss a strategy for polychotomous classification that
involves estimating class probabilities for each pair of classes,
and then coupling the estimates together. The coupling model
is similar to the BradleyTerry method for paired
comparisons. We study the nature of the
class probability estimates that arise, and examine the performance
of the procedure in real and simulated datasets.
Classifiers used include linear discriminants, nearest neighbors,
and the support vector machine.
 Unsupervised Online Learning of Decision Trees for Hierarchical Data Analysis  Marcus Held, Joachim M. Buhmann
 An adaptive online algorithm is proposed to estimate hierarchical data
structures for nonstationary data sources. The approach is based on
the principle of {\it minimum cross entropy}
to derive a decision tree for data clustering and on a
{\it metalearning} (learning of learning) idea to adapt to changing
data characteristics. Its efficiency is demonstrated by grouping
nonstationary artifical data and by hierarchical segmentation of
LANDSAT images.
 A Simple and Fast Neural Network Approach to Stereovision  Rolf D. Henkel
 A neural network approach to stereovision is presented based on aliasing
effects of simple disparity estimators and a fast coherencedetection
scheme. Using only realistic neural circuitry, the network calculates
dense disparity maps with an associated verification count. It
simultaneously fuses the left and right input images into a single
cyclopean view.
 Selecting Weighting Factors in Logarithmic Opinion Pools  Tom Heskes
 We study the combination of experts' probability assessments in a
logarithmic opinion pool. The error of the pool is shown to be smaller
or equal than the average error of individual experts. A decomposition
in terms of individual errors of and distances between experts leads to
a robust and relatively straightforward procedure for selecting
weighting factors. Our general description includes regression models,
models for computing variances, and classification models.
 A 1,000Neuron System with One Million 7bit Physical Interconnections  Yuzo Hirai
 An asynchronous PDM (PulseDensityModulating) digital neural network
system has been developed in our laboratory. It consists of one
thousand neurons that are physically interconnected via one million
7bit synapses. It can solve one thousand simultaneous nonlinear
firstorder differential equations in a fully parallel and continuous
fashion. The performance of this system was measured by a
winnertakeall network with one thousand neurons. Although the
magnitude of the input and network parameters were identical for each
competing neuron, one of them won in 6 milliseconds because of the
asynchronous behavior of the system. A broad range of neural networks
including spatiotemporal filtering, feedforward, and feedback networks
can be run by loading appropriate network parameters from a host
system.
 MELONET I: Neural Nets for Inventing BaroqueStyle Chorale Variations  Dominik Hörnel
 MELONET I is a multiscale neural network system producing
baroquestyle melodic variations. Given a melody, the system invents a
fourpart chorale harmonization and a variation of any chorale voice,
after being trained on music pieces of composers like J. S. Bach and
J. Pachelbel. Unlike earlier approaches to the learning of melodic
structure, the system is able to learn and reproduce highorder
structure like harmonic, motif and phrase structure in melodic
sequences. This is achieved by using mutually interacting feedforward
networks operating at different time scales, in combination with
Kohonen networks to classify and recognize musical structure. The
results are chorale partitas in the style of J. Pachelbel. Their
quality has been judged by experts to be comparable to improvisations
invented by an experienced human organist.
 Nonlinear Markov Networks for Continuous Variables  Reimar Hofmann, Volker Tresp
 In this paper we address the problem of learning the structure in
nonlinear Markov networks with continuous variables. Markov networks
are well suited to model relationships which do not exhibit a natural
causal ordering. We represent the quantitative relationships between
variables using neural networks as models for conditional probability
densities.
This approach is well suited for inference by Gibbs sampling.
Using a financial and a sociological data set we show that interesting
structures can be found using our approach.
 Active Data Clustering  Thomas Hofmann, Joachim M. Buhmann
 Active data clustering is a novel technique for clustering of
proximity data which utilizes principles from sequential experiment
design in order to interleave data generation and data analysis. The
proposed active data sampling strategy is based on the expected
value of information, a concept rooted in statistical decision
theory. This is considered as an important step towards the analysis of
largescale data sets, because it offers a way to overcome the
inherent data sparseness of proximity data. We present applications to
unsupervised texture segmentation in computer vision and information
retrieval in document databases.
 Computing with Action Potentials {\rm\bf (Invited Talk)}  John J. Hopfield, Carlos D. Brody, Sam Roweis
 Most computational engineering based loosely on biology, whether
algorithmic or hardware, uses continuous variables and sigmoid units.
Most neurons communicate with action potentials, and the engineering view
is equivalent to using a ratecode for the representation of information
and computing. There are an increasing number of examples in biology where
this may not be the case. Information can be represented using the timing
of action potentials, and computed with in this representation.
The ``analog match'' problem of olfaction is one of the simplest problems which
can be efficiently solved using action potential timing and an underlying
rhythm. By using units which adapt, and using the adaptation to effect a
fundamental change of representation of a problem, the problem of
recognizing words in connected speech with uniform timewarp within a word
can be mapped into the olfaction problem. The architecture and preliminary
results of such a system (with Sam T. Roweis) will be described. Using
the fast events of biology in conjunction with an underlying rhythm gets
away from the limits of an eventdriven view of computation. When the
intrinsic hardware is much faster than the time scale of change of inputs,
it can greatly increase the computation which can be done per unit time on
a given quantity of hardware.
 Adaptation in Speech Motor Control  John F. Houde, Michael I. Jordan
 Human subjects are known to adapt their motor behavior to a shift of
the visual field brought about by wearing prism glasses over their
eyes. We have studied the analog of this phenomenon in the speech
domain. Utilizing a device that can feed back transformed speech
signals in real time, we exposed subjects to phonetically sensible,
online perturbations of their own speech patterns. We found that
speakers learn to adjust their production of a vowel to compensate for
feedback alterations that change the vowel's perceived phonetic
identity; moreover, the effect generalizes across phonetic contexts.
This phenomenon provides a new tool for probing the nature of the
sensorimotor control system underlying human speech production.
 Function Approximation with the Sweeping Hinge Algorithm  Don R. Hush, Fernando Lozano, Bill Horne
 We present a computationally efficient algorithm for
function approximation with piecewise linear sigmoidal nodes.
A one hiddenlayer network is constructed one node at a time using
the method of fitting the residual. The task of fitting individual
nodes is accomplished using a new algorithm that searchs
for the best fit by solving a sequence of Quadratic Programming problems.
This approach offers significant advantages over derivativebased
search algorithms (e.g. backpropagation and its extensions).
Unique characteristics of this algorithm include: finite
step convergence, a simple stopping criterion, a deterministic
methodology for seeking ``good'' local minima, good scaling properties
and a robust numerical implementation.
 New Approximations of Differential Entropy for Independent Component Analysis and Projection Pursuit  Aapo Hyv\"{a}rinen
 We derive a firstorder approximation of the density of maximum
entropy for a continuous 1D random variable, given a number of simple
constraints. This results in a density expansion which is somewhat
similar to the classical polynomial density expansions by
GramCharlier and Edgeworth. Using this approximation of density, an
approximation of 1D differential entropy is derived. The
approximation of entropy is both more exact and more robust against
outliers than the classical approximation based on the polynomial
density expansions, without being computationally more expensive. The
approximation has applications, for example, in independent component
analysis and projection pursuit.
 A Model of Early Visual Processing  Laurent Itti, Jochen Braun, Dale K. Lee, Christof Koch
 We propose a model for early visual processing in primates. The model
consists of a population of linear spatial filters which interact
through nonlinear excitatory and inhibitory pooling. Statistical
estimation theory is then used to derive human psychophysical
thresholds from the responses of the entire population of units. The
model is able to reproduce human thresholds for contrast and
orientation discrimination tasks, and to predict contrast thresholds
in the presence of masks of varying orientation and spatial frequency.
 The Error Coding and Substitution PaCTs  Gareth James, Trevor Hastie
 A new class of plug in classification techniques have
recently been developed in the statistics and machine learning literature.
A plug in classification technique (PaCT) is a method that takes a
standard classifier (such as LDA or TREES) and plugs it into an
algorithm to produce a new classifier. The standard classifier is
known as the Plug in Classifier (PiC). These methods often
produce large improvements over using a
single classifier. In this paper we investigate one of these
methods and give some motivation for its success.
 Wavelet Models for Video TimeSeries  Sheng Ma, Chuanyi Ji
 In this work, we tackle the problem of timeseries modeling
posed by recent discoveries on variable bit rate video traffic
and Ethernet data traffic, and
establish a wavelet model on the timeseries.
Different from the existing methods which
model the timeseries in the time domain,
we model the wavelet
coefficients in the wavelet domain.
The strength of the wavelet model includes (1)
a unified approach to model
both the longrange and the shortrange dependence in the video and data
traffic
simultaneously, $(2)$ a computationally
efficient method of developing the model and
generating high quality video traffic, and $(3)$ feasibility of
performance analysis using the model. Analysis and insights will be given to
show why wavelets provide excellent models for the timeseries.
 Extended ICA Removes Artifacts from Electroencephalographic Recordings  TzyyPing Jung, Colin Humphries, TeWon Lee, Scott Makeig, Martin J. McKeown, Vicente Iragui, Terrence J. Sejnowski
 Severe contamination of electroencephalographic (EEG)
activity by eye movements, blinks, and muscle, heart and
line noise presents a serious problem for EEG interpretation
and analysis. Rejecting contaminated EEG segments results in
a considerable loss of data and may be impractical for
clinical data. Many methods have been proposed to remove eye
movement and blink artifacts from EEG recordings. Often
regression in the time or frequency domain is performed on
simultaneous EEG and electrooculographic (EOG) recordings to
derive parameters characterizing the appearance and spread
of EOG artifacts in the EEG channels. However, EOG also
carries brain signal (Peters, 1967; Oster \& Stern, 1980),
so regressing out eye artifacts inevitably involves
subtracting a portion of the relevant EEG signal from each
record as well. This method cannot be used for muscle noise
or line noise for which there is no reference channel for
regression. Here, we propose a new and generally applicable
method for removing a wide variety of artifacts from EEG
records. The method is based on an extended version of an
Independent Component Analysis (ICA) algorithm (Bell \&
Sejnowski, 1995) for performing blind source separation on
linear mixtures of independent source signals that can be
sub or superGaussian. Our results show that ICA can
effectively detect, separate and remove the activity of a
wide variety of artifactual sources in EEG records, with
results comparing favorably to those obtained using
regression methods.
 Boltzmann Machine Learning Using Mean Field Theory and Linear Response Correction  Hilbert J. Kappen, F. B. Rodr\'{\i}guez
 The learning process in Boltzmann Machines is computationally intractible.
We present a new approximate learning algorithm
for Boltzmann Machines, which is based on mean field theory and
the linear response theorem. The computational complexity of the
algorithm is cubic in the number of neurons.
 SMap: A Network with a Simple SelfOrganization Algorithm for Generative Topographic Mappings  Kimmo Kiviluoto, Erkki Oja
 The SMap is a network with a simple learning algorithm that
combines the selforganization capability of the SelfOrganizing Map
(SOM) and the probabilistic interpretability of the Generative
Topographic Mapping (GTM). The algorithm is shown to minimize the
same error function as the GTM  the negative log likelihood  but
when compared to the GTM, the SMap seems to have a stronger tendency
to selforganize from random initial configuration. The SMap
algorithm can be further simplified to employ pure Hebbian learning,
without changing the qualitative behaviour of the network.
 Relative Loss Bounds for Multidimensional Regression Problems  Jyrki Kivinen, Manfred K. Warmuth
 We study online generalized linear regression
with multidimensional outputs, i.e., neural networks
with multiple output nodes but no hidden nodes.
We allow at the final layer transfer functions such as the softmax
function that need to consider the linear activations to all
the output neurons.
We also use a parameterization function which transforms parameter
vectors maintained by the algorithm into the actual weights.
The online algorithm we consider updates the parameters in an
additive manner, analogous to the delta rule, but because the actual
weights are obtained via the possibly nonlinear parameterization
function they may behave in a very different manner.
Our approach is based on applying the notion of a matching loss
function in two different contexts.
First, we measure the loss of the algorithm in terms of the loss that
matches the transfer function used to produce the outputs.
Second, the loss function that matches the
parameterization function can be used both as a measure of distance
between models in motivating the update rule of the algorithm and as a
potential function in analyzing its relative performance compared to
an arbitrary fixed model.
As a result, we have a unified treatment that generalizes earlier
results for the Gradient Descent and Exponentiated Gradient algorithms
to multidimensional outputs, including multiclass logistic regression.
 Analysis of Drifting Dynamics with Neural Network Hidden Markov Models  Jens Kohlmorgen, KlausRobert M\"{u}ller, Klaus Pawelzik
 A method for the analysis of nonstationary time series with
multiple operating modes is presented.
In particular, it is possible to detect and to model a switching of
the dynamics and also a less abrupt, time consuming drift
from one mode to another. This is achieved by an unsupervised
algorithm that segments the data according to inherent modes, and a
subsequent search through the space of possible drifts.
An application to
physiological wake/sleep data demonstrates that analysis and modeling
of realworld time series can be improved when the drift paradigm is
taken into account. In the case of wake/sleep data, we hope to gain
more insight into the physiological
processes that are involved in the transition from wake to sleep.
 Asymptotic Theory for Regularization: OneDimensional Linear Case  Petri Koistinen
 The generalization ability of a neural
network can sometimes be improved dramatically by regularization.
To analyze the improvement one needs more refined
results than the asymptotic distribution of the weight vector.
We study the simple case of onedimensional linear regression,
where we derive expansions for the
optimal regularization parameter and the ensuing improvement.
It is possible to construct examples where it
is best to use no regularization.
 Perturbative MSequences for Auditory Systems Identification  Mark Kvale, Christoph E. Schreiner
 In this paper we present a new method for studying auditory systems
based on msequences. The method allows us to perturbatively study
the linear response of the system in the presence of various other
stimuli, such as speech or sinusoidal modulations. This allows one
to construct linear kernels (receptive fields) at the same time that
other stimuli are being presented. Using the method we calculate
the modulation transfer function of single units in the inferior
colliculus of the cat at different operating points and discuss
nonlinearities in the response.
 Learning Humanlike Knowledge by Singular Value Decomposition: A Progress Report  Thomas K. Landauer, Darrell Laham, Peter Foltz
 Singular value decomposition (SVD) can be viewed as a method for
unsupervised training of a network that associates two classes of events
reciprocally by linear connections through a single hidden layer. SVD was
used to learn and represent relations among very large numbers of words
(20k60k) and very large numbers of natural text passages (1k70k) in which
they occurred. The result was 100350 dimensional ``semantic spaces'' in
which any trained or newly added word or passage could be represented as a
vector, and similarities were measured by the cosine of the contained angle
between vectors. Good accuracy in simulating human judgments and behaviors
has been demonstrated by performance on multiplechoice vocabulary and
domain knowledge tests, emulation of expert essay evaluations, and in
several other ways. Examples are also given of how the kind of knowledge
extracted by this method can be applied.
 A Generic Approach for Identification of Event Related Brain Potentials via a Competitive Neural Network Structure  Daniel H. Lange, Hava T. Siegelmann, Hillel Pratt, Gideon F. Inbar
 We present a novel generic approach to the problem of Event Related
Potential identification and classification, based on a competitive Neural Net
architecture. The network weights converge to the embedded signal patterns,
resulting in the formation of a matched filter bank.
The network performance is analyzed via a simulation study, exploring
identification robustness under low SNR conditions and compared to the expected
performance from an information theoretic perspective. The classifier is applied
to real eventrelated potential data recorded during a classic oddball type
paradigm; for the first time, withinsession variable signal patterns are
automatically identified, dismissing the strong and limiting requirement of
apriori stimulusrelated selective grouping of the recorded data.
The results present new possibilities in evoked potential research.
 A Neural Network Based Head Tracking System  Daniel D. Lee, H. Sebastian Seung
 We have constructed an inexpensive, videobased, motorized tracking
system that learns to track a head. It uses real time graphical user
input as a supervisory signal to train a convolutional neural network.
The inputs to the neural network consist of normalized luminance and
chrominance images and motion information from frame differences.
Subsampled images are also used to provide scale invariance. During
the online training phase, the neural network adjusts the input
weights depending upon the reliability of the different channels in
the surrounding environment. This allows the system to robustly track
a head even when other objects are moving within a cluttered background.
 Two Approaches to Optimal Annealing  Todd K. Leen, Bernhard Schottky, David Saad
 In recent years, two theoretical approaches have been
brought to bear on the analysis of online learning. The
master equation describes the evolution of the
probability density over weight space, i.e. the density
for the model parameters. Approximations to the master equation
provide partial differential equations for the evolution of the
density. From this density, objects of
experimental interest including the generalization error,
and moments of the weight error can be computed.
 Multimodular Associative Memory  Nir Levy, David Horn, Eytan Ruppin
 Motivated by the findings of modular structure in the
cortex, we study
a multimodular model of associative memory that
successfully stores memory patterns with different levels of activity.
We differentiate between two types of synaptic efficacies, intramodular
and intermodular ones. The latter signals undergo an additional nonlinear
processing before reaching the soma.
Compared with the conventional,
singlemodule associative memory network, this multimodular network has
two main advantages: it is less susceptible to damage to columnar
input, and its response is more consistent with the cognitive data pertaining
to category specific impairments.
 Inferring Sparse, Overcomplete Image Codes Using an Efficient Coding Framework  Michael S. Lewicki, Bruno A. Olshausen
 We apply a general technique for learning overcomplete bases
(Lewicki and Sejnowski, 1997) to the problem of finding efficient
image codes. The bases learned by the algorithm are localized,
oriented, and bandpass, consistent with earlier results obtained using
different methods (Olshausen and Field, 1996; Bell and Sejnowski
1997). We show that higher degrees of overcompleteness produce bases
which have much greater likelihood and results in a Gaborlike basis
with greater sampling density in position, orientation, and scale.
This framework also allows different bases to be compared objectively
by calculating their probability given the observed data. Compared to
the complete and overcomplete Fourier and wavelet bases, the learned
bases have much greater probability and thus have the potential to
yield better coding efficiency. We demonstrate the improvement in the
representation of the learned bases by showing superior noise
reduction properties.
 Learning Nonlinear Overcomplete Representations for Efficient Coding  Michael S. Lewicki, Terrence J. Sejnowski
 We derive a learning algorithm for inferring an overcomplete basis by
viewing it as probabilistic model of the observed data. Overcomplete
bases allow for better approximation of the underlying statistical
density. Using a Laplacian prior on the basis coefficients removes
redundancy and leads to representations that are sparse and are a
\emph{nonlinear} function of the data. This can be viewed as a
generalization of the technique of independent component analysis and
provides a method for blind source separation of fewer mixtures than
sources. We demonstrate the utility of overcomplete representations
on natural speech and show that compared to the traditional Fourier
basis the inferred representations potentially have much greater
coding efficiency.
 Visual Navigation in a Robot Using ZigZag Behavior  M. Anthony Lewis
 We implement a model of obstacle avoidance in flying insects on a
small, monocular robot. The result is a system that is capable of rapid
navigation through a dense obstacle field. The key to the system is the use
of zigzag behavior to articulate the body during movement. It is shown that
this behavior compensates for a parallax blind spot surrounding the focus of
expansion normally found in systems without zigzag behavior. The system
models the cooperation of several behaviors: halterocular response (similar
to VOR), optomotor response, and the parallax field computation and mapping
to motor system itself. The resulting system is neurally plausible,
simple, and should be easily hostable on aVLSI hardware.
 Factorizing Multivariate Function Classes  Juan K. Lin
 The problem of local independent component analysis is
formulated and solved in this paper. In contrast to
iterative nonlocal parametric density estimation algorithms
for source separation, here only local information is
used and an analytic solution for source separation is
obtained. The local nature of this approach allows for
separation of general smooth invertible mixing transformations,
while the analytic nature allows for a proper treatment of
source separation in the presence of uncertainty.
The mathematical framework presented extends
independent component analysis to a general factorization
of multivariate functions. Numerical examples and a
tie to decorrelation are presented.
 Silicon Retina with Adaptive Filtering Properties  ShihChii Liu
 This paper describes a small, compact circuit that captures
the adaptation properties of
the photoreceptor layer and the adaptive temporal filtering properties
of the laminar layer of the fly.
This circuit uses only six transistors and two capacitors,
and is operated in the subthreshold domain. The circuit has
a low DC gain and a high transient gain.
The adaptation time constant of the circuit can be
controlled via an external bias. Its temporal filtering
properties change with the background intensity or
signaltonoise conditions. The frequency response
of the circuit shows that in the frequency range of 1 to 100 Hz,
the circuit response goes from highpass filtering under
high light levels to lowpass filtering under
low light levels (i.e., when the signaltonoise ratio is low).
A chip with $20\times20$ pixels has been fabricated in
1.2 $\mu$m nwell technology.
 2D Observers for Human 3D Object Recognition?  Zili Liu, Daniel Kersten
 Converging evidence has shown that human object recognition depends on
observers' familiarity with objects' appearance. The more similar the
objects are, the stronger this dependence will be, and the more
important twodimensional (2D) image information will be. The degree to which 3D
structural information is used, however, remains an area of strong
debate. Previously, we showed that all models that allow rotations in
the image plane of independent 2D templates could not account for
human performance in discriminating novel object views. We now
present results from models of generalized radial basis functions
(GRBF), 2D nearest neighbor matching that allows 2D affine
transformations, and a Bayesian statistical
estimator that integrates over all possible 2D affine transformations.
The performance of the human observers relative to each of the
models is better for the novel views than for the template views,
suggesting that humans generalize better to novel views from template
views. The Bayesian estimator yields the optimal
performance with 2D affine transformations and independent 2D
templates. Therefore, no models of 2D affine operations with
independent 2D templates account for the human performance.
 Effects of Spike Timing Underlying Binocular Integration and Rivalry in a Neural Model of Early Visual Cortex  Erik D. Lumer
 A neural network that models the mammalian early visual system is
introduced to gain insight into possible mechanisms operating
during binocular vision. In the model, the desynchronized firing
of cortical neurons that first combine inputs from the two eyes
produces rivalrous activity patterns at later stages in the visual
pathway. By contrast, synchronization of firing among these cells
prevents such competition. The temporal coordination of cortical
activity and its effects on neural competition emerge naturally
from the network connectivity and from its dynamics. These results
suggest that inputrelated differences in relative spike timing
at an early stage of visual processing may give rise to the
phenomena both of perceptual integration and rivalry in binocular
vision.
 Dynamic Stochastic Synapses as Computational Units  Wolfgang Maass, Anthony M. Zador
 In most neural network models, synapses are treated as static weights
that change only on the slow time scales of learning. In fact,
however, synapses are highly dynamic, and show usedependent
plasticity over a wide range of time scales. Moreover, synaptic
transmission is an inherently stochastic process: a spike arriving at
a presynaptic terminal triggers release of a vesicle of
neurotransmitter from a release site with a probability that can be
much less than one. Changes in release probability represent one of
the main mechanisms by which synaptic efficacy is modulated in neural
circuits.
 Synaptic Transmission: An InformationTheoretic Perspective  Amit Manwani, Christof Koch
 Here we analyze synaptic transmission from an informationtheoretic
perspective. We derive closedform expressions for the lowerbounds
on the capacity of a simple model of a cortical synapse under two
explicit coding paradigms. Under the ``signal estimation'' paradigm,
we assume the signal to be encoded in the mean firing rate of a
Poisson neuron. The performance of an optimal linear estimator of
the signal then provides a lower bound on the capacity for signal
estimation. Under the ``signal detection'' paradigm, the presence
or absence of the signal has to be detected (YesNo task).
Performance of the optimal spike detector allows us to compute a
lower bound on the capacity for signal detection. We find that
single synapses (for empirically measured parameter values)
transmit information poorly but significant improvement can be
achieved with a small amount of redundancy.
 Reinforcement Learning for Call Admission Control and Routing in Integrated Service Networks  Peter Marbach, Oliver Mihatsch, Miriam Schulte, John N. Tsitsiklis
 In integrated service communication networks, an important problem is to
excercise call admission control and routing so as to optimally use the network
resources. This problem is naturally formulated as a dynamic programming
problem, which, however, is too complex to be solved exactly. We use
methods of reinforcement learning (RL), together with a decomposition
approach, to find call admission control and routing policies. The
performance of our policy for a network with approximately $10^{45}$ states
is compared with a commonly used heuristic policy.
 A Framework for MultipleInstance Learning  Oded Maron, Tom\'{a}s LozanoP\'{e}rez
 Multipleinstance learning is a variation on supervised
learning, where the task is to learn a concept given positive and
negative bags of instances. Each bag may contain many instances, but
a bag is labeled positive even if only one of the instances in it
falls within the concept. A bag is labeled negative only if all the
instances in it are negative. We describe a new general framework,
called Diverse Density, for solving multipleinstance learning
problems. We apply this framework to learn a simple description of a
person from a series of images (bags) containing that person, to a
stock selection problem, and to the drug activity prediction problem.
 An Application of ReversibleJump MCMC to Multivariate Spherical Gaussian Mixtures  Alan D. Marrs
 Applications of Gaussian mixture models occur frequently in the fields
of statistics and artificial neural networks. One of the key issues
arising from any mixture model application is how to estimate the
optimum number of mixture components. This paper extends the
ReversibleJump Monte Carlo Markov Chain algorithm to the case of
multivariate spherical Gaussian mixtures using a hierarchical prior
model. Using this method the number of mixture components is no longer
fixed but becomes a parameter of the model which we shall
estimate. The reversiblejump MCMC algorithm is capable of moving
between parameter subspaces which correspond to models with different
numbers of mixture components. As a result a sample from the full
joint distribution of all unknown model parameters is generated. The
technique is then demonstrated on a well known classification problem.
 Estimating Dependency Structure as a Hidden Variable  Marina Meil\u{a}, Michael I. Jordan
 This paper introduces the Mixture of Trees, a probability model that can
account for sparse, but dynamically changing dependence relationships
between the variables of the domain under study. We present a family of
efficient algorithms that use EM and the Maximum Spanning Tree algorithm
to find the Maximum Likelihood and the MAP Mixture of Trees for a
variety of priors, including the Dirichlet and the MDL priors.
 Structural Risk Minimization for Nonparametric Time Series Prediction  Ron Meir
 The problem of time series prediction is studied within the uniform
convergence framework of Vapnik and Chervonenkis. The dependence inherent
in the temporal structure is incorporated into the analysis, thereby
generalizing the available theory for memoryless processes. Finite sample
bounds are calculated in terms of covering numbers of the approximating
class, and the tradeoff between approximation and estimation is discussed.
Vapnik's structural risk minimization procedure is applied to achieve
consistency, and convergence rates are established within a nonparametric
setting.
 Toward a SingleCell Account for Binocular Disparity Tuning: An Energy Model May Be Hiding in Your Dendrites  Bartlett W. Mel, Daniel L. Ruderman, Kevin A. Archie
 Hubel and Wiesel (1962) proposed that complex cells in visual
cortex are driven by a pool of simple cells with the same
preferred orientation but different spatial phases. However, a
wide variety of experimental results over the past two decades
have challenged the pure hierarchical model, primarily by
demonstrating that many complex cells receive monosynaptic
input from unoriented LGN cells, or do not depend on simple
cell input. We recently showed using a detailed biophysical
model that nonlinear interactions among synaptic inputs to an
excitable dendritic tree could provide the nonlinear subunit
computations that underlie complex cell responses
(Mel, Ruderman, \& Archie 1997). Our present work extends this result
to the case of complex cell binocular disparity tuning, by
demonstrating in an isolated model pyramidal cell (1) disparity
tuning at a resolution much finer than the the overall
dimensions of the cell's receptive field, and (2)
systematically shifted optimal disparity values for rivalrous
pairs of light and dark barsboth in good agreement with
published reports (Ohzawa, Deangelis, \& Freeman, 1997). Our results
reemphasize the potential importance of intradendritic
computation for binocular visual processing in particular, and
for cortical neurophysiology in general.
 Combining Classifiers Using Correspondence Analysis  Christopher J. Merz
 Several effective methods for improving the performance of a single
learning algorithm have been developed recently. The general approach
is to to create a set of learned models by repeatedly applying the
algorithm to different versions of the training data, and then combine
the learned models' predictions according to a prescribed voting
scheme. Little work has been done in combining the predictions of a
collection of models generated by many learning algorithms having
different representation and/or search strategies. This paper
describes a method which uses the strategies of stacking and
correspondence analysis to model the relationship between the learning
examples and the way in which they are classified by a collection of
learned models. A nearest neighbor method is then applied within
the resulting representation to classify previously unseen examples.
The new algorithm consistently performs as well or better than other
combining techniques on a suite of data sets.
 Serial Order in Reading Aloud: Connectionist Models and Neighborhood Structure  Jeanne C. Milostan, Garrison W. Cottrell
 DualRoute and Connectionist SingleRoute models of reading have been
at odds over claims as to the correct explanation of the reading process.
Recent DualRoute models predict that subjects should show an increased
naming latency for irregular words when the irregularity is earlier in the
word  a prediction that has been confirmed in human experiments.
Coltheart \& Rastle (1994) claim that SingleRoute connectionist models
cannot account for this latency by positionofirregularity interaction.
A refutation of this claim is presented here, consisting of network models
which do show the interaction, along with orthographic neighborhood
statistics that explain the effect.
 Learning Path Distributions Using Nonequilibrium Diffusion Networks  Paul Mineiro, Javier Movellan, Ruth J. Williams
 Diffusion networks are a natural extension of recurrent neural networks in
which the dynamics are probabilistic. In this paper we derive the gradient
of the loglikelihood of a path with respect to the drift parameters for a
diffusion network. This gradient can be used to optimize a diffusion
network in the nonequilibrium regime for a wide variety of problems,
including reinforcement learning, filtering and
prediction, signal detection, and continuous path density
estimation. An aspect of our work which is of interest to computational
neuroscience and hardware design is that the obtained gradient is local in
space and time, i.e., no time unfolding, backpropagation of error signals,
or Boltzmann phases are required.}
 Automated Aircraft Recovery via Reinforcement Learning: Initial Experiments  Jeffrey F. Monaco, David G. Ward, Andrew G. Barto
 Initial experiments described here are directed toward using
reinforcement learning (RL) to develop an automatic recovery system (ARS) for
highagility aircraft. An ARS is an outerloop flight control system
designed to
bring an aircraft from a range of initial states to straight, level, and
noninverted flight in minimum time and while satisfying given constraints.
Here we
report results for a simple version of the problem involving only singleaxis
(pitch) simulated recoveries. Through simulated control experience using a
mediumfidelity aircraft simulation, the RL system approximated an optimal
policy
for longitudinalstick inputs to produce minimumtime transitions to
straight and
level flight in unconstrained cases, while meeting a pilotstation
acceleration constraint.
 Learning to Schedule StraightLine Code  Eliot Moss, Paul Utgoff, John Cavazos, Doina Precup, Darko Stefanovi\'{c}, Carla Brodley, David Scheeff
 Execution speed of programs on modern computer architectures is sensitive, by
a factor of two or more, to the order in which instructions are presented to
the processor. To realize potential execution efficiency, it is now customary
for an optimizing compiler to employ a heuristic algorithm for instruction
scheduling. These algorithms are painstakingly handcrafted, which is
expensive and timeconsuming. We show how to cast the instruction scheduling
problem as a learning task, so that one obtains the heuristic scheduling
algorithm automatically. Our focus is the narrower problem of scheduling
straightline code, also known as a {\em basic block} of instructions. Our
empirical results show that just a few features are adequate for quite good
performance at this task for a real modern processor, and that any of several
supervised learning methods perform nearly optimally with respect to the
features used.
 Bayesian Robustification for Audio Visual Fusion  Javier Movellan, Paul Mineiro
 We discuss the problem of catastrophic fusion in multimodal recognition
systems. This problem arises in systems that need to fuse different
channels in nonstationary environments. Practice shows that when
recognition modules within each modality are tested in contexts inconsistent
with their assumptions, their influence on the fused product tends to
increase, with catastrophic results. We explore a principled solution to
this problem based upon Bayesian ideas of competitive models and inference
robustification: each sensory channel is provided with simple whitenoise
context models, and the perceptual hypothesis and context are jointly
estimated. Consequently, context deviations are interpreted as changes in
white noise contamination strength, automatically adjusting the influence of
the module. The approach is tested on a fixed lexicon automatic audiovisual
speech recognition problem with very good results.
 A SuperadditiveImpairment Theory of Optic Aphasia  Michael C. Mozer, Mark Sitton, Martha Farah
 Accounts of neurological disorders often posit damage to a specific functional
pathway of the brain. Farah (1990) has proposed an alternative class of
explanations involving partial damage to multiple pathways. We explore this
explanation for optic aphasia, a disorder in which severe performance deficits
are observed when patients are asked to name visually presented objects, but
surprisingly, performance is relatively normal on naming objects from auditory
cues and on gesturing the appropriate use of visually presented objects. We
model this highly specific deficit through partial damage to two pathwaysone
that maps visual input to semantics, and the other that maps semantics to
naming responses. The effect of this damage is superadditive, meaning that
tasks which require one pathway or the other show little or no performance
deficit, but the damage is manifested when a task requires both pathways
(i.e., naming visually presented objects). Our model explains other phenomena
associated with optic aphasia, and makes testable experimental predictions.
 Reinforcement Learning for Continuous Stochastic Control Problems  R\'{e}mi Munos, Paul Bourgine
 This paper is concerned with the problem of Reinforcement Learning (RL)
for continuous state space and time stochastic control problems. We state
the HamiltonJacobiBellman equation satisfied by the value function and
use a FiniteDifference method for designing a convergent approximation
scheme. Then we propose a RL algorithm based on this scheme and prove its
convergence to the optimal solution.
 Enhancing QLearning for Optimal Asset Allocation  Ralph Neuneier
 In this paper, we enhance the Qlearning algorithm for optimal asset
allocation proposed in (Neuneier, 1996). The new formulation simplifies
the approach and allows policyiteration without the necessity for
having a model of the system. The new algorithm is tested on real data
of the German stock market. Furthermore, the possibility of risk
management within the framework of Markov decision problems is
analyzed. This leads to different return functions, which bridge the
gap between classical portfolio management and asset allocation
strategies derived by reinforcement learning.
 A Hippocampal Model of Recognition Memory  Randall C. O'Reilly, Kenneth A. Norman, James L. McClelland
 A rich body of data exists showing that recollection of specific
information contributes to recognition memory. We present a model,
based largely on known features of hippocampal anatomy and
physiology, that accounts for both the {\em highthreshold} nature
of this recollection process (i.e., the fact that only studied items
are recollected, although nonstudied items sometimes trigger
recollection of similar studied items), and the fact that increasing
interference leads to less recollection but apparently does not
compromise the {\em quality} of recollection (i.e., the extent to
which recollection veridically reflects events that occurred at
study). Both of these properties of recollection pose problems for
existing computational and mathematical models.
 Learning Generative Models with the UpPropagation Algorithm  JongHoon Oh, H. Sebastian Seung
 Backpropagation networks process their inputs in a bottomup fashion,
and use topdown connections to propagate an error signal for
learning. We introduce a new algorithm called uppropagation, which
uses topdown connections to generate patterns, and bottomup
connections to propagate an error signal. The error signal is part of
a computational feedback loop that adjusts the generated pattern to
match sensory input. The error signal is also used for
unsupervised learning. The algorithm is benchmarked against principal
component analysis in experiments on images of handwritten digits.
 Adaptive Choice of Grid and Time in Reinforcement Learning  Stephan Pareigis
 We propose local error estimates together with algorithms for adaptive
aposteriori grid and time refinement in reinforcement learning. We
consider a deterministic system with continuous state and time with
an infinite horizon discounted cost functional. For grid refinement we
follow the procedure of numerical methods for the
Bellmanequation. For time refinement we propose a new criterion,
based on consistency estimates of discrete solutions of the
Bellmanequation. We demonstrate that an optimal ratio of time to
space discretization is crucial for optimal learning rates and
accuracy of the approximate optimal value function.
 Reinforcement Learning with Hierarchies of Machines  Ronald Parr, Stuart Russell
 We present a new approach to reinforcement learning in which the
policies considered by the learning process are constrained by
hierarchies of partially specified machines. This allows for the use
of prior knowledge to reduce the search space and provides a framework
in which knowledge can be transferred across problems and in which
component solutions can be recombined to solve larger and more
complicated problems. Our approach can be seen as providing a link
between reinforcement learning and ``behaviorbased'' or
``teleoreactive'' approaches to control. We present provably
convergent algorithms for problemsolving and learning with
hierarchical machines and demonstrate their effectiveness on a problem
with several thousand states.
 Analog VLSI Model of Intersegmental Coordination with NearestNeighbor Coupling  Girish N. Patel, Jeremy H. Holleman, Stephen P. DeWeerth
 We have developed an analog VLSI system that models the coordination
of neurobiological segmented oscillators. We have implemented and tested
a system that consists of a chain of eleven pattern generating circuits
that are synaptically coupled to their nearest neighbors. Each pattern
generating circuit is implemented with two silicon MorrisLecar neurons
that are connected in a reciprocally inhibitory network. We discuss the
mechanisms of oscillations in the twocell network and explore system
behavior
based on isotropic and anisotropic coupling, and frequency gradients
along the chain of oscillators.
 Multitime Models for Temporally Abstract Planning  Doina Precup, Richard S. Sutton
 Planning and learning at multiple levels of temporal abstraction is a key
problem for artificial intelligence. In this paper we summarize an
approach to this problem based on the mathematical framework of Markov
decision processes and reinforcement learning. Current modelbased
reinforcement learning is based on onestep models that cannot represent
commonsense higherlevel actions, such as going to lunch, grasping an
object, or flying to Denver. This paper generalizes prior work on
temporally abstract models (Sutton, 1995) and extends it from the
prediction setting to include actions, control, and planning. We
introduce a more general form of temporally abstract model, the
{\em multitime model}, and establish its suitability for planning and
learning by virtue of its relationship to Bellman equations. This paper
summarizes the theoretical framework of multitime models and illustrates
their potential advantages in a gridworld planning task.
 Analytical Study of the Interplay between Architecture and Predictability  Avner Priel, Ido Kanter, David A. Kessler
 We study model feed forward networks as time series predictors in the
stationary limit. The focus is on complex, yet nonchaotic, behavior.
The main question we address is whether the asymptotic behavior is governed
by the architecture, regardless the details of the weights.
We find hierarchies among classes of architectures with respect to the
attractor dimension of the long term sequence they are capable of generating;
larger number of hidden units can generate higher dimensional attractors.
In the case of a perceptron, we develop the stationary solution for a
general weight vector, and show that the flow is typically one dimensional.
The relaxation time from an arbitrary initial condition to the stationary
solution is found to scale linearly with the size of the network.
In multilayer networks, the number of hidden units gives bounds on the
number and dimension of the possible attractors.
We conclude that long term prediction (in the nonchaotic regime) with such
models is governed by attractor dynamics related to the architecture.
 Correlates of Attention in a Model of Dynamic Visual Recognition  Rajesh P. N. Rao
 Given a set of objects in the visual field, how does the the visual
system learn to attend to a particular object of interest while
ignoring the rest? How are occlusions and background clutter so
effortlessly discounted for when recognizing a familiar object? In
this paper, we attempt to answer these questions in the context of a
Kalman filterbased model of visual recognition that has previously
proved useful in explaining certain neurophysiological phenomena such
as endstopping and related extraclassical receptive field effects
in the visual cortex. By using results from the field of robust
statistics, we describe an extension of the Kalman filter model that
can handle multiple objects in the visual field. The resulting
robust Kalman filter model demonstrates how certain forms of
attention can be viewed as an emergent property of the interaction
between topdown expectations and bottomup signals. The model also
suggests functional interpretations of certain attentionrelated
effects that have been observed in visual cortical neurons.
Experimental results are provided to help demonstrate the ability of
the model in performing robust segmentation and recognition of objects
and image sequences in the presence of varying degrees of occlusions
and clutter.
 An Incremental Nearest Neighbor Algorithm with Queries}  Joel Ratsaby
 We consider the general problem of learning
multicategory classification from labeled examples.
In many practical learning settings the sample size or time
available for training are limited, which may have
adverse effects
on the accuracy of the resulting classifier.
To compensate for this, researchers have
considered various ways
of trying to make the learning process more efficient so that even with
limited resources high accuracy may be possible.
The classical theory of pattern recognition
assumes labeled examples appear
according to
unknown underlying
class conditional probability distributions where the
pattern classes are picked randomly in a passive
manner according to their {\em a priori}
probabilities.
In this paper, we present experimental results for an algorithm
which actively selects
samples from different pattern classes according to a querying rule as
opposed
to the {\em a priori} probabilities.
The amount of improvement of this querybased approach over the
passive batch approach depends on the
complexity of the Bayes rule. The query rule essentially learns to
tune the subsample size of each of the pattern classes to this Bayes
complexity
which is assumed to be unknown.
The principle on which this algorithm is based is general enough to be
used
in any learning algorithm which permits a modelselection criterion and
for which
the error rate of the classifier
is calculable in terms of the complexity of the model.
 Globally Optimal Online Learning Rules  Magnus Rattray, David Saad
 We present a method for determining the globally optimal online
learning rule for a soft committee machine under a statistical
mechanics framework. This work complements previous results on locally
optimal rules, where only the rate of change in generalization error
was considered. We maximize the total reduction in generalization
error over the whole learning process and show how the resulting
rule can significantly outperform the locally optimal rule.
 Just One View: Invariances in Inferotemporal Cell Tuning  Maximilian Riesenhuber, Tomaso Poggio
 In macaque inferotemporal cortex (IT), neurons have been found to
respond selectively to complex shapes while showing broad tuning
(``invariance'') with respect to stimulus transformations such as
translation and scale changes and a limited tuning to rotation in depth.
Training monkeys with novel, papercliplike objects, Logothetis et al.~(1995)
could investigate whether these invariance properties are due to
experience with exhaustively many transformed instances of an object or
if there are mechanisms that allow the cells to show response invariance
also to previously unseen instances of that object. They found
objectselective cells in anterior IT which exhibited limited invariance
to various transformations.
 RCC Cannot Compute Certain FSA, Even with Arbitrary Transfer Functions  Mark Ring
 Existing proofs demonstrating the computational limitations of Recurrent
Cascade Correlation (RCC) and similar networks (Fahlman, 1991; Bachrach, 1988;
Mozer, 1988) explicitly limit their results to units having sigmoidal or
hardthreshold transfer functions (Giles et al., 1995; Kremer, 1996).
The proof given here shows that for any finite, discrete transfer
function used by the units of an RCC network, there are finitestate automata
(FSA) that the network cannot model, no matter how many units are used.
The proof also applies to continuous transfer functions with a finite
number of fixed points, such as sigmoid and radial basis functions.
 Recurrent Neural Networks Can Learn to Implement SymbolSensitive Counting  Paul Rodriguez, Janet Wiles
 Recently researchers have derived formal complexity analysis of analog
computation in the setting of discretetime dynamical systems. Although
these results are theoretically instructive, they do not identify relevant
issues for implementation in actual systems, or for psychological models
of sequence processing. As an empirical contrast, training recurrent
neural networks (RNNs) produces selforganized systems that can give us
complementary and constructive evidence for realizations of analog
mechanisms. Previous work showed that a RNN can learn to process a simple
contextfree language (CFL). Herein, we extend that work to show that a
RNN can learn a harder CFL by organizing its resources into a
symbolsensitive counting solution, and we provide a dynamical systems
analysis which demonstrates how the network can not only count, but also
copy and store counting information.
 EM Algorithms for PCA and SPCA  Sam Roweis
 I present an expectationmaximization (EM) algorithm for principal
component analysis (PCA). Traditional methods for finding the first
few principal components of a dataset can be quite costly when there
are many datapoints and/or the data space is of very high
dimension. These methods can also fail when the sample covariance
matrix is not of full rank and cannot deal easily with missing data
values.
 Intrusion Detection with Neural Networks  Jake Ryan, MengJang Lin, Risto Miikkulainen
 With the rapid expansion of computer networks during the past few years,
security has become a crucial issue for modern computer systems. A good
way to detect illegitimate use is through monitoring unusual user
activity. Methods of intrusion detection based on handcoded rule sets
or predicting commands online are laborous to build or not very
reliable. This paper proposes a new way of applying neural networks to
detect intrusions. We believe that a user leaves a `print' when using
the system; a neural network can be used to learn this print and
identify each user much like detectives use thumbprints to place people
at crime scenes. If a user's behavior does not match his/her print, the
system administrator can be alerted of a possible security breech. A
backpropagation neural network called NNID (Neural Network Intrusion
Detector) was trained in the identification task and tested
experimentally on a system of 10 users. The system was 96\% accurate in
detecting unusual activity, with 7\% false alarm rate. These results
suggest that learning user profiles is an effective way for detecting
intrusions.
 On the Separation of Signals from Neighboring Cells in Tetrode Recordings  Maneesh Sahani, John S. Pezaris, Richard A. Andersen
 We discuss a solution to the problem of separating waveforms produced
by multiple cells in a tetrode recording. We take an explicitly
probabilistic approach, using generative models of varying
sophistication to describe the distribution of waveforms produced by a
single cell. The models range from a single Gaussian distribution of
waveforms for each cell to a continuoushiddenstate dynamical model
with a mixture of linear propagators. We stress the overall structure
of the approach, while the details of the generative model may depend
on the specific neural preparation.
 Modeling Acoustic Correlations by Factor Analysis  Lawrence Saul, Mazin Rahim
 Hidden Markov models (HMMs) for automatic speech recognition rely on
high dimensional feature vectors to summarize the shorttime
properties of speech. Correlations between features can arise when
the speech signal is nonstationary or corrupted by noise. We
investigate how to model these correlations using factor analysis, a
statistical method for dimensionality reduction. Factor analysis uses
a small number of parameters to model the covariance structure of high
dimensional data. These parameters are estimated by an
ExpectationMaximization (EM) algorithm that can be embedded in the
training procedures for HMMs. We evaluate the combined use of
mixture densities and factor analysis in HMMs that recognize
alphanumeric strings. Holding the total number of parameters fixed,
we find that these methods, properly combined, yield better models
than either method on its own.
 Local Dimensionality Reduction  Stefan Schaal, Sethu Vijayakumar, Christopher G. Atkeson
 If globally high dimensional data has locally only
low dimensional distributions, it is advantageous to
perform a local dimensionality reduction before
further processing the data. In this paper we examine
several techniques for local dimensionality reduction
in the context of locally weighted linear regression.
As possible candidates, we derive local versions of
factor analysis regression, principal component
regression, principal component regression on joint
distributions, and partial least squares regression.
After outlining the statistical basis of these
methods, we perform Monte Carlo simulations to
evaluate their robustness with respect to violations
of their statistical assumptions. One surprising
outcome is that locally weighted partial least
squares regression offers the best average results,
thus outperforming even factor analysis, the
theoretically most appealing of our candidate
techniques.
 Comparison of Human and Machine Word Recognition  Markus Schenkel, Cyril Latimer, Marwan Jabri
 We present a study which is concerned with word recognition rates for
heavily degraded documents. We compare human with machine reading
capabilities in a series of experiments, which explores the
interaction of word/nonword recognition, word frequency and legality
of nonwords with degradation level. We also study the influence of
character segmentation, and compare human performance with that of our
artificial neural network model for reading. We found that the
proposed computer model uses word context as efficiently as humans,
but performs slightly worse on the pure character recognition task.
 Prior Knowledge in Support Vector Kernels  Bernhard Sch\"olkopf, Patrice Simard, Alex J. Smola, Vladimir Vapnik
 We explore methods for incorporating prior knowledge about a problem at hand
in Support Vector learning machines. We show that both invariances under
group transformations and prior knowledge about locality in images can be
incorporated by constructing appropriate kernel functions.
 Training Methods for Adaptive Boosting of Neural Networks  Holger Schwenk, Yoshua Bengio
 Boosting is a general method for improving the performance
of any learning algorithm that consistently generates
classifiers which need to perform only slightly better than
random guessing. A recently proposed and very promising
boosting algorithm is AdaBoost. It has been applied with
great success to several benchmark machine learning problems
using rather simple learning algorithms, in particular decision trees.
In this paper we use AdaBoost to improve the
performances of neural networks applied to character recognition
tasks. We compare training methods based on sampling the training
set and weighting the cost function. Our system achieves about
1.4\% error on a data base of online handwritten digits from more
than 200 writers. Adaptive boosting of a multilayer network
achieved less than 2\% error on the UCI Letters offline characters
data set.
 The Rectified Gaussian Distribution  Nicholas D. Socci, Daniel D. Lee, H. Sebastian Seung
 A simple but powerful modification of the standard Gaussian
distribution is studied. The variables of the rectified Gaussian are
constrained to be nonnegative, enabling the use of nonconvex energy
functions. Two multimodal examples, the competitive and cooperative
distributions, illustrate the representational power of the rectified
Gaussian. Since the cooperative distribution can represent the
translations of a pattern, it demonstrates the potential of the
rectified Gaussian for modeling pattern manifolds. The problem of
learning may be more tractable for the rectified Gaussian than for the
Boltzmann machine, owing to the technical advantages of continuous
variables.
 Learning Continuous Attractors in Recurrent Networks  H. Sebastian Seung
 One approach to invariant object recognition employs a recurrent
neural network as an associative memory. In the standard depiction of
the network's state space, memories of objects are stored as
attractive fixed points of the dynamics. I argue for a modification
of this picture: if an object has a continuous family of
instantiations, it should be represented by a continuous attractor.
This idea is illustrated with a network that learns to complete
patterns. To perform the task of filling in missing information, the
network develops a continuous attractor that models the manifold from
which the patterns are drawn. From a statistical viewpoint, the
pattern completion task allows a formulation of unsupervised learning
in terms of regression rather than density estimation.
 Minimax and Hamiltonian Dynamics of ExcitatoryInhibitory Networks  H. Sebastian Seung, Tom J. Richardson, Jeffrey C. Lagarias, John J. Hopfield
 Because the dynamics of a neural network with symmetric interactions
is similar to a gradient descent dynamics, convergence to a fixed
point is the general behavior. In this paper, we analyze the global
behavior of networks with distinct excitatory and inhibitory
populations of neurons, under the assumption that the interactions
between the populations are antisymmetric. Our analysis exploits the
similarity of such a dynamics to a saddle point dynamics. This
analogy gives some intuition as to why such a dynamics can either
converge to a fixed point or a limit cycle, depending on parameters.
We also show that the network dynamics can be written in a dissipative
Hamiltonian form.
 DataDependent Structural Risk Minimization for Perceptron Decision Trees  John ShaweTaylor, Nello Cristianini
 Perceptron Decision Trees (also known as Linear Machine DTs, etc.) are
analysed in order that data dependent Structural Risk Minimization can
be applied. Data dependent analysis is performed which indicates that
choosing the maximal margin hyperplanes at the decision nodes will
improve the generalization. The analysis is performed by defining a real
valued function class computed by decision trees and bounding its
generalization error using a novel technique. Experiments are
performed to test the approach.
 An Analog VLSI Neural Network for Phasebased Machine Vision}  Bertram E. Shi, Kwok Fai Hui
 We describe the design, fabrication and test results of an analog CMOS VLSI
neural network prototype chip intended for phasebased computer vision
algorithms. The chip implements an image filtering operation similar to
Gaborfiltering. Because a Gabor filter's output is complex valued, it can
be used to define a phase at every pixel in an image. This phase can be used
in robust algorithms for disparity estimation and binocular stereo vergence
control in stereo vision and for image motion analysis. The chip reported
here takes an input image and generates two outputs at every pixel,
corresponding to the real and imaginary parts of the output.
 Monotonic Networks  Joseph Sill
 Monotonicity is a constraint which arises in many application domains.
We present a machine learning model called the monotonic network, which
has the ability to obey monotonicity constraints exactly, i.e., by virtue
of functional form. A straightforward method for implementing and training
a monotonic network is described. Monotonic networks are proven to be
universal approximators of continous, differentiable monotonic functions.
We apply monotonic networks to a realworld task in corporate bond
rating prediction and show that they compare favorably to other
approaches.
 How to Dynamically Merge Markov Decision Processes  Satinder Singh, David Cohn
 We are frequently called upon to do multiple tasks that
compete for our attention and resource. Often we know the optimal
solution to each task in isolation. How can we exploit that knowledge
to efficiently determine good solutions for doing the tasks in
parallel? We formulate this question as that of dynamically merging
multiple Markov decision processes (MDPs), and present a new
theoreticallysound dynamic programming algorithm that assumes known
good solutions (value functions) for the individual MDPs in isolation
and efficiently constructs a good solution for doing the set of MDPs
in parallel. Our algorithm can merge MDPs dynamically, assimilating a
new MDP smoothly into an ongoing merging of previous MDPs. We analyze
various aspects of our algorithm and illustrate its use on a simple
merging problem.
 From Regularization Operators to Support Vector Kernels  Alex J. Smola, Bernhard Sch\"olkopf
 We derive the correspondence between regularization operators used in
Regularization Networks and Hilbert Schmidt Kernels appearing in
Support Vector Machines. More specifically, we prove that the Green's
Functions associated with regularization operators are suitable
Support Vector Kernels with equivalent regularization properties.
As a byproduct we show that a large number of Radial Basis Functions
namely conditionally positive definite functions may be used as
Support Vector kernels.
 Stacked Density Estimation  Padhraic Smyth, David Wolpert
 In this paper, the technique of stacking, previously only used for
supervised learning, is applied to unsupervised
learning. Specifically, it is used for nonparametric multivariate
density estimation, to combine finite mixture model and kernel density
estimators. Experimental results on both simulated data and real world
data sets clearly demonstrates that stacked density estimation
outperforms other strategies such as choosing the single best model
based on crossvalidation, combining with uniform weights, and even
the single best model chosen by ``cheating'' by looking at the data
used for independent testing.
 Online Learning from Finite Training Sets in Nonlinear Networks  Peter Sollich, David Barber
 Online learning is one of the most common forms of neural network
training. We present an analysis of online learning from finite
training sets for nonlinear networks (namely, softcommittee
machines), advancing the theory to more realistic learning scenarios.
Dynamical equations are derived for an appropriate set of order
parameters; these are exact in the limiting case of either linear
networks or infinite training sets. Preliminary comparisons with
simulations suggest that the theory captures some effects of finite
training sets, but may not yet account correctly for the presence of
local minima.
 Bidirectional Retrieval from Associative Memory  Friedrich T. Sommer, G\"unther Palm
 So far, similarity based fault tolerant retrieval naturally provided by
neural associative memories (NAM) is not widely applied as access method
for large databases.
A serious obstacle to applications is that the theoretical performance
of NAM can not be achieved in practical cases: the Willshaw model with
the highest asymptotic information capacity produces a high level of
cross talk errors in the retrieval, if fully exploited for finite sizes.
 Incorporating Contextual Information in White Blood Cell Identification  Xubo Song, Yaser AbuMostafa, Joseph Sill, Harvey Kasdan
 In this paper we propose a technique to incorporate contextual information
into object classification. In the real world there are cases where the
identity of an object is ambiguous due to the noise in the measurements
based on which the classification should be made. It is helpful to
reduce the ambiguity by utilizing extra information referred to
as context, which in our case is the identities of the accompanying
objects. This technique is applied to white blood cell
classification. Comparisons are made against the ``no context'' approach,
to demonstrate the superior classification performance achieved
by using context. In our particular application, it
significantly reduces false alarm rate and thus greatly reduces
the cost due to expensive clinical tests.
 Bach in a BoxRealTime Harmony  Randall R. Spangler, Rodney M. Goodman, Jim Hawkins
 We describe a system for learning J. S. Bach's rules of
musical harmony. These rules are learned from examples and are
expressed as rulebased neural networks. The rules are then applied
in realtime to generate new accompanying harmony for a live
performer. Realtime functionality imposes constraints on the
learning and harmonizing processes, including limitations on the types
of information the system can use as input and the amount of
processing the system can perform. We demonstrate algorithms for
generating and refining musical rules from examples which meet these
constraints. We describe a method for including a priori
knowledge into the rules which yields significant performance gains.
We then describe techniques for applying these rules to generate new
music in realtime. We conclude the paper with an analysis of
experimental results.
 Experiences with Bayesian Learning in a Real World Application  Peter Sykacek, Georg Dorffner, Peter Rappelsberger, Josef Zeitlhofer
 This paper reports about an application of Bayes' inferred neural
network classifiers in the field of automatic sleep staging.
The reason for using Bayesian learning for this task is twofold. First,
Bayesian inference is known to embody regularization automatically.
Second, a side effect of Bayesian learning leads to larger variance of
network outputs in regions without training data. This results in well
known moderation effects, which can be used to detect outiers.
In a 5 fold crossvalidation experiment the full Bayesian solution found with
R. Neals hybrid Monte Carlo algorithm was not better than a single
maximum aposteriori (MAP) solution found with D.J. MacKay's evidence
approximation. In a second experiment we studied the properties of both
solutions in rejecting classification of movement artifacts.
 The Asymptotic ConvergenceRate of Qlearning  Csaba Szepesv\'{a}ri
 In this paper we show that for discounted MDPs with discount factor
$\gamma$ the asymptotic rate of convergence of Qlearning is
$O(1/t^{1\gamma})$ if $\gamma>1/2$ and $O(\sqrt{\log\log t/ t})$
otherwise.
 Mapping a Manifold of Perceptual Observations  Joshua B. Tenenbaum
 I present a new computationalgeometric framework for
unsupervised mapping of a manifold of perceptual observations.
Nonlinear dimensionality reduction is formulated here as the problem
of trying to find a Euclidean featurespace embedding of a set of
observations that preserves as closely as possible their intrinsic
metric structure, i.e. the distances between points on the observation
manifold as measured along geodesic paths. Our isometric feature
mapping procedure (``isomap'') is able to reliably recover
lowdimensional nonlinear structure in realistic perceptual data sets,
such as a manifold of face images, where conventional global mapping
methods find only local minima. The recovered map provides a
canonical set of globally meaningful features, which allows perceptual
transformations such as interpolation, extrapolation, and analogy 
highly nonlinear transformations in the original observation space 
to be computed with simple linear operations in feature space.
 Online Learning from Finite Training Sets in Nonlinear Networks  Peter Sollich, David Barber
 Blind Separation of Radio Signals in Fading Channels  Kari Torkkola
 We apply information maximization / maximum likelihood
blind source separation to complex valued signals
mixed with complex valued nonstationary matrices.
This case arises in radio communications with baseband signals.
We incorporate known source signal distributions in the adaptation,
thus making the algorithms less ``blind''. This results in drastic
reduction of the amount of data needed for successful convergence.
Adaptation to rapidly changing signal mixing conditions,
such as to fading in mobile communications, becomes now feasible
as demonstrated by simulations.
 A Solution for Missing Data in Recurrent Neural Networks with an Application to Blood Glucose Prediction  Volker Tresp, Thomas Briegel
 We consider neural network models for stochastic nonlinear dynamical
systems where measurements of the variable of interest are only
available at irregular intervals i.e. most realizations are missing.
Difficulties arise since the solutions for prediction and maximum
likelihood learning with missing data lead to complex integrals, which
even for simple cases cannot be solved analytically. In this paper we
propose a specific combination of a nonlinear recurrent neural
predictive model and a linear error model which leads to tractable
prediction and maximum likelihood adaptation rules. In particular, the
recurrent neural network can be trained using the realtime recurrent
learning rule and the linear error model can be trained by an EM
adaptation rule, implemented using forwardbackward Kalman filter
equations. The model is applied to predict the glucose/insulin
metabolism of a diabetic patient where blood glucose measurements are
only available a few times a day at irregular intervals. The new model
shows considerable improvement with respect to both recurrent neural
networks trained with teacher forcing or in a free running mode and
various linear models.
 Selfsimilarity Properties of Natural Images  Antonio Turiel, Germ\'an Mato, N\'estor Parga, JeanPierre Nadal
 Scale invariance is a fundamental property of ensembles of
natural images.
Their non Gaussian properties are less well understood,
but they indicate the existence of a rich statistical
structure. In this work we present a detailed study of
the marginal statistics of a variable related to the edges
in the images. A numerical analysis shows that it
exhibits extended selfsimilarity.
This is a scaling property stronger than selfsimilarity:
all its moments can be expressed as a power of any given moment.
More interesting, all the exponents can be predicted in
terms of a multiplicative logPoisson process. This is the very same
model that was used very recently to predict the correct
exponents of the structure functions of turbulent flows.
These results allow us to study the underlying multifractal singularities.
In particular we find that the most singular structures are onedimensional:
the most singular manifold consists of sharp edges.
 Multiresolution Tangent Distance for Affineinvariant Classification  Nuno Vasconcelos, Andrew Lippman
 The ability to rely on similarity metrics invariant to image transformations
is an important issue for image classification tasks such as face or
character recognition. We analyze an invariant metric that has performed
well for the latter  the tangent distance  and study its
limitations when applied to regular images, showing that the most
significant among these (convergence to local minima) can be drastically
reduced by computing the distance in a multiresolution setting. This leads
to the multiresolution tangent distance, which exhibits
significantly higher invariance to image transformations,
and can be easily combined with robust estimation procedures.
 Use of a MultiLayer Perceptron to Predict Malignancy in Ovarian Tumours  Herman Verrelst, Yves Moreau, Joos Vandewalle, Dirk Timmerman
 We discuss the development of a MultiLayer Perceptron neural network
classifier
for use in preoperative differentiation between benign and malignant
ovarian tumors.
As the Mean Squared classification Error is not sufficient to make
correct and objective
assessments about the performance of the neural classifier, the concepts
of sensitivity
and specificity are introduced and combined in Receiver Operating
Characteristic
curves. Based on objective observations such as sonomorphologic
criteria, color
Doppler imaging and results from serum tumor markers, the neural network
is able to
make reliable predictions with a discriminating performance comparable
to that of
experienced gynecologists.
 Independent Component Analysis for Identification of Artifacts  Ricardo Vig\'{a}rio, Veikko Jousm\"{a}ki, Matti H\"{a}m\"{a}l\"{a}inen, Riitta Hari, Erkki Oja
 We have studied the application of an independent component analysis
(ICA) approach to the identification and possible removal of
artifacts from a magnetoencephalographic (MEG) recording. This
statistical technique separates components according to the kurtosis
of their amplitude distributions over time, thus distinguishing
between strictly periodical signals, and regularly and irregularly
occurring signals. Many artifacts belong to the last category. In
order to assess the effectiveness of the method, controlled
artifacts were produced, which included saccadic eye movements and
blinks, increased muscular tension due to biting and the presence of
a digital watch inside the magnetically shielded room. The results
demonstrate the capability of the method to identify and clearly
isolate the produced artifacts.
 Modeling Complex Cells in an Awake Macaque during Natural Image Viewing  William E. Vinje, Jack L. Gallant
 Cells in V1 are often modeled as quasilinear filters. We are
interested in the behavior of these filters during natural vision. We
apply a quasilinear filter to three stimulus sequences that simulated
all of the events occurring in and around the receptive field of a single
welltuned V1 complex cell while an animal freely viewed several natural
images (review movies). Model parameters are estimated from the cell's
responses to flashed gratings. Uncertainties in parameter estimates are
quantified and used to estimate the resulting uncertainties in the model's
predictions. We find modest but significant correlation between the
model and the cell's recorded responses to three review movies (r = 0.21,
0.25, 0.52, respectively). Prediction uncertainties are sizable, but are
not large enough to be the sole source of discrepancy between the model
and the data. The largest single source of prediction uncertainty is the
measurement uncertainty in orientation tuning. Our results indicate that
it should be possible to measure model parameters accurately enough to
substantally lower this prediction uncertainty. Low uncertainty predictions
are necessary if a quasilinear filter model is to be used as a foundation
for future models that incorporate nonlinear mechanisms.
 Competitive Online Linear Regression  Volodya Vovk
 We apply a general algorithm for merging prediction strategies
(the Aggregating Algorithm)
to the problem of linear regression with the square loss;
our main assumption is that the response variable is bounded.
It turns out that for this particular problem
the Aggregating Algorithm resembles,
but is different from,
the wellknown ridge estimation procedure.
From general results about the Aggregating Algorithm
we deduce a guaranteed bound on the difference
between our algorithm's performance
and the best, in some sense, linear regression function's performance.
We show that the AA attains the optimal constant in our bound,
whereas the constant attained by the ridge regression procedure
can be 4.3 times worse.
 On the Infeasibility of Training Neural Networks with Small Squared Errors  Van H. Vu
 We demonstrate that the problem of training neural networks with
small (average) squared error is computationally intractable.
Consider a data set of $M$ points $(X_i, Y_i)$, $i=1,2,\dots, M$, where
$X_i$ are input vectors from $R^d$, $Y_i$ are
real outputs ($Y_i \in R)$.
For a network $f_0$ in some class ${\cal F} $ of neural networks,
$(1/M) (\sum _{i=1}^M (f_0(X_i) Y_i)^2)^{1/2} 
inf_{f \in {\cal F}}(1/M) (\sum _{i=1}^M (f(X_i) Y_i)^2)^{1/2}$
is the (average) relative error that occurs when one tries to fit the data
set by $f_0$.
We will prove for several classes ${\cal F}$ of neural netwoks that achieving
relative error smaller than some fixed positive threshold
(independent from the size of the data set) is NPhard. This answers a
question of L. Jones.
 Phase Transitions and the Perceptual Organization of Video Sequences  Yair Weiss
 Estimating motion in scenes containing multiple moving objects remains
a difficult problem in computer vision. A promising approach to this
problem involves using mixture models, where the motion of each object
is a component in the mixture.
However, existing
methods typically require specifying in advance the number of components in the
mixture, i.e. the number of objects in the scene.
Here we show that the number of objects can be estimated automatically
in a maximum likelihood framework, given an assumption about the
level of noise in the video sequence. We derive analytical results
showing the number of models which maximize the likelihood for a given
noise level in a given sequence. We illustrate these results on a real
video sequence, showing how the phase transitions correspond to
different perceptual organizations of the scene.
 Hybrid NN/HMMBased Speech Recognition with a Discriminant Neural Feature Extraction  Daniel Willett, Gerhard Rigoll
 In this paper we present a novel hybrid architecture for continuous speech
recognition systems. It consists of a continuous HMM system extended by an
arbitrary neural network that is used as a preprocessor that takes several
frames of the feature vector as input to produce more discriminative
feature vectors with respect to the underlying HMM system. This hybrid
system is an extension of a stateoftheart continuous HMM system, and in
fact, it is the first hybrid system that really is capable of
outperforming these standard systems with respect to the recognition
accuracy. Experimental results show a relative error reduction of about
10\% achieved on a remarkably good recognition system based on
continuous HMMs for the Resource Management 1000word continuous speech
recognition task.
 Modelling Seasonality and Trends in Daily Rainfall Data  Peter M. Williams
 This paper presents a new approach to the problem of modelling daily
rainfall using neural networks. We first model the conditional
distributions of rainfall amounts, in such a way that the model itself
determines the order of the process, and the timedependent shape and
scale of the conditional distributions. After integrating over
particular weather patterns, we are able to extract seasonal
variations and longterm trends.
 Graph Matching with Hierarchical Discrete Relaxation  Richard C. Wilson, Edwin R. Hancock
 Our aim in this paper is to develop
a Bayesian framework for matching hierarchical
relational models. Such models are widespread
in computer vision. The framework that we adopt
for this study is provided by
iterative
discrete relaxation. Here the aim is to assign the discrete
matches so as to optimise a global cost function
that draws information concerning the consistency of match
from different levels of the hierarchy.
Our Bayesian development naturally distinguishes
between intralevel and interlevel constraints. This
allows the impact of reassigning a match to be assessed
not only
at its own (or peer) level of representation,
but also upon its parents and children in the hierarchy.
We illustrate the effectiveness of the technique in the matching
of linesegment groupings in SAR images of rural scenes.
 The Storage Capacity of a FullyConnected Committee Machine  Yuansheng Xiong, Chulan Kwon, JongHoon Oh
 We study the storage capacity of a fullyconnected committee
machine with a large number $K$ of hidden nodes. The storage
capacity is obtained by analyzing the geometrical structure of the
weight space related to the internal representation. By examining
the asymptotic behavior of the order parameters in the limit of large
$K$, the storage capacity is found to be proportional
to $K\sqrt{\ln K}$ up to the leading order. This result satisfies
the mathematical bound given by Mitchison and Durbin, whereas the
replicasymmetric solution in a conventional Gardner's approach
violates this bound.
 Hybrid Reinforcement Learning and Its Application to Biped Robot Control  Satoshi Yamada, Akira Watanabe, Michio Nakashima
 A learning system composed of linear control modules, reinforcement
learning modules and selection modules (a hybrid reinforcement learning system)
is proposed for the quick learning of realworld control problems.
The selection modules choose one appropriate control module dependent on
the state. This hybrid learning system was applied to a stilttype biped
robot control. Because it did not need to learn the control on the flat floor,
where the linear control module can control the robot, it learned the control
on a floor with a slope more quickly than the usual reinforcement learning.
If it was trained by a 2step learning (at the first learning step, the
selection module was trained by a training procedure controlled only by
the linear controller), it learned the control much more quickly.
The average number of trials (about 50) is so small that the learning
system is applicable to real robot control.
 The Efficiency and the Robustness of Natural Gradient Descent Learning Rule  Howard H. Yang, Shunichi Amari
 We have discovered a new scheme to represent the Fisher information
matrix of a stochastic multilayer perceptron.
Based on this scheme, we have designed an algorithm to compute
the inverse of the Fisher information matrix.
When the input dimension $n$ is much larger than the number of hidden
neurons, the complexity of this algorithm is of order $O(n^2)$ while
the complexity of conventional algorithms for the same purpose is of
order $O(n^3)$. The inverse of the Fisher information matrix is used
in the natural gradient descent algorithm to train singlelayer or
multilayer perceptrons. It is confirmed by simulation that the
natural gradient descent learning rule is not only efficient but also
robust.
 Multiplicative Updating Rule for Blind Separation Derived from the Method of Scoring  Howard H. Yang
 For blind source separation, when the Fisher information matrix
is used as the Riemannian metric tensor for the parameter space,
the steepest descent algorithm to maximize the likelihood function
in this Riemannian parameter space becomes the serial updating rule
with equivariant property. This algorithm can be further simplified
by using the asymptotic form of the Fisher information matrix around
the equilibrium.
 The ObserverObservation Dilemma in NeuroForecasting  Hans Georg Zimmermann, Ralph Neuneier
 We explain how the training data can be separated into a cleaned part
and an unexplainable noisy part. Analogous to the data, the neural
network is separated into a time invariant structure used for
forecasting, and a noisy part. We propose a unified theory connecting
the optimization algorithms for cleaning and learning together with
algorithms that control the data noise and the parameter noise. The
combined algorithm allows a datadriven local control of the liability
of the network parameters and therefore an improvement in
generalization. The approach is successfully evaluated for the task of
forecasting the German bond market.
We have attempted to ensure that all information is correct, but we cannot
guarantee it. Please send comments and corrections to:
L. Douglas Baker
Carnegie Mellon University
ldbapp+nips@cs.cmu.edu