This page last updated 18 November 1998.
This is a list of the papers that were presented at NIPS*98. They are listed in (approximately) alphabetical order according to the last name of the first author.
These papers should be cited as:
Advances in Neural Information Processing Systems 11,
M. S. Kearns, S. A. Solla, D. A. Cohn, eds., MIT Press, 1999.
Some of you may remember Mr. Kearns' middle initial as something different.
Following a recent marriage, it is now correctly "M. S. Kearns".
You can also jump to the names starting with
ABCDEFGHIJKLMNOPRSTVWYZ


 Invited Talk: Temporally Asymmetric Hebbian Learning, Spike Timing, and Neural Response Variability  L. F. Abbott
 In most neural network studies of Hebbian learning, changes in synaptic
weights depend on pre and postsynaptic firing rates. Data on longterm
potentiation (LTP) and depression (LTD), candidate mechanisms for Hebbian
learning in biological systems, indicate that the precise timing of pre
and postsynaptic action potentials, not merely firing rates, determines
both the magnitude and sign of changes in synaptic efficacies. Recent data
indicates that the degree of synaptic modification is a narrow, temporally
asymmetric function of the time difference between pre and postsynaptic
spikes. Temporally asymmetric Hebbian learning based on these data is
competitive and can reach a stable equilibrium, even in the absence of
normalization constraints. Synaptic modification for a model neuron
receiving multiple inputs automatically reaches a ``balanced state'' in which
neuronal firing is highly irregular. Such states have been proposed to
explain the variability seen in cortical responses. In these models, the
strengthening of synapses does not depend on the firing rate, but rather on
the degree of correlation between different synaptic inputs. When inputs
are uncorrelated, no systematic synaptic modification occurs, but rapid
learning can take place if inputs are appropriately correlated.
 Influence of Changing the Synaptic Transmitter Release Probability on Contrast Adaptation of Simple Cells in the Primary Visual Cortex  Peter Adorjan, Klaus Obermayer
 The contrast response function (CRF) of many neurons in the primary
visual cortex saturates, and shifts towards higher contrast values
following prolonged presentation of high contrast visual
stimuli. Using a recurrent neural network of excitatory spiking
neurons with adapting synapses we show that both effects could be
explained by a fast and a slow component in the synaptic
adaptation. The fast componenta short term synaptic depression
componentleads to a saturation of the CRF and a phase advance in
the cortical cells' response to high contrast stimuli. The slow
component is derived from an adaptation of the probability of the
synaptic transmitter release, and changes such that the mutual
information between the input and the output of a cortical neuron is
maximal. This componentgiven by the infomax learning
ruleexplains contrast adaptation of the averaged membrane potential
(DC component) as well as the surprising experimental result, that
the stimulus modulated component (F1 component) of a cortical cell's
membrane potential adapts only weakly. Based on our results we propose
a new experimental method to estimate the strength of the effective
excitatory feedback to a cortical neuron, and we also suggest a
relatively simple experimental test to justify our hypothesized
synaptic mechanism for contrast adaptation.
 Robust, Efficient, GloballyOptimized Reinforcement Learning with the PartiGame Algorithm  Mohammad A. AlAnsari, Ronald J. Williams
 Partigame (Moore 1994a, Moore 1994b, Moore and Atkeson 1995) is a
reinforcement learning (RL) algorithm that has a lot of promise in
overcoming the curse of dimensionality (Bellman 1957) that can plague
RL algorithms when applied to highdimensional problems. In this
paper we introduce modifications to the algorithm that further improve
its performance and robustness. In addition, while partigame
solutions can be improved locally by standard local pathimprovement
techniques, we introduce an addon algorithm in the same spirit as
partigame that instead tries to improve solutions in a nonlocal
manner.
 Hierarchical ICA Belief Networks  Hagai Attias
 We introduce a multilayer generalization of independent component
analysis (ICA). At each level, this network extracts realvalued
latent variables that are nonlinear functions of the input data with
a highly adaptive functional form, resulting in a hierarchical distributed
representation of these data. The network is based on a probabilistic
generative model, constructed by cascading singlelayer local ICA
models. Whereas exact maximumlikelihood learning for this model is
intractable, we present and demonstrate an algorithm that maximizes a
lower bound on the likelihood. This algorithm is developed by formulating
a variational approach to hierarchical ICA networks.
 Gradient Descent for General Reinforcement Learning  Leemon Baird, Andrew W. Moore
 A simple learning rule is derived, which can be instantiated to generate a wide range of new reinforcementlearning algorithms. These algorithms solve a number of open problems, define several new approaches to reinforcement learning, and unify different approaches to reinforcement learning under a single theory. These algorithms all have guaranteed convergence, and include modifications of several existing algorithms that were known to fail to converge on simple MDPs. These include Qlearning, SARSA, and advantage learning. In addition to these valuebased algorithms it also generates valueless reinforcementlearning algorithms, which learn optimal policies without learning a value function. In addition, it allows valueless and valuebased algorithms to be combined, thus unifying two very different approaches to reinforcement learning. And these algorithms converge for POMDPs without requiring a proper belief state. Simulations results are given, and several areas for future research are discussed.
 Probabilistic Modeling for Face Orientation Discrimination: Learning from Labeled and Unlabeled Data  Shumeet Baluja
 This paper presents probabilistic modeling methods to solve the problem of
discriminating between five facial orientations with very little labeled
data. Three models are explored. The first model maintains no interpixel
dependencies, the second model is capable of modeling a set of arbitrary
pairwise dependencies, and the last model allows dependencies only between
neighboring pixels. We show that for all three of these models, the accuracy of
the learned models can be greatly improved by augmenting a small number of
labeled training images with a large set of unlabeled images using
ExpectationMaximization. This is important because it is often difficult to
obtain image labels, while many unlabeled images are readily available. Through
a large set of empirical tests, we examine the benefits of unlabeled data for
each of the models. By using only two randomly selected labeled examples per
class, we can discriminate between the five facial orientations with an accuracy
of 94\%; with six examples, we achieve an accuracy of 98\%.
 Making Templates Rotationally Invariant: An Application to Rotated Digit Recognition  Shumeet Baluja
 This paper describes a simple and efficient method to make
templatebased object classification invariant to inplane
rotations. The task is divided into two parts: orientation
discrimination and classification. The key idea is to perform the
orientation discrimination before the classification. This can be
accomplished by hypothesizing, in turn, that the input image belongs
to each class of interest. The image can then be rotated to maximize
its similarity to the training images in each class (these contain
the prototype object in an upright orientation). This process yields a
set of images, at least one of which will have the object in an
upright position. The resulting images can then be classified by
models which have been trained with only upright examples. This
approach has been successfully applied to two realworld visionbased
tasks: rotated handwritten digit recognition and rotated face
detection in cluttered scenes.
 Where Does the Population Vector of Motor Cortical Cells Point during Reaching Movements?  Pierre Baraduc, Emmanuel Guigon, Yves Burnod
 Visuallyguided arm reaching movements are produced by distributed neural
networks within parietal and frontal
regions of the cerebral cortex. Experimental data indicate that (1) single
neurons in these regions are broadly
tuned to parameters of movement; (2) appropriate commands are elaborated by
populations of neurons; (3) the coordinated action of neurons can be visualized
using a neuronal population vector (NPV). However, the NPV provides only a rough
estimate of movement parameters (direction, velocity) and may even fail to
reflect the parameters of movement when arm posture is changed. We designed a
model of the cortical motor command to investigate the relation between the
desired direction of the movement, the actual direction of movement and the
direction of the NPV in motor cortex. The model is a twolayer selforganizing
neural network which combines broadlytuned (muscular) proprioceptive and
(cartesian) visual information to calculate (angular) motor commands for the
initial part of the movement of a twolink arm. The network was trained by motor
babbling in 5 positions. Simulations showed that (1) the network produced
appropriate movement direction over a large part of the workspace; (2) small
deviations of the actual trajectory from the desired trajectory existed at the
extremities of the workspace; (3) these deviations were accompanied by large
deviations of the NPV from both trajectories. These results suggest the NPV does
not give a faithful image of cortical processing during arm reaching movements.
 Tractable Variational Structures for Approximating Graphical Models  David Barber, Wim Wiegerinck
 Graphical models provide a broad framework for probabilistic
inference, with application to such diverse areas as speech
recognition (Hidden Markov Models), medical diagnosis (Belief
Networks) and artificial intelligence (Boltzmann Machines). However,
the computing time is typically exponential in the number of nodes in
the graph. We present a general framework for a class of approximating
models, based on the KullbackLeibler divergence between an
approximating graph and the original graph. In addition to unifying
the nodeelimination and structural variational frameworks, we provide
generalised meanfield equations for both directed and undirected
graphs. Simulation results on a small benchmark problem suggest that
this method compares favourably against others previously reported in
the literature.
 Almost Linear VC Dimension Bounds for Piecewise Polynomial Networks  Peter L. Bartlett, Vitaly E. Maiorov, Ron Meir
 We compute upper and lower bounds on the VC dimension of
feedforward networks of units with piecewise polynomial
activation functions. We show that if the number of layers
is fixed, then the VC dimension grows as $W\log W$, where
$W$ is the number of parameters in the network. This result
stands in opposition to the case where the number of layers
is unbounded, in which case the VC dimension grows as $W^2$.
 SemiSupervised Support Vector Machines  Kristin Bennett, Ayhan Demiriz
 We introduce a semisupervised support vector machine method (SSSVM).
Given a training set of labeled data and a working set of unlabeled data,
SSSVM constructs a support vector machine using both the training and
working sets. We use SSSVM to solve the overall risk minimization
problem (ORM) posed by Vapnik. The ORM problem is to estimate the value
of a classification function at the given points in the working set. This
contrasts with the standard learning problem of empirical risk minimization
which estimates the classification function at all possible values.
We propose a general SSSVM model that minimizes both the misclassification
error and the function capacity based on all the available data. We show
how the SSSVM model for 1norm linear support vector machines can be
converted to a mixedinteger program (MIP) and then solved exactly using
integer programming. Results of SSSVMMIP and the
standard ERM approach are compared on eleven data sets. Our computational
results support the statistical learning theory results showing that
incorporating working data improves generalization when insufficient
training information is available. In every case, SSSVM either improved
or showed no significant difference in generalization compared to the
standard approach.
 Evidence for a Forward Dynamics Model in Human Adaptive Motor Control  Nikhil Bhushan, Reza Shadmehr
 Based on computational principles, the concept of an internal model
for adaptive motor control has been divided into a forward and an inverse
model. However, there is as yet little evidence that learning control
by the CNS is through adaptation of one or the other. Here we
demonstrate that three types of adaptive control processes are possible,
based on only one or a combination of inverse and forward models. For fast
reaching movements of the hand in novel force fields, the three processes
result in hand trajectories that differ from each other. We show that the
process involving a combination of a forward and an inverse model for
adaptive motor control results in a control system where key
characteristics of performance match the kinematics of human subjects. In
contrast,
adaptive control systems that rely only on an inverse model or only on a
forward model fail to produce the kinematic features observed in the
subjects. Our results provide strong evidence that learning control of
novel dynamics is via adaptation of the forward model.
 Lazy Learning Meets the Recursive Least Squares Algorithm  Mauro Birattari, Gianluca Bontempi, Hugues Bersini
 Lazy learning is a memorybased technique that, once a query is
received, extracts a prediction interpolating locally the neighboring
examples of the query which are considered relevant according to a
distance measure. In this paper we propose a datadriven method to
select on a querybyquery basis the optimal number of neighbors to be
considered for each prediction. As an efficient way to identify and
validate local models, the recursive least squares algorithm is
introduced in the context of local approximation and lazy
learning. Furthermore, beside the winnertakesall strategy
for model selection, a local combination of the most promising
models is explored.
The method proposed is tested on six different datasets and compared
with a stateoftheart approach.
 Bayesian PCA  Christopher M. Bishop
 The technique of principal component analysis (PCA) has recently been
expressed as the maximum likelihood solution for a generative
latent variable model. In this paper we use
this probabilistic reformulation as the basis for a Bayesian treatment
of PCA. Our key result is that effective dimensionality of the
latent space (equivalent to the number of retained principal
components) can be determined automatically as part of the Bayesian
inference procedure. An important application of this framework is to
mixtures of probabilistic PCA models, in which each component can
determine its own effective complexity.
 Learning Multiclass Dynamics  Andrew Blake, Ben North, Michael Isard
 A probabilistic algorithm is presented for learning the dynamics of
complex motions. Complexity is taken into account by allowing multiple
classes of motion, and an AutoRegressive Process (ARP) associated
with each class. Training sets need incorporate only indirect
observations of motion, and this allows for sensory noise. A learning
algorithm is presented for this problem based on propagation of
random samples. Experiments have been performed with visually observed
juggling, and plausible dynamical models are found to emerge from the learning process.
 Approximate Learning of Dynamic Models  Xavier Boyen, Daphne Koller
 Inference is a key component in learning probabilistic models from partially
observable data. When learning temporal models, each of the many inference
phases requires a complete traversal over a potentially very long sequence;
furthermore, the data structures propagated in this procedure can be extremely
large, making the whole process very demanding. In a previous paper, we
describe an approximate inference algorithm for monitoring stochastic
processes, and prove bounds on its approximation error. In this paper, we apply
this algorithm as an approximate forward propagation step in an EM algorithm
for learning temporal Bayesian networks. We also provide a related
approximation for the backward step, and prove error bounds for the combined
algorithm. We show that EM using our inference algorithm is much faster than EM
using exact inference, with no degradation of the quality of the learned model.
We then extend our analysis to the online learning task, showing a bound on the
error resulting from restricting attention to a small window of observations.
We present an online EM learning algorithm for dynamic systems, and show that
it learns much faster than standard offline EM.
 Structure Discovery via Entropy Minimization  Matthew Brand
 We introduce a novel framework for simultaneous structure and parameter
learning in hiddenvariable conditional probability models, based on an
entropic prior and a solution for its maximum a posteriori (MAP)
estimator. The MAP estimate minimizes uncertainty in all respects:
crossentropy between model and data, entropy of the model, entropy of the
data's descriptive statistics. Iterative estimation extinguishes weakly
supported parameters, compressing and sparsifying the model. Trimming
operators accelerate this process by removing excess parameters and, unlike
most pruning schemes, guarantee an increase in posterior probability.
Entropic estimation takes a large random model and simplifies it, inducing
the structure of relations between hidden and observed variables. Applied
to hidden Markov models (HMMs), it finds a concise finitestate machine
representing the hidden structure of a signal. We entropically model
music, handwriting, and video timeseries, and show that the resulting
models are highly concise, structured, predictive, and
interpretablesurviving states tend to be highly correlated with
meaningful partitions of the data, while surviving transitions provide a
lowperplexity model of the signal dynamics.
 Fisher Scoring and a Mixture of Modes Approach for Approximate Inference and Learning in Nonlinear State Space Models  Thomas Briegel, Volker Tresp
 We present MonteCarlo generalized EM equations for learning in
nonlinear
state space models. The difficulties lie in the MonteCarlo Estep which
consists of sampling from the posterior distribution of the hidden
variables
given the observations. The new idea presented in this paper is to
generate
samples from a Gaussian approximation to the true posterior from which
it
is easy to obtain independent samples. The parameters of the Gaussian
approximation are either derived from the extended Kalman filter or the
Fisher Scoring algorithm. In case the posterior density is multimodal we
propose to approximate the posterior by a sum of Gaussians (mixture of
modes approach). We show that sampling from the approximate posterior
densities obtained by the above algorithms leads to better models than
using point estimates for the hidden states. In our experiment, the
Fisher
Scoring algorithm obtained a better approximation of the posterior mode
than the extended Kalman filter. For a multimodal distribution, the
mixture
of modes approach gave superior results.
 Nonlinear PI Control Inspired by Biological Control Systems  Lyndon J. Brown, Gregory E. Gonye, James S. Schwaber
 A nonlinear modification to PI control has been developed which
is appropriate for plants requiring exact setpoint matching and disturbance
attenuation in the presence of infrequent step changes in load disturbances
or setpoint. This control algorithm, labeled PII (proportional with
intermittent integral), is motivated by a model of a signal transduction
pathway active in mammalian blood pressure regulation.
The proportional aspect of the controller is independently designed to be
a disturbance attenuator, and setpoint matching will be achieved by
intermittently invoking an integral controller. The mechanisms observed in
the biological control model are used to tie these controllers together.
Improved performance over PI control is shown on a model of cyclopentenol
production. PI control fails at the peak operating point, as a sign change
in plant gain results in an unstable closedloop system. Application of this
new approach to this problem results in stable exact setpoint matching for
achievable setpoints.
 Optimizing Admission Control while Ensuring Quality of Service in Multimedia Networks via Reinforcement Learning  Timothy X Brown, Hui Tong, Satinder Singh
 This paper examines the application of reinforcement learning to a
telecommunications networking problem. The problem requires that
revenue be maximized while simultaneously meeting a quality of service
constraint that forbids entry into certain states. We present a general
solution to this multicriteria problem that is able to earn
significantly higher revenues than alternatives.
 Analog VLSI Cellular Implementation of the Boundary Contour System  Gert Cauwenberghs, James Waskiewicz
 We present an analog VLSI cellular architecture implementing a simplified
version of the Boundary Contour System (BCS) for realtime focalplane
image processing. Inspired by neuromorphic models across several layers of
visual cortex, the design integrates in each pixel the functions of simple
cells, complex cells, hypercomplex cells, and bipole cells, each tuned to
one of three orientations and interconnected on a hexagonal grid. Analog
currentmode CMOS circuits are used throughout to perform edge detection,
local inhibition, directionally selective longrange diffusive kernels,
and renormalizing global gain control. Experimental results from a
fabricated $12 \times 10$ pixel prototype in 1.2~$\mu$m CMOS technology
demonstrate the robustness of the BCS model in selecting image contours in
a cluttered and noisy background.
 Complex Cells as Cortically Amplified Simple Cells  Frances S. Chance, Sacha B. Nelson, L. F. Abbott
 Complex cells are selective for stimulus orientation and spatial frequency
(like simple cells) but not for spatial phase. While complex cell
responses to drifting gratings are relatively unmodulated, the temporal
frequency of complex cell responses to counterphase gratings is twice that
of the stimulus. We show that complex cell responses can arise through
recurrent interactions within a network of cortical cells. With weak or
absent recurrent connections, the model cells in the network show responses
similar to simple cells. Strong recurrent connections result in complex
cell responses. Our results suggest that simple and complex cells
represent the weakly and strongly coupled regimes of the same basic neural
circuit.
 Invited Talk: Statistical Natural Language Processing: Better Living Through Floatingpoint Numbers  Eugene Charniak
 Over the last ten years or so the field of natural language processing
(NLP) has become increasingly dominated by corpusbased methods and
statistical techniques. In this research, problems are attacked by
collecting statistics from a corpus (sometimes marked with correct
answers, sometimes not) and then applying the statistics to new
instances of the task. In this talk we give an overview of
statistical techniques in many areas of NLP such as: parsing (finding
the correct phrase structure for a sentence), lexical semantics
(learning meanings and other properties of words and phrases from
text), anaphora resolution (determining the intended antecedent of
pronouns, and noun phrases in general), wordsense disambiguation
(finding the correct sense in context of a word with multiple
meanings) etc. As a general rule, corpusbased, and particularly
statistical techniques outperform handcrafted systems, and the rate
of progress in the field is still quite high.
 Neuronal Regulation Implements Efficient Synaptic Pruning  Gal Chechik, Isaac Meilijson, Eytan Ruppin
 Human and animal studies show that mammalian brain undergoes
massive synaptic pruning during childhood, removing about
half of the synapses until puberty.
We have previously shown that maintaining network memory
performance while synapses are deleted, requires that synapses
are properly modified and pruned, removing the weaker
synapses. We now show that neuronal regulation, a mechanism
recently observed to maintain the average neuronal input field,
results in weightdependent synaptic modification.
Under the correct range of the degradation
dimension and synaptic upper bound, neuronal regulation
removes the weaker synapses and judiciously modifies the remaining
synapses. It implements near optimal synaptic modification, and
maintains the memory performance of a network undergoing massive
synaptic pruning. Thus, this paper shows that in addition to the
known effects of Hebbian changes, neuronal regulation may play an
important role in the selforganization of brain networks during
development.
 Perceiving Without Learning: from Spirals to Inside/Outside Relations  Ke Chen, DeLiang L. Wang
 As a benchmark task, the spiral problem is well known in neural networks.
Unlike previous work that emphasizes learning, we approach the problem
from a generic perspective that does not involve learning. We point out
that the spiral problem is intrinsically connected to the inside/outside
problem. A generic solution to both problems is proposed based on
oscillatory correlation using a time delay network. Our simulation results
are qualitatively consistent with human performance, and we interpret
human limitations in terms of synchrony and time delays, both biologically
plausible. As a special case, our network without time delays can always
distinguish these figures regardless of shape, position, size, and
orientation. We conjecture that visual perception will be effortful if
local activation cannot be rapidly propagated, as synchrony would not be
established in the presence of time delays.
 A Model for Associative Multiplication  G. Bjorn Christianson, Suzanna Becker
 Despite the fact that arithmetic is based on only a few hundred basic
facts and some simple algorithms, humans have a difficult time
mastering the subject, and make mistakes even with experience.
Associative multiplication, the process of doing multiplication by
memory without the use of rules or algorithms, is especially
problematic. Humans have certain characteristic errors in performing
associative multiplications, both in the type of error and in the
error frequency. We propose a model for the process of associative
multiplication, and compare the results with both data from normal
humans and from the model proposed by Anderson et al.
 A Micropower CMOS Adaptive Amplitude and Shift Invariant Vector Quantiser  Richard Coggins, Raymond Wang, Marwan Jabri
 In this paper we describe the architecture, implementation and
experimental results for an Intracardiac Electrogram (ICEG) classification
and compression chip. The chip processes and vectorquantises
30~dimensional analogue vectors while consuming a maximum of 2.5~$\mu$W
power for a heart rate of 60~beats per minute (1~vector per second) from a
3.3~V supply. This represents a significant advance on previous work which
achieved ultra low power supervised morphology classification since the
template matching scheme used in this chip enables unsupervised blind
classification of abnormal rhythms and the computational support for low
bit rate data compression. The adaptive template matching scheme used is
tolerant to amplitude variations, and inter and intrasample time shifts.
 Dynamics of Supervised Learning with Restricted Training Sets  A.C.C. Coolen, David Saad
 We study the dynamics of supervised learning in layered neural networks,
in the regime where the size p of the training set is proportional to the
number N of inputs. Here the local fields are no longer described by
Gaussian distributions. We use dynamical replica theory to predict the
evolution of macroscopic observables, including the relevant error measures,
incorporating the old formalism in the limit where p/N goes to infinity.
 Adding Constrained Discontinuities to Gaussian Process Models of Wind Fields  Dan Cornford, Ian T. Nabney, Christopher K. I. Williams
 Gaussian Processes provide good prior models for spatial data, but can
be too smooth. In many physical situations there are discontinuities
along bounding surfaces, for example fronts in nearsurface wind fields.
We describe a modelling method for such a constrained discontinuity and
demonstrate how to infer the model parameters in wind fields with MCMC
sampling.
 A Phase Space Approach to Minimax Entropy Learning and the Minutemax Approximation  James M. Coughlan, A.L. Yuille
 There has been much recent work on learning probability distributions on
images and on image statistics. We observe that the mapping from images to
statistics is manytoone and show it can be quantified by a phase space
factor. This phase space approach throws light on the Minimax Entropy
technique for learning Gibbs distributions on images with potentials derived
from image statistics and elucidates the ambiguities that are inherent to
determining the potentials. In addition, it shows that if the phase factor
can be approximated by an analytic distribution then the computation for
Minimax entropy learning can be vastly reduced. An illustration of this
concept, using a Gaussian to approximate the phase factor, leads to a new
algorithm called``Minutemax," which gives a good approximation to the
results of Zhu and Mumford in just seconds of CPU time. The phase space
approach also gives insight into the multiscale potentials found by Zhu and
Mumford and suggest that the forms of the potentials are influenced greatly
by phase space considerations. Finally, we prove that probability
distributions learned in feature space alone are equivalent to Minimax
Entropy learning with a multinomial approximation of the phase factor.
 Dynamically Adapting Kernels in Support Vector Machines  Nello Cristianini, Colin Campbell, John ShaweTaylor
 The kernelparameter is one of the few tunable parameters in Support
Vector machines, and controls the complexity of the resulting
hypothesis. The choice of its value amounts to model selection, and is
usually performed by means of a validation set.
We present an algorithm which can automatically perform model selection
and learning with no additional computational cost and with no need
of a validation set. Theoretical results motivating this approach
providing upper bounds on the generalisation error and experimental
results confirming its validity are presented.
 Facial Memory Is Kernel Density Estimation (Almost)  Matthew N. Dailey, Garrison W. Cottrell, Thomas A. Busey
 We compare the ability of three exemplarbased memory models, each
using three different face stimulus representations, to account for
the probability a human subject responded ``old'' in an old/new
facial memory experiment. The models are 1) the Generalized Context
Model, 2) SimSample, a probabilistic sampling model, and 3) DBM, a
novel model related to kernel density estimation that explicitly
encodes stimulus distinctiveness. The representations are
1) positions of stimuli in MDS ``face space,'' 2) projections of
test faces onto the eigenfaces of the study set, and 3) a
representation based on response to a grid of Gabor filter jets. Of
the 9 model/representation combinations, only the distinctiveness
model in MDS space predicts the observed ``morph familiarity
inversion'' effect, in which the subjects' false alarm rate for
morphs between similar faces is higher than their hit rate for many
of the studied faces. This evidence is consistent with the
hypothesis that human memory for faces is a kernel density
estimation task, with the caveat that distinctive faces require
larger kernels than do typical faces.
 Example Based Image Synthesis of Articulated Figures  Trevor Darrell
 We present a method for learning complex appearance mappings, such as
occur with images of articulated objects. Traditional interpolation
networks fail on this case since appearance is not necessarily a
smooth function nor a linear manifold for articulated objects. We
define an appearance mapping from examples by constructing a set of
independently smooth interpolation networks; these networks can cover
overlapping regions of parameter space. A set growing procedure is
used to find example clusters which are wellapproximated within their
convex hull; interpolation then proceeds only within these sets of
examples. With this method physically valid images are produced even
in regions of parameter space where nearby examples have different
appearances. We show results generating both simulated and real arm
images.
 Heeger's Normalization, Line Attractor Networks and Ideal Observers  Sophie Deneve, Alexandre Pouget, Peter Latham
 Gain control by divisive inhibition, a.k.a.\ Heeger's normalization,
seems to be a general mechanism throughout the visual cortex. We
explore in this study the statistical properties of this normalization
in the presence of noise. Using simulations, we show that Heeger's
normalization is a close approximation to a maximum likelihood
estimator, which, in the context of population coding, is the same as
an ideal observer. We also demonstrate analytically that this is a
general property of a large class of nonlinear recurrent networks
with line attractors. Our work suggests that Heeger's normalization plays a
critical role in noise filtering, and that every cortical
layer may be an ideal observer of the activity in the preceding layer.
 Vertex Identification in High Energy Physics Experiments  Gideon Dror, Halina Abramowicz, David Horn
 In High Energy Physics experiments one has to sort through
a high flux of events, at a rate of tens of MHz, and select
the few that are of interest.
In making this decision one relies on the
location of the vertex where the interaction that led to the event,
took place. Here we present a solution to this problem, based on two
feedforward neural networks with fixed architectures, whose parameters
are chosen so as to obtain a high accuracy.
The system is tested on many data sets, and is shown to perform better
than commonly used algorithms.
 Phase Diagram and Storage Capacity of Sequence Storing Neural Networks  A. During, A.C.C. Coolen, D. Sherrington
 We solve the dynamics of Hopfieldtype neural networks which store
sequences of patterns, close to saturation. The asymmetry of the
interaction matrix in such models leads to violation of detailed
balance, ruling out an equilibrium statistical mechanical analysis.
Using generating functional methods we derive exact closed equations
for dynamical order parameters, viz. the sequence overlap and
correlation and response functions, in the limit of an infinite system
size. We calculate the time translation invariant solutions of these
equations, describing stationary limitcycles, which leads to a phase
diagram. The effective retarded selfinteraction usually appearing in
symmetric models is here found to vanish, which causes a significantly
enlarged storage capacity of $\alpha_c=0.269$, compared to $\alpha_c=0.139$
for Hopfield networks storing static patterns. Our results are tested
against extensive computer simulations and excellent agreement is found.
 Optimizing Correlation Algorithms for Hardwarebased Transient Classification  R. Timothy Edwards, Gert Cauwenberghs, Fernando Pineda
 The performance of dedicated VLSI neural processing hardware depends
critically on the design of the implemented algorithms. We have
previously proposed an algorithm for acoustic transient classification.
Having implemented and demonstrated this algorithm in
a mixedmode architecture, we now investigate variants on the algorithm,
using time and frequency channel differencing, input and output
normalization, and schemes to binarize and train the
template values, with the goal of achieving optimal classification
performance for the chosen hardware.
 VLSI Implementation of Motion Centroid Localization for Autonomous Navigation  Ralph EtienneCummings, Mohammed Abdel Ghani, Viktor Gruev
 A circuit for fast, compact and lowpower focalplane motion centroid
localization is presented. This chip, which uses mixed signal CMOS components
to implement photodetection, edge detection, ONset detection and centroid
localization, models the retina and superior colliculus. The centroid localization
circuit uses timewindowed asynchronously triggered row and column address events
and two linear resistive grids to provide the analog coordinates of the motion
centroid. This VLSI chip is used to realize fast lightweight autonavigating vehicles.
The obstacle avoiding linefollowing algorithm is discussed.
 Finitedimensional Approximation of Gaussian Processes  Giancarlo FerrariTrecate, Christopher K. I. Williams, Manfred Opper
 Gaussian process (GP) prediction suffers from $O(n^3)$ scaling with the data
set size $n$. By using a finitedimensional basis to approximate the GP
predictor, the computational complexity can be reduced. We derive optimal
finitedimensional predictors under a number of assumptions, and show the
superiority of these predictors over the Projected Bayes Regression method
(which is asymptotically optimal). We also show how to calculate the minimal
model size for a given $n$. The calculations are backed up by numerical
experiments.
 Using Statistical Properties of a Labelled Visual World to Estimate Scenes  William T. Freeman, Egon C. Pasztor
 A goal of lowlevel vision algorithms is to infer underlying scene
information (such as velocities, shapes, or reflectances) from an
image or set of images. We introduce a simple, trainingbased method:
learn the probability of every scene interpretation for any local
image patch, and the probability that any local scene neighbors
another. The former probabilities give scene estimates from local
image data, and the latter allow the local estimates to propagate. We
use Markov networks to optimally propagate the local information,
represented as mixtures of gaussians.
We demonstrate the technique for two problems: motion estimation, and
the disambiguation of shading and reflectance. We find that the local
statistical information, without added problem domain knowledge, finds
good global scene interpretations.
 Global Optimisation of Neural Network Models Via Sequential Sampling  Nando de Freitas, Mahesan Niranjan, Arnaud Doucet, Andrew Gee
 We propose a novel strategy for training neural networks using sequential
Monte Carlo algorithms. In particular, we discuss an efficient hybrid
gradient descent/sampling importance resampling algorithm that we
developed recently, under the name of hybrid SIR. This global optimisation
approach allows us to learn the probability distribution of the network
weights and outputs in a sequential framework. It is well suited to
applications involving online, nonlinear, nonGaussian or nonstationary
signal processing. We show how the new algorithm outperforms extended
Kalman filter training on several simulated examples and on a real
application involving the pricing of option contracts traded in financial
markets.
 Efficient Bayesian Parameter Estimation in Large Discrete Domains  Nir Friedman, Yoram Singer
 In this paper we examine the problem of estimating the parameters of a
multinomial distribution over a large number of discrete outcomes, most
of which do not appear in the training data. We analyze this problem
from a Bayesian perspective and develop a hierarchical prior that
incorporates the assumption that the observed outcomes constitute only
a small subset of the possible outcomes. We show how to efficiently
perform exact inference with this form of hierarchical prior
and compare our method to standard approaches to demonstrate its merits.
 Synergy and Redundancy Among Brain Cells of Behaving Monkeys  Itay Gat, Naftali Tishby
 Determining the relationship between the activity of a single nerve
cell to that of an entire population is a fundamental question that
bears on the basic neural computation paradigms. In this paper we
apply an information theoretic approach to quantify the level of
cooperative activity among cells in a behavioral context. It is
possible to discriminate between synergetic activity of the cells
vs. redundant activity, depending on the difference between the
information they provide when measured jointly and the information
they provide independently. We define a synergy value that is positive
in the first case and negative in the second and show that the synergy
value can be measured by detecting the behavioral mode of the animal
from simultaneously recorded activity of the cells. We observe that
among cortical cells positive synergy can be found, while cells from
the basal ganglia, active during the same task, do not exhibit similar
synergetic activity.
 A Randomized Algorithm for Pairwise Clustering  Yoram Gdalyahu, Daphna Weinshall, Michael Werman
 We present a stochastic clustering algorithm based on pairwise
similarity of datapoints.
Our method extends existing deterministic methods, including
agglomerative algorithms, mincut graph algorithms, and connected
components, thus it provides a common framework for all these methods.
Our graphbased method differs from existing
stochastic methods which are based on analogy to physical systems.
The stochastic nature of our method makes it more robust against
noise, including accidental similarity relations and small spurious clusters.
We demonstrate the superiority of our algorithm using an example with 3
spiraling bands and a lot of noise.
 Classification with Linear Threshold Functions and the Linear Loss  Claudio Gentile, Manfred K. Warmuth
 We describe a method for proving relative
loss bounds for online learning algorithms
that use linear threshold functions for classifying the examples.
For instance the Perceptron algorithm and Winnow are such learning
algorithms. For classification problems the discrete loss is used,
i.e., the total number of prediction mistakes.
We introduce a continuous loss function called the linear loss.
Our method consists of first proving bounds w.r.t.
the linear loss and then converting these bounds to the discrete loss.
 Invited Talk: Things That Think  Neil Gershenfeld
 At the rate of current progress, VLSI scaling will reach fundamental
scaling limits early in the next century. Beyond performance, many of the
most severe constraints are increasingly associated with the cost and
packaging of bringing processing to where it is needed. An alternative to
increasingly heroic engineering means to meet these demands is to become
smarter in how we take advantage of the inherent computational
capabilities of simple physical systems. I will illustrate this promise
with a few examples, including molecular quantum computation, coding in
dissipative dynamical systems, and the response of driven musical
instruments.
 Learning Nonlinear Stochastic Dynamics Using the Generalized EM Algorithm  Zoubin Ghahramani, Sam T. Roweis
 The ExpectationMaximization (EM) algorithm is an iterative procedure
for maximum likelihood parameter estimation from data sets with
missing or hidden variables (Dempster, Laird and Rubin, 1977). It has
been applied to system identification in linear stochastic statespace
models, where the state variables are hidden from the observer and
both the state and the parameters of the model have to be estimated
simultaneously (Shumway and Stoffer, 1982). We present a
generalization of the EM algorithm for parameter estimation in
nonlinear dynamical systems. The ``expectation'' step makes use of
Extended Kalman Smoothing to estimate the state, while the
``maximization'' step reestimates the parameters using these
uncertain state estimates. In general, the nonlinear maximization step
is difficult because it requires integrating out the uncertainty in
the states. However, if Gaussian radial basis function (RBF)
approximators are used to model the nonlinearities, the integrals
become tractable and the maximization step can be solved via systems
of linear equations.
 Classification on Pairwise Proximity Data  Thore Graepel, Ralf Herbrich, Peter BollmannSdorra, Klaus Obermayer
 We investigate the problem of learning a classification task
on data represented in terms of their pairwise proximities. This
representation does not refer to an explicit feature representation of
the data items and is thus more general than the standard approach of
using Euclidean feature vectors, from which pairwise proximities can
always be calculated. Our first approach is based on a combined linear
embedding and classification procedure resulting in an extension of
the Optimal Hyperplane algorithm to pseudoEuclidean data. As an
alternative we present another approach based on a linear threshold
model in the proximity values themselves, which is optimized using
Structural Risk Minimization. We show that prior knowledge about the
problem can be incorporated by the choice of distance measures and
examine different metrics w.r.t. their generalization. Finally, the
algorithms are successfully applied to protein structure data and to
data from the cat's cerebral cortex. They show better performance than
Knearestneighbor classification.
 Lasso is Equivalent to Adaptive Quadratic Penalization  Yves Grandvalet, St{\'e}phane Canu
 Adaptive ridge is a special form of ridge regression, balancing the quadratic penalization on each parameter of the model.
We show the equivalence between adaptive ridge and lasso (least absolute shrinkage and selection operator).
This equivalence states that both procedures produce the same estimate.
Least absolute shrinkage can thus be viewed as a particular quadratic penalization.
From this observation, we derive a fixedpoint algorithm to compute the lasso solution.
We finally present a series of possible applications of this type of algorithm in regression problems: kernel regression, additive modeling and neural net training.
 Familiarity Discrimination of Radar Pulses  Eric Granger, Stephen Grossberg, Mark A. Rubin, William W. Streilein
 The ARTMAPFD neural network performs both identification
(placing test patterns in classes encountered during training)
and familiarity discrimination (judging whether a test pattern
belongs to any of the classes encountered during training).
The performance of ARTMAPFD is tested on radar pulse data obtained
in the field, and compared to that of the nearestneighborbased
NEN algorithm and to a $k > 1$ extension of NEN.
 Fast Neural Network Emulation of PhysicsBased Models for Computer Animation  Radek Grzeszczuk, Demetri Terzopoulos, Geoffrey Hinton
 Computer animation through the numerical simulation of physicsbased
graphics models offers unsurpassed realism, but it can be
computationally demanding. This paper demonstrates the possibility of
replacing the numerical simulation of nontrivial dynamic models with a
dramatically more efficient ``NeuroAnimator'' that exploits neural
networks. NeuroAnimators are automatically trained offline to
emulate physical dynamics through the observation of physicsbased
models in action. Depending on the model, its neural network emulator
can yield physically realistic animation one or two orders of
magnitude faster than conventional numerical simulation. We
demonstrate NeuroAnimators for a variety of physicsbased models.
 A Neuromorphic Monaural Sound Localizer  John G. Harris, ChiangJung Pu, Jose C. Principe
 We describe the first single microphone sound localization system
and its inspiration from theories of human monaural sound
localization. Reflections and diffractions caused by the external
ear (pinna) allow humans to estimate sound source elevations using
only one ear. Our single microphone localization model relies on a
specially shaped reflecting structure that serves the role of the
pinna. Specially designed analog VLSI circuitry uses echotime
processing to localize the sound. A CMOS integrated circuit has been
designed, fabricated, and successfully demonstrated on actual
sounds.
 Multiple Paired ForwardInverse Models for Human Motor Learning and Control  Masahiko Haruno, Daniel M. Wolpert, Mitsuo Kawato
 Humans demonstrate a remarkable ability to generate accurate and
appropriate motor behavior under many different and often uncertain
environmental conditions. This paper describes a new modular approach
to human motor learning and control, based on multiple pairs of
inverse (controller) and forward (predictor) models. This
architecture simultaneously learns the multiple inverse models
necessary for control as well as how to select the inverse models
appropriate for a given environment. Simulations of object
manipulation demonstrates the ability to learn multiple objects,
appropriate generalization to novel objects and the inappropriate
activation of motor programs based on visual cues, followed by online
correction, seen in the ``sizeweight illusion''.
 GLS: a Hybrid Classifier System Based on POMDP Research  Akira Hayashi, Nobuo Suematsu
 Classifier systems are now viewed as disappointing because of
their problems such as the rule strength vs rule set performance
problem and the credit assignment problem. In order to solve these
problems, we have developed a hybrid classifier system: GLS
(Generalization Learning System). In designing GLS, we view
classifier systems as model free learning in POMDPs and take a hybrid
approach to finding the best generalization, given the total number of
rules. GLS uses the policy improvement procedure by Jaakkola et
al. for the optimal stochastic policy when a set of rule conditions is
given. GLS uses GA to search for the best set of rule conditions.
 Visualizing Group Structure  Marcus Held, Jan Puzicha, Joachim M. Buhmann
 Cluster analysis is a fundamental principle in exploratory
data analysis, providing the user with a description of the group
structure
of given data. A key problem in this context is the interpretation and
visualization of clustering solutions in highdimensional or abstract
data spaces. In particular, fuzzy or probabilistic descriptions of the
group
structure, essential to capture intercluster relations, are hardly
assessable by simple inspection of the probabilistic assignment
variables.
We present a novel approach for the visualization of probabilistic group
structure based on a statistical model of the object assignments which
have
been observed or estimated by a probabilistic clustering procedure.
The objects or data points are embedded in a low dimensional Euclidean
space
by approximating the observed data statistics with a Gaussian mixture
model.
The algorithm provides a new approach to the visualization of the
inherent
structure for a broad variety of data types, e.g.~histogram data,
proximity
data and cooccurrence data.
To demonstrate the power of the approach, histograms of textured images
are
visualized as a largescale data mining application.
 Unsupervised and Supervised Clustering: the Mutual Information Between Parameters and Observations  Didier Herschkowitz, JeanPierre Nadal
 Recent works in parameter estimation and neural coding have demonstrated
that optimal performance is related to
the mutual information between parameters and data.
We study this mutual information
for a family of supervised and unsupervised learning tasks.
More precisely we consider the case
where the dependency in the parameter
of the conditional probability distribution of each observation
is through their scalar product only, the parameter and the observations
being vectors in a possibly high dimensional space.
\par
We derive exact bounds and exact
asymptotic behaviours for the mutual information as function
of the data size and of some properties of the probability of the data given
the parameter.
We study also the behaviour of the mutual information as predicted by replica
calculations. Finally we discuss the universal
properties of the mutual information especially in the limit of
large data size.
 An Integrated Vision Sensor for the Computation of Optical Flow Singular Points  Charles M. Higgins, Christof Koch
 A robust, integrative algorithm is presented for computing the
position of the focus of expansion or axis of rotation (the singular point)
in optical flow fields such as those generated by selfmotion.
Measurements are shown of a fully parallel CMOS analog VLSI
motion sensor array which computes the direction of local motion (sign of
optical flow) at each pixel and can directly implement this algorithm.
The flow field singular point is computed in
real time with a power consumption of less than 2 mW. Computation of the
singular point for more general flow fields requires
measures of field expansion and rotation, which it is shown
can also be computed in realtime hardware, again using only the sign of the
optical flow field. These
measures, along with the location of the singular point, provide robust
realtime selfmotion information for the visual guidance of a moving
platform such as a robot.
 Source Separation as a ByProduct of Regularization  Sepp Hochreiter, Juergen Schmidhuber
 This paper reveals a previously ignored connection
between two important fields: regularization and
independent component analysis (ICA). We show that at
least one representative of a broad class of algorithms
(regularizers that reduce network complexity) extracts
independent features as a byproduct. This algorithm is
Flat Minimum Search (FMS), a recent general method for
finding lowcomplexity networks with high generalization
capability. FMS works by minimizing both training error
and required weight precision. According to our
theoretical analysis, the hidden layer of an FMStrained
autoassociator attempts at coding each input by a sparse
code with as few simple features as possible. In
experiments, the method extracts optimal codes for
difficult versions of the ``noisy bars'' benchmark
problem by separating the underlying sources, whereas
ICA and PCA fail. Real world images are coded with
fewer bits per pixel than by ICA or PCA.
 Learning from Dyadic Data  Thomas Hofmann, Jan Puzicha, Michael Jordan
 Dyadic data refers to a domain with two finite sets of
objects in which observations are made for dyads, i.e., pairs
with one element from either set. This type of data arises naturally in
many application ranging from computational linguistics and
information retrieval to preference analysis and computer vision. In
this paper we present a systematic, domainindependent framework of
learning from dyadic data by statistical mixture models. Our
approach covers different models with flat and hierarchical latent
class structures. We propose an annealed version of the
standard EM algorithm for model fitting which is empirically evaluated
on a variety of data sets from different domains.
 Callbased Fraud Detection in Mobile Communication Networks using a Hierarchical RegimeSwitching Model  Jaakko Hollmen, Volker Tresp
 Fraud causes substantial losses to telecommunication carriers.
Detection systems which automatically detect illegal use of the
network can be used to alleviate the problem. Previous approaches
worked on features derived from the call patterns of individual
users. In this paper we present a callbased detection system based on
a hierarchical regimeswitching model. The detection problem is
formulated as an inference problem on the regime probabilities.
Inference is implemented by applying the junction tree algorithm to
the underlying graphical model. The dynamics are learned from data
using iterative maximum likelihood EMlearning. The methods are
assessed using fraud data from a real mobile communications network.
 Banquet Talk: The Laws of the Web  Bernardo A. Huberman
 The World Wide Web has become in a short period a standard source of
information for a large part of the world's population. Its exponential
growth has transformed it into an ecology of knowledge in which a highly
diverse quantity of information is linked in extremely complex and
arbitrary fashion.
In spite of the sheer size of the Web and its highly interactive nature,
there exist strong statistical regularities that govern the way people use
it and interact with each other. This talk will discuss the existence and
nature of such laws and their experimental verification.
 Graph Matching for Shape Retrieval  Benoit Huet, Andrew D.J. Cross, Edwin R. Hancock
 This paper describes a Bayesian graph matching algorithm for
datamining from large structural databases. The matching algorithm
uses edgeconsistency and node attribute similarity to determine the
{\it a posteriori} probability of a query graph for each of the
candidate matches in the database. The node featurevectors are
constructed by computing normalised histograms of pairwise geometric
attributes. Attribute similarity is assessed by computing the
Bhattacharyya distance between the histograms. Recognition is
realised by selecting the candidate from the database which has the
largest {\it a posteriori} probability. We illustrate the recognition
technique on a database containing 2500 line patterns extracted from
realworld imagery. Here the recognition technique is shown to
significantly outperform a number of algorithm alternatives.
 Sparse Code Shrinkage: Denoising by Maximum Likelihood Estimation  Aapo Hyvarinen, Patrik Hoyer, Erkki Oja
 Sparse coding is a method for finding a representation of
data in which each of the components of the representation is only
rarely significantly active. Such a representation is closely related
to redundancy reduction and independent component analysis, and has
some neurophysiological plausibility. In this paper, we show how
sparse coding can be used for denoising. Using maximum likelihood
estimation of nongaussian variables corrupted by gaussian noise, we
show how to apply a shrinkage nonlinearity on the components of sparse
coding so as to reduce noise. A theoretical analysis of the denoising
capability of the method is given, and it is shown how to choose the
optimal basis for sparse coding. Our method is closely related to the
method of wavelet shrinkage, but has the important benefit over
wavelet methods that both the features and the shrinkage parameters
are estimated directly from the data.
 Convergence of The WakeSleep Algorithm  Shiro Ikeda, Shunichi Amari, Hiroyuki Nakahara
 The WS (WakeSleep) algorithm is a simple learning rule for the
models with hidden variables. It is shown that this algorithm can be
applied to a factor analysis model which is a linear version of the
Helmholtz machine. But even for a factor analysis model, the general
convergence is not proved theoretically. In this article, we
describe the geometrical understanding of the WS algorithm in
contrast with the EM (ExpectationMaximization) algorithm and the
{\em em} algorithm. As the result, we prove the convergence of the WS
algorithm for the factor analysis model. We also show the condition
for the convergence in general models.
 Learning to Find Pictures of People  Sergey Ioffe, David Forsyth
 Finding articulated objects, like people, in pictures presents a
particularly difficult object recognition problem. We show how to
find people by finding putative body segments, and then constructing
assemblies of those segments that are consistent with the constraints
on the appearance of a person that result from kinematic properties.
Since a reasonable model of a person requires at least nine segments,
it is not possible to present every group to a classifier. Instead,
the search can be pruned by using projected versions of a classifier
that accepts groups corresponding to people. We describe an efficient
projection algorithm for one popular classifier, and demonstrate that
our approach can be used to determine whether images of real scenes
contain people.
 Restructuring Sparse High Dimensional Data for Effective Retrieval  Charles Lee Isbell, Jr., Paul Viola
 The task in text retrieval is to find the subset of a collection of
documents relevant to a user's information request, usually expressed
as a set of words. Classically, documents and queries are represented
as vectors of word counts. In its simplest form, relevance is defined
to be the dot product between a document and a query vectora measure
of the number of common terms. A central difficulty in text retrieval
is that the presence or absence of a word is not sufficient to
determine relevance to a query. Linear dimensionality reduction has
been proposed as a technique for extracting underlying structure from
the document collection. In some domains (such as vision)
dimensionality reduction reduces computational complexity. In text
retrieval it is more often used to improve retrieval performance. We
propose an alternative and novel technique that produces sparse
representations constructed from sets of highlyrelated words.
Documents and queries are represented by their distance to these sets,
and relevance is measured by the number of common clusters. This
technique significantly improves retrieval performance, is efficient
to compute and shares properties with the optimal linear projection
operator and the independent components of documents.
 Attentional Modulation of Human Pattern Discrimination Psychophysics Reproduced by a Quantitative Model  Laurent Itti, Jochen Braun, Dale K. Lee, Christof Koch
 We previously proposed a quantitative model of early visual processing
in primates, based on nonlinearly interacting visual filters and
statistically efficient decisions. We now use this model to interpret
the observed modulation of a range of human psychophysical thresholds
with and without focal visual attention. Our model  calibrated by
an automatic fitting procedure  simultaneously reproduces thresholds
for four classical pattern discrimination tasks, performed while
attention was engaged by another concurrent task. Our model then
predicts that the seemingly complex improvements of certain
thresholds, which we observed when attention was fully available for
the discrimination tasks, can best be explained by a strengthening of
competition among early visual filters.
 Exploiting Generative Models in Discriminative Classifiers  Tommi Jaakkola, David Haussler
 Generative probability models such as hidden Markov models
provide a principled way of treating missing information and dealing
with variable length sequences. On the other hand, discriminative
methods such as support vector machines enable us to construct flexible
decision boundaries and often result in classification performance
superior to that of the model based approaches. An ideal classifier
should combine these two complementary approaches. In this paper, we
develop a natural way of achieving this combination by deriving kernel
functions for use in discriminative methods such as support vector
machines from generative probability models. We provide a theoretical
justification for this combination as well as demonstrate a substantial
improvement in the classification performance in the context of DNA and
protein sequence analysis.
 Maximum Conditional Likelihood via Bound Maximization and the CEM Algorithm  Tony Jebara, Alex Pentland
 We present the CEM ({\em Conditional Expectation Maximization})
algorithm as an extension of the EM ({\em Expectation Maximization})
algorithm to conditional density estimation under missing data. A
bounding and maximization process is given to specifically optimize
conditional likelihood instead of the usual joint likelihood. We apply
the method to conditioned mixture models and use bounding
techniques to derive the model's update rules. Monotonic convergence,
computational efficiency and regression results superior to EM are
demonstrated.
 Analyzing and Visualizing SingleTrial EventRelated Potentials  TzyyPing Jung, Scott Makeig, Marissa Westerfield, Jeanne Townsend, Eric Courchesne, Terrence J. Sejnowski
 Eventrelated potentials (ERPs), are portions of
electroencephalographic (EEG) recordings that are both time
and phaselocked to experimental events. ERPs are usually
averaged to increase their signal/noise ratio relative to
nonphase locked EEG activity, regardless of the fact that
response activity in single epochs may vary widely in
time course and scalp distribution. This study applies a linear
decomposition tool, Independent Component Analysis (ICA) (Lee et.\ al., in
press), to multichannel singletrial EEG records to derive spatial filters
that decompose singletrial EEG epochs into a sum of temporally
independent and spatially fixed components arising from distinct or
overlapping brain or extrabrain networks. Our results show that ICA can
separate artifactual, stimuluslocked, responselocked, and nonevent
related background EEG activities into separate
components, allowing (1) removal of pervasive
artifacts of all types from singletrial EEG records, and
(2) identification of both stimulus and responselocked EEG
components. Second, this study proposes a new visualization
tool, the `ERP image', for investigating variability in
latencies and amplitudes of eventevoked
responses in spontaneous EEG or MEG records. We show that
sorting singletrial ERP epochs in order of a relevant
response measure (e.g.\ reaction time) and plotting the
potentials in 2D clearly reveals underlying
patterns of response variability linked to performance.
These analysis and visualization tools appear broadly
applicable to electrophyiological research on both normal
and clinical populations.
 The Belief in TAP  Yoshiyuki Kabashima, David Saad
 We show the similarity between belief propagation and TAP, for
decoding corrupted messages encoded by Sourlas's method. The latter
is a special case of the Gallager errorcorrecting code, where the
code word comprises products of $K$ bits selected randomly from the
original message. We examine the efficacy of solutions obtained by the
two methods for various values of $K$ and show that solutions for $K
\!\ge\! 3$ may be sensitive to the choice of initial conditions in the
case of unbiased patterns. Good approximations are obtained generally
for $K\!=\!2$ and for biased patterns in the case of $K\! \ge \!3$,
especially when Nishimori's temperature is being used.
 Optimizing Classifers for Imbalanced Training Sets  Grigoris Karakoulas, John ShaweTaylor
 Following recent results showing the importance of the fatshattering
dimension in
explaining the beneficial effect of a large margin on generalization
performance,
further work analysed how the margin observed on a test example could be used
to give higher confidence of the classification accuracy. The current paper
investigates
the implications of these results for the case of imbalanced datasets and
develops
two approaches to setting the threshold. The approaches are incorporated
into ThetaBoost,
a boosting algorithm that we develop for dealing with unequal loss
functions. The
performance of ThetaBoost and the two approaches are also investigated
experimentally..
 Inference in Multilayer Networks via Large Deviation Bounds  Michael Kearns, Lawrence Saul
 We study probabilistic inference in large, layered belief
networks represented as directed acyclic graphs. We show that the
intractability of exact inference in such networks does not preclude
their effective use. We give algorithms for approximate probabilistic
inference that exploit averaging phenomena occurring at nodes with
large numbers of parents. We show that these algorithms compute
rigorous lower and upper bounds on marginal probabilities of interest,
prove that these bounds become exact in the limit of large networks,
and provide rates of convergence.
 FiniteSample Convergence Rates for QLearning and Indirect Algorithms  Michael Kearns, Satinder Singh
 In this paper, we address two issues of longstanding interest in the
reinforcement learning literature. First,
what kinds of performance guarantees can be made for Qlearning
after only a finite number of actions?
Second, what quantitative comparisons can be made between Qlearning and
modelbased (indirect) approaches,
which use experience to estimate nextstate
distributions for offline value iteration?
We first show that both Qlearning and the indirect approach enjoy
rather rapid convergence to the optimal policy as a function of the
number of state transitions observed. In particular, on the order of
only $(N\log(1/\epsilon)/\epsilon^2)(\log(N) + \log\log(1/\epsilon))$
transitions
are sufficient for both algorithms to come within $\epsilon$ of the
optimal policy, in an idealized model that assumes the observed
transitions are ``wellmixed'' throughout an $N$state MDP. Thus, the
two approaches have roughly the same sample complexity. Perhaps surprisingly,
this sample complexity is far less than what is required for the modelbased
approach to actually construct a good approximation to the nextstate
distribution. The result also shows that the amount of memory required by
the modelbased approach is closer to $N$ than to $N^2$.
For either approach, to remove the assumption that the observed transitions
are wellmixed, we consider a model in which the transitions are
determined by a fixed, arbitrary exploration policy. Bounds on the
number of transitions required in order to achieve a desired level of
performance are then related to the stationary distribution and
mixing time of this policy.
 A Polygonal Line Algorithm for Constructing Principal Curves  Bal\'azs K\'egl, Adam Krzy\.zak, Tam\'as Linder, Kenneth Zeger
 Principal curves have been defined as``self consistent''
smooth curves which pass through the``middle'' of a ddimensional
probability distribution or data cloud. They give a summary of the
data and also serve as an efficient feature extraction tool.
Recently, we have offered a new approach by defining principal curves
as continuous curves of a given length which minimize the expected
squared distance between the curve and points of the space randomly
chosen according to a given distribution. The new definition made it
possible to carry out a theoretical analysis of learning principal
curves from training data. In our theoretical learning scheme we consider
classes of polygonal lines with ksegments and with a given length,
and the algorithm chooses a curve from this class which minimizes the
average squared distance over n training points drawn independently.
In this paper we propose a practical construction based on the new
definition. In each iteration of the algorithm a new vertex is added
to the polygonal line and the positions of the vertices are updated so
that they minimize a penalized squared distance criterion. Simulation
results demonstrate that the new algorithm compares favorably with
previous methods both in terms of performance and computational
complexity, and is more robust to varying data models.
 SpikeBased Compared to RateBased Hebbian Learning  Richard Kempter, Wulfram Gerstner, J. Leo van Hemmen
 A correlationbased learning rule at the spike level is formulated,
mathematically analyzed, and compared to learning in a firingrate description.
A differential equation for the learning dynamics is derived under the
assumption that the time scales of learning and spiking can be separated.
Using a linear Poissonian neuron model which receives timedependent stochastic
input we show that spike correlations on a millisecond time scale play indeed
a role under reasonable neurobiological conditions. It is shown that
correlations between input and output spikes tend to stabilize
structure formation, provided that the form of the learning window is
in accordance with Hebb's principle.
 Exploring Unknown Environments with RealTime Heuristic Search or Reinforcement Learning  Sven Koenig
 Learning RealTime A* (LRTA*) is a popular control method that interleaves planning and plan
execution. Advantages of LRTA* include: It allows for finegrained control over how much planning to
do between plan executions, is able to use heuristic knowledge to guide planning, can be interrupted
at any state and resume execution at a different state, and improves its planexecution time as it
solves similar planning tasks, until its planexecution time is optimal. LRTA* has been shown to
solve search problems in known environments efficiently. In this paper, we apply LRTA* to the
problem of getting to a given goal location in an initially unknown environment. Uninformed LRTA*
with maximal lookahead always moves on a shortest path to the closest unvisited state, that is, to
the closest potential goal state. This was believed to be a good exploration heuristic, but we show
that it does not minimize the planexecution time in the worst case compared to other uninformed
exploration methods. This result is also of interest to reinforcementlearning researchers since
many reinforcement learning methods use asynchronous dynamic programming, just like LRTA*.
 Active Noise Canceling Using Analog NeuroChip with OnChip Learning Capability  SooYoung Lee, Jung Wook Cho
 A modular analogue neurochip set with onchip learning capability is
developed for active noise canceling. The analogue neurochip set
incorporates the error backpropagation learning rule for practical
applications, and allows pintopin interconnections for multichip boards.
The developed neuroboard demonstrated active noise canceling without any
digital signal processor. Multipath fading of acoustic channels, random
noise, and nonlinear distortion of the loud speaker are compensated by the
adaptive learning circuits of the neurochips. Experimental results are
reported for cancellation of car noises in real time.
 Unsupervised Classification with NonGaussian Mixture Models using ICA  TeWon Lee, Michael S. Lewicki, Terrence J. Sejnowski
 We present an unsupervised classification algorithm based on an ICA
mixture model. A mixture model is a model in which the observed data
can be categorized into several mutually exclusive data classes. In
an ICA mixture model, it is assumed that the data in each class are
generated by a linear mixture of independent sources. The algorithm
finds the independent sources and the mixing matrix for each class and
also computes the class membership probability of each data point.
This approach extends the Gaussian mixture model so that the clusters
can have nonGaussian structure. Performance on a standard
classification problem, the Iris flower data set, demonstrates that the
new algorithm can improve classification accurately over standard
Gaussian mixture models. We also show that the algorithm can be
applied to blind source separation in nonstationary environments. The
method can switch automatically between learned mixing matrices in
different environments. Preliminary results on natural scenes and text
image patterns show that the algorithm is able to find classes so that
one class encodes the natural images and the other class specializes on
encoding the text segments.
 Learning a Continuous Hidden Variable Model for Binary Data  Daniel D. Lee, Haim Sompolinsky
 A directed generative model for binary data using a small number of
hidden continuous units is investigated. A clipping nonlinearity
distinguishes the model from conventional principal components
analysis. The relationships between the correlations of the
underlying continuous Gaussian variables and the binary output
variables are utilized to learn the appropriate weights of the
network. The advantages of this approach are illustrated on a
translationally invariant binary distribution and on handwritten
digit images.
 Stationarity and Stability of Autoregressive Neural Network Processes  Friedrich Leisch, Adrian Trapletti, Kurt Hornik
 We analyze the asymptotic behavior of autoregressive neural
network (ARNN) processes using techniques from Markov chains and
nonlinear time series analysis. It is shown that standard ARNNs
without shortcut connections are irreducible, aperiodic and
asymptotically stationary. If linear shortcut connections are allowed,
only the shortcut weights determine whether the overall system is
stationary, hence standard conditions for linear AR processes can be
used. We further show how integrated processes can be modeled using
ARNN processes.
The asymptotic behavior of ARNNs is especially important for
predictions over larger intervals of time or when using the network to
generate artificial time series. E.g., one might train the network on
an available sample and then use the trained network afterwardsdriven by
artificial noise from a random number generatorto generate new data
with similar properties than the training sample. The asymptotic
stationarity guarantees that the ARNN model cannot show ``explosive''
behavior or growing variance with time.
 Coding TimeVarying Signals Using Sparse, ShiftInvariant Representations  Michael S. Lewicki, Terrence J. Sejnowski
 A common way to represent a time series is to divide it into
shortduration blocks, each of which is then represented by a set of
basis functions. A limitation of this approach, however, is that the
temporal alignment of the basis functions with the underlying
structure in the time series is arbitrary. We present an algorithm
for encoding a time series that does not require blocking the data.
The algorithm finds an efficient representation by inferring the best
temporal positions for functions in a kernel basis. These can have
arbitrary temporal extent and are not constrained to be orthogonal.
This allows the model to capture structure in the signal that may
occur at arbitrary temporal positions and preserves the relative
temporal structure of underlying events. The model is shown to be
equivalent to a very sparse and highly overcomplete basis. Under this
model, the mapping from the data to the representation is nonlinear,
but can be computed efficiently. This form also allows the use of
existing methods for adapting the basis itself to data.
 A V1 Model of Pop Out and Asymmetry in Visual Search  Zhaoping Li
 Visual search is the task of finding a target in an image against a
background of distractors. Unique features of targets enable them
to pop out against the background. It is known that the ease of
target detection can change when the roles of figure and ground are
switched. The mechanisms underlying pop out and asymmetry in visual
search have been elusive. This paper shows that a model of
segmentation in V1 based on intracortical interactions can explain
many of the qualitative aspects of visual search.
 Computational Differences between Asymmetrical and Symmetrical Networks  Zhaoping Li, Peter Dayan
 Symmetrically connected recurrent networks have recently been used
as models of a host of neural computations. However, because of the
separation between excitation and inhibition, biological neural
networks are asymmetrical. We study characteristic differences
between asymmetrical networks and their symmetrical counterparts,
showing that they have dramatically different dynamical behavior and
also how the differences can be exploited for computational ends. We
illustrate our results in the case of a network that is a selective
amplifier.
 Mechanisms of Generalization in Perceptual Learning  Zili Liu, Daphna Weinshall
 The learning of many visual perceptual tasks was shown to be specific
to the practiced stimulus, and new stimuli require relearning from
scratch. Here we demonstrate generalization using a novel paradigm in
motion discrimination where learning has been shown to be specific: We
trained subjects to discriminate the directions of moving dots, and
verified the previous results that learning does not transfer from the
trained direction to a new one. However, by tracking the subjects'
performance across time in the new direction, we found that their rate
of learning doubled. Therefore, we found generalization in a task
previously considered too difficult to generalize. Even for an easy
task that was found to generalize, we found a new manifestation of
this generalization: After mastering the task with an easy stimulus,
subjects who had practiced briefly to discriminate the easy stimulus
in a new direction generalized to a difficult stimulus in that
direction. This generalization depended on {\it both} the mastering
{\it and} the brief practice. The specificity of perceptual learning,
and the dichotomy between the learning of ``easy'' vs.\ ``difficult''
tasks, were hypothesized to imply that different learning processes are
involved, operating at different visual cortical areas. Here we show
how to interpret these results in terms of signal detection. With the
single assumption of limited computational resources, we obtain the
observed phenomena  transfer, change of learning rate, and no
transfer  for respectively increasing levels of task difficulty. It
appears that the observed generalization concurs with the expected
behavior of a generic discrimination system. Thus we challenge the
validity of using the specificity of perceptual learning to probe the
neuronal loci of modifications in the early visual cortex.
 The Effect of Eligibility Traces on Finding Optimal Memoryless Policies in Partially Observable Markov Decision Processes  John Loch
 Agents acting in the real world are confronted with the problem of making
good decisions with limited knowledge of the environment. Partially
observable Markov decision processes (POMDPs) model decision problems in
which an agent tries to maximize its reward in the face of limited sensor
feedback. Recent work has shown empirically that a reinforcement learning
(RL) algorithm called Sarsa Lambda can efficiently find optimal memoryless
policies, which map current observations to actions, for POMDP problems
(Loch and Singh 1998). The Sarsa Lambda algorithm uses a form of shortterm
memory called an eligibility trace, which distributes temporally delayed
rewards to observationaction pairs which lead up to the reward. This paper
explores the effect of eligibility traces on the ability of the Sarsa
Lambda algorithm to find optimal memoryless policies. A variant of Sarsa
Lambda called kstep truncated Sarsa Lambda is applied to four test
problems taken from the recent work of Littman, Littman, Cassandra and
Kaelbling, Parr and Russell, and Chrisman. The empirical results show that
eligibility traces can be significantly truncated without affecting the
ability of Sarsa Lambda to find optimal memoryless policies for POMDPs.
 Utilizing Time: Asynchronous Binding  Bradley C. Love
 Historically, connectionist systems have not excelled at representing
and manipulating complex structures. How can a system composed of
simple neuronlike computing elements encode complex relations?
Recently, researchers have begun to appreciate that representations
can extend in both time and space. Many researchers have proposed
that the synchronous firing of units can encode complex
representations. I identify the limitations of this approach and
present an asynchronous model of binding that effectively
represents complex structures. The asynchronous model extends the
synchronous approach. I argue that our cognitive architecture
utilizes a similar mechanism.
 Analog Neural Nets with Gaussian or other Common Noise Distributions Cannot Recognize Arbitrary Regular Languages  Wolfgang Maass, Eduardo D. Sontag
 We consider recurrent analog neural nets where each gate is subject to
Gaussian noise, or any other common noise distribution whose
probability density function is nonzero on a large set. We show that
many regular languages cannot be recognized by networks of this type,
for example the language $\{w \in \{0,1\}^* \; w \mbox{ begins with}
\ 0\}$, and we give a precise characterization of those languages which
can be recognized. This result implies severe constraints on
possibilities for constructing recurrent analog neural nets that are
robust against realistic types of analog noise. On the other hand, we
present a method for constructing {\it feedforward} analog neural nets
that are robust with regard to analog noise of this type.
 Neural Networks for Density Estimation  Malik MagdonIsmail, Amir F. Atiya
 We introduce two new techniques for density estimation. Our approach poses
the problem as a supervised learning task which can be performed using
Neural Networks. We introduce a stochastic method for learning the
cumulative distribution and an analogous deterministic technique. We
demonstrate convergence of our methods both theoretically and
experimentally, and provide comparisons with the Parzen estimate. Our
theoretical results demonstrate better convergence properties than the
Parzen estimate.
 Signal Detection in Noisy WeaklyActive Dendrites  Amit Manwani, Christof Koch
 Here we derive measures quantifying the information loss of a synaptic
signal due to the presence of neuronal noise sources as it
electrotonically propagates along a weaklyactive dendrite . We model the
dendrite as an infinite linear cable, with noise sources distributed along
its length. The noise sources we consider are thermal noise, channel
noise arising from the stochastic nature of voltagedependent ionic
channels (K and Na) and synaptic noise due to spontaneous background
activity. We assess the efficacy of information transfer using a signal
detection paradigm where the objective is to detect the presence/absence
of a presynaptic spike from the postsynaptic membrane voltage. This
allows us to analytically assess the role of each of these noise sources
in information transfer. For our choice of parameters, we find that the
synaptic noise is the dominant noise source which limits the maximum
length over which information be reliably transmitted.
 Exploratory Data Analysis Using Radial Basis Function Latent Variable Models  Alan D. Marrs, Andrew R. Webb
 Two developments of nonlinear latent variable models based on radial
basis
functions are discussed: the first is a resampling approach that makes more
effective
use of latent samples in evaluating the likelihood. Also, the use of priors
or constraints
on allowable models is considered as a means of preserving data structure on
lowdimensional representations for visualisation purposes. The former
development is
illustrated on both a simulated data set and radar range profiles of ships.
 Direct Optimization of Margins Improves Generalization in Combined Classifiers  Llew Mason, Peter L. Bartlett, Jonathan Baxter
 Recent theoretical results have shown that the generalization
performance of thresholded convex combinations of base classifiers is
greatly improved if the underlying convex combination has large
{\em margins} on the training data (correct examples are classified well
away from the decision boundary). Neural network algorithms and
AdaBoost have been shown to implicitly maximize margins, thus
providing some theoretical justification for their remarkably
good generalization performance. In this paper we are concerned with
maximizing the margin explicitly. In particular, we prove a theorem
bounding the generalization performance of convex combinations in
terms of general cost functions of the margin (previous results
were stated in terms of the particular cost function
$\sgn(\text{margin}  \theta)$). We then present an algorithm (DOOM)
for directly optimizing a piecewiselinear family of cost functions
satisfying the conditions of the theorem. Experiments on several of the
datasets in the UC Irvine database are presented in which AdaBoost was used
to generate a set of base classifiers and then DOOM was used to find the
optimal convex combination of those classifiers. In all but one case
the convex combination generated by DOOM had lower test error than
AdaBoost's combination. In many cases DOOM achieves these lower test
errors by sacrificing training error, in the interests of reducing the
new cost function. The margin plots also show that the size of
the minimum margin is not relevant to generalization performance.
 Scheduling StraightLine Code Using Reinforcement Learning and Rollouts  Amy McGovern, Eliot Moss
 The execution order of a block of computer instructions can make a
difference in its running time by a factor of two or more. In order
to achieve the best possible speed, compilers use heuristic schedulers
appropriate to each specific architecture implementation. However,
these heuristic schedulers are timeconsuming and expensive to build.
In this paper, we present results using both rollouts and
reinforcement learning to construct heuristics for scheduling basic
blocks. The rollout scheduler outperformed a commercial scheduler,
and the reinforcement learning scheduler performed almost as well as
the commercial scheduler.
 On the Optimality of Incremental Neural Network Algorithms  Ron Meir, Vitaly E. Maiorov
 We study the approximation of functions by twolayer feedforward neural
networks, focusing on incremental algorithms which incrementally add
units, estimating single unit parameters at each stage. As opposed to
standard algorithms for fixed architectures, the optimization at each
stage is performed over a small number of parameters, mitigating many of
the difficult numerical problems inherent in highdimensional nonlinear
optimization. We establish upper bounds on the approximation error
incurred in the approximation, when approximating functions from the
Sobolev space, thereby extending previous results which only provided
rates of convergence for functions in certain convex hulls of functional
spaces. By comparing our results to recently derived lower bounds, we show
that the incremental algorithms are nearly optimal in many cases, while
provably outperform linear methods in others. Combined with recent
estimation error results for incremental greedy algorithms, a strong case
can be made for this type of approach.
 Kernel PCA and DeNoising in Feature Spaces  Sebastian Mika, Bernhard Sch\"olkopf, Alex J. Smola, KlausR. M\"uller, Matthias Scholz, Gunnar R\"atsch
 Kernel PCA as a nonlinear feature extractor has proven powerful as a
preprocessing step for classification algorithms. But it can also be
considered as a natural generalization of linear principal component
analysis. This gives rise to the question how to use nonlinear
features for data compression, reconstruction, and denoising,
applications common in linear PCA. This is a nontrivial task, as
the results provided by kernel PCA live in some high dimensional
feature space and need not have preimages in input space. This work
presents ideas for finding approximate preimages, focusing on
Gaussian kernels, and shows experimental results using these
preimages in data reconstruction and denoising on toy examples as
well as on real world data.
 Bayesian Modeling of Facial Similarity  Baback Moghaddam, Tony Jebara, Alex Pentland
 In previous work, we advanced a new technique for direct visual matching
of images for the purposes of face recognition and image retrieval,
using a probabilistic measure of similarity based primarily on a
Bayesian analysis of image differences, leading to a ``dual'' basis
similar to eigenfaces. The performance advantage of this probabilistic
matching technique over standard Euclidean nearestneighbor eigenface
matching was recently demonstrated using results from DARPA's 1996
``FERET'' face recognition competition, in which this probabilistic
matching algorithm was found to be the top performer. We have further
developed a simple method of replacing the rather costly compution of
nonlinear (online) Bayesian similarity measures by the relatively
inexpensive computation of linear (offline) subspace projections and
simple (online) Euclidean norms, thus resulting in a significant
computational speedup for implementation with very large image
databases as typically encountered in realworld applications.
 Learning InstanceIndependent Value Functions to Enhance Local Search  Robert Moll, Andrew G. Barto, Theodore Perkins, Richard S. Sutton
 Reinforcement learning methods can be used to improve the performance of
local search algorithms for combinatorial optimization by learning an
evaluation
function that predicts the outcome of search. The evaluation function is
therefore able to guide search more effectively to lowcost solutions than
can the original cost function.
We describe a reinforcement learning method for enhancing local search that
combines
aspects of previous work by Zhang and Dietterich (1995) and Boyan and Moore
(1997,
1998). In an offline learning phase, a value function is learned that is
useful for guiding search for multiple problem sizes and instances. We
illustrate our technique by developing several such functions for the
DialARide Problem, a variant of the wellknown Traveling Salesman's
Problem. Overall, our learned hillclimbing results exhibit an
improvement of more than 30\% over the standard DialARide local search
algorithm.
 Reinforcement Learning for Trading Systems  John Moody, Matthew Saffell
 We propose to train trading systems by optimizing financial objective
functions via reinforcement learning. The performance functions that
we consider as value functions are profit or wealth, the Sharpe ratio
and our recently proposed differential Sharpe ratio for online
learning. In Moody and Wu 1997, we presented empirical results in
controlled experiments that demonstrated the advantages of
reinforcement learning relative to supervised learning. Here we
extend our previous work to compare QLearning to a reinforcement
learning technique based on realtime recurrent learning (RTRL) that
maximizes immediate reward. We provide new simulation results that
demonstrate the presence of predictability in the monthly S\&P 500
Stock Index for the 25 year period 1970 through 1994, as well as a
sensitivity analysis that provides economic insight into the trader's
structure.
 Very Fast EMbased Mixture Model Clustering using Multiresolution Kdtrees  Andrew W. Moore
 Clustering is important in many fields including manufacturing,
biology, finance, and astronomy. Mixture models are a popular
approach due to their statistical foundations, and EM is a very
popular method for finding mixture models. EM, however, requires many
accesses of the data, and thus has been dismissed as impractical for
data mining of enormous datasets. We present a new algorithm, based
on the multiresolution kdtrees of our earlier work, which
dramatically reduces the cost of EMbased clustering, with savings
rising linearly with the number of datapoints. Although presented
here for maximum likelihood estimation of Gaussian mixture models, it
is also applicable to nonGaussian models (provided class densities
are monotonic in Mahalanobis distance), mixed categorical/numeric
clusters, and Bayesian methods such as Autoclass.
 A Principle for Unsupervised Hierarchical Decomposition of Visual Scenes  Michael C. Mozer
 Structure in a visual scene can be described at many levels of granularity.
At a coarse level, the scene is composed of objects; at a finer level,
each object is made up of parts, and the parts of subparts. In this work, I
propose a simple principle by which such hierarchical structure can be
extracted from visual scenes: Regularity in the relations among different
parts of an object is weaker than in the internal structure of a part. This
principle can be applied recursively to define partwhole relationships
among elements in a scene. The principle does not make use of object
models, categories, or other sorts of higherlevel knowledge; rather,
partwhole relationships can be established based on the statistics of a
set of sample visual scenes. I illustrate with a model that performs
unsupervised decomposition of simple scenes. The model can account for
the results from a human learning experiment on the ontogeny of
partwhole relationships.
 Barycentric Interpolators for Continuous Space and Time Reinforcement Learning  Remi Munos, Andrew W. Moore
 In order to find the optimal control of continuous statespace and time
reinforcement learning (RL) problems, we approximate the value function
(VF)
with a particular class of functions called the barycentric
interpolators.
We establish sufficient conditions under which a RL algorithm converges
to the optimal VF, even when we use approximate models of the state
dynamics
and the reinforcement functions.
 Controlling the Complexity of HMM Systems by Regularization  Christoph Neukirchen, Gerhard Rigoll
 This paper introduces a method for regularization of HMM systems
that avoids parameter overfitting caused by insufficient training
data. Regularization is done by augmenting the EM training method
by a penalty term that favors simple
and smooth HMM systems. The penalty
term is constructed as a mixture model of negative exponential
distributions that is assumed to generate the
state dependent emission probabilities
of the HMMs. This new method is the successful transfer
of a well known regularization approach in neural networks
to the HMM domain and can be interpreted as
a generalization of traditional statetying for HMM systems.
The effect of regularization is demonstrated for continuous
speech recognition tasks by improving overfitted triphone
models and by speaker adaptation with limited training data.
 Risk Sensitive Reinforcement Learning  Ralph Neuneier, Oliver Mihatsch
 As already known, the expected return of a policy controlling a
Markov Decision Problem is not always the most suitable optimality
criterion.
For many applications control strategies should meet various
constraints like avoiding very bad states (riskavoiding) or generating
high profit within a short time although this might
probably cause significant costs (riskseeking).
We propose a modified Qlearning algorithm which uses a single continuous
parameter $\kappa \in [1,1]$ to determine in which sense the resulting
policy is optimal.
For $\kappa=0$, the policy is optimal with respect to the usual
expected return criterion while $\kappa \to 1$ generates a solution
which is optimal in the worst case. Analogous, the closer $\kappa$
is to $1$ the more risk seeking the policy becomes.
In contrast to other related approaches in the field of MDPs we do not
have to transform the cost model or to increase the state space in
order to take risk into account.
The existing convergence results for traditional $Q$learning remain
also valid in the risk sensitive case $\kappa\not= 0$.
We describe some potential applications in industry and evaluate our new approach by computing optimal strategies for one of them.
 MaximumLikelihood Continuity Mapping (MALCOM): An Alternative to HMMs  David A. Nix, John E. Hogden
 (MALCOM), an alternative to hidden Markov models (HMMs) for
processing sequence data such as speech. While HMMs have a
discrete ``hidden'' space constrained by a fixed
finiteautomaton architecture, MALCOM has a continuous
hidden spacea {\it continuity map}that is constrained
only by a smoothness requirement on paths through the space.
MALCOM fits into the same probabilistic framework for speech
recognition as HMMs, but it represents a more realistic
model of the speech production process. To evaluate the
extent to which MALCOM captures speech production
information, we generated continuous speech continuity maps
for three speakers and used the paths through them to
predict measured speech articulator data. The median
correlation between the MALCOM paths {\it obtained from only
the speech acoustics} and articulator measurements was 0.77
on an independent test set not used to train MALCOM or the
predictor. This unsupervised model achieved correlations
over speakers and articulators only 0.02 to 0.15 lower than
those obtained using an analogous supervised method which
{\it used articulatory measurements as well as acoustics.}
 Graphical Models for Recognizing Human Interactions  Nuria Oliver, Barbara Rosario, Alex Pentland
 We describe a realtime computer vision and
machine learning system for modeling and recognizing human actions and
interactions. Two different domains are explored:
recognition of twohanded
motions in the martial art `Tai Chi,' and multipleperson interactions
in a visual surveillance task. Our system combines topdown with
bottomup information using a feedback loop, and is formulated with a
Bayesian framework. Two different graphical models (HMMs and Coupled
HMMs) are used for modeling both individual actions and multipleagent
interactions, and CHMMs are shown to work more efficiently and
accurately for a given amount of training. Finally, to overcome the
limited amounts of training data, we demonstrate that `synthetic
agents' (Alifestyle agents) can be used to develop flexible prior models of
the persontoperson interactions.
 General Bounds on Bayes Errors for Regression with Gaussian Processes  Manfred Opper, Francesco Vivarelli
 Based on a simple convexity lemma,
we develop bounds for different types of Bayesian prediction errors
for regression with Gaussian processes. The basic bounds are formulated
for a fixed training set. Simpler expressions are obtained for
sampling from an input distribution which equals the weight function
of the covariance kernel, yielding asymptotically tight results.
The results are compared with numerical experiments.
 Mean Field Methods for Classification with Gaussian Processes  Manfred Opper, Ole Winther
 We discuss the application of TAP mean field methods
known from the Statistical Mechanics of disordered systems
to Bayesian classification models with Gaussian processes.
In contrast to previous approaches, no knowlege about the distribution
of inputs is needed in the derivation. Simulations for standard datasets
are given.
 Coordinate Transformation Learning of Hand Position Feedback Controller by Using Change of Position Error Norm  Eimei Oyama, Susumu Tachi
 In order to grasp an object, we need to solve the inverse kinematics
problem, i.e., the coordinate transformation from the visual coordinates
to the joint angle vector coordinates. In human motion control, the
learning function of the hand position error feedback controller in
human inverse kinematics solver is important. This paper proposes a
novel model of the coordinate transformation learning of the human
visual feedback controller, which uses the change of the joint angle
vector and the corresponding change of the square of the hand position
error norm.
 Replicator Equations, Maximal Cliques, and Graph Isomorphism  Marcello Pelillo
 We present a new energyminimization framework for the graph isomorphism
problem which is based on an equivalent maximum clique formulation. The
approach is centered around a remarkable result proved by Motzkin and
Straus in the mid1960s, and recently expanded in various ways, which
allows us to formulate the maximum clique problem in terms of a standard
quadratic program. To solve the program we use``replicator'' equations, a
class of simple continuous and discretetime dynamical systems developed
in various branches of theoretical biology. We show how, despite their
inability to escape from local solutions, they nevertheless provide
experimental results which are competitive with those obtained using more
elaborate meanfield annealing heuristics.
 Support Vector Machines Applied to Face Recognition  P. Jonathon Phillips
 Face recognition is a $K$ class problem, where $K$ is the number of
known individuals; and support vector machines (SVMs) are a binary
classification method. By reformulating the face recognition problem
and reinterpreting the output of the SVM classifier, we developed a
SVMbased face recognition algorithm. The face recognition problem is
formulated as a problem in {\it difference space}, which models
dissimilarities between two facial images. In difference space we
formulate face recognition as a two class problem. The classes are:
dissimilarities between faces of the same person, and dissimilarities
between faces of different people. By modifying the interpretation of
the decision surface generated by SVM, we generated a similarity
metric between faces that is learned from examples of differences
between faces. The SVMbased algorithm is compared with a principal
component analysis (PCA) based algorithm on a difficult set of images
from the FERET database. Performance was measured for both
verification and identification scenarios. The identification
performance for SVM is 7778\% versus 54\% for PCA. For verification,
the equal error rate is 7\% for SVM and 13\% for PCA.
 The Role of Lateral Cortical Competition in Ocular Dominance Development  Christian Piepenbrock, Klaus Obermayer
 Lateral competition within a layer of neurons sharpens and localizes the
response to an input stimulus. Here, we investigate a model for the activity
dependent development of ocular dominance maps which allows to vary the degree
of lateral competition. For weak competition, it resembles a
correlationbased learning model and for strong competition, it becomes a
selforganizing map. Thus, in the regime of weak competition the receptive
fields are shaped by the second order statistics of the input patterns,
whereas in the regime of strong competition, the higher moments and
``features'' of the individual patterns become important. When correlated
localized stimuli from two eyes drive the cortical development we find
$(i)$~that a topographic map and binocular, localized receptive fields emerge
when the degree of competition exceeds a critical value and $(ii)$~that
receptive fields exhibit eye dominance beyond a second critical value. For
anticorrelated activity between the eyes, the second order statistics drive
the system to develop ocular dominance even for weak competition, but no
topography emerges. Topography is established only beyond a critical degree
of competition.
 Using Analytic QP and Sparseness to Speed Training of Support Vector Machines  John Platt
 Training a Support Vector Machine (SVM) requires the solution of a very
large quadratic programming (QP) problem. This paper proposes a new
algorithm for training SVMs: Sequential Minimal Optimization, or SMO.
SMO breaks the large QP problem into a series of smallest possible QP
problems which are analytically solvable. SMO thus avoids numerical QP
optimization
in the inner loop. Using code which exploits sparseness of input
data speeds up SMO even further. Because large matrix computation is
avoided,
SMO scales somewhere between linear and quadratic in the training set size
for
various test problems, while a standard projected conjugate gradient (PCG)
chunking algorithm scales somewhere between linear and cubic in the training
set size. SMO's computation time is dominated by SVM evaluation, hence SMO
is fastest for linear SVMs and sparse data sets. For the MNIST database,
SMO is as fast as PCG chunking; while for the UCI Adult database and linear
SVMs, SMO can be more than 1000 times faster than the PCG chunking
algorithm.
 Independent Component Analysis of Intracellular Calcium Spike Data  Klaus Prank, Julia Borger, Alexander von zur Muhlen, Georg Brabant, Christof Schofl
 Calcium (Ca$^{2+}$) is an ubiquitous intracellular messenger
which regulates cellular processes, such as secretion, contraction,
and cell proliferation. A number of different cell types respond
to hormonal stimuli with periodic oscillations of the intracelluar
free calcium concentration ([$Ca^{2+}]_i$). These $Ca^{2+}$ signals
are often organized in complex temporal and spatial patterns even
under conditions of sustained stimulation. Here we study the
spatiotemporal aspects of intracelluar calcium ($[Ca^{2+}]_i$)
oscillations in clonal $\beta$cells (hamster insulin secreting
cells, HIT) under pharmacological stimulation
(Sch\"{o}fl {\it et al.}, 1994). We use a novel fast fixedpoint
algorithm (Hyv\"{a}rinen and Oja, 1997) for {\it Independent
Component Analysis (ICA)} to blind source separation of the
spatiotemporal dynamics of $[Ca^{2+}]_i$ in a HITcell. Using
this approach we find two significant independent components
out of five differently mixed input signals: one $[Ca^{2+}]_i$
signal with a mean oscillatory period of 68 s and a high frequency
signal with a broadband power spectrum with considerable
spectral density. This result is in good agreement with a study
on highfrequency $[Ca^{2+}]_i$ oscillations (Palu\u{s} {\it et al.}, 1998).
Further theoretical and experimental studies have to be performed
to resolve the question on the functional impact of intracellular
signaling of these independent $[Ca^{2+}]_i$ signals.
 OnLine Learning with Restricted Training Sets: Exact Solution as Benchmark for General Theories  H.C. Rae, Peter Sollich, A.C.C. Coolen
 We solve the dynamics of online Hebbian learning in perceptrons exactly,
for the regime where the size of the training set scales linearly with the
number of inputs. We consider both noiseless and noisy teachers. Our
calculation cannot be extended to nonHebbian rules, but the solution
provides a nice benchmark to test more general and advanced theories for
solving the dynamics of learning with restricted training sets.
 Learning MacroActions in Reinforcement Learning  Jette Randl\o v
 We present a method for automatically constructing macroactions
from primitive actions in reinforcement learning. The agent
constructs the macros totally from scratch during the learning
process. The overall idea is to reinforce the tendency to perform
action B after action A if such a pattern of actions has been
rewarded.
We test the method on a bicycle task, the Car On The Hill Task, the
RaceTrack Task and some gridworld tasks. For the bicycle task and
the RaceTrack the use of macro actions approximately halves the
learning time, while for the gridworld tasks the learning time is
drastically reduced by orders of magnitude.
 Learning Lie Groups for Invariant Visual Perception  Rajesh P. N. Rao, Daniel L. Ruderman
 One of the most important problems in visual perception is that of
visual invariance: how are objects perceived to be the same despite
undergoing transformations such as translations, rotations or scaling?
In this paper, we describe a Bayesian method for learning invariances
based on Lie group theory. We show that previous approaches based on
firstorder Taylor series expansions of inputs can be regarded as
special cases of the Lie group approach, the latter being capable of
handling in principle arbitrarily large transformations. Using a
matrixexponential based generative model of images, we derive an
unsupervised algorithm for learning Lie group operators from input
data containing infinitesimal transformations. The online
unsupervised learning algorithm maximizes the posterior probability of
generating the training data. We provide experimental results
suggesting that the proposed method can learn Lie group operators for
handling reasonably large 1D translations and 2D rotations.
 Regularizing AdaBoost  Gunnar R\"atsch, Takashi Onoda, KlausR. M\"uller
 Boosting methods maximize a hard classification margin and are
known as powerful techniques that do not exhibit overfitting for low
noise cases. Also for noisy data boosting will try to enforce a
hard margin and thereby give too much weight to outliers, which then
leads to the dilemma of nonsmooth fits and overfitting. Therefore we
propose three algorithms to allow for soft margin classification by
introducing regularization with slack variables into the boosting
concept: (1) AdaBoost$_{reg}$ and regularized versions of (2) linear and
(3) quadratic programming AdaBoost. Experiments show the usefulness of
the proposed algorithms in comparison to another soft margin
classifier: the support vector machine.
 Multielectrode Spike Sorting by Clustering Transfer Functions  Dmitry Rinberg, Hanan Davidowitz, Naftali Tishby
 We propose a new paradigm for sorting spikes in multielectrode extra
cellular recording by clustering the transfer functions
between cells and electrodes. The assumption is that for
every cell and electrode there is a stable linear system,
corresponding to the media properties and geometry.
The main advantage of this method is that it is insensitive
to variations in the cellspike shape and amplitude. At the same
time it utilizes all other multielectrode information.
In addition, since the transfer function noise is simpler
than the variablity in the spike shape, this representation
is clean and easy to classify.
We generate spike templates by clustering the complex ratio of
the Fouriertransforms of spikes on the different electrodes, using
deterministic annealing to determine the number of separated cells.
The detection of the spikes is then done by using their Fourier
representation.
We apply the technique to sorting spikes generated in the escape
response system of the cockroach.
 GeneralPurpose Localization of Textured Image Regions  Ruth Rosenholtz
 We suggest a working definition of texture: Texture is stuff which is
more compactly represented by its statistics than by specifying the
configuration of its parts. This definition suggests that to find
texture we look for outliers to the local statistics, and label as
texture the regions with no outliers. We present a method, based upon
this idea, for labeling points in natural scenes as belonging to
texture regions, while simultaneously allowing us to label lowlevel,
bottomup cues for visual attention. This method is strongly based
upon recent psychophysics results on processing of texture and popout.
 Tight Bounds for the VCDimension of Piecewise Polynomial Networks  Akito Sakurai
 $O(ws(s\log d+\log (dqh/s)))$ and $O(ws((h/s)\log q)+\log (dqh/s))$
are upper bounds for the VCdimension of
a set of neural networks of
units with piecewise polynomial activation functions,
where $h$ is the number of hidden units,
$w$ is the number of adjustable parameters,
$q$ is the maximum of the number of polynomial segments of
the activation function,
and $d$ is the maximum degree of the polynomials;
also $\Omega(ws\log (dqh/s))$ is a lower bound for the VCdimension of
such a network set, which are tight for the cases $s=\Theta(h)$ and $s$ is constant.
For the special case $q=1$, the VCdimension is $\Theta(ws\log d)$.
 Reinforcement Learning Based on Online EM Algorithm  Masaaki Sato, Shin Ishii
 In this article, we will propose a new reinforcement learning
(RL) method based on an actorcritic architecture.
The actor and the critic are approximated by Normalized
Gaussian Networks (NGnet's), which are networks of local linear
regression units. The NGnet's are trained by the online EM
algorithm proposed in our previous paper.
We apply our RL method to a task for swingingup and stabilizing
a single pendulum, and a task for balancing a double pendulum
near the upright position.
The experimental results show that our RL method can be applied
to optimal control problems having continuous
state/action spaces and it achieves a good control in a small
number of trialanderrors.
 Markov Processes on Curves for Automatic Speech Recognition  Lawrence Saul, Mazin Rahim
 We investigate a probabilistic framework for automatic speech
recognition based on the intrinsic geometric properties of curves. In
particular, we analyze the setting in which two variables  one
continuous (X), one discrete (S)  evolve jointly in time. We suppose
that the vector X traces out a smooth multidimensional curve and that
the variable S evolves stochastically as a function of the arc length
traversed along this curve. Since arc length does not depend on the
rate at which a curve is traversed, this gives rise to a family of
Markov processes whose predictions, Pr[SX], are invariant to
nonlinear warpings of time. We describe the use of such models  known
as Markov processes on curves  for automatic speech recognition, where
X are acoustic feature trajectories and S are phonetic transcriptions.
On the task of recognizing 1219 New Jersey town names, we report
systematic improvements in word accuracy over comparable,
stateoftheart hidden Markov models.
 Shrinking the Tube: A New Support Vector Regression Algorithm  Bernhard Sch\"olkopf, Peter L. Bartlett, Alex J. Smola, Robert Williamson
 A new algorithm for Support Vector regression is proposed. For a priori
chosen $\nu$, it automatically adjusts a flexible tube of minimal radius to
the data such that at most a fraction $\nu$ of the data points lie outside.
Moreover, it is shown how to use parametric tube shapes with nonconstant
radius. The algorithm is analysed theoretically and experimentally.
 Probabilistic Image Sensor Fusion  Ravi K. Sharma, Todd K. Leen, Misha Pavel
 We present a probabilistic approach for fusion of video sequences
produced by multiple imaging sensors. We model the sensor images
as noisy, locally affine functions of an underlying true scene.
Maximum likelihood estimates of the parameters in the local affine
functions are based on the local covariance of the image data, and
therefore related to local principal component analysis.
With the model parameters estimated, a Bayesian framework provides
either maximum likelihood or maximum a posteriori estimates of the
true scene from the sensor images. These true scene estimates
comprise the sensor fusion rules. We demonstrate the efficacy of
the approach on sequences of images from visibleband and infrared
sensors.
 Boxlets: a Fast Convolution Algorithm for Signal Processing and Neural Networks  Patrice Y. Simard, Leon Bottou, Patrick Haffner, Yann Le Cun
 Signal processing and pattern recognition algorithms make extensive use of
convolution. In many cases, computational accuracy is not as important as
computational speed. In feature extraction, for instance, the features of
interest in a signal are usually quite distorted. This form of noise
justifies some level of quantization in order to achieve faster feature
extraction. Our approach consists of approximating regions of the signal
with low degree polynomials, and then differentiating the resulting signals
in order to obtain impulse functions (or derivatives of impulse functions).
With this representation, convolution becomes extremely simple and can be
implemented quite effectively. The true convolution can be recovered by
integrating the result of the convolution. This method yields substantial
speed up in feature extraction and is applicable to convolutional neural
networks.
 Invited Talk: Cortical Normalization Models and the Statistics of Visual Images  Eero Simoncelli
 I present a parametric statistical model for visual images in the
wavelet transform domain. The model characterizes the joint densities
of coefficients corresponding to basis functions at adjacent spatial
locations, adjacent orientations, and adjacent spatial scales. The
model is consistent with the statistics of a wide variety of images,
including photographs of indoor and outdoor scenes, medical images,
and synthetic (graphics) images, and has been used successfully in
applications of compression, noise removal, and texture synthesis.
The model also suggests a nonlinear method of removing these
dependencies, which I call ``normalized component analysis,'' in which
each wavelet coefficient is divided by a linear combination of
coefficient magnitudes at adjacent locations, orientations and scales.
This analysis provides a theoretical justification for recent divisive
normalization models of striate visual cortex. Furthermore, the
statistical measurements may be used to determine the weights that are
used in computing the normalization signal. The resulting model makes
specific predictions regarding nonspecific suppression and adaptation
behaviors of cortical neurons, and thus offer the opportunity to test
directly (through physiological measurements) the ecological
hypothesis that visual neural computations are optimally matched to
the statistics of images.
 Modeling NonSpecific Suppression in V1 Neurons with a StatisticallyDerived Normalization Model  Eero Simoncelli, Odelia Schwartz
 We examine the statistics of natural monochromatic images decomposed
using a multiscale wavelet basis. Although the coefficients of this
representation are nearly decorrelated, they exhibit important
higherorder statistical dependencies that cannot be eliminated with
purely linear processing. In particular, rectified coefficients
corresponding to basis functions at neighboring spatial positions,
orientations and scales are highly correlated. A method of
removing these dependencies is to {\em divide} each coefficient by a
weighted combination of its rectified neighbors. Several successful
models of neural processing in visual cortex are based on such
divisive gain control (or``normalization'') computations, and thus our analysis
provides a theoretical justification for these models. Perhaps more
importantly, the statistical measurements explicitly specify the
weights that should be used in computing the normalization signal. We
demonstrate that this weighting is qualitatively consistent with
recent physiological measurements, and thus that early visual neural
processing is well matched to these statistical properties of images.
 Online and Batch Parameter Estimation of Gaussian Mixtures Based on the Relative Entropy  Yoram Singer, Manfred K. Warmuth
 We describe a new iterative method for parameter estimation of Gaussian
mixtures. The new method is based on a framework developed by Kivinen
and Warmuth for supervised online learning. In contrast to gradient
descent and EM, which estimate the mixture's covariance matrices, the
proposed method estimates the {\em inverses} of the covariance
matrices. Furthermore, the new parameter estimation procedure can be
applied in both online and batch settings. We show experimentally that
it is typically faster than EM, and usually requires about half as many
iterations as EM. We also describe experiments with digit recognition
that demonstrate the merits of the online version.
 Discontinuous Recall Transitions Induced by Competition Between Short and LongRange Interactions in Recurrent Networks  N.S. Skantzos, C.F. Beckmann, A.C.C. Coolen
 We present exact analytical equilibrium solutions for a class of recurrent
neural network models, with both sequential and parallel neuronal dynamics,
in which there is a tunable competition between nearestneighbour and
longrange synaptic interactions. This competition is found to induce novel
coexistence phenomena as well as discontinuous transitions between pattern
recall states, 2cycles and nonrecall states.
 Semiparametric Support Vector and Linear Programming Machines  Alex J. Smola, Thilo Friess, Bernhard Sch\"olkopf
 Semiparametric models are useful tools in the case where domain
knowledge exists about the function to be estimated or emphasis is
put onto understandability of the model. We extend two learning
algorithms  Support Vector machines and Linear Programming machines
to this case and give experimental results for SV machines.
 Learning Curves for Gaussian Processes  Peter Sollich
 I consider the problem of calculating learning curves (i.e., average
generalization performance) of Gaussian processes used for
regression. A simple expression for the generalization error in terms
of the eigenvalue decomposition of the covariance function is derived,
and used as the starting point for several approximation schemes. I
identify where these become exact, and compare with existing bounds on
learning curves; the new approximations, which can be used for any
input space dimension, generally get substantially closer to the
truth.
 Invited Talk: Computation by Cortical Modules  Haim Sompolinsky
 Understanding the principles underlying the processing of information in the
neocortex is one of the most important challenges of brain research.
Anatomical and physiological data suggest that cortical areas function
as interacting networks of local computational modules. These modules,
analogous to the hypercolumns in visual cortex, consist of hundreds of
thousands of cells which are interconnected by
massive recurrent excitation and inhibition. Recent theoretical research
shed light on the intrinsic dynamics of these local cortical circuits, the
resultant repertoire of spatiotemporal patterns of neuronal activity, and
their possible computational
implications for feature extraction, motor planning and memory.
 Applications of Multiresolution Neural Networks to Mammography  Clay D. Spence, Paul Sajda
 We have previously presented a coarsetofine hierarchical
pyramid/neural network (HPNN) architecture which combines
multiscale image processing techniques with neural networks. In
this paper we present applications of this general architecture to two
problems in mammographic ComputerAided Diagnosis (CAD). The first
application is the detection of microcalcifications. The
coarsetofine HPNN was designed to learn largescale context
information for detecting small objects like microcalcifications.
Receiver operating characteristic (ROC) analysis suggests that the
hierarchical architecture improves detection performance of a well
established CAD system by roughly 50\%. The second application is to
detect mammographic masses directly. Since masses are large, extended
objects, the coarsetofine HPNN architecture is not suitable for this problem.
Instead we construct a finetocoarse HPNN architecture which is
designed to learn smallscale detail structure associated with the
extended objects. Our initial results applying the finetocoarse HPNN to mass
detection are encouraging, with detection performance improvements of
15\% to 30\%. We conclude that the ability of the HPNN architecture to
integrate information across scales, both coarsetofine and
finetocoarse, makes it well suited for detecting objects
which may have contextual clues or detail structure occurring at
scales other than the natural scale of the object.
 Information Maximization in Single Neurons  Martin B. Stemmler, Christof Koch
 Information from the senses must be compressed into the limited range of
responses spiking nerve cells can generate; for optimal compression, the
neuron's response should match the statistics of stimuli encountered in
nature. Given a limit on its maximal firing rate, a nerve cell should teach
itself to use each available firing rate equally often. Given a fixed
average firing rate, a neuron should selforganize itself to respond with
high firing rates only to comparatively rare events. Since changing the
voltagedependent ionic conductances in the nerve cell membrane alters
the representation of information, an unsupervised
learning rule is derived to show how a HodgkinHuxley model neuron can
adapt these conductances to optimize its firing rate response. This
learning rule corresponds to a nonHebbian,
developmental mechanism that maximizes the rate of information
transmission.
 Computation of Smooth Optical Flow in a Feedback Connected Analog Network  Alan Stocker, Rodney Douglas
 In 1985, Tanner and Mead implemented an interesting constraint satisfaction
circuit for global motion sensing in aVLSI. We report here a new and improved
aVLSI implementation that provides smooth optical flow as well as global
motion in a two dimensional visual field. The computation of optical flow is
an illposed problem, which expresses itself as the aperture problem. However,
the optical flow can be estimated by the use of regularization methods, in
which additional constraints are introduced in terms of a global energy
functional that must be minimized. We show how the algorithmic constraints of
Horn and Schunck on computing smooth optical flow can be mapped onto the
physical constraints of an equivalent electronic network.
 A Reinforcement Learning Algorithm in Partially Observable Environments Using ShortTerm Memory  Nobuo Suematsu, Akira Hayashi
 We describe a Reinforcement Learning algorithm for
partially observable environments using shortterm memory,
which we call BLHT. Since BLHT learns a stochastic model
based on Bayesian Learning, the overfitting problem is
reasonably solved. Moreover, BLHT has an efficient
implementation. This paper shows that the model learned
by BLHT converges to one which provides the most accurate
predictions of percepts and rewards, given shortterm
memory.
 Improved Switching among Temporally Abstract Actions  Richard S. Sutton, Satinder Singh, Doina Precup, B. Ravindran
 In robotics and other control applications it is commonplace to have a
preexisting set of controllers for solving subtasks, perhaps handcrafted
or previously learned or planned, and still face a difficult problem of
how to choose and switch among the controllers to solve an overall task as
well as possible. In this paper we present a framework based on Markov
Decision Processes and SemiMarkov Decision Processes for phrasing this
problem, a basic theorem regarding the improvement in performance that can
be obtained by switching flexibly between given controllers, and example
applications of the theorem. In particular, we show how an agent can plan
with these highlevel controllers and then use the results of such
planning to find an even better plan, by modifying the existing
controllers with negligible additional cost and no replanning. In one of
our examples, the complexity of the problem is reduced from 24 billion
stateaction pairs to less than a million statecontroller pairs.
 A Theory of Mean Field Approximation  Toshiyuki Tanaka
 I present a theory of mean field approximation based on
information geometry. This theory includes in a consistent
way the naive mean field approximation, as well as the
ThoulessAndersonPalmer (TAP) approach and the linear
response theorem in statistical physics, giving clear
informationtheoretic interpretations to them.
 Bayesian Modeling of Human Concept Learning  Joshua B. Tenenbaum
 I consider the problem of learning concepts from small numbers of
positive examples, a feat which humans perform all the time but which
computers are rarely capable of. In the spirit of trying to bridge
machine learning and cognitive science perspectives, I present both
theoretical analysis and an empirical study with human subjects for a
simple, wellknown task: learning concepts corresponding to
axisaligned rectangles in a multidimensional feature space. Existing
learning models, when applied to this task, cannot explain how
subjects generalize from only a few examples of the concept. I
propose a principled Bayesian model based on the assumption that the
observed examples are a random sample from the concept to be learned.
The Bayesian model gives precise fits to human behavior on this simple
task and provides qualitative insights into more complex, realistic
cases of concept learning.
 Orientation, Scale, and Discontinuity as Emergent Properties of Illusory Contour Shape  Karvel K. Thornber, Lance R. Williams
 A recent neural model of illusory contour formation is based on a
distribution of natural shapes traced by particles moving with
constant speed in directions given by Brownian motions. The input to
that model consists of pairs of position and direction constraints and
the output consists of the distribution of contours joining all such
pairs. In general, these contours will not be closed and their
distribution will not be scaleinvariant. In this paper, we show how
to compute a scaleinvariant distribution of closed contours given
position constraints alone, and use this result to explain a well known
illusory contour effect.
 Probabilistic Visualisation of Highdimensional Binary Data  Michael E. Tipping
 We present a probabilistic latentvariable framework for data
visualisation, a key feature of which is its applicability to binary
and categorical data types for which few established methods exist.
A
variational approximation to the likelihood is exploited to derive a
fast algorithm for determining the model parameters. Illustrations
of
application to real and synthetic binary data sets are given.
 SMEM Algorithm for Mixture Models  Naonori Ueda, Ryohei Nakano, Zoubin Ghahramani, Geoffrey Hinton
 We present a split and merge EM (SMEM) algorithm to overcome the
local maximum problem in parameter estimation of finite mixture models.
In the case of mixture models, nonglobal maxima often involve having
too many components of a mixture model in one part of the space and
too few in another, widely separated part of the space.
To escape from such configurations we repeatedly perform simultaneous
split and merge operations using a new criterion for efficiently
selecting the split and merge candidates.
We apply the proposed algorithm to the training of Gaussian mixtures
and mixtures of factor analyzers using synthetic and real data and
show the effectiveness of using the split and merge operations
to improve the likelihood of both the training data and of
heldout test data.
 Learning Mixture Hierarchies  Nuno Vasconcelos, Andrew Lippman
 The hierarchical representation of data has various applications in
domains such as data mining, machine vision, or information
retrieval. In this paper we introduce an extension of the
ExpectationMaximization (EM) algorithm that learns mixture
hierarchies in a computationally efficient manner. Efficiency is
achieved by progressing in a bottomup fashion, i.e. by clustering
the mixture components of a given level in the hierarchy to obtain
those of the level above. This clustering requires only knowledge of
the mixture parameters, there being no need to resort to
intermediate samples. In addition to practical applications, the
algorithm allows a new interpretation of EM that makes clear the
relationship with nonparametric kernelbased estimation
methods, provides explicit control over the tradeoff between the
bias and variance of EM estimates, and offers new insights about the
behavior of deterministic annealing methods commonly used with EM to
escape local minima of the likelihood.
 Discovering Hidden Features with Gaussian Processes Regression  Francesco Vivarelli, Christopher K. I. Williams
 In Gaussian process regression the covariance
between the outputs at input locations $\xvec$ and $\xvec'$ is usually
assumed to depend on the distance $\round{\xvec  \xvec'}\tp W
\round{\xvec\xvec'}$, where $W$ is a positive definite matrix.
$W$ is often taken to be diagonal, but if we allow $W$ to be
a general positive definite matrix which can be tuned on the basis of
training data, then an eigenanalysis of W shows that we are effectively
creating hidden features, where the dimensionality of the hiddenfeature
space is determined by the data. We demonstrate the superiority of
predictions using the general matrix over those based on a diagonal
matrix on two test problems.
 The BiasVariance Tradeoff and the Randomized GACV  Grace Wahba, Xiwu Lin, Fangyu Gao, Dong Xiang, Ronald Klein, Barbara Klein
 We propose a new insample cross validation based
method (randomized GACV) for choosing smoothing
or bandwidth parameters that govern the biasvariance
or fitcomplexity
tradeoff in `soft' classification. Soft classification
refers to a learning procedure which estimates the probability
that an example with a given attribute vector is in
class 1 {\it vs} class 0. The target for optimizing the
the tradeoff is the KullbackLiebler distance between
the estimated probability distribution and the `true'
probability distribution, representing knowledge
of an infinite population. The method uses a
randomized estimate of the trace of a Hessian and mimics cross
validation at the cost of a single relearning with
perturbed outcome data.
 Classification in NonMetric Spaces  Daphna Weinshall, David W. Jacobs, Yoram Gdalyahu
 A key question in vision is how to represent our knowledge of
previously encountered objects in order to classify new ones. The
answer depends on how we determine the similarity of two
objects. Similarity tells us how relevant each previously seen object
is in determining the category to which a new object belongs. Here a
dichotomy emerges. Complex notions of similarity appear necessary for
cognitive models and applications, while simple notions of similarity
form a tractable basis for current computational approaches to
classification. We explore the nature of this dichotomy and why it
calls for new approaches to wellstudied problems in learning. We
begin this process by demonstrating new computational methods for
supervised learning that can handle complex notions of similarity.
(1) We discuss how to implement parametric methods that represent a
class by its MEAN when using nonmetric similarity functions;
and (2) we review nonparametric methods that we have developed using
nearest neighbor classification in nonmetric spaces.
 Basis Selection For Wavelet Regression  Kevin R. Wheeler
 A wavelet basis selection procedure is presented for wavelet regression.
Both the basis and threshold are selected using crossvalidation.
The method includes the capability of incorporating prior knowledge on
the smoothness (or shape of the basis functions) into the basis
selection procedure. The results of the method are demonstrated using
widely published sampled functions.
The results of the method are contrasted with other basis function
based methods.
 DTs: Dynamic Trees  Christopher K. I. Williams, Nicholas J. Adams
 In this paper we introduce a new class of image models, which we call
dynamic trees or DTs. A dynamic tree model specifies a prior over a
large number of trees, each one of which is a treestructured belief
net (TSBN).
Our intuition as to why DTs may be useful image models is based on the
idea that most pixels in an image are derived from a single object. We think
of an object as being described by a root of a tree, with the scale of the
object being determined by the level in the tree at which the root occurs.
Experiments show that the DT models have better translation invariance
properties than a fixed, ``balanced'' TSBN, and that Markov chain
Monte Carlo methods are effective at finding trees which have high
posterior probability.
 Experiments with an Algorithm which Learns Stochastic Memoryless Policies for Partially Observable Markov Decision Processes  John K. Williams, Satinder Singh
 Partially Observable Markov Decision Processes (POMDPs) constitute an
important class of reinforcement learning problems which present
unique theoretical and computational difficulties. In the absence of
the Markov property, popular reinforcement learning algorithms such as
Qlearning may no longer be effective, and memorybased methods which
remove partial observability via stateestimation are notoriously
expensive. An alternative approach is to seek a stochastic memoryless
policy which for each observation of the environment prescribes a
probability distribution over available actions that maximizes the
average reward per timestep. A reinforcement learning algorithm which
learns a locally optimal stochastic memoryless policy has been
proposed by Jaakkola, Singh and Jordan, but not empirically
verified. We present a variation of this algorithm, discuss its
implementation, and demonstrate its viability using four test problems.
 Robot Docking using Mixtures of Gaussians  Matthew M. Williamson, Roderick MurraySmith, Volker Hansen
 This paper applies the {\it Mixture of Gaussians}
probabilistic model, combined with Expectation Maximization
optimization, to the task of summarizing three dimensional range data
for a mobile robot. This provides a flexible way of dealing with
uncertainties in sensor information, and allows the introduction of
prior knowledge into lowlevel perception modules. Problems with the
basic approach were solved in several ways: the mixture of Gaussians
was reparameterized to reflect the types of objects expected in the
scene, and priors on model parameters were included in the
optimization process. Both approaches force the optimization to find
`interesting' objects, given the sensor and object characteristics. A
higher level classifier was used to interpret the results provided by
the model, and to reject spurious solutions.
 Using Collective Intelligence to Route Internet Traffic  David Wolpert, Kagan Tumer, Jeremy Frank
 A COllective INtelligence (COIN) is a set of interacting reinforcement
learning (RL) algorithms designed so that their collective behavior
optimizes a global utility function. We summarize the theory of COINs,
then present experiments using that theory to design COINs to control
internet traffic routing. These experiments indicate that COINs
outperform all previously investigated RLbased, shortest path
routing algorithms.
 Population Coding with Correlated Noise  Hyoungsoo Yoon, Haim Sompolinsky
 We study the effect of correlated noise on the accuracy of
population coding using a model of a population of neurons that are
broadly tuned to an angle in twodimension.
The fluctuations in the neuronal activity is modeled as a
Gaussian noise with pairwise correlations which decays
exponentially with the difference between the preferred
orientations of the pair.
By calculating the Fisher information of the system,
we show that in the biologically relevant regime of parameters
positive correlations
decrease the estimation capability of the network relative to
the uncorrelated population.
Moreover, strong positive correlations result in information capacity which
saturates to a finite value as the number of cells in the population grows.
In contrast, negative correlations substantially increase the
information capacity of the neuronal population.
 Convergence Rates of Algorithms for Visual Search: Detecting Visual Contours.  A.L. Yuille, James M. Coughlan
 This paper develops a theory for the convergence
rates of algorithms for performing visual search tasks (formulated in
a Bayesian framework). Our approach makes use of the A* search
strategy and mathematical results from the theory of types in
information theory. In particular, we formulate the problem of the
detection of visual contours in noise/clutter by optimizing a global
criterion for combining local intensity and geometry information. We
analyze the convergence rates of A* search algorithms for detecting
the target contour. This analysis determines characteristics of the
domain, which we call order parameters, which determine the
convergence rates. In particular, we present a specific admissible A*
algorithm with pruning which converges, with high probability, with
expected convergence time $O(N)$ in the size of the problem. In addition, we
briefly summarize extensions of this work which address fundamental
limits of target contour detectability (i.e. algorithm independent
results) and the use of nonadmissible heuristics.
 Distributional Population Codes and Multiple Motion Models  Richard S. Zemel, Peter Dayan
 Theoretical and empirical studies of population codes make the
assumption that neuronal activities reflect a unique and unambiguous
value of an encoded quantity. However, population activities can
contain additional information, such as multiple values of the
quantity and uncertainty about it. We have previously suggested a
method to recover the extra information by treating the activities
of the population of cells as coding for a complete distribution
over the coded quantity rather than just a single value. We now
show how this approach bears on psychophysical and
neurophysiological studies of population codes for motion direction.
Our model is consistent with both correct and erroneous human
performance on tasks in which subjects must report the directions of
motion in stimuli containing groups of dots moving in some small
number of directions. The model also suggests a natural explanation
for why a monkey's psychophysical discrimination performance is not
substantially better.
 Blind Separation of Filtered Source Using StateSpace Approach  Liqing Zhang, Andrzej Cichocki
 In this paper we present a novel approach to multichannel blind
separation/generalized deconvolution, assuming that both mixing and
demixing models are described by stable linear statespace systems. We
decompose the blind separation problem into two process: separation and
state estimation. Based on the minimization of KullbackLeibler
Divergence, we develop a novel learning algorithm to train the matrices
in the output equation. To estimate the state of the demixing model, we
introduce a new concept, called hidden innovation, to numerically
implement Kalman filter. Computer simulations are given to show the
validity and high effectiveness of the statespace approach.
 A High Performance kNN Classifier Using a Binary Correlation Matrix Memory  Ping Zhou, Jim Austin, John Kennedy
 This paper presents a novel and fast kNN classifier that is based on
a binary CMM (Correlation Matrix Memory) neural network. A robust
encoding method is developed that converts numerical inputs into binary
ones to meet CMM input requirements. A hardware implementation of the
CMM is described, which gives over 200 times the speed of a current
midrange workstation, and is scaleable to very large problems. When
tested on several benchmarks and compared with a simple kNN method,
the CMM classifier gave less than 1\% lower accuracy and over 4 and 12
times speedups in software and hardware respectively.