# Neural Information Processing Systems 1998

This is a list of the papers that were presented at NIPS*98. They are listed in (approximately) alphabetical order according to the last name of the first author.

These papers should be cited as:
Advances in Neural Information Processing Systems 11,
M. S. Kearns, S. A. Solla, D. A. Cohn, eds., MIT Press, 1999.
Some of you may remember Mr. Kearns' middle initial as something different. Following a recent marriage, it is now correctly "M. S. Kearns".
A-B-C-D-E-F-G-H-I-J-K-L-M-N-O-P-R-S-T-V-W-Y-Z

#### A

Invited Talk: Temporally Asymmetric Hebbian Learning, Spike Timing, and Neural Response Variability -- L. F. Abbott
In most neural network studies of Hebbian learning, changes in synaptic weights depend on pre- and postsynaptic firing rates. Data on long-term potentiation (LTP) and depression (LTD), candidate mechanisms for Hebbian learning in biological systems, indicate that the precise timing of pre- and postsynaptic action potentials, not merely firing rates, determines both the magnitude and sign of changes in synaptic efficacies. Recent data indicates that the degree of synaptic modification is a narrow, temporally asymmetric function of the time difference between pre- and postsynaptic spikes. Temporally asymmetric Hebbian learning based on these data is competitive and can reach a stable equilibrium, even in the absence of normalization constraints. Synaptic modification for a model neuron receiving multiple inputs automatically reaches a balanced state'' in which neuronal firing is highly irregular. Such states have been proposed to explain the variability seen in cortical responses. In these models, the strengthening of synapses does not depend on the firing rate, but rather on the degree of correlation between different synaptic inputs. When inputs are uncorrelated, no systematic synaptic modification occurs, but rapid learning can take place if inputs are appropriately correlated.

Influence of Changing the Synaptic Transmitter Release Probability on Contrast Adaptation of Simple Cells in the Primary Visual Cortex -- Peter Adorjan, Klaus Obermayer
The contrast response function (CRF) of many neurons in the primary visual cortex saturates, and shifts towards higher contrast values following prolonged presentation of high contrast visual stimuli. Using a recurrent neural network of excitatory spiking neurons with adapting synapses we show that both effects could be explained by a fast and a slow component in the synaptic adaptation. The fast component---a short term synaptic depression component---leads to a saturation of the CRF and a phase advance in the cortical cells' response to high contrast stimuli. The slow component is derived from an adaptation of the probability of the synaptic transmitter release, and changes such that the mutual information between the input and the output of a cortical neuron is maximal. This component---given by the infomax learning rule---explains contrast adaptation of the averaged membrane potential (DC component) as well as the surprising experimental result, that the stimulus modulated component (F1 component) of a cortical cell's membrane potential adapts only weakly. Based on our results we propose a new experimental method to estimate the strength of the effective excitatory feedback to a cortical neuron, and we also suggest a relatively simple experimental test to justify our hypothesized synaptic mechanism for contrast adaptation.

Robust, Efficient, Globally-Optimized Reinforcement Learning with the Parti-Game Algorithm -- Mohammad A. Al-Ansari, Ronald J. Williams
Parti-game (Moore 1994a, Moore 1994b, Moore and Atkeson 1995) is a reinforcement learning (RL) algorithm that has a lot of promise in overcoming the curse of dimensionality (Bellman 1957) that can plague RL algorithms when applied to high-dimensional problems. In this paper we introduce modifications to the algorithm that further improve its performance and robustness. In addition, while parti-game solutions can be improved locally by standard local path-improvement techniques, we introduce an add-on algorithm in the same spirit as parti-game that instead tries to improve solutions in a non-local manner.

Hierarchical ICA Belief Networks -- Hagai Attias
We introduce a multilayer generalization of independent component analysis (ICA). At each level, this network extracts real-valued latent variables that are non-linear functions of the input data with a highly adaptive functional form, resulting in a hierarchical distributed representation of these data. The network is based on a probabilistic generative model, constructed by cascading single-layer local ICA models. Whereas exact maximum-likelihood learning for this model is intractable, we present and demonstrate an algorithm that maximizes a lower bound on the likelihood. This algorithm is developed by formulating a variational approach to hierarchical ICA networks.

#### B

Gradient Descent for General Reinforcement Learning -- Leemon Baird, Andrew W. Moore
A simple learning rule is derived, which can be instantiated to generate a wide range of new reinforcement-learning algorithms. These algorithms solve a number of open problems, define several new approaches to reinforcement learning, and unify different approaches to reinforcement learning under a single theory. These algorithms all have guaranteed convergence, and include modifications of several existing algorithms that were known to fail to converge on simple MDPs. These include Q-learning, SARSA, and advantage learning. In addition to these value-based algorithms it also generates valueless reinforcement-learning algorithms, which learn optimal policies without learning a value function. In addition, it allows valueless and value-based algorithms to be combined, thus unifying two very different approaches to reinforcement learning. And these algorithms converge for POMDPs without requiring a proper belief state. Simulations results are given, and several areas for future research are discussed.

Probabilistic Modeling for Face Orientation Discrimination: Learning from Labeled and Unlabeled Data -- Shumeet Baluja
This paper presents probabilistic modeling methods to solve the problem of discriminating between five facial orientations with very little labeled data. Three models are explored. The first model maintains no inter-pixel dependencies, the second model is capable of modeling a set of arbitrary pairwise dependencies, and the last model allows dependencies only between neighboring pixels. We show that for all three of these models, the accuracy of the learned models can be greatly improved by augmenting a small number of labeled training images with a large set of unlabeled images using Expectation-Maximization. This is important because it is often difficult to obtain image labels, while many unlabeled images are readily available. Through a large set of empirical tests, we examine the benefits of unlabeled data for each of the models. By using only two randomly selected labeled examples per class, we can discriminate between the five facial orientations with an accuracy of 94\%; with six examples, we achieve an accuracy of 98\%.

Making Templates Rotationally Invariant: An Application to Rotated Digit Recognition -- Shumeet Baluja
This paper describes a simple and efficient method to make template-based object classification invariant to in-plane rotations. The task is divided into two parts: orientation discrimination and classification. The key idea is to perform the orientation discrimination before the classification. This can be accomplished by hypothesizing, in turn, that the input image belongs to each class of interest. The image can then be rotated to maximize its similarity to the training images in each class (these contain the prototype object in an upright orientation). This process yields a set of images, at least one of which will have the object in an upright position. The resulting images can then be classified by models which have been trained with only upright examples. This approach has been successfully applied to two real-world vision-based tasks: rotated handwritten digit recognition and rotated face detection in cluttered scenes.

Where Does the Population Vector of Motor Cortical Cells Point during Reaching Movements? -- Pierre Baraduc, Emmanuel Guigon, Yves Burnod
Visually-guided arm reaching movements are produced by distributed neural networks within parietal and frontal regions of the cerebral cortex. Experimental data indicate that (1) single neurons in these regions are broadly tuned to parameters of movement; (2) appropriate commands are elaborated by populations of neurons; (3) the coordinated action of neurons can be visualized using a neuronal population vector (NPV). However, the NPV provides only a rough estimate of movement parameters (direction, velocity) and may even fail to reflect the parameters of movement when arm posture is changed. We designed a model of the cortical motor command to investigate the relation between the desired direction of the movement, the actual direction of movement and the direction of the NPV in motor cortex. The model is a two-layer self-organizing neural network which combines broadly-tuned (muscular) proprioceptive and (cartesian) visual information to calculate (angular) motor commands for the initial part of the movement of a two-link arm. The network was trained by motor babbling in 5 positions. Simulations showed that (1) the network produced appropriate movement direction over a large part of the workspace; (2) small deviations of the actual trajectory from the desired trajectory existed at the extremities of the workspace; (3) these deviations were accompanied by large deviations of the NPV from both trajectories. These results suggest the NPV does not give a faithful image of cortical processing during arm reaching movements.

Tractable Variational Structures for Approximating Graphical Models -- David Barber, Wim Wiegerinck
Graphical models provide a broad framework for probabilistic inference, with application to such diverse areas as speech recognition (Hidden Markov Models), medical diagnosis (Belief Networks) and artificial intelligence (Boltzmann Machines). However, the computing time is typically exponential in the number of nodes in the graph. We present a general framework for a class of approximating models, based on the Kullback-Leibler divergence between an approximating graph and the original graph. In addition to unifying the node-elimination and structural variational frameworks, we provide generalised mean-field equations for both directed and undirected graphs. Simulation results on a small benchmark problem suggest that this method compares favourably against others previously reported in the literature.

Almost Linear VC Dimension Bounds for Piecewise Polynomial Networks -- Peter L. Bartlett, Vitaly E. Maiorov, Ron Meir
We compute upper and lower bounds on the VC dimension of feedforward networks of units with piecewise polynomial activation functions. We show that if the number of layers is fixed, then the VC dimension grows as $W\log W$, where $W$ is the number of parameters in the network. This result stands in opposition to the case where the number of layers is unbounded, in which case the VC dimension grows as $W^2$.

Semi-Supervised Support Vector Machines -- Kristin Bennett, Ayhan Demiriz
We introduce a semi-supervised support vector machine method (SSSVM). Given a training set of labeled data and a working set of unlabeled data, SSSVM constructs a support vector machine using both the training and working sets. We use SSSVM to solve the overall risk minimization problem (ORM) posed by Vapnik. The ORM problem is to estimate the value of a classification function at the given points in the working set. This contrasts with the standard learning problem of empirical risk minimization which estimates the classification function at all possible values. We propose a general SSSVM model that minimizes both the misclassification error and the function capacity based on all the available data. We show how the SSSVM model for 1-norm linear support vector machines can be converted to a mixed-integer program (MIP) and then solved exactly using integer programming. Results of SSSVM-MIP and the standard ERM approach are compared on eleven data sets. Our computational results support the statistical learning theory results showing that incorporating working data improves generalization when insufficient training information is available. In every case, SSSVM either improved or showed no significant difference in generalization compared to the standard approach.

Evidence for a Forward Dynamics Model in Human Adaptive Motor Control -- Nikhil Bhushan, Reza Shadmehr
Based on computational principles, the concept of an internal model for adaptive motor control has been divided into a forward and an inverse model. However, there is as yet little evidence that learning control by the CNS is through adaptation of one or the other. Here we demonstrate that three types of adaptive control processes are possible, based on only one or a combination of inverse and forward models. For fast reaching movements of the hand in novel force fields, the three processes result in hand trajectories that differ from each other. We show that the process involving a combination of a forward and an inverse model for adaptive motor control results in a control system where key characteristics of performance match the kinematics of human subjects. In contrast, adaptive control systems that rely only on an inverse model or only on a forward model fail to produce the kinematic features observed in the subjects. Our results provide strong evidence that learning control of novel dynamics is via adaptation of the forward model.

Lazy Learning Meets the Recursive Least Squares Algorithm -- Mauro Birattari, Gianluca Bontempi, Hugues Bersini
Lazy learning is a memory-based technique that, once a query is received, extracts a prediction interpolating locally the neighboring examples of the query which are considered relevant according to a distance measure. In this paper we propose a data-driven method to select on a query-by-query basis the optimal number of neighbors to be considered for each prediction. As an efficient way to identify and validate local models, the recursive least squares algorithm is introduced in the context of local approximation and lazy learning. Furthermore, beside the winner-takes-all strategy for model selection, a local combination of the most promising models is explored. The method proposed is tested on six different datasets and compared with a state-of-the-art approach.

Bayesian PCA -- Christopher M. Bishop
The technique of principal component analysis (PCA) has recently been expressed as the maximum likelihood solution for a generative latent variable model. In this paper we use this probabilistic reformulation as the basis for a Bayesian treatment of PCA. Our key result is that effective dimensionality of the latent space (equivalent to the number of retained principal components) can be determined automatically as part of the Bayesian inference procedure. An important application of this framework is to mixtures of probabilistic PCA models, in which each component can determine its own effective complexity.

Learning Multi-class Dynamics -- Andrew Blake, Ben North, Michael Isard
A probabilistic algorithm is presented for learning the dynamics of complex motions. Complexity is taken into account by allowing multiple classes of motion, and an Auto-Regressive Process (ARP) associated with each class. Training sets need incorporate only indirect observations of motion, and this allows for sensory noise. A learning algorithm is presented for this problem based on propagation of random samples. Experiments have been performed with visually observed juggling, and plausible dynamical models are found to emerge from the learning process.

Approximate Learning of Dynamic Models -- Xavier Boyen, Daphne Koller
Inference is a key component in learning probabilistic models from partially observable data. When learning temporal models, each of the many inference phases requires a complete traversal over a potentially very long sequence; furthermore, the data structures propagated in this procedure can be extremely large, making the whole process very demanding. In a previous paper, we describe an approximate inference algorithm for monitoring stochastic processes, and prove bounds on its approximation error. In this paper, we apply this algorithm as an approximate forward propagation step in an EM algorithm for learning temporal Bayesian networks. We also provide a related approximation for the backward step, and prove error bounds for the combined algorithm. We show that EM using our inference algorithm is much faster than EM using exact inference, with no degradation of the quality of the learned model. We then extend our analysis to the online learning task, showing a bound on the error resulting from restricting attention to a small window of observations. We present an online EM learning algorithm for dynamic systems, and show that it learns much faster than standard offline EM.

Structure Discovery via Entropy Minimization -- Matthew Brand
We introduce a novel framework for simultaneous structure and parameter learning in hidden-variable conditional probability models, based on an entropic prior and a solution for its maximum a posteriori (MAP) estimator. The MAP estimate minimizes uncertainty in all respects: cross-entropy between model and data, entropy of the model, entropy of the data's descriptive statistics. Iterative estimation extinguishes weakly supported parameters, compressing and sparsifying the model. Trimming operators accelerate this process by removing excess parameters and, unlike most pruning schemes, guarantee an increase in posterior probability. Entropic estimation takes a large random model and simplifies it, inducing the structure of relations between hidden and observed variables. Applied to hidden Markov models (HMMs), it finds a concise finite-state machine representing the hidden structure of a signal. We entropically model music, handwriting, and video time-series, and show that the resulting models are highly concise, structured, predictive, and interpretable--surviving states tend to be highly correlated with meaningful partitions of the data, while surviving transitions provide a low-perplexity model of the signal dynamics.

Fisher Scoring and a Mixture of Modes Approach for Approximate Inference and Learning in Nonlinear State Space Models -- Thomas Briegel, Volker Tresp
We present Monte-Carlo generalized EM equations for learning in nonlinear state space models. The difficulties lie in the Monte-Carlo E-step which consists of sampling from the posterior distribution of the hidden variables given the observations. The new idea presented in this paper is to generate samples from a Gaussian approximation to the true posterior from which it is easy to obtain independent samples. The parameters of the Gaussian approximation are either derived from the extended Kalman filter or the Fisher Scoring algorithm. In case the posterior density is multimodal we propose to approximate the posterior by a sum of Gaussians (mixture of modes approach). We show that sampling from the approximate posterior densities obtained by the above algorithms leads to better models than using point estimates for the hidden states. In our experiment, the Fisher Scoring algorithm obtained a better approximation of the posterior mode than the extended Kalman filter. For a multimodal distribution, the mixture of modes approach gave superior results.

Non-linear PI Control Inspired by Biological Control Systems -- Lyndon J. Brown, Gregory E. Gonye, James S. Schwaber
A non-linear modification to PI control has been developed which is appropriate for plants requiring exact set-point matching and disturbance attenuation in the presence of infrequent step changes in load disturbances or set-point. This control algorithm, labeled PII (proportional with intermittent integral), is motivated by a model of a signal transduction pathway active in mammalian blood pressure regulation. The proportional aspect of the controller is independently designed to be a disturbance attenuator, and set-point matching will be achieved by intermittently invoking an integral controller. The mechanisms observed in the biological control model are used to tie these controllers together. Improved performance over PI control is shown on a model of cyclopentenol production. PI control fails at the peak operating point, as a sign change in plant gain results in an unstable closed-loop system. Application of this new approach to this problem results in stable exact set-point matching for achievable set-points.

Optimizing Admission Control while Ensuring Quality of Service in Multimedia Networks via Reinforcement Learning -- Timothy X Brown, Hui Tong, Satinder Singh
This paper examines the application of reinforcement learning to a telecommunications networking problem. The problem requires that revenue be maximized while simultaneously meeting a quality of service constraint that forbids entry into certain states. We present a general solution to this multi-criteria problem that is able to earn significantly higher revenues than alternatives.

#### C

Analog VLSI Cellular Implementation of the Boundary Contour System -- Gert Cauwenberghs, James Waskiewicz
We present an analog VLSI cellular architecture implementing a simplified version of the Boundary Contour System (BCS) for real-time focal-plane image processing. Inspired by neuromorphic models across several layers of visual cortex, the design integrates in each pixel the functions of simple cells, complex cells, hyper-complex cells, and bipole cells, each tuned to one of three orientations and interconnected on a hexagonal grid. Analog current-mode CMOS circuits are used throughout to perform edge detection, local inhibition, directionally selective long-range diffusive kernels, and renormalizing global gain control. Experimental results from a fabricated $12 \times 10$ pixel prototype in 1.2~$\mu$m CMOS technology demonstrate the robustness of the BCS model in selecting image contours in a cluttered and noisy background.

Complex Cells as Cortically Amplified Simple Cells -- Frances S. Chance, Sacha B. Nelson, L. F. Abbott
Complex cells are selective for stimulus orientation and spatial frequency (like simple cells) but not for spatial phase. While complex cell responses to drifting gratings are relatively unmodulated, the temporal frequency of complex cell responses to counterphase gratings is twice that of the stimulus. We show that complex cell responses can arise through recurrent interactions within a network of cortical cells. With weak or absent recurrent connections, the model cells in the network show responses similar to simple cells. Strong recurrent connections result in complex cell responses. Our results suggest that simple and complex cells represent the weakly and strongly coupled regimes of the same basic neural circuit.

Invited Talk: Statistical Natural Language Processing: Better Living Through Floating-point Numbers -- Eugene Charniak
Over the last ten years or so the field of natural language processing (NLP) has become increasingly dominated by corpus-based methods and statistical techniques. In this research, problems are attacked by collecting statistics from a corpus (sometimes marked with correct answers, sometimes not) and then applying the statistics to new instances of the task. In this talk we give an overview of statistical techniques in many areas of NLP such as: parsing (finding the correct phrase structure for a sentence), lexical semantics (learning meanings and other properties of words and phrases from text), anaphora resolution (determining the intended antecedent of pronouns, and noun phrases in general), word-sense disambiguation (finding the correct sense in context of a word with multiple meanings) etc. As a general rule, corpus-based, and particularly statistical techniques outperform hand-crafted systems, and the rate of progress in the field is still quite high.

Neuronal Regulation Implements Efficient Synaptic Pruning -- Gal Chechik, Isaac Meilijson, Eytan Ruppin
Human and animal studies show that mammalian brain undergoes massive synaptic pruning during childhood, removing about half of the synapses until puberty. We have previously shown that maintaining network memory performance while synapses are deleted, requires that synapses are properly modified and pruned, removing the weaker synapses. We now show that neuronal regulation, a mechanism recently observed to maintain the average neuronal input field, results in weight-dependent synaptic modification. Under the correct range of the degradation dimension and synaptic upper bound, neuronal regulation removes the weaker synapses and judiciously modifies the remaining synapses. It implements near optimal synaptic modification, and maintains the memory performance of a network undergoing massive synaptic pruning. Thus, this paper shows that in addition to the known effects of Hebbian changes, neuronal regulation may play an important role in the self-organization of brain networks during development.

Perceiving Without Learning: from Spirals to Inside/Outside Relations -- Ke Chen, DeLiang L. Wang
As a benchmark task, the spiral problem is well known in neural networks. Unlike previous work that emphasizes learning, we approach the problem from a generic perspective that does not involve learning. We point out that the spiral problem is intrinsically connected to the inside/outside problem. A generic solution to both problems is proposed based on oscillatory correlation using a time delay network. Our simulation results are qualitatively consistent with human performance, and we interpret human limitations in terms of synchrony and time delays, both biologically plausible. As a special case, our network without time delays can always distinguish these figures regardless of shape, position, size, and orientation. We conjecture that visual perception will be effortful if local activation cannot be rapidly propagated, as synchrony would not be established in the presence of time delays.

A Model for Associative Multiplication -- G. Bjorn Christianson, Suzanna Becker
Despite the fact that arithmetic is based on only a few hundred basic facts and some simple algorithms, humans have a difficult time mastering the subject, and make mistakes even with experience. Associative multiplication, the process of doing multiplication by memory without the use of rules or algorithms, is especially problematic. Humans have certain characteristic errors in performing associative multiplications, both in the type of error and in the error frequency. We propose a model for the process of associative multiplication, and compare the results with both data from normal humans and from the model proposed by Anderson et al.

A Micropower CMOS Adaptive Amplitude and Shift Invariant Vector Quantiser -- Richard Coggins, Raymond Wang, Marwan Jabri
In this paper we describe the architecture, implementation and experimental results for an Intracardiac Electrogram (ICEG) classification and compression chip. The chip processes and vector-quantises 30~dimensional analogue vectors while consuming a maximum of 2.5~$\mu$W power for a heart rate of 60~beats per minute (1~vector per second) from a 3.3~V supply. This represents a significant advance on previous work which achieved ultra low power supervised morphology classification since the template matching scheme used in this chip enables unsupervised blind classification of abnormal rhythms and the computational support for low bit rate data compression. The adaptive template matching scheme used is tolerant to amplitude variations, and inter- and intra-sample time shifts.

Dynamics of Supervised Learning with Restricted Training Sets -- A.C.C. Coolen, David Saad
We study the dynamics of supervised learning in layered neural networks, in the regime where the size p of the training set is proportional to the number N of inputs. Here the local fields are no longer described by Gaussian distributions. We use dynamical replica theory to predict the evolution of macroscopic observables, including the relevant error measures, incorporating the old formalism in the limit where p/N goes to infinity.

Adding Constrained Discontinuities to Gaussian Process Models of Wind Fields -- Dan Cornford, Ian T. Nabney, Christopher K. I. Williams
Gaussian Processes provide good prior models for spatial data, but can be too smooth. In many physical situations there are discontinuities along bounding surfaces, for example fronts in near-surface wind fields. We describe a modelling method for such a constrained discontinuity and demonstrate how to infer the model parameters in wind fields with MCMC sampling.

A Phase Space Approach to Minimax Entropy Learning and the Minutemax Approximation -- James M. Coughlan, A.L. Yuille
There has been much recent work on learning probability distributions on images and on image statistics. We observe that the mapping from images to statistics is many-to-one and show it can be quantified by a phase space factor. This phase space approach throws light on the Minimax Entropy technique for learning Gibbs distributions on images with potentials derived from image statistics and elucidates the ambiguities that are inherent to determining the potentials. In addition, it shows that if the phase factor can be approximated by an analytic distribution then the computation for Minimax entropy learning can be vastly reduced. An illustration of this concept, using a Gaussian to approximate the phase factor, leads to a new algorithm calledMinutemax," which gives a good approximation to the results of Zhu and Mumford in just seconds of CPU time. The phase space approach also gives insight into the multi-scale potentials found by Zhu and Mumford and suggest that the forms of the potentials are influenced greatly by phase space considerations. Finally, we prove that probability distributions learned in feature space alone are equivalent to Minimax Entropy learning with a multinomial approximation of the phase factor.

Dynamically Adapting Kernels in Support Vector Machines -- Nello Cristianini, Colin Campbell, John Shawe-Taylor
The kernel-parameter is one of the few tunable parameters in Support Vector machines, and controls the complexity of the resulting hypothesis. The choice of its value amounts to model selection, and is usually performed by means of a validation set. We present an algorithm which can automatically perform model selection and learning with no additional computational cost and with no need of a validation set. Theoretical results motivating this approach providing upper bounds on the generalisation error and experimental results confirming its validity are presented.

#### D

Facial Memory Is Kernel Density Estimation (Almost) -- Matthew N. Dailey, Garrison W. Cottrell, Thomas A. Busey
We compare the ability of three exemplar-based memory models, each using three different face stimulus representations, to account for the probability a human subject responded old'' in an old/new facial memory experiment. The models are 1) the Generalized Context Model, 2) SimSample, a probabilistic sampling model, and 3) DBM, a novel model related to kernel density estimation that explicitly encodes stimulus distinctiveness. The representations are 1) positions of stimuli in MDS face space,'' 2) projections of test faces onto the eigenfaces of the study set, and 3) a representation based on response to a grid of Gabor filter jets. Of the 9 model/representation combinations, only the distinctiveness model in MDS space predicts the observed morph familiarity inversion'' effect, in which the subjects' false alarm rate for morphs between similar faces is higher than their hit rate for many of the studied faces. This evidence is consistent with the hypothesis that human memory for faces is a kernel density estimation task, with the caveat that distinctive faces require larger kernels than do typical faces.

Example Based Image Synthesis of Articulated Figures -- Trevor Darrell
We present a method for learning complex appearance mappings, such as occur with images of articulated objects. Traditional interpolation networks fail on this case since appearance is not necessarily a smooth function nor a linear manifold for articulated objects. We define an appearance mapping from examples by constructing a set of independently smooth interpolation networks; these networks can cover overlapping regions of parameter space. A set growing procedure is used to find example clusters which are well-approximated within their convex hull; interpolation then proceeds only within these sets of examples. With this method physically valid images are produced even in regions of parameter space where nearby examples have different appearances. We show results generating both simulated and real arm images.

Heeger's Normalization, Line Attractor Networks and Ideal Observers -- Sophie Deneve, Alexandre Pouget, Peter Latham
Gain control by divisive inhibition, a.k.a.\ Heeger's normalization, seems to be a general mechanism throughout the visual cortex. We explore in this study the statistical properties of this normalization in the presence of noise. Using simulations, we show that Heeger's normalization is a close approximation to a maximum likelihood estimator, which, in the context of population coding, is the same as an ideal observer. We also demonstrate analytically that this is a general property of a large class of nonlinear recurrent networks with line attractors. Our work suggests that Heeger's normalization plays a critical role in noise filtering, and that every cortical layer may be an ideal observer of the activity in the preceding layer.

Vertex Identification in High Energy Physics Experiments -- Gideon Dror, Halina Abramowicz, David Horn
In High Energy Physics experiments one has to sort through a high flux of events, at a rate of tens of MHz, and select the few that are of interest. In making this decision one relies on the location of the vertex where the interaction that led to the event, took place. Here we present a solution to this problem, based on two feedforward neural networks with fixed architectures, whose parameters are chosen so as to obtain a high accuracy. The system is tested on many data sets, and is shown to perform better than commonly used algorithms.

Phase Diagram and Storage Capacity of Sequence Storing Neural Networks -- A. During, A.C.C. Coolen, D. Sherrington
We solve the dynamics of Hopfield-type neural networks which store sequences of patterns, close to saturation. The asymmetry of the interaction matrix in such models leads to violation of detailed balance, ruling out an equilibrium statistical mechanical analysis. Using generating functional methods we derive exact closed equations for dynamical order parameters, viz. the sequence overlap and correlation and response functions, in the limit of an infinite system size. We calculate the time translation invariant solutions of these equations, describing stationary limit--cycles, which leads to a phase diagram. The effective retarded self-interaction usually appearing in symmetric models is here found to vanish, which causes a significantly enlarged storage capacity of $\alpha_c=0.269$, compared to $\alpha_c=0.139$ for Hopfield networks storing static patterns. Our results are tested against extensive computer simulations and excellent agreement is found.

#### E

Optimizing Correlation Algorithms for Hardware-based Transient Classification -- R. Timothy Edwards, Gert Cauwenberghs, Fernando Pineda
The performance of dedicated VLSI neural processing hardware depends critically on the design of the implemented algorithms. We have previously proposed an algorithm for acoustic transient classification. Having implemented and demonstrated this algorithm in a mixed-mode architecture, we now investigate variants on the algorithm, using time and frequency channel differencing, input and output normalization, and schemes to binarize and train the template values, with the goal of achieving optimal classification performance for the chosen hardware.

VLSI Implementation of Motion Centroid Localization for Autonomous Navigation -- Ralph Etienne-Cummings, Mohammed Abdel Ghani, Viktor Gruev
A circuit for fast, compact and low-power focal-plane motion centroid localization is presented. This chip, which uses mixed signal CMOS components to implement photodetection, edge detection, ON-set detection and centroid localization, models the retina and superior colliculus. The centroid localization circuit uses time-windowed asynchronously triggered row and column address events and two linear resistive grids to provide the analog coordinates of the motion centroid. This VLSI chip is used to realize fast lightweight autonavigating vehicles. The obstacle avoiding line-following algorithm is discussed.

#### F

Finite-dimensional Approximation of Gaussian Processes -- Giancarlo Ferrari-Trecate, Christopher K. I. Williams, Manfred Opper
Gaussian process (GP) prediction suffers from $O(n^3)$ scaling with the data set size $n$. By using a finite-dimensional basis to approximate the GP predictor, the computational complexity can be reduced. We derive optimal finite-dimensional predictors under a number of assumptions, and show the superiority of these predictors over the Projected Bayes Regression method (which is asymptotically optimal). We also show how to calculate the minimal model size for a given $n$. The calculations are backed up by numerical experiments.

Using Statistical Properties of a Labelled Visual World to Estimate Scenes -- William T. Freeman, Egon C. Pasztor
A goal of low-level vision algorithms is to infer underlying scene information (such as velocities, shapes, or reflectances) from an image or set of images. We introduce a simple, training-based method: learn the probability of every scene interpretation for any local image patch, and the probability that any local scene neighbors another. The former probabilities give scene estimates from local image data, and the latter allow the local estimates to propagate. We use Markov networks to optimally propagate the local information, represented as mixtures of gaussians. We demonstrate the technique for two problems: motion estimation, and the disambiguation of shading and reflectance. We find that the local statistical information, without added problem domain knowledge, finds good global scene interpretations.

Global Optimisation of Neural Network Models Via Sequential Sampling -- Nando de Freitas, Mahesan Niranjan, Arnaud Doucet, Andrew Gee
We propose a novel strategy for training neural networks using sequential Monte Carlo algorithms. In particular, we discuss an efficient hybrid gradient descent/sampling importance resampling algorithm that we developed recently, under the name of hybrid SIR. This global optimisation approach allows us to learn the probability distribution of the network weights and outputs in a sequential framework. It is well suited to applications involving on-line, nonlinear, non-Gaussian or non-stationary signal processing. We show how the new algorithm outperforms extended Kalman filter training on several simulated examples and on a real application involving the pricing of option contracts traded in financial markets.

Efficient Bayesian Parameter Estimation in Large Discrete Domains -- Nir Friedman, Yoram Singer
In this paper we examine the problem of estimating the parameters of a multinomial distribution over a large number of discrete outcomes, most of which do not appear in the training data. We analyze this problem from a Bayesian perspective and develop a hierarchical prior that incorporates the assumption that the observed outcomes constitute only a small subset of the possible outcomes. We show how to efficiently perform exact inference with this form of hierarchical prior and compare our method to standard approaches to demonstrate its merits.

#### G

Synergy and Redundancy Among Brain Cells of Behaving Monkeys -- Itay Gat, Naftali Tishby
Determining the relationship between the activity of a single nerve cell to that of an entire population is a fundamental question that bears on the basic neural computation paradigms. In this paper we apply an information theoretic approach to quantify the level of cooperative activity among cells in a behavioral context. It is possible to discriminate between synergetic activity of the cells vs. redundant activity, depending on the difference between the information they provide when measured jointly and the information they provide independently. We define a synergy value that is positive in the first case and negative in the second and show that the synergy value can be measured by detecting the behavioral mode of the animal from simultaneously recorded activity of the cells. We observe that among cortical cells positive synergy can be found, while cells from the basal ganglia, active during the same task, do not exhibit similar synergetic activity.

A Randomized Algorithm for Pairwise Clustering -- Yoram Gdalyahu, Daphna Weinshall, Michael Werman
We present a stochastic clustering algorithm based on pairwise similarity of datapoints. Our method extends existing deterministic methods, including agglomerative algorithms, min-cut graph algorithms, and connected components, thus it provides a common framework for all these methods. Our graph-based method differs from existing stochastic methods which are based on analogy to physical systems. The stochastic nature of our method makes it more robust against noise, including accidental similarity relations and small spurious clusters. We demonstrate the superiority of our algorithm using an example with 3 spiraling bands and a lot of noise.

Classification with Linear Threshold Functions and the Linear Loss -- Claudio Gentile, Manfred K. Warmuth
We describe a method for proving relative loss bounds for on-line learning algorithms that use linear threshold functions for classifying the examples. For instance the Perceptron algorithm and Winnow are such learning algorithms. For classification problems the discrete loss is used, i.e., the total number of prediction mistakes. We introduce a continuous loss function called the linear loss. Our method consists of first proving bounds w.r.t. the linear loss and then converting these bounds to the discrete loss.

Invited Talk: Things That Think -- Neil Gershenfeld
At the rate of current progress, VLSI scaling will reach fundamental scaling limits early in the next century. Beyond performance, many of the most severe constraints are increasingly associated with the cost and packaging of bringing processing to where it is needed. An alternative to increasingly heroic engineering means to meet these demands is to become smarter in how we take advantage of the inherent computational capabilities of simple physical systems. I will illustrate this promise with a few examples, including molecular quantum computation, coding in dissipative dynamical systems, and the response of driven musical instruments.

Learning Nonlinear Stochastic Dynamics Using the Generalized EM Algorithm -- Zoubin Ghahramani, Sam T. Roweis
The Expectation--Maximization (EM) algorithm is an iterative procedure for maximum likelihood parameter estimation from data sets with missing or hidden variables (Dempster, Laird and Rubin, 1977). It has been applied to system identification in linear stochastic state-space models, where the state variables are hidden from the observer and both the state and the parameters of the model have to be estimated simultaneously (Shumway and Stoffer, 1982). We present a generalization of the EM algorithm for parameter estimation in nonlinear dynamical systems. The expectation'' step makes use of Extended Kalman Smoothing to estimate the state, while the maximization'' step re-estimates the parameters using these uncertain state estimates. In general, the nonlinear maximization step is difficult because it requires integrating out the uncertainty in the states. However, if Gaussian radial basis function (RBF) approximators are used to model the nonlinearities, the integrals become tractable and the maximization step can be solved via systems of linear equations.

Classification on Pairwise Proximity Data -- Thore Graepel, Ralf Herbrich, Peter Bollmann-Sdorra, Klaus Obermayer
We investigate the problem of learning a classification task on data represented in terms of their pairwise proximities. This representation does not refer to an explicit feature representation of the data items and is thus more general than the standard approach of using Euclidean feature vectors, from which pairwise proximities can always be calculated. Our first approach is based on a combined linear embedding and classification procedure resulting in an extension of the Optimal Hyperplane algorithm to pseudo-Euclidean data. As an alternative we present another approach based on a linear threshold model in the proximity values themselves, which is optimized using Structural Risk Minimization. We show that prior knowledge about the problem can be incorporated by the choice of distance measures and examine different metrics w.r.t. their generalization. Finally, the algorithms are successfully applied to protein structure data and to data from the cat's cerebral cortex. They show better performance than K-nearest-neighbor classification.

Lasso is Equivalent to Adaptive Quadratic Penalization -- Yves Grandvalet, St{\'e}phane Canu
Adaptive ridge is a special form of ridge regression, balancing the quadratic penalization on each parameter of the model. We show the equivalence between adaptive ridge and lasso (least absolute shrinkage and selection operator). This equivalence states that both procedures produce the same estimate. Least absolute shrinkage can thus be viewed as a particular quadratic penalization. From this observation, we derive a fixed-point algorithm to compute the lasso solution. We finally present a series of possible applications of this type of algorithm in regression problems: kernel regression, additive modeling and neural net training.

Familiarity Discrimination of Radar Pulses -- Eric Granger, Stephen Grossberg, Mark A. Rubin, William W. Streilein
The ARTMAP-FD neural network performs both identification (placing test patterns in classes encountered during training) and familiarity discrimination (judging whether a test pattern belongs to any of the classes encountered during training). The performance of ARTMAP-FD is tested on radar pulse data obtained in the field, and compared to that of the nearest-neighbor-based NEN algorithm and to a $k > 1$ extension of NEN.

Fast Neural Network Emulation of Physics-Based Models for Computer Animation -- Radek Grzeszczuk, Demetri Terzopoulos, Geoffrey Hinton
Computer animation through the numerical simulation of physics-based graphics models offers unsurpassed realism, but it can be computationally demanding. This paper demonstrates the possibility of replacing the numerical simulation of nontrivial dynamic models with a dramatically more efficient NeuroAnimator'' that exploits neural networks. NeuroAnimators are automatically trained off-line to emulate physical dynamics through the observation of physics-based models in action. Depending on the model, its neural network emulator can yield physically realistic animation one or two orders of magnitude faster than conventional numerical simulation. We demonstrate NeuroAnimators for a variety of physics-based models.

#### H

A Neuromorphic Monaural Sound Localizer -- John G. Harris, Chiang-Jung Pu, Jose C. Principe
We describe the first single microphone sound localization system and its inspiration from theories of human monaural sound localization. Reflections and diffractions caused by the external ear (pinna) allow humans to estimate sound source elevations using only one ear. Our single microphone localization model relies on a specially shaped reflecting structure that serves the role of the pinna. Specially designed analog VLSI circuitry uses echo-time processing to localize the sound. A CMOS integrated circuit has been designed, fabricated, and successfully demonstrated on actual sounds.

Multiple Paired Forward-Inverse Models for Human Motor Learning and Control -- Masahiko Haruno, Daniel M. Wolpert, Mitsuo Kawato
Humans demonstrate a remarkable ability to generate accurate and appropriate motor behavior under many different and often uncertain environmental conditions. This paper describes a new modular approach to human motor learning and control, based on multiple pairs of inverse (controller) and forward (predictor) models. This architecture simultaneously learns the multiple inverse models necessary for control as well as how to select the inverse models appropriate for a given environment. Simulations of object manipulation demonstrates the ability to learn multiple objects, appropriate generalization to novel objects and the inappropriate activation of motor programs based on visual cues, followed by on-line correction, seen in the size-weight illusion''.

GLS: a Hybrid Classifier System Based on POMDP Research -- Akira Hayashi, Nobuo Suematsu
Classifier systems are now viewed as disappointing because of their problems such as the rule strength vs rule set performance problem and the credit assignment problem. In order to solve these problems, we have developed a hybrid classifier system: GLS (Generalization Learning System). In designing GLS, we view classifier systems as model free learning in POMDPs and take a hybrid approach to finding the best generalization, given the total number of rules. GLS uses the policy improvement procedure by Jaakkola et al. for the optimal stochastic policy when a set of rule conditions is given. GLS uses GA to search for the best set of rule conditions.

Visualizing Group Structure -- Marcus Held, Jan Puzicha, Joachim M. Buhmann
Cluster analysis is a fundamental principle in exploratory data analysis, providing the user with a description of the group structure of given data. A key problem in this context is the interpretation and visualization of clustering solutions in high--dimensional or abstract data spaces. In particular, fuzzy or probabilistic descriptions of the group structure, essential to capture inter--cluster relations, are hardly assessable by simple inspection of the probabilistic assignment variables. We present a novel approach for the visualization of probabilistic group structure based on a statistical model of the object assignments which have been observed or estimated by a probabilistic clustering procedure. The objects or data points are embedded in a low dimensional Euclidean space by approximating the observed data statistics with a Gaussian mixture model. The algorithm provides a new approach to the visualization of the inherent structure for a broad variety of data types, e.g.~histogram data, proximity data and co--occurrence data. To demonstrate the power of the approach, histograms of textured images are visualized as a large--scale data mining application.

Unsupervised and Supervised Clustering: the Mutual Information Between Parameters and Observations -- Didier Herschkowitz, Jean-Pierre Nadal
Recent works in parameter estimation and neural coding have demonstrated that optimal performance is related to the mutual information between parameters and data. We study this mutual information for a family of supervised and unsupervised learning tasks. More precisely we consider the case where the dependency in the parameter of the conditional probability distribution of each observation is through their scalar product only, the parameter and the observations being vectors in a possibly high dimensional space. \par We derive exact bounds and exact asymptotic behaviours for the mutual information as function of the data size and of some properties of the probability of the data given the parameter. We study also the behaviour of the mutual information as predicted by replica calculations. Finally we discuss the universal properties of the mutual information especially in the limit of large data size.

An Integrated Vision Sensor for the Computation of Optical Flow Singular Points -- Charles M. Higgins, Christof Koch
A robust, integrative algorithm is presented for computing the position of the focus of expansion or axis of rotation (the singular point) in optical flow fields such as those generated by self-motion. Measurements are shown of a fully parallel CMOS analog VLSI motion sensor array which computes the direction of local motion (sign of optical flow) at each pixel and can directly implement this algorithm. The flow field singular point is computed in real time with a power consumption of less than 2 mW. Computation of the singular point for more general flow fields requires measures of field expansion and rotation, which it is shown can also be computed in real-time hardware, again using only the sign of the optical flow field. These measures, along with the location of the singular point, provide robust real-time self-motion information for the visual guidance of a moving platform such as a robot.

Source Separation as a By-Product of Regularization -- Sepp Hochreiter, Juergen Schmidhuber
This paper reveals a previously ignored connection between two important fields: regularization and independent component analysis (ICA). We show that at least one representative of a broad class of algorithms (regularizers that reduce network complexity) extracts independent features as a by-product. This algorithm is Flat Minimum Search (FMS), a recent general method for finding low-complexity networks with high generalization capability. FMS works by minimizing both training error and required weight precision. According to our theoretical analysis, the hidden layer of an FMS-trained autoassociator attempts at coding each input by a sparse code with as few simple features as possible. In experiments, the method extracts optimal codes for difficult versions of the noisy bars'' benchmark problem by separating the underlying sources, whereas ICA and PCA fail. Real world images are coded with fewer bits per pixel than by ICA or PCA.

Learning from Dyadic Data -- Thomas Hofmann, Jan Puzicha, Michael Jordan
Dyadic data refers to a domain with two finite sets of objects in which observations are made for dyads, i.e., pairs with one element from either set. This type of data arises naturally in many application ranging from computational linguistics and information retrieval to preference analysis and computer vision. In this paper we present a systematic, domain-independent framework of learning from dyadic data by statistical mixture models. Our approach covers different models with flat and hierarchical latent class structures. We propose an annealed version of the standard EM algorithm for model fitting which is empirically evaluated on a variety of data sets from different domains.

Call-based Fraud Detection in Mobile Communication Networks using a Hierarchical Regime-Switching Model -- Jaakko Hollmen, Volker Tresp
Fraud causes substantial losses to telecommunication carriers. Detection systems which automatically detect illegal use of the network can be used to alleviate the problem. Previous approaches worked on features derived from the call patterns of individual users. In this paper we present a call-based detection system based on a hierarchical regime-switching model. The detection problem is formulated as an inference problem on the regime probabilities. Inference is implemented by applying the junction tree algorithm to the underlying graphical model. The dynamics are learned from data using iterative maximum likelihood EM-learning. The methods are assessed using fraud data from a real mobile communications network.

Banquet Talk: The Laws of the Web -- Bernardo A. Huberman
The World Wide Web has become in a short period a standard source of information for a large part of the world's population. Its exponential growth has transformed it into an ecology of knowledge in which a highly diverse quantity of information is linked in extremely complex and arbitrary fashion. In spite of the sheer size of the Web and its highly interactive nature, there exist strong statistical regularities that govern the way people use it and interact with each other. This talk will discuss the existence and nature of such laws and their experimental verification.

Graph Matching for Shape Retrieval -- Benoit Huet, Andrew D.J. Cross, Edwin R. Hancock
This paper describes a Bayesian graph matching algorithm for data-mining from large structural data-bases. The matching algorithm uses edge-consistency and node attribute similarity to determine the {\it a posteriori} probability of a query graph for each of the candidate matches in the data-base. The node feature-vectors are constructed by computing normalised histograms of pairwise geometric attributes. Attribute similarity is assessed by computing the Bhattacharyya distance between the histograms. Recognition is realised by selecting the candidate from the data-base which has the largest {\it a posteriori} probability. We illustrate the recognition technique on a data-base containing 2500 line patterns extracted from real-world imagery. Here the recognition technique is shown to significantly outperform a number of algorithm alternatives.

Sparse Code Shrinkage: Denoising by Maximum Likelihood Estimation -- Aapo Hyvarinen, Patrik Hoyer, Erkki Oja
Sparse coding is a method for finding a representation of data in which each of the components of the representation is only rarely significantly active. Such a representation is closely related to redundancy reduction and independent component analysis, and has some neurophysiological plausibility. In this paper, we show how sparse coding can be used for denoising. Using maximum likelihood estimation of nongaussian variables corrupted by gaussian noise, we show how to apply a shrinkage nonlinearity on the components of sparse coding so as to reduce noise. A theoretical analysis of the denoising capability of the method is given, and it is shown how to choose the optimal basis for sparse coding. Our method is closely related to the method of wavelet shrinkage, but has the important benefit over wavelet methods that both the features and the shrinkage parameters are estimated directly from the data.

#### I

Convergence of The Wake-Sleep Algorithm -- Shiro Ikeda, Shun-ichi Amari, Hiroyuki Nakahara
The W-S (Wake-Sleep) algorithm is a simple learning rule for the models with hidden variables. It is shown that this algorithm can be applied to a factor analysis model which is a linear version of the Helmholtz machine. But even for a factor analysis model, the general convergence is not proved theoretically. In this article, we describe the geometrical understanding of the W-S algorithm in contrast with the EM (Expectation-Maximization) algorithm and the {\em em} algorithm. As the result, we prove the convergence of the W-S algorithm for the factor analysis model. We also show the condition for the convergence in general models.

Learning to Find Pictures of People -- Sergey Ioffe, David Forsyth
Finding articulated objects, like people, in pictures presents a particularly difficult object recognition problem. We show how to find people by finding putative body segments, and then constructing assemblies of those segments that are consistent with the constraints on the appearance of a person that result from kinematic properties. Since a reasonable model of a person requires at least nine segments, it is not possible to present every group to a classifier. Instead, the search can be pruned by using projected versions of a classifier that accepts groups corresponding to people. We describe an efficient projection algorithm for one popular classifier, and demonstrate that our approach can be used to determine whether images of real scenes contain people.

Restructuring Sparse High Dimensional Data for Effective Retrieval -- Charles Lee Isbell, Jr., Paul Viola
The task in text retrieval is to find the subset of a collection of documents relevant to a user's information request, usually expressed as a set of words. Classically, documents and queries are represented as vectors of word counts. In its simplest form, relevance is defined to be the dot product between a document and a query vector--a measure of the number of common terms. A central difficulty in text retrieval is that the presence or absence of a word is not sufficient to determine relevance to a query. Linear dimensionality reduction has been proposed as a technique for extracting underlying structure from the document collection. In some domains (such as vision) dimensionality reduction reduces computational complexity. In text retrieval it is more often used to improve retrieval performance. We propose an alternative and novel technique that produces sparse representations constructed from sets of highly-related words. Documents and queries are represented by their distance to these sets, and relevance is measured by the number of common clusters. This technique significantly improves retrieval performance, is efficient to compute and shares properties with the optimal linear projection operator and the independent components of documents.

Attentional Modulation of Human Pattern Discrimination Psychophysics Reproduced by a Quantitative Model -- Laurent Itti, Jochen Braun, Dale K. Lee, Christof Koch
We previously proposed a quantitative model of early visual processing in primates, based on non-linearly interacting visual filters and statistically efficient decisions. We now use this model to interpret the observed modulation of a range of human psychophysical thresholds with and without focal visual attention. Our model -- calibrated by an automatic fitting procedure -- simultaneously reproduces thresholds for four classical pattern discrimination tasks, performed while attention was engaged by another concurrent task. Our model then predicts that the seemingly complex improvements of certain thresholds, which we observed when attention was fully available for the discrimination tasks, can best be explained by a strengthening of competition among early visual filters.

#### J

Exploiting Generative Models in Discriminative Classifiers -- Tommi Jaakkola, David Haussler
Generative probability models such as hidden Markov models provide a principled way of treating missing information and dealing with variable length sequences. On the other hand, discriminative methods such as support vector machines enable us to construct flexible decision boundaries and often result in classification performance superior to that of the model based approaches. An ideal classifier should combine these two complementary approaches. In this paper, we develop a natural way of achieving this combination by deriving kernel functions for use in discriminative methods such as support vector machines from generative probability models. We provide a theoretical justification for this combination as well as demonstrate a substantial improvement in the classification performance in the context of DNA and protein sequence analysis.

Maximum Conditional Likelihood via Bound Maximization and the CEM Algorithm -- Tony Jebara, Alex Pentland
We present the CEM ({\em Conditional Expectation Maximization}) algorithm as an extension of the EM ({\em Expectation Maximization}) algorithm to conditional density estimation under missing data. A bounding and maximization process is given to specifically optimize conditional likelihood instead of the usual joint likelihood. We apply the method to conditioned mixture models and use bounding techniques to derive the model's update rules. Monotonic convergence, computational efficiency and regression results superior to EM are demonstrated.

Analyzing and Visualizing Single-Trial Event-Related Potentials -- Tzyy-Ping Jung, Scott Makeig, Marissa Westerfield, Jeanne Townsend, Eric Courchesne, Terrence J. Sejnowski
Event-related potentials (ERPs), are portions of electroencephalographic (EEG) recordings that are both time- and phase-locked to experimental events. ERPs are usually averaged to increase their signal/noise ratio relative to non-phase locked EEG activity, regardless of the fact that response activity in single epochs may vary widely in time course and scalp distribution. This study applies a linear decomposition tool, Independent Component Analysis (ICA) (Lee et.\ al., in press), to multichannel single-trial EEG records to derive spatial filters that decompose single-trial EEG epochs into a sum of temporally independent and spatially fixed components arising from distinct or overlapping brain or extra-brain networks. Our results show that ICA can separate artifactual, stimulus-locked, response-locked, and non-event related background EEG activities into separate components, allowing (1) removal of pervasive artifacts of all types from single-trial EEG records, and (2) identification of both stimulus- and response-locked EEG components. Second, this study proposes a new visualization tool, the ERP image', for investigating variability in latencies and amplitudes of event-evoked responses in spontaneous EEG or MEG records. We show that sorting single-trial ERP epochs in order of a relevant response measure (e.g.\ reaction time) and plotting the potentials in 2-D clearly reveals underlying patterns of response variability linked to performance. These analysis and visualization tools appear broadly applicable to electrophyiological research on both normal and clinical populations.

#### K

The Belief in TAP -- Yoshiyuki Kabashima, David Saad
We show the similarity between belief propagation and TAP, for decoding corrupted messages encoded by Sourlas's method. The latter is a special case of the Gallager error-correcting code, where the code word comprises products of $K$ bits selected randomly from the original message. We examine the efficacy of solutions obtained by the two methods for various values of $K$ and show that solutions for $K \!\ge\! 3$ may be sensitive to the choice of initial conditions in the case of unbiased patterns. Good approximations are obtained generally for $K\!=\!2$ and for biased patterns in the case of $K\! \ge \!3$, especially when Nishimori's temperature is being used.

Optimizing Classifers for Imbalanced Training Sets -- Grigoris Karakoulas, John Shawe-Taylor
Following recent results showing the importance of the fat-shattering dimension in explaining the beneficial effect of a large margin on generalization performance, further work analysed how the margin observed on a test example could be used to give higher confidence of the classification accuracy. The current paper investigates the implications of these results for the case of imbalanced datasets and develops two approaches to setting the threshold. The approaches are incorporated into ThetaBoost, a boosting algorithm that we develop for dealing with unequal loss functions. The performance of ThetaBoost and the two approaches are also investigated experimentally..

Inference in Multilayer Networks via Large Deviation Bounds -- Michael Kearns, Lawrence Saul
We study probabilistic inference in large, layered belief networks represented as directed acyclic graphs. We show that the intractability of exact inference in such networks does not preclude their effective use. We give algorithms for approximate probabilistic inference that exploit averaging phenomena occurring at nodes with large numbers of parents. We show that these algorithms compute rigorous lower and upper bounds on marginal probabilities of interest, prove that these bounds become exact in the limit of large networks, and provide rates of convergence.

Finite-Sample Convergence Rates for Q-Learning and Indirect Algorithms -- Michael Kearns, Satinder Singh
In this paper, we address two issues of long-standing interest in the reinforcement learning literature. First, what kinds of performance guarantees can be made for Q-learning after only a finite number of actions? Second, what quantitative comparisons can be made between Q-learning and model-based (indirect) approaches, which use experience to estimate next-state distributions for off-line value iteration? We first show that both Q-learning and the indirect approach enjoy rather rapid convergence to the optimal policy as a function of the number of state transitions observed. In particular, on the order of only $(N\log(1/\epsilon)/\epsilon^2)(\log(N) + \log\log(1/\epsilon))$ transitions are sufficient for both algorithms to come within $\epsilon$ of the optimal policy, in an idealized model that assumes the observed transitions are well-mixed'' throughout an $N$-state MDP. Thus, the two approaches have roughly the same sample complexity. Perhaps surprisingly, this sample complexity is far less than what is required for the model-based approach to actually construct a good approximation to the next-state distribution. The result also shows that the amount of memory required by the model-based approach is closer to $N$ than to $N^2$. For either approach, to remove the assumption that the observed transitions are well-mixed, we consider a model in which the transitions are determined by a fixed, arbitrary exploration policy. Bounds on the number of transitions required in order to achieve a desired level of performance are then related to the stationary distribution and mixing time of this policy.

A Polygonal Line Algorithm for Constructing Principal Curves -- Bal\'azs K\'egl, Adam Krzy\.zak, Tam\'as Linder, Kenneth Zeger
Principal curves have been defined asself consistent'' smooth curves which pass through themiddle'' of a d-dimensional probability distribution or data cloud. They give a summary of the data and also serve as an efficient feature extraction tool. Recently, we have offered a new approach by defining principal curves as continuous curves of a given length which minimize the expected squared distance between the curve and points of the space randomly chosen according to a given distribution. The new definition made it possible to carry out a theoretical analysis of learning principal curves from training data. In our theoretical learning scheme we consider classes of polygonal lines with k-segments and with a given length, and the algorithm chooses a curve from this class which minimizes the average squared distance over n training points drawn independently. In this paper we propose a practical construction based on the new definition. In each iteration of the algorithm a new vertex is added to the polygonal line and the positions of the vertices are updated so that they minimize a penalized squared distance criterion. Simulation results demonstrate that the new algorithm compares favorably with previous methods both in terms of performance and computational complexity, and is more robust to varying data models.

Spike-Based Compared to Rate-Based Hebbian Learning -- Richard Kempter, Wulfram Gerstner, J. Leo van Hemmen
A correlation-based learning rule at the spike level is formulated, mathematically analyzed, and compared to learning in a firing-rate description. A differential equation for the learning dynamics is derived under the assumption that the time scales of learning and spiking can be separated. Using a linear Poissonian neuron model which receives time-dependent stochastic input we show that spike correlations on a millisecond time scale play indeed a role under reasonable neurobiological conditions. It is shown that correlations between input and output spikes tend to stabilize structure formation, provided that the form of the learning window is in accordance with Hebb's principle.

Exploring Unknown Environments with Real-Time Heuristic Search or Reinforcement Learning -- Sven Koenig
Learning Real-Time A* (LRTA*) is a popular control method that interleaves planning and plan execution. Advantages of LRTA* include: It allows for fine-grained control over how much planning to do between plan executions, is able to use heuristic knowledge to guide planning, can be interrupted at any state and resume execution at a different state, and improves its plan-execution time as it solves similar planning tasks, until its plan-execution time is optimal. LRTA* has been shown to solve search problems in known environments efficiently. In this paper, we apply LRTA* to the problem of getting to a given goal location in an initially unknown environment. Uninformed LRTA* with maximal lookahead always moves on a shortest path to the closest unvisited state, that is, to the closest potential goal state. This was believed to be a good exploration heuristic, but we show that it does not minimize the plan-execution time in the worst case compared to other uninformed exploration methods. This result is also of interest to reinforcement-learning researchers since many reinforcement learning methods use asynchronous dynamic programming, just like LRTA*.

#### L

Active Noise Canceling Using Analog Neuro-Chip with On-Chip Learning Capability -- Soo-Young Lee, Jung Wook Cho
A modular analogue neuro-chip set with on-chip learning capability is developed for active noise canceling. The analogue neuro-chip set incorporates the error backpropagation learning rule for practical applications, and allows pin-to-pin interconnections for multi-chip boards. The developed neuro-board demonstrated active noise canceling without any digital signal processor. Multi-path fading of acoustic channels, random noise, and nonlinear distortion of the loud speaker are compensated by the adaptive learning circuits of the neuro-chips. Experimental results are reported for cancellation of car noises in real time.

Unsupervised Classification with Non-Gaussian Mixture Models using ICA -- Te-Won Lee, Michael S. Lewicki, Terrence J. Sejnowski
We present an unsupervised classification algorithm based on an ICA mixture model. A mixture model is a model in which the observed data can be categorized into several mutually exclusive data classes. In an ICA mixture model, it is assumed that the data in each class are generated by a linear mixture of independent sources. The algorithm finds the independent sources and the mixing matrix for each class and also computes the class membership probability of each data point. This approach extends the Gaussian mixture model so that the clusters can have non-Gaussian structure. Performance on a standard classification problem, the Iris flower data set, demonstrates that the new algorithm can improve classification accurately over standard Gaussian mixture models. We also show that the algorithm can be applied to blind source separation in nonstationary environments. The method can switch automatically between learned mixing matrices in different environments. Preliminary results on natural scenes and text image patterns show that the algorithm is able to find classes so that one class encodes the natural images and the other class specializes on encoding the text segments.

Learning a Continuous Hidden Variable Model for Binary Data -- Daniel D. Lee, Haim Sompolinsky
A directed generative model for binary data using a small number of hidden continuous units is investigated. A clipping nonlinearity distinguishes the model from conventional principal components analysis. The relationships between the correlations of the underlying continuous Gaussian variables and the binary output variables are utilized to learn the appropriate weights of the network. The advantages of this approach are illustrated on a translationally invariant binary distribution and on handwritten digit images.

Stationarity and Stability of Autoregressive Neural Network Processes -- Friedrich Leisch, Adrian Trapletti, Kurt Hornik
We analyze the asymptotic behavior of autoregressive neural network (AR-NN) processes using techniques from Markov chains and non-linear time series analysis. It is shown that standard AR-NNs without shortcut connections are irreducible, aperiodic and asymptotically stationary. If linear shortcut connections are allowed, only the shortcut weights determine whether the overall system is stationary, hence standard conditions for linear AR processes can be used. We further show how integrated processes can be modeled using AR-NN processes. The asymptotic behavior of AR-NNs is especially important for predictions over larger intervals of time or when using the network to generate artificial time series. E.g., one might train the network on an available sample and then use the trained network afterwards---driven by artificial noise from a random number generator---to generate new data with similar properties than the training sample. The asymptotic stationarity guarantees that the AR-NN model cannot show explosive'' behavior or growing variance with time.

Coding Time-Varying Signals Using Sparse, Shift-Invariant Representations -- Michael S. Lewicki, Terrence J. Sejnowski
A common way to represent a time series is to divide it into short-duration blocks, each of which is then represented by a set of basis functions. A limitation of this approach, however, is that the temporal alignment of the basis functions with the underlying structure in the time series is arbitrary. We present an algorithm for encoding a time series that does not require blocking the data. The algorithm finds an efficient representation by inferring the best temporal positions for functions in a kernel basis. These can have arbitrary temporal extent and are not constrained to be orthogonal. This allows the model to capture structure in the signal that may occur at arbitrary temporal positions and preserves the relative temporal structure of underlying events. The model is shown to be equivalent to a very sparse and highly overcomplete basis. Under this model, the mapping from the data to the representation is nonlinear, but can be computed efficiently. This form also allows the use of existing methods for adapting the basis itself to data.

A V1 Model of Pop Out and Asymmetry in Visual Search -- Zhaoping Li
Visual search is the task of finding a target in an image against a background of distractors. Unique features of targets enable them to pop out against the background. It is known that the ease of target detection can change when the roles of figure and ground are switched. The mechanisms underlying pop out and asymmetry in visual search have been elusive. This paper shows that a model of segmentation in V1 based on intracortical interactions can explain many of the qualitative aspects of visual search.

Computational Differences between Asymmetrical and Symmetrical Networks -- Zhaoping Li, Peter Dayan
Symmetrically connected recurrent networks have recently been used as models of a host of neural computations. However, because of the separation between excitation and inhibition, biological neural networks are asymmetrical. We study characteristic differences between asymmetrical networks and their symmetrical counterparts, showing that they have dramatically different dynamical behavior and also how the differences can be exploited for computational ends. We illustrate our results in the case of a network that is a selective amplifier.

Mechanisms of Generalization in Perceptual Learning -- Zili Liu, Daphna Weinshall
The learning of many visual perceptual tasks was shown to be specific to the practiced stimulus, and new stimuli require re-learning from scratch. Here we demonstrate generalization using a novel paradigm in motion discrimination where learning has been shown to be specific: We trained subjects to discriminate the directions of moving dots, and verified the previous results that learning does not transfer from the trained direction to a new one. However, by tracking the subjects' performance across time in the new direction, we found that their rate of learning doubled. Therefore, we found generalization in a task previously considered too difficult to generalize. Even for an easy task that was found to generalize, we found a new manifestation of this generalization: After mastering the task with an easy stimulus, subjects who had practiced briefly to discriminate the easy stimulus in a new direction generalized to a difficult stimulus in that direction. This generalization depended on {\it both} the mastering {\it and} the brief practice. The specificity of perceptual learning, and the dichotomy between the learning of easy'' vs.\ difficult'' tasks, were hypothesized to imply that different learning processes are involved, operating at different visual cortical areas. Here we show how to interpret these results in terms of signal detection. With the single assumption of limited computational resources, we obtain the observed phenomena --- transfer, change of learning rate, and no transfer --- for respectively increasing levels of task difficulty. It appears that the observed generalization concurs with the expected behavior of a generic discrimination system. Thus we challenge the validity of using the specificity of perceptual learning to probe the neuronal loci of modifications in the early visual cortex.

The Effect of Eligibility Traces on Finding Optimal Memoryless Policies in Partially Observable Markov Decision Processes -- John Loch
Agents acting in the real world are confronted with the problem of making good decisions with limited knowledge of the environment. Partially observable Markov decision processes (POMDPs) model decision problems in which an agent tries to maximize its reward in the face of limited sensor feedback. Recent work has shown empirically that a reinforcement learning (RL) algorithm called Sarsa Lambda can efficiently find optimal memoryless policies, which map current observations to actions, for POMDP problems (Loch and Singh 1998). The Sarsa Lambda algorithm uses a form of short-term memory called an eligibility trace, which distributes temporally delayed rewards to observation-action pairs which lead up to the reward. This paper explores the effect of eligibility traces on the ability of the Sarsa Lambda algorithm to find optimal memoryless policies. A variant of Sarsa Lambda called k-step truncated Sarsa Lambda is applied to four test problems taken from the recent work of Littman, Littman, Cassandra and Kaelbling, Parr and Russell, and Chrisman. The empirical results show that eligibility traces can be significantly truncated without affecting the ability of Sarsa Lambda to find optimal memoryless policies for POMDPs.

Utilizing Time: Asynchronous Binding -- Bradley C. Love
Historically, connectionist systems have not excelled at representing and manipulating complex structures. How can a system composed of simple neuron-like computing elements encode complex relations? Recently, researchers have begun to appreciate that representations can extend in both time and space. Many researchers have proposed that the synchronous firing of units can encode complex representations. I identify the limitations of this approach and present an asynchronous model of binding that effectively represents complex structures. The asynchronous model extends the synchronous approach. I argue that our cognitive architecture utilizes a similar mechanism.

#### M

Analog Neural Nets with Gaussian or other Common Noise Distributions Cannot Recognize Arbitrary Regular Languages -- Wolfgang Maass, Eduardo D. Sontag
We consider recurrent analog neural nets where each gate is subject to Gaussian noise, or any other common noise distribution whose probability density function is nonzero on a large set. We show that many regular languages cannot be recognized by networks of this type, for example the language $\{w \in \{0,1\}^* |\; w \mbox{ begins with} \ 0\}$, and we give a precise characterization of those languages which can be recognized. This result implies severe constraints on possibilities for constructing recurrent analog neural nets that are robust against realistic types of analog noise. On the other hand, we present a method for constructing {\it feedforward} analog neural nets that are robust with regard to analog noise of this type.

Neural Networks for Density Estimation -- Malik Magdon-Ismail, Amir F. Atiya
We introduce two new techniques for density estimation. Our approach poses the problem as a supervised learning task which can be performed using Neural Networks. We introduce a stochastic method for learning the cumulative distribution and an analogous deterministic technique. We demonstrate convergence of our methods both theoretically and experimentally, and provide comparisons with the Parzen estimate. Our theoretical results demonstrate better convergence properties than the Parzen estimate.

Signal Detection in Noisy Weakly-Active Dendrites -- Amit Manwani, Christof Koch
Here we derive measures quantifying the information loss of a synaptic signal due to the presence of neuronal noise sources as it electrotonically propagates along a weakly-active dendrite . We model the dendrite as an infinite linear cable, with noise sources distributed along its length. The noise sources we consider are thermal noise, channel noise arising from the stochastic nature of voltage-dependent ionic channels (K and Na) and synaptic noise due to spontaneous background activity. We assess the efficacy of information transfer using a signal detection paradigm where the objective is to detect the presence/absence of a presynaptic spike from the post-synaptic membrane voltage. This allows us to analytically assess the role of each of these noise sources in information transfer. For our choice of parameters, we find that the synaptic noise is the dominant noise source which limits the maximum length over which information be reliably transmitted.

Exploratory Data Analysis Using Radial Basis Function Latent Variable Models -- Alan D. Marrs, Andrew R. Webb
Two developments of nonlinear latent variable models based on radial basis functions are discussed: the first is a re-sampling approach that makes more effective use of latent samples in evaluating the likelihood. Also, the use of priors or constraints on allowable models is considered as a means of preserving data structure on low-dimensional representations for visualisation purposes. The former development is illustrated on both a simulated data set and radar range profiles of ships.

Direct Optimization of Margins Improves Generalization in Combined Classifiers -- Llew Mason, Peter L. Bartlett, Jonathan Baxter
Recent theoretical results have shown that the generalization performance of thresholded convex combinations of base classifiers is greatly improved if the underlying convex combination has large {\em margins} on the training data (correct examples are classified well away from the decision boundary). Neural network algorithms and AdaBoost have been shown to implicitly maximize margins, thus providing some theoretical justification for their remarkably good generalization performance. In this paper we are concerned with maximizing the margin explicitly. In particular, we prove a theorem bounding the generalization performance of convex combinations in terms of general cost functions of the margin (previous results were stated in terms of the particular cost function $\sgn(\text{margin} - \theta)$). We then present an algorithm (DOOM) for directly optimizing a piecewise-linear family of cost functions satisfying the conditions of the theorem. Experiments on several of the datasets in the UC Irvine database are presented in which AdaBoost was used to generate a set of base classifiers and then DOOM was used to find the optimal convex combination of those classifiers. In all but one case the convex combination generated by DOOM had lower test error than AdaBoost's combination. In many cases DOOM achieves these lower test errors by sacrificing training error, in the interests of reducing the new cost function. The margin plots also show that the size of the minimum margin is not relevant to generalization performance.

Scheduling Straight-Line Code Using Reinforcement Learning and Rollouts -- Amy McGovern, Eliot Moss
The execution order of a block of computer instructions can make a difference in its running time by a factor of two or more. In order to achieve the best possible speed, compilers use heuristic schedulers appropriate to each specific architecture implementation. However, these heuristic schedulers are time-consuming and expensive to build. In this paper, we present results using both rollouts and reinforcement learning to construct heuristics for scheduling basic blocks. The rollout scheduler outperformed a commercial scheduler, and the reinforcement learning scheduler performed almost as well as the commercial scheduler.

On the Optimality of Incremental Neural Network Algorithms -- Ron Meir, Vitaly E. Maiorov
We study the approximation of functions by two-layer feedforward neural networks, focusing on incremental algorithms which incrementally add units, estimating single unit parameters at each stage. As opposed to standard algorithms for fixed architectures, the optimization at each stage is performed over a small number of parameters, mitigating many of the difficult numerical problems inherent in high-dimensional non-linear optimization. We establish upper bounds on the approximation error incurred in the approximation, when approximating functions from the Sobolev space, thereby extending previous results which only provided rates of convergence for functions in certain convex hulls of functional spaces. By comparing our results to recently derived lower bounds, we show that the incremental algorithms are nearly optimal in many cases, while provably out-perform linear methods in others. Combined with recent estimation error results for incremental greedy algorithms, a strong case can be made for this type of approach.

Kernel PCA and De-Noising in Feature Spaces -- Sebastian Mika, Bernhard Sch\"olkopf, Alex J. Smola, Klaus-R. M\"uller, Matthias Scholz, Gunnar R\"atsch
Kernel PCA as a nonlinear feature extractor has proven powerful as a preprocessing step for classification algorithms. But it can also be considered as a natural generalization of linear principal component analysis. This gives rise to the question how to use nonlinear features for data compression, reconstruction, and de-noising, applications common in linear PCA. This is a nontrivial task, as the results provided by kernel PCA live in some high dimensional feature space and need not have pre-images in input space. This work presents ideas for finding approximate pre-images, focusing on Gaussian kernels, and shows experimental results using these pre-images in data reconstruction and de-noising on toy examples as well as on real world data.

Bayesian Modeling of Facial Similarity -- Baback Moghaddam, Tony Jebara, Alex Pentland
In previous work, we advanced a new technique for direct visual matching of images for the purposes of face recognition and image retrieval, using a probabilistic measure of similarity based primarily on a Bayesian analysis of image differences, leading to a dual'' basis similar to eigenfaces. The performance advantage of this probabilistic matching technique over standard Euclidean nearest-neighbor eigenface matching was recently demonstrated using results from DARPA's 1996 FERET'' face recognition competition, in which this probabilistic matching algorithm was found to be the top performer. We have further developed a simple method of replacing the rather costly compution of nonlinear (online) Bayesian similarity measures by the relatively inexpensive computation of linear (offline) subspace projections and simple (online) Euclidean norms, thus resulting in a significant computational speed-up for implementation with very large image databases as typically encountered in real-world applications.

Learning Instance-Independent Value Functions to Enhance Local Search -- Robert Moll, Andrew G. Barto, Theodore Perkins, Richard S. Sutton
Reinforcement learning methods can be used to improve the performance of local search algorithms for combinatorial optimization by learning an evaluation function that predicts the outcome of search. The evaluation function is therefore able to guide search more effectively to low-cost solutions than can the original cost function. We describe a reinforcement learning method for enhancing local search that combines aspects of previous work by Zhang and Dietterich (1995) and Boyan and Moore (1997, 1998). In an off-line learning phase, a value function is learned that is useful for guiding search for multiple problem sizes and instances. We illustrate our technique by developing several such functions for the Dial-A-Ride Problem, a variant of the well-known Traveling Salesman's Problem. Overall, our learned hillclimbing results exhibit an improvement of more than 30\% over the standard Dial-A-Ride local search algorithm.

Reinforcement Learning for Trading Systems -- John Moody, Matthew Saffell
We propose to train trading systems by optimizing financial objective functions via reinforcement learning. The performance functions that we consider as value functions are profit or wealth, the Sharpe ratio and our recently proposed differential Sharpe ratio for online learning. In Moody and Wu 1997, we presented empirical results in controlled experiments that demonstrated the advantages of reinforcement learning relative to supervised learning. Here we extend our previous work to compare Q-Learning to a reinforcement learning technique based on real-time recurrent learning (RTRL) that maximizes immediate reward. We provide new simulation results that demonstrate the presence of predictability in the monthly S\&P 500 Stock Index for the 25 year period 1970 through 1994, as well as a sensitivity analysis that provides economic insight into the trader's structure.

Very Fast EM-based Mixture Model Clustering using Multiresolution Kd-trees -- Andrew W. Moore
Clustering is important in many fields including manufacturing, biology, finance, and astronomy. Mixture models are a popular approach due to their statistical foundations, and EM is a very popular method for finding mixture models. EM, however, requires many accesses of the data, and thus has been dismissed as impractical for data mining of enormous datasets. We present a new algorithm, based on the multiresolution kd-trees of our earlier work, which dramatically reduces the cost of EM-based clustering, with savings rising linearly with the number of datapoints. Although presented here for maximum likelihood estimation of Gaussian mixture models, it is also applicable to non-Gaussian models (provided class densities are monotonic in Mahalanobis distance), mixed categorical/numeric clusters, and Bayesian methods such as Autoclass.

A Principle for Unsupervised Hierarchical Decomposition of Visual Scenes -- Michael C. Mozer
Structure in a visual scene can be described at many levels of granularity. At a coarse level, the scene is composed of objects; at a finer level, each object is made up of parts, and the parts of subparts. In this work, I propose a simple principle by which such hierarchical structure can be extracted from visual scenes: Regularity in the relations among different parts of an object is weaker than in the internal structure of a part. This principle can be applied recursively to define part-whole relationships among elements in a scene. The principle does not make use of object models, categories, or other sorts of higher-level knowledge; rather, part-whole relationships can be established based on the statistics of a set of sample visual scenes. I illustrate with a model that performs unsupervised decomposition of simple scenes. The model can account for the results from a human learning experiment on the ontogeny of part-whole relationships.

Barycentric Interpolators for Continuous Space and Time Reinforcement Learning -- Remi Munos, Andrew W. Moore
In order to find the optimal control of continuous state-space and time reinforcement learning (RL) problems, we approximate the value function (VF) with a particular class of functions called the barycentric interpolators. We establish sufficient conditions under which a RL algorithm converges to the optimal VF, even when we use approximate models of the state dynamics and the reinforcement functions.

#### N

Controlling the Complexity of HMM Systems by Regularization -- Christoph Neukirchen, Gerhard Rigoll
This paper introduces a method for regularization of HMM systems that avoids parameter overfitting caused by insufficient training data. Regularization is done by augmenting the EM training method by a penalty term that favors simple and smooth HMM systems. The penalty term is constructed as a mixture model of negative exponential distributions that is assumed to generate the state dependent emission probabilities of the HMMs. This new method is the successful transfer of a well known regularization approach in neural networks to the HMM domain and can be interpreted as a generalization of traditional state-tying for HMM systems. The effect of regularization is demonstrated for continuous speech recognition tasks by improving overfitted triphone models and by speaker adaptation with limited training data.

Risk Sensitive Reinforcement Learning -- Ralph Neuneier, Oliver Mihatsch
As already known, the expected return of a policy controlling a Markov Decision Problem is not always the most suitable optimality criterion. For many applications control strategies should meet various constraints like avoiding very bad states (risk-avoiding) or generating high profit within a short time although this might probably cause significant costs (risk-seeking). We propose a modified Q-learning algorithm which uses a single continuous parameter $\kappa \in [-1,1]$ to determine in which sense the resulting policy is optimal. For $\kappa=0$, the policy is optimal with respect to the usual expected return criterion while $\kappa \to 1$ generates a solution which is optimal in the worst case. Analogous, the closer $\kappa$ is to $-1$ the more risk seeking the policy becomes. In contrast to other related approaches in the field of MDPs we do not have to transform the cost model or to increase the state space in order to take risk into account. The existing convergence results for traditional $Q$-learning remain also valid in the risk sensitive case $\kappa\not= 0$. We describe some potential applications in industry and evaluate our new approach by computing optimal strategies for one of them.

Maximum-Likelihood Continuity Mapping (MALCOM): An Alternative to HMMs -- David A. Nix, John E. Hogden
(MALCOM), an alternative to hidden Markov models (HMMs) for processing sequence data such as speech. While HMMs have a discrete hidden'' space constrained by a fixed finite-automaton architecture, MALCOM has a continuous hidden space---a {\it continuity map}---that is constrained only by a smoothness requirement on paths through the space. MALCOM fits into the same probabilistic framework for speech recognition as HMMs, but it represents a more realistic model of the speech production process. To evaluate the extent to which MALCOM captures speech production information, we generated continuous speech continuity maps for three speakers and used the paths through them to predict measured speech articulator data. The median correlation between the MALCOM paths {\it obtained from only the speech acoustics} and articulator measurements was 0.77 on an independent test set not used to train MALCOM or the predictor. This unsupervised model achieved correlations over speakers and articulators only 0.02 to 0.15 lower than those obtained using an analogous supervised method which {\it used articulatory measurements as well as acoustics.}

#### O

Graphical Models for Recognizing Human Interactions -- Nuria Oliver, Barbara Rosario, Alex Pentland
We describe a real-time computer vision and machine learning system for modeling and recognizing human actions and interactions. Two different domains are explored: recognition of two-handed motions in the martial art Tai Chi,' and multiple-person interactions in a visual surveillance task. Our system combines top-down with bottom-up information using a feedback loop, and is formulated with a Bayesian framework. Two different graphical models (HMMs and Coupled HMMs) are used for modeling both individual actions and multiple-agent interactions, and CHMMs are shown to work more efficiently and accurately for a given amount of training. Finally, to overcome the limited amounts of training data, we demonstrate that synthetic agents' (Alife-style agents) can be used to develop flexible prior models of the person-to-person interactions.

General Bounds on Bayes Errors for Regression with Gaussian Processes -- Manfred Opper, Francesco Vivarelli
Based on a simple convexity lemma, we develop bounds for different types of Bayesian prediction errors for regression with Gaussian processes. The basic bounds are formulated for a fixed training set. Simpler expressions are obtained for sampling from an input distribution which equals the weight function of the covariance kernel, yielding asymptotically tight results. The results are compared with numerical experiments.

Mean Field Methods for Classification with Gaussian Processes -- Manfred Opper, Ole Winther
We discuss the application of TAP mean field methods known from the Statistical Mechanics of disordered systems to Bayesian classification models with Gaussian processes. In contrast to previous approaches, no knowlege about the distribution of inputs is needed in the derivation. Simulations for standard datasets are given.

Coordinate Transformation Learning of Hand Position Feedback Controller by Using Change of Position Error Norm -- Eimei Oyama, Susumu Tachi
In order to grasp an object, we need to solve the inverse kinematics problem, i.e., the coordinate transformation from the visual coordinates to the joint angle vector coordinates. In human motion control, the learning function of the hand position error feedback controller in human inverse kinematics solver is important. This paper proposes a novel model of the coordinate transformation learning of the human visual feedback controller, which uses the change of the joint angle vector and the corresponding change of the square of the hand position error norm.

#### P

Replicator Equations, Maximal Cliques, and Graph Isomorphism -- Marcello Pelillo
We present a new energy-minimization framework for the graph isomorphism problem which is based on an equivalent maximum clique formulation. The approach is centered around a remarkable result proved by Motzkin and Straus in the mid-1960s, and recently expanded in various ways, which allows us to formulate the maximum clique problem in terms of a standard quadratic program. To solve the program we usereplicator'' equations, a class of simple continuous- and discrete-time dynamical systems developed in various branches of theoretical biology. We show how, despite their inability to escape from local solutions, they nevertheless provide experimental results which are competitive with those obtained using more elaborate mean-field annealing heuristics.

Support Vector Machines Applied to Face Recognition -- P. Jonathon Phillips
Face recognition is a $K$ class problem, where $K$ is the number of known individuals; and support vector machines (SVMs) are a binary classification method. By reformulating the face recognition problem and re-interpreting the output of the SVM classifier, we developed a SVM-based face recognition algorithm. The face recognition problem is formulated as a problem in {\it difference space}, which models dissimilarities between two facial images. In difference space we formulate face recognition as a two class problem. The classes are: dissimilarities between faces of the same person, and dissimilarities between faces of different people. By modifying the interpretation of the decision surface generated by SVM, we generated a similarity metric between faces that is learned from examples of differences between faces. The SVM-based algorithm is compared with a principal component analysis (PCA) based algorithm on a difficult set of images from the FERET database. Performance was measured for both verification and identification scenarios. The identification performance for SVM is 77-78\% versus 54\% for PCA. For verification, the equal error rate is 7\% for SVM and 13\% for PCA.

The Role of Lateral Cortical Competition in Ocular Dominance Development -- Christian Piepenbrock, Klaus Obermayer
Lateral competition within a layer of neurons sharpens and localizes the response to an input stimulus. Here, we investigate a model for the activity dependent development of ocular dominance maps which allows to vary the degree of lateral competition. For weak competition, it resembles a correlation-based learning model and for strong competition, it becomes a self-organizing map. Thus, in the regime of weak competition the receptive fields are shaped by the second order statistics of the input patterns, whereas in the regime of strong competition, the higher moments and features'' of the individual patterns become important. When correlated localized stimuli from two eyes drive the cortical development we find $(i)$~that a topographic map and binocular, localized receptive fields emerge when the degree of competition exceeds a critical value and $(ii)$~that receptive fields exhibit eye dominance beyond a second critical value. For anti-correlated activity between the eyes, the second order statistics drive the system to develop ocular dominance even for weak competition, but no topography emerges. Topography is established only beyond a critical degree of competition.

Using Analytic QP and Sparseness to Speed Training of Support Vector Machines -- John Platt
Training a Support Vector Machine (SVM) requires the solution of a very large quadratic programming (QP) problem. This paper proposes a new algorithm for training SVMs: Sequential Minimal Optimization, or SMO. SMO breaks the large QP problem into a series of smallest possible QP problems which are analytically solvable. SMO thus avoids numerical QP optimization in the inner loop. Using code which exploits sparseness of input data speeds up SMO even further. Because large matrix computation is avoided, SMO scales somewhere between linear and quadratic in the training set size for various test problems, while a standard projected conjugate gradient (PCG) chunking algorithm scales somewhere between linear and cubic in the training set size. SMO's computation time is dominated by SVM evaluation, hence SMO is fastest for linear SVMs and sparse data sets. For the MNIST database, SMO is as fast as PCG chunking; while for the UCI Adult database and linear SVMs, SMO can be more than 1000 times faster than the PCG chunking algorithm.

Independent Component Analysis of Intracellular Calcium Spike Data -- Klaus Prank, Julia Borger, Alexander von zur Muhlen, Georg Brabant, Christof Schofl
Calcium (Ca$^{2+}$) is an ubiquitous intracellular messenger which regulates cellular processes, such as secretion, contraction, and cell proliferation. A number of different cell types respond to hormonal stimuli with periodic oscillations of the intracelluar free calcium concentration ([$Ca^{2+}]_i$). These $Ca^{2+}$ signals are often organized in complex temporal and spatial patterns even under conditions of sustained stimulation. Here we study the spatio-temporal aspects of intracelluar calcium ($[Ca^{2+}]_i$) oscillations in clonal $\beta$-cells (hamster insulin secreting cells, HIT) under pharmacological stimulation (Sch\"{o}fl {\it et al.}, 1994). We use a novel fast fixed-point algorithm (Hyv\"{a}rinen and Oja, 1997) for {\it Independent Component Analysis (ICA)} to blind source separation of the spatio-temporal dynamics of $[Ca^{2+}]_i$ in a HIT-cell. Using this approach we find two significant independent components out of five differently mixed input signals: one $[Ca^{2+}]_i$ signal with a mean oscillatory period of 68 s and a high frequency signal with a broadband power spectrum with considerable spectral density. This result is in good agreement with a study on high-frequency $[Ca^{2+}]_i$ oscillations (Palu\u{s} {\it et al.}, 1998). Further theoretical and experimental studies have to be performed to resolve the question on the functional impact of intracellular signaling of these independent $[Ca^{2+}]_i$ signals.

#### R

On-Line Learning with Restricted Training Sets: Exact Solution as Benchmark for General Theories -- H.C. Rae, Peter Sollich, A.C.C. Coolen
We solve the dynamics of on-line Hebbian learning in perceptrons exactly, for the regime where the size of the training set scales linearly with the number of inputs. We consider both noiseless and noisy teachers. Our calculation cannot be extended to non-Hebbian rules, but the solution provides a nice benchmark to test more general and advanced theories for solving the dynamics of learning with restricted training sets.

Learning Macro-Actions in Reinforcement Learning -- Jette Randl\o v
We present a method for automatically constructing macro-actions from primitive actions in reinforcement learning. The agent constructs the macros totally from scratch during the learning process. The overall idea is to reinforce the tendency to perform action B after action A if such a pattern of actions has been rewarded. We test the method on a bicycle task, the Car On The Hill Task, the Race-Track Task and some grid-world tasks. For the bicycle task and the Race-Track the use of macro actions approximately halves the learning time, while for the grid-world tasks the learning time is drastically reduced by orders of magnitude.

Learning Lie Groups for Invariant Visual Perception -- Rajesh P. N. Rao, Daniel L. Ruderman
One of the most important problems in visual perception is that of visual invariance: how are objects perceived to be the same despite undergoing transformations such as translations, rotations or scaling? In this paper, we describe a Bayesian method for learning invariances based on Lie group theory. We show that previous approaches based on first-order Taylor series expansions of inputs can be regarded as special cases of the Lie group approach, the latter being capable of handling in principle arbitrarily large transformations. Using a matrix-exponential based generative model of images, we derive an unsupervised algorithm for learning Lie group operators from input data containing infinitesimal transformations. The on-line unsupervised learning algorithm maximizes the posterior probability of generating the training data. We provide experimental results suggesting that the proposed method can learn Lie group operators for handling reasonably large 1-D translations and 2-D rotations.

Regularizing AdaBoost -- Gunnar R\"atsch, Takashi Onoda, Klaus-R. M\"uller
Boosting methods maximize a hard classification margin and are known as powerful techniques that do not exhibit overfitting for low noise cases. Also for noisy data boosting will try to enforce a hard margin and thereby give too much weight to outliers, which then leads to the dilemma of non-smooth fits and overfitting. Therefore we propose three algorithms to allow for soft margin classification by introducing regularization with slack variables into the boosting concept: (1) AdaBoost$_{reg}$ and regularized versions of (2) linear and (3) quadratic programming AdaBoost. Experiments show the usefulness of the proposed algorithms in comparison to another soft margin classifier: the support vector machine.

Multi-electrode Spike Sorting by Clustering Transfer Functions -- Dmitry Rinberg, Hanan Davidowitz, Naftali Tishby
We propose a new paradigm for sorting spikes in multi-electrode extra cellular recording by clustering the transfer functions between cells and electrodes. The assumption is that for every cell and electrode there is a stable linear system, corresponding to the media properties and geometry. The main advantage of this method is that it is insensitive to variations in the cell-spike shape and amplitude. At the same time it utilizes all other multi-electrode information. In addition, since the transfer function noise is simpler than the variablity in the spike shape, this representation is clean and easy to classify. We generate spike templates by clustering the complex ratio of the Fourier-transforms of spikes on the different electrodes, using deterministic annealing to determine the number of separated cells. The detection of the spikes is then done by using their Fourier representation. We apply the technique to sorting spikes generated in the escape response system of the cockroach.

General-Purpose Localization of Textured Image Regions -- Ruth Rosenholtz
We suggest a working definition of texture: Texture is stuff which is more compactly represented by its statistics than by specifying the configuration of its parts. This definition suggests that to find texture we look for outliers to the local statistics, and label as texture the regions with no outliers. We present a method, based upon this idea, for labeling points in natural scenes as belonging to texture regions, while simultaneously allowing us to label low-level, bottom-up cues for visual attention. This method is strongly based upon recent psychophysics results on processing of texture and popout.

#### S

Tight Bounds for the VC-Dimension of Piecewise Polynomial Networks -- Akito Sakurai
$O(ws(s\log d+\log (dqh/s)))$ and $O(ws((h/s)\log q)+\log (dqh/s))$ are upper bounds for the VC-dimension of a set of neural networks of units with piecewise polynomial activation functions, where $h$ is the number of hidden units, $w$ is the number of adjustable parameters, $q$ is the maximum of the number of polynomial segments of the activation function, and $d$ is the maximum degree of the polynomials; also $\Omega(ws\log (dqh/s))$ is a lower bound for the VC-dimension of such a network set, which are tight for the cases $s=\Theta(h)$ and $s$ is constant. For the special case $q=1$, the VC-dimension is $\Theta(ws\log d)$.

Reinforcement Learning Based on On-line EM Algorithm -- Masa-aki Sato, Shin Ishii
In this article, we will propose a new reinforcement learning (RL) method based on an actor-critic architecture. The actor and the critic are approximated by Normalized Gaussian Networks (NGnet's), which are networks of local linear regression units. The NGnet's are trained by the on-line EM algorithm proposed in our previous paper. We apply our RL method to a task for swinging-up and stabilizing a single pendulum, and a task for balancing a double pendulum near the upright position. The experimental results show that our RL method can be applied to optimal control problems having continuous state/action spaces and it achieves a good control in a small number of trial-and-errors.

Markov Processes on Curves for Automatic Speech Recognition -- Lawrence Saul, Mazin Rahim
We investigate a probabilistic framework for automatic speech recognition based on the intrinsic geometric properties of curves. In particular, we analyze the setting in which two variables --- one continuous (X), one discrete (S) --- evolve jointly in time. We suppose that the vector X traces out a smooth multidimensional curve and that the variable S evolves stochastically as a function of the arc length traversed along this curve. Since arc length does not depend on the rate at which a curve is traversed, this gives rise to a family of Markov processes whose predictions, Pr[S|X], are invariant to nonlinear warpings of time. We describe the use of such models -- known as Markov processes on curves -- for automatic speech recognition, where X are acoustic feature trajectories and S are phonetic transcriptions. On the task of recognizing 1219 New Jersey town names, we report systematic improvements in word accuracy over comparable, state-of-the-art hidden Markov models.

Shrinking the Tube: A New Support Vector Regression Algorithm -- Bernhard Sch\"olkopf, Peter L. Bartlett, Alex J. Smola, Robert Williamson
A new algorithm for Support Vector regression is proposed. For a priori chosen $\nu$, it automatically adjusts a flexible tube of minimal radius to the data such that at most a fraction $\nu$ of the data points lie outside. Moreover, it is shown how to use parametric tube shapes with non-constant radius. The algorithm is analysed theoretically and experimentally.

Probabilistic Image Sensor Fusion -- Ravi K. Sharma, Todd K. Leen, Misha Pavel
We present a probabilistic approach for fusion of video sequences produced by multiple imaging sensors. We model the sensor images as noisy, locally affine functions of an underlying true scene. Maximum likelihood estimates of the parameters in the local affine functions are based on the local covariance of the image data, and therefore related to local principal component analysis. With the model parameters estimated, a Bayesian framework provides either maximum likelihood or maximum a posteriori estimates of the true scene from the sensor images. These true scene estimates comprise the sensor fusion rules. We demonstrate the efficacy of the approach on sequences of images from visible-band and infrared sensors.

Boxlets: a Fast Convolution Algorithm for Signal Processing and Neural Networks -- Patrice Y. Simard, Leon Bottou, Patrick Haffner, Yann Le Cun
Signal processing and pattern recognition algorithms make extensive use of convolution. In many cases, computational accuracy is not as important as computational speed. In feature extraction, for instance, the features of interest in a signal are usually quite distorted. This form of noise justifies some level of quantization in order to achieve faster feature extraction. Our approach consists of approximating regions of the signal with low degree polynomials, and then differentiating the resulting signals in order to obtain impulse functions (or derivatives of impulse functions). With this representation, convolution becomes extremely simple and can be implemented quite effectively. The true convolution can be recovered by integrating the result of the convolution. This method yields substantial speed up in feature extraction and is applicable to convolutional neural networks.

Invited Talk: Cortical Normalization Models and the Statistics of Visual Images -- Eero Simoncelli
I present a parametric statistical model for visual images in the wavelet transform domain. The model characterizes the joint densities of coefficients corresponding to basis functions at adjacent spatial locations, adjacent orientations, and adjacent spatial scales. The model is consistent with the statistics of a wide variety of images, including photographs of indoor and outdoor scenes, medical images, and synthetic (graphics) images, and has been used successfully in applications of compression, noise removal, and texture synthesis. The model also suggests a nonlinear method of removing these dependencies, which I call normalized component analysis,'' in which each wavelet coefficient is divided by a linear combination of coefficient magnitudes at adjacent locations, orientations and scales. This analysis provides a theoretical justification for recent divisive normalization models of striate visual cortex. Furthermore, the statistical measurements may be used to determine the weights that are used in computing the normalization signal. The resulting model makes specific predictions regarding non-specific suppression and adaptation behaviors of cortical neurons, and thus offer the opportunity to test directly (through physiological measurements) the ecological hypothesis that visual neural computations are optimally matched to the statistics of images.

Modeling Non-Specific Suppression in V1 Neurons with a Statistically-Derived Normalization Model -- Eero Simoncelli, Odelia Schwartz
We examine the statistics of natural monochromatic images decomposed using a multi-scale wavelet basis. Although the coefficients of this representation are nearly decorrelated, they exhibit important higher-order statistical dependencies that cannot be eliminated with purely linear processing. In particular, rectified coefficients corresponding to basis functions at neighboring spatial positions, orientations and scales are highly correlated. A method of removing these dependencies is to {\em divide} each coefficient by a weighted combination of its rectified neighbors. Several successful models of neural processing in visual cortex are based on such divisive gain control (ornormalization'') computations, and thus our analysis provides a theoretical justification for these models. Perhaps more importantly, the statistical measurements explicitly specify the weights that should be used in computing the normalization signal. We demonstrate that this weighting is qualitatively consistent with recent physiological measurements, and thus that early visual neural processing is well matched to these statistical properties of images.

On-line and Batch Parameter Estimation of Gaussian Mixtures Based on the Relative Entropy -- Yoram Singer, Manfred K. Warmuth
We describe a new iterative method for parameter estimation of Gaussian mixtures. The new method is based on a framework developed by Kivinen and Warmuth for supervised on-line learning. In contrast to gradient descent and EM, which estimate the mixture's covariance matrices, the proposed method estimates the {\em inverses} of the covariance matrices. Furthermore, the new parameter estimation procedure can be applied in both on-line and batch settings. We show experimentally that it is typically faster than EM, and usually requires about half as many iterations as EM. We also describe experiments with digit recognition that demonstrate the merits of the on-line version.

Discontinuous Recall Transitions Induced by Competition Between Short- and Long-Range Interactions in Recurrent Networks -- N.S. Skantzos, C.F. Beckmann, A.C.C. Coolen
We present exact analytical equilibrium solutions for a class of recurrent neural network models, with both sequential and parallel neuronal dynamics, in which there is a tunable competition between nearest-neighbour and long-range synaptic interactions. This competition is found to induce novel coexistence phenomena as well as discontinuous transitions between pattern recall states, 2-cycles and non-recall states.

Semiparametric Support Vector and Linear Programming Machines -- Alex J. Smola, Thilo Friess, Bernhard Sch\"olkopf
Semiparametric models are useful tools in the case where domain knowledge exists about the function to be estimated or emphasis is put onto understandability of the model. We extend two learning algorithms - Support Vector machines and Linear Programming machines to this case and give experimental results for SV machines.

Learning Curves for Gaussian Processes -- Peter Sollich
I consider the problem of calculating learning curves (i.e., average generalization performance) of Gaussian processes used for regression. A simple expression for the generalization error in terms of the eigenvalue decomposition of the covariance function is derived, and used as the starting point for several approximation schemes. I identify where these become exact, and compare with existing bounds on learning curves; the new approximations, which can be used for any input space dimension, generally get substantially closer to the truth.

Invited Talk: Computation by Cortical Modules -- Haim Sompolinsky
Understanding the principles underlying the processing of information in the neocortex is one of the most important challenges of brain research. Anatomical and physiological data suggest that cortical areas function as interacting networks of local computational modules. These modules, analogous to the hypercolumns in visual cortex, consist of hundreds of thousands of cells which are interconnected by massive recurrent excitation and inhibition. Recent theoretical research shed light on the intrinsic dynamics of these local cortical circuits, the resultant repertoire of spatio-temporal patterns of neuronal activity, and their possible computational implications for feature extraction, motor planning and memory.

Applications of Multiresolution Neural Networks to Mammography -- Clay D. Spence, Paul Sajda
We have previously presented a coarse-to-fine hierarchical pyramid/neural network (HPNN) architecture which combines multi-scale image processing techniques with neural networks. In this paper we present applications of this general architecture to two problems in mammographic Computer-Aided Diagnosis (CAD). The first application is the detection of microcalcifications. The coarse-to-fine HPNN was designed to learn large-scale context information for detecting small objects like microcalcifications. Receiver operating characteristic (ROC) analysis suggests that the hierarchical architecture improves detection performance of a well established CAD system by roughly 50\%. The second application is to detect mammographic masses directly. Since masses are large, extended objects, the coarse-to-fine HPNN architecture is not suitable for this problem. Instead we construct a fine-to-coarse HPNN architecture which is designed to learn small-scale detail structure associated with the extended objects. Our initial results applying the fine-to-coarse HPNN to mass detection are encouraging, with detection performance improvements of 15\% to 30\%. We conclude that the ability of the HPNN architecture to integrate information across scales, both coarse-to-fine and fine-to-coarse, makes it well suited for detecting objects which may have contextual clues or detail structure occurring at scales other than the natural scale of the object.

Information Maximization in Single Neurons -- Martin B. Stemmler, Christof Koch
Information from the senses must be compressed into the limited range of responses spiking nerve cells can generate; for optimal compression, the neuron's response should match the statistics of stimuli encountered in nature. Given a limit on its maximal firing rate, a nerve cell should teach itself to use each available firing rate equally often. Given a fixed average firing rate, a neuron should self-organize itself to respond with high firing rates only to comparatively rare events. Since changing the voltage-dependent ionic conductances in the nerve cell membrane alters the representation of information, an unsupervised learning rule is derived to show how a Hodgkin-Huxley model neuron can adapt these conductances to optimize its firing rate response. This learning rule corresponds to a non-Hebbian, developmental mechanism that maximizes the rate of information transmission.

Computation of Smooth Optical Flow in a Feedback Connected Analog Network -- Alan Stocker, Rodney Douglas
In 1985, Tanner and Mead implemented an interesting constraint satisfaction circuit for global motion sensing in aVLSI. We report here a new and improved aVLSI implementation that provides smooth optical flow as well as global motion in a two dimensional visual field. The computation of optical flow is an ill-posed problem, which expresses itself as the aperture problem. However, the optical flow can be estimated by the use of regularization methods, in which additional constraints are introduced in terms of a global energy functional that must be minimized. We show how the algorithmic constraints of Horn and Schunck on computing smooth optical flow can be mapped onto the physical constraints of an equivalent electronic network.

A Reinforcement Learning Algorithm in Partially Observable Environments Using Short-Term Memory -- Nobuo Suematsu, Akira Hayashi
We describe a Reinforcement Learning algorithm for partially observable environments using short-term memory, which we call BLHT. Since BLHT learns a stochastic model based on Bayesian Learning, the overfitting problem is reasonably solved. Moreover, BLHT has an efficient implementation. This paper shows that the model learned by BLHT converges to one which provides the most accurate predictions of percepts and rewards, given short-term memory.

Improved Switching among Temporally Abstract Actions -- Richard S. Sutton, Satinder Singh, Doina Precup, B. Ravindran
In robotics and other control applications it is commonplace to have a pre-existing set of controllers for solving subtasks, perhaps handcrafted or previously learned or planned, and still face a difficult problem of how to choose and switch among the controllers to solve an overall task as well as possible. In this paper we present a framework based on Markov Decision Processes and Semi-Markov Decision Processes for phrasing this problem, a basic theorem regarding the improvement in performance that can be obtained by switching flexibly between given controllers, and example applications of the theorem. In particular, we show how an agent can plan with these high-level controllers and then use the results of such planning to find an even better plan, by modifying the existing controllers with negligible additional cost and no replanning. In one of our examples, the complexity of the problem is reduced from 24 billion state-action pairs to less than a million state-controller pairs.

#### T

A Theory of Mean Field Approximation -- Toshiyuki Tanaka
I present a theory of mean field approximation based on information geometry. This theory includes in a consistent way the naive mean field approximation, as well as the Thouless-Anderson-Palmer (TAP) approach and the linear response theorem in statistical physics, giving clear information-theoretic interpretations to them.

Bayesian Modeling of Human Concept Learning -- Joshua B. Tenenbaum
I consider the problem of learning concepts from small numbers of positive examples, a feat which humans perform all the time but which computers are rarely capable of. In the spirit of trying to bridge machine learning and cognitive science perspectives, I present both theoretical analysis and an empirical study with human subjects for a simple, well-known task: learning concepts corresponding to axis-aligned rectangles in a multidimensional feature space. Existing learning models, when applied to this task, cannot explain how subjects generalize from only a few examples of the concept. I propose a principled Bayesian model based on the assumption that the observed examples are a random sample from the concept to be learned. The Bayesian model gives precise fits to human behavior on this simple task and provides qualitative insights into more complex, realistic cases of concept learning.

Orientation, Scale, and Discontinuity as Emergent Properties of Illusory Contour Shape -- Karvel K. Thornber, Lance R. Williams
A recent neural model of illusory contour formation is based on a distribution of natural shapes traced by particles moving with constant speed in directions given by Brownian motions. The input to that model consists of pairs of position and direction constraints and the output consists of the distribution of contours joining all such pairs. In general, these contours will not be closed and their distribution will not be scale-invariant. In this paper, we show how to compute a scale-invariant distribution of closed contours given position constraints alone, and use this result to explain a well known illusory contour effect.

Probabilistic Visualisation of High-dimensional Binary Data -- Michael E. Tipping
We present a probabilistic latent-variable framework for data visualisation, a key feature of which is its applicability to binary and categorical data types for which few established methods exist. A variational approximation to the likelihood is exploited to derive a fast algorithm for determining the model parameters. Illustrations of application to real and synthetic binary data sets are given.

#### U

SMEM Algorithm for Mixture Models -- Naonori Ueda, Ryohei Nakano, Zoubin Ghahramani, Geoffrey Hinton
We present a split and merge EM (SMEM) algorithm to overcome the local maximum problem in parameter estimation of finite mixture models. In the case of mixture models, non-global maxima often involve having too many components of a mixture model in one part of the space and too few in another, widely separated part of the space. To escape from such configurations we repeatedly perform simultaneous split and merge operations using a new criterion for efficiently selecting the split and merge candidates. We apply the proposed algorithm to the training of Gaussian mixtures and mixtures of factor analyzers using synthetic and real data and show the effectiveness of using the split and merge operations to improve the likelihood of both the training data and of held-out test data.

#### V

Learning Mixture Hierarchies -- Nuno Vasconcelos, Andrew Lippman
The hierarchical representation of data has various applications in domains such as data mining, machine vision, or information retrieval. In this paper we introduce an extension of the Expectation-Maximization (EM) algorithm that learns mixture hierarchies in a computationally efficient manner. Efficiency is achieved by progressing in a bottom-up fashion, i.e. by clustering the mixture components of a given level in the hierarchy to obtain those of the level above. This clustering requires only knowledge of the mixture parameters, there being no need to resort to intermediate samples. In addition to practical applications, the algorithm allows a new interpretation of EM that makes clear the relationship with non-parametric kernel-based estimation methods, provides explicit control over the trade-off between the bias and variance of EM estimates, and offers new insights about the behavior of deterministic annealing methods commonly used with EM to escape local minima of the likelihood.

Discovering Hidden Features with Gaussian Processes Regression -- Francesco Vivarelli, Christopher K. I. Williams
In Gaussian process regression the covariance between the outputs at input locations $\xvec$ and $\xvec'$ is usually assumed to depend on the distance $\round{\xvec - \xvec'}\tp W \round{\xvec-\xvec'}$, where $W$ is a positive definite matrix. $W$ is often taken to be diagonal, but if we allow $W$ to be a general positive definite matrix which can be tuned on the basis of training data, then an eigen-analysis of W shows that we are effectively creating hidden features, where the dimensionality of the hidden-feature space is determined by the data. We demonstrate the superiority of predictions using the general matrix over those based on a diagonal matrix on two test problems.

#### W

The Bias-Variance Tradeoff and the Randomized GACV -- Grace Wahba, Xiwu Lin, Fangyu Gao, Dong Xiang, Ronald Klein, Barbara Klein
We propose a new in-sample cross validation based method (randomized GACV) for choosing smoothing or bandwidth parameters that govern the bias-variance or fit-complexity tradeoff in soft' classification. Soft classification refers to a learning procedure which estimates the probability that an example with a given attribute vector is in class 1 {\it vs} class 0. The target for optimizing the the tradeoff is the Kullback-Liebler distance between the estimated probability distribution and the true' probability distribution, representing knowledge of an infinite population. The method uses a randomized estimate of the trace of a Hessian and mimics cross validation at the cost of a single relearning with perturbed outcome data.

Classification in Non-Metric Spaces -- Daphna Weinshall, David W. Jacobs, Yoram Gdalyahu
A key question in vision is how to represent our knowledge of previously encountered objects in order to classify new ones. The answer depends on how we determine the similarity of two objects. Similarity tells us how relevant each previously seen object is in determining the category to which a new object belongs. Here a dichotomy emerges. Complex notions of similarity appear necessary for cognitive models and applications, while simple notions of similarity form a tractable basis for current computational approaches to classification. We explore the nature of this dichotomy and why it calls for new approaches to well-studied problems in learning. We begin this process by demonstrating new computational methods for supervised learning that can handle complex notions of similarity. (1) We discuss how to implement parametric methods that represent a class by its MEAN when using non-metric similarity functions; and (2) we review non-parametric methods that we have developed using nearest neighbor classification in non-metric spaces.

Basis Selection For Wavelet Regression -- Kevin R. Wheeler
A wavelet basis selection procedure is presented for wavelet regression. Both the basis and threshold are selected using cross-validation. The method includes the capability of incorporating prior knowledge on the smoothness (or shape of the basis functions) into the basis selection procedure. The results of the method are demonstrated using widely published sampled functions. The results of the method are contrasted with other basis function based methods.

DTs: Dynamic Trees -- Christopher K. I. Williams, Nicholas J. Adams
In this paper we introduce a new class of image models, which we call dynamic trees or DTs. A dynamic tree model specifies a prior over a large number of trees, each one of which is a tree-structured belief net (TSBN). Our intuition as to why DTs may be useful image models is based on the idea that most pixels in an image are derived from a single object. We think of an object as being described by a root of a tree, with the scale of the object being determined by the level in the tree at which the root occurs. Experiments show that the DT models have better translation invariance properties than a fixed, balanced'' TSBN, and that Markov chain Monte Carlo methods are effective at finding trees which have high posterior probability.

Experiments with an Algorithm which Learns Stochastic Memoryless Policies for Partially Observable Markov Decision Processes -- John K. Williams, Satinder Singh
Partially Observable Markov Decision Processes (POMDPs) constitute an important class of reinforcement learning problems which present unique theoretical and computational difficulties. In the absence of the Markov property, popular reinforcement learning algorithms such as Q-learning may no longer be effective, and memory-based methods which remove partial observability via state-estimation are notoriously expensive. An alternative approach is to seek a stochastic memoryless policy which for each observation of the environment prescribes a probability distribution over available actions that maximizes the average reward per timestep. A reinforcement learning algorithm which learns a locally optimal stochastic memoryless policy has been proposed by Jaakkola, Singh and Jordan, but not empirically verified. We present a variation of this algorithm, discuss its implementation, and demonstrate its viability using four test problems.

Robot Docking using Mixtures of Gaussians -- Matthew M. Williamson, Roderick Murray-Smith, Volker Hansen
This paper applies the {\it Mixture of Gaussians} probabilistic model, combined with Expectation Maximization optimization, to the task of summarizing three dimensional range data for a mobile robot. This provides a flexible way of dealing with uncertainties in sensor information, and allows the introduction of prior knowledge into low-level perception modules. Problems with the basic approach were solved in several ways: the mixture of Gaussians was reparameterized to reflect the types of objects expected in the scene, and priors on model parameters were included in the optimization process. Both approaches force the optimization to find interesting' objects, given the sensor and object characteristics. A higher level classifier was used to interpret the results provided by the model, and to reject spurious solutions.

Using Collective Intelligence to Route Internet Traffic -- David Wolpert, Kagan Tumer, Jeremy Frank
A COllective INtelligence (COIN) is a set of interacting reinforcement learning (RL) algorithms designed so that their collective behavior optimizes a global utility function. We summarize the theory of COINs, then present experiments using that theory to design COINs to control internet traffic routing. These experiments indicate that COINs outperform all previously investigated RL-based, shortest path routing algorithms.

#### Y

Population Coding with Correlated Noise -- Hyoungsoo Yoon, Haim Sompolinsky
We study the effect of correlated noise on the accuracy of population coding using a model of a population of neurons that are broadly tuned to an angle in two-dimension. The fluctuations in the neuronal activity is modeled as a Gaussian noise with pairwise correlations which decays exponentially with the difference between the preferred orientations of the pair. By calculating the Fisher information of the system, we show that in the biologically relevant regime of parameters positive correlations decrease the estimation capability of the network relative to the uncorrelated population. Moreover, strong positive correlations result in information capacity which saturates to a finite value as the number of cells in the population grows. In contrast, negative correlations substantially increase the information capacity of the neuronal population.

Convergence Rates of Algorithms for Visual Search: Detecting Visual Contours. -- A.L. Yuille, James M. Coughlan
This paper develops a theory for the convergence rates of algorithms for performing visual search tasks (formulated in a Bayesian framework). Our approach makes use of the A* search strategy and mathematical results from the theory of types in information theory. In particular, we formulate the problem of the detection of visual contours in noise/clutter by optimizing a global criterion for combining local intensity and geometry information. We analyze the convergence rates of A* search algorithms for detecting the target contour. This analysis determines characteristics of the domain, which we call order parameters, which determine the convergence rates. In particular, we present a specific admissible A* algorithm with pruning which converges, with high probability, with expected convergence time $O(N)$ in the size of the problem. In addition, we briefly summarize extensions of this work which address fundamental limits of target contour detectability (i.e. algorithm independent results) and the use of non-admissible heuristics.

#### Z

Distributional Population Codes and Multiple Motion Models -- Richard S. Zemel, Peter Dayan
Theoretical and empirical studies of population codes make the assumption that neuronal activities reflect a unique and unambiguous value of an encoded quantity. However, population activities can contain additional information, such as multiple values of the quantity and uncertainty about it. We have previously suggested a method to recover the extra information by treating the activities of the population of cells as coding for a complete distribution over the coded quantity rather than just a single value. We now show how this approach bears on psychophysical and neurophysiological studies of population codes for motion direction. Our model is consistent with both correct and erroneous human performance on tasks in which subjects must report the directions of motion in stimuli containing groups of dots moving in some small number of directions. The model also suggests a natural explanation for why a monkey's psychophysical discrimination performance is not substantially better.

Blind Separation of Filtered Source Using State-Space Approach -- Liqing Zhang, Andrzej Cichocki
In this paper we present a novel approach to multichannel blind separation/generalized deconvolution, assuming that both mixing and demixing models are described by stable linear state-space systems. We decompose the blind separation problem into two process: separation and state estimation. Based on the minimization of Kullback-Leibler Divergence, we develop a novel learning algorithm to train the matrices in the output equation. To estimate the state of the demixing model, we introduce a new concept, called hidden innovation, to numerically implement Kalman filter. Computer simulations are given to show the validity and high effectiveness of the state-space approach.

A High Performance k-NN Classifier Using a Binary Correlation Matrix Memory -- Ping Zhou, Jim Austin, John Kennedy
This paper presents a novel and fast k-NN classifier that is based on a binary CMM (Correlation Matrix Memory) neural network. A robust encoding method is developed that converts numerical inputs into binary ones to meet CMM input requirements. A hardware implementation of the CMM is described, which gives over 200 times the speed of a current mid-range workstation, and is scaleable to very large problems. When tested on several benchmarks and compared with a simple k-NN method, the CMM classifier gave less than 1\% lower accuracy and over 4 and 12 times speed-ups in software and hardware respectively.