# Neural Information Processing Systems 1999

This is a list of the papers that were presented at NIPS*99. They are listed in (approximately) alphabetical order according to the last name of the first author. The title is a link to a postscript version of the paper if the author(s) provided us with one. A compressed (gzip'd) version is available as well.

These papers should be cited as:
To appear in Advances in Neural Information Processing Systems 12,
S. A. Solla, T. K. Leen, K.-R. Müller, eds., MIT Press, 2000.

You can also jump to the names starting with
A-B-C-D-E-F-G-H-I-J-K-L-M-N-O-P-R-S-T-V-W-X-Y-Z

#### A

Banquet Talk: Lightness Perception and Lightness Illusions -- Edward H. Adelson
Look at a gray object indoors or out: it still looks gray, even though the light falling upon it (and thus the light reflecting from it) varies by a thousandfold. This "lightness constancy" is a central feature of visual perception. A clever experimenter can trick the visual system into seeing two identical grays as being quite different. The lightness illusions appartently result from the same mechanisms that produce lightness constancy. Illusions are worth studying because (1) they allow a researcher to expose the inner workings of the visual system and (2) they make really cool demos.

The classical approaches to lightness perception involve normalization by a image statistics, e.g., dividing or subtracting out the mean luminance. Low-level operations such as lateral inhibition can implement such normalization locally. However, we have devised stimuli that dramatically demonstrate the inadequacy of these traditional models. The new illusions indicate the importance of mid-level visual processes that utilize information about contours, junctions, and regions. We propose that the visual system estimates "optical atmosphere" at each point in a scene, and propagates lightness constraints within regions defined by atmospheric boundaries. Lightness constancy can be considered as a statistical estimation problem, where the statistics are gathered within an adaptive window that prevents the mixture of samples from different lighting conditions.

Recurrent Cortical Competition: Strengthen or Weaken? (gzip'd) -- Peter Adorjan, Lars Schwabe, Christian Piepenbrock, Klaus Obermayer
We investigate the short term dynamics of recurrent competition and neural activity in the primary visual cortex in terms of information processing and in the context of orientation selectivity. We propose that after stimulus onset, the strength of the recurrent excitation decreases due to fast synaptic depression. As a consequence, the network is shifted from an initially highly nonlinear to a more linear operating regime. Sharp orientation tuning is established in the first highly competitive phase. In the second and less competitive phase, precise signaling of multiple orientations and long range modulation, e.g., by intra- and inter-areal connections becomes possible (surround effects). Thus the network first extracts the salient features from the stimulus, and then starts to process the details. We show that this signal processing strategy is optimal if the neurons have limited bandwidth and their objective is to transmit the maximum amount of information in any time interval beginning with the stimulus onset.

Robust Full Bayesian Methods for Neural Networks (gzip'd) -- Christophe Andrieu, Nando de Freitas, Arnaud Doucet
In this paper, we propose a full Bayesian model for neural networks. This model treats the model dimension (number of neurons), model parameters, regularisation parameters and noise parameters as random variables that need to be estimated. We then propose a reversible jump Markov chain Monte Carlo (MCMC) method to perform the necessary computations. We find that the results are not only better than the previously reported ones, but also appear to be robust with respect to the prior specification. Moreover, we present a geometric convergence theorem for the algorithm.

Inferring Parameters and Structure of Graphical Models by Variational Bayes -- Hagai Attias
This paper presents a novel practical framework for Bayesian model averaging and model selection in probabilistic graphical models. Our approach approximates full posterior distributions over model parameters and structures, as well as latent variables, in an analytical manner. Unlike in large-sample approximations, these posteriors are generally non-Gaussian and no Hessian needs to be computed. The resulting algorithm generalizes the standard Expectation Maximization algorithm, and its convergence is guaranteed. We demonstrate that this approach can be applied to a large class of graphical models in several domains, including mixture models, hidden Markov models and blind source separation.

Dynamic Graphical Models for Independent Factor Analysis -- Hagai Attias
We present a new technique for time series analysis based on dynamic probabilistic networks. In this approach, the observed data are modeled in terms of unobserved, mutually independent factors, as in Independent Factor Analysis (IFA) recently introduced by (Attias 1999). However, unlike IFA, the factors are not i.i.d.; each factor has its own temporal statistical characteristics. We derive a family of EM algorithms that learn the structure of the underlying factors and their relation to the data, and demonstrate superior performance compared to IFA. Model selection issues, in particular inferring the optimal number of factors, are also addressed.

#### B

Robust Learning of Chaotic Attractors (gzip'd) -- Rembrandt Bakker, Jaap C. Schouten, Marc-Olivier Coppens, Floris Takens, C. Lee Giles, Cor M. van den Bleek
A fundamental problem with the modeling of chaotic time series data is that minimizing short-term prediction errors does not guarantee a match between the reconstructed attractors of model and experiments. We introduce a modeling paradigm that simultaneously learns to short-term predict and to locate the outlines of the attractor by a new way of nonlinear principal component analysis. Closed-loop predictions are constrained to stay within these outlines, to prevent divergence from the attractor. Learning is exceptionally fast: parameter estimation for the 1000 sample laser data from the 1991 Santa Fe time series competition took less than a minute on a 166 MHz Pentium PC.

Gaussian Fields for Approximate Inference in Layered Sigmoid Belief Networks -- David Barber, Peter Sollich
Layered Sigmoid Belief Networks are directed graphical models in which the local conditional probabilities are parameterised by weighted sums of parental states. Learning and inference in such networks are generally intractable, and approximations need to be considered. Progress in learning these networks has been made by using variational procedures. We demonstrate, however, that variational procedures can be inappropriate for the equally important issue of inference - that is, calculating marginals of the network. We introduce an alternative procedure, based on assuming that the weighted input to a node is approximately Gaussian distributed. Our approach goes beyond previous Gaussian field assumptions in that we take into account correlations between parents of nodes. This procedure is specialized for calculating marginals and is significantly faster and simpler than the variational procedure.

Image Representations for Facial Action Coding -- Marian Stewart Bartlett, Gianluca Donato, Javier R. Movellan, Joseph C. Hager, Paul Ekman, Terrence J. Sejnowski
The Facial Action Coding System (FACS) is an objective method for quantifying facial movement in terms of component actions. This system is widely used in behavioral investigations of emotion, cognitive processes, and social interaction. The coding is presently performed by highly trained human experts. This paper explores and compares techniques for automatically recognizing facial actions in sequences of images. These methods include unsupervised learning techniques for finding basis images such as principal component analysis, independent component analysis and local feature analysis, and supervised learning techniques such as Fisher's linear discriminants. These data-driven bases are compared to Gabor wavelets, in which the basis images are predefined. Best performances were obtained using the Gabor wavelet representation and the independent component representation, both of which achieved 96\% accuracy for classifying twelve facial actions. The ICA representation is 90\% more computationally efficient than the Gabor representation due to the large difference in the number of kernels. The results provide evidence for the importance of using local image bases, high spatial frequencies, and statistical independence for classifying facial actions.

Recognizing Evoked Potentials in a Virtual Environment (gzip'd) -- Jessica D. Bayliss, Dana H. Ballard
Virtual reality (VR) provides immersive and controllable experimental environments. It expands the bounds of possible evoked potential (EP) experiments by providing complex, dynamic environments in order to study cognition without sacrificing environmental control. Unfortunately, a more complex and natural'' VR environment can encourage eye movement as well as other kinds of artifact. We discuss problems with EP work in a virtual environment and how they can be overcome in the context of an experiment to detect EP's at red, green, and yellow stoplights in a virtual driving environment. Experimental results show the existence of the P3 EP at green/red lights and the contingent negative variation (CNV) EP at yellow lights. Since VR serves as a safe dynamic testbed for brain-computer interface (BCI) research, we looked at detecting the existence of the P3 EP at red lights and the absence of this signal at yellow lights. We present off- and on-line single trial recognition results and show that the P3 may successfully be used to control the brakes of a VR car at stoplights with an average accuracy of 84.5\%.

Modeling High-Dimensional Discrete Data with Multi-Layer Neural Networks (gzip'd) -- Yoshua Bengio, Samy Bengio
The curse of dimensionality is severe when modeling high-dimensional discrete data: the number of possible combinations of the variables explodes exponentially. In this paper we propose a new architecture for modeling high-dimensional data that requires resources (parameters and computations) that grow only as the square of the number of variables, using a multi-layer neural network to represent the joint distribution of the variables as the product of conditional distributions. The neural network can be interpreted as a graphical model without hidden random variables, but in which the conditional distributions are tied through the hidden units. The connectivity of the neural network can be pruned by using dependency tests between the variables. Experiments on modeling the distribution of several discrete data sets show statistically significant improvements over other methods such as naive Bayes and comparable Bayesian networks, and show that significant improvements can be obtained by pruning the network.

Robust Neural Network Regression for Offline and Online Learning -- Thomas Briegel, Volker Tresp
We replace the commonly used Gaussian noise model in nonlinear regression by a more flexible noise model based on the Student-$t$-distribution. The degrees of freedom of the $t$-distribution can be chosen such that as special cases either the Gaussian distribution or the Cauchy distribution are realized. The latter is commonly used in robust regression. Since the $t$-distribution can be interpreted as being an infinite mixture of Gaussians, parameters and hyperparameters such as the degrees of freedom of the $t$-distribution can be learned from the data based on an EM-learning algorithm. We show that modeling using the $t$-distribution leads to improved predictors on real world data sets. In particular, if outliers are present, the $t$-distribution is superior to the Gaussian noise model. In effect, by adapting the degrees of freedom, the system can learn'' to distinguish between outliers and non-outliers. Especially for online learning tasks, one is interested in avoiding inappropriate weight changes due to measurement outliers to maintain stable online learning capability. We show experimentally that using the $t$-distribution as a noise model leads to stable online learning algorithms and outperforms state-of-the art online learning methods like the extended Kalman filter algorithm.

Low Power Communications via Reinforcement Learning (gzip'd) -- Timothy X. Brown
This paper examines the application of reinforcement learning to a wireless communication problem. The problem requires that channel utility be maximized while simultaneously minimizing battery usage. We present a solution to this multi-criteria problem that is able to significantly reduce power consumption. The solution uses a variable discount factor to capture the effects of battery usage.

An Oscillatory Correlation Framework for Computational Auditory Scene Analysis (gzip'd) -- Guy J. Brown, DeLiang L. Wang
A neural model is described which uses oscillatory correlation to segregate speech from interfering sound sources. The core of the model is a two-layer neural oscillator network. A sound stream is represented by a synchronized population of oscillators, and different streams are represented by desynchronized oscillator populations. The model has been evaluated using a corpus of speech mixed with interfering sounds, and produces an improvement in signal-to-noise ratio for every mixture.

Model Selection in Clustering by Uniform Convergence Bounds (gzip'd) -- Joachim M. Buhmann, Marcus Held
Unsupervised learning algorithms are designed to extract structure from data samples. Reliable and robust inference requires a guarantee that extracted structures are typical for the data source, i.e., similar structures have to be infered from a second sample set of the same data source. The overfitting phenomenon in maximum entropy based annealing algorithms is exemplarily studied for a class of histogram clustering models. Bernstein's inequality for large deviations is used to determine the maximal achievable approximation quality parameterized by a minimal temperature. Monte Carlo simulations support the proposed model selection criterion by finite temperature annealing.

Uniqueness of the SVM Solution (gzip'd) -- Chris Burges, David Crisp
We give necessary and sufficient conditions for uniqueness of the support vector solution for the problems of pattern recognition and regression estimation, for a general class of cost functions. We show that if the solution is not unique, all support vectors are necessarily always at bound, and we give some simple examples of non-unique solutions. We note that uniqueness of the primal (dual) solution does not necessarily imply uniqueness of the dual (primal) solution. We show how to compute the threshold $b$ when the solution is unique, but when all support vectors are at bound, in which case the usual method for determining $b$ does not work.

#### C

Reconstruction of Sequential Data with Probabilistic Models and Continuity Constraints (gzip'd) -- Miguel A. Carreira-Perpinan
We consider the problem of reconstructing a temporal discrete sequence of multidimensional real vectors when part of the data is missing, under the assumption that the sequence was generated by a continuous process. A particular case of this problem is multivariate regression, which is very difficult when the underlying mapping is one-to-many. We propose an algorithm based on a joint probability model of the variables of interest, implemented using a nonlinear latent variable model. Each point in the sequence is potentially reconstructed as any of the modes of the conditional distribution of the missing variables given the present variables (computed using an exhaustive mode search in a Gaussian mixture). Mode selection is determined by a dynamic programming search that minimises a geometric measure of the reconstructed sequence, derived from continuity constraints. We illustrate the algorithm with a toy example and apply it to a real-world inverse problem, the acoustic-to-articulatory mapping. The results show that the algorithm outperforms conditional mean imputation and multilayer perceptrons.

Model Selection for Support Vector Machines (gzip'd) -- Olivier Chapelle, Vladimir N. Vapnik
New functionals for parameter (model) selection of Support Vector Machines are introduced based on the concepts of the {\em span} of support vectors and rescaling of the feature space. It is shown that using these functionals, one can both predict the best choice of parameters of the model and the relative quality of performance for any value of parameter.

Transductive Inference for Estimating Values of Functions (gzip'd) -- Olivier Chapelle, Vladimir N. Vapnik, Jason Weston
We introduce an algorithm for estimating the values of function on specific points $x^*_1,...x^*_m$ given a set of training points $(x_1,y_1),...,(x_\ell,y_\ell)$ without estimating (as an intermediate step) the regression function. We demonstrate that this direct (transductive) way for estimating values of regression (or classification in pattern recognition) is more accurate than the traditional one based on two steps, first estimating the function and then calculating the values of this function at the point of interest.

Effective Learning Requires Neuronal Remodeling of Hebbian Synapses -- Gal Chechik, Isaac Meilijson, Eytan Ruppin
This paper revisits the classical neuroscience paradigm of Hebbian learning. We show that a necessary requirement for effective associative memory learning is that the efficacies of the incoming synapses should be uncorrelated. This requirement is difficult to achieve in a robust manner by Hebbian synaptic learning, since it depends on network level information. Effective learning can yet be obtained by a neuronal process that maintains a zero sum of the incoming synaptic efficacies. This normalization drastically improves the memory capacity of associative networks, from an essentially bounded capacity to one that linearly scales with the network's size. Such neuronal normalization can be successfully carried out by activity-dependent homeostasis of the neuron's synaptic efficacies, which was recently observed in cortical tissue. Thus, our findings strongly suggest that effective associative learning with Hebbian synapses alone is biologically implausible and that Hebbian synapses must be continuously remodeled by neuronally-driven regulatory processes in the brain.

Optimal Sizes of Dendritic and Axonal Arbors -- Dmitri B. Chklovskii
I consider a topographic projection between two neuronal layers with different densities of neurons. Given the number of output neurons connected to each input neuron (divergence or fan-out) and the number of input neurons synapsing on each output neuron (convergence or fan-in) I determine the widths of axonal and dendritic arbors which minimize the total volume of axons and dendrites. My analytical results can be summarized qualitatively in the following rule: neurons of the sparser layer should have arbors wider than those of the denser layer. This agrees with the anatomical data from retinal and cerebellar neurons whose morphology and connectivity are known. The rule may be used to infer connectivity of neurons from their morphology.

Wiring Optimization in the Brain -- Dmitri B. Chklovskii, Charles F. Stevens
The complexity of cortical circuits may be characterized by the number of synapses per neuron. We study the dependence of complexity on the fraction of the cortical volume that is made up ofwire'' (that is, of axons and dendrites), and find that complexity is maximized when wire takes up about 60\% of the cortical volume. This prediction is in good agreement with experimental observations. A consequence of our arguments is that any rearrangement of neurons that takes more wire would sacrifice computational power.

An Environment Model for Nonstationary Reinforcement Learning (gzip'd) -- Samuel P. Choi, Dit-Yan Yeung, Nevin L. Zhang
Reinforcement learning in nonstationary environments is generally regarded as an important and yet difficult problem. This paper partially addresses the problem by formalizing a subclass of nonstationary environments. The environment model, called hidden-mode Markov decision process (HM-MDP), assumes that environmental changes are always confined to a small number of hidden modes. A mode basically indexes a Markov decision process (MDP) and evolves with time according to a Markov chain. While HM-MDP is a special case of partially observable Markov decision processes (POMDP), modeling an HM-MDP environment via the more general POMDP model unnecessarily increases the problem complexity. A variant of the Baum-Welch algorithm is developed for model learning requiring less data and time.

Dynamics of Supervised Learning with Restricted Training Sets and Noisy Teachers -- A.C.C. Coolen, C.W.H. Mace
We generalize a recent formalism to describe the dynamics of supervised learning in layered neural networks, in the regime where data recycling is inevitable, to the case of noisy teachers. Our theory generates predictions for the evolution in time of training- and generalization errors, and extends the class of mathematically solvable learning processes in large neural networks to those complicated situations where overfitting occurs.

A Geometric Interpretation of $\nu$-SVM Classifiers (gzip'd) -- David Crisp, Chris Burges
We show that the recently proposed variant of the Support Vector machine (SVM) algorithm, known as $\nu$-SVM, can be interpreted as a maximal separation between subsets of the convex hulls of the data, which we call soft convex hulls. The soft convex hulls are controlled by choice of the parameter $\nu$. If the intersection of the convex hulls is empty, the hyperplane is positioned halfway between them such that the distance between convex hulls, measured along the normal, is maximized; and if it is not, the hyperplane's normal is similarly determined by the soft convex hulls, but its position (perpendicular distance from the origin) is adjusted to minimize the error sum. The proposed geometric interpretation of $\nu$-SVM also leads to necessary and sufficient conditions for the existence of a choice of $\nu$ for which the $\nu$-SVM solution is nontrivial.

Efficient Approaches to Gaussian Process Classification (gzip'd) -- Lehel Csato, Ernest Fokue, Manfred Opper, Bernhard Schottky, Ole Winther
We present three simple approximations for the calculation of the posterior mean in Gaussian Process classification. The first two methods are related to mean field ideas known in Statistical Physics. The third approach is based on Bayesian online approach which was motivated by recent results in the Statistical Mechanics of Neural Networks. We present simulation results showing: 1.\ that the mean field Bayesian evidence may be used for hyperparameter tuning and 2.\ that the online approach may achieve a low training error fast.

#### D

A Neurodynamical Approach to Visual Attention -- Gustavo Deco, Josef Zihl
We propose a system of interconnected modules consisting in populations of neurons for modeling the underlying mechanisms involved in selective visual attention. We demonstrate that it is plausible to build a neural system for visual search, which works across the visual field in parallel but due to the different intrinsic dynamics can show the two experimentally observed modes of visual attention, namely: the serial focal and the parallel spread over the space mode. In other words, neither explicit serial focal search nor saliencies maps should be assumed. The focus of attention is not included in the system but is a result of the convergence of the dynamic behavior of the neural networks. The dynamics of the system can be interpreted as an intrinsic dynamical routing for binding features if top-down information is available. The neural population dynamics are handled in the framework of the mean-field approximation. Consequently, the whole process can be expressed as a system of coupled differential equations.

State Abstraction in MAXQ Hierarchical Reinforcement Learning (gzip'd) -- Thomas G. Dietterich
Many researchers have explored methods for hierarchical reinforcement learning (RL) with temporal abstractions, in which abstract actions are defined that can perform many primitive actions before terminating. However, little is known about learning with state abstractions, in which aspects of the state space are ignored. In previous work, we developed the MAXQ method for hierarchical RL. In this paper, we define five conditions under which state abstraction can be combined with the MAXQ value function decomposition. We prove that the MAXQ-Q learning algorithm converges under these conditions and show experimentally that state abstraction is important for the successful application of MAXQ-Q learning.

The Nonnegative Boltzmann Machine (gzip'd) -- Oliver B. Downs, David J.C. MacKay, Daniel D. Lee
The nonnegative Boltzmann machine (NNBM) is a recurrent neural network model that can describe multimodal nonnegative data. Application of maximum likelihood estimation to this model gives a learning rule that is analogous to the binary Boltzmann machine. We examine the utility of the mean field approximation for the NNBM, and describe how Monte Carlo sampling techniques can be used to learn the parameters of the NNBM. Reflective slice sampling is particularly well-suited for this distribution, and can efficiently be implemented to sample the distribution. We illustrate learning of the NNBM on a translationally invariant distribution, as well as on a generative model for images of human faces.

Potential Boosters? -- Nigel Duffy, David Helmbold
Recent interpretations of the Adaboost algorithm view it as performing a gradient descent on a potential function. Simply changing the potential function allows one to create new algorithms related to AdaBoost. However, these new algorithms are generally not known to have the formal boosting property. This paper examines the question of which potential functions lead to new algorithms that are boosters. The two main results are general sets of conditions on the potential; one set implies that the resulting algorithm is a booster, while the other implies that the algorithm is not. These conditions are applied to previously studied potential functions, such as those used by LogitBoost and Doom II.

#### E

Invited Talk: Sound Processing for Cochlear Implants: Rationale, Implementation and Patient Performance -- Donald K. Eddington
Cochlear implants are electronic devices designed to take advantage of the excitable, auditory nerve fibers that remain in most deaf individuals. These devices produce sound sensations by translating acoustic signals into electric stimuli that are delivered to the nerve fibers by electrodes implanted in the patient's cochlea. By modulating the stimuli based on the acoustic input, patterns of spike activity are elicited that are designed to produce hearing sensations that patients are able to interpret. This presentation will describe the signal processing employed by current devices and present speech-reception data that illustrate the limitations these techniques impose.

Invited Talk: Global Spatial Patterning Through Distance and Delay -- Bard Ermentrout
This talk will focus on spatial pattern formation in networks of neurons in which there is underlying temporal dynamics. This results in some new ways in which connectivity that would normally not result in any kind of instabilities can lead to large amplitude spatial patterns. Recent work by Koch and Leisman and by Jirsa et al exploits this idea to generate global patterns from local heterogeneities. We focus on oscillatory and excitable networks. We discuss the general principles that underly global spatial patterns and how delays (explicit or dynamic) conspire to create global spatial patterns. We apply these general ideas to propagating waves in thalamic networks, emergence of waves in central pattern generators, and to some simple continuum cortical networks.

Neural Representation of Multi-Dimensional Stimuli (gzip'd) -- Christian W. Eurich, Stefan D. Wilke, Helmut Schwegler
The encoding accuracy of a population of stochastically spiking neurons is studied for different distributions of their tuning widths. The situation of identical radially symmetric receptive fields for all neurons, which is usually considered in the literature, turns out to be disadvantageous from an information-theoretic point of view. Both a variability of tuning widths and a fragmentation of the neural population into specialized subpopulations improve the encoding accuracy.

#### F

Learning Informative Statistics: A Nonparametric Approach -- John W. Fisher~III, Alexander T. Ihler, Paul Viola
We discuss an information theoretic approach for categorizing and modeling dynamic processes. The approach can learn a compact and informative statistic which summarizes past states to predict future observations. Furthermore, the uncertainty of the prediction is characterized nonparametrically by a joint density over the learned statistic and present observation. We discuss the application of the technique to both noise driven dynamical systems and random processes sampled from a density which is conditioned on the past. In the first case we show results in which both the dynamics of random walk and the statistics of the driving noise are captured. In the second case we present results in which a summarizing statistic is learned on noisy random telegraph waves with differing dependencies on past states. In both cases the algorithm yields a principled approach for discriminating processes with differing dynamics and/or dependencies. The method is grounded in ideas from information theory and nonparametric statistics.

Differentiating Functions of the Jacobian with Respect to the Weights -- Gary William Flake, Barak A. Pearlmutter
For many problems, the correct behavior of a model depends not only on its input-output mapping but also on properties of its Jacobian matrix, the matrix of partial derivatives of the model's outputs with respect to its inputs. We introduce the J-prop algorithm, an efficient general method for computing the exact partial derivatives of a variety of simple functions of the Jacobian of a model with respect to its free parameters. The algorithm applies to any parametrized feedforward model, including nonlinear regression, multilayer perceptrons, and radial basis function networks.

Local Probability Propagation for Factor Analysis -- Brendan Frey
Ever since Pearl's probability propagation algorithm in graphs with cycles was shown to produce excellent results for error-correcting decoding a few years ago, we have been curious about whether local probability propagation could be used successfully for machine learning. One of the simplest adaptive models is the factor analyzer, which is a two-layer network that models bottom layer sensory inputs as a linear combination of top layer factors plus independent Gaussian sensor noise. The number of bottom-up/top-down iterations needed to exactly infer the factors given a network and an input is equal to the number of factors in the top layer. In online learning, this iterative procedure must be reinitialized upon each pattern presentation and so learning becomes prohibitively slow in big networks, such as those used for face recognition and for large-scale models of the cortex. We show that local probability propagation in the factor analyzer network usually takes just a few iterations to perform accurate inference, even in networks with 320 sensors and 80 factors. We derive an expression for the algorithm's fixed point and show that this fixed point matches the exact solution in a variety of networks, even when the fixed point is unstable. We also show that this method can be used successfully to perform inference for approximate EM and we give results on face recognition.

#### G

Variational Inference for Bayesian Mixture of Factor Analysers (gzip'd) -- Zoubin Ghahramani, Matthew J. Beal
We present an algorithm that infers the model structure of a mixture of factor analysers using an efficient and deterministic variational approximation to full Bayesian integration over model parameters. This procedure can automatically determine the optimal number of components and the local dimensionality of each component (i.e. the number of factors in each factor analyser). Alternatively it can be used to infer posterior distributions over number of components and dimensionalities. Since all parameters are integrated out, the method is not prone to overfitting. Using a stochastic procedure for adding components, it is possible to perform the variational optimisation incrementally and to avoid local maxima. Results show that the method works very well in practice and correctly infers the number and dimensionality of nontrivial synthetic examples.

Effects of Spatial and Temporal Contiguity on the Acquisition of Spatial Information -- Thea Ghiselli-Crippa, Paul Munro
Spatial information comes in two forms: direct spatial information (for example, retinal position) and indirect temporal contiguity information, since objects encountered sequentially are spatially close. The acquisition of spatial information by a neural network is investigated here. Given a spatial layout of several objects, networks are trained on a prediction task. Networks using temporal sequences with no direct spatial information are found to develop internal representations that show distances correlated with distances in the external layout. The influence of spatial information is analyzed by providing direct spatial information to the system during training that is either consistent with the layout or inconsistent with it. This approach allows examination of the relative contributions of spatial and temporal contiguity.

Kirchoff Law Markov Fields for Analog Circuit Design -- Richard M. Golden
Three contributions to developing an algorithm for assisting engineers in designing analog circuits are provided in this paper. First, a method for representing highly nonlinear and non-continuous analog circuits using Kirchoff current law potential functions within the context of a Markov field is described. Second, a relatively efficient algorithm for optimizing the Markov field objective function is briefly described and the convergence proof is briefly sketched. And third, empirical results illustrating the strengths and limitations of the approach are provided within the context of a JFET transistor design problem. The proposed algorithm generated a set of circuit components for the JFET circuit model that accurately generated the desired characteristic curves.

Bayesian Transduction -- Thore Graepel, Ralf Herbrich, Klaus Obermayer
Transduction is an inference principle that takes a training sample and aims at estimating the values of a function at given points contained in the so-called working sample. Hence, transduction is a less ambitious task than induction which aims at inferring a functional dependency on the whole of input space. As a consequence, however, transduction provides a confidence measure on single predictions rather than classifiers, a feature particularly important for risk--sensitive applications. We consider the case of binary classification by linear discriminant functions (perceptrons) in kernel space. From the transductive point of view, the infinite number of perceptrons is boiled down to a finite number of equivalence classes on the working sample each of which corresponds to a polyhedron in parameter space. In the Bayesian spirit the posteriori probability of a labelling of the working sample is determined as the ratio between the volume of the corresponding polyhedron and the volume of version space. Then the maximum posteriori scheme recommends to choose the labelling of maximum volume. We suggest to sample version space by an ergodic billiard in kernel space. Experimental results on real world data indicate that Bayesian Transduction compares favourably to the well-known Support Vector Machine, in particular if the posteriori probability of labellings is used as a confidence measure to exclude test points of low confidence.

#### H

Bayesian Averaging is Well-Temperated -- Lars Kai Hansen
Bayesian predictions are stochastic just like predictions of any other inference scheme that generalizes from a finite sample. While a simple variational argument shows that Bayes averaging is generalization optimal given that the prior matches the teacher parameter distribution, the situation is less clear if the teacher distribution is unknown. I define a class of averaging procedures, the temperated likelihoods, including both Bayes averaging with a uniform prior and maximum likelihood estimation as special cases. I show that Bayes is generalization optimal in this family for any teacher distribution for two learning problems that are analytically tractable: learning the mean parameter of a Gaussian and asymptotics of smooth learners.

Audio Vision: Using Audio-Visual Synchrony to Locate Sounds (gzip'd) -- John Hershey, Hiroshi Ishiguro, Javier R. Movellan
Psychophysical and physiological evidence shows that sound localization of acoustic signals is strongly influenced by their synchrony with visual signals. This effect, known as ventriloquism, is at work when sound coming from the side of a TV set feels as if it were coming from the mouth of the actors. The ventriloquism effect suggests that there is important information about sound location encoded in the synchrony between the audio and video signals. In spite of this evidence, audiovisual synchrony is rarely used as a source of information in computer vision tasks. In this paper we explore the use of audio visual synchrony to locate sound sources. We developed a system that searches for regions of the visual landscape that correlate highly with the acoustic signals and tags them as likely to contain an acoustic source. We discuss our experience implementing the system, present results on a speaker localization task and discuss potential applications of the approach.

Spiking Belief Networks -- Geoffrey E. Hinton, Andrew D. Brown
We first show how to represent sharp posterior probability distributions using real valued coefficients on broadly-tuned basis functions. Then we show how the precise times of spikes can be used to convey the real-valued coefficients on the basis functions quickly and accurately. Finally we describe a simple simulation in which spiking neurons learn to model an image sequence by fitting a dynamic generative model.

Learning to Parse Images (gzip'd) -- Geoffrey E. Hinton, Zoubin Ghahramani, Yee Whye Teh
We describe a class of probabilistic models that we call credibility networks. Using parse trees as internal representations of images, credibility networks are able to perform segmentation and recognition simultaneously, removing the need for ad hoc segmentation heuristics. Promising results in the problem of segmenting handwritten digits were obtained.

Invited Talk: Animation of Human Motion -- Jessica K. Hodgins
Computer animations and virtual environments both require a source of motion for their characters. We are exploring one possible solution to this problem: applying control algorithms to physically realistic models of the systems that we would like to animate. By using these techniques to simulate humans, we are working towards avatars that are responsive to the user's subtle gestures and interactive agents that respond appropriately to events in a virtual environment. For example, we have developed control algorithms that allow rigid body models to run or bicycle at a variety of speeds, bounce on a trampoline, and to perform a handspring vault and several platform dives. To facilitate the development of new characters for an animation, we have developed algorithms that adapt existing control algorithms to a new character in a semi-automatic fashion. Recently, we have begun to use human data to adjust the behavior of the control systems. Because our goal is natural looking motion, we compare the computed motion for each simulation to that of humans performing similar maneuvers. We perform the comparison qualitatively with real video images and quantitatively with biomechanical data.

Learning the Similarity of Documents: An Information-Geometric Approach to Document Retrieval and Categorization (gzip'd) -- Thomas Hofmann
The project pursued in this paper is to develop from first principles a general machine learning approach to learn the similarity between text documents. We utilize a statistical latent class model to generate a decomposition of document collections in terms of topic factors. From this model a canonical kernel, the Fisher kernel, is derived within the theoretical framework of information geometry. The Fisher kernel provides a similarity function that can be used for unsupervised and supervised learning problems alike. This in particular covers the interesting case where both labeled and unlabeled data are available. Experiments in automated indexing and text categorization verify the advantages of this approach.

Bayesian Modelling of fMRI Time Series -- Pedro H{\o}jen-S{\o}rensen, Lars Kai Hansen, Carl Edward Rasmussen
We present a Hidden Markov Model (HMM) for infering the hidden psychological state (or neural activity) during single trial fMRI activation experiments with blocked task paradigms. Inference is based on Bayesian methodology, using a combination of analytical and a variety of Markov Chain Monte Carlo (MCMC) sampling techniques. The advantage of this method is that detection of short time learning effects between repeated trials is possible since inference is based only on single trial experiments.

Distributed Synchrony of Spiking Neurons in a Hebbian Cell Assembly (gzip'd) -- David Horn, Nir Levy, Isaac Meilijson, Eytan Ruppin
We investigate the behavior of a Hebbian cell assembly of spiking neurons formed via a temporal synaptic learning curve. This learning function is based on recent experimental findings. It includes potentiation for short time delays between pre- and post-synaptic neuronal spiking, and depression for spiking events occuring in the reverse order. The coupling between the dynamics of synaptic learning and neuronal activation leads to interesting results. We find that, in addition to synchronous or asynchronous modes, the Hebbian cell assembly can be realized in a firing pattern of distributed synchrony. The latter implies spontaneous division of the Hebbian cell assembly into groups of cells that fire in a cyclic manner. We investigate the behavior of distributed synchrony both by simulations and by analytic calculations of the resulting synaptic distributions.

Bayesian Reconstruction of 3D Human Motion from Single-Camera Video (gzip'd) -- Nicholas R. Howe, Michael E. Leventon, William T. Freeman
Three-dimensional motion capture for human subjects is underdetermined when the input is limited to a single camera, due to the inherent 3D ambiguity of 2D video. We present a system that reconstructs the 3D motion of human subjects from single-camera video, relying on prior knowledge about human motion, learned from training data, to resolve those ambiguities. After initialization in 2D, the tracking and 3D reconstruction is automatic; we show results for several video sequences. The results show the power of treating 3D body tracking as an inference problem.

Emergence of Topography and Complex Cell Properties from Natural Images using Extensions of ICA (gzip'd) -- Aapo Hyv{\"a}rinen, Patrik Hoyer
Independent component analysis of natural images leads to emergence of simple cell properties, i.e. linear filters that resemble wavelets or Gabor functions. In this paper, we extend ICA to explain further properties of V1 cells. First, we decompose natural images into independent subspaces instead of scalar components. This model leads to emergence of phase and shift invariant features, similar to those in V1 complex cells. Second, we define a topography between the linear components obtained by ICA. The topographic distance between two components is defined by their higher-order correlations, so that the components are close to each other in the topography if they are strongly dependent on each other. This leads to simultaneous emergence of both topography and invariances similar to complex cell properties.

#### I

The Parallel Problems Server: an Interactive Tool for Large Scale Machine Learning (gzip'd) -- Charles Lee Isbell,~Jr., Parry Husbands
Imagine that you wish to classify data consisting of tens of thousands of examples residing in a twenty thousand dimensional space. How can one apply standard machine learning algorithms? We describe the Parallel Problems Server (PPServer) and MATLAB*P. In tandem they allow users of networked computers to work transparently on large data sets from within Matlab. This work is motivated by the desire to bring the many benefits of scientific computing algorithms and computational power to machine learning researchers. We demonstrate the usefulness of the system on a number of tasks. For example, we perform {\em independent components analysis} on very large text corpora consisting of tens of thousands of documents, making minimal changes to the original Bell and Sejnowski Matlab source. Applying ML techniques to data previously beyond their reach leads to interesting analyses of both data and algorithms.

#### J

Maximum Entropy Discrimination -- Tommi Jaakkola, Marina Meila, Tony Jebara
We present a general framework for discriminative estimation based on the maximum entropy principle and its extensions. All calculations involve distributions over structures and/or parameters rather than specific settings and reduce to relative entropy projections. This holds even when the data is not separable within the chosen parametric class, in the context of anomaly detection rather than classification, or when the labels in the training set are uncertain or incomplete. Support vector machines are naturally subsumed under this class and we provide several extensions. We are also able to estimate exactly and efficiently discriminative distributions over tree structures of class-conditional models within this framework. Preliminary experimental results are indicative of the potential in these techniques.

Neural System Model of Human Sound Localization -- Craig T. Jin, Simon Carlile
This paper examines the role of biological constraints in the human auditory localization process. A psychophysical and neural system modeling approach was undertaken in which performance comparisons between competing models and the human subject provided the foundation for understanding the relevant biologically plausible realism constraints''. The directional acoustical cues, upon which sound localization is based, were derived from the human subject's head-related transfer functions (HRTFs). Sound stimuli were generated by convolving bandpass noise with the HRTFs and were presented to both the subject and the model. The input stimuli to the model was processed using the Auditory Image Model of cochlear processing. The cochlear data was then analyzed by a time-delay neural network which integrated temporal and spectral information to determine the spatial location of the sound source. The combined cochlear model and neural network provided a system model of the sound localization process in which measurable human-like localization performance was achieved in relationship to frequency division or tonotopicity, sound level variations, band-pass sounds with restricted frequencies, and natural'' listening conditions with variable training sounds.

Spectral Cues in Human Sound Localization -- Craig T. Jin, Anna Corderoy, Simon Carlile, Andr\'e van Schaik
In this paper we study the differential contribution of the monaural and interaural spectral cues to human sound localization. A psychophysical and analytical approach was undertaken, in which the cues to a sound's location were correlated on an individual basis with the human localization data. The spectral cues derive from the acoustical filtering of an individual's outer ear (represented by the recorded head-related transfer functions, HRTFs). Psychoacoustical experiments were conducted in virtual auditory space (VAS) in which the amplitude spectra of the sound stimulus was varied {\it independently} at each ear while preserving the normal timing cues, an impossibility in the free-field environment. The sound localization performance of four normal hearing subjects was measured in VAS using broadband sound stimuli. These stimuli were filtered using the subject's HRTFs to produce three sound conditions in which the monaural and interaural spectral cues were systematically varied. All subjects showed systematic mislocalizations on the spectrally altered sounds. The pattern of mislocalizations varied among subjects but in a systematic manner related to the available acoustical cues. The analysis of the different cues along with the subjects' localization responses suggests there are significant differences with respect to the monaural and interaural spectral cues and that the auditory system's reliance on the spectral cues varies with the sound condition.

Topographic Transformation as a Discrete Latent Variable (gzip'd) -- Nebojsa Jojic, Brendan Frey
Invariance to topographic transformations such as translation and shearing in an image has been successfully incorporated into feedforward mechanisms, eg, convolutional neural networks'', tangent propagation''. We describe a way to add transformation invariance to a generative density model by approximating the nonlinear transformation manifold by a discrete set of transformations. An EM algorithm for the original model can be extended to the new model by computing expectations over the set of transformations. We show how to add a discrete transformation variable to Gaussian mixture modeling, factor analysis and mixtures of factor analysis. We give results on filtering microscopy images, face and facial pose clustering, and handwritten digit modeling and recognition.

Broadband DOA Estimation Based on Second Order Statistics (gzip'd) -- Alexander Jourjine, Joseph O'Ruanaidh, Justinian Rosca, Scott Rickard
A parametric time-delay model of mixing is introduced. For N sources it is defined by an NxN matrix of source attenuation coefficients and an NxN matrix of delays in signal propagation times. It is shown, using TITO problem as a basis, that N statistically orthogonal sources can be separated blindly from N time-delay mixtures using only second order statistics. The separation is performed by estimation of the attenuation and delay parameters, which is done by solving non-linear cross-correlation constraints resulting from the statistical orthogonality of the sources. The parameters are then used to construct a demixing matrix of a known parametric form. A solution to the channel selection problem is presented together with a method of removal of demixing artifacts by application of Wiener filter. The results are verified in an example of speech signals by using both time and frequency domain algorithms.

#### K

Regular and Irregular Gallager-type Error-Correcting Codes (gzip'd) -- Yoshiyuki Kabashima, Tatsuto Murayama, David Saad, Renato Vicente
The performance of regular and irregular Gallager-type error-correcting codes is investigated via methods of statistical physics. The transmitted codeword comprises products of the original message bits selected by two randomly-constructed sparse matrices; the number of non-zero row/column elements in these matrices constitutes a family of codes. We show that Shannon's channel capacity is saturated for many of the regular codes while slightly lower performance is obtained for others which may be of higher practical relevance. Decoding aspects are considered by employing the TAP approach which is identical to the commonly used belief-propagation-based decoding. We show that irregular codes also saturate Shannon's capacity but have improved dynamical (decoding) properties.

Acquisition in Autoshaping -- Sham Kakade, Peter Dayan
Quantitative data on the speed with which animals acquire behavioral responses during classical conditioning experiments should provide strong constraints on models of learning. However, most models have simply ignored these data; the few that have attempted to address them have failed by at least an order of magnitude. We discuss key data on the speed of acquisition, and show how to account for them using a statistically sound model of learning, in which differential reliabilities of stimuli play a crucial role.

Approximate Planning in Large POMDPs via Reusable Trajectories (gzip'd) -- Michael Kearns, Yishay Mansour, Andrew Y. Ng
We consider the problem of reliably choosing a near-best strategy from a restricted class of strategies $\Pi$ in a partially observable Markov decision process (POMDP). We assume we are given the ability to {\em simulate\/} the POMDP, and study what might be called the {\em sample complexity\/} --- that is, the amount of data one must generate in the POMDP in order to choose a good strategy. We prove upper bounds on the sample complexity showing that, even for {\em infinitely large and arbitrarily complex\/} POMDPs, the amount of data needed can be finite, and depends only linearly on the complexity of the restricted strategy class $\Pi$, and exponentially on the horizon time. This latter dependence can be eased in a variety of ways, including the application of gradient and local search algorithms. Our measure of complexity generalizes the classical supervised learning notion of VC dimension to the settings of reinforcement learning and planning.

Actor-Critic Algorithms -- Vijay R. Konda, John N. Tsitsiklis
We propose and analyze a class of actor-critic algorithms for simulation-based optimization of a Markov decision process over a parameterized family of randomized stationary policies. These are two-time-scale algorithms in which the critic uses TD learning with a linear approximation architecture, and the actor is updated in an approximate gradient direction based on information provided by the critic. We show that a set of appropriate features for the critic is prescribed by the choice of parametrization of the actor. We provide an interpretation of the gradient in terms of Riemannian geometry, and conclude by discussing convergence properties and some open problems.

#### L

An Oculo-Motor System with Multi-Chip Neuromorphic Analog VLSI Control (gzip'd) -- Oliver Landolt, Steve Gyger
A system emulating the functionality of a moving eye (hence the name oculo-motor system'') has been built and successfully tested. It is made of an optical device for shifting the field of view of an image sensor by up to 45 degrees in any direction, four neuromorphic analog VLSI circuits implementing an oculo-motor control loop, and some off-the-shelf electronics. The custom integrated circuits communicate with each other primarily by non-arbitrated address-event buses. The system implements the behaviors of saliency-based saccadic exploration, and smooth pursuit of light spots. The duration of saccades ranges from 45ms to 100ms, which is comparable to human eye performance. Smooth pursuit operates on light sources moving at up to 50 degrees per second in the visual field.

An Improved Decomposition Algorithm for Regression Support Vector Machines (gzip'd) -- Pavel Laskov
A new decomposition algorithm for training regression Support Vector Machines (SVM) is presented. The algorithm builds on the basic principles of decomposition proposed by Osuna et. al., and addresses the issue of optimal working set selection. The new criteria for testing optimality of a working set are derived. Based on these criteria, the principle of maximal inconsistency'' is proposed to form (approximately) optimal working sets. Experimental results show superior performance of the new algorithm in comparison with traditional training of regression SVM without decomposition. Similar results have been previously reported on decomposition algorithms for pattern recognition SVM. The new algorithm is also applicable to advanced SVM formulations based on regression, such as density estimation and integral equation SVM.

Selective Attention for Robust Recognition of Noisy and Superimposed Patterns (gzip'd) -- Soo-Young Lee, Michael C. Mozer
Based on the early selection'' filter model and top-down attention mechanism, a new selective attention algorithm is developed to improve recognition performance for noisy patterns and superimposed patterns. The selective attention algorithm incorporates the error backpropagation rule to adapt the attention filters for a testing input pattern and an attention cue for a specific class. For superimposed test patterns an attention-switching algorithm is also developed to recognize both patterns one by one. The developed algorithms demonstrated much higher recognition rates for corrupted digit recognition tasks.

Algorithms for Independent Components Analysis and Higher Order Statistics -- Daniel D. Lee, Uri Rokni, Haim Sompolinsky
A latent variable generative model with finite noise is used to describe several different algorithms for Independent Components Analysis (ICA). In particular, the Fixed Point ICA algorithm is shown to be equivalent to the Expectation-Maximization algorithm for maximum likelihood under certain constraints, allowing the conditions for global convergence to be elucidated. The algorithms can also be explained by their generic behavior near a singular point where the size of the optimal generative bases vanishes. An expansion of the likelihood about this singular point indicates the role of higher order correlations in determining the features discovered by ICA. The application and convergence of these algorithms are demonstrated on the learning of edge features as the independent components of natural images.

An Information-Theoretic Framework for Understanding Saccadic Behaviors (gzip'd) -- Tai Sing Lee, Stella X. Yu
In this paper, we propose that information maximization can provide a unified framework for constructing and understanding preattentive saccadic behaviors. In this framework, the mutual information among the cortical representations of the retinal image, the priors constructed from our long term visual experience, and a dynamic short-term internal representation constructed from recent saccades provides a map for eye navigation. By directing the eyes to locations of minimum mutual information at each step, the preattentive saccadic system greedily collect information about the external world. This framework provides a mathematical explanation that connects several psychological phenomena, such as pop-out and inhibition of return, to long term visual experience and short term working memory. It also provides an interesting perspective on the computation in the primary visual cortex.

Can V1 Mechanisms Account for Figure-Ground and Medial Axis Effects? (gzip'd) -- Zhaoping Li
When a visual image consists of a figure against a background, V1 cells are physiologically observed to give higher responses to image regions corresponding to the figure relative to their responses to the background. The medial axis of the figure also induces relatively higher responses compared to responses to other locations in the figure (except for the boundary between the figure and the background). Since the receptive fields of V1 cells are very small compared with the global scale of the figure-ground and medial axis effects, it has been suggested that these effects may be caused by feedback from higher visual areas. I show how these effects can be accounted for by V1 mechanisms when the size of the figure is small or is of a certain scale. They are a manifestation of the processes of pre-attentive segmentation which detect and highlight the boundaries between homogeneous image regions.

Mixture Density Estimation (gzip'd) -- Jonathan Q. Li, Andrew R. Barron
Gaussian mixtures (or so-called radial basis function networks) for density estimation provide a natural counterpart to sigmoidal neural networks for function fitting and approximation. In both cases, it is possible to give simple expressions for the iterative improvement of performance as components of the network are introduced one at a time. In particular, for mixture density estimation we show that a k-component mixture estimated by maximum likelihood (or by an iterative likelihood improvement that we introduce) achieves log-likelihood within order $1/k$ of the log-likelihood achievable by any convex combination. Consequences for approximation and estimation using Kullback-Leibler risk are also given. A Minimum Description Length principle selects the optimal number of components $k$ that minimizes the risk bound.

The Relaxed Online Maximum Margin Algorithm (gzip'd) -- Yi Li, Philip M. Long
We describe a new incremental algorithm for training linear threshold functions. Our algorithm can be viewed as an approximation to the algorithm that repeatedly chooses the hyperplane that classifies previously seen examples correctly with the maximum margin. It is known that such a maximum-margin hypothesis can be computed by minimizing the length of the weight vector subject to a number of linear constraints. Our algorithm works by maintaining a relatively simple relaxation of these constraints that can be efficiently updated. We prove a mistake bound for our algorithm that is the same as that proved for the perceptron algorithm. Our analysis implies that the more computationally intensive maximum-margin algorithm also satisfies this mistake bound; this is the first worst-case performance guarantee for this algorithm. We demonstrate that our algorithm can also be used with kernel functions to efficiently classify in a very high dimensional space. We describe some experiments using our new algorithm as a batch algorithm to recognize handwritten digits. Its computational complexity and simplicity is similar to that of perceptron algorithm, but its generalization is much better. Compared to Vapnik's maximum margin classifier, the use of our algorithm in batch mode is much simpler to implement, and much more efficient in computation time, but its generalization is not as good.

Statistical Dynamics of Batch Learning (gzip'd) -- Song Li, K.Y.Michael Wong
An important issue in neural computing concerns the description of learning dynamics with macroscopic dynamical variables. Recent progress on on-line learning only addresses the often unrealistic case of an infinite training set. We introduce a new framework to model batch learning of restricted sets of examples, widely applicable to any learning cost function, and fully taking into account the temporal correlations introduced by the recycling of the examples. For illustration we analyze the effects of weight decay and early stopping during the learning of teacher-generated examples.

Constructing Heterogeneous Committees Using Input Feature Grouping: Application to Economic Forecasting (gzip'd) -- Yuansong Liao, John Moody
The committee approach has been proposed for reducing model uncertainty and improving generalization performance. The advantage of committees depends on (1) the performance of individual members and (2) the correlational structure of errors between members. This paper presents an input grouping technique for {\it designing} a {\it heterogeneous committee}. With this technique, all input variables are first grouped based on their mutual information. Statistically similar variables are assigned to the same group. Each member's input set is then formed by input variables extracted from different groups. Our {\it designed committees} have less correlation between its members, since each member observes different input variable combinations. The individual member's feature sets contain less redundant information, because highly correlated variables will not be combined together. The member feature sets contain almost complete information, since each set contains a feature from each information group. An empirical study for a noisy and nonstationary economic forecasting problem shows that committees constructed by our proposed technique outperform committees formed using several existing techniques.

A Winner-Take-All Circuit with Controlable Soft Max Property (gzip'd) -- Shih-Chii Liu
I describe a silicon network that consists of a group of excitatory neurons and one global inhibitory neuron that can be used to normalize the output signal with respect to the number of inputs. This network models the normalization property of the wide-field direction-selective cells in the fly visual system. This normalization property is useful in any network where we wish the output signal to code only the strength of the inputs and not be dependent on the number of inputs. The circuitry in each neuron is equivalent to that in Lazzaro et al's (Lazzaro 1998) winner-take-all circuit with one additional transistor. Results from a fabricated chip of 20 neurons in a 1.2$\mu$m CMOS technology show the transition between the soft-max property and the hard winner-take-all property by the tuning of one of parameter.

Perceptual Organization Based on Temporal Dynamics (gzip'd) -- Xiuwen Liu, DeLiang L. Wang
A figure-ground segregation network is proposed based on a novel boundary pair representation. Nodes in the network are boundary segments obtained through local grouping. Each node is excitatorily coupled with the neighboring nodes that belong to the same region, and inhibitorily coupled with the corresponding paired node. Gestalt grouping rules are incorporated by modulating connections. The status of a node represents its probability being figural and is updated according to a differential equation. The system solves the figure-ground segregation problem through temporal evolution. Different perceptual phenomena, such as modal and amodal completion, virtual contours, grouping and shape decomposition are then explained through local diffusion. The system eliminates combinatorial optimization and accounts for many psychophysical results with a fixed set of parameters.

Invited Talk: How Anomalous are Anomalies in Financial Time Series? -- Andrew W. Lo
Many anomalies in financial time series have been documented, each a challenge to market efficiency. However, an equal number of studies have challenged these anomalies, arguing that they are merely symptoms of data-snooping biases and overfitting. In this paper, we propose several methods for assessing the statistical and economic significance of financial anomalies. \par We begin by providing a critical review of the financial anomalies literature, categorizing anomalies systematically and performing citation counts to evaluate the importance of each anomaly in the literature. We then develop methods---both classical and Bayesian decision-theoretic---for determining the statistical significance of an anomaly after taking into account the fact that anomalies are often obtained by extensive data-mining algorithms. We conclude with with an illustrative empirical example involving multifactor linear models of US stock returns.\\ Work done in collaboration with Li Jin (ljin@mit.edu).

#### M

Neural Computation with Winner-Take-All as the Only Nonlinear Operation (gzip'd) -- Wolfgang Maass
This article initiates a rigorous theoretical analysis of the computational power of circuits that employ modules for computing winner-take-all. Computational models that involve competitive stages have so far been neglected in computational complexity theory, although they are widely used in computational brain models, artificial neural networks, and analog VLSI. Our theoretical analysis shows that winner-take-all is a surprisingly powerful computational module in comparison with threshold gates (= McCulloch-Pitts neurons) and sigmoidal gates. We prove an optimal quadratic lower bound for computing winner-take-all in any feedforward circuit consisting of threshold gates. Furthermore we show that any threshold circuit consisting of two layers of threshold gates can be simulated by a circuit that employs a single k-winner-take-all gate as its only nonlinear operation. In addition we show that arbitrary continuous functions can be approximated by circuits employing a single soft winner-take-all gate as their only nonlinear operation. Our theoretical analysis also provides answers to two basic questions that have been raised by neurophysiologists in view of the well-known asymmetry between excitatory and inhibitory connections in cortical circuits: how much computational power of neural networks is lost if only positive weights are employed in weighted sums, and how much adaptive capability is lost if only the positive weights are subject to plasticity. We refer to {\tt http://www.cis.tu-graz.ac.at/igi/maass/} for further details.

Boosting with Multi-Way Branching in Decision Trees (gzip'd) -- Yishay Mansour, David McAllester
It is known that decision tree learning can be viewed as a form of boosting. However, existing boosting theorems for decision tree learning allow only binary-branching trees and the generalization to multi-branching trees is not immediate. Practical decision tree algorithms, such as CART and C4.5, implement a trade-off between the number of branches and the improvement in tree quality as measured by an index function. Here we give a boosting justification for a particular quantitative trade-off curve. Our main theorem states, in essence, that if we require an improvement proportional to the log of the number of branches then top-down greedy construction of decision trees remains an effective boosting algorithm.

Channel Noise in Excitable Neural Membranes (gzip'd) -- Amit Manwani, Peter N. Steinmetz, Christof Koch
Stochastic fluctuations of voltage-gated ion channels generate current and voltage noise in neuronal membranes. This noise may be a critical determinant of the efficacy of information processing within neural systems. Using Monte-Carlo simulations, we carry out a systematic investigation of the relationship between channel kinetics and the resulting membrane voltage noise using a stochastic Markov model of dendritic excitability in cortical neurons. Our simulations show that kinetic parameters which lead to an increase in membrane excitability (increasing channel densities, decreasing temperature) also lead to an increase in the magnitude of the sub-threshold voltage noise. Noise also increases as the membrane is depolarized from rest towards threshold. This suggests that channel fluctuations may interfere with a neuron's ability to function as an integrator of its synaptic inputs and may limit the reliability and precision of neural information processing.

Bayesian Network Induction via Local Neighborhoods -- Dimitris Margaritis, Sebastian Thrun
In recent years, Bayes networks have become highly successful tool for diagnosis, analysis, and decision making in real-world domains. We present an efficient and robust algorithm for learning Bayes networks from data. Our approach constructs Bayes networks by first identifying each node's Markov blankets, then connecting nodes in a consistent way. In contrast to the majority of work, which typically uses hill-climbing approaches that may produce dense nets, our approach yields much more compact networks by heeding independencies in the data. Compact networks facilitate fast inference and are also easier to understand. We prove that under mild assumptions, our approach requires time polynomial in the size of the data and the number of nodes. A Monte Carlo variant, also presented here, yields comparable results at much higher speeds.

Boosting Algorithms as Gradient Descent (gzip'd) -- Llew Mason, Jonathan Baxter, Peter Bartlett, Marcus Frean
Much recent attention, both experimental and theoretical, has been focussed on classification algorithms which produce voted combinations of classifiers. Recent theoretical work has shown that the impressive generalization performance of algorithms like AdaBoost can be attributed to the classifier having large margins on the training data. We present an abstract algorithm for finding linear combinations of functions that minimize arbitrary cost functionals (i.e., functionals that do not necessarily depend on the margin). Many existing voting methods can be shown to be special cases of this abstract algorithm. Then, following previous theoretical results bounding the generalization performance of convex combinations of classifiers in terms of general cost functions of the margin, we present a new algorithm (DOOM II) for performing a gradient descent optimization of such cost functions. Experiments on several data sets from the UC Irvine repository demonstrate that DOOM II generally outperforms AdaBoost, especially in high noise situations. Margin distribution plots verify that DOOM II is willing to give up' on examples that are too hard in order to avoid overfitting. We also show that the overfitting behaviour exhibited by AdaBoost can be quantified in terms of our proposed cost function.

A Multi-class Linear Learning Algorithm (gzip'd) -- Chris Mesterharm
In this paper, we present Committee, a new multi-class learning algorithm related to the Winnow family of algorithms. Committee is an algorithm for combining the predictions of a set of sub-experts in the on-line mistake-bounded model of learning. A sub-expert is a special type of attribute that predicts with a distribution over a finite number of classes. Committee learns a linear function of sub-experts and uses this function to make class predictions. We provide bounds for Committee that show it performs well when the prediction target can be represented by a few relevant sub-experts. We also show how Committee can be used to solve more traditional problems composed of attributes. This leads to a natural extension of Committee that learns on multi-class problems that contain both traditional attributes and sub-experts.

Invariant Feature Extraction and Classification in Kernel Spaces (gzip'd) -- Sebastian Mika, Gunnar R{\"a}tsch, Jason Weston, Bernhard Sch{\"o}lkopf, Alexander J. Smola, Klaus--Robert M{\"u}ller
We incorporate prior knowledge to construct nonlinear algorithms for invariant feature extraction and discrimination. Employing a unified framework in terms of a nonlinear variant of the Rayleigh coefficient, we propose non-linear generalizations of Fisher's discriminant and oriented PCA using Support Vector kernel functions. Extensive simulations show the utility of our approach.

From Coexpression to Coregulation: An Approach to Inferring Transcriptional Regulation among Gene Classes from Large-Scale Expression Data (gzip'd) -- Eric Mjolsness, Tobias Mann, Rebecca Castano, Barbara Wold
We provide preliminary evidence that existing algorithms for inferring small-scale gene regulation networks from gene expression data can be adapted to large-scale gene expression data coming from hybridization microarrays. The essential steps are (1) clustering many genes by their expression time-course data into a minimal set of clusters of co-expressed genes, (2) theoretically modeling the various conditions under which the time-courses are measured using a continious-time analog recurrent neural network for the cluster mean time-courses, (3) fitting such a regulatory model to the cluster mean time courses by simulated annealing with weight decay, and (4) analysing several such fits for commonalities in the circuit parameter sets including the connection matrices. This procedure can be used to assess the adequacy of existing and future gene expression time-course data sets for determining transcriptional regulatory relationships such as coregulation.

Information Factorization in Connectionist Models of Perception (gzip'd) -- Javier R. Movellan, James L. McClelland
We examine a psychophysical law that describes the influence of stimulus and context on perception. According to this law choice probability ratios factorize into components independently controlled by stimulus and context. It has been argued that this pattern of results is incompatible with feedback models of perception. In this paper we examine this claim using neural network models defined via stochastic differential equations. We show that the law is related to a condition named channel separability and has little to do with the existence of feedback connections. In essence, channels are separable if they converge into the response units without direct lateral connections to other channels and if their sensors are not directly contaminated by external inputs to the other channels. Implications of the analysis for cognitive and computational neurosicence are discussed.

Invited Talk: Deconstructing Synchrony -- J. Anthony Movshon
In the early stages of visual processing, objects and scenes are represented by neurons with small visual receptive fields. Each neuron provides information about local features of a scene, but to describe a scene in terms of objects requires that these features be combined. Objects can cover wide areas of visual space and be partially occluded by other objects, so the problem of binding the separate representations of parts into coherent wholes is not a simple one. Some theorists have advanced the view that binding is a special problem and requires a special solution because it is necessary to tag'' each visual neuron to signify the object to which its activity relates. Each neuron therefore has to carry two distinct signals, one that indicates how effective a stimulus is falling on its receptive field, and a second that tags it as a member of a particular cell assembly. To make these signals distinct, von der Malsburg proposed that the effectiveness'' signal would be carried by a conventional rate code, while the tag'' signal would be created by synchronizing the spike activity of the neuron with spikes from other neurons in the same assembly. First, I will consider whether the theory is an a priori reasonable approach to solving the binding problem, and conclude that it is at best incomplete. Next, I will ask whether spike synchrony can plausibly be used as an informational code, and conclude that there are significant practical and theoretical obstacles both to encoding and to decoding information in this way. I will examine the experimental evidence usually adduced to support the synchrony hypothesis, and conclude that the evidence is largely indirect and has no proven relevance to the issue of binding per se. I will finish by asking whether the binding problem is truly of unique difficulty and requires a unique solution, and by considering some strategies for solving the binding problem that do not require the creation of a special neural code.

Churn Reduction in the Wireless Industry (gzip'd) -- Michael C. Mozer, Richard Wolniewicz, David Grimes, Eric Johnson, Howard Kaushansky
Competition in the wireless telecommunications industry is rampant. To maintain profitability, wireless carriers must control churn, the loss of subscribers who switch from one carrier to another. We explore statistical techniques for churn prediction and, based on these predictions, an optimal policy for identifying customers to whom incentives should be offered to increase retention. Our experiments are based on a data base of nearly 47,000 U.S. domestic subscribers, and includes information about their usage, billing, credit, application, and complaint history. We show that under a wide variety of assumptions concerning the cost of intervention and the retention rate resulting from intervention, churn prediction and remediation can yield significant savings to a carrier. We also show the importance of a data representation crafted by domain experts.

LTD Facilitates Learning in a Noisy Environment -- Paul Munro, Gerardina Hernandez
Long-term potentiation (LTP) has long been held as a biological substrate for associative learning. Recently, evidence has emerged that long-term depression (LTD) results when the presynaptic cell fires after the postsynaptic cell. The computational utility of LTD is explored here. Synaptic modification kernels for both LTP and LTD have been proposed by other laboratories based studies of one postsynaptic unit. Here, the interaction between time-dependent LTP and LTD is studied in small networks.

Bayesian Map Learning in Dynamic Environments (gzip'd) -- Kevin Murphy
We consider the problem of learning a dynamic deterministic finite automaton with noisy inputs and outputs. The graph of the DFA can be thought of as a topological map, which we are trying to learn using a robot which has unreliable sensors and actuators. Furthermore, the underlying topology can change over time, to reflect the fact that, e.g., doors can open and close. We therefore consider the topology as a random variable (instead of a fixed parameter), and try to learn it using Bayesian inference (instead of EM). We use sample-based methods to do the inference approximately. We give results for a simulated dynamic grid world.

#### N

Inference for the Generalization Error (gzip'd) -- Claude Nadeau, Yoshua Bengio
We perform a theoretical investigation of the variance of the cross-validation estimate of the generalization error that takes into account the variability due to the choice of training sets. This allows us to propose two new ways to estimate this variance. We show, via simulations, that these new statistics perform well relative to the statistics considered by Dietterich (1998).

Approximate Inference Algorithms for Two-Layer Bayesian Networks (gzip'd) -- Andrew Y. Ng, Michael Jordan
We present a class of approximate inference algorithms for two-layer graphical models of the QMR-DT type. We give convergence rates for these algorithms as the networks become large (subject to conditions on the size of the weights that ensure local averaging behavior), and verify these theoretical predictions empirically. We also present empirical results on the QMR-DT diagnostic inference problem.

Policy Search via Density Estimation (gzip'd) -- Andrew Y. Ng, Ronald Parr, Daphne Koller
We propose a new approach to the problem of searching a space of stochastic controllers for a Markov decision process (MDP) or a partially observable Markov decision process (POMDP). Following several other authors, our approach is based on searching in parameterized families of policies (for example, via gradient descent) to optimize solution quality. But rather than trying to estimate the values and derivatives of a policy directly, we do so indirectly using an estimate for the probability density that the policy induces on the states at the different points in time. This enables our algorithms to exploit the many techniques for {\em efficient\/} and robust approximate density propagation in stochastic systems. We show how our techniques can be applied both to deterministic propagation schemes (where the MDP's dynamics are given explicitly in compact form,) and to stochastic propagation schemes (where we have access only to a generative model, or simulator, of the MDP). We present empirical results for both of these variants on complex problems --- one with a continuous state space and complex nonlinear dynamics, and one with a large discrete state space and a transition model specified using a dynamic Bayesian network.

#### O

Resonance in Stochastic Neuron Model with Delayed Interaction (gzip'd) -- Toru Ohira, Yuzuru Sato, Jack D. Cowan
We study here a simple stochastic single neuron model with delayed self--feedback capable of generating spike trains. The model is investigated numerically and found that its spike trains show a peculiar resonant behavior between noise'' and delay''. In order to gain the insight in this resonance, we abstract the model and study a stochastic binary element whose transition probability depends on its state at a fixed interval in the past. With this abstracted model we can analytically capture the time interval histograms between spikes and how the resonance between noise and delay arises. The resonance is also observed when such stochastic binary elements are coupled through delayed interaction.

Learning Sparse Codes with a Mixture-of-Gaussians Prior (gzip'd) -- Bruno A. Olshausen, K. Jarrod Millman
We describe an algorithm for learning an optimal set of basis functions for modeling data with sparse structure. A principal challenge in learning the basis functions is presented by the fact that when the basis set is overcomplete, the posterior probability distribution over the coefficients for each data point is difficult to integrate (which is a necessary step in the learning procedure). Previous attempts at approximating the posterior typically do not properly capture its full volume, and as the prior over the coefficients is pushed to capture higher degrees of sparseness (i.e., probability more tightly peaked at zero), the problems associated with these approximations become exacerbated. Here, we address this problem by using a mixture-of-Gaussians prior on the coefficients. The prior is formed from a linear combination of two or more Gaussian distributions: one Gaussian captures the peak at zero while the others capture the spread over the non-zero coefficient values. We show that when the prior is in such a form, there exist efficient methods for learning the basis functions as well as parameters of the prior. The performance of the algorithm is demonstrated on natural images, showing similar results to those obtained with other sparse coding algorithms. Importantly, though, since the parameters of the prior are adapted to the data, no assumption about sparse structure in the images need be made a priori, rather it is learned from the data.

Optimal Kernel Shapes for Local Linear Regression -- Dirk Ormoneit, Trevor Hastie
Local linear regression performs very well in many low-dimensional forecasting problems. In high-dimensional spaces, its performance typically decays due to the well-known curse-of-dimensionality''. A possible way to approach this problem is by varying the shape'' of the weighting kernel. In this work we suggest a new, data-driven method to estimating the optimal kernel shape. Experiments using an artificially generated data set and data from the UC Irvine repository show the benefits of kernel shaping.

#### P

Graded Grammaticality in Prediction Fractal Machines (gzip'd) -- Shan Parfitt, Peter Tino, Georg Dorffner
We introduce a novel method of constructing language models, which avoids some of the problems associated with recurrent neural networks. The method of creating a Prediction Fractal Machine (PFM) is briefly described and some experiments are presented which demonstrate the suitability of PFMs for language modeling tasks. PFMs are able to distinguish reliably between minimal pairs, and their behavior is consistent with the hypothesis that well-formedness is graded' rather than absolute. These results form the basis of a discussion of the PFM's potential to offer fresh insights into the problem of language acquisition and processing.

Unmixing Hyperspectral Data -- Lucas Parra, Clay Spence, Paul Sajda, Andreas Ziehe, Klaus--Robert M{\"u}ller
In hyperspectral imagery one pixel typically consists of a mixture of the reflectance spectra of several materials, where the mixture coefficients correspond to the abundances of the constituting materials. We assume linear combinations of reflectance spectra with some additive normal sensor noise and derive a probabilistic MAP framework for analyzing hyperspectral data. As the material reflectance characteristics are not known a priori, we face the problem of unsupervised linear unmixing. The incorporation of different prior information (e.g. positivity and normalization of the abundances) naturally leads to a family of interesting algorithms, for example in the noise-free case yielding an algorithm that can be understood as constrained independent component analysis (ICA). Simulations underline the usefulness of our theory.

A Neuromorphic VLSI System for Modeling the Neural Control of Axial Locomotion -- Girish N. Patel, Edgar A. Brown, Stephen P. DeWeerth
We have developed and tested an analog/digital VLSI system that models the coordination of biological segmental oscillators underlying axial locomotion in animals such as leeches and lampreys. In its current form, the system consists of a chain of twelve pattern generating circuits that are capable of arbitrary contralateral inhibitory synaptic coupling. Each pattern generating circuit is implemented with two independent silicon Morris-Lecar neurons with a total of 32 programmable (floating-gate based) inhibitory synapses, and an asynchronous address-event interconnection element that provides synaptic connectivity and implements axonal delay. We describe and analyze the data from a set of experiments exploring the system behavior in terms of synaptic coupling.

Bifurcation Analysis of a Silicon Neuron -- Girish N. Patel, Gennady S. Cymbalyuk, Ronald L. Calabrese, Stephen P. DeWeerth
We have developed a VLSI silicon neuron and a corresponding mathematical model that is a two state-variable system. We describe the circuit implementation and compare the behaviors observed in the silicon neuron and the mathematical model. We also perform bifurcation analysis of the mathematical model by varying the externally applied current and show that the behaviors exhibited by the silicon neuron under corresponding conditions are in good agreement to those predicted by the bifurcation analysis.

Neural Network Based Model Predictive Control -- Stephen Piche, Jim Keeler, Greg Martin, Gene Boe, Doug Johnson, Mark Gerules
Model Predictive Control (MPC), a control algorithm which uses an optimizer to solve for the optimal control moves over a future time horizon based upon a model of the process, has become a standard control technique in the process industries over the past two decades. In most industrial applications, a linear dynamic model developed using empirical data is used even though the process itself is often nonlinear. Linear models have been used because of the difficulty in developing a generic nonlinear model from empirical data and the computational expense often involved in using nonlinear models. In this paper, we present a generic neural network based technique for developing nonlinear dynamic models from empirical data and show that these models can be efficiently used in a model predictive control framework. This nonlinear MPC based approach has been successfully implemented in a number of industrial applications in the refining, petrochemical, paper, and food industries. Performance of the controller on a nonlinear industrial process, a polyethylene reactor, is presented.

Large Margin DAGs for Multiclass Classification (gzip'd) -- John C. Platt, Nello Cristianini, John Shawe-Taylor
We present a new learning architecture: the Decision Directed Acyclic Graph (DDAG), which is used to combine many two-class classifiers into a multi-class classifier. For an N-class problem, the DDAG contains N(N-1)/2 classifiers, one for each pair of classes. We present a VC analysis of the case when the node classifiers are hyperplanes; the resulting bound on the test error depends on N and on the margin achieved at the nodes, but not on the dimension of the space. This motivates an algorithm, DAGSVM, which operates in a kernel-induced feature space and uses two-class maximal margin hyperplanes at each decision-node of the DDAG. The DAGSVM is substantially faster to train and evaluate than either the standard algorithm or Max Wins, while maintaining comparable accuracy to both of these algorithms.

Memory Capacity of Linear vs. Nonlinear Models of Dendritic Integration (gzip'd) -- Panayiota Poirazi, Bartlett W. Mel
We have previously shown using biophysically detailed compartmental models that nonlinear interactions between nearby synaptic inputs in the branches of an active dendritic tree can provide a layer of virtual feature detectors, which can significantly boost the cell's memory capacity [Mel B. W. 92a,b]. Our aim here has been to quantify this boost by calculating the capacity of a neuron under two different modes of dendritic integration: (1) a passive dendrite model in which synaptic inputs are combined linearly across the entire cell, followed by a single global threshold, and (2) an active dendrite model in which a threshold is applied separately to the output of each branch. We focus here on the limiting case of binary-valued synaptic weights, and derive expressions which measure model capacity by counting the number of distinct input-output functions available to both neuron types. We show that (1) the application of a fixed nonlinearity to each dendritic compartment substantially increases the model's flexibility, (2) for neurons of realistic size, the capacity of the nonlinear cell can exceed that of the same-sized linear cell by more than an order of magnitude, and (3) the largest capacity boost occurs for cells with a relatively large number of dendritic subunits of relatively small size. We validated the analysis by empirically measuring memory capacity with randomized two-class classification problems, where a stochastic delta rule was used to train both linear and nonlinear models. We found that the large capacity boosts predicted for the nonlinear dendritic model were readily achieved in practice.

#### R

Predictive Sequence Learning in Recurrent Neocortical Circuits (gzip'd) -- Rajesh P. N. Rao, Terrence J. Sejnowski
Neocortical circuits are dominated by massive excitatory feedback: more than eighty percent of the synapses made by excitatory cortical neurons are onto other excitatory cortical neurons. Why is there such massive recurrent excitation in the neocortex and what is its role in cortical computation? Recent neurophysiological experiments have shown that the plasticity of recurrent neocortical synapses is governed by a temporally asymmetric Hebbian learning rule. We describe how such a rule may allow the cortex to modify recurrent synapses for prediction of input sequences. The goal is to predict the next cortical input from the recent past based on previous experience of similar input sequences. We show that a temporal difference learning rule for prediction used in conjunction with dendritic back-propagating action potentials reproduces the temporally asymmetric Hebbian plasticity observed physiologically. Biophysical simulations demonstrate that a network of cortical neurons can learn to predict moving stimuli and develop direction selective responses as a consequence of learning. The space-time response properties of model neurons are shown to be similar to those of direction selective cells in alert monkey V1.

The Infinite Gaussian Mixture Model (gzip'd) -- Carl Edward Rasmussen
In a Bayesian mixture model it is not necessary a priori to limit the number of components to be finite. In this paper an infinite Gaussian mixture model is presented which neatly sidesteps the difficult problem of finding the right'' number of mixture components. Inference in the model is done using an efficient parameter-free Markov Chain that relies entirely on Gibbs sampling.

$\nu$-Arc: Ensemble Learning in the Presence of Outliers (gzip'd) -- Gunnar R{\"a}tsch, Bernhard Sch{\"o}lkopf, Alexander J. Smola, Klaus--Robert M{\"u}ller, Takashi Onoda, Sebastian Mika
AdaBoost and other ensemble methods have successfully been applied to a number of classification tasks, seemingly defying problems of overfitting. AdaBoost performs gradient descent in an error function with respect to the margin, asymptotically concentrating on the patterns which are hardest to learn. For very noisy problems, however, this can be disadvantageous. Indeed, theoretical analysis has shown that the margin distribution, as opposed to just the minimal margin, plays a crucial role in understanding this phenomenon. Loosely speaking, some outliers should be tolerated if this has the benefit of substantially increasing the margin on the remaining points. We propose a new boosting algorithm, $\nu$-Arc, which allows for the possibility of a pre-specified fraction of points to lie in the margin area or even on the wrong side of the decision boundary.

A Recurrent Model of the Interaction Between Prefrontal and Inferotemporal Cortex in Delay Tasks -- Alfonso Renart, Nestor Parga, Edmund T. Rolls
A very simple model of two reciprocally connected attractor neural networks is studied analytically in situations similar to those encountered in delay match-to-sample tasks with intervening stimuli and in tasks of memory guided attention. The model qualitatively reproduces many of the experimental data on these types of tasks and provides a framework for the understanding of the experimental observations in the context of the attractor neural network scenario.

Understanding Stepwise Generalization of Support Vector Machines: a Toy Model (gzip'd) -- Sebastian Risau-Gusman, Mirta B. Gordon
We study the effects of introducing structure in the input distribution of the data to be learnt by a simple perceptron. We determine the learning curves within the framework of Statistical Mechanics. Stepwise generalization occurs as a function of the number of examples when the distribution of patterns is highly anisotropic. Although extremely simple, the model seems to capture the relevant features of a class of Support Vector Machines which was recently shown to present this behavior.

Reinforcement Learning Using Approximate Belief States -- Andr\'es Rodr{\'i}guez, Ronald Parr, Daphne Koller
The problem of developing good policies for partially observable Markov decision problems (POMDPs) remains one of the most challenging areas of research in stochastic planning. One line of research in this area involves the use of reinforcement learning with belief states, probability distributions over the underlying model states. This is a promising method for small problems, but its application is limited by the intractability of computing or representing a full belief state for large problems. Recent work shows that, in many settings, we can maintain an {\em approximate belief state}, which is fairly close to the true belief state. In particular, great success has been shown with approximate belief states that ignore correlations between state variables. In this paper, we investigate two methods of full belief state reinforcement learning and one novel method for reinforcement learning using factored approximate belief states. We compare the performance of these algorithms on several well-known problem from the literature. Our results demonstrate the importance of approximate belief state representations for large problems.

Nonlinear Discriminant Analysis Using Kernel Functions (gzip'd) -- Volker Roth, Volker Steinhage
Fishers linear discriminant analysis (LDA) is a classical multivariate technique both for dimension reduction and classification. The data vectors are transformed into a low dimensional subspace such that the class centroids are spread out as much as possible. In this subspace LDA works as a simple prototype classifier. The resulting decision boundaries are linear. However, in many applications the linear boundaries do not adequately separate the classes and the possibility of modelling more complex boundaries would be desirable. In this paper we present a nonlinear generalization of discriminant analysis that implements the method of representing dot products of pattern vectors by kernel functions. This technique allows to efficiently compute discriminant functions in arbitrary feature spaces for which such kernel representations exist.

A SNoW-Based Face Detector (gzip'd) -- Dan Roth, Ming-Hsuan Yang, Narendra Ahuja
A novel learning approach for human face detection using a network of linear units is presented. The SNoW learning architecture is a sparse network of linear functions over a pre-defined or incrementally learned feature space and is specifically tailored for learning in the presence of a very large number of features. A wide range of face images in different poses, with different expressions and under different lighting conditions are used as a training set to capture the variations of human faces. Experimental results on commonly used benchmark data sets of a wide range of face images show that the SNoW-based approach outperforms methods that use neural networks, Bayesian methods, support vector machines and others. Furthermore, learning and evaluation using the SNoW-based method are significantly more efficient than with other methods.

Constrained Hidden Markov Models (gzip'd) -- Sam Roweis
By thinking of each state in a hidden Markov model as corresponding to some spatial region of a fictitioustopology space'' it is possible to naturally define neighbouring states of any state as those which are connected in that space. The transition matrix of the HMM can then be constrained to allow transitions only between neighbours; this means that all valid state sequences correspond to connected paths in the topology space. This strong constraint makes structure discovery in sequences easier. I show how such constrained HMMs can learn to discover underlying structure in complex sequences of high dimensional data, and apply them to the problem of recovering mouth movements from acoustic observations in continuous speech.

Coastal Navigation with Mobile Robots (gzip'd) -- Nicholas Roy, Sebastian Thrun
The problem that we address in this paper is how a mobile robot can plan in order to arrive at its goal with minimum uncertainty. Traditional motion planning algorithms often assume that a mobile robot can track its position reliably, however, in real world situations, reliable localization may not always be feasible. Partially Observable Markov Decision Processes (POMDPs) provide one way to maximize the certainty of reaching the goal state, but at the cost of computational intractability for large state spaces. The method we propose explicitly models the uncertainty of the robot's position as a state variable, and generates trajectories through the augmented pose-uncertainty space. By minimizing the positional uncertainty at the goal, the robot reduces the likelihood it becomes lost. We demonstrate experimentally that coastal navigation reduces the uncertainty at the goal, especially with degraded localization.

An Analysis of Turbo Decoding with Gaussian Densities -- Paat Rusmevichientong, Benjamin Van Roy
We provide an analysis of the turbo decoding algorithm (TDA) in a setting involving Gaussian densities. In this context, we are able to show that the algorithm converges and that -- somewhat surprisingly --though the density generated by the TDA may differ significantly from the desired posterior density, the means of these two densities coincide.

#### S

Learning Factored Representations for Partially Observable Markov Decision Processes (gzip'd) -- Brian Sallans
The problem of reinforcement learning in a non-Markov environment is explored using a dynamic Bayesian network, where conditional independence assumptions between random variables are compactly represented by network parameters. The parameters are learned on-line, and approximations are used to perform inference and to compute the optimal value function. The relative effects of inference and value function approximations on the quality of the final policy are investigated, by learning to solve a moderately difficult driving task. The two value function approximations, linear and quadratic, were found to perform similarly, but the quadratic model was more sensitive to initialization. Both performed below the level of human performance on the task. The dynamic Bayesian network performed comparably to a model using an HMM-style representation, while requiring exponentially fewer parameters.

An Analog VLSI Model of Periodicity Extraction -- Andr\'e van Schaik
This paper presents an electronic system that extracts the periodicity of a sound. It uses three analogue VLSI building blocks: a silicon cochlea, two inner-hair-cell circuits and two spiking neuron chips. The silicon cochlea consists of a cascade of filters. Because of the delay between two outputs from the silicon cochlea, spike trains created at these outputs are synchronous only for a narrow range of periodicities. In contrast to traditional band-pass filters, where an increase in selectivity has to be traded off against decrease in response time, the proposed system responds quickly, independent of selectivity.

Lower Bounds on the Complexity of Approximating Continuous Functions by Sigmoidal Neural Networks (gzip'd) -- Michael Schmitt
We calculate lower bounds on the size of sigmoidal neural networks that approximate continuous functions. In particular, we show that for the approximation of polynomials the network size has to grow as $\Omega((\log k)^{1/4})$ where $k$ is the degree of the polynomials. This bound is valid for any input dimension, i.e. independently of the number of variables. The result is obtained by introducing a new method employing upper bounds on the Vapnik-Chervonenkis dimension for proving lower bounds on the size of networks that approximate continuous functions.

Information Capacity and Robustness of Stochastic Neuron Models (gzip'd) -- Elad Schneidman, Idan Segev, Naftali Tishby
The reliability and accuracy of spike trains have been shown to depend on the nature of the stimulus that the neuron encodes. Adding ion channel stochasticity to neuronal models results in a macroscopic behavior that replicates the input-dependent reliability and precision of real neurons. We calculate the amount of information that an ion channel based stochastic Hodgkin-Huxley (HH) neuron model can encode about a wide set of stimuli. We show that both the information rate and the information per spike of the stochastic model is similar to the values reported experimentally. Moreover, the amount of information that the neuron encodes is correlated with the amplitude of fluctuations in the input, and less so with the average firing rate of the neuron. We also show that for the HH ion channel density, the information capacity is robust to changes in the density of ion channels in the membrane, whereas changing the ratio between the $Na^{+}$ and $K^{+}$ ion channels has a considerable effect on the information that the neuron can encode. This suggests that neurons may maximize their information capacity by appropriately balancing the density of the different ion channels that underlies neuronal excitability.

SV Estimation of a Distribution's Support (gzip'd) -- Bernhard Sch{\"o}lkopf, Robert C. Williamson, Alexander J. Smola, John Shawe-Taylor
Suppose you are given some dataset drawn from an underlying probability distribution $P$ and you want to estimate a subset $S$ of input space such that the probability that a test point drawn from $P$ lies outside of $S$ is bounded by some a priori specified $0<\nu\le 1$. We propose an algorithm which approaches this problem by trying to estimate a function $f$ which is positive on $S$ and negative on the complement. The functional form of $f$ is given by a kernel expansion in terms of a potentially small subset of the training data; it is regularized by controlling the length of the weight vector in an associated feature space. The algorithm is a natural extension of the support vector algorithm to the case of unlabelled data.

Application of Blind Separation of Sources to Optical Recording of Brain Activity (gzip'd) -- Holger Schoener, Martin Stetter, Ingo Schießl, Jennifer Lund, Niall McLoughlin, John E.W. Mayhew, Klaus Obermayer
In the analysis of data recorded by optical imaging from intrinsic signals (measurement of changes of light reflectance from cortical tissue) the removal of noise and artifacts such as blood vessel patterns is a serious problem. Often bandpass filtering is used, but the underlying assumption that a spatial frequency exists, which separates the mapping component from other components (especially the global signal), is questionable. Here we propose alternative ways of processing optical imaging data, using blind source separation techniques based on the spatial decorrelation of the data. We first perfom benchmarks on artificial data in order to select the way of processing, which is most robust with respect ot sensor noise. We then apply it to recordings of optical imaging experiments from macaque primary visual cortex. We show that our BSS technique is able to extract ocular dominance and orientation preference maps from single condition stacks, for data, where standard postprocessing procedures fail. Artifacts, especially blood vessel patterns, can often be completely removed from the maps. In summary, our method for blind source separation using extended spatial decorrelation is a superior technique for the analysis of optical recording data.

Online Independent Component Analysis with Local Learning Rate Adaptation -- Nicol N. Schraudolph, Xavier Giannakopoulos
Stochastic meta-descent (SMD) is a new technique for online adaptation of local learning rates in arbitrary twice-differentiable systems. Like matrix momentum it uses full second-order information while retaining O(n) computational complexity by exploiting the efficient computation of Hessian-vector products. Here we apply SMD to independent component analysis, and employ the resulting algorithm for the blind separation of time-varying mixtures. By matching individual learning rates to the rate of change in each source signal's mixture coefficients, our technique is capable of simultaneously tracking sources that move at very different, a priori unknown speeds.

Better Generative Models for Sequential Data Problems: Bidirectional Recurrent Mixture Density Networks -- Mike Schuster
This paper describes bidirectional recurrent mixture density networks, which can model multi-modal distributions of the type $P(x_t|y)$ and $P(x_t|x_1, x_2, \ldots, x_{t-1}, y)$ without any explicit assumptions about the use of context. These expressions occur frequently in pattern recognition problems with sequential data, for example in speech recognition. Experiments show that the proposed generative models give a higher likelihood on test data compared to a traditional modeling approach, indicating that they can summarize the statistical properties of the data better.

Greedy Importance Sampling (gzip'd) -- Dale Schuurmans
I present a simple variation of importance sampling that explicitly searches for high density regions in the target distribution. I prove that the technique yields unbiased estimates, and show empirically it can reduce the variance of standard Monte Carlo estimators. This is achieved by concentrating samples in more important regions of the sample space.

Bayesian Model Selection for Support Vector Machines, Gaussian Processes and Other Kernel Classifiers (gzip'd) -- Matthias Seeger
We present a variational Bayesian method for model selection over families of kernels classifiers like Support Vector machines or Gaussian processes. The algorithm needs no user interaction and is able to adapt a large number of kernel parameters to given data without having to sacrifice training cases for validation. This opens the possibility to use sophisticated families of kernels in situations where the small standard kernel'' classes are clearly inappropriate. We relate the method to other work done on Gaussian processes and clarify the relation between Support Vector machines and certain Gaussian process models.

Noisy Neural Networks and Generalizations (gzip'd) -- Hava T. Siegelmann, Alexander Roiterstein, Asa Ben-Hur
In this paper we define a probabilistic computational model which generalizes many noisy neural network models, including the recent work of Maass and Sontag. We identify weak ergodicity as the mechanism responsible for restriction of the computational power of probabilistic models to {\em definite languages}, independent of the characteristics of the noise: whether it is discrete or analog, or if it depends on the input or not, and independent of whether the variables are discrete or continuous. We give examples of weakly ergodic models including noisy computational systems with noise depending on the current state and inputs, aggregate models, and computational systems which update in continuous time.

Leveraged Vector Machines -- Yoram Singer
We describe an iterative algorithm for building vector machines used in classification tasks. The algorithm builds on ideas from support vector machines, boosting, and generalized additive models. The algorithm can be used with various continuously differential functions that bound the discrete (0-1) classification loss and is very simple to implement. We test the proposed algorithm with two different loss functions on synthetic and natural data. We also describe a norm-penalized version of the algorithm for the exponential loss function used in AdaBoost. The performance of the algorithm on natural data is comparable to Support Vector machines while typically both its running time is shorter than of SVM and the sizes of the final classifiers it builds are smaller.

Reinforcement Learning for Spoken Dialogue Systems -- Satinder Singh, Michael Kearns, Diane Litman, Marilyn Walker
Recently, a number of authors have proposed treating dialogue systems as Markov decision processes (MDPs). However, the practical application of MDP algorithms to dialogue systems faces a number of severe technical challenges. We have built a general software tool (RLDS, for Reinforcement Learning for Dialogue Systems) based on the MDP framework, and have applied it to dialogue corpora gathered from two dialogue systems built at AT\&T Labs. Our experiments demonstrate that RLDS holds promise as a tool for browsing'' and understanding correlations in complex, temporally dependent dialogue corpora.

Agglomerative Information Bottleneck (gzip'd) -- Noam Slonim, Naftali Tishby
We introduce a novel distributional clustering algorithm that explicitly maximizes the mutual information per cluster between the data and given categories. This algorithm can be considered as a bottom up hard version of the recently introduced Information Bottleneck Method''. We relate the mutual information between clusters and categories to the Bayesian classification error, which provides another motivation for using the obtained clusters as features. The algorithm is compared with the top-down soft version of the information bottleneck method and a relationship between the hard and soft results is established. We demonstrate the algorithm on the {\em 20 Newsgroups} data set. For a subset of two news-groups we achieve compression by 3 orders of magnitudes loosing only 10\% of the original mutual information.

Speech Modelling Using Subspace and EM Techniques (gzip'd) -- Gavin Smith, Nando de Freitas, Tony Robinson, Mahesan Niranjan
This paper concerns modelling using a piecewise-stationary discrete-time linear stochastic state space model, with applications to speech modelling. The purpose of the paper is to compare two algorithms for model parameter estimation: subspace state space system identification (4SID) and expectation-maximisation (EM). Both algorithms estimate state sequence and parameters jointly. EM is related to Kalman smoothing, maximises likelihoods, is iterative and requires parameter initialisation. Whereas 4SID is related to Kalman filtering, minimises a criterion involving both short and long-term prediction errors using least-squares, is closed-form and requires no parameter initialisation. Therefore 4SID has the advantage that it avoids iterative algorithm problems and requires less a priori knowledge because initialisation parameters are not needed. 4SID and EM methods are compared through experiments on real speech data. EM is sensitive to initialisation. Different initialisation methods are discussed and compared.

The Entropy Regularization Information Criterion (gzip'd) -- Alexander J. Smola, John Shawe-Taylor, Bernhard Sch{\"o}lkopf, Robert C. Williamson
Effective methods of capacity control via uniform convergence bounds for function expansions have been largely limited to Support Vector machines, where good bounds are obtainable by the entropy number approach. We extend these methods to systems with expansions in terms of arbitrary (parametrized) basis functions and a wide range of regularization methods covering the whole range of general linear additive models. This is achieved by a data dependent analysis of the eigenvalues of the corresponding design matrix. Experimental evidence corroborates the new bounds.

Probabilistic Methods for Support Vector Machines (gzip'd) -- Peter Sollich
I describe a framework for interpreting Support Vector Machines (SVMs) as maximum a posteriori (MAP) solutions to inference problems with Gaussian Process priors. This can provide intuitive guidelines for choosing a good' SVM kernel. It can also assign (by evidence maximization) optimal values to parameters such as the noise level C which cannot be determined unambiguously from properties of the MAP solution alone (such as cross-validation error). I illustrate this using a simple approximate expression for the SVM evidence. Once C has been determined, error bars on SVM predictions can also be obtained.

Image Recognition in Context: Application to Microscopic Urinalysis (gzip'd) -- Xubo B. Song, Yaser Abu-Mostafa, Joseph Sill, Harvey Kasdan
We propose a new and efficient technique to incorporate contextual information into object classification. Most of the current techniques faces the problem of exponential computation cost, which is dealt with by either limiting to local searching space, which might cause incomplete context, or simplifying the problem using specific domain knowledge. In this paper, we propose a new general framework that incorporates {\it partial} context at a linear cost. This technique is applied to microscopic urinalysis image recognition, resulting in significant improvement of recognition rate than context free approach, which would have been impossible by conventional context incorporating techniques.

Hierarchical Image Probability (HIP) Models -- Clay Spence, Lucas Parra
We formulate a model for probability distributions on image spaces. We show that any distribution of images can be factored exactly into conditional distributions of feature vectors at one resolution (pyramid level) conditioned on the image information at lower resolutions. We would like to factor this over positions in the pyramid levels to make it tractable, but such factoring may miss long-range dependencies. To fix this, we introduce hidden class labels at each pixel in the pyramid. The result is a hierarchical mixture of conditional probabilities, similar to a hidden Markov model on a tree. The model parameters can be found with maximum likelihood estimation using the EM algorithm. We have obtained encouraging preliminary results on the problems of detecting various objects in SAR images and target recognition in optical aerial images.

Training Data Selection for Optimal Generalization in Trigonometric Polynomial Networks (gzip'd) -- Masashi Sugiyama, Hidemitsu Ogawa
In this paper, we consider the active learning problem in trigonometric polynomial networks and give a necessary and sufficient condition of sample points to provide the optimal generalization capability. By analyzing the condition from the functional analytic point of view, we clarify the mechanism of achieving the optimal generalization capability. We also show that a set of training examples satisfying the condition does not only provide the optimal generalization but also reduces the computational complexity and memory required for the calculation of learning results. Finally, we give examples of sample points satisfying the condition and show that one of the examples further reduces the computational complexity and memory required for the calculation.

Predictive Approaches for Choosing Hyperparameters in Gaussian Processes -- S. Sundararajan, S. Sathiya Keerthi
Gaussian Processes are powerful regression models specified by parametrized mean and covariance functions. Standard approaches to estimate these parameters (known by the name Hyperparameters) are Maximum Likelihood (ML) and Maximum APosterior (MAP) approaches. In this paper, we propose and investigate predictive approaches, namely, maximization of Geisser's Surrogate Predictive Probability (GPP) and minimization of mean square error with respect to GPP (referred to as Geisser's Predictive mean square Error (GPE)) to estimate the hyperparameters. We also derive results for the standard Cross-Validation (CV) error and make a comparison. These approaches are tested on a number of problems and experimental results show that these approaches are strongly competitive to existing approaches.

Policy Gradient Methods for Reinforcement Learning with Function Approximation (gzip'd) -- Richard S. Sutton, David McAllester, Satinder Singh, Yishay Mansour
Function approximation is essential to reinforcement learning, but the standard approach of approximating a value function and determining a policy from it has so far proven theoretically intractable. In this paper we explore an alternative approach in which the policy is explicitly represented by its own function approximator, independent of the value function, and is updated according to the gradient of expected reward with respect to the policy parameters. Williams's REINFORCE method and actor--critic methods are examples of this approach. Our main new result is to show that the gradient can be written in a form suitable for estimation from experience aided by an approximate action-value or advantage function. Using this result, we prove for the first time that a version of policy iteration with arbitrary differentiable function approximation is convergent to a locally optimal policy.

On Input Selection with Reversible Jump Markov Chain Monte Carlo Sampling (gzip'd) -- Peter Sykacek
In this paper we will treat input selection for a radial basis function (RBF) like classifier within a Bayesian framework. We approximate the a-posteriori distribution over both model coefficients and input subsets by samples drawn with Gibbs updates and reversible jump moves. Using some public datasets we compare the classification accuracy of the method with a conventional ARD scheme. These datasets are also used to infer the relevance of different input subsets.

#### T

A MEG Study of Response Latency and Variability in the Human Visual System (gzip'd) -- Akaysha C. Tang, Barak A. Pearlmutter, Tim A. Hely, Michael P. Weisend, Michael Zibulevsky
Human reaction times during sensory-motor tasks vary considerably. To understand how this variability arises, one must examine each stage of information processing. According to the conventional view, precise temporal information will be gradually lost as the information is passed through a layered network of mean-rate units.'' To test, in humans, whether populations of neurons at different stages of processing behave like mean-rate units,'' we applied Blind Source Separation to MEG signals obtained during a sensory-motor integration task. We identified multiple visual sources differing in response latency and duration, and estimated response time latency and response time variability by detecting single-trial stimulus-locked events for each source. We found a greater response time variability in the early visual responses than the later visual responses. This finding supports the hypothesis that variability in populational responses increases at successive processing stages, despite evidence in the literature that single neuron spike jitters at successive synaptic stages can be preserved.

Rules and Similarity in Concept Learning (gzip'd) -- Joshua B. Tenenbaum
A popular view holds that learning and generalizing concepts depends on two fundamentally distinct modes of representation: rules and similarity-to-exemplars. Through a combination of experiments and formal analysis, I show how a Bayesian framework offers a unifying account of both rule-based and similarity-based generalization. Bayes explains the specific workings of these two modes -- which rules are abstracted, how similarity is measured -- as well as why generalization appears rule-based or similarity-based in different situations. I conclude that the distinction between rules and similarity in concept learning may be useful at the level of heuristic algorithms but is not computationally fundamental.

Monte Carlo POMDPs (gzip'd) -- Sebastian Thrun
We present a Monte Carlo algorithm for learning to act optimally in partially observable Markov decision processes (POMDPs). Our approach uses importance sampling for representing beliefs, and Monte Carlo approximation for belief revision. Reinforcement learning (value iteration) is employed to learn value functions over belief functions, and a sample-based version of nearest neighbor is used to generalize across states. Our approach departs from previous work in the POMDP field in that it can handle real-valued state spaces. Initial empirical results suggest that our approach may work well in practical applications.

Building Predictive Models from Spatial Representations of Symbolic Sequences (gzip'd) -- Peter Tino, Georg Dorffner
We propose a novel approach for building finite memory predictive models similar in spirit to variable memory length Markov models (VLMMs). The models are constructed by first transforming the n-block structure of the training sequence into a spatial structure of points in a unit hypercube, such that the longer is the common suffix shared by any two n-blocks, the closer lie their point representations. Such a transformation embodies a Markov assumption - n-blocks with long common suffixes are likely to produce similar continuations. Finding a set of prediction contexts is formulated as a resource allocation problem solved by vector quantizing the spatial n-block representation. We compare our model with both the classical and variable memory length Markov models on three data sets with different memory and stochastic components. Our models have a superior performance, yet, their construction is fully automatic, which is shown to be problematic in the case of VLMMs.

The Relevance Vector Machine (gzip'd) -- Michael E. Tipping
This paper introduces the \emph{Relevance Vector Machine}, a generalised linear model for regression and classification whose output, like the Support Vector Machine (SVM), is a weighted sum of kernel functions associated with a subset of the training examples. A high degree of sparsity is achieved through a Bayesian treatment by introducing a prior distribution over the model weights, and maximising the marginal likelihood over the values of the hyperparameters. Examples in both regression and classification settings illustrate generalisation at least as good as a comparable SVM, while utilising dramatically fewer kernel functions.

Evolving Learnable Languages (gzip'd) -- Bradley Tonkes, Alan Blair, Janet Wiles
Traditional theories of child language acquisition center around the existence of innate, domain-specific parameters which are specifically tuned for learning a particular class of languages. More recent proposals suggest that language acquisition is assisted by the evolution of languages towards forms that are easily learnable. In this paper, we evolve combinatorial languages which can be learned by a recurrent neural network quickly and from relatively few examples. Additionally, we evolve languages for generalization in different worlds'', and for generalization from specific examples. We find that languages can be evolved to facilitate different forms of impressive generalization for a general purpose learner. The results provide empirical support for the theory that the language itself, as well as the language environment of a learner, plays a substantial role in learning: that there is far more to language acquisition than the language acquisition device.

#### V

Generalized Model Selection for Unsupervised Learning in High Dimensions (gzip'd) -- Shivakumar Vaithyanathan, Byron Dom
In this paper we describe an approach to model selection in unsupervised learning. This approach determines both the feature set and the number of clusters. To this end we first derive an objective function that explicitly incorporates this generalization. We then evaluate two schemes for model selection - one using this objective function (a Bayesian estimation scheme that selects the best model structure using the marginal or integrated likelihood) and the second based on a technique using a cross-validated likelihood criterion. In the first scheme, for a particular application in document clustering, we derive a closed-form solution of the integrated likelihood by assuming an appropriate form of the likelihood function and prior. Extensive experiments are carried out to ascertain the validity of both approaches and all results are verified by comparison against ground truth. In our experiments the Bayesian scheme using our objective function gave better results than cross-validation.

Support Vector Method for Multivariate Density Estimation (gzip'd) -- Vladimir N. Vapnik, Sayan Mukherjee
A new method for multivariate density estimation is developed based on the Support Vector Machine (SVM) solution of inverse ill-posed problems. The solution has the form of a mixture of densities. This method with Gaussian kernels compared favorably to both Parzen's method and the Gaussian Mixture Model method.

Learning from User Feedback in Image Retrieval Systems (gzip'd) -- Nuno Vasconcelos, Andrew Lippman
We formulate the problem of retrieving images from visual databases as a problem of Bayesian inference. This leads to natural and effective solutions for two of the most challenging issues in the design of a retrieval system: providing support for region-based queries without requiring prior image segmentation, and accounting for user-feedback during a retrieval session. We present a new learning algorithm that relies on belief propagation to account for both positive and negative examples of the user's interests.

#### W

Scale Mixtures of Gaussians and the Statistics of Natural Images (gzip'd) -- Martin J. Wainwright, Eero P. Simoncelli
The statistics of photographic images, when represented using multi-scale (wavelet) bases, exhibit two striking non-Gaussian behaviors. First, the marginal densities of the coefficients have extended heavy tails. Second, the joint densities exhibit variance dependencies not captured by second-order models. We examine properties of the class of Gaussian scale mixtures (GSM), and show that these densities can accurately characterize both the marginal and joint distributions of natural image wavelet coefficients. This class of model suggests a Markov structure, in which wavelet coefficients are linked by hidden scaling variables corresponding to local image structure. We derive an estimator for these hidden variables, and show that a nonlinear normalization'' procedure can be used to Gaussianize the wavelet coefficients.

Dual Estimation and the Unscented Transformation -- Eric A. Wan, Rudolph van der Merwe, Alex T. Nelson
Dual estimation refers to the problem of simultaneously estimating the state of a dynamic system and the model which gives rise to the dynamics. Algorithms include expectation-maximization (EM), Dual Kalman Filtering, and Joint Kalman methods. These methods have been recently explored in the context of nonlinear modeling, where a neural network is used as the functional form of the unknown model. Typically, an Extended Kalman Filter (or smoother) is used for the part of the algorithm that estimates the clean state given the current estimated model. An EKF may also be used as a state-space approach to estimating the weights of the network. This paper points out the flaws in using the EKF, and proposes an improvement based on a new approach called the Unscented Transformation (UT). A substantial performance gain is achieved with the same order of computational complexity as that of the standard EKF. The approach is illustrated on several dual estimation methods.

Algebraic Analysis for Non-regular Learning Machines (gzip'd) -- Sumio Watanabe
This paper mathematically clarifies the asymptotic form of the free energy or the stochastic complexity for non-regular and over-realizable learning machines. For regular learning machines, it is well known that the asymptotic form of the free energy is F(n)= (d/2)log n, where d is the dimension of the parameter space and n is the number of training samples. However, layered statistical models such as neural networks are not regular models, because the set of true parameters is not one point but an analytic set with singularities if the model contains the true distribution. For such non-regular and non-identifiable learning machines, we prove that F(n) = a log n + (b-1)loglog n, where a is a rational number and b is a natural number which are determined from the zero point of Sato-Bernstein polynomial and its multiplicity. We also show that a and b can be calculated by using resolution of singularities in algebraic geometry.

Learning Statistically Neutral Tasks without Expert Guidance (gzip'd) -- Ton Weijters, Antal van den Bosch, Eric Postma
In this paper, we question the necessity of levels of expert-guided abstraction in learning hard, statistically neutral classification tasks. We focus on two tasks, date calculation and parity-12, that are claimed to require intermediate levels of abstraction that must be defined by a human expert. We challenge this claim by demonstrating empirically that a single hidden-layer {\sc bp-som} network can learn both tasks without guidance. Moreover, we analyze the network's solution for the parity-12 task and show that its solution makes use of an elegant intermediary checksum computation.

Correctness of Belief Propagation in Gaussian Graphical Models of Arbitrary Topology -- Yair Weiss, William T. Freeman
Local belief propagation'' rules of the sort proposed by Pearl are guaranteed to converge to the correct posterior probabilities in singly connected graphical models. Recently, a number of researchers have empirically demonstrated good performance of loopy belief propagation''--using these same rules on graphs with loops. Perhaps the most dramatic instance is the near Shannon-limit performance of Turbo codes'', whose decoding algorithm is equivalent to loopy belief propagation. Except for the case of graphs with a single loop, there has been little theoretical understanding of the performance of loopy propagation. Here we analyze belief propagation in networks with arbitrary topologies when the nodes in the graph describe jointly Gaussian random variables. We give an analytical formula relating the true posterior probabilities with those calculated using loopy propagation. We give sufficient conditions for convergence and show that when belief propagation converges it gives the correct posterior means {\em for all graph topologies}, not just networks with a single loop. The related `max-product'' belief propagation algorithm finds the maximum posterior probability estimate for singly connected networks. We show that, even for non-Gaussian probability distributions, the convergence points of the max-product algorithm in loopy networks are at least local maxima of the posterior probability. These results motivate using the powerful belief propagation algorithm in multiply connected networks, and help clarify the empirical performance results.

Probabilistic Hierarchical Clustering (gzip'd) -- Christopher K. I. Williams
There are many hierarchical clustering algorithms available, but these lack a firm statistical basis. Here we set up a hierarchical probabilistic mixture model, where data is generated in a hierarchical tree-structured manner. Markov chain Monte Carlo methods are demonstrated which can be used to sample from the posterior distribution over trees containing variable numbers of hidden units.

Population Decoding Based on an Unfaithful Model (gzip'd) -- Si Wu, Hiroyuki Nakahara, Noboru Murata, Shun-ichi Amari
The present study investigates a population decoding paradigm in which the maximum likelihood is based on an unfaithful decoding model (UMLI). This is usually the case for neural population decoding because the encoding process of the brain is not exactly known, or because a simplified model is preferred for saving computational cost. We consider an unfaithful decoding model which neglects the pair-wise correlation between neuronal activities. The decoding error of UMLI is proved to be asymptotically efficient when the neuronal correlation is uniform or of limited-range. The performance of UNMLI is compared with that of the maximum likelihood inference based on a faithful model and that of the center of mass decoding method. It turns out that UMLI has advantages of decreasing the computational complexity remarkably and maintaining a high-level decoding accuracy at the same time.

#### X

Spike-based Learning Rules and Stabilization of Persistent Neural Activity (gzip'd) -- Xiao-Hui Xie, H. Sebastian Seung
We analyze the conditions under which synaptic learning rules based on action potential timing can be approximated by learning rules based on firing rates. In particular, we consider a form of plasticity in which synapses depress when a presynaptic spike is followed by a postsynaptic spike, and potentiate with the opposite temporal ordering. Such {\em differential anti-Hebbian plasticity} can be approximated under certain conditions by a learning rule that depends on the time derivative of the postsynaptic firing rate. We argue that this learning rule acts to stabilize persistent neural activity patterns in recurrent neural networks, and illustrate its operation with numerical simulations of an autapse.

#### Y

Search for Information Bearing Components in Speech (gzip'd) -- Howard Hua Yang, Hynek Hermansky
In this paper, we use mutual information (MI) to characterize the distributions of phonetic and speaker/channel information over a time-frequency space. The MI between the phonetic label and one, two or three features is estimated. The Miller's bias formulas for entropy and MI estimates are extended to include higher order terms. The MI for speaker/channel recognition is also estimated. The results are complementary to those for phonetic classification. Our results show how the phonetic information is locally spread and how the speaker/channel is globally spread in time and frequency.

Data Visualization and Feature Selection: New Algorithms for Nongaussian Data (gzip'd) -- Howard Hua Yang, John Moody
Data visualization and feature selection methods are proposed based on the joint mutual information and ICA. The visualization methods can find many good 2-D projections for high dimensional data interpretation, which cannot be easily found by the other existing methods. The new variable selection method is found to be better in eliminating redundancy in the inputs than other methods based on simple mutual information. The efficacy of the methods is illustrated on a radar signal analysis problem to find 2-D viewing coordinates for data visualization and to select inputs for a neural network classifier.

A Generative Model for Visual Cue Combination (gzip'd) -- Zhiyong Yang, Richard S. Zemel
We develop a hierarchical generative model to study cue combination. The model consists of four layers: global shape parameters at the top, followed by global cue-specific shape parameters, then local cue-specific parameters, ending with an intensity image at the bottom. Inferring parameters from images is achieved by inverting this model. Inference produces a probability distribution at each level; using distributions rather than a single value of underlying variables at each stage preserves information about the validity of each cue for the given image, which helps make this model more powerful than standard linear combination models or other existing combination models. The parameters of the model are determined using data obtained from psychophysical experiments. In these experiments, subjects estimate surface shape from intensity images containing texture information, shading information, or both types of cues, and varying degrees of noise. This model provides a good fit to our data on the combination of these two cues, and also provides a natural account for many aspects of cue combination.

Localist Attractor Networks (gzip'd) -- Richard S. Zemel, Michael C. Mozer
Attractor networks, which map an input space to a discrete output space, are useful for pattern completion---cleaning up noisy or missing input features. However, designing a net to have a given set of attractors is notoriously tricky; training procedures are CPU intensive and often produce spurious attractors and ill-conditioned attractor basins. These difficulties occur because each connection in the network participates in the encoding of multiple attractors. We describe an alternative formulation of attractor networks in which the encoding of knowledge is local, not distributed. Although localist attractor networks have similar dynamics to their distributed counterparts, they are much easier to work with and interpret. We propose a statistical formulation of localist attractor net dynamics, which yields a convergence proof and a mathematical interpretation of model parameters. We present simulation experiments that explore the behavior of localist attractor networks, showing that spurious attractors are rare, and they facilitate two desirable properties of psychological and neurobiological models, priming---faster convergence to an attractor if the attractor has been recently visited---and gang effects---in which the presence of an attractor enhances the attractor basins of neighboring attractors.

Some Theoretical Results Concerning the Convergence of Compositions of Regularized Linear Functions -- Tong Zhang
Recently, sample complexity bounds have been derived for problems involving linear functions such as neural networks and support vector machines. In this paper, we extend some theoretical results in this area by deriving dimensional independent covering number bounds for regularized linear functions under certain regularization conditions. We show that such bounds lead to a class of new methods for training linear classifiers with similar theoretical advantages of the support vector machine. Furthermore, we also present theoretical analysis for these new methods from the asymptotic statistical point of view. This technique provides better description for large sample behaviors of these algorithms.

Semiparametric Approach to Multichannel Blind Deconvolution of Nonminimum Phase Systems (gzip'd) -- L.-Q. Zhang, Shun-ichi Amari, A. Cichocki
In this paper we discuss the semiparametric statistical model for blind deconvolution. First we introduce a Lie Group to the manifold of noncausal FIR filters. Then blind deconvolution problem is formulated in the framework of a semiparametric model, and a family of estimating functions is derived for blind deconvolution. A natural gradient learning algorithm is presented for training noncausal filters. To improve the learning efficiency, an explicit form of standardized estimating functions is given as well. Superefficiency of the natural gradient learning is proven in this framework.

Manifold Stochastic Dynamics for Bayesian Learning (gzip'd) -- Mark Zlochin, Yoram Baram
We propose a new Markov Chain Monte Carlo algorithm which is a generalization of the stochastic dynamics method. The algorithm performs exploration of the state space using its intrinsic geometric structure, which facilitates efficient sampling of complex distributions. Applied to Bayesian learning in neural networks, our algorithm was found to produce results comparable to the best state-of-the-art method while consuming considerably less time.