# Neural Information Processing Systems 1998

This is a partial list of the papers that were presented at NIPS*97. They are listed in (approximately) alphabetical order according to the last name of the first author.

These papers should be cited as:
Advances in Neural Information Processing Systems 10,
M. I. Jordan, M. J. Kearns, S. A. Solla, eds., MIT Press, 1998.

A-B-C-D-E-F-G-H-I-J-K-L-M-N-O-P-R-S-T-V-W-X-Y-Z

#### A

Generalized Prioritized Sweeping -- David Andre, Nir Friedman, Ronald Parr
Prioritized sweeping is a model-based reinforcement learning method that attempts to focus an agent's limited computational resources to achieve a good estimate of the value of environment states. The classic account of prioritized sweeping uses an explicit, state-based representation of the value, reward, and model parameters. Such a representation is unwieldy for dealing with complex environments and there is growing interest in learning with more compact representations. We claim that classic prioritized sweeping is ill-suited for learning with such representations. To overcome this deficiency, we introduce generalized prioritized sweeping, a principled method for generating representation-specific algorithms for model-based reinforcement learning. We then apply this method for several representations, including state-based models and generalized model approximators (such as Bayesian networks). We describe preliminary experiments that compare our approach with classical prioritized sweeping.

Nonparametric Model-Based Reinforcement Learning -- Christopher G. Atkeson
This paper describes some of the interactions of model learning algorithms and planning algorithms we have found in exploring model-based reinforcement learning. The paper focuses on how local trajectory optimizers can be used effectively with learned nonparametric models. We find that trajectory planners that are fully consistent with the learned model often have difficulty finding reasonable plans in the early stages of learning. Trajectory planners that balance obeying the learned model with minimizing cost often do better, even if the plan is not fully consistent with the learned model.

Coding of Naturalistic Stimuli by Auditory Midbrain Neurons -- Hagai Attias, Christoph E. Schreiner
It is known that humans can make finer discriminations between familiar sounds (e.g. syllables) than between unfamiliar ones (e.g. different noise segments). Here we show that a corresponding enhancement is present in early auditory processing stages. Based on previous work which demonstrated that natural sounds had robust statistical properties that could be quantified, we hypothesize that the auditory system exploits those properties to construct efficient neural codes. To test this hypothesis, we measure the information rate carried by auditory spike trains on narrow-band stimuli whose amplitude modulation has naturalistic characteristics, and compare it to the information rate on stimuli with non-naturalistic modulation. We find that naturalistic inputs significantly enhance the rate of transmitted information, indicating that auditiory neural responses are matched to characteristics of natural auditory scenes.

#### B

Synchronized Auditory and Cognitive 40 Hz Attentional Streams, and the Impact of Rhythmic Expectation on Auditory Scene Analysis -- Bill Baird
We have developed a neural network architecture that implements a theory of attention, learning, and trans-cortical communication based on adaptive synchronization of 5-15 Hz and 30-80 Hz oscillations between cortical areas. Here we present a specific higher order cortical model of attentional networks, rhythmic expectancy, and the interaction of higher-order and primary cortical levels of processing. It accounts for the mismatch negativity'' of the auditory ERP and the results of psychological experiments of Jones showing that auditory stream segregation can depend on the rhythmic structure of inputs. The timing mechanisms of the model allow us to explain how relative timing information such as the relative order of events between streams is lost when streams are formed. The model suggests how the theories of auditory perception and attention of Jones and Bregman may be reconciled.

Using Expectation to Guide Processing: A Study of Three Real-World Applications -- Shumeet Baluja
In many real world tasks, only a small fraction of the available inputs are important at any particular time. This paper presents a method for ascertaining the relevance of inputs by exploiting temporal coherence and predictability. The method proposed in this paper dynamically allocates relevance to inputs by using expectations of their future values. As a model of the task is learned, the model is simultaneously extended to create task-specific predictions of the future values of inputs. Inputs which are either not relevant, and therefore not accounted for in the model, or those which contain noise, will not be predicted accurately. These inputs can be de-emphasized, and, in turn, a new, improved, model of the task created. The techniques presented in this paper have yielded significant improvements for the vision-based autonomous control of a land vehicle, vision-based hand tracking in cluttered scenes, and the detection of faults in the etching of semiconductor wafers.

Ensemble Learning for Multi-Layer Networks -- David Barber, Christopher M. Bishop
Bayesian treatments of learning in neural networks are typically based either on local Gaussian approximations to a mode of the posterior weight distribution, or on Markov chain Monte Carlo simulations. A third approach, called ensemble learning, was introduced by Hinton and van Camp (1993). It aims to approximate the posterior distribution by minimizing the Kullback-Leibler divergence between the true posterior and a parametric approximating distribution. However, the derivation of a deterministic algorithm relied on the use of a Gaussian approximating distribution with a diagonal covariance matrix and so was unable to capture the posterior correlations between parameters. In this paper, we show how the ensemble learning approach can be extended to full-covariance Gaussian distributions while remaining computationally tractable. We also extend the framework to deal with hyperparameters, leading to a simple re-estimation procedure. Initial results from a standard benchmark problem are encouraging.

Radial Basis Functions: A Bayesian Treatment -- David Barber, Bernhard Schottky
Bayesian methods have been successfully applied to regression and classification problems in multi-layer perceptrons. We present a novel application of Bayesian techniques to Radial Basis Function networks by developing a Gaussian approximation to the posterior distribution which, for fixed basis function widths, is analytic in the parameters. The setting of regularization constants by cross-validation is wasteful as only a single optimal parameter estimate is retained. We treat this issue in a principled manner by assigning prior distributions to these constants, which are then adapted in light of the data under a simple re-estimation formula.

The Canonical Distortion Measure in Feature Space and 1-NN Classification -- Jonathan Baxter, Peter Bartlett
We prove that the Canonical Distortion Measure (CDM) is the optimal distance measure to use for 1 nearest-neighbour (1-NN) classification, and show that it reduces to squared Euclidean distance in feature space for function classes that can be expressed as linear combinations of a fixed set of features. PAC-like bounds are given on the sample-complexity required to learn the CDM. An experiment is presented in which a neural network CDM was learnt for a Japanese OCR environment and then used to do 1-NN classification.

Shared Context Probabilistic Transducers -- Yoshua Bengio, Samy Bengio, Jean-Fran\c{c}ois Isabelle, Yoram Singer
Recently, a model for supervised learning of probabilistic transducers represented by suffix trees was introduced. However, this algorithm tends to build very large trees, requiring very large amounts of computer memory. In this paper, we propose a new, more compact, transducer model in which one shares the parameters of distributions associated to contexts yielding similar conditional output distributions. We illustrate the advantages of the proposed algorithm with comparative experiments on inducing a noun phrase recognizer.

Refractoriness and Neural Precision -- Michael J. Berry II, Markus Meister
The relationship between a neuron's refractory period and the precision of its response to identical stimuli was investigated. The light response of retinal ganglion cells was modeled as probabilistic firing combined with a refractory period: the instantaneous firing rate is the product of a free firing rate'', which depends only on the stimulus, and a recovery function'', which depends only on the time since the last spike. In simulations, longer refractory periods were found to make the response more reproducible, eventually matching the precision of measured ganglion cell spike trains. The underlying free firing rate derived in this way often exceeded the observed firing rate by an order of magnitude, and was found to convey information about the stimulus over a much wider dynamic range. Thus the free firing rate may be the preferred variable for describing a spiking neuron's response.

Approximating Posterior Distributions in Belief Networks Using Mixtures -- Christopher M. Bishop, Neil Lawrence, Tommi Jaakkola, Michael I. Jordan
Exact inference in densely connected Bayesian networks is computationally intractable, and so there is considerable interest in developing effective approximation schemes. One approach which has been adopted is to bound the log likelihood using a mean-field approximating distribution. While this leads to a tractable algorithm, the mean field distribution is assumed to be factorial and hence unimodal. In this paper we demonstrate the feasibility of using a richer class of approximating distributions based on mixtures of mean field distributions. We derive an efficient algorithm for updating the mixture parameters and apply it to the problem of learning in sigmoid belief networks. Our results demonstrate a systematic improvement over simple mean field theory as the number of mixture components is increased.

Receptive Field Formation in Natural Scene Environments: Comparison of Single Cell Learning Rules -- Brian S. Blais, Nathan Intrator, Harel Shouval, Leon N. Cooper
We study several statistically and biologically motivated learning rules using the same visual environment and neuronal architecture. This allows us to concentrate on the feature extraction and neuronal coding properties of these rules. We find that the quadratic form of the BCM rule behaves in a manner similar to a kurtosis maximization rule when the distribution contains kurtotic directions, although the BCM modification equations are computationally simpler.

Multiple Threshold Neural Logic -- Vasken Bohossian, Jehoshua Bruck
We introduce a new Boolean computing element related to the Linear Threshold element, which is the Boolean version of the neuron. Instead of the sign function, it computes an arbitrary (with polynomialy many transitions) Boolean function of the weighted sum of its inputs. We call the new computing element an LTM element, which stands for Linear Threshold with Multiple transitions.

An Annealed Self-Organizing Map for Source Channel Coding -- Matthias Burger, Thore Graepel, Klaus Obermayer
We derive and analyse robust optimization schemes for noisy vector quantization on the basis of deterministic annealing. Starting from a cost function for central clustering that incorporates distortions from channel noise we develop a soft topographic vector quantization algorithm (STVQ) which is based on the maximum entropy principle and which performs a maximum-likelihood estimate in an expectation-maximization (EM) fashion. Annealing in the temperature parameter $\beta$ leads to phase transitions in the existing code vector representation during the cooling process for which we calculate critical temperatures and modes as a function of eigenvectors and eigenvalues of the covariance matrix of the data and the transition matrix of the channel noise. A whole family of vector quantization algorithms is derived from STVQ, among them a deterministic annealing scheme for Kohonen's self-organizing map (SOM). This algorithm, which we call SSOM, is then applied to vector quantization of image data to be sent via a noisy binary symmetric channel. The algorithm's performance is compared to those of LBG and STVQ. While it is naturally superior to LBG, which does not take into account channel noise, its results compare very well to those of STVQ, which is computationally much more demanding.

#### C

Incorporating Test Inputs into Learning -- Zehra Cataltepe, Malik Magdon-Ismail
In many applications, such as credit default prediction and medical image recognition, test inputs are available in addition to the labeled training examples. We propose a method to incorporate the test inputs into learning. Our method results in solutions having smaller test errors than that of simple training solution, especially for noisy problems or small training sets.

On Efficient Heuristic Ranking of Hypotheses -- Steve Chien, Andre Stechert, Darren Mutz
This paper considers the problem of learning the ranking of a set of alternatives based upon incomplete information (e.g., a limited number of observations). We describe two algorithms for hypothesis ranking and their application for probably approximately correct (PAC) and expected loss (EL) learning criteria. Empirical results are provided to demonstrate the effectiveness of these ranking procedures on both synthetic datasets and real-world data from a spacecraft design optimization problem.

On Parallel versus Serial Processing: A Computational Study of Visual Search -- Eyal Cohen, Eytan Ruppin
A novel neural network model of pre-attention processing in visual-search tasks is presented. Using displays of line orientations taken from Wolfe's experiments [1992], we study the hypothesis that the distinction between parallel versus serial processes arises from the availability of global information in the internal representations of the visual scene. The model operates in two phases. First, the visual displays are compressed via principal-component-analysis. Second, the compressed data is processed by a target detector module in order to identify the existence of a target in the display. Our main finding is that targets in displays which were found experimentally to be processed in parallel can be detected by the system, while targets in experimentally-serial displays cannot. This fundamental difference is explained via variance analysis of the compressed representations, providing a numerical criterion distinguishing parallel from serial displays. Our model yields a mapping of response-time slopes that is similar to Duncan and Humphreys's search surface'' [1989], providing an explicit formulation of their intuitive notion of feature similarity. It presents a neural realization of the processing that may underlie the classical metaphorical explanations of visual search.

Learning to Order Things -- William W. Cohen, Robert E. Schapire, Yoram Singer
There are many applications in which it is desirable to order rather than classify instances. Here we consider the problem of learning how to order, given feedback in the form of preference judgments, i.e., statements to the effect that one instance should be ranked ahead of another. We outline a two-stage approach in which one first learns by conventional means a {\em preference function}, of the form $\mbox{prefer}(u,v)$'', which indicates whether it is advisable to rank $u$ before $v$. New instances are then ordered so as to maximize agreements with the learned preference function. We show that the problem of finding the ordering that agrees best with a preference function is NP-complete, even under very restrictive assumptions. Nevertheless, we describe a simple greedy algorithm that is guaranteed to find a good approximation. We then discuss an on-line learning algorithm, based on the $\mbox{Hedge}$'' algorithm, for finding a good linear combination of ranking experts.'' We use the ordering algorithm combined with the on-line learning algorithm to find a combination of search experts,'' each of which is a domain-specific query expansion strategy for a WWW search engine, and present experimental results that demonstrate the merits of our approach.

Recovering Perspective Pose with a Dual Step EM Algorithm -- Andrew D. J. Cross, Edwin R. Hancock
This paper describes a new approach to extracting 3D perspective structure from 2D point-sets. The novel feature is to unify the tasks of estimating transformation geometry and identifying point-correspondence matches. Unification is realised by constructing a mixture model over the bi-partite graph representing the correspondence match and by effecting optimisation using the EM algorithm. According to our EM framework the probabilities of structural correspondence gate contributions to the expected likelihood function used to estimate maximum likelihood perspective pose parameters. This provides a means of rejecting structural outliers. We evaluate the technique on the matching of different perspective views of 3.5in floppy discs. We provide a sensitivity study based on synthetic data.

#### D

Task and Spatial Frequency Effects on Face Specialization -- Matthew N. Dailey, Garrison W. Cottrell
The double dissociation between prosopagnosia, a face recognition deficit occurring after brain damage, and visual object agnosia, difficulty recognizing other kinds of complex objects, indicates that face and non-face object recognition may be served by partially independent mechanisms in the brain. Such a dissociation could be the result of a competitive learning mechanism that, during development, devotes neural resources to the tasks they are best at performing. Studies of normal adult performance on face and object recognition tasks seem to indicate that face recognition is primarily configural, involving the low spatial frequency information present in a stimulus over relatively large distances, whereas object recognition is primarily featural, involving analysis of the object's parts using local, high spatial frequency information. In a feed-forward computational model of visual processing, two modules compete to classify input stimuli; when one module receives low spatial frequency information and the other receives high spatial frequency information, the low-frequency module shows a strong specialization for face recognition in a combined face identification/object classification task. The series of experiments shows that the fine discrimination necessary for distinguishing members of a visually homogeneous class such as faces relies heavily on the low spatial frequencies present in a stimulus.

Statistical Models of Conditioning -- Peter Dayan, Theresa Long
Recent evidence suggests that dopaminergic neurons in vertebrates report prediction errors during reward based classical and instrumental conditioning. We consider more complicated conditioning paradigms which involve combining predictions from multiple predictive stimuli. We show that our existing model fails to act in accordance with the learning data, and suggest an alternative in which there is attentional selection between different available stimuli. The new model is a form of mixture of experts (Jacobs, Jordan \& Barto, 1991) and is statistically well-founded.

Structure Driven Image Database Retrieval -- Jeremy S. De Bonet, Paul A. Viola
A new algorithm is presented which approximates the perceived visual similarity between images. The images are initially transformed into a feature space which captures visual structure, texture and color using a tree of filters. Similarity is the inverse of the distance in this perceptual feature space. Using this algorithm we have constructed an image database system which can perform example based retrieval on large image databases. Using carefully constructed target sets, which limit variation to only a single visual characteristic, retrieval rates are quantitatively compared to those of standard methods.

A Non-Parametric Multi-Scale Statistical Model for Natural Images -- Jeremy S. De Bonet, Paul A. Viola
The observed distribution of visual images is far from uniform. On the contrary, images have complex and important structure that can be used for image processing, recognition and analysis. There have been many proposed approaches to the principled statistical modeling of images, but each has been limited in either the complexity of the models or the complexity of the images. We present a non-parametric multi-scale statistical model for images that can be used for recognition, image de-noising, and in a generative mode'' to synthesize high quality textures.

Characterizing Neurons in the Primary Auditory Cortex of the Awake Primate Using Reverse Correlation -- R. Christopher deCharms, Michael M. Merzenich
While the understanding of the functional role of different classes of neurons in the awake primary visual cortex has been extensively studied since the time of Hubel and Weisel, our understanding of the basic response selectivity and functional role of neurons in the primary auditory cortex is much farther from complete. In particular, while moving bars are now well-recognized as optimal for many neurons in the primary visual cortex, for an auditory cortical neuron it is not clear how to select the most appropriate stimulus. In this study, we recorded from neurons in the auditory cortex of the awake primate, and used a novel reverse correlation technique to compute the receptive fields (or preferred stimuli) of auditory cortical neurons, encompassing both multiple frequency components and ongoing time. These spectrotemporal receptive fields make clear that neurons in the primary auditory cortex, as in the primary visual cortex, typically show considerable structure, sometimes including multiple excitatory and inhibitory regions in their receptive fields. These neurons can be sensitive to stimulus edges in frequency composition or in time, sensitive to stimulus transitions such as changes in frequency, and sensitive to combinations of stimulus features.

Using Helmholtz Machines to Analyze Multi-channel Neuronal Recordings -- Virginia R. de Sa, R. Christopher deCharms, Michael M. Merzenich
We believe that understanding neural processing will ultimately require observing the response patterns and interaction of large populations of neurons. To that end, we have developed a chronic multi-electrode implant for long term simultaneous recording from multiple neurons. The task of analyzing firing patterns, across time and between different units, arising from this electrode array is a critical and technically challenging task. We discuss a greedy, incremental algorithm from the Helmholtz machine'' family that we are using for automated discovery of ensemble neuronal events. We construct a generative model that attempts to maximize a bound on the likelihood of observed firing patterns. The model is constructed incrementally; each added unit in the model attempts to greedily maximize the likelihood. While not globally optimal, this strategy is computationally tractable. We show encouraging benchmark data on artificially constructed spike trains and promising early results on some real data.

Neural Basis of Object-Centered Representations -- Sophie Deneve, Alexandre Pouget
We present a neural model that can perform eye movements to a particular side of an object regardless of the position and orientation of the object in space, a generalization of a task which has been recently used by Olson and Gettner to investigate the neural structure of object-centered representations. Our model uses an intermediate representation in which units have oculocentric receptive fields -- just like collicular neurons -- whose gain is modulated by the side of the object to which the movement is directed, as well as the orientation of the object. We show that these gain modulations are consistent with Olson and Gettner single cell recordings in the supplementary eye field. This demonstrates that it is possible to perform an object-centered task without a representation involving an object-centered map, viz., without neurons whose receptive fields are defined in object-centered coordinates. We also show that the same approach can account for object-centered neglect, a situation in which patients with a right parietal lesion neglect the left side of objects regardless of the orientation of the objects.

Instabilities in Eye Movement Control: A Model of Periodic -- Ernst R. Dow, Thomas J. Anastasio
Nystagmus is a pattern of eye movement characterized by a smooth rotations of the eye in one direction and rapid rotations in the opposite direction that reset eye position. Periodic alternating nystagmus (PAN) is a form of uncontrollable nystagmus that has been described as an unstable but amplitude-limited oscillation. PAN has been observed previously only in subjects with vestibulo-cerebellar damage. We describe results in which PAN can be produced in normal subjects by prolonged rotation in darkness. We propose a new model in which the neural circuits that control eye movement are inherently unstable, but this instability is kept in check under normal circumstances by the cerebellum. Circumstances which alter this cerebellar restraint, such as vestibulo-cerebellar damage or plasticity due to rotation in darkness can lead to PAN.

#### E

Agnostic Classification of Markovian Sequences -- Ran El-Yaniv, Shai Fine, Naftali Tishby
Classification of finite sequences without explicit knowledge of their statistical origins is of great importance to various applications, such as text categorization and biological sequence analysis. We propose a new information theoretic algorithm for this problem based on two important ingredients. The first, motivated by the two sample problem'', provides an (almost) optimal model independent criterion for comparing two statistical sources. The second ingredient is a novel idea for agnostic statistical merging'' of sequences in an asymptotically optimal way.

A General Purpose Image Processing Chip: Orientation Detection -- Ralph Etienne-Cummings, Donghui Cai
A 80x78 pixel general purpose vision chip for spatial focal plane processing is presented. The size and configuration of the processing receptive field are programmable. The chip's architecture allows the photoreceptor cells to be small and densely packed by performing all computation on the read-out, away from the array. In addition to the raw intensity image, the chip outputs four processed images in parallel. Also presented is an application of the chip to line segment orientation detection, as found in the retinal receptive fields of toads.

#### F

Ensemble and Modular Approaches for Face Detection: A Comparison -- Rapha\"el Feraud, Olivier Bernier
A new learning model based on autoassociative neural networks is developed and applied to face detection. To extend the detection ability in orientation and to decrease the number of false alarms, different combinations of networks are tested: ensemble, conditional ensemble and conditional mixture of networks. The use of a conditional mixture of networks allows to obtain state of the art results on different benchmark face databases.

Hippocampal Model of Rat Spatial Abilities Using Temporal Difference Learning -- David J. Foster, Richard G. M. Morris, Peter Dayan
We present a model of the hippocampus dependent performance of rats in a delayed match-to-place spatial watermaze learning task. Real rats show slow learning in the first few days of training; but ultimately exhibit one-trial learning. We model the early performance using a temporal difference (TD) reinforcement learning architecture, using place cells in the hippocampus as the representation of space. We model late performance using a coordinate learning system which also employs TD learning and place cells. We show a simple integration of these two navigation models.

Bayesian Model of Surface Perception -- William T. Freeman, Paul A. Viola
Image intensity variations can result from several different object surface effects, including shading from 3-dimensional relief of the object, or paint on the surface itself. An essential problem in vision, which people solve naturally, is to attribute the proper physical cause, e.g. surface relief or paint, to an observed image. We addressed this problem with an approach combining psychophysical and Bayesian computational methods.

Regularisation in Sequential Learning Algorithms -- Jo\~ao F. G. de Freitas, Mahesan Niranjan, Andrew H. Gee
In this paper we discuss regularisation in online/sequential learning algorithms. In environments where data arrives sequentially, techniques such as cross-validation to achieve regularisation or model selection are not possible. Further, bootstrapping to determine a confidence level is not practical. To surmount these problems, a minimum variance estimation approach that makes use of the extended Kalman algorithm for training multi-layer perceptrons is employed. The novel contribution of this paper is to show the theoretical links between extended Kalman filtering, Sutton's variable learning rate algorithms and Mackay's Bayesian estimation framework. In doing so, we propose algorithms to overcome the need for heuristic choices of the initial conditions and noise covariance matrices in the Kalman approach.

A Revolution: Belief Propagation in Graphs with Cycles -- Brendan J. Frey, David J. C. MacKay
Until recently, artificial intelligence researchers have frowned upon the application of probability propagation in Bayesian belief networks that have cycles. The probability propagation algorithm is only exact in networks that are cycle-free. However, it has recently been discovered that the two best practical eror-correcting decoding algorithms are actually performing probability propagation in belief networks with cycles.

#### G

Features as Sufficient Statistics -- Davi Geiger, Archisman Rudra, Laurance T. Maloney
An image is often represented by a set of detected features. We get an enormous compression by representing images in this way. Furthermore, we get a representation which is little affected by small amounts of noise in the image. However, features are typically chosen in an ad hoc manner. We show how a good set of features can be obtained using sufficient statistics. The idea of sparse data representation naturally arises. We treat the 1-dimensional and 2-dimensional signal reconstruction problem to make our ideas concrete.

Hierarchical Non-linear Factor Analysis and Topographic Maps -- Zoubin Ghahramani, Geoffrey E. Hinton
We first describe a hierarchical, generative model that can be viewed as a non-linear generalization of factor analysis and can be implemented in a neural network. The model performs perceptual inference in a probabilistically consistent manner by using top-down, bottom-up and lateral connections. These connections can be learned using simple rules that require only locally available information. We then show how to incorporate non-adaptive lateral connections into the generative model. The model extracts a sparse, distributed, hierarchical representation of depth from simplified random-dot stereograms and the localized disparity detectors in the first hidden layer form a topographic map.

Regression with Input-dependent Noise: A Gaussian Process Treatment -- Paul W. Goldberg, Christopher K. I. Williams, Christopher M. Bishop
Gaussian processes provide natural non-parametric prior distributions over regression functions. In this paper we consider regression problems where there is noise on the output, and the variance of the noise depends on the inputs. If we assume that the noise is a smooth function of the inputs, then it is natural to model the noise variance using a second Gaussian process, in addition to the Gaussian process governing the noise-free output value. We show that the posterior distribution of the noise rate can be sampled using Gibbs sampling. Our results on a synthetic data set give a posterior variance that well-approximates the true variance. We also show that the predictive likelihood of a test data set approximates the true likelihood better under this model than under a uniform noise model.

Generalization in Decision Trees and DNF: Does Size Matter? -- Mostefa Golea, Peter Bartlett, Wee Sun Lee, Llew Mason
Recent theoretical results for pattern classification with thresholded real-valued functions (such as support vector machines, sigmoid networks, and boosting) give bounds on misclassification probability that do not depend on the size of the classifier, and hence can be considerably smaller than the bounds that follow from the VC theory. In this paper, we show that these techniques can be more widely applied, by representing other boolean functions as two-layer neural networks (thresholded convex combinations of boolean functions). For example, we show that with high probability any decision tree of depth no more than $d$ that is consistent with $m$ training examples has misclassification probability no more than $O \left( \left( \frac{1}{m} \left( N_{\mbox{eff}} \, \overline{d} \, \mbox{VC}_{\mbox{dim}} ({\cal U}) \, \log^2 m \log d\right)\right)^{1/2}\right)$, where ${\cal U}$ is the class of node decision functions, $N_{\mbox{eff}} \le N$ can be thought of as the effective number of leaves (it becomes small as the distribution on the leaves induced by the training data gets far from uniform), and $\overline{d}$ is the average leaf depth (averaged over the training data). This bound is qualitatively different from the VC bound and can be considerably smaller.

A Mathematical Model of Axon Guidance by Diffusible Factors -- Geoffrey J. Goodhill
In the developing nervous system, gradients of target-derived diffusible factors play an important role in guiding axons to appropriate targets. What shape does a gradient set up by diffusion have, and over what range is it capable of guiding axons? A simple mathematical model suggests the answer is about 1 mm. This value fits well with experimental data.

Gradients for Retinotectal Mapping -- Geoffrey J. Goodhill
The initial activity-independent formation of a topographic map in the retinotectal system has long been thought to rely on the matching of molecular cues expressed in gradients in the retina and the tectum. However, direct experimental evidence for the existence of such gradients has only emerged in the past two years. The new data has provoked the discussion of a new set of models in the experimental literature. Here, these models are subjected to a rigourous theoretical analysis.

Linear Concepts and Hidden Variables: An Empirical Study -- Adam J. Grove, Dan Roth
A fundamental issue the arises in deciding which approach to choose is the robustness of the density estimation approach to deviations from the postulated model. We study this question experimentally in a restricted, yet non-trivial and interesting case: the model considered here is a {\em conditionally independent attribute\/} (CIA) model which postulates a single binary-valued {\em hidden variable\/} $z$ on which all other attributes (i.e. the target and the observables) depend. We observe that this model has the property that finding the most likely value of any one variable (given known values for the others) reduces to testing a linear function of the observed values.

Detection of First and Second Order Motion -- Alexander Grunewald, Heiko Neumann
A model of motion detection is presented. The model contains three stages. The first stage is unoriented and is selective for contrast polarities. The next two stages work in parallel. A phase {\em insensitive} stage pools across different contrast polarities through a spatiotemporal filter and thus can detect first and second order motion. A phase {\em sensitive} stage keeps contrast polarities separate, each of which is filtered through a spatiotemporal filter, and thus only first order motion can be detected. Differential phase sensitivity can therefore account for the detection of first and second order motion. Phase insensitive detectors correspond to cortical complex cells, and phase sensitive detectors to simple cells.

#### H

A Neural Network Model of Naive Preference and Filial Imprinting in the Domestic Chick -- Lucy E. Hadden
Filial imprinting in domestic chicks is of interest in psychology, biology, and computational modeling because it exemplifies simple, rapid, innately programmed learning which is biased toward learning about some objects. Horn et al.\ have recently discovered a naive visual preference for heads and necks which develops over the course of the first three days of life. These interactions are reminiscent of those in human language acquisition. The neurological basis of this predisposition is almost entirely unknown; that of imprinting-related learning is fairly clear. This project is the first model of the predisposition consistent with what is known about learning in imprinting. The model develops the predisposition appropriately, learns to approach'' a training object, and replicates one interaction between the two processes. Future work will replicate more interactions between imprinting and the predisposition in chicks, and analyze why the system works.

An Improved Policy Iteration Algorithm for Partially Observable MDPs -- Eric A. Hansen
A policy iteration algorithm for partially observable Markov decision processes is described that is simpler and more efficient than an earlier policy iteration algorithm of Sondik. The key simplification is the representation of a policy as a finite-state controller. The dynamic-programming update used in the policy improvement step is interpreted as the transformation of a finite-state controller into an improved finite-state controller. Empirical testing shows that this policy iteration algorithm outperforms value iteration on a range of examples.

An Analog VLSI Model of the Fly Elementary Motion Detector -- Reid R. Harrison, Christof Koch
Flies are capable of rapidly detecting and integrating visual motion information in behaviorly-relevant ways. The first stage of visual motion processing in flies is a retinotopic array of functional units known as elementary motion detectors (EMDs). Several decades ago, Reichardt and colleagues developed a correlation-based model of motion detection that described the behavior of these neural circuits. We have implemented a variant of this model in a 2.0-micron analog CMOS VLSI process. The result is a low-power, continuous-time analog circuit with integrated photoreceptors that responds to motion in real time. The responses of the circuit to drifting sinusoidal gratings qualitatively resemble the temporal frequency response, spatial frequency response, and direction selectivity of motion-sensitive neurons observed in insects. In addition to its possible engineering applications, the circuit could potentially be used as a building block for constructing hardware models of higher-level insect motion integration.

Classification by Pairwise Coupling -- Trevor Hastie, Robert Tibshirani
We discuss a strategy for polychotomous classification that involves estimating class probabilities for each pair of classes, and then coupling the estimates together. The coupling model is similar to the Bradley-Terry method for paired comparisons. We study the nature of the class probability estimates that arise, and examine the performance of the procedure in real and simulated datasets. Classifiers used include linear discriminants, nearest neighbors, and the support vector machine.

Unsupervised On-line Learning of Decision Trees for Hierarchical Data Analysis -- Marcus Held, Joachim M. Buhmann
An adaptive on--line algorithm is proposed to estimate hierarchical data structures for non--stationary data sources. The approach is based on the principle of {\it minimum cross entropy} to derive a decision tree for data clustering and on a {\it metalearning} (learning of learning) idea to adapt to changing data characteristics. Its efficiency is demonstrated by grouping non--stationary artifical data and by hierarchical segmentation of LANDSAT images.

A Simple and Fast Neural Network Approach to Stereovision -- Rolf D. Henkel
A neural network approach to stereovision is presented based on aliasing effects of simple disparity estimators and a fast coherence-detection scheme. Using only realistic neural circuitry, the network calculates dense disparity maps with an associated verification count. It simultaneously fuses the left and right input images into a single cyclopean view.

Selecting Weighting Factors in Logarithmic Opinion Pools -- Tom Heskes
We study the combination of experts' probability assessments in a logarithmic opinion pool. The error of the pool is shown to be smaller or equal than the average error of individual experts. A decomposition in terms of individual errors of and distances between experts leads to a robust and relatively straightforward procedure for selecting weighting factors. Our general description includes regression models, models for computing variances, and classification models.

A 1,000-Neuron System with One Million 7-bit Physical Interconnections -- Yuzo Hirai
An asynchronous PDM (Pulse-Density-Modulating) digital neural network system has been developed in our laboratory. It consists of one thousand neurons that are physically interconnected via one million 7-bit synapses. It can solve one thousand simultaneous nonlinear first-order differential equations in a fully parallel and continuous fashion. The performance of this system was measured by a winner-take-all network with one thousand neurons. Although the magnitude of the input and network parameters were identical for each competing neuron, one of them won in 6 milliseconds because of the asynchronous behavior of the system. A broad range of neural networks including spatiotemporal filtering, feedforward, and feedback networks can be run by loading appropriate network parameters from a host system.

MELONET I: Neural Nets for Inventing Baroque-Style Chorale Variations -- Dominik Hörnel
MELONET I is a multi-scale neural network system producing baroque-style melodic variations. Given a melody, the system invents a four-part chorale harmonization and a variation of any chorale voice, after being trained on music pieces of composers like J. S. Bach and J. Pachelbel. Unlike earlier approaches to the learning of melodic structure, the system is able to learn and reproduce high-order structure like harmonic, motif and phrase structure in melodic sequences. This is achieved by using mutually interacting feedforward networks operating at different time scales, in combination with Kohonen networks to classify and recognize musical structure. The results are chorale partitas in the style of J. Pachelbel. Their quality has been judged by experts to be comparable to improvisations invented by an experienced human organist.

Nonlinear Markov Networks for Continuous Variables -- Reimar Hofmann, Volker Tresp
In this paper we address the problem of learning the structure in nonlinear Markov networks with continuous variables. Markov networks are well suited to model relationships which do not exhibit a natural causal ordering. We represent the quantitative relationships between variables using neural networks as models for conditional probability densities. This approach is well suited for inference by Gibbs sampling. Using a financial and a sociological data set we show that interesting structures can be found using our approach.

Active Data Clustering -- Thomas Hofmann, Joachim M. Buhmann
Active data clustering is a novel technique for clustering of proximity data which utilizes principles from sequential experiment design in order to interleave data generation and data analysis. The proposed active data sampling strategy is based on the expected value of information, a concept rooted in statistical decision theory. This is considered as an important step towards the analysis of large-scale data sets, because it offers a way to overcome the inherent data sparseness of proximity data. We present applications to unsupervised texture segmentation in computer vision and information retrieval in document databases.

Computing with Action Potentials {\rm\bf (Invited Talk)} -- John J. Hopfield, Carlos D. Brody, Sam Roweis
Most computational engineering based loosely on biology, whether algorithmic or hardware, uses continuous variables and sigmoid units. Most neurons communicate with action potentials, and the engineering view is equivalent to using a rate-code for the representation of information and computing. There are an increasing number of examples in biology where this may not be the case. Information can be represented using the timing of action potentials, and computed with in this representation. The analog match'' problem of olfaction is one of the simplest problems which can be efficiently solved using action potential timing and an underlying rhythm. By using units which adapt, and using the adaptation to effect a fundamental change of representation of a problem, the problem of recognizing words in connected speech with uniform time-warp within a word can be mapped into the olfaction problem. The architecture and preliminary results of such a system (with Sam T. Roweis) will be described. Using the fast events of biology in conjunction with an underlying rhythm gets away from the limits of an event-driven view of computation. When the intrinsic hardware is much faster than the time scale of change of inputs, it can greatly increase the computation which can be done per unit time on a given quantity of hardware.

Adaptation in Speech Motor Control -- John F. Houde, Michael I. Jordan
Human subjects are known to adapt their motor behavior to a shift of the visual field brought about by wearing prism glasses over their eyes. We have studied the analog of this phenomenon in the speech domain. Utilizing a device that can feed back transformed speech signals in real time, we exposed subjects to phonetically sensible, on-line perturbations of their own speech patterns. We found that speakers learn to adjust their production of a vowel to compensate for feedback alterations that change the vowel's perceived phonetic identity; moreover, the effect generalizes across phonetic contexts. This phenomenon provides a new tool for probing the nature of the sensorimotor control system underlying human speech production.

Function Approximation with the Sweeping Hinge Algorithm -- Don R. Hush, Fernando Lozano, Bill Horne
We present a computationally efficient algorithm for function approximation with piecewise linear sigmoidal nodes. A one hidden-layer network is constructed one node at a time using the method of fitting the residual. The task of fitting individual nodes is accomplished using a new algorithm that searchs for the best fit by solving a sequence of Quadratic Programming problems. This approach offers significant advantages over derivative--based search algorithms (e.g. backpropagation and its extensions). Unique characteristics of this algorithm include: finite step convergence, a simple stopping criterion, a deterministic methodology for seeking good'' local minima, good scaling properties and a robust numerical implementation.

New Approximations of Differential Entropy for Independent Component Analysis and Projection Pursuit -- Aapo Hyv\"{a}rinen
We derive a first-order approximation of the density of maximum entropy for a continuous 1-D random variable, given a number of simple constraints. This results in a density expansion which is somewhat similar to the classical polynomial density expansions by Gram-Charlier and Edgeworth. Using this approximation of density, an approximation of 1-D differential entropy is derived. The approximation of entropy is both more exact and more robust against outliers than the classical approximation based on the polynomial density expansions, without being computationally more expensive. The approximation has applications, for example, in independent component analysis and projection pursuit.

#### I

A Model of Early Visual Processing -- Laurent Itti, Jochen Braun, Dale K. Lee, Christof Koch
We propose a model for early visual processing in primates. The model consists of a population of linear spatial filters which interact through non-linear excitatory and inhibitory pooling. Statistical estimation theory is then used to derive human psychophysical thresholds from the responses of the entire population of units. The model is able to reproduce human thresholds for contrast and orientation discrimination tasks, and to predict contrast thresholds in the presence of masks of varying orientation and spatial frequency.

#### J

The Error Coding and Substitution PaCTs -- Gareth James, Trevor Hastie
A new class of plug in classification techniques have recently been developed in the statistics and machine learning literature. A plug in classification technique (PaCT) is a method that takes a standard classifier (such as LDA or TREES) and plugs it into an algorithm to produce a new classifier. The standard classifier is known as the Plug in Classifier (PiC). These methods often produce large improvements over using a single classifier. In this paper we investigate one of these methods and give some motivation for its success.

Wavelet Models for Video Time-Series -- Sheng Ma, Chuanyi Ji
In this work, we tackle the problem of time-series modeling posed by recent discoveries on variable bit rate video traffic and Ethernet data traffic, and establish a wavelet model on the time-series. Different from the existing methods which model the time-series in the time domain, we model the wavelet coefficients in the wavelet domain. The strength of the wavelet model includes (1) a unified approach to model both the long-range and the short-range dependence in the video and data traffic simultaneously, $(2)$ a computationally efficient method of developing the model and generating high quality video traffic, and $(3)$ feasibility of performance analysis using the model. Analysis and insights will be given to show why wavelets provide excellent models for the time-series.

Extended ICA Removes Artifacts from Electroencephalographic Recordings -- Tzyy-Ping Jung, Colin Humphries, Te-Won Lee, Scott Makeig, Martin J. McKeown, Vicente Iragui, Terrence J. Sejnowski
Severe contamination of electroencephalographic (EEG) activity by eye movements, blinks, and muscle, heart and line noise presents a serious problem for EEG interpretation and analysis. Rejecting contaminated EEG segments results in a considerable loss of data and may be impractical for clinical data. Many methods have been proposed to remove eye movement and blink artifacts from EEG recordings. Often regression in the time or frequency domain is performed on simultaneous EEG and electrooculographic (EOG) recordings to derive parameters characterizing the appearance and spread of EOG artifacts in the EEG channels. However, EOG also carries brain signal (Peters, 1967; Oster \& Stern, 1980), so regressing out eye artifacts inevitably involves subtracting a portion of the relevant EEG signal from each record as well. This method cannot be used for muscle noise or line noise for which there is no reference channel for regression. Here, we propose a new and generally applicable method for removing a wide variety of artifacts from EEG records. The method is based on an extended version of an Independent Component Analysis (ICA) algorithm (Bell \& Sejnowski, 1995) for performing blind source separation on linear mixtures of independent source signals that can be sub- or super-Gaussian. Our results show that ICA can effectively detect, separate and remove the activity of a wide variety of artifactual sources in EEG records, with results comparing favorably to those obtained using regression methods.

#### K

Boltzmann Machine Learning Using Mean Field Theory and Linear Response Correction -- Hilbert J. Kappen, F. B. Rodr\'{\i}guez
The learning process in Boltzmann Machines is computationally intractible. We present a new approximate learning algorithm for Boltzmann Machines, which is based on mean field theory and the linear response theorem. The computational complexity of the algorithm is cubic in the number of neurons.

S-Map: A Network with a Simple Self-Organization Algorithm for Generative Topographic Mappings -- Kimmo Kiviluoto, Erkki Oja
The S-Map is a network with a simple learning algorithm that combines the self-organization capability of the Self-Organizing Map (SOM) and the probabilistic interpretability of the Generative Topographic Mapping (GTM). The algorithm is shown to minimize the same error function as the GTM -- the negative log likelihood -- but when compared to the GTM, the S-Map seems to have a stronger tendency to self-organize from random initial configuration. The S-Map algorithm can be further simplified to employ pure Hebbian learning, without changing the qualitative behaviour of the network.

Relative Loss Bounds for Multidimensional Regression Problems -- Jyrki Kivinen, Manfred K. Warmuth
We study on-line generalized linear regression with multidimensional outputs, i.e., neural networks with multiple output nodes but no hidden nodes. We allow at the final layer transfer functions such as the softmax function that need to consider the linear activations to all the output neurons. We also use a parameterization function which transforms parameter vectors maintained by the algorithm into the actual weights. The on-line algorithm we consider updates the parameters in an additive manner, analogous to the delta rule, but because the actual weights are obtained via the possibly nonlinear parameterization function they may behave in a very different manner. Our approach is based on applying the notion of a matching loss function in two different contexts. First, we measure the loss of the algorithm in terms of the loss that matches the transfer function used to produce the outputs. Second, the loss function that matches the parameterization function can be used both as a measure of distance between models in motivating the update rule of the algorithm and as a potential function in analyzing its relative performance compared to an arbitrary fixed model. As a result, we have a unified treatment that generalizes earlier results for the Gradient Descent and Exponentiated Gradient algorithms to multidimensional outputs, including multiclass logistic regression.

Analysis of Drifting Dynamics with Neural Network Hidden Markov Models -- Jens Kohlmorgen, Klaus-Robert M\"{u}ller, Klaus Pawelzik
A method for the analysis of nonstationary time series with multiple operating modes is presented. In particular, it is possible to detect and to model a switching of the dynamics and also a less abrupt, time consuming drift from one mode to another. This is achieved by an unsupervised algorithm that segments the data according to inherent modes, and a subsequent search through the space of possible drifts. An application to physiological wake/sleep data demonstrates that analysis and modeling of real-world time series can be improved when the drift paradigm is taken into account. In the case of wake/sleep data, we hope to gain more insight into the physiological processes that are involved in the transition from wake to sleep.

Asymptotic Theory for Regularization: One-Dimensional Linear Case -- Petri Koistinen
The generalization ability of a neural network can sometimes be improved dramatically by regularization. To analyze the improvement one needs more refined results than the asymptotic distribution of the weight vector. We study the simple case of one-dimensional linear regression, where we derive expansions for the optimal regularization parameter and the ensuing improvement. It is possible to construct examples where it is best to use no regularization.

Perturbative M-Sequences for Auditory Systems Identification -- Mark Kvale, Christoph E. Schreiner
In this paper we present a new method for studying auditory systems based on m-sequences. The method allows us to perturbatively study the linear response of the system in the presence of various other stimuli, such as speech or sinusoidal modulations. This allows one to construct linear kernels (receptive fields) at the same time that other stimuli are being presented. Using the method we calculate the modulation transfer function of single units in the inferior colliculus of the cat at different operating points and discuss nonlinearities in the response.

#### L

Learning Human-like Knowledge by Singular Value Decomposition: A Progress Report -- Thomas K. Landauer, Darrell Laham, Peter Foltz
Singular value decomposition (SVD) can be viewed as a method for unsupervised training of a network that associates two classes of events reciprocally by linear connections through a single hidden layer. SVD was used to learn and represent relations among very large numbers of words (20k-60k) and very large numbers of natural text passages (1k-70k) in which they occurred. The result was 100-350 dimensional semantic spaces'' in which any trained or newly added word or passage could be represented as a vector, and similarities were measured by the cosine of the contained angle between vectors. Good accuracy in simulating human judgments and behaviors has been demonstrated by performance on multiple-choice vocabulary and domain knowledge tests, emulation of expert essay evaluations, and in several other ways. Examples are also given of how the kind of knowledge extracted by this method can be applied.

A Generic Approach for Identification of Event Related Brain Potentials via a Competitive Neural Network Structure -- Daniel H. Lange, Hava T. Siegelmann, Hillel Pratt, Gideon F. Inbar
We present a novel generic approach to the problem of Event Related Potential identification and classification, based on a competitive Neural Net architecture. The network weights converge to the embedded signal patterns, resulting in the formation of a matched filter bank. The network performance is analyzed via a simulation study, exploring identification robustness under low SNR conditions and compared to the expected performance from an information theoretic perspective. The classifier is applied to real event-related potential data recorded during a classic odd-ball type paradigm; for the first time, within-session variable signal patterns are automatically identified, dismissing the strong and limiting requirement of a-priori stimulus-related selective grouping of the recorded data. The results present new possibilities in evoked potential research.

A Neural Network Based Head Tracking System -- Daniel D. Lee, H. Sebastian Seung
We have constructed an inexpensive, video-based, motorized tracking system that learns to track a head. It uses real time graphical user input as a supervisory signal to train a convolutional neural network. The inputs to the neural network consist of normalized luminance and chrominance images and motion information from frame differences. Subsampled images are also used to provide scale invariance. During the online training phase, the neural network adjusts the input weights depending upon the reliability of the different channels in the surrounding environment. This allows the system to robustly track a head even when other objects are moving within a cluttered background.

Two Approaches to Optimal Annealing -- Todd K. Leen, Bernhard Schottky, David Saad
In recent years, two theoretical approaches have been brought to bear on the analysis of on-line learning. The master equation describes the evolution of the probability density over weight space, i.e. the density for the model parameters. Approximations to the master equation provide partial differential equations for the evolution of the density. From this density, objects of experimental interest including the generalization error, and moments of the weight error can be computed.

Multi-modular Associative Memory -- Nir Levy, David Horn, Eytan Ruppin
Motivated by the findings of modular structure in the cortex, we study a multi-modular model of associative memory that successfully stores memory patterns with different levels of activity. We differentiate between two types of synaptic efficacies, intra-modular and inter-modular ones. The latter signals undergo an additional nonlinear processing before reaching the soma. Compared with the conventional, single-module associative memory network, this multi-modular network has two main advantages: it is less susceptible to damage to columnar input, and its response is more consistent with the cognitive data pertaining to category specific impairments.

Inferring Sparse, Overcomplete Image Codes Using an Efficient Coding Framework -- Michael S. Lewicki, Bruno A. Olshausen
We apply a general technique for learning overcomplete bases (Lewicki and Sejnowski, 1997) to the problem of finding efficient image codes. The bases learned by the algorithm are localized, oriented, and bandpass, consistent with earlier results obtained using different methods (Olshausen and Field, 1996; Bell and Sejnowski 1997). We show that higher degrees of overcompleteness produce bases which have much greater likelihood and results in a Gabor-like basis with greater sampling density in position, orientation, and scale. This framework also allows different bases to be compared objectively by calculating their probability given the observed data. Compared to the complete and overcomplete Fourier and wavelet bases, the learned bases have much greater probability and thus have the potential to yield better coding efficiency. We demonstrate the improvement in the representation of the learned bases by showing superior noise reduction properties.

Learning Nonlinear Overcomplete Representations for Efficient Coding -- Michael S. Lewicki, Terrence J. Sejnowski
We derive a learning algorithm for inferring an overcomplete basis by viewing it as probabilistic model of the observed data. Overcomplete bases allow for better approximation of the underlying statistical density. Using a Laplacian prior on the basis coefficients removes redundancy and leads to representations that are sparse and are a \emph{nonlinear} function of the data. This can be viewed as a generalization of the technique of independent component analysis and provides a method for blind source separation of fewer mixtures than sources. We demonstrate the utility of overcomplete representations on natural speech and show that compared to the traditional Fourier basis the inferred representations potentially have much greater coding efficiency.

Visual Navigation in a Robot Using Zig-Zag Behavior -- M. Anthony Lewis
We implement a model of obstacle avoidance in flying insects on a small, monocular robot. The result is a system that is capable of rapid navigation through a dense obstacle field. The key to the system is the use of zig-zag behavior to articulate the body during movement. It is shown that this behavior compensates for a parallax blind spot surrounding the focus of expansion normally found in systems without zig-zag behavior. The system models the cooperation of several behaviors: halter-ocular response (similar to VOR), optomotor response, and the parallax field computation and mapping to motor system itself. The resulting system is neurally plausible, simple, and should be easily hostable on aVLSI hardware.

Factorizing Multivariate Function Classes -- Juan K. Lin
The problem of local independent component analysis is formulated and solved in this paper. In contrast to iterative non--local parametric density estimation algorithms for source separation, here only local information is used and an analytic solution for source separation is obtained. The local nature of this approach allows for separation of general smooth invertible mixing transformations, while the analytic nature allows for a proper treatment of source separation in the presence of uncertainty. The mathematical framework presented extends independent component analysis to a general factorization of multivariate functions. Numerical examples and a tie to decorrelation are presented.

Silicon Retina with Adaptive Filtering Properties -- Shih-Chii Liu
This paper describes a small, compact circuit that captures the adaptation properties of the photoreceptor layer and the adaptive temporal filtering properties of the laminar layer of the fly. This circuit uses only six transistors and two capacitors, and is operated in the subthreshold domain. The circuit has a low DC gain and a high transient gain. The adaptation time constant of the circuit can be controlled via an external bias. Its temporal filtering properties change with the background intensity or signal-to-noise conditions. The frequency response of the circuit shows that in the frequency range of 1 to 100 Hz, the circuit response goes from highpass filtering under high light levels to lowpass filtering under low light levels (i.e., when the signal-to-noise ratio is low). A chip with $20\times20$ pixels has been fabricated in 1.2 $\mu$m n-well technology.

2D Observers for Human 3D Object Recognition? -- Zili Liu, Daniel Kersten
Converging evidence has shown that human object recognition depends on observers' familiarity with objects' appearance. The more similar the objects are, the stronger this dependence will be, and the more important two-dimensional (2D) image information will be. The degree to which 3D structural information is used, however, remains an area of strong debate. Previously, we showed that all models that allow rotations in the image plane of independent 2D templates could not account for human performance in discriminating novel object views. We now present results from models of generalized radial basis functions (GRBF), 2D nearest neighbor matching that allows 2D affine transformations, and a Bayesian statistical estimator that integrates over all possible 2D affine transformations. The performance of the human observers relative to each of the models is better for the novel views than for the template views, suggesting that humans generalize better to novel views from template views. The Bayesian estimator yields the optimal performance with 2D affine transformations and independent 2D templates. Therefore, no models of 2D affine operations with independent 2D templates account for the human performance.

Effects of Spike Timing Underlying Binocular Integration and Rivalry in a Neural Model of Early Visual Cortex -- Erik D. Lumer
A neural network that models the mammalian early visual system is introduced to gain insight into possible mechanisms operating during binocular vision. In the model, the desynchronized firing of cortical neurons that first combine inputs from the two eyes produces rivalrous activity patterns at later stages in the visual pathway. By contrast, synchronization of firing among these cells prevents such competition. The temporal coordination of cortical activity and its effects on neural competition emerge naturally from the network connectivity and from its dynamics. These results suggest that input-related differences in relative spike timing at an early stage of visual processing may give rise to the phenomena both of perceptual integration and rivalry in binocular vision.

#### M

Dynamic Stochastic Synapses as Computational Units -- Wolfgang Maass, Anthony M. Zador
In most neural network models, synapses are treated as static weights that change only on the slow time scales of learning. In fact, however, synapses are highly dynamic, and show use-dependent plasticity over a wide range of time scales. Moreover, synaptic transmission is an inherently stochastic process: a spike arriving at a presynaptic terminal triggers release of a vesicle of neurotransmitter from a release site with a probability that can be much less than one. Changes in release probability represent one of the main mechanisms by which synaptic efficacy is modulated in neural circuits.

Synaptic Transmission: An Information-Theoretic Perspective -- Amit Manwani, Christof Koch
Here we analyze synaptic transmission from an information-theoretic perspective. We derive closed-form expressions for the lower-bounds on the capacity of a simple model of a cortical synapse under two explicit coding paradigms. Under the signal estimation'' paradigm, we assume the signal to be encoded in the mean firing rate of a Poisson neuron. The performance of an optimal linear estimator of the signal then provides a lower bound on the capacity for signal estimation. Under the signal detection'' paradigm, the presence or absence of the signal has to be detected (Yes-No task). Performance of the optimal spike detector allows us to compute a lower bound on the capacity for signal detection. We find that single synapses (for empirically measured parameter values) transmit information poorly but significant improvement can be achieved with a small amount of redundancy.

Reinforcement Learning for Call Admission Control and Routing in Integrated Service Networks -- Peter Marbach, Oliver Mihatsch, Miriam Schulte, John N. Tsitsiklis
In integrated service communication networks, an important problem is to excercise call admission control and routing so as to optimally use the network resources. This problem is naturally formulated as a dynamic programming problem, which, however, is too complex to be solved exactly. We use methods of reinforcement learning (RL), together with a decomposition approach, to find call admission control and routing policies. The performance of our policy for a network with approximately $10^{45}$ states is compared with a commonly used heuristic policy.

A Framework for Multiple-Instance Learning -- Oded Maron, Tom\'{a}s Lozano-P\'{e}rez
Multiple-instance learning is a variation on supervised learning, where the task is to learn a concept given positive and negative bags of instances. Each bag may contain many instances, but a bag is labeled positive even if only one of the instances in it falls within the concept. A bag is labeled negative only if all the instances in it are negative. We describe a new general framework, called Diverse Density, for solving multiple-instance learning problems. We apply this framework to learn a simple description of a person from a series of images (bags) containing that person, to a stock selection problem, and to the drug activity prediction problem.

An Application of Reversible-Jump MCMC to Multivariate Spherical Gaussian Mixtures -- Alan D. Marrs
Applications of Gaussian mixture models occur frequently in the fields of statistics and artificial neural networks. One of the key issues arising from any mixture model application is how to estimate the optimum number of mixture components. This paper extends the Reversible-Jump Monte Carlo Markov Chain algorithm to the case of multivariate spherical Gaussian mixtures using a hierarchical prior model. Using this method the number of mixture components is no longer fixed but becomes a parameter of the model which we shall estimate. The reversible-jump MCMC algorithm is capable of moving between parameter subspaces which correspond to models with different numbers of mixture components. As a result a sample from the full joint distribution of all unknown model parameters is generated. The technique is then demonstrated on a well known classification problem.

Estimating Dependency Structure as a Hidden Variable -- Marina Meil\u{a}, Michael I. Jordan
This paper introduces the Mixture of Trees, a probability model that can account for sparse, but dynamically changing dependence relationships between the variables of the domain under study. We present a family of efficient algorithms that use EM and the Maximum Spanning Tree algorithm to find the Maximum Likelihood and the MAP Mixture of Trees for a variety of priors, including the Dirichlet and the MDL priors.

Structural Risk Minimization for Nonparametric Time Series Prediction -- Ron Meir
The problem of time series prediction is studied within the uniform convergence framework of Vapnik and Chervonenkis. The dependence inherent in the temporal structure is incorporated into the analysis, thereby generalizing the available theory for memoryless processes. Finite sample bounds are calculated in terms of covering numbers of the approximating class, and the tradeoff between approximation and estimation is discussed. Vapnik's structural risk minimization procedure is applied to achieve consistency, and convergence rates are established within a nonparametric setting.

Toward a Single-Cell Account for Binocular Disparity Tuning: An Energy Model May Be Hiding in Your Dendrites -- Bartlett W. Mel, Daniel L. Ruderman, Kevin A. Archie
Hubel and Wiesel (1962) proposed that complex cells in visual cortex are driven by a pool of simple cells with the same preferred orientation but different spatial phases. However, a wide variety of experimental results over the past two decades have challenged the pure hierarchical model, primarily by demonstrating that many complex cells receive monosynaptic input from unoriented LGN cells, or do not depend on simple cell input. We recently showed using a detailed biophysical model that nonlinear interactions among synaptic inputs to an excitable dendritic tree could provide the nonlinear subunit computations that underlie complex cell responses (Mel, Ruderman, \& Archie 1997). Our present work extends this result to the case of complex cell binocular disparity tuning, by demonstrating in an isolated model pyramidal cell (1) disparity tuning at a resolution much finer than the the overall dimensions of the cell's receptive field, and (2) systematically shifted optimal disparity values for rivalrous pairs of light and dark bars---both in good agreement with published reports (Ohzawa, Deangelis, \& Freeman, 1997). Our results reemphasize the potential importance of intradendritic computation for binocular visual processing in particular, and for cortical neurophysiology in general.

Combining Classifiers Using Correspondence Analysis -- Christopher J. Merz
Several effective methods for improving the performance of a single learning algorithm have been developed recently. The general approach is to to create a set of learned models by repeatedly applying the algorithm to different versions of the training data, and then combine the learned models' predictions according to a prescribed voting scheme. Little work has been done in combining the predictions of a collection of models generated by many learning algorithms having different representation and/or search strategies. This paper describes a method which uses the strategies of stacking and correspondence analysis to model the relationship between the learning examples and the way in which they are classified by a collection of learned models. A nearest neighbor method is then applied within the resulting representation to classify previously unseen examples. The new algorithm consistently performs as well or better than other combining techniques on a suite of data sets.

Serial Order in Reading Aloud: Connectionist Models and Neighborhood Structure -- Jeanne C. Milostan, Garrison W. Cottrell
Dual-Route and Connectionist Single-Route models of reading have been at odds over claims as to the correct explanation of the reading process. Recent Dual-Route models predict that subjects should show an increased naming latency for irregular words when the irregularity is earlier in the word - a prediction that has been confirmed in human experiments. Coltheart \& Rastle (1994) claim that Single-Route connectionist models cannot account for this latency by position-of-irregularity interaction. A refutation of this claim is presented here, consisting of network models which do show the interaction, along with orthographic neighborhood statistics that explain the effect.

Learning Path Distributions Using Nonequilibrium Diffusion Networks -- Paul Mineiro, Javier Movellan, Ruth J. Williams
Diffusion networks are a natural extension of recurrent neural networks in which the dynamics are probabilistic. In this paper we derive the gradient of the log-likelihood of a path with respect to the drift parameters for a diffusion network. This gradient can be used to optimize a diffusion network in the nonequilibrium regime for a wide variety of problems, including reinforcement learning, filtering and prediction, signal detection, and continuous path density estimation. An aspect of our work which is of interest to computational neuroscience and hardware design is that the obtained gradient is local in space and time, i.e., no time unfolding, backpropagation of error signals, or Boltzmann phases are required.}

Automated Aircraft Recovery via Reinforcement Learning: Initial Experiments -- Jeffrey F. Monaco, David G. Ward, Andrew G. Barto
Initial experiments described here are directed toward using reinforcement learning (RL) to develop an automatic recovery system (ARS) for high-agility aircraft. An ARS is an outer-loop flight control system designed to bring an aircraft from a range of initial states to straight, level, and non-inverted flight in minimum time and while satisfying given constraints. Here we report results for a simple version of the problem involving only single-axis (pitch) simulated recoveries. Through simulated control experience using a medium-fidelity aircraft simulation, the RL system approximated an optimal policy for longitudinal-stick inputs to produce minimum-time transitions to straight and level flight in unconstrained cases, while meeting a pilot-station acceleration constraint.

Learning to Schedule Straight-Line Code -- Eliot Moss, Paul Utgoff, John Cavazos, Doina Precup, Darko Stefanovi\'{c}, Carla Brodley, David Scheeff
Execution speed of programs on modern computer architectures is sensitive, by a factor of two or more, to the order in which instructions are presented to the processor. To realize potential execution efficiency, it is now customary for an optimizing compiler to employ a heuristic algorithm for instruction scheduling. These algorithms are painstakingly hand-crafted, which is expensive and time-consuming. We show how to cast the instruction scheduling problem as a learning task, so that one obtains the heuristic scheduling algorithm automatically. Our focus is the narrower problem of scheduling straight-line code, also known as a {\em basic block} of instructions. Our empirical results show that just a few features are adequate for quite good performance at this task for a real modern processor, and that any of several supervised learning methods perform nearly optimally with respect to the features used.

Bayesian Robustification for Audio Visual Fusion -- Javier Movellan, Paul Mineiro
We discuss the problem of catastrophic fusion in multimodal recognition systems. This problem arises in systems that need to fuse different channels in non-stationary environments. Practice shows that when recognition modules within each modality are tested in contexts inconsistent with their assumptions, their influence on the fused product tends to increase, with catastrophic results. We explore a principled solution to this problem based upon Bayesian ideas of competitive models and inference robustification: each sensory channel is provided with simple white-noise context models, and the perceptual hypothesis and context are jointly estimated. Consequently, context deviations are interpreted as changes in white noise contamination strength, automatically adjusting the influence of the module. The approach is tested on a fixed lexicon automatic audiovisual speech recognition problem with very good results.

A Superadditive-Impairment Theory of Optic Aphasia -- Michael C. Mozer, Mark Sitton, Martha Farah
Accounts of neurological disorders often posit damage to a specific functional pathway of the brain. Farah (1990) has proposed an alternative class of explanations involving partial damage to multiple pathways. We explore this explanation for optic aphasia, a disorder in which severe performance deficits are observed when patients are asked to name visually presented objects, but surprisingly, performance is relatively normal on naming objects from auditory cues and on gesturing the appropriate use of visually presented objects. We model this highly specific deficit through partial damage to two pathways--one that maps visual input to semantics, and the other that maps semantics to naming responses. The effect of this damage is superadditive, meaning that tasks which require one pathway or the other show little or no performance deficit, but the damage is manifested when a task requires both pathways (i.e., naming visually presented objects). Our model explains other phenomena associated with optic aphasia, and makes testable experimental predictions.

Reinforcement Learning for Continuous Stochastic Control Problems -- R\'{e}mi Munos, Paul Bourgine
This paper is concerned with the problem of Reinforcement Learning (RL) for continuous state space and time stochastic control problems. We state the Hamilton-Jacobi-Bellman equation satisfied by the value function and use a Finite-Difference method for designing a convergent approximation scheme. Then we propose a RL algorithm based on this scheme and prove its convergence to the optimal solution.

#### N

Enhancing Q-Learning for Optimal Asset Allocation -- Ralph Neuneier
In this paper, we enhance the Q-learning algorithm for optimal asset allocation proposed in (Neuneier, 1996). The new formulation simplifies the approach and allows policy-iteration without the necessity for having a model of the system. The new algorithm is tested on real data of the German stock market. Furthermore, the possibility of risk management within the framework of Markov decision problems is analyzed. This leads to different return functions, which bridge the gap between classical portfolio management and asset allocation strategies derived by reinforcement learning.

#### O

A Hippocampal Model of Recognition Memory -- Randall C. O'Reilly, Kenneth A. Norman, James L. McClelland
A rich body of data exists showing that recollection of specific information contributes to recognition memory. We present a model, based largely on known features of hippocampal anatomy and physiology, that accounts for both the {\em high-threshold} nature of this recollection process (i.e., the fact that only studied items are recollected, although nonstudied items sometimes trigger recollection of similar studied items), and the fact that increasing interference leads to less recollection but apparently does not compromise the {\em quality} of recollection (i.e., the extent to which recollection veridically reflects events that occurred at study). Both of these properties of recollection pose problems for existing computational and mathematical models.

Learning Generative Models with the Up-Propagation Algorithm -- Jong-Hoon Oh, H. Sebastian Seung
Backpropagation networks process their inputs in a bottom-up fashion, and use top-down connections to propagate an error signal for learning. We introduce a new algorithm called up-propagation, which uses top-down connections to generate patterns, and bottom-up connections to propagate an error signal. The error signal is part of a computational feedback loop that adjusts the generated pattern to match sensory input. The error signal is also used for unsupervised learning. The algorithm is benchmarked against principal component analysis in experiments on images of handwritten digits.

#### P

Adaptive Choice of Grid and Time in Reinforcement Learning -- Stephan Pareigis
We propose local error estimates together with algorithms for adaptive a-posteriori grid and time refinement in reinforcement learning. We consider a deterministic system with continuous state and time with an infinite horizon discounted cost functional. For grid refinement we follow the procedure of numerical methods for the Bellman-equation. For time refinement we propose a new criterion, based on consistency estimates of discrete solutions of the Bellman-equation. We demonstrate that an optimal ratio of time to space discretization is crucial for optimal learning rates and accuracy of the approximate optimal value function.

Reinforcement Learning with Hierarchies of Machines -- Ronald Parr, Stuart Russell
We present a new approach to reinforcement learning in which the policies considered by the learning process are constrained by hierarchies of partially specified machines. This allows for the use of prior knowledge to reduce the search space and provides a framework in which knowledge can be transferred across problems and in which component solutions can be recombined to solve larger and more complicated problems. Our approach can be seen as providing a link between reinforcement learning and behavior-based'' or teleo-reactive'' approaches to control. We present provably convergent algorithms for problem-solving and learning with hierarchical machines and demonstrate their effectiveness on a problem with several thousand states.

Analog VLSI Model of Intersegmental Coordination with Nearest-Neighbor Coupling -- Girish N. Patel, Jeremy H. Holleman, Stephen P. DeWeerth
We have developed an analog VLSI system that models the coordination of neurobiological segmented oscillators. We have implemented and tested a system that consists of a chain of eleven pattern generating circuits that are synaptically coupled to their nearest neighbors. Each pattern generating circuit is implemented with two silicon Morris-Lecar neurons that are connected in a reciprocally inhibitory network. We discuss the mechanisms of oscillations in the two-cell network and explore system behavior based on isotropic and anisotropic coupling, and frequency gradients along the chain of oscillators.

Multi-time Models for Temporally Abstract Planning -- Doina Precup, Richard S. Sutton
Planning and learning at multiple levels of temporal abstraction is a key problem for artificial intelligence. In this paper we summarize an approach to this problem based on the mathematical framework of Markov decision processes and reinforcement learning. Current model-based reinforcement learning is based on one-step models that cannot represent common-sense higher-level actions, such as going to lunch, grasping an object, or flying to Denver. This paper generalizes prior work on temporally abstract models (Sutton, 1995) and extends it from the prediction setting to include actions, control, and planning. We introduce a more general form of temporally abstract model, the {\em multi-time model}, and establish its suitability for planning and learning by virtue of its relationship to Bellman equations. This paper summarizes the theoretical framework of multi-time models and illustrates their potential advantages in a gridworld planning task.

Analytical Study of the Interplay between Architecture and Predictability -- Avner Priel, Ido Kanter, David A. Kessler
We study model feed forward networks as time series predictors in the stationary limit. The focus is on complex, yet non-chaotic, behavior. The main question we address is whether the asymptotic behavior is governed by the architecture, regardless the details of the weights. We find hierarchies among classes of architectures with respect to the attractor dimension of the long term sequence they are capable of generating; larger number of hidden units can generate higher dimensional attractors. In the case of a perceptron, we develop the stationary solution for a general weight vector, and show that the flow is typically one dimensional. The relaxation time from an arbitrary initial condition to the stationary solution is found to scale linearly with the size of the network. In multilayer networks, the number of hidden units gives bounds on the number and dimension of the possible attractors. We conclude that long term prediction (in the non-chaotic regime) with such models is governed by attractor dynamics related to the architecture.

#### R

Correlates of Attention in a Model of Dynamic Visual Recognition -- Rajesh P. N. Rao
Given a set of objects in the visual field, how does the the visual system learn to attend to a particular object of interest while ignoring the rest? How are occlusions and background clutter so effortlessly discounted for when recognizing a familiar object? In this paper, we attempt to answer these questions in the context of a Kalman filter-based model of visual recognition that has previously proved useful in explaining certain neurophysiological phenomena such as endstopping and related extra-classical receptive field effects in the visual cortex. By using results from the field of robust statistics, we describe an extension of the Kalman filter model that can handle multiple objects in the visual field. The resulting robust Kalman filter model demonstrates how certain forms of attention can be viewed as an emergent property of the interaction between top-down expectations and bottom-up signals. The model also suggests functional interpretations of certain attention-related effects that have been observed in visual cortical neurons. Experimental results are provided to help demonstrate the ability of the model in performing robust segmentation and recognition of objects and image sequences in the presence of varying degrees of occlusions and clutter.

An Incremental Nearest Neighbor Algorithm with Queries} -- Joel Ratsaby
We consider the general problem of learning multi-category classification from labeled examples. In many practical learning settings the sample size or time available for training are limited, which may have adverse effects on the accuracy of the resulting classifier. To compensate for this, researchers have considered various ways of trying to make the learning process more efficient so that even with limited resources high accuracy may be possible. The classical theory of pattern recognition assumes labeled examples appear according to unknown underlying class conditional probability distributions where the pattern classes are picked randomly in a passive manner according to their {\em a priori} probabilities. In this paper, we present experimental results for an algorithm which actively selects samples from different pattern classes according to a querying rule as opposed to the {\em a priori} probabilities. The amount of improvement of this query-based approach over the passive batch approach depends on the complexity of the Bayes rule. The query rule essentially learns to tune the subsample size of each of the pattern classes to this Bayes complexity which is assumed to be unknown. The principle on which this algorithm is based is general enough to be used in any learning algorithm which permits a model-selection criterion and for which the error rate of the classifier is calculable in terms of the complexity of the model.

Globally Optimal On-line Learning Rules -- Magnus Rattray, David Saad
We present a method for determining the globally optimal on-line learning rule for a soft committee machine under a statistical mechanics framework. This work complements previous results on locally optimal rules, where only the rate of change in generalization error was considered. We maximize the total reduction in generalization error over the whole learning process and show how the resulting rule can significantly outperform the locally optimal rule.

Just One View: Invariances in Inferotemporal Cell Tuning -- Maximilian Riesenhuber, Tomaso Poggio
In macaque inferotemporal cortex (IT), neurons have been found to respond selectively to complex shapes while showing broad tuning (invariance'') with respect to stimulus transformations such as translation and scale changes and a limited tuning to rotation in depth. Training monkeys with novel, paperclip-like objects, Logothetis et al.~(1995) could investigate whether these invariance properties are due to experience with exhaustively many transformed instances of an object or if there are mechanisms that allow the cells to show response invariance also to previously unseen instances of that object. They found object-selective cells in anterior IT which exhibited limited invariance to various transformations.

RCC Cannot Compute Certain FSA, Even with Arbitrary Transfer Functions -- Mark Ring
Existing proofs demonstrating the computational limitations of Recurrent Cascade Correlation (RCC) and similar networks (Fahlman, 1991; Bachrach, 1988; Mozer, 1988) explicitly limit their results to units having sigmoidal or hard-threshold transfer functions (Giles et al., 1995; Kremer, 1996). The proof given here shows that for any finite, discrete transfer function used by the units of an RCC network, there are finite-state automata (FSA) that the network cannot model, no matter how many units are used. The proof also applies to continuous transfer functions with a finite number of fixed points, such as sigmoid and radial basis functions.

Recurrent Neural Networks Can Learn to Implement Symbol-Sensitive Counting -- Paul Rodriguez, Janet Wiles
Recently researchers have derived formal complexity analysis of analog computation in the setting of discrete-time dynamical systems. Although these results are theoretically instructive, they do not identify relevant issues for implementation in actual systems, or for psychological models of sequence processing. As an empirical contrast, training recurrent neural networks (RNNs) produces self-organized systems that can give us complementary and constructive evidence for realizations of analog mechanisms. Previous work showed that a RNN can learn to process a simple context-free language (CFL). Herein, we extend that work to show that a RNN can learn a harder CFL by organizing its resources into a symbol-sensitive counting solution, and we provide a dynamical systems analysis which demonstrates how the network can not only count, but also copy and store counting information.

EM Algorithms for PCA and SPCA -- Sam Roweis
I present an expectation-maximization (EM) algorithm for principal component analysis (PCA). Traditional methods for finding the first few principal components of a dataset can be quite costly when there are many datapoints and/or the data space is of very high dimension. These methods can also fail when the sample covariance matrix is not of full rank and cannot deal easily with missing data values.

Intrusion Detection with Neural Networks -- Jake Ryan, Meng-Jang Lin, Risto Miikkulainen
With the rapid expansion of computer networks during the past few years, security has become a crucial issue for modern computer systems. A good way to detect illegitimate use is through monitoring unusual user activity. Methods of intrusion detection based on hand-coded rule sets or predicting commands on-line are laborous to build or not very reliable. This paper proposes a new way of applying neural networks to detect intrusions. We believe that a user leaves a print' when using the system; a neural network can be used to learn this print and identify each user much like detectives use thumbprints to place people at crime scenes. If a user's behavior does not match his/her print, the system administrator can be alerted of a possible security breech. A backpropagation neural network called NNID (Neural Network Intrusion Detector) was trained in the identification task and tested experimentally on a system of 10 users. The system was 96\% accurate in detecting unusual activity, with 7\% false alarm rate. These results suggest that learning user profiles is an effective way for detecting intrusions.

#### S

On the Separation of Signals from Neighboring Cells in Tetrode Recordings -- Maneesh Sahani, John S. Pezaris, Richard A. Andersen
We discuss a solution to the problem of separating waveforms produced by multiple cells in a tetrode recording. We take an explicitly probabilistic approach, using generative models of varying sophistication to describe the distribution of waveforms produced by a single cell. The models range from a single Gaussian distribution of waveforms for each cell to a continuous-hidden-state dynamical model with a mixture of linear propagators. We stress the overall structure of the approach, while the details of the generative model may depend on the specific neural preparation.

Modeling Acoustic Correlations by Factor Analysis -- Lawrence Saul, Mazin Rahim
Hidden Markov models (HMMs) for automatic speech recognition rely on high dimensional feature vectors to summarize the short-time properties of speech. Correlations between features can arise when the speech signal is non-stationary or corrupted by noise. We investigate how to model these correlations using factor analysis, a statistical method for dimensionality reduction. Factor analysis uses a small number of parameters to model the covariance structure of high dimensional data. These parameters are estimated by an Expectation-Maximization (EM) algorithm that can be embedded in the training procedures for HMMs. We evaluate the combined use of mixture densities and factor analysis in HMMs that recognize alphanumeric strings. Holding the total number of parameters fixed, we find that these methods, properly combined, yield better models than either method on its own.

Local Dimensionality Reduction -- Stefan Schaal, Sethu Vijayakumar, Christopher G. Atkeson
If globally high dimensional data has locally only low dimensional distributions, it is advantageous to perform a local dimensionality reduction before further processing the data. In this paper we examine several techniques for local dimensionality reduction in the context of locally weighted linear regression. As possible candidates, we derive local versions of factor analysis regression, principal component regression, principal component regression on joint distributions, and partial least squares regression. After outlining the statistical basis of these methods, we perform Monte Carlo simulations to evaluate their robustness with respect to violations of their statistical assumptions. One surprising outcome is that locally weighted partial least squares regression offers the best average results, thus outperforming even factor analysis, the theoretically most appealing of our candidate techniques.

Comparison of Human and Machine Word Recognition -- Markus Schenkel, Cyril Latimer, Marwan Jabri
We present a study which is concerned with word recognition rates for heavily degraded documents. We compare human with machine reading capabilities in a series of experiments, which explores the interaction of word/non-word recognition, word frequency and legality of non-words with degradation level. We also study the influence of character segmentation, and compare human performance with that of our artificial neural network model for reading. We found that the proposed computer model uses word context as efficiently as humans, but performs slightly worse on the pure character recognition task.

Prior Knowledge in Support Vector Kernels -- Bernhard Sch\"olkopf, Patrice Simard, Alex J. Smola, Vladimir Vapnik
We explore methods for incorporating prior knowledge about a problem at hand in Support Vector learning machines. We show that both invariances under group transformations and prior knowledge about locality in images can be incorporated by constructing appropriate kernel functions.

Training Methods for Adaptive Boosting of Neural Networks -- Holger Schwenk, Yoshua Bengio
Boosting is a general method for improving the performance of any learning algorithm that consistently generates classifiers which need to perform only slightly better than random guessing. A recently proposed and very promising boosting algorithm is AdaBoost. It has been applied with great success to several benchmark machine learning problems using rather simple learning algorithms, in particular decision trees. In this paper we use AdaBoost to improve the performances of neural networks applied to character recognition tasks. We compare training methods based on sampling the training set and weighting the cost function. Our system achieves about 1.4\% error on a data base of online handwritten digits from more than 200 writers. Adaptive boosting of a multi-layer network achieved less than 2\% error on the UCI Letters offline characters data set.

The Rectified Gaussian Distribution -- Nicholas D. Socci, Daniel D. Lee, H. Sebastian Seung
A simple but powerful modification of the standard Gaussian distribution is studied. The variables of the rectified Gaussian are constrained to be nonnegative, enabling the use of nonconvex energy functions. Two multimodal examples, the competitive and cooperative distributions, illustrate the representational power of the rectified Gaussian. Since the cooperative distribution can represent the translations of a pattern, it demonstrates the potential of the rectified Gaussian for modeling pattern manifolds. The problem of learning may be more tractable for the rectified Gaussian than for the Boltzmann machine, owing to the technical advantages of continuous variables.

Learning Continuous Attractors in Recurrent Networks -- H. Sebastian Seung
One approach to invariant object recognition employs a recurrent neural network as an associative memory. In the standard depiction of the network's state space, memories of objects are stored as attractive fixed points of the dynamics. I argue for a modification of this picture: if an object has a continuous family of instantiations, it should be represented by a continuous attractor. This idea is illustrated with a network that learns to complete patterns. To perform the task of filling in missing information, the network develops a continuous attractor that models the manifold from which the patterns are drawn. From a statistical viewpoint, the pattern completion task allows a formulation of unsupervised learning in terms of regression rather than density estimation.

Minimax and Hamiltonian Dynamics of Excitatory-Inhibitory Networks -- H. Sebastian Seung, Tom J. Richardson, Jeffrey C. Lagarias, John J. Hopfield
Because the dynamics of a neural network with symmetric interactions is similar to a gradient descent dynamics, convergence to a fixed point is the general behavior. In this paper, we analyze the global behavior of networks with distinct excitatory and inhibitory populations of neurons, under the assumption that the interactions between the populations are antisymmetric. Our analysis exploits the similarity of such a dynamics to a saddle point dynamics. This analogy gives some intuition as to why such a dynamics can either converge to a fixed point or a limit cycle, depending on parameters. We also show that the network dynamics can be written in a dissipative Hamiltonian form.

Data-Dependent Structural Risk Minimization for Perceptron Decision Trees -- John Shawe-Taylor, Nello Cristianini
Perceptron Decision Trees (also known as Linear Machine DTs, etc.) are analysed in order that data dependent Structural Risk Minimization can be applied. Data dependent analysis is performed which indicates that choosing the maximal margin hyperplanes at the decision nodes will improve the generalization. The analysis is performed by defining a real valued function class computed by decision trees and bounding its generalization error using a novel technique. Experiments are performed to test the approach.

An Analog VLSI Neural Network for Phase-based Machine Vision} -- Bertram E. Shi, Kwok Fai Hui
We describe the design, fabrication and test results of an analog CMOS VLSI neural network prototype chip intended for phase-based computer vision algorithms. The chip implements an image filtering operation similar to Gabor-filtering. Because a Gabor filter's output is complex valued, it can be used to define a phase at every pixel in an image. This phase can be used in robust algorithms for disparity estimation and binocular stereo vergence control in stereo vision and for image motion analysis. The chip reported here takes an input image and generates two outputs at every pixel, corresponding to the real and imaginary parts of the output.

Monotonic Networks -- Joseph Sill
Monotonicity is a constraint which arises in many application domains. We present a machine learning model called the monotonic network, which has the ability to obey monotonicity constraints exactly, i.e., by virtue of functional form. A straightforward method for implementing and training a monotonic network is described. Monotonic networks are proven to be universal approximators of continous, differentiable monotonic functions. We apply monotonic networks to a real-world task in corporate bond rating prediction and show that they compare favorably to other approaches.

How to Dynamically Merge Markov Decision Processes -- Satinder Singh, David Cohn
We are frequently called upon to do multiple tasks that compete for our attention and resource. Often we know the optimal solution to each task in isolation. How can we exploit that knowledge to efficiently determine good solutions for doing the tasks in parallel? We formulate this question as that of dynamically merging multiple Markov decision processes (MDPs), and present a new theoretically-sound dynamic programming algorithm that assumes known good solutions (value functions) for the individual MDPs in isolation and efficiently constructs a good solution for doing the set of MDPs in parallel. Our algorithm can merge MDPs dynamically, assimilating a new MDP smoothly into an ongoing merging of previous MDPs. We analyze various aspects of our algorithm and illustrate its use on a simple merging problem.

From Regularization Operators to Support Vector Kernels -- Alex J. Smola, Bernhard Sch\"olkopf
We derive the correspondence between regularization operators used in Regularization Networks and Hilbert Schmidt Kernels appearing in Support Vector Machines. More specifically, we prove that the Green's Functions associated with regularization operators are suitable Support Vector Kernels with equivalent regularization properties. As a by--product we show that a large number of Radial Basis Functions namely conditionally positive definite functions may be used as Support Vector kernels.

Stacked Density Estimation -- Padhraic Smyth, David Wolpert
In this paper, the technique of stacking, previously only used for supervised learning, is applied to unsupervised learning. Specifically, it is used for non-parametric multivariate density estimation, to combine finite mixture model and kernel density estimators. Experimental results on both simulated data and real world data sets clearly demonstrates that stacked density estimation outperforms other strategies such as choosing the single best model based on cross-validation, combining with uniform weights, and even the single best model chosen by cheating'' by looking at the data used for independent testing.

On-line Learning from Finite Training Sets in Nonlinear Networks -- Peter Sollich, David Barber
Online learning is one of the most common forms of neural network training. We present an analysis of online learning from finite training sets for non-linear networks (namely, soft-committee machines), advancing the theory to more realistic learning scenarios. Dynamical equations are derived for an appropriate set of order parameters; these are exact in the limiting case of either linear networks or infinite training sets. Preliminary comparisons with simulations suggest that the theory captures some effects of finite training sets, but may not yet account correctly for the presence of local minima.

Bidirectional Retrieval from Associative Memory -- Friedrich T. Sommer, G\"unther Palm
So far, similarity based fault tolerant retrieval naturally provided by neural associative memories (NAM) is not widely applied as access method for large databases. A serious obstacle to applications is that the theoretical performance of NAM can not be achieved in practical cases: the Willshaw model with the highest asymptotic information capacity produces a high level of cross talk errors in the retrieval, if fully exploited for finite sizes.

Incorporating Contextual Information in White Blood Cell Identification -- Xubo Song, Yaser Abu-Mostafa, Joseph Sill, Harvey Kasdan
In this paper we propose a technique to incorporate contextual information into object classification. In the real world there are cases where the identity of an object is ambiguous due to the noise in the measurements based on which the classification should be made. It is helpful to reduce the ambiguity by utilizing extra information referred to as context, which in our case is the identities of the accompanying objects. This technique is applied to white blood cell classification. Comparisons are made against the no context'' approach, to demonstrate the superior classification performance achieved by using context. In our particular application, it significantly reduces false alarm rate and thus greatly reduces the cost due to expensive clinical tests.

Bach in a Box---Real-Time Harmony -- Randall R. Spangler, Rodney M. Goodman, Jim Hawkins
We describe a system for learning J. S. Bach's rules of musical harmony. These rules are learned from examples and are expressed as rule-based neural networks. The rules are then applied in real-time to generate new accompanying harmony for a live performer. Real-time functionality imposes constraints on the learning and harmonizing processes, including limitations on the types of information the system can use as input and the amount of processing the system can perform. We demonstrate algorithms for generating and refining musical rules from examples which meet these constraints. We describe a method for including a priori knowledge into the rules which yields significant performance gains. We then describe techniques for applying these rules to generate new music in real-time. We conclude the paper with an analysis of experimental results.

Experiences with Bayesian Learning in a Real World Application -- Peter Sykacek, Georg Dorffner, Peter Rappelsberger, Josef Zeitlhofer
This paper reports about an application of Bayes' inferred neural network classifiers in the field of automatic sleep staging. The reason for using Bayesian learning for this task is two-fold. First, Bayesian inference is known to embody regularization automatically. Second, a side effect of Bayesian learning leads to larger variance of network outputs in regions without training data. This results in well known moderation effects, which can be used to detect outiers. In a 5 fold cross-validation experiment the full Bayesian solution found with R. Neals hybrid Monte Carlo algorithm was not better than a single maximum a-posteriori (MAP) solution found with D.J. MacKay's evidence approximation. In a second experiment we studied the properties of both solutions in rejecting classification of movement artifacts.

The Asymptotic Convergence-Rate of Q-learning -- Csaba Szepesv\'{a}ri
In this paper we show that for discounted MDPs with discount factor $\gamma$ the asymptotic rate of convergence of Q-learning is $O(1/t^{1-\gamma})$ if $\gamma>1/2$ and $O(\sqrt{\log\log t/ t})$ otherwise.

#### T

Mapping a Manifold of Perceptual Observations -- Joshua B. Tenenbaum
I present a new computational-geometric framework for unsupervised mapping of a manifold of perceptual observations. Nonlinear dimensionality reduction is formulated here as the problem of trying to find a Euclidean feature-space embedding of a set of observations that preserves as closely as possible their intrinsic metric structure, i.e. the distances between points on the observation manifold as measured along geodesic paths. Our isometric feature mapping procedure (isomap'') is able to reliably recover low-dimensional nonlinear structure in realistic perceptual data sets, such as a manifold of face images, where conventional global mapping methods find only local minima. The recovered map provides a canonical set of globally meaningful features, which allows perceptual transformations such as interpolation, extrapolation, and analogy -- highly nonlinear transformations in the original observation space -- to be computed with simple linear operations in feature space.

On-line Learning from Finite Training Sets in Nonlinear Networks -- Peter Sollich, David Barber
We apply information maximization / maximum likelihood blind source separation to complex valued signals mixed with complex valued nonstationary matrices. This case arises in radio communications with baseband signals. We incorporate known source signal distributions in the adaptation, thus making the algorithms less `blind''. This results in drastic reduction of the amount of data needed for successful convergence. Adaptation to rapidly changing signal mixing conditions, such as to fading in mobile communications, becomes now feasible as demonstrated by simulations.

A Solution for Missing Data in Recurrent Neural Networks with an Application to Blood Glucose Prediction -- Volker Tresp, Thomas Briegel
We consider neural network models for stochastic nonlinear dynamical systems where measurements of the variable of interest are only available at irregular intervals i.e. most realizations are missing. Difficulties arise since the solutions for prediction and maximum likelihood learning with missing data lead to complex integrals, which even for simple cases cannot be solved analytically. In this paper we propose a specific combination of a nonlinear recurrent neural predictive model and a linear error model which leads to tractable prediction and maximum likelihood adaptation rules. In particular, the recurrent neural network can be trained using the real-time recurrent learning rule and the linear error model can be trained by an EM adaptation rule, implemented using forward-backward Kalman filter equations. The model is applied to predict the glucose/insulin metabolism of a diabetic patient where blood glucose measurements are only available a few times a day at irregular intervals. The new model shows considerable improvement with respect to both recurrent neural networks trained with teacher forcing or in a free running mode and various linear models.

Self-similarity Properties of Natural Images -- Antonio Turiel, Germ\'an Mato, N\'estor Parga, Jean-Pierre Nadal
Scale invariance is a fundamental property of ensembles of natural images. Their non Gaussian properties are less well understood, but they indicate the existence of a rich statistical structure. In this work we present a detailed study of the marginal statistics of a variable related to the edges in the images. A numerical analysis shows that it exhibits extended self-similarity. This is a scaling property stronger than self-similarity: all its moments can be expressed as a power of any given moment. More interesting, all the exponents can be predicted in terms of a multiplicative log-Poisson process. This is the very same model that was used very recently to predict the correct exponents of the structure functions of turbulent flows. These results allow us to study the underlying multifractal singularities. In particular we find that the most singular structures are one-dimensional: the most singular manifold consists of sharp edges.

#### V

Multiresolution Tangent Distance for Affine-invariant Classification -- Nuno Vasconcelos, Andrew Lippman
The ability to rely on similarity metrics invariant to image transformations is an important issue for image classification tasks such as face or character recognition. We analyze an invariant metric that has performed well for the latter - the tangent distance - and study its limitations when applied to regular images, showing that the most significant among these (convergence to local minima) can be drastically reduced by computing the distance in a multiresolution setting. This leads to the multiresolution tangent distance, which exhibits significantly higher invariance to image transformations, and can be easily combined with robust estimation procedures.

Use of a Multi-Layer Perceptron to Predict Malignancy in Ovarian Tumours -- Herman Verrelst, Yves Moreau, Joos Vandewalle, Dirk Timmerman
We discuss the development of a Multi-Layer Perceptron neural network classifier for use in preoperative differentiation between benign and malignant ovarian tumors. As the Mean Squared classification Error is not sufficient to make correct and objective assessments about the performance of the neural classifier, the concepts of sensitivity and specificity are introduced and combined in Receiver Operating Characteristic curves. Based on objective observations such as sonomorphologic criteria, color Doppler imaging and results from serum tumor markers, the neural network is able to make reliable predictions with a discriminating performance comparable to that of experienced gynecologists.

Independent Component Analysis for Identification of Artifacts -- Ricardo Vig\'{a}rio, Veikko Jousm\"{a}ki, Matti H\"{a}m\"{a}l\"{a}inen, Riitta Hari, Erkki Oja
We have studied the application of an independent component analysis (ICA) approach to the identification and possible removal of artifacts from a magnetoencephalographic (MEG) recording. This statistical technique separates components according to the kurtosis of their amplitude distributions over time, thus distinguishing between strictly periodical signals, and regularly and irregularly occurring signals. Many artifacts belong to the last category. In order to assess the effectiveness of the method, controlled artifacts were produced, which included saccadic eye movements and blinks, increased muscular tension due to biting and the presence of a digital watch inside the magnetically shielded room. The results demonstrate the capability of the method to identify and clearly isolate the produced artifacts.

Modeling Complex Cells in an Awake Macaque during Natural Image Viewing -- William E. Vinje, Jack L. Gallant
Cells in V1 are often modeled as quasi-linear filters. We are interested in the behavior of these filters during natural vision. We apply a quasi-linear filter to three stimulus sequences that simulated all of the events occurring in and around the receptive field of a single well-tuned V1 complex cell while an animal freely viewed several natural images (review movies). Model parameters are estimated from the cell's responses to flashed gratings. Uncertainties in parameter estimates are quantified and used to estimate the resulting uncertainties in the model's predictions. We find modest but significant correlation between the model and the cell's recorded responses to three review movies (r = 0.21, 0.25, 0.52, respectively). Prediction uncertainties are sizable, but are not large enough to be the sole source of discrepancy between the model and the data. The largest single source of prediction uncertainty is the measurement uncertainty in orientation tuning. Our results indicate that it should be possible to measure model parameters accurately enough to substantally lower this prediction uncertainty. Low uncertainty predictions are necessary if a quasi-linear filter model is to be used as a foundation for future models that incorporate nonlinear mechanisms.

Competitive On-line Linear Regression -- Volodya Vovk
We apply a general algorithm for merging prediction strategies (the Aggregating Algorithm) to the problem of linear regression with the square loss; our main assumption is that the response variable is bounded. It turns out that for this particular problem the Aggregating Algorithm resembles, but is different from, the well-known ridge estimation procedure. From general results about the Aggregating Algorithm we deduce a guaranteed bound on the difference between our algorithm's performance and the best, in some sense, linear regression function's performance. We show that the AA attains the optimal constant in our bound, whereas the constant attained by the ridge regression procedure can be 4.3 times worse.

On the Infeasibility of Training Neural Networks with Small Squared Errors -- Van H. Vu
We demonstrate that the problem of training neural networks with small (average) squared error is computationally intractable. Consider a data set of $M$ points $(X_i, Y_i)$, $i=1,2,\dots, M$, where $X_i$ are input vectors from $R^d$, $Y_i$ are real outputs ($Y_i \in R)$. For a network $f_0$ in some class ${\cal F}$ of neural networks, $(1/M) (\sum _{i=1}^M (f_0(X_i) -Y_i)^2)^{1/2} - inf_{f \in {\cal F}}(1/M) (\sum _{i=1}^M (f(X_i) -Y_i)^2)^{1/2}$ is the (average) relative error that occurs when one tries to fit the data set by $f_0$. We will prove for several classes ${\cal F}$ of neural netwoks that achieving relative error smaller than some fixed positive threshold (independent from the size of the data set) is NP-hard. This answers a question of L. Jones.

#### W

Phase Transitions and the Perceptual Organization of Video Sequences -- Yair Weiss
Estimating motion in scenes containing multiple moving objects remains a difficult problem in computer vision. A promising approach to this problem involves using mixture models, where the motion of each object is a component in the mixture. However, existing methods typically require specifying in advance the number of components in the mixture, i.e. the number of objects in the scene. Here we show that the number of objects can be estimated automatically in a maximum likelihood framework, given an assumption about the level of noise in the video sequence. We derive analytical results showing the number of models which maximize the likelihood for a given noise level in a given sequence. We illustrate these results on a real video sequence, showing how the phase transitions correspond to different perceptual organizations of the scene.

Hybrid NN/HMM-Based Speech Recognition with a Discriminant Neural Feature Extraction -- Daniel Willett, Gerhard Rigoll
In this paper we present a novel hybrid architecture for continuous speech recognition systems. It consists of a continuous HMM system extended by an arbitrary neural network that is used as a preprocessor that takes several frames of the feature vector as input to produce more discriminative feature vectors with respect to the underlying HMM system. This hybrid system is an extension of a state-of-the-art continuous HMM system, and in fact, it is the first hybrid system that really is capable of outperforming these standard systems with respect to the recognition accuracy. Experimental results show a relative error reduction of about 10\% achieved on a remarkably good recognition system based on continuous HMMs for the Resource Management 1000-word continuous speech recognition task.

Modelling Seasonality and Trends in Daily Rainfall Data -- Peter M. Williams
This paper presents a new approach to the problem of modelling daily rainfall using neural networks. We first model the conditional distributions of rainfall amounts, in such a way that the model itself determines the order of the process, and the time-dependent shape and scale of the conditional distributions. After integrating over particular weather patterns, we are able to extract seasonal variations and long-term trends.

Graph Matching with Hierarchical Discrete Relaxation -- Richard C. Wilson, Edwin R. Hancock
Our aim in this paper is to develop a Bayesian framework for matching hierarchical relational models. Such models are widespread in computer vision. The framework that we adopt for this study is provided by iterative discrete relaxation. Here the aim is to assign the discrete matches so as to optimise a global cost function that draws information concerning the consistency of match from different levels of the hierarchy. Our Bayesian development naturally distinguishes between intra-level and inter-level constraints. This allows the impact of reassigning a match to be assessed not only at its own (or peer) level of representation, but also upon its parents and children in the hierarchy. We illustrate the effectiveness of the technique in the matching of line-segment groupings in SAR images of rural scenes.

#### X

The Storage Capacity of a Fully-Connected Committee Machine -- Yuansheng Xiong, Chulan Kwon, Jong-Hoon Oh
We study the storage capacity of a fully-connected committee machine with a large number $K$ of hidden nodes. The storage capacity is obtained by analyzing the geometrical structure of the weight space related to the internal representation. By examining the asymptotic behavior of the order parameters in the limit of large $K$, the storage capacity is found to be proportional to $K\sqrt{\ln K}$ up to the leading order. This result satisfies the mathematical bound given by Mitchison and Durbin, whereas the replica-symmetric solution in a conventional Gardner's approach violates this bound.

#### Y

Hybrid Reinforcement Learning and Its Application to Biped Robot Control -- Satoshi Yamada, Akira Watanabe, Michio Nakashima
A learning system composed of linear control modules, reinforcement learning modules and selection modules (a hybrid reinforcement learning system) is proposed for the quick learning of real-world control problems. The selection modules choose one appropriate control module dependent on the state. This hybrid learning system was applied to a stilt-type biped robot control. Because it did not need to learn the control on the flat floor, where the linear control module can control the robot, it learned the control on a floor with a slope more quickly than the usual reinforcement learning. If it was trained by a 2-step learning (at the first learning step, the selection module was trained by a training procedure controlled only by the linear controller), it learned the control much more quickly. The average number of trials (about 50) is so small that the learning system is applicable to real robot control.

The Efficiency and the Robustness of Natural Gradient Descent Learning Rule -- Howard H. Yang, Shun-ichi Amari
We have discovered a new scheme to represent the Fisher information matrix of a stochastic multi-layer perceptron. Based on this scheme, we have designed an algorithm to compute the inverse of the Fisher information matrix. When the input dimension $n$ is much larger than the number of hidden neurons, the complexity of this algorithm is of order $O(n^2)$ while the complexity of conventional algorithms for the same purpose is of order $O(n^3)$. The inverse of the Fisher information matrix is used in the natural gradient descent algorithm to train single-layer or multi-layer perceptrons. It is confirmed by simulation that the natural gradient descent learning rule is not only efficient but also robust.

Multiplicative Updating Rule for Blind Separation Derived from the Method of Scoring -- Howard H. Yang
For blind source separation, when the Fisher information matrix is used as the Riemannian metric tensor for the parameter space, the steepest descent algorithm to maximize the likelihood function in this Riemannian parameter space becomes the serial updating rule with equivariant property. This algorithm can be further simplified by using the asymptotic form of the Fisher information matrix around the equilibrium.

#### Z

The Observer-Observation Dilemma in Neuro-Forecasting -- Hans Georg Zimmermann, Ralph Neuneier
We explain how the training data can be separated into a cleaned part and an unexplainable noisy part. Analogous to the data, the neural network is separated into a time invariant structure used for forecasting, and a noisy part. We propose a unified theory connecting the optimization algorithms for cleaning and learning together with algorithms that control the data noise and the parameter noise. The combined algorithm allows a data-driven local control of the liability of the network parameters and therefore an improvement in generalization. The approach is successfully evaluated for the task of forecasting the German bond market.

We have attempted to ensure that all information is correct, but we cannot guarantee it. Please send comments and corrections to:
L. Douglas Baker
Carnegie Mellon University
ldbapp+nips@cs.cmu.edu