# Neural Information Processing Systems 1995

This is a partial list of the papers that were presented at NIPS*95.

These papers should be cited as:
To appear in Advances in Neural Information Processing Systems 8,
D. S. Touretzky, M. C. Mozer, M. E. Hasselmo, eds., MIT Press, 1996. In press.

A-B-C-D-E-F-G-H-I-J-K-L-M-N-O-P-Q-R-S-T-U-V-W-X-Y-Z

#### A

Selective Attention for Handwritten Digit Recognition, Ethem Alpaydin
Completely parallel object recognition is NP-complete. Achieving a recognizer with feasible complexity requires a compromise between parallel and sequential processing where a system selectively focuses on parts of a given image, one after another. Successive fixations are generated to sample the image and these samples are processed and abstracted to generate a temporal context in which results are integrated over time. A computational model based on a partially recurrent feedforward network is proposed and made credible by testing on the real-world problem of recognition of handwritten digits with encouraging results.
A New Learning Algorithm for Blind Signal Separation, S. Amari and A. Cichocki and H. H. Yang
A new on-line learning algorithm which minimizes a statistical dependency among outputs is derived for blind separation of mixed signals. The dependency is measured by the average mutual information (MI) of the outputs. The source signals and the mixing matrix are unknown except for the number of the sources. The Gram-Charlier expansion instead of the Edgeworth expansion is used in evaluating the MI. The natural gradient approach is used to minimize the MI. A novel activation function is proposed for the on-line learning algorithm which has an equivariant property and is easily implemented on a neural network like model. The validity of the new learning algorithm is verified by computer
Statistical Theory of Overtraining - Is Cross-Validation Effective?, Amari, S., Murata, N., Müller, K.-R., Finke, M. and Yang, H.
A statistical theory for overtraining is proposed. The analysis treats realizable stochastic neural networks, trained with Kullback-Leibler loss in the asymptotic case. It is shown that the asymptotic gain in the generalization error is small if we perform early stopping, even if we have access to the optimal stopping time. Considering cross-validation stopping we answer the question: In what ratio the examples should be divided into training and testing sets in order to obtain the optimum performance. In the non-asymptotic region cross-validated early stopping always decreases the generalization error. Our large scale simulations done on a CM5 are in nice agreement with our analytical findings.
Exponentially many local minima for single neurons, Peter Auer, Mark Herbster and Manfred Warmuth
We show that for a single neuron with the logistic function as the transfer function the number of local minima of the error function based on the square loss can grow exponentially in the dimension.

#### B

Classifying Facial Action, Marian Stewart Bartlett, Paul A. Viola, Terrence J. Sejnowski, Beatrice A. Golomb, Jan Larsen, Joseph C. Hager, and Paul Ekman
The Facial Action Coding System, (FACS), devised by Ekman and Friesen, provides an objective means for measuring the facial muscle contractions involved in a facial expression. In this paper, we approach automated facial expression analysis by detecting and classifying facial actions. We generated a database of over 1100 image sequences of 24 subjects performing over 150 distinct facial actions or action combinations. We compare three different approaches to classifying the facial actions in these images: Holistic spatial analysis based on principal components of graylevel images; explicit measurement of local image features such as wrinkles; and template matching with motion flow fields. On a dataset containing six individual actions and 20 subjects, these methods had 89%, 57%, and 85% performances respectively for generalization to novel subjects. When combined, performance improved to 92%.
Learning Model Bias, Jonathan Baxter
In this paper the problem of {\em learning} appropriate domain-specific bias is addressed. It is shown that this can be achieved by learning many related tasks from the same domain, and a theorem is given bounding the number tasks that must be learnt. A corollary of the theorem is that if the tasks are known to possess a common {\em internal representation} or {\em preprocessing} then the number of examples required per task for good generalisation when learning $n$ tasks simultaneously scales like $O(a + \frac{b}{n})$, where $O(a)$ is a bound on the minimum number of examples requred to learn a single task, and $O(a + b)$ is a bound on the number of examples required to learn each task independently. An experiment providing strong qualitative support for the theoretical results is reported.
EM Optimization of Latent-Variable Density Models C. M. Bishop, M. Svensen, C. K. I. Williams
There is currently considerable interest in developing general non-linear density models based on latent, or hidden, variables. Such models have the ability to discover the presence of a relatively small number of underlying "causes", which, acting in combination, give rise to the apparent complexity of the observed data set. Unfortunately, to train such models generally requires large computational effort. In this paper we introduce a novel latent variable algorithm which retains the general non-linear capabilities of previous models but which uses a training procedure based on the EM algorithm. We demonstrate the performance of the model on a toy problem and on data from flow diagnosticsfor a multi-phase oil pipeline.
Recurrent Neural Networks for Missing or Asynchronous Data Y. Bengio and F. Gingras
In this paper we propose recurrent neural networks with feedback into the input units for handling two types of data analysis problems. On the one hand, this scheme can be used for static data when some of the input variables are missing. On the other hand, it can also be used for sequential data, when some of the input variables are missing or are available at different frequencies. Unlike in the case of probabilistic models (e.g. Gaussian) of the missing variables, the network does not attempt to model the distribution of the missing variables given the observed variables. Instead it is a more discriminant'' approach that fills in the missing variables for the sole purpose of minimizing a learning criterion (e.g., to minimize an output error).
On Neural Networks with Minimal Weights, Vasken Bohossian and Jehoshua Bruck
Linear threshold elements are the basic building blocks of artificial neural networks. A linear threshold element computes a function that is a sign of a weighted sum of the input variables. The weights are arbitrary integers; actually, they can be very big integers---exponential in the number of the input variables. However, in practice, it is difficult to implement big weights. In the present literature a distinction is made between the two extreme cases: linear threshold functions with polynomial-size weights as opposed to those with exponential-size weights. The main contribution of this paper is to fill up the gap by further refining that separation. Namely, we prove that the class of linear threshold functions with polynomial-size weights can be divided into subclasses according to the degree of the polynomial. In fact, we prove a more general result---that there exists a minimal weight linear threshold function for any arbitrary number of inputs and any weight size. To prove those results we have developed a novel technique for constructing linear threshold functions with minimal weights.
A Realizable Learning Task which Exhibits Overfitting , Siegfried Bös
In this paper we examine a perceptron learning task. The task is realizable since it is provided by another perceptron with identical architecture. Both perceptrons have nonlinear sigmoid output functions. The gain of the output function determines the level of nonlinearity of the learning task. It is observed that a high level of nonlinearity leads to overfitting. We give an explanation for this rather surprising observation and develop a method to avoid the overfitting. This method has two possible interpretations, one is learning with noise, the other cross--validated early stopping.

#### C

Experiments with Neural Networks for Real Time Implementation of Optimal Control , P. Campbell, M. Dale, H.L. Ferra, and A. Kowalczyk
This paper describes a neural network based controller for allocating capacity in a telecommunications network. This system was proposed in order to overcome a "real time" response constraint. Two basic architectures are evaluated: 1) a feedforward network-heuristic and; 2) a feedforward network-recurrent network. These architectures are compared against a linear programming (LP) optimiser as a benchmark. This LP optimiser was also used as a teacher to label the data samples for the feedforward neural network training algorithm. It is found that the systems are able to provide a traffic throughput of 99% and 95%, respectively, of the throughput obtained by the linear programming solution. Once trained, the neural network based solutions are found in a fraction of the time required by the LP optimiser.
Rapid Quality Estimation of Neural Network Input Representations Kevin J. Cherkauer and Jude W. Shavlik
The choice of an input representation for a neural network can have a profound impact on its accuracy in classifying novel instances. However, neural networks are typically computationally expensive to train, making it difficult to test large numbers of alternative representations. This paper introduces fast quality measures for neural network representations, allowing one to quickly and accurately estimate which of a collection of possible representations for a problem is the best. We show that our measures for ranking representations are more accurate than a previously published measure, based on experiments with three difficult, real-world pattern recognition problems.
Laterally Interconnected Self-Organizing Maps in Hand-Written Digit Recognition , Yoonsuck Choe, Joseph Sirosh, and Risto Miikkulainen
An application of laterally interconnected self-organizing maps (LISSOM) to handwritten digit recognition is presented. The lateral connections learn the correlations of activity between units on the map. The resulting excitatory connections focus the activity into local patches and the inhibitory connections decorrelate redundant activity on the map. The map thus forms internal representations that are easy map. The map thus forms internal representations that are easy to recognize with e.g. a perceptron network. The recognition rate on a subset of NIST database 3 is 4.0% higher with LISSOM than with a regular Self-Organizing Map (SOM) as the front end, and 15.8% higher than recognition of raw input bitmaps directly. The15.8% higher than recognition of raw input bitmaps directly. These results form a promising starting point for building pattern recognition systems with a LISSOM map as a front end.
Predictive Q-Routing: A Memory-based Reinforcement Learning Approach to Adaptive Traffic Control, Samuel P.M. Choi, Dit-Yan Yeung
In this paper, we propose a memory-based Q-learning algorithm called predictive Q-routing (PQ-routing) for adaptive traffic control. We attempt to address two problems encountered in Q-routing (Boyan & Littman, 1994), namely, the inability to fine-tune routing policies under low network load and the inability to learn new optimal policies under decreasing load conditions. Unlike other memory-based reinforcement learning algorithms in which memory is used to keep past experiences to increase learning speed, PQ-routing keeps the best experiences learned and reuses them by predicting the traffic trend. The effectiveness of PQ-routing has been verified under various network topologies and traffic conditions. Simulation results show that PQ-routing is superior to Q-routing in terms of both learning speed and adaptability.
Parallel Optimization of Motion Controllers via Policy Iteration, Jefferson A. Coelho Jr., Ramesk K. Sitaraman, Roderic A. Grupen
This paper describes a policy iteration algorithm for optimizing the performance of a harmonic function-based controller with respect to a user-defined index. Value functions are represented as potential distributions over the problem domain, being control policies represented as gradient fields over the same domain. All intermediate policies are intrinsically safe, i.e. collisions are not promoted during the adaptation process.
A Dynamical Model of Context Dependencies for the Vestibulo-Ocular Reflex, Olivier J.M.D. Coenen and Terrence J. Sejnowski
The vestibulo-ocular reflex (VOR) stabilizes images on the retina during rapid head motions. The gain of the VOR (the ratio of eye to head rotation velocity) is typically around -1 when the eyes are focused on a distant target. However, to stabilize images accurately, the VOR gain must vary with context (eye position, eye vergence and head translation). We first describe a kinematic model of the VOR which relies solely on sensory information available from the semicircular canals (head rotation), the otoliths (head translation), and neural correlates of eye position and vergence angle. We then propose a dynamical model and compare it to the eye velocity responses measured in monkeys. The dynamical model reproduces the observed amplitude and time course of the modulation of the VOR and suggests one way to combine the required neural signals within the cerebellum and the brain stem. It also makes predictions for the responses of neurons to multiple inputs (head rotation and translation, eye position, etc.) in the oculomotor system.
Modern Analytic Techniques to Solve the Dynamics of Recurrent Neural Networks , A.C.C. Coolen, S.N. Laughton and D. Sherrington
We describe the use of modern analytical techniques in solving the dynamics of symmetric and nonsymmetric recurrent neural networks near saturation. These explicitly take into account the correlations between the post-synaptic potentials, and thereby allow for a reliable prediction of transients.
Extracting Tree-Structured Representations of Trained Networks, Mark W. Craven and Jude W. Shavlik
A significant limitation of neural networks is that the representations they learn are usually incomprehensible to humans. We present a novel algorithm, TREPAN, for extracting comprehensible, symbolic representations from trained neural networks. Our algorithm uses queries to induce a decision tree that approximates the concept represented by a given network. Our experiments demonstrate that TREPAN is able to produce decision trees that maintain a high level of fidelity to their respective networks, while being comprehensible and accurate. Unlike previous work in this area, our algorithm is both general in its applicability and scales well to large networks and problems with high-dimensional input spaces.
Improving Elevator Performance Using Reinforcement Learning, Robert H. Crites and Andrew G. Barto
This paper describes the application of reinforcement learning (RL) to the difficult real world problem of elevator dispatching. The elevator domain poses a combination of challenges not seen in most RL research to date. Elevator systems operate in continuous state spaces and in continuous time as discrete event dynamic systems. Their states are not fully observable and they are nonstationary due to changing passenger arrival rates. In addition, we use a team of RL agents, each of which is responsible for controlling one elevator car. The team receives a global reinforcement signal which appears noisy to each agent due to the effects of the actions of the other agents, the random nature of the arrivals and the incomplete observation of the state. In spite of these complications, we show results that in simulation surpass the best of the heuristic elevator control algorithms of which we are aware. These results demonstrate the power of RL on a very large scale stochastic dynamic optimization problem of practical utility.

#### D

Sample Complexity for Learning Recurrent Perceptron Mappings, Bhaskar Dasgupta and Eduardo D. Sontag
Recurrent perceptron classifiers generalize the classical perceptron model. They take into account those correlations and dependences among input coordinates which arise from linear digital filtering. This paper provides tight bounds on sample complexity associated to the fitting of such models to experimental data.
Geometry of Early Stopping in Linear Networks, Robert Dodier
A theory of early stopping as applied to linear models is presented. The backpropagation learning algorithm is modeled as gradient descent in continuous time. Given a training set and a validation set, all weight vectors found by early stopping must lie on a certain quadric surface, usually an ellipsoid. Given a training set and a candidate early stopping weight vector, all validation sets have least-squares weights lying on a certain plane. This latter fact can be exploited to estimate the probability of stopping at any given point along the trajectory from the initial weight vector to the least-squares weights derived from the training set, and to estimate the probability that training goes on indefinitely. The prospects for extending this theory to nonlinear models are discussed.
Temporal Difference Learning in Continuous Time and Space, Kenji Doya
A continuous-time, continuous-state version of the temporal difference (TD) algorithm is derived in order to facilitate the application of reinforcement learning to real-world control tasks and neurobiological modeling. An optimal nonlinear feedback control law was also derived using the derivatives of the value function. The performance of the algorithms was tested in a task of swinging up a pendulum with limited torque. Both the critic'' that specifies the paths to the upright position and the actor'' that works as a nonlinear feedback controller were successfully implemented by radial basis function (RBF) networks.

#### E

Analog VLSI Processor Implementing the Continuous Wavelet Transform, R. Timothy Edwards and Gert Cauwenberghs
We present an integrated analog processor for real-time wavelet decomposition and reconstruction of continuous temporal signals covering the audio frequency range. The processor performs complex harmonic modulation and Gaussian lowpass filtering in 16 parallel channels, each clocked at a different rate, producing a multiresolution mapping on a logarithmic frequency scale. Our implementation uses mixed-mode analog and digital circuits, oversampling techniques, and switched-capacitor filters to achieve a wide linear dynamic range while maintaining compact circuit size and low power consumption. We include experimental results on the processor and characterize its components separately from measurements on a single-channel test chip.
VLSI Model of Primate Visual Smooth Pursuit, Ralph Etienne-Cummings Jan Van der Spiegel Paul Mueller
A one dimensional model of primate smooth pursuit mechanism has been implemented in 2 um CMOS VLSI. The model consolidates Robinson's negative feedback model with Wyatt and Pola's positive feedback scheme, to produce a smooth pursuit system which zero's the velocity of a target on the retina. Furthermore, the system uses the current eye motion as a predictor for future target motion. Analysis, stability and biological correspondence of the system are discussed. For implementation at the focal plane, a local correlation based visual motion detection technique is used. Velocity measurements, ranging over 4 orders of magnitude with < 15% variation, provides the input to the smooth pursuit system. The system performed successful velocity tracking for high contrast scenes. Circuit design and performance of the complete smooth pursuit system is presented.

#### F

High-Speed Airborne Particle Monitoring Using Artificial Neural Networks , Alistair Ferguson, Theo Sabisch, Paul Kaye, Laurence C Dixon, and Hamid Bolouri
Current environmental monitoring systems assume particles to be spherical, and do not attempt to classify them. A laser-based system developed at the University of Hertfordshire aims at classifying airborne particles through the generation of two-dimensional scattering profiles. The performances of template matching, and two types of neural network (HyperNet and semi-linear units) are compared for image classification. The neural network approach is shown to be capable of comparable recognition performance, while offering a number of advantages over template matching.
Does the wake-sleep algorithm produce good density estimators?, Brendan J. Frey, Geoffrey E. Hinton, Peter Dayan
The wake-sleep algorithm (Hinton, Dayan, Frey and Neal 1995) is a relatively efficient method of fitting a multilayer stochastic generative model to high-dimensional data. In addition to the top-down connections in the generative model, it makes use of bottom-up connections for approximating the probability distribution over the hidden units given the data, and it trains these bottom-up connections using a simple delta rule. We use a variety of synthetic and real data sets to compare the performance of the wake-sleep algorithm with Monte Carlo and mean field methods for fitting the same generative model and also compare it with other models that are less powerful but easier to fit.
How perception guides production in birdsong learning, Christopher L. Fry
A computational model of song learning in the song sparrow (Melospiza melodia) learns to categorize the different syllables of a song sparrow song and uses this categorization to train itself to reproduce song. The model fills a crucial gap in the computational explanation of birdsong learning by exploring the organization of perception in songbirds. It shows how competitive learning may lead to the organization of a specific nucleus in the bird brain, replicates the song production results of a previous model (Doya and Sejnowski, 1995), and demonstrates how perceptual learning can guide production through reinforcement learning.
Active Learning in Multilayer Perceptrons, Kenji Fukumizu
We propose an active learning method with hidden-unit reduction, which is devised specially for multilayer perceptrons (MLP). First, we review our active learning method, and point out that many Fisher-information-based methods applied to MLP have a critical problem: the information matrix may be singular. To solve this problem, we derive the singularity condition of an information matrix, and propose an active learning technique that is applicable to MLP. Its effectiveness is verified through experiments.

#### G

Factorial Hidden Markov Models, Zoubin Ghahramani and Michael I. Jordan
We present a framework for learning in hidden Markov models with distributed state representations. Within this framework, we derive a learning algorithm based on the Expectation--Maximization (EM) procedure for maximum likelihood estimation. Analogous to the standard Baum-Welch update rules, the M-step of our algorithm is exact and can be solved analytically. However, due to the combinatorial nature of the hidden state representation, the exact E-step is intractable. A simple and tractable mean field approximation is derived. Empirical results on a set of problems suggest that both the mean field approximation and Gibbs sampling are viable alternatives to the computationally expensive exact algorithm.
Optimizing Cortical Mappings, Geoffrey J. Goodhill, Steven Finch and Terrence J. Sejnowski
Topographic'' mappings occur frequently in the brain. A popular approach to understanding the structure of such mappings is to map points representing input features in a space of a few dimensions to points in a 2 dimensional space using some self-organizing algorithm. We argue that a more general approach may be useful, where similarities between features are not constrained to be geometric distances, and the objective function for topographic matching is chosen explicitly rather than being specified implicitly by the self-organizing algorithm. We investigate analytically an example of this more general approach applied to the structure of interdigitated mappings, such as the pattern of ocular dominance columns in primary visual cortex.
A model of transparent motion and non-transparent motion aftereffects Alexander Grunewald
A model of human motion perception is presented. The model contains two stages of direction selective units. The first stage contains broadly tuned units, while the second stage contains units that are narrowly tuned. The model accounts for the motion aftereffect through adapting units at the first stage and inhibitory interactions at the second stage. The model explains how two populations of dots moving in slightly different directions are perceived as a single population moving in the direction of the vector sum, and how two populations moving in strongly different directions are perceived as transparent motion. The model also explains why the motion aftereffect in both cases appears as non-transparent motion.

#### H

Cholinergic suppression of transmission may allow combined associative memory function and self-organization in the neocortex., Michael E. Hasselmo and Milos Cekic
Selective suppression of transmission at feedback synapses during learning is proposed as a mechanism for combining associative feed back with self-organization of feedforward synapses. Experimental data demonstrates cholinergic suppression of synaptic transmission in layer I (feedback synapses), and a lack of suppression in layer IV (feed forward synapses). A network with this feature uses local rules to learn mappings which are not linearly separable. During learning, sensory stimuli and desired response are simultaneously presented as input. Feedforward connections form self-organized representations of input, while suppressed feedback connections learn the transpose of feedfor ward connectivity. During recall, suppression is removed, sensory input activates the self-organized representation, and activity generates the learned response.
Discriminant Adaptive Nearest Neighbor Classification and Regression, Trevor Hastie and Robert Tibshirani
Nearest neighbor classification expects the class conditional probabilities to be locally constant, and suffers from bias in high dimensions We propose a locally adaptive form of nearest neighbor classification to try to finesse this curse of dimensionality. We use a local linear discriminant analysis to estimate an effective metric for computing neighborhoods. We determine the local decision boundaries from centroid information, and then shrink neighborhoods in directions orthogonal to these local decision boundaries, and elongate them parallel to the boundaries. Thereafter, any neighborhood-based classifier can be employed, using the modified neighborhoods. We also propose a method for global dimension reduction, that combines local dimension information. We indicate how these techniques can be extended to the regression problem.
Hierarchical Recurrent Neural Networks for Long-Term Dependencies S. El Hihi and Y. Bengio
We have already shown that extracting long-term dependencies from sequential data is difficult, both for deterministic dynamical systems such as recurrent networks, and probabilistic models such as hidden Markov models (HMMs) or input/output hidden Markov models (IOHMMs). In practice, to avoid this problem, researchers have used domain specific a-priori knowledge to give meaning to the hidden or state variables representing past context. In this paper, we propose to use a general type of a-priori knowledge, namely that the temporal dependencies are structured hierarchically. This implies that long-term dependencies are represented by variables with a long time scale. This principle is applied to a recurrent network which includes delays and multiple time scales. Experiments confirm the advantages of such structures. A similar approach is proposed for HMMs and IOHMMs.
Discovering Structure in Continuous Variables Using Bayesian Networks, Reimar Hofmann and Volker Tresp
We study Bayesian networks for continuous variables using nonlinear conditional density estimators. We demonstrate that meaningful models of the joint density can be extracted from a limited data set in a self-organized way and we present sampling techniques for belief update based on Markov blanket conditional density models.
Gradient and Hamiltonian Dynamics Applied to Learning in Neural Networks , James W. Howse, Chaouki T. Abdallah and Gregory L. Heileman
The process of machine learning can be considered in two stages: model selection and parameter estimation. In this paper a technique is presented for constructing dynamical systems with desired qualitative properties. The approach is based on the fact that an n-dimensional nonlinear dynamical system can be decomposed into one gradient and (n - 1) Hamiltonian systems. Thus, the model selection stage consists of choosing the gradient and Hamiltonian portions appropriately so that a certain behavior is obtainable. To estimate the parameters, a stably convergent learning rule is presented. This algorithm has been proven to converge to the desired system trajectory for all initial conditions and system inputs. This technique can be used to design neural network models which are guaranteed to solve the trajectory learning problem.

#### I

Parallel analog VLSI architectures for computation of heading direction and time-to-contact, G. Indiveri, J. Kramer, C. Koch
We describe two parallel analog VLSI architectures that integrate optical flow data obtained from arrays of elementary velocity sensors to estimate heading direction and time-to-contact. For heading direction computation, we performed simulations to evaluate the most important qualitative properties of the optical flow field and determine the best functional operators for the implementation of the architecture. For time-to-contact we exploited the divergence theorem to integrate data from all velocity sensors present in the architecture and average out possible errors.

#### J

Fast Learning by Bounding Likelihoods in Sigmoid Type Belief Networks, T. Jaakkola, L. K. Saul, and M. I. Jordan
Sigmoid type belief networks, a class of probabilistic neural networks, provide a natural framework for compactly representing probabilistic information in a variety of unsupervised and supervised learning problems. Often the parameters used in these networks need to be learned from examples. Unfortunately, estimating the parameters via exact probabilistic calculations (i.e, the EM-algorithm) is intractable even for networks with fairly small numbers of hidden units. We propose to avoid the infeasibility of the E step by bounding likelihoods instead of computing them exactly. We introduce extended and complementary representations for these networks and show that the estimation of the network parameters can be made fast (reduced to quadratic optimization) by performing the estimation in either of the alternative domains. The complementary networks can be used for continuous density estimation as well.
Learning Sparse Perceptrons, Jeffrey C. Jackson and Mark W. Craven
We introduce a new algorithm designed to learn sparse perceptrons over input representations which include high-order features. Our algorithm, which is based on a hypothesis-boosting method, is able to PAC-learn a relatively natural class of target concepts. Moreover, the algorithm appears to work well in practice: on a set of three problem domains, the algorithm produces classifiers that utilize small numbers of features yet exhibit good generalization performance. Perhaps most importantly, our algorithm generates concept descriptions that are easy for humans to understand.
The Role of Activity in Synaptic Competition at the Neuromuscular Junction, Samuel R. H. Joseph and David J. Willshaw
An extended version of the dual constraint model of motor endplate morphogenesis is presented that includes activity dependent and independent competition. It is supported by a wide range of recent neurophysiological evidence that indicates a strong relationship between synaptic efficacy and survival. The computational model is justified at the molecular level and its predictions match the developmental and regenerative behaviour of real synapses.

#### K

Temporal coding in the sub-millisecond range: Model of barn owl auditory pathway, Richard Kempter, Wulfram Gerstner, J. Leo van Hemmen, and Hermann Wagner
Binaural coincidence detection is essential for the localization of externalsounds and requires auditory signal processing with high temporal precision. We present an integrate-and-fire model of spike processing in the auditory pathway of the barn owl. It is shown that a temporal precision in the microsecond range can be achieved with neuronal time constants which are at least one magnitude longer. An important feature of our model is an unsupervised Hebbian learning rule which leads to a temporal fine tuning of the neuronal connections.
Context-Dependent Classes in a Hybrid Recurrent Network-HMM Speech Recognition System (UK Version), ( US Mirrored version) Dan Kershaw, Tony Robinson and Mike Hochberg
A method for incorporating context-dependent phone classes in a connectionist-HMM hybrid speech recognition system is introduced. A modular approach is adopted, where single-layer networks discriminate between different context classes given the phone class and the acoustic data.The context networks are combined with a context-independent (CI) network to generate context-dependent (CD) phone probability estimates. Experiments show an average reduction in word error rate of 16\% and 13\% from the CI system on ARPA 5,000 word and SQALE 20,000 word tasks respectively. Due to improved modelling, the decoding speed of the CD system is more than twice as fast as the CI system.
Neural networks with quadratic VC dimension, Pascal Koiran and Eduardo D. Sontag
This paper shows that neural networks which use continuous activation functions have VC dimension at least as large as the square of the number of weights w. This result settles a long-standing open question, namely whether the well-known O(w log w) bound, known for hard-threshold nets, also held for more general sigmoidal nets. Implications for the number of samples needed for valid generalization are discussed.
Examples of Learning Curves from a Modified VC-Formalism, A. Kowalczyk, J. Szymanski, P.L. Bartlett, and R.C. Williamson
We examine the issue of evaluation of model specific parameters in a modified VC-formalism. Two examples are analyzed: the 2-dimensional homogeneous perceptron and the 1-dimensional higher order neuron. Both models are solved theoretically, and their learning curves are compared against true learning curves. It is shown that the formalism has the potential to generate a variety of learning curves, including ones displaying "phase transitions".
Prediction of Beta Sheets in Proteins, Anders Krogh and Soren K. Riis
Most current methods for prediction of protein secondary structure use a small window of the protein sequence to predict the structure of the central amino acid. We describe a new method for prediction of the non-local structure called beta-sheet, which consists of two or more beta-strands that are connected by hydrogen bonds. Since beta-strands are often widely separated in the protein chain, a network with two windows is introduced. After training on a set of proteins the network predicts the sheets well, but there are many false positives. By using a global energy function the beta-sheet prediction is combined with a local prediction of the three secondary structures alpha-helix, beta-strand and coil. The energy function is minimized using simulated annealing to give a final prediction.
[If you have problems with the above link or you are based in the UK, try THIS.]

#### L

The Gamma MLP for Speech Phoneme Recognition, S. Lawrence, A. C. Tsoi, A. D. Back
We define a Gamma multi-layer perceptron (MLP) as an MLP with the usual synaptic weights replaced by gamma filters (as proposed by de Vries and Principe) and associated gain terms throughout all layers. We derive gradient descent update equations and apply the model to the recognition of speech phonemes. We find that both the inclusion of gamma filters in all layers, and the inclusion of synaptic gains, improves the performance of the Gamma MLP. We compare the Gamma MLP with TDNN, Back-Tsoi FIR MLP, and Back-Tsoi IIR MLP architectures, and a local approximation scheme. We find that the Gamma MLP results in a substantial reduction in error rates.
Silicon Models for Auditory Scene Analysis, John Lazzaro and John Wawrzynek
We are developing special-purpose, low-power analog-to-digital converters for speech and music applications, that feature analog circuit models of biological audition to process the audio signal before conversion. This paper describes our most recent converter design, and a working system that uses several copies of the chip to compute multiple representations of sound from an analog input. This multi-representation system demonstrates the plausibility of inexpensively implementing an auditory scene analysis approach to sound processing.
Handwritten Word Recognition using Contextual Hybrid Radial Basis Function Network/Hidden Markov Models Lemarie, Gilloux, Leroux
Abstract A hybrid and contextual radial basis function network/hidden Markov model off-line handwritten word recognition system is presented. The task assigned to the radial basis function networks is the estimation of emission probabilities associated to Markov states. The model is contextual because the estimation of emission probabilities takes into account the left context of the current image segment as represented by its predecessor in the sequence. The new system does not outperform the previous system without context but acts differently.
Visual gesture-based robot guidance with a modular neural system, Enno Littmann, Andrea Drees, and Helge Ritter
We report on the development of the modular neural system See-Eagle'' for the visual guidance of robot pick-and-place actions. Several neural networks are integrated to a single system that visually recognizes human hand pointing gestures from stereo pairs of color video images. The output of the hand recognition stage is processed by a set of color-sensitive neural networks to determine the cartesian location of the target object that is referenced by the pointing gesture. Finally, this information is used to guide a robot to grab the target object and put it at another location that can be specified by a second pointing gesture. The accuracy of the current system allows to identify the location of the referenced target object to an accuracy of 1cm in a workspace area of 50x50cm. In our current environment, this is sufficient to pick and place arbitrarily positioned target objects within the workspace. The system consists of neural networks that perform the tasks of image segmentation, estimation of hand location, estimation of 3D-pointing direction, object recognition, and necessary coordinate transforms. Drawing heavily on the use of learning algorithms, the functions of all network modules were created from data examples only.

#### M

Implementation Issues in the Fourier Transform Algorithm, Yishay Mansour and Sigal Sahar (LT148)
The Fourier transform of boolean functions has come to play an important role in proving many important learnability results. We aim to demonstrate that the Fourier transform techniques are also a useful and practical algorithm in addition to being a powerful theoretical tool. We describe the more prominent changes we have introduced to the algorithm, ones that were crucial and without which the performance of the algorithm would severely deteriorate. One of the benefits we present is the confidence level for each prediction which measures the likelihood the prediction is correct.
Strong Unimodality and Exact Learning of Constant Depth mu-Perceptron Networks, Mario Marchand and Saeed Hadjifaradji
We present a statistical method that exactly learns the class of constant depth mu-perceptron networks with weights taken from {-1, 0 +1} and arbitrary thresholds when the distribution that generates the input examples is member of the family of product distributions. These networks (also known as nonoverlapping perceptron networks or read-once formulas over a weighted threshold basis) are loop-free neural nets in which each node has only one outgoing weight. With arbitrary high probability, the learner is able to exactly identify the connectivity (or skeleton) of the target mu-perceptron network by using a new statistical test which exploits the strong unimodality property of sums of independent random variables.
Learning to Predict Visibility and Invisibility from Occlusion Events, Jonathan A. Marshall and Richard K. Alley and Robert S. Hubbard
Visual occlusion events constitute a major source of depth information. This paper presents a self-organizing neural network that learns to detect, represent, and predict the visibility and invisibility relationships that arise during occlusion events, after a period of exposure to motion sequences containing occlusion and disocclusion events. The network develops two parallel opponent channels or "chains" of lateral excitatory connections for every resolvable motion trajectory. One channel, the "On" chain or "visible" chain, is activated when a moving stimulus is visible. The other channel, the "Off" chain or "invisible" chain, carries a persistent, amodal representation that predicts the motion of a formerly visible stimulus that becomes invisible due to occlusion. The learning rule uses disinhibition from the On chain to trigger learning in the Off chain. The On and Off chain neurons can learn separate associations with object depth ordering. The results are closely related to the recent discovery (Assad & Maunsell, 1995) of neurons in macaque monkey posterior parietal cortex that respond selectively to inferred motion of invisible stimuli.
Premitive Manipulation Learning with Connectionism, Yoky Matsuoka
Infants' manipulative exploratory behavior within the environment is a vehicle of cognitive stimulation [McCall 1974]. During this time, infants practice and perfect sensorimotor patterns that become behavioral modules which will be seriated and imbedded in more complex actions. This paper explores the development of such primitive learning systems using an embodied light-weight hand which will be used for a humanoid being developed at the MIT Artificial Intelligence Laboratory [Brooks and Stein 1993]. Primitive grasping procedures are learned from sensory inputs using a connectionist reinforcement algorithm while two submodules preprocess sensory data to recognize the hardness of objects and detect shear using competitive learning and back-propagation algorithm strategies, respectively. This system is not only consistent and quick during the initial learning stage, but also adaptable to new situations after training is completed.
SEEMORE: A View-Based Approach to 3-D Object Recognition Using Multiple Visual Cues, Bartlett W. Mel
A neurally-inspired visual object recognition system is described called SEEMORE, whose goal is to identify common objects from a large known set---independent of 3-D viewing angle, distance, and non-rigid distortion. SEEMORE's database consists of 100 objects that are rigid (shovel), non-rigid (telephone cord), articulated (book), statistical (shrubbery), and complex (photographs of scenes). Recognition results were obtained using a set of 102 color and shape feature channels within a simple feedforward network architecture. In response to a test set of 600 novel test views (6 of each object) presented individually in color video images, SEEMORE identified the object correctly 97% of the time (chance is 1) using a nearest neighbor classifier. Similar levels of performance were obtained for the subset of 15 non-rigid objects. Generalization behavior reveals emergence of striking natural category structure not explicit in the input feature dimensions. A long paper is available at ftp://quake.usc.edu/pub/mel/papers/mel.seemore.TR96.ps.gz.
Quadratic--Type Lyapunov Functions for Competitive Neural Networks with Different Time--Scales, Anke Meyer--Bäase
The dynamics of complex neural networks modelling the self--organization process in cortical maps must include the aspects of long and short--term memory. The behaviour of the network is such characterized by an equation of neural activity as a fast phenomenon and an equation of synaptic modification as a slow part of the neural system. We present a quadratic--type Lyapunov function for the flow of a competitive neural system with fast and slow dynamic variables. We also show the consequences of the stability analysis on the neural net parameters.

#### N

Dynamics of Attention as Near Saddle-Node Bifurcation Behavior, A. Hiroyuki Nakahara and Kenji Doya
In consideration of attention as a means for goal-directed behavior in non-stationary environments, we argue that the dynamics of attention should satisfy two opposing demands: long-term maintenance and quick transition. These two characteristics are contradictory within the linear domain. We propose the near saddle-node bifurcation behavior of a sigmoidal unit with self-connection as a candidate of dynamical mechanism that satisfies both of these demands. We further show in simulations of the bug-eat-food' tasks by evolutionary programming that the near saddle-node bifurcation behavior of recurrent networks can emerge as a functional property for survival in non-stationary environments.
Control of Selective Visual Attention: Modeling the Where'' Pathway, Ernst Niebur and Christof Koch
Intermediate and higher vision processes require selection of a subset of the available sensory information before further processing. Usually, this selection is implemented in the form of a spatially circumscribed region of the visual field, the so-called focus of attention'' which scans the visual scene dependent on the input and on the attentional state of the subject. We here present a model for the control of the focus of attention in primates, based on a saliency map. This mechanism is not only expected to model the functionality of biological vision but also to be essential for the understanding of complex scenes in machine vision.
Optimal Asset Allocation using Adaptive Dynamic Programming, Ralph Neuneier
The article fomulates asset allocation as a Markovian Decision Problem and solves it with q-learning. The resulting strategy for the task of investing liquid capital in the German stock market is much more profitable than a benchmark policy.

#### O

Family Discovery, Stephen M. Omohundro
Family discovery' is the task of learning the dimension and structure of a parameterized family of stochastic models. It is especially appropriate when the training examples are partitioned into episodes' of samples drawn from a single parameter value. We present three family discovery algorithms based on surface learning and show that they significantly improve performance over two alternatives on a parameterized classification task.
Generating Accurate and Diverse Members of a Neural-Network Ensemble, David W. Opitz and Jude W. Shavlik
Neural-network ensembles have been shown to be very accurate classification techniques. Previous work has shown that an effective ensemble should consist of networks that are not only highly correct, but ones that make their errors on different parts of the input space as well. Most existing techniques, however, only indirectly address the problem of creating such a set of networks. In this paper we present a technique called ADDEMUP that uses genetic algorithms to directly search for an accurate and diverse set of trained networks. ADDEMUP works by first creating an initial population, then uses genetic operators to continually create new networks, keeping the set of networks that are as accurate as possible while disagreeing with each other as much as possible. Experiments on three DNA problems show that ADDEMUP is able to generate a set of trained networks that is more accurate than several existing approaches. Experiments also show that ADDEMUP is able to effectively incorporate prior knowledge, if available, to improve the quality of its ensemble.
Improved Gaussian Mixture Density Estimates Using Bayesian Penalty Terms and Network Averaging, Dirk Ormoneit and Volker Tresp
We compare two regularization methods which can be used to improve the generalization capabilities of Gaussian mixture density estimates. The first method uses a Bayesian prior on the parameter space. We derive EM (Expectation Maximization) update rules which maximize the a posterior parameter probability. In the second approach we apply ensemble averaging to density estimation. This includes Breiman's bagging'', which recently has been found to produce impressive results for classification networks.

#### P

Symplectic Nonlinear Component Analysis, Lucas C. Parra
Statistically independent features can be extracted by finding a factorial representation of a signal distribution. Principal Component Analysis (PCA) accomplishes this for linear correlated and Gaussian distributed signals. Independent Component Analysis (ICA), formalized by Comon (1994), extracts features in the case of linear statistical dependent but not necessarily Gaussian distributed signals. Nonlinear Component Analysis finally should find a factorial representation for nonlinear statistical dependent distributed signals. This paper proposes for this task a novel feed-forward, information conserving, nonlinear map - the explicit symplectic transformations. It also solves the problem of non-Gaussian output distributions by considering single coordinate higher order statistics.
Pruning with generalization based weight saliencies: gammaOBD, gammaOBS, Morten With Pedersen, Lars Kai Hansen and Jan Larsen
The purpose of most architecture optimization schemes is to improve generalization. In this presentation we suggest to estimate the weight saliency as the associated change in generalization error if the weight is pruned. We detail the implementation of both an O(N)-storage scheme extending OBD, as well as an O(N^2) scheme extending OBS. We illustrate the viability of the approach on prediction of a chaotic time series.
A Neural Network Autoassociator for Induction Motor Failure Prediction, Thomas Petsche, Angelo Marcantonio, Christian Darken, Stephen J. Hanson, Gary M. Kuhn and Iwan Santoso
We present results on the use of neural network based autoassociators which act as novelty or anomaly detectors to detect imminent motor failures. The autoassociator is trained to reconstruct spectra obtained from the healthy motor. In laboratory tests, we have demonstrated that the trained autoassociator has a small reconstruction error on measurements recorded from healthy motors but a larger error on those recorded from a motor with a fault. We have designed and built a motor monitoring system using an autoassociator for anomaly detection and are in the process of testing the system at three industrial and commercial sites.
A Neural Network Classifier for the I1000 OCR Chip, John Platt, Tim Allen
This paper describes a neural network classifier for the I1000 chip, which optically reads the E13B font characters at the bottom of checks. The first layer of the neural network is a hardware linear classifier which recognizes the characters in this font. A second software neural layer is implemented on an inexpensive microprocessor to clean up the results of the first layer. The hardware linear classifier is mathematically specified using constraints and an optimization principle. The weights of the classifier are found using the active set method, similar to Vapnik's separating hyperplane algorithm. In 7.5 minutes of SPARC 2 time, the method solves for 1523 Lagrange multipliers, which is equivalent to training on a data set of approximately 128,000 examples. The resulting network performs quite well: when tested on a test set of 1500 real checks, it has a 99.995% character accuracy rate.

#### R

Modeling Saccadic Targeting in Visual Search, Rajesh P.N. Rao, Gregory J. Zelinsky, Mary M. Hayhoe and Dana H. Ballard
Visual cognition depends critically on the ability to make rapid eye movements known as saccades that orient the fovea over targets of interest in a visual scene. Saccades are known to be ballistic: the pattern of muscle activation for foveating a prespecified target location is computed prior to the movement and visual feedback is precluded. Despite these distinctive properties, there has been no general model of the saccadic targeting strategy employed by the human visual system during visual search in natural scenes. This paper proposes a model for saccadic targeting that uses iconic scene representations derived from oriented spatial filters at multiple scales. Visual search proceeds in a coarse-to-fine fashion with the largest scale filter responses being compared first. The model was empirically tested by comparing its performance with actual eye movement data from human subjects in a natural visual search task; preliminary results indicate substantial agreement between eye movements predicted by the model and those recorded from human subjects.
A Practical Monte Carlo Implementation of Bayesian Learning, Carl Edward Rasmussen
A practical method for Bayesian training of feed-forward neural networks using sophisticated Monte Carlo methods is presented and evaluated. In reasonably small amounts of computer time this approach outperforms other state-of-the-art methods on 5 data-limited tasks from real world domains.
Modeling Interactions of the Rat's Place and Head Direction Systems, A. David Redish and David S. Touretzky
We have developed a computational theory of rodent navigation that includes analogs of the place cell system, the head direction system, and path integration. In this paper we present simulation results showing how interactions between the place and head direction systems can account for recent observations about hippocampal place cell responses to doubling and/or rotation of cue cards in a cylindrical arena (Sharp et al. 1990).
Human Face Detetion in Visual Scenes, Henry A. Rowley, Shumeet Baluja, and Takeo Kanade
We present a neural network-based face detection system. A retinally connected neural network examines small windows of an image, and decides whether each window contains a face. The system arbitrates between multiple networks to improve performance over a single network. We use a bootstrap algorithm for training, which adds false detections into the training set as training progresses. This eliminates the difficult task of manually selecting non-face training examples, which must be chosen to span the entire space of non-face images. Comparisons with another state-of-the-art face detection system are presented; our system has better performance in terms of detection and false-positive rates.
Stable Dynamic Parameter Adaptation, Stefan M. Rüger
A stability criterion for dynamic parameter adaptation is given. In the case of the learning rate of backpropagation, a class of stable algorithms is presented and studied, including a convergence proof.

#### S

Dynamics of On-Line Gradient Descent Learning for Multilayer Neural Networks, David Saad and Sara A. Solla
We consider the problem of on-line gradient descent learning for general two-layer neural networks. An analytic solution is presented and used to investigate the role of the learning rate in controlling the evolution and convergence of the learning process.
Reinforcement Learning by Probability Matching , Philip N. Sabes and Michael I. Jordan
We present a new algorithm for associative reinforcement learning. The algorithm is based upon the idea of matching a network's output probability with a probability distribution derived from the environment's reward signal. This Probability Matching algorithm is shown to perform faster and be less susceptible to local minima than previously existing algorithms. We use Probability Matching to train mixture of experts networks, an architecture for which other reinforcement learning rules fail to converge reliably on even simple problems. This architecture is particularly well suited for our algorithm as it can compute arbitrarily complex functions yet calculation of the output probability is simple.
Exploiting tractable substructures in intractable networks , Lawrence K. Saul and Michael I. Jordan
We develop a refined mean field approximation for inference and learning in probabilistic neural networks. Our mean field theory, unlike most, does not assume that the units behave as independent degrees of freedom; instead it exploits in a principled way the existence of large substructures that are computationally tractable. To illustrate the advantages of this framework, we show how to incorporate weak higher-order interactions into a first-order hidden Markov model, treating the corrections (but not the first-order structure) within mean field theory.
From Isolation to Cooperation: An Alternative View of a System of Experts, Stefan Schaal and Christopher G. Atkeson
We introduce a constructive, incremental learning system for regression problems that models data by means of locally linear experts. In contrast to other approaches, the experts are trained independently and do not compete for data during learning. Only when a prediction for a query is required do the experts cooperate by blending their individual predictions. Each expert is trained by minimizing a penalized local cross validation error using second order methods. In this way, an expert is able to adjust the size and shape of the receptive field in which its predictions are valid, and also to adjust its bias on the importance of individual input dimensions. The size and shape adjustment corresponds to finding a local distance metric, while the bias adjustment accomplishes local dimensionality reduction. We derive asymptotic results for our method. In a variety of simulations we demonstrate the properties of the algorithm with respect to interference, learning speed, prediction accuracy, feature detection, and task oriented incremental learning.
Improved Silicon Cochlea using Compatible Lateral Bipolar Transistors, André van Schaik, Eric Fragnière, and Eric Vittoz
Analog electronic cochlear models need exponentially scaled filters. CMOS Compatible Lateral Bipolar Transistors (CLBTs) can create exponentially scaled currents when biased using a resistive line with a voltage difference between both ends of the line. Since these CLBTs are independent of the CMOS threshold voltage, current sources implemented with CLBTs are much better matched than current sources created with MOS transistors operated in weak inversion. Measurements from integrated test chips are shown to verify the improved matching.
Tempering Backpropagation Networks: Not All Weights are Created Equal, Nicol N. Schraudolph and Terrence J. Sejnowski
Backpropagation learning algorithms typically collapse the network's structure into a single vector of weight parameters to be optimized. We suggest that their performance may be improved by utilizing the structural information instead of discarding it, and introduce a framework for tempering'' each weight accordingly.

In the tempering model, activation and error signals are treated as approximately independent random variables. The characteristic scale of weight changes is then matched to that of the residuals, allowing structural properties such as a node's fan-in and fan-out to affect the local learning rate and backpropagated error. The model also permits calculation of an upper bound on the global learning rate for batch updates, which in turn leads to different update rules for bias vs. non-bias weights.

This approach yields hitherto unparalleled performance on the family relations benchmark, a deep multi-layer network: for both batch learning with momentum and the delta-bar-delta algorithm, convergence at the optimal learning rate is sped up by more than an order of magnitude.

Forward-backward retraining of recurrent neural networks, Andrew Senior and Tony Robinson
This paper describes the training of a recurrent neural network as the probability estimator for a hidden Markov model off-line handwriting recognition system. Probability distributions are estimated by the network for each of a series of frames representing the features present in a vertical strip through a handwritten word. The network is trained by backpropagation through time'. The supervised training algorithm requires target outputs to be provided for each frame. Three methods for deriving these targets are presented. A novel method based upon the forward-backward algorithm widely used in speech recognition is found to result in the recognizer with the lowest error rate.
Adaptive Mixture of Probabilistic Transducers , Y. Singer
We introduce and analyze a mixture model for supervised learning of probabilistic transducers. We devise an online learning algorithm that efficiently infers the structure and estimates the parameters of each model in the mixture. Theoretical analysis and comparative simulations indicate that the learning algorithm tracks the best model from an arbitrarily large (possibly infinite) pool of models. We also present an application of the model for inducing a noun phrase recognizer.
Onset-based Sound Segmentation, Leslie S. Smith
A technique for segmenting sounds using processing based on mammalian early auditory processing is presented. The technique is based on features in sound which neuron spike recording suggests are detected in the cochlear nucleus. The sound signal is bandpassed and each signal processed to enhance onsets and offsets. The onset and offset signals are compressed, then clustered both in time and across frequency channels using a network of integrate-and-fire neurons. Onsets and offsets are signalled by spikes, and the timing of these spikes used to segment the sound.
Learning with Ensembles: How Overfitting can be Useful, Peter Sollich and Anders Krogh
We study the characteristics of learning with ensembles. Solving exactly the simple model of an ensemble of linear students, we find surprisingly rich behaviour. For learning in large ensembles, it is advantageous to use under-regularized students, which actually over-fit the training data. Globally optimal performance can be obtained by choosing the training set sizes of the students appropriately. For smaller ensembles, optimization of the ensemble weights can yield significant improvements in ensemble generalization performance, in particular if the individual students are subject to noise in the training process. Choosing students with a wide range of regularization parameters makes this improvement robust against changes in the unknown level of noise in the training data.
Beating a Defender in Robotic Soccer: Memory Based Learning of a Continuous Function, Peter Stone and Manuela Veloso
Learning how to adjust to an opponent's position is critical to the success of having intelligent agents collaborating towards the achievement of specific tasks in unfriendly environments. This paper describes our work on a Memory-based technique for to choose an action based on a continuous-valued state attribute indicating the position of an opponent. We investigate the question of how an agent performs in nondeterministic variations of the training situations. Our experiments indicate that when the random variations fall within some bound of the initial training, the agent performs better with some initial training rather than from a tabula-rasa.
Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding, Richard S. Sutton
On large problems, reinforcement learning systems must use parameterized function approximators such as neural networks in order to generalize between similar situations and actions. In these cases, there are no strong theoretical results on the accuracy of convergence, and computational results have been mixed. In particular, Boyan and Moore reported at last year's meeting a series of negative results in attempting to apply dynamic programming together function approximation to simple control problems with continuous state spaces. In this paper, we present positive results for all the control tasks they attempted, and for one that is significantly larger. The most important differences are that we used sparse-coarse-coded function approximators (CMACs) whereas they used mostly global function approximators, and that we learned online whereas they used learned offline. Boyan and Moore and others have suggested that the problems they encountered could be solved by using actual outcomes ("rollouts"), as in classical Monte Carlo methods, and as in the TD(lambda) algorithm when \lambda=1. However, in our experiments this always resulted in substantially poorer performance. We conclude that reinforcement learning can work robustly in conjunction with function approximators, and that there is little justification at present for avoiding the case of general lambda.

#### T

Learning the structure of similarity, Joshua B. Tenenbaum
The additive clustering (ADCLUS) model (Shepard & Arabie, 1979) treats the similarity of two stimuli as a weighted additive measure of their common features. Inspired by recent work in unsupervised learning with multiple cause models, we propose a new, statistically well-motivated algorithm for discovering the structure of natural stimulus classes using the ADCLUS model, which promises substantial gains in conceptual simplicity, practical efficiency, and solution quality over earlier efforts. We also present preliminary results with artificial data and two classic similarity data sets.
Is Learning The n-th Thing Any Easier Than Learning The First?, Sebastian Thrun
This paper investigates learning in a lifelong context. Lifelong learning addresses situations in which a learner faces a whole stream of learning tasks. Such scenarios provide the opportunity to transfer knowledge across multiple learning tasks, in order to generalize more accurately from less training data. In this paper, several different approaches to lifelong learning are described, and applied in an object recognition domain. It is shown that across the board, lifelong learning approaches generalize consistently more accurately from less training data, by their ability to transfer knowledge across learning tasks.
A Multiscale Attentional Framework for Relaxation Neural Networks, Dimitris I. Tsioutsias & Eric Mjolsness
We investigate the optimization of neural networks governed by general objective functions. Practical formulations of such objectives are notoriously difficult to solve; a common problem is the poor local extrema that result by any of the applied methods. In this paper, a novel framework is introduced for the solution of large-scale optimization problems. It assumes little about the objective function and can be applied to general nonlinear, non-convex functions; objectives in thousand of variables are thus efficiently minimized by a combination of techniques - deterministic annealing, multiscale optimization, attention mechanisms and trust region optimization methods.

#### V

Empirical Entropy Manipulation for Real-World Problems, Paul Viola, Nicol N. Schraudolph and Terrence J. Sejnowski
No finite sample is sufficient to determine the density, and therefore the entropy, of a signal directly. Some assumption about either the functional form of the density or about its smoothness is necessary. Both amount to a prior over the space of possible density functions. By far the most common approach is to assume that the density has a parametric form.

By contrast we derive a differential learning rule called EMMA that optimizes entropy by way of kernel density estimation. Entropy and its derivative can then be calculated by sampling from this density estimate. The resulting parameter update rule is surprisingly simple and efficient.

We will describe two real-world applications that can be solved efficiently and reliably using EMMA. In the first application EMMA is used to align 3D models to complex natural images. In the second application EMMA is used to detect and correct corruption in magnetic resonance images (MRI). Both applications are beyond the scope of existing parametric entropy models.

#### W

Absence of Cycles in Symmetric Neural Networks, X. Wang, A. Jagota, F. Botelho and M. Garzon
For a given recurrent neural network, a discrete-time model may have asymptotic dynamics different from the one of a related continuous-time model. In this paper, we consider a discrete-time model that discretizes the continuous-time leaky integrator model and study its parallel and sequential dynamics for symmetric networks. We provide sufficient (and necessary in many cases) conditions for the discretized model to have the same cycle-free dynamics of the corresponding continuous-time model in symmetric networks.
Bayesian methods for Mixtures of Experts (UK Version), US Mirrored version, Steve Waterhouse, David MacKay and Tony Robinson
We present a Bayesian framework for inferring the parameters of a mixture of experts model based on ensemble learning by variational free energy minimisation. The Bayesian approach avoids the over-fitting and noise level under-estimation problems of traditional maximum likelihood inference. We demonstrate these methods on artificial problems and sunspot time series prediction.
Constructive Methods for Mixtures of Experts (UK Version), US mirrored version, Steve Waterhouse and Tony Robinson
We present two additions to the hierarchical mixture of experts (HME) architecture. By applying a likelihood splitting criteria to each expert in the HME we grow'' the tree adaptively during training. Secondly, by considering only the most probable path through the tree we may prune'' branches away, either temporarily, or permanently if they become redundant. We demonstrate results for the growing and path pruning algorithms which show significant speed ups and more efficient use of parameters over the standard fixed structure in discriminating between two interlocking spirals and classifying 8-bit parity patterns.
SPERT-II: A Vector Microprocessor System and its Application to Large Problems in Backpropagation Training, John Wawrzynek, Krste Asanovic, Brian Kingsbury, James Beck, David Johnson, and Nelson Morgan
We report on our development of a high-performance system for neural network and other signal processing applications. We have designed and implemented a vector microprocessor and packaged it as an attached processor for a conventional workstation. We present performance comparisons with workstations on neural network backpropagation training. The SPERT-II system demonstrates roughly 15 times the performance of a mid-range workstation and five times the performance of a high-end workstation.
Adaptive Back-Propagation in On-Line learning of Multilayer Networks, Ansgar H.L. West and David Saad
An adaptive back-propagation algorithm is studied and compared with gradient descent (standard back-propagation) for on-line learning in two-layer neural networks with an arbitrary number of hidden units. Within a statistical mechanics framework, both numerical studies and a rigorous analysis show that the adaptive back-propagation method results in faster training by breaking the symmetry between hidden units more efficiently and by providing faster convergence to optimal generalization than gradient descent.
Gaussian Processes for Regression C. K. I. Williams and C. E. Rasmussen
The Bayesian analysis of neural networks is difficult because a simple prior over weights implies a complex prior distribution over functions. In this paper we investigate the use of Gaussian process priors over functions, which permit the predictive Bayesian analysis for fixed values of hyperparameters to be carried out exactly using matrix operations. Two methods, using optimization and averaging (via Hybrid Monte Carlo) over hyperparameters have been tested on a number of challenging problems and have produced excellent results.
A Smoothing Regularizer for Recurrent Neural Networks, Lizhong Wu and John Moody
We derive a smoothing regularizer for recurrent network models by requiring robustness in prediction performance to perturbations of the training data. The regularizer can be viewed as a generalization of the first order Tikhonov stabilizer to dynamic models. The closed-form expression of the regularizer covers both time-lagged and simultaneous recurrent nets, with feedforward nets and one-layer linear nets as special cases. We have successfully tested this regularizer in a number of case studies and found that it performs better than standard quadratic weight decay.

#### Z

High-Performance Job-Shop Scheduling With A Time-Delay TD(lambda) Network, Wei Zhang and Thomas G. Dietterich
Job-shop scheduling is an important task for manufacturing industries. We are interested in the particular task of scheduling payload processing for NASA's space shuttle program. This paper summarizes our previous work on formulating this task for solution by the reinforcement learning algorithm $TD(\lambda)$. A shortcoming of this previous work was its reliance on hand-engineered input features. This paper shows how to extend the time-delay neural network (TDNN) architecture to apply it to irregular-length schedules. Experimental tests show that this TDNN-$TD(\lambda)$ network can match the performance of our previous hand-engineered system. The tests also show that both neural network approaches significantly out-perform the best previous (non-learning) solution to this problem in terms of the quality of the resulting schedules and the number of search steps required to construct them.

We have attempted to ensure that all information is correct, but we cannot guarantee it. Please send comments and corrections to:
L. Douglas Baker
Carnegie Mellon University
ldbapp+nips@cs.cmu.edu