Ajit Singh, Andrew Guillory, Jeff Bilmes. On Bisubmodular Maximization. To appear, Artificial Intelligence & Statistics, April 2012.

Abstract: Bisubmodularity extends the concept of submodularity to set functions with two arguments. We show how bisubmodular maximization leads to richer value-of-information problems, using examples in sensor placement and feature selection. We present the first constant-factor approximation algorithm for a wide class of bisubmodular maximizations.

Zafer Aydin, Ajit Singh, Jeff Bilmes, and William S. Noble. Learning sparse models for a dynamic Bayesian network classifier of protein secondary structure. BMC Bioinformatics 12:154. 2011.
[Abstract] [Publisher, open access journal]


Protein secondary structure prediction provides insight into protein function and is a valuable preliminary step for predicting the 3D structure of a protein. Dynamic Bayesian networks (DBNs) and support vector machines (SVMs) have been shown to provide state-of-the-art performance in secondary structure prediction. As the size of the protein database grows, it becomes feasible to use a richer model in an effort to capture subtle correlations among the amino acids and the predicted labels. In this context, it is beneficial to derive sparse models that discourage over-fitting and provide biological insight.

In this paper, we first show that we are able to obtain accurate secondary structure predictions. Our per-residue accuracy on a well established and difficult benchmark (CB513) is 80.3%, which is comparable to the state-of-the-art evaluated on this dataset. We then introduce an algorithm for sparsifying the parameters of a DBN. Using this algorithm, we can automatically remove up to 70-95% of the parameters of a DBN while maintaining the same level of predictive accuracy on the SD576 set. At 90% sparsity, we are able to compute predictions three times faster than a fully dense model evaluated on the SD576 set. We also demonstrate, using simulated data, that the algorithm is able to recover true sparse structures with high accuracy, and using real data, that the sparse model identifies known correlation structure (local and non-local) related to different classes of secondary structure elements.

We present a secondary structure prediction method that employs dynamic Bayesian networks and support vector machines. We also introduce an algorithm for sparsifying the parameters of the dynamic Bayesian network. The sparsification approach yields a significant speed-up in generating predictions, and we demonstrate that the amino acid correlations identified by the algorithm correspond to several known features of protein secondary structure. Datasets and source code used in this study are available at

Ajit P. Singh and Geoffrey J. Gordon. A Bayesian Matrix Factorization Model for Relational Data. In Proceedings of the 26th Conference on Uncertainty in Artificial Intelligence (UAI). July 2010.
[Abstract] [Paper]

Abstract: Relational learning can be used to augment one data source with other correlated sources of information, to improve predictive accuracy. We frame a large class of relational learning problems as matrix factorization problems, and propose a hierarchical Bayesian model. Training our Bayesian model using random-walk Metropolis-Hastings is impractically slow, and so we develop a block Metropolis-Hastings sampler which uses the gradient and Hessian of the likelihood to dynamically tune the proposal. We demonstrate that a predictive model of brain response to stimuli can be improved by augmenting it with side information about the stimuli.

Ajit P. Singh and Geoffrey J. Gordon. A Unified View of Matrix Factorization Models. In Machine Learning and Knowledge Discovery in Databases, European Conference (ECML/PKDD). September 2008.
[Abstract] [Paper] [Publisher]

Abstract: We present a unified view of matrix factorization that frames the differences among popular methods, such as NMF, Weighted SVD, E-PCA, MMMF, pLSI, pLSI-pHITS, Bregman co-clustering, and many others, in terms of a small number of modeling choices. Many of these approaches can be viewed as minimizing a generalized Bregman divergence, and we show that (i) a straightforward alternating projection algorithm can be applied to almost any model in our unified view; (ii) the Hessian for each projection has special structure that makes a Newton projection feasible, even when there are equality constraints on the factors, which allows for matrix co-clustering; and (iii) alternating projections can be generalized to simultaneously factor a set of matrices that share dimensions. These observations immediately yield new optimization algorithms for the above factorization methods, and suggest novel generalizations of these methods such as incorporating row and column biases, and adding or relaxing clustering constraints.

Ajit P. Singh and Geoffrey J. Gordon. Relational Learning via Collective Matrix Factorization. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). August 2008.
[Abstract] [Paper] [Publisher]

Abstract: Relational learning is concerned with predicting unknown values of a relation, given a database of entities and observed relations between entities. An example of relational learning is movie rating prediction, where the entities are users, movies, genres, and actors. Relations encode each user’s ratings of movies, the genre of each movie, and the roles that actors played in movies. A common prediction technique given one pairwise relation, for example a #users × #movies matrix, is low-rank matrix factorization. In domains with multiple relations, represented as multiple matrices, we may improve predictive accuracy by exploiting information from one relation while predicting another. To this end, we propose a collective matrix factorization model: we simultaneously factor a set of matrices, sharing parameters among factors when an entity participates in multiple relations. Each relation can have a different value type and error distribution; to handle this diversity, we generalize our model to allow nonlinear relations between the parameters and outputs, using Bregman divergences to measure error. We extend standard alternating projection algorithms to collective factorization, and derive efficient Newton-Raphson updates. Furthermore, we propose stochastic optimization methods to deal with large, sparse matrices. Our model can handle any pairwise relational schema and a wide variety of error models. We demonstrate that additional relations improve prediction in the movie rating domain. Since our model generalizes several existing matrix factorization problems, this work also yields new large scale optimization algorithms for these problems.

Andreas Krause, Ajit Singh and Carlos Guestrin. Near-Optimal Sensor Placements in Gaussian Processes: Theory, Efficient Algorithms and Empirical Studies. Journal of Machine Learning Research 9(Feb), pp 236–284, 2008.
[Abstract] [Paper] [Publisher, open access journal]

Abstract: When monitoring spatial phenomena, which can often be modeled as Gaussian processes (GPs), choosing sensor locations is a fundamental task. There are several common strategies to address this task, for example, geometry or disk models, placing sensors at the points of highest entropy (variance) in the GP model, and A-, D-, or E-optimal design. In this paper, we tackle the combinatorial optimization problem of maximizing the mutual information between the chosen locations and the locations which are not selected. We prove that the problem of finding the configuration that maximizes mutual information is NP-complete. To address this issue, we describe a polynomial-time approximation that is within (1-1/e) of the optimum by exploiting the submodularity of mutual information. We also show how submodularity can be used to obtain online bounds, and design branch and bound search procedures. We then extend our algorithm to exploit lazy evaluations and local structure in the GP, yielding significant speedups. We also extend our approach to find placements which are robust against node failures and uncertainties in the model. These extensions are again associated with rigorous theoretical approximation guarantees, exploiting the submodularity of the objective function. We demonstrate the advantages of our approach towards optimizing mutual information in a very extensive empirical study on two real-world data sets.

Tim van Allen, Ajit Singh, Russell Greiner, and Peter Hooper. Quantifying the uncertainty of a belief net response: Bayesian error-bars for belief net inference. Artificial Intelligence 172(4-5), 2008.
[Abstract] [Paper] [Publisher]

Abstract: A Bayesian belief network models a joint distribution over variables using a DAG to represent variable dependencies and network parameters to represent the conditional probability of each variable given an assignment to its immediate parents. Existing algorithms assume each network parameter is fixed. From a Bayesian perspective, however, these network parameters can be random variables that reflect uncertainty in parameter estimates, arising because the parameters are learned from data, or because they are elicited from uncertain experts.

Belief networks are commonly used to compute responses to queries—i.e., return a number for P(H=h|E=e). Parameter uncertainty induces uncertainty in query responses, which are thus themselves random variables. This paper investigates this query response distribution, and shows how to accurately model this distribution for any query and any network structure. In particular, we prove that the query response is asymptotically Gaussian and provide its mean value and asymptotic variance. Moreover, we present an algorithm for computing these quantities that has the same worst-case complexity as inference in general, and also describe straight-line code when the query includes all n variables. We provide empirical evidence that (1) our approximation of the variance is very accurate, and (2) a Beta distribution with these moments provides a very accurate model of the observed query response distribution. We also show how to use this to produce accurate error bars around these responses—i.e., to determine that the response to P(H=h|E=e) is x±y with confidence 1−δ.

Jure Leskovec, Ajit Singh, and Jon M. Kleinberg. Patterns of Influence in a Recommendation Network. In Advances in Knowledge Discovery and Data Mining, Tenth Pacific-Asia Conference (PAKDD). April 2006.
[Abstract] [Paper] [Publisher]

Abstract: Information cascades are phenomena in which individuals adopt a new action or idea due to influence by others. As such a process spreads through an underlying social network, it can result in widespread adoption overall. We consider information cascades in the context of recommendations, and in particular study the patterns of cascading recommendations that arise in large social networks. We investigate a large person-to-person recommendation network, consisting of four million people who made sixteen million recommendations on half a million products. Such a dataset allows us to pose a number of fundamental questions: What kinds of cascades arise frequently in real life? What features distinguish them? We enumerate and count cascade subgraphs on large directed graphs; as one component of this, we develop a novel efficient heuristic based on graph isomorphism testing that scales to large datasets. We discover novel patterns: the distribution of cascade sizes is approximately heavy-tailed; cascades tend to be shallow, but occasional large bursts of propagation can occur. The relative abundance of different cascade subgraphs suggests subtle properties of the underlying social network and recommendation process.

Carlos Guestrin, Andreas Krause and Ajit P. Singh. Near-optimal sensor placements in Gaussian processes. In Proceedings of the Twenty-Second International Conference on Machine Learning (ICML). 2005. Winner of the Best Paper Runner-up Award.
[Abstract] [Paper] [Publisher]

Abstract: When monitoring spatial phenomena, which are often modeled as Gaussian Processes (GPs), choosing sensor locations is a fundamental task. A common strategy is to place sensors at the points of highest entropy (variance) in the GP model. We propose a mutual information criteria, and show that it produces better placements. Furthermore, we prove that finding the configuration that maximizes mutual information is NP-complete. To address this issue, we describe a polynomial-time approximation that is within (1 − 1/e) of the optimum by exploiting the submodularity of our criterion. This algorithm is extended to handle local structure in the GP, yielding significant speedups. We demonstrate the advantages of our approach on two real-world data sets.

Renée Elio, Afsaneh Haddadi and Ajit Singh. Abstract Task Specifications for Conversation Policies. Proceedings of the Fourth International Conference on Autonomous Agents (AGENTS), 2000.
[Abstract] [Paper] [Publisher]

Abstract: Under some views, a crucial function for conversation policies is to "constrain the messages that appear on the wire," for there can be a many-to-many mapping between an agent’s intention and the message primitive used to express that intention. In this paper, we argue that the way to constrain messages is to constrain intentions. We propose a pragmatic approach to doing this through an abstract task specification or model. Abstract task specifications are based on a simple state-space representation for problem formulation, and this representation can be used to delimit the agents’ task intentions and discourse intentions. This analysis supports a flexible and pragmatic way of handling "unexpected" messages by reference to the abstract task specification. We see an abstract task specification as one component of a publicly posted conversation policy. A simple search-assistant agent was implemented within a BDI architecture to illustrate application of these ideas.

Renée Elio, Afsaneh Haddadi and Ajit Singh. Task Models, Intentions, and Agent Conversation Policies. Sixth Pacific Rim International Conference on Artificial Intelligence (PRICAI), 2000.
[Abstract] [Paper]

Abstract: It is possible to define conversation policies, such as communication or dialogue protocols, that are based strictly on what messages and, respectively, what performatives may follow each other. While such an approach has many practical applications, such protocols support only "local coherence" in a conversation. Lengthy message exchanges require some infrastructure to lend them "global coherence." Recognition of agent intentions about the joint task is essential for this global coherence, but there are further mechanisms needed to ensure that both local and global coherence are jointly maintained. This paper presents a general yet practical approach to designing, managing, and engineering agents that can do simple run-time intention recognition without creating complex multi-state protocols. In this approach we promote developing abstract task models and designing conversation policies in terms of such models. An implemented agent assistant based on these ideas is briefly described.


Ajit P. Singh. Efficient Matrix Models for Relational Learning. Ph.D. Thesis CMU-ML-09-111. Machine Learning Department, Carnegie Mellon University. October 2009.
[Abstract] [Tech Report]

Information integration deals with the setting where one has multiple sources of data, each describing different properties of the same set of entities. We are concerned primarily with settings where the properties are pairwise relations between entities, and attributes of entities. We want to predict the value of relations and attributes, but relations between entities violate the basic statistical assumption of exchangeable data points, or entities. Furthermore, we desire models that scale gracefully as the number of entities and relations increase.

Matrices are the simplest form of relational data; and we begin by distilling the literature on low-rank matrix factorization into a small number of modelling choices. We then frame information integration as simultaneously factoring sets of related matrices: i.e., Collective Matrix Factorization. Each entity is described by a small number of parameters, and if an entity is described by more than one matrix, those parameters participate in multiple matrix factorizations. Maximum likelihood estimation of the resulting model involves a large non-convex optimization, which we reduce to cyclically solving convex optimizations over small subsets of the parameters. Each convex subproblem can be solved by Newton-Raphson, which we extend to the setting of stochastic Newton-Raphson.

To address the limitations of maximum likelihood estimation in matrix factorization models, we extend our approach to the hierarchical Bayesian setting. Here, Bayesian estimation involves computing a high-dimensional integral with no analytic form. If we resorted to standard Metropolis-Hastings techniques, slow mixing would limit the scalability of our approach to large sets of entities. We show how to accelerate Metropolis-Hastings by using our efficient solution for maximum likelihood estimation to guide the sampling process.

This thesis rests on two claims, that (i) that Collective Matrix Factorization can effectively integrate different sources of data to improve prediction; and, (ii) that training scales well as the number of entities and observations increase. Two real-world data sets are considered in experimental support of these claims: augmented collaborative filtering and augmented brain imaging. In augmented collaborative filtering, we show that genre information about movies can be used to increase the predictive accuracy of user's ratings. In augmented brain imaging, we show that word co-occurrence information can be used to increase the predictive accuracy of a model of changes in brain activity to word stimuli, even in regions of the brain that were never included in the training data.

Ajit P. Singh. What to do when you don't have much data: Issues in small sample parameter learning in Bayesian Networks. M.Sc. Thesis. University of Alberta, Department of Computing Science. July 2003.
[Abstract] [Paper]

Abstract: A discrete Bayesian network is a factorization of a joint distribution over random variables. The most common use of these networks is for the computation of conditional probabilities (query responses). The parameters of these networks are often learned from data. Thus network parameters are themselves uncertain, which induces a distribution for any query response. When data sets are small, the effects of parameter uncertainty can be severe. In this thesis we argue that when data sets are small, the distribution of a query response is accurately modelled by a Beta distribution. Procedures for the modeling of the query response are also reviewed. Furthermore, we examine proposed techniques for parameter learning when data sets are small and only one query is of interest.

Other Contributions

Ajit P. Singh, John Halloran, Jeff A. Bilmes, Katrin Kirchoff, William S. Noble. To appear, RECOMB Satellite Conference on Computational Proteomics, April 2012.

Abstract: Foundational to shotgun proteomics is the spectrum identification problem: i.e., map each observed MS2 fragmentation spectrum to its corresponding peptide. In the context of database search, we propose a new class of scoring functions for peptide-spectrum matches, based on modeling peptide fragmentation using a dynamic Bayesian network (DBN). Our approach has no free parameters, involves no training, and outperforms SEQUEST on fully-tryptic searches. Our approach is also competitive with Percolator, a trained discriminative model. We further contend that the use of DBNs leads to scoring functions that are easy to extend with additional knowledge about peptide fragmentation.

Ajit P. Singh, Asela Gunawardena, Chris Meek, and Arun C. Surendran. Recommendations using Absorbing Random Walks. North East Student Colloquium on Artificial Intelligence (NESCAI). Cornell University, April 2007.
[Abstract] [Paper]

Abstract: Collaborative filtering attempts to find items of interest for a user by utilizing the preferences of other users. In this paper we describe an approach to filtering that explicitly uses social relationships, such as friendship, to find items of interest to a user. Modeling user-item relations as a bipartite graph we augment it with user-user (social) links and propose an absorbing random walk that induces a set of stationary distributions, one per user, over all items. These distributions can be interpreted as personalized rankings for each user. We exploit sparsity of both user-item and user-user relationships to improve the efficiency of our algorithm.

Ajit P. Singh and Andrew W. Moore. Finding Optimal Bayesian Networks by Dynamic Programming. Technical Report CMU-CALD-05-106, Carnegie Mellon University. 2005.
[Abstract] [Paper] [Publisher]

Abstract: Finding the Bayesian network that maximizes a score function is known as structure learning or structure discovery. Most approaches use local search in the space of acyclic digraphs, which is prone to local maxima. Exhaustive enumeration requires super-exponential time. In this paper we describe a “merely” exponential space/time algorithm for finding a Bayesiannetwork that corresponds to a global maxima of a decomposable scoring function, such as BDeu or BIC.

Ajit Singh, Russell Greiner, Victor Dorian and Brent Lefebvre. Automated diagnosis of IEMs using high-throughput NMR spectroscopy. Poster, Intelligent Systems for Molecular Biology (ISMB). 2002.
[Abstract] [PDF]

Abstract: We examine the utility of high throughput Nuclear Magnetic Resonance (NMR) data for the diagnosis of inborn errors of metabolism. We propose a two-layer Bayesian network with the novel addition of Bayesian error bars, which provide confidence intervals for our diagnoses.

Ajit Singh. Implementation of a Real-Time Kalman Filtering Navigation System. Technical Report FR-132, National Research Council Canada. 2000.