Research Areas

Machine Learning

Emphasizing theory and algorithms for learning from high-dimensional data, and reasoning under uncertainty.

Computational Biology

Emphasizing developing formal models and algorithms that address problems of practical biological and medical concerns.

Learning time & space varying-coefficient models with evolving structures, & applications in sociocultural data mining & systems biology [more]
Learning sparse structured input/output models in very high-dimensional space, & applications in structured prediction problems in NLP, vision, & bioinformatics [more]
Statistical models & algorithms of networks & relational data [more]
Nonparametric Bayesian analysis, applications of Bayesian nonparametrics in data mining [more]
Dynamic topic models for structured browsing of large corpus of text/image/video/network data [more]
Other applications of probabilistic graphical models in Computational Biology, IR, NLP, Multimedia & Control [more]

Genome-Transcription-Phenome-Wide Association: a new paradigm for association studies in complex diseases [more]
Recovering and exploring time-varying gene interactions in Drosophila development & human disease progression [more]
Computational systems biology of genome-microenvironment interactions in breast cancer [more]
Probabilistic evolutionary models of cis-regulatory modules in Drosophila [more]
Inference of population genetic structure, variation, migration, & evolution based on genome polymorphisms [more]
Biological sequence analysis: motif detection, gene finding & systems biology [more]

Machine Learning

Emphasizing theory and algorithms for learning complex probabilistic models, learning with prior knowledge, and reasoning under uncertainty.

Learning Time and Space Varying-Coefficient Models with Evolving Structures, and Applications in Socio-cultural Data Mining and Systems Biology

In this project we develop methodologies for estimating and analyzing varying-coefficient models with structural changes occurring at unknown times or locations. Instances of such models are frequently encountered in social and biological problems where data are structured and longitudinal, and the iid assumptions on sample with respect to an invariant underlying model no longer hold. For example, at each time point, the observation (such as a single snapshot of social state of all actors) is distributed according to a model (such as a network) specific to that time, and therefore can not be directly used for estimating models underlying other time points. The main issues we concern in this project include estimating the changing model structures and parameters, number of structural changes, the change times, and the unknown coefficient functions. We will develop efficient and scalable algorithms for addressing these problems under "small n large p" scenarios based on techniques such as sparse regression under various regularization and structured-constraint schemes, convex optimization, and Bayesian inference. We will also focus on asymptotic analysis of the procedure and give conditions under which it is able to correctly estimate the structural changes and the model coefficients.

Selected Reading:

F. Guo, S. Hanneke, W. Fu and E. P. Xing, Recovering Temporally Rewiring Networks: A model-based approach, Proceedings of the 24th International Conference on Machine Learning (ICML 2007)
M. Kolar, L. Song, A. Ahmed, and E. P. Xing, Estimating Time-Varying Networks, Manuscript, arXiv:0812.5087.

[Top]

Learning Sparse Structured Input/Output Models in Very High-dimensional Space, and Applications in Structured prediction problems in NLP, Vision, and Bioinformatics

In many high-dimensional structured I/O problems, such as genome-phenome association analysis and image segmentation, where both input and output can contain tens of thousands, sometimes even millions of inter-related features, learning a sparse and consistent structured predictive function can be of paramount importance for both robustness and interpretability of the model. In this project, we attack this problem from two directions. One direction we will pursue to learn sparse structured I/O models is to extend the l1-regularized regression model (i.e., the lasso method) to a family of sparse "structured" regression models in the contexts of uncovering true associations between linked genetic variations (inputs) and networked phenotypes (outputs), which can be cast as efficiently solvable convex optimization problems and yield parsimonious and possibly consistent maximum likelihood estimates of the model. Another direction we are exploring is based on a new statistical formalism known as the maximum entropy discrimination Markov networks, which address the problem of estimating sparse structured I/O models under a maximum margin framework, but using a entropic regularizer that leads to a distribution of structured prediction functions that are simultaneously primal and dual sparse (i.e., with few support vectors, and of low effective feature dimension), and can be efficiently solved via a novel algorithm that builds on variational inference and existing solvers for the maximum margin Markov network (which is a special case of our proposed model). We will also investigate the theoretical guarantee of these methods such as generalization bounds, sample complexity, convergence behavior.