Machine LearningEmphasizing theory and
algorithms for learning from high-dimensional data, and reasoning under
uncertainty. |
Computational BiologyEmphasizing
developing formal models and algorithms that address problems of
practical biological and medical concerns. |
|
|
|
|
Machine LearningEmphasizing theory and algorithms for learning complex probabilistic models, learning with prior knowledge, and reasoning under uncertainty. |
In this project we develop
methodologies for estimating
and
analyzing varying-coefficient models with structural changes occurring
at
unknown times or locations. Instances of such models are frequently
encountered
in social and biological problems where data are structured and
longitudinal,
and the iid assumptions on sample with respect to an invariant
underlying model
no longer hold. For example, at each time point, the observation (such
as a
single snapshot of social state of all actors) is distributed according
to a
model (such as a network) specific to that time, and therefore can not
be directly
used for estimating models underlying other time points. The main
issues we
concern in this project include estimating the changing model
structures and
parameters, number of structural changes, the change times, and the
unknown
coefficient functions. We will develop efficient and scalable
algorithms for
addressing these problems under "small n large p" scenarios based on
techniques such as sparse regression under various regularization and
structured-constraint schemes, convex optimization, and Bayesian
inference. We
will also focus on asymptotic analysis of the procedure and give
conditions
under which it is able to correctly estimate the structural changes and
the
model coefficients.
[Top]
In many
high-dimensional structured I/O
problems, such as
genome-phenome association analysis and image segmentation, where both
input
and output can contain tens of thousands, sometimes even millions of
inter-related
features, learning a sparse and consistent structured predictive
function can
be of paramount importance for both robustness and interpretability of
the
model. In this project, we attack this problem from two directions. One
direction we will pursue to learn sparse structured I/O models is to
extend the
l1-regularized regression model (i.e., the lasso method) to a family of
sparse
"structured" regression models in the contexts of uncovering true
associations between linked genetic variations (inputs) and networked
phenotypes (outputs), which can be cast as efficiently solvable convex
optimization problems and yield parsimonious and possibly consistent
maximum
likelihood estimates of the model. Another direction we are exploring
is based
on a new statistical formalism known as the maximum entropy
discrimination
Markov networks, which address the problem of estimating sparse
structured I/O
models under a maximum margin framework, but using a entropic
regularizer that
leads to a distribution of structured prediction functions that are
simultaneously primal and dual sparse (i.e., with few support vectors,
and of
low effective feature dimension), and can be efficiently solved via a
novel
algorithm that builds on variational inference and existing solvers for
the
maximum margin Markov network (which is a special case of our proposed
model).
We will also investigate the theoretical guarantee of these methods
such as
generalization bounds, sample complexity, convergence behavior.
[Top]
A plausible representation of the relational
information among
entities in dynamic systems such as a living cell, or a social
community, or
the internet, is a stochastic network which is topologically rewiring
and
semantically evolving over time (or space). In this project we develop
probabilistic generative models for the formation, growth, evolution,
and
dynamics of networks and relational data in general, and
inference/learning
algorithms for node labeling, link prediction, latent theme extraction,
etc.,
for network and relational data. We also work on theoretical issues,
such as
bounds, complexity, related to our models and algorithms, and
applications to
the analysis of socio-cultural networks, author-citation networks, the
blogosphere, and biological networks.
[Top]
In this project we develop nonparametric and
semiparametric Bayesian models (based on the Dirichlet process and
extensions,
sometimes known as the generalized Polya urn schemes) for analyzing
time series data,
hierarchical data, and other complex inputs with uncertain internal
structure,
which arise from temporal text mining (e.g., emails, news streams),
object
tracking (e.g., video surveillance, navigation and control) and
biological data
analysis. We develop formal probabilistic formalisms, sampling and
variational
inference algorithms, and also address theoretical issues such as
consistence,
bounds and convergence of our models and algorithms.
[Top]
This is a "google-style" project in which we develop
formal methods for visualizing, categorizing, and tracking content-rich
multimedia and network data-streams in an arbitrary-dimension
"tomographic"
space of latent semantic topics, where each entity can be represented
by a human-perceivable and interpretable "semantic" coordinates or
trajectory, so that one can succinctly browse the complex multimodal
and time-evolving data in a direct, global and on-line fashion. Such
display of complex data complements the current Google representation
of
web-information, which is one-dimensional (in terms of a rank list),
static (no change
in the topical content of the media can be traced), and unimodal (only
one
type of media can be displayed as a time, such as text only, or image
only).
One can use our system to directly visualize in the topic space a much
larger
amount of media or web information, rather than via pages of
ordered-lists of
subjects as in current Google interface. One can detect bias,
ideological
perspective, or other subtle information, which hide beneath the
topical contents in
news article or other media. One can directly measure (by eye) the
distance
between query and related entities rather than following a rank list,
which
offer more flexibility and higher accuracy. One can also track the
trajectories of
evolving entities and detection events such as birth/death of
themes/topics
using our system, which is not possible under Google.
[Top]
We design various task-specific generative, discriminative,
and hybrid graphical models and algorithms for various biological and
genetic problems (see bellow), for NLP problems such as statistical
machine
translation, for comprehending and categorizing text corpus, for
segmenting, tracking and interpreting video and caption streams from
various
sources (e.g., surveillance system, robots), and for decision making
and active
learning in dynamic environments.
[Top]
![]()
|
|
Computational BiologyWith an emphasis on developing formal models and algorithms that address problems of practical biological and medical concerns. |
Many complex disease
syndromes consist of a large number of
related, rather than independent, clinical phenotypes. Differences
between these syndromes involve the complex interplay of a large number
of
genomic variations that perturb the function of disease-related genes
in the
context of a regulatory network, rather than individually. The current
state-of-the-art in genome-wide-association studies (GWAS) is a
single-gene association
approach. A major challenge for the immediate future is therefore to
transcend the
one-gene/one-disease approach with a suite of approaches that account
for
pathway and network structure. In this project, we aim at developing
algorithms
and software to enable next-generation association analysis addressing
these
problems, in the context of asthma disease. We are developing
methodologies for
the largely unexplored but practically important problem of structured
associations between elements in the genome, transcriptome, and the
phenome.
This research will open a new paradigm for association studies of
complex
diseases, which facilitates: 1) Intra- and inter-omic integration of
data for association mapping and disease gene/pathway discovery, 2)
Thorough
explorations of the internal structures within different omic data, so
that cryptic associations that are not possibly detectable in
unstructured
analysis due to their weak statistical power can be now inferred. 3)
Joint
statistical inference of mechanisms and pathways of how variations in
DNA lead to
variations in complex traits flows through molecular networks, and
inference of condition-specific state of gene function in the molecular
networks,
and 4) Development of faster and automated computational algorithm with
greater scalability and robustness to large-scale inter-omic analysis,
and more
convenient software package and user interface.
[Top]
Due to the dynamic
nature of biological systems, biological
networks underlying temporal process such as development,
immunoresponse, and disease progression can exhibit significant
topological changes to
facilitate dynamic regulatory functions. The latent functionality or
membership
undertaken by the biomolecules as determined by these dynamic
interactions will
also exhibit rich temporal behaviors, assuming a distinct function at
one
point while leaning more towards a second function at an another point.
In
this project, we focus on two dynamic processes, the life cycle of
Drosophila melanogaster, and the progress of asthma in human, for which
we develop
methodologies to, 1) reverse-engineer latent time-evolving gene
networks based
on either microimages of spatial pattern of gene expressions in
Drosophila
embryos, or genome-wide microarray profile of gene expression
intensities; 2)
recover the transcriptional activation/repression functions from
temporal/spatial patterns of gene expressions; 3) estimate embeddings
of every genes into a latent function space and track its mixed
membership of
functions in the latent space across time. The goal is to understand
the driving
forces underlying dynamic rewiring of gene regulation circuity, and to
predict
future network structures.
[Top]
In this project we
analyze the molecular abundance profiles
(e.g., microarray, CGH, ChIp-Chip) measured in a "designer
microenvironment", realized in 3D culture model that imitates the in
vivo cellular context and dynamics of cancer progression, reversion and
apoptosis. We will develop algorithms to identify molecular
determinants and
markers of cancer states and categorize cancers on the basis of
signaling pathway
characteristics. Using probabilistic graphical modeling approaches, we
hope to infer stochastic network models for transcriptional regulation
in
response to combinations of signaling inhibitions in cancer cells.
Selected Reading:
[Top]
In this project we
study the evolutionary relationships
reflected in the sequence, ordering, position, spacing and function of
the regulatory motifs controlling body segmentation during early
embryogenesis in multiple species of the Drosophila. We are interested
in understanding
the biological driving forces, molecular mechanisms and functional
implications of motif evolution in general from this biological model,
and in
developing comparative genomic algorithms for motif finding from
unaligned
non-coding sequences.
[Top]
In this project we
develop novel statistical
models and computational algorithms based on formalisms such as the
coalescent
process, Bayesian nonparametrics, admixture models, and various
temporal/spatial
stochastic processes for uncovering the chromosomal association (i.e.,
haplotypes), population distribution (i.e., diversity and frequency)
inheritance
process (i.e., recombination/substitution/selection), and migration
history of
genetics polymorphisms such as SNPs, to address problems such as
ancestry
reconstruction, disease-related gene flow, chromosomal evolution and
genetic demography in human or organismal populations.
[Top]
In this project we
develop models and
algorithms for
understanding and uncovering the structure of genomic sequences of
higher
organisms. We develop Bayesian models for DNA/protein motif detection
and gene
finding based on both sequence-level signatures and meta-sequence-level
structural information reflecting protein-DNA binding, transcript
stability,
and prior knowledge of the organization rules of regulatory modules. We
intend
to integrate motif finding with the system biology research of gene
regulatory
network.
[Top]
|
Last updated 02/1/2009 |