Time series expression data presents an opportunity to watch (and analyze) gene regulatory programs as they unfold.  Here we address three problems in the realm of modeling dynamic gene regulation.  We develop a novel set of modeling algorithms, using an Input-Output Hidden Markov Model (IOHMM) framework to build models of regulatory activity.

The first problem we address is combinatorial regulation.  Genes are often combinatorially regulated by multiple transcription factors (TFs). Such combinatorial regulation plays an important role in development and facilitates the ability of cells to respond to different stresses. We present a new method called cDREM, capable of reconstructing dynamic models of combinatorial regulation. cDREM integrates time series gene expression data with (static) protein interaction data. The method is based on a hidden Markov model and utilizes the sparse group Lasso to identify small subsets of combinatorially active TFs, their time of activation and the logical function they implement.

The second problem is the modeling of multiple dynamic regulatory networks from multiple time series expression experiments.  It is now possible to measure a patient's gene expression during the course of a treatment.  We wish to identify groups of patients with similar regulatory activity, with the expectation that this will relate to disease progression and treatment outcome.  We present here a method called SMARTS that can be used to cluster patients based on the similarity of the regulatory program they are expressing, and then identify TFs which may be differentially active between the groups.

In SMARTS each dynamic regulatory model we build is created from a set of individual time series.  Our third aim is to extend this technique to use sets of single cell gene expression experiments as the input to a regulatory model.  We present a novel technique, SCAREDY-CAT, which is able to create such models.  We use this technique to analyze the differentiation of lung epithelial cells, and show that we can reconstruct the structure of lung epithelium differentiation in an unsupervised manner.

We tie these methods together with the release of a software package that allows interested (non-technical) users to use our methods.  By developing methods for understanding the regulatory dynamics present in time series data, we enable the discovery of regulatory relationships that help us understand biological systems and mechanisms underlying disease.

Thesis Committee:
Ziv Bar-Joseph (Advisor)
Russell Schwartz
Zoltan Oltvai (University of Pittsburgh)
Naftali Kaminski (Yale School of Medicine)

A central theme in my lab is to address the following fundamental question in development and cell biology: how can a single fertilized egg in a multicellular organism develop into different cell types, and furthermore, how can these cells maintain phenotypic stability and integrity against various fluctuations, and be phenotypically plastic to response to environmental change at the same time? Specifically in this talk I will focus on our theoretical studies on olfactory sensory neuron differentiation based on our previous work on epigenetic histone modification dynamics [1].

Olfaction, or the sense of smell, can be essential for the proliferation and survival of an organism. Thus most species have evolved highly sensitive olfactory system. For example, a human nose can detect over a trillion different odors. A large number of olfactory receptor (OR) neurons, 40 millions for humans, located on the nose epithelium, sense odor molecules through the transmembrane ORs, then transmit electric signals to the brain. OR genes are the largest gene superfamily in vertebrates, (including ~20% pseudogenes, i.e., dysfunctional genes that have lost protein-coding ability) found in mouse and ~900 (including ~63% pseudogenes) in humans. In their Nobel-Prize winning studies, Axel, Buck and coworkers showed that each OR neuron stochastically expresses one and only one type of the ORs. Actually each cell only expresses one of the two alleles, which means two copies of a gene from the two parents, of an OR gene. The studies of Axel and Buck raise another intriguing question that has been puzzling the field since then: how can a cell activate the expression of one and only one allele of a single OR gene out of a large number of different types of ORs, and maintain its stable expression through the life of the cell, which is about 90 days in mice [2]? In the talk I will show how olfactory receptor neurons may use simple physics to achieve single allele activation.

About the Speaker.

[1]. Zhang, H., et al., Statistical Mechanics Model for the Dynamics of Collective Epigenetic Histone Modification. Phys. Rev. Lett., 2014. 112(6): p. 068101.
[2]. Rodriguez, I., Singular expression of olfactory receptor genes. Cell, 2013. 155(2): p. 274-7.

Understanding cellular organization is a major goal of systems biology. Cellular organization affects the behavior of cells and many diseases and disorders impact the spatial organization of cells and their morphologies in turn. There are many current means of studying these systems and their effects. High-content imaging is one high-resolution way in which to study the location of proteins within cells. Advances in imaging technologies have allowed for high quality data to be acquired from live cells in three dimensions over time.

Historically, imaging data have been analyzed using image-feature based approaches to create models predicting cell state using classification or regression based machine learning. Generative modeling tools such as CellOrganizer offer an alternative approach to modeling cells and their subcellular structures. The added benefit of this class of approaches is that they describe the statistical distributions of cells and can be sampled from to create realistic in silico instances of cells and their subcellular organization. Despite our ability to model static subcellular organization, modeling the dynamic restructuring of cells and their components remains a major challenge in systems biology. These subcellular dynamics are strongly correlated with cell cycle and disease progression and understanding them will aid in the development of treatments. Towards this goal we trained generative models describing cellular morphology dynamics by using both time series and static-time cell image datasets. At a more granular level, cell function is dependent on the proteins within it and their interactions. Not only is the organization of cells correlated with cell response, but it may also be a driving force.

To study the impact of cell shape and organization on these biochemical interactions we developed a computational pipeline to perform high-throughput spatially resolved simulations using realistic cellular geometries generated with CellOrganizer. In addition to exhibiting complex responses over time, some cells such as neurons are highly morphologically complex. As such, traditional generative modeling methods are ineffective or fail completely. We addressed this issue by expanding the capabilities of CellOrganizer to include models for neuronal shape.Together these works allow for the study of cellular and subcellular structure for realistic and complex cellular morphologies and their dynamic responses over time in high-throughput.

Thesis Committee:
Robert F. Murphy (Advisor)
James Faeder
Gustavo Rohde
Ivo Sbalzarini (Max Planck Institute, Dresden)

Protein subcellular location and compartmentalization play an important role in regulating cellular processes.  Protein mislocalization alters cell signaling and is observed in diverse diseases (Hung and Link 2011).  Drug resistance can occur when proteins are mislocalized to the cytoplasm and nucleus, suggesting that the measurement of protein location can help clinicians personalize therapies and diagnose disease.  Here, two projects explore how automatically quantitating subcellular location from pathology images can be used be in diagnostics and for understanding disease. 

1) We developed an automated pipeline to compare the subcellular location of proteins between two sets of immunohistochemistry images.  We used the pipeline to compare images of healthy and tumor tissue from the Human Protein Atlas, ranking hundreds of proteins in breast, liver, prostate and bladder based on how much their location was estimated to have changed.  The performance of the system was evaluated by determining whether proteins previously known to change location in tumors were ranked highly.  We present a number of new candidate location biomarkers for each tissue.  Further we identified biochemical pathways that are enriched in proteins that change location.  We confirmed some previously implicated pathways and we report new pathways previously unassociated with cancer to have changed. 

2) We extended the IHC pipeline to process full slide images.  Using the pipeline we explored how measuring changes in protein subcellular location can aid in identifying adult and pediatric liver lesions.  Our results indicate that most of the time single protein measurements are poor markers for the lesions.  Next we explored lesion-specific protein signatures for identifying diseases.  Given our dataset we found a signature set of proteins that can successfully identify liver lesions in adult and pediatric populations with perfect accuracy.  Finally we report two new proteins that aid in classifying the lesions when used as part of a signature protein set.

Thesis Committee:
Robert F. Murphy (Advisor)
Chakra Chennubhotia
Gustavo Rodhe
John Ozolek (University of Pittsburgh Medical Center)

Next-generation sequencing technology allows us to peer inside the cell in exquisite detail, revealing new insights into biology, evolution, and disease that would have been impossible to discover just a few years ago. The enormous volumes of data produced by NGS experiments present many computational challenges that we are working to address. In recent years, my lab has developed multiple systems for sequence analysis, including the widely-used Bowtie, TopHat and Cufflinks programs for alignment and assembly of transcipts from RNA-seq data. In this talk, I will discuss two new systems: (1) the HISAT system for spliced aligment of NGS reads, a successor to TopHat; and (2) the StringTie program for assembly and quantitation of RNA-seq data, a successor to Cufflinks. This talk describes joint work with Daehwan Kim and Mihaela Pertea.

About the Speaker.

Our lives are full of habits, good ones (example: exercise) and bad ones (example: eating unhealthy food). The imbalance in these habits is particularly evident in the world-wide prevalence of obesity. It is widely proven that many diseases such as cancer, diabetes, heart disease, and depression are strongly influenced by these habits. Shifting the balance between bad and good habits can therefore prevent disease and enhance well-being. Here, we propose to monitor urine insulin levels to provide people with weight loss intent with molecular feed-back on their metabolic state. The idea is to borrow the body’s own molecules used in internal communicating to assist individuals externally in the conscious struggle to promote healthy life-style changes.

To this end, we have developed a mobile health platform available. We currently provide the capabilities for a user to log five types of events (food, activity, weight, urine, ketostix). We have used the platform to conduct several experiments in collecting urine samples while varying food type (low carb, normal and ketogenic diets), timing of food intake, and variation across and within individuals. Urine insulin values were measured using immunosandwich electrochemiluminescence detection. Comparison of the insulin data with the food intake and exercise information indicated that unlike blood glucose, urine insulin levels are highly sensitive to changes in diet and activity. We observed urine insulin profiles characteristic of each diet. Therefore, such measurements could be useful to health-care professionals in monitoring adherence to recommended life-style changes and to individuals in obtaining feedback on their metabolic responses to food intake.

The phenotype inference from genotype in RNA viruses maps the viral genome/protein sequences to the molecular functions in order to understand the underlying molecular mechanisms that are responsible for the function changes. The inference is currently done through a laborious experimental process which is arguably inefficient, incomplete, and unreliable. The wealth of RNA virus sequence data in the presence of different phenotypes promotes the rise of computational approaches to aid the inference. Key residue identification and genotype-phenotype mapping function learning are two approaches to identify the critical positions out of hitchhikers and elucidate the relations among them.

The existing computational approaches in this area focus on prediction accuracy, yet a number of fundamental problems have not been considered: the scalability of the data, the capability to suggest informative biological experiments, and the interpretability of the inferences. A common scenario of inference done by biologists with mutagenesis experiments usually involves a small number of available sequences, which is very likely to be inadequate for the inference in most setups. Accordingly biologists desire models that are capable of inferring from such limited data, and algorithms that are capable of suggesting new experiments when more data is needed. Another important but always been neglected property of the models is the interpretability of the mapping, since most existing models behave as 'black boxes'.

To address these issues, in the thesis I design a supervised combinatorial filtering algorithm that systematically and efficiently infers the correct set of key residue positions from available labeled data. For cases where more data is needed to fully converge to an answer, I introduce an active learning algorithm to help choose the most informative experiment from a set of unlabeled candidate strains or mutagenesis experiments to minimize the expected total laboratory time or financial cost. I also propose Disjunctive Normal Form (DNF) as an appropriate assumption over the hypothesis space to learn interpretable genotype-phenotype functions.

The challenges of these approaches are the computational efficiency due to the combinatorial nature of our algorithms. The solution is to explore biological plausible assumptions to constrain the solution space and efficiently find the optimal solutions under the assumptions.

The algorithms were validated in two ways: 1) prediction quality in a cross-validation manner, and 2) consistency with the domain experts‘ conclusions. The algorithms also suggested new discoveries that have not been discussed yet. I applied these approaches to a variety of RNA virus datasets covering the majority of interesting RNA phenotypes, including drug resistance, Antigenicity shift, Antibody neutralization and so on to demonstrate the prediction power, and suggest new discoveries of Influenza drug resistance and Antigenicity. I also prove the extension of the approaches in the area of severe acute community disease.

Thesis Committee:
Roni Rosenfeld (Advisor)
Jaime Carbonell
Gilles Clermont (University of Pittsburgh)
Eldie Ghedin (New York University)

The ultimate goal of systems medicine is to enable the use of molecular profiles, such as DNA, RNA, epigenetics, and drug sensitivity profiles, for prognosis and predicting response to therapies. There is substantial need for better ways to choose and predict the outcome of therapy in individual cancer patients. For example, patients over 65 with acute myeloid leukemia have no better prognosis today than they did in 1980. For a growing number of diseases, there is a fair amount of data on molecular profiles from patients. The most important step necessary to realize this goal is to identify molecular features (e.g., expression levels of certain genes) in these data that predict clinical phenotypes such as response to a certain therapy. However, due to the high-dimensionality of the data, it is an open challenge to identify robust molecular features that are consistently predictive of clinical phenotypes across many studies.

In this talk, I will present an integrative approach to reduce the dimensionality of expression data by selecting genes that represent important molecular events based on publicly available expression data. In particular, I will present computational methods for identifying the genes with certain features in the inferred expression network and gene expression signature conserved across vastly different cancer types. I will show how our approach led to novel markers for various important clinical phenotypes, such as survival time, chemosensitivity, histomorphological features and surgical resectability in cancer.

About the Speaker

Rare variants identified from DNA sequence, especially de novo loss of function (LoF) mutations, have identified genes involved in risk for autism spectrum disorders (ASD). Multiple de novo LoF mutations in the same gene demonstrate that gene affects risk. De novo mutations occur twofold more often in ASD probands than their siblings, implying that half of the genes hit are risk genes. He et al. (2013) extract more information by using a statistical model, called TADA for Transmission And De novo Association, that integrates data from family and case-control studies to infer the likelihood a gene affects risk. Still, given limited sequence data, can we garner yet more information? Progress has been made as part of a collaborative effort to develop systems biological approaches to understanding ASD pathophysiology. Using ASD risk genes as foci, we hypothesize that genes expressed at the same developmental period and brain region, and with highly correlated co-expression, are functionally interrelated and more likely to affect risk. To find these genes we model two kinds of data: gene co-expression in specific brain regions and periods of development; and the TADA results from published sequencing studies. We model the ensemble data as a Hidden Markov Random Field, in which the graph structure is determined by gene co-expression and the model combines these interrelationships with node-specific observations: gene identity; expression; genetic data; and whether it affects risk, which will be estimated. This analysis identifies ~100 genes that plausibly affect risk, many novel and others implicated despite relatively weak genetic evidence. We will describe how these results can be used to expand our understanding of the genetics of ASD (e.g., nominating genes for targeted sequencing in new samples) and ASD neurobiology.


Kathryn Roeder is Professor of Statistics and Computational Biology. Currently her work focuses on statistical genetics and the genetic basis of complex disease. Her group has published extensively on methods for gene mapping and the genetics of autism. Roeder’s career began in the biological sciences, during which time she spent a year living in the wilderness regions of the Pacific Northwest as a research assistant for the Department of Wildlife Resources. In 1988 she received her Ph.D. in Statistics from Pennsylvania State University. Next she spent 6 years on the Statistics faculty at Yale University where she played a pivotal role developing the foundations of DNA forensic inference. In 1994 Roeder joined the Department of Statistics at Carnegie Mellon University. She has developed statistical methods in a wide spectrum of areas, including high dimensional inference, mixture models and nonparametric statistics. She has served as an associate editor of JASA, Biometrics and American Journal of Human Genetics. She is an elected fellow of the American Statistical Association and the Institute of Mathematical Statistics. In 1997 she received the COPSS Presidents award and the Snedecor Award for outstanding work in statistical applications.

Subscribe to CompBio