As high throughput genomic data becomes central to clinical decision making, computational bottlenecks involving scalability, security and privacy call for effective and efficient solutions.

In this talk we will go through some of the recent developments in the compression of high throughput sequence data, such as our new tool for "light genomic assembly" for improved de novo compression and the MPEG benchmarking effort towards establishing genomic sequence representation standards. We will also discuss some of the new developments in secure, collaborative genomic data processing through the use of Intel SGX (Software Guard Extensions) architectures and differentially private querying of population stratified genomic (SNV) data for genome-wide association studies (GWAS). Time permitting, we will also go through some of the algorithmic developments on cancer genome sequence analysis, especially in the context of driver gene and module identification based on new measures of random walk distances in molecular interaction networks.

Rapid advances in high-throughput genomics experiments, such as microarray and next-generation sequencing, have increased availability of multi-level omics data (e.g. mRNA expression, miRNA expression, methylation, etc.) in the public domain. Integration of multi-level omics data for biomarker association, outcome prediction and disease subtype discovery has brought new computational and statistical challenges. In this talk, I will present several omics meta-analysis and integrative modeling methods we have developed for disease subtype discovery and biomarker detection with applications mostly in cancer research. The result shows benefit of information integration and careful modeling to retrieve biologically relevant information from complex experimental datasets.

An increasing number of human diseases, such as neuromuscular disorders and cancer are attributed to defects in protein-RNA recognition. My lab studies how gene expression is regulated at the level of RNA processing, primarily by protein-RNA interactions. I will present our efforts in identifying new RNA binding proteins, peforming large-scale robust and reproducible transcriptome-wide measurements of protein-RNA interactions for hundreds of RNA binding proteins. If time permits I’ll discuss an example in neurodegeneration and also studying alternative splicing in single cell transcriptomic data.

About the Speaker.

Three dimensional organization of the human genome plays important roles in regulating its function, and a detailed structural characterization will be crucial for enabling its rational design using genome editing techniques. However, experimental studies on the structure of the genome have met with limited success so far due to its large size and amorphous shape; theoretical modeling has stayed mostly in the exploratory phasFaculty Hosts: e as well and lacks the accuracy desired for engineering purposes. In this talk, I will explain various modeling approaches that we have developed to reveal the genome organization at different lengthscales, from the nano-meter unwinding of the nucleosomal DNA, to the micro-meter folding the entire chromosome. A novel theoretical approach to enable de novo prediction of whole chromosome structures using only 1D sequence information will also be briefly discussed.

Faculty Hosts: Ivet Bahar, Jianhua Xing

While targeting key drivers of tumor progression (e.g., BCR/ABL, HER2, and BRAFV600E) has had a major impact in oncology, most patients with advanced cancer continue to receive drugs that do not work in concert with their specific biology.  This is exemplified by acute myeloid leukemia (AML), a disease for which treatments and cure rates (in the range of 20%) have remained stagnant. Effectively deploying an ever-expanding array of cancer therapeutics holds great promise for improving these rates but requires methods to identify how drugs will affect specific patients.  Cancers that appear pathologically similar often respond differently to the same drug regimens.

I will present our on-going project on building an AI system that takes available molecular information, reasons about the best possible treatment strategy, and explains its reasoning. The most important step necessary to realize this goal is to identify robust molecular markers from available data to predict the response to each of hundreds of chemotherapy drugs. However, due to the high-dimensionality (i.e., the number of variables is much greater than the number of samples) along with potential biological or experimental confounders, it is an open challenge to identify robust biomarkers that are replicated across different studies. I will present two distinct machine learning techniques to resolve these challenges. These methods learn the low-dimensional features hat are likely to represent important molecular events in the disease process in an unsupervised fashion, based on molecular profiles from multiple populations of patients with specific cancer type.I will present two applications of these two methods – AML and ovarian cancer. When the first method was applied to AML data in collaboration with UW Hematology and UW’s Center for Cancer Innovation, a novel molecular marker for topoisomerase inhibitors, widely used chemotherapy drugs in AML treatment, was revealed. The other method applied to ovarian cancer data led to a potential molecular driver for tumor-associated stroma, in collaboration with UW Pathology and UW Genome Sciences. Our methods are general computational frameworks and can be applied to many other diseases.

Professor Su-In Lee is an Assistant Professor in the Departments of Computer Science & Engineering and Genome Sciences at the University of Washington. She received her Ph.D. degree in Electrical Engineering from Stanford University in 2009. Before joining the UW in 2010, she was a Visiting Assistant Professor in the Computational Biology Department at Carnegie Mellon University.

Her interest is in developing advanced machine learning algorithms to analyze high-throughput data to 1) discover molecular mechanisms of diseases, 2) identify therapeutic targets, and 3) develop personalized treatment plans given an individual’s molecular profile

She has been named an American Cancer Society Research Scholar and received the NSF CAREER award. Her lab is currently funded by the American Cancer Society, the National Institutes of Health, the National Science Foundation, the Institute of Translational Health Sciences and the Solid Tumor Translational Research.

The life sciences are becoming a big data enterprise with its own data characteristics. To make big data useful, we need to find ways of dealing with the heterogeneity, diversity, and complexity of the data, to identify problems that cannot been solved before, and to develop methods to solve those new problems. In this talk, I will outline a set of novel biological problems that we proposed and solved by integrating a large amount of genomic data. A major part of the talk is on integrating the 3D chromatin structures, epigenetic modification, and transcription factors to study gene regulation.

More about the Speaker.

Faculty Host: Jian Ma

Chromosome segregation during mitosis hinges on proper assembly of the microtubule spindle that establishes bipolar attachment for each chromosome to the associated spindle pole. While chromosome missegregation is very rare in normal mitotic cells, it is much more frequent in cancer cells and during meiosis I in mammalian oocytes. In these latter cases, abnormal spindle pole formation and its aberrant coordination with atypical kinetochore-spindle attachments have been shown to correlate with aneuploidy. Intriguingly, experiments demonstrate allometry of mitotic spindle and a striking size scaling with cell size across metazoans. Taken together, these experiments indicate that a conserved principle of mitotic spindle geometry and size scaling could be at play during evolution. The nature of this principle, however, is currently unknown. Researchers have focused on mechanics of spindle assembly process that might shed light on the mechanistic underpinnings resulting in the spindle geometry and size scaling. In this work we take a different standpoint and ask: What are the spindle geometry and size scaling for? We address this question from functional perspectives of the spindle assembly checkpoint (SAC). SAC is the critical surveillance mechanism that prevents premature chromosome segregation in the presence of mis-attached chromosomes. The SAC signal gets silenced after and only after the last chromosome-spindle attachment in mitosis. We established a model that explains this robustness of SAC silencing based on spindle-mediated spatiotemporal regulation of SAC proteins. Further, our results suggest that robust and timely SAC silencing entails proper geometry and size scaling of mitotic spindle, violation of which will result in premature anaphase onset or prolonged mitotic arrest. Our work provides a novel, function-oriented angle towards understanding the observed spindle allometry, and the universal scaling relationship between spindle size and cell size evidenced across the metazoan kingdom. In a broad sense, the functional requirement of robust SAC silencing could have helped shape the spindle assembly mechanism in evolution.

Host: Jianhua Xing

Estimating the Tree of Life will likely involve a two-step procedure, where in the first step trees are estimated on many genes, and then the gene trees are combined into a tree on all the taxa. However, the true gene trees may not agree with with the species tree due to biological processes such as deep coalescence, gene duplication and loss, and horizontal gene transfer. Statistically consistent methods based on the multi-species coalescent model have been developed to estimate species trees in the presence of incomplete lineage sorting; however, the relative accuracy of these methods compared to the usual "concatenation" approach is a matter of substantial debate within the research community. In this talk I will present new state of the art methods we have developed for estimating species trees in the presence of incomplete lineage sorting (ILS), and show how they can be used to estimate species trees from genome-scale datasets with high accuracy. I will also discuss tradeoffs between data quantity and quality, and the implications for big data genomic analysis.

Gene expression has been studied extensively on the transcript level with the help of RNA-seq technology, however less attention has been paid to gene regulation pre- transcription and post-transcription. For example, it is not clear whether genome structure plays an important role in gene functionality, nor is it clear how gene expression is regulated by translational speed on a codon basis. Recently, several high-throughput sequencing techniques have been developed to help answer these questions. Specifically, Chromosome Conformation Capture (3C) was developed to capture spatially close chromatin loci in cell nuclei and enables whole-genome structure studies, and ribosome profiling (ribo-seq) is developed to study ribosome location preferences during translation and enables genome-wide translational studies. However, the complicated experimental pipelines make these data inherently noisy, and typical approaches to process these data are prone to errors and computationally expensive. We developed various computational pipelines to fundamentally process these data to advance downstream analysis regarding gene regulation. Specifically, we developed a graph-based test to identify sets of functionally related genomic loci that are statistically spatially closer than expected by chance using 3C data. Compared to typical methods, our approach is computationally inexpensive and more robust to unmeasured interactions and the inclusion of non-associated loci. We also developed a pipeline to estimate ribosome occupancy preferences on a transcript level from ribo-seq data.

This is the first systematic approach to address the ubiquitous multi-mappings in ribo-seq data and quantify ribosome loci on a transcript level. It results in better estimations of both ribosome profiles and ribosome loads. In addition, we designed a mathematical model and algorithm to recover ribosome positions from ribo-seq data. Unlike existing simple heuristics that make inaccurate assumptions on ribo-seq read digestions, our approach captured the complicated digestion pattern in a flexible and data-driven way, and outputs better ribosome profiles that help reveal biologically reasonable observations on translation patterns. Using these improved preprocessing pipelines above, we estimated the codon decoding time in yeast, and showed that both codon usage and wobble pairing play a role in regulating translational speed. Lastly, we performed the first genome-wide analysis on ribosome collisions with the help of a modified ribosome profiling protocol. Our preliminary results indicate that extreme slow-down of local ribosome movements during translation is likely to be random and rare, and the identification of programmed ribosome stalling requires further experiments with deeper sequencing. Together, our algorithms and analysis have helped to build the foundation for exploring pre- and post-transcriptional regulation in gene expression, which will help us understand the mechanism of cell growth and death, the differential gene expression across conditions and cell types, and the development and causes of diseases.

Thesis Committee:
Carl Kingsford (Chair)
Joel McManus
James Faeder (University of Pittsburgh)
Sridhar Hannenhalli (University of Maryland)

Influenza has been, and continues to be, a significant source of disease burden worldwide. Regular epidemics and sporadic pandemics are incredibly costly to society, not just in terms of the monetary expense of prevention and treatment, but also in terms of reduced productivity, increased absenteeism, and excessive morbidity and mortality. Major obstacles to mitigating these costs include an incomplete understanding of influenza's phylodynamics, the inherent delays of clinical surveillance and reporting, and a lack of outbreak forewarning.

The aim of this thesis is to address each of these obstacles computationally by (1) simulating transmission and evolution of influenza to explore the interplay between human immunity and viral evolution; (2) collecting and integrating a diverse set of real-time digital surveillance signals to track influenza activity; and (3) generating season-wide forecasts of influenza epidemics using an ensemble of statistical models, simulations, and human judgment.

The first part explores the concept of generalized immunity, which was previously hypothesized to be highly protective but short-lasting. Large-scale, long-term simulations based on an extension of an earlier model were used to scan immunity parameter space and indicate that the most plausible definition of generalized immunity is less protective but potentially much longer-lasting than previously assumed. The second part describes how sensor fusion and tracking can be applied to the nowcasting problem. Drawing from control theory, weather forecasting, and econometrics, an optimal filtering methodology is developed to integrate a set of proxies for influenza activity which share one common property: they are available online and in real-time. Otherwise, they are available at different temporal intervals, geographic resolutions, and historical periods, and they are noisy and potentially correlated. The resulting nowcasts are robust to failure of individual proxies and are available up to several weeks before traditional surveillance reports. The third part combines earlier results with novel methodologies to produce probabilistic forecasts of influenza spread and intensity that are timely, accurate, and actionable. In particular, an empirical Bayes method and spline regression are used to produce forecasts which only rely on the availability of historical data and are readily generalizable to other infectious diseases; and a wisdom of crowds approach is used to incorporate human judgment into the forecasting process.

Thesis Committee:
Roni Rosenfeld (Advisor)
Carl Kingsford
John Grefenstette (University of Pittsburgh)
Elodie Ghedin (New York University)


Subscribe to CBD