Note: I have moved to the Salk Institute for Biological Studies.
Openings:We are looking for highly motivated computational graduate students and post-docs. If interested, please email me with your CV and research interests.
We work at the interface of theoretical computer science and systems biology. We are interested in:
- Models of the evolution and development of complex biological systems, including molecular interactions in the cell and neural interactions in the brain;
- Robustness and flexibility of networks and signaling pathways in the face of environmental noise, failures, and perturbations;
- "Algorithms in nature", i.e. how collections of molecules, cells, and organisms process information and solve interesting distributed computing problems.
S. Navlakha, X. He, C. Faloutsos, Z. Bar-Joseph. Topological properties of robust biological and computational networks. J. R. Soc. Interface, 11(96):20140283, 2014.
[abstract]Abstract: Network robustness is an important principle in biology and engineering. Previous studies of global networks have identified both redundancy and sparseness as topological properties used by robust networks. By focusing on molecular subnetworks, or modules, we show that module topology is tightly linked to the level of environmental variability (noise) the module expects to encounter. Modules internal to the cell that are less exposed to environmental noise are more connected and less robust than external modules. A similar design principle is used by several other biological networks. We propose a simple change to the evolutionary gene duplication model which gives rise to the rich range of module topologies observed within real networks. We apply these observations to evaluate and design communication networks that are specifically optimized for noisy or malicious environments. Combined, joint analysis of biological and computational networks leads to novel algorithms and insights benefiting both fields.
S. Navlakha, P. Ahammad, E. W. Myers. Unsupervised segmentation of noisy electron microscopy images using salient watersheds and region merging. BMC Bioinformatics, 14:294, 2013.
[abstract]Abstract: Segmenting electron microscopy (EM) images of cellular and subcellular processes in the nervous system is a key step in many bioimaging pipelines involving classification and labeling of ultrastructures. However, fully automated techniques to segment images are often susceptible to noise and heterogeneity in EM images (e.g. different histological preparations, different organisms, different brain regions, etc.). Supervised techniques to address this problem are often helpful but require large sets of training data, which are often difficult to obtain in practice, especially across many conditions. We propose a new, principled unsupervised algorithm to segment EM images using a two-step approach: edge detection via salient watersheds following by robust region merging. We performed experiments to gather EM neuroimages of two organisms (mouse and fruit fly) using different histological preparations and generated manually curated ground-truth segmentations. We compared our algorithm against several state-of-the-art unsupervised segmentation algorithms and found superior performance using two standard measures of under-and over-segmentation error. Our algorithm is general and may be applicable to other large-scale segmentation problems for bioimages.
S. Navlakha, J. Suhan, A. Barth, and Z. Bar-Joseph. A high-throughput framework to detect synapses in electron microscopy images. Proc. 21st Intl. Conf. on Intelligent Systems for Molecular Biology/12th European Conference on Computational Biology (ISMB/ECCB), 2013. Journal version in Bioinformatics, 29(13):i9-17, 2013.
[abstract]Abstract: Synaptic connections underlie learning and memory in the brain and are dynamically formed and eliminated during development and in response to stimuli. Quantifying changes in overall number and strength of synapses is an important pre-requisite for studying connectivity and plasticity in these cases or in diseased conditions. Unfortunately, most techniques to detect such changes are either low-throughput (e.g. electrophysiology), prone to error and difficult to automate (e.g. standard electron microscopy), or too coarse (e.g. MRI) to provide accurate and large-scale measurements. To facilitate high-throughput analyses, we used a 50-year-old experimental technique to selectively stain for synapses in electron microscopy (EM) images and developed a machine learning framework to automatically detect synapses in these images. To validate our method we experimentally imaged brain tissue of the somatosensory cortex in four mice. We detected thousands of synapses in these images and demonstrate the accuracy of our approach using cross-validation with manually labeled data and by comparing against existing algorithms and against tools that process standard EM images. We also used a semi-supervised algorithm that leverages unlabeled data to overcome sample heterogeneity and improve performance. Our algorithms are highly efficient and scalable and are freely available for others to use.
S. Navlakha, A. Gitter, and Z. Bar-Joseph. A network-based approach for predicting missing pathway interactions. PLoS Comput. Biol., 8(8):e1002640, 2012.
[abstract]Abstract: Embedded within large-scale protein interaction networks are signaling pathways that encode response cascades in the cell. Unfortunately, even for well-studied species like S. cerevisiae, only a fraction of all true protein interactions are known, which makes it difficult to reason about the exact flow of signals and the corresponding causal relations in the network. To help address this problem, we introduce a framework for predicting new interactions that aid connectivity between upstream proteins (sources) and downstream transcription factors (targets) of a particular pathway. Our algorithms attempt to globally minimize the distance between sources and targets by finding a small set of shortcut edges to add to the network. Unlike existing algorithms for predicting general protein interactions, by focusing on proteins involved in specific responses our approach homes-in on pathway-consistent interactions by reducing source-target distances. We applied our method to extend pathways in osmotic stress response in yeast and identified several missing interactions, some of which are supported by published reports. We also performed experiments that support a novel interaction not previously reported. Our framework is general and may be applicable to edge prediction problems in other domains.
S. Navlakha and Z. Bar-Joseph. Algorithms in nature: the convergence of systems biology and computational thinking. Nature/EMBO Mol. Syst. Biol., 7:546, 2011.
[abstract]Abstract: Computer science and biology have enjoyed a long and fruitful relationship for decades. Biologists rely on computational methods to analyze and integrate large data sets, while several computational methods were inspired by the high-level design principles of biological systems. Recently, these two directions have been converging. In this review, we argue that thinking computationally about biological processes may lead to more accurate models, which in turn can be used to improve the design of algorithms. We discuss the similar mechanisms and requirements shared by computational and biological processes and then present several recent studies that apply this joint analysis strategy to problems related to coordination, network analysis, and tracking and vision. We also discuss additional biological processes that can be studied in a similar manner and link them to potential computational problems. With the rapid accumulation of data detailing the inner workings of biological systems, we expect this direction of coupling biological and computational studies to greatly expand in the future.
A. Thor, P. Anderson, L. Raschid, S. Navlakha, B. Saha, S. Khuller, and X-N. Zhang. Link prediction for annotation graphs using graph summarization. Proc. 10th Intl. Semantic Web Conference (ISWC), 7031:714-729, 2011.
[abstract]Abstract: Annotation graph datasets are a natural representation of scientific knowledge. They are common in the life sciences where genes or proteins are annotated with controlled vocabulary terms (CV terms) from ontologies. The W3C Linking Open Data (LOD) initiative and semantic Web technologies are playing a leading role in making such datasets widely available. Scientists can mine these datasets to discover patterns of annotation. While ontology alignment and integration across datasets has been explored in the context of the semantic Web, there is no current approach to mine such patterns in annotation graph datasets. In this paper, we propose a novel approach for link prediction; it is a preliminary task when discovering more complex patterns. Our prediction is based on a complementary methodology of graph summarization (GS) and dense subgraphs (DSG). GS can exploit and summarize knowledge captured within the ontologies and in the annotation patterns. DSG uses the ontology structure, in particular the distance between CV terms, to filter the graph, and to find promising subgraphs. We develop a scoring function based on multiple heuristics to rank the predictions. We perform an extensive evaluation on Arabidopsis thaliana genes.
R. Patro, E. Sefer, J. Malin, G. Marçais, S. Navlakha, and C. Kingsford. Parsimonious reconstruction of network evolution. Workshop on Algorithms in Bioinformatics (WABI), 2011. Journal version in Algorithms Mol. Biol., 7(1):25 2012.
[abstract]Abstract: We consider the problem of reconstructing a maximally parsimonious history of network evolution under models that support gene duplication and loss and independent interaction gain and loss. We introduce a combinatorial framework for encoding network histories, and we give a fast procedure that, given a set of duplication histories, in practice ï¬nds network histories with close to the minimum number of interaction gain or loss events. In contrast to previous studies, our method does not require knowing the relative ordering of unrelated duplication events. Results on simulated histories suggest that common ancestral networks can be accurately reconstructed using this parsimony approach.
S. Navlakha and C. Kingsford. Network archaeology: uncovering ancient networks from present-day interactions. PLoS Comput. Biol., 7(4):e1001119 2011. Presented at the 6th RECOMB Systems Biology Satellite Conference, 2010.
[abstract]Abstract: What proteins interacted in a long-extinct ancestor of yeast? How have different members of a protein complex assembled together over time? Our ability to answer such questions has been limited by the unavailability of ancestral protein-protein interaction (PPI) networks. To overcome this limitation, we propose several novel algorithms to reconstruct the growth history of a present-day network. Our likelihood-based method finds a probable previous state of the graph by applying an assumed growth model backwards in time. This approach retains node identities so that the history of individual nodes can be tracked. Using this methodology, we estimate protein ages in the yeast PPI network that are in good agreement with sequence-based estimates of age and with structural features of protein complexes. Further, by comparing the quality of the inferred histories for several different growth models (duplication-mutation with complementarity, forest fire, and preferential attachment), we provide additional evidence that a duplication-based model captures many features of PPI network growth better than models designed to mimic social network growth. From the reconstructed history, we model the arrival time of extant and ancestral interactions and predict that complexes have significantly re-wired over time and that new edges tend to form within existing complexes. We also hypothesize a distribution of per-protein duplication rates, track the change of the network's clustering coefficient, and predict paralogous relationships between extant proteins that are likely to be complementary to the relationships inferred using sequence alone. Finally, we infer plausible parameters for the model, thereby predicting the relative probability of various evolutionary events. The success of these algorithms indicates that parts of the history of the yeast PPI are encoded in its present-day form.
G. Duggal, S. Navlakha, M. Girvan, and C. Kingsford. Uncovering many views of biological networks using ensembles of graph partitions. Proc. 1st Intl. Workshop on Discovering, Summarizing, and Using Multiple Clusterings (KDD MultiClust), 2010:9, 2010.
[abstract]Abstract: Densely interacting regions of biological networks often correspond to functional modules such as protein complexes. Most algorithms proposed to uncover modules, however, produce one clustering that only reveals a single view of how the cell is organized. We describe two new methods to find ensembles of provably near-optimal modularity partitions that lie within a heuristically constrained search space. We also show how to count the number of solutions in this space that exist within a bounded modularity range. We apply our algorithms to a protein interaction network for S. cerevisiae and show how fine-grained differences between near-optimal partitions can be used to define robust communities. We also propose a technique to find structurally diverse nearoptimal solutions and show that these different partitions are enriched for different biological functions. Our results indicate that near-optimal solutions can represent alternative and complementary views of the network's structure.
J. R. White, S. Navlakha, N. Nagarajan, M. Ghodsi, C. Kingsford, and M. Pop. Alignment and clustering of phylogenetic markers - implications for microbial diversity studies. BMC Bioinformatics, 11:152, 2010.
[abstract]Abstract: Molecular studies of microbial diversity have provided many insights into the bacterial communities inhabiting the human body and the environment. A common first step in such studies is a survey of conserved marker genes (primarily 16S rRNA) to characterize the taxonomic composition and diversity of these communities. To date, however, there exists significant variability in analysis methods employed in these studies. Here we provide a critical assessment of current analysis methodologies that cluster sequences into operational taxonomic units (OTUs) and demonstrate that small changes in algorithm parameters can lead to significantly varying results. Our analysis provides strong evidence that the species-level diversity estimates produced using common OTU methodologies are inflated due to overly stringent parameter choices. We further describe an example of how semi-supervised clustering can produce OTUs that are more robust to changes in algorithm parameters. Our results highlight the need for systematic and open evaluation of data analysis methodologies, especially as targeted 16S rRNA diversity studies are increasingly relying on high-throughput sequencing technologies. All data and results from our study are available through the JGI FAMeS website.
S. Navlakha and C. Kingsford. The power of protein interaction networks for associating genes with diseases. Bioinformatics, 26(8):1057-63, 2010.
[abstract]Abstract: Understanding the association between genetic diseases and their causal genes is an important problem concerning human health. With the recent influx of high-throughput data describing interactions between gene products, scientists have been provided a new avenue through which these associations can be inferred. Despite the recent interest in this problem, however, there is little understanding of the relative benefits and drawbacks underlying the proposed techniques. We assessed the utility of physical protein interactions for determining geneâ€“disease associations by examining the performance of seven recently developed computational methods (plus several of their variants). We found that random-walk approaches individually outperform clustering and neighborhood approaches, although most methods make predictions not made by any other method. We show how combining these methods into a consensus method yields Pareto optimal performance. We also quantified how a diffuse topological distribution of disease-related proteins negatively affects prediction quality and are thus able to identify diseases especially amenable to network-based predictions and others for which additional information sources are absolutely required. The predictions made by each algorithm considered are available online.
S. Navlakha and C. Kingsford. Exploring biological network dynamics with ensembles of graph partitions. Proc. 15th Intl. Pacific Symposium on Biocomputing (PSB), 166-177, 2010.
[abstract]Abstract: Unveiling the modular structure of biological networks can reveal important organizational patterns in the cell. Many graph partitioning algorithms have been proposed towards this end. However, most approaches only consider a single, optimal decomposition of the network. In this work, we make use of the multitude of near-optimal clusterings in order to explore the dynamics of network clusterings and how those dynamics relate to the structure of the underlying network. We recast the modularity optimization problem as an integer linear program with diversity constraints. These constraints produce an ensemble of dissimilar but still highly modular clusterings. We apply our approach to four social and biological networks and show how optimal and near-optimal solutions can be used in conjunction to identify deeper community structure in the network, including inter-community dynamics, communities that are especially resilient to change, and core-and-peripheral community members.
S. Navlakha, J. White, N. Nagarajan, M. Pop, and C. Kingsford. Finding biologically accurate clusterings in hierarchical tree decompositions using the variation of information. Proc. 14th Intl. Conf. on Research in Computational Molecular Biology (RECOMB), 5541:400–417, 2009. Journal version in J. Comput. Biol., 17(3):503–516, 2010.
[abstract]Abstract: Hierarchical clustering is a popular method for grouping to- gether similar elements based on a distance measure between them. In many cases, annotations for some elements are known beforehand, which can aid the clustering process. We present a novel approach for decom- posing a hierarchical clustering into the clusters that optimally match a set of known annotations, as measured by the variation of information metric. Our approach is general and does not require the user to enter the number of clusters desired. We apply it to two biological domains: finding protein complexes within protein interaction networks and identifying species within metagenomic DNA samples. For these two applications, we test the quality of our clusters by using them to predict complex and species membership, respectively. We find that our approach generally outperforms the commonly used heuristic methods.
S. Navlakha, M.C. Schatz, and C. Kingsford. Revealing biological modules via graph summarization. J. Comput. Biol., 16(2):253-64, 2009. Presented at the 5th RECOMB Systems Biology Satellite Conference, 2008.
[abstract]Abstract: The division of a protein interaction network into biologically meaningful modules can aid with automated detection of protein complexes and prediction of biological processes and can uncover the global organization of the cell. We propose the use of a graph summarization (GS) technique, based on graph compression, to cluster protein interaction graphs into biologically relevant modules. The method is motivated by defining a biological module as a set of proteins that have similar sets of interaction partners. We show this definition, put into practice by a GS algorithm, reveals modules that are more biologically enriched than those found by other methods. We also apply GS to predict complex memberships, biological processes, and co-complexed pairs and show that in most settings GS is preferable over existing methods of protein interaction graph clustering.
S. Navlakha, R. Rastogi, and N. Shrivastava. Graph
summarization with bounded error. Proc. 34th Intl. Conf. on Management of Data (SIGMOD), 419-432, 2008.
[abstract]Abstract: We propose a highly compact two-part representation of a given graph G consisting of a graph summary and a set of corrections. The graph summary is an aggregate graph in which each node corresponds to a set of nodes in G, and each edge represents the edges between all pair of nodes in the two sets. On the other hand, the corrections portion specifies the list of edge-corrections that should be applied to the summary to recreate G. Our representations allow for both lossless and lossy graph compression with bounds on the introduced error. Further, in combination with the MDL principle, they yield highly intuitive coarse-level summaries of the input graph G. We develop algorithms to construct highly compressed graph representations with small sizes and guaranteed accuracy, and validate our approach through an extensive set of experiments with multiple real-life graph data sets. To the best of our knowledge, this is the first work to compute graph summaries using the MDL principle, and use the summaries (along with corrections) to compress graphs with bounded error.
H. Haidarian-Shahri, G. M. Namata, S. Navlakha, A. Deshpande, and N. Roussopoulos. A graph-based approach to vehicle tracking in traffic camera video streams. Proc. 5th Intl. Workshop on Data Management for Sensor Networks (VLDB DMSN), 2007.
[abstract]Abstract: Vehicle tracking has a wide variety of applications from law enforcement to traffic planning and public safety. However, the image resolution of the videos available from most traffic camera systems, make it difficult to track vehicles based on unique identifiers like license plates. In many cases, vehicles with similar attributes are indistinguishable from one another due to image quality issues. Often, network bandwidth and power constraints limit the frame rate, as well. In this paper, we discuss the challenges of performing vehicle tracking queries over video streams from ubiquitous traffic cameras. We identify the limitations of tracking vehicles individually in such conditions and provide a novel graph-based approach using the identity of neighboring vehicles to improve the performance. We evaluate our approach using streaming video feeds from live traffic cameras available on the Internet. The results show that vehicle tracking is feasible, even for low quality and low frame rate traffic cameras. Additionally, exploitation of the attributes of neighboring vehicles significantly improves the performance.