The data collected within The Cancer Genome Atlas (TCGA) project is exceptionally heterogeneous. Molecular profiling data generated by different measurement modalities as well as clinical information collected on each patient give rise to continuous, discrete, and categorical data with different distributional properties. Additional data is generated by a variety of analyses carried out on the individual data sets. Examples include functional or structural annotations of mutations, assignment of an expression subtype of a tumor, or enrichment or activity of molecular pathways, for each patient sample. The data also include missing values as well as interdependencies among the features that undoubtedly extend beyond pairwise correlations. I will describe our efforts towards identifying strong multivariate associations in the TCGA data using a framework based on random forest regression as well as development of web-based tools to interactively explore such associations. I will also describe our efforts to integrate these association data with other information from public biomedical resources using big graph analytics, with applications in drug repurposing.
About the Speaker