The phenotype inference from genotype in RNA viruses maps the viral genome/protein sequences to the molecular functions in order to understand the underlying molecular mechanisms that are responsible for the function changes. The inference is currently done through a laborious experimental process which is arguably inefficient, incomplete, and unreliable. The wealth of RNA virus sequence data in the presence of different phenotypes promotes the rise of computational approaches to aid the inference. Key residue identification and genotype-phenotype mapping function learning are two approaches to identify the critical positions out of hitchhikers and elucidate the relations among them.
The existing computational approaches in this area focus on prediction accuracy, yet a number of fundamental problems have not been considered: the scalability of the data, the capability to suggest informative biological experiments, and the interpretability of the inferences. A common scenario of inference done by biologists with mutagenesis experiments usually involves a small number of available sequences, which is very likely to be inadequate for the inference in most setups. Accordingly biologists desire models that are capable of inferring from such limited data, and algorithms that are capable of suggesting new experiments when more data is needed. Another important but always been neglected property of the models is the interpretability of the mapping, since most existing models behave as 'black boxes'.
To address these issues, in the thesis I design a supervised combinatorial filtering algorithm that systematically and efficiently infers the correct set of key residue positions from available labeled data. For cases where more data is needed to fully converge to an answer, I introduce an active learning algorithm to help choose the most informative experiment from a set of unlabeled candidate strains or mutagenesis experiments to minimize the expected total laboratory time or financial cost. I also propose Disjunctive Normal Form (DNF) as an appropriate assumption over the hypothesis space to learn interpretable genotype-phenotype functions.
The challenges of these approaches are the computational efficiency due to the combinatorial nature of our algorithms. The solution is to explore biological plausible assumptions to constrain the solution space and efficiently find the optimal solutions under the assumptions.
The algorithms were validated in two ways: 1) prediction quality in a cross-validation manner, and 2) consistency with the domain experts‘ conclusions. The algorithms also suggested new discoveries that have not been discussed yet. I applied these approaches to a variety of RNA virus datasets covering the majority of interesting RNA phenotypes, including drug resistance, Antigenicity shift, Antibody neutralization and so on to demonstrate the prediction power, and suggest new discoveries of Influenza drug resistance and Antigenicity. I also prove the extension of the approaches in the area of severe acute community disease.
Roni Rosenfeld (Advisor)
Gilles Clermont (University of Pittsburgh)
Eldie Ghedin (New York University)