Language Technologies Thesis Proposal

  • Ph.D. Student
  • Language Technologies Institute
  • Carnegie Mellon University
Thesis Proposals

Feature Learning and Graphical Models for Protein Sequences

Machine learning methods rely heavily on using and learning good features. We study three problems in the context of protein sequences:

(1) drug cocktail design
(2) studying allostery in GPCRs and
(3) generative modeling of protein families.

We show that the core challenges underlying each of these tasks relates to effective feature selection, feature interpretation and feature learning, respectively.

We address the drug cocktail design problem by providing solutions for feature selection in the context of large scale data. We employ structure learning in markov random fields and interpret the features learned from a biological perspective for studying allostery in GPCRs. We investigate deep architectures for unsupervised feature learning of latent representations in protein families. We show preliminary results using Restricted Boltzmann Machines (RBMs). We propose to build a deep architecture from RBMs using Deep Boltzmann Machines (DBMs). Additionally, we propose Locally Connected Deep Boltzmann Machines (LC-DBMs) employing sparse structure learning for trading off model agnosticism with prior knowledge.

Thesis Committee:
Chris Langmead (Chair)
Jaime Carbonell
Bhiksha Raj
Hetunandan Kamisetty (Facebook)

Thesis Proposal Document

For More Information, Please Contact: