Identifying genetic variants (e.g. single nucleotide polymorphisms) associated with phenotypic variations (e.g. disease status) is one of fundamental problems in genetics. However, most genetic variants associated with complex phenotypes are still elusive. A major challenge is that the number of samples is much smaller than the number of genetic variants, and thus statistical power is limited to detect strong associations.
This thesis develops structured sparse models and algorithms to detect genotype-phenotype associations, to enhance the statistical power using structures of the problems or prior biological knowledge. In the first part of this thesis, we develop structured sparse models and algorithms. We first present adaptive multi-task Lasso and structured input-output Lasso that take advantage of genome annotations or group structures of genome and traits. We then develop a structured piecewise linear Lasso to detect trait-associated interactions between genetic variants, which uses non-linear structures of the problem.
In the second part of this thesis, we focus on scaling up algorithms for structured sparse models to analyze large-scale human genomic data. Specifically, we propose a screening algorithm for overlapping group Lasso which allows us to safely discard irrelevant genetic variants using simple rules. This makes it feasible to solve large structured sparse models developed in the first part of this thesis, as the screening can efficiently reduce the candidate genetic variants prior to solving the original problems. Finally, using structured input-output Lasso model with the screening algorithm, we propose a work plan to identify associations in large-scale Alzheimer’s disease data.
Eric P. Xing (Chair)
Matthew Stephens (University of Chicago)