INTRO TO MACHINE LEARNING PROJECTS: Single Cell Analysis

Single Cell Analysis Using Neural Networks

Background and summary: A gene is a sequence of DNA containing instructions for building a single molecule. While cells from the same individual share (roughly) the same DNA sequence that encodes genes, different cells use different subsets of these genes, and at different levels. Single-cell RNA sequencing (scRNA-seq) measures the activity levels of a set of annotated genes in a single cell, a vector with continues values that is termed ‘gene expression profile’. The type of a cell (e.g., lung cell, brain cell, skin epidermis) is closely connected with its gene expression profile.

Goal: Predict the cell type using the expression profiles. This is an important goal. For example, when taking a cancer biopsy the sample contains several different types of cells and the ability to accurately determine the cell composition can have important impact on the type of treatment prescribed. Thus, accurately characterizing the set of cells from the collected expression profiles is of much interest right now.

There could be many approaches to deal with this data. While we have found that NN are successful, you are not required to use them here. Another approach that seems to work well with this data is to first perform dimesionality reduction (using unsupervised or supervised methods) and then perform the classification in the reduced dimension space. We list below some references that may be of help for thinking about how to work with this data.

While the goal listed above is the most natural task, there are many other things that can be done with this data and these would also qualify for a project in this class. For example, we used the data to infer interactions between genes based on co-expression so it can be used to learn various networks (including Bayesian and others). While this would be an interesting choice for a project it also require more biological knowledge than the classification task mentioned above and so may not be appropriate to most student in the class. If you have any other ideas on projects related to this data we would be happy to consider them as well.

Input data: The datasets used for this homework are drawn from 104 separate scRNA-seq experiments, each of which profiles cells from different individuals or sets of individuals, and may focus on different cell types. For this project we provide both, the full dataset and a separate training and test sets since, as we discuss below, its not a trivial task for such data due to various artifacts that can arise. However, if you are not attempting to perform classification (for example, if you decided to focus on interactions instead) you should igorethis division and treat the two files the same.

For the training and test sets we made sure that there is no overlap in the set of experiments (i.e. studies) used; however, the set of possible cell types seen in the test data is a subset of those appearing in the training data. This is to ensure that the classifier does not learn to predict on experimental biases instead of gene expression profiles.

You have access to the following data:
all_data.h5: Contains a few structures including:
1. gene expression profiles (covariates) for all data (train+test); its n rows correspond to cells, and p columns correspond to genes; the (i; j)th entry is measurement of the gene expression level of the jth gene in the ith cell in the training set;
2. Gene names for all genes
3. Cell types for all rows
4. Information on the source of the data for each experiment

train_data.h5— same as ‘all’ but only contained training data info

test_data.h5 — same for test data.

Note that this dataset is large. Dataset is found here: DATA

Relevant papers:
Using Neural Networks for Reducing the Dimensions of Single-cell RNA
scQuery: A Web Server for Comparative Analysis of Single-cell RNA-seq Data