Feature Attributes

 

 

> List of Attribute Groups

Table S1.  A total of 162 features are characterized into 17 categories. Two styles of feature encoding are used to result in very different sizes of the feature vectors. The second column lists the numbers of features when using “Detail” encoding for each category. In the "summary" encoding, this number would be 1. (Table 3 in the paper presents the coverage of each attribute group used in our feature set.)

 

 

Group Index

# of features

Dataset

Attribute Property

REFERENCE

NOTE

1

20

Gene Expression

Real value:

[-1, 1]

[8]

Co-Expressed Score

2

21

GO Molecular Function

{1, 0}

[7, 15]

Co-Function Score

3

33

GO Biological Process

{1, 0}

[7, 15]

Co-Process Score

4

23

GO Component

{1, 0}

[7, 15]

Co-Location Score

5

1

Protein Expression

Real Value – Non Negative

[9]

Co-Expressed Score

6

1

Essentiality

{2 , 1, 0}

[14]

 

7

1

HMS_PCI Mass

{ 1, 0}

[5, 3]

Matrix model for co-complex and co-pathway prediction. Spoke model for direct PPI prediction [3]

8

1

TAP Mass

{ 1, 0}

[4, 3]

9

1

Y2H

{ 1, 0}

[1, 2, 3]

 

10

1

Synthetic Lethal

{ 1, 0}

[10, 13]

 

11

1

Gene Neighborhood / Gene Fusion / Gene Co-occur

{ 1, 0}

[10]

 

12

1

Sequence Similarity

Real value - Non negative

[15]

 

13

4

Homology based PPI

Discrete: Non-negative (Most 0, 1)

[15, 16]

 

14

1

Domain-Domain Interaction

Real value between [0, 1]

[11]

Co-Domain Score

15

16

Protein-DNA TF group binding

Non-negative discrete, most 0

[6]

 

Co-Binding Score

16

25

MIPS Protein Class

{ 1, 0}

[17]

 

17

11

MIPS Mutant Phenotype

{ 1, 0}

[17]

 

 

 

 

 

         è [new]      Since the related positive reference sets and feature sources have updated rapidly over the years, just sharing the extracted feature files or partial prediction scores are not good enough anymore. Thus,

 

Here I share the code and related files to generate our feature set. Download (both summary and detailed !)

 

The general framework and the codes should be quite useful. You could try to find more recent versions of related evidence sets to make improvement though.

 

 

 

> Details about Each Attribute

 

1.             Gene Expression Data

The gene expression data were obtained from ref. [8] and contained 20 gene expression datasets recorded under more than 500 conditions (each measuring a time series expression profile) was downloaded from http://www.psrg.lcs.mit.edu/Networks/data/expressionData.txt). We can either compute one global similarity score (under "Summary" encoding) for each pair of proteins or 20 distinct scores (under "Detailed" encoding) for each pair.

·        In summary encoding, we calculated, for each pair, the Pearson correlation value considering all conditions and used it as one attribute.

·        In detail encoding, we split the 500+ set into the following subsets:  20 subsets based on their experimental sources and conditions based on the criteria given in http://www.psrg.lcs.mit.edu/Networks/data/expressionData.txt. We then calculated the Pearson CC for each dataset and therefore obtained 20 features for this group.

 

2.             SGD’s Gene Ontology (Co-function, Co-process, and Co-localization)

Gene Ontology (GO) based information was downloaded from SGD [7] and include:   

-         molecular function of a gene product,

-         biological process in which the gene product participates,

-         cellular component where the gene product.

·        In summary encoding, for each pair in each of the three GO hierarchies trees, we use as feature the value of how many times both are in the same category. This results in three values as attributes. We treat the functional catalog as a hierarchical tree of functional classes. Each protein is either a member or not a member of each functional class, such that each protein describes a "subtree" of the overall hierarchical tree of classes. The "functional similarity" between two proteins is defined as the frequency at which the intersection tree of the two proteins occurs in the distribution. Intuitively, the intersection tree represents the function shared by the two proteins. Finally, a single real value is derived to represent this similarity for a protein pair.  

·        In detail encoding, we generate each GO-feature as a discrete feature {0 or 1}: “1” means, both proteins share the same function /component /process. “0” means otherwise. There are 34 types of processes, 22 types of function, 24 and types of component features. Each class was mapped to one binary variable ("attribute").

 

3.             Protein expression data

Ghaemmaghami, et al [13] presented experimental protein abundance data for yeast. Since this data set includes just one condition’s expression, we used the absolute difference as our protein co-expression attribute.

·        In summary encoding: Due to there is only one condition expression in this data, we use the absolute difference of the protein expression value.

·        In detail encoding: Due to there is only one condition expression in this data, we also use the absolute difference of the protein expression value. So here the detailed encoding is the same as the summary encoding for this feature.

 

 

4.             Essentiality

1106 ORFs are listed in the essential ORF list, downloaded from www.sequence.stanford.edu/group/yeast_deletion_project/Essential_ORFs.txt. Based on the advice by the authors of this feature set, we assume that anything not listed can be considered to be nonessential (NE).  Any gene deemed essential (E) is one that cannot be made into a haploid or homozygous deletion strain. The co-essential feature is a 3-value categorized feature: 0 means NE/EN, 1 means NN, 2 means EE)

·        In summary encoding: This is a one value feature.

·        In detail encoding: This is a one value feature. Here the detailed encoding is the same as the summary encoding for this feature.

 

 

5.             High throughput direct PPI data set

Two types of high throughput direct data were used, (1) derived from mass spectrometry and (2) from Y2H screens:

-         Mass spectrometry data: These experiments use individual proteins as ‘hooks’ to biochemically purify protein complexes. The identity of the proteins located in these complexes is then determined by mass spectrometry. TAP [4] (tandem affinity purification) and HMS-PCI [5] (high-throughout mass-spectrometry protein complex identification) are two of the protocols used for this technique. Both protocols may miss true complexes when the affinity is weak or transient or when the tagged protein may be misfolded or its interaction capability disturbed by the tag. We used TAP and HMS-PCI as separate attributes. To convert complex relationships to interaction pairs, we use the spoke model [3] for the direct protein-protein interaction prediction task, resulting in 3224 pairs for TAP (spoke) and 3618 pairs for HMS-PCI. For the other two tasks, we employed the matrix model to use these two mass spectrometry features. 

-         Y2H ( yeast two-hybrids screen ) data: In the Yeast two-hybrid system, potential pairs of proteins are expressed as two separate fusion (hybrids) proteins in yeast that are brought together by the DNA-mediated interaction of the fusion proteins. Therefore, this method requires that the two test proteins are capable of interacting in the environment of the nucleus. Thus, some proteins which are natively localized in other compartments of the cell may fail to interact. 5614 Y2H interactions were downloaded from [3].

·        In summary encoding: For each highthroughput experiment, the values are determined by the experiments. We do not have calculation processing here.

·        In detail encoding: The values are determined by the experiments. Here the detailed encoding is the same as the summary encoding for this feature.

 

6.             Synthetic Lethal

The synthetic lethal data described as {0, 1} discrete feature pairs were derived from the union of the following data sets:

-         295 synthetic lethal interaction from the first high-throughput study on genetic interactions in yeast [13a] 

-         591 synthetic lethal interactions parsed from MIPS were downloaded from http://mips.gsf.de/proj/yeast/tabels/interaction/genetic_interact.html.

-         A genetic interaction network containing approximately 1000 genes and approximately 4000 interactions [13]:

·        In summary encoding: The values are determined by the experiments. We do not have calculation processing here.

·        In detail encoding: The values are determined by the experiments. Here the detailed encoding is the same as the summary encoding for this feature.

 

 

7.