Table S1. A total of 162 features are characterized
into 17 categories. Two styles of feature encoding are used to result in very
different sizes of the feature vectors. The second column lists the numbers of
features when using “Detail” encoding for each category. In the
"summary" encoding, this number would be 1. (Table 3 in the paper presents
the coverage of each attribute group used in our feature set.)
|
Group Index |
# of features |
Dataset |
Attribute Property |
REFERENCE |
NOTE |
|
1 |
20 |
Gene Expression |
Real value: [-1, 1] |
[8] |
Co-Expressed Score |
|
2 |
21 |
GO Molecular Function |
|
[7, 15] |
Co-Function Score |
|
3 |
33 |
GO Biological Process |
|
[7, 15] |
Co-Process Score |
|
4 |
23 |
GO Component |
|
[7, 15] |
Co-Location Score |
|
5 |
1 |
Protein Expression |
Real Value – Non Negative |
[9] |
Co-Expressed Score |
|
6 |
1 |
Essentiality |
|
[14] |
|
|
7 |
1 |
HMS_PCI Mass |
|
[5, 3] |
Matrix model for co-complex and co-pathway prediction. Spoke model
for direct PPI prediction [3] |
|
8 |
1 |
TAP Mass |
|
[4, 3] |
|
|
9 |
1 |
Y2H |
|
[1, 2, 3] |
|
|
10 |
1 |
Synthetic Lethal |
|
[10, 13] |
|
|
11 |
1 |
Gene Neighborhood / Gene Fusion / Gene Co-occur |
|
[10] |
|
|
12 |
1 |
Sequence Similarity |
Real value - Non negative |
[15] |
|
|
13 |
4 |
Homology based PPI |
Discrete: Non-negative (Most 0, 1) |
[15, 16] |
|
|
14 |
1 |
Domain-Domain Interaction |
Real value between [0, 1] |
[11] |
Co-Domain Score |
|
15 |
16 |
Protein-DNA TF group binding |
Non-negative discrete, most 0 |
[6] |
Co-Binding Score |
|
16 |
25 |
MIPS Protein Class |
|
[17] |
|
|
17 |
11 |
MIPS Mutant Phenotype |
|
[17] |
|
|
è [new] Since the related positive reference sets and feature sources have updated rapidly over the years, just sharing the extracted feature files or partial prediction scores are not good enough anymore. Thus, Here I share the code and related files to generate our feature set. Download (both summary and detailed !) The general framework and the codes should be quite useful. You could try to find more recent versions of related evidence sets to make improvement though. |
The gene expression data were obtained from ref. [8] and contained 20
gene expression datasets recorded under more than 500 conditions (each
measuring a time series expression profile) was downloaded from http://www.psrg.lcs.mit.edu/Networks/data/expressionData.txt). We can either compute one global similarity score (under
"Summary" encoding) for each pair of proteins or 20 distinct scores
(under "Detailed" encoding) for each pair.
·
In summary encoding, we calculated, for each pair, the Pearson correlation value
considering all conditions and used it as one attribute.
·
In detail encoding, we split the 500+ set into the following subsets: 20 subsets based on their experimental sources
and conditions based on
the criteria given in http://www.psrg.lcs.mit.edu/Networks/data/expressionData.txt.
We then calculated the Pearson CC for each dataset and therefore obtained 20
features for this group.
Gene Ontology
(GO) based information was downloaded from SGD [7] and
include:
-
molecular function of a gene
product,
-
biological
process in which the gene product participates,
-
cellular
component where the gene product.
·
In summary encoding, for each pair
in each of the three GO hierarchies trees, we use as feature the value of how
many times both are in the same category. This results in three values as
attributes. We treat the functional catalog as a hierarchical tree
of functional classes. Each protein is either a member or not a member of each
functional class, such that each protein describes a "subtree" of the
overall hierarchical tree of classes. The "functional similarity"
between two proteins is defined as the frequency at which the intersection tree
of the two proteins occurs in the distribution. Intuitively, the intersection
tree represents the function shared by the two proteins. Finally, a single real
value is derived to represent this similarity for a protein pair.
·
In detail encoding, we generate
each GO-feature as a discrete feature
Ghaemmaghami,
et al [13] presented experimental protein abundance data for yeast. Since this
data set includes just one condition’s expression, we used the absolute
difference as our protein co-expression attribute.
·
In summary encoding: Due to there is only one condition
expression in this data, we use the absolute difference of the protein
expression value.
·
In detail encoding: Due to there
is only one condition expression in this data, we also use the absolute
difference of the protein expression value. So here the detailed encoding is
the same as the summary encoding for this feature.
1106 ORFs are listed in the essential ORF list,
downloaded from www.sequence.stanford.edu/group/yeast_deletion_project/Essential_ORFs.txt.
Based on the advice by
the authors of this feature set, we assume that anything not listed can be
considered to be nonessential (NE). Any gene deemed essential (E) is one
that cannot be made into a haploid or homozygous deletion strain. The
co-essential feature is a 3-value categorized feature: 0 means NE/EN, 1 means
NN, 2 means EE)
·
In summary encoding: This is a one value feature.
·
In detail encoding: This is a one value feature. Here the detailed encoding is the same as the
summary encoding for this feature.
Two types of
high throughput direct data were used, (1) derived from mass spectrometry and
(2) from Y2H screens:
-
Mass spectrometry data: These experiments use individual proteins as ‘hooks’ to biochemically
purify protein complexes. The identity of the proteins located in these
complexes is then determined by mass spectrometry. TAP [4] (tandem affinity
purification) and HMS-PCI [5] (high-throughout mass-spectrometry protein
complex identification) are two of the protocols used for this technique. Both
protocols may miss true complexes when the affinity is weak or transient or
when the tagged protein may be misfolded or its interaction capability
disturbed by the tag. We used TAP and HMS-PCI as separate attributes. To
convert complex relationships to interaction pairs, we use the spoke model [3]
for the direct protein-protein interaction prediction task, resulting in 3224
pairs for TAP (spoke) and 3618 pairs for HMS-PCI. For the other two tasks, we
employed the matrix model to use these two mass spectrometry features.
-
Y2H ( yeast two-hybrids screen )
data: In the Yeast two-hybrid system, potential pairs of proteins are
expressed as two separate fusion (hybrids) proteins in yeast that are brought
together by the DNA-mediated interaction of the fusion proteins. Therefore,
this method requires that the two test proteins are capable of interacting in
the environment of the nucleus. Thus, some proteins which are natively
localized in other compartments of the cell may fail to interact. 5614 Y2H
interactions were downloaded from [3].
·
In summary encoding: For each highthroughput experiment, the values are determined by the
experiments. We do not have calculation processing here.
·
In detail encoding: The values are determined by the experiments. Here the detailed encoding is the
same as the summary encoding for this feature.
The synthetic
lethal data described as
-
295 synthetic lethal interaction from the first high-throughput study on
genetic interactions in yeast [13a]
-
591 synthetic lethal interactions parsed from MIPS were downloaded from http://mips.gsf.de/proj/yeast/tabels/interaction/genetic_interact.html.
-
A
genetic interaction network containing approximately 1000 genes and
approximately 4000 interactions [13]:
·
In summary encoding: The values are determined by the experiments. We do not have
calculation processing here.
·
In detail encoding: The values are determined by the experiments. Here the detailed encoding is the same as the
summary encoding for this feature.