Attributes Set |
We have used 15 different attributes for classifying protein pairs as interacting or not. These can be divided into three main categories: Direct experimental data set, indirect high throughput data sets and sequence based data sources. Note that different attributes may contain different type of data (binary, continuous or categorical) and their missing values should be handled.
Table S2. Our Attributes Set
Data Type |
Dataset |
REFERENCE |
NOTE |
Direct experimental data set |
Y2H |
[1, 2, 3] |
|
TAP Mass |
[4, 3] |
Spoke model |
|
HMS_PCI Mass |
[5, 3] |
Spoke model |
|
Indirect high throughput data sets |
Protein-DNA Binding |
[6] |
Co-Binding Score |
Gene Expression |
[8] |
Co-Expressed Score |
|
Protein Expression |
[9] |
Co-Expressed Score |
|
Synthetic Lethal |
[10] |
|
|
GO Molecular Function |
[7] |
Co-Function Score |
|
GO Biological Process |
[7] |
Co-Process Score |
|
Syn-expression |
[10] |
|
|
Sequence based data sources |
Domain-Domain Interaction |
[11] |
Co-Domain Score |
Gene Neighborhood |
[10] |
|
|
Gene Fusion |
[10] |
|
|
Gene Co-occur |
[10] |
|
In the Yeast two-hybrid system, pairs of proteins are expressed as fusion (hybrids) proteins in yeast. Note that this method requires that the two proteins be present in the nucleus. Thus, some proteins which are localized to other compartments in the cell may be missed. 5614 Y2h interactions were downloaded from [3]. In our data set, we also consider the missing value case.
These experiments use individual proteins as ‘hooks’ to bio-chemically purify whole protein complexes. These complexes are then separated and identified by mass spectrometry. TAP [4] (tandem affinity purification) and HMS-PCI [5] (high-throughout mass-spectrometry protein complex identification) are two of the protocols used for this technique. Some drawbacks are that tagging may disturb complex formation and loosely associated components may be washed off during purification.
We used TAP and HMS-PCI as separate attributes. And for transforming complex relationship to interaction pairs, we use the spoke model [3], resulting TAP (spoke) 3224 pairs and HMS-PCI 3618 pairs downloaded from [3].
mRNA expression data was obtained from [8]. You can find an excel sheet that describes the 500+ expression experiments results that we have used here. We calculated, for each pair, their Pearson correlation value and used it as our attribute
Another gene expression related feature is the syn-expression from [10]. This dataset is based on two large, genome-wide surveys of mRNA expression levels in yeast. One is the Rossetta compendium, the other from the study of mitotic cell cycle. All mRNA levels were converted to log-ratios and normalized. In [10] they have fused both datasets, yielding 317 measurements per genes. All pairs having a similarity above a given cutoff (0.675 [10]) were connected by a putative interaction.
We downloaded the data from [6]. As in that paper we have used a p-value cutoff of .001 to determine binding. For each pair of proteins we have counted the number of transcription factors that bind to both proteins, and have used this number as one of our attributes.
GO valued were downloaded from SGD [7]. There are three hierarchies in GO:
Selecting a levels in the hierarchy we have determined, for each pair in each of these hierarchies whether they are in the same category or not and have used these three values as attributes.
Deng, et al [11] used maximum likelihood estimation methods to infer interacting domains based on sequence analysis. They use a separate yeast two-hybrid protein data and treat protein sequence as “bag of domains”. We used as an attribute their derived domain-domain interaction probability,
Ghaemmaghami, et al [13] presented experimental protein abundance data for yeast. We used the absolute difference as our protein co-expression attribute.
These attributes were derived from von Mering, et al [10]. They contain the following:
· Conserved gene neighborhood:
42 sequenced genomes were searched for instances of conserved neighborhood between genes.
· Co-occurrence of genes:
For each entry in the orthology-database COG9, the pattern of occurrence among 42 completely sequenced genomes. Then [10] compared these patterns and recorded a putative functional interaction between those proteins whose mutual information was higher than 0.5 (close matches to the 13 most frequent patterns were ignored, as they are mostly phylogenetic).
· Gene fusion events: These were detected by the presence of a gene in more than one COG cluster. Single fusion events were not considered significant.
These were downloaded from [10].
We have pointed in the paper that there are inherent missing value problem related to biological data evidence. Here we present the coverage of each attribute using in our feature set (Total 2,395,456 pairs. See ReferenceSet page for details. ).
Table1.
Coverage for each attribute within our data set
Attribute No. |
Dataset |
coverage ( = 1- missing_value_percentage) |
1 |
Protein-DNA Binding |
1.0000 |
2 |
Domain-Domain Interaction |
0.0064 |
3 |
Gene Expression |
0.9586 |
4 |
GO Molecular Function |
0.5188 |
5 |
GO Biological Process |
0.6305 |
6 |
GO Cellular Component |
0.7852 |
7 |
Y2H |
0.3382 |
8 |
TAP Mass |
0.0468 |
9 |
HMS_PCI Mass |
0.0625 |
10 |
Gene Neighborhood |
1.0000 |
11 |
Gene Fusion |
1.0000 |
12 |
Gene Co-occur |
1.0000 |
13 |
Synthetic Lethal |
1.0000 |
14 |
Syn-expression |
1.0000 |
15 |
Protein Expression |
0.3696 |
References:
1. Uetz P, et al., A comprehensive analysis of
protein-protein interactions in Saccharomyces cerevisiae. Nature.
403(6770):623-7. (2000)
2. Ito T, et al., A comprehensive two-hybrid analysis to
explore the yeast protein interactome., Proc Natl Acad Sci U S A.
10;98(8):4569-74. (2001)
3. Bader GD, Hogue CWV. Analyzing yeast protein-protein
interaction data obtained from different sources. Nature Biotechnology
20:991-997 (2003)
4. Gavin AC, et. al, Functional organization of the yeast
proteome by systematic analysis of protein complexes. Nature. 415(6868):141-7. (2002)
5. Ho Y, et al., Systematic identification of protein
complexes in Saccharomyces cerevisiae by mass spectrometry. Nature
415(6868):180-3. (2002)
7.
Gene Ontology, http://www.geneontology.org/ (2004)
8. Z. Bar-Joseph*, G. Gerber*, T. Lee*, N. Rinaldi, J.
Yoo, F. Robert, B. Gordon, E. Fraenkel, T. Jaakkola, R. Young, and D. Gifford Computational
discovery of gene modules and regulatory networks. Nature
Biotechnology, 21(11) pp. 1337-42, 2003
10. von Mering C, et al., Comparative assessment of
large-scale data sets of protein-protein interactions. Nature
417:399-403. (2002).
12. Saccharomyces Genome Database (SGD). http://www.yeastgenome.org (2004)
13. Ghaemmaghami S, et al., Global analysis of protein
expression in yeast. Nature. 425(6959):737-41. (2003)