Attributes Set


 

Attribute Set

 

            We have used 15 different attributes for classifying protein pairs as interacting or not. These can be divided into three main categories: Direct experimental data set, indirect high throughput data sets and sequence based data sources. Note that different attributes may contain different type of data (binary, continuous or categorical) and their missing values should be handled.

                                                               

Table S2. Our Attributes Set

Data Type

Dataset

REFERENCE

NOTE

Direct experimental data set

Y2H

[1, 2, 3]

 

TAP Mass

[4, 3]

Spoke model

HMS_PCI Mass

[5, 3]

Spoke model

Indirect high throughput data sets

Protein-DNA Binding

[6]

Co-Binding Score

Gene Expression

[8]

Co-Expressed Score

Protein Expression

[9]

Co-Expressed Score

Synthetic Lethal

[10]

 

GO Molecular Function

[7]

Co-Function Score

GO Biological Process

[7]

Co-Process Score

Syn-expression

[10]

 

Sequence based data sources

Domain-Domain Interaction

[11]

Co-Domain Score

Gene Neighborhood

[10]

 

Gene Fusion

[10]

 

Gene Co-occur

[10]

 

 

 

 

Detailed About Each Attribute

1.      Yeast two-hybrid data

In the Yeast two-hybrid system, pairs of proteins are expressed as fusion (hybrids) proteins in yeast. Note that this method requires that the two proteins be present in the nucleus. Thus, some proteins which are localized to other compartments in the cell may be missed. 5614 Y2h interactions were downloaded from [3]. In our data set, we also consider the missing value case.

 

2.      Mass spectrometry data

These experiments use individual proteins as ‘hooks’ to bio-chemically purify whole protein complexes. These complexes are then separated and identified by mass spectrometry. TAP [4] (tandem affinity purification) and HMS-PCI [5] (high-throughout mass-spectrometry protein complex identification) are two of the protocols used for this technique. Some drawbacks are that tagging may disturb complex formation and loosely associated components may be washed off during purification.

We used TAP and HMS-PCI as separate attributes. And for transforming complex relationship to interaction pairs, we use the spoke model [3], resulting TAP (spoke) 3224 pairs and HMS-PCI 3618 pairs downloaded from [3].

 

3.      Gene expression data

mRNA expression data was obtained from [8]. You can find an excel sheet that describes the 500+ expression experiments results that we have used here. We calculated, for each pair, their Pearson correlation value and used it as our attribute

Another gene expression related feature is the syn-expression from [10].  This dataset is based on two large, genome-wide surveys of mRNA expression levels in yeast. One is the Rossetta compendium, the other from the study of mitotic cell cycle. All mRNA levels were converted to log-ratios and normalized. In [10] they have fused both datasets, yielding 317 measurements per genes. All pairs having a similarity above a given cutoff (0.675 [10]) were connected by a putative interaction.

 

4.      Protein-DNA binding data

We downloaded the data from [6]. As in that paper we have used a p-value cutoff of .001 to determine binding. For each pair of proteins we have counted the number of transcription factors that bind to both proteins, and have used this number as one of our attributes.

 

5.      GO ontology features (Co-function, Co-process, and Co-localization)

GO valued were downloaded from SGD [7]. There are three hierarchies in GO:

            Selecting a levels in the hierarchy we have determined, for each pair in each of these hierarchies whether they are in the same category or not and have used these three values as attributes.

           

6.      Domain-domain interaction feature

Deng, et al [11] used maximum likelihood estimation methods to infer interacting domains based on sequence analysis. They use a separate yeast two-hybrid protein data and treat protein sequence as “bag of domains”. We used as an attribute their derived domain-domain interaction probability,

 

7.      Protein expression data

Ghaemmaghami, et al [13] presented experimental protein abundance data for yeast. We used the absolute difference as our protein co-expression attribute.

 

8.      Features related with sequence information

These attributes were derived from von Mering, et al [10]. They contain the following:

·        Conserved gene neighborhood:

42 sequenced genomes were searched for instances of conserved neighborhood between genes.

·        Co-occurrence of genes:

For each entry in the orthology-database COG9, the pattern of occurrence among 42 completely sequenced genomes. Then [10] compared these patterns and recorded a putative functional interaction between those proteins whose mutual information was higher than 0.5 (close matches to the 13 most frequent patterns were ignored, as they are mostly phylogenetic).

·        Gene fusion events: These were detected by the presence of a gene in more than one COG cluster. Single fusion events were not considered significant.

 

9.      Synthetic Lethal

These were downloaded from [10].

 

 

 

Coverage of Each Attribute

We have pointed in the paper that there are inherent missing value problem related to biological data evidence. Here we present the coverage of each attribute using in our feature set (Total 2,395,456 pairs. See ReferenceSet page for details. ).

 

Table1. Coverage for each attribute within our data set

Attribute No.

Dataset

coverage

( = 1- missing_value_percentage)

1

Protein-DNA Binding

1.0000

2

Domain-Domain Interaction

0.0064

3

Gene Expression

0.9586

4

GO Molecular Function

0.5188

5

GO Biological Process

0.6305

6

GO Cellular Component

0.7852

7

Y2H

0.3382

8

TAP Mass

0.0468

9

HMS_PCI Mass

0.0625

10

Gene Neighborhood

1.0000

11

Gene Fusion

1.0000

12

Gene Co-occur

1.0000

13

Synthetic Lethal

1.0000

14

Syn-expression

1.0000

15

Protein Expression

0.3696

 

 

 

References:

1.      Uetz P, et al., A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae. Nature. 403(6770):623-7. (2000)

2.      Ito T, et al., A comprehensive two-hybrid analysis to explore the yeast protein interactome., Proc Natl Acad Sci U S A. 10;98(8):4569-74. (2001)

3.      Bader GD, Hogue CWV. Analyzing yeast protein-protein interaction data obtained from different sources. Nature Biotechnology 20:991-997 (2003)

4.      Gavin AC, et. al, Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature. 415(6868):141-7. (2002)

5.      Ho Y, et al., Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature 415(6868):180-3. (2002)

6.      Lee et al., Transcriptional Regulatory Networks in Saccharomyces cerevisiae,  Science 298:799-804 (2002)

7.      Gene Ontology, http://www.geneontology.org/ (2004)

8.      Z. Bar-Joseph*, G. Gerber*, T. Lee*, N. Rinaldi, J. Yoo, F. Robert, B. Gordon, E. Fraenkel, T. Jaakkola, R. Young, and D. Gifford Computational discovery of gene modules and regulatory networks. Nature Biotechnology, 21(11) pp. 1337-42, 2003 

9.      Ghaemmaghami S, et al., Global analysis of protein expression in yeast. Nature. 425(6959):737-41. (2003)

10.  von Mering C, et al., Comparative assessment of large-scale data sets of protein-protein interactions.  Nature 417:399-403. (2002).

11.  Deng M, et al., Inferring domain-domain interactions from protein-protein interactions. Genome Res. 12(10):1540-8. (2002)

12.  Saccharomyces Genome Database (SGD). http://www.yeastgenome.org  (2004)

13.  Ghaemmaghami S, et al., Global analysis of protein expression in yeast. Nature. 425(6959):737-41. (2003)