Reference Set (Gold Standard)


 

Our reference set

Table S1.  Reference Set (Gold Standard)

 

 

Set

#pairs

Reference:

Note

Use

Reference Set

Positive Set

4036

[6, 7]

Small Scale Experiments

Used for Training & Testing

Negative Set

2,391,420

 

Random Generated

 

We use a set of small scale protein interaction experimental data as positive reference data. Here, a positive data point means a protein pair that does interact. This reference set (4036 protein pairs) was combined from DIP [6] small set and Bader [7]. They were derived from existing literatures. These small scale experiments tend to be more accurate than high-throughput data.

 

It is tricky to get negative reference data (negative means a protein pair that does not interact), because this data is not often published. The number protein pairs that do not interact (negatives) far outnumber those that do (positives). In yeast, the ~6000 proteins can give rise to ~18 million interactions, but current estimates for the number of total interactions is 20,000~30,000 [1, 2]. It was concluded that, on average, every yeast protein interacts with 10 other proteins [4, 5]. We generated a negative set by randomly selecting pairs from the ~18 million yeast protein pairs. Based on the above estimates, the ratio of positives pairs to negative pairs is roughly 1 to 600. Thus, our reference set has 2,391,420 negative random pairs and 4036 positive pair. We expect this negative set to contain over 99.5% negative pairs.

 

 

 

Comparison with reference sets used in the past

In [2] they have used a "gold-standard" set in which positives are defined by the MIPS complex database and their negatives are random protein pairs with different sub-cellular localizations, which we verified in the following section that it is not a very good estimation for the negative data.

Lan et al. [3] also used a positive set from the MIPS complex database, however, they have used all remaining protein pairs as their negative set, which apparently  largely over-estimate the non interacting protein pair.

 

 

Statistics of yeast protein pair and their GO localization

Another question about the random negative set is that why we did not use the style of negative set in [2] (protein pairs that do not have the same sub-cellular). It seems a reasonable guess for the random negative set. 

To further validate our negative set, we mad a statistical analysis about the relationship of protein sub-cellular location and the yeast protein pair. We found that (Table 1) within a random yeast protein pairs set, the chance that a pairs of proteins would have the same GO-Localization is around 27.33%; whereas, within the positive protein interacting pairs set, the chance that a pairs of proteins would have the same GO-Localization is around 76.78%. So to use a random generated yeast protein pairs set as random negative set is a reasonable guess.

 

Table1 Statistics of yeast protein pair and their GO localization

 

Protein Pair Set

Same GO-LOC

Diff GO-LOC

Protein Not in GO-LOC

Total

Positive

2536 ( 76.78%  )

695

72

3303

Random Neg

532615 ( 27.33% )

997328

418559

1948502

 

 

 

Reference:

[1] von Mering C, et al., Comparative assessment of large-scale data sets of protein-protein interactions.  Nature 417:399-403. (2002).

[2] R Jansen, et al., A Bayesian networks approach for predicting protein-protein interactions from genomic data, Science 302: 449-53. (2003)

[3] Lan V. Zhang, et al., Predicting co-complexed protein pairs using genomic and proteomic data integration, BMC Bioinformatics. 5 (1): 38 (2004)

[4] Gilchrist MA, Salter LA, Wagner A. A statistical framework for combining and interpreting proteomic datasets. Bioinformatics. 20(5):689-700 (2004)

[5] Tong A.H.Y. et al. Global Mapping of the Yeast Genetic Interaction Network. Science. 303: 808-813. (2004)

[6] Xenarios I, et al., DIP: The Database of Interacting Proteins: 2001 update, Nucleic Acids Res. 29(1):239-41 (2001)

[7] Bader GD, Hogue CWV. Analyzing yeast protein-protein interaction data obtained from different sources. Nature Biotechnology 20:991-997 (2003)