Reference Set (Gold Standard)

Our reference set

Table S1. Reference Set (Gold Standard)

	Set	#pairs	Reference:	Note	Use
Reference Set	Positive Set	4036	[6, 7]	Small Scale Experiments	Used for Training & Testing
Reference Set	Negative Set	2,391,420		Random Generated	Used for Training & Testing

We use a set of small scale protein interaction experimental data as positive reference data. Here, a positive data point means a protein pair that does interact. This reference set (4036 protein pairs) was combined from DIP [6] small set and Bader [7]. They were derived from existing literatures. These small scale experiments tend to be more accurate than high-throughput data.

It is tricky to get negative reference data (negative means a protein pair that does not interact), because this data is not often published. The number protein pairs that do not interact (negatives) far outnumber those that do (positives). In yeast, the ~6000 proteins can give rise to ~18 million interactions, but current estimates for the number of total interactions is 20,000~30,000 [1, 2]. It was concluded that, on average, every yeast protein interacts with 10 other proteins [4, 5]. We generated a negative set by randomly selecting pairs from the ~18 million yeast protein pairs. Based on the above estimates, the ratio of positives pairs to negative pairs is roughly 1 to 600. Thus, our reference set has 2,391,420 negative random pairs and 4036 positive pair. We expect this negative set to contain over 99.5% negative pairs.

Comparison with reference sets used in the past

In [2] they have used a "gold-standard" set in which positives are defined by the MIPS complex database and their negatives are random protein pairs with different sub-cellular localizations, which we verified in the following section that it is not a very good estimation for the negative data.

Lan et al. [3] also used a positive set from the MIPS complex database, however, they have used all remaining protein pairs as their negative set, which apparently largely over-estimate the non interacting protein pair.

Statistics of yeast protein pair and their GO localization

Another question about the random negative set is that why we did not use the style of negative set in [2] (protein pairs that do not have the same sub-cellular). It seems a reasonable guess for the random negative set.

To further validate our negative set, we mad a statistical analysis about the relationship of protein sub-cellular location and the yeast protein pair. We found that (Table 1) within a random yeast protein pairs set, the chance that a pairs of proteins would have the same GO-Localization is around 27.33%; whereas, within the positive protein interacting pairs set, the chance that a pairs of proteins would have the same GO-Localization is around 76.78%. So to use a random generated yeast protein pairs set as random negative set is a reasonable guess.

Table1 Statistics of yeast protein pair and their GO localization

Protein Pair Set	Same GO-LOC	Diff GO-LOC	Protein Not in GO-LOC	Total
Positive	2536 ( 76.78% )	695	72	3303
Random Neg	532615 ( 27.33% )	997328	418559	1948502

Reference:

[1] von Mering C, et al., Comparative assessment of large-scale data sets of protein-protein interactions. Nature 417:399-403. (2002).

[2] R Jansen, et al., A Bayesian networks approach for predicting protein-protein interactions from genomic data, Science 302: 449-53. (2003)

[3] Lan V. Zhang, et al., Predicting co-complexed protein pairs using genomic and proteomic data integration, BMC Bioinformatics. 5 (1): 38 (2004)

[4] Gilchrist MA, Salter LA, Wagner A. A statistical framework for combining and interpreting proteomic datasets. Bioinformatics. 20(5):689-700 (2004)

[5] Tong A.H.Y. et al. Global Mapping of the Yeast Genetic Interaction Network. Science. 303: 808-813. (2004)

[6] Xenarios I, et al., DIP: The Database of Interacting Proteins: 2001 update, Nucleic Acids Res. 29(1):239-41 (2001)

[7] Bader GD, Hogue CWV. Analyzing yeast protein-protein interaction data obtained from different sources. Nature Biotechnology 20:991-997 (2003)