Reference Set
(Gold Standard) |
Table S1. Reference Set (Gold Standard)
|
Set |
#pairs |
Reference: |
Note |
Use |
Reference Set |
Positive Set |
4036 |
[6, 7] |
Small Scale
Experiments |
Used for Training
& Testing |
Negative Set |
2,391,420 |
|
Random Generated |
We use a set of small
scale protein interaction experimental data as positive reference data. Here, a
positive data point means a protein pair that does interact. This reference set
(4036 protein pairs) was combined from DIP [6] small set and Bader [7]. They
were derived from existing literatures. These small scale experiments tend to
be more accurate than high-throughput data.
It is tricky to get
negative reference data (negative means a protein pair that does not interact),
because this data is not often published. The number protein pairs that do
not interact (negatives) far outnumber those that do (positives). In yeast, the
~6000 proteins can give rise to ~18 million interactions, but current estimates
for the number of total interactions is 20,000~30,000 [1, 2]. It was concluded that, on average, every yeast
protein interacts with 10 other proteins [4, 5]. We generated a negative set by
randomly selecting pairs from the ~18 million yeast protein pairs. Based on the
above estimates, the ratio of positives pairs to negative pairs is roughly 1 to
600. Thus, our reference set has 2,391,420 negative random pairs and 4036
positive pair. We expect this negative set to contain over 99.5% negative
pairs.
In [2] they have used a
"gold-standard" set in which positives are defined by the MIPS complex
database and their negatives are random protein pairs with different
sub-cellular localizations, which we verified in the following section that it
is not a very good estimation for the negative data.
Lan et al. [3] also used a
positive set from the MIPS complex database, however, they have used all
remaining protein pairs as their negative set, which apparently largely over-estimate the non interacting
protein pair.
Another question about the random
negative set is that why we did not use the style of negative set in [2]
(protein pairs that do not have the same sub-cellular). It seems a reasonable
guess for the random negative set.
To further validate our negative
set, we mad a statistical analysis about the relationship of protein
sub-cellular location and the yeast protein pair. We found that (Table 1)
within a random yeast protein pairs set, the chance that a pairs of proteins
would have the same GO-Localization is around 27.33%; whereas, within the
positive protein interacting pairs set, the chance that a pairs of proteins
would have the same GO-Localization is around 76.78%. So to use a random
generated yeast protein pairs set as random negative set is a reasonable guess.
Table1 Statistics of yeast protein pair and their GO
localization
Protein Pair Set |
Same GO-LOC |
Diff GO-LOC |
Protein Not in GO-LOC |
Total |
Positive |
2536 ( 76.78%
) |
695 |
72 |
3303 |
Random Neg |
532615 ( 27.33% ) |
997328 |
418559 |
1948502 |
Reference:
[1] von Mering C,
et al., Comparative assessment of large-scale data sets of protein-protein
interactions. Nature 417:399-403. (2002).
[2] R Jansen, et al., A Bayesian networks approach for
predicting protein-protein interactions from genomic data, Science 302: 449-53. (2003)
[4] Gilchrist MA, Salter LA, Wagner A. A statistical
framework for combining and interpreting proteomic datasets. Bioinformatics.
20(5):689-700 (2004)
[5] Tong A.H.Y. et al. Global Mapping of the Yeast Genetic Interaction
Network. Science. 303: 808-813. (2004)
[7] Bader
GD, Hogue CWV. Analyzing yeast protein-protein interaction data obtained from
different sources. Nature Biotechnology 20:991-997 (2003)