Details about Search algorithm / Evaluation setting / Validation setting

· Heuristic Local search algorithm for protein complex

Choices for how to start:

- Interesting selected nodes;

- Top degree nodes

- All related nodes on the graph searched on;

We order the seeding node list with respect to the node degree and start the search from high degree nodes to low degree nodes.

Expand current cluster heuristic:

- Generate a subset V* from all neighbours of current cluster;

- Choose top rank M nodes as candidate to expand the current cluster.

- The order of adding is based on the maximum similarity weight to the current cluster;

- Choose the node who achieves the best new cluster score among M candidates;

Search:

- Accept the new cluster candidate if with higher score;

- If lower, accept with probability exp(L' - L)/T;

- Temperature parameter T decreasing by a scaling factor alpha after each round

- Accepted cluster must score higher than a threshold

When to stop:

- Current cluster has no neighbor nodes to expand;

- Number of rounds since the last score improvements is larger than a specified number;

- The number of expanding rounds is larger a the specified number;

· Complexity of complex identification

Assuming that the graph we searched on have n nodes and e edges. The maximum size of sugraphs we detected is p (chose as 20 in our evaluations). Thus,

l For each candidate sub-graph, we need to perform the feature extractions. This would cost O(p³)

l When searching from a seeding node, we assume the average degree of nodes in the graph as q (in sparse graph, q<<n) and the cluster expansion in each round is constrained to m (set as 20 in the evaluations) maximum weight nodes. Then finding the related complex for the node would be cost O(q*p*m* p³)

l If wanting to find all the complexes in a graph, we start from all the related n nodes, totally the algorithm cost O(n*q*m* p⁴)

l Thus, if (n >> p⁴), polynomial to n. If not, the complexity controls largely by p

· Evaluation setting

There exist another large scale affinity-MS based data set from the following paper [1]:

Global landscape of protein complexes in yeast Saccharomyces cerevisiae, Nature 2006 Mar 30;440(7084):637-43. Epub 2006 Mar 22., Krogan NJ, et al, Greenblatt JF. [1]
The reason that we did not use this set as one of the training positive set to start is : this paper used MIPS manual derived complex PPIs to train to identify reliable core set of PPIs from their experimental detected PPIs (from large scale MS)
Thus, this data is not independent with MIPS manual complexes. We could not use it as testing or training versus MIPS complexes then.
Nevertheless it still could be applied in the validation to filter the newly predicted complexes.

In the evaluation, we either train with MIPS [5] manual complexes as positive and test with TAP06 complexes, Or train with TAP06 complexes as positive and test with MIPS manual complexes

We did a pre-processing to filter the complexes from TAP06 [3] which removed those complexes highly overlapped (p>0.8) with MIPS complexes.
During the search in evaluation, we remove the detected clusters that matched with the training positive complexes.
The number of training non-complexes we used in the evaluation have the power law size distribution and totally there are ~2000.

Related parameters used after selecting the best among evaluation

- CLUSTER_LIMIT 20

- CLSOVERLAPLIMIT 0.75

- CHECKNEIULIM 20

- SIMUTemp0 0.88

- SIMUALPHA 1.80

- trainPosPpara 0.0001

- LikeScoreCutoff 0

· Validation setting

Related parameters used

- trainvaliChoice mipsTrain.outVali

- trainMethodpara 0.0001

- seedChoice allGraphNode

- CLUSTER_LIMIT 20

- CLSOVERLAPLIMIT 0.8

- CHECKNEIULIM 20

- SIMUTemp 0.88

- SIMUALPHA 1.80

- LikeScoreCutoff 0

We checked our predicted clusters respect to five complexes sets:

- MIPS manual (MIPS) [5]

- Gavin 02 [2]

- Gavin 06 (TAP06) [3]

- Ho 02 [4]

- Krogan 06 [1]

l References:

[1] Krogan,N.J., et al, Greenblatt,J.F. (2006) Global landscape of protein complexes in yeast Saccharomyces cerevisiae. Nature , 440, (7084):637-43.

[2] Gavin,A.C., Bosche,M., Krause,R., Grandi,P., Marzioch,M., Bauer,A., Schultz,J., et.al. (2002) Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature, 415, (6868):141-7.

[3] Gavin,A.C., Aloy,P., et al. (2006) Proteome survey reveals modularity of the yeast cell machinery. Nature, 440, (7084):631-6.

[4] Ho,Y., Gruhler,A., Heilbut,A., Bader,GD., Moore,L., Adams,S.L., Millar,A., Taylor,P.,Bennett,K., Boutilier,K., et al. (2002) Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature, 415, 6868.

[5] Mewes,H.W., Amid,C., Arnold,R., et. al (2004) MIPS: analysis and annotation of proteins from whole genomes. Nucleic Acids Res. 32, (Database issue):D41-4.