Details about Search algorithm / Evaluation setting / Validation setting

        Heuristic Local search algorithm for protein complex

 

Choices for how to start:

-          Interesting selected nodes;

-          Top degree nodes

-          All related nodes on the graph searched on;

 

We order the seeding node list with respect to the node degree and start the search from high degree nodes to low degree nodes.

 

 

Expand current cluster heuristic:

-          Generate a subset V* from all neighbours of current cluster;

-          Choose top rank M nodes as candidate to expand the current cluster.

-          The order of adding is based on the maximum similarity weight to the current cluster;

-          Choose the node who achieves the best new cluster score among M candidates;

 

Search:

-          Accept the new cluster candidate if with higher score;

-          If lower, accept with probability exp(L' - L)/T;

-          Temperature parameter T decreasing by a scaling factor alpha after each round

-          Accepted cluster must score higher than a threshold

 

When to stop:

-         Current cluster has no neighbor nodes to expand;

-         Number of rounds since the last score improvements is larger than a specified number;

-         The number of expanding rounds is larger a the specified number;

 

 

        Complexity of complex identification

 

Assuming that the graph we searched on have n nodes and e edges. The maximum size of sugraphs we detected is p (chose as 20 in our evaluations). Thus,

l          For each candidate sub-graph, we need to perform the feature extractions. This would cost O(p3)

l          When searching from a seeding node, we assume the average degree of nodes in the graph as q (in sparse graph, q<<n) and the cluster expansion in each round is constrained to m (set as 20 in the evaluations) maximum weight nodes. Then finding the related complex for the node would be cost O(q*p*m* p3)

l          If wanting to find all the complexes in a graph, we start from all the related n nodes, totally the algorithm cost O(n*q*m* p4)

l          Thus, if (n >> p4), polynomial to n. If not, the complexity controls largely by p

 

 

        Evaluation setting

 

There exist another large scale affinity-MS based data set from the following paper [1]:

  • Global landscape of protein complexes in yeast Saccharomyces cerevisiae, Nature 2006 Mar 30;440(7084):637-43. Epub 2006 Mar 22., Krogan NJ, et al, Greenblatt JF. [1]
  • The reason that we did not use this set as one of the training positive set to start is : this paper used MIPS manual derived complex PPIs to train to identify reliable core set of PPIs from their experimental detected PPIs (from large scale MS)
  • Thus, this data is not independent with MIPS manual complexes. We could not use it as testing or training versus MIPS complexes then.
  • Nevertheless it still could be applied in the validation to filter the newly predicted complexes.

 

In the evaluation, we either train with MIPS [5] manual complexes as positive and test with TAP06 complexes, Or train with TAP06 complexes as positive and test with MIPS manual complexes

  • We did a pre-processing to filter the complexes from TAP06 [3] which removed those complexes highly overlapped (p>0.8) with MIPS complexes.
  • During the search in evaluation, we remove the detected clusters that matched with the training positive complexes.
  • The number of training non-complexes we used in the evaluation have the power law size distribution and totally there are ~2000.

 

 

Related parameters used after selecting the best among evaluation

-          CLUSTER_LIMIT 20

-          CLSOVERLAPLIMIT 0.75

-          CHECKNEIULIM 20

-          SIMUTemp0 0.88

-          SIMUALPHA 1.80

-          trainPosPpara 0.0001

-         LikeScoreCutoff 0

 

 

 

        Validation setting

 

     Related parameters used

-          trainvaliChoice mipsTrain.outVali

-          trainMethodpara 0.0001

-          seedChoice allGraphNode

-          CLUSTER_LIMIT 20

-          CLSOVERLAPLIMIT 0.8

-          CHECKNEIULIM 20

-          SIMUTemp 0.88

-          SIMUALPHA 1.80

-          LikeScoreCutoff 0

 

We checked our predicted clusters respect to five complexes sets:

-          MIPS manual (MIPS) [5]

-          Gavin 02 [2]

-          Gavin 06 (TAP06) [3]

-          Ho 02 [4]

-          Krogan 06 [1]

 

 

 

 

l          References:

 

[1]     Krogan,N.J., et al, Greenblatt,J.F. (2006) Global landscape of protein complexes in yeast Saccharomyces cerevisiae. Nature , 440, (7084):637-43.

[2]     Gavin,A.C., Bosche,M., Krause,R., Grandi,P., Marzioch,M., Bauer,A., Schultz,J., et.al. (2002) Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature, 415, (6868):141-7.

[3]     Gavin,A.C., Aloy,P., et al. (2006) Proteome survey reveals modularity of the yeast cell machinery. Nature, 440, (7084):631-6.

[4]     Ho,Y., Gruhler,A., Heilbut,A., Bader,GD., Moore,L., Adams,S.L., Millar,A., Taylor,P.,Bennett,K., Boutilier,K., et al. (2002) Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature, 415, 6868.

[5]     Mewes,H.W., Amid,C., Arnold,R., et. al (2004) MIPS: analysis and annotation of proteins from whole genomes. Nucleic Acids Res. 32, (Database issue):D41-4.