Evaluation of different biological data and computational classification methods for use in protein interaction prediction

Yanjun Qi1, Ziv Bar-Joseph1 and Judith Klein-Seetharaman1,2,*

1School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213

2Department of Pharmacology, University of Pittsburgh School of Medicine, Pittsburgh, PA 15261




Protein-protein interactions play a key role in many biological systems. High-throughput methods can directly detect the set of interacting proteins in yeast but the results are often incomplete and exhibit high false positive and false negative rates. Recently, many different research groups independently suggested using supervised learning methods to integrate direct and indirect biological data sources for the protein interaction prediction task.


However, the data sources, approaches and implementations varied. Furthermore, the protein interaction prediction task itself can be sub-divided into prediction of (1) physical interaction, (2) co-complex relationship and (3) pathway co-membership. To investigate systematically the utility of different data sources and the way the data is encoded as features for predicting each of these types of protein interactions, we assembled a large set of biological features and varied their encoding for use in each of the three prediction tasks. Six different classifiers were used to assess the accuracy in predicting interactions, Random Forest (RF), RF similarity based k-Nearest-Neighbor, Naïve Bayes, Decision Tree, Logistic Regression and Support Vector Machine.


For all classifiers, the three prediction tasks had different success rates and co-complex prediction appears to be an easier task than the other two. Independently of prediction task, however, the RF classifier consistently ranked as one of the top two classifiers for all combinations of feature sets. Therefore, we used this classifier to study the importance of different biological datasets. First, we used the splitting function of the RF tree structure, the Gini index, to estimate feature importance. Second, we determined classification accuracy when only the top-ranking features were used as an input in the classifier. We find that the importance of different features depends on the specific prediction task and the way they are encoded. Strikingly, gene expression is consistently the most important feature for all three prediction tasks, while the protein interactions identified using the yeast-2-hybrid system were not amongst the top-ranking features under any condition.


Online Acess URL:  http://www3.interscience.wiley.com/cgi-bin/fulltext/112392432/HTMLSTART


              Since the related positive reference sets and feature sources have updated rapidly over the years, just sharing the extracted feature files or partial prediction scores are not good enough anymore. Thus,


è [new]  Here I share the code and related files to generate our feature set. Download (both summary and detailed !)


The general framework and the codes should be quite useful. You could try to find more recent versions of related evidence sets to make improvement though.


If you use this code, please cite: “Y. Qi, Z. Bar-Joseph, J. Klein-Seetharaman, "Evaluation of different biological data and computational classification methods for use in protein interaction prediction", PROTEINS: Structure, Function, and Bioinformatics. 63(3):490-500. 2006”



“Data Sets”:

  Feature Attributes details

  Data sets including 162 Attributes Sets Downloads


  Performance Change When using Top20 detail feature (RF)




  Web Service to retrieve the full predictions (co-complex / physical) [new]



 “Computational Predictions Downloads”:

  Predicted interaction scores (for all possible yeast protein pairs)

        On Yeast all possible protein pairs

        Detailed feature type

        For two tasks

1.      For physical Interaction Task

2.      For co-complex Task