FAQ

(For paper: "Evaluation of different biological data and computational classification methods for use in protein interaction prediction".)

1. How did you handle the missing values when classifying samples using SVM?

Ø For classifiers could not handle missing value in features, we just use the fill-missing value strategy: For numerical features, fill with mean ; For categorical features, fill with the most frequent value.

2. Which kernel did you use for SVM?

Ø We compared the polynomial (-d 2) kernel and linear kernel using SVMLight. We reported the linear kernel results in the paper. We also varied the cost factor parameters and chose the best to report based on train-test results.

3. How did you handle the numerical features in the Naïve Bayes classifier ?

Ø We choose to use the “- K” choice in WEKA NB classifier. This means that : “Use kernel estimation for modelling numeric attributes rather than a single normal distribution.”

4. For the evaluation, we repeated the procedure 25 times and reported the average value. How is the average value derived in terms of precision – recall curves?

Ø In each test run, by changing the score cutoffs, we got a precision-recall curve. So totally we get 25 precision vs. recall curves. Then for these 25 curves, we average the precision values for each fixed recall value.

5. How did you generate the scores for “Computational Predictions Downloads” and why did you put them online?

Ø The scores are at: http://www.cs.cmu.edu/afs/cs.cmu.edu/project/structure-9/PPI/protein05/twoTasks-fullPredict/

Ø We are currently trying to build a web-service to provide retrieval functions for computational PPI predicted scores. So generating the above scores are for providing data for this web-service. And these scores could also give some help to people who are more interested to use predicted PPI scores other than the evaluation comparison in the paper.

Ø Current shared scores are generated based on the following procedure:

1. The features are as paper's Table III (detailed encoding type).

2. The positive train set are from DIP physical PPI data and MIPS co-complex data ( Table II).

3. The score calculation:

- First train a SVM model based on a training data set including all positive PPI of that task and random negative PPI data examples

- Then classify all possible Yeast protein-protein pairs ( 6270 * 6269 / 2) based on the above derived model

- ( In old shared file), for all these pairs, rank them based on the classification score

- ( In old shared file), from the score rank list, the top ~20000 pairs were reported.

Ø I am hoping to update these scores frequently. ( Currently the reason that we use SVM to classify all potential pairs is due to RF is relatively too slow. )

Ø If you want predicted scores before calibration or you want other formats / size of predicted PPI, please feel free to contact me.

6. Would you share your code for generating the features?

Ø Here I share the code and related files to generate our feature set. Download (both summary and detailed !). The general framework and the codes should be quite useful. You could try to find more recent versions of related evidence sets to make improvement though