Feature Set Downloading:

è [new] Since the related positive reference sets and feature sources have updated rapidly over the years, just sharing the extracted feature files or partial prediction scores are not good enough anymore. Thus,

Here I share the code and related files to generate our feature set. Download (both summary and detailed !)

The general framework and the codes should be quite useful. You could try to find more recent versions of related evidence sets to make improvement though.

Here I share the derived feature sets (the 162 detailed version) to save others’ time if also interested in this problem.

> Feature Details in the data set

Details about each feature here are in this link.

Group Index	# of features	Dataset	Attribute Property	Data Position in the set
1	20	Gene Expression	Real value: [-1, 1]	1-20
2	21	GO Molecular Function	{1, 0}	21- 41
3	33	GO Biological Process	{1, 0}	42 - 74
4	23	GO Component	{1, 0}	75 - 97
5	1	Protein Expression	Real Value – Non Negative	98
6	1	Essentiality	{2 , 1, 0}	99
7	1	HMS_PCI Mass *	{ 1, 0}	100
8	1	TAP Mass *	{ 1, 0}	101
9	1	Y2H	{ 1, 0}	102
10	1	Synthetic Lethal	{ 1, 0}	103
11	1	Gene Neighborhood / Gene Fusion / Gene Co-occur	{ 1, 0}	104
12	1	Sequence Similarity	Real value - Non negative	105
13	4	Homology based PPI	Discrete: Non-negative (Most 0, 1)	106 – 109
14	1	Domain-Domain Interaction	Real value between [0, 1]	110
15	16	Protein-DNA TF group binding	Non-negative discrete, most 0	111 – 126
16	25	MIPS Protein Class	{ 1, 0}	127 – 151
17	11	MIPS Mutant Phenotype	{ 1, 0}	151 - 162

* Matrix model for co-complex and co-pathway prediction. Spoke model for direct PPI prediction.

> Shared data sets

Yeast Protein ORF list

File format read me

For physical Interaction Task in Detailed feature type

Positive Set PPI list ( from DIP database )
Positive Set feature set
Random Negative Set Protein Pairs list (subset size ~230,000)
Random Negative Set Protein Pairs Feature
File format read me

For co-complex Task in Detailed feature type

Positive Set PPI list ( from MIPS database Complex catalogue )
Positive Set feature set
Random Negative Set Protein Pairs list (subset size ~230,000)
Random Negative Set Protein Pairs Feature
File format read me

For co-pathway Task in Detailed feature type

Positive Set PPI list ( from KEGG database pathway )
Positive Set feature set
Random Negative Set Protein Pairs list (subset size ~230,000)
Random Negative Set Protein Pairs Feature
File format read me
Note: The co-pathway relation is an extreme simplified version for the protein-protein pair-wise relationships within a pathway (see paper for the reference work of this task). The main purpose using this task here is to make a comparison to the co-complex and the physical interaction tasks. For the future research, it is quite necessary to investigate the proteins’ interactions within pathways in a more detailed fashion.

> Note

· “-100” in the feature sets means a missing value in that position!

· Details about the gold standard positive sets shared above, please check “Gold Standard datasets” section in the paper.

· The negative data sets I put here is just a random subset containing ~230,000 yeast protein-protein pairs that are not in the positive PPI set of each specific task.

· In the paper, we assume the size ratio between the positive examples and the negative examples is roughly 1:600 (estimated based on experimental data) in building the train-test sets.

· This ratio is still questionable and need further discussion.

· If you happen to know a better answer other than the above strategy I used, it would be greatly appreciated if you could contact me.

· If you notice any mistakes in the data, please contact me as soon as possible. Thanks ahead !

· FAQ page