1. Results on R21578 (there are 7769 documents in training set and 3019 documents in testing set.)

1.1 kNN on R21578:

Result for microF1 and Result for macroF1

The graph that tunes feature selection (2000 is the best number for microf1 and 1000 is better for macrof1)
The graph that tunes nearest neighbor number K (85 is the approximate best number for both microf1 and macrof1)

The graph that tunes fbr score (0.6 is the best number for microf1 and 0.4 for macrof1)
(It need be noticed that the term-weighting method used here is ooc, NOT ltc.   All the other other classifiers have used ltc)

notice: The blue line represent real dots while the red line represents regression fit. First two subplots of every graph show 3 fold cross validation curves. The next two subplots show the average results over the 3 fold cross validations. The last two subplot show the key area of the above graphs so that the curve can be observed more clearly.
 
 

Observation: The feature selection curve is stable and reasonable. The curve that tunes K with micro average F1 is also stable while the curve for macro average F1 has much larger variance.  It has been mentioned the best K is near 45 in [1]. This point is proved by our micro F1 curve . However it seems the best K for macro average f1 should be larger than 45. In fact, when K increase, the curve first increases, than remains stable. I also tuned fbr score. while always using fbr rank =1 because R21578's average category number per document is 1.23 for training set and 1.24 for test set. The significance of tuning fbr score is that we can see whether scut is better or rcut (r=1)is better and the tradeoff between them. The best fbr score is 0.5 in [1]. I think this is for micro avg f1. My finial micro result is very close to that in [1] ( 0.8573 vs 0.8567) while my finial macro result is much higher (0.6146 vs 0.5242).

This is a graph  that compares the scut with s tuned and rcut with r=1. It shows when considering micro avg. result, scut has higher recall while rcut has
has higher precision. scut's f1 is a little better than rcut and they have comparable f0.5 value. However, when considering macro result, rcut performs
very poor. Though it still has higher precision, its other measures are far lower than scut.
 
 
 

1.2 Rocchio on R21578
Result  for microF1 and Result for macroF1

The graph for tuning feature selection number (2000 is the best number for both microf1 and macro)
The graph for tuning fbr score  (0.6 is the best number for microf1 and 0.3 for macrof1)
The graph for tuning Pmax  (4000 is the best number for both microf1 and macrof1)
The graph for tuning beta (-2 is the best number for both microf1 and macrof1)
 

Observation: We have thought Rocchio will not perform well since it is a simple classifier. In fact, It's performance is good. Besides, It has small training cost.

This is a graph that compares the scut with s tuned and rcut with r=1.  It shows Rocchio's scut is better than rcut in almost every aspects.(I mean on
R21578. Data set may also have much influence on the performance of decision strategy.)
 
 
 

1.3 Naive Bayes on R21578

Result for micro and macro(Using wintbell smoothing)
The graph for tuning feature selection number  (1000 is the best number for microf1 and macrof1 )
The graph for tuning fbr score number  (0.1 is for both microf1 and  macrof1 )

 

Observation:  A obvious point is that NB has reasonable micro result(though still lower than other classifiers) but its macro result is
very low. This shows NB is less powerful on rare categories than other classifiers. So when we deal with a data set which is mainly consist
with rare categories, NB may not be a good choice.

Another issue is that NB has 2 versions: binary version and multi-class version. From the theory, they should be both correct. Obviously, Multi-class version need much less training cost. The problem is that it can only deal with the situation that one document has only one label and this is not the case in Reuters 21578. So here we took a simplified implementation. For documents that have two labels, we duplicate it and assign 1 copy to each of the categories it belongs to. This simplification may affect the category priors. But the affection will be very small. In fact, we found the performance of this version is no less than the binary version. The reason is that we should still assign di to ci even if P(Ci|di) is 0.2 when P(cj|di) is even less.
 
 
 

1.4 SVM on R21578

Result SVM on R21578 for micro and macro (using scut, fbr=0.3. No features have been deleted)

The graph tuning feature selection number ( The best value for micro F1 is 15000, the value for macro is about 13000, it shows using  feature selection for svm is not very helpful)
The graph for tuning fbr score  (0.3 is OK for both microf1 and macrof1)

In SVM's primary sense, it uses 0 as the threshold for every category. So we have 3 different decision stragety now: scut with s tuned. scut with s=0 and
rcut with r=1. I tried these 3 decisions from the comparison, we can see when considering micro result, tuned scut has better recall while rcut r=1 and scut s=0 have better precision. These 3 strategies have comparable f1 and f0.5. However when considering macro result, tuned scut is obviously better in all the measures except precision. It's clear that tuned scut threshold can improve macro result greatly.
 
 
 

1.5 Some conclusions and comments:

Here is a table for the 4 classifiers on Reuters 21578.(using 5 fold cross validation and the best tuned parameters)
 
 

 
Micro avg. F1 Macro avg. F1
kNN 0.8557 0.5975
Rocchio 0.8474 0.5914
NB 0.8009 0.4737
SVM 0.885787 0.595677

 

We can see that NB has worse performance. The other three classifiers have close result on macro f1. However SVM has a little better result on micro f1.

Since the decision strategies depend on specific classifiers and data set, it's hard to get a general conclusion. But From the experiments above, we can say(on R21578) .Scut(s tuned) has always has higher recall than rcut(r=1) while rcut tends to have higher precision. They have comparable performance on micro avg. f1 and f0.5. But when considering macro avg. performance, Scut tends to have much better f1 and f0.5. To SVM, we can also use scut s=0 since it's the original idea of SVM,but its performance is no better than scut with s tuned.

In parameter tuning, I found Rocchio ,KNN and NB 's best feature number is 1000-2000. SVM does not depend on feature selection so much, just as Joachims has observed. fbr score is another parameters shared by the classifiers. Generally saying, I found 0.6 is a good valure for micro f1 and 0-0.1 is best for macro f1. The curve tuned for macro avg. results often have larger variance than micro avg. result.

This is a graph reflecting the upbound and real f1_catfre performance.   with regression window size=20.