Performance Comparison with Popular Classifiers

# Detailed about machine learning methods we tried

Here we just make a brief introduction for the above popular methods we tried. For each method, we use cross validation to perform model selection. The plots shown in the above graph represented these methods with the best parameters found.

Here one additional note is that for all the algorithms other than our method, before the training-testing, the missing values were filled by the mean value of the corresponding attribute, which actually reduce the difficulty of the prediction.

## Support Vector Machine

We use SVMLight tool. It is an implementation of Support Vector Machines (SVMs) in C. It uses fast optimization algorithm and the working set selection is based on steepest feasible descent.  It can handle several hundred-thousands of training examples

## Naïve Bayes

The basic assumption made by Naïve Bayes is that of feature independence. But Naïve Bayes performs surprisingly well in many real-world domains. Most of those domains have clear feature dependencies.

## Logistic Regression

Logistic regression can be used to predict a dependent variable on the basis of independents and to determine the percent of variance in the dependent variable explained by the independents; to rank the relative importance of independents; to assess interaction effects; and to understand the impact of covariate control variables.

Logistic regression applies maximum likelihood estimation after transforming the dependent into a logit variable (the natural log of the odds of the dependent occurring or not). In this way, logistic regression estimates the probability of a certain event occurring. Note that logistic regression calculates changes in the log odds of the dependent, not changes in the dependent itself as OLS regression does.

## Decision Tree-J48

Decision Tree J48 (C4.5-Quinlan) is an extension of the base algorithm ID3. It incorporated numerical (continuous) attributes. And it has post-pruning after induction of trees, e.g. based on test sets, in order to increase accuracy. Also C4.5 can deal with incomplete information (missing attribute values).

Freund et al (1996) have developed the widely used AdaBoost.M1 which weights instances depending whether or not they were misclassified by previous trees. This allows attention to be directed to the instances that have caused errors in previous iterations.

Note 1:             We use machine learning toolkit WEKA[1]’s naïve Bayes classifier ( -K ), logistic regression classifier, J48 classifier, and Adaboost classifier in our experiments.

Note 2:             When making the performance comparison, besides the gold standard set and the features used, the training set’s size also matters. In our comparison, our training set’s size is ~ 15,000.

[1]. Ian H. Witten and Eibe Frank, Data Mining: Practical machine learning tools with Java implementations, Morgan Kaufmann, San Francisco, 2000