Naive Bayes and EM Algorithm
on Switchboard Corpus Classification
Pei Zheng
pzheng@cs.cmu.edu
12/07/99
I. Introduction
There are kinds of language models that deal with the classification problems, such as tfidf measure, mutual information networks, connectionist models, and naive bayes classifiers. Among then, Naive Bayes classifier is regards as one of the most popular and best performance classifiers. Especially in the text classification problems, it can produce high accuracy with low computational complexity.
The traditional way to use the Naive Bayes classifier is to perform a supervised learning. Currently, there is a trend that applies the unlabeled data in the training progress to improve the performance of the classifier. Some papers have been published in this area, and report that under certain conditions the combination of the supervised learning (using labeled data) and unsupervised learning (using the unlabeled data) will produce an improvements to the supervised learning.
The Switchboard corpus is a collection of telephone conversations between strangers who are given one of 70 topics to talk about. There are totally 67 classes in the corpus, and the number of documents in each class is different, ranged from 2 to 80. Since the number of classes is very high and the data for each class is not sufficient for training and test, this property of the corpus makes some experiments difficult. Moreover, since the corpus is actually the transcripts of the telephone conversation, there are many words appearing with very high frequency, such as Uh, Okay, Uh-huh, as well as some repeat or broken words.
The experiment work in this project contains two parts. The first part is the implementation of the Naive Bayes classifier, test and analysis on the test results of the accuracy rate under different conditions, such as with/out high frequency words, with/out low frequency words, different parameter values to treat the words that do not appear in the training set. The test results show that the accuracy rate can be improved by removing some high frequency words and low frequency words and selecting the proper value of parameters to treat the zero-counted words in the training set. The second part is to do some experiments on unlabeled data by the text classification tool, Rainbow, which was developed by researchers in CMU. The results show that for the corpus like Switchboard, the unlabeled data could not improve the accuracy rate, the reason will be analyzed in the third part of this report.
II. Approach Description
Naive Bayes classifier is one the best performance text classifiers, it is widely used in text classification. In some domain, its performance has been shown to be comparable to that of the neural network and decision tree learning. The basic idea depended is the MAP(maximum a posterior). The detailed classification algorithm is
………….(MAP)
, where c is the class of document, p(d|c) is the prior probability of the document given class c, p(c) is the probability of the class. Practically, we could estimate p(d|c) and p(c) in the above equation based on the training data. It is easy to estimate each of p(c) simply by counting the frequency with which each target class occurs in the training data. For p(d|c), Naive Bayes classifier is based on the simplifying assumption that the attribute value are conditionally independent given the target value, so we can get the Naive Bayes classifier as,
![]()
,where
is the word in the vocabulary, p(
|c) is the probability of the word
given the class c.
2. EM algorithm
In many practical learning settings, only a set of the relevant instance features might be observable. Many approaches have been proposed to handle the problem of learning in the presence of unobserved data. EM algorithm is a widely used approach for this problem, even for variables whose value can not be observed directly, provided the general form of the probability distribution governing these variables is known. There are two steps in the EM algorithm, E-step calculate the expected value of each hidden variable, assuming the current hypothesis is hold. M-step calculates a new maximum likelihood hypothesis assuming the value taken on by each hidden variable is its expected value.
EM algorithm can be applied for the training with the unlabeled data. The detailed progress is first training with the labeled data, and then classified the unlabeled data (E-step), after that, maximize the training variables with the new coming data (M-step).
The unlabeled data will improve the learning accuracy when the labeled training data is not sufficient, because more information will be added to the system. But for the problem that already has enough labeled training data, the unlabeled data can only serve as noise, and will decrease the accuracy rate.
III. Experiment and Results
The data in corpus is the original transcript data, which contains the header, punctuation, notes for some special voice or noise. In order to work on the training, it is necessary to do some pre-processing of the data set, which includes extracting class target, removing header, remove all punctuation, removing all words that are not in vocabulary, transforming all words to upper case so that there will not be some duplication of words in the vocabulary only because of different cases. All these processing guarantee that the research work will not be affected by the elements other the real information in the text.
There are two parts in this project, the first one is implement the Naive Bayes classifier and test the training and test accuracy rate under the following four different conditions
, as well as the test for the test accuracy rate with different parameter value for zero-counted word.
The second part is the EM algorithm to test the effects of the unlabeled data to improve the classification results. In this part, the "Rainbow" text classification utility has been used to test the classification result with or without removing the stop-words, with or without the unlabeled data added to the training processing.
In this experiment, I implement the Naive Bayes classifier, separate the corpus into three sets – training set (80%), test set (10%) and validation set (10%), the result is shown in Table 1.
|
Training Set (80%) |
Test Set (10%) |
Validation Set (10%) |
|
|
Keep all words |
81.36% |
71.49% |
68.56% |
|
Remove Top 100 words |
99.67% |
88.16% |
81.22% |
|
Remove once-appeared words |
85.09% |
73.25% |
71.62% |
|
Remove Both of Above |
99.84% |
89.91% |
88.65% |
Table 1. Test result of the training set, test set and validation set under different word processing
Here is also the Fig.1 which show the effects on the accuracy rate by removing different number of high frequency words.

Fig.1 Accuracy rate on test and validation set according to different number of high frequency word removed
From the results shown above, we can see that the accuracy rate can be improved by only including a subset of the words occurring in the documents in the vocabulary, which means that the top several high frequency words were removed and those words only occurred once were removed. It is not surprised to see that the accuracy rate increased so much by removing some high frequency words, the reason is that there are many useless words in the Switchboard corpus, which occurs frequently, but just some habit in speaker’s speaking, such as Uh, Um, Uh-huh, Okay and whatever. Since these kind of words do not frequently appear in other text, they are not the generally meaning stop-words, that is why we can find that the classification results provided by Rainbow under the same condition is not satisfied as we can see here. Of course, if we remove too much words, there should be information lost, and will cause the decrease of the accuracy rate.
Moreover, we should deal with the zero-counted words in the training set, so that the probability will not go to zero due to the zero-counted word and make the classifier totally down. Generally for the zero-counted word problem, the probability like
will be adopted, different p will lead to different correct classification rate. Table 2 shows the classification correct rate under different p selection.
|
Training Set (80%) |
Test Set (10%) |
|
|
0.001 |
96.16% |
71.93% |
|
0.01 |
93.70% |
71.93% |
|
0.05 |
-- |
72.80% |
|
0.08 |
-- |
72.37% |
|
0.5 |
-- |
71.93% |
|
0.1 |
86.68% |
71.93% |
|
1 |
81.36% |
71.49% |
Table 2. Accuracy rate according to different value of parameter for the zero-counted words
From the above table, we can see that the correct rate for the training set is increasing with the decreasing of the p, which means that we give smaller and smaller probability to the words that do not appear in the training set. As p decreasing, the correct rates of the test set and validation set increase at first and then decrease, even though the correct rate of the training set keep increasing. The means the overfit of the training set. The reason is that the smaller we set the probability to the zero-counted words in the training set, the better that training set fits. But for the test and validation sets, it is possible that the zero-counted word in the training set is not zero-counted any more. At this time, the overfit happens.
On Rainbow, the test includes the following combination conditions on different training sets, such as
Fig.2 shows the accuracy rate for 200 test data, no stop-words removing and with/out EM algorithm. Fig.3 shows the accuracy rate for 200 test data, stop-words removing and with/out EM algorithm.
From the test results, we can see that there is almost no improvement for the training accuracy with the adding of the unlabeled data. When we look into the Switchboard corpus carefully, we can find that it is not strange. The unlabeled data will have a good effect on the learning when the learning data is insufficient. Rainbow even randomly choose the data set, so for some class there is not so many documents, it is hard to get the example from them, and Rainbow core dumped when it met zero document class in running EM algorithm. And since there are almost 70 classes and there are only a few documents for each class, when select more train data, it is impossible to get more

Fig.2 The accuracy rate for 200 test data, no stop-words removing and with/out EM algorithm.

Fig.3 The accuracy rate for 200 test data, stop-words removing and with/out EM algorithm.
unlabeled data. So what we can see from the test results is the unlabeled data is totally served as some noise to the training when we get enough training data. So in this curve, there is no improvement from the unlabeled data.
As I have analyzed above, the stop-word removing will not as effective as the high frequency word removing, the reason is writing in the above part.
IV Conclusions
In this project, Naïve Bayes classifier and the EM algorithm are experimented under different conditions. From the test results, we can get the conclusion that the method to remove high frequency words and low frequency words will improve the accuracy; There will be overfit if we less-estimate the zero-counted words so much; The unlabeled data will improve the accuracy only when the training data is insufficient, otherwise, it can only serve as noise and decrease the accuracy.
There should be some relationship between the class high frequency word and the high frequency word in the whole vocabulary, if we can find some relationship between them, we can get better classification results than simply removing the high frequency words. I will keep doing some research on this topic.