Neural Network Classifiers for Optical Chinese Character Recognition

Richard Romero, Robert Berger, Robert Thibadeau, and Dave Touretsky

Imaging Systems Lab, Robotics Institute

Carnegie Mellon University

5000 Forbes Avenue, Pittsburgh, PA 15213

Abstract

We describe a new, publicly accessible Chinese character recognition system based on a nearest neighbor classifier that uses several sophisticated techniques to improve its performance. To increase throughput, a 400-dimensional feature space is compressed through multiple discriminant analysis techniques to 100 dimensions. Recognition accuracy is improved by scaling these dimensions to achieve uniform variance. Two neural network classifiers are compared using the new feature space, Kohonen's Learning Vector Quantization and Geva and Sitte's Decision Surface Mapping. Experiments with a 37,000 character ground truthed dataset show performance comparable to other systems in the literature. We are now employing noise and distortion models to quantify the robustness of the recognizer on realistic page images.

1 Introduction

1.1 History of Chinese character recognition

et al.

(1)

From 1956 to 1964, the People's Republic of China introduced over 2,000 simplified characters. The effort was undertaken in order to reduce the number of strokes necessary to form the more common characters. A standard dictionary of the simplified character set contains about 7,000 of the characters in general use.

The People's Republic of China created the simplified standard for its own use. The traditional character set is still the norm in Taiwan, Hong Kong, Macau, and in overseas Chinese publications. Also in common use are four main font styles: songti, fangsongti, kaiti, and heiti. Examples of each are shown in Figure 1.1.

1.2 Problems faced

FIGURE 1.1 Example characters in the four font styles. Left to right: songti, fangsongti, kaiti, and heiti.

A related problem is that the frequency of some characters is extremely low, measured in occurrences per million characters of text. In a frequency count over 1.8 million characters of Chinese text [Xiandai 86], 425 characters were encountered only once each, and the number of unique characters was only 4,574. Hence it is nearly impossible to build a corpus from actual scanned data that is large enough to contain all the characters that will ever be encountered, let alone obtain adequate statistical information on all characters.

We have created a set of ground truthed data containing over 55,000 characters from 111 scanned pages. Out of this, the largest segment uses the songti font and the simplified character set: 68 pages and 37,021 total characters, with 1850 unique characters. This paper will use the songti, simplified subset for reporting purposes.

1.3 Proposed solutions

2 Character Features

2.1 The TECHIS feature set

2.2 Reference characters

TABLE 2.1 Brief summary of the features extracted. 
-----------------------------------------------------------------
Name                                                   Dimensions  
-----------------------------------------------------------------
Blackness (number of black pixels)                     1           
Stroke Width                                           1           
Total Stroke Length                                    1           
Horizontal Projection                                  64          
Vertical Projection                                    64          
Number of Horizontal Transitions (white-to-black)      1           
Number of Vertical Transitions (white-to-black)        1           
First Order Peripheral Features                        32          
Second Order Peripheral Features                       32          
Stroke Density at (fig) and (fig)                      16          
Local Direction Contributivity with four regions and   64          
four directions                                                    
Maximum Local Direction Contributivity                 64          
Stroke Proportion                                      32          
Black Jump Distribution in Balanced Subvectors         32          
Total Feature Dimensions                              405         
                                                                   
-----------------------------------------------------------------

songti

et al.

Once all of the scaling had been done, features were extracted. The total number of feature dimensions was 405

3 Feature Transformation

3.1 The TECHIS distance measure

ad hoc

(EQ 1)

Simply put, the distance between the feature vector x and the prototype feature value for class i is the sum of the absolute differences of all vector elements. The next reduction, from four feature distances to a single distance, was done as:

(EQ 2)

where is the distance from feature j in class i, and N is the number of classes. This is then called the relative distance, since it is measuring distance from class i relative to the distance from all other classes. Using this method of feature combination, an impressive 99.92% recognition rate on test data of over 20,000 characters was reported. However, after implementing the same features and distance measures, we obtained only a 92% recognition rate on our 58,482 character database.
The justification for these equations was partially based on a supposed failure of statistical methods to properly deal with multiple types of multidimensional features. However, we have implemented statistical methods to both combine the different features in a beneficial manner and reduce the number of dimensions. Then, instead of computing a simple linear distance, the true Euclidean distance measure can be utilized.

3.2 Dimension reduction transformation

with
(EQ 4)

and the total within-class scatter matrix is
(EQ 5) .

The between-class scatter matrix is defined as

(EQ 6) ,

where m is the overall mean, and it can be shown that
(EQ 7)

where is the total scatter matrix. The scatter matrices are analogous to covariance matrices. What we wish to compute is W , a transformation matrix that maximizes the new between-class scatter, , with respect to the new within-class scatter, . Then, using W , we can define the following:
(EQ 8)

(EQ 9)

(EQ 10)

(EQ 11)

(EQ 12)

It follows, then, that
(EQ 13)

(EQ 14)

One way to then maximize the between-class scatter with respect to the within-class scatter is to use the determinants of the scatter matrices as an overall measure of variance. The determinant is the same as the product of the eigenvalues, and this corresponds directly with the product of the variances. So, we wish to minimize

(EQ 15) .

The solution to the matrix W , can be shown to correspond to the generalized eigenvectors corresponding to the largest eigenvalues in

(EQ 16)

where the are the columns of W . To reduce the original space of d dimensions, the size of the x vectors, to a c dimensional space, take the eigenvectors corresponding to the c largest eigenvalues to form W .
With the nearest-neighbor based classifiers described in section 4 it is desirable to scale the transformed features so they have equal within-class variances, resulting in hyperspherical class distributions. In order to scale the transformed feature axes, we need an estimate for the variance in dimension i . This can be found from Equation 13, since we know the form for . The variance in dimension i is then

(EQ 17) , with as column i of W .

And the transformation matrix can be scaled by replacing each with
(EQ 18)

thereby scaling each new dimension by a factor of its new standard deviation.

3.3 Scatter matrix computation

songti

The reason for not including characters with fewer exemplars is that their statistical validity is questionable in terms of how much they should contribute to the overall within-class variance. The more exemplars available, the more stable the estimate for , which implies that the calculations for , , and are more stable.

4 Neural Network Classifiers

Both neural classification schemes that we used are essentially nearest-neighbor prototype matching. The difference between the two lies solely in the training algorithms that are used to determine the prototypes.

4.1 The LVQ algorithm

4.2 The DSM algorithm

DSM can be unstable when there is noise in the training set. In that case, Geva and Sitte suggest that LVQ might be preferable. But in our application, where the initial prototypes are drawn from noise-free PostScript fonts and the training set has been carefully ground truthed, our experiments show that our modified version of DSM retains its stability.

4.3 Dynamic node addition

4.4 Training parameters and data

5 Recognition Results

5.1 Results before training

songti

(2)

TABLE 5.1 Results of using different numbers of dimensions for data 
		transformation.
------------------------------------------
Dimensions  Correct  Incorrect  Hit Rate %  
------------------------------------------
 50         35,240   1781       95.19       
100         35,568   1453       96.08       
150         35,355   1666       95.50       
                                            
------------------------------------------

5.2 Results after training

For the LVQ algorithm, values of and were found to be more suitable. The LVQ algorithm in all instances ran for 50 epochs without converging to a perfect fit of the training data, making the DSM algorithm much faster in terms of training epochs. For the three training set sizes of 6,761, 10,894, and 17,588 characters, recognition rates on the actual training sets were 99.78%, 99.72%, and 99.64%, respectively.

TABLE 5.2.a Summary of results with different training and test set sizes. 
-----------------------------------------------------------------------------------------------------------------------
Number of   Number of    Number of   Number of   Characters              Hit Rate %  Characters              Hit Rate %  
Training    Unique       Test        Unique       Correct                             Correct                            
Characters  Characters   Characters  Test                                                                                
                                     Characters                                                                          
-----------------------------------------------------------------------------------------------------------------------
 6,761      1,103        30,260      1,706       29,631                  97.92       29,821                  98.55       
10,894      1,249        26,127      1,673       25,660                  98.21       25,790                  98.71       
17,588      1,541        19,433      1,485       19,135                  98.47       19,204                  98.82       
                                                                                                                         
-----------------------------------------------------------------------------------------------------------------------

TABLE 5.2.b Summary of results with different training sets and identical test 
		sets.
----------------------------------------------------------------------------------
Number of   Characters              Hit Rate %  Characters              Hit Rate %  
Training    Correct                             Correct                             
Characters                                                                          
----------------------------------------------------------------------------------
 6,761      14,519                  97.70       14,631                  98.45       
10,894      14,584                  98.14       14,669                  98.71       
17,588      14,625                  98.41       14,679                  98.78       
----------------------------------------------------------------------------------

FIGURE 5.3.b Log scale plot of character frequency versus error rate using DSM.

Table 5.2.a shows a summary of the results using different training set sizes. In this table, the test set consisted of all the characters which had not been used in training. In Table TABLE 5.2.b, the same three training sets are shown, but a common set of test data is used. It contains 14,861 characters, 1,354 of which are unique. In all of these results, only the training and test set sizes are varied; the learning parameters remain constant.

5.3 Conclusions

FIGURE 5.3.a Log scale plot of character frequency versus number of instances correctly classified using DSM.

In Figure 5.3.a and 5.3.b, two views of the same information are given. Both of these plots show that the frequency of occurrence of a character has a strong correlation with its recognition rate. (In both plots, characters that were recognized without error have been omitted(3).) In evaluating a Chinese character recognition engine, this correlation is one which we believe to be useful. It shows that the characters with low recognition rates tend to be those occurring with low frequency. While acceptable for Chinese character recognition, this sort of correlation is not desired at all in Latin character recognition, simply because there are only a handful of characters, each of which is frequent in comparison to a Chinese character. Similar plots are found using the LVQ algorithm.
In comparing the LVQ to the DSM training algorithm, it appears that the data is not best separated by modeling decision boundaries, but rather the distribution centers. It does appear, however, that LVQ and DSM may converge to the same placement of class prototypes as the training set size increases.

Using the DSM algorithm, consider the case where a class has a single prototype and very few exemplars. Furthermore, assume that classes in general are uniformly distributed in their locations, and normally distributed in their shapes. The final location of the prototype under DSM is largely constrained by the convex hull of the exemplars, which is made up of the data outliers. With a small number of exemplars, there is no reason to believe that the mean of the outliers will accurately predict a mean for the class. As the number of exemplars increases, the outliers should come closer to forming a hyper-ellipse centered at the true mean of the class's distribution.

If the class is not distributed normally in shape, or if classes are not, in general, uniformly distributed in the transformed feature space, it is unclear which algorithm would yield the better answer. It does appear that being able to dynamically add prototypes is desirable, as the LVQ algorithm never found a perfect solution for the training data given a single prototype per class. This is evidence that at least some classes are either not well separated or not normally distributed.

6 Future Work

6.1 Additional features

6.2 Noise and distortion models

(4)

References

Duda, Richard; Hart, Peter: Pattern Classification and Scene Analysis 1973, 114-121. New York: John Wiley & Sons.

Geva, Shlomo; Sitte, Joaquin: Adaptive Nearest Neighbor Pattern Classification, IEEE Transactions on Neural Networks, 1991, Vol. 2, No. 2.

Kohonen, Teuvo: Self-organization and Associative Memory, 2nd ed., 1988, 199-202.

Stallings, William: Approaches to Chinese Character Recognition. Pattern Recognition 1976, Vol. 8, 87-98.

Suchenwirth, Richard; Guo, Jun; Hartmann, Irmfried; Hincha, Georg; Krause, Manfred; Zhang, Zheng: Optical Recognition of Chinese Characters. Advances in Control Systems and Signal Processing 1989, Vol. 8. Braunschweig: Friedr. Vieweg & Sohn.

Xiandai Hanyu Pinlu Cidian: Frequency Dictionary of the Modern Chinese Language. Languages Institute Press 1986.

Zhang, Zheng: A Structural Analysis Method for Constrained Handprinted Chinese Character Recognition. Journal of the Beijing Institute of Technology 1987.1, 1-7.

Appendix A: Image Samples

Figure A.1 shows some of the incorrectly classified characters. On the next page, in Figures A.2 and A.3, are portions of two of the testpages used, magnified 1.5 times.

FIGURE A.2
FIGURE A.3

Appendix B: FTP Instructions

A demonstration version of our system is available via anonymous FTP. To retrieve it, open a connection to host ftp.cs.cmu.edu (128.2.206.173) and log in as user anonymous. with userid@host as the password. Then cd to /afs/cs/project/pcvision/ChineseOCR. Note: you must cd directly to this directory; the parent directories will not be accessible to anonymous FTP users. Remember to select image (binary) mode before retrieving compressed or executable files.<
There are subdirectories for document images, fonts, and the demo application, presently a SunOS 4.x SPARC executable. Versions for other platforms may be provided at a later date. A full description of the directory's organization can be found in the README file.

Footnotes

(1): The actual number of dictionary entries for the traditional character set is measured in the tens of thousands, as many words can be formed through the compounding of simpler words. But many of the characters are used so rarely that dictionaries commonly omit them.
(2): The reason for the qualifiers "approximately" and "roughly" is the problems we encountered with floating point imprecision. The scatter matrix was ill-conditioned for finding eigenvalues and eigenvectors, and several of the eigenvalues were very close to zero.
(3): In FigureFIGURE 5.3.a, the characters recognized without error are of the form , since the number correct is the number present. In FigureFIGURE 5.3.b, these same characters have a miss rate of 0, and their log becomes .
(4): We used a standard coding scheme known as the GB code, analogous to ASCII. Unfortunately, GB codes are only defined for the simplified character set. For the traditional character set, an encoding called Big 5 was used.