Neural Network Classifiers for Optical Chinese Character Recognition

Neural Network Classifiers for Optical Chinese Character Recognition


Richard Romero, Robert Berger, Robert Thibadeau, and Dave Touretsky

Imaging Systems Lab, Robotics Institute
Carnegie Mellon University
5000 Forbes Avenue, Pittsburgh, PA 15213

Abstract

We describe a new, publicly accessible Chinese character recognition system based on a nearest neighbor classifier that uses several sophisticated techniques to improve its performance. To increase throughput, a 400-dimensional feature space is compressed through multiple discriminant analysis techniques to 100 dimensions. Recognition accuracy is improved by scaling these dimensions to achieve uniform variance. Two neural network classifiers are compared using the new feature space, Kohonen's Learning Vector Quantization and Geva and Sitte's Decision Surface Mapping. Experiments with a 37,000 character ground truthed dataset show performance comparable to other systems in the literature. We are now employing noise and distortion models to quantify the robustness of the recognizer on realistic page images.

1 Introduction


2 Character Features

3 Feature Transformation

4 Neural Network Classifiers

Both neural classification schemes that we used are essentially nearest-neighbor prototype matching. The difference between the two lies solely in the training algorithms that are used to determine the prototypes.

5 Recognition Results


6 Future Work



References


Duda, Richard; Hart, Peter: Pattern Classification and Scene Analysis 1973, 114-121. New York: John Wiley & Sons.

Geva, Shlomo; Sitte, Joaquin: Adaptive Nearest Neighbor Pattern Classification, IEEE Transactions on Neural Networks, 1991, Vol. 2, No. 2.

Kohonen, Teuvo: Self-organization and Associative Memory, 2nd ed., 1988, 199-202.

Stallings, William: Approaches to Chinese Character Recognition. Pattern Recognition 1976, Vol. 8, 87-98.

Suchenwirth, Richard; Guo, Jun; Hartmann, Irmfried; Hincha, Georg; Krause, Manfred; Zhang, Zheng: Optical Recognition of Chinese Characters. Advances in Control Systems and Signal Processing 1989, Vol. 8. Braunschweig: Friedr. Vieweg & Sohn.

Xiandai Hanyu Pinlu Cidian: Frequency Dictionary of the Modern Chinese Language. Languages Institute Press 1986.

Zhang, Zheng: A Structural Analysis Method for Constrained Handprinted Chinese Character Recognition. Journal of the Beijing Institute of Technology 1987.1, 1-7.

Appendix A: Image Samples


Figure A.1 shows some of the incorrectly classified characters. On the next page, in Figures A.2 and A.3, are portions of two of the testpages used, magnified 1.5 times.

FIGURE A.2
FIGURE A.3

Appendix B: FTP Instructions


A demonstration version of our system is available via anonymous FTP. To retrieve it, open a connection to host ftp.cs.cmu.edu (128.2.206.173) and log in as user anonymous. with userid@host as the password. Then cd to /afs/cs/project/pcvision/ChineseOCR. Note: you must cd directly to this directory; the parent directories will not be accessible to anonymous FTP users. Remember to select image (binary) mode before retrieving compressed or executable files.<
There are subdirectories for document images, fonts, and the demo application, presently a SunOS 4.x SPARC executable. Versions for other platforms may be provided at a later date. A full description of the directory's organization can be found in the README file.


Footnotes

(1)
The actual number of dictionary entries for the traditional character set is measured in the tens of thousands, as many words can be formed through the compounding of simpler words. But many of the characters are used so rarely that dictionaries commonly omit them.
(2)
The reason for the qualifiers "approximately" and "roughly" is the problems we encountered with floating point imprecision. The scatter matrix was ill-conditioned for finding eigenvalues and eigenvectors, and several of the eigenvalues were very close to zero.
(3)
In FigureFIGURE 5.3.a, the characters recognized without error are of the form , since the number correct is the number present. In FigureFIGURE 5.3.b, these same characters have a miss rate of 0, and their log becomes .
(4)
We used a standard coding scheme known as the GB code, analogous to ASCII. Unfortunately, GB codes are only defined for the simplified character set. For the traditional character set, an encoding called Big 5 was used.