From vosse@ruls41.LeidenUniv.nl Fri Sep 30 17:23:10 EDT 1994 Article: 2167 of comp.ai.nat-lang Path: cantaloupe.srv.cs.cmu.edu!das-news.harvard.edu!news2.near.net!MathWorks.Com!europa.eng.gtefsd.com!howland.reston.ans.net!EU.net!sun4nl!news.nic.surfnet.nl!highway.LeidenUniv.nl!ruls41.LeidenUniv.nl!vosse From: vosse@ruls41.LeidenUniv.nl (Theo Vosse) Newsgroups: comp.ai.nat-lang Subject: Language identification: Summary Date: 28 Sep 1994 10:53:46 GMT Organization: Leiden University, The Netherlands Lines: 187 Distribution: world Message-ID: <36bhvq$fbc@highway.LeidenUniv.nl> Reply-To: vosse@ruls41.LeidenUniv.nl (Theo Vosse) NNTP-Posting-Host: ruls41.leidenuniv.nl Hi, Recently, I asked for references to language identification algorithms on the CORPORA mailing list (for information how to subscribe: contact corpora-request@uib.no). Below, you will find a summary of the responses with only a few duplicates. Some papers are directly available from the authors, which once again shows the power of the network and the usefulness of this mailing list. Ciao, Theo Vosse ---------- Unit for Experimental Psychology University of Leiden The Netherlands ------------------------------------------------------------------------ From: Jeff Reynar Here are a couple more references for you: Ziegler, Douglas-Val. The Automatic Identification of Languages Using Linguistics Recognition Signals. SUNY Buffalo Ph.D. Dissertation, 1987. Sibun, Penelope and A. Lawrence Spitz. Language Determination: Natural Language Processing from Scanned Document Images. To appear in Proceedings of the Fourth Applied Natural Language Processing Conference, Stuttgart, Germany, 1994. The first paper has references for lots of other papers. Jeff ------------------------------------------------------------------------ From: Gregory.Grefenstette@xerox.fr (Gregory Grefenstette) From: cavnar@erim.org (Bill Cavnar) [Cavnar94a] Cavnar, William B., and Trenkle, John M., "N-Gram-Based Text Categorization," proceedings of The 1994 Symposium On Document Analysis and Information Retrieval, University of Nevada, Las Vegas, pp. 161-176. which describes a very successful method of language identification for ASCII text. Alternatively, you can try getting it by ftp: ftp ftp.erim.org anonymous cd outgoing get cavnar_ngram_text_cat.ps quit Let me know if you have problems. -Gregory Grefenstette --Bill Cavnar (cavnar@erim.org) ------------------------------------------------------------------------ From: paulwu@iss.nus.sg (Paul Wu) Dear Theo, Here is one reference I have, "N-Gram-Based Text Categorization", by William B. Cavnar and John M. Trenkle in Symposium on Document Analysis and Information Retrieval in Las Vegas. The author also has software copy on line, the following is the direction to get it. Cheers, Paul ------------------------------------------------------------------------ From: Penni Sibun try: Sibun, Penelope and A. Lawrence Spitz. Language Determination: Natural Language Processing from Scanned Document Images. To appear in Proceedings of the Fourth Applied Natural Language Processing Conference, Stuttgart, Germany, October 1994. there are many other papers referenced in the above; however, i don't currently have a plain text version of the references to post. i could send a postscript or paper copy of our paper to anyone who wants it. --penni Penelope Sibun Member of the Research Staff Fuji Xerox Palo Alto Laboratory 3400 Hillview Avenue Palo Alto CA 94304 (415) 813-7772 (415) 813-7081 (fax) sibun@pal.xerox.com ------------------------------------------------------------------------ From: Thomas Edward Raffill I believe there are several well-established techniques. I recommend the following papers: Vitale, Tony, "An Algorithm for High Accuracy Name Pronunciation by Parametric Speech Synthesizer," Computational Linguistics, Vol. 17, No. 3, 1991, p. 257-275. This article describes some statistical and trigram techniques. It contains a reference to a paper called "Language Identification with Neural Networks" by Cole, Inouye, Muthasamy, and Gopalakrishnan, in Proceedings, IEEE Pacific Rim Conference on Communications, Computers and Signal Processing, 1989. I haven't seen that paper. Oshika, Machi, Evans, and Tom, Feb. 1988, "Computational Techniques for Improved Name Search," Proceedings of the 2nd Annual Applied Natural Language Conference, Austin, Texas, ACL, 203-210. This article describes using an HMM model for the classification. I seem to remember one other article called something like "Triphone Analysis" which described a technique similar to the trigram method except that it uses phonetic information too, but I can't find the reference right now. Good luck in your research. T. Raffill raffill@holonet.net ------------------------------------------------------------------------ From: E S Atwell Hi Theo, This may be relevant: Gavin Churcher, Judith Hayes, John Hughes, Stephen Johnson and Clive Souter, "Bigram and Trigram models for language identification and classification" in L. Evett and T. Rose, eds, Computational Linguistics for Speech and Handwriting Recognition: Proceedings of the AISB'94 Workshop, University of Leeds, UK, 1994. In case you have difficulty getting hold of these Proceedings, I'll ask Gavin to email you a postscript version of this paper regards, eric ------------------------------------------------------------------------ From: zuijlenj@verdi.sra.com (Job van Zuijlen) An article of interest might be: Vitale, Tony (1991): "An Algorithm for High Accuracy Name Pronunciation by Parametric Speech Synthesizer", Computational Linguistics, Vol 17, No 3, pp. 257-276. It describes a trigram approach to analyzing words (names in this case) to determine their ethnic origin. The application is a speaking phone book. Job van Zuijlen SRA Corporation Arlington, VA 22201 USA ------------------------------------------------------------------------ From: ted@crl.nmsu.edu (Ted Dunning) here is an internal tech report i wrote recently. it has been submitted as a journal article. [shar file deleted: available from the author or from me] ------------------------------------------------------------------------