Empirical Methods in Natural Language Processing: What's Happened Since the First SIGDAT Meeting? Kenneth Ward Church AT&T Labs-Research 180 Park Ave Florham Park, NJ 07932-0971 kwc@research.att.com http://www.research.att.com/~kwc The first workshop on Very Large Corpora was held just before the 1993 ACL meeting in Columbus Ohio. The turnout was even greater than anyone could have predicted (or else we would have called the meeting a conference rather than a workshop). We knew that text analysis was a ``hot area,'' but we didn't appreciate just how hot it would turn out to be. The 1990s were witnessing a resurgence of interest in 1950s-style empirical and statistical methods of language analysis. Perhaps the most immediate reason for this empirical renaissance is the availability of massive quantities of data: text is available like never before. Just ten years earlier, the one-million word Brown Corpus (Francis and Kucera, 1982) was considered large, but these days, everyone has access to the web. Experiments are routinely carried out on many gigabytes of text. Some researchers are even working with terabytes. The big difference since the first SIGDAT meeting in 1993 is that large corpora are now having a big impact on ordinary users. Web search engines/portals are an obvious example.