Experiments in Information Retrieval
from Spoken Documents


A. G. Hauptmann, R. E. Jones, K. Seymore, S. T. Slattery,

M. J. Witbrock*, and M. A. Siegler


School of Computer Science

Carnegie Mellon University

Pittsburgh, PA 15213-3890

*Justsystem Pittsburgh Research Center

4616 Henry St.

Pittsburgh, PA 15213



This paper describes the experiments performed as part of the TREC-97 Spoken Document Retrieval Track. The task was to pick the correct document from 35 hours of recognized speech documents, based on a text query describing exactly one document. Among the experiments we described here are: Vocabulary size experiments to assess the effect of words missing from the speech recognition vocabulary; experiments with speech recognition using a stemmed language model; using confidence annotations that estimate of the correctness of each recognized word; using multiple hypotheses from the recognizer. And finally we also measured the effects of corpus size on the SDR task. Despite fairly high word error rates, information retrieval performance was only slightly degraded for speech recognizer transcribed documents.


For the first time, the 1997 Text REtrieval Conference (TREC-97) included an evaluation track for information retrieval on spoken documents. In this paper, we describe some experiments for the spoken document retrieval, with details of both the speech recognition system and the information retrieval engine.

The SDR Data

The speech data were identical to the training data used in the 1997 ARPA Speech Recognition Workshop HUB-4 broadcast news evaluations [11]. The main difference lay in the split between training and testing data; here roughly half of the material was reserved for the test data and only half the total acoustic data was used for training the acoustic models. There were three "versions" of the data available from NIST: A manually generated transcript (which also contained some errors), a speech recognition transcript provided by IBM, and the raw audio data, to be transcribed by our own recognizer. There were about 1200 news stories in the training data set and 1451 in the test set.

Information Retrieval Scoring Metrics

The IR task consisted of a list of queries, for each of which one or more relevant documents were to be returned by the IR system. The test queries were designed to simulate a known-item retrieval task. For each query, only one document was considered relevant for the purposes of this evaluation. While other documents may have had some relevance to the query, only the document it was designed to retrieve was scored as a correct retrieval. To measure the effectiveness of the IR system, we report the inverse average inverse rank (IAIR):

Where is the rank of document I. N is the number of queries.

One characteristic of the IAIR is that it rewards correct documents near the top more than documents in the middle or towards the end of the rankings. Both average rank and IAIR score 1.0 for a perfect retrieval and larger numbers for less than perfect retrievals. However, using the average rank metric, the difference between returning a document at rank 100 versus rank 200 is large, where this difference is almost negligible for the IAIR metric. At the other end of the scale, the difference between returning a document at rank 2 versus rank 10 is small for the average rank, but large for IAIR. In real life situations, where users’ time is valuable, closeness to the top is more critical than the average rank over all items returned.

The Speech Recognition Component

The Sphinx-III speech recognition system was used for the CMU TREC SDR evaluation, and it was configured similarly to the that used in the 1996 DARPA CSR evaluation [9], although several changes have been made to the recognizer since then. Sphinx-III is a large vocabulary, speaker independent, fully continuous hidden Markov model speech recognizer with separately trained acoustic, language and lexical models.

For the current evaluation a gender-independent HMM with 6,000 senonically-tied states and 16 diagonal-covariance Gaussian mixtures was trained on a union of the CSR Wall Street Journal corpus and the 1996 TREC-6 training set.

The decoder used a Katz-smoothed trigram language model [12] trained on the 1992-1996 Broadcast News Language Modeling (BN LM) corpus [11]. This is a fairly standard language model, much like those that have been used by the DARPA speech recognition community for the past several years. As a space optimization, singleton trigrams and bigrams were excluded. As a new feature, this language model incorporated cross-sentence-boundary trigrams to better model utterances containing more than one sentence.

The lexicon was chosen from the most common words in the corpus, at a size that balanced the trade-off between leaving words out-of-vocabulary and introducing acoustically confusable words [8]. For this evaluation, the vocabulary was comprised of the most frequent 51,000 words in the BN LM corpus, supplemented by some 200 multi-word phrases and some 150 acronyms. The vocabulary size was initially based on our experience with broadcast news, and a subsequent careful analysis of the trade-off showed that this choice was a good one. More details of the trade-off involved in vocabulary selection are provided below.

Compared with the earlier Sphinx-II speech recognition system, Sphinx-III boasts a higher accuracy but at significant computational cost. To achieve a lower word error rate of 27.4% versus 45.9% for Sphinx-II on a subset of the training data, the original Sphinx-III system processing time increased to 120 times real time on a 266 MHz DEC Alpha compared with only 1.4 times real time for Sphinx-II. By reducing the beam width of the search and by optimizing the space required, the Sphinx-III processing time was reduced to about 30 times real time, with only a slight loss in word transcription accuracy. The 75% speedup resulted in about a 10% increase in relative word error rate. Decoding the audio files in the test data required about 1000 hours of CPU time.

The Information Retrieval Component

Both documents and queries were processed using the same conditioning tools, namely noise filtering, stopword removal, and term stemming:

A heavily stripped down core of the CMU Informedia SEIDX engine [10] was used to compare queries with documents. A relevance score was created for each pair according to the following equation:

Query term frequency for vocabulary word I
Document term frequency for vocabulary word I
Inverse document frequency for vocabulary word I
Sign of value function (0 if 0, 1 if positive)

3. Official TREC-6 SDR Results

Table 1 shows the official CMU TREC SDR results. Since the transcriptions were subject to filtering by stopword removal and stemming as discussed above, the word error rates were reported for both the unfiltered and filtered references and hypotheses. An analysis of the results showed several preprocessing errors and confirmed an insight into the relationship between word error rate and information retrieval.

Transcription Source

















Table 1: Performance of the CMU TREC-6 SDR Evaluation System according to the NIST scoring system on 49 queries. The filtered word error rate (WER) reflects the effect of stopword removal and stemming.

Vocabulary Coverage

The words that were in the queries but were missing from the speech recognizer’s 51,000 word vocabulary were "CIA", "TORCHED?", "SMOKING?", "WELL-KNOWN", and "GOLDFINGER". These problems are primarily due to inconsistencies in the preprocessing phases. While "C.I.A." was in the vocabulary, "CIA" was not, resulting in a completely missed word during information retrieval. Similarly, an oversight in the preprocessing phase allowed the question mark to become part of the word in "torched?" and "smoking?". For "well-known", each of the component words "well" and "known" were in the vocabulary, but the compound "well-known" was not there as a single token, and thus was treated as an irretrievable word. The only true missing word in our 51,000-word vocabulary was "Goldfinger". Thus the 51,000 word vocabulary selection provided excellent coverage for this test evaluation.

Recognition Accuracy versus Information Retrieval Quality

The official TREC results confirmed that vastly reduced word error rates translate into slight improvements in information retrieval. Comparing the performance on the baseline IBM speech recognition data with that on the CMU speech recognition output, on the filtered texts, we found that nearly doubling the filtered word error rate led to only a 14% decrease in information retrieval effectiveness as measured by IAIR.

4. Experiments

Some of the experiments described here were performed before the actual test data with queries was available from NIST. In order to allow meaningful experiments to be performed on the TREC-6 training data, 1167 documents were selected from the set and known-item retrieval style queries were generated for 374 of them by hand. In some of the very early experiments, a much smaller test set composed of only 103 broadcast news stories with associated known-item queries from a privately collected corpus was added to the 1167 documents to permit initial investigation of ideas involving the speech recognition configuration. We shall refer to this latter test set as the "small test set."

Vocabulary Size Experiments

Prior to the evaluation we attempted to find a good vocabulary size that was optimized for both speech recognition and information retrieval. We chose three different vocabulary sizes, 40,000, 51,000 and 64,000 words, constructed a language model for each one, and then performed speech recognition. Table 2 shows that as the vocabulary got larger, the rate of out-of-vocabulary words decreased, but beyond 51,000 words speech recognition accuracy did not improve. Additional vocabulary coverage was thus obtained only at the cost of adding many acoustically confusable words, and information retrieval effectiveness decreased slightly. We chose to use the 51,000-word vocabulary for our official TREC submission. As explained in the analysis of vocabulary coverage above, this vocabulary size left in only unrecognizable word amongst the terms used in the 49 test queries. This experiment was performed prior to the official TREC submission on the 103 queries that constituted our in-house development test set.

Vocabulary Size

Out Of Vocabulary Rate

Word Error Rate


40k Words

1.13 %

26.4 %


51k Words

0.83 %

26.8 %


64k Words

0.75 %

26.8 %


Table 2: Effect of Vocabulary Size on System Performance.
This experiment was performed on the "small" test set of 103 queries.

Stemmed Language Models

Using a small test set described above and the 51,000-word vocabulary, we also investigated the concept of language modeling tailored specifically to information retrieval. Since the words in the recognition output are stemmed before being used for IR, distinctions between different forms of a stem are irrelevant to the IR system. In an attempt to take advantage of this observation, a language model was built from a stemmed version of the LM training data. Each root word in the language model had multiple "pronunciations" in the lexicon to reflect the original, unstemmed, forms.

For example, suppose the root forms of the words "recognize", "recognized", and "recognition" all map into the common root "recogni"+suffix, where the suffix in this case is either "ze", "zed", or "tion". The stemmed language model would provide only one transition from the root "recogni" into words that can follow, in effect collapsing multiple paths between individual words into one path between root words. The lexicon would reflect the alternate inflected forms as alternate pronunciations of the root word, i.e.

Recogni R EH K AX G N AY Z

Recogni(2) R EH K AX G N AY Z DD

Recogni(3) R EH K AX G N IH SH AX N

The premise was that this stemmed language model would avoid much of the confusion due to acoustic variations in suffixes of words, but would aid in the correct recognition of the important roots of the words. Table 3 shows the results of these experiments. The word error rate of the stemmed language model was higher than for the baseline language model. The WER increased both if only stemmed words were counted, as well as when all original words were compared. Furthermore the information retrieval effectiveness (as measured by the inverse average inverse rank metric) also showed a decrease.

Language Model

Word Error Rate





26.8 %




35.1 %

23.8 %


Table 3: Using a language model built from stemmed LM training texts. This experiment was also done with the "small" 103-query in-house development test set.

Confidence Annotation

Since state-of-the-art speech recognition software does not produce a perfect transcript of what was said, we would like to obtain any extra information we can about the likelihood of correctness of particular words. This is akin to the situation in which a human annotator makes a guess at a word that was hard to hear, and marks that this word may have been mis-heard.

An ideal automatic confidence annotator would label each word produced by the speech recognizer with a label correct to indicate that this is in fact the word that was spoken, and incorrect to indicate that this word was not spoken. We will compare the results of our annotation to this ideal, which we call Perfect Annotation.

Features for Confidence Annotation

The confidence annotation we performed is based on work by Lin Chase [2], though annotation has been explored by many others including [3,4,5]. Typically confidence annotation is performed by taking information available about individual occurrences of words in the hypothesized text, from information produced within the speech recognizer, or outside the recognizer. These features are then automatically examined to find indicators of likely correctness and incorrectness. The candidate features we considered were:

Experimental Description - Confidence Annotation

For each set of features, the experiment proceeds as follows:

We conducted experiments by splitting the training data into two sections, training our decision tree on one half, testing on the other half, then reversing the roles.

Decision Tree Building

The decision tree building algorithm we use is C4.5 [7]. It functions by taking all training data, and attempting to find rules based on features which distinguish between classes. Each item of training data is a word along with its associated features (described above), and its class of correct or incorrect. Taking each feature does this in turn, asking a question about that feature, and using the answer to partition the data. A feature is chosen if it has high information gain, i.e. if the resulting two groups of data contain less of a mix of correct and incorrect. The ideal split would create classes that contain exclusively correct or exclusively incorrect examples.

Since such ideal splits are rare, the decision tree building halts when no more information gain (reduction in entropy) can be achieved. At this point, each leaf of the tree contains examples which have all the same features for questions asked at each partition, and which are mostly of one class. The proportion of correct examples at this node is the probability of correctness that will be assigned to any word with the same features.

When using the decision tree to classify a new word, we check each of its features to find which leaf-node of the decision tree to classify it into. At that point, it is classified as having the probability of correctness corresponding to this leaf node.

Evaluating Confidence Annotation: Cross-Entropy Reduction

The most common method of evaluating word confidence annotation is cross-entropy reduction. Cross-entropy is a measure of how well our model of the probability of word correctness corresponds to Perfect Annotation (as defined above). If our model annotates perfectly, its cross-entropy is 0. The worse the annotation performs the higher the cross-entropy.

The most naive from of confidence annotation we can perform is to tag each word with a probability of correctness equal to the overall word-accuracy. Thus if we know that our recognizer generally gets 80% of words correct, the baseline confidence annotator assigns each word an 80% probability of correctness. We then measure the quality of our annotation by measuring how much better it performs than this baseline.

Actual probability that word i is incorrect
Predicted probability that word i is incorrect

Thus we attain a figure for cross-entropy for the default model of classifying each word as correct with probability equal to the word-accuracy, and score our improvements in modeling the probability of correctness by how much they reduce cross-entropy as a percentage of this baseline.

Information Retrieval Using Word Confidence Weights

First we describe two orthogonal ways of using word confidence weights in the relevance scheme described above:

Since typically, is very small when, we only take the product over terms for which the recognized word was w. Summing this value over all documents and dividing by the total number of documents gives us an approximate value of the expected document frequency for this word

Oracle Experiments

Since the interaction between confidence annotation and information retrieval may be complex, we also conducted an experiment to see how we could make use of confidence scores in the idealized case in which we know exactly which words are correct, and which are incorrect. We removed words in two different ways:

Table 4 shows that for both training and testing sets, the Post-Filter Oracle annotation was able to significantly reduce the IR error of the decoded transcripts. This indicates that a more realistic experiment might be able to do this as well.

We performed an analysis of some of the differences between documents in the stemmed oracle experiment, and reference information retrieval experiments. We should expect the number of query words in the correct document to decrease, since oracle confidence annotation cannot correct for substitutions and deletions, but will drop all incorrectly substituted and inserted words. A cursory glance at documents and queries revealed that some documents contain more query words as speech hypotheses then the corresponding reference transcription. Our intuition here is that speech recognition can occasionally correct for spelling errors in the references, and so words that are incorrect with respect to the reference transcription may be correct for the purposes of information retrieval.


Baseline Performance

Oracle Annotation

Reference Transcripts

Decoded Transcripts



Training Set





Testing Set






Table 4: Baseline and Oracle Annotation on TREC-6 Training and Testing Sets. Values are IAIR

Information Retrieval Experiments for Confidence Annotations

In order to see how well cross-entropy reduction translates into gains in information retrieval accuracy, we conducted a series of experiments. Since we also hoped to find the best way of incorporating weights into information retrieval we performed the following information retrieval experiments:






Training set




Testing set










Training set




Testing set




Table 5: Confidence Annotation Performance on TREC-6 Training and Testing Sets. Values are IAIR.

The results of these experiments are found in  







Training set




Testing set




Table 5

. Although the IAIR was reduced in most cases, the upper bound found in the Oracle Annotation was not attained.

6. Using N-best Lists for Information Retrieval

Typically, speech recognition systems produce a transcription of each spoken utterance in much the same way that a human transcriptionist might. However, the transcription used is only the most probable decoding of the acoustic signal, out of a large number of hypotheses that are considered during the recognition process. It is a relatively simple matter to obtain a list of these different hypotheses, ranked in order of decreasing likelihood.

Using these additional hypotheses seems promising for information retrieval, since it offers the hope of including terms that would otherwise be missed by the speech recognizer in documents, allowing them to match with query terms and increase document recall. On the other hand, words incorrectly identified in lower ranked recognition hypotheses may cause spurious matches with query terms, decreasing retrieval precision.

Experiments Using N-Best Lists

In the context of the TREC-6 SDR task, an initial attempt was made to evaluate retrieval effectiveness using n-best hypotheses lists generated from the speech recognition decoder lattice. N-Best hypotheses were generated for the 1451 stories in the TREC-6 SDR test data. Of these, decoding failed completely in four cases, resulting in empty transcriptions. For the remaining 1447 stories, lists of the two hundred most likely hypotheses were generated for each utterance. Table 6 shows an example of N-best hypotheses.

Ideally, one would use hypothesis probabilities generated during decoding to weight the terms during retrieval, but for this preliminary experiment, the n hypotheses for each utterance were simply concatenated together into one larger document. No discounting of weights for less probable hypotheses was done.


Nth most likely decoder hypothesis









Table 6: The top four hypotheses for utterance three of story j960531d.7, after stop word removal and stemming. Note that the fourth hypothesis is identical to the first, and differed only in inflected forms.

The effect on retrieval effectiveness of using the documents generated from the n-best lists in the TREC-6 test set is illustrated in Table 7. Note that for N set at 50, the performance on the hypothesized transcripts is actually slightly lower than performance on the reference transcripts (1.332) This may be again due to effects of misspellings in the reference transcripts. These results were obtained from the official NIST queries using the full TREC-6 SDR corpus. The 49 queries include the corrected transcription for the words "well-known", "C.I.A.", "smoking?", and "torched?". Thus the baseline at 1 hypothesis is slightly higher than the official number reported in Table 1.


Number of Hypotheses (N)


















Table 7: IR Performance of N-Best hypotheses on the TREC-6 test set. The 49 queries include the corrected transcription for the words "well-known", "C.I.A.", "smoking?", and "torched?". Thus the baseline at 1 hypothesis is slightly higher than the official number reported in Table 1.

While it is encouraging that an improvement in retrieval can be obtained at all by this method, it is clear that further work will be required if the promise of this idea is to be realized. In particular, the increasingly harmful effect of adding large numbers of less probable hypotheses to the documents suggests that discounting each hypothesized word by its recognition score may improve performance even more.

7. Scaling Collection Size

Many of our experiments, including some of the ones reported here, seem to suffer from two problems. The effect size of our experimental variables seems to be fairly small, and the difference between the reference text retrieval and the speech recognition transcript retrieval is only a few percent of the inverse average inverse rank. If this relationship holds even as we scale to larger, more realistic, and more useful collections, then we can consider the problem of spoken document retrieval practically solved to within a few percent of perfect text retrieval effectiveness.

To test this hypothesis using the TREC-6 training set, we increased the number of text documents in the corpus up to 14,000 and measured the inverse average inverse rank for the same retrieval queries. However, instead of actually performing speech recognition on the added documents, artificially degraded texts were used. In this case, the degradation method attempted to only model word errors through deletion of term words. Although a primitive model of speech recognition errors this may represent an upper performance bound.

Figure 1 shows the relationship between the inverse average inverse rank information retrieval performance and the size of the document collection. As more documents are added to the collection, the gap between the reference (perfect text) retrieval and the speech recognition based retrieval grows. At collections larger than 10,000 documents the gap starts to widen significantly. We can expect to experience larger discrepancies between speech transcribed and perfectly transcribed documents, which may make spoken document recognition unusable for collections numbering in the 100,000 or larger.

8. Summary

There are several conclusions we can draw based on our experiments:

In general, most of our findings are very preliminary. While we believe we may have uncovered trends, there is too little data for conclusive experiments. As a result, we did not conduct significance tests to measure the practical effects of the observed trends since the TREC-6 SDR track provided too little data for definitive experiments. Furthermore, the difference between the speech recognizer generated transcripts and the perfect text transcripts was too small in this corpus. However, the experiments we have done on increasing the scale of these document collections by orders of magnitude leave a worrisome fear that the initially promising results for SDR will not hold up in larger data sets.

Figure 1: Effect of collection size on IR performance of the TREC-6 training set with reference and artificially degraded documents. The X Axis is the number of documents used in the analysis, and the Y Axis is the IAIR.


  1. M.-Y. Hwang, "Subphonetic Acoustic Modeling for Speaker-Independent Continuous Speech Recognition". PhD Thesis, CMU-CS-93-230, Carnegie Mellon University, 1993.
  2. L. L. Chase, PhD thesis, Carnegie Mellon University Robotics Tech Report, 1997.
  3. S. Cox and R. Rose, "Confidence Measures for the Switchboard Database," IEEE International Conference on Acoustics, Speech and Signal Processing, 1996.
  4. L. Gillick and Y. Ito, "Confidence Estimation and Evaluation," LVCSR Hub-5 Workshop Presentation, 1996.
  5. P. Jeanrenaud, M. Siu, H. Gish, "Large Vocabulary Word Scoring as a Basis for Transcription Generation," Proceedings of Eurospeech, 1995.
  6. M. F. Porter, "An algorithm for suffix stripping," Program, 14(3):130-137, July 1980.
  7. J. R. Quinlan, Programs for Machine Learning, San Francisco, Calif.: Morgan Kaufmann, 1993.
  8. K. Seymore, S. Chen, M. Eskenazi, and R. Rosenfeld. "Language and Pronunciation Modeling in the CMU 1996 Hub 4 Evaluation," Proc. Spoken Language Systems Technology Workshop. Morgan Kaufmann Publishers, 1997.
  9. M. Siegler, U. Jain, B. Raj, and R. Stern. "Automatic Segmentation, Classification, and Clustering of Broadcast News Audio," Proc. Spoken Language Systems Technology Workshop. Morgan Kaufmann Publishers, 1997.
  10. M. J. Witbrock, and A. G. Hauptmann, "Speech Recognition and Information Retrieval", Proceedings of the 1997 DARPA Speech Recognition Workshop, Chantilly, VA, February 2-5, 1997.
  11. D. Graff, Z. Wu, R. MacIntyre and M. Liberman, "The 1996 Broadcast News Speech and Language-Model Corpus", Proceedings of the 1997 DARPA Speech Recognition Workshop, Chantilly, VA, February 2-5, 1997.
  12. S. Katz, "Estimation of probabilities from sparse data for the language model component of a speech recognizer", IEEE Transactions on Acoustics, Speech and Signal Processing, ASSP-35(3),400-401, March, 1987.
  13. Salton, G., Ed, "The SMART Retrieval System", Prentice-Hall, Englewood Cliffs, NJ, 1971.