Speech Recognition for a Digital Video Library
Michael J. Witbrock and Alexander G. Hauptmann
Abstract
The standard method for making the full content of audio and video material searchable and is to annotate it with human-generated meta-data that describes the content in a way that the search can understand, as is done in the creation of multimedia CD-ROMs. However, for the huge amounts of data that could usefully be included in digital video and audio libraries, the cost of producing this meta-data is prohibitive. In the Informedia Digital Video Library, the production of the meta-data supporting the library interface is automated using techniques derived from artificial intelligence (AI) research. By applying speech recognition together with natural language processing, information retrieval and image analysis, an interface has been produced that helps users locate the information they want and navigate or browse the digital video library more effectively. Specific interface components include automatic titles, filmstrips, video skims, word location marking and representative frames for shots. Both the user interface and the information retrieval engine within Informedia are designed for use with automatically derived meta-data, much of which depends on speech recognition for its production. Some experimental information retrieval results will be given supporting a basic premise of the Informedia project: that speech recognition generated transcripts can make multimedia material searchable. The Informedia project emphasizes the integration of speech recognition, image processing, natural language processing and information retrieval to compensate for deficiencies in these individual technologies.
Keywords:
video browsing, information retrieval interfaces, speech recognition, News-On-Demand, multimedia indexing and search, Informedia, artificial intelligence, automatic text summarization, video summarization, digital libraryIntroduction to Informedia
Vast digital libraries of video and audio information are becoming available on the World Wide Web as a result of emerging multimedia computing technologies. However, it is not enough simply to store and play back information as many commercial video-on-demand services intend to do. New technology is needed to organize and search these vast data collections, retrieve the most relevant selections, and permit them to be reused effectively.
Through the integration of technologies from the fields of natural language understanding, image processing, speech recognition and video compression, the Informedia digital video library system [Christel-94a,b][Wactlar96][Informedia95] allows a user to explore multimedia data in depth as well as in breadth. An overview of the system is shown in Figure 1. Hours of video programming are segmented into small coherent pieces and indexed according to their multimedia content. Users can actively explore the information by finding sections of content relevant to their search, rather than by following someone else’s path through the material or by serially viewing a single large chunk of pre-produced video. This active exploration is far more flexible than that provided by video-on-demand, where only one way of viewing the content is permitted. It is also more flexible than the interfaces provided by the current generation of educational CD-ROMs, where users follow a designed path through the material in a more or less passive manner. The goal in Informedia is have the computer serve as more than just a sophisticated video delivery platform. The Informedia Digital Video Library provides the user with a tool with which to assemble, from a large corpus, an instructive set of video segments relevant to a particular information need. Using this tool, a large library of video material can be searched with very little effort.
The Informedia project is developing these new technologies and embedding them in a video library system primarily for use in education and training. To establish the effectiveness of these technologies, the project is establishing an on-line digital video library consisting of over a thousand hours of video material. In order to be able to process and search this volume of data, practical, effective and efficient tools are essential.
News-on-Demand [Hauptmann95a][Hauptmann96] is a particular collection in the Informedia Digital Library that has served as a proving ground for automatic library creation techniques. In News-on-Demand, complete automation is the principal goal. Motivated by the timeliness required of news data, and the volume of material to be indexed every day, the project has applied speech recognition, natural language processing and image understanding to the creation of a fully content-indexed library and to interactive querying. While this work is centered around processing news stories from TV broadcasts, the Informedia library creation process exemplified in News-on-Demand represents an approach that can make any video, audio or text data more accessible. This article will concentrate on the speech recognition, and information retrieval aspects of automated library creation; natural language and image processing will only be covered in passing.
Content for News-on-Demand can be automatically captured off the air or via a DSS satellite receiver on a daily basis. Over the course of nearly two years, the system has captured four hundred and thirty three news broadcasts, of which three hundred and eighty two are television broadcasts stored as MPEG-1 files, and fifty one are radio broadcasts, stored as 16kHz sixteen-bit digital audio files. These news broadcasts have been segmented into more than fourteen thousand individual news stories. In most cases, the system captures closed caption data along with the video broadcasts, and this caption data can be used along with the soundtrack to create a higher quality text transcript, thus improving searchability. For the radio broadcasts, the raw audio signal is all that the system can use to create a searchable transcript. In addition to storing and segmenting and indexing the data and its associated transcript, the system has also generated one line "headline" summaries for more than twenty three thousand individual news stories and produced more than two hundred thousand video "skims" (described below) at varying levels of detail. All processing for the News-on-Demand corpus has been fully automated, no human intervention is required.
Related Projects and Research
Most other attempts at solving the news retrieval problem by providing news databases have restricted the data to text material only. Video-on-demand systems allow a user to select, and pay for, a complete movie, but do not allow for ad-hoc search and retrieval within programs. An approximation to News-on-Demand can be found in the "CNN-AT-WORK" system offered to businesses by a CNN/Intel cooperation. At the heart of the CNN-AT-WORK solution is a digitizer that encodes the video into INDEO format compression format and transmits it to workstations over a local area network. Users can store headlines together with video clips and retrieve them later. However, this retrieval depends entirely on the separately transmitted, manually created, text "headlines" and the service does not include news sources other than CNN. In addition, CNN-AT-WORK does not feature an integrated multi-modal query interface [CNN-AT-WORK95].
Preliminary investigation into the use of speech recognition for analysis of a news story was carried out by Schäuble and Wechsler [Schäuble95]. Since they lacked a powerful speech recognizer, their approach used a phonetic engine that transformed the spoken content of the news stories into possibly erroneous phoneme strings. The query was also transformed into a phoneme string and the database searched for the best approximate match. Despite errors in recognition and word prefix and suffix mismatches, the system performed reasonably well, since these errors scatter evenly over all documents allowing the consistently high search scores of well-matching correct segments to dominate the retrieval.
Another news processing system that included video materials was the MEDUSA system [Brown95]. The MEDUSA news broadcast application could digitize and record news video and teletext transcriptions, which are equivalent to closed-captions. Instead of segmenting the news into stories, the system used overlapping windows of adjacent text lines for indexing and retrieval. During retrieval, the system responded to typed requests by returning an ordered list of the most relevant news broadcasts. Within a news broadcast, it was up to the user to select and play a region, using information provided by the system about the position of the matched keywords. The focus of MEDUSA was in the system architecture and the information retrieval component. No image processing and no speech recognition were performed. The system did, however serve as the substrate for a later series of speech recognition and information retrieval experiments [Jones96]. Using a speech recognition system to extract words from spoken messages, these experiments evaluated information retrieval using a combination of word spotting based on rapid scanning of word lattices and whole word retrieval, in a video mail retrieval task. These latter experiments have not yet been extended to news or other broadcast data.
Other projects that seek to index and retrieve from video news sources include the Conceptually Indexed Video project at Sun [Woods96], which is attempting to build conceptual taxonomies of query terms to improve the quality of returned stories, and the VISION system at the University of Kansas which, while similar in aim to Informedia, is concentrating on the problems of compressing video data and delivering it over the Internet. It is also distinguished by its stated concentration on the use of pre-existing, mature, domain independent indexing technologies [Li96].
The Broadcast News Navigator (BNN) system [Maybury96][Mani96] has concentrated on the automatic segmentation of stories from news broadcasts using discourse structure. While a great deal of success has been achieved so far using heuristics based on stereotypical features of particular shows (e.g. "still to come on the NewsHour tonight…"), the longer term objective is to use multi-stream analysis of such features as speaker change detection, scene changes, appearance of music and so forth to achieve reliable and robust story segmentation. The system also aims to provide a deeper level of understanding of story content than is provided by simple full text search, by extracting and identifying, for example, all the named entities in a story.
The Informedia project, in as much as it involves the indexing of non-textual data, also bears similarities to projects such as QBIC [Flickner95], which applies both automatic image characterization and hand-annotation to images, and supports retrieval using image similarity. One of the more interesting features of the QBIC system is that it allows query by demonstration, with the user sketching the features desired in the retrieved image. A similar effort, which also encompasses some video material, is the Photobook system [Pentland94]. Photobook employs relatively sophisticated statistical characterizations of selected image features, such as faces, shapes and textures, to support accurate retrieval by image similarity. A final example of an image retrieval system is Chabot [Ogle95], a part of the Berkeley digital library project. This system includes an element of cross-modal operation, allowing users to search simultaneously in pre-existing annotations and color content characterizations of a large set of landscape images. This allows searches for objects such as "yellow flowers", that might not have been easily identified from the annotations or image qualities alone.
Creating an Informedia Library
The Informedia digital video library uses a combination of techniques from image processing, speech recognition, natural language processing and information retrieval. The integration of these techniques has permitted the construction of an effective interface to a digital video library, even though none of the techniques is completely reliable or error-free. Speech recognition is used for transcription and alignment, image processing is used for shot analysis and to identify representative frames, and natural language processing is used for summarization. Information retrieval allows the user to easily retrieve indexed material. Despite the imperfections in all the techniques used, and the problems inherent in working with raw broadcast data, a suite of navigation aids enabled by their use allows the user to quickly select and play back appropriate stories from the Informedia Digital Video Library.
The Informedia Digital video library system is composed of two parts: the Library Creation System and the Library Exploration Client. The Library Creation System for News-on-Demand can automatically capture one or more current news shows every night. Processing a news show for the library takes about 14 times real time on a DEC-AlphaStation 600 5/266 workstation with 256 Mbytes of memory. During library creation, the following major steps are performed:
At this point, a user with the Informedia Digital Library Client Software can access the library in a number of different ways and use different abstractions to navigate through the data it contains.
Exploring the Informedia Library
A user can type queries to the system or speak the queries in natural English. Speech recognition for the IDVL client queries is done with the Sphinx-II Speech Recognition System using a 20,000 word vocabulary based on North American Business News and modified to account for the typical phrasing of queries. Since an ad-hoc retrieval engine is used, requests for information can be posed in unconstrained English. Means for issuing simple Boolean queries are also provided, although they are seldom used. Users may refine a query by adding more words to their initial query. A variety of abstractions are available to aid users in browsing the video stories or paragraphs returned from a search of the library. The library exploration process will be described in more detail in the following sections. The roles and effects of speech, natural language and image processing techniques are illustrated in conjunction with the library exploration process.
None of the technologies underlying Informedia library creation work perfectly. The version of the Sphinx-II speech recognition system used in the system, for example, only correctly transcribes about half of the words in a typical TV news broadcast. Because of this basic imperfection, the Informedia client system has been designed to provide as much information as possible to aid users in navigating through the presented information. The goal is to allow users to find data, which satisfy their information needs. By combining information derived using different processing techniques, and from different modalities, it is often possible to minimize the effects of shortcomings of the automatically derived meta-data on retrieval effectiveness. Similarly, it is possible to compensate for problems with the data itself.
The techniques from speech recognition, information retrieval, image processing and natural language processing enable the interface to support rapid and accurate search of imperfect news data. Before a more detailed discussion speech recognition and information retrieval in Informedia that follows, a brief walk-through of a typical interaction with the system is illustrated in Figure 2.
Imagine a cautious user who is planning a trip to Europe, and who says to the system "Tell me about mad cow disease." The system searches and retrieves the best six of ninety-four matches that contain one of the words ‘mad’, ‘cow’ or ‘disease’ (Figure 2a). The user could have set options to retrieve more hits from among the news stories that match the query. Moving the mouse over the representative poster frames extracted from the stories causes a text summary headline "Britain’s secretary it’s nation’s entire herd might slaughter" to appear. Another story poster has the headline "Britain’s ending dashed European officials, Belgium, voted". Although imperfect, these summaries allow the user to select the story of greater interest, in this case the first one, which is clearly about the likelihood of drastic measures being taken to allay fears of infection, in the second, the focus is more on the crisis’ treatment in European politics. Clicking on the first poster frame starts the video of the story playing (Figure 1b). Underneath the video window is a bar with colored lines showing the exact time at which every query term was spoken. Clicking on the word ‘mad’ in the query would have highlighted the bars representing that word and the poster frames whose stories contained that word. The user could then have clicked the "next hit" button to skip past introductory material to the exact place where the word `mad’ was mentioned.
Since the video clip is nearly a minute long, and since it is not obvious from the automatically selected poster frame that the story begins on the topic of mad cow disease, the user switches to a "filmstrip" view of the story, where every shot is represented by one frame (Figure 2c). Again occurrences of the query words are marked on the filmstrip exactly where they occur, and the user can navigate directly to the parts of the story that are clearly relevant, such as the pictures of butchers’ shops. Alternatively, the user might have elected to enable and play a video "skim" of the story, viewing only the most important sections of the story in a fraction of the original time.
Finally, the user changes topic entirely, and in Figure 2d is shown engaged in a query about life on Mars. Having become interested in the general topic, the user has opened the catalog browser to see what other space related material is available from, in this case, NASA sources included in an accompanying educational collection.
Speech Recognition in Informedia
Table 1 shows the results of testing recognition accuracy for the Sphinx-II recognizer applied to samples of about two hours of speech from a variety of video data. These results show that the type of speech and the environment in which it was created dramatically alter the speech recognition accuracy. Substantially lower error rates can be obtained using recently developed systems such as Sphinx-III, but at greatly increased computational cost. The following paragraphs describe the conditions for each line in the table in more detail.
|
Table 1: Speech Recognition using the CMU Sphinx–II recognition system recognizes broadcast material with word error rates between twenty and eighty five percent. Careful speakers in the lab produce error rates between eight and seventeen percent. Conditions are described in detail in the text. |
||
|
Type of Speech Data |
Word Error Rate =Insertions + Deletions + Substitutions |
|
|
1) Speech benchmark evaluation |
~ 8% - 12% |
|
|
2) News text spoken in lab |
~ 10%- 17% |
|
|
3) Narrator recorded in TV studio |
~ 20% |
|
|
4) C-Span |
~ 40% |
|
|
5) Dialog in documentary video |
~ 50% - 65% |
|
|
6) Evening News (30 min) |
~ 33% - 50% |
|
|
7) Complete 1-hour documentary |
~ 65 – 75% |
|
|
8) Commercials |
~ 85% |
|
While these recognition results seem dismaying at first glance, they merely represent a first attempt at quantifying the usefulness of speech recognition for broadcast video and audio material. Fortunately, as the experiments on information retrieval described below demonstrate, speech recognition does not have to be perfect to be useful in the Informedia digital video library.
The transcript generated by Sphinx-II recognition need not be viewed by users, but can be hidden. However, the words in the transcript are time-aligned with the video for subsequent retrieval. Because, generally, only the timing information from the speech recognition output is used directly, errors in recognition are not directly visible to users and the system can tolerate higher error rates than those that would be required to produce a human-readable transcript.
Information Retrieval in Informedia
In this section, some experimental information retrieval results will be given to support the basic premise of the Informedia project: that speech recognition generated transcripts can make multimedia material searchable. More than any other "imperfect" technology, the Informedia Digital Library System depends on text transcripts that allow effective indexing and retrieval of segments relevant to a query. If a perfect, manually created transcript were available, the success of information retrieval in the Informedia Digital Library System would be assured; there are many examples of successful document retrieval systems. However, large amounts of video and audio data in the real world do not have associated perfect transcripts. Closed-caption transcripts, for example, are quite errorful. Most video and audio material has no available transcript at all. It is therefore necessary to develop and evaluate techniques for information retrieval that can be applied to imperfectly, and possibly automatically, transcribed transcripts. This is a cornerstone upon which Informedia is built.
In support of this evaluation, a series of experiments and measurements have been conducted. The information retrieval experiments were performed using data sets consisting of perfect text transcripts, closed-captioned text transcripts that were broadcast together with news shows, and transcripts created by the Sphinx-II speech recognition system. Different corpus sizes were also compared.
The most substantial body of previous work has been done at Cambridge University in the United Kingdom. Jones et al. [Jones96] used a specially constructed test set of 50 queries and 300 voice mail messages, from 15 speakers, constructed to have on average 10.8 highly relevant documents per query. They measured precision at rank 5, 10, 15 and 20 and also reported the average precision. For their data, the best performance on a hand transcribed version of the data was an average precision of 36.8 %. Average precision for the best speech data was 85.6 % of the text retrieval precision. This was achieved by combining a speech recognizer transcript (based on a similar 20,000 word North American business news language model) with a phone-lattice scanning word-spotter based on speaker independent biphone models. It should be noted that these results are comparable to the results below in terms of the techniques used, but not in terms of the data sets. Because each corpus has significantly different characteristics, one should resist the temptation to compare precision values.
METHOD OF EVALUATION
The standard metrics for retrieval effectiveness in the information retrieval literature are precision and recall [Salton71]. Precision is defined as the number of correct (relevant) hits returned by the system divided by the number of total hits returned to the user. Recall is defined as the number of correct hits returned to the user divided by the number of hits a perfect retrieval should have returned. Recall and precision are thus computed based on the retrieved set, the relevant set, and their intersection. [Jones96] reported results for precision at 5, 10, 15 and 20 items retrieved and in order to allow some comparison, we will use average precision and recall over those four ranks. Retrieval effectiveness for automatically transcribed spoken documents has been reported as a percentage of the figure for a comparable text retrieval system applied to perfect transcripts [Jones96][James96], which we will also report. It should be noted that we also computed all our results using 11 point interpolated precision, and found identical trends in all experiments.
The precision/recall metric has the drawback that a person (or, preferably, a number of people) must manually score the test data. This is extremely tedious and time-consuming and is therefore usually only done for small sets. Any results are simply assumed to scale to larger sets. Within the Informedia project, an effort is being undertaken to measure precision and recall for a data set of 602 news stories given a list of 105 queries, this involves having each human judge make 63210 relevance judgments. To date, a full set of these evaluations has only been completed by a single judge, and partial sets of evaluations have been made by several judges. The experimental results scored based on this set of judgments will be described below.
Making these evaluations is not an easy task, even for human judges. A subset of 100 stories, for which 3 judges completed relevance judgments for the 105 queries, demonstrated that the judges agreed on 10443 judgments and disagreed 57 times. However, of the 85 cases where at least one judge thought a story was relevant to a query, the two other judges agreed only 28 times.
In order to be able to report meaningful numbers over larger data sets, a second metric, the "average rank of the correct story", was substituted. This measure is intended to give information about retrieval effectiveness comparable to that supplied by precision and recall numbers. This metric uses a query prompt that is created for one specific document in the database set. Then the rank of this target document in the returned set is computed. Over large numbers of documents the average rank of the target document is computed. Note that this number can be expected to increase with the size of the database. This relatively simple metric permits the repetition of retrieval experiments for relatively large amounts of data without laborious manual scoring. In particular, one can empirically observe which techniques scale better than others, something that is virtually impossible to do for precision and recall metrics that are manually derived. In the future a more thorough effort will be undertaken to measure the correlation between the average and median rank and the precision recall metrics. Although the measure is expected to give similar information, it is not directly comparable to measures based on actual relevance judgments, since it assumes that there is exactly one document that should be retrieved for a given prompt, when, in fact, several documents may be relevant.
The Data for the Information Retrieval Experiments
For each of these shows with transcripts, closed-captions were also collected as they were broadcast and a speech recognition transcript was generated from the audio using the Sphinx-II speech recognition system running with a 20,000 word dictionary and language model based on the Wall Street Journal from 1987-1994.
Speech recognition for this data has a 50.7% Word Error Rate (WER) when compared to the JGI transcripts. WER measures the number of words inserted, deleted or substituted divided by the number of words in the correct transcript. Thus, WER can exceed 100% at times. Closed captions have a 15.6% WER compared to the Journal Graphics transcribed text.
The Journal Graphics transcription service also provided human-generated headlines for each of the 105 news stories. Each headline was matched to exactly one news story. The headlines were used as the query prompts in the information retrieval experiments. Thus, the rank of the correct story is defined as the rank of the news story returned by the search engine, for which the headline used as the query was created. Recall that this does not ensure that no other story is relevant to the title. In fact, in the 63,210 relevance judgments, a human judge assigned an average of 1.857 relevant documents to each headline. The average length of a headline query was 5.83 words.
In all the experiments described here, the stories being indexed were segmented by hand. Automatic segmentation methods can be expected to generate errors that may decrease retrieval effectiveness.
In these experiments, the measure of retrieval effectiveness adopted is the average rank of the query’s correct story in the retrieved set. Where possible, it is compared with actual precision and recall figures calculated with respect to human-generated relevance judgments. The base-line system uses a search engine based on TF and stop words (also referred to as "coordinate matching" by [Witten94]). The initial comparison used the 602-story corpus (set 2) to evaluate retrieval effectiveness for the closed-caption transcripts, the manual transcripts and the speech recognition transcripts in the set. Note that only for the 105 stories corresponding to the headline queries was there a choice of manual transcripts, speech recognizer output or closed-captioning available. The remaining 498 stories in the set were all derived from "perfect" manual transcripts. Thus, the data set was biased against the speech recognition data in that it mixed perfect text transcripts, some of which may have been relevant to the query headline, with the targeted speech recognized transcripts. Since the speech recognized transcripts can be expected to lose some query terms to recognition errors, their relevance ranking is likely to be somewhat lower than a comparable, relevant text story. Because of the limited number of stories for which speech, closed captions and manual transcriptions were available, accepting this bias was necessary to permit experimentation on a sizable retrieval corpus.
Experiment 1: Precision and recall for various transcription methods
The first experiment shows precision and recall for three different types of transcripts. Precision and recall were calculated using the relevance judgments of a human judge as a reference, as described earlier. The corpora used consisted of the 602 stories (described above in set 2) including 105 stories corresponding to the query headlines and 497 manually transcribed "distractor" stories. The three experimental conditions involved using the same 105 stories corresponding to the queries, but the transcripts for these 105 stories were generated in three different ways: by manual transcription, by closed-captioning, by large vocabulary speech recognition. Precision and recall figures were computed at 5, 10, 15 and 20 stories retrieved. Two versions of the retrieval system were contrasted, a simple search engine using only TFIDF weighting and stop words, and the best version of the Informedia search engine using TFIDF, document length normalization, stop words, suffix stripping and document weight vector normalization. The latter is effectively a type of cosine distance metric. The average precision and recall figures over these 4 sets were computed and are displayed in Table 1.
On manually prepared transcripts, which are assumed to have perfect text content, recall at the average of ranks 5, 10, 15 and 20 was 0.714 (precision = 0.097) for the standard search engine and 0.906 (precision = 0.128) for the search engine using suffix stripping, document length normalization and document weight normalization in addition to TFIDF and stop words. This shows the helpful effects of these additional search engine features. The closed captions for these same transcripts had a 15.7% word error rate compared to the perfect manual transcripts. This translated into a decreased average recall at rank 5,10,15 and 20 of 0.667 (precision = 0.091) for the standard search engine and an average recall of 0.849 (precision = 0.116) for the best search engine. Note that a 15.7% word error rate resulted in a 6.6% decrease in recall (6.2% decrease in precision) for the standard search engine and a 6.3% recall decrease (9.3% precision decrease) for the best search engine compared to text transcript retrieval. For speech generated transcripts, the average recall was 0.505 (precision 0.068) for the standard search engine and 0.803 (precision 0.110) for the best search engine. The 50.7% word error rate thus resulted in a 29.3% decrease in recall performance (29.9% precision decrease compared to text retrieval from perfect transcripts for the standard search engine. For the engine with the most sophisticated weighting scheme, this decrease was 11.4% in recall and 14.1% decrease in precision. By increasing the quality of the information retrieval engine, it was possible to palliate the effects of imperfect transcription by speech recognition. When the best search engine is used, the decrease in precision (14.1%) resulting from use of speech recognition generated transcripts is actually less than that resulting from using an information retrieval system instead of a human being to make relevance judgments (23.8%). The errors in speech recognition accuracy are not a critical impediment to achieving good information retrieval performance.
Table 1: Comparison of precision and recall for different transcript types when 105 "headline" queries were made against a corpus of 602 stories. The transcripts for 105 stories corresponding to the queries were derived, in three conditions, from manual transcription, closed-captioning, and speech recognition. 498 manually transcribed text story transcripts were added to the corpus in each condition. Precision and recall figures were averaged over ranks 5,10,15 and 20. Hypothetical "perfect" retrieval scores, according to human relevance judgments, are also shown.
|
Search Engine Features: |
TFIDF and Stop Words |
TFIDF, Stop Words, Stemming, Document vector normalization, Document length normalization. |
|||
|
Type of Corpus |
Word error rate |
Avg. Recall at rank 5/10/15/20 |
Avg. Precision at rank 5/10/15/20 |
Avg. Recall at rank 5/10/15/20 |
Avg. Precision at rank 5/10/15/20 |
|
Manually Prepared Transcript |
0% (base line) |
0.714 |
0.097 |
0.906 |
0.128 |
|
Broadcast Closed Captions |
15.6% |
0.667
|
0.091 |
0.849 |
0.116 |
|
Speech Generated Transcript |
50.7% |
0.505 |
0.068 |
0.803 |
0.110 |
|
"Perfect" Retrieval on manual transcripts (human relevance judgments) |
0% |
0.992 |
0.168 |
0.992 |
0.168 |
An analysis of the large difference between the simple search engine and the full search engine (Recall/precision = 0.505/0.068 vs. 0.803/0.110) showed that about half of the improvement in the speech document retrieval are due to the effect of stemming. The speech recognizer will misrecognize words and substitute close phonetic matching words. These matches are often words with similar stems, but different suffixes. The second biggest improvement came from vector normalization (computing the cosine distance between the query vector and the document vector instead of the Euclidean distance).
Experiment 2: Average retrieval rank for correct story using various transcription methods.
The second experiment involved computing the second, "average rank of correct story" measure for the same conditions. Table 2 shows that the rank of speech recognition based transcripts is more than three times higher than that of manually generated data. Closed captions lie somewhere in between. The differences in the recall and precision figures in Table 1 are in the same direction, but are much smaller. Compared with precision and recall, the "average rank of correct story" metric seems to be a correlated, but more sensitive measure of retrieval effectiveness.
Table 2: Comparison of the average retrieval rank for the correct story corresponding to a query headline for manual story transcripts, closed-captioned transcripts and transcripts based on speech recognition. The same corpus was used as for the experiment described in Table 1.
|
Transcript Source TFIDF and Stop Words |
Suffixes, Stop words, Document weight normalization, TFIDF, Document length normalization. |
|||
|
Manually Prepared |
0% (base line) |
11.14 |
2.32 |
|
|
Broadcast Closed Captions |
15.6% |
13.93 |
4.85 |
|
|
Speech Recognition |
50.7% |
44.11 |
7.89 |
|
Experiment 3: Scaling behavior of the average retrieval rank.
Table 3. A comparison of retrieval effectiveness for manual transcripts and speech recognition transcripts for larger data sets. Each data sets used the same 105 prompts for which corresponding stories were either created manually or through a speech recognizer. In this case, though, three corpus sizes were generated by adding manually generated transcripts. Average rank figures were computed using the best retrieval system, as described in the text.
|
Average rank of correct story based on: |
602 stories |
2,600 stories |
12,000 stories |
|
Manually Prepared Transcript |
2.32 |
5.65 |
9.34 |
|
Speech Generated Transcript |
7.89 |
31.16 |
60.19 |
Table 3 shows the scaling behavior of the average rank measure for the best retrieval system as the number of documents in the corpus is increased by adding more manually generated "distractor" story transcripts. The table indicates that the average rank rises more quickly for speech recognized transcripts than for manually created transcripts. However, both conditions seem to degrade approximately with the log of the size of the corpus. Recall that since only "perfect" manual transcripts were added to the corpus, this data is slightly biased against the speech recognized transcripts. One would expect better measured performance in the speech recognition condition if the additional stories in the corpus were always of the same type (i.e. speech transcript) as the original 105 stories. It is also worth noting that the ratio of the average rank between speech recognition and manual transcripts increases with the size of the corpus. This indicates that speech recognition generated transcripts are less focused on the correct topic and are more likely to be displaced by other apparently relevant stories from the "distractor" set.
Experiment 4: Phonetic transcription compared with large vocabulary recognition
In previous work, Schäuble and Wechsler [Schäuble95] performed experiments in which they used automatic phonetic transcriptions, as opposed to the whole word speech recognition transcripts described above, for information retrieval in a small radio news corpus. They reported reasonable success in retrieving relevant documents. Similarly, Jones et al, [Jones96] used a combination of whole word transcription and phoneme lattices to improve on the retrieval effectiveness of a system using either alone.
Although a strictly phonetic transcription of the data used in these experiments has not yet been generated, an approximation is achieved by converting story transcripts into a phonetic representation by looking up the words in the transcript in the large phonetic dictionary used by Sphinx-II [CMU-Speech95]. All substrings of between three and six phonemes in length are generated from these transcriptions and used as the lexical tokens for building the inverted index and retrieval. The prompts are also converted into phoneme based tokens in the same way.
Table 4 gives precision and recall for word based and phonetic representations of both human generated and machine generated transcriptions. Combined word and phoneme based retrieval is also evaluated, as originally proposed by James [James96]. Figures are given for two information retrieval engines. The first observation one can make from these data is that the speech recognition vocabulary is crucial to IR performance. Reducing the text vocabulary to that of the speech recognition engine accounts for half the loss of precision and recall resulting from the use of speech recognition based transcripts. Further analysis shows that the out-of-vocabulary (OOV) rate for the data used in these experiments is rather high; OOV terms account for 11.4% of the terms in the 105 "headline" prompts (71/620 words) and for 6% of the words in the stories. Searching on a phonetic representation of the manual transcripts slightly decreases retrieval effectiveness, but this is not reliably true for the SR based transcripts, perhaps because the phonetic transcription allows the system to bypass some errors in the recognition of word suffixes. Finally, interpolating the phonetic and whole word representations gives better retrieval performance than either alone, both for manual and automatic transcriptions. For manual transcriptions (perfect text), it is likely that the phonetic transcriptions help by providing a general suffix and prefix matching mechanism, which is not a feature in the base search engine using TFIDF and stop words. However, performance when using the "best" search engine, which now includes suffix stemming, is not changed when phonetic transcriptions are added to the already perfect text transcripts. The results for perfect text transcripts and phonemes with the full search engine are only given to illustrate the point phoneme retrieval is not needed for perfect text systems, whereas it can be useful for speech recognized documents. For speech recognition based transcriptions, the improvement in retrieval effectiveness derives from phonetic transcriptions, which allow matching on words outside the fixed 20,000-word speech recognition vocabulary.
Table 4: Recall and precision for retrieval performed using a variety of whole word and phonetic transcript representations, for transcripts generated manually or using large vocabulary speech recognition
|
|
TFIDF + stop words |
Full system with all IR features
|
|||
|
Type of transcription |
Recall |
Precision |
Recall |
Precision |
|
|
Words from Text |
0.714 |
0.097 |
0.906 |
0.128 |
|
|
Words from SR |
0.505 |
0.068 |
0.803 |
0.110 |
|
|
Words from Text without words not in SR dictionary |
0.623 |
0.081 |
0.850 |
0.119 |
|
|
Phonemes from Text |
0.705 |
0.087 |
0.839 |
0.115 |
|
|
Phonemes from SR |
0.544 |
0.064 |
0.762 |
0.102 |
|
|
Text words + Text Phonemes interpolated |
0.765 |
0.107 |
0.905 |
0.129 |
|
|
SR words + SR Phonemes interpolated |
0.616 |
0.083 |
0.831 |
0.114 |
|
Table 5 gives the results of a similar experiment expressed in terms of the average rank measure. As in previous experiments, the magnitude of the effect is much more noticeable when this "average rank of correct story" measure is used. The relative retrieval effectiveness for different condition shows exactly the same trends as before: when precision and recall are high, the average rank is relatively low, and vice versa.
Table 5. Improvements in the average rank of the correct story from speech recognized stories using phoneme recognition. This experiment is based on the set of 602 stories. For each condition, a base-line (TF + stop words) is shown, as well as the best information retrieval using TFIDF, stop words, document length normalization, document weight normalization, proximity weighting and suffix stripping. The conditions contrast words from manually transcribed text, words from a 20,000 word speech recognizer, words from the manual transcripts after the out-of-vocabulary words were removed, retrieval given only a phonetic representation of the text transcript, and retrieval given only a phonetic representation of the speech recognized transcript. The last two rows show the improvements in average rank of the correct story obtained when both words and phonemes are used for retrieval.
|
Transcript type |
TFIDF + stop words |
Best IR System |
|
Words from Text |
11.14 |
2.32 |
|
Words from SR |
44.11 |
7.89 |
|
Words from Text less OOVs for SR |
38.36 |
17.20 |
|
Phonemes from Text |
17.74 |
9.75 |
|
Phonemes from SR |
25.05 |
12.02 |
|
Text words + Text Phonemes interpolated |
9.15 |
2.18 |
|
SR words + SR Phonemes interpolated |
20.06 |
6.87 |
Experimental Summary
The experiments confirm findings by Wechsler and Schäuble [Schäuble95], that phoneme-based recognition using phoneme strings of different lengths can be used for effective information retrieval. While these experiments do not directly duplicate the procedure of Wechsler and Schauble, they are sufficiently similar to confirm their results and to show the robustness of phoneme-string based information retrieval in different implementations.
The current results also show that it is better to have a large vocabulary speech recognizer (with a lexicon of at least 20,000 words) in conjunction with a phonetic engine, where the retrieval results of the two can be combined. This is consistent with the findings of Jones et al. [Jones96] on a voice mail retrieval task. In contrast to Jones et al, the current experiments do not use word spotting or phoneme lattices to augment the vocabulary of the recognition system. Instead of word spotting, a large set of phoneme strings of various lengths is indexed and searched. In the future, these experiments will be extended to investigate use of word and phoneme lattices that can be generated by the Sphinx-II speech recognizer. There are also subtle differences in the search engine features used by the different groups. Unlike the Jones et al experiments, the Informedia search engine uses document weight vector normalization to gain a small improvement in retrieval effectiveness for larger corpora. However, this difference is unlikely to influence the direction and trends of the results, which support Jones et al findings in all essential respects.
The experiments reported here also hint at the behavior of the retrieval system for larger collections of documents. For larger corpora it was demonstrated that the performance for speech recognition generated transcripts fell off more rapidly than the performance for equivalent manually created transcripts. The introduction of the "average rank of correct story" metric enabled measurements for collection sizes that are far beyond what humans could reasonably judge for relevance. The "average rank of correct story" metric confirmed all the basic trends in the small corpus, when compared to the recall and precision figures based on human judgments. It does, however, appear to be a more sensitive measure. It is suspected that median rank will perhaps be less sensitive and more directly comparable to the traditional measure of precision and recall.
Future Directions
There are six main research areas that contribute to the effectiveness of the Informedia Digital Video Library: Data delivery, user interface design, image understanding, natural language processing, information retrieval and speech recognition. AI technologies in image understanding, natural language processing, speech recognition and information retrieval primarily affect the off-line library creation process. The online library exploration phase with the user is affected more by data delivery issues and user interface design, and to a lesser extent by natural language understanding and information retrieval for query processing and speech recognition for spoken queries.
There are two main data delivery issues: storage and transmission. How can one address the problem of huge storage requirements of the MPEG-I encoded video data accumulated through daily news broadcasts? An hour of video takes up about 600MB of disk space. Although the Informedia project will eventually have a terabyte of disk on which to store video, this is still barely enough for 1500 hours of video. Even with the constantly dropping prices of storage, the many thousands of hours of both broadcast and privately produced video force the consideration of new approaches to dealing with this data It is worthwhile to investigate when data could be degraded or "forgotten". It may be useful for data to degrade to lower quality video at fewer frames per second, and lower resolution. It is also possible to eliminate the video entirely and save only the audio portion. Finally one can retain only the text transcript without audio or video. Even if enough storage is available, the need to speed access by keeping cached copies of material encourages research to find out which reduced quality representations of video material can serve as useful substitutes while the original material is fetched.
The second data delivery issue concerns the transmission of the video news story to a remote user. Essentially, one needs to provide fast enough networks to allow MPEG-1 bit rates to be transmitted continuously, and servers that can keep up with this demand for many users. The need to play back skims, which in future versions of the system will be dynamically created according to user queries, require that the MPEG-1 data be served with very low latency, preventing many of the optimizations currently used in multimedia servers. Local caching strategies would allow for a frequently used subset of the news to be stored near to the client where they can be served rapidly and without contention, while most of the less frequently used, older news is stored in a central archive. This approach would reduce network bandwidth requirements, although occasional MPEG-I transmission rates would be required. Another approach is to use lower bandwidth streaming video representations, trading off reduced quality against expanded accessibility. To enable experimentation with a wide variety of networks and delivery platforms, the Informedia project is currently implementing a Web based client in Java.
The user interface issues deal with the way users explore the library once it is available. Can the user intuitively navigate the space of features and options provided in the Informedia: News-on-Demand interface? What other features should the system provide to allow users to obtain the information they are looking for? The plan for the future is to move the system into further test-bed deployments to gain insight from users and to enable evaluation of interface design alternatives. User studies have already been useful in directing the system towards better schemes for poster frame generation. In future, such studies will be used to answer questions such as whether the speech interface enables more effective querying than the typing interface, and whether skims are an effective reduced representation of the content of a video segment.
Natural language processing research for News-on-Demand needs to find ways of providing acceptable segmentation of the news broadcasts into stories, even when those broadcasts are only available as speech. It is desirable for the system to generate more meaningful short summaries of the news stories in natural sounding English, and to chose more coherent and informative sections of a story for use in skims. In both these cases, it would also be useful if the system could tailor both the headline and the skim to a model of the user’s information need, as expressed through the query. Natural language understanding also has a role to play in deciding the weightings and operators, such as adjacency and Boolean operators, that should be used with natural language queries to ensure optimal retrieval or matching concepts from the story texts. Parsing and broad semantic analysis of news stories may also be worth pursuing; the system might greatly improve if it could parse and separate out dates, major concepts and types of news sources. Finally, machine translation, both statistical and symbolic, will be fundamentally important to dealing with the fact that video and audio and audio programs are produced in a variety of languages and that this variety is also represented by potential users of Digital Libraries.
Image processing research [Hauptmann95b][Zhang95] is continuing to refine the scene segmentation (the identification of cuts in the video). Within a scene and within a story, image processing gives us the key frame to represent that scene or story. The choice of a single key frame to best represent a whole scene is a subject of active research.
In the longer term, the project plans to add text detection and optical character recognition (OCR) capabilities for reading captions and text off the video images. In the future, work will be done aimed at including similarity-based image matching in the retrieval features available to a user. A simple version of similarity matching based on color histograms has already been used in one version of the system, but the aim is to move far beyond this point, and to do matching of identified objects found in the images.
The error rate of the speech recognition system still leaves much room for improvement. The experiments presented here indicate that substantial gains in retrieval effectiveness can be achieved with more accurate speech-recognition generated transcripts. The figures in this paper show that the greatest benefit for speech recognition would be obtained by reducing the number of out-of-vocabulary words. These words occur in the stories and the queries, but that are not represented in the speech recognizer’s active vocabulary. In addition to the out-of-vocabulary words, improvements could come from better acoustic models, better language modeling and better pronunciation modeling. To improve acoustic modeling, experiments are under way to automatically adapt the existing models to broadcast news. This is achieved by combining the closed-captioned text with a speech recognition transcript of the audio stream. In principle, this could enable a computer to learn to recognize speech by watching television 24 hours a day. To improve the lexical coverage and the language model, a two pass system is being developed. The first pass will be a standard recognition using the general English language model. The recognized words are then used in a query to an information retrieval system from which related text on the specific topic can be obtained. This text is then interpolated with the generic language model to focus on the topic domain covered in the current speech. Preliminary experiments have shown that this adaptation of the language model can improve speech recognition accuracy as well as coverage of the out-of-vocabulary lexical items.
The task for speech recognition in Informedia is not merely transcription. It matters little whether the system correctly transcribes any of the stop words. However, important concept words should be recognized correctly or retrieval effectiveness will suffer. This suggests a new approach to training and evaluating speech recognizers for information retrieval tasks, in which the score to be maximized is based on retrieval effectiveness, not on the number of words correctly transcribed. A further improvement could come through the use of confidence measures, which reflect the recognizer’s estimation of the likelihood that a word was correctly transcribed. Confidence measures offer the potential to be used as an additional weighting factor in the document vector. Finally, one would expect prosodic information, and in particular lexical stress, to reflect the importance of spoken query terms or spoken document concept words. The challenge here is to reliably identify the prosodically marked words and to favor them as relevant terms.
While the results presented here, as well as those from Schäuble and Wechsler [Schäuble95] and Jones et al [Jones96] show some success in information retrieval from spoken documents, much remains to be done. In the near term, the combination of whole-word recognition and phoneme recognition shows clear promise. In addition, improvements can be achieved through the use of word or phoneme lattices, where the recognizer indicates multiple, ranked choices for each word or phoneme. Since recognition errors are not semantically correlated, the penalty of adding multiple word candidates into the retrieval document is likely to be outweighed by the benefit of alternate, lesser ranked, but sometimes correct word or phoneme candidates.
Judicious combination of a variety of AI techniques has permitted the construction of an effective interface to a digital video library. Speech recognition is used for transcription and alignment, image processing is used for shot analysis and to identify representative frames, and natural language processing is used for summarization. Despite the imperfections in each of these techniques, and the problems inherent in processing unmodified broadcast news data, strong navigation tools supported by the use of AI allow the user to quickly retrieve appropriate stories from the Informedia Digital Video Library.
Acknowledgments
The authors are grateful for the help of Mosur Ravishankar, Paul Placeway and our other colleagues in the CMU speech group, Michael Smith, for the image processing code, Michael Christel and Mark Hoy for more than substantial coding on the user interface, Ricky Houghton, Craig Marcus, Bryan Maher and King-Sun Wai for programming and technical support, and Howard Wactlar for fearless leadership. Especial thanks go to Marci Maher, without whose sterling efforts this paper would not have been possible. We are also indebted to Hsinchun Chen for his forbearance.
References
[Brown95] Brown, M. G., Foote, J. T., Jones, G. J. F., Spärck Jones, K., and Young, S. J. "Automatic Content-based Retrieval of Broadcast News," Proceedings of ACM Multimedia. San Francisco: ACM, November, 1995, pp. 35-43.
[Christel94a] Christel, M., Stevens, S., & Wactlar, H. "Informedia Digital Video Library," Proceedings of the Second ACM International Conference on Multimedia, Video Program. New York: ACM, October, 1994, pp. 480-481.
[Christel94b] Christel, M., Kanade, T., Mauldin, M., Reddy, R., Sirbu, M., Stevens, S., and Wactlar, H.", "Informedia Digital Video Library", Communications of the ACM", 38 (4), April 1994, pp. 57-58.
[CNN-AT-WORK95] Cable News Network/Intel CNN at Work - Live News on your Networked PC Product Information. http://www.intel.com/comm-net/cnn_work/index.html.
[Flickner95] Flickner, M., Sawhney, H., Niblack, W., Ashley, J., Huang, Q., Dom, B., Gorkani, M., Hafner, J., Lee, D., Petkovic, D., Steele, D., and Yanker, P. "Query by Image and Video Content: The QBIC System". IEEE Computer, September 1995, pp. 23-31
[Hauptmann95] Hauptmann, A. G., Witbrock, M. J., Rudnicky, A. I., and Reed, S., Speech for Multimedia Information Retrieval, UIST-95, Proceedings of User Interface Software Technology, 1995, in press
[Hauptmann95b] Hauptmann, A. G. and Smith, M. A., Text, Speech and Vision for Video Segmentation: the Informedia Project. AAAI Fall Symposium on Computational Models for Integrating Language and Vision, Boston MA Nov 10-12 1995., pp. 90-95.
[Hauptmann96] Hauptmann, A.G. and Witbrock, M.J., Informedia News on Demand: Multimedia Information Acquisition and Retrieval, in Maybury, M. T., Ed, Intelligent Multimedia Information Retrieval, AAAI Press/MIT Press, Menlo Park, 1996 (In Press).
[Hwang94] Hwang, M., Rosenfeld, R., Thayer, E., Mosur, R., Chase, L., Weide, R., Huang, X., and Alleva, F., "Improving Speech Recognition Performance via Phone-Dependent VQ Codebooks and Adaptive Language Models in SPHINX-II." ICASSP-94, vol. I, pp. 549-552.
[Informedia95] http://www.informedia.cs.cmu.edu/
[James96] James D. A., System for Unrestricted Topic Retrieval from Radio News Broadcasts. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Atlanta, GA, USA, May 1996, pp. 279-282.
[Jones96] Jones, G.J.F., Foote, J.T., Spärck Jones, K., and Young, S.J., "Retrieving Spoken Documents by Combining Multiple Index Sources", SIGIR-96 Proceedings of the 1996 ACM SIGIR Conference, Zürich.
[Li96] Li, W., Gauch, S., Gauch, J., and Pua, K.M., "VISION: A Digital Video Library", Digital Libraries ’96: 1st ACM International Conference on Research and Development in Digital Libraries, Bethesda MD, March 1996.
[Mani96] Mani, I., House, D., Maybury, M. and Green, M. 1996. "Towards Content-Based Browsing of Broadcast News Video", in Maybury, M. T. (editor), Intelligent Multimedia Information Retrieval.
[Maybury96] Maybury, M., Merlino, A., and Rayson, J., submitted 1996. "Segmentation, Content Extraction and Visualization of Broadcast News Video using Multistream Analysis", in Proceedings of the ACM International Conference on Multimedia, Boston, MA.
[Ogle95] Ogle, V. and Stonebraker, M. "Chabot: Retrieval from a Relational Database of Images", IEEE Computer, Vol. 28, No 9, September 1995.
[Pentland94] Pentland, A, Picard, R., Sclaroff, S., "Photobook: Tools for Content-Base Manipulation of Image Databases ". SPIE Conference on Storage and Retrieval of Image and Video Databases II, (SPIE paper 2185-05) Feb 6-10, 1994, San Jose CA, pp34-47
[Rudnicky95] Rudnicky, A., "Language Modeling with Limited Domain Data," Proceeding of the 1995 ARPA Workshop on Spoken Language Technology, in press.
[CMU-Speech95] URL: http://www.speech.cs.cmu.edu/speech/
[CMU-Speech96] URL: http://www.speech.cs.cmu.edu/cgi-bin/cmudict
[Salton71] Salton, G., Ed, "The SMART Retrieval System", Prentice-Hall, Englewood Cliffs, 1971.
[Schäuble95] Schäuble, P. and Wechsler, M. "First Experiences with a System for Content Based Retrieval of Information from Speech Recordings," IJCAI-95 Workshop on Intelligent Multimedia Information Retrieval, Maybury, M. T., (chair), working notes, pp. 59 - 69, August, 1995.
[Wactlar96] Wactlar, H. D., Kanade, T., Smith, M. A. and Stevens, S.M. "Intelligent Access to Digital Video: Informedia Project". IEEE Computer, 29(5) May 1996, pp. 46-52.
[Witten94] Witten, I.H., Moffat, A., and Bell, T.C., "Managing Gigabytes : Compressing and Indexing Documents and Images", Van Nostrand Reinhold, 1994.
[Woods96] Woods, Bill, "Conceptually Indexed Video: Enhanced Storage and Retrieval" . http://www.sun.com/960201/cover/video.html
[Zhang95] Zhang, H., Low, C., and Smoliar, S. "Video parsing and indexing of compressed data," Multimedia Tools and Applications 1 (March 1995), pp. 89-111.