6. System Implementation and Evaluation: A Discussion

In general, it is essential to empirically evaluate theories and systems that purportedly implement them. Not only do evaluations help others understand the strengths and limitations of various hypotheses and systems, but they also facilitate comparisons between competing claims in many cases. However, NLG evaluations are considered difficult (Hovy and Meeter, 1990). NLG systems can be evaluated at many different levels, some of them being orthogonal to each other. Our case is no exception. There are at least three different, and equally important questions that one could investigate further:

  • validity of the complexity metric: perhaps the most critical aspect, since without a valid complexity metric, the system would not be able to generate reasonable captions irrespective of how well any/all of the other components performed. The only way to corroborate the complexity metrics we discussed here would be through rigorous user experiments; fortunately, a recent dissertation on graph comprehension (Shah, 1995) looked at some of the factors in our complexity metrics and found that many of the factors used in our complexity metric were indeed correlated with the increased times required to interpret graphs and charts.
  • validity of the discourse strategies: the paper discussed three discourse strategies for structuring information presented in the captions. There are at least two ways to evaluate a set of strategies used (1) by performing a corpus analysis on a different set of charts and captions than those used to initially infer the strategies in an effort to see how well they fit the test set: this is the usual approach in machine learning where the learning and test sets are kept separate for precisely this reason. However, this would require significant resources to find and code charts and their captions for both the data displayed and the discourse strategies used. This would help determine whether the set of discourse strategies we had come up with was both consistent and complete. (2) Another way to evaluate the discourse strategies would be by conducting user comprehension tests with various charts and captions generated using different strategies at random: while this would be less efficient at testing the set of strategies for completeness, it would allow us to validate that a particular strategy (from our set of three) was best suited for particular types of charts.
  • utility of the captions generated: or the value-added test: are the captions and the graphics together better than the graphics alone for some purpose? If so, the value of generating the captions would be confirmed. We conducted an informal, subjective evaluation of the system over a period of two years. Whenever users interacted with SAGE and were unable to understand a graphic, we suggested that they generate a caption. Later on, we requested feedback on their experience: whether the captions were useful or not, and if they would have liked to see something different. We can categorically state that the captions clearly help in understanding the graphic being presented. The need for natural language explanations seems to arise every time a novel, complex graphic is generated--something that happens quite frequently with SAGE.

A large part of the work we have discussed in this paper is system independent and applicable to any automatic graphic design system. Perhaps the most surprising aspect about our current implementation is how far one can get with such a simple architecture. We made certain simplifying decisions initially in order to get a prototype implemented. Surprisingly few of these simplifying assumptions were problematic down the line. An example of this is our pipelined architecture. Most NLG researchers agree that the various modules in a NLG system need to be strongly interconnected with bi-directional communication and control and use shared data structures. We started off by using a pipelined architecture and were surprised to find that the simplifications seemed to be problematic in only one situation (which we were able to get around by planning appropriately). There are several advantages of a pipelined approach as in our case: not only is it easy to design, implement and test each module independently, it also becomes easy to extend the functionality of any individual module without significantly affecting the others. While such a simplified architecture will certainly not suffice for all generation tasks, this is a strong argument for trying this minimal approach to see where it falls short and why.

Over the last two years, this system has been used to generate captions for several hundred figures in different domains (housing-sales, Napoleon's march of 1812, logistics transportation, scheduling, etc.). Porting the system from one domain to another usually requires only specifying the lexicon for the new domain (e.g., "battle," "troops," etc.). The fact that the captions generated in each of these--quite different--domains are deemed useful and natural by users is testimony to the effectiveness of the caption generation mechanism currently in place.

It should be noted that there are two shortcomings in the system that will be addressed in future work: (1) the caption generation system, as described here, cannot in general, modify the graphics designed by SAGE if so required by the caption. There are several cases where this capability would be extremely useful, but the caption generation system described here was designed to work after SAGE had designed and rendered the graphic. There is one specialized case where coordination currently occurs, which is when the caption generator presents an example. In that case, the caption generator can request that the graphemes corresponding to the tuple values used in the example be highlighted in the picture; (2) the system does not, as yet, analyze the data set for interesting patterns or clusters of data points. To do this, the system will need a clustering analysis module that can be used by the caption generator. As a result, the system cannot generate captions of the sort "this chart shows that sales were flat throughout 1995, but rose sharply in 1996."

To next section.


Paper Sections:

     To Title page
     To Part 1: Introduction
     To Part 2: SAGE: A System for Automatic Graphical Explanations
     To Part 3: Discourse Strategies for Generating Captions
     To Part 4: Graphical Complexity: The Need for Clarification
     To Part 5: Generating Explanatory Captions
     To Part 7: Related Work
     To Part 7: Related Work
     To Part 8: Conclusions and Future Work
     To Appendix A
     To Acknowledgements
    [RESEARCH]     [SAMPLES]     [PAPERS]     [PEOPLE]     [HOME]