Next: Conclusion Up: Grounding the Lexical Semantics Previous: Discussion

Related Work

Most prior work uses motion profile, some combination of relative-and absolute linear-and-angular positions, velocities, and accelerations, as the features that drive event classification. That work follows the tradition of linguists and cognitive scientists, such as [31], [39], [47], [26, 27], and [44], that represent the lexical semantics of verbs via the causal, aspectual, and directional qualities of motion. Some linguists and cognitive scientists, such as [25] and [28], have argued that force-dynamic relations [63], such as support, contact, and attachment, are crucial for representing the lexical semantics of spatial prepositions. For example, in some situations, part of what it means for one object to be on another object is for the former to be in contact with, and supported by, the latter. In other situations, something can be on something else by way of attachment, as in the knob on the door. [50] has argued that change in the state of force-dynamic relations plays a more central role in specifying the lexical semantics of simple spatial motion verbs than motion profile. The particular relative-and-absolute linear-and-angular positions, velocities, and accelerations don't matter when picking something up or putting something down. What matters is a state change in the source of support of the patient. Similarly, what distinguishes putting something down from dropping it is that, in the former, the patient is always supported, while in the latter, the patient undergoes unsupported motion.

The work described in this paper differs from prior work in visual-event perception in a number of respects. [71], [70], [36], and [46] describe unimplemented frameworks that are not based on force dynamics. [64] describes a system that recognizes when an event occurs but not what event occurs. His system processes simulated video and is not based on force dynamics. [5], [3], [65], [42], [66, 67], [1], [2], [41], and [45] describe systems that process simulated video and that are not based on force dynamics. [12, 13] presents event definitions that are based on force-dynamic relations but does not present techniques for recovering those relations automatically from either simulated or real video. [73], [62], [58], [54], [14, 15], [18], and [10] present systems that recognize event occurrences from real video using motion profile but not force dynamics. These systems use hidden Markov models rather than event logic as the event-classification engine. [24] presents a heuristic approach to stability analysis that operates on simulated video but does not perform model reconstruction or event classification. [17] and [16] present a heuristic approach to stability analysis that operates on real video but do not use stability analysis to perform model reconstruction and event classification. [9] and [23] present stability-analysis algorithms that are based on linear programming but do not use stability analysis to perform model reconstruction or event classification. These stability-analysis algorithms use dynamics rather than kinematics. [49, 50, 51, 52, 53, 55] presents systems that operate on simulated video and use force dynamics to recognize event occurrences. All of that work, except [55], uses heuristic approaches to stability analysis, model reconstruction, and event classification. [55] presents an early version of the stability-analysis and event-logic-based event-recognition techniques used in the current system. [34, 35] and [33] present a system that does model reconstruction from real video but does not use the recovered force-dynamic relations to perform event classification. That work uses an approach to stability analysis based on dynamics instead of the kinematic approach used in this paper.

There is also a body of prior work that grounds fragments of natural-language semantics in physical relations between objects in graphically represented blocks worlds or for solving physics word problems. Examples of such work include [11], [72], and [43] as well the ISSAC system [40] and the MECHO project [20, 19, 32]. While that work does not focus on recognizing events, per se, it does relate lexical semantics to physical relations between represented objects.

LEONARD currently does not contain a learning component. It is given a fixed physical theory of the world, implicitly represented in the model-reconstruction procedure, and a fixed collection of event-type descriptions, explicitly formulated as event-logic expressions. One potential area for future work would be to automatically learn a physical theory of the world and/or event-type descriptions. Adding a learning component could potentially produce more robust model-reconstruction and event-classification components than those currently constructed by hand. Techniques such as those presented in [37] and [21] might be useful for this task.

Next: Conclusion Up: Grounding the Lexical Semantics Previous: Discussion

Jeffrey Mark Siskind
Wed Aug 1 19:08:09 EDT 2001