This paper has presented LEONARD, a comprehensive implemented system for recovering event occurrences from video input. It differs from the prior approach to the same problem in two fundamental ways. First, it uses state changes in the force-dynamic relations between objects, instead of motion profile, as the key descriptive element in defining event types. Second, it uses event logic, instead of hidden Markov models, to perform event classification. One key result of this paper is the formulation of spanning intervals, a novel efficient representation of the infinite sets of intervals that arise when processing liquid and semi-liquid events. A second key result of this paper is the formulation of an efficient procedure, based on spanning intervals, for inferring all occurrences of compound event types from occurrences of primitive event types. The techniques of force-dynamic model reconstruction, spanning intervals, and event-logic inference have been used to successfully recognize seven event types from real video: pick up, put down, stack, unstack, move, assemble, and disassemble. Using force dynamics and event logic to perform event recognition offers four key advantages over the prior approach of using motion profile and hidden Markov models. First, it is insensitive to variance in the motion profile of an event occurrence. Second, it is insensitive to the presence of extraneous objects in the field of view. Third, it allows temporal segmentation of sequential and parallel event occurrences. Fourth, it robustly detects the non-occurrence of events as well as their occurrence.
At a more fundamental level, this paper advances a novel methodology: grounding lexical-semantic representations in visual-event perception as a means for assessing the accuracy of such representations. Prior work in lexical-semantic representations has used calculi whose semantics were not precisely specified. Lexical entries formulated in such calculi derived their meaning from intuition and thus could not be empirically tested. By providing a lexical-semantic representation whose semantics is precisely specified via perceptual grounding, this paper opens up the field of lexical semantics to empirical evaluation. The particular representations advanced in this paper are clearly only approximations to the ultimate truth. This follows from the primitive state of our understanding of language and perception. Nonetheless, I hope that this paper offers an advance towards the ultimate truth, both through its novel methodology and the particular details of its mechanisms.