Next: Model Reconstruction Up: Grounding the Lexical Semantics Previous: Grounding the Lexical Semantics

Introduction

If one were to look at the image sequence in Figure 1(a), one could describe the event depicted in that sequence by saying Someone picked the red block up off the green block. Similarly, if one were to look at the image sequence in Figure 1(b), one could describe the event depicted in that sequence by saying Someone put the red block down on the green block. One way that one recognizes that the former is a pick up event is that one notices a state change in the force-dynamic relations between the participant objects. Prior to Frame 13, the red block is supported by the green block by a substantiality constraint, the fact that solid objects cannot interpenetrate [59, 8, 6, 7, 60, 61]. From Frame 13 onward, it is supported by being attached to the hand. Similarly, one way that one recognizes that the latter is a put down event is that one notices the reverse state change in Frame 14. This paper describes an implemented computer system, called LEONARD, that can produce similar event descriptions from such image sequences. A novel aspect of this system is that it produces event descriptions by recognizing state changes in force-dynamic relations between participant objects. Force dynamics is a term introduced by [63] to describe a variety of causal relations between participants in an event, such as allowing, preventing, and forcing. In this paper, I use force dynamics in a slightly different sense, namely, to describe the support, contact, and attachment relations between participant objects.

Figure 1: Image sequences depicting (a) a pick up event and (b) a put down event.

A number of systems have been reported that can produce event descriptions from video input. Examples of such systems include the work reported in [73], [62], [58], [14], and [10]. LEONARD differs from these prior systems in two crucial ways. First, the prior systems classify events based on the motion profile of the participant objects. For example, [58] characterize a pick up event as a sequence of two subevents: the agent moving towards the patient while the patient is at rest above the source, followed by the agent moving with the patient away from the source while the source remains at rest. Similarly, a put down event is characterized as the agent moving with the patient towards the destination while the destination is at rest, followed by the agent moving away from the patient while the patient is at rest above the destination. In contrast, LEONARD characterizes events as changes in the force-dynamic relations between the participant objects. For example, a pick up event is characterized as a change from a state where the patient is supported by a substantiality constraint with the source to a state where the patient is supported by being attached to the agent. Similarly, a put down event is characterized as the reverse state change. Irrespective of whether motion profile or force dynamics is used to recognize events, event recognition is a process of classifying time-series data. In the case of motion profile, this time-series data takes the form of relative-and-absolute positions, velocities, and accelerations of the participant objects as a function of time. In the case of force dynamics, this time-series data takes the form of the truth values of force-dynamic relations between the participant objects as a function of time. This leads to the second difference between LEONARD and prior systems. The prior systems use stochastic reasoning, in the form of hidden Markov models, to classify the time-series data into event types. In contrast, LEONARD uses logical reasoning, in the form of event logic, to do this classification.

Using force dynamics and event logic (henceforth the `new approach') to recognize events offers several advantages over using motion profile and hidden Markov models (henceforth the `prior approach'). First, the new approach will correctly recognize an event despite a wider variance in motion profile than the prior approach. For example, when recognizing, say, a pick up event, the prior approach is sensitive to aspects of event execution, like the approach angle and velocity of the hand, that are irrelevant to whether or not the event is actually a pick up. The new approach is not sensitive to such aspects of event execution. Second, the new approach will correctly recognize an event despite the presence of unrelated objects in the field of view. The prior approach computes the relative-and-absolute positions and motions of all objects and pairs of objects in the field of view. It then selects the subset of objects whose positions and motions best matched some model. This could produce incorrect descriptions when some unintended subset matched some unintended model better than the intended subset matched the intended model. The new approach does not exhibit such deficiencies. Extraneous objects typically do not exhibit the precise sequence of state changes in force-dynamic relations needed to trigger the event-classification process and thus will not generate spurious claims of event occurrences. Third, the new approach performs temporal and spatial segmentation of events. The prior approach matches an entire image sequence against an event model. It fails if that image sequence depicts multiple event executions, either in sequence or in parallel. In contrast, the new approach can segment a complex image sequence into a collection of sequential and/or overlapping events. In particular, it can handle hierarchal events, such as move, that consist of a pick up event followed by a put down event. It can recognize that all three events, and precisely those three events, occur in an appropriate image sequence whereas the prior approach would try to find the single best match. Finally, the new approach robustly detects the non-occurrence of events as well as the occurrence of events. The prior approach always selects the best match and reports some event occurrence for every image sequence. Thresholding the match cost does not work because an approach based on motion profile can be fooled into triggering recognition of an event occurrence by an event whose motion profile is similar to one or more target event classes even though that event is not actually in any of those target event classes. Consider, for example, the two image sequences in Figure 2. Suppose that an event-recognition system contained two target event classes, namely pick up and put down. Neither of the image sequences depict pick up or put down events. Nonetheless, the prior approach might mistakingly classify Figure 2(a) as a pick up event because the second half of this image sequence matches the second half of the motion profile of a pick up event. Alternatively, it might mistakingly classify this image sequence as a put down event because the first half of this image sequence matches the first half of the motion profile of a put down event. Similarly, the prior approach might mistakingly classify Figure 2(b) as a pick up event because the first half of this image sequence matches the first half of the motion profile of a pick up event. Alternatively, it might mistakingly classify this image sequence as a put down event because the second half of this image sequence matches the second half of the motion profile of a put down event. In contrast, the new approach correctly recognizes that neither of these image sequences exhibit the necessary state changes in force-dynamic relations to qualify as either pick up or put down events. All four of these advantages will be discussed in greater detail in Section 5.

Figure 2: Image sequences depicting non-events.

The techniques described in this paper have been implemented in a system called LEONARD. LEONARD is a comprehensive system that takes image sequences as input and produces event descriptions as output. The overall architecture of LEONARD is shown in Figure 6. The input to LEONARD consists of a sequence of images taken by a Canon VC-C3 camera and Matrox Meteor frame grabber at 320 240 resolution at 30fps. This image sequence is first processed by a segmentation-and-tracking component. A real-time colour- and motion-based segmentation algorithm places a convex polygon around each coloured and moving object in each frame. A tracking algorithm then forms a correspondence between the polygons in each frame and those in temporally adjacent frames. The output of the segmentation-and-tracking component consists of a sequence of scenes, each scene being a set of polygons. Each polygon is represented as a sequence of image coordinates corresponding to a clockwise traversal of the polygon's vertices. The tracker guarantees that each scene contains the same number of polygons and that they are ordered so that the polygon in each scene corresponds to the same object. Figure 3 shows the output of the segmentation-and-tracking component on the image sequences from Figure 1. The polygons have been overlayed on the input images for ease of comprehension.

Figure 3: The output of the segmentation-and-tracking component applied to the image sequences from Figure 1. (a) depicts a pick up event. (b) depicts a put down event. The polygons have been overlayed on the input images for ease of comprehension.

This scene sequence is passed to a model-reconstruction component. This component produces a force-dynamic model of each scene. This model specifies three types of information: which objects are grounded, i.e. are supported by an unseen mechanism that is not associated with any visible object, which objects are attached to other objects by rigid or revolute joints, and the qualitative depth of each object, i.e. a qualitative representation of the relative distance of different objects in the field of view from the observer, in the form of a same layer relation specifying which objects are at the same qualitative depth. Figure 4 shows the output of the model-reconstruction component on the scene sequences from Figure 3. The models are depicted graphically, overlayed on the input images, for ease of comprehension. The details of this depiction scheme will be described momentarily. For now, it suffices to point out that Figure 4(a) shows the red block on the same layer as the green block up through Frame 1 and attached to the hand from Frame 14 onward. Figure 4(b) shows the reverse sequence of relations, with the red block attached to the hand up through Frame 13 and on the same layer as the green block from Frame 23 onward.

Figure 4: The output of the model-reconstruction component applied to the scene sequences from Figure 3. (a) depicts a pick up event. (b) depicts a put down event. The models have been overlayed on the input images for ease of comprehension. In (a), the red block is on the same layer as the green block up through Frame 1 and is attached to the hand from Frame 14 onward. In (b), the reverse sequence of relations holds, with the red block attached to the hand up through Frame 13 and on the same layer as the green block from Frame 23 onward.

This model sequence is passed to an event-classification component. This component first determines the intervals over which certain primitive event types are true. These primitive event types include , , , and . This component then uses an inference procedure to determine the intervals over which certain compound event types are true. These compound event types include , , , , , , and and are specified as expressions in event logic over the primitive event types. The output of the event-classification component consists of an indication of which compound event types occurred in the input movie as well as the subsequence(s) of frames during which those event types occurred. Figure 5 shows the output of the event-classification component on the model sequences from Figure 4. The subsequences of frames during which the events occurred are depicted as spanning intervals. Spanning intervals will be described in Section 4.1.

Figure 5: The output of the event-classification component applied to the model sequences from Figure 4. Note that the pick up event is correctly recognized in (a) and the put down event is correctly recognized in (b).

LEONARD is too complex to describe completely in one paper. This paper provides a detailed description of the event-classification component and, in particular, the event-logic inference procedure. The segmentation and tracking algorithms are extensions of the algorithms presented in [58] and [56], modified to place convex polygons around the participant objects instead of ellipses. The model-reconstruction techniques are extensions of those presented in [55, 57]. The model-reconstruction techniques will be described briefly below to allow the reader to understand the event-classification techniques without reference to those papers.

Figure 6: The overall architecture of LEONARD.

Next: Model Reconstruction Up: Grounding the Lexical Semantics Previous: Grounding the Lexical Semantics

Jeffrey Mark Siskind
Wed Aug 1 19:08:09 EDT 2001