This paper presents a new approach to event recognition that differs from the prior approach in two ways. First, it uses force dynamics instead of motion profile as the feature set to differentiate between event types. Second, it uses event logic instead of hidden Markov models as the computational framework for classifying time-series data containing these features. Nominally, these two differences are independent. One can imagine using hidden Markov models to classify time series of force-dynamic features or using event logic to classify time series of motion-profile features. While such combinations are feasible in principle, they are unwieldy in practice.
Consider using event logic to classify time series of motion-profile features. Motion-profile features, such as position, velocity, and acceleration, are typically continuous. A given event usually corresponds to a vague range of possible feature values. This vagueness is well modeled by continuous-output hidden Markov models. Event logic, which is discrete in nature, requires quantizing precise feature-value ranges. Such quantization can lead to a high misclassification rate. Furthermore, continuous distributions allow partitioning a multidimensional feature space into different classes where the boundaries between classes are more complex than lines along the feature axes. Emulating this in event logic would require complex disjunctive expressions.
Similarly, consider using hidden Markov models to classify time series of force-dynamic features. Suppose that a feature vector contains n features. Since both force-dynamic and motion-profile features typically relate pairs of objects, n is often quadratic in the number of event participants. Let us contrast the number of parameters needed to represent these features in both the motion-profile and force-dynamic approaches. Since, as discussed above, motion-profile features are typically continuous, hidden Markov models with continuous outputs can be used in the motion-profile approach. When the features are independent, such a model requires O(n) parameters per state to specify the output distributions. Even if one uses, say, a multivariate Gaussian to model dependent features, this requires only parameters per state to specify the output distributions in the motion-profile approach. However, force-dynamic features are Boolean. This requires using discrete-output hidden Markov models. Such models output a stream of symbols, not feature vectors. Constructing an appropriate alphabet of output symbols requires considering all possible subsets of features. This requires parameters per state to specify the output distributions in the force-dynamic approach. Thus continuous-output hidden Markov models appear to be better suited to an approach that uses motion-profile features while event logic appears to be better suited to an approach that uses force-dynamic features.
Humans use language for three fundamental purposes: we describe what we see, we ask others to perform actions, and we engage in conversation. The first two require grounding language in perception and action. Only the third involves disembodied use of language. Almost all research in computational linguistics has focused on such disembodied language use. Data-base query processing, information extraction and retrieval, and spoken-language dialog all use language solely to manipulate internal representations. In contrast, the work described in this paper grounds language in perception of the external world. It describes an implemented system, called LEONARD, that uses language to describe events observed in short image sequences.
Why is perceptual grounding of language important and relevant to computational linguistics? Current approaches to lexical semantics suffer from the `bold-face syndrome.' All too often, the meanings of words, like throw, are taken to be uninterpreted symbols, like throw, or expressions over uninterpreted symbols, like cause to go [31, 39, 47, 26, 27, 44]. Since the interpretation of such symbols is left to informal intuition, the correctness of any meaning representation constructed from such uninterpreted symbols cannot be verified. In other words, how is one to know whether cause to go is the correct meaning of throw? Perceptual grounding offers a way to verify semantic representations. Having an implemented system use a collection of semantic representations to generate appropriate descriptions of observations gives evidence that those semantic representations are correct. This paper takes a small step in this direction. In contrast to prior work, which presents informal semantic representations whose interpretation is left to intuition, it presents perceptually-grounded semantic representations. While the system described in this paper addresses only perceptual grounding of language, the long-term goal of this research is to provide a unified semantic representation that is sufficiently powerful to support all three forms of language use: perception, action, and conversation.
Different parts of speech in language typically describe different aspects of visual percepts. Nouns typically describe objects. Verbs typically describe events. Adjectives typically describe properties. Prepositions typically describe spatial and temporal relations. Grounding language in visual perception will require construction of semantic representations for all of these different parts of speech. It is likely that different parts of speech will require different machinery to represent their lexical semantics. In other words, whatever the ultimate representation of apple and chair are, they are likely to be based on very different principles than the ultimate representation of pick up and put down. These, in turn, are likely to be further different from those needed to represent in, on, red, and big. Indeed, machine vision research, at least that aspect of machine vision research that focuses on object recognition, can be viewed as an attempt to perceptually ground the lexical semantics of nouns. In contrast, this paper focuses solely on verbs. Accordingly, it develops machinery that is very different from what is typically used in the machine-vision community, machinery that is more reminiscent of that which is used in the knowledge-representation community. On the other hand, unlike typical knowledge-representation work, it grounds that machinery in image processing.
When one proposes a representation, such as cause to go, as the meaning of a word, such as throw, one must specify three things to effectively specify the meaning of that word. First, one must specify the lexical semantics of the individual primitives, how one determines the truth conditions of items like cause and to go. Second, one must specify the compositional semantics of the representation, how one combines the truth conditions of primitives like cause and to go to get the aggregate truth conditions of compound expressions like cause to go. Third, one must specify a lexical entry, a map from a word, like throw, to a compound expression, like cause to go. All three are necessary in order to precisely specify the word meaning.
Prior work in lexical semantics, such as the work of , , , [26, 27], and , is deficient in this regard. It specifies the third component without the first two. In other words, it formulates lexical entries in terms of compound expressions like cause to go, without specifying the meanings of the primitives, like cause and to go, and without specifying how these meanings are combined to form the aggregate meaning of the compound expression. This paper attempts to address that deficiency by specifying all three components. First, the lexical semantics of the event-logic primitives is precisely specified in Figure 9. Second, the compositional semantics of event logic is precisely specified in Section 3. Third, lexical entries for several verbs are precisely specified in Figure 10. These three components together formally specify the meanings of those verbs with a level of precision that is absent in prior work.
While these lexical entries are precise, there is no claim that they are accurate. Lexical entries are precise when their meaning is reduced to an impartial mechanical procedure. Lexical entries are accurate when they properly reflect the truth conditions for the words that they define. Even ignoring homonymy and metaphor, words such as move and assemble clearly have meanings that are much more complex than what is, and even can be, represented with the machinery presented in this paper. But that holds true of prior work as well. The lexical entries given in, for example, , , , [26, 27], and  also do not accurately reflect the truth conditions for the words that they define. The purpose of this paper is not to improve the accuracy of definitions. In fact, the definitions given in prior work might be more accurate, in some ways, than those given here. Rather, its purpose is to improve the precision of definitions. The definitions given in prior work are imprecise and that imprecision makes assessing their accuracy a subjective process: do humans think an informally specified representation matches their intuition. In contrast, precision allows objective assessment of accuracy: does the output of a mechanical procedure applied to sample event occurrences match human judgments of which words characterize those occurrences.
Precision is the key methodological advance of this work. Precise specification of the meaning of lexical semantic representations, by way of perceptual grounding, makes accuracy assessment possible by way of experimental evaluation. Taking this first step of advancing precision and perceptual grounding will hopefully allow us to take future steps towards improving accuracy.