Certain properties of objects are visible. For example, position, orientation, shape, size, colour, texture, and so forth. Furthermore, relational variants of these properties are also visible, as well as changes in such properties and relations over time. In contrast, force-dynamic properties and relations are not visible. One cannot see the fact that the door knob is attached to, and supported by, the door. One must infer that fact using physical knowledge of the world. Such knowledge includes the fact that unsupported objects fall and attachment is one way of offering support. Using physical knowledge to infer force-dynamic properties and relations was first discussed by [49, 50, 51]. This later became known as the perceiver framework advanced by . The perceiver framework states that perception involves four levels. First, one must specify the observables, what properties and relations can be discerned by direct observation. Second, one must specify an ontology, what properties and relations must be inferred from the observables. Descriptions of the observables in terms of such properties and relations are called interpretations. There may be multiple interpretations of a given observation. Third, one must specify a theory, a way of differentiating consistent interpretations from inconsistent ones. The consistent interpretations are the models of the observation. There may be multiple models of a given observation. Finally, one must specify a preference relation, a way of ordering the models. The most-preferred models of the observations are the percepts. One can instantiate the perceiver framework for different observables, ontologies, theories, and preference relations. [49, 50, 51, 52, 53, 55] instantiated this framework for a kinematic theory applied to simulated video. [34, 35] and  instantiated this framework for a dynamics theory applied to real video.  instantiated this framework for a kinematic theory applied to real video. This paper uses this later approach.
The input to the model-reconstruction process consists of a sequence of scenes, each scene being a set of convex polygons. Each polygon is represented as a sequence of points corresponding to a clockwise traversal of the polygon's vertices. The tracker guarantees that each scene contains the same number of polygons and that they are ordered so that the polygon in each scene corresponds to the same object. The output of the model-reconstruction process consists of a sequence of interpretations, one interpretation per scene. The interpretations are formulated out of the following primitive properties of, and relations between, the objects in each scene.
An interpretation I is a 4-tuple: . Throughout this paper, interpretations will be depicted graphically, overlayed on scene images, for ease of comprehension. Figure 7 gives a sample interpretation depicted graphically. The symbol ` ' attached to a polygon indicates that it is grounded. A solid circle indicates that two polygons are rigidly attached at the center of the circle. A hollow circle indicates that two polygons are attached by a revolute joint at the center of the circle. The same-layer relation is indicated by giving a layer index, a small nonnegative integer, to each polygon. Polygons with the same layer index are on the same layer, while those with different layer indices are on different layers.
Figure: The graphical method for depicting interpretations that is used in this paper. The symbol ` ' indicates that a polygon is grounded. A solid circle indicates a rigid joint. A hollow circle indicates a revolute joint. Two polygons with the same layer index are on the same layer.
Model reconstruction can be viewed as a generate-and-test process. Initially, all possible interpretations are generated for each scene. Then, inadmissible and unstable interpretations are filtered out. Admissibility and stability can be collectively viewed as a consistency requirement. The stable admissible interpretations are thus models of a scene. The nature of the theory guarantees that there will always be at least one model for each scene, namely the model where all objects are grounded. There may, however, be multiple models for a given scene. Therefore, a preference relation is then applied through a sequence of circumscription processes  to select the minimal, or preferred, models for each scene. While there will always be at least one minimal model for each scene, there may be several, since the preference relation may not induce a total order. If there are multiple minimal models for a given scene, one is chosen arbitrarily as the most-preferred model for that scene. The precise details of the admissibility criteria, the stability checking algorithm, the preference relations, and the circumscription process are beyond the scope of this paper. They are discussed in . What is important, for the purpose of this paper, is that, given a scene sequence, model reconstruction produces a sequence of interpretations, one for each scene, and that these interpretations are 4-tuples containing the predicates , , , and . Figure 4 shows sample interpretation sequences produced by the model-reconstruction component on the scene sequences from Figure 3.