Certain properties of objects are visible.
For example, position, orientation, shape, size, colour, texture, and so forth.
Furthermore, relational variants of these properties are also visible,
as well as changes in such properties and relations over time.
In contrast, force-dynamic properties and relations are not visible.
One cannot *see* the fact that the door knob is attached to, and
supported by, the door.
One must *infer* that fact using physical knowledge of the world.
Such knowledge includes the fact that unsupported objects fall and attachment
is one way of offering support.
Using physical knowledge to infer force-dynamic properties and relations was
first discussed by [49, 50, 51].
This later became known as the *perceiver framework* advanced by
[29].
The perceiver framework states that perception involves four levels.
First, one must specify the *observables*, what properties and relations
can be discerned by direct observation.
Second, one must specify an *ontology*, what properties and relations must
be inferred from the observables.
Descriptions of the observables in terms of such properties and relations are
called *interpretations*.
There may be multiple interpretations of a given observation.
Third, one must specify a *theory*, a way of differentiating
*consistent* interpretations from *inconsistent* ones.
The consistent interpretations are the *models* of the observation.
There may be multiple models of a given observation.
Finally, one must specify a *preference relation*, a way of ordering the
models.
The most-preferred models of the observations are the *percepts*.
One can instantiate the perceiver framework for different observables,
ontologies, theories, and preference relations.
[49, 50, 51, 52, 53, 55]
instantiated this framework for a kinematic theory applied to simulated video.
[34, 35] and [33] instantiated this framework for
a dynamics theory applied to real video.
[57] instantiated this framework for a kinematic theory applied
to real video.
This paper uses this later approach.

The input to the model-reconstruction process consists of a sequence of
*scenes*, each scene being a set of convex polygons.
Each polygon is represented as a sequence of points corresponding to a
clockwise traversal of the polygon's vertices.
The tracker guarantees that each scene contains the same number of polygons
and that they are ordered so that the polygon in each scene
corresponds to the same object.
The output of the model-reconstruction process consists of a sequence of
*interpretations*, one interpretation per scene.
The interpretations are formulated out of the following primitive properties
of, and relations between, the objects in each scene.

- Polygon
*p*is*grounded*. It is constrained to occupy a fixed position and orientation by an unseen mechanism that is not associated with any visible object and thus cannot move either translationally or rotationally. - Polygons
*p*and*q*are*attached*by a*rigid joint*at point*r*. Both the relative position and orientation of*p*and*q*are constrained. - Polygons
*p*and*q*are*attached*by a*revolute joint*at point*r*. The relative position of*p*and*q*is constrained but the relative orientation is not. - Polygons
*p*and*q*are on the*same layer*. Layers are a qualitative representation of depth, or distance from the observer. This representation is impoverished. There is no notion of `in-front-of' or `behind' and there is no notion of adjacency in depth. The only representable notion is whether two objects are on the same or different layers. The same-layer relation is constrained to be an equivalence relation, i.e. it must be reflexive, symmetric, and transitive. Furthermore, two objects on the same layer must obey the*substantiality constraint*, the constraint that they not interpenetrate [59, 8, 6, 7, 60, 61].

An interpretation *I* is a 4-tuple:
.
Throughout this paper, interpretations will be depicted graphically, overlayed
on scene images, for ease of comprehension.
Figure 7 gives a sample interpretation depicted
graphically.
The symbol ` ' attached to a polygon indicates that it is
grounded.
A solid circle indicates that two polygons are rigidly attached at the center
of the circle.
A hollow circle indicates that two polygons are attached by a revolute joint
at the center of the circle.
The same-layer relation is indicated by giving a *layer index*, a
small nonnegative integer, to each polygon.
Polygons with the same layer index are on the same layer, while those with
different layer indices are on different layers.

**Figure:** The graphical method for depicting interpretations that is used in
this paper.
The symbol ` ' indicates that a polygon is
grounded.
A solid circle indicates a rigid joint.
A hollow circle indicates a revolute joint.
Two polygons with the same layer index are on the same layer.

Model reconstruction can be viewed as a generate-and-test process.
Initially, all possible interpretations are generated for each scene.
Then, *inadmissible* and *unstable* interpretations are filtered out.
Admissibility and stability can be collectively viewed as a consistency
requirement.
The stable admissible interpretations are thus *models* of a scene.
The nature of the theory guarantees that there will always be at least one
model for each scene, namely the model where all objects are grounded.
There may, however, be multiple models for a given scene.
Therefore, a *preference relation* is then applied through a sequence of
circumscription processes [38] to select the minimal, or
preferred, models for each scene.
While there will always be at least one minimal model for each scene, there
may be several, since the preference relation may not induce a total order.
If there are multiple minimal models for a given scene, one is chosen
arbitrarily as the most-preferred model for that scene.
The precise details of the admissibility criteria, the stability checking
algorithm, the preference relations, and the circumscription process are
beyond the scope of this paper.
They are discussed in [57].
What is important, for the purpose of this paper, is that, given a scene
sequence, model reconstruction produces a sequence of interpretations, one for
each scene, and that these interpretations are 4-tuples containing the
predicates , , , and .
Figure 4 shows sample interpretation sequences
produced by the model-reconstruction component on the scene sequences from
Figure 3.

Wed Aug 1 19:08:09 EDT 2001