The estimation task that we have identified in the previous section differs somewhat from the estimation problem solved by Walley and Fine [27], although the assumptions behind data generation are identical. The difference is subtle, but important for avoiding confusion and fully understanding the results in this area. Although Walley and Fine's formulation has a mathematical elegance in that it allows them to identify the optimal estimator (in their sense), we believe the objective we have outlined is more representative of what one is pragmatically interested in learning in the framework of convex sets of distributions.

From a sequence of outcomes, x_{1}, x_{2}, &ldots;, one can construct a
sequence of relative frequencies, r_{1}(A), r_{2}(A), &ldots;. Here
r_{n}(A) is the frequency of occurrences of event A during the
initial n trials of the sequence. Rather than estimating the credal
set that generates the data, Walley and Fine's characterize all possible
subsequences of relative frequencies. For example, suppose one considers
only the subsequence of odd frequencies,
r_{2},r_{3},r_{5},r_{7},r_{11},r_{13},r_{17}, &ldots;,
and that this subsequence converges to a limiting frequency.
Walley and Fine give estimators that capture this limiting
frequency with asymptotic certainty.
Popper [18, Section 63-66,] calls these limiting frequencies
"middle frequencies", and points out that sequences may have multiple
middle frequencies.

Note that example 3 involves exactly this type of construction. Walley and Fine emphasize estimation with this type of sequence. On the other had, example 1 generates a sequence with a single middle frequency and does not produce a credal set estimate with Walley and Fine's approach.

One way to state the difference between our task and Walley and Fine's task is that we are interested in limiting frequencies for subsequences of outcomes, while Walley and Fine (and Popper) gave estimators for limiting points of subsequences of frequencies. The former approach characterizes the sequence of outcomes, and relates directly to the underlying credal set that generates the data, while the latter approach is a characterization of the sequence of relative frequencies, and how the sequence of frequencies may not converge in the classical sense. Walley and Fine's objectives can be pursued by throwing away all information contained in the sequence of actual outcomes, keeping only the sequence of frequencies.

It is not hard to see that an estimator for Walley and Fine's task is an estimator for our task. However, for our task, a Walley and Fine estimator can often be substantially improved upon.

Walley and Fine propose the following estimator for lower envelopes.
Consider a sequence of trials { x_{1}, x_{2}, &ldots;}.
For any event A, the relative frequency r_{i}(A) is the number of positive trials for A up to trial i.
From the original sequence { x_{1}, x_{2}, &ldots;}, we can compute a sequence of relative frequencies
{ r_{1}(A), r_{2}(A), &ldots;}.

Walley and Fine define a class of estimators for the lower envelope having the following form:

__r__(A) =
_{n rarrinf} { r_{i}(A) : k(n) lei len },

The lower envelope formed through Walley and Fine's estimator can be extended to a convex set (the set of all distributions that dominate these estimates). Walley and Fine prove that this set dominates the credal set that generated the data [27, Theorem 4.1,]. The dual upper envelope estimator is obtained by replacing the infimum with a supremum in (1).

For many of us who are used to thinking about relative frequencies in terms of
single distributions and i.i.d. trials,
the intuition behind Walley and Fine's estimator can be quite difficult to grasp.
For example, r_{k(n)} and r_{n} both converge to the relative frequency of the
infinite sequence of trials,
making it non-intuitive why looking at them together should uncover more information about the
mechanism generating the data than simply looking at the common limiting relative frequency.

In fact, instead of looking at just the limiting relative frequency of an infinite sequence,
Walley and Fine's estimator simultaneously considers the whole set of possible
limiting frequencies. If observations were being generated by an a single distribution through
an infinite sequence of i.i.d. trials,
each relative frequency in this set would converge (by the law of large numbers) to the same limiting
relative frequency, making r_{k(n)}=r_{k(n)+1}= &ldots;=r_{i-1}=r_{i},
and making the infimum in (1) uninteresting.
When we drop the single distribution and i.i.d. assumptions,
the estimates become far richer.

Walley and Fine [27, Theorem 4.1(a),] prove that their estimator produces a credal set that
dominates the underlying credal set with asymptotic certainty.
Their estimator will detect divergence from i.i.d. point probability with asymptotic favorability (their
Theorem 4.1(d)).
Also, any convergence subsequence of r_{1},r_{2},. converges to a frequency contained in their
estimate, and their estimate is the smallest credal set for which this is true (their Theorem 4.1).
Thus, in a certain sense, their estimator optimally characterizes the asymptotic divergence of relative
frequencies for a given sequence of outcomes.

The astute reader, however, will notice that Walley and Fine's estimator des not recover ``nature's'' exact credal set in this example in our previous examples. Since our goal is to recover, as best we can, ``nature's'' underlying credal set, there is clearly room for improvement. We now study Walley and Fine's estimator from a slightly different perspective, which helps clarify our own approach to the problem.

The above interpretation of Walley and Fine's estimator built an analogy between
the estimator and the minimum of a sequence of estimates, { r_{k(n)}, &ldots;, r_{n} }. This is a
translation of expression (1) and serves the purpose of clarifying the logic behind the
estimator.

Walley and Fine's estimator can also be described as the minimum estimate produced by a generator of sub-sequences; this is the description that interests us in this paper.

Consider an infinite sequence of trials X = { x_{1}, x_{2}, &ldots;}. Consider a generator of selection rules,
i.e., an algorithm that generates infinite sub-sequences X^{s} out of the sequence X, by specifying
members of X that must also be members of X^{s}. Take the following algorithm, which generates
subsequences for a given n:

- For all k from k(n) to n, produce the sub-sequence X
^{k}= { x_{1}&ldots;x_{k}}.

Each sub-sequence X^{k} has its relative frequency.
Suppose now that n rarrinf.
If the original sequence X has multiple converging
frequencies, these frequencies must be attained in some of the sub-sequences and their minimum will be
captured by Walley and Fine's estimator. The way to produce multiple converging frequencies was
illustrated in the coin example: progressively ``longer infinite'' numbers of trials must be used to generate
each one of the frequencies.

This procedure reveals the drawback of Walley and Fine's approach. The problem is that their estimator is geared toward capturing all possible limiting frequencies, regardless of the types of sequences it may find. Rarely a sequence of data will be maliciously produced as in the coin example, with progressively ``longer infinite'' segments generated by different probabilities. In general we expect trials to be generated by selecting distributions from the credal set in some defined, deterministic way, and then generating the data from the selected distributions. This is the most relevant situation in practice, where we are interested to assess how much our assumptions of randomness, and our abstractions in the modeling process, are justified.

The main goal of this paper is to develop estimators that are suited to deal with the situation described above. Suppose that we have some deterministic procedure selecting distributions. First suppose, for the sake of argument, that we know the deterministic procedure. For example, odd trials obey one distribution, even trials obey another. Then the logical way to proceed is to partition the data into even and odd sub-sequences and estimate relative frequencies in each one of them. This agrees with intuition and with statistical practice: if we suspect differences between blocks of data, we must run some form of cross-validation among the inferences obtained from different blocks.

Now suppose the distribution selection mechanism is unknown. We still proceed the same way, by partitioning the data into several sub-sequences, in the hope of matching the data patterns. If we take the lower bounds of the collections of relative frequencies that emerge, we obtain estimators that can capture aspects of the data that are not captured by Walley and Fine's estimator. The learning theorem proved in the next section demonstrates that this procedure in fact creates credal sets that dominate the ``true'' credal set with asymptotic certainty. We also indicate how to combine our procedure with Walley and Fine's estimator so as to improve both estimators.

Sun Jun 29 22:16:40 EDT 1997