Next: The finite learning theorem Up: Learning Convex Sets of Previous: The estimation task

# Walley and Fine's estimation task

The estimation task that we have identified in the previous section differs somewhat from the estimation problem solved by Walley and Fine [27], although the assumptions behind data generation are identical. The difference is subtle, but important for avoiding confusion and fully understanding the results in this area. Although Walley and Fine's formulation has a mathematical elegance in that it allows them to identify the optimal estimator (in their sense), we believe the objective we have outlined is more representative of what one is pragmatically interested in learning in the framework of convex sets of distributions.

From a sequence of outcomes, x1, x2, &ldots;, one can construct a sequence of relative frequencies, r1(A), r2(A), &ldots;. Here rn(A) is the frequency of occurrences of event A during the initial n trials of the sequence. Rather than estimating the credal set that generates the data, Walley and Fine's characterize all possible subsequences of relative frequencies. For example, suppose one considers only the subsequence of odd frequencies, r2,r3,r5,r7,r11,r13,r17, &ldots;, and that this subsequence converges to a limiting frequency. Walley and Fine give estimators that capture this limiting frequency with asymptotic certainty. Popper [18, Section 63-66,] calls these limiting frequencies "middle frequencies", and points out that sequences may have multiple middle frequencies.

Note that example 3 involves exactly this type of construction. Walley and Fine emphasize estimation with this type of sequence. On the other had, example 1 generates a sequence with a single middle frequency and does not produce a credal set estimate with Walley and Fine's approach.

One way to state the difference between our task and Walley and Fine's task is that we are interested in limiting frequencies for subsequences of outcomes, while Walley and Fine (and Popper) gave estimators for limiting points of subsequences of frequencies. The former approach characterizes the sequence of outcomes, and relates directly to the underlying credal set that generates the data, while the latter approach is a characterization of the sequence of relative frequencies, and how the sequence of frequencies may not converge in the classical sense. Walley and Fine's objectives can be pursued by throwing away all information contained in the sequence of actual outcomes, keeping only the sequence of frequencies.

It is not hard to see that an estimator for Walley and Fine's task is an estimator for our task. However, for our task, a Walley and Fine estimator can often be substantially improved upon.

Walley and Fine propose the following estimator for lower envelopes. Consider a sequence of trials { x1, x2, &ldots;}. For any event A, the relative frequency ri(A) is the number of positive trials for A up to trial i. From the original sequence { x1, x2, &ldots;}, we can compute a sequence of relative frequencies { r1(A), r2(A), &ldots;}.

Walley and Fine define a class of estimators for the lower envelope having the following form:

r(A) = liminfn rarrinf { ri(A) : k(n) lei len },

where k(n) is any function with the properties that limnrarrinfk(n)rarrinf and limn rarrinf (k(n)/n) = 0. For example, k(n)=n yields one such estimator.

The lower envelope formed through Walley and Fine's estimator can be extended to a convex set (the set of all distributions that dominate these estimates). Walley and Fine prove that this set dominates the credal set that generated the data [27, Theorem 4.1,]. The dual upper envelope estimator is obtained by replacing the infimum with a supremum in (1).

For many of us who are used to thinking about relative frequencies in terms of single distributions and i.i.d. trials, the intuition behind Walley and Fine's estimator can be quite difficult to grasp. For example, rk(n) and rn both converge to the relative frequency of the infinite sequence of trials, making it non-intuitive why looking at them together should uncover more information about the mechanism generating the data than simply looking at the common limiting relative frequency.

In fact, instead of looking at just the limiting relative frequency of an infinite sequence, Walley and Fine's estimator simultaneously considers the whole set of possible limiting frequencies. If observations were being generated by an a single distribution through an infinite sequence of i.i.d. trials, each relative frequency in this set would converge (by the law of large numbers) to the same limiting relative frequency, making rk(n)=rk(n)+1= &ldots;=ri-1=ri, and making the infimum in (1) uninteresting. When we drop the single distribution and i.i.d. assumptions, the estimates become far richer.

Walley and Fine [27, Theorem 4.1(a),] prove that their estimator produces a credal set that dominates the underlying credal set with asymptotic certainty. Their estimator will detect divergence from i.i.d. point probability with asymptotic favorability (their Theorem 4.1(d)). Also, any convergence subsequence of r1,r2,. converges to a frequency contained in their estimate, and their estimate is the smallest credal set for which this is true (their Theorem 4.1). Thus, in a certain sense, their estimator optimally characterizes the asymptotic divergence of relative frequencies for a given sequence of outcomes.

The astute reader, however, will notice that Walley and Fine's estimator des not recover ``nature's'' exact credal set in this example in our previous examples. Since our goal is to recover, as best we can, ``nature's'' underlying credal set, there is clearly room for improvement. We now study Walley and Fine's estimator from a slightly different perspective, which helps clarify our own approach to the problem.

The above interpretation of Walley and Fine's estimator built an analogy between the estimator and the minimum of a sequence of estimates, { rk(n), &ldots;, rn }. This is a translation of expression (1) and serves the purpose of clarifying the logic behind the estimator.

Walley and Fine's estimator can also be described as the minimum estimate produced by a generator of sub-sequences; this is the description that interests us in this paper.

Consider an infinite sequence of trials X = { x1, x2, &ldots;}. Consider a generator of selection rules, i.e., an algorithm that generates infinite sub-sequences Xs out of the sequence X, by specifying members of X that must also be members of Xs. Take the following algorithm, which generates subsequences for a given n:

• For all k from k(n) to n, produce the sub-sequence Xk = { x1 &ldots;xk }.

Each sub-sequence Xk has its relative frequency. Suppose now that n rarrinf. If the original sequence X has multiple converging frequencies, these frequencies must be attained in some of the sub-sequences and their minimum will be captured by Walley and Fine's estimator. The way to produce multiple converging frequencies was illustrated in the coin example: progressively ``longer infinite'' numbers of trials must be used to generate each one of the frequencies.

This procedure reveals the drawback of Walley and Fine's approach. The problem is that their estimator is geared toward capturing all possible limiting frequencies, regardless of the types of sequences it may find. Rarely a sequence of data will be maliciously produced as in the coin example, with progressively ``longer infinite'' segments generated by different probabilities. In general we expect trials to be generated by selecting distributions from the credal set in some defined, deterministic way, and then generating the data from the selected distributions. This is the most relevant situation in practice, where we are interested to assess how much our assumptions of randomness, and our abstractions in the modeling process, are justified.

The main goal of this paper is to develop estimators that are suited to deal with the situation described above. Suppose that we have some deterministic procedure selecting distributions. First suppose, for the sake of argument, that we know the deterministic procedure. For example, odd trials obey one distribution, even trials obey another. Then the logical way to proceed is to partition the data into even and odd sub-sequences and estimate relative frequencies in each one of them. This agrees with intuition and with statistical practice: if we suspect differences between blocks of data, we must run some form of cross-validation among the inferences obtained from different blocks.

Now suppose the distribution selection mechanism is unknown. We still proceed the same way, by partitioning the data into several sub-sequences, in the hope of matching the data patterns. If we take the lower bounds of the collections of relative frequencies that emerge, we obtain estimators that can capture aspects of the data that are not captured by Walley and Fine's estimator. The learning theorem proved in the next section demonstrates that this procedure in fact creates credal sets that dominate the ``true'' credal set with asymptotic certainty. We also indicate how to combine our procedure with Walley and Fine's estimator so as to improve both estimators.

Next: The finite learning theorem Up: Learning Convex Sets of Previous: The estimation task

© Fabio Cozman[Send Mail?]

Sun Jun 29 22:16:40 EDT 1997