Abstract: Consider two or more forecasters, each making a sequence of predictions for different events over time. We ask a relatively basic question: how might we compare these forecasters, either online or post-hoc, while avoiding unverifiable assumptions on how the forecasts or outcomes were generated?

This work presents a novel and rigorous answer to this question. We design a sequential inference procedure for estimating the time-varying difference in forecast quality as measured by a relatively large class of proper scoring rules. The resulting confidence intervals can be continuously monitored to yield statistically valid comparisons at arbitrary data-dependent stopping times ("anytime-valid"); this is enabled by employing variance-adaptive supermartingales.

Motivated by Shafer and Vovk's game-theoretic probability, our coverage guarantees are also distribution-free, in the sense that they make no distributional assumptions whatsoever on the forecasts or outcomes. We demonstrate their effectiveness by comparing forecasts on Major League Baseball (MLB) games and statistical postprocessing methods for ensemble weather forecasts.