what does it mean that such a large % of tracker cycles have do-validate set?
The % of tracks looks more like I would expect.  I guess it means that
do-validate tracks live longer than average.

What is the result distribution, and what does it mean?

For evaluation, don't use Mahalanobis distance, just plain old RMS distance,
perhaps with vague end special case.  Otherwise the meaning is obscure.

find tracks with bad error and look at them.
check that tracks flagged moving really are.
test for possible false-negative moving tracks by using some more liberal
moving test and manual examination.

some way to put "bookmarks" in replay GUI?

Different data set?  In this one, the statistics of the three classes seem
basically the same.

