Splay Trees for Data Compression

Abstract

We present applications of splay trees to two topics in data compression. First is a variant of the move-to-front (mtf) data compression (of Bentley,Sleator Tarjan and Wei) algorithm, where we introduce secondary list(s). This seems to capture higher-order correlations. An implementation of this algorithm with Sleator-Tarjan splay trees runs in time (provably) proportional to the entropy of the input sequence. When tested on some telephony data, compression ratio and run time showed significant improvements over original mtf-algorithm, making it competitive or better than popular programs. For stationary ergodic sources, we analyse the compression and output distribution of the original mtf-algorithm, which suggests why the secondary list is appropriate to introduce. We also derive analytical upper bounds on the average codeword length in terms of stochastic parameters of the source.

Secondly, we consider the compression (or coding) of source sequences where the codewords are required to preserve the alphabetic order of the source symbols. We describe the use of the semisplay tree, the regular splay tree, and splay trees with depth three and four for this application, and derive upper bounds on their compression efficiency. For example, the average codeword length of the semisplay tree is is no more than twice the source entropy plus some minor terms.