A formalism for the design of optimal adaptive text data compression rules

Kevin Scott Atteson, University of Pennsylvania


Data compression is the transformation of data into representations which are as concise as possible. In particular, noiseless coding is the theory of concisely encoding randomly generated information in such a way that the data can be completely recovered from the encoded data. We present two abstract models of sources of information: the standard finite data model and a new infinite data model. For the finite data model, a technique known as Huffman coding is known to yield the smallest possible average coding length of the transformed data. In the more general infinite data model, the popular technique of arithmetic coding is optimal in a strong sense. Also, we demonstrate that arithmetic coding is practical in the sense that it has finite delay with probability one. In recent years, "robust" or adaptive data compression techniques have become popular. We present a methodology based upon statistical decision theory for deriving optimal adaptive data compression rules for a given class of stochastic processes. We demonstrate the use of this methodology by finding optimal data compression rules for the class of fixed-order stationary Markov chains with non-zero transition probabilities. The optimal rules for this class involve integrals which cannot be solved in closed form. We present an analysis of rules which are used in practice and compare these with the optimal rules. Finally, we present the results of simulations which coincide well with our asymptotic results. In our conclusions, we make suggestions on how to derive optimal rules for more general classes of stochastic processes such as the class of Markov chains of any order.

Subject Area

Computer science|Statistics

Recommended Citation

Atteson, Kevin Scott, "A formalism for the design of optimal adaptive text data compression rules" (1995). Dissertations available from ProQuest. AAI9543045.