When reading about sequence learning and the prediction of elements in a time series you inevitably cross paths with the area of sequence compression at some point, and so did I. In the present post I’d like to outline a classical compression algorithm, the 1978 version of the sequence compression algorithm after Lempel and Ziv (Ziv & Lempel, 1978).
Let $X$ be a finite alphabet and let $X^\ast$ denote the set of all possible strings (or “words”) over $X$. Let $\epsilon$ denote the empty string. The aim is to encode (compress) a sequence of inputs $d = x_1\ldots x_l \in X^\ast$. The approach of Lempel–Ziv splits into two phases: parsing and encoding. In the parsing phase the algorithm splits the sequence $d$ into adjacent non-empty words $ w_1,\ldots,w_p \in X^\ast$, i.e.
\[d = w_1 \ldots w_p,\]such that the following conditions hold
- Except for the last word the $w_i$ are mutually distinct, i.e. for $i\neq j$ with $i,j < p$ we have $w_i \neq w_j.$
- Each word is either just one element, or obtained by concatenation of one of its predecessors in the sequence and an element in $X$, i.e. for each $j$ there is an $x \in X$ such that either $w_j = x$ or there is an index $i < j$ with $ w_j = w_i x$.
Such a decomposition can be obtained by forming a dictionary whose keys (ordered by insertion) define the desired decomposition as follows:
“At each step take the longest prefix of the sequence that is not yet contained (as a key) in the dictionary and add it to the dictionary. Continue with the remaining sequence (i.e. after removing the prefix) until the whole sequence is consumed. The last prefix of the sequence which contains the last element will either be a new word or it will already be in the dictionary (cf. the first condition in the above list).”
With $w_0 := \epsilon$,where $\epsilon$ denotes the empty string, the conditions in the above list imply that each word $w$ in $\{w_1,\ldots,w_p\}$ can uniquely expressed in the form $w = w_ix$, hence we may define
\[\texttt{tail-index}\thinspace w := i \ \ \ \text{ and } \ \ \ \texttt{head} \thinspace w := x.\]The compressed version of the sequence $d$ is a sequence in $\{0,\ldots,p\} \times X$ and is defined by
\[\texttt{LZ78}(d) := \big[ \big( \texttt{tail-index} \thinspace w_1 ,\texttt{head} \thinspace w_1 \big), \ldots , \big( \texttt{tail-index} \thinspace w_p ,\texttt{head} \thinspace w_p \big) \big].\]The sequence can be decoded by forming a trie $T$ whose nodes are indexed by $i=1,\ldots,p$ and wich store their associated elements $\texttt{head} \thinspace w_i$. We further add a node indexed by $0$ storing the empty string $\epsilon$ – this will be the root of the tree. The parent of a node is given by its associated tail-index, i.e. two nodes $i < j$ are connected by an edge if $\texttt{tail-index} \thinspace w_j = i$. Each node has up to $\vert X \vert$ children, and no two children store similar elements, with potentially one exception, the node indexed by $p$ (recall that the last word is the only one that might have already appeared in the list). In case of non-uniqueness, we identify the node indexed by $p$ with the sibling that stores the similar input. The $i$th word $w_i$ can now easily be recovered from $T$ by traversing the path from the root (node $0$) down to node $i$, and concatenating the associated elements (of each traversed node).
Example
Suppose we want to compress $d = “abracadabrarabarbar”$. This sequences decomposes as:
a | b | r | ac | ad | ab | ra | rab | ar | ba | r.
Number the words from 1 to 11 going from left to right. The colored part corresponds to the part that can be found earlier in the dictionary. Replace the colored part by the corresponding position assigned earlier, where zero corresponds to the empty string, e.g. the tail ‘ra’ of the 8th word ‘rab’ can be found at position 7, and hence translates to ‘7b’. The resulting sequence reads as
0a | 0b | 0r | 1c | 1d | 1b | 3a | 7b | 1r | 2a | 3.
References
- Ziv, J., & Lempel, A. (1978). Compression of individual sequences via variable-rate coding. IEEE Transactions on Information Theory, 24(5), 530–536. https://doi.org/10.1109/TIT.1978.1055934