# LZ77

2018-06-26 21:00

This is a series of posts where I jot down my learnings from computer science papers I’ve read. Previously, Storage strategies for collections in dynamically typed languages.

LZ77 is a compression algorithm by Abramham Lampel and Jacob Ziv published in a paper in 1977 (pdf). It is the basis of many algorithms used in places like PNG and ZIP. I thought it would be interesting to see how it works.

The paper is only 6 pages long, and packed with mathematics I don’t fully understand. This will be about the compression and decompression algorithms described in the paper, without much mention of the mathematics and analysis.

*I will try to explain things in simpler terms. To be consistent with the paper and make it easier to follow, I will refer to notations in the paper using italics.*

At a high level, the algorithm performs two steps:

- Parse a
*source*,*S*, into*words* - Map
*words*into*codewords*using a*coding scheme*

*The source is a string of symbols from a finite alphabet, A, of cardinality α. Words are of at most L_{s} length. The codewords are of fixed length L_{c}, over the same A*

The parsing step is done by filling a buffer, or *window*, of size *n*, with symbols, and finding a *reproducible extension*.

To figure out what a *reproducible extension* is, let’s look at an example string. Let

*S*= 00101011,*l**e**n*(*S*) is the length of*S*, so*l**e**n*(*S*) = 8

*S*(*i*, *j*) is a substring of *S* starting from position *i* until position *j* (inclusive). As a shortcut *S*(*i*) = *S*(*i*, *i*).

A *prefix* of *S* is a substring starting from position 1 until position *j*, where *j* < *l**e**n*(*S*). For example, *S*(1, 5) = 00101 is a prefix of *S*, and *S* is not a prefix of itself.

An *extension* of the prefix *S*(1, *j*) into *S* is:

- a substring
*S*(*j*+ 1,*k*), (where*j*+ 1 ≤*k*≤*l**e**n*(*S*)), - which matches some other substring
*S*(*p*,*q*) that begins before or at*j*(*p*≤*j*) - and since these substrings must be the same length,
*q*=*p*+*k*−*j*

In our example, with *j* = 3, the extensions of *S*(1, *j*) into *S* (which must begin at *j* + 1 = 4) are:

*S*(4, 4) = 0, matches*S*(1, 1)*S*(4, 5) = 01, matches*S*(2, 3)*S*(4, 6) = 010, matches*S*(2, 4)*S*(4, 7) = 0101, matches*S*(2, 5)

The *reproducible extension* is the longest of these extensions, which is *S*(4, 7), since *l**e**n*(*S*(4, 7)) = 4, and the match is *S*(2, 5).

And the *pointer of reproduction*, denoted *p*, is where the matched substring begins, which is 2.

With this, we can now dive deeper into the encoding algorithm:

- Begin with a string of
*n*−*L*_{s}zeroes, followed the first*L*_{s}symbols from the source, call this*B*_{1} - Find the reproducible extension of the prefix
*B*_{1}(1,*n*−*L*_{s}) into*B*_{1}(1,*n*− 1), with length*l*_{1}− 1 and*p*_{1}as the pointer of reproduction - Set the source word,
*S*_{1}, to be the reproducible extension, with 1 more symbol added to the end, i.e.*S*_{1}=*B*_{1}(*n*−*L*_{s}+ 1,*n*−*L*_{s}+*l*_{1}), thus*l**e**n*(*S*_{1}) =*l*_{1}− 1 - Convert source word into codeword (details described later)
- Update the contents of the buffer by shifting out the first
*l*_{1}symbols, and adding the next*l*_{1}symbols from the source - Repeat

An example to help illustrate these steps (underscores don’t mean anything and serves as reading guide):

- Given:
*S*= 001010210210212021021200...*L*_{s}= 9*n*= 18

*B*_{1}= 000000000_001010210, 9 zeros and the first 9 symbols of*S*- The reproducible extension of
*B*_{1}(1, 9) into*B*_{1}(1, 17), is*B*_{1}(10, 11) = 00, with length 2, thus*l*_{1}= 3.*p*_{1}can be anywhere from 0 to 9, let’s pick*p*_{1}= 9 - The source word
*S*_{1}is*B*_{1}(*n*−*L*_{s}+ 1,*n*−*L*_{s}+*l*_{i}) =*B*_{1}(10, 12) = 001 - The code word
*C*_{1}= 22_02_1, calculate like so:- it is made up of three parts,
*C*_{i}=*C*_{i1}*C*_{i2}*C*_{i3} *C*_{i1}is*p*_{i}− 1 in base-*α*, which is (9 − 1)_{3}= 22, of length ⌈log (*n*−*L*_{s})⌉*C*_{i2}is*l*_{i}− 1 in base-*α*, which is (3 − 1)_{3}= 02, of length ⌈log (*L*_{s})⌉*C*_{i3}is the last symbol of the source word, which is 1

- it is made up of three parts,
- Shift the first 3 symbols from the buffer, to get 000000001010210, and add the next 3 symbols from
*S*, to get*B*_{2}= 000000_001010210_210 - Repeat
*B*_{2}= 000000001_010210210, reproducible extension of*B*_{2}(1, 9) into*B*_{2}(1, 17) is*B*(10, 12) of length 3,*l*_{2}= 4,*p*= 8,*C*_{2}= 21102*B*_{3}= 000010102_102102120, reproducible extension of*B*_{3}(1, 9) into*B*_{3}(1, 17) is*B*(10, 16) of length 7,*l*_{3}= 8,*p*= 7,*C*_{3}= 20212*B*_{4}= 210210212_021021200, reproducible extension of*B*_{4}(1, 9) into*B*_{4}(1, 17) is*B*(10, 17) of length 8,*l*_{4}= 9,*p*= 3,*C*_{4}= 02220

Decompression is doing these steps in reverse:

- Given the codeword
*C*= 2202121102... (determined above) - Load the initial decoding buffer
*D*_{1}of length*n*−*L*_{s}with zeroes, so*D*_{1}= 000000000 - Decode code word into
*p*_{1}and*l*_{1}and last symbol of*S*_{1} - Shift buffer left by
*l*_{1}times, each time adding*D*_{1}(*p*_{1}) to the end of buffer, except the last shift, which adds the last symbol of*S*_{1} - The last
*l*_{1}symbols will be*S*_{1}

Decoding our code words from the example above:

*C*_{1}= 22021*D*_{1}= 000000000*p*_{1}= 9,*l*_{1}= 3, and the last symbol is 1- shift and add to buffer
- shift left once, and add
*D*_{1}(9), 000000000 → 000000000 - shift left once, and add
*D*_{1}(9), 000000000 → 000000000 - shift left once, and add last symbol, 000000000 → 000000001

- shift left once, and add
- the last 3 symbols,
*S*_{1}is 001

The second code word:

*C*_{2}= 21102*D*_{2}= 000000001*p*_{2}= 8,*l*_{2}= 4, last symbol is 2- shift and add to buffer
- shift left once, and add
*D*_{2}(8), 000000001 → 000000010 - shift left once, and add
*D*_{2}(8), 000000010 → 000000100 - shift left once, and add
*D*_{2}(8), 000000100 → 000001001 - shift left once, and add last symbol, 000001001 → 000010102

- shift left once, and add
- the last 4 symbols
*S*_{2}is 0102

The third code word:

*C*_{3}= 20212*D*_{3}= 000010102*p*_{3}= 7,*l*_{3}= 8, last symbol is 2- shift and add to buffer
- shift left once, and add
*D*_{2}(7), 000010102 → 000101021 - shift left once, and add
*D*_{2}(7), 000101021 → 001010210 - shift left once, and add
*D*_{2}(7), 001010210 → 010102102 - shift left once, and add
*D*_{2}(7), 010102102 → 101021021 - shift left once, and add
*D*_{2}(7), 101021021 → 010210210 - shift left once, and add
*D*_{2}(7), 010210210 → 102102102 - shift left once, and add
*D*_{2}(7), 102102102 → 021021021 - shift left once, and add last symbol, 021021021 → 210210212

- shift left once, and add
- the last 8 symbols 10210212 is
*S*_{3}

The last code word:

*C*_{4}= 02220*D*_{4}= 210210212*p*_{4}= 3,*l*_{3}= 9, last symbol is 0- shift and add to buffer
- shift left once, and add
*D*_{2}(7), 210210212 → 102102120 - shift left once, and add
*D*_{2}(7), 102102120 → 021021202 - shift left once, and add
*D*_{2}(7), 021021202 → 210212021 - shift left once, and add
*D*_{2}(7), 210212021 → 102120210 - shift left once, and add
*D*_{2}(7), 102120210 → 021202102 - shift left once, and add
*D*_{2}(7), 021202102 → 212021021 - shift left once, and add
*D*_{2}(7), 212021021 → 120210212 - shift left once, and add
*D*_{2}(7), 120210212 → 202102120 - shift left once, and add last symbol, 202102120 → 021021200

- shift left once, and add
- the last 9 symbols 021021200 is
*S*_{4}