LZ77

Home

This is a series of posts where I jot down my learnings from computer science papers I’ve read. Previously, Storage strategies for collections in dynamically typed languages.

LZ77 is a compression algorithm by Abramham Lampel and Jacob Ziv published in a paper in 1977 (pdf). It is the basis of many algorithms used in places like PNG and ZIP. I thought it would be interesting to see how it works.

The paper is only 6 pages long, and packed with mathematics I don’t fully understand. This will be about the compression and decompression algorithms described in the paper, without much mention of the mathematics and analysis.

I will try to explain things in simpler terms. To be consistent with the paper and make it easier to follow, I will refer to notations in the paper using italics.

At a high level, the algorithm performs two steps:

Parse a source, S, into words
Map words into codewords using a coding scheme

The source is a string of symbols from a finite alphabet, A, of cardinality α. Words are of at most L_s length. The codewords are of fixed length L_c, over the same A

The parsing step is done by filling a buffer, or window, of size n, with symbols, and finding a reproducible extension.

To figure out what a reproducible extension is, let’s look at an example string. Let

S = 00101011,
len(S) is the length of S, so len(S) = 8

S(i, j) is a substring of S starting from position i until position j (inclusive). As a shortcut S(i) = S(i, i).

A prefix of S is a substring starting from position 1 until position j, where j < len(S). For example, S(1, 5) = 00101 is a prefix of S, and S is not a prefix of itself.

An extension of the prefix S(1, j) into S is:

a substring S(j + 1, k), (where j + 1 ≤ k ≤ len(S)),
which matches some other substring S(p, q) that begins before or at j (p ≤ j)
and since these substrings must be the same length, q = p + k − j

In our example, with j = 3, the extensions of S(1, j) into S (which must begin at j + 1 = 4) are:

S(4, 4) = 0, matches S(1, 1)
S(4, 5) = 01, matches S(2, 3)
S(4, 6) = 010, matches S(2, 4)
S(4, 7) = 0101, matches S(2, 5)

The reproducible extension is the longest of these extensions, which is S(4, 7), since len(S(4, 7)) = 4, and the match is S(2, 5).

And the pointer of reproduction, denoted p, is where the matched substring begins, which is 2.

With this, we can now dive deeper into the encoding algorithm:

Begin with a string of n − L_s zeroes, followed the first L_s symbols from the source, call this B₁
Find the reproducible extension of the prefix B₁(1, n − L_s) into B₁(1, n − 1), with length l₁ − 1 and p₁ as the pointer of reproduction
Set the source word, S₁, to be the reproducible extension, with 1 more symbol added to the end, i.e. S₁ = B₁(n − L_s + 1, n − L_s + l₁), thus len(S₁) = l₁ − 1
Convert source word into codeword (details described later)
Update the contents of the buffer by shifting out the first l₁ symbols, and adding the next l₁ symbols from the source
Repeat

An example to help illustrate these steps (underscores don’t mean anything and serves as reading guide):

Given:
- S = 001010210210212021021200...
- L_s = 9
- n = 18

B₁ = 000000000_001010210, 9 zeros and the first 9 symbols of S
The reproducible extension of B₁(1, 9) into B₁(1, 17), is B₁(10, 11) = 00, with length 2, thus l₁ = 3. p₁ can be anywhere from 0 to 9, let’s pick p₁ = 9
The source word S₁ is B₁(n − L_s + 1, n − L_s + l_i) = B₁(10, 12) = 001
The code word C₁ = 22_02_1, calculate like so:
- it is made up of three parts, C_i = C_i1C_i2C_i3
- C_i1 is p_i − 1 in base-α, which is (9 − 1)₃ = 22, of length ⌈log (n − L_s)⌉
- C_i2 is l_i − 1 in base-α, which is (3 − 1)₃ = 02, of length ⌈log (L_s)⌉
- C_i3 is the last symbol of the source word, which is 1
Shift the first 3 symbols from the buffer, to get 000000001010210, and add the next 3 symbols from S, to get B₂ = 000000_001010210_210
Repeat
- B₂ = 000000001_010210210, reproducible extension of B₂(1, 9) into B₂(1, 17) is B(10, 12) of length 3, l₂ = 4, p = 8, C₂ = 21102
- B₃ = 000010102_102102120, reproducible extension of B₃(1, 9) into B₃(1, 17) is B(10, 16) of length 7, l₃ = 8, p = 7, C₃ = 20212
- B₄ = 210210212_021021200, reproducible extension of B₄(1, 9) into B₄(1, 17) is B(10, 17) of length 8, l₄ = 9, p = 3, C₄ = 02220

Decompression is doing these steps in reverse:

Given the codeword C = 2202121102... (determined above)
Load the initial decoding buffer D₁ of length n − L_s with zeroes, so D₁ = 000000000
Decode code word into p₁ and l₁ and last symbol of S₁
Shift buffer left by l₁ times, each time adding D₁(p₁) to the end of buffer, except the last shift, which adds the last symbol of S₁
The last l₁ symbols will be S₁

Decoding our code words from the example above:

C₁ = 22021
D₁ = 000000000
p₁ = 9, l₁ = 3, and the last symbol is 1
shift and add to buffer
- shift left once, and add D₁(9), 000000000 → 000000000
- shift left once, and add D₁(9), 000000000 → 000000000
- shift left once, and add last symbol, 000000000 → 000000001
the last 3 symbols, S₁ is 001

The second code word:

C₂ = 21102
D₂ = 000000001
p₂ = 8, l₂ = 4, last symbol is 2
shift and add to buffer
- shift left once, and add D₂(8), 000000001 → 000000010
- shift left once, and add D₂(8), 000000010 → 000000100
- shift left once, and add D₂(8), 000000100 → 000001001
- shift left once, and add last symbol, 000001001 → 000010102
the last 4 symbols S₂ is 0102

The third code word:

C₃ = 20212
D₃ = 000010102
p₃ = 7, l₃ = 8, last symbol is 2
shift and add to buffer
- shift left once, and add D₂(7), 000010102 → 000101021
- shift left once, and add D₂(7), 000101021 → 001010210
- shift left once, and add D₂(7), 001010210 → 010102102
- shift left once, and add D₂(7), 010102102 → 101021021
- shift left once, and add D₂(7), 101021021 → 010210210
- shift left once, and add D₂(7), 010210210 → 102102102
- shift left once, and add D₂(7), 102102102 → 021021021
- shift left once, and add last symbol, 021021021 → 210210212
the last 8 symbols 10210212 is S₃

The last code word:

C₄ = 02220
D₄ = 210210212
p₄ = 3, l₃ = 9, last symbol is 0
shift and add to buffer
- shift left once, and add D₂(7), 210210212 → 102102120
- shift left once, and add D₂(7), 102102120 → 021021202
- shift left once, and add D₂(7), 021021202 → 210212021
- shift left once, and add D₂(7), 210212021 → 102120210
- shift left once, and add D₂(7), 102120210 → 021202102
- shift left once, and add D₂(7), 021202102 → 212021021
- shift left once, and add D₂(7), 212021021 → 120210212
- shift left once, and add D₂(7), 120210212 → 202102120
- shift left once, and add last symbol, 202102120 → 021021200
the last 9 symbols 021021200 is S₄