Home

This is a series of posts where I jot down my learnings from computer science papers I’ve read. Previously, Storage strategies for collections in dynamically typed languages.

LZ77 is a compression algorithm by Abramham Lampel and Jacob Ziv published in a paper in 1977 (pdf). It is the basis of many algorithms used in places like PNG and ZIP. I thought it would be interesting to see how it works.

The paper is only 6 pages long, and packed with mathematics I don’t fully understand. This will be about the compression and decompression algorithms described in the paper, without much mention of the mathematics and analysis.

I will try to explain things in simpler terms. To be consistent with the paper and make it easier to follow, I will refer to notations in the paper using italics.

At a high level, the algorithm performs two steps:

1. Parse a source, S, into words
2. Map words into codewords using a coding scheme

The source is a string of symbols from a finite alphabet, A, of cardinality α. Words are of at most Ls length. The codewords are of fixed length Lc, over the same A

The parsing step is done by filling a buffer, or window, of size n, with symbols, and finding a reproducible extension.

To figure out what a reproducible extension is, let’s look at an example string. Let

• S = 00101011,
• len(S) is the length of S, so len(S) = 8

S(i, j) is a substring of S starting from position i until position j (inclusive). As a shortcut S(i) = S(i, i).

A prefix of S is a substring starting from position 1 until position j, where j < len(S). For example, S(1, 5) = 00101 is a prefix of S, and S is not a prefix of itself.

An extension of the prefix S(1, j) into S is:

• a substring S(j + 1, k), (where j + 1 ≤ k ≤ len(S)),
• which matches some other substring S(p, q) that begins before or at j (p ≤ j)
• and since these substrings must be the same length, q = p + k − j

In our example, with j = 3, the extensions of S(1, j) into S (which must begin at j + 1 = 4) are:

• S(4, 4) = 0, matches S(1, 1)
• S(4, 5) = 01, matches S(2, 3)
• S(4, 6) = 010, matches S(2, 4)
• S(4, 7) = 0101, matches S(2, 5)

The reproducible extension is the longest of these extensions, which is S(4, 7), since len(S(4, 7)) = 4, and the match is S(2, 5).

And the pointer of reproduction, denoted p, is where the matched substring begins, which is 2.

With this, we can now dive deeper into the encoding algorithm:

1. Begin with a string of n − Ls zeroes, followed the first Ls symbols from the source, call this B1
2. Find the reproducible extension of the prefix B1(1, n − Ls) into B1(1, n − 1), with length l1 − 1 and p1 as the pointer of reproduction
3. Set the source word, S1, to be the reproducible extension, with 1 more symbol added to the end, i.e. S1 = B1(n − Ls + 1, n − Ls + l1), thus len(S1) = l1 − 1
4. Convert source word into codeword (details described later)
5. Update the contents of the buffer by shifting out the first l1 symbols, and adding the next l1 symbols from the source
6. Repeat

An example to help illustrate these steps (underscores don’t mean anything and serves as reading guide):

• Given:
• S = 001010210210212021021200...
• Ls = 9
• n = 18
1. B1 = 000000000_001010210, 9 zeros and the first 9 symbols of S
2. The reproducible extension of B1(1, 9) into B1(1, 17), is B1(10, 11) = 00, with length 2, thus l1 = 3. p1 can be anywhere from 0 to 9, let’s pick p1 = 9
3. The source word S1 is B1(n − Ls + 1, n − Ls + li) = B1(10, 12) = 001
4. The code word C1 = 22_02_1, calculate like so:
• it is made up of three parts, Ci = Ci1Ci2Ci3
• Ci1 is pi − 1 in base-α, which is (9 − 1)3 = 22, of length ⌈log (n − Ls)⌉
• Ci2 is li − 1 in base-α, which is (3 − 1)3 = 02, of length ⌈log (Ls)⌉
• Ci3 is the last symbol of the source word, which is 1
5. Shift the first 3 symbols from the buffer, to get 000000001010210, and add the next 3 symbols from S, to get B2 = 000000_001010210_210
6. Repeat
• B2 = 000000001_010210210, reproducible extension of B2(1, 9) into B2(1, 17) is B(10, 12) of length 3, l2 = 4, p = 8, C2 = 21102
• B3 = 000010102_102102120, reproducible extension of B3(1, 9) into B3(1, 17) is B(10, 16) of length 7, l3 = 8, p = 7, C3 = 20212
• B4 = 210210212_021021200, reproducible extension of B4(1, 9) into B4(1, 17) is B(10, 17) of length 8, l4 = 9, p = 3, C4 = 02220

Decompression is doing these steps in reverse:

1. Given the codeword C = 2202121102... (determined above)
2. Load the initial decoding buffer D1 of length n − Ls with zeroes, so D1 = 000000000
3. Decode code word into p1 and l1 and last symbol of S1
4. Shift buffer left by l1 times, each time adding D1(p1) to the end of buffer, except the last shift, which adds the last symbol of S1
5. The last l1 symbols will be S1

Decoding our code words from the example above:

1. C1 = 22021
2. D1 = 000000000
3. p1 = 9, l1 = 3, and the last symbol is 1
4. shift and add to buffer
• shift left once, and add D1(9), 000000000 → 000000000
• shift left once, and add D1(9), 000000000 → 000000000
• shift left once, and add last symbol, 000000000 → 000000001
5. the last 3 symbols, S1 is 001

The second code word:

1. C2 = 21102
2. D2 = 000000001
3. p2 = 8, l2 = 4, last symbol is 2
4. shift and add to buffer
• shift left once, and add D2(8), 000000001 → 000000010
• shift left once, and add D2(8), 000000010 → 000000100
• shift left once, and add D2(8), 000000100 → 000001001
• shift left once, and add last symbol, 000001001 → 000010102
5. the last 4 symbols S2 is 0102

The third code word:

1. C3 = 20212
2. D3 = 000010102
3. p3 = 7, l3 = 8, last symbol is 2
4. shift and add to buffer
• shift left once, and add D2(7), 000010102 → 000101021
• shift left once, and add D2(7), 000101021 → 001010210
• shift left once, and add D2(7), 001010210 → 010102102
• shift left once, and add D2(7), 010102102 → 101021021
• shift left once, and add D2(7), 101021021 → 010210210
• shift left once, and add D2(7), 010210210 → 102102102
• shift left once, and add D2(7), 102102102 → 021021021
• shift left once, and add last symbol, 021021021 → 210210212
5. the last 8 symbols 10210212 is S3

The last code word:

1. C4 = 02220
2. D4 = 210210212
3. p4 = 3, l3 = 9, last symbol is 0
4. shift and add to buffer
• shift left once, and add D2(7), 210210212 → 102102120
• shift left once, and add D2(7), 102102120 → 021021202
• shift left once, and add D2(7), 021021202 → 210212021
• shift left once, and add D2(7), 210212021 → 102120210
• shift left once, and add D2(7), 102120210 → 021202102
• shift left once, and add D2(7), 021202102 → 212021021
• shift left once, and add D2(7), 212021021 → 120210212
• shift left once, and add D2(7), 120210212 → 202102120
• shift left once, and add last symbol, 202102120 → 021021200
5. the last 9 symbols 021021200 is S4