# The LZ77 Compression Family (Ep 2, Compressor Head)

In the realm of

data compression, there’s been many rulers over

many lands of information. But since the early

days of the realm, there’s been one family

that’s reigned over it with supreme authority. This ruling class is known

as the Lempel-Ziv family of compressors. And it’s got bloodlines

that span generations with feuding

brothers, and sisters, and cousins, even twice-removed

aunts, a family tree that’s so dynamic and interwoven that

it makes “Game of Thrones” look like a gathering

at a local pizza parlor. Not, this family

is more than that. It’s cunning, and dynamic,

and efficient, traits that have allowed it to reign

over the realm of compression for more than 30

years undisputed. But to truly understand what

gives this family it’s power, we’re going to have to go back

and change the way that we approach our data and challenge

our fundamental understanding of entropy. Now, this is going to

be a difficult task. But fear not, young programmer. I’m here to help. My name is Colt McAnlis. And this is Compressor Head. [MUSIC PLAYING] What Claude Shannon developed

with the concept of entropy was a measurement of

information content in a stream. Or put another way,

entropy defines the minimum number of bits

needed per symbol on average to represent your data stream. So for example, you take

the entropy of a stream and multiply it by the

length of the stream, and you get the smallest

size that stream can be compressed according

to information theory. But there’s a reason it’s called

information theory, not a law. We can break the rules. So for example, if we took this

sorted, linearly increasing numbers as our data

stream, the entropy is about four bits per symbol. However, if we apply

delta encoding, which is the process of encoding the

next symbol as the difference with the previous

symbol, we end up with a data stream full of ones,

whose entropy is effectively zero. Now, what we did here

was transform the stream into something with a

different symbol frequency, dominantly ones, which

changed the entropy, making it more compressible. This idea to transform

data into state which can be better

compressed lies the heart of all modern

compression theory. But it’s not as

straightforward as it seems. So for example, take this

stream of ones and twos. Applying a delta

transform here doesn’t change the frequency

of the symbols. The entropy remains

about one bit per symbol, give or

take a quarter of a bit. As such, we know that

not every data transform is suited for

every type of data. You need to know something

about the data in your stream to properly understand

what valid transforms work on it to produce lower

entropy, which, trust me, is a lot trickier

than it sounds. Now, if we focus

the eye of Sauron on compressing

textual data, we find that one of the more

popular transform is known as symbol grouping. You see, up until

now, we’ve really only been looking at single

symbols within a stream. But in reality,

there’s often quite a lot of correlation

between adjacent symbols. For example, you

typically see the letter Q followed by the letter

U. Grouping symbols means looking at the stream in

terms of pairs or more of adjacent symbols instead

of just the single ones. Guys, my clicker’s broken. Can– thanks. So for that example

data set, you can see that by grouping

the symbols together, a new pattern emerges,

which, of course, as we know, changes entropy. Thank you, please. But sadly, we don’t get a chance

to look at compressed data manually. So writing a grouping

algorithm gets tough when you get complex strings. But just for

giggles, we’re going to take a swing at

this one anyway. Are you taking a selfie? Just turn the thing. So if we simply define

each letter as a symbol as we have been doing,

we end up with an entropy that’s around 2.38,

which isn’t that bad for a string like this. But I think we can do

a little bit better. Can we speed this

up next time maybe? Actually, looking

at this stream, we realize that probably,

words, or known words, would be a little bit

easier to group by. So for to be or not to

be, for this stream, we can get a little bit better

entropy at 1.98, which is much, much better. But I think we

can still do more. Now we’re getting it. In fact, a naive approach may

be to do something a little bit different. Perhaps we should actually

take the longest string that can be matched. So in this case, to

be or not is actually matched twice in this string. And it actually has

the longest length. The problem here is

that we actually end up with a whole group

of single symbols that really don’t have any

duplicates and no clear skewed probability. And the entropy

reflects this at 2.5. You see, what we really need

is more of a dynamic grouping algorithm, one

that tries to find a balance between the longest

chains and the smallest entropy. And as you can imagine, that’s

a pretty difficult problem. Thanks. Really? Let’s just go to the next scene. And with that in mind,

it’s time to reveal the founders of the LZ dynasty. You see, back in 1977, right

after an accident took him out of running for

the winter Olympics, Abraham Lempel and Jacob

Ziv invented an algorithm which identifies the optimal

way to tokenize a stream. The original two transforms,

named LZ77 and LZ78, are so good at finding

optimal tokenization that in 30 plus years, there

hasn’t been another algorithm to replace them. Nope, the LZ family

still makes up the backbone of all

modern, useful compressors. The LZ algorithm works

by splitting the stream into two segments. The left side of the stream

is dubbed the search buffer and contains the symbols

that we’ve already seen and already processed. The right side of the stream is

denoted the look ahead buffer and contains symbols

we haven’t seen yet. In practical matters,

the search buffer is some thousands

of kilosymbols long, while the look ahead buffer

is some 10s of symbols long. The encoder will read

a symbol from the look ahead buffer and attempt

to find a match for it in the search buffer. If the symbol is

found, then the encoder will go to read more symbols

from the look ahead buffer and continue searching

backwards in the search buffer until it finds

the longest match. When a match is

finally settled on, the encoder will output a token. The token has three parts,

an offset, a length, and the next token

in the stream. In this case, it would

be nine backwards from the window with a

length of two symbols where the next

symbol in the look ahead is the letter

B. Once this is done, the window is then

shifted to the right. If the backward search yields no

match for a given input symbol, or we’re seeing a symbol

for the first time, instead of outputting a

token with length and offset, we actually output a null

token, which is 0, 0, and then the token itself that we

actually just encountered. The compression process comes

from creating lists, and lists, and lists of these token values

across the entire data set and then combining

each of the channels in their own associative

lanes to create compression. Now that we’ve actually

been through encoding, it’s time to meet the

rest of the family. At the top of our tree

lie LZ78 and LZ77, our matriarch and

patriarch algorithms. Each one has given rise to

a variant that changes just slightly how you search

through your search buffer or how you output tokens. For example, LZW will

actually maintain a running variable length

dictionary of symbols, always trying to optimize for

the longest match possible. LZSS, on the other

hand, will not omit a token if the token

itself is longer than the string you’re trying to compress. Or my favorite, LZMA, actually

combines the LZ algorithm with Markov chains to

produce better compression. In fact, LZMA is actually

one of the cooler algorithms in this entire tree. Not only is it more

modern, but it actually combines a statistical

encoder with the LZ transform. And in fact, most

of these algorithms do not compress as well as

you would think and need statistical encoding to actually

get the values you want. In fact, if you look under the

hood at most modern compression algorithms, you’ll actually see

an LZ variant sitting around. For example, LZW is

actually the backbone of PKZIP, a popular compression

algorithm in the early ’90s. On the other hand, LZSS makes

up most of what WIN RAR uses. And this deflate algorithm over

here– you may recognize it. It powers everything

behind GZIP, which runs most of the

internet right now. Or you may know it as a

better name of WINZIP, which, of course, is the

Windows variant of that. And of course, my

favorite LZMA actually comes from a modern

archiver known as 7 Zip that’s slowly

gaining popularity. The truth of all

of this is that LZ is such a dynamic and

efficient algorithm that it powers the backbone

of the fastest, most used archivers out there

in the public today. Even if something

else comes along, the LZ family’s reign

is going to last much longer than

you and I realize. In the realm of

data compression, the LZ family still

reigns supreme. It’s the backbone of the

most popular and aggressive compressors out there. But there’s trouble

in the realm. A new breed of algorithms

known as the Markov models are actually

threatening its reign. But that’s a story

for a different show. My name is Colt McAnlis. Thanks for watching. [MUSIC PLAYING]

Very impressive presentation???

So 7z uses LZMA, yay!

Love the series, very interactive way to make complex thing easy 🙂 Thanks

Compressor Head, Episode 2: The LZ Compression Family/with @Colt McAnlis #everybitcounts #compressorhead #developers

In the world of compression, one algorithm family reigns supreme. Born in the late 70s, the Lempel-Ziv algorithms have become the most dominant dictionary encoding schemes in compression. +Colt McAnlis walks us through why these algorithms are so dominant in this episode of Compressor Head.

Google developers being awesome as ever, thanks!

The Video is Cool！

http://youtu.be/Jqc418tQDkg?t=7m5s Prolog, is that you?

More Like this, pleeeeeeeeeeeeeeeas

Great series and a great presenter with innovative and cool presentation techniques. Great job!!!!

This video is the best in the series so far…

Subtitles?

3:00 “Oh guy, my clicker is broken.” LOL!

gg

you missed LZ4

Thanks!

I'm just implementing some compression algorithms on my master thesis and your videos are nice start for getting some intuition.

6:39 why is the window moved to before the B?? Shouldn't it be after the B?

Carcassonne FTW!

Fact: I will watch any video Colt makes, as the content is always so well explained. Thanks!

Nice job 🙂 thank you

I didn't realized that Woody Harrelson knows LZ Compression Family.

i was just trying to learn something but all your attempts to make the video funny made me close it at the 3 minute mark

Super useful! Thanks!

Hey what happened to Episode 1? 🙂

Imagine the compression you can get with quantum computers with their exponential growth in power with more qbits.

tbh the "comedy" was really distracting. the corporate comedy does not work

You make it easy

5:27 misleading up to 10 algos are relevant in 10, others underperform