The LZ77 Compression Family (Ep 2, Compressor Head)


In the realm of
data compression, there’s been many rulers over
many lands of information. But since the early
days of the realm, there’s been one family
that’s reigned over it with supreme authority. This ruling class is known
as the Lempel-Ziv family of compressors. And it’s got bloodlines
that span generations with feuding
brothers, and sisters, and cousins, even twice-removed
aunts, a family tree that’s so dynamic and interwoven that
it makes “Game of Thrones” look like a gathering
at a local pizza parlor. Not, this family
is more than that. It’s cunning, and dynamic,
and efficient, traits that have allowed it to reign
over the realm of compression for more than 30
years undisputed. But to truly understand what
gives this family it’s power, we’re going to have to go back
and change the way that we approach our data and challenge
our fundamental understanding of entropy. Now, this is going to
be a difficult task. But fear not, young programmer. I’m here to help. My name is Colt McAnlis. And this is Compressor Head. [MUSIC PLAYING] What Claude Shannon developed
with the concept of entropy was a measurement of
information content in a stream. Or put another way,
entropy defines the minimum number of bits
needed per symbol on average to represent your data stream. So for example, you take
the entropy of a stream and multiply it by the
length of the stream, and you get the smallest
size that stream can be compressed according
to information theory. But there’s a reason it’s called
information theory, not a law. We can break the rules. So for example, if we took this
sorted, linearly increasing numbers as our data
stream, the entropy is about four bits per symbol. However, if we apply
delta encoding, which is the process of encoding the
next symbol as the difference with the previous
symbol, we end up with a data stream full of ones,
whose entropy is effectively zero. Now, what we did here
was transform the stream into something with a
different symbol frequency, dominantly ones, which
changed the entropy, making it more compressible. This idea to transform
data into state which can be better
compressed lies the heart of all modern
compression theory. But it’s not as
straightforward as it seems. So for example, take this
stream of ones and twos. Applying a delta
transform here doesn’t change the frequency
of the symbols. The entropy remains
about one bit per symbol, give or
take a quarter of a bit. As such, we know that
not every data transform is suited for
every type of data. You need to know something
about the data in your stream to properly understand
what valid transforms work on it to produce lower
entropy, which, trust me, is a lot trickier
than it sounds. Now, if we focus
the eye of Sauron on compressing
textual data, we find that one of the more
popular transform is known as symbol grouping. You see, up until
now, we’ve really only been looking at single
symbols within a stream. But in reality,
there’s often quite a lot of correlation
between adjacent symbols. For example, you
typically see the letter Q followed by the letter
U. Grouping symbols means looking at the stream in
terms of pairs or more of adjacent symbols instead
of just the single ones. Guys, my clicker’s broken. Can– thanks. So for that example
data set, you can see that by grouping
the symbols together, a new pattern emerges,
which, of course, as we know, changes entropy. Thank you, please. But sadly, we don’t get a chance
to look at compressed data manually. So writing a grouping
algorithm gets tough when you get complex strings. But just for
giggles, we’re going to take a swing at
this one anyway. Are you taking a selfie? Just turn the thing. So if we simply define
each letter as a symbol as we have been doing,
we end up with an entropy that’s around 2.38,
which isn’t that bad for a string like this. But I think we can do
a little bit better. Can we speed this
up next time maybe? Actually, looking
at this stream, we realize that probably,
words, or known words, would be a little bit
easier to group by. So for to be or not to
be, for this stream, we can get a little bit better
entropy at 1.98, which is much, much better. But I think we
can still do more. Now we’re getting it. In fact, a naive approach may
be to do something a little bit different. Perhaps we should actually
take the longest string that can be matched. So in this case, to
be or not is actually matched twice in this string. And it actually has
the longest length. The problem here is
that we actually end up with a whole group
of single symbols that really don’t have any
duplicates and no clear skewed probability. And the entropy
reflects this at 2.5. You see, what we really need
is more of a dynamic grouping algorithm, one
that tries to find a balance between the longest
chains and the smallest entropy. And as you can imagine, that’s
a pretty difficult problem. Thanks. Really? Let’s just go to the next scene. And with that in mind,
it’s time to reveal the founders of the LZ dynasty. You see, back in 1977, right
after an accident took him out of running for
the winter Olympics, Abraham Lempel and Jacob
Ziv invented an algorithm which identifies the optimal
way to tokenize a stream. The original two transforms,
named LZ77 and LZ78, are so good at finding
optimal tokenization that in 30 plus years, there
hasn’t been another algorithm to replace them. Nope, the LZ family
still makes up the backbone of all
modern, useful compressors. The LZ algorithm works
by splitting the stream into two segments. The left side of the stream
is dubbed the search buffer and contains the symbols
that we’ve already seen and already processed. The right side of the stream is
denoted the look ahead buffer and contains symbols
we haven’t seen yet. In practical matters,
the search buffer is some thousands
of kilosymbols long, while the look ahead buffer
is some 10s of symbols long. The encoder will read
a symbol from the look ahead buffer and attempt
to find a match for it in the search buffer. If the symbol is
found, then the encoder will go to read more symbols
from the look ahead buffer and continue searching
backwards in the search buffer until it finds
the longest match. When a match is
finally settled on, the encoder will output a token. The token has three parts,
an offset, a length, and the next token
in the stream. In this case, it would
be nine backwards from the window with a
length of two symbols where the next
symbol in the look ahead is the letter
B. Once this is done, the window is then
shifted to the right. If the backward search yields no
match for a given input symbol, or we’re seeing a symbol
for the first time, instead of outputting a
token with length and offset, we actually output a null
token, which is 0, 0, and then the token itself that we
actually just encountered. The compression process comes
from creating lists, and lists, and lists of these token values
across the entire data set and then combining
each of the channels in their own associative
lanes to create compression. Now that we’ve actually
been through encoding, it’s time to meet the
rest of the family. At the top of our tree
lie LZ78 and LZ77, our matriarch and
patriarch algorithms. Each one has given rise to
a variant that changes just slightly how you search
through your search buffer or how you output tokens. For example, LZW will
actually maintain a running variable length
dictionary of symbols, always trying to optimize for
the longest match possible. LZSS, on the other
hand, will not omit a token if the token
itself is longer than the string you’re trying to compress. Or my favorite, LZMA, actually
combines the LZ algorithm with Markov chains to
produce better compression. In fact, LZMA is actually
one of the cooler algorithms in this entire tree. Not only is it more
modern, but it actually combines a statistical
encoder with the LZ transform. And in fact, most
of these algorithms do not compress as well as
you would think and need statistical encoding to actually
get the values you want. In fact, if you look under the
hood at most modern compression algorithms, you’ll actually see
an LZ variant sitting around. For example, LZW is
actually the backbone of PKZIP, a popular compression
algorithm in the early ’90s. On the other hand, LZSS makes
up most of what WIN RAR uses. And this deflate algorithm over
here– you may recognize it. It powers everything
behind GZIP, which runs most of the
internet right now. Or you may know it as a
better name of WINZIP, which, of course, is the
Windows variant of that. And of course, my
favorite LZMA actually comes from a modern
archiver known as 7 Zip that’s slowly
gaining popularity. The truth of all
of this is that LZ is such a dynamic and
efficient algorithm that it powers the backbone
of the fastest, most used archivers out there
in the public today. Even if something
else comes along, the LZ family’s reign
is going to last much longer than
you and I realize. In the realm of
data compression, the LZ family still
reigns supreme. It’s the backbone of the
most popular and aggressive compressors out there. But there’s trouble
in the realm. A new breed of algorithms
known as the Markov models are actually
threatening its reign. But that’s a story
for a different show. My name is Colt McAnlis. Thanks for watching. [MUSIC PLAYING]

26 Comments

Add a Comment

Your email address will not be published. Required fields are marked *