Let’s Program A Compression Algorithm Part 1: How To Fit A Byte Into A Bit And Other Curious Tricks

So a while ago there was a new Alice in Wonderland movie coming out and my local geek store apparently decided it would be fun to replay the trailer again and again every few minutes. So it should be no surprise I’ve got the lyrics to “White Rabbit” stuck in my head.

One pill makes you larger
And one pill makes you small
And the ones that mother gives you
Don’t do anything at all
Go ask Alice
When she’s ten feet tall

Speaking of growing and shrinking, have you ever really thought about data compression? It’s that amazing thing that let’s you take an entire folder full of vacation photos and shrink it down into a single file that’s small enough to email to grandma. Data compression is also the star player behind streaming Youtube videos, fast loading image files, compact MP3 songs, reasonable computer backups, home DVD players and anywhere else you need to take a very big file and make it very small for either storage or transport.

But how exactly does that work? You would think that making a file smaller would result in lost data, kind of like how shrinking a picture in paint and then blowing it back up makes everything fuzzy. A 10MB file has 10MB of data, right? You’d think that the only way to shrink it is by throwing some of that data away.

After years of playing really old games on newer, bigger monitors I’ve grown fond of the fuzzy pixelated look

And yet somehow magic compression algorithms like zip manage to shrink things down and then return them to normal size without ever losing any of the original data.

To explain how that is possible we’re going to have to talk information theory, but that sounds boring so instead let’s talk fast food.

Your average fast food restaurant offers a dozen or so different kinds of sandwiches along with another dozen or so sides such as soda, onion rings and soft serve ice cream. Customers can then order whatever mix of items they want.

But not all combination of items are equally popular. Every day hundreds and hundreds of people ordered sandwiches and sodas for lunch but only one or two customers a week ask for a six pack of onion rings plus an ice cream cone.

This leads to a brilliant idea: What if we make the most popular combinations of food items easier to order by giving them simple numbers? A lot of people order hamburgers, a drink and some fries so let’s call that the “#1 Combo”. Other people order chicken strips and a soda so let’s call that the “#2 Combo”. We get a lot of visitors from the gym next door who just want a salad and some water so let’s call that the “#3 Combo”.

Fast food combos are basically a way to “compress” common orders into simple numbers that chef’s can “expand” by just looking at a chart. Customer wants a #3? Let’s check the menu and see which items go with that combo.

Customer service is thus much faster because now most people can just shout out a number instead of having to spend half a minute or more carefully listing out exactly what they want.

Information theory in a nutshell

Now information science has a lot of fancy words and equations for talking about this stuff but all you really need to know for now is that most types of data have patterns and those patterns can be used to create shortcuts. Restaurants use these patterns to replace complex orders with simple combo numbers and compression algorithms use these patterns to replace big files with simplified smaller files.

As an extreme example, imagine a file that’s just the word “Jabberwoky” repeated a million times. That’s roughly 8 MB of disk space, but do you really need all that? Not really. You could replace it with a file that says “JabberwokyX1000000” and as long as your code knew to interpret that as a million item list everything would work the exact same.

And that’s the secret to compression: A 10MB file contains 10MB worth of 0s and 1s but not necessarily 10MB worth of unique information. By finding a more efficient way to express that same information you can shrink your files, sometimes quite dramatically.

For a more realistic example: imagine an image file made up of a few million pixels. In a raw image file each pixel is a 32 bit number telling the computer which of several million colors it should draw on the screen. But what if your image doesn’t have millions of different colors? What if it’s a plain black and white drawing? You don’t need a full 32 bits just to mark whether a pixel is black or white so you could save a ton of space by inventing a new compressed image format where each pixel is just a single bit, 0 for black and 1 for white.

Or maybe you have a cartoon video file with a lot of identical still shots. Instead of storing a complete image for all those identical frames maybe you could create a compressed video format with the ability to say “This frame should look just like the last one”.

And maybe you have a text file that’s just a little too big for an email attachment. You could just email it in parts but I bet with a little clever thought you could come up with a way to compress it.

That’s going to be the topic of this let’s program.

Well, that and Alice in Wonderland.