Last time we talked about Alice in Wonderland and the deep cruelty of stores that play the same handful of movie trailers on endless loop.
More importantly we also talked about how computers use compression to shrink files down for easier storage and faster transfer. Without good compression algorithms the Internet would crawl to a halt and half the electronics you use on a daily basis wouldn’t exist. For instance, without compression a 2 hour HD movie would be over 300 GB; good luck fitting that all on one disc!
To better understand this vital computer science breakthrough we’re going to be writing our own ASCII text compression program.
Disclaimer: It’s not going to be a very good compression program. Like all of my let’s programs the goal here is education and not the creation of usable code. We’re going to be skipping out on all the security and error checking issues a professional compressor would include and our end goal is a modest 20% reduction in file size compared to the 75%+ reduction well known tools like Zip can pull off.
Now that your expectations have been properly lowered let’s talk about the structure of text files. After all, we can’t compress them if we don’t know what they’re supposed to look like or how they work.
The ASCII text standard is a set of 256 symbols that includes all 26 letters of the English language in both upper and lower case as well as all standard punctuation, some useful computer symbols (like newline) and a bunch of random symbols and proto-emoticons that seem like they were included mostly to fill up space.
The fact that there are 256 symbols is very important because 256 is exactly how high you can count using a single 8-bit computer byte. That means you can store one letter in one byte by simply using binary counting to mark down where the letter is in the ASCII chart.
For example, the capital letter “A” is in spot 65 of the ASCII chart. 65 in binary is 01000001. So to store the letter “A” in our computer we would grab a byte worth of space on our hard drive and fill it with the electronic pattern 01000001.
An ASCII text file is just a bunch of these 8-bit character bytes all in a row. If you want to save a 120 character long text you need a 120-byte long ASCII file. If you want to save a 5,000 character short story you need a roughly 5 kilobyte ASCII file.
Ok, cool. Now we know what an ASCII file’s guts look like. Time to start looking for patterns we can use for compression.
Patterns… patterns… here’s one!
The ASCII files we’ll be working with are all based on the English alphabet, and in English not all letters are used evenly. Things like “s”, “e”, and “a” get used all the time while poor little letters like “x” and “z” hardly ever see the light of day. And don’t forget “ ”! You might not think of the blank space as a letter but just imagine tryingtowritewithoutit.
So some letters are much more common than others but ASCII stores them all in identical 8-bit bytes anyways. What if we were to change that? What if we stored the most common letters in smaller spaces, like 4-bit nibbles*?
The biggest challenge here is figuring out a way to let the computer know when it should expect a nibble instead of a byte. In normal ASCII every letter is eight bits long which makes it easy for the computer to figure out where one letter ends and the next begins.
But since we’re going to have letters of different lengths we need some way to point out to the computer what to expect. A sort of virtual name-tag to say “I’m a 4-bit letter” or “I’m a normal 8-bit letter”.
Here’s a simple idea: Our short 4-bit letters will always start with a 0, on the other hand our 8-bit ASCII letters will always start with a 1.
This solution does have some drawbacks. If the first bit of our 4-bit letters is always 0 that means we only have three bits left for encoding the actual letter. Three bits is only enough to count up to eight so that means we will only be able to compress eight letters.
A bigger problem comes from the 8-bit ASCII letters. They need their full 8-bits to work properly so the only way to mark them with a leading 1 is by gluing it to the front and turning them into 9-bit letters. So while our common letters had their size cut in half our uncommon letters are actually getting bigger. Hopefully we’ll still come out ahead but it might be close.
Anyways, it sounds like we’re going to have eight different shortcut codes to work with. What letters should we use them for? Well, according to Wikipedia the eight most common letters in the English language are, in order: E, T, A, O, I, N, S, H. So that’s probably a good bet if we want as much compression as possible.
But Wikipedia doesn’t count the blank space as a letter. However because it’s so common in text it’s definitely something we want to compress. Let’s add it to the front of the list and drop “H”. That means the letters we will be compressing are “ ”, E, T, A, O, I, N, S.
Or more accurately “ ”, e, t, a, o , i, n, s. Remember that in ASCII upper and lower case letters are coded differently so we have to choose which we want. Since lowercase letters are more common than uppercase it makes sense to focus on them.
Now that we have our eight compression targets all we have to do is assign them one of our short codes, all of which are just the number 0 followed by some binary. Let’s go with this:
“ ” = 0000
“e” = 0001
“t” = 0010
“a” = 0011
“o” = 0100
“i” = 0101
“n” = 0110
“s” = 0111
Also remember that any letter not on this list will actually be expanded by putting a “1” in front of it’s binary representation.
One Makes You Smaller
Whew! That was a lot of abstract thinking but believe it or not we now have a complete compression algorithm. And just to prove it we’re going to do a compression by hand.
But what should we practice on? Well, our theme is “Wonderland” and I seem to recall that Alice was able to shrink herself by fooling around with a bottle labeled “drink me”. In ASCII that looks like this:
d | r | i | n | k | m | e | |
01100100 | 01110010 | 01101001 | 01101110 | 01101011 | 00100000 | 01101101 | 01100101 |
Eight bits times eight letter means 64 bits total. But if we replace the space, ‘i’, ‘e’, and ‘n’ with our 4-bit shortcuts while adding a 1 flag in front of the remaining 8-bit (soon to be 9-bit) letters we get
d | r | i | n | k | m | e | |
101100100 | 101110010 | 0101 | 0110 | 101101011 | 0000 | 101101101 | 0001 |
Which is 9*4 + 4*4 bits long for a total of only 52 bits. So we saved ourselves 12 bits, which is almost 20% less space than the original. Not bad.
One Makes You Grow Taller
Of course, taking text and compressing it is pretty useless unless we also know how to take compressed text and expand it back into normal readable ASCII. So please take a look at the following bit sequence and see if you can figure out what it used to say:
0001001111111010000001011011010001
I don’t want anybody accidentally looking ahead so let’s push the answer down a page or so with some another random comic.
So the first thing to do here is to take that big messy data stream and split it into individual letters. Remember that according to our rules the length of each letter is determined by whether it starts with a 0 or a 1. The 0s are 4-bit letters and the 1s are 9-bit letters.
So here we go. 0001001111111010000001011011010001 starts with 0 so the first letter must be four bits long: 0001
After removing those four bits we’re left with 001111111010000001011011010001, which also starts with a 0 meaning our next letter is also four bits long: 0011
Removing those four letters leaves us with 11111010000001011011010001. Since that starts with a 1 that means our next letter is 9 bits long: 111110100
By doing this again and again we finally get these six letters:
0001 | 0011 | 111110100 | 0000 | 101101101 | 0001 |
Now that we have our individual letters it’s time to turn them into, well, letters. But the kind of letters people can read.
For the four bit letters we just reference the list of short codes we made up. Scroll up in the likely event that you neglected to commit them to long term memory.
0001 | 0011 | 111110100 | 0000 | 101101101 | 0001 |
e | a | ? | ? | e |
For the nine bit letters we need to remove the leading 1 and then look up the remaining eight bit code in the official ASCII chart. For instance 111110100 becomes 11110100 which is the code for “t”.
0001 | 0011 | 111110100 | 0000 | 101101101 | 0001 |
e | a | t | m | e |
And there we go, the compressed binary has successfully been turned back into human readable data.
I Don’t Like Pretending To Be A Compression Algorithm
Doing these examples by hand was a great way to prove our proposed compression algorithm actually works but I don’t think any of us want to to try and compress an entire book or even an entire email by hand. It would be much better to teach the computer how to do this all for us. Which is exactly what we’re going to start working on next time.
After all, this is a Let’s Program, not a Let’s Spend A Small Eternity Doing Mathematical Busywork.
* Yes, nibble is the actual official term for half a byte. Programmers are weird like that.