Last time we invented and tested a super-simple compression algorithm that revolves around replacing the eight most common symbols in the English language with tiny 4-bit codes instead of their normal 8-bit representations (at the cost of replacing everything else with bigger 9-bit versions). We even did some examples by hand.
But this is a coding blog, so it’s time to write some actual code. When it comes to language choice our only real requirement is that the language be capable or working directly with bits and bytes and files and since pretty much every language ever has no problem doing that we’re open to choose whatever we want.
I’m personally going to choose Lisp and be keeping all my code in a file named “wrc.lisp” for “White Rabbit Compression”. You, of course, can follow along in whatever language you want. In fact, that would probably be the most educational approach to this series.
Anyways, like we talked about in the design phase, this project will mostly focus on processing lists of bits in batches of 8, 4 and 9. That means we’re going to need a convenient data structure for holding these lists.
Fortunately Lisp is built entirely around list processing and has some very powerful built in tools for creating and managing lists of numbers, so I will be representing our binary fragments as a plain old list of integers which will just so happen to always be either a 0 or a 1.
For those of you not in Lisp I bet your language has it’s own list structures. Could be a vector, a queue, a linked list, whatever. Anything that lets you add new stuff to the end and pull old stuff off the front will be fine.
Now the efficiency buffs in the audience might have noticed that we’re wasting space by using entire integers to keep track of mere bits. Shouldn’t we be using something else, like a bit vector?
Probably! But this is just the proof of concept first pass so doing things the easy way is more important than doing things the best way. If it turns out the program is too slow or memory hungry then we’ll revisit this decision.
With that in mind our first attempt at letter bit lists is going to look something like this:
; An 8-bit ASCII letter (list 0 1 0 0 0 0 0 1) ;One of our 4 bit compressed letters (list 0 0 1 1) ;One of our 9 bit labeled ASCII letters (list 1 0 1 0 0 0 0 0 1)
Next up let’s write some helper functions that can turn real 8-bit bytes into our custom bit lists or turn our bit lists back into bytes.
(defun 8-bit-list-to-byte (bitlist) (+ (* 128 (first bitlist)) (* 64 (second bitlist)) (* 32 (third bitlist)) (* 16 (fourth bitlist)) (* 8 (fifth bitlist)) (* 4 (sixth bitlist)) (* 2 (seventh bitlist)) (* 1 (eighth bitlist)))) (defun byte-to-8-bit-list (byte) (let ((bitlist nil)) (if (>= byte 128) (progn (push 1 bitlist) (setf byte (- byte 128))) (push 0 bitlist)) (if (>= byte 64) (progn (push 1 bitlist) (setf byte (- byte 64))) (push 0 bitlist)) (if (>= byte 32) (progn (push 1 bitlist) (setf byte (- byte 32))) (push 0 bitlist)) (if (>= byte 16) (progn (push 1 bitlist) (setf byte (- byte 16))) (push 0 bitlist)) (if (>= byte 8) (progn (push 1 bitlist) (setf byte (- byte 8))) (push 0 bitlist)) (if (>= byte 4) (progn (push 1 bitlist) (setf byte (- byte 4))) (push 0 bitlist)) (if (>= byte 2) (progn (push 1 bitlist) (setf byte (- byte 2))) (push 0 bitlist)) (if (>= byte 1) (progn (push 1 bitlist) (setf byte (- byte 1))) (push 0 bitlist)) (nreverse bitlist)))
The logic here is pretty simple although the Lisp syntax can be a bit weird. I think 8-bit-list-to-byte is self-explanatory but byte-to-8-bit-list might need some explaining. The basic idea is that every bit in a byte has a specific value: 128, 64, 32, 16, 8, 4, 2 or 1. Because of how binary works any number greater than 128 must have the first bit set, any number smaller than 128 but larger than 64 must have the second bit set and so on.
So we can turn a byte into a list by first checking if the number is bigger than 128. If not we put a 0 into our list and move on. But if it is we put a 1 into our list and then subtract 128 before working with the remainder of the number. Then the next if statement checks if the numbe ris bigger than 64 and so on. The only Lispy trick here is that Lisp if statements by default can only contain two lines of logic: the first for when the if is true and the second when the if is false. Since we want to run two lines of logic when things are true we wrap them in a progn, which just lumps multiple lines of code into one unit.
Clear as mud? Let’s move on then.
With those generic helpers out of the way let’s move on to writing the core of our compressor: A function that can accept a byte and then transform it into either a 4-bit short code or a 9-bit extended code.
The first thing we’ll want is a single place in the code where we can keep the “official” list of which letters translate to which short codes. It’s important we have only one copy of this list because later on we might find out we need to change it and updating one list is a lot easier than hunting through our code for half a dozen different lists.
(defparameter *compression-map* '((160 (0 0 0 0)) (101 (0 0 0 1)) (116 (0 0 1 0)) (97 (0 0 1 1)) (111 (0 1 0 0)) (105 (0 1 0 1)) (110 (0 1 1 0)) (115 (0 1 1 1))))
Here I’m using the ‘ shortcut to create a list of lists since (list (list 160 (list 0 0 0 0))…) would get really tiring to type really fast. Syntax aside the first item in each of the eight items in our list is the ASCII code for the letter we want to compress and the second item is the bit list we want it to compress to.
Now that we’ve gor our compression map in whatever format you prefer all we need is an easy way to get the data we need out of that map. During compression we want to be able to take a byte and find out it’s short code and during decompression we want to take a short-code and find out which byte it used to be.
Off the top of my head there are a few ways to do this. One would be to just loop through the entire list every time we want to do a lookup, stopping when we find an item that starts with the byte we want or ends with the list we want (or returning false if we don’t find anything). The whole list is only eight items long so this isn’t as wasteful as it might sound.
An easier solution (depending on your language) might be to use our master list to create a pair of hash tables or dictionaries. As you probably know these are one way lookup structures that link “keys” to “values”. Give the hash a key and it will very efficiently tell you whether or not it has stored a value for it and what that value is; perfect for our needs. The only trick is that since they are only one way and we want to both compress and decompress our data we’ll actually need two hashes that mirror each other. One would use the bytes as keys and reference the bit-lists. The other would use the bit-lists as keys and reference the bytes.
I think I’ll go with the hash approach since they are a more universal language feature than Lisp’s particular approach to list parsing.
To translate my universal compression mapping into a pair of usable hash tables I’m going to first create the hash tables as global variables and then I’m going to use a simple Lisp loop to step through every pair in the compression map. During each step of the loop we will insert the data from one pair into each hash.
(defparameter *byte-to-list-compression-hash* (make-hash-table)) (defparameter *list-to-byte-compression-hash* (make-hash-table :test 'equal)) (loop for compression-pair in *compression-map* do (setf (gethash (car compression-pair) *byte-to-list-compression-hash*) (cadr compression-pair)) (setf (gethash (cadr compression-pair) *list-to-byte-compression-hash*) (car compression-pair)))
A little Lisp trivia here for anybody following along in my language of choice:
1) The :test keyword lets you tell a hash how to compare keys. The default value works great with bytes but not so great with lists so I use :test ‘equal to give the *list-to-byte-compression-hash* a more list friendly compartor.
2) To pull data out of or put data into a hash you use the gethash function. The fist argument is a key value, the second argument is the hash you want to use. I tend to get this backwards a lot :-(
3) “car” and “cdr” and combinitions like “cadar” are all old fashioned keywords for grabbing different parts of lists. I could have just as easily used chains of “first” and “second” but what’s the fun in that?
Lisp aside, we now have an official compression map and two easy to search hashes for doing compression lookups and decompression reverse lookups. Let’s put them to good use by actually writing a compression function!
The compression function should take an ASCII byte and check if it’s in the compression map. If it is it will return the proper 4-bit short code. If it isn’t in the map it should transform it to an 8-bit list, glue on a leading 1 and then return the new 9-bit list.
(defun compress-byte (byte-to-compress) (let ((short-code (gethash byte-to-compress *byte-to-list-compression-hash*))) (if short-code short-code (append '(1) (byte-to-8-bit-list byte-to-compress)))))
Nothing all that clever here. We take the byte-to-compress and lookup whatever value it has in the *byte-to-list-compression-hash*. We then use let to store that result in a local short-code variable. We then us a simple if statment to see whether short-code has an actual value (meaning we found a compression) or if it is empty (meaning that byte doesn’t compress). If it has a compressed value we just return it. Otherwise we take the original byte-to-compress, turn it into a bit list, glue a 1 to the front and return it.
Let’s see it all in action:
[1]> (load “wrc.lisp”)
;; Loading file wrc.lisp …
;; Loaded file wrc.lisp
T
[2]> (compress-byte 111)
(0 1 0 0)
[3]> (compress-byte 82)
(1 0 1 0 1 0 0 1 0)
Cool. It successfully found 111 (“o”) in our compression map and shrunk it from eight bytes down to four. It also noticed that 82 (“R”) was not in our map and so returned the full 8-bits with a ninth “1” marker glued to the front.
Next time we’ll start looking into how to use this function to compress and save an actual file.