Now that we have a working compression function all that’s left is to grab a file, read it byte by byte, compress the bytes and then output them into a new file. Easy peasy!
Let’s start with reading a file byte by byte. This is one particular topic that depends heavily on your language, so if you’re not using Lisp you’re on your own for this step. What you’re looking for is a way to read a file one byte at a time, which might be tricky since by default most languages try to read text files one line at a time. In general you need to use some specific function or argument to force things into byte mode.
For example, compare these two Lisp functions, both run on a file called “drinkme.txt” which contains nothing but the words “drink me”:
(defun normal-read-test (filename) (let ((in (open filename))) (when in (loop for line = (read-line in nil) while line do (print line)) (close in)))) (defun byte-read-test (filename) (let ((in (open filename :element-type '(unsigned-byte 8)))) (when in (loop for testbyte = (read-byte in nil) while testbyte do (print testbyte)) (close in))))
> (normal-read-test “drinkme.txt”)
> (byte-read-test “drinkme.txt”)
Notice that in order to get actual bytes we not only had to use “read-byte” instead of “read-line”, we also had to open the file with the :element type ‘(unsinged-byte) modifier. Your language probably has similar options.
Hopefully, one way or another, you now know how to grab all the bytes in the file you want to compress. With that in place we can use the compress-byte function we wrote last time to change a file into a series of compressed bit lists.
(defun file-to-compressed-bit-lists (filename) (let ((bit-lists '()) (in (open filename :element-type '(unsigned-byte 8)))) (when in (loop for testbyte = (read-byte in nil) while testbyte do (setf bit-lists (append bit-lists (list (compress-byte testbyte))))) (close in)) bit-lists))
> (file-to-compressed-bit-lists “drinkme.txt”)
((1 0 1 1 0 0 1 0 0) (1 0 1 1 1 0 0 1 0) (0 1 0 1) (0 1 1 0)
(1 0 1 1 0 1 0 1 1) (0 0 0 0) (1 0 1 1 0 1 1 0 1) (0 0 0 1)
(1 0 0 0 0 1 0 1 0) )
Well whaddaya know, that’s the exact same compression we calculated by hand in part 2. Good sign our code probably did things right!
Lisp trivia: append adds the items of one list to another list. So to add a single list to a list of lists you actually have to wrap that list inside another list, making it a one item list whose item happens to be a list. Otherwise we’d get something like ((0 0 1 1) (1 0 1 0) 1 0 0 1) instead of ((0 0 1 1) (1 0 1 0) (1 0 0 1)).
Trivia aside, now all that’s left is to write our new compressed data back to file.
Unfortunately our compressed data is the wrong “shape” for normal file output code.
Modern computer software is more or less designed around the idea that the smallest unit of data any sane person would ever want to work with is an entire 8-bit byte. This is almost always the correct assumption for something like 99% of computer programs and is the reason programming language come shipped with libraries for reading and writing bytes.
But sometimes you need to work outside of byte land. We certainly do. Our short letters are 4-bits long and our expanded letters are 9-bits long. Neither of those will work with standard byte-based file IO.
So our data is the wrong shape. Whatever shall we do?
The obvious solution is to make it the right shape and the easiest way to do this is to take our compressed sequence of 4-bit and 9-bit letters, glue them together and then split them apart at new 8-bit boundaries. As long as the end result has the same string of 0s and 1s it doesn’t really matter how they’re grouped.
So a compressed letter list like this
0001 0011 111110100 0000 101101101 0001
Becomes a bit list like this
Which we can then turn into
00010011 11111010 00000101 10110100 01
Now we have four easy to read and write bytes plus two dangling bits. Let’s fix that by adding in some extra 0s for padding.
00010011 11111010 00000101 10110100 01000000
Looking good. Well, mostly. Those padded zeros might be a problem when we try to decompress the file. The program is going to look at those and think there’s an extra 4-bit letter there at the end.
A quick fix for this would be to have every compressed file end with a specific bit sequence. Then when the decompresser sees that sequence it knows it can stop because everything after that point is just padding.
But what symbol should we use? It can’t be one of our 4-bit codes because we need those for making the compression work. How about a really uncommon ASCII character instead? For instance, 00000000 is “null” and shouldn’t ever appear in a normal text file. That means we can use the 9-bit 100000000 code to mean “end of the compressed text” without worrying about accidentally conflicting with a real 100000000 from the text.
So our new terminated compressed message looks like:
0001 0011 111110100 0000 101101101 0001 100000000
And when cut into bytes and padded out it looks like this (termination sequence underlined and padding comes after the |)
00010011 11111010 00000101 10110100 01100000 000|00000
Some of you probably noticed that between the termination sequence and the padding our supposedly compressed message is now just as long as it would have been had we just saved it as six normal byte characters. This is because our example was so short. Adding an extra byte to the end of a five byte message is obviously a huge change, but when we start working with actual multi-kilobyte text files the impact of padding and termination will be negligible.
Negligible is kind of a funny word when your write it down. I think it’s the two “g”s. gligib. egligibl. Weird, right?
Back to the subject at hand, let’s write some code for doing an actual compression.
Now my first idea was to take the bit lists from our function, strip them out of their lists and then dump them all into a new list. But as I mentioned in the trivia section I actually had to go to quite a bit of work to get the function to create a list of lists instead of one big list of bits. So I can actually just remove a few list calls from our existing function and create a new one that naturally returns one big list of bits. A final append to add our (1 0 0 0 0 0 0 0 0) termination symbol and it’s done!
(defun file-to-compressed-bitlist (filename) (let ((bit-lists '()) (in (open filename :element-type '(unsigned-byte 8)))) (when in (loop for testbyte = (read-byte in nil) while testbyte do (setf bit-lists (append bit-lists (compress-byte testbyte)))) (close in)) (append bit-lists '(1 0 0 0 0 0 0 0 0))))
> (file-to-bitlist “drinkme.txt”)
(1 0 1 1 0 0 1 0 0 1 0 1 1 1 0 0 1 0 0 1 0 1 0 1 1 0 1 0 1 1 0 1 0 1 1 0 0 0 0
1 0 1 1 0 1 1 0 1 0 0 0 1 1 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0)
Now to pad it out and make sure it’s evenly divisible into 8-bit bytes.
We can figure out how many dangling bits there are like this:
> (mod (length (file-to-bitlist “drinkme.txt”)) 8)
If there are 6 extra bits then we need 8 – 6 = 2 bits of padding to fill it out to an even byte length. We can generate this padding like this:
> (loop repeat 2 collect 0)
So if we mix that all into one big function:
(defun compress-terminate-and-pad-file (filename) (let* ((bitlist (file-to-compressed-bitlist filename)) (padding (loop repeat (- 8 (mod (length bitlist) 8)) collect 0))) (append bitlist padding)))
> (compress-terminate-and-pad-file “drinkme.txt”)
(1 0 1 1 0 0 1 0 0 1 0 1 1 1 0 0 1 0 0 1 0 1 0 1 1 0 1 0 1 1 0 1 0 1 1 0 0 0 0
1 0 1 1 0 1 1 0 1 0 0 0 1 1 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0)
> (length (compress-terminate-and-pad-file “drinkme.txt”))
For the finishing touch we pop open a byte output file, read through our fully prepared list 8-bits at a time and VOILA we have file compression.
(defun white-rabbit-compress-file (input-filename output-filename) (let ((bitlist (compress-terminate-and-pad-file input-filename)) (out (open output-filename :direction :output :element-type '(unsigned-byte 8)))) (when out (loop while (> (length bitlist) 0) do (write-byte (8-bit-list-to-byte (subseq bitlist 0 8)) out) (setf bitlist (subseq bitlist 8)))) (close out)))
Nothing strange here. As long as there are bits left in the bitlist we take the first eight bits, convert them into a byte, write that byte to file and then drop those bits from the list (by setting the list equal to a subsequence starting at place 8 and continuing to the end of the list). Loop and repeat until the entire file is done.
Now let’s give it a whirl. Run a command like (white-rabbit-compress-file “drinkme.txt” “compresseddrinkme.txt”) and then look at the compressed file with some sort of hex editor. You will find, once again, the exact same sequence of ones and zeros we calculated by hand.
But turning a 9 byte file into a different 9 byte file isn’t much of a compression, so instead let’s grab the entire first paragraph from “Alice in Wonderland” and save it to a file called “1stparagraph.txt”.
Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, ‘and what is the use of a book,’ thought Alice ‘without pictures or conversations?’
Run that through our compression and then look at the file sizes of the original versus the compressed. For me the original text file was 303 bytes, but the compressed version was only 216.
One word of warning though: This is a horribly slow and inefficient Lisp program that will crawl to a halt on any file more than a few kilobytes in length. There are several easy ways to speed it up but I’m not going to talk about those until the end of this LP because 1) Speed doesn’t matter during proof of concept 2) Optimizing Lisp code isn’t the main focus of this project so let’s leave it until last for the handful of people who care.
So now that we have all happily agreed to not mention how horrible this particular implementation is we can move on to part two of the project: Decompressing the files we just compressed. After all, if you can’t decompress your files you might as well just delete them and achieve 100% compression.