Let’s Program Some Weird Markov Chains Part 3: Tell Me A Story

So last time we wrote some basic tools for creating Markov Models from a text file. We just send some sentences to the program and it takes them apart in order to figure out what words follow what other words and how often.

But can we do anything interesting with this other than collect obscure trivia on the writing style of, say, Herman Melville?

Of course we can!

The most obvious and amusing thing you can do with a text based Markov Model is use it to chain words together into a sentence. This is pretty easy to do since we already wrote a function that can take any word in our model and randomly return a possible next word.

So if we use that function with “sentencestart” we will randomly get back one of the words that the program found at the start of a sentence. If we then use the function on that first word we will randomly get back a second word. And that second word will lead to a third word until at some point we randomly get “sentenceend”.

The code for that is super straightforward. Just add something like this to the end of the program you wrote for creating the Markov model in the first place:

currentWord = 'sentencestart';

stopWord = 'sentenceend';

message = []

while currentWord != stopWord:
    currentWord = getRandomFromProbabilityList(markovChain[currentWord])
    if(currentWord != stopWord):
        message.append(currentWord)

By the end of this while loop we will have an array holding a complete sentence.

The only real trick here is that you want to make sure that your sentence array does not actually include “sentencestart” or “sentenceend”. That’s why we don’t start adding items to the message array until after the first call to getRandomFromProbabilityList and why we make sure the loop ends as soon as “sentenceend” shows up and before it can wind up in the output array.

You can now sort of see a sentence by looking at the message array, but for convenience you probably want to join all the individual words together with spaces to make a single line of output. In python you do this like this:

print(' '.join(message))

Now if you run the code and feed it our mini text file about apples you will get random output like “I have another apple pie”. You’ll notice that this sentence doesn’t exist in the original sample but that every word pair (“I have”, “have another”, “another apple”, “apple pie”) does.

Great! Now let’s try this on something much much larger: The complete text of Moby Dick as found on project gutenberg. Please note that their files are stored in utf-8 instead of ascii format so depending on what language you’re following along in you might need to change how you read files in order to avoid crashing the first time you run across a strange symbol.

In python that means we need to update our read line to this:

with open(sys.argv[1], encoding="utf8") as myfile:
    data = myfile.read()

And with that we can now summon the digital ghost of Herman Melville! Just grab a public domain copy of Moby Dick and feed it into your program:

“how now,” whispered something like a sentry-box with one matter from the next movement about a mariner’s needle, is, that, parsee:—a hearse can endure; at such prodigies as if the whale-fishery contains it is this arm

Hmmm… Ok? Those are sure some words, at least. Let’s try again.

cook! ho, ho! there’s no small gold-fish has footed it beast, boat, carrying his pupils of real are of the flashing upon the people of his shrill voice, “drinking hot fire burning martyr, or two, for it

Maybe that’s a little better?

Alright, let’s be honest. This is all pure nonsense. Real books have complex sentences that are hard to mimic by looking at them just one word at a time. That’s why we’re going to need to build a better Markov chain based off of looking at multiple words.

It is important to note that even when working with a multi-word chains we sometimes still need a way to look at only one word at a time. For instance, when trying to figure out how to start a sentence we won’t have multiple words to look at. All we’ll have is “sentencestart” so we either need to build a 1 deep model to go along with our 2 deep model or we need some way to make a 2 deep sentence starting symbol.

Since we already have a 1 deep model we’ll just keep that, and the whole algorith that builds it, and then add a 2 deep model after it.

markovChain2Deep = {}

for i in range(len(tokenList) - 2):
    if tokenList[i] not in markovChain2Deep:
        markovChain2Deep[tokenList[i]] = {}
    if tokenList[i+1] not in markovChain2Deep[tokenList[i]]:
        markovChain2Deep[tokenList[i]][tokenList[i+1]] = {}

    if tokenList[i+2] not in markovChain2Deep[tokenList[i]][tokenList[i+1]]:
        markovChain2Deep[tokenList[i]][tokenList[i+1]][tokenList[i+2]] = 1
    else:
        markovChain2Deep[tokenList[i]][tokenList[i+1]][tokenList[i+2]] += 1

Not hard to write, just a little tedious. Instead of grabbing every pair of words we grab every triplet of words in our text and then build a dictionary of first words that link to a dictionary of second words that link to a final dictionary keeping track of how often that triplet happened.

With that in place we can improve our message generating code to work a little bit better. Update it like this:

currentWord = 'sentencestart'

nextWord = getRandomFromProbabilityList(markovChain[currentWord])

stopWord = 'sentenceend'

message = []

while nextWord != stopWord:
    tempWord = getRandomFromProbabilityList(markovChain2Deep[currentWord][nextWord])
    currentWord = nextWord
    nextWord = tempWord

    if(currentWord != stopWord):
        message.append(currentWord)

print(' '.join(message))

Now that we’re linking two deep we get things like:

i fancied that the frenchman had a creditor

That actually makes sense!

Of course it can still do strange things, like here where it got caught in an “oil, nor” loop for a while:

he cannot be olive oil, nor macassar oil, nor train oil, nor train oil, nor macassar oil, nor cod-liver oil

But we still get our fair share of nonsense:

but as the standard-bearer of this ship, and make the bed corner, slips out the smoke from the rim of the enemy defray the current expenses of the west and south

What if we boost it to be three deep? Maybe a little less nonsense?

markovChain3Deep = {}

for i in range(len(tokenList) - 3):
    if tokenList[i] not in markovChain3Deep:
        markovChain3Deep[tokenList[i]] = {}
    if tokenList[i+1] not in markovChain3Deep[tokenList[i]]:
        markovChain3Deep[tokenList[i]][tokenList[i+1]] = {}
    if tokenList[i+2] not in markovChain3Deep[tokenList[i]][tokenList[i+1]]:
        markovChain3Deep[tokenList[i]][tokenList[i+1]][tokenList[i+2]] = {}

    if tokenList[i+3] not in markovChain3Deep[tokenList[i]][tokenList[i+1]][tokenList[i+2]]:
        markovChain3Deep[tokenList[i]][tokenList[i+1]][tokenList[i+2]][tokenList[i+3]] = 1
    else:
        markovChain3Deep[tokenList[i]][tokenList[i+1]][tokenList[i+2]][tokenList[i+3]] += 1

And then replace our old message producing code:

currentWord = 'sentencestart'

nextWord1 = getRandomFromProbabilityList(markovChain[currentWord])

nextWord2 = getRandomFromProbabilityList(markovChain2Deep[currentWord][nextWord1])

stopWord = 'sentenceend'

message = []

while nextWord2 != stopWord:
    tempWord = getRandomFromProbabilityList(markovChain3Deep[currentWord][nextWord1][nextWord2])
    currentWord = nextWord1
    nextWord1 = nextWord2
    nextWord2 = tempWord
    message.append(currentWord)

    #Make sure we don't exit the loop without first recording what's left in the pipeline
    if(nextWord2 == stopWord):
        message.append(nextWord1)

print(' '.join(message))

Now what sort of sentences do we get?

besides, such is the endlessness, yea, the intolerableness of all earthly ills, and that on no account kick back; for you can’t help yourself, wise stubb

Not bad. Not that good either but it does almost sounds like an old sailor giving depressing life advice.

Let’s try again…

but though the lakeman had induced the seamen to dip their ship-biscuit into the huge oil-pots and let them fry there awhile

Hey! That one actually makes a lot of sense. Success!

Or is it? Compare it to this line from the original book:

In the long try watches of the night it is a common thing for the seamen to dip their ship-biscuit into the huge oil-pots and let them fry there awhile.

So you can see that even at only three words deep we’re already reaching the point where entire segments of the original book are being reproduced word for word.

For another example consider this generated sentence:

eakable thing called his “flurry,” the monster horribly wallowed in his blood, at last he partially disclosed a strangely discoloured bunch or protuberance, the size of the iron part of a hoe

To this original sentence:

Still rolling in his blood, at last he partially disclosed a strangely discoloured bunch or protuberance, the size of a bushel, low down on the flank.

Once again we’re just duplicating big blocks of text. That’s because there are quite a few three word phrases that only happen once in the entire book and once you wander into those you’re trapped.

Also, wow, this book is actually pretty bloody. I should have done something like Prid and Prejudice instead…

collins and maria were gone on business into the village, when she was startled by a ring at the door, and elizabeth saw her sister’s countenance change as she read were scarcely to be defined

What a relief, no blood and gore. Although we have randomly generated a pretty good way to start a thriller. Who is a the door? What does Elizabeth’s sister know? What will Collins and Maria find when they finish their business and get back to the house?

So there you go. A complete, if mediocre, tool for tearing a story apart, analyzing word pattern frequency and then stitching things back together into sentences. By this point you should feel pretty comfortable with the basic ideas behind using Markov Models and understand the pros and cons of making them deep vs shallow.

But maybe you still want a little more practice? Well then, here’s some fun projects for you:

1) Writing deeper and deeper Markov Models is tedious. Figure out a way to automate that so you can do something like generateDeepModel(text, depth)

2) Looking at output we did a really bad job at removing punctuation. Fix that.

3) Our code also treats all periods as the end of a sentence, which is a problem for titles like Mr., Mrs. Dr. and so on. Figure out a way around that.

Let’s Program Some Weird Markov Chains Part 2: Need More Data

Last time we learned that a Markov Chain is a way to model lists of events and objects by basically saying “If your current item/event is X then here is a list of items/events that might come next PLUS a list of how likely each choice is.”

This is a remarkably easy thing to program in pretty much any language.

All we really need is:

  1. A way to connect an input to a list of possible outputs
  2. A way to keep track of the probability of those outputs
  3. A way to randomly select one output from a list based on those probabilities

For this Let’s Program we’ll be using Python since it’s cross platform and already on all of my computers. Be warned that I’m not much of a Python expert so please don’t take this project as an example of how to write “proper” Python code.

Step 1, connecting inputs to possible outputs, is super easy because we can just use the dictionary/map/hashtable that is undoubtedly included in whatever language you’re using. These are data structures that let you link a “key” to a “value” of any sort. Observe:

>> dictionaryTest = {‘apple’: ‘pie’, ‘chocolate’: ‘ice cream’}

>> dictionaryTest[‘apple’]

‘pie’

The trick here is that we don’t want to link a word to another word, we want to link a word to a set of possible next words.

Believe it or not we can actually represent that list as a dictionary as well, a dictionary where the keys are words we have seen and the value is how often we have seen them.

>>> dictionaryTest2 = {‘apple’: {‘pie’:15, ‘tart’:6}, ‘chocolate’: {‘ice cream’: 6, ‘milk’: 3}}

>>> dictionaryTest2[‘apple’]

{‘pie’: 15, ‘tart’: 6}

Now all we need is a way to use those word counts to randomly choose a word.

Pulling Marbles Out Of Bags

The most obvious (but probably not most efficient) algorithm here is to cycle through the entire dictionary and count up the total number of word occurrences in it. Ex, if pie showed up 15 times and tart showed up 6 times the total is 21.

We then randomly generate a number between 0 and our total, then loop through the array a second time. For each word count we check if our random number is less than the count. If so we return that word. If not we decrease our random number by that word count and then try again with the next key vaule pair.

Example 1: We randomly generate “3”. That’s less than the “pie” count of 15 so we return “pie”.

Example 2: We randomly generate “19”. That’s higher than the “pie” count of 15 so we subtract 15 from 19 and now have “4”. We then move on to “tart” and see that “4” is less than the “tart” count of 6, so we return “tart”.

Let's see that in code! These experiments are getting a little long to just type into the command line so let's put this in a file with a convenient name like “markov.py”.

from random import randint

def getRandomFromProbabilityList(probabilityList):
    total = 0

    for item, count in probabilityList.items():
        total = total + count
        chosenProbability = randint(0, total -1)

        for item, count in probabilityList.items():
            if(chosenProbability < count):
                return item
            chosenProbability = chosenProbability - count

markovModel = {'apple': {'pie':15, 'tart':6}, 'chocolate': {'ice cream': 6, 'milk': 3}}

print(getRandomFromProbabilityList(markovModel['apple']))

Here you can see we brought in our mini food themed Markov Model and then used our getRandomFromProbabilityList function to randomly generate an appropriate next word for the starting word “apple”.

Inside the getRandomFromProbabilityList you should see the two “key value” for loops that we planned. The “key” is the word and the “value” is how often we saw it. The first loop ignores the keys and just focuses on adding up all the counts. The second loop then uses a randomly chosen number to choose a word key to return.

Running this several times in a row shows that we do indeed get semi-random answers for what word should come after “apple” and that we see ‘pie’ (count 15) a little more than twice as often as ‘tart’ (count 6), which is exactly what we want.

And that simple code really is the core of our Markov Chain engine. Obviously we’ll need to build a few other tools in order to do things like automatically build Markov Chains from raw data or to automatically generate interesting output from our input but the real logic is all right here. That function will help us link our chains together.

Let’s Data Mine Some Text

It didn’t take long at all to build some code for pulling probabilistic events out of our Markov Model. The problem though is that to test that code we had to build a Markov Model by hand. That’s the sort of thing that can get awfully boring awfully fast so let’s figure out a way to automatically build a Markov Model from existing text.

Of course, for debugging purposes it would be silly to jump straight into working with a massive book like Moby Dick. Instead let’s cut our teeth on this fascinating little piece of micro-fiction:

I have an apple. I have another apple. I have some apples. I will make them into a pie. It will be an apple pie. The apple pie will taste good. I like apple pie.

To turn this into a Markov Chain we need to extract all the word pairs and count how often each on happened, but before that we need to do a little pre-processing to make the text more computer friendly.

Specifically we want to make sure the Markov Chain knows when sentences start and end. Knowing when and how sentences start will help the chain create sentences with a natural sounding beginning while knowing how they end will give the program a way to gracefully finish off sentences instead of just looping through an infinite chain of “the man saw the dog saw the man saw the dog saw the…”

After we’ve figured out the start and stop info out we’ll also want to remove all punctuation and capitalization. Otherwise our program might accidentally think “Apple”, “apple”, “apple,” and “apple.” are four different words.

After pre-processing our silly little sample story should look more like this:

sentencestart i have an apple sentenceend sentencestart i have another apple sentenceend sentencestart i have some apples sentenceend sentencestart i will make them into a pie sentenceend sentencestart it will be an apple pie sentenceend sentencestart the apple pie will taste good sentenceend sentencestart i like apple pie sentenceend

Obviously I did that by hand, but equally obvious is the fact that I don’t want to have to edit the entirety of War and Peace just to use it in our program so let’s figure out how we might write a program to pre-process our text for us.

Hmmm…, as humans we keep track of the boundary between sentences by using periods. Maybe our code can do the same? Perhaps we can scan through the code looking for periods and then replace them all with symbols showing that one sentence has ended and another is about to start. We can use whatever symbols we want but should try to avoid anything that might show up in a real piece of text. Mushing words together to create sentenceend and sentencestart is how I did it but you could use a randomly generated strings like BBSDFDSstart and BBSDFDend if you wanted to be extra careful.

As usual there are some edge cases, very literal ones here. The first word of any text should be sentencestart even though there is no period there. And the final period a story should be replaced by just sentenceend with no following sentencestart.

Forunately the code for this is pretty easy thanks to regular experssions:

import re #Regular expressions
import sys #Lets us get command line arguments

#Get the first argument to the program and use it as a filename
with open(sys.argv[1], 'r') as myfile:
    data = myfile.read()

#Text should start with sentencestart
text = 'sentencestart ' + data

#Replace all periods with 'senteceend sentencestart'
text = re.sub('\.', ' sentenceend sentencestart', text)

#Remove the unneccessary sentencestart from the last period in the file
text = re.sub('sentencestart$', '', text);

#Make everything lowercase
text = text.lower()

#Remove everything that isn't a number or letter
text = re.sub('^a-zA-Z\d\s:', '', text)

#Split the pre-processed text in to an array of words
tokenList = text.split()

#Double check that the pre-processed array turned out right
print(tokenList)

Once we have our list split into a nice array it’s pretty easy to walk through that array and add every word pair to our Markov Model.

markovChain = {}

for i in range(len(tokenList) - 1):
    if tokenList[i] not in markovChain:
        markovChain[tokenList[i]] = {}

    if tokenList[i+1] not in markovChain[tokenList[i]]:
        markovChain[tokenList[i]][tokenList[i+1]] = 1
    else:
        markovChain[tokenList[i]][tokenList[i+1]] += 1


Pretty simple. We index through the array of tokens. At every step the word at the current index i is treated as the first word of our pair and the word at i + 1 is treated as the second letter in the pair. We can use this to build our nested dictionary of dictionaries. The end result will eventually work sort of like this: markovChain[firstWord][secondWord] = count.

The only trick here is that you have to make sure you end the loop on the second to last token, not the last token. This is because the last token doesn’t have a “next” token so we can’t build a pair. And, depending on your language, trying to will probably throw an error.

And with that we have automatically turned a text file into a Markov Chain full of interesting facts like:

  • 5 sentences started with ‘I’
  • The world “apple” was followed by the word “pie” twice
  • The word “will” can be followed by “be”, “make” or “taste”

Great! We now have a tool that can help us build large Markov Models with significant amounts of real data. Tune in next time to see what interesting stuff we can do with that sort of information.

Let’s Program Some Weird Markov Chains Part 1:We’re Making A What?

Markov Chains are, in my personal opinion, one of the niftiest things to be found in the computer science bag of tricks. They are easy to understand and simple to program but still powerful enough to do some really cool stuff.

So what is a Markov Chain?

A Markov Chain is a way to model or simulate any sort of list of things or events where the next item in the list is influenced by the item that came before.

One of the best examples is human language. Here, try to fill in the blank: “He saw a bright red _____”.

You probably came up with something like “a bright red apple” or “a bright red car”.

You might have come up with something like “a bright red star” or “a bright red dragon”.

You probably didn’t think anything obviously wrong like “a bright red bluebird” or “a bright red emerald”.

And you definitely didn’t think of anything grammatically nonsensical like “a bright red sleeping” or “a bright red quickly”.

See? All I had to do was tell you the start of a sentence and your years of experience with human language let you instinctively guess what sort of words probably would or wouldn’t come next.

A Markov Chain is just a way to formally define this idea of “what will probably come next”.

For instance, a Markov Chain describing a family cookbook might include mathematical facts like:

  • The word “red” is followed by the word “sauce” 50% of the time, the word “apple” 25% of the time and the word “pepper” 25% of the time.
  • The phrase “preheat the oven to” is followed by “350 degrees” 80% of the time.
  • The phrase “2 cups of” is followed by the word “water” 75% of the time, the word “milk” 20% of the time and the word “butter” 5% of the time.

Great. We now know what words tend to follow other words. That’s interesting trivia, but what can we do with it?

What we can do is simulate new cook book instructions by chaining our rules together. That’s why they’re called Markov Chains and not Markov Factoids.

Example: if you know the word “the” is sometimes followed by “red” and that “red” is sometimes followed by “apple” and that “apple” is sometimes followed by “tastes” and that “tastes” is sometimes followed by “good” then you can logically chain these rules together and create the full sentence “The red apple tastes good”.

That Seems Too Conveninet. What’s the catch?

Sadly, not every chain makes sense.

Consider the sentence “Sally ate red apple pie into pieces of oil”.

Every individual step in this sentence makes sense. “Sally ate” makes sense. So does “ate red” and “red apple” and “apple pie”. And there’s nothing wrong with “pie into” or “into pieces”. And “of oil” is something you’d expect to see a lot of in a cook book.

It’s only when you link all these individually valid chains together that you wind up with nonsense. So while our cooking book Markov Chain has the potential to put together consistent sounding cooking instructions it also has the potential to spout complete gibberish.

Oh No! Can We Make The Markov Chain’s Smarter?

The reason our imaginary Markov Chain generated nonsense is because it is only smart enough to look one word ahead or behind at a time, but in English you often need two or three or more words of context for things to make sense.

That suggests we could improve the quality of our Markov Chain by getting rid of “word A leads to either word B or word C” and instead upgrade to “phrase A B C leads to either word X or word Y”.

But watch out! The more context you force on your Markov Chain the less flexible it becomes, until eventually it becomes entirely useless.

How could too much context be a problem? Well, suppose you are trying to build a Markov Chain based off of Lord of the Rings with the goal of creating unique and original sentences that sound like something Tolkien would actually write. Your want to make sure you get high quality sentences so you design your Markov Chain base itself off of the past 10 words of context.

Unfortunately when you run your program you find that you aren’t getting original sentences at all! Your program does nothing but randomly print entire sentences taken word for word from the original books.

The problem was you got too specific. Tolkien might have used the individual word “elves” in hundreds of different ways throughout his books but the specific phrase “For the Elves the world moves, and it moves both…” only shows up once, leading to a boring Markov Model that does nothing but tell you with 100% certainty how specific sentences actually ended.

So building useful Markov Chains requires a certain amount of experimentation in order to find an input length with enough context to provide intelligent results but short enough that it doesn’t just directly copy whatever it is you’re modeling.

What Are They Good For?

So are Markov Chains good for anything other than generating inedible recipe fragments or producing text that sound vaguely like they came from famous books?

Of course!

Search engines use them for ranking pages and predicting user behavior. Smart phone keyboards use them for auto complete. Statisticians use them for studying medicine, economics, sociology and pretty much everything else where things happen in a specific order.

They’re also a handy way to build video game AIs that act in intelligent semi-random ways.

So understanding the basics of Markov Chains will come in handy regardless of whether you plan to try and cure cancer or just program a really awesome giant robot videogame.

Or both. Dare to be a modern day Renaissance Man!

The Title Is “Weird” Markov Chains. What is that about?

Like I said, you can use Markov Chains to model anything that has an order. But whether or not said Markov Chain is actually good for anything is completely separate matter.

So for this Let’s Program we’re going to start with a classic word based Markov Chain to analyze some public domain books and then try to mimic their writing style. That should be enough to prove we know what we’re doing. But after that we’re going to try a couple experimental ideas that may or may not produce anything worthwhile.

The first experiment will be to try and grab a bunch of musical scores and then build some Markov Chains that model what notes should come next in a song given the last couple of notes. This might let us generate some interesting sounding semi-random music. Or it might just generate the sound of a cat walking on a piano.

After that we’ll finish up with an even stranger experiment based around grabbing a bunch of thematically related images and using them to build a Markov Chain that predicts what color/shade a pixel should be based on the pixels around it. We will then use this to generate new images that I am almost certain will look like static but might include recognizable fragments that actually look related to the our original images. It’s an experiment I’ve been wanting to try for a while, honestly, and I’m glad to finally be getting around to it.