So last time we wrote some basic tools for creating Markov Models from a text file. We just send some sentences to the program and it takes them apart in order to figure out what words follow what other words and how often.
But can we do anything interesting with this other than collect obscure trivia on the writing style of, say, Herman Melville?
Of course we can!
The most obvious and amusing thing you can do with a text based Markov Model is use it to chain words together into a sentence. This is pretty easy to do since we already wrote a function that can take any word in our model and randomly return a possible next word.
So if we use that function with “sentencestart” we will randomly get back one of the words that the program found at the start of a sentence. If we then use the function on that first word we will randomly get back a second word. And that second word will lead to a third word until at some point we randomly get “sentenceend”.
The code for that is super straightforward. Just add something like this to the end of the program you wrote for creating the Markov model in the first place:
currentWord = 'sentencestart'; stopWord = 'sentenceend'; message = [] while currentWord != stopWord: currentWord = getRandomFromProbabilityList(markovChain[currentWord]) if(currentWord != stopWord): message.append(currentWord)
By the end of this while loop we will have an array holding a complete sentence.
The only real trick here is that you want to make sure that your sentence array does not actually include “sentencestart” or “sentenceend”. That’s why we don’t start adding items to the message array until after the first call to getRandomFromProbabilityList and why we make sure the loop ends as soon as “sentenceend” shows up and before it can wind up in the output array.
You can now sort of see a sentence by looking at the message array, but for convenience you probably want to join all the individual words together with spaces to make a single line of output. In python you do this like this:
print(' '.join(message))
Now if you run the code and feed it our mini text file about apples you will get random output like “I have another apple pie”. You’ll notice that this sentence doesn’t exist in the original sample but that every word pair (“I have”, “have another”, “another apple”, “apple pie”) does.
Great! Now let’s try this on something much much larger: The complete text of Moby Dick as found on project gutenberg. Please note that their files are stored in utf-8 instead of ascii format so depending on what language you’re following along in you might need to change how you read files in order to avoid crashing the first time you run across a strange symbol.
In python that means we need to update our read line to this:
with open(sys.argv[1], encoding="utf8") as myfile: data = myfile.read()
And with that we can now summon the digital ghost of Herman Melville! Just grab a public domain copy of Moby Dick and feed it into your program:
“how now,” whispered something like a sentry-box with one matter from the next movement about a mariner’s needle, is, that, parsee:—a hearse can endure; at such prodigies as if the whale-fishery contains it is this arm
Hmmm… Ok? Those are sure some words, at least. Let’s try again.
cook! ho, ho! there’s no small gold-fish has footed it beast, boat, carrying his pupils of real are of the flashing upon the people of his shrill voice, “drinking hot fire burning martyr, or two, for it
Maybe that’s a little better?
Alright, let’s be honest. This is all pure nonsense. Real books have complex sentences that are hard to mimic by looking at them just one word at a time. That’s why we’re going to need to build a better Markov chain based off of looking at multiple words.
It is important to note that even when working with a multi-word chains we sometimes still need a way to look at only one word at a time. For instance, when trying to figure out how to start a sentence we won’t have multiple words to look at. All we’ll have is “sentencestart” so we either need to build a 1 deep model to go along with our 2 deep model or we need some way to make a 2 deep sentence starting symbol.
Since we already have a 1 deep model we’ll just keep that, and the whole algorith that builds it, and then add a 2 deep model after it.
markovChain2Deep = {} for i in range(len(tokenList) - 2): if tokenList[i] not in markovChain2Deep: markovChain2Deep[tokenList[i]] = {} if tokenList[i+1] not in markovChain2Deep[tokenList[i]]: markovChain2Deep[tokenList[i]][tokenList[i+1]] = {} if tokenList[i+2] not in markovChain2Deep[tokenList[i]][tokenList[i+1]]: markovChain2Deep[tokenList[i]][tokenList[i+1]][tokenList[i+2]] = 1 else: markovChain2Deep[tokenList[i]][tokenList[i+1]][tokenList[i+2]] += 1
Not hard to write, just a little tedious. Instead of grabbing every pair of words we grab every triplet of words in our text and then build a dictionary of first words that link to a dictionary of second words that link to a final dictionary keeping track of how often that triplet happened.
With that in place we can improve our message generating code to work a little bit better. Update it like this:
currentWord = 'sentencestart' nextWord = getRandomFromProbabilityList(markovChain[currentWord]) stopWord = 'sentenceend' message = [] while nextWord != stopWord: tempWord = getRandomFromProbabilityList(markovChain2Deep[currentWord][nextWord]) currentWord = nextWord nextWord = tempWord if(currentWord != stopWord): message.append(currentWord) print(' '.join(message))
Now that we’re linking two deep we get things like:
i fancied that the frenchman had a creditor
That actually makes sense!
Of course it can still do strange things, like here where it got caught in an “oil, nor” loop for a while:
he cannot be olive oil, nor macassar oil, nor train oil, nor train oil, nor macassar oil, nor cod-liver oil
But we still get our fair share of nonsense:
but as the standard-bearer of this ship, and make the bed corner, slips out the smoke from the rim of the enemy defray the current expenses of the west and south
What if we boost it to be three deep? Maybe a little less nonsense?
markovChain3Deep = {} for i in range(len(tokenList) - 3): if tokenList[i] not in markovChain3Deep: markovChain3Deep[tokenList[i]] = {} if tokenList[i+1] not in markovChain3Deep[tokenList[i]]: markovChain3Deep[tokenList[i]][tokenList[i+1]] = {} if tokenList[i+2] not in markovChain3Deep[tokenList[i]][tokenList[i+1]]: markovChain3Deep[tokenList[i]][tokenList[i+1]][tokenList[i+2]] = {} if tokenList[i+3] not in markovChain3Deep[tokenList[i]][tokenList[i+1]][tokenList[i+2]]: markovChain3Deep[tokenList[i]][tokenList[i+1]][tokenList[i+2]][tokenList[i+3]] = 1 else: markovChain3Deep[tokenList[i]][tokenList[i+1]][tokenList[i+2]][tokenList[i+3]] += 1
And then replace our old message producing code:
currentWord = 'sentencestart' nextWord1 = getRandomFromProbabilityList(markovChain[currentWord]) nextWord2 = getRandomFromProbabilityList(markovChain2Deep[currentWord][nextWord1]) stopWord = 'sentenceend' message = [] while nextWord2 != stopWord: tempWord = getRandomFromProbabilityList(markovChain3Deep[currentWord][nextWord1][nextWord2]) currentWord = nextWord1 nextWord1 = nextWord2 nextWord2 = tempWord message.append(currentWord) #Make sure we don't exit the loop without first recording what's left in the pipeline if(nextWord2 == stopWord): message.append(nextWord1) print(' '.join(message))
Now what sort of sentences do we get?
besides, such is the endlessness, yea, the intolerableness of all earthly ills, and that on no account kick back; for you can’t help yourself, wise stubb
Not bad. Not that good either but it does almost sounds like an old sailor giving depressing life advice.
Let’s try again…
but though the lakeman had induced the seamen to dip their ship-biscuit into the huge oil-pots and let them fry there awhile
Hey! That one actually makes a lot of sense. Success!
Or is it? Compare it to this line from the original book:
In the long try watches of the night it is a common thing for the seamen to dip their ship-biscuit into the huge oil-pots and let them fry there awhile.
So you can see that even at only three words deep we’re already reaching the point where entire segments of the original book are being reproduced word for word.
For another example consider this generated sentence:
eakable thing called his “flurry,” the monster horribly wallowed in his blood, at last he partially disclosed a strangely discoloured bunch or protuberance, the size of the iron part of a hoe
To this original sentence:
Still rolling in his blood, at last he partially disclosed a strangely discoloured bunch or protuberance, the size of a bushel, low down on the flank.
Once again we’re just duplicating big blocks of text. That’s because there are quite a few three word phrases that only happen once in the entire book and once you wander into those you’re trapped.
Also, wow, this book is actually pretty bloody. I should have done something like Prid and Prejudice instead…
collins and maria were gone on business into the village, when she was startled by a ring at the door, and elizabeth saw her sister’s countenance change as she read were scarcely to be defined
What a relief, no blood and gore. Although we have randomly generated a pretty good way to start a thriller. Who is a the door? What does Elizabeth’s sister know? What will Collins and Maria find when they finish their business and get back to the house?
So there you go. A complete, if mediocre, tool for tearing a story apart, analyzing word pattern frequency and then stitching things back together into sentences. By this point you should feel pretty comfortable with the basic ideas behind using Markov Models and understand the pros and cons of making them deep vs shallow.
But maybe you still want a little more practice? Well then, here’s some fun projects for you:
1) Writing deeper and deeper Markov Models is tedious. Figure out a way to automate that so you can do something like generateDeepModel(text, depth)
2) Looking at output we did a really bad job at removing punctuation. Fix that.
3) Our code also treats all periods as the end of a sentence, which is a problem for titles like Mr., Mrs. Dr. and so on. Figure out a way around that.