A Second Blind Test
Now that I’ve made a few modifications to DELPHI it’s time to run another live test. But I can’t just ask the first test user to swing by for a second go. That user already knows some of DELPHI’s strengths and weaknesses so they can no longer provide a truly unbiased sample of whether or not the code is working. For good test results I’m going to need a second test-user with no experience with DELPHI at all.
Fortunately it’s still the holidays and finding a second user was as easy as snagging a nearby relative. This is their conversation:
DELPHI: HELLO! MY NAME IS DELPHI.
DELPHI: I CAN USE MY MYSTERIOUS POWER TO ANSWER YES OR NO QUESTIONS LIKE:
DELPHI: “WILL IT RAIN TOMORROW?” OR “DID I PASS MY LAST TEST?”
DELPHI: WHAT WOULD YOU LIKE TO KNOW?
>will i be a white Christmas?
DELPHI: I LOST MY TAROT CARDS… BUT I HAVE SOME PLAYING CARDS. THEY PREDICT THAT YOU WILL NOT BE A WHITE CHRISTMAS
>will IT be a white Christmas?
DELPHI: BASED ON THESE TEA LEAVES IT SEEMS IT WILL NOT BE A WHITE CHRISTMAS
>Will my New Year’s Resolutions last longer this year?
DELPHI: I LOST MY TAROT CARDS… BUT I HAVE SOME PLAYING CARDS. THEY PREDICT THAT YOUR NEW WILL YEAR’S RESOLUTIONS LAST LONGER THIS YEAR
>Does that mean I can resolve to learn how to fly and it still be successful?
DELPHI: TRUST YOUR INSTINCTS ON THIS ONE
>Can I resolve to grow wings and learn to fly?
DELPHI: I’M REALLY GOOD AT YES NO QUESTIONS. TRY ONE OF THOSE
>If you attempt to fail and succeed, have you really succeeded?
DELPHI: THAT LOOKED LIKE A QUESTION, BUT I’M NOT SURE HOW TO RESPOND. MAYBE YOU COULD SIMPLIFY IT TO A YES OR NO QUESTION
>Is the sky blue?
DELPHI: THE SKY CAN’T BE BLUE. THAT WOULD MESS UP THE CROSSWORD PUZZLE I’M WORKING ON
>Is the grass green?
DELPHI: FATE INDICATES THAT THE GRASS IS GREEN
>Is sky gray?
DELPHI: SKY CAN’T BE GRAY. THAT WOULD MESS UP THE CROSSWORD PUZZLE I’M WORKING ON
>Is the sky red?
DELPHI: HMMM… ECHOES OF COSMIC WISDOM SHOW THAT THE SKY IS NOT IS
>Did I break you?
DELPHI: FATE INDICATES THAT YOU DID BREAK I
>I’m sorry. Good Bye
DELPHI: I’M SORRY, I’M NOT SURE WHAT TO SAY ABOUT THAT. TRY A QUESTION LIKE ‘WHY IS THE SKY BLUE?’
>Why is he Sky Blue?
DELPHI: WHY NOT?
>Because
DELPHI: I’M SORRY, I’M NOT SURE WHAT TO SAY ABOUT THAT. TRY A QUESTION LIKE ‘WHY IS THE SKY BLUE?’
>goodbye
DELPHI: Come back anytime
Conversation Analysis
The first and most important thing I noticed is that this test user actually laughed at some of the things DELPHI did. That is a very good sign and indicates that the personality of the writing is still helping to hide the weaknesses in the code.
Aside from that, the test user asked 14 different questions. Of those questions 6 were answered properly and five were answered improperly. The other 3 were kind of on the fence. Two of the “meh” answers were DELPHI complaining about input it didn’t understand, which is fine if it only happens occasionally. The third borderline answer was this interesting exchange:
>Will my New Year’s Resolutions last longer this year?
DELPHI: I LOST MY TAROT CARDS… BUT I HAVE SOME PLAYING CARDS. THEY PREDICT THAT YOUR NEW WILL YEAR’S RESOLUTIONS LAST LONGER THIS YEAR
>Does that mean I can resolve to learn how to fly and it still be successful?
The pattern didn’t recognize “New Year’s Resolutions” as one big noun so the response stuck a “will” in the wrong place. Normally I’d call this a problem but the user didn’t seem to notice or care about this little mistake. They just moved right on with the conversation.
Actually, this isn’t all that surprising. Humans are very good at expecting a what sentence should sound like so you can get away with mixnig up word order or letter order from time to time. Most people’s brains will just rearrange the sentence without consciously noticing there was a mistake.
So for now I’m going to ignore this and focus on the five real problems.
Easy Fixes
Of our five definite mistakes two are easy to fix. The first was actually just a typo in one of my response patterns:
>Is the sky red?
DELPHI: HMMM… ECHOES OF COSMIC WISDOM SHOW THAT THE SKY IS NOT IS
“IS NOT IS”? Where did that come from? Well, looking at my response patterns it turns out I accidentally referenced the user’s verb twice instead of referencing the second half of their input like I should have. Notice that the below output rule has two “UIF0”s but no “UIF2”.
“Hmmm… echoes of cosmic wisdom show that UIF1 UIF0 not UIF0”
Fixing that was as simple as changing the second UIF0 to a UIF2. Still, it’s an embarrassing mistake. I guess that’s what I get for having automated tests that only check the first answer in the random response array!
The next easy problem was this one:
>Can I resolve to grow wings and learn to fly?
DELPHI: I’M REALLY GOOD AT YES NO QUESTIONS. TRY ONE OF THOSE
There is really no difference between a “Can” rule and a “Does” rule or an “Is” rule. So writing a “can” rule shouldn’t be a challenge. The only issue to watch out for is that your generic “Can” rule needs to be a lower priority than the helpful “Can you” rule that we’re using to provide tips on what DELPHI can and can’t do.
Here’s a test case and the code to solve it:
$testCases[21][0] = "Can this code pass all the tests?"; $testCases[21][1] = "FATE INDICATES THAT THIS CODE CAN PASS ALL THE TESTS";
push(@chatPatterns, [qr/\ACan ($noncaptureAdjectiveChain[a-zA-Z]+) (.+)\?\z/i, ["Fate indicates that UIF0 can UIF1", "My \"Big Book O' Wisdom\" says that UIF0 can't UIF1"] ]);
Although to be honest if you just plug that rule in you’ll probably get an error in your tests and find that the input is being caught by the generic “Can you” rule. That’s because the “Can you” rule just looks for the word “Can” followed by the letter “i” without caring whether or not the “i” is an actual word (what we want) or just part of something bigger. In this case, it’s catching the “i” in the middle of “this”. We can fix this with a few word boundary adjustments to the “Can you” regex.
/\ACan.*\bI\b/i
Now the “Can you” rule will only activate when the “I” is on it’s own like it should be after the word “You” has been transformed into first person.
Slightly Less Easy Problems
Complex and compound sentences of all sorts are the natural enemy of pattern matching chatbots. Like these two example:
>If you attempt to fail and succeed, have you really succeeded?
…
>I’m sorry. Good Bye
The first sentence is an ‘If X then Y” pattern. Noticing that a question starts with “If” would be easy, but deciding what to say back is difficult. The main problem is that not all sentences that start with “If” are actually yes or no questions, so we can’t just randomly throw out a yes or no answer like we do for other kinds of input. For example:
If two trains are fifty miles apart and approaching each other a twenty miles per hour, when will they meet?
So if we wanted DELPHI to respond to “If” questions we would need to actually create a system for identifying exactly what sort of “If” question was being asked. That’s quite a bit more complicated than the original scope of this program.
Compared to the “If then” input the second problem is almost trivially easy. All the user did was type in two complete sentences at once. The smartest way to solve that issue would be to split their input into two inputs and then choose one of them to respond to.
In fact, splitting sentences into multiple parts when you see punctuation is a pretty decent way to handle all sorts of complex input. Compound sentences, complex sentences and multiple sentence can all be split along their periods and commas. Then you can compare each sentence fragment to the list of rules and return a response to whichever piece of input matched against a high priority rule.
Or you can be lazy like me and ignore the problem. If the user gets politely scolded for using complex sentences and “if” statements it won’t take long for them to go back to talking in ways that DELPHI already understands. I’m not going to spend too much time worrying about a speech pattern that has only shown up once in all the user tests I’ve run.
Of course, ambitious readers might decide that they do want to tackle this problem. More power to you!
More Hard To Fix Problems: Grammar Be Hard
The second hard to fix problem can be seen here:
>Did I break you?
DELPHI: FATE INDICATES THAT YOU DID BREAK I
The problem here has to do with an annoying bit of non-uniqueness in English pronouns. In first person we have “I” and “Me” but in second person all we have is “You” and another “You”. Observe:
I gave this to you.
You gave this to me.
Our current switchFirstAndSecondPerson function isn’t smart enough to figure out that sometimes “you” should be “me” and sometimes “you” should be “I”. It always changes it to “I”. Once again, this is a problem that technically could be fixed. It is possible to build an automatic grammar parsing system that can identify the part of speech of every word in a sentence. This would then give us enough information to more intelligently swap around “I”s and “You”s and “Me”s.
But the whole point of this Let’s Program was to build a simple pattern matching chatbot and avoid the full complexity of natural language parsing. So once again, this is a problem I’m going to ignore on the principle that being right most of the time is good enough for a chatbot with only two functions a few hundred lines of code and data.
Besides, DELPHI is supposed to guide users to ask questions about the future, not about DELPHI itself. That should hopefully minimize the number of second-to-first person switches we have to make anyways. And if no-one ever sees a certain bug, is it really a bug at all?
80% Is A Decent Grade… Especially When You Haven’t Put In Much Effort
So how successful was DELPHI this time around? Well, if we award ourselves points for the two problems we just fixed and the three slightly wrong but acceptable answers we get this:
6 good answers + 2 fixed answer + 3 borderline answers = 11 out of 14 answers
That means that DELPHI is now 78% acceptable in terms of its ability to talk with real humans. And to be honest, that’s good enough for me. The whole point of this Let’s Program was to demonstrate the bare basics of how to use pattern matching to create a very simple chatbot. I never expected it to perform nearly as well as it does.
But since we’ve come this far we and added a new “Can” rule we might as well try to hunt down one last test user and see if we really are getting an 80% success rate with DELPHI. As all good scientists know an experiment isn’t really done until you’ve repeated it several times and made sure you can get the same answer every time.