» Let’s Program A Chatbot 12: When The Answer Key Is WrongScott Cornaby

Unrealistic Expectations

Sometimes you get halfway through a project only to realize you don’t have the time or money to do what you originally planned to do*. When that happens you have no choice but to rethink your plans, either lowering your expectations or setting a new deadline. Admittedly both approaches generally involve getting frowned at by both management and your customers but sometimes you really have no choice. Even the best of developers have limits.

Why am I bringing this up? You’ll understand in a minute, but I will tell you that it involves these still unresolved use cases:

Test Case 2 Failed!!!

Input: Does this program work?

Output: I’m sorry, could you try rewording that?

Expected: Fate indicates that this program works

Test Case 3 Failed!!!

Input: Do computers compute?

Output: I’m sorry, could you try rewording that?

Expected: Fate indicates that computers compute

At first this doesn’t look so bad. The use cases are “Do X Y?” and “Does X Y?” and all DELPHI has to do is respond back “Yes X Y”. Hardly seems like a challenge. We’ll just slip this new rule into our list after the “or” rule and right before the “is” rule.

push(@chatPatterns,
   [qr/\A(?:Do|Does) (.+)\?\z/,
      "Fate indicates that UIF0"]);

Very simple. We look for any question that starts with some form of “Do” (notice the non-capture ?: symbol) and then we just replace that one question word with our “Fate indicates that” prediction. Is that really all it took?

Test Case 2 Failed!!!

Input: Does this program work?

Output: Fate indicates that this program work

Expected: Fate indicates that this program works

Test Case 3 Passed

A success and a failure is still an overall failure. So now we need to find out what went wrong with Test Case 2 that didn’t go wrong with test Case 3. If you look closely at the expected vs actual output the only issue is verb agreement. It should be “program works”, with an ‘s’, but all we got was the original “program work” from the question.

This problem really only shows up in the third person where the question is phrased as “Does X VERB” and the answer needs to be in form “X VERBs”. It’s really a pretty simple grammar rule. At least, it’s simple for a human. DELPHI is going to need a lot of help.

Hmmm… maybe we can solve this by just slipping an ‘s’ onto the end of our response. Of course, since this only applies to third person questions we’ll have to split the original rule into two rules. Notice that only the “does” version glues a final s onto the end of the User Input Fragment from the original input:

push(@chatPatterns,
   [qr/\ADo (.+)\?\z/,
      "Fate indicates that UIF0"]);

push(@chatPatterns,
   [qr/\ADoes (.+)\?\z/,
      "Fate indicates that UIF0s"]);

Test Case 2 Passed

I’m Still Not Sure This Is Really Working

Just gluing an ‘s’ to the end of the input doesn’t seem very sophisticated. Sure, it passed our test case but I’m not sure it will really work in all scenarios. So how about we write a new test case just to make extra sure we really solved our problem?

$testCases[13][0] = "Does adding an s work well?";
$testCases[13][1] = "Fate indicates that adding an s works well";

Nope!

Test Case 13 Failed!!!

Input: Does adding an s work well?

Output: Fate indicates that adding an s work wells

Expected: Fate indicates that adding an s works well

Adding an ‘s’ to the end of the sentence isn’t enough because what we truly want is an ‘s’ on the end of the verb and there is no guarantee that the verb will be the last word in the sentence. So to fix this problem we are going to need to either:

Develop a complex system for identifying the verb in an arbitrary sentence
Decide that we don’t care about adding ‘s’s to verbs

I’m going to go with option number 2 and come up with a new definition of what is considered a “correct” answer to a “does” question.

The New Test Case

There is an easy way around having to reformat our verbs and that is by including the word “does” inside the response. For instance, these two sentences basically mean the same thing:

This sentence looks equal to the other sentence

This sentence does look equal to the other sentence

This means that we can change the response to “Does X Y?” from “Yes, X Ys” to the much simpler “X does Y”. Now we are dealing with the exact same problem we already solved for “X is Y” and “X will Y”.

Here are our updated test cases:

$testCases[2][0] = "Does this program work?";
$testCases[2][1] = "Fate indicates that this program does work";

$testCases[13][0] = "Does this approach work better?";
$testCases[13][1] = "Fate indicates that this approach does work better";

And here is our updated “does” rule (the “do” rule can stay the same):

push(@chatPatterns,
   [qr/\ADoes ($noncaptureAdjectiveChain[a-zA-Z]+) (.+)\?\z/,
      "Fate indicates that UIF0 does UIF1"]);

And, finally, here are the results

Passed 13 out of 14 tests

Test Failure!!!

Did We Learn Anything Useful Today?

The moral of today’s story is that sometimes a test case that is really hard to solve represents a problem with your expectations as much as your program. If you’re on a tight budget or schedule** sometimes it makes sense to stop and ask yourself “Can we downgrade this requirement to something simpler? Can we delay this requirement until a later release?”

After all, good software today and the promise of great software tomorrow is better than insisting on great software today and never getting it.

Although sometimes you can manage to deliver great software today and that’s even better. Reach for the stars, bold readers. I have faith in your skills!

Conclusion

Did you notice that the success rate on our last testing run was 13 out of 14? That means we’re almost done! At least, we’re almost done with the first test version of the code. I’m sure the instant we ask a human tester to talk to DELPHI we’re going to find all sorts of new test cases that we need to include.

But future test cases are a problem for the future. For now we’re only one test case away from a significant milestone in our project. So join me next time as I do my best to get the DELPHI test suite to finally announce “All Tests Passed!”

* Even worse, sometimes you’ll find out that what you want to do is mathematically impossible. This is generally a bad thing, especially if you’ve already spent a lot of money on the project.

** Or if you’re writing a piece of demo software for your blog and don’t feel like spending more than a few dozen hours on what is essentially a useless toy program