Let’s Program A Chatbot 16: Testing On Live Subjects

You Don’t Need To See The Latest Code Updates



Since last we met all I’ve done is add 25 common adjectives to DELPHI’s common adjectives list and write four more possible response patterns for every chatbot rule. These modifications didn’t involve any actual coding tricks and have made DELPHI too long to conveniently embed inside a blog post. I do promise to publish DELPHI’s complete code as soon as this Let’s Program is over but for today there’s nothing worth showing. Those of you following along at home can feel free to write your own response patterns and add your own common adjectives.



Automated Testing Alone Is Not Enough



Our automated tests made it really easy for us to keep track of whether or not DELPHI was doing what we, the programmers, wanted it to do. And that’s very important! It’s hard to write good software if you don’t have some way of keeping track what your goals are and which you have and haven’t met.



But just because a program satisfies the programmer’s list of goals doesn’t mean it will satisfy the customer’s list of demands. Real world users almost always have items on their wish lists that we programmers completely overlooked.



Test users also help us programmers avoid blind spots in our testing. When a developer tries to write tests for his own code he will subconsciously tend to avoid test cases that he knows will break the program. One common example is that we programmers usually forget to write a test case for blank input because accidentally typing in blank input is the sort of mistake we don’t usually make. That makes it easy for us to forget that the problem exists at all, which can lead to fragile programs that break apart the first time a non-programmer hits enter with no input either on accident or out of curiosity. Having a test user break your program like this is much better than waiting for a customer to make a mistake you overlooked.



So there’s really no way around it: We need to get other human beings to test out our programs and find the bugs we forget to include in our automated tests. We can then use their human feedback to set new goals for the next version of our program and to help write new use cases for our tests.



Once we have new and improved tests it’s back to modifying our code until we’ve once again reached a 100% pass rate. Then we find a new test user and have them poke our program until they find another problem and two. And of course that leads to more use cases which can lead to new tests and new software features. Rinse and repeat this cycle until the software finally reaches a level of quality that everyone feels comfortable with (or until you run out of time and have to give up / publish whatever you have). That’s basically the entire test driven development cycle.



DELPHI’s First Human User*



It’s the holidays now and that means friends and family as far as the eye can see. If you’re anything like me, people you haven’t seen for months are suddenly queuing up at your door to “Wish You A Merry Christmas And A Happy New Year”. These lovely people represent a valuable resource to every hobbyist programmer: A large pool of test users that will be more than happy to spend five minutes playing with your latest project.



So for this blog post I grabbed a handy holiday well-wisher and dropped them in front of DELPHI with no explanation of what it did or how it works. This is the conversation they had:



DELPHI: The oracle greets you. What questions are on your mind?


>How much longer is Xenoblade Chronicles**?


DELPHI: I’m sorry, could you try rewording that?


>Are there more than 10 hours left of Xenoblade Chronicles?


DELPHI: I’m sorry, could you try rewording that?


>What kind of questions can you answer?


DELPHI: I saw a question mark, but I didn’t understand the question. I’m very sorry


>Is a yes or no question better?


DELPHI: Let me flip a coin… the second one


>Should I have chicken or beef for dinner?


DELPHI: The later, unless it’s actually the former


>How many days until Christmas?


DELPHI: I saw a question mark, but I didn’t understand the question. I’m very sorry


>What is the weather like?


DELPHI: I’m sorry, could you try rewording that?


>Is it snowing outside?


DELPHI: Hmmm… echoes of cosmic wisdom show that it is not snowing outside


>Will it snow tomorrow?


DELPHI: it will snow tomorrow, assuming the mole people don’t invade next Tuesday


>Was school cancelled for tomorrow?


DELPHI: I’m sorry, could you try rewording that?


>Will there be school tomorrow?


DELPHI: I lost my tarot cards… but I have some playing cards. They predict that there will be school tomorrow


>What is 4 + 5?


DELPHI: I’m sorry, could you think of a simpler way to ask that question?


>What is the sum of 4 and 5?


DELPHI: That looked like a question, but I’m not sure how to respond


>Can you do math?


DELPHI: I saw a question mark, but I didn’t understand the question. I’m very sorry


>Do I go to high school or middle school?


DELPHI: Fate indicates the former


>What is today’s date?


DELPHI: I’m sorry, could you try rewording that?


>Is today Sunday?


DELPHI: I think that today is Sunday, but you should probably double check with someone else


>Is tomorrow Monday?


DELPHI: tomorrow can’t be Monday. That would mess up the crossword puzzle I’m working on


>Is it the 8th?


DELPHI: I see no reason why it wouldn’t be the 8th


>What month is it?


DELPHI: I’m sorry, could you think of a simpler way to ask that question?


>Is it december?


DELPHI: it can’t be december. That would mess up the crossword puzzle I’m working on


>Is it december or january?


DELPHI: Let me flip a coin… the first one


>Is it Sunday and the 8th?


DELPHI: it can’t be Sunday and the 8th. That would mess up the crossword puzzle I’m working on


>Will it be the 9th or he 10th tomorrow?


DELPHI: Whatever you think is correct


>I believe the 9th is correct. Am i correct?


DELPHI: That looked like a question, but I’m not sure how to respond


>Did I miss any responses?


DELPHI: That looked like a question, but I’m not sure how to respond




DELPHI: Come back anytime



Wow! DELPHI did a pretty bad job holding up it’s end of the conversation. On the other hand, it actually did pretty good for a chatbot with only a handful of rules and only a few weekends of development work. So even though DELPHI has pretty poor performance I’m still happy with it’s performance to effort ratio.



What Did We Learn



Time to put on our thinking caps and analyze exactly what went wrong and what went right in this DELPHI test run. Those of you following along at home might want to break out some paper and jot down your own thoughts before reading my conclusions.



Of course, if you’re a programmer you probably have a high reading speed and the ability to recognize the words on your screen even when you aren’t directly looking at them. So you’ve undoubtedly already absorbed at least one or two of the conclusions I’ve written about below. Just think of it as getting a hint on how to start your own list.



BAD: DELPHI Introduction Doesn’t Give Good Enough Instructions



Since users never read the manual (and DELPHI doesn’t have a manual to read anyways) it is very important for DELPHI to provide gentle guidance on the proper way to ask it questions. And I think it’s fair to say I completely failed at this.



I probably should have warned the user to stick to YES/NO questions in the original prompt. Instead I just invited them to ask whatever was on their mind and got an open ended question about the play-time of a video game my user was interested in. Since it wasn’t a yes no question DELPHI gave up. I also could have done a better job of having DELPHI’s confused messages suggest better question formats. Constantly telling the user that DELPHI doesn’t know how to answer their question doesn’t do any good if I’m not also giving hints on what questions they should be asking.



Fortunately the user was pretty clever and figured out on their own that switching their question to a YES/NO format might help. Unfortunately this lead to our next error.



BAD: DELPHI Can’t Handle Plural And Past Tense Versions Of It’s Rules



The user’s second question should have been easy. After all, it was just an “Is X Y?” question and that was one of the first rules we ever wrote.


>Are there more than 10 hours left of Xenoblade Chronicles?



Unfortunately it turns out that DELPHI only has rules specifically for “Is” and doesn’t have nearly enough brainpower to recognize that “Are” should use the same kind of rule. DELPHI also had difficulty later on when the user went first person and tried an conjugated “Is” into “Am”. There were similar problems with past tense conjugations; DELPHI gave up on a “Was” question and a “Did” question even though logically they’re the same as “Is” and “Do”.



So it looks like we’re going to need to do some work buffing DELPHI up to work with a wide range of tenses and pluralizations: Is, Are, Am, Was, Were, Do, Does, Did.



BAD: DELPHI Doesn’t Know How To Talk About Itself



After their first two questions fell apart my clever test user asked an incredibly intelligent third question:



>What kind of questions can you answer?



Unfortunately that isn’t a pattern DELPHI knows how to respond to. Which is a shame because that would have been the perfect opportunity to slip a mini user manual into DELPHI’s output.



GOOD: Humor Made The User Curious



My test user spent a lot longer with DELPHI than I thought they would. When I asked them what they were doing they admitted they were trying to see how many different ways DELPHI could respond to the same type of question. They also explained that they were trying to come up with new types of questions just to double check they weren’t missing an entire group of sort-of-funny chatbot replies.



This means that even though my chatbot was very flawed it made up for those flaws by being interesting enough that the user wanted to keep playing with it to see what it would say and do next. Since DELPHI is basically a toy the fact that the user enjoyed playing with it is a huge success.



GOOD: 50% Success Rate



If you count up the instances where DELPHI gave a good answer to a question compared to when it gave a default confused answer you’ll find it had very close to a 50% success rate. You might argue that a number that low shouldn’t count as a good thing but I think it’s only fair to point out that DELPHI actually did manage to perform as expected in a wide variety of circumstances. No need to focus entirely on it’s mistakes.



I think it’s also interesting to note that the success rate seems higher in the second half of the conversation than the first. This suggests that the user eventually caught on to what kind of questions DELPHI handled best. So if I do a better job of explaining early on in the conversation that DELPHI prefers YES/NO questions the overall success rate should increase a lot.






As predicted DELPHI wasn’t quite ready for human contact. But it did better than I thought it would and now I have lots of data on what problem areas need to be tackled next. Expect my next post to be a rapid fire series of new test cases and the code to fix them.





* You might think I was DELPHI’s first human user, but I don’t count***.



** Xenoblade Chronicles is a Japanese RPG for the Nintendo Wii that has an epic, and rather long, plot. In retrospect it’s not the sort of thing one should try to speed-run during a holiday get together.



*** Because I programmed it. I wasn’t suggesting I don’t count because I’m not human. I’m totally a real human. Really