Where Are My Intelligent Assistant’s Mistakes? A Systematic Testing Approach

  • Todd Kulesza
  • Margaret Burnett
  • Simone Stumpf
  • Weng-Keen Wong
  • Shubhomoy Das
  • Alex Groce
  • Amber Shinsel
  • Forrest Bice
  • Kevin McIntosh
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6654)


Intelligent assistants are handling increasingly critical tasks, but until now, end users have had no way to systematically assess where their assistants make mistakes. For some intelligent assistants, this is a serious problem: if the assistant is doing work that is important, such as assisting with qualitative research or monitoring an elderly parent’s safety, the user may pay a high cost for unnoticed mistakes. This paper addresses the problem with WYSIWYT/ML (What You See Is What You Test for Machine Learning), a human/computer partnership that enables end users to systematically test intelligent assistants. Our empirical evaluation shows that WYSIWYT/ML helped end users find assistants’ mistakes significantly more effectively than ad hoc testing. Not only did it allow users to assess an assistant’s work on an average of 117 predictions in only 10 minutes, it also scaled to a much larger data set, assessing an assistant’s work on 623 out of 1,448 predictions using only the users’ original 10 minutes’ testing effort.


Intelligent assistants end-user programming end-user development end-user software engineering testing machine learning 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Abraham, R., Erwig, M.: AutoTest: A tool for automatic test case generation in spreadsheets. In: Proc. VL/HCC, pp. 43–50. IEEE, Los Alamitos (2006)Google Scholar
  2. 2.
    Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley Longman Publishing Co., Inc., Boston (1999)Google Scholar
  3. 3.
    Beizer, B.: Software Testing Techniques. International Thomson Computer Press (1990)Google Scholar
  4. 4.
    Blackwell, A.: First steps in programming: A rationale for attention investment models. In: Proc. HCC, pp. 2–10. IEEE, Los Alamitos (2002)Google Scholar
  5. 5.
    Burnett, M., Cook, C., Rothermel, G.: End-user software engineering. Comm. ACM 47(9), 53–58 (2004)CrossRefGoogle Scholar
  6. 6.
    Chang, C., Lin, C.: LIBSVM: A library for support vector machines (2001),
  7. 7.
    Fisher, M., Cao, M., Rothermel, G., Brown, D., Cook, C., Burnett, M.: Integrating automated test generation into the WYSIWYT spreadsheet testing methodology. ACM Trans. Software Engineering and Methodology 15(2), 150–194 (2006)CrossRefGoogle Scholar
  8. 8.
    Frankl, P., Weiss, S.: An experimental comparison of the effectiveness of branch testing and data flow testing. IEEE Trans. Software Eng. 19(3), 202–213 (1993)CrossRefGoogle Scholar
  9. 9.
    Glass, A., McGuinness, D., Wolverton, M.: Toward establishing trust in adaptive agents. In: Proc. IUI, pp. 227–236. ACM, New York (2008)CrossRefGoogle Scholar
  10. 10.
    Gmail Priority Inbox: Get through your email faster, (accessed September 16, 2010)
  11. 11.
    Green, T., Petre, M.: Usability analysis of visual programming environments: A cognitive dimensions framework. J. Visual Languages and Computing 7(2) (June 1996)Google Scholar
  12. 12.
    Grigoreanu, V., Cao, J., Kulesza, T., Bogart, C., Rector, K., Burnett, M., Wiedenbeck, S.: Can feature design reduce the gender gap in end-user software development environments? In: Proc. VL/HCC, pp. 149–156. IEEE, Los Alamitos (2008)Google Scholar
  13. 13.
    Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning. Springer, Heidelberg (2003)zbMATHGoogle Scholar
  14. 14.
    IEEE, IEEE Standard Glossary of Software Engineering Terminology (IEEE Std610.12-1990) (1990)Google Scholar
  15. 15.
    Klann, M., Paterno, F., Wulf, V.: Future perspectives in end-user development. In: Lieberman, H., Paterno, F., Wulf, V. (eds.) End-User Development. Springer, Heidelberg (2006)Google Scholar
  16. 16.
    Kniesel, G., Rho, T.: Newsgroup data set (2005),
  17. 17.
    Kulesza, T., Wong, W., Stumpf, S., Perona, S., White, R., Burnett, M., Oberst, I., Ko, A.: Fixing the program my computer learned: Barriers for end users, challenges for the machine. In: Proc. IUI, pp. 187–196. ACM, New York (2009)Google Scholar
  18. 18.
    Kulesza, T., Stumpf, S., Burnett, M., Wong, W., Riche, Y., Moore, T., Oberst, I., Shinsel, A., McIntosh, K.: Explanatory debugging: Supporting end-user debugging of machine-learned programs. In: Proc. VL/HCC. IEEE, Los Alamitos (2010)Google Scholar
  19. 19.
    Lawrance, J., Bogart, C., Burnett, M., Bellamy, R., Rector, K., Fleming, S.: How programmers debug, revisited: An information foraging theory perspective. IEEE Trans. Software Engineering (2011)Google Scholar
  20. 20.
    Lim, B., Dey, A., Avrahami, D.: Why and why not explanations improve the intelligibility of context-aware intelligent systems. In: Proc. CHI, pp. 2119–2128. ACM, New York (2009)Google Scholar
  21. 21.
    Lim, B., Dey, A.: Toolkit to support intelligibility in context-aware applications. In: Proc. Int. Conf. Ubiquitous Computing. ACM, New York (2010)Google Scholar
  22. 22.
    Miller, R., Myers, B.: Outlier finding: Focusing user attention on possible errors. In: Proc. UIST, pp. 81–90. ACM, New York (2001)Google Scholar
  23. 23.
    Raghavan, H., Madani, O., Jones, R.: Active learning with feedback on both features and instances. JMLR 7, 1655–1686 (2006)zbMATHGoogle Scholar
  24. 24.
    Raz, O., Koopman, P., Shaw, M.: Semantic anomaly detection in online data sources. In: Proc. ICSE, pp. 302–312. IEEE, Los Alamitos (2002)Google Scholar
  25. 25.
    Rothermel, G., Burnett, M., Li, L., Dupuis, C., Sheretov, A.: A methodology for testing spreadsheets. ACM Trans. Software Engineering and Methodology 10(1) (January 2001)Google Scholar
  26. 26.
    Rowan, J., Mynatt, E.: Digital family portrait field trial: Support for aging in place. In: Proc. CHI, pp. 521–530. ACM, New York (2005)Google Scholar
  27. 27.
    Scaffidi, C.: Unsupervised inference of data formats in human-readable notation. In: Proc. Int. Conf. Enterprise Integration Systems, pp. 236–241 (2007)Google Scholar
  28. 28.
    Settles, B.: Active learning literature survey. Computer Sciences Technical Report 1648, University of Wisconsin–Madison (2009)Google Scholar
  29. 29.
    Shen, J., Dietterich, T.: Active EM to reduce noise in activity recognition. In: Proc. IUI, pp. 132–140. ACM, New York (2007)CrossRefGoogle Scholar
  30. 30.
    Talbot, J., Lee, B., Kapoor, A., Tan, D.: EnsembleMatrix: Interactive visualization to support machine learning with multiple classifiers. In: Proc. CHI, pp. 1283–1292. ACM, New York (2009)Google Scholar
  31. 31.
    Tullio, J., Dey, A., Chalecki, J., Fogarty, J.: How it works: A field study of non-technical users interacting with an intelligent system. In: Proc. CHI, pp. 31–40. ACM, New York (2007)Google Scholar
  32. 32.
    Wong, W.-K., Oberst, I., Das, S., Moore, T., Stumpf, S., McIntosh, K., Burnett, M.: End-user feature labeling: A locally-weighted regression approach. In: Proc IUI. ACM, New York (2011)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Todd Kulesza
    • 1
  • Margaret Burnett
    • 1
  • Simone Stumpf
    • 2
  • Weng-Keen Wong
    • 1
  • Shubhomoy Das
    • 1
  • Alex Groce
    • 1
  • Amber Shinsel
    • 1
  • Forrest Bice
    • 1
  • Kevin McIntosh
    • 1
  1. 1.School of EECS, Kelley Engr. CenterOregon State UniversityCorvallisUnited States
  2. 2.Centre for HCI DesignCity University LondonLondonUK

Personalised recommendations