Skip to main content
Log in

Evaluating intelligent knowledge systems: experiences with a user-adaptive assistant agent

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

This article examines experiences in evaluating a user-adaptive personal assistant agent designed to assist a busy knowledge worker in time management. We examine the managerial and technical challenges of designing adequate evaluation and the tension of collecting adequate data without a fully functional, deployed system. The CALO project was a seminal multi-institution effort to develop a personalized cognitive assistant. It included a significant attempt to rigorously quantify learning capability, which this article discusses for the first time, and ultimately the project led to multiple spin-outs including Siri. Retrospection on negative and positive experiences over the 6 years of the project underscores best practice in evaluating user-adaptive systems. Lessons for knowledge system evaluation include: the interests of multiple stakeholders, early consideration of evaluation and deployment, layered evaluation at system and component levels, characteristics of technology and domains that determine the appropriateness of controlled evaluations, implications of ‘in-the-wild’ versus variations of ‘in-the-lab’ evaluation, and the effect of technology-enabled functionality and its impact upon existing tools and work practices. In the conclusion, we discuss—through the lessons illustrated from this case study of intelligent knowledge system evaluation—how development and infusion of innovative technology must be supported by adequate evaluation of its efficacy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. While PTIME can be seen as a type of recommender system, evaluating a task-oriented adaptive system such as PTIME differs significantly from evaluating a classical recommender system, due to the generative, incremental, and dynamic nature of the recommendation task.

References

  1. Ackerman S (2011) The iPhone 4S’ talking assistant is a military veteran. Wired, 2011. www.wired.com/2011/10/siri-darpa-iphone/. Retrieved 26 Jan 2015

  2. Ambite JL, Barish G, Knoblock CA, Muslea M, Oh J, Minton S (2002) Getting from here to there: Interactive planning and agent execution for optimizing travel. In: Proceedings of fourteenth conference on innovative applications of artificial intelligence (IAAI’02), pp 862–869

  3. Ambite J-L, Chaudhri VK, Fikes R, Jenkins J, Mishra S, Muslea M, Uribe T, Yang G (2006) Design and implementation of the CALO Query Manager. In: Proceedings of eighteenth conference on innovative applications of artificial intelligence (IAAI’06), pp 1751–1758

  4. Aylett R, Brazier F, Jennings N, Luck M, Nwana H, Preist C (1998) Agent systems and applications. Knowl Eng Rev 13(3):303–308

    Article  Google Scholar 

  5. Azvine B, Djian D, Tsui KC, Wobcke W (2000) The intelligent assistant: an overview. In: Intelligent systems and soft computing: prospects, tools and applications. Lecture notes in computer science, vol 1804. Springer, New York, NY, pp 215–238

  6. Bank J, Cain Z, Shoham Y, Suen C, Ariely D (2012) Turning personal calendars into scheduling assistants. In: Extended abstracts of twenty-fourth conference on human factors in computing systems (CHI’12)

  7. Berry PM, Gervasio M, Peintner B, Yorke-Smith N (2007) Balancing the needs of personalization and reasoning in a user-centric scheduling assistant. Technical note 561, AI Center, SRI International

  8. Berry PM, Donneau-Golencer T, Duong K, Gervasio MT, Peintner B, Yorke-Smith N (2009a) Evaluating user-adaptive systems: lessons from experiences with a personalized meeting scheduling assistant. In: Proceedings of twenty-first conf. on innovative applications of artificial intelligence (IAAI’09), pp 40–46

  9. Berry PM, Donneau-Golencer T, Duong K, Gervasio MT, Peintner B, Yorke-Smith N (2009b) Mixed-initiative negotiation: facilitating useful interaction between agent/owner pairs. In: Proceedings of AAMAS’09 workshop on mixed-initiative multiagent systems, pp 8–18

  10. Berry PM, Gervasio M, Peintner B, Yorke-Smith N (2011) PTIME: personalized assistance for calendaring. ACM Trans Intell Syst Technol 2(4):40:1–40:22

    Article  Google Scholar 

  11. Bosker B (2013a) Tempo smart calendar app boasts Siri pedigree and a calendar that thinks for itself. The Huffington Post. www.huffingtonpost.com/2013/02/13/tempo-smart-calendar-app_n_2677927.html. Retrieved 30 June 2016

  12. Bosker B (2013b) SIRI RISING: the inside story of Siri’s origins—and why she could overshadow the iPhone. The Huffington Post. www.huffingtonpost.com/2013/01/22/siri-do-engine-apple-iphone_n_2499165.html. Retrieved 10 June 2013

  13. Bosse T, Memon ZA, Oorburg R, Treur J, Umair M, de Vos M (2011) A software environment for an adaptive human-aware software agent supporting attention-demanding tasks. Int J Artif Intell Tools 20(5):819–846

    Article  Google Scholar 

  14. Brusilovsky P, Karagiannidis C, Sampson D (2004) Layered evaluation of adaptive learning systems. Int J Contin Eng Educ Lifelong Learn 14(4–5):402–421

    Article  Google Scholar 

  15. Brusilowsky P (2001) Adaptive hypermedia. User Modell User Adapt Interact 11(1–2):87–110

    Article  Google Scholar 

  16. Brzozowski M, Carattini K, Klemmer SR, Mihelich P, Hu J, Ng AY (2006) groupTime: preference-based group scheduling. In: Proceedings of eighteenth conference on human factors in computing systems (CHI’06), pp 1047–1056

  17. Campbell M (2009) Talking paperclip inspires less irksome virtual assistant. New Scientist, 29 July 2009

  18. Carroll JM, Rosson MB (1987) Interfacing thought: cognitive aspects of human-computer interaction. MIT Press, Cambridge

    Google Scholar 

  19. Chalupsky H, Gil Y, Knoblock CA, Lerman K, Oh J, Pynadath DV, Russ TA, Tambe M (2002) Electric elves: agent technology for supporting human organizations. AI Mag 23(2):11–24

    Google Scholar 

  20. Cheyer A, Park J, Giuli R (2005) IRIS: integrate, relate, infer, share. In: Proceedings of 4th international semantic web conference on workshop on the semantic desktop, p 15

  21. Christie CA, Fleischer DN (2010) Insight into evaluation practice: a content analysis of designs and methods used in evaluation studies published in North American evaluation-focused journals. Am J Eval 31(3):326–346

    Article  Google Scholar 

  22. Cohen P (1995) Empirical methods for artificial intelligence. MIT Press, Cambridge

    MATH  Google Scholar 

  23. Cohen P, Howe AE (1989) Toward AI research methodology: three case studies in evaluation. IEEE Trans Syst Man Cybern 19(3):634–646

    Article  Google Scholar 

  24. Cohen PR, Howe AE (1988) How evaluation guides AI research: the message still counts more than the medium. AI Mag 9(4):35–43

    Google Scholar 

  25. Cohen PR, Cheyer AJ, Wang M, Baeg SC (1994) An open agent architecture. In: Huhns MN, Singh MP (eds) Readings in agents. Morgan Kaufmann, San Francisco, pp 197–204

    Google Scholar 

  26. Cramer H, Evers V, Ramlal S, Someren M, Rutledge L, Stash N, Aroyo L, Wielinga B (2008) The effects of transparency on trust in and acceptance of a content-based art recommender. User Model User Adap Int 18(5):455–496

    Article  Google Scholar 

  27. Davis FD, Bagozzi RP, Warshaw PR (1989) User acceptance of computer technology: a comparison of two theoretical models. Manag Sci 35:982–1003

    Article  Google Scholar 

  28. Deans B, Keifer K, Nitz K et al (2009) SKIPAL phase 2 final technical report. Technical report 1981, SPAWAR Systems Center Pacific, San Diego

  29. Evers V, Cramer H, Someren M, Wielinga B (2010) Interacting with adaptive systemsInteractive collaborative information systems, volume 281 of studies in computational intelligence. Springer, Heidelberg

    Google Scholar 

  30. Freed M, Carbonell J, Gordon G, Hayes J, Myers B, Siewiorek D, Smith S, Steinfeld A, Tomasic A (2008) RADAR: a personal assistant that learns to reduce email overload. In: Proceedings of twenty-third AAAI conference on artificial intelligence (AAAI’08), pp 1287–1293

  31. Gena C (2005) Methods and techniques for the evaluation of user-adaptive systems. Knowl Eng Rev 20(1):1–37

    Article  Google Scholar 

  32. Grabisch M (1996) The application of fuzzy integrals in multicriteria decision making. Eur J Oper Res 89(3):445–456

    Article  MATH  Google Scholar 

  33. Graebner ME, Eisenhardt KM, Roundy PT (2010) Success and failure in technology acquisitions: lessons for buyers and sellers. Acad Manag Perspect 24(3):73–92

    Article  Google Scholar 

  34. Greenberg S, Buxton B (2008) Usability evaluation considered harmful (some of the time). In: Proceedings of twentieth conference on human factors in computing systems (CHI’08), pp 111–120

  35. Greer J, Mark M (2016) Evaluation methods for intelligent tutoring systems revisited. Int J Artif Intell Educ 26(1):387–392

    Article  Google Scholar 

  36. Grudin J, Palen L (1995) Why groupware succeeds: discretion or mandate? In: Proceedings of 4th European conference on computer-supported cooperative work (ECSCW’95), pp 263–278

  37. Hall J, Zeleznikow J (2001) Acknowledging insufficiency in the evaluation of legal knowledge-based systems: Strategies towards a broad based evaluation model. In: Proceedings of 8th international conference on artificial intelligence and law (ICAIL’01), pp 147–156

  38. Hitt LM, Wu DJ, Zhou X (2002) ERP investment: business impact and productivity measures. J Manag Inf Syst 19:71–98

    Google Scholar 

  39. Höök K (2000) Steps to take before intelligent user interfaces become real. Interact Comput 12(4):409–426

    Article  Google Scholar 

  40. Horvitz E, Breese J, Heckerman D, Hovel D, Rommelse K (1998) The Lumière project: Bayesian user modeling for inferring the goals and needs of software users. In: Proceedings of 14th conference on uncertainty in artificial intelligence (UAI’98), pp 256–266

  41. Jameson AD (2009) Understanding and dealing with usability side effects of intelligent processing. AI Mag 30(4):23–40

    Google Scholar 

  42. Joachims T (2002) Optimizing search engines using clickthrough data. In: Proceedings of 22nd ACM conference on knowledge discovery and data mining (KDD’02), pp 133–142

  43. Kafali Ö, Yolum P (2016) PISAGOR: a proactive software agent for monitoring interactions. Knowl Inf Syst 47(1):215–239

    Article  Google Scholar 

  44. Kahney L (2010) MS Office helper not dead yet. Wired, 19 April 2001. www.wired.com/science/discoveries/news/2001/04/43065?currentPage=all. Retrieved 8 Oct 2010

  45. Kjeldskov J, Skov MB (2007) Studying usability in sitro: simulating real world phenomena in controlled environments. Int J Hum Comput Interact 22(1–2):7–36

    Article  Google Scholar 

  46. Klimt B, Yang Y (2004) The Enron corpus: a new dataset for email classification research. In: Proceedings of 15th European conference on machine learning (ECML’04), number 3201 in lecture notes in computer science. Springer, pp 217–226

  47. Knoblock CA (2006) Beyond the elves: making intelligent agents intelligent. In: Proceedings of AAAI 2006 spring symposium on what went wrong and why: lessons from AI research and applications, p 40

  48. Kokalitcheva K (2015) Salesforce acquires “smart” calendar app Tempo, which is shutting down. Fortune. www.fortune.com/2015/05/29/salesforces-acquires-tempo/. Retrieved 30 June 2016

  49. Kozierok R, Maes P (1993) A learning interface agent for scheduling meetings. In: Proceedings of international workshop on intelligent user interfaces (IUI’93), pp 81–88

  50. Krzywicki A, Wobcke W (2008) Closed pattern mining for the discovery of user preferences in a calendar assistant. In: Nguyen NT, Katarzyniak R (eds) New challenges in applied intelligence technologies. Springer, New York, pp 67–76

    Chapter  Google Scholar 

  51. Langley P (1999) User modeling in adaptive interfaces. In: Proceedings of 7th international conference on user modeling (UM’99), pp 357–370

  52. Lazar J, Feng JH, Hockheiser H (2010) Research methods in human–computer interaction. Wiley, Chichester

    Google Scholar 

  53. Maes P (1994) Agents that reduce work and information overload. J ACM 37(7):30–40

    Article  Google Scholar 

  54. McCorduck P, Feigenbaum EA (1983) The fifth generation: artificial intelligence and Japan’s computer challenge to the world. Addison Wesley, Boston

    Google Scholar 

  55. Mitchell T, Caruana R, Freitag D, McDermott J, Zabowski D (1994) Experience with a learning personal assistant. Commun ACM 37(7):80–91

    Article  Google Scholar 

  56. Modi PJ, Veloso MM, Smith SF, Oh J (2004) CMRadar: a personal assistant agent for calendar management. In: Proceedings of agent-oriented information systems workshop (AOIS’04), pp 169–181

  57. Moffitt MD, Peintner B, Yorke-Smith N (2006) Multi-criteria optimization of temporal preferences. In: Proceedings of CP’06 workshop on preferences and soft constraints, pp 79–93

  58. Myers KL, Berry PM, Blythe J, Conley K, Gervasio M, McGuinness D, Morley D, Pfeffer A, Pollack M, Tambe M (2007) An intelligent personal assistant for task and time management. AI Mag 28(2):47–61

    Google Scholar 

  59. Nielsen J, Levy J (1994) Measuring usability: preference vs. performance. Commun ACM 37(4):66–75

    Article  Google Scholar 

  60. Norman DA (1994) How might people interact with agents. Commun ACM 37(7):68–71

    Article  Google Scholar 

  61. Oh J, Smith SF (2004) Learning user preferences in distributed calendar scheduling. In: Proceedings of 5th international conference on practice and theory of automated timetabling (PATAT’04), pp 3–16

  62. Oppermann R (1994) Adaptively supported adaptivity. Int J Hum Comput Stud 40(3):455–472

    Article  Google Scholar 

  63. Palen L (1999) Social, individual and technological issues for groupware calendar systems. In: Proceedings of eleventh conference on human factors in computing systems (CHI’99), pp 17–24

  64. Paramythis A, Weibelzahl S, Masthoff J (2010) Layered evaluation of interactive adaptive systems: framework and formative methods. User Model User Adap Interact 20(5):383–453

    Article  Google Scholar 

  65. Peintner B, Dinger J, Rodriguez A, Myers K (2009) Task assistant: personalized task management for military environments. In: Proceedings of twenty-first conference on innovative applications of artificialintelligence (IAAI’09), pp 128–134

  66. Refanidis I, Alexiadis A (2011) Deployment and evaluation of Selfplanner, an automated individual task management system. Comput Intell 27(1):41–59

    Article  MathSciNet  Google Scholar 

  67. Refanidis I, Yorke-Smith N (2010) A constraint-based approach to scheduling an individual’s activities. ACM Trans Intell Syst Technol 1(2):121–1232

    Article  Google Scholar 

  68. Rychtyckyj N, Turski A (2008) Reasons for success (and failure) in the development and deployment of AI systems. In: Proceedings of AAAI’08 workshop on what went wrong and why: lessons from AI research and applications, pp 25–31

  69. Schaub F, Könings B, Lang P, Wiedersheim B, Winkler C, Weber M (2014) PriCal: context-adaptive privacy in ambient calendar displays. In: Proc. of sixteeth international conference on pervasive and ubiquitous computing (UbiComp’14), pp 499–510

  70. Shakshuki EM, Hossain SM (2014) A personal meeting scheduling agent. Pers Ubiquit Comput 18(4):909–922

    Article  Google Scholar 

  71. Shen J, Li L, Dietterich TG, Herlocker JL (2006) A hybrid learning system for recognizing user tasks from desktop activities and email messages. In: Proceedings of eighteenth international conference on intelligent user interfaces (IUI’06), pp 86–92

  72. SRI International (2013) CALO: cognitive assistant that learns and organizes. https://pal.sri.com. Retrieved 10 June 2013

  73. Steinfeld A, Bennett R, Cunningham K et al (2006) The RADAR test methodology: evaluating a multi-task machine learning system with humans in the loop. Report CMU-CS-06-125, Carnegie Mellon University

  74. Steinfeld A, Bennett R, Cunningham K, et al. (2007a) Evaluation of an integrated multi-task machine learning system with humans in the loop. In: Proceedings of 7th NIST workshop on performance metrics for intelligent systems (PerMIS’07), pp 182–188

  75. Steinfeld A, Quinones P-A, Zimmerman J, Bennett SR, Siewiorek D (2007b) Survey measures for evaluation of cognitive assistants. In: Proceedins of 7th NIST workshop on performance metrics for intelligent systems (PerMIS’07), pp 189–193

  76. Stumpf S, Rajaram V, Li L, Wong W-K, Burnett M, Dietterich T, Sullivan E, Herlocker J (2009) Interacting meaningfully with machine learning systems: three experiments. Int J Hum Comput Stud 67(8):639–662

    Article  Google Scholar 

  77. Tambe M, Bowring E, Pearce JP, Varakantham P, Scerri P, Pynadath DV (2006) Electric Elves: what went wrong and why. In: Proceedings of AAAI 2006 spring symposium on what went wrong and why: lessons from AI research and applications, pp 34–39

  78. Van Velsen L, Van Der Geest T, Klaassen R, Steehouder M (2008) User-centered evaluation of adaptive and adaptable systems: a literature review. Knowl Eng Rev 23(3):261–281

    Google Scholar 

  79. Viappiani P, Faltings B, Pu P (2006) Preference-based search using example-critiquing with suggestions. J Artif Intell Res 27:465–503

    MATH  Google Scholar 

  80. Wahlster W (ed) (2006) SmartKom: foundations of multimodal dialogue systems. Cognitive technologies. Springer, New York

    Google Scholar 

  81. Weber J, Yorke-Smith N (2008) Time management with adaptive reminders: two studies and their design implications. In: Working Notes of CHI’08 workshop: usable artificial intelligence, pp 5–8

  82. Wobcke W, Nguyen A, Ho VH, Krzywicki A (2007) The smart personal assistant: an overview. In: Proceedings of the AAAI spring symposium on interaction challenges for intelligent assistants, pp 135–136

  83. Yorke-Smith N, Saadati S, Myers KL, Morley DN (2012) The design of a proactive personal agent for task management. Int J Artif Intell Tools 21(1):90–119

    Article  Google Scholar 

Download references

Acknowledgements

We thank the anonymous reviewers for suggestions that helped to refine this article. We thank Karen Myers and Daniel Shapiro for their constructive comments, and we thank Mark Plascencia, Aaron Spaulding, and Julie Weber for help with the user studies and evaluations. We thank other contributors to the PTIME project, including Cory Albright, Emma Bowring, Michael D. Moffitt, Kenneth Nitz, Jonathan P. Pearce, Martha E. Pollack, Shahin Saadati, Milind Tambe, Joseph M. Taylor, and Tomás Uribe. We also gratefully acknowledge the many participants in our various studies, and the larger CALO team. For their feedback we thank among others Reina Arakji, Bijan Azad, Jane Davies, Nitin Joglekar, and Alexander Komashie, and the reviewers at the IAAI’09 conference where preliminary presentation of part of this work was made [8]. NYS thanks the Operations group at the Cambridge Judge Business School, where the body of the article was written, the fellowship at St Edmund’s College, Cambridge, and the Engineering Design Centre at the University of Cambridge. This material is based in part upon work supported by the US Defense Advanced Research Projects Agency (DARPA) Contract No. FA8750-07-D-0185/0004. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of DARPA, or the Air Force Research Laboratory.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Neil Yorke-Smith.

Appendices

Appendix 1: CALO Test scoring method

The annual CALO Test proceeded as follows, for project Years 2–4 (Sects. 4.14.2).

Methods. The CALO Test process consisted of five parts, as follows for PTIME.

  1. 1.

    The independent evaluator (IE) defined a set of parameterized questions (PQs). These templates were made known to the PTIME team, who worked to develop the system’s capabilities towards them. There were some 60 PQs relevant to time management. For example:

    Rank the following times in terms of their suitability for [MEETING-ID], given the schedules and locations of anticipated participants.

    Each PQ was supplemented by an agreed interpretation (i.e., what the PQ means) and an ablation process (i.e., how to remove any learning in LCALO, to yield BCALO), both approved by the IE.

  2. 2.

    Data was collected during the week-long critical learning period (CLP).

  3. 3.

    The IE selected a subset of the PQs for evaluation. In Year 2, nine PTIME-relevant PQs were selected. In Year 3, two additional questions were selected.

  4. 4.

    The IE instantiated each PQ with three instantiations relevant to the data set. For example, one instantiation of the above PQ is:

    Rank the following times in terms of their suitability for MTG-CALO-0133, given the schedules and locations of anticipated participants: (1) 7 am, (2) 10 am, (3) 10:30 am, (4) 3 pm, (5) 8 pm.

  5. 5.

    The IE scored LCALO and BCALO on each such instantiated question (IQ) instance and produced the overall results. First, the IE determined the ‘gold-standard’ answer for each IQ. For each PQ, the process for determining the answer key was documented prior to the Test. For example, for the above PQ:

    Since this PQ is not asked from a single user’s perspective but from a global perspective (what is best considering all invitees), the Test evaluators will select an arbitrator who will be given access to all calendars and user preferences. The arbitrator may also ask any user for any information that may help the arbitrator identify the best answer. For example, the arbitrator may ask how important the meeting is to a user. The arbitrator will come up with the correct answer.

    While some PTIME IQs had objective answers, others (such as the above) had subjective answers. The IE followed the answer determination process to derive the answer key for each IQ. If necessary, the IE elicited information from the CLP participants, and if further warranted, made subjective final decisions.

    Second, the IE scored LCALO and BCALO against the answer for each IQ. Scores were between 0 (worst) and 4 (best). Again, for each PQ the process for determining the score was documented prior to the Test. For our example PQ, the process was to compare the ordered list generated by PTIME with the ordered list of the answer key by (quoting verbatim):

    Kendall rank correlation coefficient (also called Kendall Tau) with a shifted and scaled value: Kendall Tau generates numbers between −1.0 and 1.0 that indicate the correlation between two different rankings of the same items. 1.0 indicates the rankings are identical. −1.0 indicates that they are the reverse of each other. Kendall Tau accommodates ties in the ranking. To get values that range from 0 to 4 (rather than −1.0 to 1.0), we use the following adjustment: Score \(=\) ( Kendall Tau \(+ 1\) ) \(\times \,2\)

    This scoring process was encoded programmatically so that scores could be computed automatically for LCALO and BCALO.

Tasks. Participants used CALO as part of their regular work activities during the period of the CLP. The participants pursued realistic tasks, in a semi-artificial scenario, i.e., in a dedicated physical location rather than their usual workplaces (Lessons 3 and 4). Participants were given guidance about activities to try to include in their work, for instance, to schedule and attend a certain number of meetings; the independent evaluator approved the guidance. Participants were informed that the CALO system was being evaluated (and not their work) and that they might encounter bugs in the system, due to it being in on-going development.

Appendix 2: Specific critique of the CALO Test

Further to the discussion of Sect. 4.1—the CALO Test aimed for objectivity, as far as could be attained, in providing a quantitative measure of the effects of learning on CALO’s performance. However, the nature of the scoring process of the CALO Test introduced unintended artefacts.

First, instantiated questions (IQs) were instantiated (from the parameterized questions (PQs)) with a range of ‘difficulty’, determined by what the Independent Evaluator (IE) considered easy/difficult for a human office assistant. What is easy or difficult for an intelligent assistant can differ from what is easy or difficult for a human.

Second, as described in “Appendix 1”, some IQs had subjective ‘gold-standard’ answers that required ex-post (i.e., after the activity) elicitation from subjects by the IE, and a partially subjective human decision on the answer key. More generally, a difficulty in evaluation is in defining successful completion of a task. It is worth noting how the PQs were defined by the IE to scope the information required to determine the answer key. For instance, it was not necessary to determine whether users had chosen the best schedules for their requirements out of all possible schedules, but only from the multiple choices of the IQ answers.

Third, for PQs asked as a multi-choice question, a chance effect could unintentionally favour BCALO. For example, consider a multiple choice PQ with two possible answers, A or B, and its three instantiations to IQs. Suppose BCALO has a naive strategy of always returning answer A. There is a \(\frac{3}{8}\) probability that for two of the three instantiations, A is the correct answer. In this case, BCALO scores 67% (2.67 / 4.0), which is higher than the LCALO target for the question!

Fourth, the scoring process for some PQs created artefacts. For example, consider the PQ of “Appendix 1”, which was scored using a shifted Kendall rank correlation coefficient. If CALO’s answer showed no correlation with the correct answer, it still gets 2 out of 4 points; thus BCALO scored at least 2. Only failing to pick some answer from the given list of choices would score 0 points.

Fifth—a point which we recognized with PTIME, when for instance memory usage of other components slowed CALO’s responsiveness—even though the Test was intended to measure CALO’s learning ability, it could not do so unless the other parts required by the Test process were in good working order, so that learning data could be collected for LCALO. Architecture, documentation, pretesting, debugging, usability, and user behaviour were therefore all important to scoring well, as much as learning algorithms (Lesson 6).

As a rule, it is difficult in any evaluation to eliminate effects such as selection bias, experimenter bias, learning effects, and the Hawthorne effect— although proper experimental design can minimize them or at least make their effects measurable. The CALO Test was in part deliberately not structured and conducted to eliminate such effects as much as it otherwise might have been, since the CLP was more a data-gathering exercise on the system than a regular user study. For example, whether it was impactful or not upon the data collected, there was selection bias from using subjects from our own institution (which was required for legal reasons). The Test was overseen by the IE, and monitors from the project sponsor were present. Both were satisfied by the validity of the Test results.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Berry, P.M., Donneau-Golencer, T., Duong, K. et al. Evaluating intelligent knowledge systems: experiences with a user-adaptive assistant agent. Knowl Inf Syst 52, 379–409 (2017). https://doi.org/10.1007/s10115-016-1011-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-016-1011-3

Keywords

Navigation