Skip to main content

Protecting victim and witness statement: examining the effectiveness of a chatbot that uses artificial intelligence and a cognitive interview

Abstract

Information of high evidentiary quality plays a crucial role in forensic investigations. Research shows that information provided by witnesses and victims often provide major leads to an inquiry. As such, statements should be obtained in the shortest possible time following an incident. However, this is not achieved in many incidents due to demands on resources. This intersectional study examined the effectiveness of a chatbot (the AICI), that uses artificial intelligence (AI) and a cognitive interview (CI) to help record statements following an incident. After viewing a sexual harassment video, the present study tested recall accuracy in participants using AICI compared to other tools (i.e., Free Recall, CI Questionnaire, and CI Basic Chatbot). Measuring correct items (including descriptive items) and incorrect items (errors and confabulations), it was found that the AI CI elicited more accurate information than the other tools. The implications on society include AI CI provides an alternative means of effectively and efficiently recording high-quality evidential statements from victims and witnesses.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Notes

  1. 1.

    AI CI for research: The AI CI has tremendous potential to study the effectiveness of the CI in different contexts, and widely accessible. For the purposes of this research, we created a research version of the AI CI, and the data presented in the present study were collected using this research version. This research version of the AI CI is available to all who want to use it for research purposes.

References

  1. Brandtzaeg B, Følstad A (2017) Why people use chatbots. International conference on internet science. Springer, Cham, pp 377–392

    Chapter  Google Scholar 

  2. Buhrmester MD, Talaifar S, Gosling D (2018) An evaluation of amazon’s mechanical turk, its rapid rise, and its effective use. Perspect Psychol Sci 13:149–154. https://doi.org/10.1177/1745691617706516

    Article  Google Scholar 

  3. Bull R (2013) What is ‘believed’ or actually ‘known’ about characteristics that may contribute to being a good/effective interviewer? Investig Interviewing: Res Pract 5:128–143

    Google Scholar 

  4. Cortina M, Magley J (2003) Raising voice, risking retaliation: events following interpersonal mistreatment in the workplace. J Occup Health Psychol 8(4):247

    Article  Google Scholar 

  5. Dienes Z, Mclatchie N (2018) Four reasons to prefer Bayesian analyses over significance testing. Psychon Bull Rev 25:207–218. https://doi.org/10.3758/s13423-017-1266z

    Article  Google Scholar 

  6. EEOC (2016) Select task force on the study of harassment in the workplace. www.eeoc.gov/eeoc/task_force/harassment/report.cfm. Accessed 16 Jan 2018

  7. EHRC (2018). Retrieved May 4, 2018. https://www.equalityhumanrights.com/en/publication-download/turning-tables-ending-sexual-harassment-work

  8. Fisher R, Geiselman R (2010) The cognitive interview method of conducting police interviews: eliciting extensive information and promoting therapeutic jurisprudence. Int J Law Psychiatry 33(5–6):321–328. https://doi.org/10.1016/j.ijlp.2010.09.004

    Article  Google Scholar 

  9. Fisher R, Milne R, Bull R (2011) Interviewing cooperative witnesses. Curr Dir Psychol Sci 20:16–19. https://doi.org/10.1177/0963721410396826

    Article  Google Scholar 

  10. Gabbert F, Hope L, Fisher R (2009) Protecting eyewitness evidence: examining the efficacy of a self-administered interview tool. Law Hum Behav 33(298):307. https://doi.org/10.1007/s10979-008-9146-8

    Article  Google Scholar 

  11. Gabbert F, Memon A, Allan K (2003) Memory conformity: can eyewitnesses influence each other’s memories for an event? Appl Cogn Psychol 17:533–543. https://doi.org/10.1002/acp.885

    Article  Google Scholar 

  12. Gudjonsson G (2018) The psychology of interrogations and confessions: a handbook. John Wiley & Sons, London

    Book  Google Scholar 

  13. Hershkowitz I, Orbach Y, Lamb ME, Sternberg KJ, Horowitz D (2001) The effects of mental context reinstatement on children's accounts of sexual abuse. Appl Cogn Psychol Off J Soc Appl Res Mem Cogn 15(3):235–248

  14. Herbenick D, van Anders M, Brotto A, Chivers L, Jawed-Wessel S, Galarza J (2019) Sexual harassment in the field of sexuality research. Arch Sex Behav 48(4):997–1006

    Article  Google Scholar 

  15. Kaplan R, Van Damme I, Levine L, Loftus E (2016) Emotion and false memory. Emot Rev 8:8–13. https://doi.org/10.1177/1754073915601228

    Article  Google Scholar 

  16. Kebbell M, Milne R (1998) Police officers’ perceptions of eyewitness performance in forensic investigations. J Soc Psychol 138:323–330. https://doi.org/10.1080/00224549809600384

    Article  Google Scholar 

  17. Köhnken G, Milne R, Memon A, Bull R (1999) The cognitive interview: a meta-analysis. Psychol Crime Law 5(1–2):3–27. https://doi.org/10.1080/10683169908414991

    Article  Google Scholar 

  18. Koriat A, Goldsmith M (1996) Monitoring and control processes in the strategic regulation of memory accuracy. Psychol Rev 103(3):490

  19. Lamb M, Hershkowitz I, Orbach Y, Esplin P (2008) Factors affecting the capacities and limitations of young witnesses. Tell Me what happened: structured investigative interviews of child victims and witnesses. Wiley, Chichester, UK, pp 19–61

    Chapter  Google Scholar 

  20. London K, Henry LA, Conradt T, Corser R (2013) Suggestibility and individual differences in typically developing and intellectually disabled children. In: Ridley AM, Gabbert F, La Rooy DJ (eds) Suggestibility in legal contexts: psychological research and forensic implications. Wiley-Blackwell, Chichester, pp 129–148

    Google Scholar 

  21. Meissner C, Kassin S (2002) “He’s guilty!”: investigator bias in judgments of truth and deception. Law Hum Behav 26:469–480. https://doi.org/10.1023/a:1020278620751

    Article  Google Scholar 

  22. Memon A, Bull R (1991) The cognitive interview: its origins, empirical support, evaluation and practical implications. J Community Appl Soc Psychol 1(4):291–307

    Article  Google Scholar 

  23. Memon A, Holley A, Wark L, Bull R, Koehnken G (1996) Reducing suggestibility in child witness interviews. Appl Cogn Psychol 10:503–518

    Article  Google Scholar 

  24. Memon A, Meissner C, Fraser J (2010) The cognitive interview: a meta-analytic review and study space analysis of the past 25 years. Psychol Public Policy Law 16:340–372

    Article  Google Scholar 

  25. Milne R, Bull R (1999) Investigative interviewing: psychology and practice. John Wiley & Sons Ltd, Chichester

    Google Scholar 

  26. Milne R, Bull R (2002) Back to basics: a componential analysis of the original cognitive interview mnemonics with three age groups. Appl Cogn Psychol Off J Soc Appl Res Mem Cogn 16(7):743–753

  27. Milne R, Bull R (2016) Investigative interviewing: investigation and probative value. J Forensic Pract. https://doi.org/10.1108/JFP-01-2016-0006

    Article  Google Scholar 

  28. Minhas R, Walsh D, Bull R (2017) Developing a scale to measure the presence of possible prejudicial stereotyping in police interviews with suspects: the Minhas investigative interviewing prejudicial stereotyping scale (MIIPSS). Police Pract Res 18:132–145. https://doi.org/10.1080/15614263.2016.1249870

    Article  Google Scholar 

  29. Ministry of Justice (2011) Achieving best evidence in criminal proceedings: guidance on interviewing victims and witnesses, and guidance on using special measures. Ministry of Justice, London

    Google Scholar 

  30. Mortimer A, Shepherd E (1999) Frames of mind: schemata guiding cognition and conduct in the interviewing of suspected offenders. In: Memon A, Bull R (eds) Handbook of the psychology of interviewing. Wiley, Chichester, pp 293–315

    Google Scholar 

  31. Murphy G, Greene C (2016) Perceptual load affects eyewitness accuracy and susceptibility to leading questions. Front Psychol 7:1322. https://doi.org/10.3389/fpsyg.2016.01322

    Article  Google Scholar 

  32. Narchet F, Meissner C, Russano M (2011) Modeling the influence of investigator bias on the elicitation of true and false confessions. Law Hum Behav 35:452–465. https://doi.org/10.1007/s10979-010-9257-x

    Article  Google Scholar 

  33. Perfect T, Wagstaff G, Moore D, Andrews B, Cleveland V, Newcombe S, Brown L (2008) How can we help witnesses to remember more? It’s an (eyes) open and shut case. Law Hum Behav 32:314–324. https://doi.org/10.1007/s10979-007-9109-5

    Article  Google Scholar 

  34. Poole D, Lamb M (1998) Investigative interviews of children: a guide for helping professionals. American Psychological Association, Washington, DC, USA

    Book  Google Scholar 

  35. Prendinger H, Ishizuka M (eds) (2013) Life-like characters: tools, affective functions, and applications. Springer Science & Business Media

    Google Scholar 

  36. Rahman A, Al Mamun A, Islam A (2017) Programming challenges of chatbot: current and future prospective. Humanitarian technology conference (R10-HTC), 2017 IEEE region 10. IEEE, pp 75–78.https://doi.org/10.1109/R10-HTC.2017.8288910

    Chapter  Google Scholar 

  37. Read J, Connolly D (2017) The effects of delay on long-term memory for witnessed events. In: Toglia MP, Read JD, Ross DF, Lindsay RCL (eds) Handbook of eyewitness psychology: memory for events, vol 1. Lawrence Erlbaum Associates Inc, Mahway, NJ, pp 117–155

    Google Scholar 

  38. Ridley A (2013) Suggestibility: a history and introduction. In: Ridley AM, Gabbert F, La Rooy DJ (eds) Suggestibility in legal contexts: psychological research and forensic implications. Wiley-Blackwell, England, pp 1–19

    Chapter  Google Scholar 

  39. Rossmo DK (2016) Case rethinking: a protocol for reviewing criminal investigations. Police Pract Res 17:212–228. https://doi.org/10.1080/15614263.2014.978320

    Article  Google Scholar 

  40. Santtila P, Korkman J, Sandnabba NK (2004) Effects of interview phase, repeated interviewing, presence of a support person, and anatomically detailed dolls on child sexual abuse interviews. Psychol Crime Law 10:21–35. https://doi.org/10.1080/1068316021000044365

    Article  Google Scholar 

  41. Shaw J, Porter S (2015) Constructing rich false memories of committing crime. Psychol Sci 26:291–301. https://doi.org/10.1177/0956797614562862

    Article  Google Scholar 

  42. Shawar B, Atwell E (2007) Chatbots: are they really useful? J Lang Technol Comput Linguist 22(1):29–49.

    Google Scholar 

  43. Shawar B, Atwell E (2005) Using corpora in machine-learning chatbot systems. Int J Corpus Linguist 10:489–516. https://doi.org/10.1075/ijcl.10.4.06sha

    Article  Google Scholar 

  44. Singh A (n.d.) Bayes Factor (Dienes) calculator. Retrieved September 5, 2018. https://medstats.github.io/bayesfactor.html.

  45. Stein L, Memon A (2006) Testing the efficacy of the cognitive interview in a developing country. Appl Cogn Psychol 20:597–605. https://doi.org/10.1002/acp.1211

    Article  Google Scholar 

  46. Taylor D, Dando C (2018) Eyewitness memory in face-to-face and immersive avatar-to avatar contexts. Front Psychol 9:507. https://doi.org/10.3389/fpsyg.2018.00507

    Article  Google Scholar 

  47. Tuckey M, Brewer N (2003) The influence of schemas, stimulus ambiguity, and interview schedule on eyewitness memory over time. J Exp Psychol Appl 9:101–118

    Article  Google Scholar 

  48. Turtle J, Yuille J (1994) Lost but not forgotten details: repeated eyewitness recall leads to reminiscence but not hypermnesia. J Appl Psychol 79:260

    Article  Google Scholar 

  49. Vallano J, Compo NS (2015) Rapport-building with cooperative witnesses and criminal suspects: a theoretical and empirical review. Psychol Publ Policy Law 21:85–89

    Article  Google Scholar 

  50. Walsh D, Bull R (2012) Examining rapport in investigative interviews with suspects: does its building and maintenance work? J Police Crim Psychol 27(1):73–84. https://doi.org/10.1007/s11896-011-9087-x

    Article  Google Scholar 

  51. Westera N, Kebbell M, Milne B (2011) Interviewing witnesses: do investigative and evidential requirements concur? Br J Forensic Pract 13:103–113. https://doi.org/10.1108/14636641111134341

    Article  Google Scholar 

  52. Wixted J, Ebbesen E (1997) Genuine power curves in forgetting: a quantitative analysis of individual subject forgetting functions. Mem Cognit 25:731–739. https://doi.org/10.3758/BF03211316

    Article  Google Scholar 

  53. YouTube (n.d.). Retrieved April 3, 2018, from https://www.youtube.com/watch?v=kg7k5x--k8o&t=22s.

Download references

Acknowledgements

Thank you to software engineer Dylan Marriot for programming the AI CI used in this research.

Funding

Financial support for this project has been obtained from All Turtles, which made all development of the tool, and all research of it possible. The researchers have not been paid for any specific results and have preregistered the study. Still, the team recognises this financial support as a potential source of bias, which is part of the motivation to make the tool widely accessible to all researchers, including those who are not affiliated with All Turtles.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Rashid Minhas.

Ethics declarations

Ethics statement

The present study was approved by the author’s home university and run in accordance with the British Psychological Society code of ethical conduct. A potential conflict of interest has been declared throughout the ethics process because this study was funded by a San Francisco based company called All Turtles on behalf of Spot, and one of the three authors of this paper is the co-creator of Spot. Spot is an AI chatbot that was based in part on the results of the present research but has since been modified for broader purposes. The most recent version of Spot can be accessed for free by individuals via https://app.talktospot.com/. The AI CI used in this study was specifically designed for research purposes, and if you would like to conduct research using this version it is recommended that you contact one of the authors of the present paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix 1

Description of AI NLP training

Training the AI

To help the AI learn which words are important and in what contexts, tables are created manually that we feed into the AI. We provide examples and it figures out the relationship between words. The more we have the more the links start to be concrete. For example, we manually indicate to the AI that “boss” is a “job role”, so that it learns to ask follow-up questions about “boss”. For some categories of words, training models already exist. For example, we use a standard library of names. But no standard library exists for words related to workplace harassment and discrimination. Because of this, we have created three particular training datasets of words and phrases to train our AI.

Group 1 relates times and dates. For this we have manually filtered a pre-existing database, filtering out words that were too general for our context like “a few” and other broad numerical descriptions that were not appropriate. Group 2 relates to locations. Here we have created a completely custom library based on workplace-related terms, like “office” or “boardroom”. Group 3 relates to people—including roles, job titles, and names. We have created a bespoke library of workplace-related descriptions like “she is my boss” or “colleague”. The names library is a standard database that has been applied unmodified.

Our own training dataset of about 1000 sentences was created using four main stages. The first stage consisted of brainstorming what we expected to be asked, resulting in about 100 sentences. In the second stage, we harvested words and phrases from news articles described accounts of workplace harassment and discrimination. This provided different syntax and word choice and added to our database. In stage three, we used about 200 reports submitted to the team explicitly for research purposes (from talktospot.com) to add to our database (Note that although many reports have been created using talktospot.com, we do not have access to them unless they are explicitly sent to the research team. This means we cannot assess the quality of the AI in these interactions).

Currently, in stage four, we are developing industry-specific words and phrases based on the industries that are using our tool. Ultimately, the database will be continuously evolving, and the AI should become increasingly attuned to the relevant words and their contexts to improve the follow-up questions and the user experience.

Appendix 2

Analyses using the Bayes factor

Introduction

Bayes factors are useful for assessing the strength of evidence of a theory, and allow researchers to draw different conclusions from those that can be inferred from orthodox statistical methods alone. Orthodox statistics model the null hypothesis (H0), generally testing if there is no difference between means. They reveal whether there is a statistical difference between means, but nothing else. Bayes factors can be used to make a three-way distinction, by testing whether the data either support the null hypothesis (H0), whether they strengthen support for the alternative hypothesis (H1), or whether there is no evidence either way. Bayes factors also challenge perceptions of the importance of power that are used in statistics, as they indicate that a high-powered non-significant result is not always evidence to support the H0, but a low-powered non-significant result might be. Similarly, a high-powered significant result might not be substantial evidence of H1. Finally, using Bayes one can specify the hypothesis in a way that is not possible with a p value (Dienes and Mclatchie 2018).

To calculate a Bayes factor, one needs a model of H0 (usually that there will be no difference between means), a model of H1 (which needs to be specified, usually from the mean difference in a previous study) and a model of the data. This means that the Bayes Factor provides a continuous measure of evidence strength for H1 over H0, rather than a sharp boundary of significance. However, as a Bayes factor of 3 often aligns with a p < 0.05, a Bayes factor of 3 or more is usually understood as substantial evidence in support of H1. For symmetry, substantial support for H0 is usually understood as a Bayes Factor of < 1/3 (Dienes and Mclatchie 2018).

Therefore, in the present research, as well as examining the main effects with statistics, we evaluated the theories in terms of strength of evidence, using Bayesian hypothesis testing. Bayes factors seemed appropriate as the difference between the conditions was designed to be subtle, and the video stimuli were short, so we were expecting non-significant results in some comparisons. The Bayes Factors also allowed us to make more nuanced inferences about the data that did not depend on power calculations.

Methods

For our analyses, Bayes factors (B) were used to determine how strong the evidence for the alternative hypothesis was over the null (Singh n.d.). BH (0, x), indicates that predictions of H1 were modelled as a half-normal distribution with a standard deviation (SD) of x (Dienes and Mclatchie 2018). We used previous research into “cognitive” versus “standard” interviews to specify our hypothesis. This showed that cognitive interviews were found to elicit a median of 34% more information than standard interviews (Köhnken et al 1999). Therefore, the SD was set to x = 34% of the highest score in the present experiment. This figure was calculated separately for each set of comparisons (according to the highest score for that set). For correct responses, we predicted that the number of correct responses would increase with the sophistication of the reporting tool, so we used the SD (34%) to test this. For the first analyses (overall correct responses), the SD was set to 6.08.

Results

Overall correct responses

The Bayes Factor between AICI and Free Recall and between Questionnaire CI and Free recall, indicated that the evidence substantially supported the alternative hypothesis BH = 182.55 and BH = 16.65, respectively; those between AICI and Basic Chat CI, between AICI and Questionnaire CI, and between Basic Chat CI and Free Recall were insensitive, BH = 2.54, BH = 0.59, and BH = 1.54, respectively; and those between Questionnaire CI and Basic Chat CI substantially supported the null hypothesis, BH = 0.07.

The Bayes Factors thus indicated that there was substantial evidence that Questionnaire CI and AICI elicited more correct items overall than Free Recall, even though only AICI did so significantly. They also indicated that there was substantial evidence to support the null (that there was no difference in the number of correct items) when it came to comparisons between Questionnaire CI and Basic Chat CI. Finally, more data were needed to explore the other comparisons. Therefore, while the significance testing indicated that there was no difference between AICI and Basic Chat CI, between AICI and Questionnaire CI, and between Basic Chat CI and Free Recall, the Bayes Factors indicated that the data did not support this conclusion.

Dialogue

For the dialogue items, we focused only on comparisons between the AICI and the other conditions, and the SD was set to 2.49 (34% of the highest score) for these comparisons. The Bayes Factors supported the null hypothesis when comparing AICI and Free Recall, BH = 0.24, and AICI and Questionnaire CI, BH = 0.20, but they were insensitive when comparing AICI and Basic Chat CI, BH = 0.69 (inspection of Fig. 2 shows a mean score of 6.73 for Basic Chat CI users and 7.33 for AICI users). Thus, participants generally performed similarly in all conditions (compared to AICI), but more data were needed to compare the scores between AICI and Basic Chat CI. Thus, while statistical analysis suggested that there was no difference between conditions, Bayes Factors suggest that when comparing the two chatbots, the data did not support this conclusion.

Action

Again, we focused only on comparisons between the AICI and the other conditions, and the SD was set to 1.15 (34% of the highest score) for these comparisons. The Bayes Factors supported the null hypothesis when comparing AICI and Free Recall, BH = 0.24, and AICI and Questionnaire CI, BH = 0.24. However, it supported the alternative hypothesis when comparing Basic Chat CI and AICI (inspection of Fig. 2 shows a mean score of 2.47 for Basic Chat CI users and 3.2 for AICI users), BH = 3.98. The strength of evidence thus indicated that participants performed similarly for these comparisons, apart from when comparing the chatbots, as the evidence suggested that the Basic Chat CI elicited fewer action items than the AICI. Therefore, again the lack of significance when comparing chatbots cannot be interpreted as support for the null, as the Bayes Factor indicates that there was evidence that the AICI performed substantially better than the Basic Chat CI.

Facts

Again, we focused only on comparisons between the AICI and the other conditions, and the SD was set to 1.54 (34% of the highest score) for these comparisons. Inspection of Fig. 3 revealed that the mean score for users of the AICI was lower than those for using the Questionnaire CI and the Basic Chat CI, so rather than the testing the hypothesis that AICI users would perform better than these conditions against the null (there would be no difference between conditions), we tested the strength of evidence of the size of the differences. The Bayes Factors indicated that there was substantial evidence that AICI also elicited fewer than Basic Chat CI and Questionnaire CI, BH = 814.10 and BH = 115.47, respectively. For the comparison between Free Recall and AICI, we re-set H1 to the original prediction. The Bayes Factor indicated that the results between Free Recall and AICI were insensitive, BH = 1.34.

Basic Chat CI was therefore significantly better at eliciting factual items than AICI, but the Bayes Factor indicated that Questionnaire CI also elicited substantially more items than AICI. However, to evaluate the performance of AICI against Free Recall, more data were needed (inspection of Fig. 3 shows a mean score of 2.5 for Free Recall users and 3.1 for AICI users). Thus, it was not possible to conclude that there was no difference between these conditions.

Description

Again, we focused only on comparisons between the AICI and the other conditions, and the SD was set to 1.46 (34% of the highest score) for these comparisons.

The Bayes Factors supported the alternative hypothesis when comparing AICI and Free Recall, BH = 50,214.61; AICI and Questionnaire CI, BH = 3059.80; and AICI and Basic Chat CI, BH = 57.14. Therefore, in this case, the Bayes Factor supported the significant results for these comparisons.

Overall incorrect responses

For incorrect responses, we expected the number of mistakes to be fewer as the sophistication of the reporting tool improved (the SD was set to x = 0.83).

The Bayes Factor indicated that the results between Questionnaire CI and Basic Chat CI, BH = 11.99, Questionnaire CI and AICI, BH = 296.03, supported the alternative hypothesis. The chatbots elicited substantially fewer mistakes than the Questionnaire CI. However, when compared to Free Recall, participants using Questionnaire CI elicited substantially more mistakes, BH = 51.17. Comparisons between AICI and Basic Chat CI, and Free Recall and Basic Chat CI were insensitive, BH = 1.99 and BH = 1.54 respectively, while those between AICI and Free Recall supported the null hypothesis, BH = 0.29.

Thus, while only participants in the Questionnaire CI condition elicited significantly more incorrect responses overall than Free Recall, the Bayes Factors indicated substantial evidence that Questionnaire CI also encouraged more incorrect responses than both chatbots. Bayes Factors allowed us to conclude that there was no difference in the number of incorrect responses between AICI and Free Recall, indicating that these two tools encouraged accuracy more than the other two. Finally, Bayes Factors indicated that we could not conclude that there were no differences between the Basic Chat CI and Free Recall, or between the Basic Chat CI and AICI.

Errors

For these analyses, we focused again on comparisons between the AICI and the other conditions, and the SD was set to 0.28 for these comparisons.

The Bayes Factor indicated that the results between Basic Chat CI and AICI, and Questionnaire CI and AICI supported the alternative hypothesis, BH = 11.30 and BH = 8.61 respectively, and those between Free Recall and AICI supported the null hypothesis, BH = 0.26 (the SD was set to x = 0.28).

Therefore, while significance testing suggested that it made no difference which reporting tool participants used. Bayesian analysis indicated that participants using AICI made fewer errors than those using Questionnaire CI or Basic Chat CI, and that there was no difference in the number of errors made between AICI and Free recall.

Confabulations

For the final analyses, we also focused on comparisons between the AICI and the other conditions, and the SD was set to 0.57 for these comparisons.

The Bayes Factor indicated that the results between Free Recall and AICI, and between Questionnaire CI and AICI, BH = 4.30, and BH = 106.08 supported the alternative hypothesis respectively (performance improved as the sophistication of the tool increased). However, the comparison between Basic Chat CI and AICI, BH = 0.48 was insensitive.

Therefore, while only Questionnaire CI encouraged participants to confabulate significantly more than Free Recall, Bayesian hypothesis testing indicated that it also encouraged participants to confabulate more than those using AICI. The results also suggested that, rather than being no difference between Basic Chat CI and AICI (inspection of Fig. 2 shows a mean score of 1.07 for Basic Chat CI users and 0.97 for AICI users), there were not enough data to make a conclusion either way.

Discussion

Statistical analyses indicated that the AICI elicited more correct responses without compromising on accuracy and that this chatbot was particularly good at eliciting descriptive details, but could improve on fact gathering. However, it failed to reveal nuances in the data that the Bayes Factors did.

We considered Bayesian hypothesis testing to be appropriate for this type of research, as the differences between conditions were chosen to be subtle, and the stimulus was a short video (1 min 45 s) that was not able to elicit dramatic differences in the number of recalled items in actual terms, so we anticipated that Bayes Factors might clarify the results somewhat. We also wanted to test the minimum number of participants possible. Although we made power calculations to reach this number, as Bayes Factors do not rely on power calculations, we considered them to be suitable to clarify the results. They also confirmed in many instances that the number of participants that we had tested was sufficient.

The Bayes Factors allowed us to make conclusions that were not possible when using statistics, and in some instances supported the statistics, adding weight to the implications. For instance, when it came to recalling correct information, significance testing indicated that overall, the AICI helped people to recall more items overall than Free Recall, that the AICI was better than the other conditions at eliciting description, while the Basic Chat CI was better than the AICI at fact gathering, and the Bayes Factors supported these results.

However, when it came to non-significance, statistical analysis fell short in three ways. An example or two will be given for each. First, the Bayes Factor indicated that while there was no significant difference in the number of action items between AICI and Basic Chat CI, the Bayesian evidence supported the notion that AICI elicited more items. Second, a non-significant result shows support for H0, but in several cases, the Bayes Factors indicated that this conclusion could not be made. For instance, when comparing the number of dialogue items recalled by AICI and Basic Chat CI users’ significance testing indicated that there was no difference between conditions, but the Bayes Factors did not support this conclusion. Therefore, non-significance is not evidence in support of the null. Third, Bayes Factors were helpful when comparisons did show support for the null. For example, comparisons between Questionnaire CI and Basic Chat CI indicated that there was substantial support for the null in the number of correct items recalled overall, but orthodox statistics failed to demonstrate this.

Thus, using Bayes Factors, we made more nuanced conclusions when it came to describing the effects, indicating that Bayesian hypothesis testing is a useful way of interpreting the results in studies of this kind.

Appendix 3

Example of AI chatbot

figurea

Mild harassment scenario link

https://youtu.be/C0LMPW46EQs

Mild harassment scenario screenshot

figureb

Example transcript

  • 1. Please tell me everything you can remember about what happened. Try not to leave anything out, even if it seems trivial. I have as much time as you need.

  • 2. Thank you for telling me about that. You mentioned… Please elaborate.

  • 3. Thank you for telling me about that. Please provide specifics about the month, week, day, or time this happened.

  • 4. You mentioned an office. Please describe.

  • 5. You referred to Mike. Please tell me more about him or her.

  • 6. The way this situation made you feel is important for understanding the impact on your wellbeing. Please describe in more detail how you felt as you were experiencing the situation.

  • 7. How did it affect your wellbeing after it happened?

  • 8. Did you tell anyone about the event?

  • 9. We’re almost done. Before we finish, is there any other evidence of what happened?

  • For example screenshots, emails, meeting notes, text messages, or recordings?

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Minhas, R., Elphick, C. & Shaw, J. Protecting victim and witness statement: examining the effectiveness of a chatbot that uses artificial intelligence and a cognitive interview. AI & Soc (2021). https://doi.org/10.1007/s00146-021-01165-5

Download citation

Keywords

  • Artificial intelligence
  • Victim statement
  • Witness statement
  • Memory recall
  • Workplace harassment