When child abuse is alleged, one of the first steps in the legal process is for police to conduct an investigative interview with the child. Since physical evidence is often lacking in child abuse cases (Hartley et al. 2013; Walsh et al. 2010), police interviewers’ ability to elicit detailed and accurate information from the child can be vital to the investigation outcomes. In some jurisdictions, the audio-visual recording of children’s interviews can further be used as their evidence-in-chief should the case progress to court (Australian Law Reform Commission 2010; Ministry of Justice 2011). Given the importance of children’s interviews, many studies over the past three decades have focused on examining interview techniques to elicit children’s reports (see Lamb 2016; Saywitz et al. 2017, for reviews). Today, numerous best practice guidelines exist to instruct interviewers in evidence-based techniques that facilitate complete and accurate reporting from children (e.g. Achieving Best Evidence (ABE), Ministry of Justice (2011); National Institute of Child Health and Human Development (NICHD) protocol, Lamb et al. (2007)).

While protocols differ in emphasis, there is consensus about key elements that constitute best practice interviewing, which are typically structured into separate phases. Broadly, in the opening phase, interviewers introduce themselves to the child and build rapport, explain the need to tell the truth, and provide conversational “ground rules” to advise the child about appropriate responses for the interview (e.g. rules about not guessing answers are common, “Tell me if you don’t know an answer”; see Brubacher et al. (2015) for a review). Next, interviewers progress to a transitional phase to shift the conversation to substantive topics by eliciting the topic of concern (e.g. alleged abuse); the transition in topics should occur in a non-leading manner, such as asking, “What have you come here to talk about today?” and interviewers must avoid introducing details about alleged abuse (Earhart et al. 2018). Once the topic of concern is established, interviewers begin the substantive phase in which the alleged abuse is explored thoroughly. Open questions should be used to prompt the child for a narrative of abusive incidents (i.e. prompts that encourage elaborate responses without specifying what information is required, such as “What happened next?” and “Tell me more about (mentioned detail)” (Powell and Snow 2007)), and leading questions should be avoided (i.e. prompts that introduce details the child has not provided, such as “He touched you on the privates, didn’t he?” when the child has not mentioned touching). For children reporting repeated episodes of abuse, interviewers should prompt the child about each incident separately using labels such as “the last time” to isolate incidents (Guadagno and Powell 2009). At the end of the substantive phase, a few specific questions about particular details may be needed to follow-up any required evidentiary information the child has not yet provided (e.g. who the perpetrator was, when and where the alleged abuse occurred) (Victorian Law Reform Commission 2004). When evidential information has been elicited, interviewers close the interview and thank the child for their time.

Despite expert agreement on the core elements of best practice interviewing, interviewers have difficulty adhering to best practice guidelines in the field (e.g. Guadagno and Powell 2009; Luther et al. 2014; Powell and Hughes-Scholes 2009; Sternberg et al. 2001; Westcott and Kynan 2006). Deficits in a range of best practice skills have been found in field interviews, including interviewers failing to test children’s understanding of the truth and omitting ground rules in the opening phase (Bull 2010; Huffman et al. 1999; Warren et al. 1996; Westcott and Kynan 2006), asking contentious or leading questions to elicit abuse topics in the transitional phase (Hughes-Scholes and Powell 2013; Powell and Hughes-Scholes 2009), relying too heavily on specific and leading questions rather than open questions, and failing to isolate and label incidents of repeated alleged abuse sufficiently in the substantive phase (Guadagno and Powell 2009; Myklebust and Bjørklund 2006; Powell et al. 2007; Sternberg et al. 2001). Specialist child forensic interview training increases interviewers’ use of best practice skills (particularly open questioning) (Cederborg et al. 2013; Price and Roberts 2011; Yi et al. 2016), but these improvements often diminish once the training is complete (Lamb 2016; Smith et al. 2009; Sternberg et al. 2001). For example, Smith et al. (2009) found that from a range of predictor variables, only the recency of interviewers’ training affected their use of open questions, and at only 1 month after completing a 10-day training course, interviewers’ use of open questions was similar to their pre-training levels. Recent research has identified that spacing interviewers’ training over multiple months supports the maintenance of best practice skills for longer periods post-training (such as up to 12-month post-training in Benson and Powell (2015) and up to 4-month post-training in Cederborg et al. (2021); see also Brubacher et al. (2021b) and Lindholm et al. (2016)). However, in many jurisdictions, child interview training is only offered over an isolated, shorter block of time (e.g. between 1 and 10 days in a review by Benson and Powell (2015); also see Brubacher et al. (2018), and Lamb (2016)). Thus, it is important to identify strategies and tools that can help maintain interviewers’ skills after they complete such courses.

One strategy that assists in improving and maintaining interviewers’ post-training use of open questions is providing them with ongoing interview feedback (Lamb et al. 2002a; b; Price and Roberts 2011; Powell 2008; see Lamb 2016 for a review). For example, Lamb et al. (2002a) found that receiving detailed individual feedback on recent interviews from training instructors (forensic and developmental psychologists), coupled with group discussions every 1–2 months, caused dramatic improvements in the proportion of open questions posed by eight interviewers following their completion of a 5-day intensive training course. When this feedback was withdrawn after approximately 1 year, interviewers’ open question use started reverting to pre-feedback levels. Despite findings supporting the use of ongoing feedback, it is an intensive and time-consuming task. Previous studies have typically had experts such as course instructors, psychologists, or academics provide the feedback; however, the sample sizes of these studies suggest that these experts can monitor only small numbers of interviewers (e.g. eight interviewers in Lamb et al. (2002a); 12 interviewers in Price and Roberts (2011); six interviewers in Orbach et al. (2000)). Indeed, experts likely manage already full workloads, and there are high financial costs in their provision of feedback (see Cyr et al. 2021). Further, in some jurisdictions, legislation precludes non-investigators from viewing completed interview transcripts or recordings to protect children’s privacy, rendering feedback from experts impossible (Powell and Barnett 2015).

A more financially feasible alternative that avoids any privacy issues is for regular feedback to come from other interviewers (i.e. peer feedback; see Cyr et al. (2021) for discussion). Peer feedback is available to interviewers more promptly and in greater frequency than expert or supervisor feedback could be provided (Topping 2009) and does not heavily challenge workloads since a second interviewer is often present during child interviews and may provide feedback promptly (see ABE, Ministry of Justice 2011; Westcott and Kynan 2004). Peer interviewers are currently practicing child interviewing themselves and thus may be abreast of contemporary issues in child forensic interviewing practices. In education disciplines, peer feedback has long been considered beneficial to student learning since it promotes students’ self-efficacy, confidence, and vicarious learning (e.g. Christensen and Kline 2001; Seroussi et al. 2019). Students also see the value in peer feedback, reporting that peer feedback can be easier to interpret than instructor feedback (Falchikov 2005; Zheng 2012). However, peers often do not have the depth of knowledge of experts, so peer feedback can be highly variable in accuracy and quality (measured as deviation from expert feedback; Hovardas et al. 2014; Lai 2016; Sluijsmans et al. 2002; Zheng 2012; see Topping 2009 for review).

To date, two studies have examined the effects of peer feedback on child forensic interviewers’ performance. In the first study, Stolzenberg and Lyon (2015) recruited law students undertaking a child interviewing unit; they were required to discuss and peer review each other’s interviews during weekly peer group meetings that were moderated by an expert. Students improved their use of open questions (and reduced specific questions) over the semester. In the second study, Cyr et al. (2021) examined the utility of peer review as a form of “refresher training” in a sample of police interviewers who had completed their formal training in child interviewing at least a year earlier. Some interviewers met in peer groups for 9 h over a 6-month period to review each other’s interviews; instructions were provided to guide their sessions. This practice marginally increased open question use and increased interviewers’ adherence to a child interviewing protocol with discrete phases. Other interviewers received refresher training from an expert or from online activities; these refresher modes produced a similar pattern of results to the peer meetings group, but occasionally the gains made by the expert feedback group slightly surpassed the other modalities. In both studies, the accuracy of peer feedback was not assessed but was likely high due to the provision of expert moderation (Stolzenberg and Lyon 2015) or feedback instructions (Cyr et al. 2021) to help scaffold peer reviewers’ feedback.

One recent study has examined child forensic interviewers’ ability to write constructive peer reviews of child interviews that are scaffolded by guidelines (Brubacher et al. 2021a). Sixty interviewers were supplied with two guidelines: (1) specific directions to identify question types in the transcript and (2) a checklist of best practice skills. They were instructed to use these guidelines to help them write a 500-word peer review of a standardised child forensic interview transcript. Experts rated how constructive the 500-word reviews were in directing the interviewer to specific improvements. They found that approximately one third provided high-quality feedback with specific, actionable improvements for the interviewer, and about half included moderate-quality feedback with vague recommendations for improvements. Further, research assistants coded whether comments in the 500-word reviews were about skills mentioned on the provided checklist. On average, participants mentioned 10 out of 27 checklist items in their reviews, and checklist items were significantly more likely to be written about than non-checklist features. This study suggests that a structured checklist can provide helpful scaffolding to interviewers’ peer reviews; however, the capacity of the child forensic interviewers to accurately use the two provided guidelines on peers’ interviews in the first instance was not considered.

In the current study, we aimed to develop a structured peer review tool that assesses a range of best practice skills and to test child forensic interviewers’ accuracy when using the tool to evaluate mock transcripts of a child forensic interview. Previous education research has found some differences in the accuracy of peer reviews for high- and low-quality work (Gielen et al. 2010; Patchan and Schunn 2015). Thus, we also wanted to examine any differences in the accuracy of peer assessments made using the tool when the interviewer’s performance was varied to depict strong, poor, or mixed adherence to best practice guidelines. We created a checklist of best practice behaviours that child interviews should contain within the opening, transitional, and substantive interview phases (see Appendix). In Study 1, we recruited child forensic interviewers who had recently graduated from a 10-day child interviewing course as participants. We determined the internal reliability of the checklist and explored whether participants’ accuracy using the tool was related to the quality of the interviewer’s performance in the transcript (strong, poor, mixed). In Study 2, we explored whether results could be replicated with a separate sample of child forensic interviewers from a different jurisdiction who received only a 4-day training course in child interviewing.

Study 1

Method

Participants

An a priori power analysis determined that a sample size of 42–60 should have sufficient power to detect large effects (ηp2= 0.15–0.20). We recruited 56 police interviewers from one jurisdiction (nfemale = 24; nmale = 32). All police members who had completed the jurisdictions’ current child forensic interviewing course were invited to participate via email. Members who returned the consent form were sent the materials. The sample had a mean age of 37.13 years (SD = 8.27; range = 24–55 years) and consisted of mostly constable (n = 21) and senior constable (n = 30) ranking officers, with few sergeants (n = 5). For the 53 participants who reported how long they had been police officers, this ranged between 1 and 21 years (M = 8.61; SD = 4.61). Further information about the jurisdiction has not been provided to ensure participant and jurisdiction anonymity.

Participants had all recently completed the child forensic interview training program offered by their jurisdiction’s police force. Participants all had little experience completing child forensic interviews before they completed the training (M = 0.70 years; SD = 1.39; range = 0–5 years). The training program lasted 10 days and was instructed jointly by one senior police instructor and one academic expert in child interviewing, as well as a range of guest lecturers, such as speech pathologists and lawyers. Course trainees learned best practice techniques to use throughout all stages of a child forensic interview, as well as identifying and using different question types. Trainees had many opportunities to practice their interviewing skills during the training program and receive feedback on their performance: they completed numerous mock interviews with adults playing the role of children and at least four interviews with real children (aged between 5 and 7 years). Trainees also learned and practiced using a coding scheme to classify and identify question types.

Materials and Procedure

The research was approved by the university’s human research ethics committee, and written consent was attained from the police organisation and individual participants. Upon conclusion of the child forensic interviewing training program, instructors emailed participants two materials detailed below: the evaluation tool and a mock child forensic interview transcript. Participants returned the completed materials to the research team within 28 days.

Evaluation Tool

Participants were provided with a document of instructions to complete their evaluation of the interviewer’s performance in a child interview transcript. See the Appendix for the full instructions. An appropriate reliance on opened-ended questioning is well-established as best practice in child interviews (see ABE, Ministry of Justice 2011; NICHD protocol, Lamb et al. 2007). Thus, the instructions first asked participants to categorise each question posed in the substantive phase of transcript as an open-invitation, open-depth, open-breadth, facilitator, specific question, or leading question (see Powell et al. 2008). Question type categories aligned to those taught on the jurisdictions’ training program. Definitions and examples of each question type were provided in the instructions for participants to refer to while completing the task (see Appendix).

Since best practice child interviewing goes beyond asking open questions, the evaluation tool also provided participants with a structured checklist of other best practice skills. The checklist was created by the research team to capture participants’ evaluations of a range of interviewer behaviours in the transcript. The checklist was comprised of nineteen items that each reflected a recommended child interview component or technique. The items were grouped into the three interview phases: the opening phase (3 items), the transitional phase (3 items), and the substantive phase (13 items). Because the substantive phase of a child interview is an important and sizable interview phase, the checklist items evaluating this phase considered three different skills: questioning children about abuse (7 items), questioning about repeated abuse specifically (4 items), and eliciting information for evidential purposes (2 items). For each item, participants were instructed to tick a box indicating that the item was present (yes), absent (no), or not applicable in the transcript they evaluated.

Interview Transcripts

Participants were randomly allocated one of three versions of a modified child interview transcript: transcript A (n = 19), transcript B (n = 18), and transcript C (n = 19). All transcripts depicted a 6-year-old child reporting two instances of sexual abuse to a female police officer during a child forensic interview. The transcripts were based upon a real interview from a different jurisdiction not included in the study, and specific details such as names and locations were changed for anonymity. The three versions of the transcript were created by systematically manipulating the quality of the interviewer’s performance in three phases of the interview: the opening, transitional, and substantive phases. In each interview phase, the interviewer’s performance was depicted as either strongly adhering to best practice performance of the skills required in the phase (strong), poorly adhering to best practice (poor), or a mixed performance where some best practice elements were present or attempted but others were missing or incorrect (mixed). For example, in transcript A, the opening phase strongly adhered to best practice, the transition phase poorly adhered to best practice, and the substantive phase was mixed. Across the three versions of the transcript, the interviewer’s performance during each phase of the interview was counterbalanced, such that an interview phase performed strongly in one transcript was presented as performed poorly in another transcript and mixed in another.

Where appropriate, the child’s responses were also modified to match the interviewer’s performance. For example, if questions were presented as open-ended, the child’s response was modified to be longer and more detailed than when a question was presented as a closed question. These modifications were informed by research on children’s response patterns (e.g. Krahenbuhl and Blades 2006; Lamb et al. 2003) and the researchers’ own experiences interviewing over 400 children and reviewing over 200 child forensic interview transcripts.

Coding and Analysis

For each transcript version, the correct classification of each question and the correct response to each checklist item was agreed upon by two experts in child interviewing prior to data collection (i.e. when the materials were being created). Both experts were training instructors on the jurisdiction’s child forensic interview training course. One was a police interviewer working in child abuse areas, with approximately 10 years’ experience as a police officer. The second was an academic with expertise in child forensic interviewing, who has interviewed hundreds of children using best practice techniques.

For each checklist item, participant responses were marked as correct or incorrect by referring to the expert answers. For participants’ identification of question types, each question was marked as correctly or incorrectly categorised by referring to the experts’ responses and a percentage accuracy score derived by dividing the number of correctly identified questions by the total number of questions in the substantive phase. Approximately one third of participants ignored the category of leading questions (n = 19, 33.9%). We thus excluded the leading question category as it skewed results. Presented question type identification percentage accuracy scores do not include leading questions.

Results

We first explored the internal reliability of the checklist items. Cronbach’s alpha was computed to determine the overall consistency of the checklist, and the internal consistency of each subsection (i.e. opening, transitional, and substantive phases). The Cronbach’s alpha for the overall checklist showed moderate internal consistency, at 0.53. For each subsection, the Cronbach’s alphas were 0.30 (opening), 0.47 (transitional), and 0.46 (substantive). Items were considered for deletion if doing so would increase the subsection Cronbach’s alpha, unless deletion would result in only one item remaining in a subsection. In the opening section, one item was removed to improve the alpha to 0.57: “Location, time, date, persons present, and child’s age all stated for recording”. In the transitional section, one item was removed to improve the alpha to 0.61: “Phrasing of the initial question to elicit the topic of concern is best practice (‘What are you here to talk to me about today?’) and avoids ‘Why’ or ‘Can you…’”. In the substantive section, one item was removed to improve the alpha to 0.53: “Interviewer clarifies whether the abuse happened once or more than once”.

After removal of the three items, we examined participants’ accuracy using only the 16 remaining items. Participants had an average of 11.71 of the 16 items correct (73.19%, SD = 2.47, range = 8–16). Percentage accuracy scores were computed for each checklist section by dividing participants’ number of correct items per checklist section by the total number of items in that section. For example, if a participant answered six out of 12 items in the substantive phase correctly, they received a percentage accuracy score of 50% (i.e. 6 / 12 * 100 = 50). For each of the three sections of the interview, a one-way ANOVA was conducted to compare participants’ percentage accuracy across interviewer performance levels (strong, mixed, or poor interviewer performance).1 Results of each ANOVA are presented below; see Fig. 1 for a depiction of mean scores in each subsection. All follow-up t tests are conducted with a Bonferroni adjustment (α = 0.017).

Fig. 1
figure 1

Mean participant accuracy on checklist items (error bars represent standard errors)

Opening

The ANOVA for percentage accuracy in the opening phase was significant, F(2, 53) = 14.14, p < 0.001, ηp2= 0.35. Follow-up t tests showed that participants had a significantly lower percentage accuracy when evaluating the mixed opening phase (M = 60.53, SD = 39.37) than the strong (M = 100.00, SD = 0.00) and the poor opening phases (M = 94.44, SD = 16.17), t(18) = 4.37, p < 0.001, d = 1.46, and t(24) = 3.46, p = 0.002, d = 1.15, respectively. There was no difference in participants’ accuracy when evaluating the strong and poor opening phases, t(17) = 1.46, p = 0.16, d = 0.51.

Transitional

The ANOVA for checklist answers in the transitional phase was significant, F(2, 53) = 24.97, p < 0.001, ηp2= 0.49. Follow-up t tests showed that participants’ percentage accuracy was significantly lower when the transitional phase was poor (M = 39.47, SD = 39.37) than when it was strong (M = 94.74, SD = 15.77) or mixed (M = 91.67, SD = 19.17), t(24) = 5.68, p < 0.001, d = 1.89, and t(26) = 5.17, p < 0.001, d = 1.72, respectively. There was no difference in participants’ accuracy when evaluating the strong and mixed opening phases, t(35) = 0.53, p = 0.60, d = 0.18.

Substantive

The ANOVA was significant, F(2, 53) = 7.07, p = 0.001, ηp2= 0.23. Results followed the same pattern as the opening phase. That is, participants had a significantly lower percentage accuracy when evaluating the mixed substantive phase (M = 59.21, SD = 9.17) than the strong (M = 78.70, SD = 20.05) and the poor substantive phases (M = 75.45, SD = 16.77), t(24) = 3.77, p = 0.001, d = 1.30, and t(36) = 3.70, p = 0.001, d = 1.23, respectively. There was no difference in participants’ accuracy when evaluating the strong and poor substantive phases, t(35) = 0.54, p = 0.59, d = 0.18.

Since participants’ accuracy when evaluating the substantive interview phase was significantly related to the interviewer’s performance, we further examined the three subsections of the substantive phase to determine whether a particular component of the subsection was driving significant results: items about questioning, repeated abuse considerations, or evidential requirements. For items about questioning, there was no significant difference between participant’s accuracy when evaluating strong, mixed, or poor substantive sections, F(2, 53) = 1.66, p = 0.20, ηp2= 0.06. For items about repeated abuse considerations, the ANOVA was significant, F(2, 53) = 19.09, p < 0.001, ηp2= 0.42. Follow-up t tests showed that when the interviewer’s performance was mixed (M = 59.64, SD = 13.96), participants’ accuracy was significantly lower than when the interviewer’s adherence to best practice was strong (M = 85.18, SD = 17.05) or poor (M = 91.23, SD = 18.73), t(35) = 5.00, p < 0.001, d = 1.69, and t(35) = 5.89, p < 0.001, d = 1.96. There was no difference in participants’ accuracy when evaluating the strong and poor substantive phases, t(35) = 1.03, p = 0.31, d = 0.35.

For items about evidential requirements, the ANOVA was significant, F(2, 53) = 5.02, p = 0.01, ηp2= 0.16. Follow-up t tests showed that when the interviewer’s adherence to best practice was mixed (M = 39.47, SD = 42.75), participant’s accuracy was significantly lower than when it was strong (M = 77.78, SD = 30.79), t(35) = 3.11, p = 0.004, d = 1.07. When the interviewer’s adherence to best practice was poor (M = 63.16, SD = 36.67), participants’ accuracy did not significantly differ from when it was strong, t(35) = 1.31, p = 0.020, d = 0.44, or mixed, t(36) = 1.83, p = 0.08, d = 61.

Within-Subjects Comparisons

Collapsing over interview quality manipulations, we compared participants’ accuracy on each phase of the checklist to determine if participants were more accurate at a certain phase: opening, transitional, or substantive. A repeated measures ANOVA showed significant differences between participants’ percentage accuracy on the opening (M = 84.82, SD = 30.03), transitional (M = 75.00, SD = 36.93), and substantive phases (M = 70.98, SD = 17.84), F(2, 110) = 3.25, p = 0.04, ηp2= 0.06. Follow-up t tests showed that participants had significantly lower percentage accuracy on the substantive phase than the opening phase, t(55) = 2.87, p = 0.006, d = 0.35. Percentage accuracy on the transitional phase did not differ from the opening, t(55) = 1.42, p = 0.16, d = 0.19, or substantive phase, t(55) = 0.85, p = 0.40, d = 0.11.

Identifying Question Types

Participants’ identification of each question type is presented in Table 1. Participants were on average 81.40% accurate in their identification of question types (SD = 9.86, range = 57.20–96.59%). There were no differences in participants’ percentage accuracy of question type identification across the three transcript versions, F(2, 53) = 0.14, p = 0.87, ηp2= 005. Since question types were identified throughout the substantive phase, we considered whether participants’ percentage accuracy of question type identification was related to their percentage accuracy on checklist items for the substantive phase. A linear regression showed that the variables were not related, F(1, 54) = 0.62, p = 0.80.

Table 1 Percentage of question types correctly identified

Discussion

For the opening and substantive interview phases, participants were least accurate when evaluating the transcripts depicting a mixed adherence to best practice. Previous work from the education field has found that work requiring nuanced improvements — like the mixed transcripts — is particularly difficult for peer reviewers to assess accurately (Gielen et al. 2010; Patchan and Schunn 2015). It may be that experts — but not peers — have the detailed knowledge of best practice to identify the nuanced improvements required in the mixed versions of the opening and substantive phases.

Results for the transitional phase followed a different pattern — this phase was least accurately reviewed when it was presented as a poor adherence to best practice. It is unclear why the poor transitional phase was least accurately evaluated. It is possible that the participants had particularly low knowledge of best practice interviewing procedures for the transitional phase and did not pick up on poor practices. Previous research has demonstrated that students with a low ability to peer review provided more praise (rather than criticisms), whereas students with a high ability to peer review provided more criticisms (Patchan and Schuunn 2015). If our Study 1 participants had a low understanding of best practice in the transitional phase, they may have perceived all transcript versions overly positively and missed critiquing the interviewer for improvements (of which the poor version required the most improvement).

After completion of Study 1, we sought to determine whether results could be replicated with another jurisdiction (see Earp and Trafimow 2015; Simons 2013 for the importance of replicability). In Study 2, we test the replicability of results with a sample of child forensic interviewers who have a shorter training program compared to the sample in Study 1 (4 days instead of 10 days of training).

Study 2

Method

Participants

Based on the effect sizes from Study 1 (ηp2= 0.23–0.49), an a priori power analysis determined that between 15 and 39 participants should provide sufficient power. We recruited 37 police officers from a different jurisdiction (nfemale = 24; nmale = 13). All police members who had completed the jurisdictions’ child forensic interviewing course were invited to participate via email. Of the 29 participants that provided demographic information, they had a mean age of 35.7 years (SD = 8.26, range = 23–53 years). Of the 28 participants that reported their ranks, they were mostly constable (n = 15) ranking officers with fewer senior constables (n = 6) and sergeants (n = 7). The sample had a similar number of years’ experience as police officers (M = 9.35, SD = 6.75, range = 1.5–27 years) as our Study 1 sample, t(43) = 0.52, p = 0.61, d = 0.14. Further information about the jurisdiction has not been provided to ensure participant and jurisdiction anonymity.

Participants were eligible for the study if they had recently completed the child forensic interview training program offered by their jurisdiction’s police force. This training program differed from the training of participants in Study 1. The training program lasted only 4 days and trainees were proved fewer opportunities to practice their interviewing skills; they completed two mock interviews with an adult playing the role of a child and only one interview with a real child (aged 5–6 years). However, like the training in Jurisdiction 1, the course was instructed by an academic expert in child forensic interviewing as well as guest lecturers (e.g. lawyers, social workers) and covered topics of best practice techniques to use at all stages of a child forensic interview and the same coding scheme for classifying question types. Upon entering their training, the sample had few child forensic interviewing experiences (M = 1.21 years’ experience, SD = 4.56, range = 0–20 years).

Materials and Procedure

The research was approved by the University’s Human Research Ethics Committee, and written consent was attained from the police organisation and individual participants. Study 2 used the same materials as Study 1 with one difference: the checklist evaluation tool was reduced to the sixteen items determined in Study 1. Upon conclusion of the child interviewing training program, instructors emailed participants the revised evaluation tool and a mock child forensic interview transcript (transcripts were the same as those used in Study 1). Again, participants were randomly allocated to receive one of the three counterbalanced versions of a modified child forensic interview transcript: transcript A (n = 12), transcript B (n = 12), or transcript C (n = 13). Participants were returned the completed materials to the research team within 28 days.

Percentage accuracy scores were computed for participants’ checklist answers by dividing the number of items answered accurately (according to the same expert scorers from Study 1) by the total number of items in the checklist subsection. For participants’ identification of question types, a percentage accuracy score was derived by dividing the number of correctly identified questions by the total number of questions in the substantive phase. Like Study 1, approximately one third of participants ignored the category of leading questions (n = 13, 37.1%), so this category was omitted from participants’ scores.

Results

We first considered the internal reliability of the checklist items with the sample. Cronbach’s alpha was computed for the overall 16-item checklist which showed low internal consistency, at 0.29. When considering each subsection separately, reliability within each scale was acceptable, 0.70 (opening), 0.66 (transitional), and 0.59 (substantive).

Participants achieved an average of 10.54 out of 16 items correct on the checklist (65.88%, SD = 2.23, range = 6–15). We explored participants’ percentage accuracy when evaluating each interview phase separately. For each of the three sections of the interview, a one-way ANOVA was conducted to compare participants’ percentage accuracy across interviewer performance levels (strong, mixed, or poor interviewer performance).1 Results of each ANOVA are presented below; see Fig. 1 for a depiction of mean scores in each subsection. All follow-up t tests are conducted with a Bonferroni adjustment (α = 0.017).

Opening

The ANOVA was significant, F(2, 34) = 28.61, p < 0.001, ηp2= 0.63. Follow-up t tests showed that participants had a significantly lower percentage accuracy when evaluating the mixed opening phase (M = 19.23, SD = 25.32) than the strong (M = 100.00, SD = 0.00) and the poor opening phases (M = 70.83, SD = 39.65), t(12) = 11.03, p < 0.001, d = 4.61, and t(23) = 3.91, p = 0.001, d = 1.63, respectively. There was no difference in participants’ accuracy when evaluating the strong and poor opening phases, t(11) = 2.55, p = 0.03, d = 1.09.

Transitional

The ANOVA was significant, F(2, 34) = 11.20, p < 0.001, ηp2= 0.40. Follow-up t tests showed that participants’ percentage accuracy was significantly lower when the transitional phase was poor (M = 33.33, SD = 32.57) than when it was strong (M = 84.61, SD = 31.52) or mixed (M = 87.50, SD = 31.07), t(23) = 4.00, p = 0.001, d = 1.68, and t(22) = 4.17, p < 0.001, d = 1.80, respectively. There was no difference in participants’ accuracy when evaluating the strong and mixed opening phases, t(23) = 0.23, p = 0.82, d = 0.01.

Substantive

The ANOVA was significant, F(2, 34) = 6.94, p = 0.003, ηp2= 0.29. Follow-up t tests showed that participants had a significantly lower percentage accuracy when evaluating the mixed substantive phase (M = 52.78, SD = 12.48) than the poor substantive phase (M = 77.56, SD = 20.52), t(23) = 3.6, p = 0.001, d = 1.51. Percentage accuracy for evaluations of the strong substantive phase (M = 66.67, SD = 15.49) did not significantly differ from accuracy of the mixed or the poor substantive phases, t(22) = 2.42, p = 0.02, d = 1.03, and t(23) = 1.49, p = 0.15, d = 0.62, respectively.

Since participants’ accuracy when evaluating the substantive interview phase was significantly related to the interviewer’s performance, we examined each of the three subsections of the substantive phase. For items about questioning, there was no significant difference between participants’ accuracy when evaluating strong, mixed, or poor substantive sections, F(2, 34) = 2.77, p = 0.08, ηp2 = 0.14.

The ANOVA for accuracy on checklist items about repeated abuse considerations was significant, F(2, 34) = 11.83, p < 0.001, ηp2=0.41. Follow-up t tests showed that when the interviewer’s adherence to best practice was mixed (M = 47.23, SD = 26.43), participants were significantly less accurate than when adherence was strong (M = 91.67, SD = 15.08) or poor (M = 76.92, SD = 25.04), t(22) = 5.06, p < 0.001, d = 2.18 and t(23) = 2.86, p = 0.008, d = 1.21, respectively. There was no difference between participants’ accuracy on strong and poor transcripts, t(20) = 1.80, p = 0.09, d = 0.74.

The ANOVA for accuracy on items about evidential requirements was also significant, F(2, 34) = 5.46, p = 0.009, ηp2=0.24. Follow-up t tests showed that when the interviewer’s adherence to best practice was mixed (M = 37.50, SD = 37.69), participants were significantly less accurate than when it was poor (M = 84.65, SD = 31.52), t(23) = 3.40, p < 0.001, d = 1.43. When the interviewer’s adherence was strong (M = 62.50, SD = 37.69), participants’ accuracy did not differ from either poor, t(23) = 1.60, p = 0.12, d = 0.67, or mixed adherence, t(22) = 1.63, p = 0.12, d = 0.69.

Within-Subjects Comparison

Collapsing over interview quality manipulations, we compared participants’ accuracy on each phase of the checklist to determine if participants were more accurate at a certain phase: opening, transitional, or substantive. A repeated measures ANOVA showed no significant differences between participants’ percentage accuracy on the opening (M = 62.16, SD = 43.15), transitional (M = 68.92, SD = 39.71), and substantive phases (M = 65.99, SD = 19.18), F(2, 56) = 0.27, p = 0.70 (Greenhouse–Geisser correction applied).

Identifying Question Types

Table 1 demonstrates the participants’ percentage accuracy at identifying each question type. Participants were on average 77.54% accurate in their identification of question types (SD = 10.31, range = 55.26–95.42%). Participants’ accuracy of question type identification did not differ across transcript versions, F(2, 32) = 2.91, p = 0.07, ηp2= 0.15. A linear regression was conducted to determine whether participants’ accuracy identifying question types predicted their checklist accuracy. The model was significant, F(1, 33) = 6.56, p = 0.02, and accounted for 14% of the variance (adjusted) in the checklist performance. Question type identification accuracy significantly predicted substantive phase checklist accuracy, B = 0.083 (SE = 0.33), β = 0.407, t = 2.56, p = 0.015. For every percentage point increase in participants’ identification of question types, their substantive phase checklist accuracy increased by 0.083.

Comparisons Between Studies 1 and 2

Overall, Studies 1 and 2 shared very similar patterns of results. We re-ran analyses with the samples from both studies collapsed with study as a predictor variable. For participants’ responses to checklist items about the transitional phase, study did not have a main effect, F(1, 87) = 1.23, p = 0.27, ηp2 = 0.02, or an interaction with interviewer performance levels (strong, mixed, or poor), F(2, 87) = 0.08, p = 0.92, ηp2 = 0.002. Similarly, for responses to items about the substantive phase, study did not have a main effect, F(1, 87) = 2.52, p = 0.17, ηp2 = 0.03, or an interaction with interviewer performance levels, F(2, 87) = 1.45, p = 0.24ηp2 = 0.03.

For participants’ responses to the opening phase checklist items, there was a main effect of study, F(1, 87) = 15.81, p < 0.001, ηp2=0.15, which was qualified by an interaction with transcript version, F(2, 87) = 4.89, p = 0.01, ηp2=0.10. When the opening section was mixed, participants in Study 2 had significantly lower percentage accuracy scores (M = 19.23, SD = 25.31) compared to participants in Study 1 (M = 60.52, SD = 39.37), t(30) = 3.33, p = 0.02, d = 1.24. There was no effect of study when the transcript was strong, since all participants scored at ceiling in both studies. There was also no effect of study when the transcript was poor, t(13) = 1.96, p = 0.07, d = 0.91.

Collapsing over interviewer performance manipulations, a 3 (checklist phase, opening, transitional, substantive) × 2 (Study, 1, 2) mixed ANOVA showed a main effect of study, F(1, 91) = 10.74, p = 0.009, ηp2 = 0.12. Participants in Study 1 (M = 76.93, SD = 16.71) had a significantly higher accuracy score overall than participants in Study 2 (M = 65.69, SD = 15.37). Study had no interaction with checklist phase, F(2, 90) = 1.76, p = 0.18, ηp2 = 0.04. For participants’ accuracy identifying question types, participants in Study 1 (M = 81.40, SD = 1.32) and Study 2 (M = 77.53, SD = 10.31) performed similarly, t(89) = 1.79, p = 0.08, d = 0.40.

Discussion

Participants in Study 2 were least accurate when reviewing the opening and substantive phases showing a mixed adherence to best practice, replicating the trend of results from Study 1. Interestingly, our finding in Study 1 that the transitional phase was least accurately evaluated when it depicted a poor adherence to best practice was also replicated. We expect that participants in both studies likely had low knowledge of best practice interviewing procedures for the transitional phase and did not notice poor practices. We discuss these results further in the General Discussion.

Replicability of our results across two samples was important given differences in training for our two samples. Patterns across Study 1 and 2 results were strikingly similar for participants’ accuracy on the checklist across strong, poor, and mixed interviewer performances. Participants in Study 1 slightly outperformed participants in Study 2 on the checklist accuracy overall, but both samples similarly achieved a moderate level of accuracy when identifying question types. The superior performance overall by Study 1 participants suggests that interviewers trained on 10-day training courses can peer review with some higher accuracy than interviewers trained on a 4-day course (Study 2 participants).

General Discussion

The current studies tested a checklist tool to guide interviewers’ peer review of child forensic interview transcripts. We examined the accuracy of peer reviewers’ scores on the checklist for transcripts that had a strong, poor, or mixed performance of best practice skills. Across two studies, participants were significantly less accurate when evaluating opening and substantive interview phases with a mixed performance and the transitional interview phase with a poor performance.

One reason for participants’ reduced accuracy when evaluating mixed interviewer performances (in the opening and substantive phases) may be because the improvements required in mixed performances are nuanced and therefore difficult to identify. In mixed transcript phases, the interviewer’s performance of best practice skills was manipulated so that she completed some skills, lacked other skills, and attempted some skills but did not properly demonstrate them. Identifying the skills that were attempted but incomplete was likely particularly difficult for participants. For example, in the mixed opening phase, the interviewer delivered only two of the three ground rules required on the checklist (see Item 3), and in the mixed substantive phase, the interviewer separated incidents of repeated abuse but did not exhaust the child’s recall of each incident (see Item 15). Identifying the nuanced errors that the mixed interviewer made on these items likely required a thorough understanding of each best practice element and detailed attention to the transcript. Alternatively, when the interviewer’s skills were depicted as strong or poor, her behaviours fully aligned or opposed each checklist item in a manner that was likely more obvious. Indeed, previous work has found that peer reviews by high school students have been least accurate when nuanced improvements must be identified (Gielen et al. 2010; Patchan and Schunn 2015). Participants in our studies had experienced training courses lasting only 10 (Study 1) or 4 (Study 2) days. Previously, child forensic interviewers have been shown to lack best practice knowledge after courses of similar lengths (Lamb 2016; Smith et al. 2009; Sternberg et al. 2001), but prolonged, ongoing training lasting many months can provide a more thorough understanding of best practice (Cederborg et al. 2021; Benson and Powell 2015; Lindholm et al. 2016). It may be that only highly trained interviewers are suited to peer review when nuanced improvements are required.

An alternative reason that participants may have lower accuracy on the transcripts with a mixed performance of opening and substantive phases may be due to the rigid nature of a checklist as an evaluation tool. The checklist tool forced participants (and the experts) into selecting a dichotomous answer for each best practice behaviour; each behaviour was either marked as present or absent. To participants lacking detailed attention to best practice, the mixed behaviours may not have obviously fallen into either category, so the correct checklist answer may not have been apparent. Comparatively, behaviours that were strong and poor likely fit into the present or absent categories more clearly. While it is established that scaffolding is helpful to facilitate peer review (Brubacher et al. 2021a; Sluijsmans et al. 2002), when scaffolding tools are too rigid, they may remove the flexibility peer reviewers require to individualise reviews to the characteristics and nuances of a particular interview. Some best practice skills are highly flexible so that interviewers can tailor their approach to a particular child (e.g. adapting questioning to a child’s developmental level; Poole and Lamb 1998). It may be that peer review tools require flexibility too; perhaps prompts with free-text space rather than categorical answer boxes are more appropriate for reviewing child interviews.

Interestingly, across both of our studies, participants’ accuracy evaluating the transitional phase followed a different pattern to other phases: participants had the lowest accuracy when evaluating the transcript poorly adhering to best practice. Our refined checklist contained two items for the transitional phase that each focused on eliciting the topic of concern when children were not immediately forthcoming (Items 5 and 6). Previous work suggests that interviewers do not have a strong understanding of best practice techniques to use in this situation: evaluations of field transcripts have shown that shortfalls in interviewers’ adherence to best practice during the transitional phase are particularly prominent when children are not forthcoming with the substantive topic (Hughes-Scholes and Powell 2013; Powell and Hughes-Scholes 2009). We believe that despite recent training on the topic, participants in our samples may have had a poor understanding of what is best practice in the transitional phase, which led to reduced accuracy when detecting poor interviewer behaviours in this phase. Comparative to the poor transitional phase, the mixed and strong phases had fewer poor behaviours present to identify, so were likely less affected by participants’ lack of understanding.

Our finding that mixed or poor interviewer performances were at times inaccurately evaluated is troubling. Given that interviewers have previously been shown to struggle to perform a range of best practice elements in the field (e.g. Guadagno and Powell 2009; Luther et al. 2014; Powell and Hughes-Scholes 2009; Sternberg et al. 2001; Westcott and Kynan 2006), it is likely that many interviews requiring ongoing feedback are indeed performed with a mixed or poor adherence to best practice. Previous work has found more encouraging results: two studies have shown improvements in interviewers’ questioning after ongoing peer review exercises (Cyr et al. 2021; Stolzenberg and Lyon 2015), and one study found that after completing a checklist, interviewers’ peer reviews are written in a mostly constructive manner that focuses on improvements (Brubacher et al. 2021a). However, these studies did not consider the quality of the interviews being reviewed, how many improvements were needed to the interviews, or how nuanced those improvements were. Results from our study highlight the importance of testing peer review tools across a range of interviews varying in quality, particularly mixed and poor interviews.

One major limitation of our studies was that a key best practice element of child interviewing was not present on our checklist — the development of rapport with children. While interviewing guidelines strongly and commonly recommend building rapport through asking children innocuous questions or asking them for a narrative of an innocuous event (see ABE, Ministry of Justice 2011; NICHD protocol, Lamb et al. 2007), it was beyond the scope of our research to include evaluations of rapport-building on the checklist. The decision to omit rapport-building from the checklist was made for two reasons: first, including pre-substantive rapport-building on the child interview transcripts for evaluation would substantially increase the transcript length and by extension the time demands for participants to read the transcripts. Second, rapport-building practices are not usually recorded in child forensic interview footage or transcripts, so our participants would not be used to seeing rapport-building in a transcript. Given the importance of rapport-building in child interviewing, future research should assess peer reviewers’ ability to evaluate rapport-building efforts or consider adding a rapport evaluation component to peer review tools. Future work might also examine interviewers’ ability to self-evaluate their own rapport-building in jurisdictions where rapport-building is not typically recorded.

A second limitation of our study is that participants evaluated mock (modified) child forensic interview transcripts, rather than transcripts of real (unaltered) interviews. While the decision to use mock transcripts reduced the ecological validity of our materials, it allowed us to manipulate interviewer performance for a controlled and systematic examination of participants’ reviews for different levels of interview quality (strong vs. mixed vs. poor). Further, to retain as much ecological validity as possible, manipulated interview transcripts were modelled closely on a real interview (see Brubacher et al. 2021a) for a similar procedure to create a mock transcript).

Implications

Our studies have important implications for the field and for future research. First, while peer reviewers require scaffolding to support their reviews (Brubacher et al. 2021a, b; Sluijsmans et al. 2002), our work suggests that a checklist tool is too rigid for interviews containing mixed adherence to best practice and that more flexible tools are needed to support peer reviews of child forensic interviews in the field. We suggest that open questions, Likert scales, or free-text space might be required, and propose that future research considers testing these formats for peer review.

Second, our results highlight the importance for researchers to consider the quality of the interviews being peer reviewed. While previous studies have had peer reviewers rate field interviews (e.g. Cyr et al. 2021; Stolzenberg and Lyon 2015) or one modified transcript (Brubacher et al. 2021a), the quality of the interview(s) being reviewed is often not reported. Our research has shown that strong interviews are relatively easy for peer reviewers to evaluate, while mixed and poor interviews are more challenging. Researchers should be encouraged to evaluate and report the quality of interviews being reviewed. Practitioners in the field should also consider interview quality, as peer review might be a useful mode of feedback for stronger interviewers, but more experienced or expert reviewers may be required to provide feedback for less-experienced interviewers who are still developing their skills.

Last, when comparing results across our two studies, we found that overall accuracy was higher for participants in Study 1 (who received 10 days of training on child interviewing) than for participants in Study 2 (who received only 4 days). Police organisations looking to implement peer review should consider the level of training that their members receive on child forensic interviewing since our results suggest that longer training may lead to improved peer reviewers.

Conclusion

Our study was the first to consider peer reviewers’ accuracy using a structured checklist to assess strong, poor, and mixed child interview transcripts. Past literature has shown that scaffolding is valuable to inform peer reviews (Brubacher et al. 2021a; Tornwell 2018). Our results have furthered this knowledge by demonstrating that a structured checklist tool is helpful for child forensic interviewers to peer review interview transcripts that strongly adhere to best practice. However, the tool had shortcomings for participants’ evaluations of sections of the child interview that were performed poorly or contained a mixed adherence to best practice. Research attention should focus on further developing and testing tools that are more flexible than a dichotomous checklist but still provide some structure to peer reviewers.