Testing an Evaluation Tool to Facilitate Police Officers’ Peer Review of Child Interviews

Providing child forensic interviewers with ongoing opportunities for feedback is critical to maintaining their interviewing skills. Given practical difficulties with engaging experts to provide this feedback (such as costs and workloads), the current paper explores whether a structured evaluation tool can assist police interviewers to accurately peer review interviews. A structured checklist of best practice skills was created, and participants in two studies used it to evaluate mock transcripts of child interviews that ranged in quality. Transcripts were manipulated to present the opening, transitional, and substantive interview phases as a strong, poor, or mixed performance of best practice skills. In Study 1, 57 police participants from one jurisdiction evaluated the opening and substantive phases of the transcript less accurately when the transcript contained a mixed performance of best practice and the transitional phase less accurately when it contained poor performance. In Study 2, a similar pattern of results was replicated with a sample of 37 police interviewers from a separate jurisdiction with shorter interview training. Results suggest that structured tools are helpful to inform peer review of child interviews, but tools that are too rigid might not be helpful when nuanced improvements are required.

When child abuse is alleged, one of the first steps in the legal process is for police to conduct an investigative interview with the child. Since physical evidence is often lacking in child abuse cases (Hartley et al. 2013;Walsh et al. 2010), police interviewers' ability to elicit detailed and accurate information from the child can be vital to the investigation outcomes. In some jurisdictions, the audio-visual recording of children's interviews can further be used as their evidence-in-chief should the case progress to court (Australian Law Reform Commission 2010; Ministry of Justice 2011). Given the importance of children's interviews, many studies over the past three decades have focused on examining interview techniques to elicit children's reports (see Lamb 2016;Saywitz et al. 2017, for reviews). Today, numerous best practice guidelines exist to instruct interviewers in evidence-based techniques that facilitate complete and accurate reporting from children (e.g. Achieving Best Evidence (ABE), Ministry of Justice (2011); National Institute of Child Health and Human Development (NICHD) protocol, Lamb et al. (2007)).
While protocols differ in emphasis, there is consensus about key elements that constitute best practice interviewing, which are typically structured into separate phases. Broadly, in the opening phase, interviewers introduce themselves to the child and build rapport, explain the need to tell the truth, and provide conversational "ground rules" to advise the child about appropriate responses for the interview (e.g. rules about not guessing answers are common, "Tell me if you don't know an answer"; see Brubacher et al. (2015) for a review). Next, interviewers progress to a transitional phase to shift the conversation to substantive topics by eliciting the topic of concern (e.g. alleged abuse); the transition in topics should occur in a non-leading manner, such as asking, "What have you come here to talk about today?" and interviewers must avoid introducing details about alleged abuse (Earhart et al. 2018). Once the topic of concern is established, interviewers begin the substantive phase in which the alleged abuse is explored thoroughly. Open questions should be used to prompt the child for a narrative of abusive incidents (i.e. prompts that encourage elaborate responses without specifying what information is required, such as "What happened next?" and "Tell me more about (mentioned detail)" (Powell and Snow 2007)), and leading questions should be avoided (i.e. prompts that introduce details the child has not provided, such as "He touched you on the privates, didn't he?" when the child has not mentioned touching). For children reporting repeated episodes of abuse, interviewers should prompt the child about each incident separately using labels such as "the last time" to isolate incidents (Guadagno and Powell 2009). At the end of the substantive phase, a few specific questions about particular details may be needed to follow-up any required evidentiary information the child has not yet provided (e.g. who the perpetrator was, when and where the alleged abuse occurred) (Victorian Law Reform Commission 2004). When evidential information has been elicited, interviewers close the interview and thank the child for their time.
Despite expert agreement on the core elements of best practice interviewing, interviewers have difficulty adhering to best practice guidelines in the field (e.g. Guadagno and Powell 2009;Luther et al. 2014;Powell and Hughes-Scholes 2009;Sternberg et al. 2001;Westcott and Kynan 2006). Deficits in a range of best practice skills have been found in field interviews, including interviewers failing to test children's understanding of the truth and omitting ground rules in the opening phase (Bull 2010;Huffman et al. 1999;Warren et al. 1996;Westcott and Kynan 2006), asking contentious or leading questions to elicit abuse topics in the transitional phase (Hughes-Scholes and Powell 2013; Powell and Hughes-Scholes 2009), relying too heavily on specific and leading questions rather than open questions, and failing to isolate and label incidents of repeated alleged abuse sufficiently in the substantive phase (Guadagno and Powell 2009;Myklebust and Bjørklund 2006;Sternberg et al. 2001). Specialist child forensic interview training increases interviewers' use of best practice skills (particularly open questioning) (Cederborg et al. 2013;Price and Roberts 2011;Yi et al. 2016), but these improvements often diminish once the training is complete (Lamb 2016;Smith et al. 2009;Sternberg et al. 2001). For example, Smith et al. (2009) found that from a range of predictor variables, only the recency of interviewers' training affected their use of open questions, and at only 1 month after completing a 10-day training course, interviewers' use of open questions was similar to their pre-training levels. Recent research has identified that spacing interviewers' training over multiple months supports the maintenance of best practice skills for longer periods post-training (such as up to 12-month post-training in Benson and Powell (2015) and up to 4-month post-training in Cederborg et al. (2021); see also Brubacher et al. (2021b) and Lindholm et al. (2016)). However, in many jurisdictions, child interview training is only offered over an isolated, shorter block of time (e.g. between 1 and 10 days in a review by Benson and Powell (2015); also see Brubacher et al. (2018), andLamb (2016)). Thus, it is important to identify strategies and tools that can help maintain interviewers' skills after they complete such courses.
One strategy that assists in improving and maintaining interviewers' post-training use of open questions is providing them with ongoing interview feedback (Lamb et al. 2002a;Price and Roberts 2011;Powell 2008; see Lamb 2016 for a review). For example, Lamb et al. (2002a) found that receiving detailed individual feedback on recent interviews from training instructors (forensic and developmental psychologists), coupled with group discussions every 1-2 months, caused dramatic improvements in the proportion of open questions posed by eight interviewers following their completion of a 5-day intensive training course. When this feedback was withdrawn after approximately 1 year, interviewers' open question use started reverting to pre-feedback levels. Despite findings supporting the use of ongoing feedback, it is an intensive and time-consuming task. Previous studies have typically had experts such as course instructors, psychologists, or academics provide the feedback; however, the sample sizes of these studies suggest that these experts can monitor only small numbers of interviewers (e.g. eight interviewers in Lamb et al. (2002a); 12 interviewers in Price and Roberts (2011); six interviewers in Orbach et al. (2000)). Indeed, experts likely manage already full workloads, and there are high financial costs in their provision of feedback (see Cyr et al. 2021). Further, in some jurisdictions, legislation precludes non-investigators from viewing completed interview transcripts or recordings to protect children's privacy, rendering feedback from experts impossible (Powell and Barnett 2015).
A more financially feasible alternative that avoids any privacy issues is for regular feedback to come from other interviewers (i.e. peer feedback; see Cyr et al. (2021) for discussion). Peer feedback is available to interviewers more promptly and in greater frequency than expert or supervisor feedback could be provided (Topping 2009) and does not heavily challenge workloads since a second interviewer is often present during child interviews and may provide feedback promptly (see ABE, Ministry of Justice 2011; Westcott and Kynan 2004). Peer interviewers are currently practicing child interviewing themselves and thus may be abreast of contemporary issues in child forensic interviewing practices. In education disciplines, peer feedback has long been considered beneficial to student learning since it promotes students' self-efficacy, confidence, and vicarious learning (e.g. Christensen and Kline 2001;Seroussi et al. 2019). Students also see the value in peer feedback, reporting that peer feedback can be easier to interpret than instructor feedback (Falchikov 2005;Zheng 2012). However, peers often do not have the depth of knowledge of experts, so peer feedback can be highly variable in accuracy and quality (measured as deviation from expert feedback; Hovardas et al. 2014;Lai 2016;Sluijsmans et al. 2002;Zheng 2012;see Topping 2009 for review).
To date, two studies have examined the effects of peer feedback on child forensic interviewers' performance. In the first study, Stolzenberg and Lyon (2015) recruited law students undertaking a child interviewing unit; they were required to discuss and peer review each other's interviews during weekly peer group meetings that were moderated by an expert. Students improved their use of open questions (and reduced specific questions) over the semester. In the second study, Cyr et al. (2021) examined the utility of peer review as a form of "refresher training" in a sample of police interviewers who had completed their formal training in child interviewing at least a year earlier. Some interviewers met in peer groups for 9 h over a 6-month period to review each other's interviews; instructions were provided to guide their sessions. This practice marginally increased open question use and increased interviewers' adherence to a child interviewing protocol with discrete phases. Other interviewers received refresher training from an expert or from online activities; these refresher modes produced a similar pattern of results to the peer meetings group, but occasionally the gains made by the expert feedback group slightly surpassed the other modalities. In both studies, the accuracy of peer feedback was not assessed but was likely high due to the provision of expert moderation (Stolzenberg and Lyon 2015) or feedback instructions (Cyr et al. 2021) to help scaffold peer reviewers' feedback.
One recent study has examined child forensic interviewers' ability to write constructive peer reviews of child interviews that are scaffolded by guidelines (Brubacher et al. 2021a). Sixty interviewers were supplied with two guidelines: (1) specific directions to identify question types in the transcript and (2) a checklist of best practice skills. They were instructed to use these guidelines to help them write a 500-word peer review of a standardised child forensic interview transcript. Experts rated how constructive the 500-word reviews were in directing the interviewer to specific improvements. They found that approximately one third provided high-quality feedback with specific, actionable improvements for the interviewer, and about half included moderate-quality feedback with vague recommendations for improvements. Further, research assistants coded whether comments in the 500-word reviews were about skills mentioned on the provided checklist. On average, participants mentioned 10 out of 27 checklist items in their reviews, and checklist items were significantly more likely to be written about than non-checklist features. This study suggests that a structured checklist can provide helpful scaffolding to interviewers' peer reviews; however, the capacity of the child forensic interviewers to accurately use the two provided guidelines on peers' interviews in the first instance was not considered.
In the current study, we aimed to develop a structured peer review tool that assesses a range of best practice skills and to test child forensic interviewers' accuracy when using the tool to evaluate mock transcripts of a child forensic interview. Previous education research has found some differences in the accuracy of peer reviews for high-and lowquality work (Gielen et al. 2010;Patchan and Schunn 2015). Thus, we also wanted to examine any differences in the accuracy of peer assessments made using the tool when the interviewer's performance was varied to depict strong, poor, or mixed adherence to best practice guidelines. We created a checklist of best practice behaviours that child interviews should contain within the opening, transitional, and substantive interview phases (see Appendix). In Study 1, we recruited child forensic interviewers who had recently graduated from a 10-day child interviewing course as participants. We determined the internal reliability of the checklist and explored whether participants' accuracy using the tool was related to the quality of the interviewer's performance in the transcript (strong, poor, mixed). In Study 2, we explored whether results could be replicated with a separate sample of child forensic interviewers from a different jurisdiction who received only a 4-day training course in child interviewing.

Participants
An a priori power analysis determined that a sample size of 42-60 should have sufficient power to detect large effects (η p 2 = 0.15-0.20). We recruited 56 police interviewers from one jurisdiction (n female = 24; n male = 32). All police members who had completed the jurisdictions' current child forensic interviewing course were invited to participate via email. Members who returned the consent form were sent the materials. The sample had a mean age of 37.13 years (SD = 8.27; range = 24-55 years) and consisted of mostly constable (n = 21) and senior constable (n = 30) ranking officers, with few sergeants (n = 5). For the 53 participants who reported how long they had been police officers, this ranged between 1 and 21 years (M = 8.61; SD = 4.61). Further information about the jurisdiction has not been provided to ensure participant and jurisdiction anonymity.
Participants had all recently completed the child forensic interview training program offered by their jurisdiction's police force. Participants all had little experience completing child forensic interviews before they completed the training (M = 0.70 years; SD = 1.39; range = 0-5 years). The training program lasted 10 days and was instructed jointly by one senior police instructor and one academic expert in child interviewing, as well as a range of guest lecturers, such as speech pathologists and lawyers. Course trainees learned best practice techniques to use throughout all stages of a child forensic interview, as well as identifying and using different question types. Trainees had many opportunities to practice their interviewing skills during the training program and receive feedback on their performance: they completed numerous mock interviews with adults playing the role of children and at least four interviews with real children (aged between 5 and 7 years). Trainees also learned and practiced using a coding scheme to classify and identify question types.

Materials and Procedure
The research was approved by the university's human research ethics committee, and written consent was attained from the police organisation and individual participants. Upon conclusion of the child forensic interviewing training program, instructors emailed participants two materials detailed below: the evaluation tool and a mock child forensic interview transcript. Participants returned the completed materials to the research team within 28 days.

Evaluation Tool
Participants were provided with a document of instructions to complete their evaluation of the interviewer's performance in a child interview transcript. See the Appendix for the full instructions. An appropriate reliance on openedended questioning is well-established as best practice in child interviews (see ABE, Ministry of Justice 2011; NICHD protocol, Lamb et al. 2007). Thus, the instructions first asked participants to categorise each question posed in the substantive phase of transcript as an open-invitation, opendepth, open-breadth, facilitator, specific question, or leading question (see Powell et al. 2008). Question type categories aligned to those taught on the jurisdictions' training program. Definitions and examples of each question type were provided in the instructions for participants to refer to while completing the task (see Appendix).
Since best practice child interviewing goes beyond asking open questions, the evaluation tool also provided participants with a structured checklist of other best practice skills. The checklist was created by the research team to capture participants' evaluations of a range of interviewer behaviours in the transcript. The checklist was comprised of nineteen items that each reflected a recommended child interview component or technique. The items were grouped into the three interview phases: the opening phase (3 items), the transitional phase (3 items), and the substantive phase (13 items). Because the substantive phase of a child interview is an important and sizable interview phase, the checklist items evaluating this phase considered three different skills: questioning children about abuse (7 items), questioning about repeated abuse specifically (4 items), and eliciting information for evidential purposes (2 items). For each item, participants were instructed to tick a box indicating that the item was present (yes), absent (no), or not applicable in the transcript they evaluated.

Interview Transcripts
Participants were randomly allocated one of three versions of a modified child interview transcript: transcript A (n = 19), transcript B (n = 18), and transcript C (n = 19). All transcripts depicted a 6-year-old child reporting two instances of sexual abuse to a female police officer during a child forensic interview. The transcripts were based upon a real interview from a different jurisdiction not included in the study, and specific details such as names and locations were changed for anonymity. The three versions of the transcript were created by systematically manipulating the quality of the interviewer's performance in three phases of the interview: the opening, transitional, and substantive phases. In each interview phase, the interviewer's performance was depicted as either strongly adhering to best practice performance of the skills required in the phase (strong), poorly adhering to best practice (poor), or a mixed performance where some best practice elements were present or attempted but others were missing or incorrect (mixed). For example, in transcript A, the opening phase strongly adhered to best practice, the transition phase poorly adhered to best practice, and the substantive phase was mixed. Across the three versions of the transcript, the interviewer's performance during each phase of the interview was counterbalanced, such that an interview phase performed strongly in one transcript was presented as performed poorly in another transcript and mixed in another. Where appropriate, the child's responses were also modified to match the interviewer's performance. For example, if questions were presented as open-ended, the child's response was modified to be longer and more detailed than when a question was presented as a closed question. These modifications were informed by research on children's response patterns (e.g. Krahenbuhl and Blades 2006;Lamb et al. 2003) and the researchers' own experiences interviewing over 400 children and reviewing over 200 child forensic interview transcripts.

Coding and Analysis
For each transcript version, the correct classification of each question and the correct response to each checklist item was agreed upon by two experts in child interviewing prior to data collection (i.e. when the materials were being created). Both experts were training instructors on the jurisdiction's child forensic interview training course. One was a police interviewer working in child abuse areas, with approximately 10 years' experience as a police officer. The second was an academic with expertise in child forensic interviewing, who has interviewed hundreds of children using best practice techniques.
For each checklist item, participant responses were marked as correct or incorrect by referring to the expert answers. For participants' identification of question types, each question was marked as correctly or incorrectly categorised by referring to the experts' responses and a percentage accuracy score derived by dividing the number of correctly identified questions by the total number of questions in the substantive phase. Approximately one third of participants ignored the category of leading questions (n = 19, 33.9%). We thus excluded the leading question category as it skewed results. Presented question type identification percentage accuracy scores do not include leading questions.

Results
We first explored the internal reliability of the checklist items. Cronbach's alpha was computed to determine the overall consistency of the checklist, and the internal consistency of each subsection (i.e. opening, transitional, and substantive phases). The Cronbach's alpha for the overall checklist showed moderate internal consistency, at 0.53. For each subsection, the Cronbach's alphas were 0.30 (opening), 0.47 (transitional), and 0.46 (substantive). Items were considered for deletion if doing so would increase the subsection Cronbach's alpha, unless deletion would result in only one item remaining in a subsection. In the opening section, one item was removed to improve the alpha to 0.57: "Location, time, date, persons present, and child's age all stated for recording". In the transitional section, one item was removed to improve the alpha to 0.61: "Phrasing of the initial question to elicit the topic of concern is best practice ('What are you here to talk to me about today?') and avoids 'Why' or 'Can you…'". In the substantive section, one item was removed to improve the alpha to 0.53: "Interviewer clarifies whether the abuse happened once or more than once".
After removal of the three items, we examined participants' accuracy using only the 16 remaining items.
Participants had an average of 11.71 of the 16 items correct (73.19%, SD = 2.47, range = 8-16). Percentage accuracy scores were computed for each checklist section by dividing participants' number of correct items per checklist section by the total number of items in that section. For example, if a participant answered six out of 12 items in the substantive phase correctly, they received a percentage accuracy score of 50% (i.e. 6 / 12 * 100 = 50). For each of the three sections of the interview, a one-way ANOVA was conducted to compare participants' percentage accuracy across interviewer performance levels (strong, mixed, or poor interviewer performance). 1 Results of each ANOVA are presented below; see Fig. 1 for a depiction of mean scores in each subsection. All follow-up t tests are conducted with a Bonferroni adjustment (α = 0.017).

Substantive
The ANOVA was significant, F(2, 53) = 7.07, p = 0.001, η p 2 = 0.23. Results followed the same pattern as the opening phase. That is, participants had a significantly lower percentage accuracy when evaluating the mixed substantive phase (M = 59.21, SD = 9.17) than the strong

Within-Subjects Comparisons
Collapsing over interview quality manipulations, we compared participants' accuracy on each phase of the checklist to determine if participants were more accurate at a certain phase: opening, transitional, or substantive. A repeated measures ANOVA showed significant differences between participants' percentage accuracy

Identifying Question Types
Participants' identification of each question type is presented in Table 1  range = 57.20-96.59%). There were no differences in participants' percentage accuracy of question type identification across the three transcript versions, F(2, 53) = 0.14, p = 0.87, η p 2 = 005. Since question types were identified throughout the substantive phase, we considered whether participants' percentage accuracy of question type identification was related to their percentage accuracy on checklist items for the substantive phase. A linear regression showed that the variables were not related, F(1, 54) = 0.62, p = 0.80.

Discussion
For the opening and substantive interview phases, participants were least accurate when evaluating the transcripts depicting a mixed adherence to best practice. Previous work from the education field has found that work requiring nuanced improvements -like the mixed transcripts -is particularly difficult for peer reviewers to assess accurately (Gielen et al. 2010;Patchan and Schunn 2015). It may be that experts -but not peers -have the detailed knowledge of best practice to identify the nuanced improvements required in the mixed versions of the opening and substantive phases.
Results for the transitional phase followed a different pattern -this phase was least accurately reviewed when it was presented as a poor adherence to best practice. It is unclear why the poor transitional phase was least accurately evaluated. It is possible that the participants had particularly low knowledge of best practice interviewing procedures for the transitional phase and did not pick up on poor practices. Previous research has demonstrated that students with a low ability to peer review provided more praise (rather than criticisms), whereas students with a high ability to peer review provided more criticisms (Patchan and Schuunn 2015). If our Study 1 participants had a low understanding of best practice in the transitional phase, they may have perceived all transcript versions overly positively and missed critiquing the interviewer for improvements (of which the poor version required the most improvement).
After completion of Study 1, we sought to determine whether results could be replicated with another jurisdiction (see Earp and Trafimow 2015;Simons 2013 for the importance of replicability). In Study 2, we test the replicability of results with a sample of child forensic interviewers who have a shorter training program compared to the sample in Study 1 (4 days instead of 10 days of training).

Participants
Based on the effect sizes from Study 1 (η p 2 = 0.23-0.49), an a priori power analysis determined that between 15 and 39 participants should provide sufficient power. We recruited 37 police officers from a different jurisdiction (n female = 24; n male = 13). All police members who had completed the jurisdictions' child forensic interviewing course were invited to participate via email. Of the 29 participants that provided demographic information, they had a mean age of 35.7 years (SD = 8.26, range = 23-53 years). Of the 28 participants that reported their ranks, they were mostly constable (n = 15) ranking officers with fewer senior constables (n = 6) and sergeants (n = 7). The sample had a similar number of years' experience as police officers (M = 9.35, SD = 6.75, range = 1.5-27 years) as our Study 1 sample, t(43) = 0.52, p = 0.61, d = 0.14. Further information about the jurisdiction has not been provided to ensure participant and jurisdiction anonymity.
Participants were eligible for the study if they had recently completed the child forensic interview training program offered by their jurisdiction's police force. This training program differed from the training of participants in Study 1. The training program lasted only 4 days and trainees were proved fewer opportunities to practice their interviewing skills; they completed two mock interviews with an adult playing the role of a child and only one interview with a real child (aged 5-6 years). However, like the training in Jurisdiction 1, the course was instructed by an academic expert in child forensic interviewing as well as guest lecturers (e.g. lawyers, social workers) and covered topics of best practice techniques to use at all stages of a child forensic interview and the same coding scheme for classifying question types. Upon entering their training, the sample had few child forensic interviewing experiences (M = 1.21 years' experience, SD = 4.56, range = 0-20 years).

Materials and Procedure
The research was approved by the University's Human Research Ethics Committee, and written consent was attained from the police organisation and individual participants. Study 2 used the same materials as Study 1 with one difference: the checklist evaluation tool was reduced to the sixteen items determined in Study 1. Upon conclusion of the child interviewing training program, instructors emailed participants the revised evaluation tool and a mock child forensic interview transcript (transcripts were the same as those used in Study 1). Again, participants were randomly allocated to receive one of the three counterbalanced versions of a modified child forensic interview transcript: transcript A (n = 12), transcript B (n = 12), or transcript C (n = 13). Participants were returned the completed materials to the research team within 28 days. Percentage accuracy scores were computed for participants' checklist answers by dividing the number of items answered accurately (according to the same expert scorers from Study 1) by the total number of items in the checklist subsection. For participants' identification of question types, a percentage accuracy score was derived by dividing the number of correctly identified questions by the total number of questions in the substantive phase. Like Study 1, approximately one third of participants ignored the category of leading questions (n = 13, 37.1%), so this category was omitted from participants' scores.

Results
We first considered the internal reliability of the checklist items with the sample. Cronbach's alpha was computed for the overall 16-item checklist which showed low internal consistency, at 0.29. When considering each subsection separately, reliability within each scale was acceptable, 0.70 (opening), 0.66 (transitional), and 0.59 (substantive).
Participants achieved an average of 10.54 out of 16 items correct on the checklist (65.88%, SD = 2.23, range = 6-15). We explored participants' percentage accuracy when evaluating each interview phase separately. For each of the three sections of the interview, a one-way ANOVA was conducted to compare participants' percentage accuracy across interviewer performance levels (strong, mixed, or poor interviewer performance). 1 Results of each ANOVA are presented below; see Fig. 1 for a depiction of mean scores in each subsection. All follow-up t tests are conducted with a Bonferroni adjustment (α = 0.017).

Within-Subjects Comparison
Collapsing over interview quality manipulations, we compared participants' accuracy on each phase of the checklist to determine if participants were more accurate at a certain phase: opening, transitional, or substantive.  Table 1 demonstrates the participants' percentage accuracy at identifying each question type. Participants were on average 77.54% accurate in their identification of question types (SD = 10.31, range = 55.26-95.42%). Participants' accuracy of question type identification did not differ across transcript versions, F(2, 32) = 2.91, p = 0.07, η p 2 = 0.15. A linear regression was conducted to determine whether participants' accuracy identifying question types predicted their checklist accuracy. The model was significant, F(1, 33) = 6.56, p = 0.02, and accounted for 14% of the variance (adjusted) in the checklist performance. Question type identification accuracy significantly predicted substantive phase checklist accuracy, B = 0.083 (SE = 0.33), β = 0.407, t = 2.56, p = 0.015. For every percentage point increase in participants' identification of question types, their substantive phase checklist accuracy increased by 0.083.

Comparisons Between Studies 1 and 2
Overall, Studies 1 and 2 shared very similar patterns of results. We re-ran analyses with the samples from both studies collapsed with study as a predictor variable. For participants' responses to checklist items about the transitional phase, study did not have a main effect, F(1, 87) = 1.23, p = 0.27, η p 2 = 0.02, or an interaction with interviewer performance levels (strong, mixed, or poor), F(2, 87) = 0.08, p = 0.92, η p 2 = 0.002. Similarly, for responses to items about the substantive phase, study did not have a main effect, F(1, 87) = 2.52, p = 0.17, η p 2 = 0.03, or an interaction with interviewer performance levels, F(2, 87) = 1.45, p = 0.24η p 2 = 0.03. For participants' responses to the opening phase checklist items, there was a main effect of study, F(1, 87) = 15.81, p < 0.001, η p 2 =0.15, which was qualified by an interaction with transcript version, F(2, 87) = 4.89, p = 0.01, η p 2 =0.10. When the opening section was mixed, participants in Study 2 had significantly lower percentage accuracy scores (M = 19.23, SD = 25.31) compared to participants in Study 1 (M = 60.52, SD = 39.37), t(30) = 3.33, p = 0.02, d = 1.24. There was no effect of study when the transcript was strong, since all participants scored at ceiling in both studies. There was also no effect of study when the transcript was poor, t(13) = 1.96, p = 0.07, d = 0.91.

Discussion
Participants in Study 2 were least accurate when reviewing the opening and substantive phases showing a mixed adherence to best practice, replicating the trend of results from Study 1. Interestingly, our finding in Study 1 that the transitional phase was least accurately evaluated when it depicted a poor adherence to best practice was also replicated. We expect that participants in both studies likely had low knowledge of best practice interviewing procedures for the transitional phase and did not notice poor practices. We discuss these results further in the General Discussion.
Replicability of our results across two samples was important given differences in training for our two samples. Patterns across Study 1 and 2 results were strikingly similar for participants' accuracy on the checklist across strong, poor, and mixed interviewer performances. Participants in Study 1 slightly outperformed participants in Study 2 on the checklist accuracy overall, but both samples similarly achieved a moderate level of accuracy when identifying question types. The superior performance overall by Study 1 participants suggests that interviewers trained on 10-day training courses can peer review with some higher accuracy than interviewers trained on a 4-day course (Study 2 participants).

General Discussion
The current studies tested a checklist tool to guide interviewers' peer review of child forensic interview transcripts. We examined the accuracy of peer reviewers' scores on the checklist for transcripts that had a strong, poor, or mixed performance of best practice skills. Across two studies, participants were significantly less accurate when evaluating opening and substantive interview phases with a mixed performance and the transitional interview phase with a poor performance.
One reason for participants' reduced accuracy when evaluating mixed interviewer performances (in the opening and substantive phases) may be because the improvements required in mixed performances are nuanced and therefore difficult to identify. In mixed transcript phases, the interviewer's performance of best practice skills was manipulated so that she completed some skills, lacked other skills, and attempted some skills but did not properly demonstrate them. Identifying the skills that were attempted but incomplete was likely particularly difficult for participants. For example, in the mixed opening phase, the interviewer delivered only two of the three ground rules required on the checklist (see Item 3), and in the mixed substantive phase, the interviewer separated incidents of repeated abuse but did not exhaust the child's recall of each incident (see Item 15). Identifying the nuanced errors that the mixed interviewer made on these items likely required a thorough understanding of each best practice element and detailed attention to the transcript. Alternatively, when the interviewer's skills were depicted as strong or poor, her behaviours fully aligned or opposed each checklist item in a manner that was likely more obvious. Indeed, previous work has found that peer reviews by high school students have been least accurate when nuanced improvements must be identified (Gielen et al. 2010;Patchan and Schunn 2015). Participants in our studies had experienced training courses lasting only 10 (Study 1) or 4 (Study 2) days. Previously, child forensic interviewers have been shown to lack best practice knowledge after courses of similar lengths (Lamb 2016;Smith et al. 2009;Sternberg et al. 2001), but prolonged, ongoing training lasting many months can provide a more thorough understanding of best practice (Cederborg et al. 2021;Benson and Powell 2015;Lindholm et al. 2016). It may be that only highly trained interviewers are suited to peer review when nuanced improvements are required.
An alternative reason that participants may have lower accuracy on the transcripts with a mixed performance of opening and substantive phases may be due to the rigid nature of a checklist as an evaluation tool. The checklist tool forced participants (and the experts) into selecting a dichotomous answer for each best practice behaviour; each behaviour was either marked as present or absent. To participants lacking detailed attention to best practice, the mixed behaviours may not have obviously fallen into either category, so the correct checklist answer may not have been apparent. Comparatively, behaviours that were strong and poor likely fit into the present or absent categories more clearly. While it is established that scaffolding is helpful to facilitate peer review (Brubacher et al. 2021a;Sluijsmans et al. 2002), when scaffolding tools are too rigid, they may remove the flexibility peer reviewers require to individualise reviews to the characteristics and nuances of a particular interview. Some best practice skills are highly flexible so that interviewers can tailor their approach to a particular child (e.g. adapting questioning to a child's developmental level; Poole and Lamb 1998). It may be that peer review tools require flexibility too; perhaps prompts with free-text space rather than categorical answer boxes are more appropriate for reviewing child interviews.
Interestingly, across both of our studies, participants' accuracy evaluating the transitional phase followed a different pattern to other phases: participants had the lowest accuracy when evaluating the transcript poorly adhering to best practice. Our refined checklist contained two items for the transitional phase that each focused on eliciting the topic of concern when children were not immediately forthcoming (Items 5 and 6). Previous work suggests that interviewers do not have a strong understanding of best practice techniques to use in this situation: evaluations of field transcripts have shown that shortfalls in interviewers' adherence to best practice during the transitional phase are particularly prominent when children are not forthcoming with the substantive topic (Hughes-Scholes and Powell 2013;Powell and Hughes-Scholes 2009). We believe that despite recent training on the topic, participants in our samples may have had a poor understanding of what is best practice in the transitional phase, which led to reduced accuracy when detecting poor interviewer behaviours in this phase. Comparative to the poor transitional phase, the mixed and strong phases had fewer poor behaviours present to identify, so were likely less affected by participants' lack of understanding.
Our finding that mixed or poor interviewer performances were at times inaccurately evaluated is troubling. Given that interviewers have previously been shown to struggle to perform a range of best practice elements in the field (e.g. Guadagno and Powell 2009;Luther et al. 2014;Powell and Hughes-Scholes 2009;Sternberg et al. 2001;Westcott and Kynan 2006), it is likely that many interviews requiring ongoing feedback are indeed performed with a mixed or poor adherence to best practice. Previous work has found more encouraging results: two studies have shown improvements in interviewers' questioning after ongoing peer review exercises (Cyr et al. 2021;Stolzenberg and Lyon 2015), and one study found that after completing a checklist, interviewers' peer reviews are written in a mostly constructive manner that focuses on improvements (Brubacher et al. 2021a). However, these studies did not consider the quality of the interviews being reviewed, how many improvements were needed to the interviews, or how nuanced those improvements were. Results from our study highlight the importance of testing peer review tools across a range of interviews varying in quality, particularly mixed and poor interviews.
One major limitation of our studies was that a key best practice element of child interviewing was not present on our checklist -the development of rapport with children. While interviewing guidelines strongly and commonly recommend building rapport through asking children innocuous questions or asking them for a narrative of an innocuous event (see ABE, Ministry of Justice 2011; NICHD protocol, Lamb et al. 2007), it was beyond the scope of our research to include evaluations of rapport-building on the checklist. The decision to omit rapport-building from the checklist was made for two reasons: first, including pre-substantive rapport-building on the child interview transcripts for evaluation would substantially increase the transcript length and by extension the time demands for participants to read the transcripts. Second, rapport-building practices are not usually recorded in child forensic interview footage or transcripts, so our participants would not be used to seeing rapportbuilding in a transcript. Given the importance of rapportbuilding in child interviewing, future research should assess peer reviewers' ability to evaluate rapport-building efforts or consider adding a rapport evaluation component to peer review tools. Future work might also examine interviewers' ability to self-evaluate their own rapport-building in jurisdictions where rapport-building is not typically recorded.
A second limitation of our study is that participants evaluated mock (modified) child forensic interview transcripts, rather than transcripts of real (unaltered) interviews. While the decision to use mock transcripts reduced the ecological validity of our materials, it allowed us to manipulate interviewer performance for a controlled and systematic examination of participants' reviews for different levels of interview quality (strong vs. mixed vs. poor). Further, to retain as much ecological validity as possible, manipulated interview transcripts were modelled closely on a real interview (see Brubacher et al. 2021a) for a similar procedure to create a mock transcript).

Implications
Our studies have important implications for the field and for future research. First, while peer reviewers require scaffolding to support their reviews (Brubacher et al. 2021a, b;Sluijsmans et al. 2002), our work suggests that a checklist tool is too rigid for interviews containing mixed adherence to best practice and that more flexible tools are needed to support peer reviews of child forensic interviews in the field.
We suggest that open questions, Likert scales, or free-text space might be required, and propose that future research considers testing these formats for peer review.
Second, our results highlight the importance for researchers to consider the quality of the interviews being peer reviewed. While previous studies have had peer reviewers rate field interviews (e.g. Cyr et al. 2021;Stolzenberg and Lyon 2015) or one modified transcript (Brubacher et al. 2021a), the quality of the interview(s) being reviewed is often not reported. Our research has shown that strong interviews are relatively easy for peer reviewers to evaluate, while mixed and poor interviews are more challenging. Researchers should be encouraged to evaluate and report the quality of interviews being reviewed. Practitioners in the field should also consider interview quality, as peer review might be a useful mode of feedback for stronger interviewers, but more experienced or expert reviewers may be required to provide feedback for less-experienced interviewers who are still developing their skills.
Last, when comparing results across our two studies, we found that overall accuracy was higher for participants in Study 1 (who received 10 days of training on child interviewing) than for participants in Study 2 (who received only 4 days). Police organisations looking to implement peer review should consider the level of training that their members receive on child forensic interviewing since our results suggest that longer training may lead to improved peer reviewers.

Conclusion
Our study was the first to consider peer reviewers' accuracy using a structured checklist to assess strong, poor, and mixed child interview transcripts. Past literature has shown that scaffolding is valuable to inform peer reviews (Brubacher et al. 2021a;Tornwell 2018). Our results have furthered this knowledge by demonstrating that a structured checklist tool is helpful for child forensic interviewers to peer review interview transcripts that strongly adhere to best practice. However, the tool had shortcomings for participants' evaluations of sections of the child interview that were performed poorly or contained a mixed adherence to best practice. Research attention should focus on further developing and testing tools that are more flexible than a dichotomous checklist but still provide some structure to peer reviewers.
Use the enclosed tool to evaluate the provided mock child forensic interview transcript. Below are some instructions to help you use the tool effectively.

Step 1: Identifying the Question Types
On page 1 of this document are definitions of different question types. Please read through the child interview transcript, and code all interviewer questions that have a blank box next to them. One box aligns to one question. If you aren't sure of an answer, please take your best guess as to the question type code. All boxes should have one question code provided in them. Please note that questions posed before the child is asked for a narrative do not have a box and do not require coding.
Step 2: Documenting Your Evaluation on the Checklist You will be presented with a table of interviewer behaviours that contribute to a best practice child interview. After coding the question types in the transcript, please re-read the transcript a second time. As you read: • Tick "Yes" next to a behaviour that is appropriately demonstrated during the interview. • Tick "No" next to a behaviour that is not appropriately demonstrated during the interview. • Tick "N/A" next to a behaviour that is not relevant to this particular transcript. Upon completion of this task, every interview behaviour in the table should have a corresponding box ticked next to it (either "yes", "no", or "N/A" depending on your answer). If you aren't sure of an answer, please take your best guess as to which box should be ticked. Table 2. of the study, to Det. Snr. Con. Stephen Hearn for assistance in the coding of materials, and to Miss Kiran Kaur for administrative assistance with data management.
Funding Open Access funding enabled and organized by CAUL and its Member Institutions.

Declarations
Ethics Approval These studies were performed in line with the principles of the Declaration of Helsinki. Approval was granted by the Ethics Committee University.

Consent to Participate
Informed consent was obtained from all individual participants included in the studies.

Competing Interests
The authors have no relevant financial or nonfinancial interests to disclose.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes Encourage the interviewee to continue providing information and signal that the interviewer is listening •Uh huh •Repeating the child last few words Leading Introduces details the child has not yet mentioned or confirmed or implies a certain response is desired. Open or specific questions may also be considered leading •He had a gun, didn't he?
were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.