1 Introduction

According to the World Health Organization, more than 800,000 people die by suicide every year, with many of these individuals have attempted suicide on multiple occasions (WHO, 2014). Indeed, a suicide attempt is one of the strongest predictors of completed suicide (Fawcett et al. 1990; Harris and Barraclough 1997; Bostwick et al. 2016). Hence, from both a clinical prevention and research perspective, it is critically important to correctly identify individuals with a lifetime history (LTH) of suicide attempts (SA). However, there is considerable variability in the findings in extant studies on the LTH of SA, in terms of its prevalence or relationship with other variables, not to mention the fact that results from studies with often diverse populations make it difficult to draw comparisons. One particularly important observation in this respect apropos research examining the LTH of SA is that they are often underpinned by a type of survey design in which a scripted question is used to obtain an answer from a representative sample within a defined population (general population, mental health patients, veterans, etc.). As extensive survey methodological research (Biemer, 1991) has demonstrated, such survey designs come with various sources of error (e.g. errors from interviewing and interviewers, respondents or missing data), which accounts, at least in part, for the observed differences in the findings of previous studies that have investigated sensitive issues such as the LTH of SA.

The specific framework used in survey methodological research to describe and investigate these aforesaid errors is Measurement Error (which is a component of Total Survey Error (TSE)). This framework is used for the identification of specific errors related to different components of the data collection process in this research design. A measurement error occurs when the obtained response deviates from the underlying true value of the respondent. These types of errors are problematic insofar as they diminish the accuracy of the obtained parameters, such as prevalence figures, point estimators, and other inferences derived from the collected data (Groves and Lyberg 2010; Biemer 2010). Measurement errors can pertain to various aspects of the data collection process, namely the instruments, the respondent, the interviewer and the mode of data collection.

Two widely used data collection methods/modes are interviews and self-report questions. While there are some studies positing that clinical interview assessments result in an underreporting of sensitive behaviour compared to the self-report method (Tourangeau and Yan 2007). Hom et al. (2019) found no evidence for this claim in the case for measures of LTH of SA. These authors demonstrated (in a sample of US military service members at risk of suicide) that around one-third of the participants reported inconsistently on five validated measures that assessed the LTH of SA. These validated measures of LTH of SA were collected through both self-reports and clinical interview assessments. However, the authors did not find any evidence that clinical interview assessments results in underreporting of LTH of SA compared to the self-report method.

Other research has shown that assessing the LTH of SA through interviews or self-report questionnaires results in the underreporting of the LTH of SA, in comparison to using data-linkage techniques to view the medical records of these same participants (Mars et al. 2016). Notably, even asking the same respondent the exact same SA question twice in a relatively short period can produce significant inconsistency in their responses. For example, Eikelenboom et al. (2014) showed that asking the following SA question twice within a two-year period led to around 30% inconsistent answers (Yes-> No; No-> Yes) in a sample of depressed individuals:

Have you ever made a serious attempt to end your life, for instance by harming or poisoning yourself or by getting into an accident?

Other scholars have found similar results (Christl et al. 2006; Klimes-Dougan et al. 2007). The available evidence for these respondent-related errors suggests that one of the key reasons for the inconsistent responses is that the current psychological state of the respondent influences the accuracy of their self-reporting about their prior suicidal behaviour (Eikelenboom et al. 2014; Klimes-Dougan et al. 2007). That is to say, individuals who felt depressed when completing a survey were more likely to report their prior suicide attempt than those who had either minor or no depressive symptoms.

One further potential explanation for the inconsistent responses to the question on the LTH of SA pertains to the influence of the interviewer. All three of these aforementioned studies used face-to-face interviews to assess the LTH of SA. Indeed, face-to-face interviews are often the best available method for researchers to collect data about a complex phenomenon like the LTH of SA. Extant literature purports that, on the one hand, these sensitive and complex questions are more likely to produce variability among interviewers because these type of questions afford more opportunities for interviewers to deviate from the scripted text, reformulate phrases, and lead the respondent to a certain answer, while, on the other hand, they also allow interviewers to intervene and help the respondent if ambiguity or misunderstandings arise (Schaeffer, Dykema, and Maynard, 2010; Schnell and Kreuter 2005; Groves and Magilavy 1986).

Two approaches are usually adopted in studies examining interviewer effects (Groves 1989, 1991). The first approach studies the relationship between (inadequate) interviewer behaviour (reading of the questions, probing, negotiating, etc.) and measurement errors. These specific effects or errors derive from the role-dependent activities of the interviewer (Davis 1997). Such interviewer-induced biases can be controlled through (additional) training, as well as, for example, by providing feedback to interviewers based on observations of their interviewing techniques (e.g. filming) (Groves, 2004). Conversely, role-independent characteristics of an interviewer concern their age, ethnicity, and gender. Hence, role-independent effects occur when both parties in an interview (interviewer and respondent) hold a series of assumptions about the other based on their personal experience of other people with similar characteristics, which, in turn, leads both parties to expect certain answers and attitudes from each other (Davis 1997; West and Blom 2017) provide a comprehensive review of both types of interviewer effects, while Olson et al. (2020) outline the state of the art on interview-related error within the framework of TSE.

Concerning the present study, there is a relative dearth of research investigating interviewer effects in the field of suicide studies. Notwithstanding this lacuna in the field, some literature has considered interviewer effects in relation to risk factors for SA. For example, Samples et al.’s (2014) study highlighted the influence of interviewer characteristics (race) on risk factors among an African American sample. In the present study, we present a roadmap for the study of interviewer-related error, and explore the role-dependent behaviour of interviewers by assessing the LTH of SA through an epidemiological research design. The roadmap consists of a three-step procedure that: (1) try to identify the existence of interviewer-related errors, (2) try to explain these effects, and (3) study the impact of these errors on univariate statistics and associations between variables. We have operationalized the roadmap by formulating the following three research questions:

Research question 1: Does an effect exist of the interviewer on the number of positive answers of the LTH of SA Question?

Research question 2: If so, can we explain these Interviewer’s effects?

Research question 3: What evidence can be found of a moderator effect of the interviewer in the association between a LTH of SA and the risk factor Suicidal ideation?

2 Methods

2.1 General Sample

For the purposes of this study, we used the sample from the Netherlands Study of Depression and Anxiety (NESDA). The NESDA study is a longitudinal study (2004-current) that examines the development and course of depression and anxiety disorders in adults (18–65 years). This study comprises respondents from both a variety of healthcare settings and different clinical developmental stages of illness, and aims to obtain a full understanding of the course and consequences of depressive and anxiety disorders. In addition to this, people who have neither depression nor anxiety disorders were also recruited to take part in the study (control cases). Inclusion of respondents took place between September 2004 to February 2007 within two regions in the West (Amsterdam and Leiden) and one region in the North (Groningen) of the Netherlands. Respondents were recruited from the general population (n = 564), those in primary care (n = 1610), and those in specialised mental health care (n = 807). The clinical exclusion criteria were a primary clinical diagnosis of a psychotic disorder, obsessive compulsive disorder, bipolar disorder, or a severe addiction disorder. Respondents were also excluded if they were not able to communicate in Dutch. For further details of the rationale, objectives, and methods of NESDA, see Penninx et al. (2008). The study protocol was approved by the ethical review boards of all the participating centers, and all participants provided written informed consent.

2.2 Fieldwork procedures

The respondents were interviewed to secure a baseline measurement at one of the seven clinical sites in the Netherlands. Similar data was collected and standardised procedures were employed for all respondents. The interviews were conducted on a laptop computer with Blaise version 4.5 with data entry checks performed on outliers and routing. Blaise is a computer-assisted interviewing (CAI) system developed by Statistics Netherlands, which is used in NESDA for computer-assisted personalised interviewing (CAPI). All the interviews were taped for monitoring data-quality and interviewer performance.

The 2981 interviews were conducted by 44 locally recruited research assistants, most of whom were psychologists, nurses or residents in psychiatry. They received five days of training from the fieldwork staff following a detailed training manual. The training was primarily focused on how to conduct the assessment in a standardised manner. Key aspects of the training were question wording, probing behaviour, as well as practicing different parts of the questionnaire via role playing exercises. The interviews were recorded and these recordings were subsequently used to provide feedback to the interviewers. This procedure was established to ensure that the interviewers adhered to the interview protocol for standardised interviewing and collected high-quality data. Throughout the fieldwork, a random selection of 10% of all the taped interviews was used to supervise the research assistants. The supervisors paid particular attention to question wording and the probing behaviour of the interviewers. Regular meetings were held with the interviewers to discuss relevant topics about the assessment, including questions about the assessment and working through difficulties they were having with it. Furthermore, a continuous monitoring system of specific interviewer item non-responses was maintained through conducting computer analyses in SPSS (Penninx et al. 2008).

3.1. Research question 1: Is there an effect of the interviewer on the number of positive answers of the LTH of SA Question?

2.3 Sample

To answer this question, we included 35 interviewers who had completed at least 15 interviews each (range 17–242) and excluded 9 interviewers who had conducted a total of 80 interviews between them. Seventeen respondents refused to answer the LTH of SA question. Therefore, the study sample totalled 2884 respondents, who were interviewed by 35 interviewers (2981-97 = 2884).

2.4 Procedures,variables and analysis

The Kish model (1962) was the first effort to model interviewer effects. Kish proposed a single factor ANOVA model with interviewer number as a fixed factor. Based on this model an estimator Rho was developed to quantify the interviewer effect. This approach was stepwise: look at the significance and estimate the amount of interviewer effects. However two assumptions in this model are rarely met: random assignment of respondents to interviewers and an almost equal workload for interviewers.

In Kishes model, unit nonresponse, workload and the hierarchical structure in which respondents are nested under interviewers are ignored.

Already in 1983, Dijkstra showed that hierarchical models are a better approach for studying interviewer effects (Dijkstra 1983). These hierarchical models were further developed by for example Hox (1994) and O’Muircheartaigh and Campanelli (1998) to take the hierarchical structure, non-randomization and differences in workload into consideration. However, also in these models, the relation between unit nonresponse and interviewer effects are complex to model (West and Olsen, 2010).

In this particular study we were interested in a road map and not in a specific model or content of a specific interviewer effect. Also, the information on unit-non response within NESDA was problematic due to the complex sampling design (see Penninx, 2008). Therefore we choose the first and most simple way to identify interviewer effects of Kish (1962). Biemer and Stokes (1991) have shown that this model is suitable for the explorative analysis of interviewer effects. The Kish model (one-way ANOVA with the interviewer as a fixed factor) defines interviewer variability as the difference between the true value of a respondent and the value observed and reported by the interviewer. It is assumed that part of this variability (error) depends systematically on the specific interviewer. The (simplified) model is written as:

Respondent observed value = Respondent’s true value + Interviewer Effect + Random error.

This model assumes that the interviewer influences the responses in the same way for all the respondents in his/her assignment of interviews. However, these effects vary among the different interviewers. The random error varies from interview to interview, even when the interviews are conducted by the same interviewer (Lundquist 2006).

To specify the ANOVA model, each interviewer received a unique identification number. This number was subsequently used as the independent variable in the ANOVA model. In the present study, an explorative analysis was performed with the interviewer number as a fixed factor in the ANOVA model. The LTH of SA was the dependent variable (see above): “Have you ever made a serious attempt to end your life, for instance by harming or poisoning yourself or by getting into an accident? (No/Yes)”. Biemer and Stokes (1991) have shown that this procedure is suitable for the explorative analysis of interviewer effects.

2.5 Results

Table 1 delineates the main characteristics of the sample. An analysis of variance showed that the effect of the interviewer on the number of positive answers on the LTH of SA question was significant F(34. 2849) = 1.591, p = 0,016. (see Table 2)

Table 1 Background characteristics of the study sample (n = 2884)
Table 2 Explorative analyses of interviewer effects through an ANOVA model

3 Research question 2: can we explain interviewer’s effects?

As a result of our answer to research question 1, we were led to question whether we could explain or understand these effects.

3.1 Sample

The selection of interviews was predicated on the need to optimise the opportunity to understand the interaction processes, while, simultaneously, asking about the LTH of SA. Specifically, we used a three-step sample procedure to achieve sufficient variability in the interactions in which we would like to study the potential effect of interviewer behaviour on the data collection process. Firstly, from the 2884 interviews, we selected those respondents who answered positively on the LTH SA question (N = 339). Also, those respondents were selected who had a positive score on suicide ideation (n = 342), as measured by the Beck Suicide Ideation Scale (Beck, 1997). Combining both variables showed that 583 from the 681 selected interviews (339 + 342) belonged to an unique respondent identifier. Finally, a random selection of 100 interviews/tapes without positive scores on suicide ideation or LT of SA were added to the study sample. The selection of 100 was based on logistics arguments (time and cost’s). This led to an overall sample of 683 (583 + 100) selected interviews to be transcribed. A mandatory element in the selection process was that the tapes of the interview needed to be present and the phrases regarding suicide ideation and the LT of SA question were of sufficient quality to transcribe as judged by experienced transcribers from the Nesda fieldwork team. In total 508 interviews could be transcribed (175 with a positive score on LT of SA, 167 with a positive score on suicide ideation, 75 with a positive score on both and 91 with no positive score on LT of SA or Suicide Ideation: (508 = 175 + 67 + 75 + 91)).

3.2 Procedures, variables, and analysis

The transcribed question-answer sequences between the interviewers and respondents were entered and coded using Sequence Viewer software (Dijkstra 2017). A description of this program can be found in Dijkstra and Ongena (2006). We typified the complete interaction according to two characteristics of interview behaviour (Quality of Interaction and Repair) that are known to influence data quality and response distributions. Most surveys follow the stimulus-response paradigm, which posits that in order to compare the responses (answers) of the respondents, everybody should receive exactly the same stimulus (question). It is for this reason that, generally speaking, questions and instructions in surveys are wholly scripted and should be presented to the respondent exactly as they are scripted. However, extant literature suggests that sensitive and complex survey questions such as a LTH of SA are more likely to produce interviewer-related effects. On the one hand, these types of questions afford opportunities for interviewers to deviate from the scripted text, reformulate phrases, and lead the respondent to a certain answer by suggestive probing, while, on the other hand, it also enables interviewers to intervene and help the respondent when ambiguity or misunderstandings occur (Schaeffer, Dykema, and Maynard, 2010; Schnell and Kreuter 2005; Groves and Magilavy 1986). To capture these two aspects within the question-answer sequence related to the LTH of SA, we evaluated the interactions apropos two different criteria for best-practice interviewing: Quality of the Interaction and Repair.

‘Quality of the Interaction’ was operationalised in terms of the presence (No/Yes) of inadequate interviewing behaviour: e.g. incorrect reading of the question, not following instructions, suggestive questioning and/or probing. Doing so is critical, insofar as the presence of such behaviour can alter the content of the question (stimulus), which, in turn, makes it difficult to compare the answers (response) that are obtained. For example, interviewers might deviate from strictly standardised interviewing by unintentionally changing the content of the scripted question, by omitting phrases, suggestively probing, that is, asking or leading the respondent to a particular answer or alternative response, or by changing the wording of the question in such a way that the content of the original question is changed (Smit, Van der Zouwen and Dijkstra, 1997). Of course, interviewers can also make such mistakes deliberately in an attempt to make the interaction with the respondent flow more naturally. Indeed, research has demonstrated (Brunton et al., 2012; Bell et al. 2016) that interviewers deviate from formal scripted questions in a concerted effort to build rapport with their participants. For a recent overview of interaction research in surveys, interviews see Ongena and Dijkstra (2021). Moreover, extant literature indicates that variability in interviewer behaviour for the purposes of establishing rapport will inevitably lead to greater variability in the response distributions (West and Blom 2017; Garbarski, Schaefer and Dijkema, 2016).

Repair’ is required when there are deviations from the ideal question answer sequence such as when, for example, the respondent does not understand (parts of) the question or the response options available to them, and, hence, needs to ask for clarification from the interviewer. It could also be the case that the question and the response options simply do not fit the experience of this specific respondent. In such cases, interviewers must ‘negotiate’ with the respondent on how to solve this complication. Moreover, it may be unclear to the respondent – especially in the case of long questions – when they are expected to start answering, which, in turn, can result in interruptions and overlapping speech. This means that respondents often deviate from the available response options and provide an inadequate response, which, in turn, necessitates interviewers to initiate repair activities (Zouwen and Smit, 2006). While probing in an attempt to repair an inadequate initial response is undoubtedly difficult, it is an essential element of an interviewer’s task, insofar as it helps, explains, and guides the respondent to provide a more accurate response to sensitive and ambiguous issues and questions, which is one of the main reasons to involve interviewers in collecting data. In these cases, interviewers are instructed and trained to intervene and try to obtain better data by probing further for a proper response from the respondent. To cite Moore and Maynard (2002: 300), “…whenever a respondent’s answer fails to reproduce one of the pre specified answer options, interviewers should probe it, that is, initiate repair and invite the respondent to redo the answer by selecting a single option.

A behaviour coding scheme was developed to establish binary values for ‘quality of the interaction’ and ‘repair’ in a question-answer sequence. Quality of interaction was coded as inadequate when interviewers substantially deviated from prescribed formulations, posed or probed questions in a suggestive way, or otherwise deviated from questioning guidelines. Repair occurred when respondents provided inadequate answers in their turns and interviewers reacted with a negotiation or probing for an adequate answer to the original question, for instance by repeating (parts of) the original or follow-up questions or by rephrasing answers of respondents.

The accuracy of the applied codes for these two variables was checked by one of the authors (SD), while the reliability of the coding was established by having a sample of 30 transcripts coded by a second coder. The reliability between the two coders was a weighted Cohen’s kappa = 0.91. Codes were exported from Sequence Viewer Software into SPSS. The association between the LTH of SA answer distribution with the two criteria for evaluating the quality of the interaction was tested with Chi-Square statistics. Interviewer effects were tested through the ANOVA procedures (see above).

3.3 Results

Table 3 presents both the background characteristics of the study sample and the frequency of the behaviour codes. The respondent characteristics of this sample are in line with the larger sample.

Table 3 Background characteristics of the study sample (n = 508)

In 6% of the interactions, the question wording was changed in such a way that the content of the question (stimulus) was not the same for those respondents as it was for others. In 12% of the interactions, the interviewer had to ‘repair’ the interaction and explain or negotiate with the respondent to obtain an accurate answer. This appears to indicate that both the interviewers and the respondents found this question to be difficult. Indeed, closer examination of those interactions in which the content of the question was changed showed that the interviewers rephrased the question by omitting parts of the question. From a survey error perspective, one can thus argue that this specific question was simply too long and used several ambiguous words (harm, accident), which have been shown to lead to deviation in how questions are read by interviewers (Saris and Gallhover, 2014). As mentioned, interviewers had to repair activities in more than 12% of the interviews to guide the respondent to an accurate answer. A closer examination of these specific interactions showed that in most of these cases the respondent either began to comment spontaneously after the topic was introduced or interrupted the interviewer while they were reading the question with a preliminary response or a question. This appears to suggest that using interviewers to administer the LTH of SA was an expedient choice, insofar as respondents often required help, clarification, and guidance when formulating their response.

Next, we investigated whether response distributions were influenced by interviewer behaviour. The results are outlined in Table 4.

Table 4 Association between behaviour codes and the LTH of SA distributions: interviewer effects

With respect to the Quality of the Interaction, there were no general effects on the distribution of the LTH of SA, although interviewers differed significantly in terms of the frequency of their inadequate behaviour. In the case of Repair, this behaviour was found to significantly impact the answer distributions. More specifically, repair activities were much more frequent in those interactions that produced a positive response on the LTH of SA question. Conversely, the interviewers did not significantly differ in terms of adequate interviewing behaviour. One could argue that guiding, helping, and negotiating with the respondent in the case of a positive answer testifies to the importance of using interviewers when asking about the LTH of SA. In the case of the present study, while the repair activities employed by the interviewers were difficult, they were certainly needed; hence, these findings appear to indicate that the role-dependent activities of the interviewer influence answer distributions. Or, to put it more explicitly: interviewers can affect the prevalence figures of the LTH of SA within a population that has mental-health issues.

4 Research question 3: Is there evidence of a moderator effect of the interviewer in the association between a LTH of SA and their risk factor Suicidal Ideation?

The fact that interviewers can influence response distributions of specific variables within a univariate analysis leads us to question whether the associations between these variables and other variables in a model are also moderated by interviewers. To gain insight into whether interviewers can moderate the associations between variables we designed a small ‘quasi-experimental’ study based on the NESDA database (see Maciejewski, 2020 for more details on quasi-experimental designs).

4.1 Procedures, variables, and analysis

To examine the moderating effects of the interviewers, we chose the correlation between suicide ideation and the LTH of SA. Suicide ideation is a well-known and extensively studied risk factor for SA, and, indeed, has been shown previously to be correlated with the LTH of SA (Ten Have, 2009) and future SAs (Eikelenboom et al., 2018). In the present study, suicide ideation was measured through the five-item Scale for Suicide Ideation (SSI, (Beck et al., 1979, Kliem et al. 2017). Example: During the past week did you have a wish to die? 0. None, (1) Weak, (2) Moderate to strong). Suicide ideation was operationalized as present in the past week (no/yes) when at least one of the five items gave an indication of suicide ideation (weak and moderate to strong). Additionally, we used the behaviour codes of the quality of interaction collected in research question 2 as the basis for dividing the interviewers into two groups/conditions in the quasi-experimental design: (a) interviewers who displayed no inadequate behaviour in the Quality of the Interaction (n = 14); and (b) those interviewers in which inadequate behaviour was present (n = 14). Concerning both groups, we considered the complete workload in the NESDA database that we used in relation to research question 1. For group 1, the total number of interviews was 1275, while the total number of interviews for group 2 was 1373. For both groups we calculated a Point biserial correlation with two categorical variables (0,1) (comparable with a Pearson correlation). The difference between the correlation coefficient was tested by using Fisher’s Z statistic.

4.2 Results

With regards to group 1, the correlation between the LTH of SA and suicide ideation one week prior to the study was (0,214; n = 1275), while for group 2 it was (0,164; N = 1373). Overall, the difference between the two correlations was not significant, albeit a trend was observed (Z = 1.332 P = 0.09).

5 Discussion

Collecting accurate figures about (the LTH of) SA and its associated risk factors is of the utmost importance from a global health perspective. Understanding the process involved in collecting the required data and the related error sources constitutes an important step in explaining the observed differences between data collected in distinct circumstances and heterogeneous populations. The present study has provided a roadmap for studying interviewer-related errors underpinned by a three-step procedure: (1) identify the existence of interviewer-related errors, (2) try to explain these effects, and (3) study the impact of these errors on univariate statistics and associations between variables. Future research and further development of this roadmap should be aimed at those 3 steps. By (1) improving models for the identifying interviewer effects taking unit nonresponse into account, (2) better and more advanced behaviour coding and (3) more complex modelling of research questions taking the interviewer effects into consideration. Future research should also investigate the applicability of the roadmap for monitoring during data collection in suicide research or other studies concerning sensitive behaviour requiring an interviewer. The availability of (a) state of the art software such as Sequence Viewer, direct access to (b) digital recordings of interviews and direct access to (c) entered data makes this certainly within reach and can help to lift the monitoring of interviewers to a new level. In so doing, the present study provides empirical evidence on the impact of role-dependent interviewer effects on the disclosure of an LTH of SA. The results are in accordance with previous findings that suggest that these errors occur relatively often, especially concerning ambiguous and sensitive questions (Krysan and Couper 2003). Notwithstanding the notable strengths of the study, the implications of our findings must be understood in the context of several limitations.

Firstly, this study was conducted within the NESDA research program. NESDA has an extensive and expensive quality control program (Total Quality Control: 5 days of interviewer training, constant monitoring of interviewer performance by fieldwork staff, audiotaping all interviews, and listening to samples of interviews for each interviewer). The consequence of this is that it may have resulted in smaller interviewer effects than would have occurred in similar studies that did not have such a stringent quality control programme. Therefore, the effects of the interviewer on the observed statistics shown in this study might not reflect the true extent of the phenomenon.

Secondly, we only considered the role-dependent characteristics of the interviewer. This is a rather narrow approach, considering that previous research has shown that role-independent interviewer effects, such as gender and race, can be important sources of error in the field of suicide research (Samples et al. 2014). A further limitation is that we focused on the LTH of SA, rather than, for example, on the risk factors for SA. However, it is not only the LTH of SA question that is sensitive and somewhat ambiguous, as a number of well-known risk factors like abuse, childhood trauma, domestic violence, alcohol and drug use are also sensitive and often ambiguous. Indeed, one could imagine that interviewer-related error would be even greater in models studying risk factors and SA, and that these errors could significantly moderate the established relations between variables.

Finally, another limitation of this study pertains to the research design. While the most expedient research design for studying errors would be an experimental design, despite the importance of investigating this phenomenon, experimental designs that compare certain interviewer behaviours to other behaviours are scarce at best. Studies about certain interviewing styles, such as personal vs. formal (Van der Zouwen et al. 1991; Mangione et al. 1992), suggestive vs. neutral (Smit, Van der Zouwen and Dijkstra, 1997) or that manipulate the number of training days provided to interviewers (Billiet and Loosveld, 1988; Fowler and Mangione 1990) have demonstrated the moderating effect of interviewers on univariate statistics. Although an experimental design is the golden standard for studying interviewer effects, Daikeler and Bosnjek (2020) state in their meta-analysis of interviewer training that these designs are expensive and difficult to incorporate into studies, which explains why they are so scarce. Moreover, in the present study we also used data that was collected for alternative purposes, and, as such, like most other studies on interviewer-related error, our ‘secondary analysis’ of existing data led to the results that we found.

In conclusion, this study has provided a roadmap for studying interviewer-related errors. First, we identified whether interviewer-related error was present; secondly, we attempted to explain this error by examining – in this specific case – the role-dependent behaviour of the interviewer; and finally, we considered the consequences of this behaviour for both univariate statistics and the associations between variables. Despite the enormous public health interest in collecting valid and accurate data about suicidality, there is a relative dearth of systematic research investigating errors in data collection on suicidality. Consequently, adopting the framework of Measurement Error represents an expedient method through which researchers can obtain more systematic knowledge in this area and classify the different sources of error, for the express purpose of understanding why results and figures on the LTH of SA vary across studies, and, consequently, how researchers can improve data collection procedures in suicide research.