4.1 Evaluations of data quality
As early as in the late 1960s Balán et al. (1969) concluded, in what was probably the first (non-experimental) evaluation of calendar methods, that the calendar instrument had the following advantages over traditional question-list surveys:
-
1.
It improved the completeness of reports by enabling the interviewer to detect ‘gaps’ in the data provided by the respondent.
-
2.
Inconsistencies in the account could be detected by the interviewer or by the respondent himself. The respondent could then correct his original account.
-
3.
It facilitated recall for distinct events, by displaying those events as part of a sequence. This (supposedly) lead to a reduction of omissions.
-
4.
It improved timing of recalled events by allowing the respondent to relate events and dates from different life domains to each other.
Although the study by Balán and his colleagues did not have an experimental design, their observations are still valid today. The expected positive effects of calendar methods on completeness, consistency, recall and timing—as well as the implied effective mechanisms of the calendar—are main issues in evaluations of calendar methods.
Over the years several authors, though not explicitly referring to the Balán-study, have tested one or more of the four statements mentioned above. Many have also included more general observations about the data collection process, such as experiences with different modes of data collection (see previous section), respondent-interviewer rapport, and consequences for the duration of the interview. The body of research on calendar methods also includes a few psychometric studies on reliability and/or validity of data collected with calendar instruments.
Our review of the methodological literature reveals that the quality of the data collected with calendar instruments has been evaluated in multiple ways. The studies can be grouped into three categories:
-
1.
Comparisons of calendar data with similar data collected with more traditional questionnaires (in a split-ballot or otherwise), but without the availability of an external standard of comparison (Becker and Diop-Sidibe 2003; Becker and Sosa 1992; Engel et al. 2001b; Goldman et al. 1989; Yoshihama et al. 2005);
-
2.
Studies in which the agreement between data collected with a calendar instrument and external data sources is measured, but no comparisons are made with regular questionnaires. External data sources include physicians’ records (Rosenberg et al. 1983; Wingo et al. 1988) or reports from earlier waves of longitudinal surveys (Freedman et al. 1988).
-
3.
Experimental studies, which combine the two approaches. Here, the authors assess the agreement between calendar data and external data, and also include a control condition, in which a traditional questionnaire is used (Belli et al. 2001; Van der Vaart 1996, 2004; Van der Vaart and Glasner 2005).
First, we will turn to the first group and present some findings based on indirect comparisons between calendars and traditional questionnaires, after that results from the ‘agreement studies’ (group 2 and 3) will be presented.
4.2 Indirect comparisons between calendar data and regular survey data
The focus of the first group of studies is mainly restricted to indirect measures of data quality, in particular consistency of the data based on logical arguments (e.g., in most societies, there should be no overlap between marriages), completeness of the data (e.g., the detection of “gaps” in employment histories) and patterns in recalled dates (such as the use of prototypical values, i.e., “heaping”). Since these studies do not include an external standard of comparison, they cannot provide direct evidence for the superiority of calendar data in terms of accuracy. However, as will be illustrated below, they do provide some indications that the calendar method overall performs better in collecting recall data than the traditional question-list.
A split-ballot comparison between a calendar method and a traditional questionnaire in a fertility study (Becker and Sosa 1992) indicated that the use of the calendar resulted in more consistent reports. It demonstrated that the calendar method resulted in less superposition of (supposedly) mutually exclusive behaviors: significantly less overlap of advanced pregnancy and contraception use was reported in the calendar condition (1.3%) than in the traditional interview (10.3%). Also supporting positive calendar effects, an interaction was found between the recency of the behavior and the effect of the calendar (Goldman et al. 1989). Goldman and her colleagues note that the calendar instrument was especially effective in enhancing recall of contraceptive use in the beginning of the reference period. A similar effect was found in a study of domestic violence victimization (Yoshihama et al. 2005). The results indicate that higher lifetime victimization rates in the calendar condition were caused by the fact that more respondents reported incidents, which took place in the distant past.
Studies that evaluated retrospective data in terms of completeness mostly concluded that the calendar method performs better than the traditional question-list. Calendars were found to be more helpful in reducing the amount of time unaccounted for in the respondent’s life course (Engel et al. 2001b; Goldman et al. 1989). This reduction is likely to be due to the visual nature of the calendar, which makes it easier for the interviewer to detect those left-out periods and ask the respondent about them (Balán et al. 1969). Overall, calendars appear to result in higher numbers of reported events and episodes, which is usually interpreted as a positive effect (Becker and Sosa 1992; Engel et al. 2001b).
Regarding the heaping of reported event dates—which occurs when respondents report prototypical values (e.g., courses starting in September, or “the accident happened two year ago”) instead of the actual values—only few studies are known to evaluate calendar effects (see also the next section). In an experimental evaluation Goldman et al. (1989) found that the calendar method significantly reduced heaping in reports of contraceptive use. While in the traditional questionnaire condition a disproportionate number of women rounded durations to prototypical values of 6, 12, 24, 36, and 48months of use, this hardly occurred in the calendar condition. It should be noted however, that this difference was probably enhanced by the coding protocol. While in the questionnaire condition, interviewers could record durations in either months or years; in the calendar condition, interviewers were instructed to always code durations in months.
4.3 Agreement between calendar data and external sources
The second and third group of studies focus on direct assessments of agreement between the recalled information and the external information: in particular concerning the number of events, their characteristics and the duration or dates of events. Some authors turned to data sources such as doctors’ records (Rosenberg et al. 1983), purchase records (Van der Vaart and Glasner 2005) or population registers (Auriat 1993) to validate the retrospective reports. In the absence of this type of validating information, authors compared calendar data with respondents’ earlier (concurrent) reports from the same longitudinal study (Belli et al. 2001; Freedman et al. 1988; Van der Vaart 1996). It can be argued that comparisons of the latter type are an assessment of (test-retest) reliability rather than of validity (Dex 1995). Nevertheless, it seems reasonable to assume that the amount of error is smaller in concurrent than in retrospective reports, since the former are less affected by memory bias. As illustrated below, the results of these both types of studies generally suggest that the calendar method has beneficial effects on data quality.
4.4 Non-experimental validation studies
While non-experimental agreement studies do not compare the performance of the calendar method to the performance of other methods, they do give an indication of the quality of calendar data. In this line, Rosenberg et al. (1983) performed a record check study, which did not include a comparison with another type of questionnaire. Using doctors’ records as validation measures the authors report an agreement of 90% between the calendar data and the records for month-specific use of oral contraceptives. The mean duration of the reference period was 33months. The agreement between physicians’ records and self-reports decreased when brand and dose of contraceptive were also considered.
High levels of data quality were also reported in non-experimental longitudinal studies. In their evaluation of calendar questionnaires Hoppin et al. (1998) report very high test-retest reliability of pesticide use when respondents were contacted by telephone one to three weeks after the original interview. A more detailed study of test-retest reliability of the calendar method—the time between the interviews being eight to fourteen months—resulted in very high agreement for reported life event anchors such as marriages, or immigration (Engel et al. 2001a). Freedman et al. (1988) compared respondents’ self-reports from two waves (1980 and 1985) of a longitudinal study. In the 1985 wave, a calendar instrument was used. The authors found an 87% agreement between school attendance reported concurrently in 1980 and retrospectively in 1985. Part-time school attendance was remembered less well than either full-time attendance or no attendance. Responses about work in 1980 were less consistent. Here, the agreement between waves was 72%. The general tendency to underreport unemployment in retrospective surveys was not fully compensated for by the calendar.
Thus, several life course studies that applied event history calendars report relatively high correspondence between retrospective calendar data and matching responses or collateral reports obtained beforehand. Similar results are found in small-scale medical studies on health timelines (e.g., Searles et al. 2000). Although these results suggest positive effects of the calendar procedures on recall accuracy, they lack an experimental design: since there is no control condition, it has not been demonstrated whether these results would have been different in a study without aided recall procedures.
4.5 Quasi-experimental studies
Only three studies so far (Belli et al. 2001, 2004; Van der Vaart 1996, 2004; Van der Vaart and Glasner 2005) have combined the approaches depicted above with an experimental design. The authors conducted split-ballot experiments in which they used calendar instruments in one condition and traditional questionnaires in the other condition. Belli et al. (2001) and Van der Vaart (1996) then validated the data from the two conditions with earlier reports from the longitudinal studies. Van der Vaart and Glasner (2005) used purchase records as validation data. Given the relevance of these studies we will discuss their results in detail below.
In the 1996 study Van der Vaart (1996, 2004) developed and tested a calendar method (in these studies called a ‘timeline’) that was filled out by the respondents during a face-to-face interview and was subsequently used as a visual recall aid. The calendar was tested in a field experiment on educational careers during the second wave of a longitudinal social survey, comparing the retrospective reports with reports during the first wave four years before (the recall period was 4–8 years). As compared to the regular questionnaire procedure, adding the calendar enhanced data quality with respect to the number of educational courses followed, the starting year of the courses, and the entire sequence of types of courses taken. Although the calendar reduced recall error in the dates of courses, it did so for absolute error only: it did not affect telescoping (i.e., the direction of the net error in dates) and neither did it diminish the heaping effect in reported dates. The calendar was shown to be most effective if respondents had to perform relatively difficult retrieval tasks in terms of recency, saliency, and frequency of the target behaviour (e.g., for respondents who had followed a great number of courses).
Comparable results were found by Belli et al. (2001, 2004) who evaluated an event history calendar by means of a field experiment integrated into a longitudinal household study on social and economic behaviours. All interviews in this study were conducted via telephone in 1998. Respondents were asked for retrospective reports on the number and the duration of events that occurred in 1996. The quality of the 1998 reports using either a calendar—that was visible to the interviewer only—or a question-list, was assessed using data from the same respondents collected one year earlier on events in 1996.1 Compared to the question-list survey the calendar instrument resulted in significant difference scores, indicating positive effects, for three out of six topics: the number of moves, the number of jobs and the number of persons entering the residence. No differences in data quality were found regarding the number of persons leaving the residence, whether having received children aid and whether having received food stamps. Regarding four out of six continuous measures the calendar method led to significantly higher correlations with the 1996 reports than the question-list. This applied to ‘income’ (a) and the durations of periods ‘being unemployed’ (b), periods ‘missing work due to illness’ (c) and periods ‘missing work due to illness of others’ (d) No differences in correlations were found for the duration of periods ‘working’ (e) and periods ‘on vacation’ (f). In spite of the effects on correlations, hardly any differences in mean errors were found between both conditions.
Finally, the experimental record check study by Van der Vaart and Glasner (2005) generally confirmed the findings of both field experiments presented above. In this study a calendar was employed as a visual aid for respondents in a telephone survey. Unlike most calendar instruments used in the social sciences, this calendar aimed to enhance the recall of singular events (the purchase of pairs of glasses) instead of episodes. The respondents’ retrospective reports about a recall period of over 7 years, were compared to database information on three issues: the price and the date of the latest purchase of pairs of glasses and the number of purchases. Hardly any effects could be established regarding the number of purchases due to a lack of variation in this measure. Regarding both the price and the date of the purchase this study demonstrated that:
-
(a)
The calendar had positive effects on recall accuracy, although it did not affect telescoping (net error in dates);
-
(b)
A more difficult recall task—in terms of the saliency and recency—led to greater recall errors;
-
(c)
Employing the calendar was especially effective in enhancing recall accuracy when the respondents’ recall task was relatively difficult, that is: for less salient and less recent purchases.
As will be discussed in more detail below, a downside of this procedure was that the response rate in the calendar condition was quite low. Sending respondents the calendar instrument beforehand probably increased the risk of refusal.
Overall the results of these experimental studies—that compared the calendar method and the question-list method by using external validating data—are mixed but quite promising. They demonstrated that the calendar method exerted positive effects on recall accuracy for different types of data and never led to worse data quality.