1 Introduction

Surveys and Survey Methodology

A survey is a systematic, standardized data collection effort that proceeds mainly by asking questions and recording the responses of the so-called survey participants or respondents.Footnote 1 Sometimes, data are not collected about individuals but about, e.g., households or firms—yet, ultimately it is still humans providing answers. Surveys can be carried out by interviewers, be self-administered by the respondents, or take on a hybrid form in which interviewers are present but, after handing over a tablet or paper questionnaire, are inactive unless needed. More specifically, the survey ‘mode’ is about how a survey is conducted: interviews in person or by telephone, web-based surveys and mail questionnaires (both self-administered), or even multiple modes for one survey across regions or persons, or time.Footnote 2

Surveys have important benefits. They can cover a broad range of topics and can also probe for very detailed information. They can be tailored to the specific research questions for which a survey is conducted. Also, a person’s attitudes, past experiences, or other information not recorded anywhere may be best or even only available by asking them. Finally, surveys benefit from decades of methodological research (see below).Footnote 3 Thus, academic research, government agencies, public opinion research and polling, and the private sector will continue to rely on surveys.

This chapter looks at paradata specifically from the perspective of survey methodology. Most survey methodologists originally come from the social sciences, psychology, and statistics. However, the field has its own terminology, goals, challenges, and thinking. Thus, how we regard and use paradata is hopefully best understood against the background context we provide next.

Surveys are conducted to answer substantive research questions,Footnote 4 mainly through statistical analysis of the collected data. Survey methodology studies the design, execution, and monitoring of surveys as well as the statistical analysis of survey data (Groves et al., 2009, ch. 1.4): what are sources of problems, how can they be measured, and what are methods to address them? Besides the quality of data and of data analysis, costs are a consideration.Footnote 5 A ‘survey organization’ (the entity conducting the survey) needs to be profitable. Costs also relate quality, quantity, and scope of a survey as well as the number of surveys that can be afforded, thereby influencing the number of research questions that can be answered. Survey methodologists care about costs, besides due to these real-world constraints, also because they imply data quality trade-offs.

Survey methodologists often think about the quality of data and of data analysis along two dimensions: the survey participants (representation) and the responses (measurement). Low participation rates, an increasingly grave problem for our field, increase the costs to achieve a fixed number of respondents. Representation, meanwhile, is about differences between respondents and nonrespondents: the less the respondents are representative of the population about which a researcherFootnote 6 wishes to draw inferences, the more likely the data (analyses) are biased. Representation problems can occur at every step. First, there is rarely a complete list of all units in a population. The ‘sampling frame’ is the incomplete list that is available, from external sources or through construction, e.g., a ‘lister’ walking a neighborhood to collect addresses. ‘Coverage’ is about how representative the sampling frame is of the intended target population. Second, only a subset of the units from the sampling frame is actually selected for possible inclusion in the survey (‘sample’). Third, who from the selection ends up in the data is then determined by who is not successfully contacted and then by who decides to not participate (‘unit nonresponse’).

Imagine the eventual survey data as a rectangular, tabular data sheet: each row corresponds to exactly one respondent, and each column corresponds to exactly one survey question. Representation is then about how representative the rows are. Meanwhile, measurement considers each cell and asks whether and how the value in this cell deviates from the true value that one intended to capture.Footnote 7 Such errors can, again, occur at every step from planning to data analysis. First, researchers start from ideas or conceptions (‘constructs’) about the “elements of information” they seek (Groves et al., 2009, ch. 2.2.1, 2.3.1). Some constructs have decent, objective counterparts in the real world, but many are ‘latent’ (‘unobservable’, not directly measurable). ‘Validity’ is about the degree to which the concept in the researcher’s mind matches how respondents understand the corresponding survey question. The precise wording of a question, the question order, and other design factors can affect how participants interpret a question. Second, a response can deviate from the truth because of, e.g., recall error, low motivation, but also interviewer effects on, e.g., sensitive questions. A participant may also be unwilling or unable to respond to a particular survey ‘item’Footnote 8 at all: ‘item nonresponse’. Third, an initial or raw response may be processed or edited: the respondent might change their initial answer later or the interviewer might edit it. If only categorical answers are permitted, the initial response must also be mapped to the given categories. Sometimes, raw information is processed later on by ‘labelers’ (also called coders or annotators): e.g., coding open-text responses into categories or rating respondent behavior based on recordings.

Motivation

We discuss other data types and definitions of paradata in Sect. 2. In short, our own definition (see 2.2) is that paradata are data that are themselves not the survey’s substantive data, that would typically not exist without the particular survey, that can be automatically produced, actively collected, or constructed later on, and that relate to the survey and its processes in one of three ways: they are produced by the survey processes, often as by-products, they describe the survey processes, or they are used to manage and evaluate the survey process(es).

There are many different examples of survey paradata (Sect. 3), with much heterogeneity between them but also ample scope for designing them as needed (Sect. 4). The allure to survey methodologists comes from the goals and challenges laid out above: paradata can be employed to recognize problems in the survey data, to correct for them in the statistical analysis, and to monitor problems in near real-time or even to predict them which is the basis for interventions (Sect. 5). The hope is that paradata capture information about the processes that produce the survey data that would otherwise not be available. Some paradata types have been used in our field long before Couper (1998) first coined the term ‘paradata’.Footnote 9 As survey methodologists, we touch on direct uses of paradata in substantive research only very briefly. However, while paradata may help with problems in the substantive data, they themselves also face and pose challenges (Sects. 6 and 7). Still, the message of our broad overview is a positive one: there are low-cost paradata types that offer a great starting point.

2 Paradata and Other Data

In Sect. 2.1, we review four data types that appear in conjunction with ‘paradata’ in the literature. A key takeaway will be that there are overlaps between them and paradata, as simplified and visualized in Fig. 1. In Sect. 2.2, we discuss paradata: definitions, the relation to ‘process’ and process data, and our own, broad definition.

Fig. 1
An illustration of the overlaps and relation among 6 data types from micro to macro level. Metadata overlaps with other auxiliary, substantiative and paradata. Metadata and paradata are auxiliary data. Meta, substantiative, paradata, and other auxiliary data are contextual in small proportions.

Relation among data types. Sizes and overlaps are not proportional

2.1 Substantive Data, Metadata, Auxiliary Data, Contextual Data

Substantive Data

4 are “what surveys are designed to collect or produce” (Couper, 2017b, p. 4): they correspond largely to the participants’ survey responses but also include, e.g., samples and measurements taken by respondents, interviewers, or sensors (Groves et al., 2009, ch. 2.2.2; Keusch et al., 2024). Unfortunately, ‘data’ are used as both a synonym for substantive data and an umbrella term for all data types (i.e., including substantive data, paradata, metadata, and so on).

Metadata

are, nowadays, “any descriptive information about some object(s) of interest” (NAS, 2022, p. 96). Thus, survey metadata are information about the survey, its components, and the produced data—“the core of [survey] documentation” (Kreuter, 2013, p. 3). Metadata variables are on a more macro level,Footnote 10exhibiting little variability (ibid): e.g., the survey’s response rate is one single value. This is illustrated by considering three important categories of metadata specific to surveys:Footnote 11

  1. 1.

    Descriptions of the survey include its name, an outline of study goals, the survey mode, and the interviewer training handbook.

  2. 2.

    Metadata on items, often in a codebook, encompass the names of the variables, possible values, interviewer instructions, and question wordings.

  3. 3.

    Aggregated data and (statistical) summaries can come from aggregating paradata (yielding, e.g., the overall response rate) or aggregating substantive data (e.g., the share of female respondents).

We notice two problems, particularly in relation to paradata. First, there is no agreement on where metadata begin: i.e., how much describing, summarizing, or, crucially, aggregation turns microdata into metadata. Overlaps and inconsistencies are thus inevitable: e.g., information on an item is metadata (about that item), but, in relation to the whole survey, is also sometimes treated as paradata. Second, often in quick succession ‘data’ are introduced to mean ‘substantive data’, and then ‘metadata’ are defined as “data about data”,Footnote 12 implying that the second “data” in “data about data” solely refer to the substantive data—erroneously (see category 3 above and Kreuter, 2013, p. 3). In actuality, there are metadata on substantive data, metadata on paradata, and so on, although these distinctions are rarely made.Footnote 13

Auxiliary Data

Without a universally accepted definition, we follow Kreuter (2013, ch. 1.3): all data other than the substantive data, i.e., auxiliary data, include paradata.Footnote 14 ‘Auxiliary’ is to be taken literally: supplementary data that are meant to help.

Non-paradata auxiliary data are external to and (overwhelmingly) exist independent of the survey: e.g., administrative data on the same respondents, survey organization employee data on the interviewers, or Census data on area characteristics.

Enrichment with non-paradata auxiliary data can help substantive research (broader or more in-depth information on the very same respondents), reduce respondent burden (fewer questions necessary), and guide survey processes (adjusting contact protocols based on background from the sampling frame). Survey methodology research also benefits greatly: on factual questions, one can determine whether a response is correct by contrasting the survey response with the otherwise unknown true value provided by high-quality, ‘gold-standard’ auxiliary data on the very same individuals. One can then investigate the causes of erroneous answering and offer solutions (see Sect. 1).

Contextual Data

are any information about an event’s or an individual’s context, particularly social, physical, environmental, temporal, or informational context. This also includes information about relevant reference groups (e.g., the family) and abstract concepts (e.g., local social norms and the legal environment).Footnote 15 ‘Context’ goes beyond recording these aspects in isolation, also considering how they interact.

Two Helpful DistinctionsMicro context refers to a specific individual or event, whereas macro context is on a higher level (e.g., regional). Internal context is, e.g., someone’s emotional state, while external context includes local laws.

Subject of the ContextSubstantive context is what is predominantly meant by ‘context’ in the larger social science literature:Footnote 16 context pertaining to substantive research questions. For example, for survey respondents asked about their cannabis consumption, the substantive context includes their parents’ attitude and local legality. Survey scientists, however, also consider survey context: the context of conducting surveys and producing survey data, both in general and specific to a particular survey. Macro survey context can be, e.g., restrictions on freedom of expression in the survey’s locale. Such information would come from appropriate auxiliary data and be included in metadata—another example of overlap among data types.

We wish to emphasize the following overlap: most if not all micro survey context is part of paradata.Footnote 17 Micro context can influence behavior at the interview or response level and thus is part of the processes producing a survey’s data: e.g., how sensitive is a question to a particular respondent–interviewer pairing (Tourangeau & Yan, 2007, p. 860)? Much of micro context consists of such latent constructs. Thus, one has to rely on individuals’ self-reports, interviewers’ observations, and other proxy indicators. Further examples are provided in Sect. 3.

Figure 1 highlights some overarching results of Sect. 2.1. First, the data types do overlap. Second, the micro–macro consideration is useful but does not distinguish data types conclusively. Third, context in its various conceptions is part of all data types and of paradata in particular.

2.2 Paradata Definitions

Above, we discussed a first pitfall for grasping paradata: overlaps. A second challenge is that definitions in older, seminal works do not reflect the current understanding fully. There is also still no universally accepted definition (Couper, 2017b, p. 4). Third, paradata definitions in the literature even use different definitional bases. Some definitions even require two of them, source and content (e.g., West, 2011, p. 1 and McClain et al., 2019, p. 199). This is perhaps because of each base’s weaknesses. For each of the three definitional bases (italicized boldface), we give example definitions and discuss some weaknesses below.

Source Paradata are “captured during the process” (Kreuter, 2013, p. 3). Sometimes the by-product or automated nature is emphasized (e.g., Couper, 2000, p. 393 and Roßmann & Gummer, 2016, p. 313).

However, substantive data are also captured during the survey process, and some paradata variables are not but derived later (see Sect. 3). The by-product or automatic nature is missing from, e.g., interviewer observations.

Content Paradata are “describing” or “about” the process (Couper, 2000, p. 393; Nicolaas, 2011, p. 1).

Yet, some paradata are themselves not about any process directly: e.g., raw audio recordings, observed neighborhood characteristics, or whether respondent and interviewer have the same gender (as an aspect of their interaction).

Use Paradata are “used to manage and evaluate the survey process” (Couper, 2017b, p. 4f. on Groves & Heeringa, 2006).

However, sampling frame information and other auxiliary data are also used to “manage and evaluate the survey process”, and substantive research, too, employs paradata. Taken literally, absolutely nothing would be paradata unless and until it has been actually used to “manage and evaluate”.

Process

is a common refrain among paradata definitions. We find the singular ‘process’ misleading: it may be the reason why some equate the whole survey process with only the field phase, the data collection, or even only the interviewing process (see Couper, 2017b, p. 4, on paradata’s narrow origins). Thereby neglected processes include the design phase, postprocessing (editing, labeling, and coding), and two repeated processes: recruitment and the question–response process. The latter itself comprises comprehension, retrieval, judgment/estimation, and reporting processes (Groves et al., 2009, ch. 7.2). All these processes, and their complex relations, influence the survey as a whole and a specific cell’s value in the released substantive data.

From authors of other chapters, we learned that some of their fields struggle with how the terms ‘process data’ and ‘paradata’ relate exactly. In our field, there is near-universal agreement that all paradata are process data.Footnote 18 The reverse question is less settled: some disagree that all process data are also paradata (e.g., Lyberg, 2011, p. 8), whereas some agree although they often equate the terms just for their paper (e.g., Kreuter et al., 2010a, p. 282 and 286). The former do not provide counterexamples. Unfortunately, neither define ‘process data’.

We surmise that some processes, happening in temporal or spatial proximity to survey processes, produce process data, but not survey paradata:Footnote 19 e.g., internal processes such as human resources of the survey organization, or the processes of processing, analysis, and algorithmic decision-making (Enqvist 2024) on the released substantive data.

Our Definition

is to reflect the heterogeneity of what are and what is seen as paradata. It synthesizes existing definitions. Survey paradata are data

  1. 1.

    that are themselves not the survey’s substantive data, and

  2. 2.

    that would typically not exist without the particular survey, at least in the particular form available, and

  3. 3.

    that were automatically produced, actively collected, or constructed later on, and

  4. 4.

    that relate to the survey and its processes in at least one of three ways:

    1. a.

      Data produced by the survey processes, often as by-products

    2. b.

      Data describing the survey processes, including proxies for unobserved constructs and (micro-)contextual information about the survey processes

    3. c.

      Data used to manage and evaluate the survey process(es).

3 Paradata Examples

Within each category (boldface), we usually first present primary, raw paradata and then some ‘derived variables’, i.e., created from the former or other data sources. The categories are to facilitate understanding. They partly overlap.

Timing

is first captured as time stamps (time and date) from which much can be derived: on which day of the week is the interview, is it a holiday, or how much time has passed since the start of the field phase, the last interview, etc. Response times are how long it takes a respondent(-interviewer paring) to complete a specific item in a particular survey (Matjašič et al., 2018); these times add up to the interview duration.

Call Records

are kept about prior contact attempts for each sampled unit. Note that survey scientists call contact attempts ‘calls’, regardless of survey mode. Together with each call’s outcomes (disposition codes: noncontact, rescheduled, completed, …; reasons for refusal), they are also termed contact history data. Recruitment phase data are the web survey analogue (McClain et al., 2019, p. 200f.). Much information can be derived: e.g., a unit’s current status; level-of-effort measures (Olson, 2006, p. 744f.); contact sequences (Durrant et al., 2019); and response histories in panelsFootnote 20 (Kreuter & Jäckle, 2008).

Audio, Verbal, or Voice Paradata

comprise recordings and features automatically extracted in real time. Derivable variables include pitch, speed, disfluencies—particularly their levels, changes, and the respondent–interviewer similarity; overspeech; and whether a question was misread by the interviewer (Jans, 2010; Conrad et al., 2013; Olson & Parkhurst, 2013, ch. 3.3.5).

Location Paradata

can come from, e.g., GPS (Edwards et al., 2017, ch. 12.3), other devices (Keusch et al., 2024), or IP addresses (Felderer & Blom, 2022). Interviewer travel distance and patterns or whether the respondent was on the move during the interview are examples of dynamics that can be derived.

Device Paradata

mainly concern web surveys: e.g., device type (PC, smartphone, tablet), operating system, and browser settings (Callegaro et al., 2015, ch. 2.4.2.2).

Human Interface/Input Device Paradata

mostly come in two forms: Keystroke data log each key pressed by the interviewer and the respondent. For example, sequences and how often the help/back/delete keys were pressed can be derived. Mouse tracking captures a computer mouse’s movements and clicks, yielding timed sequences of coordinates and events. They allow the calculation of distance traveled by the mouse cursor, deviation from the direct path, velocity, and acceleration, as well as hovers over response options (Kieslich et al., 2019; Fernández-Fontelo et al., 2023). Both forms inform about navigation, idle times, and whether and when responses were changed and what the previous answer was. Analogues for smartphones and tablets have been developed (Schlosser & Höhne, 2020).

Interviewer Observations

can be about, e.g., the neighborhood (signs of vandalism), dwelling (the presence of children or whether interviewer access was blocked by a gate), person, or interview (interruptions by children). Interviewer ratings are evaluations of, e.g., the respondent’s interest, effort, or satisfaction, and the interviewer–respondent interaction (Kirchner et al., 2017; Jacobs et al., 2020).

Respondent’s Ratings and Self-Ratings

mirror interviewer ratings. Respondents are either explicitly prompted for their ratings or can provide information about their particular survey in an ‘open comments’ section at the end.

Interviewer Characteristics

can be either fixed or varying. The former are sometimes seen as paradata: sociodemographics, position or experience in the organization, etc. (via employee data); or attitudes, traits, education, skills, and years working as an interviewer (via interviewers answering a separate questionnaire). Varying characteristics need to be calculated: the number of prior calls or completed interviews on the same day or during the field phase overall, time since the last interview, etc.

Few fixed respondent characteristics are widely considered paradata, except for some attitudes (about being interviewed, scientific surveys generally, this survey’s topic) and prior survey experience (Matthijsse et al., 2015; Schwarz et al., 2022, ch. 2). Varying respondent characteristics are discussed under Interactions below.

Survey/Interview Characteristics such as incentives, recruitment strategies, and offered mode can vary: across units, time, or in multicountry efforts.Footnote 21

Item Characteristics inevitably differ between items: e.g., length, response options, and topic. But even a particular item may be different across respondents: e.g., because of adaptations based on the participant’s prior responses.

Interactions

are often only accessible via proxies or subjective judgments. For each respondent–interviewer pairing, ratings can be captured, the difference in age, attitudes, or language can be derived, or specific aspects of this complex social interaction can be addressed (Bradburn, 2016). For the respondent-survey interaction, reasons for participating (Schwarz et al., 2022, ch. 2.2) or whether it was conducted in the respondent’s native tongue are examples. The respondent-item interaction, too, contains subjective aspects, e.g., item sensitivity or trouble understanding, and objective aspects, e.g., the number of all or of similar questions answered before.

Micro Survey Context

partly overlaps with observations, ratings, and interactions. Adding to Sect. 2.1, we highlight some further latent constructs (italicized) and respective proxies. Perceived level of privacy (Yan, 2021, p. 120): Was the respondent at home or in a public space? Who else was present: a boss, spouse, or children? Trust and interviewer–respondent rapport (Sun et al., 2021): Was it always the same interviewer, in panels or in continued recruitment attempts (Kühne, 2018)? Engagement and effort: Did the respondent multitask? Did they look up information in documents?

Design Phase

Footnote 22 paradata include changes made after pretesting a survey (see Sect. 5). Online comments are a simple tool for volunteering information: e.g., respondents about comprehensibility, offensiveness, or other issues with a question, or experts about design flaws (Callegaro et al., 2015, p. 105, 109).

Editing/Coding22 paradata about each cell of the substantive data can be whether the value came from the respondent or a labeler (Sana & Weinreb, 2008). More detailed information for the latter case includes the labeler id or the rate of agreement among multiple labelers looking at the same cell.

Miscellaneous

Video recordings, eye-tracking measures, and brain activity data are rare because of equipment requirements (Callegaro et al., 2015, p. 108f.).

Some surveys inquire about providing biosamples, willingness to be contacted again, or allowing record linkage to other data. The respondent’s consent decision or reasons for refusal (Sakshaug, 2013) may be indicative of respondent behavior.

The status or relation of who provided information can be relative to the sample unit (targeted person versus family member) or to the information (information provided about oneself or about someone else). In establishment surveys, the respondent’s position within the company might influence their response.

Sensors, wearables, and apps have received attention recently (Keusch et al., 2024), but only a part of these data are paradata.

4 Collecting, Structuring, and Designing Paradata

Below, we consider some differences in how paradata types are collected and structured. This is not just inherent heterogeneity: many paradata types can also be actively designed. Thus, paradata are not necessarily ‘found’ or ‘organic data’ (Groves, 2011; Japec et al., 2015, p. 843) over which researchers have no discretion.

Resolution

Device-recorded paradata are usually constrained only theoretically, as technical resolution limits are beyond sufficient.Footnote 23 Resolution for response time should be at least 100, better 10 or 1 millisecond(s) (Mayerl, 2013, p. 3): if differences across individuals (‘signal’) are on the order of seconds, then measuring only up to the second (‘noise’) degrades too much information (signal-to-noise ratio)—needlessly.

Human-recorded paradata should heed survey methodology’s advice on how to construct items and response scales (e.g., Bradburn et al., 2004; Groves et al., 2009, ch. 7).Footnote 24 Here, higher resolution can be detrimental: e.g., contact history systems with very many disposition codes (AAPOR, 2016, p. 71ff.) may produce minuscule counts for some outcomes and errors (e.g., by inexperienced interviewers). If one anticipates combining categories later on, then the design should facilitate this.

Granularity

Ratings may reflect the whole interview, segments, or use, e.g., ‘increasing/steady/declining’ to capture dynamics. Response time can be at the level of questionnaire sections, the page (web mode), item, or even finer (see Components). Item-level analysis is impossible when measurements are only at the page level.

In web surveys, server-side (Callegaro, 2013, p. 262) measurements can always be implemented but are restricted to the page level and (differential) transmission, and loading times are unwanted components—in contrast to client-side measurements (i.e., collected on respondents’ devices) whose collectibility, however, depends on user consent and device (Callegaro et al., 2015, ch. 5.3.4.1).

Components: Splitting/Combining

The idealized question–answer sequence comprises an interviewer reading aloud, the respondent’s cognitive processing, their answering, and the entering of the response. Thus, item-level response time can be split into four components, yielding more nuanced information.

Aggregation This often refers to the level at which a variable operates (varies) within the complex, hierarchical survey structure: item, respondent, interviewer, call, or interview. Appropriate aggregation reduces informational overload (see Sect. 6) and enables targeted applications: e.g., interviewer level monitoring and item level design evaluation (see Sect. 5).

Ex post, one can usually decrease granularity, combine components,Footnote 25 and aggregate. The reverse direction is typically impossible: information not captured is lost forever.

Degree of Automation

Some paradata are always collected automatically (e.g., keystrokes) and some are never (e.g., interviewer observations). For others, such as response times, there are several options:

  1. 1.

    Manual, ‘active’ time stamps: The interviewer presses a button after having read the question aloud completely and again when the participant starts responding.

  2. 2.

    General automation: The timer always starts when the question appears on-screen and stops when the response is confirmed.

  3. 3.

    Specific automation: A voice-activated system recognizes when speaking starts and stops.

Each approach has its advantages. Interviewers (1) know best when they finish and can also ignore nonanswers (e.g., thinking out loud or asking for clarification); they can also record whether a measurement was valid. Meanwhile, automation frees up the interviewer. General automation (2) is unsusceptible to inadvertent button-pressing and nonanswering but combines all interviewer and respondent components into one measurement. Specific automation (3) can separate them but is hampered by overspeech, low volume, nonanswering, and needs specialized software. Combined, semi-automated versions try to reap the benefits of each approach.

Raw Paradata

are sometimes not fit for the intended use.

Preprocessing turns raw mouse-tracking data—a continuous stream of events, coded in computer language—into comprehensible, usable information (Olson & Parkhurst, 2013, ch. 3.3.3, 3.5.1). Specialized software exists (Wulff et al., 2021; Henninger et al., 2022b), unlike for processing tasks needing human labelers such as rating recorded interactions. Other paradata may need trivial (response time \(=\) stoptime \(-\) starttime) or no work (interviewer ratings).

Adjustment shall denote the correction for unwanted properties. Raw response times are influenced by characteristics of the respondent (e.g., their general baseline speed; Mayerl et al., 2005), interviewer, item, device, and so on (Couper & Peterson, 2017; Sturgis et al., 2021, ch. 1). Response times become comparable only after accounting for such influences; otherwise, it would remain unclear if someone was speeding on an item or is just generally fast. This is usually done by statistical regression (Couper & Kreuter, 2013), which is only possible after data collection, i.e., not in real time. Further examples benefiting from adjusting for respondent idiosyncrasies include verbal and mouse paradata.

Capturing Paradata

can take on different, potentially complementary forms. Location is, mostly, about server-side and client-side in web surveys (see Granularity). The origin of ratings can be from interviewers, respondents, or labelers, of observations from interviewers, recruiters, or listers, and of response times from human interviewers or several devices.

Availability

for all relevant units and at a point in time limits applications (see Sect. 5).

When While interviewers continually observe the process, their evaluations are only recorded after the interview’s conclusion. Other paradata are available in (near) real time—even some derived variables: e.g., idle times and response editing, from keystrokes or mouse tracking. However, a need for nonautomated or time-consuming preprocessing or adjustments impedes real-time interventions (Mittereder, 2019, p. 153). On a different note, early in the field phase the paucity of paradata limits applications (West et al., 2023).

On whom There are more data for completed cases (e.g., ratings) than for breakoffs (anything collected until termination), contacts (outcome codes; some interviewer observations), and noncontacts (GPS; neighborhood observations) (Sakshaug & Kreuter, 2011).

5 Applications

Paradata are employed for the various survey methodological challenges (see Sect. 1): errors of representation, errors of measurement, and missing data, i.e., missing units/rows (‘unit nonresponse’) and missing responses/cell entries (‘item nonresponse’). The first goal is to recognize errors and the underlying mechanisms. Then, the design of future surveys can be improved. Also, for given survey data, statistical methods such as imputation and weighting can be used to derive unbiased results even from deficient data. The second goal is to monitor data quality and to predict problems: this is the basis for interventions.

Any statistical modeling of behavior is constrained by what paradata are available for all relevant units. Prediction, in addition, is restricted by information available at prediction time (see 4).Footnote 26

Unit Nonresponse

is likely the most studied application. Other than the (very limited) sampling frame information, paradata may be all that is available for respondents and nonrespondents alike (Sinibaldi et al., 2014):Footnote 27 e.g., observations about the neighborhood, dwelling, or individual(s), call histories, and interviewer characteristics such as their voice (Kreuter & Casas-Cordero, 2010, p. 3; Olson, 2013; Charoenruk & Olson, 2018).Footnote 28

Avoidance of nonresponse bias builds on increasing the recruitment effort, monetary incentives (Jackson et al., 2020), or the many adaptive survey design strategies (see below) on cases predicted to be difficult or important for sample representativity.

Adjustment for nonresponse in the eventual data analysis of the substantive data usually involves some form of weighting based on the response propensity P. To repair nonresponse bias, a paradata variable employed in estimating P must be strongly correlated with both P and the survey variable of interest (Kreuter et al., 2010b). However, rarely a single available paradata variable exhibits enough correlation with both; using multiple variables may help (Kreuter & Olson, 2011).

Panel dropout of participants between waves of a panel survey can be studied with prior waves’ paradata: comments (McLauchlan & Schonlau, 2016); response behavior and speed (Roßmann & Gummer, 2016); interviewer observations (Plewis et al., 2017); and habitual late responding (Minderop & Weiß, 2023). Breakoffs are more frequent in web mode, on mobile (versus PC) and nonpreferred devices, and preceding response behavior such as speeding and instability is predictive (Mittereder, 2019; Couper et al., 2017; Chen et al., 2022; Mittereder & West, 2022).

Coverage Error

can be addressed in two ways (Eckman, 2013). First, two sources can be compared. For example, sampling frame versus a ‘lister’ walking a neighborhood to collect addresses and contact information: flagging, additions, and deletions of units. How much self-reports, sampling frame information, or interviewer observations match on survey inclusion criteria is an indicator of their accuracy. Second, whether the particular circumstances affected sampling frame creation is of interest: e.g., duration, weather, time, location data, lister’s or interviewer’s discretion, and edits. Device paradata can inform about the error that would be introduced when survey participation required apps available only for some smartphone models (Couper et al., 2017, ch. 7.2). Similarly, to increase representativeness, some online panels have offered free devices and Internet to those lacking (Blom et al., 2017). Paradata on who was such an ‘offliner’ allow studying such programs’ success regarding participation and improving substantive results (Cornesse & Schaurer, 2021; Eckman, 2016).

Errors of Measurement

(Yan & Olson, 2013) and Item Nonresponse are often studied jointly as both they concern (error-prone and missing, respectively) cell entries in the substantive data.

Paradata measure or proxy behaviors, context, and mechanisms that influence these data quality aspects: device (Lugtig & Toepoel, 2016); multitasking (Sendelbah et al., 2016; Höhne et al., 2020b); regional context (Purdam et al., 2020); rapport (Sun et al., 2021); consistency of related answers (Revilla & Ochoa, 2015); uncertainty, slow or fast responding, changing of responses, soliciting help, interviewers misreading (Yan & Olson, 2013); ratings (Holbrook et al., 2014; Olson & Parkhurst, 2013, ch. 3.3.6); verbal paradata (Jans, 2010); reasons for participating such as incentives (Matthijsse et al., 2015; Schwarz et al., 2022); and interviewer characteristics influencing the sensitivity of a specific question (Peytchev, 2012). Respondent self-reports provide additional information to that already contained in other paradata (Revilla & Ochoa, 2015; Höhne et al., 2020a).

Adaptive Survey Design

(ASD) and Responsive Survey Design (RSD)Footnote 29 are popular, mostly to lower costs and increase data quality (e.g., Wagner et al., 2012). One perspective is that the harder a unit is to recruit, the more similar it presumably is to nonrespondents (Olson, 2013, p. 155). Thus, when data collection stabilizes, i.e., primary substantive variables do not change anymore with increased contact attempts, one may move to another RSD phase, tweak protocols, or stop data collection. Real-time interventions in ASD can be appropriate pop-up messages to prevent breakoffs (propensity predicted with paradata: Mittereder, 2019, ch. 6) or slow down speeders (Conrad et al., 2017). Offering clarifications in self-administered surveys based on age-adjusted idle time can improve response accuracy and satisfaction (Conrad et al., 2007). Allowing the interviewer to ask only the most important questions when they predict a high risk for unit nonresponse or breakoff is more drastic (Lynn, 2003).

Monitoring and Evaluation

guide the complex survey processes in real time (Couper, 2017b, p. 10). Dashboards (Mohadjer & Edwards, 2018, p. 263ff.) visualize information for survey managers: in particular, ‘key performance indicators’ of costs, data quality,Footnote 30 and interviewer performanceFootnote 31 (Meitinger et al., 2020).

Performance can improve with feedback to interviewers when data sources conflict (GPS vs. call history about locations: Edwards et al., 2017, ch. 12.3; Wagner et al., 2017, p. 221) or when paradata indicate deviations from protocols (Edwards et al., 2020). Recordings can be reviewed for quality control. Interview durations may inform about deviant behavior (fabricated interviews: Schwanhäuser et al., 2022).

Evaluation of Survey Design

in the evaluation phase (Maitland & Presser, 2018), in pretesting (Couper, 2000; Stern, 2008), by experts (comments about items: Callegaro et al., 2015, p. 105, 109), and during the field phase uses paradata to indicate problems: slow responding, rates of item nonresponse and changed responses, going back to earlier related items, interviewer evaluations, and labeler-coded behavior.

Costs

not being available in real time or in full detail hampers survey administration. Then, estimating cost parameters from call histories may help (Wagner, 2019).

Substantive Research

has used survey paradata, too, but is beyond the focus of this survey methodological chapter. For example, interviewer observations can supplement the substantive data when missing or for quality control: e.g., the presence of wheelchair ramps and cigarette butts in health surveys (West, 2018a, p. 212).

Response times or ‘latencies’ have long been used to study cognitive processes such as the degree of elaboration (deliberative-controlled or automatic-spontaneous processing), abilities, strength of attitudes, and mental availability of information (Johnson, 2004; Mayerl, 2013; Kyllonen & Zu, 2016; De Boeck & Jeon, 2019).

6 Challenges and Some Solutions

Paradata Quality

is understudied in general (West & Sinibaldi, 2013). Interviewer-produced paradata have received relatively more attention (Olson, 2013, p. 159). Automation (see 4) and objective paradata (West & Sinibaldi, 2013, ch. 14.2.3) do not guarantee high(er) quality.

Errors

,including a lack of internal validity or reliability, in interviewer observations have been noted at rates between \(<\)10% and 92% (West, 2011, p. 4; West, 2013b). Context (e.g., seasonality, cooperation, sensitivity) and characteristics of the respondent, household, area, and interviewer can influence interviewer observations (West & Li, 2019; West & Blom, 2017, ch. 4.8). Unfortunately, performance may actually decline during the field phase (West & Sinibaldi, 2013, p. 351). Also, interlabeler reliability can be challenging (verbal paradata: Jans, 2010, ch. 2.2).

Missing Values

can be frequent in, e.g., interviewers’ neighborhood observations (Olson, 2013, p. 146) and call records (Wagner et al., 2017, ch. 5.3). Reasons include ambiguous guidelines or cases (Biemer et al., 2013), forgetting when recording later,Footnote 32 and hesitancy to record sensitive information. Also, using multiple devices can hinder completeness (Höhne et al., 2020a, p. 994).

Solutions

for improving general survey quality are also informative for survey paradata quality. For instance, operationalizations of interviewer observations should heed survey methodology’s general lessons better (see 4 and Kreuter, 2018b, p. 534). Systems must facilitate easy, timely entry (ibid ), with errors easy to correct or flag: e.g., interviewers can rate each response time measurement as valid, respondent error, or interviewer error (Mayerl, 2013, step 2.2). Automatic consistency and completeness checks (West & Sinibaldi, 2013, p. 352f.) can compare across data sources (for location: from GPS vs. from call records) or to normal values (unusual response times: West, 2013a, p. 352f.): after all, interviewers can prevent or correct problems best in real time. Frequently recommended are standardized, high-quality, survey-specific training and, periodically or when needed, retraining of interviewers, reminders, checklists, instructions, and the like (e.g., Kreuter, 2018b, p. 534).

Informed Consent

about paradata collection is an ongoing debate (Connors et al., 2019, p. 187f.). Should respondents be informed—and how?Footnote 33 Do they need to consent (Kunz et al., 2020a, p. 397f.)—at all,Footnote 34 as part of one overall agreement to participate in the survey, or in a separate paradata consent question?Footnote 35 Nonconsent may reduce participation (Couper & Singer, 2013) and bias samples (Felderer & Blom, 2022, p. 878), although much less so when following the emerging best practices (Kunz et al., 2020b). We would like to caution that in the long run a lack of transparency could backfire and reduce trust and participation rates.

Confidentiality

concerns are highest in, e.g., address details, interviewer observations, open-text answers, and recordings (Nicolaas, 2011, p. 15). Selective anonymization is hard for unstructured paradata (Kreuter, 2018b, p. 535). The general approaches for sensitive data (see Shlomo, 2018 and Bender et al., 2020) can be solutions for paradata, too. Also, paradata may be used in real time, never leaving a respondent’s device (Henninger et al., 2022a, p. 16). Also, perceived privacy (Nicolaas, 2011, p. 15), the actual driver of consent and behavior, must reflect reality.

Availability of Paradata

is hampered by organizations guarding internal best practices, resources needed for preparation, warehousing, and documentation (Nicolaas, 2011, p. 16 and 14; Olson, 2013, p. 162), and confidentiality questions (Kreuter, 2018b, p. 535). (Micro) Paradata are released more frequently nowadays but often only contain some of the paradata variables or only the completed interviews. Research about paradata may also stay internal for similar reasons or because improvements are deemed small (Wagner, 2013b, p. 166).

Standardization

may be helped indirectly by the dominance of a few software solutions (web: McClain et al., 2019, p. 201f.).Footnote 36 Yet, in contrast to metadata there are almost no universal paradata standards (Vardigan et al., 2016, p. 445; Couper, 2017b, p. 7). Even within an organization there may be heterogeneity on how to record information: e.g., among interviewers with different experiences at prior employers or between survey methodologists and interviewers. Concrete, clear standards are key. Yet, standardization must leave room for tailoring paradata (Kreuter, 2018b, p. 534): e.g., to specific contexts and needs (see West & Sinibaldi, 2013, p. 347 and 5 on nonresponse adjustment variables having to fit the specific application).

Overwhelming

users is a common worry about paradata (Couper, 1998, p. 45; Kreuter et al., 2010a, ch. 5). This is in part, but not only, about volume.

The informational content per observation is, however, only high for some variables: e.g., an interviewer’s exhaustive free-text call notes may be useful to themselves but overwhelm other follow-up interviewers or managers (West & Sinibaldi, 2013, p. 347). Standardizing and structuring the minimum informational content while making additional notes optional is an easy fix. Many paradata variables are or might be available. Beginning with those that one knows will be used and for which one has applications is a great starting point (West, 2018a, p. 213). Some paradata variables have many data points: e.g., every single mouse coordinate.

Instead of appraising every single, microlevel value, information is aggregated to the appropriate level, reduced in dimension (e.g., by clustering) or to special cases (e.g., outliers), or fed into statistical methods.

Handling Paradata

can seem daunting at first. Yet, the separate files for call records, interviewer characteristics, and item-level paradata can be merged. Levels may be changed by aggregation, or, in files, by ‘reshaping’ between long and wide data formats. All this is facilitated by software and need not be done manually.

The structure of many paradata variables can be nontrivial. Where detailed statistical analysis of paradata is needed, hierarchical, complex structures are addressed with multilevel modeling.Footnote 37 Call records are an example of unbalanced data: zero, one, or more observations per unit. Yet, this is only sometimes actually problematic. Then, simple aggregation is often sufficient: e.g., counts per unit. There are also less crude methods that can target patterns as a whole, e.g., in call histories and mouse movement trajectories (Durrant et al., 2019; Fernández-Fontelo et al., 2023).

Heterogeneity

abounds across cases. Some accrue more information (completed interviews) or more observations (repeated calls). One variable may capture different concepts (Olson, 2013, p. 159): attempts made (nonrespondents) or calls needed for success (respondents). In surveys with multiple modes (e.g., ASD and RSD), some variables are not available in each or not directly comparable (Kreuter, 2018a, p. 195).

Information Is Lacking

on many processes (respondents’ and interviewers’ true motivation, states, and behavior) or because of too few cases (e.g., breakoffs and fabricated interviews). Unsupervised learning (James et al., 2021, ch. 12) may help: e.g., clustering for finding deviant interviewer behavior (Schwanhäuser et al., 2022).

Misalignment of Incentives

between, e.g., interviewers and survey designers or researchers, can be problematic. Yet, studies of, e.g., prevalence and reasons for interviewers ignoring recommendations are rare (call timing: Wagner, 2013a, Experiment 5; travel routes: Tourangeau, 2021, p. 17f.). Remuneration schemes ignoring the time needed to record paradataFootnote 38 clash with expectations for high-quality paradata.Footnote 39 With (perhaps diffuse) monitoring, interviewers may feel the need to demonstrate performance (West & Sinibaldi, 2013, p. 343, 347). Transparency is a partial solution (West & Groves, 2013, p. 373): letting the interviewer know why they get relatively more difficult cases and that good paradata help fair evaluation.

Overall, one may need to convince the interviewers of the value of quality paradata, in general and to themselves, via improved case assignments and improved recommendations (West & Sinibaldi, 2013, p. 348; West, 2018a, p. 212). The same is true for survey managers, listers, recruiters, and other actors on the ground or in decision-making positions (Olson, 2013, p. 161).

7 Discussion

(Un)intended Consequences

of making paradata and paradata collection explicit need further study. Changed behavior among “watched” respondents is plausible but has not been found yet (Kunz et al., 2020a, p. 402) except for participation (Henninger et al., 2022a, p. 5f., 9). When recorded, interviewers produce fewer suspiciously short durations (Olbrich et al., 2022). On a different note, making interviewers predict respondent behavior could yield self-fulfilling prophecies (Eckman, 2017, ch. 3).

Perspectives

on paradata are many and varied. This is true across disciplines, as this volume shows, but also within our field. Most research has started from either the available paradata or established knowledge about surveys. Those on the ground—labelers, field staff, interviewers (Jans, 2010, ch. 2.2; West & Sinibaldi, 2013, ch. 14.2.2.1; West & Trappmann, 2019)— have hitherto untapped knowledge about processes, their own strategies, and working with researchers’ paradata instruments.

Ethics

and critical reflection of potential harm from paradata collection and applications are paramount (AAPOR, 2021). Survey methodology is shaped by mostly benign surveys. In the West or elsewhere, respondents and interviewers from some locales, contexts, or specific groups are rightfully afraid of negative consequences from honest answering or mere participation. Yet, many of the ethical and legal struggles (see also Sect. 6) are not unique to paradata (Conrad et al., 2021, p. 254).

Costs and Trade-Offs

relate questions of data quality to each other and to the real world. Paradata may be by-products—they are not why surveys are conducted—but they are not cost-free: Systems need development and maintenance; recording information (interviewers), monitoring quality (managers), and training (both) take time and effort; paradata must be preprocessed and documented before being released. That paradata are high-quality is not a given, either (see Sect. 6).

Our field does not have a common framework for all survey costs and few empirical studies on utility per dollar. Trade-offs are recognized but hard to quantify. Resources spent on paradata basics (e.g., infrastructure) cannot be spent to improve one survey’s substantive data (quantity or quality) but can benefit many future surveys.

We have discussed many examples and challenges to provide a broad overview, but one important message should not get lost: some paradata types are easy to capture and contain much information relative to the resources that must be invested.

Take a Paradata Perspective When Helpful

Whether everyone agrees that something is paradata or whether they would, had it been created differently, will not diminish its usefulness. Paradata are not an end unto themselves, but “additional […] tools” to help in practice (Couper, 2017b, p. 11), not meant to replace other tools or perspectives. Use may not seem the most important definitional base for paradata (see Sect. 2.2), but, after all, applications are why we capture paradata.