Introduction

Data generated on a massive scale and recorded automatically within digital contexts is increasingly forming the basis for contemporary social science research. One reason for this shift is that the object of study itself is transforming before our eyes: Social practices now take place to a great extent in digital spheres and social fields are extensively digitized. Depending on the viewpoint of the respective paradigm, digital environments allow people to perform everyday practices, to realize choices, to represent themselves, or to interact based on symbols, which makes data from digital contexts interesting for a variety of contemporary social scientists. Another reason for the enduring scientific popularity of these forms of data lies in the methodological advantages often attributed to them when compared with traditional data types, in particular objectivity, unobtrusiveness, unbiasedness, reliability, and so on (see Boyd and Crawford 2012, p. 663). As a result, however, the actual quality of digital process data is a rather neglected topic in the current discourse.

Yet, in the context of data generated by modern socio-technical systemsFootnote 1 (Dolata 2009; Riebling 2018), there are in fact quite serious problems regarding data quality. These problems include phenomena of bias, selectivity, erroneous aggregation, and recursive effects of the research instrument, in short, a whole series of abstract mechanisms that are anything but unfamiliar to social scientists. As with written surveys, digital data are equally capable of containing erroneous and biased information (see Japec et al. 2015, p. 854 f.; Sen et al. 2019; Baur et al. 2020; Diaz-Bone et al. 2020). Although in the context of written or online surveys, it is usually considered essential both to ensure data quality in the process of data generation and to correct potential quality problems, the systematic conceptualization of data quality for digital process data is still in its infancy. Even if numerous individual problems in data quality are known, or at least suspected, to exist in specific research settings, there is still a lack of a systematic understanding of the different quality-distorting mechanisms in the context of digital process data. This systematic deficiency occludes from our view the potentially considerable distortions of substantive findings and may, ultimately, threaten the legitimacy of research based on the analysis of digital process data. This is why the German Research Foundation (DFG 2020) recently emphasized the increasing importance of digital data and the need for quality assurance of these data.

In order to be able to deal with this major challenge of present and future research, social science requires two essential tools, which it in principle already has at its disposal: First, a systematic overview of abstract error mechanisms, and second, ways of inferring quality problems in existing data sets. For the first component, we delineate a systematic process perspective on error-inducing mechanisms of digital process data along three ideal-typical dimensions: Observational design, data generation, and data processing. For the latter component, we outline a mixed methods strategy of post-hoc quality control using a combination of simulation models and statistical identification techniques. Simulations are particularly promising, in that they allow researchers to (i) systematically control the data-generating process under different contextual conditions, (ii) model the various hypothetical error-generating mechanisms, and (iii) screen the resulting data using explorative techniques to determine what they can and cannot tell us about implemented quality impairments. As a practical use case, we will discuss a fundamental and far-reaching issue: The activities of non-human actors (bots). Bots account for the majority of global web traffic and bots of various types are active on most social media platforms, sometimes significantly affecting data structures, which thus poses one of the key challenges for modern quantitative social research and computational socioeconomics (see Gao et al. 2019, p. 94). For analyses based on the assumption that human actors are the entities to be observed, the presence of bots calls into question any supposedly unbiased “measurements”; equally, for approaches that conceive of bots as genuine elements of socio-technical systems, their identification is crucial too (see Venturini and Latour 2010, p. 8). Using agent-based simulations, we generate a series of artificial data sets including distortion effects from bots in systematic and controlled ways. Subsequently, we demonstrate the possibilities and challenges of post-hoc control by mobilizing geometric data analysis, an established exemplary technique for identifying issues with data quality. In the conclusion, we discuss how the generalized data quality framework can inform further research and what contribution our proposed combination of simulation and statistical screening can make as part of a much-needed multi-paradigmatic and mixed-method discourse on the quality of digital process data.

Conceptualizing Quality Issues in Digital Process Data

In recent years, more and more researchers have noticed that social sciences’ established quality concepts, such as coverage error and non-response, can also be applied and translated to digital process data (see Diaz et al. 2016, p. 3). Japec et al. (2015, p. 851) relate the concept of total survey error (see Biemer 2010) to digital process data in order to conceptualize “big data total error” (BDTE). Sen et al. (2019) have developed a total error framework for digital trace data inspired by traditional error-generating processes. Thus, after a phase of focusing on the apparent novelties and unique features of digital process data, it is being increasingly recognized that the prevailing quality problems in this context are in fact similar to the problems familiar from (survey) research.

To derive a systematic overview of the possible phenomena, sources, and mechanisms of errors, we employ a process perspective. We understand data production as processes emerging from the genuine interplay of social and technological entities. The systematic conception of the data-generating process and its accompanying errors have been examined both in the context of a process-oriented theory of survey research (Bachleitner et al. 2010) and as a “statistical chain”—that is, as a relational interplay in which different entities, objects, practices, and situations jointly generate data (see Desrosières 2009; Diaz-Bone 2018; Diaz-Bone et al. 2020, p. 319). Problems with data quality (as well as adequate interpretability) arise from the inconsistency of conventions between the different links in the chain of data production. Interviewers, data managers, statisticians, and recipients will differ in their data-related knowledge, definitions, implicit assumptions, practical choices, and their conceptions about the (realist or constructivist) status of the data and its constructs. On this analytical basis, successful attempts have been made, for survey data, to trace the process of data production from start to finish, to theoretically grasp the mechanisms of distortion, and to thereby make them accessible for investigation and, eventually, correction.

Expanding on this analytical tradition, we establish a generalized, ideal-typical model of error mechanisms in digital process data along three analytically separated (yet empirically interacting) dimensions: Observational design, data generation, and data processing. Observational design describes the processes that precede the actual collection of data: The design of the architecture, programming rules (including the specification of data structures and information flow), conventions of handling information and events, and thus ultimately the arrangement of the social environment with and within which users interact and produce data. Data generation addresses the actual process of data gathering and thus the interaction processes between the users themselves and with the technical infrastructure. Data processing refers to the techniques and conventions of data handling employed by social scientists, i.e., practices such as restructuring, editing, and statistically analyzing data.

Observational Design

Analytically, the first step of quantitative data collection is a priori construction, i.e., designing the architecture that will collect the data. In the various lines of social science’s tradition, systematic, controlled, and reflective access to data has been emphasized as an essential prerequisite for ensuring the validity and robustness of findings. To this end, social scientists are, ideally, substantially involved in defining the research question, constructing the survey instrument, and conducting it in the field. This allows the survey process to be controlled, and quality-reducing influences to be anticipated, identified, and managed. In this context, it is essential to enable the accurate reconstruction of problems and their consequences through careful documentation of the survey, the survey process, and the subsequent data management. Therefore, data quality is essentially associated with process control (Lyberg and Biemer, 2008). The problems of division of labor in the process of data production and usage are well known from the organization of standardized social research (Desrosières 2009; see Diaz-Bone 2018). The division of labor in the production process of data can lead to the data being used at the end of the production chain (e.g., in secondary analyses) without knowledge of the underlying conventions (Desrosières 2001b; see Diaz-Bone 2016) of data production and data management. In traditional survey research, data problems and the conventions of their treatment can be considered as rather congruent, at least from the perspective of the survey institute and the evaluating researchers.

In the context of digital process data, however, a particularly profound division of labor between providers, computer scientists, and social scientists can be said to prevail. In contrast to survey research, the organization of the data infrastructure on the part of a producer of digital data is considerably less oriented toward scientific interests, instead following considerations of economy or efficiency (see Schmitz et al. 2009). The fact that the data may also be used for scientific purposes is a subordinate criterion and genuinely scientific quality criteria are—according to this logic—often irrelevant (if not interfering with own quality criteria such as being fit for use). As a result, the opacity or impenetrability of the statistical chain can be said to be particularly pronounced in situations of this kind and the researcher’s insight to be particularly limited (see Diaz-Bone et al. 2020, p. 324).

This problem is already evident in the elementary aspect of the definition of the units: Desrosières shows that, early in the history of statistics, the assumption of a statistical equivalence space with clearly definable and identifiable units was a necessary prerequisite for subsequent analyses. In order to study the differences between units, we must first assume their unity (Desrosières 2001a). In past decades, in the context of questionnaire research (which made use of the contributions of sampling theory) it was comparatively easy to straightforwardly treat sampled actors as a statistical unit. However, according to what rules is the unit constructed in the context of digital process data? Which events and processes are attributed to an actor, and according to which rules? The scientist should know exactly how entities are defined a priori, what is actually treated (that is, kept) as a valid entity, and the principles on which they are collected in the data-generating process. The problem of unknown design decisions also manifests in the ways in which different users are handled differently, for example, in the form of offers that can be distributed differently with reference to time, specific user groups, or access type, such as when access from one specific residential area (as opposed to another) or via a smartphone (as opposed to a PC) results in a different offer in terms of price (see Morstatter et al. 2013).

Another example for design principles one would need to know are pre-structured sets of choices and interactions that can compromise the validity of interactional and network analyses as well as of constructs based thereon: Recommender systems such as those used by Netflix or Amazon that direct the users’ attention by latent system-immanent choices as part of their customer retention strategy, and other algorithms implemented on similar platforms, pre-structure interactions and thereby shape the observation of interactional processes. For example, Twitter uses network indicators that define a position in the timeline, which makes the application of network analyses problematic; they do not “measure” the actual communication structure, but also the communication as structurally triggered by Twitter itself. The effects of suggestions on matching-based platforms suffer from a comparable problem (see Malik and Pfeffer 2016). Principles of data aggregation represent another example of a priori design decisions that can distort network analyses. Unknown pre-defined aggregation rules make it difficult to reconstruct original data structures based on end-user data (for example, network relationships (see Howison et al. 2011, p. 781)). One might think here of Twitter’s mostly opaque agglomeration of retweets: User A retweets a post by C via user B’s timeline. In the final data set, however, it looks as if A has referred directly to B. This can be particularly problematic if conclusions are to be formulated regarding the extent of the polarization of Twitter users, for example. Beyond that, the temporal mode of data storage can distort the processual structure of data and its temporal granularities. For example, several researchers who analyze Reddit data rely on the pushshift archive, which does not have a consistent way of keeping entries up to date. Events that may in fact be one minute apart in real time, can be stored and misrepresented in the data as co-occurring within one second. Similarly, Facebook’s CrowdTangle stores time-series information about the evolution of likes for a post simply on an hourly basis.

As socio-technical systems are usually not embedded in scientific fields, problems of this kind are further exacerbated, as these systems are subject to constant (economic, political, legal, etc.) adaptation pressures. For example, with changing business strategies, the architecture of a platform can be repeatedly altered. Such reorganizations of design can lead to significant transformations of data structures over time, thereby inducing structural breaks in the data (Think of YouTube’s recent reorganization of how one can react to a video: today, only thumbs up and no longer thumbs down can be given). Design changes in the choice and interaction options offered to users can lead to far-reaching problems in the use of data. For instance, if profiles on platforms are redesigned, and new categories (e.g., additional gender categories) are introduced, interaction practices change, and so does the data structure. Although this can be compared with survey research, where panel studies can include modified item blocks, more radical changes can be introduced in digital environments, such as a redesign of recommender systems, which becomes the more problematic (e.g., for construct reliability) the less potential data users are aware of such reorganizations.

Yet, when compared with (institutionally) collected survey data, sociologists are much less likely to be involved in the conceptualization, design, and planning phases of digital process data. Private companies such as social media platforms provide such data without the detailed documentation that would correspond to the information necessary for social scientists. In fact, because a private company is unlikely to have any interest in making these principles public, this can be seen as the default situation in large parts of current research. Thus, today, the assurance of process quality, traditionally so important in empirical research, is systematically hampered by a fundamental division of labor between private-sector providers, computer scientists, and sociologists (see Diaz-Bone et al. 2020, p. 324). Consequently, in the context of socio-technical systems, the procedures and conventions underlying data generation and data structures are not transparent to the scientist for the most part. A lack of understanding of the underlying conventions concerning the definition of data frames and the predefined data recording rules can lead to severe restrictions in the extent to which the data can be meaningfully used. If researchers are not informed about design-specific conditions, a substantial lack of knowledge about the underlying observational design will engender the risk of misjudging the phenomena observed (or to misconstrue opaquely constructed data as neutral, valid “measurement”).

Data Generation

After the construction phase and the design of an observational instrument, analytically, the phase of actual data generation follows, which is traditionally referred to as the “field phase”.Footnote 2 Here, apart from technically flawed observation and recording processes in the narrower sense, several biasing phenomena can occur.

Again, the first issue—representativeness—is well known from traditional social sciences (see Baur et al. 2020). As a consequence of a social media company’s design principles, the population (of actors and events) under analysis in a digital context can represent a distorted image of the overall population, which is reminiscent of the coverage error and sampling error discussed in survey research (see Sen et al. 2019, p. 5). Further aspects such as the question of which client is used can impact representativeness, as some clients make it possible to avoid tracking; consequently, specific events of specific users remain unobserved, or are not connected with previous information (comparable with the differences between different modes of survey delivery such as landline vs. mobile). Human users may also generate distorted patterns whenever they do not actually constitute a single unit in the resulting data. In abstract terms, this reminds us of the situation where a third party influences a respondent during an interview, so that the resulting interview contains different patterns of response. This can be particularly problematic in the digital context, where different users can share an account (such as in streaming services) or one individual can create and operate a large number of user profiles (such as in dating services).

When it comes to the processes taking place in a socio-technical system, it is possible that entities that may not be defined as elements of the population actually become part of the data (similar to distorted selection frames in survey research). For example, bots can be designed to defraud or manipulate, or they can be used by the operators of the platform itself. Here, the problem does not end in the mere number of such artificial actors. Bot-generated events (such as contact patterns) represent even more problematic distortions for substantive analyses. For example, Schmitz et al. (2012) show that bots on a German dating platform make up less than 3% of all entities, but produce up to 33% of all first contacts and—by simultaneously exhibiting a high level of attractiveness and yet little selectivity—thereby establish a contact pattern in the data that diverges from human mating patterns, thus biasing statistical analyses of contact practices. Likewise, ideologically motivated third parties may try to exert external influence on digital interactions and communication contexts via bots or trolls (see Bratu 2017; Bulut and Yörük 2017; Starbird 2019).

To further complicate matters, digital process data are characterized by the fact that they are generated in a context that can have a decisive influence on the actors’ practices—for example, by defining and sorting their interaction options. In the social science literature, this is referred to as the obtrusiveness of collection methods (Webb et al. 1966). From survey research, it is well-known that the influence of the data-generating context is particularly problematic when different mode effects occur, such as in a survey that is realized by a computer-assisted questionnaire on the one hand and a written one on the other (Shin et al. 2012). Mode effects can distort digital process data, too, such as when different users are treated differently by an algorithm or by moderators in commercial algorithmic systems. As a consequence, a platform’s users and their practices can be subjected to different representations of the socio-technical environment (e.g., owing to client strategies), resulting in their actions and interactions being affected by different processes or modes. For example, YouTube recommendations are provided depending on a user’s history, as the work of Faddoul et al. (2020) shows for exposure to conspiracy theories.

Conversely, in the context of digital interaction and communication, the observed units of inquiry can react substantially to the socio-technical system, a circumstance that social science has been referring to as reactivity. In addition to and in conjunction with the designers’ influences, the users’ reactions are of particular importance when it comes to potential issues with data quality. Similar to the way different respondents react differently to the same survey instrument, thereby revealing different “response styles” (van Vaerenbergh and Thomas 2013), different usage styles can influence the content of data derived from digital interaction contexts (see Olteanu et al. 2019). Take the example of those Twitter users who are particularly opinionated about a topic and post long threads that span more than ten items. For the researcher, for example, ten separate tweets on one topic may result in that topic being perceived as ten times more frequent than an antagonistic statement that may be expressed in just one tweet.

Reactivity may also manifest as strategic reaction towards the technology; for example, privacy concerns can induce self-censorship and practices of subtweeting, or mock-retweeting (see Tufekci 2014). A particularly severe problem can result from the users’ (legitimate) control over the data that they created in the past. For different reasons, some users may delete posts or accounts, thus creating gaps in the data that are difficult to track and can lead to misinterpretation of content. In socio-technical systems, strategies emerge in yet another respect: For surveys, it is known that respondents sometimes exhibit strategic response behavior by orienting themselves to norms of social desirability, e.g., in the social situation of a face-to-face interview (see Blasius and Thiessen 2012). Such strategic behaviors are disproportionately more likely to occur in digital interaction contexts. In contrast to surveys, relational, digital process data are not created in a context in which actors are socially independent of each other—instead, the actors observed strongly influence each other. Rather, they find themselves in a genuine social situation vis-à-vis other actors, thus reciprocally inducing strategic practices (e.g., on dating platforms, see Zillmann et al. 2011). Social desirability can manifest in the form of strategic postings with socially desirable content, which may lead to clickbait strategies. These strategies are part of the digital attention economy, with users of social platforms adjusting their publicly posted metrics in order to be placed higher in lists (referred to colloquially as “playing the algorithm game”). Other users respond reflexively to these metrics that are presented to them. On Twitter, for example, likes, retweets, and displayed “ratios” become the basis for reflexive practices that cannot be understood in the resulting data without understanding their context.

On top of that, in socio-technical systems, the provider can react to the users’ behavior, in a further recursion. For example, provider-side handling of unwanted content can involve the complete removal of specific users’ actions and communication acts from the database, e.g., when providers identify bot activities. But there are more subtle techniques that providers apply, such as “shadow banning,” where users are not banned per se, but their output is made invisible; as such, the opportunities for being perceived are severely limited (or “throttled”). If, however, the probability of a user’s actions being seen by other users is artificially decreased, their actions are not adequately represented in the resulting data. Such practices of the “silent truncation of data” have been observed, for example, in cases such as Sourceforge and Wikipedia dumps (see Howison et al. 2011). Although the problem of missing data is a familiar phenomenon (here reminiscent of item non-response and unit-nonresponse from survey tradition), the problem is even more far-reaching for digital interaction data. Within the relational environments of socio-technical systems, data that have been deleted may have elicited reactions (e.g., responses) prior to the deletion. Such relations, however, cannot be observed and the original occasion of a response remains unknown.

When it comes to data dissemination, missing and biased data—again, familiar from written surveys—can result from providers being interested in only making available certain specific subsamples (of entities, events and their relations) for strategic reasons. Just as survey institutes can provide users with selective samples, selective provision (temporally, spatially, socio-structurally, etc.) of data excerpts will often result in a bias, especially because data that contain relational processes is not suitable for random samples (see Morstatter et al. 2013; Driscoll and Walker 2014; González-Bailón et al. 2014). Using Facebook as an example, Allen et al. (2021) show that censoring URLs with fewer than 100 public shares results in a biased data set that will overestimate the share of fake news by a factor of 4.

In sum, the problem identified in the context of the definition of observational design is perpetuated. More often than not, the actual data generation process—and the potential issues of data quality entailed—remain unknown to the social scientist (see Baur et al. 2020, p. 220; Diaz-Bone et al. 2020, p. 324). But beyond that, the advantages of digital process data environments, namely the fact that actors proactively access them and perform their practices there (in contrast to surveys), and that recording is highly standardized and automated, go hand in hand with systematic challenges for data quality. Cultural norms and the architecture of platforms can influence user behavior, and vice versa, which may result in severe “measurement” errors (see Malik and Pfeffer 2016), a problem that has also been labelled as “platform affordances error” (see Sen et al. 2019, p. 8). Consequently, the objectivity sometimes attributed to digital process data when compared with survey data is called into question by the complex reactivity relations and the diverse entities involved that underlie socio-technical systems (and that are—again—hardly known to the end-user).

Data Processing

From an analytical perspective, quantitative data can be understood to be further processed after they have been collected. Although the production of digital data is characterized by a specific back and forth between data generation and data processing, it is still true that all kinds of (intermediate) data sets are handled, transformed, and processed in various ways, which can lead to errors that were not originally present. For survey research, we already know that the providers’ (here: Survey institutes) decisions regarding data processing can be unsatisfactory for the end users. For example, an institute’s technocratic quality conventions, which lead to specific respondents being sorted out of the final data, can render specific ideological positions invisible (Barth and Schmitz 2018). In the survey tradition, such mistakes subsequent to data collection have been referred to as “processing errors” (see Groves and Lyberg 2010). Analogously, providers of digital data can also compromise the quality of their data through faulty processing, such as biasing transformations of signals and entities (see Sen et al. 2019, p. 7 ff.). Japec et al. (2015, p. 854 f.) discuss the different steps of (digital) data processing, in which errors can be generated via “creating or enhancing meta-data,” “record matching,” “variable coding,” “editing,” “data munging (or scrubbing),” or “data integration” (linking records across disparate systems). Ideally, a researcher (whose particular methodological perspective and research interest will determine the specific conception of data quality) should be informed about the quality conventions according to which a company processes data internally. For example, the criteria of data cleansing may vary and so does, accordingly, which entities and events are considered to be irregular (e.g., bots) and are removed from the system and thus from our observation. Errors can also be caused by providers when aggregating, selecting, and reducing the data to be forwarded (Hellerstein (2008, p. 2) refers to this as “distillation error”), whereas the end user assumes the data were of high quality and unbiased. Of course, the end users themselves can also introduce numerous errors in any kind of data through faulty data management, e.g., through the incorrect identification of valid (or invalid) cases or incorrect recoding. In the context of digital process data, there is a more systematic problem that arises from the considerable distance between the links of the statistical chain described above, i.e., between provider and end user (see Diaz-Bone et al. 2020, p. 324). End users do not know the rules according to which data structures have been created during and processed after collection. Yet not knowing provider-based conventions of data processing (e.g., classifications, aggregations, transformations, etc.) can mislead the end user into wrongly assuming that the data structure thus obtained represents an adequate image of the original, underlying data structures and processes (see Venturini and Latour 2010, p. 8; Japec et al. 2015, p. 853). For example, entities can be “concealed in logs” (Van der Aalst 2016, p. 148).

Another more general phenomenon is the example of divergent definitions of units. Working from the assumption of a classical, individual-centered epistemology, social scientists sometimes still request or arrange the data in an actor-centric manner, i.e., as a two-dimensional flat file with human actors as the sole type of entity. However, this may be problematic when the original data structure is non-linear, e.g., relational, and contains multiple entities and interrelations, such as communicating parties and their communication acts nested in multiple and reciprocal hierarchies (see Sen et al. 2019, p. 6; Diaz-Bone et al. 2020, p. 334)Footnote 3.

Thus, in light of the problems that will occur during this process of data production, any statistical analysis based on data from digital sources can face severe difficulties (see Olteanu et al. 2019, p. 16 f.; Japec et al. 2015, p. 854 f.). Biases may appear in many different situations, for example, in univariate distributions, descriptive parameters, estimates of regression parameters, or network parameters. In sum, when misconstruing (and thus misconstructing) actually underlying complex data relations, inappropriate data management and data analysis conventions can lead to situations where statistical models and substantive conclusions that draw on digital process data can be severely impaired.

Simulation and Post-Hoc Identification

To summarize up to this point: The quality of digital process data can be distorted in different, often unknown ways. Yet what we do know is that digital process data—much vaunted by some for its methodological virtues—are in fact haunted by issues that remind us of classic problems of data generation and data processing, such as selectivity, bias, validity, reliability, objectivity, etc. As outlined in the preceding section, a core difference between survey data and digital process data is that, for the latter, there is scarcely any comprehensive documentation for scientific end users, and scientists have little knowledge of, let alone control over, construction and collection processes. Under such conditions, however, both a priori control and process control of data quality become virtually impossible tasks and the adequacy of statistical analyses based on them must be fundamentally called into question.

For this reason, it is essential to identify data quality problems in the data available—that is, to focus on the possibilities that post-hoc quality assessment offers us. In fact, in addition to the goal of high-quality survey methodology and systematic process control, social scientists have traditionally conducted post-hoc analyses to develop hypotheses about possibly biasing phenomena. Employing descriptive statistics, and exploratory and visualization techniques such as cluster analyses (Bredl et al. 2012), classification models (Biemer 2010), or scaling approaches (Blasius and Thiessen 2015), scientists have been able to accumulate a great deal of insight by examining countless empirical data sets from different contexts. The practical knowledge gained in dealing with empirical contexts and the general phenomena of error mechanisms has proven instructive in asking the right questions of the data and using the statistical models in meaningful ways. Still, the problem of post-hoc control is that the actual error-generating mechanisms cannot be verified, but only assumed to be plausible. Whether, for example, social desirability actually guides actions in a questionnaire is ultimately an educated guess. This is why researchers have also used qualitative (cognitive) interviews with respondents to determine their perceptions of a survey.

Yet, in order to consolidate and systematize the insights gained from empirical data sets, a further methodological tool has proven to be of value in the context of survey data: Simulation techniques have been employed in order to systematically supplement the body of empirical experience. Some authors have worked with simulations that enable a more systematic understanding of data-generating (and error-generating) processes, and of what problems (and of what magnitude) are to be expected under specifiable conditions (Dijkstra et al. 1995; West 2013; McCarthy et al. 2017). Overall, in the tradition of survey research, the abductive interplay of human experience, empirical data, simulated mechanisms, and theory building has enabled social scientists to collectively achieve considerable advances in making quality problems more amenable to regulation.

In contrast to the survey tradition, however, there are not yet as many generally available empirical data sets for digital process data, and there is still a lack of practical experience, systematic comparability between contexts, and theoretical knowledge. Taking our argumentation so far into consideration, one would need to know how different kinds of biasing mechanisms are manifested in the highly aggregated and selective digital data that are provided to end users from the field of social sciences. In order to achieve such systematic knowledge, we propose the implementation of simulation techniques to model the process of data production, including possible biases, and to thereby yield data sets where the error-generating mechanisms are known. Simulation techniques are particularly promising for the objective at hand, as they allow the otherwise merely hypothesized mechanisms of quality impairment to be generated and examined in a controlled manner, which is impossible for any kind of empirically collected data (unless we are dealing with experimental approaches). However, it is important to note that simulation models are not aimed at being complex and realistic. In fact, the majority of phenomena and effects that are known to be present must be consciously kept out of the model, with a few others added parsimoniously and incrementally. It is precisely this controlled approach that underlies the potential of simulations to systematically enrich the stock of empirical experience with a series of artificial datasets.

Simulation models make it possible to recreate the different ways in which the observational design underlying socio-technical systems influence the generation of data. Techniques such as agent-based simulation models (see Jun and Sethi 2008; Macal and North 2009) can simulate both the defined forms of permissible interactions between users and the principles of hierarchizing displayed information, as well as another essential structural feature of socio-technical systems: Mutual reactivity, and thus the relationships between interacting entities. These aspects can be implemented by choosing specific topologies in agent-based simulations, and thus the rules that govern, enable, and restrict interaction and communication. In contrast to a post-facto study of the quality of data, simulation studies enable the systematic exploration of elements of the observational design that would not be observable otherwise. This is crucial in situations where the scientific community is scarcely involved in the construction of the data-generating architecture. Likewise, in the actual data generation process, simulation models can take into account the myriad ways in which interaction in socio-technical systems can actually occur; this represents a powerful way to control and to partition out the concrete effects in which social scientists are interested. In doing so, error mechanisms can be added, such as information gaps that result from the users’ or producers’ clean-up activities. These aspects of the data-generating process can be implemented by specifying the strategies and rules of interactions the actors follow. Consequently, a variety of distortion mechanisms can be simulated not only by modelling different technically possible but also differing socially common ways of interacting. Finally, the simulation approach is useful as it enables us to study distorting mechanisms that may arise from applying specific data processing conventions, such as by transforming the simulated data to different data structures or selecting only very specific elements of the overall process.

In sum, this approach allows us to learn more about biasing mechanisms, and whether we can detect these built-in problems, for example, when we use statistical techniques of post-hoc identification.

A Case Study: Simulating and Identifying Bot Behavior

In order to illustrate the general strategy outlined above, we use the example of bot activity. The distortion effect we try to capture is the influence, in different scenarios, of bots on interaction patterns and on the resulting overall data structures. The population of our simulation model consists of two different agents, users and bots, and we assume a series of situations in which actors perceive, contact, and respond to each other in different scenarios, i.e., in different socio-technical systems that produce and (partly) provide data on the users’ profile characteristics and actions. The simulations contain examples of the three aspects of the general principle outlined above: Observational design (the a priori specification of possible interactions based on profile selection), the process of data generation (as a result of the interactions of bots and users), and data processing (where the originally relational data will be transformed to and analyzed as a cross-sectional flat file).

Socio-technical systems differ according to their prevailing interaction structures and norms; the role of bots is, of course, different in each context, depending on the respective type of empirical environment (dating platform, platform for the exchange of political views, etc.). Therefore, we simulate different scenarios, each with a different parameterization of a crucial factor: What are the consequences of bot presence in contexts that differ only in terms of the prevailing interaction structure? That is, bot activity will remain constant over the series of scenarios, but the interaction style specific to the respective socio-technical system will be systematically modified. We model the differences between the scenarios as varying homophily thresholds. In terms of empirical examples, one may compare online dating platforms—where the homophily principle is more relevant—with consumer/business-to-consumer e‑commerce websites such as eBay, where homophily is less prevalent (see Huber and Malhotra 2016; Šćepanović et al. 2017); one may think here, for example, about harvester or spam-bots. Furthermore, it has been shown that homophily is sometimes even used as a strategy for so-called astroturfing bots in order to influence discussions on certain social media and to reinforce specific opinions (Lazer et al. 2018). Although homophily cannot be considered a general principle of all interaction, it can serve as an established starting point for an exemplary, parsimonious model.

The agents are defined as being endowed with some invariant attributes. In reality, these attributes could take on many different forms, such as profile information, publicly displayed affiliations, or social status within an online community. We assign the values 0 or 1 to these invariant attributes for nine variables. For bots, we assume that they have been programmed in such a way that they are widely considered to be attractive or interesting (e.g., by publishing appealing content) in the eyes of a target audience. To instrumentalize this effect in the construction of our bots, we set one part of their profile information to 1, whereas the rest is filled randomly from a binomial distribution. As a result, an average bot is very similar to many human actors (in fact, even more similar than a randomly selected human).

Further, we assume that interactions are driven by some form of commitment in order to be sustained. As what is involved in interaction processes is not a mere deterministic process that runs given the constellation of two actors’ attributes and the interaction-process does not only involve manifest (dis)similarities between two actors, but rather the joint production of shared conceptions (such as complementary role conceptions or complementary hierarchy relations). For a real-world example, we might consider an online media discussion where people congregate around a shared interest or topic. As the interaction continues, it can develop in such a way that participants are no longer in sufficient agreement and thus terminate the interaction. The continuation of the interaction is modeled as being dependent on a given similarity as well on the participants’ willingness to adapt to the ongoing changes caused by the interaction, that is, by additional similarity regarding attributes that can change during an interaction. Constellations in which no shared meanings are established will cease over the course of the interaction and the resulting processes of reciprocal classification; this pattern of communication cessation should be particularly typical for bots owing to their severely restricted communicative competencies. This is why our bots are simulated in such a way that they cannot adapt to the communication offers of human users.

The actual simulation process is set up as follows: First, the agents are randomly drawn from a population of all agents. If the newly drawn agent (ego) is not currently interacting with another agent, a new draw is made from the entire population (alter). If ego and alter have not interacted in the past, the Jaccard distance between all fixed attributes is calculated.Footnote 4 If the Jaccard distance falls below or is equal to a threshold value, both alter and ego are said to be interacting. The size of the threshold between 0 and 1 represents the degree to which interactions are structured by homophily, with values closer to 0 being indicative of interactions that have a higher requirement in terms of similarity (homophily) in order to be present. Following this step, a new agent is drawn from the pool, and we check again whether the agent is part of an interaction. If not, the process above is repeated until an agent with a raised interaction flag is drawn or the pool of available agents is empty. In the latter case, the next round begins by returning all agents to the pool. If the agent that has been drawn is marked as interacting, the interaction proceeds: In this case, the agent whose turn it is (ego) selects an element of its binary profile vector in a specified range and flips that bit, i.e., turns a 0 into a 1 and vice versa. After this change of a specific bit, the interaction partner (alter) can reciprocate by setting its corresponding bit to the same position as ego’s. However, this reciprocity can only be exercised by non-bots. Therefore, if alter is not a bot, alter will reciprocate with a probability of pr, determined by a global parameter.Footnote 5 After this change, the profile vectors are compared again, and the interaction is only continued if the Jaccard distance still falls below the threshold a.

Furthermore, during every interaction, the bot might terminate the interaction (depending on the base chance pt times the duration of the ongoing interaction); this simulates a bot’s tendency to follow a specific pattern, that is, trying to achieve a desired outcome, such as getting (exclusively human) users to click on a link. Finally, we include the users’ competency to unmask bots dependent on the number of experiences gained with their prior bot interactions. The rule states that a user can employ a test for every interaction they have had with a bot: The test gives the user a chance of 25% of stopping the interaction. These simulations were run for a certain number (n = 5000) of interaction steps.

Subsequently, we selected three datasets for an exemplary statistical post-hoc analysis. In this way, we not only simulate data-generating and error-generating mechanisms, but rather the very situation in which sociologists usually find themselves—namely, having to work with highly aggregated, selective data and no definitive information on how these data are generated and which quality problems they might entail. In the specific case, we are interested in whether bot presence as generated in the simulated data sets is easily noticeable and to which extent cases are falsely classified as being problematic.

In each case, the resulting relational data were aggregated to a data format still common in the social sciences: A two-dimensional data extract, which is composed of the profile variables mentioned above, and augmented by selected continuous variables expressing the number of unique partners, the overall sum of interactions in which this agent participated, and the average interaction duration by contact partner. To select the sub-set of illustrative datasets, we apply the homogeneity criterion (Rosenberg and Hirschberg 2007) to both profile and interactional variables. This criterion serves to select scenarios that differ in their underlying multivariate data structure (and thus in the difficulty of their statistical classifiability). In doing so, we yield three datasets that differ with respect to their similarity thresholds (a = 0.3, 0.65, and 0.8, with 0 being the maximum possible and 1 the minimum possible similarity)Footnote 6:

  • Scenario 1: Strong homophily selection (a = 0.3)

  • Scenario 2: Moderate homophily selection (a = 0.65)

  • Scenario 3: Low homophily selection (a = 0.8)

For identifying bots in the artificially generated datasets, we mobilize geometric data analysis as an exploratory tool, which has proven useful in the context of error identification in survey data (see Blasius and Thiessen 2012). Explorative methods of this kind are appropriate when it can be assumed that error-inducing mechanisms are involved and that these errors might manifest in the resulting data, but when it is not known a priori which errors are actually present and in which ways they will be reflected in the data. The solutions yielded by geometric data analyses represent multivariate associations in the data, in the form of two kinds of interrelated spaces: Spaces of characteristics, which enable researchers to interpret the meaning of the variables analyzed, and spaces of individuals, which can indicate and visualize multivariate outliers.

The idea is that a researcher does not know the nature and extent of the problem, but is generally aware that bots may differ both in terms of their (e.g., more homogeneous) profile characteristics and in terms of their (e.g., more extreme) interaction patterns. As the data contain variables with different scales, we employ multiple factor analysis (Pagès and Bécue-Bertaut 2006), which integrates factorial solutions for categorical and continuous variables into a common solution. Online appendix 1 contains the space of characteristics for all analyses, separated by continuous (interaction) and categorical (profile) variables. The spaces of individuals, i.e., the dispersions, are shown below. For easier visibility, bots (blue) and human users (gray) have been color-coded differently and the dispersion of both categories is indicated by concentration ellipses.

Analyzing dataset 1, which exhibits the greatest extent of homophily-based selection, yields a space of characteristics where three of the binary profile variables (1, 2, 3), a high number of contact partners, and low average interaction length describe the lower, right area of the graph (see Figs. A1 and A2 in the online appendix). Based on existing knowledge, a researcher might surmise a pattern matching the characteristics and interaction modes of bots, as this specific pattern (many contacts, short interactions) is known from bots’ activities on, for instance, dating platforms (see Schmitz et al. 2012). In fact, Fig. 1, which shows the space of individuals and thus the dispersion of all cases in the plane, allows us to identify outliers in the fourth quadrant (bottom right).

Fig. 1
figure 1

Space of Individuals Dataset 1 (MFA)

Yet most bots are still located within the distribution of human users (gray circle) and would not have been unmasked without further ado. Of the 30 bots actually present, the visual inspection of the outliers would identify only six to eight cases, while not suggesting false-positive assignments. This circumstance is the result of two effects: Bots are more likely to engage in an interaction (i.e., have high numbers of interactions) owing to their advantageous profile characteristics, which make them comparatively more similar to more contact partners than the average human user, resulting in a high number of different contact partners. However, the intense demand for homophily makes continuing the interaction much harder, as bots cannot adapt to human users over the course of the symbolic interaction process, so that the distance between bots and human users increases during the interaction and the resulting average interaction length is very low. Nevertheless, this effect is not so extreme that bots could be immediately distinguished from humans in a highly distinctive form based on an initial exploratory analysis.

For dataset 2 (moderate homophily-based selection), the overall number of interactions and the average count of interaction events by contact partner are strongly positively correlated with each other, but strongly negatively correlated with the number of unique contact patterns (see Fig. B1 in the online appendix). This indicates, overall, a clearly ordered, uniform interaction structure in which agents only interact with a few contacts, although over a longer series of events. Assuming that bots do not establish numerous lasting interactions, and will have unsuccessfully tried to approach many unique contact partners, one might expect them to be located at the left side of the space. As in scenario 1, this region is also described by some of the profile properties, although no longer with the same discriminatory precision (see Fig. B2 in the online appendix); thus, as expected, the profile characteristics have less importance for the initiation and continuation of interactions. However, the space of individuals does not suggest any readily identifiable outliers (see Fig. 2). Although the bots are located with disproportionate frequency in the expected second and third quadrants (top and bottom left), they do not take extreme positions in the cloud of individuals.

Fig. 2
figure 2

Space of Individuals Dataset 2 (MFA)

In this scenario, by design, homophily is of less significance when it comes to continuing or terminating an interaction. The moderate pressure of homophily-based selection provides human users with sufficient potential interaction partners to which they can adapt and with whom they can form lasting, exclusive relationships, thereby obfuscating the specific interaction patterns that could potentially reveal the bots as outliers.

For dataset 3 (weak homophily-based choices), another pattern can be observed. Again, several profile indicators describe the right-hand part of the plane (1, 2, 3, and 4), but now, the overall sum of interaction events and the average duration of an interaction have the highest values along the first diagonal, whereas the number of unique interaction partners describes the second diagonal (see Fig. C1 and C2 in the online appendix). Thus, cases on the right side of the space possess a high number of interactions and long average interactions, as well as more interaction partners. One might assume that, in a context where the similarity and commitment of the interacting actors are not relevant, bots should be relatively successful in establishing numerous lasting interactions.

Accordingly, the space of individuals shows some clearly identifiable outliers on the right-hand side (see Fig. 3) and 16 of the 30 bots would be identified (with two or three false-positive classifications) via visual inspection.Footnote 7

Fig. 3
figure 3

Space of Individuals Dataset 3 (MFA)

As it turns out, albeit only for the two extreme specifications (strong and weak homophily-based contexts), we stumbled across bot-specific phenomena simply by performing a basic explorative analysis. For scenario 1, it can be assumed that strong homophily leads to a situation where bots have been sorted out by human users during the interaction process, owing to the bots’ inability to adapt, whereas weak homophily in scenario 3 promotes situations where human users did not exclude bots, but instead kept on interacting with them. Or, in short: Both constellations created discernible outlier patterns. In the case of strong homophily, bots were conspicuous in that they were rarely involved in (lasting) interactions, whereas in the case of weak homophily, bot presence became evident because they were very strongly involved in the interaction processes.

This nonlinear way in which homophily impacts the identifiability of bots demonstrates the need for further systematic simulations: Substantially more data sets need to be generated (a) over a wider parameter space, (b) with more varying parameters involved (such as heterophilous strategies), and (c) with multiple repetitions to control random variations in the individual scenarios. Only in this way can we specify the exact conditions in which the implemented errors can be understood and identified; without such a systematic, controlled comparison it is always possible that random effects will be mistaken for systematic effects.Footnote 8

Discussion

While working with digital process data is becoming increasingly relevant to the practice of social scientists from diverse paradigmatic backgrounds, the topic of data quality is still in its infancy. To address this major challenge of modern empirical research, this paper has drawn on the body of social science knowledge with respect to the empirical phenomena, theoretical conceptualizations, and methodological controllability of quality distortion. Therefore, we analytically distinguished three generalized aspects that empirically interact in producing digital process data: Observational design, data generation, and data processing. These three ideal-typical dimensions describe mechanisms that may call into question the quality, validity, and reliability of digital process data on entities, events, and their relations. Whereas issues of quality of digital data can indeed be compared with traditional data types, a crucial meta-quality criterion of such data is transparency, i.e., the degree of inspectability and traceability into the underlying processes of data production.

Given the fact that transparency cannot be assumed, as sociologists usually have very little if any insight into the conventions and processes that underlie the production of digital process data, we discussed the promising role of combining simulation and post-hoc identification techniques. Simulation techniques represent a way to respond to the lack of control and insight and employing such approaches contributes to the body of empirical knowledge by adding artificial data, gaining experience about the effects of error mechanisms in the resulting data, and learning whether the traditional identification techniques at our disposal are helpful in drawing our attention to suspicious phenomena. We illustrated this approach using the example of the identification of bots, a phenomenon that can be said to genuinely belong to socio-technical environments and that cannot be understood as a mere “external” nuisance. Yet to the extent that we are unaware of their activities, we always run the risk of misinterpreting observed actions, interactions, and communications.

In order to grasp such problems—as well as other distorting phenomena within observational design, data generation, or data processing—in a more systematic fashion, future research can build on our mixed-methods strategy by systematically generating data sets across a multitude of simulations and thereby accounting for random variations. Applying a “pipeline strategy” over a series of simulated data, varying the conditions and parameters, and running the simulations several times will enable social scientists to systematically evaluate the threshold values at which suspicious patterns come to light when using identification methods (geometric data analysis, but also finite mixture models, or clustering methods).

Yet, such automated strategies must be realized in conjunction with the interpretative competencies of the researcher. It is essential to have both an adequate understanding of the respective empirical phenomena and a theoretical understanding of possible distortion mechanisms, and, as in traditional survey research, insights must be acquired through practical engagement with empirical data sets. In doing so, knowledge gained from simulated datasets may sensitize empirical researchers who screen their real-life datasets for comparable patterns using similar identification techniques. For this purpose, entire large‑N data sets cannot and need not be examined manually: The iterative, abductive examination of samples of conspicuous cases, as well as qualitative or ethnographic investigations, can be most useful in providing further clues regarding suspicious patterns and in specifying explorative identification models. In the context of survey research, researchers have already used qualitative (cognitive) interviews with respondents on their perceptions of the survey as well as ethnographic observations of the classification practices subsequently conducted by the researchers. In similar ways, future research will employ mixed-methods approaches to quality issues of digital process data. A great deal can be learned about observational design, data generation, and data processing procedures through expert interviews with programmers (who operate in comparable contexts of practice and can inform us about common conventions) and ethnographic observations of programming activities. Complementarily, qualitative interviews with platform users can reveal their perspectives, practices (such as their ways of identifying and interacting with bots), and effects on socio-technical systems. Such accounts are essential to increase our contextual knowledge, and they can be useful for further developing simulations (e.g., for implementing more realistic strategies).

Ultimately, such mixed-methods approaches to data quality in the digital realm can help us to understand that phenomena that are interpreted as mere distortions of otherwise accurate observational data are, in fact, constitutive and generative elements of socio-technical systems.