The Complexity of Datafication: Putting Digital Traces in Context

This chapter deepens the discussion on the problem of contextualizing digital traces. First, digital traces are reflected as a phenomenon of media-related complexity more generally. Secondly, the example of data from learning management systems is taken to discuss possible strategies of how to put such automatically generated data into context by the use of qualitative methods that become triangulated. On such a basis, some conclusions are drawn about the future challenges of this kind of research. Overall, this chapter can only argue in an exemplary way, taking a specific and thus limited case of analysis. But such detailed discussion makes it possible to outline different options for future methodological developments in media and communication research.

activities. This is, for example, the case when using a search engine or when reading newspapers online, where only a limited group of users are aware of the scope of related traces and their further use, for example in the advertizing industry (Turow 2011). But digital traces go even further: they are not just made by the users themselves but also by others when they interact online with reference to them, for example by synchronizing their address books with our digital addresses, by tagging pictures, texts or other digital artefacts with the names of other users. Digital traces nowadays even begin before the date of birth and beyond death. One example for this is the 'mediatization of parenthood' (Damkjaer 2015), which results in processes of constructing 'parenthood' before birth, as pregnancy is accompanied with an ongoing flow of communication via apps and platforms that produces digital traces of a 'forthcoming child'. Then the question 'who is allowed to leave these traces of an even unborn?' becomes an issue in a kind of family communication policy. In such a sense, as individuals, collectivities or organizations 'we cannot not leave digital traces' (Merzeau 2009: 4) in times of deep mediatization. Therefore, datafication reflects an increasing complexity of the social world by adding a new level of social construction that is delegated to algorithms and software.
Methodologically speaking, the emergence of such kinds of digital traces is a problem for empirical media and communication research. Existing research on datafication shows that one problem is the access to such kinds of data. In many cases, the application program interfaces (APIs) which open access to this kind of data are controlled by companies in outstanding power positions, such as Apple, Twitter, Facebook or Google. However, even if such an access is given, yet another problem arises. How can this data be put into context in a way that one is able to analyze it in a socially meaningful way?
In this chapter, we want to deepen the discussion of this second problem of contextualizing digital traces. First, we will reflect on digital traces as a phenomenon of complexity more generally. Then we will take the example of data from learning management systems to discuss possible strategies of how to put such automatically generated data into context by the use of qualitative methods that become triangulated. On such a basis, we finally want to draw some conclusions about the future challenges of this kind of research. Overall, this chapter can only argue in an exemplary way, taking a specific and thus limited case of analysis. But we hope that our more detailed discussion makes it possible to outline different options for future methodological developments in media and communication research.

dIgItal traces as a Phenomenon of comPlexIty
Understanding digital traces as the sequence of 'digital footprints' which are left by the use of digital media and services represents quite a new area of media and communication research. At the same time, we can refer this back to more prolonged discussions about whether 'new' media also require 'new' methods of research (see for example Golding and Splichal 2013;Hutchinson 2016), and have to contextualize it in the much more far-reaching discussion surrounding 'digital humanities' and its methods (Baum and Stäcker 2015;Gardiner and Musto 2015). As a phenomenon, digital traces have evoked a sophisticated but also controversial methodological discussion (Kitchin 2014). In this respect, we can notice a multiple complexity of the phenomenon.
First of all, it is important to be aware that they are more than just (big) data. As 'big data' is used as 'a catch-all, amorphous phrase' (Kitchin and McArdle 2016), it provokes substantial discussions about its capacity. Heavily criticized by one group of scholars (boyd and Crawford 2012; Andrejevic 2014), it is regarded as the future of empirical research by others (Mayer-Schönberger and Cukier 2013;Townsend 2013). Hence, we follow a different direction while discussing some questions of big data later more in detail. Digital traces are a kind of digital data which become meaningful because this sequence of 'digital footprints' is in a technical procedure of construction related to a certain actor or action, typically an individual but in principle also a collectivity or an organization. By such procedures of connecting data with entities of the social world they become meaningful information, and this is the reason why companies and other organizations of data processing are highly interested in this kind of data aggregation in relation to 'real' people. For the purpose of empirical research, a good starting point is to define digital traces as numerically produced correlations of disparate kinds of data that are generated by practices of individual, collective and corporative actors in a digitalized media environment. 1 The complexity of digital traces is reasoned by the variety of their production, but also the variety of possible correlations.
Recently, digital traces and related possibilities of data generation became an issue of fundamental critique of social science methods; one that we do not share in detail but have to be aware of. The argument at this point is that with increasing datafication, methods of social sciences increasingly entered a 'crisis' as digital traces seem to be a much more proper data source than the kinds of data typically used in social sciences (Savage and Burrows 2007). While the sample survey and the in-depth interview once represented innovative contributions to a methodologically informed description and understanding of the social world, nowadays because of datafication-and hence accessible data sources-they would produce a much more limited access to the procedures of how society is constructed. Its main governing organizations-companies, administrations, educational and government institutions-get much of their information via an ongoing observation and analysis of the various digital traces left by the people. Against such sources, any proposition academic research can produce based on surveys and interviews seems to be flawed. Many established methods would come under pressure with recent datafication as they cannot deliver proper answers to the problems under question, something that is described as the 'social life of methods' (Savage 2013: 5). Therefore, we would need to 'reassemble social science methods' (Ruppert et al. 2013: 22). A widely discussed conclusion from this is to think about new forms of data collection and analysis that are based on 'digital methods' (Rogers 2013: 1, 13). Methods such as crawling, scraping or data mining take digital traces as sources for empirical research. They do not use special procedures for data collection to produce data that is then analyzed; but rather they are methods of using digital traces as a source for analysis.
Some proponents even go one step further, arguing that digital traces would allow for the first time a direct access to ongoing processes of social construction. Maybe the most prominent example is Bruno Latour's integration of digital traces investigation into his overall approach to social analysis (see Latour 2007). A 'digital traceability' (Venturini and Latour 2010: 6) then becomes a possibility for analyzing processes of social construction in situ: 'Being interested in the construction of social phenomena implies tracking each of the actors involved and each of the interactions between them' (Venturini and Latour 2010: 5). With digital traces, so the argument, we might have such a direct access, as they would allow us to witness processes of assembling in the moment they take place (see Latour et al. 2012;Venturini 2012).
From our point of view, this move largely misunderstands the main points of digital traces and the complexity of their analysis. First of all, there remains the fundamental problem of misinterpreting the social world as 'flat' and therefore as reconstructable solely by an analysis of correlated 'footprints' in digital media. This is one point of access which is non-responsive, but one that reduces the present complexity of the datafied social world to the ontology of a flat society. 2 Second, and even more fundamentally, such an approach misunderstands digital traces as something 'neutral', offering us a 'direct access' to society. However, digital traces are not 'neutral phenomena'; rather, they rely on the technical procedures of governing institutions: the companies, administrations and agencies that produce this kind of data. With governing we mean that these institutions are organizations that are in a powerful position to define the character and structure of data and metadata as well as its possible purposes of use. Actors can access this purposefully constructed and not objective data as individuals (independent workers, civic hackers), collectivities or organizations only in a controlled way. Therefore, as in any established method of social science, digital traces as indicators of social reality have to be critically reflected with regard to their particular perspective and the underlying biases in which they are produced.
Concluding from this, our approach to digital traces refers back to a critique of any naïve understanding of 'big data' (cf. Puschmann and Burgess 2014). Especially beyond academic research, there is high hope of the promise of new forms of analysis with reference to a so-called 'revolution of big data'. The core argument of this hope is that huge amounts of databased information can be related and analyzed with automated procedures without predefining theoretical assumptions, and at the same time can lay the ground to predict future developments. This would make a new, purely data-oriented knowledge production possible that is partly positioned against theoretically informed forms of academic research. As prominent representatives of big data analysis put it, 'no longer do we necessarily require a valid substantive hypothesis about a phenomenon to begin to understand our world' (Mayer-Schönberger and Cukier 2013: 55). Or, as formulated in the sub-title of a best-selling practical guide (Marr 2015), it is about 'using smart big data, analytics and metrics to make better decisions and improve performance'. In education, 'learning analytics' (Ferguson 2012;Papamitsiou and Economides 2014) based on big data become the new vision to control and manage individual learning processes purely by algorithms. Similarly, student assessment data based on psychometric tests are used by administrators to rank schools, incentivize teachers and to create their accountability systems (Anagnostopoulos et al. 2013). But as Perrotta and Williamson (2016) clearly point out, the production of the underlying data structures and algorithms and their construction power in social life are often neglected.
Such an approach reduces the complexity of the phenomenon of digital traces to a 'big data paradigm' that is about 'managing data and transforming it into usable and sellable knowledge' (Elmer et al. 2015: 3). From the point of view of empirical research methods in social sciences, such hopes are partly based on what we can call a 'mythology of big data', that is 'large data sets offer a higher form of intelligence and knowledge that can generate insights that were previously impossible, with the aura of truth, objectivity, and accuracy' (boyd and Crawford 2012: 2). This kind of 'social analytics' (Couldry et al. 2015) refers back to the 'gradual normalisation of datafication' ( van Dijck 2014: 198) as a new paradigm in science and society. This is exactly the point where we have to be careful: researchers of big data 'tend to echo these claims concerning the nature of social media data as natural traces and of platforms as neutral facilitators' (van Dijck 2014: 199). The idea is that once the easy work of gathering data is completed, the 'data will speak for itself' (Mosco 2014: 180). The hope becoming articulated in such a discourse is that big data would offer a possibility to reduce the complexity of analyzing the social. Or put another way: big data is constructed as an easy way to handle the complexity of our datafied social world by datafication.
As we know in the meantime, (meta)data cannot be considered as 'raw resources' that offer any direct access to a complex datafied social world (Gitelman and Jackson 2013, 7: Bowker 2014: 1797van Dijck 2014: 201;Borgman 2015). In contrast, the main methodological task for empirical research on digital traces is to make them meaningful in a social sense, that is to explain the causalities and relations that go beyond pure aggregations and correlations as they are put up by automated collections of data. As a consequence, the methodological challenge for researching transforming communications is less than just an automated analysis of big data, as often postulated: rather, the methodological challenge lies in how to relate digital traces to further sources of data by means of which such traces become validated as well as interpretable and can subsequently be referred to in more sophisticated explanations and procedures of theory building (see Crampton et al. 2013;Lohmeier 2014). We must be very careful to avoid possible misunderstandings at this point. We share the position that competences in new forms of 'digital methods' (Rogers 2013) and 'automatized analysis' (neuendorf 2017) are a necessity for media and communication research that endeavours to be up to date, and we subscribe to this discussion about datafication (Hepp 2016: 234-237). This said, we are critical of any approaches that understand data purely as a direct source for describing the society. We need the combination with further information about the figuration under investigation. Following the semiotic theory, information is data in context referring to its semantics (O'Connor et al. 2001). 16.3 school learnIng management systems as an examPle: analyzIng dIgItal traces as PuttIng them Into context If we follow the line of argument up to this point, the main challenge is how we can analyze digital traces in a way that we can contextualize them within the figurations of humans that produce these sequences of 'digital footprints' but also use them as a means for social construction. From such a point of view, we have to think about how to relate the 'information' of digital traces to specific actor constellations, frames of relevance, and practices of communication in and by which they are produced.
The main examples on which we want to discuss this challenge are data systems as they are nowadays widespread in schools, originally especially in the USA and the UK, but increasingly also in Germany. School learning management systems as a software define the 'space' in which data are produced as 'digital traces' which, however, are also used by others to subsequently construct social reality. Or put differently: the school information systems are not only the means to 'collect' data; they are also means for powerful processes of construction, typically on the part of their providers who do 'data analysis'. The way in which data are embedded into communicative practices in schools plays an important role: for example, the use of grades for decision-making, the use of upload and download traces to define student involvement or teacher or parent engagement.
Learning management systems (Ifenthaler 2012) in schools and higher education institutions are supposed to support the learning process of students and the management processes of teachers. Most studies reflect the forms of instructional use, teachers' and learners' attitudes, and the impact on learning (e.g. De Smet et al. 2012). But the organizational processes of schools, that is interactions between students, teachers and parents and within their groups, between school management and staff, school district and school board, are often neglected (see Breiter 2014). In an empirical study of German secondary schools headed by one of the authors, 3 the goal was to reconstruct the school as a social organization by analyzing communicative practices of key stakeholders. Hence, online, face-to-face as well as paper-based forms were studied. A subset of our research addressed the interdependence of communication networks between teachers in the world of the school building and in the world of the learning management system. The underlying hypothesis assumed a very similar activity structure inside and outside the technical system: those who interact regularly and intensively will do so online. For this purpose, we collected digital traces that teachers left in the learning management system. As in most server-based systems, the paths of users can be traced back by using log-files. Log-files provide information, problems or errors pertaining to the system and its applications (Markov and Larose 2007;Suneetha and Krishnamoorthi 2009;Liu 2011;Oliner et al. 2012), often in the Extended Common Log-file Format: Looking at these log-files from a webserver as in Fig. 16.1-here in an anonymized and therefore fictive form-it is possible to identify the user by her internet protocol (IP) (1.2.3.4) and additionally the Browser Operating System combination if multiple users use the same internet connection. Once a user is identified, one can track the movement within the site because the second last entry contains the page the user came from, the so-called referrer. In the example given here, the user enters the site at index.php, stays on the site for 14 seconds and moves on to page2.php by using a hyperlink. These 'clicks' are called actions. Using this information, we can track all movements from all users separately. There are mainly five ways to conduct a log-file analysis: (1) display which pages of a website are accessed more than others and how many users selected a specific function; (2) show paths from visitors through the site; (3) cluster visitors into groups, the clusters being based on movements or paths through the system; (4) social network analysis to identify connections between users and/or websites based on the 'clickstream' data; and (5) other statistical methods and algorithms (e.g. multi-level analysis).
Logfile analyses are non-reactive. All information is gathered on the application layer or server layer and not actively put in by the user. Furthermore, data are stored in a machine-readable format and can be used in real time. But there is a main disadvantage of such a strategy for collecting data: the lack of any information about the user's practices. Furthermore, there is usually no information about socio-demographic data of the user. Additionally, log-files can cause high privacy concerns.
The users normally have no control over the log-files that are produced by the server or application. As the IP address is stored, the users are easy to identify. Therefore, log-files must be made anonymous by the researchers. But while this is necessary from the point of view of research ethics, it additionally limits the interpretation of such data.
We gathered anonymous data from a learning management system in a larger German secondary school (>100 teachers and >1000 students). The system is mainly used by the staff for coordination and communication. As it is hosted by an external company it can be accessed from inside the school's network and from home. The learning management system offers the following features: announcements, calendar, file exchange and discussion groups.
The log-files investigated by us span over a period of 12 months, including holiday breaks. In the log-files we analyzed, 120,000 hits from 138 users are recorded. After the deletion of all irrelevant data (e.g. by bots) and by using path completion algorithms, the sum of hits is approximately 62,000. The 138 unique users had a total of 4451 visits. 4 In Fig. 16.2, a network graph of this data is shown. 5 Such a visualization makes it possible to identify three main groups in the upper part of the graph, which are connected to the categories 'miscellaneous', 'reports' and 'conferences'. All are mainly linked to dates, some to announcements and materials. Announcements and materials are more likely accessed than dates. This is no surprise as dates can be viewed in a calendar-like overview. The items themselves are mainly linked to the category and not linked among themselves.
In the bottom left are many materials closely connected to each other. Above these materials are two subjects-one bigger and one smaller. In contrast to the representation of the former three categories, the nodes are overlapping each other and are not only linked to the subject itself but also to each other. This indicates that the items are closely linked together and due to the force-driven representation. The relative big node size is another indicator for the intensive material exchange within these two subjects. To deepen such an analysis, we can do a scatter plot of this data. Scatter plots are mathematical diagrams with two coordinates to visualize values of variables. As the points have different sizes, they represent a third variable (in this case uploader). The scatter plot in Fig. 16.3 compares the number of materials per subject and the sum of hits to these materials. The size of each subject shows the number of different contributors. English has the most hits (2300) and the most materials (23). That is no surprise and was already assumed if we refer to the previous data set. Spanish, on the other hand, is more interesting. It has the second most materials (15), but only around 500 hits and only three contributors. Based on the log-files, we can only speculate about the reasons.
As we can see in this example of digital traces in a school information system, the interpretation of so-called big data is only possible with context-specific knowledge. Log-files can give researchers a broad view into an information system and its usage. They do not allow to identify 'significant behaviour'. Our analyzed data had a time span of about 300 days. There may be the possibility to overlook significant behaviour as the amount of data is large, and significant behaviour must not be the most common behaviour. But statistical methods such as sequential pattern or cluster analysis try to find a common and frequent pattern, not a rare or unique pattern which is potentially more relevant. This may lead to an opposition of available methods and research aims. Additionally, patterns which can be identified statistically need to be embedded in the physical world of classrooms, different staff rooms, subject-and/or grade-related rooms and 'water coolers' (Earl 2001). 6 To understand schools as communicative figurations, we need to identify the actor constellation which will only partly be mapped in the log-files-non-active members of staff and their communicative practices are neglected, even if they might have a media ensemble which allows data exchange.
In our case, we accompanied our quantitative analysis with indepth qualitative studies based on participant observations and interviews (Welling et al. 2015). Over a period of one school year, we observed teachers in their staff room as well as in subject-specific rooms. Based on an observation protocol, the use of the information system as well as situations and locations for exchange about administrative and organizational issues were recorded and later analyzed with an open coding scheme. The interviews with different groups of teachers were recorded and coded according to standards of qualitative data analysis. Based on both data sources, we could find clusters of activities as well as subject-specific communicative practices. In both cases, the usage of the school information system was an integral part of the data collection. This helped us to identify patterns which could be reconstructed in the log-file analysis.  Schulz and Breiter (2013) This offered a different and more detailed view on the organizational processes of a school beyond the data in the system. Spanish is a small subject; the teachers usually teach at different schools and need to be virtually present at different locations. Time management is difficult and the learning management system with calendar function allows the scheduling of meetings and book resources online any time and anywhere. The link to their subject community is mainly organized through web-based systems. The English subject group has a long-standing tradition of exchanging classroom materials. Years before the introduction of the learning management system, they arranged their exchange via paper folders in their subject-related staff room.
This example of digital traces in school data systems highlights the relevance of digital traces in context, which can be very rich empirical data if analyzed interdependently. Dealing with log-files entails additional concerns about research ethics and privacy. Users cannot give their consent a priori.  Schulz and Breiter (2013)

conclusIon: challenges PuttIng dIgItal traces In context
Taking the example of school data systems, we could demonstrate what it means to put digital traces in context: the data collected by the respective systems have to be linked with further, detailed information to make them socially meaningful. Only in this way do such data become a source of describing our present complex social world of datafication. For empirical analysis, this is related to three challenges which we consider as fundamental for any social science analysis of digital traces. The first challenge is to find a way to grasp digital traces with reference to a defined social entity. Very often, digital traces are understood as a phenomenon of a single actor, an individual who left the traces through the use of digital media and services. While this is correct for a basic definition of digital traces as well as for many procedures of data generation (it is the single user of an online system who leaves the footprints that are collected by this system-often because the individual is of interest as a customer), our example demonstrates that we rather have to consider these individuals as social actors whose practices are located and embedded in the figurations of further institutional contexts and groups of people (in the case of our example the organization of the school and the different groups of teachers). Only by reflecting this does the data become meaningful. Therefore, we have to have the whole communicative figurations in mind in which the individual who is the 'originator' of the respective traces acts. The challenge here is to find a way to link the data being automatically generated with a social analysis of such a figuration. To achieve this, it seems to be appropriate to start the research with the frame of relevance of this figuration and locate the analysis of the digital traces. By so doing, there is a chance of finding helpful ways of contextualizing.
As a second challenge, we are confronted with the triangulation of quantitative 'digital methods' (in the case of our example the log-file analysis) with forms of qualitative analysis that offer the context information which is needed. As we have seen, combining automatized collected and processed data, on the one hand, with various forms of qualitative interviews or focus groups, on the other, is a promising method. But this again refers back to the first challenge: only if the actor constellation of the figuration under consideration is known does it become possible to conduct such interviews and focus groups. This offers rich data, which because of their richness at the same time might become critical from a research ethics point of view.
Therefore, research ethics are the third challenge. Any approach which puts digital traces in context in such a way entails linking digital data that are left (partly without any detailed knowledge of this on the part of the persons concerned) with further information about certain persons. The knowledge being gathered in this way can be very far reaching-at many times much more far reaching than the knowledge a person has about him or herself. For research ethics, one consequence of this is the necessity to inform the investigated persons in detail about such possibilities of data collection and analysis (and to offer them, for example, the opportunity to have such unknown information communicated back to them). Another consequence is that as researchers we must be very careful of how we publish such results because the publication of data on digital traces in triangulation with further information about individuals might offer others the chance to isolate these persons. Anonymization becomes an important and complicated task.
With respect to these three challenges, it becomes obvious how far the meaningful analysis of digital traces is more than just a new field for media and communication research. As a new field, it is necessary not only to reflect in a new way about the relations of qualitative and quantitative data but also about our (digital) research ethics. This is essential if we want to conduct a form of media and communication research that addresses the complexity of the present social world, which is increasingly characterized by datafication. We hope that this chapter offers some stimulation for further steps in such a direction. notes 1. The term 'trace' collects numerous meanings and appendices (to trace, track, traceable, traceability, tracing, etc.) and seems to connote an isolated object as well as an action or a process (Serres 2002: 1;Reigeluth 2014: 249). Because of this semantic richness of 'trace' in general, there is some ambiguity determining 'digital traces' in a proper way, which we want to clarify with the definition above. 2. Here the general problem of the idea of the social world as the sum of assemblages becomes replicated (for a critique of such an approach see Couldry and Hepp 2016: 57-78).
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/ by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made. The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.