5.1 What Even Is ‘Big Data’?

Big data generally capture what is easy to ensnare—data that are openly expressed (what is typed, swiped, scanned, sensed, etc.; people’s actions and behaviours; the movement of things)—as well as data that are the ‘exhaust’, a by-product … It takes these data at face value, despite the fact that they may not have been designed to answer specific questions and the data produced might be messy and dirty. (Kitchin 2014, Chap. 2, p. 3 of individual chapter version)

Rob Kitchin is possibly one of the most cited definers of ‘Big Data’, opening books and dissertations up and down the land. Yet, as we are about to discover, Kitchin himself tells us that while the term ‘Big Data’ is repeatedly defined (Kitchin 2014, Chap. 2, p. 3), big data themselves defy categorical labelling. So, it is not clear-cut, because differentiating what ‘it’ is and what they are not is often side-stepped, or comes with caveats.Footnote 1 We encountered something similar before, if you remember, in Chap. 2. When it comes to understanding what well-being is, those inclined to measure are sometimes keen to measure well-being to understand it, rather than define what it is that is being measured. In a similar way, those describing Big Data are often more concerned with what Big Data does (or do), rather than what Big Data is, or are.

In this chapter on Big Data, we will discover that how they are used can defy some of the old definitions of how to use data or what data are for. So, let us start with some definitions and what is different. For Kitchin, the lack of ‘ontological clarity’ of Big Data (as the individual concepts and categories of Big Data and the relations between them) means the term acts as a vague, catch-all label for a wide selection of data (Kitchin 2014, Chap. 2, p. 3). Despite this, he has reviewed how other people define it and proposes the key traits of Big Data. These qualities are outlined in Table 5.1. Given the word ‘big’, it is probably no surprise that volume is one of ‘the 3Vs’ identified by Doug Laney back in 2001. The other two being velocity and variety. Other qualities include exhaustivity, resolution, indexicality, relationality, extensionality and scalability (Kitchin and McArdle 2016; Kitchin 2014). But what does this mean? How do these characteristics help us understand the data?

Table 5.1 Ways that Big Data are different

Having established a series of classifications for Big Data, Kitchin tested his taxonomy of traits with co-author McArdle a few years later (Kitchin and McArdle 2016). They applied the categories to 26 datasets which are widely considered Big Data and drawn from across seven sources: mobile communication, websites, social media/crowdsourcing, sensors, cameras/lasers, transaction process generated data and administrative data (2016). The authors find all seven traits in Table 5.1 are only applicable to ‘a handful’ of these datasets (Kitchin and McArdle 2016, 9). This shows how difficult it is to diagnose what Big Data actually are. Rather than the qualities of the data themselves, it might be more useful to instead turn to thinking about the contexts of data again: where they come from, and what they do (Oman n.d.).

The key differences in the characteristics of Big Data are context, which is often missing when presented. Table 5.2 represents how difficult it is to diagnose what Big Data actually are, without considering the qualities that affect their use. It shows there are additional Vs: veracity, value and variability—these are concerned with how the data suit their re-purposing. Given the multiple insights and applications of data outside of their original setting, it can be difficult—even more difficult—to find certainty from them. This is because the data were collected, generated and produced for a specific reason, or as a by-product, that differs from how they are re-used.

Table 5.2 Some qualities of Big Data

The value of Big Data is the variety of insights that are possible and that can be used for other purposes. However, there are many things in the data that may not be useful. This also means using Big Data can increase the risk of confounding more traditional causal explanations. Instead, the mess of Big Data lends them to correlation with many insights, which can be used to enable prediction of well-being for individuals and society. We shall return to correlations and well-being in our case studies later in this chapter.

Table 5.3 looks at sources of different kinds of data typically used to predict well-being along with their pros and cons. These sources were drawn from an article in a journal for Data Science Analytics (Voukelatou et al. 2020), and I have synthesised these with Kitchin’s seven sources (mobile communication, websites, social media/crowdsourcing, sensors, cameras/lasers, transaction process generated data and administrative data) retaining commentary from Voukelatou et al. on the pros and cons for their use to understand well-being. You may look at these and feel like these data sources seem like strange ways to understand people’s well-being: the difference in origins and what they may be used for. You may also note that the authors’ presentation of the pros and cons, based on these sources, does not really prompt consideration for the people whose data they are, more their ease of use for the Data Scientist.

Table 5.3 Sources of Big Data and their pros and cons for well-being measurement

Returning to contexts of use: mobile phone data, for example, have a primary purpose which is for billing, or because apps need location data to work (such as maps or for local restaurant recommendations). This is very different from these data being used to understand trends about people and society. Our previous examples of data re-use (or secondary analysis) have largely involved data that were collected in national surveys, or through more qualitative methods with smaller samples to understand a specific aspect of people and society more deeply in some way. Notably, even if the research question is different when data are re-used in Chap. 3’s examples, the purpose of the data’s collection is not as different, or as removed, as this ‘exhaust’, ‘by-product’ nature of the data Kitchin refers to.

The process which has come to be known as ‘datafication’ (as coined by Mayer-Schönberger and Cukier 2013) describes the increased demand for and uses of data. As we have seen in previous centuries, appetite for numbers (pandemics being one accelerator of data desire) has coincided with technological evolutions with numbers. In turn, and as we have seen over the last four chapters, different disciplines have increased and expanded their capacities for data and knowing the human experience in their own, particular way, and ‘new sciences’ have been declared. ‘Big Data’, as data with the qualities presented above, result from mounting capacity and faster instruments that increase the possibilities for the origins and volumes of data that can be stored in expanding databases, or in different databases which can be readily linked for a variety of purposes. As we have also seen before, it can be difficult to decide which came first: appetite for data, or capacity to expand on data possibilities.

In the age of Big Data, these newer data sources hold a wide variety of easy-to-capture data points, including observations of how we feel, where we are (or were), who we know, what we spend—and on what. These provide information on what products we have clicked on, and those we have not bought (Turow 2011). They can show how and where we spend our spare time and our money, both off and online. They are, therefore, incredibly valuable for research and commerce.

It is not these individual data points that are important, per se, but the links between them, that make them valuable. Through linking, assumptions can be made about how our behaviour, such as online spending, or improved mood, can be replicated in another place or time. These insights are also linked with other more familiar data points from administrative records, for example: where we were born, how much we earn, whether we own our own house. Other data are produced by loyalty cards, smartphones and in-house devices, such as Alexa, expanding such linking opportunities. Those who may try to avoid ‘being known’ by these other data will try to bypass the systems that gather these data. However, this resistance also becomes data in and of themselves; avoidance still produces digital traces that can be used to gather insights. Corporations may still create an automated profile of sorts, and assumptions will be made about the kind of products ‘the resistors’ buy. The persistence of data practices and their seeming inescapability are the reason we are starting to think about the experience of Big Data as something we ‘live with’ (Kennedy et al. 2020) and as something we ‘feel’.

This chapter covers some of the pervasiveness of Big Data, alongside the possibilities that come with that. Crucially, we look at what that means for well-being. We start by looking at the ways that data about mundane aspects of our lives is increasing, alongside how normalised increasing data collection, analysis and re-use are. These ‘data practices’ present new possibilities and realities of data-driven systems and decision-making that affect culture and society.

In this chapter, we touch on some of the uncomfortable aspects of these new realities, before historicising Big Data as well-being data to contextualise contemporary concerns regarding data practices that can be harmful. The second half of the chapter uses case studies to explore these concerns about well-being and data. Firstly, we consider a high-profile case that was billed as the promise of Big Data: Google Flu Trends (GFT), looking back from the age of COVID-19. Three further, short examples show the possibilities of social media data, place-based data, and health and fitness data to understand well-being for social and cultural policy and culture and society more generally.

5.2 Big Data: A New Way to Understand Well-being?

“Big Data”, was cited 40,000 times in 2017 in Google Scholar, about as often as “happiness”! (Bellet and Frijters 2019)

The datafication of social life has led to a profound transformation in how society is ordered, decisions are made, and citizens are governed. (Hintz and Brand n.d., 2)

Digital devices and data are becoming an ever more pervasive and part of social, commercial, governmental and academic practices. (Ruppert et al. 2013, 2)

The majority of Big Data are collected in a different way to the national surveys and interviews we encountered in Chaps. 3 and 4, and consequently has numerous different qualities. One is that surveys and questionnaires are, by and large, overt methods, in that it is obvious you are asking questions to generate data. The new technologies use data which are collected covertly and so often gathered on individuals without their ‘considered consent’, and are often processed without transparency. Figure 5.1 shows just a small selection of the types of personal data that are useful and valuable for social analytics and that are covered in this chapter.Social analytics involve the: monitoring, analysing, measuring and interpreting of data about people’s movements, characteristics, interactions, relationships, feelings, ideas and other content. Figure 5.1 shows only a few of many more examples. Here, they are categorised into domains that share the same names as the UK’s well-being measures, to enable you to cross reference the different kinds of insights available under each domain from these data (although biometrics is a new addition). The data are from ‘observations’ of how we move around the on and offline world. They can include behaviours collected by sensors (think of how your mobile phone uses data via GPS to tell you when the next bus is, or that you are about to encounter traffic on the motorway). They include our feelings, shared by social media data, or in apps. While demographic data have long been collected, as we know, these newer forms of data can say much more about us, our well-being and quality of life. As we shall discover, this is both for good and bad and any insights gained need to be put into context.

Fig. 5.1
figure 1

Some examples of personal data used for social analytics in the era of Big Data

As we have also discovered, data are not only numbers or text, but can be sound and pictures. Analysing these kinds of qualitative data as Big Data holds new possibilities. In some ways it is these new possibilities that feel the most uncomfortably non-human. Whether it is concern that your phone is always listening to you, or, rather, that Alexa or Siri are (to humanise these technologies). Even the Street View option of Google Maps allows us to look at other people’s homes. I remember keenly finding the image of the flat I rented in London for years, only to see my washing-up through the kitchen window. I couldn’t help but think, I wish I had known they were coming.

More notable than my neglected washing-up being on public view for judgement are other visual data used for training datasets, particularly for facial recognition. There are the moments when you know that facial recognition technology is being used: to log in to your phone, or at passport control at the airport, perhaps. However, they are also being developed for schools, public transport systems, workplaces and healthcare facilities (Ada Lovelace Institute 2019). Revelations about its use in shopping centres prompted media and public outrage, regulatory investigation and political criticism (Denham 2019; BBC 2019). These reactions are in part about the further encroachment on the way we live (like the call centre example from the 1990s that opens the book) and in part the lack of consent and knowledge about these data being collected about us.

Some people who uploaded photos to Flickr, some 10–15 years ago, more recently discovered they (as in the people’s faces and their photos) appeared in a huge facial-recognition database called MegaFace (Hill and Krolik 2019). They found the database held facial data on around 700,000 individuals, including their children, and was being downloaded by various companies to train face-identification algorithms. These algorithms were then being used to track protesters, surveil terrorists, spot problem gamblers and spy on the public at large (Hill and Krolik 2019). Notably, a colleague who read this chapter before publication—a digital sociologist,Footnote 3 no less—confessed to me their shock at reading this anecdote, as they had used Flickr and were not aware of this story. Therefore, not only are our personal data collected and used without our knowledge, but the controversies surrounding their re-use are not even shared with users. This poses questions for accountability and transparency.

The questions of who is collecting these data, and who is using them, and for what, present a more complex issue than before. Public support for the police to use facial recognition technology is conditional upon limitations and subject to appropriate safeguards, but there is no trust in private company use (Ada Lovelace Institute 2019). As we have been discovering—it is the contexts of data collection and uses that we need to understand: it is the who, what, where, why and what for? that are important.

5.2.1 Why We Need to Ask Critical Questions of Data in the Context of Well-being

Many issues related to Big Data don’t have clear-cut answers, especially where well-being is concerned. While data reveal details of the vulnerable, often involving risk for these people and their communities, the State uses data systems that people increasingly need to be a part of to access healthcare and welfare support (Dencik 2020). This is why the growing amount of research which problematises the utility and ethics of Big Data, and how they are used, is vital. In this area of critical data science (see Bates 2016), some researchers use Big Data to reveal the limits and social issues connected to everyday datasets that we all use, such as a search engine’s image database (e.g. Otterbacher et al. 2017). These critical studies of data and their effects on society reveal how data are capable of not only new problems, but persistent racism and misogyny, as we discovered in Chap. 1 with Virginia Noble’s example of what happens when you search for the phrase ‘black girls’ (Noble 2018). These projects reveal data’s negative social effects, and how they are already embedded in society, exacerbating issues.

Other research aims to investigate what people know and think is going on. Also looking at the possibilities of Big Data (and their associated technologies) to understanding aspects of well-being. One such example (Living With Data n.d.) presents real-life cases of public sector data practices to members of the public. It wants to understand how much people appreciate the possible benefits and how much they doubt or distrust the possible implications of data systems and sharing in their everyday lives. One option being, of course, that many people may not really care as much as we think they do, or should.

We touch on these issues in this chapter. Most notable is the increase in concerns regarding the harms that Big Data and new technologies are capable of, and which are happening unchecked (i.e. the UK’s Data Justice Lab n.d.; Eubanks 2018; O’Neil 2016; Noble 2018; Benjamin 2019). There are two main problems here. One is that we are compromising well-being in the so-called aim of better understanding the human condition. The second is that we are not only using these data and technologies to understand people but also sorting and managing them in different ways that suit those who are already more powerful.

It is vital to note that key to concerns about datafication are how these practices disproportionately affect the well-being of those already most vulnerable. Facial recognition, for example, negatively impacts people already disadvantaged, owing to its own gendered, heteronormative classed and racialised biases (Ada Lovelace Institute 2019). These technologies are also being trialled in policing in the UK and have reported more than 90% of incorrect matches (Fussey and Murray 2019; Davies et al. 2018). In a more general way, all public services are adopting new data practices and possibilities.

Data-driven decision-making is growing as an everyday feature of public services. Who receives welfare (Eubanks 2018, 37) housing (Eubanks 2018, 93) and other interventions, such as child protection (Eubanks 2018, 135) or education (O’Neil 2016, 5-9; 52–60) are decisions increasingly made by algorithms, rather than people. Even when automated decisions are questioned by people (Eubanks 2018, 141), it is unclear whether ‘experienced workers’ (Eubanks 2018, 77) or the data system has the greater influence in key decisions.

Beyond welfare, algorithms intervene in other social policy areas. They monitor the ‘quality’ of education, using dubious proxies (O’Neil 2016), with various bad outcomes, including teachers undeservedly losing their jobs.Footnote 4 In COVID-19 UK in 2020, an algorithm also decided the grades awarded to school-leavers in the absence of exams, owing to social distancing measures. One national media headline (Pidd 2020) called this ‘punishment by statistics’.

The UK’s A Level algorithm example was extremely high profile, causing outrage that data-driven decision-making would have such an enormous effect on the futures of these young people. It was seen as morally outrageous for a number of reasons. First, because our society dictates that these young people’s well-being should be protected. Second, this algorithm used data that no one had consented to: no one knew at the time that their prior grades could be used as a final grade. Third, the data model also included proxies for expected performance which were nothing to do with each student’s own academic record. Instead, they used their school’s overall performance in previous years, which were scores based on previous students’ grades, not theirs. While the governing body, Ofqual, insisted its standardisation arrangements ‘are the fairest possible to facilitate students progressing on to further study or employment as planned’ (Pidd 2020), there were further controversies over transparency around how they had arrived at ‘fair’. After which, Ofqual published a 319-page document explaining its methodology (Pidd 2020) which was criticised for not being accessible to the general public. Therefore, not only did the whole thing seem far from fair, but Ofqual didn’t make explicit how the approach was fair to those affected.

Here we see public services failing to look after well-being through the use of data in ways which go against the moral code of fairness, accountability and transparencyFootnote 5—and without the young people’s consent. Beyond their high-profile nature, what is different about these data uses? While Chap. 2 discussed the greater role of data in public services from the 1980s onwards, this ostensibly had a different rationale. It aimed to evaluate qualities of these services, such as efficiency or cost-effectiveness. While these approaches led to flawed decisions and evaluations, assessments were made at a societal level. Contemporary data-driven decision-making, whether the allocation of resources to people or the labelling of individuals at risk, is a different approach and uses data on a different level. Or, to use the language of Chap. 3, there is a different unit of analysis, and that unit could be a vulnerable person.

In sum, why do we need to ask critical questions about how people and their well-being are being understood or about how data and data systems used to understand people can compromise well-being? Going back to those definitions, people are often concerned with the speed and size, and so on, of Big Data. Actually, as Kitchin indicates, it is the contexts of these data that are the most important ways that they are different. Not only are the contexts of origin of Big Data more different, and further from the contexts of use, than before, but the practices of analysing data feel less human. By this I mean that less human attention is now required in data analysis and in important processes that require data. What does that mean for decisions made about people and well-being?

As we will discover in a few sections, the response to COVID-19 required older data and data systems—and more human judgement—than you would have imagined if you were looking at media reports of the promise of artificial intelligence (AI) in the first half of 2020. However, as the financial value of data increases, the more expediently they can be analysed, and here we must ask other questions. Who stands to gain and who stands to lose? Who has chosen to participate? But then did people ever get to choose to participate in systems of well-being data? Or were we even thinking about data as ‘a thing’ about us, that affects our lives and was valuable? The next two sections deconstruct the financial value of Big Data and whether this reality is even new.

5.2.2 Value

Another major reason why we need to ask critical questions about Big Data and well-being concerns the financial value of knowing more about people and the financial value of the systems that sort people for public services and welfare distribution (Eubanks 2018). Beyond public services, the value of the new ways that Big Data can work is not just in knowing more about people, but because of the potential this knowledge has to orient people’s thinking through suggestion and in some high-profile cases to manipulate what they do. They enable marketers to sell you products you might be most tempted by, knowing when you might be most susceptible too, based on your previous sales or what else you’ve looked at (Turow 2011). They also enable political campaigns to target their messages in the same way and change voting behaviour (Avila 2019; Bates et al. 2016; Murgia 2017). The recent Cambridge Analytica scandal saw Facebook implicated in not only the unethical use of people’s data, and knowledge it had on their behaviour, but in misinformation that is thought to have changed the results of the US presidential election 2016 and Brexit in the UK the same year.

The first and second waves of well-being (Bache and Reardon 2013) from Chap. 2, and to which we keep returning, evolved as historical moments in which data capabilities married policy-makers’ aims: improving the way we think about measuring human progress. Similarly, well-being metrics became more viable because well-being methodologies were evolving in a way that politicians saw as favourable. Political will and academic developments work with evolving infrastructure and technological development to enable datasets to be created with more detailed and nuanced information about quality of life. These factors work together for new methodologies to generate new kinds of data and analytical approaches which then, by extension, affect research and policy-making, which in turn impact upon our quality of life.

The increasing emphasis on Big Data as ‘the new oil’Footnote 6 (a misnomer, of course) is not because datasets are ‘better’ (which would need some qualification) or because the technologies are new (though admittedly this is partly why it has become such a fixation). Instead, ‘Big Data’ datasets offer data with different qualities than more traditional data acquired by surveys. This means big datasets offer capacity to answer different research questions—or answer the same research questions differently. Most importantly, they have been called the new oil because: (1) ‘data powers today’s most profitable corporations, just like fossil fuels energized those of the past’ (Matsakis 2019) and (2) this means these qualities can be financialised.

The amount of data on individuals that are now collected is almost impossible to visualise in our minds. The growing number of devices and sensors means we are generating more and more data than can be collected: the International Data Corporation predicts that by 2025, the total amount of digital data created worldwide will rise to 163 zettabytes (Coughlin 2018). That is 1021 (1,000,000,000,000,000,000,000 bytes) or one trillion Gigabytes. The European Commission forecasted the European ‘data market’ to be worth as much as €106.8 billion by 2020 (Ram and Murgia 2019). These kinds of numbers reinforce the importance of looking at Big Data as social phenomena—with social effects, but how new are large datasets about people and populations? 

5.3 Are Big Data Even Actually New?

While data are ‘sold’ to us as ‘the new oil’ (The Economist 2017), large datasets, and their use to understand human behaviour, are not new; neither is the relationship between governments, commerce and value, when it comes to data. Mary Poovey’s A History of the Modern Fact: Problems of Knowledge in the Sciences of Wealth and Society (1998) describes the rise of merchants and their influence over the State, including campaigns to promote the balance of trade as the index of national well-being from the early seventeenth century onwards (Poovey 1998, 93–94). The new ‘enthusiasm for numbers’ in the early to mid-nineteenth century (Hacking 1991, 186; Porter 1986, 1996) coincided with a growing infrastructure to collect and analyse data. This desire for numbers, and the data processes that were required to provide them, led to the ‘great explosion of numbers that made the term statistics’ (Porter 1986, 11). If truth be told, the term ‘statistics’ originated for governments to understand ‘the quantum of happiness’ (Sinclair 1798, vol. 20, p. xiii). In this ‘avalanche of numbers’, ‘nation-states classified, counted and tabulated their subjects anew’ (Hacking 1990, 2; 1991, 186). However, while ‘statistics’ may be hundreds of years old, large datasets go back further.

Managing land, agricultural hierarchies and the desire to control populations have long required systems of recording. One of the oldest-known writing systems is Sumerian script, which is approximately 6000 years old (Bellet and Frijters 2019). This script is called cuneiform, and its uses are said to include the tracking of trade and taxes: you need records on who has paid, how much; who has not paid, and what they owe (Harford 2017). While the clay tablets these records were written on may not seem like a database, or feel like the Big Data futures outlined in the previous and subsequent sections, they were a dataset of sorts. Crucially, these data were used to monitor and control resources, including the management of people.

Most countries now undertake a census of sorts. The UK Census takes place every ten years and has done since 1801.Footnote 7 The first four were only headcounts, with the 1841 Census being the first to intentionally record names of all individuals in a household or institution. The UK’s ONS website offers an interesting history of censuses in the UK, back to the Domesday book ordered by the Norman (French) King, William the Conqueror in 1086 (ONS 2016). Again, censuses precede these European data moments by some 4000 years in both Egypt and China, whose governments (as they would have been formed and named in those days) recorded who lived where and how wealthy they were. The Romans held regular censuses to keep track of their expanding—and then contracting—empire. Evidence of other institutionalised data practices exists in the Bible: the book of Genesis talks of kinship and marriage records and Exodus mentions a population census to support the tabernacle. The Church collected information on births, christenings, marriages, wills and deaths; this tracked the business of a church and its parish, but was also a means of counting the faithful and tracking their wealth.

You will note that the recording of trade and births, marriages and deaths is not so different from the administrative data that appear in all our examples of well-being data, from Table 3.1 to 5.3. So, what is new about Big Data? We’ve long had large datasets that hold multiple data points on people and nations, but these are thought to be ‘state simplifications’ for officials (Scott 1998). Rationalisation and standardisation mean these representations ‘did not successfully represent the actual activity of the society depicted, nor were they intended to; they represented only the slice of it that interested the official observer’ (Scott 1998, 3). What the historian James Scott tells us here is that the sorts of information that were collected on scale lacked detail that could be used to improve quality of life. He implies, of course, that those in charge did not actually care about quality of life, only quantity of resource, whether this was people to work the land, make armies, or pay taxes. More recently, as we have seen, governments were charged with responsibility for people’s well-being, and therefore, more complex data were required.Footnote 8 One such development was the social survey.

The social survey has been used to collect data which capture various qualities of lives in richer ways, and for longer, than it is often credited for. For example, surveys in the UK in the mid-1940s (in World War II) discovered almost one in ten households did not have the number of cups deemed necessary for essential use, and ‘the shortage of scrubbing brushes seems to have been extensively felt’ (Oman 2015, 88; ONS 2001, 9). Whilst still administrative records of resource and scarcity, the survey began to be used to articulate more qualitative aspects of quality of life as proxies for well-being. This presents richer detail than many of the contemporary surveys that generate the well-being data we have seen as either objective or subjective data so far.

These more qualitative data were not only collected using government social scientists that we might imagine with clipboards. A project called Mass Observation was established in 1937 by anthropologist Tom Harrisson, poet Charles Madge and filmmaker Humphrey Jennings.Footnote 9 Mass Observation aimed to record everyday life in Britain. There were paid investigators who anonymously recorded people’s conversations and their behaviour: at work, on the street and at memorable occasions, including public meetings or sporting and religious events.

This project was reminiscent of the current idea of ‘Big Data’, not only in the scope of the data gathered, but also in how they were gathered. Mass Observation had numerous phases and at one point also used a panel of around 500 voluntary ‘observers’. The initial aims of Mass Observation were to research everyday life, making use of ‘the untrained observer, the man in the street’Footnote 10 as much as those who were thought to be skilled and qualified in gathering data of this sort (Madge and Harrisson 1937, 10). The observers used various data collection methods to generate large datasets on different topics: some maintained diaries, while others replied to open-ended questionnaires . In 1938, there was ‘a competition’ for the residents of Bolton, Lancashire (see Fig 5.2), asking people what happiness meant for them. This was one of many themes, and people would reply to what were called directives with often very long texts describing what they thought and how they felt. The data from these and from the 1938 project can still be accessed via a vast archive at the University of Sussex.Footnote 11

Fig. 5.2
figure 2

What is happiness? Mass Observation competition flyer, 1938

Mass Observation began with a positive vision of democratising the processes behind how data were gathered to better understand people’s lives. However, over time, much qualitative social research shifted towards the narrower analysis of consumer choice, and Mass Observation became a market-research firm in 1949 (Albert 2019). Mass Observation re-launched in 1981, returning to its original egalitarian ideals and the archives are testament to the ways that Mass Observation aims to engage the public in the documenting of their own lives.

These historical examples of large datasets are, therefore, not so different from the qualities found in previously crowdsourced, location-based, time-based data on how people feel about things, as seen in Table 5.3. The purchasing of scrubbing brushes was used as proxy data for other qualities of life in the same way our purchasing data are analysed to better understand us. Similarly, a lack of cups was indicative of a particular kind of poverty and lack of resources at a point in time, and this was analysed across the population. However, the democratic promise of Mass Observation and other projects of the time were superseded by the potential of understanding what makes people happy for commercial gain.

5.3.1 The Darker Side of Historical Well-being Data and Commercial Gain

With the rise of market research came increased interest in people’s preferences, and in what made them happy or gave them pleasure (Davies 2015; Savage 2010). This involved capturing subjective well-being data, as well as cultivating communications to imply that owning or consuming certain things would increase someone’s well-being in some way. The aim here in this context, of course, was to change people’s purchasing choices. With this shift, people as citizens became consumers. Over the years, ‘consumer sentiment’ indices have been assessed to see if these data can predict people’s behaviours on a macro level, from economic cycles (Carroll et al. 1994) to presidential popularity (Suzuki 1992). This marriage of mood and economics is not new to us, of course. In Chap. 4, we encountered the development of subjective well-being data, a newer shinier well-being data, as a marriage of economics and psychology, known as happiness economics that was able to measure subjective well-being at population level.

Mood and sentiment analysis are not new, then. Neither are big datasets. Even Fitbits and Apple watches are not new; not really, as attaching technologies to people’s bodies has been used to study and improve productivity and surveillance of workers and citizens for around a hundred years (Davies 2015; Cryle and Stephens 2017). So, what is new? The amount and variety of data on the well-being of individuals and populations are increasing as technologies develop to manage greater amounts of different kinds of data, not only faster, but faster together.Footnote 12 Therefore, it is not necessarily how one thing (not that Big Data are one thing, really) is new. Instead, it is a far more complex picture of how different aspects of, and different people across fields of, politics, science, research and technology work together—and work with commerce. These all combine as developments in what we know, and ways of knowing, about society.

The question is, what does that mean for well-being? How can we learn from previous mistakes regarding the context of who is using what data—and to what end? COVID-19 will offer us many data insights and many insights into how data can help us understand and look after well-being better. The next section looks at the role of data and learning in a pandemic, of old and new infrastructures and commercial and governmental data practices in the management of a pandemic.

5.4 A Case Study on the Promise of Commercial Big Data

One of the most high-profile cases of the possibilities of Big Data involves a tale that begins in 2009 when a new virus was discovered. This new illness spread quickly and combined elements of bird flu and swine flu. This story opens Mayer-Schönberger and Cukier’s book, Big Data: A Revolution That Will Transform How We Will Live, Work and Think, which you may remember is mentioned earlier in the chapter as a much-cited originator of the term ‘datafication’ (2013). The authors explain that the only way authorities could curb the spread of this new virus was through knowing where it was already.

In the US, the Centres for Disease Control and Prevention (CDC) requested that doctors inform them of cases. However, the information on the pandemic that the CDC had to work with was out of date. This was by nature of the data collected, and its ‘data journey’ (Bates et al. 2016). There were multiple data journeys to consider: data were collected at the point someone went to the doctor, which could be days after initial symptoms, let alone contraction; sharing data with the CDC was a time-consuming procedure; the CDC only processed the data once a week. Thus, the picture was probably weeks out of date, making intervention or behavioural analysis difficult. In other words, while the datasets were large, even potentially fairly detailed, these Big Data were too slow.

Coincidentally, so Mayer-Schönberger and Cukier tell us, a few weeks before the new disease made the headlines, Google engineers published a paper in a high-profile journal, Nature, which explained how Google could ‘predict’ the spread of the winter flu in the US. This was possible just through analysing what people had typed into their search engine (and, of course, knowing where those people typing were). It compared the CDC data on the spread of seasonal flu from 2003 to 2008 with the 50 million most common search terms in America.

The Google engineers looked for correlations between what people typed into the Google search engine and the spread of the disease. Mayer-Schönberger and Cukier point out that.

Google’s method doesn’t require traditional infrastructures to distribute mouth swabs or for people to go to doctors’ surgeries.

‘Instead, it is built on ‘big data’—the ability of society to harness information in novel ways to produce useful insights or goods and services of significant value. With it, by the time the next pandemic comes around, the world will have a better tool at its disposal to predict and thus prevent the spread. (Mayer-Schönberger and Cukier 2013, 2–3)

Sadly, a pandemic with wider societal and well-being effects arrived after I started writing this book, and despite the promise of Big Data, it did not prevent the spread. Data hold a very important place in the story of COVID-19 and its management, but all data have limitations in how it can inform human action to change reality, as do the different ways of analysing data. Indeed, data are not just there but are managed and used by people with their own interests. Data do not speak for themselves but are interpreted. All data realities also involve selective processes in what data are important and what data are not. These limits are not always made as clear as they should be.

Mayer-Schönberger and Cukier’s promise of Big Data as revolutionary and transformational in the US was clearly jumping the gun. Not only was the pandemic not prevented by way of predictive analytics, but actually, part of COVID-19 data management has very much involved doctors’ surgeries and mouth swabs—in the UK at least. To clarify, I was randomly selected from data held on people registered with a GP to participate in a survey in August 2020.Footnote 13 I was contacted by the Real-time Assessment of Community Transmission (REACT) Study,Footnote 14 which is in fact a series of studies, using home testing to understand more about COVID-19, and its transmission in communities in England. The logic behind the study was that not all people with the virus were being tested at this point, either because they were asymptomatic or for some other reason. This was one of a few projects to collect data from a sample of the population, over time, in order to understand how it was spreading.

This process relied on old infrastructures: I received a letter by Royal Mail, I signed up online, and then I was sent a mouth swab—also by post. That all worked fine for me, but there was a series of steps registering different barcodes and I found myself wondering how accessible this was for everyone (when I say everyone, I often think of my once tech-savvy Dad, who’d have been bewildered at this whole process). After completing these steps, a courier was ordered to collect the test. I sat in patiently waiting for my test to be collected, slightly anxious about what felt like a huge responsibility, and acutely aware that I might need to be ready to run out and meet a courier with my test.

I live in a high-rise with no working bell or intercom (and a bunch of other things that don’t work). For three separate days, I watched for details of the courier on the app, and out of my window, waiting for them to appear on the road, or call to say I should come down. But there was no sighting of the courier in real life and no phone call. When the app showed they were coming, they disappeared without attempting to deliver. After three attempts. I was told that this particular courier company was infamous for not bothering to try and collect from my flats, because it was too inconvenient. So, in my case, while some aspects of the traditional data infrastructure (the post) worked fine, they didn’t necessarily all work together as they might. This meant that my test remained uncollected, expired and had to be securely disposed of. This meant my data became ‘missing data’.

What I was surprised by was how the information system assumed you would live somewhere that was easy to access. As we know, many people from our poorest communities live in high-rises where the lift doesn’t work, or the people in the flats themselves are difficult for a courier to access. Thinking about the contexts in which data are collected (or not) can be both extraordinary, and mundane, and we often don’t hear of these stories—when they work, and the odd occasion when they don’t, and what that might mean for the data. Yet, these contexts have huge impact on who is readable in data and how we understand well-being and inequality.

So why did COVID-19 data collection end up using more traditional infrastructures in the UK? On a larger scale, why did the world not use Google data as Mayer-Schönberger and Cukier predicted? As it turns out, Google Flu Trends (GFT) missed the peak of the 2013 flu season by 140%, and Google subsequently closed the project (REF). In 2014 a paper called ‘The Parable of Google Flu: Traps in Big Data Analysis’ was published in another high-profile academic journal, Science (Lazer et al. 2014). The authors concluded that while there was potential in these sorts of methodologies, and while Google’s efforts in projecting the flu may have been well meaning (which could be called into question), the method and data were opaque. This made it potentially ‘dangerous’ (Lazer and Kennedy 2015) to rely on GFT for any decision-making, as the context of the data and the analyses were not made explicit to public decision-makers. Of course, it is also perhaps unlikely that Google had designed the tool for public decision-making contexts,Footnote 15 considering what government officials need to understand for this kind of decision-making.

There are other limits to the data: its sample. Google assumes this ubiquitous reputation, yet, it is not the only search engine available: people choose other search engines for various reasons. Crucially, Google also does not have global reach. Most services offered by Google China, for example, were blocked by the Great Firewall in the People’s Republic of China. This was not even the first time it was banned in China. So, even if GFT were still in action, would it have pre-empted the COVID-19 outbreak in Wuhan, China, before more official announcements?

If we are to think about how Big Data have transformed how we live, as Mayer-Schönberger and Cukier want us to, then we must also consider how ‘datafication’ has changed people’s practices. More and more of us scour the internet, hoping to reassure ourselves that recently developed symptoms are minor ailments. This is—as we discovered in Chap. 2—part of the anxiety introduced with audit culture: we consult technologies as a default because we can, rather than should. We search for confirmation that nothing is wrong, rather than only searching when something is wrong. In countries where access to healthcare is diminished, people are actively encouraged to search the internet before interacting with health services. Consequently, this limits the predictability of search data, as their contexts have changed.

In the case of COVID-19, people searched for symptoms they didn’t necessarily have, especially in the second quarter of 2020, when most nations were in lockdown and the severity and ramifications of the disease were becoming clearer. The implications of this are that searches would not necessarily have reflected the infected state of an individual that could be aggregated to reveal community or population infections, or more importantly, predict transmission so that it might be controlled in some way. Instead, searches for COVID-19 symptoms may well be a predictor of concern or anxiety. Ironically, then, Google searches are arguably a better indicator of negative subjective well-being than of COVID-19.

The very idea of data being reliable has led to our need to feel sure—to have objective confirmation that all was OK, is OK or will be OK, and has led to an increased reliance on data. In the case of Google searches, this reliance has triggered people to search for verification of risk or safety. So how might we have cut through the ‘noise’ that the definitions at the beginning of this chapter point to, in order to know how it was spreading? We are back at the chicken and the egg dilemma: do people search about COVID-19 because they have symptoms? Or do people search about COVID-19 because they are worried about it and feel compelled to search for confirmation—or search on behalf of friends or loved ones? I watched someone use their internet searches to check our colleague’s proclaimed symptoms against the common signs of swine flu—a very collegiate individual, but one whose search history told a story of their friend’s (potential) disease state, rather than their own. In this latter case, then, Google searches were more indicative of personality than health or even subjective well-being, although, perhaps well-being data all the same.

Bigger datasets make correlation more powerful than causation, explain Mayer-Schönberger and Cukier, devoting a whole chapter to it in their book (2013). Google queries went from 14 billion per year in 2000 to 1.2 trillion a decade later. There are even websites that show a live running tally of how many searches have been achieved in a day.Footnote 16 If Big Data were all about scale, then GFT would have been more, not less likely to work on the premise of correlation as search numbers increased. The scale at which we have correlations using ‘Big Data’ may be an indicator of causation, but not proof. Is this the end of the promise of Big Data, though? If we return to a case of COVID-19 and Big Data, what might we find?

5.4.1 Linking Big Datasets: For Well-being?

On New Year’s Day, 2020, a Canadian health monitoring company alerted its customers to the COVID-19 outbreak, some days before the US’ CDC or the World Health Organization (WHO) alerted anyone (Niiler 2020). Of course, the disease was not yet called COVID-19, and it was not known that it was to be a global pandemic. At this point, a cluster of unusual pneumonia cases had been detected. One of the companies said to have beaten the WHO to this discovery is called BlueDot, which uses AI-driven algorithm searches to look at datasets, much like GFT.

Unlike Google Flu Trends, BlueDot’s algorithms consolidate and analyse data from numerous sources. BlueDot’s owner, Dr. Kamran Khan explains:

We can pick up news of possible outbreaks, little murmurs or forums or blogs of indications of some kind of unusual events going on. (Khan, in Niiler 2020)

Other data sources are more official, such as statements from health organisations, livestock and news reports in 65 languages. BlueDot also uses ‘anonymous mobile phone data’ (Whitaker 2020), flight sales and other records. These various data points enable a prediction of a possible new serious disease. Importantly, the logic is that this approach also offers insight into how that disease becomes mobile by the people who carry it and the planes who carry the people carrying the disease.

What we have done is use natural language processing and machine learning to train this engine to recognize whether this is an outbreak of anthrax in Mongolia versus a reunion of the heavy metal band Anthrax. (Niiler 2020)

Also, crucially, ‘epidemiologists check that the conclusions make sense from a scientific standpoint’ (Niiler 2020). The company website states that ‘BlueDot protects people around the world from infectious diseases with human and artificial intelligence’ (BlueDot n.d.). Therefore, despite claims to its sophistication, the automated data-sifting still requires human analysis to make sense of what has been found.

Khan’s company utilised technological developments at its disposal to synthesise many different types of data from multiple datasets to construct evidence. Only when the data were pieced together was the information useful, and only after human experts had checked it, were these insights deemed useful enough to share and use. BlueDot is a commercial company. The human and artificial intelligence are synthesised as an enterprise, and Khan is often presented as both an entrepreneur, as well as a professor of medicine and public health at the University of Toronto. Khan has also worked in hospitals, so understands how they work. Khan explains in one interview,

Disease doesn’t wait for the reviewers, so we need a more agile system. My motivation for creating a company—here to start supporting an entrepreneurial spirit—using business as the vehicle to do that. (Khan, on Charrington 20 February 2020)

There are two things to note here. Khan suggests that the old structures of peer review and scientific expertise are too slow in their use of data and evidence to tackle a global pandemic. He also suggests that his business successfully links together ‘human and artificial intelligence’ to provide what traditional science cannot: the analysis of data with veracity and variability, speed, resolution, relationality and so on. The value of BlueDot is in its claims to harnessing the qualities of Big Data.

To return to Mayer-Schönberger and Cukier, ‘Google’s method’ may not have involved distributing mouth swabs, or been built on old infrastructures, but instead, they explain:

[I]t is built on “big data”—the ability of society to harness information in novel ways to produce useful insights or goods and services of significant value. (Mayer-Schönberger and Cukier, 2)

So, there we have those familiar terms of insights (a marketing term) and valuation (that we discovered from economics in Chap. 2), alongside clear communications and the presentation of novelty (Chap. 4), goods and services. Mayer-Schönberger and Cukier hint at the complex politics at play on the value of data—and the values of data more broadly than we have already encountered.

Crucially, in a book about well-being and data, we have to note that BlueDot’s business is entrepreneurial because it is profitable. In other words, the insights have to be sold to clients and customers. They were also not the only innovator (as acknowledged by the Lancet and MIT Review [McCall 2020; Heaven 2020]). Here, we must return to the economic value of data because of the possibilities of well-being insights and the ideological project of the well-being agenda.

If the well-being agenda is about improving redistribution of resources as an issue of social justice, we might want to think about what position we are coming from: rather than asking, ‘what are the data limits of these well-being projects?’, we might ask, ‘what are the well-being limits of data projects like these?’ Although, despite the clear sophistication of BlueDot’s project, it also did not prevent COVID-19’s spread. This criticism has been noted in the MIT Review:

The hype outstrips the reality. In fact, the narrative that has appeared in many news reports and breathless press releases—that AI is a powerful new weapon against diseases—is only partly true and risks becoming counterproductive. (Heaven 2020)

The point this MIT article was making here is that the over-reaching claims of AI could be damaging to its future progression, in the same way that GFT overstretched its claims.

Data and the distribution of resources are very much part of the COVID-19 story, and not just of private companies profiteering, either. Such competition is also reiterated by national politicians misleading the public about ‘world-beating’ systems of data (BBC 2020). In the same way that the social indicators movement was halted because it was not quite measuring what it thought it was measuring (Chap. 2), the ‘promise’ of Big Data has adjusted. The limits of Google’s approach are in a lack of context: the nature of what people actually search for is different than was predicted. The limits on data are social, cultural, political and economic, and by extension, these limit the possibilities for a good society. We will explore social media and mobile communications data in the final few sections to better appreciate this relationship.

5.5 Social Media Data: A Game Changer?

I am sure that social media plays a role in unhappiness, but it has as many benefits as it does negatives. (Sir Simon Wessely, president of the UK’s Royal College of Psychiatrists in Campbell 2017)

Social media platforms have an interesting relationship to well-being. They are often demonised as bad for well-being, especially for the younger generation who are thought to dwell on images of idealised bodies and lifestyles on Instagram (Campbell 2017). All ages feel a pang looking at the picture-perfect presentations on Facebook, and even the NHS warns people to take breaks from social media (NHS 2016). Credible, successful women leave themselves vulnerable to criticism from strangers in the sharing of thoughts, opinions and aspects of their identity on platforms like Twitter (Lewis et al. 2016). Similarly, hate speech against people of colour (Gayle 2018) or for their gender identity (Pearce et al. 2020) are realities of social media platforms. However, social media and online platforms also offer places for human connections, and have had beneficial effects for the social isolation brought about by measures to curb the spread of COVID-19. The jury is still out on many of the pros and cons of social media, including their propensity to spread disinformation, versus credible analysis of data and guidelines. Social media therefore hold an ambivalent place in the management of well-being.

These controversial aspects of social media are not their only connections to well-being. The data we share can make them useful for well-being analysis. The most mundane aspects of our feeds, the venting of minor irritations, celebrations of small wins or just feelings shared with friends and family mean our social media accounts are full of well-being data. Think about those ONS4 questions again (Table 4.2) that aim to gauge ‘personal well-being’. For example, they all ask you to think about how you felt yesterday overall—in terms of happiness or anxiety, as well as whether you think what you do is worthwhile, and whether you are satisfied with your life. When you think about Facebook’s most prolific posters in your timelines, for example, much of their content will indicate how they felt in similar ways at specific moments. The recent addition of emojis to Facebook means it is easier to proclaim whether you were happy, celebrating or anxious. The reminders of what you were doing this time last year or ten years ago means we are telling everyone on Facebook how we feel now, about how we were feeling in previous years. Crucially, this means it is even easier for Facebook to know this too, as you have essentially coded your own data for them.

This compulsion to share how we feel means we are also sharing our data with Facebook and other platforms. These platforms are able to analyse us alongside millions of others at scale. Companies like Brandwatch monitor social media and analyse several billion emoticons each year to inform brands whether they are provoking hatred or happiness with their products. It is also possible for a broad range of actors to mine social media data, whether commercial companies, government agencies, academic researchers or amateurs with the inclination to do so. The platforms are set up with open Application Programming Interfaces (APIs). APIs are what allow other (data mining ) software to interact with social media platforms. Once access to social media data has been gained, it can be ‘scraped’ with comparative speed with the right skills and software. Scraping is a process which essentially involves gathering and copying data that meets specific search terms. It is then put into a database (that can be as crude as a spreadsheet), for later retrieval or analysis. This can be done by a person, although the term more typically refers to automated processes involving a bot or web crawler. The fact that APIs are generally open as a standard indicates that these data—your data—are made available by social media platforms to be used by various different actors. Not many people think about the fact that their public post on a social media platform is public in the sense that it is no longer their private property and can be used by others in research.Footnote 17

There are practical limits to what can be known through analysing people’s social media posts, of course. First, people are not neutrally representing themselves on social media. As we know, people feel compelled to publish reflections on an idealised version of their lives (Kruzan and Won 2019). Of course, our social media posts don’t always represent our lives as happier than they actually are: people often exaggerate the impact of minor negative events that are as mundane as missing the bus or being rained on. Some people collectively engage in dissatisfaction with their lot in life, leading to Twitter bubbles and what has become known as ‘the culture wars’,Footnote 18 as the contemporary cultural conflict between social groups. This term describes a gap between those who side with a traditional, conservative approach, and those with a liberal, progressive approach to society and social issues, such as immigration, abortion, LGBTQIA+ rights, and so on. The contemporary culture wars, as a struggle for dominance of values and beliefs, now takes place on Twitter, and we might question the extent to which such rage and passion are indicative of someone’s personal well-being, or some form of tribal rage on a larger scale. Essentially, we are seeing how important social media can be in both distorting and shaping our well-being for better or for worse. The key to appreciating the relationship of social media, data and well-being is understanding limits and context—of collection and use.

5.5.1 Social Media Data Mining in Social and Cultural Sectors

Social media data mining is not always a large-scale affair requiring APIs and special software. As found in a six-month research project with city councils and a city-based museums group in the north of England (Kennedy 2016), many small organisations use quite basic techniques to do this work. Social and cultural policy sectors are reliant on understanding well-being data, as improving well-being is at the core of what many of them do. Yet, as Chap. 1 of this book acknowledges, the sectors do not always have the skills or confidence to use data. We will look at these sectors as a whole in greater depth in the next three chapters.

The project exploring how these smaller social and cultural organisations were already using data mining, wanted to understand how they might use it more effectively. The researchers discovered that although software packages were adopted to analyse institutional impact and engagement on Twitter, this was largely unsystematic (Kennedy 2016, 71 & 72). Keen to improve their social media data mining capacity, these organisations signed up for training in new tools that would improve their capability. However, it became clear that less data mining was happening than expected and the capacity of workshop participants to engage with training in the new tools also fell away (Kennedy 2016, 74). Doing better with data seems a good idea, but is not always as easily resourced or incorporated into working practices as initially hoped.

Local councils, social and cultural sector organisations all have limited resources. Despite enthusiasm for being, or becoming, data-driven, capacity to invest time and money in new tools at the organisational level is often lacking (Kennedy 2016; Oman 2019a, b). In the case of the cultural sector, there is a tendency to invest in grand schemes, new metrics and reports at policy level that claim to investigate the value of new and/or Big Data and the associated technologies required to generate or analyse them (Gilmore et al. 2018; Oman 2013a). However, when considering the (already ill-defined) cultural sectorFootnote 19 as a whole, differences are obscured in requirements and capacity for data technologies, which are multiplied by huge variability in organisation size, type, purpose, mission and cultural offering across and within sectors (Oman 2013a). These top-down resources and contributions are not always actually used or found useful at an organisational level or across the wider sector (Oman 2013a). Some organisations recognise that their audiences are full of people whose opinions are less easily captured by Big Data. Some people, for example, still prefer booking telephone lines to web pages and are certainly not tweeting or Instagramming their experience of a show. As such, some who attend a show are less likely to be generating data on their opinions that might then be mined. Advocates for using Big Data in small organisations acknowledge that Big Data can be ‘debilitating’ in their complexity and challenges. This is not always explored in a way that offers resolution (Oman 2013a), and as we have seen (Kennedy 2016) when recommendations, even training, are offered, there is not necessarily the capacity to take them up.

Yet, it can be very easy and fast to interact with Big Data as social media data, as long as you consider the limitations of the data and their origins, as well as how you might analyse them yourself. Organisations and individuals do not need Big Data analytics know-how or software, although there are excellent resources freely available to help them understand how,Footnote 20 as I found when I wanted to explore Twitter discussions about happiness. In 2013, Mass Observation recreated the Bolton happiness study on Twitter (see Fig. 5.3). This was still fairly experimental for them as much as me when I requested access to the tweets. There were 25 responses that they captured at the time.

Fig. 5.3
figure 3

Mass Observation happiness tweets

The sample of 25 meant that—of course—I did not require data mining or sentiment analysis software—or any knowledge of APIs. In fact, I did not even need to request these tweets from Mass Observation directly, as they are still available on Twitter by searching the hashtag (or were in August 2020 when I last checked). A cursory analysis in this case simply meant reading, and noting similarities and themes, which I could have done on a piece of paper.

So, what did this cursory analysis tell me? Whilst 20% mentioned pets, all of which were cats (it is the internet after all), one person replied with a single word: bacon. Mainly, however, people described informal, everyday participation,Footnote 21 including reading, going to gigs, watching films. There were lots of glasses of wine and some chocolate in there too. The textual content of these tweets is reproduced in Box 5.1, without Twitter handles.

Box 5.1 Tweets Answering the Question: ‘What Is Happiness?’

• Beer, maps, chocolate, quizzes, the unending pursuit of knowledge

• Ability for women to walk down the street & not be catcalled or threatened. Few happy women here

• Short term happiness is different for everyone. Long term happiness is about fulfilling your potential.

• Bacon

• 5 minutes to myself and a good book, with peppermint tea and the cats curled up around me. Absolute bliss!

• Volunteering, yoga, baking, being with loved ones, reading, warm days paddling in the sea, colourful things, exploring, my cat: D

• Doing what I love (#history), a safe home by the sea, someone to love & share things with

• Good company, fireworks, being smiled at, a job well done, ‘sweet pea’ by Manfred Mann, making someone else happy, good health.

• I am happiest when discovering/learning new things, such as reading books and finding new music.

• Happiness is cooking for those I love, with a glass of wine and giggles on the side.

• Day off. Smoke in peace.

• “What is happiness?” something to do with dopamine levels

• Making things that muself [sic], and hopefully other people will enjoy

• Loving and being loved and valued for who I actually am.

• More precisely: Time, a book, a view, a friend.

• Choices and control in life not just in shopping.

• Connecting with other people, being able to make a difference to someone else, a good book and a purring cat on my lap!

• My kids

• What is happiness?’—“A warm spot on the bed in the sunshine”

• Knowing that enough is plenty

• The scent of roses on a damp morning […] being where you are without wishing to be somewhere else

• Happiness is seeing my children flourish, Swansea City FC progress & succeed & cooking for husband. Ln that order!;)

• Love, health and a sense of purpose. Oh, and cake.

• What makes me happy? Cuddling up on the sofa with my partner & animals, a glass of wine, chocolate, a film & crochet- bliss

• Happiness is good relationships, a little more than enough money, satisfaction and contentment

You might note the surprising varieties of theories of well-being we have encountered so far in the book can be present in 25 tweets. Some map onto clear areas of social policy, others are definitely in the private domain. Some people used negative language to imply life isn’t currently great for them: ‘Day off. Smoke in peace.’ And ‘Ability for women to walk down the street & not be catcalled or threatened. Few happy women here’. Some people were philosophical, others wistful. Some focussed on activities, others on the ‘bliss’ of doing nothing. The variety of tone and content makes for fascinating reading, but leaves these data wide open to interpretation—whether that is via human or artificial intelligence.

I used these tweets as a light-hearted example, with my ever so light-touch analysis, in my first ever conference presentation in 2013. In Chap. 3, I explained that my research question at the beginning of my PhD was loosely: ‘When people describe well-being, how often do they talk about participating in different kinds of activities—and what might that tell us about aspects of social and cultural policy?’ or ‘how can qualitative data collected to understand well-being tell us how people feel about what they do?’. I noted in this presentation that state-funded cultural practices (like art galleries and museums) were less frequently mentioned by people as making them happy than what is called everyday participation (Oman 2013b). This same finding emerged from my reanalysis of the ONS free text data I used in my PhD (Oman 2017, 2020). By extension, these data (with their caveats) were another dataset to suggest we should question whether cultural funding was supporting activities that made people happier or increased their well-being.

This was not the only way of analysing these tweets to make an argument about the relationship between culture and well-being. Someone else may have counted how many of these responses included something creative and used their analysis to argue they have found the value of culture to people, thereby justifying more funding. These are debates about data and their use in politics and policy that we return to in the next chapter. What is important here is that even with (arguably, especially with) such a small dataset we can see how human bias can interact with data and lead to different arguments.

If it is difficult for humans to make categorical claims from a form of sentiment analysis that is not much more systematic or technical than reading 25 tweets, we must remember these limits when these analyses are made through machine learning. This is especially vital as time-sensitive analyses of large-scale samples of emotional expressions are being used in research on COVID-19, particularly given they are seen to have the potential to inform mental health support and help tailor risk communication to change behaviours (i.e. Pellert et al. 2020). As with all data uses mentioned in this book, it is not that using social media data, or automated sentiment analyses are necessarily bad, but rather, that their limits should be recognised. As ever, it is an issue of methodology, transparency context and legibility.

5.5.2 Understanding Where People Are and How They Feel Using Twitter Data

Of course, it is not only what people say that can be mined, but also where they are. One research project attempted to gauge community well-being using Twitter data from between 27 September and 10 December 2010 (Quercia et al. 2012). Interestingly, as an aside, this coincided with the UK’s Measuring National Well-being debate which launched in November of that year. The researchers were interested in a few things. They wanted to understand more than individuals, to measure the well-being of communities. They state their intention as moving the recent developments in subjective well-being measures that we discovered in the last chapter forward. Rather than administering questionnaires on an individual basis, or in a national-level survey, they wanted to explore the recent possibilities of sentiment analysis to understand community well-being,

Social media data can significantly reduce the time-consuming processes that make large-scale surveys and qualitative work resource-heavy. Once these data have been ‘scraped’ and saved into a database, they can be analysed in many ways. In the case of Querica and their co-authors, they were interested in the idea of using sentiment analysis to see if it could interpret community well-being. They created a sentiment metric, which was originally derived from studying Facebook status updates (Kramer 2010). This metric standardised the difference between the percentage of positive and negative words in a Facebook user’s posts in one day. Kramer used the metric to make arguments at a national level, aiming to develop, as he suggests in the title of his paper, ‘An Unobtrusive Behavioral Model of “Gross National Happiness”’.

His new standardised metric was found to correlate with self-reported life satisfaction. Looking at the US specifically, peaks were found in life satisfaction that correlated with national and cultural holidays. This is fine in and of itself, but what does that tell us about well-being? Christmas is good for well-being? Other research indicates otherwise (Holmes and Rahe 1967; Mutz 2016), suggesting it can cause feelings of stress for various reasons: financial, family, and so on. What about the days either side when people are travelling huge distances (with everyone else) using transport infrastructure which is not fit for purpose? Or the excesses of consumption that holidays like Christmas involve, as well as their impact on the planet? What about all those who do not celebrate Christmas, as they are not of a Christian denomination? In his limitations, the author acknowledges that there is a possibility that the likelihood to wish people ‘Happy Christmas’ could have affected these results. However, he decided not to control for this, as wishing someone happy holidays is a positive sentiment. We might wonder, then, whether this study was really interested in the possibilities for understanding the human experience using the details of the Facebook posts, or whether it was interested in deriving a metric that was comparable with more established methods.

Returning to the study on community well-being, the authors state, ‘it is not clear whether the correspondence between sentiment of self-reported text and well-being would hold at community level, that is, whether sentiment expressed by community residents on social media reflects community socio-economic well-being’ (Quercia et al. 2012, 965). Therefore, they do note some of the limitations of using this approach to answer their research question. However, notably, they do not acknowledge some of the limitations of the metric itself.

London was chosen for the study to understand about communities, socio-economics and well-being. Let’s break down what they did and how. The study used four types of data gathering, it:

  1. 1.

    ‘Crawled’ Twitter accounts whose user-specified locations report London neighbourhoods.

  2. 2.

    Geo-referenced the Twitter accounts by converting their locations into longitude—latitude.

  3. 3.

    Measured socio-economic prosperity, using the UK’s IMD.Footnote 22

  4. 4.

    Conducted sentiment analysis on tweets between particular dates from their sample.

How did these processes work?

1. How the crawl worked: the researchers chose three popular London-based profiles of news outlets: the free newspaper The Metro, which was available in London on the Tube at the time (it has since expanded), a right-wing tabloid The Sun and the centre-left newspaper The Independent. These media were chosen because they are thought to capture different demographics of class and politics. Using these three accounts as ‘seeds’, they used ‘a crawler’ to trace linked accounts. Crawlers are software that allows you to gather various kinds of available data based on who interacts with a particular website or Twitter account. In this instance, every user following these accounts was ‘crawled’.

2. Some Twitter users stated where they live in their profiles. Accounts were crawled to find 157k of 250k profiles had listed locations, with 1323 accounts specified London neighbourhoods. They then filtered out likely bots by also ‘crawling’ using another metricFootnote 23 for each profile. This brought the sample down to 573 profiles. Once these were established, locations were converted into longitude-latitude pairs, translating these data into geographical co-ordinates which are easier to work with.

3. The IMD is broken into 32,482 areas, 78 of these are within the boundaries of London used by the authors (these are not necessarily fixed). The IMD offered a score for each of London’s 78 census areas. The authors use a census area to represent ‘a community’. We shall return to this key point in a bit, but hold that thought. The data comes from the ONS’ Census and is an objective list of sorts: income, employment, education, health, crime, housing, and the environmental quality. It is worth noting that in the IMD, the ONS talk about ‘Lower Layer Super Output Areas’ (LSOAs), rather than communities.

4. Sentiment analysis was undertaken on the tweets using two algorithms. (1) Kramer’s metric described and (2) something called a ‘Maximum Entropy classifier’, which uses machine learning. The algorithm in Kramer’s metric has a limited dictionary, so this second machine learning package was used to improve on the first, by using a training dataset of tweets with smiley and frown-y faces. The authors argue that the results across the two algorithms correlate and are accurate. They then measured the sentiment expressed by a profile’s tweets and then compute, for each region, an aggregate sentiment measure of all the profiles in the region.

Findings: So what did they find? Through studying the relationship between sentiment and socio-economic well-being they found that ‘the higher the normalised sentiment score of a community’s tweets, the higher the community’s socio-economic well-being’. In other words, the sentiment metric accounted for positive and negative sentiments, enabling each area’s aggregated data to show an average score. This tended to correlate with the scale that they used that indicates poverty and prosperity in that locale (the IMD).

Limitations—What did the authors identify as limitations?

Demographic bias—Twitter users are certain types of people; therefore, these findings will over-represent the happiness of Twitter users—missing out on non-users.

Causality—our old friend. Though the causal direction is difficult to determine from observational data, one could repeatedly crawl Twitter over multiple time intervals, and use a cross-lag analysis to observe potentially causal relationships.

Sentiment—They tracked sentiment but not ‘what actually makes communities happy’ (Quercia et al. 2012, 968). The intention was to compare topics across communities. Their example:

given two communities, one talking about yoga and organic food, and the other talking about gangs and junk food, what can be said about their levels of social deprivation? The hope is that topical analysis will answer this kind of question and, in so doing, assist policy makers in making informed choices regarding, for example, urban planning. (Quercia et al. 2012, 968)

As evidenced with the possibilities for making an argument using the crude analysis of the Mass Observation tweets, and as suggested by the citation directly above, there is bias in the ways that Big Data can be used to inform social and cultural policy. However, this is not necessarily any more the case in these examples than in those using more traditional data sources explored earlier in the book. The ways our social worlds are ordered do not reside in the algorithms, but in the preconceptions, laziness and judgements which become reproduced through researchers and their research and through policy-makers and their decisions. While the Quercia et al. examples were presented as a binary of opposites for narrative effect, the ridiculousness of the proposition may not stop it coming into effect as a deductive study in future. The fact that gangs are unlikely to tweet about gangs is one thing. Furthermore, the idea that these gangs remain within their ONS-allocated geographical boundaries called LSOAs is also a nonsense.

This brings me to another point, LSOAs are not communities: not in the way that we think of community well-being as built on social relations and inter-related lives. People are not only active citizens where they live, and in a city like London especially, may actually be more likely to be active citizens where they work. Without the context of understanding London, what it is to live in London, and the complex, overlaid communities and social groups that comprise a postcode, this idea of community well-being is a misnomer. Instead, it matches one index that uses census data, which, while valuable, can be out of date, and is well-known for its various limitations as a metric of socio-economic deprivation or advantage.

Perhaps another way to look at a question of community well-being might be to look at people interacting in public space. Plunz et al. (2019) also used sentiment analysis with geo-located Twitter data. They were interested in finding well-being indicators associated with urban park space. Their goal was to assess if tweets generated in parks may express a more positive sentiment than tweets generated in other places in New York City. Their results suggest that tweets in Manhattan are different from other NYC boroughs. In Manhattan, people’s tweets were more positive outside of parks than inside, whereas the opposite was true outside of Manhattan. They concluded that Twitter data could still be useful for aspects of social policy, including urban design and planning. They also note that one of the limitations of geo-located Twitter data is that GPS is less accurate than sometimes accounted for. It also does not account for elevation, so you could be on the metro underneath Central Park, or indeed, stuck in traffic alongside it. It is hard to establish whether people may have gone for a walk to let off steam, or commute to work, for example.

The relationships between where we are standing or where we live and our well-being are not new, but a feature of much philosophy on the nature of subjective experience, especially since the Enlightenment (which we shall come to in the next chapter). Big Data offer new ways to test what we know about place. However, these data and devices also make assumptions about place and experience (Wilmott 2016). The expectations and suppositions of what happens where, for whom and how drive these analyses with the same bias as other Big Data technologies, and we must be aware of the limitations of these data, technologies and the ideas of well-being they claim to measure. We also need to be vigilant about who holds the data and why they are analysing.

5.6 Fit for Purpose? Health and Well-being Tracking and Apps

Recent technological developments have seen a rise in people using wearable technologies and their mobile phones to track their movements and behaviour. These include: periods of activity, menstruation, what they have eaten, how they have slept, how far they have walked and their heart rate, in order to gain an overall picture of their health and general well-being. These practices are frequently called the Quantified Self movement (Ruckenstein and Pantzar 2017), which refers both to the cultural phenomenon of self-tracking using one’s own data, as well as the community of people who use and share data in this way.

The technologies are increasingly popular and are being discussed as cost-savers for the NHS, but there are barriers to their use (Jee 2016). Around five years ago, 85% of the general population did not own wearable devices (Lee et al. 2016). Therefore, measures which use datasets from these technologies will only account for a proportion of the population, who are most likely to be younger and more affluent (Strain et al. 2019) and already demonstrating an investment in their current and future well-being by owning such a device in the first place. We also do not yet fully understand the impact of COVID-19 on wearable devices and app use, as at the beginning of the crisis there were stories about governments using these data to monitor compliance with lockdown measures (Digital Initiatives 2020). YouGov polling dataFootnote 24 indicate that even in July 2020, 65% of the UK had still never owned a wearable device, with 22% currently using one (with everyone else having tried one, or owned one but not currently using one). However, the same YouGov data indicate that usage has increased from 22% to 27% in January 2021, and those who have never owned a device has decreased at a similar rate. Therefore COVID-19 has seen an increase in wearable technology, as people take an interest in their well-being data in new ways.

Self-tracking, or the practice of generating or capturing data about everyday activities like eating, exercise for purposes of self-improvement, puts data and control in the hands of people, as well as the corporations which produce self-tracking devices and the third parties with which these data are shared (Kennedy et al. 2020). The research is ambivalent as to whether the experience of self-tracking has positive benefits, such as perception of control, agency or, in the case of professional or amateur sporting, opportunities for new communities (Ajana 2017; Lupton 2019; Pink and Fors 2017). It is also thought that these practices in and of themselves, and in their relationship to control, may decrease well-being more generally (Kennedy et al. 2020).

Data collected via mobile phone apps present similar possibilities for community and compromise. Smartphone access and usage only account for certain sections of a national demographic, much like wearable devices. Similarly, people who download an app to better understand their well-being are already self-selecting as wanting to improve their well-being, and therefore may not be considered a representative sample. A number of apps in the early 2010s wanted to further develop the insights gained from better understanding subjective well-being measurement.

In 2012, experts in geography and the lived environment based at the London School of Economics created a mobile phone app to understand happiness (MacKerron and Mourato 2013). What they branded a ‘hedonimeter’ (after the nineteenth-century invention we discovered in Chap. 2), the ‘Mappiness’ app asked people to allow the app to collect objective data about where they were (automatically, using GPS data), what activity they were doing, and who they were with (as manual entries). It also asked them to provide hedonic responses (subjective well-being data) as to how awake, happy and relaxed they were. These data were collected using sliders instead of the more traditional scales we have previously encountered. The data collected by the app were used in a number of different ways to appreciate subjective well-being and we will touch on a couple here.

In 2015, a report which drew on this data was published. ‘Cultural Activities, Artforms and Wellbeing’ reported on research commissioned by Arts Council England (ACE). The authors evaluated the hedonic readings of various activities found in the data collected by the app (Fujiwara and MacKerron 2015). Table 5.4 shows what the authors describe as ‘happiness activities rankings’, with theatre, dance and concert appearing to have the highest effect, and reading the lowest, unless you incorporate other ‘everyday participation’ activities, such as TV watching. As you can see housework, chores and DIY is negatively associated with happiness.

Table 5.4 ‘Happiness activitiesa rankings’

Other studies cited in this report indicate that theatre has less of an effect on life satisfaction, whereas reading fares much better (Leadbetter et al. 2013). As we encountered in Chap. 4, there are conceptual differences between life satisfaction and happiness, and common sense might tell us that reading and attending a theatre performance present different kinds of well-being experiences. Yet, seeing that reading looks quite bad for well-being is surprising at first glance. Elsewhere in the report are regression tablesFootnote 25 for other activities, including birdwatching, gardening and hunting and fishing which are significantly better than watching a film—or indeed—poor old reading that doesn’t win on these happiness scales. Interestingly, when you go back to the Twitter data answering the question: ‘what is happiness?’ (Box 5.1) there were many responses that answered reading, curling up on the sofa and watching a film, and so on. While the limited sample of the Twitter data makes it impossible to generalise, it certainly still poses questions as to what is going on with confounding results in various happiness data. One thing that struck me returning to these cases in 2020, a world changed by COVID-19, is the difference between activities in the home and outside the home.

Interestingly, the app’s inventors co-authored an academic article for the journal Global Environmental Change. Using the same data, they found that outdoor activities were better for well-being. They state:

[T]he predicted happiness of a person who is outdoors (+2.32), birdwatching (+4.32) with friends (+4.38), in heathland (+2.71), on a hot (+5.13) and sunny (+0.46) Sunday early afternoon (+4.30) is approximately 26 scale points (or 1.2 standard deviations) higher than that of someone who is commuting (−2.03), on his or her own, in a city, in a vehicle, on a cold, grey, early weekday morning. Equivalently, this is a difference of about the same size as between being ill in bed (−19.65) vs doing physical exercise (+6.51), keeping all other factors the same. (MacKerron and Mourato 2013, 997)

The numbers in the brackets refer to ‘the scale points’, showing the increase in probable happiness by where people are, what day of the week it is, what time of day it is. Interestingly, the greener the space you are in and the hotter the day (if sunniness seems less important than you might expect), the better. While this may appear to be common sense in one way, when you think back to how policy relies on evidence to improve well-being, what are the policy messages here from an investment point of view?

I had this app for a while and my results always told me that I was happiest in a pub beer garden with my best friends. Did I know that the data I was ploughing in when the app beeped me to do so was going to potentially be used to inform policy-making? Well, yes, of course, I guessed that, because I was researching well-being data and policy, which was why I downloaded the app in the first place. But did most people who were interested in how they felt doing certain things imagine the contexts of their data’s potential future use?

What policy decisions should be made about beer gardens off the back of my interactions with some sliders on a mobile phone app after a few ciders on a summer’s day? While these data were collected at a scale that means my personal data and my interactions are no longer visible on an individual level, it does pose questions for some of the correlations we make with these data. Are people happier on a weekend because they are not working or because they can go to the pub?

5.7 Conclusion

Despite the conflicting evidence from different approaches to ‘Big Data’, people are keen to find new ways to harness them to answer the age-old policy and philosophy questions around people’s well-being. The increase in well-being research coincides with an increase in research with and on Big Data. Both have possibilities and challenges, but could they be exacerbated by combining well-being research with these data practices? Do Big Data have a capacity for good when making decisions about young people’s exam grades or whether someone is eligible for social housing? We reflected on some important examples of where this went awry in this chapter.

New methods and metrics using Big Data, and indeed the research going into developing new tools to harness them, are not necessarily being checked for rigour before the approach is used elsewhere, as was the case with the Twitter community study, and its use of the sentiment metrics. Generalising people’s happiness based on mobile phone data has its limitations. We cannot necessarily be entirely sure of whether it is the aesthetic grandeur of an old Victorian bandstand in the park, whether there is a classical concert inside, if you had enough sleep, whether you are picnicking with your favourite friends, with your kids, or having time away from your kids; indeed, whether you are stuck on a delayed tube underneath the park, or are walking in a hailstorm, that truly adds to (or detracts from) your momentary happiness.

The ethics of studying Big Data more broadly should be considered, and the behaviours of those who are outside the sample of users of wearable tech or smartphones, especially as these people may be older or poorer, for example, which we know intersects with well-being in very significant ways. Despite this, claims are still made that findings from these studies could be used to inform policy and investment. While they can offer some insights, we must be mindful of their limits—and crucially of their implications, especially in different contexts.

All in all, Big Data and new technologies, whilst not always revolutionary in kind, can offer insights into well-being that are useful for policy-makers on a national scale, in international pandemics and for people who simply want to see what people think. But they are not without their limits, nor are they a magic bullet to the issues we have with existing data. If anything, they are also shown to have the potential to exacerbate existing problems as much as investigate solutions.

The capacity for Big Data to embrace complexity, and at greater speed, means they present new opportunities to analyse health data—and crucially how health intersects with social concerns. Reflecting back from today on how crude the Google Flu Trends analysis in 2013 now seems, it is clear that Big Data technologies and techniques are improving at pace. The COVID-19 example, BlueDot, shows that the value of Big Data analyses is in their capacity to now cope with more of Big Data’s qualities at the same time, and in fact, to harness them: their messiness, variability, size and the capacity to link previously unconnected data sources from farming information to flight sales. The value was in the variety of data and sources used. Yet harnessing the power of Big Data was not powerful enough to prevent a worldwide crisis, despite the grand claims.

What we think of as ‘Big Data’ offer a peculiar perspective on ‘well-being’. Consider the different things they capture, from sleep patterns to elite cycle trails to facial recognition and how many steps your walk to the post office takes. These devices exist to capture and produce data because data can be useful and commercialised. We are not even clear on whether more knowledge of the self is good for well-being or bad (yet?), let alone whether it is good at scale: that governments (and who else) know more about us. What is clear is that data are producing and changing culture and society, as much as they are capturing it.

We need to ask questions around the commercial value of these data practices alongside social justice issues. How would these data have had a greater chance of improving well-being were the contexts in which they were analysed different? Who should be included in these discussions, and who is excluded? Ultimately, how will decisions and trade-offs be made between the commercial and social justice dimensions?