Introduction

Approximately 5.3 billion people, or 66% of the world’s population, have access to the Internet (ITU, 2023) and live in an era in which information has crossed temporal and spatial boundaries, allowing the world to remain connected with the aid of an increasingly intense flow of information (Valcanis, 2011). Therefore, we can infer that the Internet has become an important component of contemporary society, with great impacts in the most diverse areas that offer opportunities for new research insights (Jarić et al., 2020).

Widespread access to the Internet has allowed the search and storage of information in various digital media, forming large sets of data known as digital corpora (Correia et al., 2021; Leetaru, 2011; Michel et al., 2011). In turn, corpora are collections of items, including Internet pages, digitized books, and posts on social networks, that can be used to generate structured data, representing repositories of knowledge and/or evidence of different products of human culture (Correia et al., 2021).

This information offers opportunities for scientific investigations that contribute to the understanding of human behavior on a large scale, because it can reach individuals that research would normally have greater difficulty accessing (Gosling et al., 2010; Hargittai, 2018). Research involving digital data, which reach different societies, can help us reduce biases in human behavioral studies, since they are carried out mostly in Western, educated, industrialized, rich, and democratic (WEIRD) societies (Henrich et al., 2010).

WEIRD societies do not represent the cultural variety existing in the general population, which prevents large generalizations of human behavior (Henrich et al., 2010). One way around, this has been the creation of methods and tools to measure the intercultural psychological distanceFootnote 1 between different populations in WEIRD societies (Muthukrishna et al., 2020). This may allow samples from different cultures to make greater generalizations (Muthukrishna et al., 2020). However, until then, there has been no way to access all social and cultural varieties around the world truly. Culturomic tools can minimize this challenge (see, e.g., Bail, 2014) because this approach is dedicated to the collection and quantitative analysis of large corpora of digital data for the study of human culture (Michel et al., 2011).

From this perspective, we understand culture as any information that can be expressed through behavior and transmitted through teaching, language, and imitation, among other forms of cultural learning (Mesoudi, 2011; Richerson & Boyd, 2005). Virtual environments can be viewed as an extension of a user’s social or offline life (Correia et al., 2017); therefore, it would be possible to explore trends and patterns that reflect human behavior. Some studies have been working within this approach, even without using the classic definition of culturomics (See Ding & Luo, 2022; Oliveira & Albuquerque, 2021); by looking at these works, we can observe the great potential that digital corpora have in helping us understand several phenomena related to human behavior.

For example, location data from cell phones helped in understanding urban mobility patterns and their relationship with social and health dynamics (Hassan Zadeh et al., 2019), and social networks such as Facebook, Twitter, Instagram, and Sina Weibo can be used to identify psychological characteristics and traits known as the Big Five (Azucar et al., 2018). Local press news from around the world can predict sociopolitical issues in different countries (Leetaru, 2011). Finally, dating sites can help identify human preferences concerning the search for romantic partners (Bergström, 2018). Thus, culturomic tools have great potential to increase the range of possibilities for the investigation of human behavior.

This opinion essay presents how culturomics have been growing as a study approach that can be used to understand human behavior and its main tools. Based on the presented scenarios and cited examples, we define culturomics of human behavior (CHB) as the approach that seeks to understand, explain, and predict human behavior from digital corpora.

Brief History of Culturomics

The term culturomics was first used by Michel et al. (2011) to describe cultural variation from sets of digitized texts written between 1800 and 2000, seeking to investigate lexicography, grammar evolution, and adoption of technologies, among other aspects. After this study, further efforts were made, and a new edition of this text corpus (Google Books Ngram Corpus) was conducted, with 6% of all books published (Lin et al., 2012). With the advancement of studies, the objectives have diversified, contemplating measures of cultural complexity based on linguistics in a corpus of texts published over the years, highlighting the cumulative aspect of human culture (Juola, 2013). On one hand, it was possible to identify cultural variation over time (see Gao et al., 2012; Petersen et al., 2012); conversely, it was possible to infer whether language suffers from political regimes (Caruana-Galizia, 2015) or social regimes (Bochkarev et al., 2014).

Back then, the corpora used were studied through a textual corpus gathered from the efforts of Michel et al. (2011) and Lin et al. (2012). However, before defining what we know today as culturomics, some studies have already been dedicated to analyzing large sets of data from the geolocation of cell phones to observe patterns of human mobility (González et al., 2008). Geolocation data from cell phones make it possible to identify patterns of movement and trips made by people and predict how human behavior affects the dynamics of epidemics. This is because mobility is a crucial factor in the spread of diseases, favoring the confrontation of these public health crises (Balcan et al., 2009; Song et al., 2010).

In the midst of this horizon of possibilities, conservation culturomics have emerged, which consist of analyzing digital data generated to provide new insights into human-nature interactions aimed at biodiversity conservation (Ladle et al., 2016). Conservation culturomics differ from iEcology because the latter studies ecological processes through online data (Jarić et al., 2020), while the former studies aspects of human culture and the human/nature relationship (Jarić et al., 2021). Conservation culturomics have been increasingly notable since their proposition, with the aim of investigating how the public interest can contribute to conservation through different approaches. These approaches include research on perceptions of national parks (Bhatt & Pickering, 2021) and people’s thinking about specific animal species (Pickering & Norman, 2020). As more approaches have emerged beyond the books gathered on Google Ngram Viewer, other data corpora have become the focus of interest. This includes posts on social networks such as Instagram (Kroetz et al., 2021), Facebook (Altay et al., 2022), Twitter (Bhatt & Pickering, 2021), news available on digital platforms (Cooper et al., 2019; Francis et al., 2019), and even the association of data from different platforms, such as social networks and searches on research sites such as Wikipedia (Fernández-Bellon & Kane, 2020).

Digital Bodies and Big Data

At the origin of culturomics, themes such as big data, web scraping, machine learning, and artificial intelligence were not very evident, except for areas dedicated to information technology and computing. The vast majority of digital bodies are formed by voluminous datasets that grow rapidly and that cannot be processed in the traditional way (Chen et al., 2014). Given the huge volume of data generated, for collection to be fast, accurate, and efficient, powerful tools are needed, such as web scraping. This comprises the procedure of extracting data from the web in an automated way without the need to manually copy or download them to a hard disk (Singrodia et al., 2019).

The culturomic approach can be further improved using machine learning (ML), which consists of sets of protocols that allow computers to automatically solve a class of tasks and continuously improve problem-solving based on performance measures (Janiesch et al., 2021; LeCun et al., 2015). Oliveira and Albuquerque (2021) used web scraping and machine learning to understand the dynamics behind the dissemination of messages with false information (fake news) on Twitter in the context of the COVID-19 pandemic. Heras-Pedrosa et al. (2020) utilized web scraping technique to analyze communication in the field of public health during the COVID-19 pandemic and recorded emotions generated in the population through data from Twitter, YouTube, Instagram, official press sites, and Internet forums in real time. This highlighted the potential of using multiple corpora for the same study.

In addition, advances in machine learning are important for the advancement of scientific practice in many areas. For example, in a study by Bae et al. (2021), ML was used to detect possible traces of schizophrenia in the posts on the Reddit forum aggregator. Chiong et al. (2021) used ML to track posts with depressive tendencies on social networks such as Facebook and Twitter.

One of the most recent and prominent techniques among culturomic methodologies is the natural language learning processing (NLP), which consists of machine learning that uses artificial intelligence to allow computers to read and interpret information from texts (Arbieu et al., 2021; Thessen et al., 2012). This technique is increasingly being used to process, analyze, and monitor trends in large volumes of digital data, generating deep insights and reducing human work time. In a study by Arbieu et al. (2021), for example, this technique was applied to perform automatic analysis of emotions in the textual content of news publications about the reinsertion of wolves (Canis lupus) in the region of Saxony, eastern Germany. From the expansion of the corpora used (social networks and online newspapers, among others), the use of these tools was optimized, as they allowed the exploration of these new sets of digital data, such as those arising from social networks and search engines.

Thus, these studies can be divided into two dimensions. The first concerns the content present in the corpus, examining changes in writing patterns, the frequency of specific terms, and the identification of motivations underlying their usage in a specific space–time context. This allowed for the detection of human cultural changes and trends through the quantitative analysis of words.

The second dimension seeks to understand people’s engagement with elements of digital corpora, such as searches for a particular term on the Internet, views in videos and images, comments, likes, and shares. This dimension has been widely used in conservation culturomics. For example, Ladle et al. (2016) found that data from social network posts and searches for certain terms, which are two-dimensional data, can help identify unexplored conservation emblems and assess the cultural impact of conservation actions, such as the selection of an endemic animal as a mascot for a sporting event. The authors argue that through these data, it is possible to assess the quality of cultural ecosystem services and monitor how these services reach people (CICES, 2023).

Table 1 shows some texts about tools that can help researchers use corpora. Corpora are rich reservoirs of human culture, which can help us understand various scientific questions.

Table 1 Books on tools, applications, and practices for using digital corpora

Investigations on Human Behavior from Digital Corpora

Every day, people worldwide use the Internet to search, shop, and share part of their lives through social networks, making the Internet a significant component in various aspects of contemporary society (Mora-Rivera & García-Mora, 2021). For example, if the content generated on the Internet is a reflection of everyday life (Correia et al., 2017), the data that comes from this content can help understand more complex phenomena. Although many studies that use culturomic methodologies do not have their themes focus on understanding human behavior, they provide clues about how using culturomics can be useful for several areas of knowledge that are dedicated to understanding it. In this section, we organize a synthesis of published works involving large digital corpora, which can offer insights for this theme (Table 2).

Table 2 Examples of research that used digital corpora to assess human behavior

Therefore, we argue that several fields of knowledge seeking to investigate human behavior can take advantage of the potential demonstrated by culturomics. For example, studies have shown that from a dataset built on the basis of world news, it would have been possible to predict various political events, such as revolutions, stabilities, and even decisions by state leaders (Leetaru, 2011). Moreover, evidence shows that such datasets can be an important foundation for understanding human preferences, based on cultural salience (the frequency of a given population characteristic), as a metric of visibility or interest (Correia et al., 2016), for example, assessing whether the body size and charisma of groups of species (amphibians, birds, mammals, and reptiles) influences their conservation (Berti et al., 2020).

Inferences regarding human mobility have also been made. Gonzalez et al. (2008), for example, analyzed data referring to the trajectory of 100,000 (anonymous) cell phone users, and observed that human trajectories have great temporal and spatial regularity. Studies such as this one help, for example, in urban planning and creation of strategies to address the spread of diseases. In this sense, the pattern of human mobility can be highly predictable, that is, people tend to frequently go to the same places and follow the same routes (Song et al., 2010). These results emphasize that the use of predictive models to understand urban mobility phenomena is not only possible, but also accurate (Song et al., 2010).

Efforts have also been made to understand how human behavior affects the dynamics of epidemics, mainly because human mobility is a crucial factor in the spread of diseases. Data from 29 countries worldwide were used for computational modeling of infectious diseases, opening the way for the development of necessary and accurate models for describing and, consequently, coping with epidemics (Balcan et al., 2009). Another aspect is that online behavior may present a tendency that is analogous to herd behavior becoming more collective, for example, in scenarios of risk to public health (Bentley et al., 2014). This makes online data an important tool for understanding human attitudes and actions during disease outbreaks.

Additionally, data from dating sites can be used as tools to investigate certain aspects of choosing romantic partners, as most relationships that start online do not differ much from those formed in other contexts (Bergström, 2018). For example, men have more initiative in initiating contact than women, and preferences regarding the age of partners can vary between genders in different age groups (Bergström, 2018). Furthermore, dating sites can be valuable sources of data on how people behave in situations of infectious disease outbreaks, where social isolation is recommended.

Recent studies have shown a change in sexual behavior during outbreaks of infectious diseases such as COVID-19, in which Chinese men and women aged between 18 and 45 years showed a decrease in the number of romantic partners and sexual frequency during the pandemic (Li et al., 2020). Conversely, virtual contact with potential sexual partners can be increased in frequency during this period (Seitz et al., 2020). Additionally, during the COVID-19 pandemic, data from dating apps gained more than 1.5 million daily users (Ting & McLachlan, 2022). Data from these apps also helped to outline the main profile of users, showing that being young, being single, and having higher levels of stress were predictors of greater app use (Ting & McLachlan, 2022).

Several studies have focused on understanding the characteristics that define individuals or groups within a sociocultural context based on spheres of human behavior linked to gender (Seewann et al., 2022), age (Agbo-Ajala et al., 2022), and personality traits (Azucar et al., 2018). For example, Schwartz et al. (2013) used big data tools to recover messages posted on Facebook and verified whether the language used in the posts reflected the personality, gender, and age of interlocutors. Women tend to be more affectionate, and men were more objective and possessive; language changes with advancing age, such as changing from a more singular “I” communication to plural “we” questions, and the propensity to use certain words is modified depending on the personality of the analyzed groups. For example, people with more outgoing personalities mentioned words related to greater sociability, such as “party,” “love you,” and “boys,” while more introverted people mentioned words related to more solitary activities, such as “computer,” “reading,” and “Internet” (Schwartz et al., 2013).

Political positioning has also been investigated. For example, Twitter data were used to show that the flow of political information on this network is controlled by a limited number of influencers (Casero-Ripollés, 2021). Facebook data were also analyzed to see how different political candidates communicated with civil society (Caton et al., 2015). These aspects are important to analyze as one of the ways of aggregating people and/or groups today is through their affinity with different political parties.

Since these parties are formed by individuals to represent their beliefs and values in a political scenario, we can infer that they reflect the personality characteristics, thoughts, and ideologies of their members (Jost et al., 2014; Cohen, 2003). For example, partisan inclination influences the adoption of sanitary measures during public health crises (Gollwitzer et al., 2020). Although sex differences should be considered in these studies, as in the fight against COVID-19, evidence suggests that female leaders seek to minimize the impact of the virus, whereas male leaders implement risky short-term decisions to avoid harm to the economy (Luoto & Varella, 2021).

Several studies have investigated the relationship between digital media and human personality traits (Schwartz et al., 2013), as evidenced by Azucar et al. (2018). These authors showed that the way users interact on social media, such as profile privacy, language, age, gender, comments, and likes, can reflect many personality traits such as the positive link between extroversion and engagement on social media (Blackwell et al., 2017). Other studies also sought to understand through tweets that feelings are more prominent in environmental contexts hitherto unknown to the user, such as the COVID-19 pandemic, showing that fear was the most prominent feeling (Xue et al., 2020).

Another target of investigation was human morality, which is based on the salience of words related to moral (e.g., virtue, decency, and conscience) and virtuous (e.g., honesty, patience, and compassion) behaviors in digital corpora. With data from Google Ngram Viewer, Kesebir and Kesebir (2012) noticed a significant decline of these words in American books during the twentieth century, which for the authors would be linked to the disappearance of these concepts in public debate throughout the construction of modern history.

The way in which human beings relate to aspects of contagion and immunization against diseases can also be accessed and investigated through culturomics. For example, Young et al. (2014) used georeferencing of tweets related to HIV and the incidence maps of AIDS cases (https://aidsvu.org/), revealing a spatial correlation between publications and reported cases. Interactions in this sense (occurrence of tweets and occurrence of disease by the United States Centers for Disease Control and Prevention—CDC) have also been observed for other infectious diseases, such as influenza (Broniatowski et al., 2013) and flu (Hassan Zadeh et al., 2019).

Besides monitoring where the public interest is concentrated, efforts have been made to assess whether it is possible to change social attitudes toward environmental crises. From an association of data from Twitter and Wikipedia, to analyze engagement and searches on environmental crises, it was observed that people’s involvement was greater after watching natural history films (Fernández-Bellon & Kane, 2020). Thus, certain digital resources can play an important role in creating connections with the natural world (Fernández-Bellon & Kane, 2020). Data from Google Trends were used to compare awareness of climate change in certain countries and the actual risk of impacts, which is necessary to identify countries where improving or adapting policies to face climate change are needed (Archibald & Butt, 2018).

The conservation culturomics approach, which is in increasing prominence and is discussed throughout this text, offers important perspectives for nature conservation, although it was not conceived as a specific discipline to study human behaviors. This approach recognizes the role of the public interest as an ally for nature conservation actions, as mentioned in previous studies (Ladle et al., 2016; Nghiem et al., 2016; Ladle et al., 2019). However, it is important to emphasize that the behavioral factors that drive the adoption of pro-conservation behaviors have not yet been adequately investigated.

Limitations of the Culturomic Approach

Although the use of digital corpora is a possibility for human behavior research, data collection, analysis, and interpretation of results need to be done with caution due to several sources of bias (Griffin et al., 2020; Tufekci, 2014). For example, information may be salient in digital media, even without a greater demand from the community. This can occur for two reasons: (1) artificial, when using programs and/or transmission lists, such as bots (Liu, 2019) and spam (Wang et al., 2012), and (2) natural, when a human manually inflates certain information, such as crowdturfing (Wang et al., 2012) and fake accounts (Shen et al., 2014). All of these options end up overvaluing information that is not of interest to a group or society, which can create social problems in the offline world (Bovet & Makse, 2019; Cantarella et al., 2023).

Furthermore, the motivation for choosing the corpus is often neglected during the investigation, as some research has shown that socioeconomic factors are highly discrepant between different social networks. For example, most users of networks such as Snapchat, Instagram, and TikTok are young people aged between 18 and 29 years (Pew Research Center, 2022) and have higher levels of education (Hargittai, 2018). That is, when using these networks as digital corpora of studies, caution is needed, especially when making large generalizations. For example, Mislove et al. (2021), when comparing a sample of US and Twitter audiences based on socioeconomic factors (geographical, race, and gender), the study observed that Twitter audiences did not represent the region’s population. In addition to socioeconomic issues, it is important to consider the affinity of each platform with a certain type of content, because although many platforms allow the posting of text and photos, the public tends to prefer a specific type of media as a model (Di Minin et al., 2013, 2015).

Additionally, some researchers have noted that the use of big data must be associated with other methodologies, such as data incorporation or analysis. Corpus association can better predict some outcomes, data validation (e.g., interviews) (Azucar et al., 2018), and the presence of outliers within the sample (Griffin et al., 2020). Another way pointed out is the observation of the structure of the collected data, which sometimes does not allow for conventional analyses (for more details, see Dodds et al., 2011; Xue et al., 2020). For example, Koplenig (2017) pointed out statistical errors in the results in several articles that disregard the temporal characteristics of the data when testing their hypotheses. That is, observations that are close in time tend to be more similar than distant observations. Although it seems that these biases can make research with culturomics unfeasible, observing the biases already indicated can greatly minimize the risks of misinterpretation (Ruths & Pfeffer, 2014).

Conclusion

In short, the Internet has become a fundamental element of contemporary society, allowing the creation of large datasets that can be used to study and understand human behavior on a large scale. This information enables scientific investigations that reach audiences who are normally difficult to reach and provides research opportunities in several areas. The culturomic approach to human behavior seeks to understand, explain, and predict human behavior using these digital corpora. With the constant increase in the volume of data, powerful tools, such as web scraping, are needed to collect and process this information. Therefore, the use of digital corpora is a rapidly developing area of research offering opportunities for new insights in several fields.

CHB is an innovative approach aimed at analyzing large cultural datasets, particularly social media data, to understand human behavior on a global scale. This approach is broader and more quantitative, emphasizing large-scale data analysis. While cross-cultural psychology explores the mind and behavior of individuals across different cultures, the data collected is primarily individual through interviews (Broesch et al., 2020).

It is important to note that CHB is more akin to historical psychology than to cross-cultural psychology, as historical psychology also conducts large-scale textual analyses (Muthukrishna et al., 2021). However, we argue that CHB should be considered a distinct field that dialogues with other areas mentioned earlier, given the specific nature of the analyses and theories involved in data collection and analysis.

Therefore, we can conclude that CHB is a promising approach for understanding human behavior on a global scale. While it may share some similarities with other disciplines, it is a field with its own characteristics and methodologies that deserve to be studied independently.