1 Introduction

In the context of our continuously progressing digital age, the field of data analytics has assumed a crucial role in comprehending societal behaviors, market trends, and developing patterns [1]. The emergence of online search engines like Google Search and analytical tools like Google Trends has proven to be highly beneficial in this era [2, 3], providing information on countless subjects to users and insight into the shared awareness and interests of individuals on a global scale. The immense volume of data produced daily from searches on Google offers an unparalleled prospect for researchers, marketers, and analysts to uncover patterns, comprehend preferences, and decipher the shared inquisitiveness of our era [4, 5]. Notably, Google Trends and Google Search serve distinct purposes in the realm of research. Google Search assists users in locating specific information by generating relevant results based on their search query [6]. On the other hand, Google Trends offers aggregated insights into users’ Google Search behavior, providing data on the popularity of search words and subjects that enable individuals to gauge public interest levels over time [3].

Both Google Trends and Google Search have the potential to act as novel tools for environmental science research beyond traditional applications. As a reflection of human activities and interests, search engine data can be analyzed to draw conclusions regarding individuals’ interactions with the world around them. This is particularly salient when placed against a backdrop of growing concern for topics like anthropogenic induced events [7, 8]. These insights can prove invaluable for researchers interested in studying how human behaviors and the society as a whole impact and interact with the environment on a larger scale, as search engines provide unprecedented amounts of data on individuals’ behaviors. While individually these behaviors might look chaotic and scattered, collectively, they demonstrate strong patterns and “trends” that can be used to derive comprehensive understandings of human behaviors and the environment that were not previously feasible. Studies have demonstrated the validity of this connection, finding a positive association between traditional sources like Gallup Polls and Google Trends data surrounding environmental issues [9]. Mavragani, et al. [10] also explored over 100 studies that utilized Google Trends data, showcasing how Trends and additional web-based sources were used for correlation testing, modeling, examining seasonality, and forecasting events. This is possible because individuals often turn to online search engines in the digital era in response to outside stimulus. Analyzing Google Trends and Search data pertaining to environmental events allows us to examine the relationships between collective human behaviors and environmental outcomes. Previous research has explored this idea in a wide variety of settings, ranging from pollen concentrations [11] to energy consumption [12].

Despite the widespread recognition of these tools for their affordability, efficiency, and dependability, there have been numerous issues and concerns associated with their application in research, primarily stemming from inconsistencies observed in the data they yield over time. For instance, Franzén [13] observed a inconsistence over a span of eleven months subsequent to the original search of the phrase “Jacob Scharf”. Such inconsistency in the use of Google Trends have also been noted by Nuti, et al. [14], as variations revealed a lack of standardized protocols within scholarly publications. The discontinuation of Google Flu Trends in 2015, coupled with findings like these, points to a general wariness across disciplines, even going so far as to suggest that researchers abstain from utilizing Google Trends.

However, the total avoidance of Google Trend as advised by Franzén [13] was subsequently critically reviewed by scholars, and new conclusions were provided, such as by Raubenheimer [15]. In this report, the anomaly noted in Franzén [13] was justified, noting that such variations are a result of explicable sampling and scaling tendencies acknowledged by Google Trends. These processes increase the variability of searches for words or phrases with low popularity and/or in places with a small population. It seems the richness of such “Big Data” might be prematurely dismissed because of the lack of understanding or failure to find proper analytical tools.

To overcome this challenge, Raubenheimer [15] recommended the use of the median value from multiple aggregate samples for searches in small populations. This approach aims to control for the variation native to the Google Trends platform and has been seen alongside other techniques, such as comparing results to benchmark queries as seen in Qin and Peng [9] and using monthly averages drawn from weekly values to control for unusual spikes in Thompson, Wilby, Matthews and Murphy [7]. The emergence of these techniques for capitalizing on search engine data signals the perpetuation of interest in the subject. In this study, we aim to explore Google Search results as a potential supplementary dataset, serving to validate and build upon other search engine tools like Google Trends. In this paper, we explore this concept by looking at Google Trends and Google Search data surrounding the Flint Water Crisis in Flint, Michigan as an example. In this sense, we aim to answer two primary questions through this investigation. First, do Google Search result data align with data trends captured by Google Trends values? Second, do socially sensed data collected through Google provide unique insights into environmental crises that might be valuable to policymakers?

Following this introductory section, we briefly discuss the Flint Water Crisis and detail our methodology of extracting and analyzing the Google Trends and Google Search data. After presenting the analytical results in the third section, we discuss these results in detail in the fourth section. The study concludes with a summary and perspective into future studies that promote more innovative utilization of big data and big data analytics in environmental studies.

2 Case description

2.1 The flint water crisis: background

The Flint Water Crisis was a significant event in recent American history. The crisis began in 2014 when the municipality transitioned its water source from Lake Huron to the Flint River, with the intention of reducing expenses. The absence of appropriate treatment measures resulted in the corrosion of the deteriorating pipes, leading to the leaching of lead into the potable water supply [16]. Consequently, this caused a substantial number of inhabitants, particularly minors, to be subjected to heightened levels of lead in the blood [17]. The crisis has brought to light inherent deficiencies in governance, environmental regulation, and public health, particularly because of the delayed response to the crisis.

2.2 Potential online presence of the crisis

To obtain a comprehensive understanding of the societal implications and consequences of such a crisis, it is imperative to delve into the patterns and dynamics of social behavior during the crisis. Previously, such endeavors often involved intensive and passive surveys, which could take months to accomplish and may present less-than-ideal results if the surveyed population was unaware of what was going on or unwilling to cooperate. It is against this backdrop that we consider the role of big data, and in particular, socially sensed big data. Big data has become increasingly integrated into research over the years, with particular attention to environmental research [18,19,20]. This interest stems in part from the theoretical argument that big data’s high density nature uniquely positions researchers to extract true, unfiltered behaviors which might not otherwise be visible [21]. While there are arguments challenging the representativeness of big data [22, 23], there is nonetheless evidence that big data present unique avenues for exploring human behavior [24]. Simultaneously, there have been calls for further cross-collaboration in the big data space. Authors Sebestyen et al. have encouraged a systems thinking approach to big data in climate change research, acknowledging the overlaps and synergies between fields such as environmental science and social science which stand to benefit both [25]. One theoretical development aligning with this concept is the emergence of social sensing as a form of remotely sensed data. In essence, this approach reframes big data generated by human behavior as a form of socially sensed data, akin to datasets such as remote sensing imagery and stationary air monitoring data [26, 27]. In this way, socially sensed big data such as social media and internet search data bring human behavior patterns into the field of environmental and earth sciences. It is our intention for this study to build upon these theoretical frameworks, leveraging the interconnected nature of big data and environmental research to reflect human behaviors within the context of nature-based events.

Further, the Health Belief Model, a theoretical framework attributed to Rosenstock [28], can help ground this study's relevance and highlight opportunities for policymakers. This model explains individuals' health-determining behaviors through five primary factors: perceived susceptibility, perceived severity, cues to action, perceived benefits, and perceived barriers [29]. In essence, an individual's actions to care for their health are heavily influenced by their perception of potential risks and benefits, as impacted by internal and external stimuli. This framework informs our data collection and presents opportunities for public health officials in the context of our study, especially in the era of big data in which such perceptions are quickly captured by social sensing platforms.

Given that both Google datasets are only available in an aggregated state, the Health Belief Model provides a method to unpack collective and individual user behaviors. Regarding perceived susceptibility and severity, we anticipate that Flint community members will query Google as their perceived risk increases. Each community member has their own understanding (or lack thereof) of lead's presence and hazardous nature. Their perceived susceptibility and severity may be influenced by various cues to action, including exposure to visibly abnormal water, news reports, and neighbors' experiences. As these perceived risks change and awareness grows, we expect an increase in Google searches as individuals seek to educate themselves. By including keywords beyond those directly related to lead, we aim to capture these educational interests in their most basic form.

When considering the potential impact of utilizing these data in a public health context, we focus on the cues to action, perceived benefits, and perceived barriers to behavior change. During an event like the Flint Water Crisis, individuals' perceptions about the event and its health impact can be influenced by cues to action. While these cues can take many forms, policymakers and public health officials have an opportunity to introduce positive cues around subjects of apparent interest. This is crucial when community members are weighing the benefits and barriers of taking health-impacting actions. For instance, if public officials notice increased search interest in water quality-related subjects without immediately recognizing the specific problem, they could disseminate information on how to report issues, where to obtain free test kits, or signs of common water quality problems. These steps serve the dual purpose of preparing officials for potential hazards and shaping community members' understanding of the benefits and barriers to protecting their health.

The importance of effective communication is further emphasized by Coombs' Situational Crisis Communication Theory (SCCT) [30]. This theory outlines how organizations should communicate during crises based on the type of event [31]. In the Flint Water Crisis, the slow response likely classifies it as a preventable crisis, where public officials were responsible due to negligence or intentional actions. The initial communication response was defensive (denial and diminishment), shifting to attempts at taking responsibility (rebuilding) once the crisis's true nature became clear. Consequently, the reputation of involved officials and organizations was severely damaged.

This loss of reputation and trust in local government is harmful to both community members and public officials. In future crises, people may be less likely to respond to government actions, potentially worsening risks and further eroding legitimacy. However, using social sensing big data, such as Google searches, offers an opportunity to craft more effective communication strategies. For instance, if an organization initially adopts a denial approach during an environmental crisis, a spike in relevant search terms could indicate the public's need for more information. This insight allows for a timely shift to a more effective communication strategy, mitigating reputational damage and reducing the risk of misinformation dominating the narrative.

It is against these backdrops that Google Trends and Google Search present unique opportunities. Online searches made by an individual are typically voluntary, provoked by something important to the user, and happen rather frequently, resulting in recognizable patterns at a collective level. The key is the collective behavior of certain searches which elucidate concurrent ongoing events that might not be known to responsible governmental agencies, as was the case of the flu trend in the US that led to the development of the Google Trends tool [3]. Careful, meticulous examination and utilization of these data might yield significant insights pertaining to public awareness, reactions, and mobilization throughout the entire period of the crisis, right from the very beginning, hence providing a chance for early intervention. Through the examination of online activity, it may be possible to assess the degree of public apprehension, diffusion of information, and engagement in activism. Additionally, achieving a thorough, reliable understanding of public behavior throughout this crisis presents an opportunity to detect similar crises faster than traditional data collection methods if such crises ever occur in the future.

In this exploration, we examine both Google Trends and Google Search data and how they can be used for providing warnings to potential environmental disasters. If left unchecked or delayed because responsible parties failed to take necessary actions, these disasters may impose profound impacts on society.

2.3 Types of Google data

In our study, we examine both Google Trends data and Google Search data. Google Trends showcases the frequency with which a query was searched on Google over time for the requested geographic region. The values reported through the Trends tool are a representative sample of search requests, normalized by the total number of searches in the geography and time frame selected, and scaled from 0 (no searches) to 100 (most searches). Google Search data, on the other hand, refers to the number of webpages or “hits” that are found when searching a query on Google. While the search data might not be as direct a reflection of the individual users’ responses to a particular ongoing event as the Trend data, the number of “hits” found on the Internet of relevant search topic can provide insights into the depth and breadth of content available on that topic, offering an indication of background information of the knowledgebase relevant to the ongoing event. Moreover, analyzing Google Search data alongside Trends data can help identify discrepancies between the interest shown by users (as indicated by search queries frequencies) and the actual availability of information or discussions related to those queries online. Considering the different representations of these two types of data, we hypothesize that while the frequency data from Google Trends has obvious peaks and valleys pertaining to the ongoing event, the knowledgebase relevant to the event will keep growing even after the peak of the search frequency of the event. This is sometimes called the “memory of the Internet” as discussed in Kirmayer, et al. [32], that our “past indiscretions, childish mistakes, and other errancies can come back to haunt us endlessly.” However, from a data analysis standpoint, such “haunting” also serves a crucial function for policymakers, governmental agencies, and other stakeholders. It allows them to learn from past errors, but more importantly, heighten alertness to potential future disasters, and intervene more swiftly and effectively at the initial signs of trouble.

2.4 Data collection—key terms

For this investigation, data was collected for keywords searching. The selection of keywords was driven by their direct relevance to the Flint Water Crisis and their expected potential to yield insightful results about public awareness and concern during the crisis, even if the public were not fully alert about the crisis, which could potentially lead to early warning enabling public health officials to take necessary actions and prevent further worsening of the environmental disaster. The chosen keywords include “Legionnaires’ Disease,” “lead,” “bottled water,” “pneumonia,” and “water filters.” It is worth noting here that these terms were chosen without considering the detailed contextual information due to the limitations of the Google Trends tool and the extensive nature of Google Search results. The pre-processing and aggregation conducted before the data is shared through the Google Trends platform are designed to protect the privacy of individuals from whom the data is collected. While this ensures privacy, it also limits our ability as researchers to delve deeper into the context, sentiment, or identity of the individuals represented in the data. Despite this limitation, we aim to capitalize on the nature of big data, capturing a high density of data in an attempt to collect information on an average snapshot of sentiments and identities from various contexts. Our intention is for this wide array to provide a reflection of the general interest of Google Searches.

The choice of “Legionnaires’ Disease” as a keyword for this study is particularly significant in the context of the Flint Water Crisis. Flint, Michigan, experienced a serious public health crisis following a decision to switch the city’s water supply source to the Flint River, which led to widespread water contamination issues. One of the most alarming consequences was a notable outbreak of Legionnaires’ Disease, a severe form of pneumonia caused by Legionella bacteria, which thrives in contaminated water systems. This outbreak was directly linked to the water crisis. Between 2014 and 2015, Genesee County, where Flint is located, reported approximately 90 cases of Legionnaires’ Disease, resulting in at least 12 deaths and 79 people becoming ill [17]. This was a significant spike from previous years, indicating a strong correlation with the change in Flint’s water source. In this context, analyzing the search trends for “Legionnaires’ Disease” provides critical insights into public awareness and concern regarding this specific health issue as it unfolded. The keyword serves as a proxy for gauging the community’s response to the health risks posed by the contaminated water. The heightened search frequency during the crisis period would likely reflect increased public concern, information-seeking behavior, and awareness about the symptoms and prevention of the disease. Furthermore, tracking the search trends for “Legionnaires’ Disease” in relation to Flint can highlight how public interest in health-related issues evolves in response to a major environmental crisis. This can provide valuable lessons for public health communication and response strategies in similar situations. Therefore, the inclusion of this keyword is not only relevant but also critical for understanding the broader societal impact of the Flint Water Crisis.

“Lead” is the primary keyword and is pivotal due to the central role of lead contamination in the Flint Water Crisis. This crisis was primarily characterized by high levels of lead in Flint’s water supply, leading to widespread public health concerns. To capture the broad spectrum of public interest and concern related to this issue and considering the multiple meanings of the word “lead,” the term “lead” was uniquely approached in our methodology. Unlike other terms, “lead” was analyzed in Google Trends as the topic “Chemical Element” rather than as a standard query. This distinction is crucial for our analysis because topics in Google Trends encompass a wider array of search parameters compared to queries. By selecting “lead” as a topic, our analysis could capture not just direct searches but also a diverse range of related inquiries. This includes common misspellings, searches in different languages, and closely associated terms such as “lead paint” and the chemical symbol “Pb.” This comprehensive approach ensures that our data reflects a more accurate and encompassing picture of public interest and concern about lead contamination, going beyond the limitations of exact-match queries. This search choice of the keyword “lead” allows us to delve deeper into how the Flint community and the broader public responded to the lead contamination issue. It helps in understanding not just the awareness levels, but also the various dimensions in which the public sought information about lead, its effects, and related safety measures during the crisis.

“Bottled water” was a key term included in our analysis, as it directly reflects the public’s immediate response to the water crisis. The surge in searches for bottled water indicates the community’s shift towards alternative, safe water sources amidst growing concerns about the safety of tap water. This term provides insight into the behavioral adaptations of the Flint population, highlighting the extent to which the crisis disrupted daily life and forced reliance on bottled water for drinking, cooking, and other household uses.

“Pneumonia” was another significant term included in our study. Similar to “Legionnaires’ Disease,” its relevance stems from the potential health implications of the water crisis, particularly as Flint witnessed an increase in respiratory illnesses, which could be linked to waterborne pathogens. By analyzing search trends for pneumonia, we aimed to understand the community’s level of concern about respiratory health issues which may have been caused or exacerbated by the crisis. This term also stands to expand upon the data captured by “Legionnaires’ Disease”, potentially including individuals who had contracted the disease but were professionally or self-diagnosed with another respiratory illness. Notably, this extension may also capture insight into individuals with particularly high levels of vulnerability, such as those without health insurance or lacking the resources to visit a doctor. This term also aids in comprehensively examining the public’s perception and awareness of the broader health impacts of the crisis.

“Water filters” was another critical term included to understand the public’s proactive measures in response to the crisis. The increased interest in water filters signifies the community’s efforts to find immediate, in-home solutions to mitigate the risk of contaminated water. We theorize that this search term may be provoked by direct or indirect experiences with poor water quality in the home. This term provides insights into the public’s adoption of protective measures and their trust in such technologies to ensure water safety.

The inclusion of these terms—“bottled water,” “pneumonia,” and “water filters”—was strategically chosen after a few rounds of deliberate discussions regarding what keywords might be most relevant and indicative to a water crisis as in Flint, Michigan. Our purposes are to provide a multifaceted view of the public’s response to the Flint Water Crisis. Each term sheds light on different aspects: from immediate behavioral changes and health concerns to the adoption of protective measures, collectively contributing to a comprehensive understanding of the public’s engagement and concerns during this critical period.

2.5 Data collection: Google Trends data extraction

Data from Google Trends was extracted to analyze the frequency and pattern of public interest over time. The process involved sequentially (1) entering each keyword into the Google Trends search bar; (2) setting the geographical scale to the United States, Michigan, and specifically the Flint Saginaw-Bay City Region, to capture data at national, state, and local levels; (3) applying a time filter from 2004 to 2022, encompassing periods before, during, and after the peak of the crisis. This longitudinal approach allows for a comprehensive understanding of how public interest evolved. The data output from Google Trends was automatically standardized on a scale of 0–100 by default, reflecting the relative search volume for each term. The applied filters and timeframe are critical in highlighting variations in public interest and concern, directly attributable to the unfolding events of the Flint Water Crisis.

2.6 Data collection: Google search data extraction

Google Search data was collected to complement the insights gained from Google Trends. This process involved: (1) searching each term independently on Google, followed by combined searches with “Michigan” and “Flint Michigan”, to gauge both the general and localized online engagement; (2) utilizing Google’s “Tools” function to restrict search results to individual months starting from January 2004, aligning with the timeframe used in Google Trends; and (3) recording the estimated number of search results (or “hits”) displayed below the search bar for each query. This step-by-step approach ensures that the search data is temporally aligned with the Google Trends data, providing a comparative view of search volumes and patterns over time. The use of the “Tools” function for time restriction provides a month-by-month breakdown of search interest, crucial for understanding the dynamics of public engagement during different phases of the crisis.

2.7 Data normalization and process for better comparison and visualization

The normalization of raw data from Google Search was a critical step to ensure comparability with Google Trends data, especially during visualization analysis. As aforementioned, Google Trends standardizes search data on a scale from 0 to 100 based on a query’s popularity relative to the total number of searches conducted in the chosen geographic area and time frame. To align the Google Search data with this format, we employed a normalization technique where each term’s raw search counts were divided by the maximum count observed for that term across the study period. This maximum count serves as a benchmark, representing the point of highest relative interest or concern.

The resulting value was then multiplied by 100, effectively scaling each term’s hits on a 0–100 scale. This normalization makes the data from both sources directly comparable at the same scale, enabling us to visualize and analyze trends and patterns across the two datasets coherently. By adopting this approach, we could accurately reflect the relative popularity and attention (search hits) each term received over time, akin to the representation in Google Trends.

To address variations in data collection, particularly the differences in formats and reporting between Google Search and Google Trends, we took several steps. First, both datasets were aligned to the same monthly time intervals. This was crucial to maintain consistency in temporal analysis, allowing for an accurate comparison of trends over the same periods. Second, regular checks were conducted to ensure the data from both sources remained consistent over time. Any anomalies or inconsistencies were investigated and rectified to maintain the integrity of the data set.

By implementing these measures, we ensured that our comparative analysis between Google Search and Google Trends data was robust, consistent, and reliable, thereby strengthening the validity of our findings. This approach underscores the methodological rigor applied in normalizing and processing the data to yield meaningful insights into public engagement and concerns during the Flint Water Crisis.

2.8 Data analysis

The primary analysis in the current study is a combination of visualizing Google Trends and Google Search data to see how well the data would provide a warning prior to a disaster becoming uncontrollable, albeit from hindsight, and how the incident interacts with the background information. We employed a mixed-method approach that combined quantitative and qualitative analyses to provide a comprehensive understanding of public interest and response to the Flint Water Crisis.

Several assumptions were made during this analysis. First, we presumed that an increase in searches for a certain term indicated heightened public interest or concern about that issue. For instance, the search for the term “Legionnaires’ Disease” suggests an interest in this lung disease, which is closely related with the contamination in the water transport system. An abnormally heightened search or trend in such a term should immediately trigger an alarm that something unusual happened. However, it is important to note that search behavior can be influenced by various factors, including media coverage and social discourse, which may not always directly reflect individual concern or awareness. These factors should also be taken into consideration when interpreting the trends to avoid overgeneralizations.

The interpretation of trends was conducted with a focus on how they relate to the unfolding events of the Flint Water Crisis. For instance, spikes in searches for “bottled water” were cross-referenced with reports of water quality issues, while increases in “Legionnaires’ Disease” searches were analyzed in the context of reported outbreaks. This approach enabled us to draw connections between public search behavior and key developments in the crisis.

By employing this mixed-method approach, combining robust visualization analysis with contextual qualitative review, our study offers a nuanced understanding of how the public engaged with and responded to the Flint Water Crisis through online searches. This methodology not only provides a clear picture of public interest and concerns but also highlights the potential of integrating different data sources and analytical approaches in environmental and public health research.

3 Discussion and evaluation

3.1 Overview of Google Trends data

Upon analyzing the data from Google Trends, a clear correlation with the timeline of the Flint Water Crisis was observed. Notably, search frequencies for the queries “water filters”, “pneumonia”, and “bottled water” escalated during the early stages of the crisis, particularly from late 2014 to early 2015. This trend was evident across all geographic resolutions, with the most pronounced spikes observed in the Flint Saginaw-Bay City Region (Fig. 1a–c). A similar pattern was observed for the topic “lead” and the query “Legionnaires’ Disease”, which registered significant increases following the emergency declarations around January 2016 (Fig. 1d, e).

Fig. 1
figure 1

Google Trends Data. a Google Trends Data for Water Filters. b Google Trends Data for Bottled Water. c Google Trends Data for Pneumonia. d Google Trends Data for Lead [element]. e Google Trends Data for Legionnaires’ Disease

3.2 Google search results analysis:

In contrast, the Google Search data exhibited a general upward trend, with a notable increase starting in 2013 (Figs. 2a–e). This trend is likely influenced by the dynamic nature of web content creation and removal over time. However, specific trends emerged that align with key events of the crisis. For example, Google Search results for “water filters” in the context of “Michigan” and “Flint Michigan” surged around mid to late 2016 (Fig. 2a). Additionally, from February to May 2016, searches for “bottled water Flint Michigan” maintained consistently high values compared to prior months (Fig. 2b). The search for “lead Flint Michigan” also showed a significant rise in December 2014, followed by fluctuating results in subsequent years (Fig. 2d), suggesting a booming public interest and discussion online on this crisis. All increases can be easily traced back to the time period around the start of the Flint Water Crisis, providing a clear time stamp for when something happened. Interestingly, the search query for “pneumonia” did not demonstrate a clear trend (Fig. 2c), suggesting a less direct online public association with the water crisis, as the general “pneumonia” disease can be caused by more than contamination of water.

Fig. 2
figure 2

Google Search Data. a Google Search Data for Water Filters. b Google Search Data for Bottled Water. c Google Search Data for Pneumonia. d Google Search Data for Lead [element]. e Google Search Data for Legionnaires’ Disease

3.3 Comparative insights:

The Google Trends data demonstrated fluctuations month to month prior to the crisis, with all three geographic resolutions showing their greatest spikes in the months leading up to the emergency declaration (Fig. 1a–e). In contrast, the Google Search data exhibited less variability prior to the crisis but showed more significant variations in the months following with a clear increase in the number of “hits” for each term (Fig. 2a–e). This divergence in Google Search data is likely the result of two things: an escalation in public interest as the crisis gained media attention and a concurrent increase in the availability of relevant webpages as online content creators responded to the growing demand for information. This synergy between Google Trends and Google Search data underscores the interconnected nature of public sentiment and digital information ecosystems. It highlights how search engine data can serve as a real-time barometer of public interest, particularly in times of crisis, offering valuable insights for policymakers, health authorities, and researchers in understanding and responding to public concerns. This is precisely what this exploratory investigation intends to argue and promote.

3.4 Focused analysis on “legionnaires’ disease”

Particular attention was given to the term “Legionnaires’ Disease” (Figs. 1e and 2e) due to its direct linkage to the water crisis. In January 2015, studies identified a connection between a majority of Legionnaires’ cases and the decrease in chlorine levels in Flint’s water, a consequence of the switch in water sources [17]. Although this linkage was not immediately established by public health officials or recognized in formal studies, our analysis indicates that search engine activity had already begun to draw a potential connection between the water crisis and the disease, preempting formal scientific findings.

Turning again to the Google Trends data, we observe fluctuations month to month prior to the beginning of the crisis, but all three geographic resolutions see their greatest spikes in the months leading up to the emergency declaration. Google Search data on the other hand shows less noise prior to the crisis, but much greater variation in the months following a general increasing trend, agreeing with our hypothesis that the water crisis event triggered an increased presence of relevant knowledgebase. We theorize that this subsequent variation is likely the result of both an increase in public interest in the topic as well as an uptick in the presence of webpages (that are still online today) broadly speaking. In either case, the Trends and Search datasets appear to be in sync in their spikes during the crisis, indicating a synergy amongst the noise. The continued increasing trend of the Google Search hits, on the other hand, suggests the event triggered a heightened content creation and dissemination effort, with more individuals, organizations, and media outlets publishing information, analysis, and updates about the water crisis and relevant information. This proliferation of content likely reflects an effort to address the surge in public concern, answer questions, and provide solutions or coping strategies. The enduring presence of these webpages points to a sustained interest in the topic, possibly fueled by ongoing developments, policy changes, and community responses related to the crisis. It also indicates the role of digital platforms in facilitating a rapid and wide-reaching information exchange during times of public need, suggesting a growing needs to integrating online content analysis and monitoring into public safety, policy, and management decision-making processes.

3.5 Evaluation

The current analysis of Google Trends and Google Search data and exploration of how these types of data correspond to a well-documented public health crisis offers valuable insights into incorporating big data in environmental crisis surveillance and response strategies. Our incorporation and analysis of this type of “big data” does not involve intensive data mining or machine learning algorithms, as is often deemed necessary by practitioners and policymakers. These approaches risk presenting these data as complex and inaccessible, hence resulting in a reluctance to adapt them into their action plans. Instead, this approach demonstrates the potential for leveraging search query data to identify emerging health issues, track the spread of diseases, and gauge public interest and concern in real-time. By integrating relatively straightforward big data analytics into environmental monitoring, assessment, crisis management and other public decision-making processes, authorities can enhance early warning systems, tailor public communications, and allocate resources more efficiently, ultimately improving health outcomes and crisis management.

Google Trends states on their website that the tool “provides access to a largely unfiltered sample of actual search requests made to Google.” The public platform can classify its search data for users based on a requested time frame—from as early as 2004 to as late as 3 days before a Google Trends search is made—in conjunction with specifically requested region(s), including by country, state, major city, or worldwide. Because of Google’s global reach and public use, Google Trends is in general a well-received analytical tool for public health-related research. Our exploration of data produced during the Flint Water Crisis echoes this sentiment, further advocating for the opportunities presented by Google Trends in research and offering Google Search data as a potential corroborating measure. The increasing content generated after the crisis as indicated by the Google Search hits provides a different view on the public’s engagement and concern levels of such severe environmental crisis, even after the peak of the crisis has passed. From our analysis and the results, we can draw a few interesting conclusions.

In general, our investigation of searches for terms relating to the crisis appeared to follow the event’s timeline, suggesting utilizing these relatively simple tools and readily available data sources could be invaluable to detect, alert, and even curb public health crises in their early stages. We focused on the search term “Legionnaires’ Disease” to exemplify this in greater detail and found that, despite the noise, the overlaps in spikes with the water crisis demonstrated a general synergy. In an ideal world, where these approaches were available to government officials prior to the events of the Flint Water Crisis, it is likely that search engine data would have provided early warnings and opportunities for a faster response to the emerging catastrophe than what actually occurred. Given the widespread availability and growing integration of social sensing big data today, researchers and government officials now have a responsibility to prepare for future disasters.

Broadly, public health officials stand to benefit significantly by leveraging insights generated from big data from sources like internet search engines. The relationships identified in this study demonstrate the potential of big data for two primary purposes: hazard deterction resource dissemination. Given the synchronicity found between Google Trends, Google search, and the crisis timeline in this study, public health officials can use measures derived from internet search engine data as an early warning mechanism for future events. For example, community detection tools for future lead outbreaks can be developed around the near-real time search data generated by Google Trends and Google Search. In the event that key queries see exponential spikes, a team of public health officials can be deployed to conduct further testing, in turn expediting traditional disaster response.

Similarly, when a catastrophe is in effect, insights drawn from internet search data can identify current community needs and interests. In the case that a lead outbreak is already known, public health officials can leverage Google Trends data to identify the most prevalent questions being posed by the community, as well as the presence of answers available online. In this way, officials can curate educational resources and public intervention to public needs, addressing key needs and informational gaps to maximize the effectiveness of relief efforts.

There are several further considerations which must be addressed when exploring the potential impacts of leveraging social sensing big data in environmental crises. First, it is important to note the limitations of these conclusions imposed by the context within which the search terms are being used. We turn once again to our query for “Legionnaires’ Disease” to explore both the pros and cons of the tools and data sources. Legionnaires’ Disease is caused by Legionella bacteria, which often appear in untreated water, and is characterized by symptoms similar to other types of pneumonia [33]. As a result, Legionnaires’ Disease does not relate directly to lead poisoning (the primary issue during this crisis), but rather stems from poor water quality, which is the result of the changed water supply source. Additionally, the disease’s similarities to other types of pneumonia mean that it may not be recognized immediately by healthcare professionals or patients as related to the water system, as is evident by the uptick in searches appearing after the outbreaks’ public acknowledgment. These characteristics make for an interesting search term to investigate, as the rise in public awareness is evident, but interest quickly tapers off, while webpage presence maintains a relatively high level compared to months prior. This presents a “chicken-or-the-egg” question as to whether public interest as gauged by search engine queries results in webpage presence or vice versa. On the one hand, it is reasonable to assume that a rise in public interest on a topic, as conveyed through a spike in searches, would likely lead to an increase in the number of webpages on that subject as we hypothesized. In other words, when people become interested in something, it stands to reason that internet content creators (bloggers, news outlets, online retailers, social media users, etc.) will produce works that address that topic, growing the public knowledgebase. On the other hand, an expansion of this knowledgebase signaled by rise in the number of webpages discussing a topic would similarly lead to a greater public interest. If websites are frequently discussing a subject, users might be driven to search online in order to learn more about it.

In reality, the online presence of a topic is not binary and is almost always a dynamic interaction between various factors. However, this idea is notable because of the way these interactions vary by topic. Whereas we observed an overlap for Legionnaires’ Disease between Google Trends and Google Search, this does not suggest a universal synchronicity that might be exploited across disciplines. This is evident from our search term “pneumonia,” a primary symptom of Legionnaires’ Disease which fails to demonstrate the same level of overlap between Google Trends and Google Search. In short, we can logically deduce the relationship between Google Search, Google Trends, and public experience when examining a hyper-specific term like Legionnaires’ Disease but cannot confidently apply this same idea to other more general search terms (like “pneumonia”). For this reason, understanding the various implications of selected search terms, particularly in the sociocultural context of the selected timeframe, is vital to effectively leveraging Google Trends and Google Search data.

This level of specificity that is demanded by internet search engine datasets further poses a risk of misrepresenting human behavior and online interest. On one end, internet search engine data present opportunities for false narratives to take hold. Echoing the common phrase, “correlation does not imply causation”, similar trends in search engine data can convey the presence of a relationship that is not extent in reality. As with any statistical investigation, misconstruing these relationships can pose significant risks, as has been captured by big data’s integration into law enforcement [22]. When seeking to extract human behavior and interest for environmental crises, centering inaccurate relationships can result in the misallocation of funds, risking wasting resources and extending harm. Further, these relationships can be fabricated for malicious means, directing public ire towards populations or institutions which are not truly at fault. This is particularly important for historically marginalized communities who may be historical targets for these acts. On the other end of this conundrum, similarities in search frequency between two terms does not inherently validate assumptions of mutual impact. In other words, search frequency data is often not enough to initiative an official response to an emerging crisis. While our investigation shows a correlation between search frequency and the Flint Water Crisis, this relationship is not enough to concretely call for intervention should similar relationships emerge in the future. Indeed, our results show that not all search terms deemed relevant to the crisis strictly followed the event’s timeline, but rather pointed to a general correlation. To remedy this given the current availability of search engine data, researchers would be required to lean on secondary and tertiary measures to help further validate the presence of a harmful event.

Second, the above discussion suggests a broader implication for search term selection. In the case of our investigation, we are examining data over the course of a well-documented, city-wide event that occurred almost a decade ago. In other words, we are studying an event that has already occurred, has a highly detailed timeline, and impacted most of the population living in a location. These factors coalesce to make an ideal case for our investigation, albeit from hindsight. Our understanding of this event allowed us to choose a small number of relevant search terms and focus on a specific geography in a specific period of time, immediately reducing the potential levels of noise in our data. When seeking to apply data sources like Google Trends and Google Search in more recent (or even emerging) public health or environmental events, such as the recent COVID-19 global pandemic, researchers are rarely granted this same level of transparency. Even in cases when historical data is available for similar events, the way in which individuals utilize technologies like Google has changed dramatically over the years and will likely continue to do so unpredictably. Similarly, the identity of individuals interacting with platforms like Google is not made known through any Google platform. This complication introduces both equity considerations about the diversity of voices included in the analysis, and interpretability concerns given the makeup of the population from which samples are drawn. These considerations introduce a greater uncertainty into prediction models, even when relying on historical data for model training. As a result, researchers interested in utilizing these data in a preemptive manner, such as for the detection of emerging events, would be required to overcome the noise. This again risks reducing the effectiveness of these data as standalone tools, requiring the integration of additional variables to aid in the prediction process. A possible solution to such potential issues would be to establish constant monitoring mechanisms of key terms that are related with significant global crises, and frequently examine the trend data and search hits of these key terms. The combination of these monitoring mechanisms could result in potential early detection prior to the crisis, hence enabling more efficient intervention to curb the crises or reduce their negative influences.

Third, we further highlight the importance of repeatability of research efforts, which must be addressed when seeking to utilize these types of data. As discussed previously, several researchers have called into question Google Trends’ consistency, replicability, and reliability for sound foundational research as the platform goes through generational iterations over continuous internal updates and reengineering. Franzén [13] showed through a spontaneous experiment that Google Trends can demonstrate inconsistencies and inaccuracies in trend data over an eleven-month period. Although these concerns are particularly notable for areas with lower search volumes, they are nonetheless ever present when aiming to conduct thorough investigations. We further acknowledge that similar considerations arise when working with Google Search data. In some respects, Google Search data is produced from an even greater black box than even Google Trends. Whereas Google Trends is a tool designed specifically for providing researchers with information that can be applied in studies, Google Search values drawn directly from the “Total Results” estimate are not validated by the organization in any clear manner. In fact, we were unable to find any documentation describing how the estimated number of results are calculated when searching for a query on Google. As such, it’s possible these values reflect a filtered estimate of the total number of webpages appearing for a given search term. Results may also vary depending on factors such as internet speed, location, search history, or other variables, leading to inconsistencies.

Similar to techniques acknowledged by Google Trends papers [15], these variations in Google Search results may be controlled through repeated sampling, allowing for a mean or median value to be drawn from the recorded values, which we attempted in the current study, though with very little to no variation in the search hits. Additional precautions might further limit these concerns, such as the use of a virtual private network and the clearing of search history to remove confounding variables. In any case, further research is warranted to understand the potential opportunities and pitfalls associated with Google Search data. Overall, we do not aim to discredit researchers who are cautioning others from abstaining from data sources like Google Trends. Rather, we contend that it is important to understand that though Google Trends may have limitations as a standalone comprehensive research tool, it has proven advantages when paired with other datasets like Google Search for disaster detection and big data research.

4 Conclusion

In this brief investigation, we sought to facilitate an understanding of how simply checking Google Trend and Google Search data could reveal patterns of public interest, awareness, and information-seeking behavior in response to a specific event or crisis. By retrospectively examining the Flint Water Crisis environmental incident, we highlighted the value of these datasets in shedding light on the public’s response. More importantly, we demonstrate the potential of these tools in quickly gauging public sentiment and concerns, allowing policymakers, researchers, and public health officials to tailor their communications and interventions more effectively.

To this end, we charted the presence of five terms relating to the crisis in both Google Trends and Google Search. Generally, amongst the noise of the data, we found a positive, logical association between the four terms’ presence and the crisis timeline. In other words, when examining the selected keywords, the presence of webpages reflected by Google Search and the number of user searches estimated by Google Trends spiked during key moments in the crisis discernible from periods immediately before or after. This relationship is indicative of two noteworthy opportunities, framed by the two research questions posed in this investigation. First, we aimed to discern whether Google Search and Google Trends data display covarying trends. We contend our results show that Google Search data behaves in a synergistic manner with Google Trends data, presenting an opportunity to validate the two against one another. Second, we set out to find whether or not Google derived data may serve as an insightful social sensing platform during environmental catastrophes. Once again, our findings suggest that Google Trends and Google Search data may serve as insightful data sources for human behavior during environmental events.

However, we similarly discovered that these datasets are not without their limitations. We first noted that, despite the potential usefulness, Google Trends and Google Search should not be treated as perfect representations of human behavior. Instead, Google Trends provides aggregated information on users’ interaction with the Google Search platform and Google Search shows the presence of web pages created which relate to the query searched. The former demonstrates an interest from users in gathering information on a subject, while the latter is the collection of content on a subject. While these two datasets are likely to interact, it is unclear when or if the rise in one leads to the rise in another. In other words, we can make assumptions about why Google Search and Google Trends values rise and fall, but as with any human behavior proxy, there is no way for us to know for certain. These uncertainties are likely to introduce a greater level of variability to prediction models.

Additionally, we acknowledge the precautions which must be taken when seeking to leverage big data sources like Google Search and Google Trends. Specifically, the black box nature in which these datasets are generated, aggregated, and disseminated introduces complexities which must be considered. It is safe to assume that the Google platform was designed with ulterior motives beyond allowing users to gather information on a subject. As a result, these intentions are embedded in the functionalities and algorithms of the platform, influencing users’ interactions with Google and rewarding the creation of certain types of content over others. These characteristics impact the nature of the data collected through Google Trends and Google Search but are simultaneously opaque and unavoidable. Furthermore, these platforms and their intentions are subject to change, potentially compromising analyses across time in unexpected ways.

Despite these limitations, spikes and trends in Google searches can be key for infodemiology, or the use of Web-based information toward public health and policy, offering a valuable tool for identifying emerging public health threats, monitoring the spread of misinformation, and guiding timely interventions by health authorities. It is important to understand that Google Trends may be less detailed when working with less raw data, but the information it provides can be reformatted and standardized in a manner that nonetheless provides researchers with valuable information. Google Searches, on the other hand, can provide more information than relative trends based on time, as search results show all web pages related to the search topic. Coupling Google Trends data and actual results from searched data, such as Google Search, can be key to enriching the data.

In our case, we aligned the Google Trends and Google Searches and found that Google Trends data reveals a public curiosity or sentiment toward a need for action, and Google Search data reveals a sort of reaction from that call to action from public entities and organizations. Our results encourage the use of Google Trends for research but advise against its use as a standalone tool. Google Search data is an excellent supplemental data source that may help to support future researchers’ efforts in developing mechanisms for accountability and collaboration in public health and environmental science initiatives.