In this section, we present the "geocrawler", which enables the collection of georeferenced posts from social networks. Furthermore, the datasets and use cases for the experiments are explained, and the time series and spatial distribution of Tweets are also shown.
Geocrawler
Natural disasters can occur unexpectedly, and their force makes them one of the most serious threats worldwide. In particular, earthquakes occur without warning, and hurricane tracks cannot be predicted accurately (Cox et al. 2013). To enhance analysis results and guarantee timely analysis, spatially and temporally relevant data must be collected as quickly as possible. We developed geocrawler software to collect as much relevant social media data as possible within a reasonable time by requesting data from the application programming interfaces (API) of social media platforms. The programme can query data from multiple social media platforms such as Twitter, Flickr, YouTube or Foursquare. As the focus of the experiments is on Twitter data, we only describe how we retrieve Tweets in this publication. However, it works similarly for other platforms with specific adaptations to each social network's application programming interfaces (API).
Twitter provides two types of APIs to collect Tweets: REST and streaming (Developer 2020). The REST API offers various endpoints for using the functionalities of Twitter, such as the endpoint "search/tweets", to collect Tweets from the last seven days with certain limitations, such as a maximum of 450 requests per 15-min interval. These constraints make the collection process challenging and require an advanced strategy to cope with the fast-moving time window of the API to harvest all offered Tweets with a minimal number of requests. Contrarily, the streaming API provides a real-time data stream that can be filtered with multiple parameters. Our developed software focuses on natively georeferenced Tweets within an area of interest and uses Twitter’s REST and streaming API (Fig. 1).
Combining the REST and the streaming API makes crawling robust against interruptions or backend issues that would lead to missing data. For example, if data from the streaming API cannot be stored in time or an interruption occurs, the missing data can be retrieved via the REST API, which provides Tweets from the last seven days. We believe that, with this strategy, we can collect as much data as possible. It is important to note that this software is designed to also request data from other social media platforms such as YouTube, Flickr and Foursquare if the appropriate credentials for the particular network are specified.
In the context of disaster management, the geocrawler starts when a natural disaster occurs or begins and requests data for the area of interest without additional configuration. The geocrawler can also be used for monitoring, as the streaming API complements the REST API restrictions that do not allow querying all regions sufficiently with only one developer account.
The geocrawler was used to collect the data for the use cases presented in this paper. The use cases represent different disaster types that did not occur in the same year or in the same region.
Emergency management service information products
The EMS provides information for emergency response related to different types of disasters as well as prevention, preparedness, response and recovery activities. It consists of two main components: rapid mapping and early warning. The mapping component has worldwide coverage and provides maps based on satellite imagery. Authorised users can send a Service Request Form directly to the European Response Coordination Centre to trigger the mapping service and request reference, delineation or grading maps. For the targeted use cases, mapping coverage is limited to the requested areas by authorised users. For the needs of this study, different EMS layers were used for the use cases:
-
Areas of interest (AOIs) An area defined by the user, which guides and limits the production to specific areas considered by the user to be impacted by the event.
-
Event layer A layer with the flood traces or damage information observed in the impacted areas using satellite data.
Official authority datasets
Our social media analysis results are validated with official authority datasets that show the actual measured areas impacted by a natural disaster. Two authorities that provide open-source data for natural disasters are the United States Geological Survey (USGS) and the National Hurricane Center. USGS is a scientific institution and part of the United States Department of the Interior (DOI). It provides reliable scientific information to minimise loss of life and property due to natural disasters as well as for other purposes such as managing water, biological, energy and mineral resources. The National Hurricane Center is part of the National Centers for Environmental Prediction, focusing on hazardous tropical weather that includes forecasting and analysing the path and the impact of hurricanes.
Use cases
For big data analysis, publicly available Tweets are the most suitable social media data source as the majority of Tweets are shared publicly. For all use cases, we collected natively georeferenced Tweets that are located within the respective area of interest.
Amatrice earthquake
On 24 August 2016, an earthquake with a magnitude of 6.2 hit central Italy with the epicentre close to Accumoli, Italy (Geological Survey 2020a). The earthquake caused the death of more than 290 people and led to severe damage in the city of Amatrice and neighbouring cities (Abbott and Schiermeier 2016). We collected 48,992 georeferenced Tweets between 17 August and 9 September 2016 that were located within the bounding box [11.7°W, 41.7°S, 14.7°E, 43.8°N] in the World Geodetic System 1984 (WGS 84). The earthquake hit central Italy at 03:36:32 on 24 August 2016, causing a peak in the time series of the number of Tweets per hour, whereby the peak matches the hour when the earthquake hit (Figure 2). Such a sudden increase in social media data has also been observed in other publications, e.g. in Earle et al. (2012) and (Resch et al. 2018).
Resch et al. (2018) analysed the 2014 Napa Valley earthquake. To evaluate portability to other earthquake events in another global region, we analysed the earthquake in Amatrice, Italy. This is especially interesting as the official language in Italy is Italian as opposed to English as in the USA. Therefore, the Amatrice earthquake can give us insights into whether the methodology works in another language, for other natural disasters, and on another continent.
We identified the used languages in the Tweet corpus with the python package Polyglot (Al-Rfou 2015) and found that approximately 57.1% were written in English and approximately 15.4% were written in Italian. The used language could not be automatically determined for 23.6% of the Tweets. We expected more Italian Tweets than English Tweets, but the mixed distribution of languages still poses a challenge.
Due to the comparably large quantity of data, data points would strongly overlap and coalesce on a point map, making it impossible for the reader to draw meaningful conclusions. Spatial binning is a visualisation technique where data points are aggregated in shapes such as triangles, rectangles or hexagons. Hexagons have proven to be the most suitable shape to use (Battersby et al. 2017). Therefore, we aggregated Tweets in hexagon bins to show the spatial distribution of the Twitter datasets. Figure 3 shows the spatial distribution of Tweets for the Amatrice earthquake in the given period. Although the Amatrice earthquake was one of the most severe earthquakes in Italy, nearby urban centres show more activity than the area impacted by the natural disaster. Compared to other rural areas, the area around Amatrice shows more activity than other regions. However, further analysis is needed as unrelated Tweets should not be included in the results used for disaster management.
For the Amatrice earthquake, the EMS produced damage maps covering only a small part of the impacted area. As shown in Fig. 4, the focus of the EMS layer is on the area around Amatrice, and the red squares represent damaged buildings.
USGS provides a geospatial dataset of the peak ground acceleration (PGA) of the Amatrice earthquake. The most substantially impacted areas are close to the city Amatrice. The PGA represents the intensity of the earthquake and is used for the comparison in this analysis because the footprint obtained by the hot spot analysis of social media is ideally similar to the PGA areas.
Hurricane Harvey
Hurricane Harvey made landfall on 25 August 2017, close to Houston, Texas, which is home to two million inhabitants. As Hurricane Harvey was slowing down, heavy rainfall occurred in the area and caused horrendous damage, estimated at $125 billion and 68 deaths (Geological Survey 2020b). We collected 135,723 georeferenced Tweets between 25 August and 7 September 2018 that were located within the bounding box [-98.8°W, 27.3°S, -90.4°E, 31.1°N] in the World Geodetic System 1984 (WGS 84). Contrary to earthquakes, hurricanes do not cause a sudden increase in Tweets because they are slow moving and can be monitored days before landfall (Figure 5). By comparing the number of Tweets per hour in the week in which Hurricane Harvey occurred (blue) with the week before (light orange), we observe that the overall number of Tweets increased, but there is no visible peak.
The spatial binning of georeferenced Tweets in the area impacted by Hurricane Harvey shows high activity in urban regions, especially in Austin, San Antonio, Houston and Baton Rouge. While Austin, San Antonio and Baton Rouge were not affected by Hurricane Harvey, most of the city of Houston was flooded. This visualisation cannot be used for disaster managers to draw conclusions about which areas are impacted by a disaster, and therefore, the dataset must be further analysed (Fig. 6).
The EMS was activated for specific regions in Texas to create flood outlines. As public authorities must request information for particular areas, some regions were not examined by EMS (Figure 7).
In the case of Hurricane Harvey, USGS collected high water marks as indicative evidence for the extent of flood inundation caused by Hurricane Harvey. A team of hydrologists and hydrologic technicians flagged and surveyed these marks for the Hurricane Harvey event on behalf of the US Federal Emergency Management Administration, which coordinates the response to a disaster in the USA. Overall, they collected 2,123 georeferenced high water marks distributed along the coast of Texas and around Houston, USA. The high water marks are densely spaced and, for visual comparison, were aggregated in a polygon shown in Fig. 8.
Hurricane Florence
On 14 September 2018, Hurricane Florence made landfall in North Carolina as a Category 1 hurricane. The main destruction took place in North and South Carolina and other states, and then, Hurricane Florence weakened and became a tropical storm. Overall, it caused up to $22 billion in damage, and 51 people died as a result of Hurricane Florence (Borter 2018). From 12 September until 19 September, we collected 414,303 georeferenced Tweets that were located within the bounding box [−85.1°W, 24.1°S, −70.5°E, 41.1°N] in the World Geodetic System 1984 (WGS 84). Similar to Hurricane Harvey, there is no significant peak in the time series, as would be expected for this type of natural disaster. The time series of the two areas of interests are highly similar and only differ in magnitude (Figs. 9, 10).
In Fig. 11, the spatial Tweet distribution shows high activity in urban areas and especially the area considered to be part of "Boswash" from Boston to Washington, D.C. (D. C. 1975). When we focus on the area where Hurricane Florence made landfall in Fig. 12, we again observe that urban areas have higher activity. Still, numerous places along the coast also show high activity. Like in the case of the Amatrice earthquake, the data must be analysed to provide useful information to disaster managers.
EMS was activated for several locations along the coast of North Carolina where Hurricane Florence made landfall. Although Hurricane Florence caused flooding in many AOIs, EMS was only able to delineate a smaller fraction of impacted areas (Fig. 13).
In the case of Hurricane Florence, USGS collected 769 high water marks. The high water marks are densely clustered, and for comparison, they were aggregated in a polygon shown in Fig. 14.