Application of TAGGS
First, we applied TAGGS on the 55.1 million tweets in a historical dataset, applying the algorithm as if the data were available in real time, shifting the scanning window by 6 h in each step. We first discuss the results obtained using baseline settings. For this, a 24-h scanning window and a threshold of 0.2 were used, which causes all locations found in tweets that scored below the threshold (the “Voting and Assigning Locations” section) to be discarded. The results for the baseline settings are summarized in Table 3. Next, we discuss the results of a sensitivity analysis for the threshold value and the size of the scanning window (the “Sensitivity Analysis” section).
Of the 55.1 million tweets, we found that 19.2 million mentioned at least one location, and 3.4 million tweets referenced multiple locations. In addition, when distinguishing between administrative levels, roughly half of the locations mentioned refer to a city, town, or village, while country and the lower administrative level locations each account for a quarter of the mentions.
To gain insight into the geoparsed tweets, those countries covered by the algorithm (Fig. 1) with a population of at least 10 million people were grouped according to economic development. For that purpose, we employed the income groups defined by the World Bank.Footnote 3 For each group, the number of geoparsed tweets between August 2014 and December 2016 was plotted (Fig. 5) against the total flood losses over this period, as described in the Munich Re’s NatCatSERVICE on a purchasing power parity (PPP) basis.Footnote 4 This gives an impression of how Twitter reporting relates to flood impacts. The data made clear that in high-income (green) countries, there were about one to two orders of magnitude more tweets than in low-income (red) countries. The number of tweets in middle-income (blue and orange) countries fell between the other two groups, with a particularly large spread in the lower-middle-income (orange) countries. Notably, these numbers likely reflect a size effect, as Indonesia (IDN) and Pakistan (PAK), which had the highest number of tweets within the lower-middle-income group, also have large populations. However, the results underscored that relatively small countries, such as the Philippines (PHL) and Venezuela (VEN), generated a significant number of (geoparsed) flood tweets within their respective groups. These findings suggest that flood events, and not just the size of the population or the Twitter user base, are responsible for the high number of tweets during the investigated time period.
The plots also illustrate that in general, more flood tweets seemed to be linked to higher levels of flood damage over the study period, as the points roughly go from the bottom left-hand corner to the top right-hand corner of the diagrams. This relation is influenced by many other factors, including (but not limited to) variations in the extent of Twitter usage per country, language use per country, and keyword selection, and is therefore by no means strong enough to have any predictive power after regression analysis. That said, the existence of this relationship was in line with expectations. Namely, in countries that suffered from disastrous flood events that caused significant damage, a substantial number of tweets about flooding were generated. This illustrates that the algorithm seemed to be successful in capturing flood events around the globe.
Validation of TAGGS
To properly validate TAGGS, we defined a golden standard with manually tagged tweets. To the best of our knowledge, no other study provides a global dataset focusing on a specific event type. Therefore, we compile a random dataset using 2785 flood-related tweets from two separate days and manually assign locations to the tweets.
Dec 12, 2015: To check if our model properly for small flood events in multiple languages, we selected a day during which multiple such events occurred across the globe, including in Indonesia, India, Kenya, Congo, Norway, the UK, Canada, and Paraguay (1282 tweets).
Dec 27, 2015: When the number of tweets that mentioned a specific location is higher, the probability of sufficient metadata being available is also higher. Therefore, we validated our algorithm on a date with multiple large events. On the date in question, several major floods received global news coverage, including floods in the USA, the UK, and Argentina (1503 tweets).
Each tweet can be labeled with one, multiple, or no locations at all. We recognized all mentions of locations on the different administrative levels that we apply the algorithm to (i.e., country, administrative subdivisions and cities, towns, and villages), including abbreviations, shorter versions, and slang, but excluded possessive pronouns (e.g., the Irish weather) and mentions of geographical features within towns and other geographical features, such as valleys and rivers. We do include location mentions when they are combined with other words (e.g., #leedsfloods) but exclude any information in the Twitter handles (e.g., @PakistanToday) because these locations are not necessarily related to the location of a possible event.
Using the manual approach, of the 2785 total tweets in our validation set, we found 2079 references to countries, administrative subdivisions and cities, towns, and villages in 1497 tweets. Then, we compared the manually labeled tweets to both the automated individual and automated grouped geoparsing (TAGGS) approaches. For individual geoparsing, we use the location metadata but did not consider other tweets mentioning the same geographical entities, similar to Schulz et al. (2013).
Trade-Off between Recall and Precision
With geoparsing algorithms, there is a trade-off between the number of tweets that are parsed (recall) and the number of correctly parsed tweets (precision; Leidner 2007). Precision measures the number of correctly geoparsed tweets relative to the total number of geoparsed tweets. Hence, precision markers do not provide an indication of the total number of tweets within a location. Recall measures reflect the number of correctly geoparsed tweets relative to the total number of tweets with a spatial reference. In essence, the greater the level of precision (i.e., the smaller the number of incorrect tags), the smaller the total number of geoparsed tweets. Inversely, if one wants to geoparse more tweets (higher recall), the number of errors within the geoparsed tweets (in terms of incorrect location assignments) will also increase (lower precision).
In the following sensitivity analysis, we show two series of plots (Figs. 6 and 7) delineating both individual (red) and grouped (blue) geoparsing for various model settings, namely, a varying threshold and a varying size of the scanning window. In these figures, we show three plots: (1) a plot that shows recall and precision measures for all locations that the model accounts for (i.e., countries, administrative subdivision and cities, towns, and villages), using all 2785 tweets; (2) a plot that shows these measures for administrative subdivisions, using only those tweets that mention such a location according to our validation set; and (3) a plot that shows precision and recall measures for all cities, towns, and villages, using only those tweets that mention such a location.
Figure 6 shows the recall and precision scores for individual and grouped geoparsing with a varying threshold. The trade-off between precision and recall is visible in the first window: When a higher threshold is chosen, more location matches are discarded, while the likelihood of a correct match is higher for the residual locations. For individual geoparsing, as only the spatial indicators of the post itself are considered, the scores behave discreet. In contrast, for grouped geoparsing, the scores are averaged between tweets within the same group, and therefore the decrease is more gradual. At very high thresholds, the precision for grouped geoparsing starts to drop (for administrative subdivisions and cities/town/villages). This is likely because the scores assigned to tweets in small groups fluctuate more than for large groups (the “Voting and Assigning Locations” section) and hence there is more uncertainty in the location being assigned correctly. Therefore, when the threshold increases, small groups have a larger share in the response set (as large groups will always have averaged medium scores) which causes the precision to drop. Approximately between a threshold of 0.1 and 0.25, precision and recall measures for grouped geoparsing are optimal and higher than using any other threshold for individual geoparsing.
Figure 7 shows the recall and precision measures for a varying scanning window size, ranging between 6 min and 48 h. In theory, when using an infinitesimally small scanning window for grouped geoparsing, the results would be identical to the individual geoparsing. It is clearly visible that, in general, both precision and recall increase when the size of the scanning window is larger. This is expected, because a larger number of tweets are grouped, and therefore, the likelihood that spatial information is available increases. Although an increase of recall and precision is still visible for a larger scanning window, the increase is not substantial, which indicates that spatial information is available for most toponyms. When new floods occur, it is not feasible to take location mentions of previous floods into account. Therefore, we hypothesize that when the scanning window becomes too large, the performance of the model will be lower. Unfortunately, because of memory (RAM) constraints in our current setup, we cannot test this. Ideally, the size of the scanning window depends on the volatility of the event type, where events with a longer average duration (people will likely refer to the same event over a longer time span), such as droughts, could benefit from a larger scanning window and vice versa for shorter events.
Effect of the Event Size
Figure 8 highlights differences in performance due to different flooding circumstances using a varying threshold. On December 12, 2015, there were various smaller flood events, while on December 27, 2015, a couple of very large flood events took place (the “Validation of TAGGS” section). These two cases make clear that using optimal model settings, TAGGS was slightly more accurate for larger-scale flood events than smaller-scale flood events. Such a finding is to be expected, because during the large flood events in the USA and UK, a larger percentage of tweets mentioned the same toponym, due to a high level of Twitter usage in both countries. The grouping approach meant that most of these tweets were scored, even though not all of them had spatial information available. In contrast, when a location is mentioned in a single or small group of tweets without location metadata, this tweet was not geoparsed. This latter situation is more common when a higher number of smaller events occur, as was the case on Dec 12, 2015. Similar to the large groups of tweets’ drop-in precision at a lower threshold compared to small groups’ drop-in precision (the “Sensitivity Analysis” section), the precision of Dec 27, 2015 also declines at a lower threshold compared to the precision of Dec 12, 2015. We argue that the more drastic drop in precision for the tweets posted on Dec 27, 2015 is because most groups of tweets are larger and therefore all tweets have a relatively low score, which are then discarded at a higher threshold. Nevertheless, the grouped algorithm still correctly geotagged about two thirds of the tweets with a location, even on days with predominantly smaller flood events.
Comparison to Other Spatial Indicators
Figure 9 illustrates the number of locations identified using the different approaches and the number of erroneous matches for the base settings. Using individual geoparsing, we found approximately 55% of these locations—of which roughly 86% were correct. The grouped geoparsing technique, developed for this research, increased the number of found locations to approximately 82%—of which about 91% are correct. In contrast, of the 2785 tweets, only 33 (~ 1.2%) have coordinate information attached. This suggests that the TAGGS approach makes significantly more spatial information available than does a strategy relying on either individual geoparsing or coordinates alone.
Comparison to Other Work
Several other studies have addressed similar problems as this paper. For example, Middleton et al. (2014) and Gelernter and Balaji (2013) investigated geoparsing for crisis mapping in a local setting, assuming a priori knowledge about an event. This allowed the authors to collect detailed information from the focus area of the event, which is unfortunately not possible for our approach. Zhang and Gelernter (2014) developed the Carnegie Mellon geolocator 2 algorithm. We analyzed the performance of these algorithms using the English tweets in our validation set. As shown in Table 4, TAGGS performs considerably better for both precision and recall.