This section presents the experimental results of the proposed methodology. In the following, for ease of discussion, a location provided by Twitter (geotag) is indicated as a “georeference” and a location provided by the CIME algorithm is indicated as a “geolocation.” Moreover, the “local” step of the CIME algorithm that builds a context based on elements from the same post (Section 4.1) is referred to as CIME local, while the “global” step of the CIME algorithm that extends the context with elements from other connected posts in the behavioral social network (Section 4.2) is referred to as CIME global.
As discussed in the introduction, the CIME algorithm was developed with the goal of increasing the amount of geolocated images extracted from social media during an emergency to ultimately support rapid mapping activities. Therefore, we evaluated the proposed algorithm in the context of emergency events, evaluating the amount, the quality and the relevance of the geolocated posts and images, and discussing the impact of the geolocated images on rapid mapping activities.
Section 5.1 describes the experimental setting in terms of processing pipeline and the analysis method. Section 5.2 describes the emergency events selected as case studies. Section 5.3 summarizes the results and, finally, Section 5.4 discusses them.
Analysis method and processing pipeline
The post/image extraction process includes the following phases:
Crawling. Posts from Twitter were crawled through the Twitter search API using a list of event-specific keywords as search parameters [3, 4].
Geolocation of posts. Posts were geolocated as described in Section 4. Georeferences natively provided by Twitter were also stored for comparison and as a baseline.
Geolocation of images. The geolocation assigned to a post by the algorithm was associated with all the images included in the post.
The results were analyzed considering the following criteria:
Recall: defined as the percentage of items correctly retrieved with respect to a given target set. Since the focus has been placed on extracting and geolocating images, 100% corresponds to geolocating all images.
Precision: defined as the percentage of relevant items among those retrieved. We can distinguish between:
The geolocation precision is assessed by comparing CIME’s geolocations against Twitter’s native georeferences as well as manually annotated ones.
While evaluating recall and geolocation precision aims to demonstrate the validity of CIME by itself, evaluating the relevance of the geolocated images aims to demonstrate the usefulness of CIME for rapid mapping activities.
These evaluation criteria have been translated in the analysis summarized as follows:
Overall number of media geolocated by CIME. Even if this value does not take into account the total number of media that can be potentially geolocated, it can be compared to the number of natively georeferenced tweets, quantifying the increase given by CIME.
Comparison to natively georeferenced locations. In order to evaluate the precision of the media geolocated through CIME, we selected the ones which are also natively georeferenced, comparing the two locations. Georeferences are not always precise, but they are commonly used in applications and generally provide a reasonable quality . To assess whether Twitter georeferences or CIME geolocations are more precise, we selected posts which include a media and we manually verified them using Google Street View if the locations were coherent with the media content.
Geolocation precision. Considering the entire dataset, CIME geolocations were compared to manual annotations, quantifying the agreement.
Geolocation recall. Considering the entire dataset, the percentage of tweets geolocated by CIME among those manually geolocated is calculated.
Geolocation relevance to the emergency management task. We calculated the percentage of media geolocated by CIME which were manually labeled as “useful”. Images are useful if they can contribute to rapid mapping activities (e.g., a river image, a flooded road, a blocked road, groups of people gathered in a street and so on). This value was compared to the percentage of useful natively georeferenced media.
Note that not all these analyses were conducted on all case studies introduced in Section 5.2. The reason for this is that manual annotations are heavily time consuming and not practically feasible over certain volumes. For example, the geolocation precision analysis requires us to label all the tweets in the dataset, to recognize the mentioned locations for each tweet, and to find them in a gazetteer so as to annotate the associated coordinates. Therefore, this analysis was performed on one case study, namely the Hurricane Sandy case study. The dataset for the Hurricane Sandy case study was manually annotated starting from a dataset included in [25, 26], which was already annotated to identify location references in the text. The goal of our further annotation was to manually disambiguate location references using OpenStreetMap.
Floods in Southern England, 2014
The first case study considered in this work is based on tweets related to the floods which occurred in Southern England in 2014.
Twitter was crawled with the keywords: “England” and “flood England” from 10th to 15th of February 2014, obtaining a dataset of 108,757 tweets.
In this time frame, the Copernicus EMS rapid mapping service was activated (activation EMSR069), producing rapid maps for the affected areas. In total 22 delineation maps were produced for the areas of Bridgwater, Hambledon, Kenley, Maidenhead, Staines and Worcester. Maps produced from Copernicus EMS are based on the analysis of high-resolution SAR (Synthetic Aperture Radar) images from the Copernicus satellite system. It has to be noted that, as stated in the map description, “the thematic accuracy might be lower in urban and forested areas due to known limitations of the analysis technique”Footnote 21.
Figure 4 shows a portion of a mapped area in the UK. The areas in red are the areas for which Copernicus EMS produced maps. Dots with numbers and markers refer to geolocated tweets with images in the area of interest. The produced maps can be interactively browsed and the geolocated tweets/images individually inspected. Figure 5 is a detail of the delineation map produced by Copernicus EMS for Staines, with the areas in blue representing flooded areas.
Figure 6 shows an example of a tweet geolocated in Queen’s Road, Datchet, by the CIME algorithm (local). The tweet is not georeferenced, but the position was derived from the tweet’s text through CIME. On the right, the extracted information is shown. The text in bold is the textual description of the location, while the coordinates correspond to the center of the location, where the tweet has been placed on the map.
Figure 7 shows other cases where CIME identified the locality rather than the street. In these cases, the posts (and the attached media) were all linked to the center of the locality. Locations at this level of precision can be useful since media can carry additional information which can help human operators creating the maps. For instance, the image on the right contains the name of a restaurant, which is clearly visible, and the image on the left shows rail tracks that can be easily identified in the area inferred by CIME.
Hurricane Sandy in New York, 2012
Hurricane Sandy was the most devastating Atlantic storm in 2012, causing severe damage in New Jersey and New York. An annotated dataset of tweets mainly related to different locations of New York City and its surroundings was provided by the University of Southampton IT Innovation Centre . The original dataset contains 1,996 tweets with manual annotations of the locations mentioned in the tweets. The dataset has been used as a basis for the evaluation of the algorithm, with an additional annotation performed by two persons for each post for location disambiguation. A total of 280 different locations that can be manually disambiguated has been obtained.
Overall number of media geolocated by CIME
The first part of Tables 2 and 3 describes the datasets in terms of overall number of tweets, tweets with images and natively georeferenced tweets. We can observe that only the 0.28% and the 0.15% of tweets with images are natively georeferenced in the two case studies.
The second part of Tables 2 and 3 summarizes the results of CIME in terms of the total number of geolocated tweets with images, also distinguishing between images geolocated through the local and the global phase. In the Southern England case, 0.64% of tweets with images have been geolocated, which is more than twice the number of natively georeferenced images. In the Hurricane Sandy case, 1.10% of tweets with images have been geolocated, which is more than seven times the number of natively georeferenced images.
To evaluate the usefulness of the geolocated images in terms of new information provided to operators, we assessed whether the sets of geolocated and georeferenced images overlap. In the Southern England case, there are 45 overlapping images, which is only 6.5% of the CIME geolocated images, while in the Hurricane Sandy case, there are 3 overlapping images, which is only 12% of the geolocated images. This means that the vast majority of images geolocated through CIME would not be available by considering only georeferenced tweets.
Note that the percentages in Tables 2 and 3 are computed with respect to the total number of tweets in the datasets. The reason for such low outcome percentages for CIME is that only the tweets with images were considered in this analysis, which are only 3.7% and 5.7% of the tweets in the two datasets. If we focus solely on tweets with images, CIME was able to geolocate 11.1% and 33.78% of the posts in the two cases.
Comparison with natively georeferenced locations.
To evaluate the accuracy of the images geolocated through CIME, we focused on those which are also natively georeferenced for comparison.
This analysis was performed only on the Southern England case, because the other case study did not provide enough georeferenced and geolocated tweets (see Table 3).
The georeferences and the locations provided by CIME for each tweet were validated by checking them on Google Maps and also by using Street View when needed, to assess whether they locate the content shown in the images correctly. The correct place for an image was considered as the place to be mapped (e.g., if a picture of a church was taken from a distance, the true location is the position of the church rather than the position where the picture was taken).
The locations were evaluated as follows: precise if on sight (max 1,000 mt.), same area if between 1 and 5 km, imprecise if > 5 km.
Figure 8 shows the results of the comparison of georeferenced and geolocated images. On the left, all images were considered for the comparison, while on the right, images which were manually tagged as unuseful (i.e., they do not show places, and therefore cannot be evaluated) were excluded.
To summarize the results, the following cases were identified: precise correspondence when both locations are precise, same area when both locations are in the same area or one of the two is in the same area and the other one is precise, geoloc correct, geotag incorrect when the geolocation is precise/same area while the other one is unprecise and vice-versa for geoloc correct, geotag incorrect. Imprecise denotes both imprecise locations.
We can observe that in 53% of cases, both CIME geolocations and georeferences referred precisely to the same place (precise correspondence or same area between 1 and 5 km). In several cases, only one of the two locations was accurate, and, in particular, CIME geolocations appeared to be more accurate than georeferences (23% vs 8%). It has also to be noted that about half of the images analyzed did not show useful content.
To evaluate the precision of CIME in terms of correctly disambiguated locations, all tweets were manually annotated to identify mentioned locations. The manual annotation involved both location recognition and location disambiguation (see Section 2 for details about these phases). While location recognition was purely a manual task, location disambiguation was performed manually with the aid of OpenStreetMap gazetteer accessed through the Nominatim web interface, thus using the same knowledge base of CIME. The gazetteer allows the human annotator to select a specific location, rather than just a name, so that a location mention can be associated with a pair of coordinates. This kind of manual annotation is heavily time-consuming and not practically feasible over certain volumes. Therefore, this analysis was performed on the Hurricane Sandy case study, which included approximately 2,000 tweets, but not on the Southern England case study, which included over 100,000 tweets. For this analysis, each location was interpreted as a point by considering its center. A CIME location is considered “precise” if the distance to the manually annotated location is below a certain threshold. We varied the distance threshold in the range of 0—50 km with a 0.5 km step. The distance was computed using the geodesic distance with the WGS-84 ellipsoid.
Results are given in Fig. 9. Precision at 1 km is equal to 41%, at 10 km it is 57%, and it settles at 64% at 18 km. Precision does not increase significantly beyond this threshold up to 50 km. Indeed, we can observe that the largest increase in precision was obtained in the first few kilometers. Figure 10 focuses on the range of 0–5 km with a 0.1 km step and shows that the precision at 0.1 km is already 37%. Therefore, the majority of locations correctly disambiguated by CIME were identified with a very high geographic accuracy.
The second part of Tables 1 and 2 summarizes the total number of images geolocated by CIME. However, to evaluate the recall of the algorithm, it is necessary to take into consideration all the locations mentioned in tweets and to calculate the ratio of locations correctly identified. This analysis was based on the same annotations used to evaluate the precision on the Hurricane Sandy case study (Section 5.3.3).
Overall, CIME disambiguated 21% of the mentioned locations. The recall of CIME appears to be relatively lower than the precision. This is mainly due to the fact that CIME always seeks a context to disambiguate a location reference. Therefore, all the tweets mentioning just one isolated reference are excluded, even when that location is intuitively easy to disambiguate. This discussion is further developed in Section 5.4.
Relevance for the task to be performed
Geolocated and georeferenced tweets with images were annotated with respect to their relevance (that is, their potential usefulness) for the rapid mapping activities. A prerequisite for being useful is being in the area of interest (mapped area), therefore only those images were considered. An image was considered relevant if it shows roads, streets, areas, blocks or gatherings of people, with or without water. We were interested in comparing the relevance of georeferenced and geolocated images. This analysis was performed only on the Southern England case, because the other case study did not provide enough geolocated and georeferenced images (see Section 5.3.1).
The results of this analysis are reported in Table 4, where the relevance for images geolocated through the local and the global phase was evaluated separately. As shown, the majority of images with any kind of attached location is relevant, but CIME geolocations (local or global) are more relevant than natively georeferenced images. Possible reasons for this phenomenon are discussed in Section 5.4.
Results demonstrate the validity of locations inferred by CIME, both in terms of volume and in terms of accuracy, with respect to natively georeferenced ones. Since natively georeferenced tweets are typically used in applications where located tweets are needed , the results show that CIME geolocated tweets can be reliably used in the same applications. Moreover, most of the CIME geolocated tweets are not natively georeferenced, thus providing an additional and complementary source.
Not only CIME geolocated tweets (local and global) double the number of geolocated media with respect to those natively georeferenced, but those media were also evaluated as more relevant than natively georeferenced ones (66% vs 79/83%). This phenomenon can be explained by the fact that, often, the text linked to relevant images is more descriptive and contains more references to locations, thus being a better target for CIME. Moreover, relevant media often lead to many interactions in terms of replies, retweets, etc., thus giving CIME global disambiguation an advantage.
The analysis reported in Fig. 8 for the Southern England case has shown that CIME geolocations are, on average, more precise than geotags. Additionally, this plot highlights the limitations of natively georeferenced tweets, which are often taken as ground truth in the analysis of existing algorithms (see for instance , which reports many methods using geotags as ground truth in Table 2). Indeed, among all the georeferenced tweets with a place-related image, 38% do not show a precise correspondence between the place depicted in the image and the tweet’s geotag.
One of the main limitations of CIME is related to its recall, comparatively lower than the precision on the same dataset. This is mainly the result of an algorithm’s design choice, which sees a tweet’s context as a necessary (but not sufficient) element to disambiguate location mentions. Therefore, all the isolated location references (i.e., location references without a context) are inherently unable to be disambiguated. CIME has been designed to privilege the precise disambiguation of context-dependent location references, which usually correspond to streets, roads and small villages, rather than locations which can be reliably disambiguated from the references themselves (e.g., “New York” in the Hurricane Sandy case study). Indeed, there are at least three important goals of CIME which are not reflected in a precision/recall analysis and which stem from the rapid mapping task. First of all, not all the locations have the same importance for rapid mapping, and often information about less known and comparatively small locations can add more value than information about large cities already covered by other media. Secondly, we can define a precision only by setting a certain accuracy threshold (see Fig. 9), which is application-dependent and needs to be low for rapid mapping. Finally, to aid rapid mapping activities, it is necessary to ensure both the relevance of the geolocated media and the precise correspondence of the locations depicted in the images to the geolocations. The analysis and the evaluation criteria presented in this paper try to capture these aspects. Nonetheless, nothing prevents the use of CIME as part of larger systems in conjunction with other high-recall geolocation methods to disambiguate isolated location references.
Comparing geolocation algorithms tailored to specific domains and applications (such as rapid mapping) is not an easy task. Indeed, as discussed, taking into consideration only precision/recall is often reductive and several other features — the relevance of which is largely domain-dependent — must also be taken into account. First of all, as described earlier, the precision is a function of the threshold to consider a location as “accurate”. Therefore, two different geolocation methodologies with different goals could perform better for different thresholds. CIME, in particular, privileges accuracy rather than volume, given a target domain that requires highly accurate geolocated items for rapid mapping purposes. This can be seen in Fig. 10, which highlights the precision at 1km (about 40%) and at 5km (about 50%). Moreover, other features and constraints of a geolocation algorithm have a major impact on its applicability in specific domains. For example, language-specific features, the necessity of prior knowledge about the target event/area, or application-specific preprocessing phases. All these different features hinder comparisons between different geolocation algorithms, because stronger constraints could bring about better performance but limit the applicability in specific domains and contexts. Finally, different geolocation algorithms could target different steps of the geolocation process (see Section 2) or different aspects of social media locations (such as home or post location, see Section 3), further hindering comparisons.
On this basis, the results of CIME on the Hurricane Sandy case study have been compared to those presented in  for the same dataset (originally described in ). A major difference between the two methodologies is the need for a preprocessing phase for the algorithm described in , with indices built limited to a specific target geographical area being considered. In contrast, CIME is designed to analyze tweets without any preparation or training phase, aimed at disambiguating locations by reacting to events in real-time.
The precision @1km obtained by CIME is comparable and slightly higher than the precision obtained in the tested datasets for geographical distances mentioned in  (40% vs 18-36%). The recall appears to be lower for CIME (21% vs 76-81%), and the motivations for a lower recall have been previously discussed.