Keywords

1 Introduction

It is generally accepted that Volunteered Geographic Information (VGI) is one of the most important technological tools that can drastically contribute to having in-depth and rapid information regarding the management of a disaster event [11,12,13,14]. Particularly regarding flood events, there is a plethora of published research related to extracting, analyzing and visualizing important information from sources of VGI, most of them, related to Social Media.

Specifically about floods, various case studies have been used in order to discover the potentials of VGI in flood event management. From Australia and the Queens-land floods [11] up to the Colorado floods in the USA [5]. Some significant findings of published research include the high productivity rhythm of information that in the case study of the Queensland flood reached about 15000 messages per hour along with the general assumption that VGI provides significant contribution regarding to the immediacy and the depth of information. [5, 14]. Moreover, the correctly tracking of a flood event that a VGI source provides, has been characterized as vital for studying, identifying weaknesses and limiting thus the negative impact of similar events that may potentially occur in the future [5].

In addition, during the latest years, there is some significant contribution by the European Union to provide innovative solutions that aim to track effectively the consequences along with the unfoldness of a disastrous event. The Copernicus emergency service provides tracked information which is automatically extracted through imagery. Although this service is very useful, it sometimes misses disastrous events; the case study used in this research for instance, (flood of Kalamata, Greece, 2016) is not present as the Copernicus database.

Another important aspect of VGI is related to the so-called Participatory GIS (PGIS) activities, which utilize volunteers who take actions in the physical environment for reducing the vulnerability of a geographic area in disastrous events. A nice approach was related to reducing flood vulnerability in the city of La Paz, in Baja California, Mexico [3], while another inspiring PGIS approach was the one that took place in Brazil, where a community of more than 117 Brazilian Experts from NGOs and private companies worked together in order to define the most suitable criteria that emerge areas, as vulnerable to floods, in South Brazil [4].

Especially regarding the data analysis part of VGI, there are various challenges. In author’s opinion, those challenges are accumulated in four clusters. At first, it is a classification problem. While international research has proposed various classification structures [1, 6, 8] there is still room for improvement, as the most effective classification can lead to the most effective delivery of vital information. The second cluster of challenges is related to the precise georeferencing, as the only way to have meaningful maps is to have sufficient geo-referencing of the data. Next, the third cluster includes the selection and development of appropriate visualization techniques for delivering the processed information to the final receivers. The maps and graphs created, must make sense to people that could potentially, need to have access on information, but with zero knowledge in reading sophisticated scientific maps and schemas. Considering the latter, too complicated visualizations should be simplified. Finally, the last cluster of challenges is related to automation of the data analysis procedures for having deliverable results in real time. VGI data analysis is a time-consuming process [7,8,9,10].

Respecting the set of open challenges, the next sessions of current article describe a methodology that can be used for extracting, classifying and visualizing DM information.

As a case study, the floods of Kalamata, Messinia (September 2016, Greece) is used. The rain that resulted to flood, started in the 7th of September 2016, while a day after, the nearby area of Lakonia was also affected. The flood caused the death of three people. Many damages to the urban environment were reported while 34 families had no place to sleep. The motorway along with the airport of the city of Kalamata remained closed. Many politicians (Ministers of the government, the X-Prime Minister etc.) travelled to Kalamata for disaster management purposes.

2 Data Used

A dataset was acquired from Sifter containing a corpus of 111000 tweets that were published from 7th of September 00:00:01 GMT time, until the 11th of September 23:59:59 GMT time. This dataset included the total number of published tweets containing at least one of the following keywords: Flood (in Greek and English), Floods (in Greek), Rain (in Greek), Storm (in Greek), Damages (in Greek), Kalamata (in Greek and Latin). Those words exist in the tweet corpus either as a hashtag (#) or as a simple word. In current research about 45% of the total dataset was analyzed.

3 Methodology

The proposed methodology is consisted of few basic steps (Fig. 1). The first step is related to data preparation; the initial dataset was consisted of three csv files. Those files were loaded into R-Studio and various functions were applied for eliminating all null values and for removing all graphical characters. Next, the final dataset was exported in a csv format.

Fig. 1.
figure 1

Methodology

The next step of the methodology was related to classifying the tweets in categories that makes sense to visualize. Table 1 indicates all the categories that emerged through a conceptual mashup of the structures of other published research [1, 6, 8]. Classification was performed either by reading each tweet 1 by 1 or by using various text queries that contributed to accelerating the process.

Table 1. Classification categories

Moreover, the category “consequences” was further sub-classified to 5 values, ranged from I to V. The first value (I) is related to simple identification of a rain or a storm while the value V is related to Loss of human life. A similar consequence scoring was presented by the author previously [1]. However this version of scoring is enriched with more certain incidents (Table 2). Finally, the author created few scatterplots in order to assess the frequencies of each classified category, considering the time of publication.

Table 2. Description of consequence score values

The third step of the methodology was consisted of georeferencing the information, by detecting the geolocations mentioned within the text of each tweet, and by multiplying each tweet N times where N is the number of detected geolocations. In the multiplied group of the tweets, the x y coordinates of each geolocation were added, in a way that each area was geo-referenced one time (Fig. 2). The whole procedure was done automatically by executing a script, developed by the author in R-programming language [2].

Fig. 2.
figure 2

Tweet geo-referencing method developed in R [2]

The visualization of the classified and geo-referenced information consists of the final part of the methodology. In specific, several graphs were created which utilize the frequency of tweets, classified to specific categories and published within different time periods. Finally, two maps were created, visualizing the quantity of tweets related to simple rain identification (Map I) and the consequence scores (Map II). Both maps, visualize extracted information, that was posted in Twitter, within the first 48 h of the start of the flood event.

4 Results

In Fig. 3 (the left part) and Table 3 the words with the highest frequency in the analyzed text corpus, are displayed. The word “Kalamata” which is the name of the city that was devastated mostly by the floods, is highly mentioned, while the words bad weather and woman along with disasters and dead/deadly are ranked just after Kalamata. It can be assumed that within the first 6 h, since the start of the rain, twitter provides sufficient information. The popular words, in terms of frequency are multiplied within 24 h (Fig. 3 (the right part), Table 4). Kalamata also appears at a high frequency in Latin, which can lead to the assumption that within 24 h the news regarding the floods of Kalamata were spread at an international level. Moreover, apart from the disseminated information regarding the human losses, the names of various Greek politicians such as Samaras and Tsipras are mentioned frequently.

Fig. 3.
figure 3

Word frequency of tweet corpus, published within 6 h (left) and 24 h (right).

Table 3. Word frequencies within 6 h
Table 4. Word frequency within 24 h respectively

Figures 4 and 5 visualize various aspects of the classified information that was extracted from twitter. In particular on those graphs, the author focuses on the information regarding flood identification, consequence score values (Fig. 4), emotional expression and irony (Fig. 5). It can be assumed that, within the first hours of the flood, there is a plethora of disseminated information regarding the flood identification and the consequences to the urban and rural environment, while the expression of emotions and the ironic tweets started to be published later.

Fig. 4.
figure 4

Frequency of tweets related to flood identification and consequences

Fig. 5.
figure 5

Frequency of tweets related to ironic and emotional tweets

Figure 6 is interesting for assessing the information published hourly, considering also if it is day or night. As it can be seen, early in the day the spread of information is higher that later in the night. Moreover, there is a tendency that twitter users post more related information during the first day of the flood event occurrence, while the info related to disaster management and the expression of emotions is widely posted during the second day as well.

Fig. 6.
figure 6

Scatter Plots of classification categories vs time period

On Fig. 7 (Map I) the frequency of tweets that are related to rain identification, is visualized, spatially. As it can be easily seen, the municipality of Kalamata is referred mostly in this thematic twitter dataset. According to author’s opinion a map like this could be used as an indicator that something extraordinary is happening, making sense in this way, to the stakeholders of disaster management as it could demonstrate an empiric evidence that in areas where there are too many tweets that report rain, there is a danger of a flood event occurrence.

Fig. 7.
figure 7

Map I: Count of tweets related to rain identification

Finally, Fig. 8 (Map II) displays the consequence score values that were extracted from the thematic twitter text corpus. Each bullet represents four reports; the red color is associated with the score value V, which represents human loss while bullets in orange, represent major damages and human injuries (score value IV). The author supports that this map, could also make sense to the authorities of disaster management and to people that have non-trained to science eyes.

Fig. 8.
figure 8

Map II: Map of consequence score values extracted from the tweet corpus

5 Conclusions

In general, it is assumed that social media, as a source of VGI, can contribute to dis-aster management, even in areas of medium-sized population while, by creating various simple graphs and maps, the extracted information can make sense to the DM stakeholders. Towards that step though, there are quite a few challenges; appropriate classification, correct and precise geo-referencing, meaningful graphs and automation are the most important of those. Considering the above, next steps of cur-rent research will be focused on classifying the tweets automatically and effectively, probably by the use of Machine Learning. Moreover, another vital aspect of VGI is linked to the quality and credibility of information that neo-geographers or social media users post. The author, in current research, accepts the validity of the Law of Linus in VGI, according to which, the more users post information about an incident in an area, the more credible and accurate the total information becomes. And there empiric evidence, that especially regarding DM information, social media users do not tend to post lies. After all, the description of consequences of a natural event, or the identification of a rain, is not a subjective or a controversial matter that different people may have different notions about it. However, as a future step in current re-search, various quality-related validations should be applied for ensuring the credibility of the outcome.