Data Collection and Filtering
There were two main conditions required for the tweets to be relevant for the study; first, the tweets must be geo-tagged, and second, they must be relevant to the migration crisis in EU. The first condition was fulfilled by filtering for tweets with attached longitude and latitude information. For the second condition to be met, hashtags relevant to the migration crisis were used based on a list of relevant hashtags, which the authors requested from the Sächsischer Flüchtlingsrat (SFR), a registered association that campaigns for the interests and rights of refugees and asylum seekers in the Free State of Saxony, Germany. These hashtags were in English and German. Additionally, the authors decided to query hashtags in not only the major European languages (English, German, French, Spanish, Italian, Dutch) but also in the native language of the countries facing the migration crisis. Other than Italy and Spain, a large number of people also come into Europe through Greece and Turkey (Idemudia and Boehnke 2020). Hence the languages chosen for the initial query were: English, German, Italian, Spanish, Dutch, French, Russian, Turkish and Greek.
There are two major reasons for not including languages spoken by refugees into our dataset. Firstly, there is actually a lack of reliable data of what languages the refugees speak. A 2017 report by Translators without BordersFootnote 1 went into details about the assumptions made of languages spoken by refugees. They interviewed 46 humanitarian organizations and found that they did not ask or report on the mother tongue of refugees, thereby contributing to the lack of data. Nigeria, a major country of origin for refugees, is home to 520 first languages. This shows that making assumptions about expected spoken languages on the basis of home countries, is very problematic. Aid workers or government agencies expecting people from Syria or Iraq to speak Arabic, find that the refugees speak Kurdish or other Turkmen languages. The second reason for not keeping languages spoken by refugees in the dataset is that the authors are not familiar in any of the languages stated above. On extracting tweets, the authors could just mistake tweets from non-refugees containing the pre-defined hashtag (which itself will be translated using Google Translate) as hashtags from refugees. Furthermore, Gillespie et al. (2016) have found that Syrian refugees (they could be Kurdish speaking or Arabic speaking) are more tech-savvy than their Iraqi or Afghani counterparts when using social media. This could also paint a picture of all refugees under the same light. Hence the author for the sake of convenience have decided to include only European languages.
Keeping all the above consideration, the data to be collected temporally ranges from 2016 till 2021 and spatially would cover Europe and the northern parts of Africa, filtered using the hashtags relating tweets to the migration crisis. The data was collected using the Twitter API which limits tweets collected to 1% of all tweets being tweeted at the time of the query. Geo-referenced tweets in English only account for 2.17% of all tweets and percentages of all the other queried languages are similarly low (Leetaru et al. 2013). Moreover, (Sloan et al. 2013) opined that the amount of geo-referenced tweets closely follow the geographic population distribution. Therefore, the constraints of the Twitter API were not expected to significantly hinder the work. The results of the query underwent a two-step refinement process. The first was to simply check if the hashtags used for the query returned tweets relevant to the migration crisis. For this step, the co-occurring hashtags were examined. This was necessary due to the query used for retrieving the tweets uses word fragments that can be ambiguous and therefore return tweets that do not relate to the desired topic. An example for that is shown in Code 1.
$${\text{WHERE LOWER}}\left( {{\text{tweet}}\_{\text{text}}} \right){\text{ LIKE }}^{\prime}\% {\text{asyl}}\% ^{\prime}$$
Code 1 Snippet from the initial SQL query for retrieving relevant tweets
The word asyl is German for the English word “asylum”. The query in Code 1 returned tweets containing the word asyl in both the tweet body and hashtags. But, since the ‘%’ operator was used, the query additionally returned tweets having conjugations with the root word asyl. An example of such a hashtag would be asylum16, which does contain the root word asyl but the hashtag has no relevance with the migration crisis in EU. Hashtags such as asylum16 are referred to as the semantically non-relevant hashtags. After this initial query, both semantically relevant and non-relevant hashtags were found in the dataset, leading the authors to a second round of filtering. This was the second step of the aforementioned two-step refinement process.
After the removal of the semantically irrelevant hashtags, the new relevant co-occurring hashtags were included in a second query. These co-occurring hashtags were selected based on the number of times they appeared in the first query and were absent in the initially decided upon list of hashtags. For example, the hashtag asylrecht appeared as a co-occuring hashtag in the initial query but was not included in the list of queried hashtags. Therefore this hashtag was included for the second round of querying, after removing the semantically non-relevant hashtags. At this point, the hashtags in Turkish, Greek and Russian were also excluded as the low number of tweets returned in the initial query did not justify their inclusion in the dataset. Furthermore, the non-Latin script of these languages posed unique challenges while performing the semantic filtering described above. After these refinement steps, a second query was conducted which returned approximately 170,000 tweets from February 2016 until January 2021, posted within the bounding box of Europe. This was used as the final dataset for the study.
Data Storage and Preparation
The entire dataset was stored on two separate PostgreSQL servers, one server for the raw tweets and the other for the tweets in HLL format. Here the term “raw tweets” refers to tweets that have been converted into LBSM raw format by using the program lbsntransform.Footnote 2 The methodology of exploring both formats of tweets followed the LBSM structure (Dunkel et al. 2019). There are, however, particularities involved in exploring the raw tweets and HLL format tweets which are explained below.
The tweets in the privacy aware data format are divided along the four facets. Each of the four facets have their own schema in the HLL server besides tables with various facets of data. The analysis was performed mostly on separate Jupyter Notebooks by downloading CSV’s from the tables and then working with the data.
The primary methodology of exploring HLL data is based on union and intersection functions which combine to form the inclusion–exclusion principle. Equation 4 shows the inclusion–exclusion principle for two sets and Eq. 5 for three sets respectively (Eq. 4 Inclusion–exclusion principle involving two sets, Eq. 5 Inclusion–exclusion principle involving three sets).
$$|{\text{A}} \cup B\left| = \right|{\text{A}}\left| + \right|{\text{B}}\left| - \right|{\text{A}} \cap {\text{B}}|$$
(4)
$$|{\text{A}} \cup {\text{B}} \cup {\text{C}}\left| = \right|{\text{A}}\left| + \right|{\text{B}}\left| + \right|{\text{C}}\left| - \right|{\text{A}} \cap {\text{B}}\left| - \right|{\text{A}} \cap {\text{C}}\left| - \right|{\text{B}} \cap {\text{C}}\left| + \right|{\text{A}} \cap {\text{B}} \cap {\text{C}}|$$
(5)
Using the HLL union and intersection function and the inclusion–exclusion principle, it is possible to explore the HLL sets. For example, considering tweets associated with the hashtag refugeeswelcome as set A and tweets associated with another hashtag like migrants as set B, Eq. 4 could be used to find total tweets containing both hashtags. In the case of three hashtags, Eq. 5 would be suitable. A complete example, using both the aforementioned equations, is depicted in the supplementary materials submitted along with this work (Online Resource 2). This process can also be recursively iterated in case if associations with more than three hashtags are to be investigated. The result of an initial intersection between A and B could be also intersected with the results of an intersection between set C and set D. Therefore, the HLL functions allow for flexibility in qualitative analyses.
Visualizing the HLL data involves the use of geohashing (Niemeyer 2008) which is a kind of Z order curve (Morton 1966). Geohashing was developed for efficient storage of large amounts of geographic coordinates by sacrificing the precision of the co-ordinates. In the case of HLL data format, visualizing geohashed coordinates adds a further level of security. The individual latitude and longitude data obtained after geohashing are further aggregated into grid cells. Depending on the values within each individual grid cell, the cells are colored to make a choropleth map. Having a basemap along with the grids adds a spatial context and improves data readability. Figure 1 depicts an example illustration of the final map after performing the aforementioned processing.
The raw data, stored in a different PostgreSQL server came with several different columns. Not all the columns were used for this work as they were not related to the facets of LBSM.
The spatial facet, consisting of the latitude and longitude of the tweets, were stored in the server in the Well Know Text (WKT) format (Herring 2011). Their re-projection and extraction into latitude and longitude points were made through the ST_TRANSFORM, ST_X and ST_Y functions in POSTGIS. During this time, it was also found that the second step of semantic filtering described in Sect. 4.1 has not been sufficient. Due to the presence of the hashtags deport and moria, various conjugations of the Spanish word deportistas like deportivo were found in the database. Occurrence of the English hashtag moriarty was also found. These hashtags are also semantically non-relevant hashtags but were not filtered during the two-step filtering previously described. They were consequently excluded from the third query result (made in the PostgreSQL server) to retrieve all tweets into a comma separated value (csv) file for further analysis on Jupyter Notebooks. The percentage of erroneous hashtags was not more than 2% in this case.
The raw tweets in the csv needed some formatting to be effectively visualized and interpreted. This formatting was dependent on the facet of the data. A short summary of the formatting of the data according to the facets is shown in Table 1.
Table 1 Summary of data formatting according to the facets For the temporal facet, the raw tweets were provided with timestamps in the format of YYYY:MM:DD HH:MM:SS. As previously mentioned, the dataset for this work spanned for five years from 2016 to 2021. It was then decided to focus on aggregating monthly or yearly timestamps instead of daily cycles. Furthermore, considering the migration crisis, better results would be expected from monthly aggregation as compared to daily or hourly aggregations. This was reasoned based on the nature of the crisis which is long term as compared to a flooding event which has a shorter temporal duration and therefore requires more fine-granular data.
For the topical facet, there were two major columns in the raw dataset to be considered for the analysis: hashtags and the tweets which echoes the main thrust of this work.
Facet Exploration Tool
With the individual facets or even with a combination of multiple facets, it was required to have a visualization where users could be able to inspect the visualizations simultaneously. Furthermore, while exploring such a diverse dataset, it would be necessary to implement interactivity. Interactivity would allow the users to choose options that they wish to see and would help in presenting the data. Some information about the facets which could not be directly integrated in the static visualizations, could be shown through tooltips. Based on these reasons, the authors decided to implement a facet exploration tool which is close to dashboard to accommodate all the needs.
The final tool functions not only as a tool for visualizing the dataset but also as a tool to help in the analysis. An introduction to the facets using the elements of narrative visualization was intended to help non-experts ease themselves into the data by defining the context of the data. This was primarily introduced in the section “Explore the Facets”. In the section “Explore Events with the Facets” (shown in Fig. 2), users both experts and non-experts can use the facet exploration tool to explore the dataset through the events. However, the authors are yet to undertake usability tests to ascertain the effectiveness of the implemented methods. Hence, a detailed discussion of this tool, and the suitability of such a dashboard will be a matter for future work.