Keywords

1 Introduction

Social media platforms have become prime sources of information during crises, particularly concerning rescue and relief requests. During Hurricane Harvey, over 7 million tweets were posted about the disaster in just over a monthFootnote 1, while over 20 million tweets with the words #sandy and #hurricane were posted in just a few days during the Hurricane Sandy disaster.Footnote 2 Sharing such vital information on social media creates real opportunities for increasing citizens’ situational awareness of the crisis, and for authorities and relief agencies to target their efforts more efficiently [23]. However, with such opportunities come real challenges, such as the handling of such large and rapid volumes of posts, which renders manual processing highly inadequate [7]. The problem is exacerbated by the findings that many of these posts bear little relevance to the crisis, even those that use the dedicated hashtags [11].

Because of these challenges, there is an increasingly desperate need for tools capable of automatically assessing crisis information relevancy, to filter out irrelevant posts quickly during a crisis, and thus reducing the load to only those posts that matter. Recent research explored various classification methods of crisis data from social media platforms, which aimed at automatically categorising them into crisis-related or not related using supervised [10, 13, 21, 25] and unsupervised [18] machine learning approaches. Most of these methods use statistical features, such as n-grams, text length, POS, and hashtags.

One of the problems with such approaches is their bias towards the data on which they are trained. This means that classification accuracy drops considerably when the data changes, for example when the crisis is of a different type, or when the posts are in a different language, in comparison to the crisis type and language the model was trained on. Training the models for all possible crisis types and languages is infeasible due to time and expense.

In our previous work, we showed that adding semantic features increases the classification accuracy when training the model on one type of crisis (e.g. floods), and applying it to another (e.g. bushfires) [11]. In this paper, we tackle the problem of language, where the model is trained on one language (e.g. English), but the incoming posts are in another (e.g. Spanish). We explore the role of adding semantics in increasing the multilingual fitness of supervised models for classifying the relevancy of crisis information.

The main contributions of this paper can be summarised as follow:

  1. 1.

    We build a statistical-semantic classification model with semantics extracted from BabelNet and DBpedia.

  2. 2.

    We experiment with classifying relevancy of tweets from 30 crisis events in 3 languages (English, Spanish, and Italian).

  3. 3.

    We run relevancy classifiers with datasets translated into a single language, as well as with cross-lingual datasets.

  4. 4.

    We show that adding semantics increases cross-lingual classification accuracy by 8.26%–9.07% in average \(F_1\) in comparison to traditional statistical models.

  5. 5.

    We show that when datasets are translated into the same language, only the model that uses BabelNet semantics outperforms the statistical model, by 3.75%.

The paper is structured as follows: Sect. 2 summarises related work. Sections 3 and 4 describe our approach and experiments on classifying cross-lingual crisis data using different semantic features. Results are reported in Sects. 4.2 and 4.3. Discussion and conclusions are in Sects. 5 and 6.

2 Related Work

Classification of social media messages about crises and disasters in terms of their relevancy has been addressed already by a number of researchers [2, 3, 9,10,11,12, 22, 25]. The classification types can differ, however. Some classify simply as relevant (related) or not; some include a partly relevant category; while others include the notion of informativeness (where informative is taken to mean providing useful information about the event). For example, Olteanu et al. [16] use the categories related and informative, related but not informative, and not related. Others treat relevance and informativeness as two separate tasks [5].

Methods for this kind of classification use a variety of supervised machine learning approaches, usually relying on linguistic and statistical features such as POS tags, user mentions, post length, and hashtags [8,9,10, 19, 21]. Approaches range from traditional classification methods such as Support Vector Machines (SVM), Naive Bayes, and Conditional Random Fields [9, 17, 21] to the more recent use of deep learning and word embeddings [3].

One of the drawbacks to these approaches is their lack of adaptability to new kinds of data. [9] took early steps in this area by training a model on messages about the Joplin 2011 tornado and applying it to messages about Hurricane Sandy, although the two events are still quite similar. [12] took this further by using semantic information to adapt a relevance classifier to new crisis events, using 26 different events of varying types, and showed that the addition of semantics increases the adaptability of the classifier to new, unseen, types of crisis events. In this paper, we develop that approach further by examining whether semantic information can help not just with new events, but also with events in different languages.

In general, adapting classification tools to different languages is a problem for many NLP tasks., since it is often difficult to acquire sufficient data to train separate models on each language. This is especially true for tasks such as sentiment analysis, where leveraging information from data in different languages is required. In that field, two main solutions have been explored: either translating the data into a single language (normally English) and using this single dataset for training and/or testing [1]; or training a model using weakly-labelled data without supervision [6]. Severyn et al. [20] improved performance of sentiment classification using distant pre-training of a CNN, consisting of inferring weak labels (emoticons) from a large set of multilingual tweets, followed by additional supervised training on a smaller set of manually annotated labels. In the other direction, annotation resources (such as sentiment lexicons) can be transferred from English into the target language to augment the training resources available [14]. A number of other approaches rely on having a set of correspondences between English and the target language(s), such as those which build distributed representations of words in multiple languages, e.g. using Wikipedia [24].

We test two similar approaches in this paper for the classification of information relevancy in crisis situations: (a) translate all datasets into a single language; (b) make use of high-quality feature information in English (and other languages) to supplement the training data of our target language(s).

As far as we know, while these kinds of language adaptation methods have been frequently applied to sentiment analysis, they have not been applied to crisis classification methods. Our work extends mainly on the previous work using hierarchical semantics from knowledge graphs to perform crisis-information classification through a supervised machine learning approach [11, 12], by generating statistical and semantic features for all relevant languages and then using this to train the models, regardless of which language is required.

3 Experiment Setup

Our aim is to train and validate a binary classifier that can automatically differentiate between crisis-related and not related tweets in cross-lingual scenarios. We generate the statistical and semantic features of tweets from different languages and then train the machine learning models accordingly. In the next sections we detail: (i) the datasets used in our experiments; (ii) the statistical and semantic sets of features used; and (iii) the classifier selection process.

3.1 Datasets

For this study, we chose datasets from multiple sources. From the CrisisLex platformFootnote 3 we selected 3 datasets: CrisisLexT26, ChileEarthquakeT1, and SOSItalyT4. CrisisLexT26 is an annotated dataset of 26 different crisis events that occurred between 2012 and 2013. Each event has 1000 labeled tweets, with the labels ‘Related and Informative’, ‘Related but not Informative’, ‘Not Related’ and ‘Not Applicable’. These events occurred around the world and hence covered a range of languages. ChileEarthquakeT1 is a dataset of 2000 tweets in Spanish (from the Chilean earthquake of 2010), where all the tweets are labeled by relatedness (relevant or not relevant). The SOSItalyT4 set is a collection of tweets spanning 4 different natural disasters which occurred in Italy between 2009 and 2014, with almost 5.6k tweets labeled by the type of information they convey (“damage”, “no damage”, or “not relevant”). Based on the guidelines of the labeling, both “damage” and “no damage” indicate relevance.

We chose all the labeled tweets from these 3 collections. Next, we converged some of the labels, since we aim to generate a binary classifier. From CrisisLexT26, we merged ‘Related and Informative’ and ‘Related but not Informative’ into the Related category, and merged Not Related abd Not Applicable into the Not Related category. For SOSItalyT4 we add the tweets labeled as “damage” and “no damage” to the “Related” category, and the “not relevant” to the “Not Related” category.

Finally, we removed all duplicate instances from the individual datasets to reduce content redundancy, by comparing the tweets in pairs after removing the special characters, URLs, and user-handles (i.e., ‘@’ mentions). This resulted in 21,378 Related and 2965 Not Related documents in the CrisisLexT26 set, 924 Related and 1238 Not Related in the Chile Earthquake set, and 4372 Related and 878 Not Related in the SOSItalyT4 set.

Next, we applied 3 different language detection APIs: detectlanguageFootnote 4, langdetectFootnote 5, and TextBlobFootnote 6. We labeled the language of the tweet where there was agreement by at least 2 of the APIs. The entire data constituted more than 30 languages, where English (en), Spanish (es), and Italian (it) comprised almost 92% of the collection (29,141 out of 31755). Considering this distribution, we focused our study on these 3 languages. To this end, we first created an unbalanced set (in terms of language) for training the classifier (see Table 1-unbalanced). In order to reduce the imbalance between Related and Not Related tweets, we thus only selected 8,146 tweets in total out of the 29,141 tweets. Next, we create a balanced version of the corpus where we split the data into a training and test set for each language, with equal distribution throughout, to remove any bias (Table 1- balanced).

Table 1. Data size for English(en), Spanish(es), and Italian(it)

We also provide, in Table 2, a breakdown of all the original datasets to give an overview of the language distribution within each crisis event set.

Table 2. Language distribution (in %) in crisis events data

3.2 Feature Engineering

We define two types of feature sets: statistical and semantic. Statistical features are widely used in various text classification problems [8,9,10, 13, 21, 25] and so we consider these as our baseline approach. These capture the quantifiable statistical properties and the linguistic features of a textual post, whereas semantic features determine the named entities and associated hierarchical semantic information.

Statistical Features were extracted for each post in the dataset, following previous work, as follows:

  • Number of nouns: nouns refer to entities occurring in the posts, such as people, locations, or resources involved in the crisis event [8, 9, 21].

  • Number of verbs: these indicate actions occurring in a crisis event [8, 9, 21].

  • Number of pronouns: similar to nouns, pronouns include entities such as people, locations, or resources.

  • Tweet Length: total number of characters in a post. The length of a post could indicate the amount of information [8, 9, 19].

  • Number of words: similar to Tweet Length, number of words may also be an indicator of the amount of information [9, 10].

  • Number of Hashtags: these reflect the themes of a post, and are manually generated by the authors of the posts [8,9,10].

  • Unigrams: the entire data (text of each post) is tokenised and represented as unigrams [8,9,10, 13, 21, 25].

The spaCy libraryFootnote 7 is used to extract the Part Of Speech (POS) features (e.g., nouns, verbs, pronouns). Unigrams for the data are extracted with the regexp tokenizer provided in NLTK.Footnote 8 We removed stop words using a dedicated list.Footnote 9 Finally, we applied the TF-IDF vector normalisatiton on the unigrams in order to weight the importance of tokens in the documents according to their relative importance within the dataset, and represent the entire data as a set of vectors. This results in a vocabulary size in unigrams (for each language in the balanced data, combining test and train data) of en-7495, es-7121, and it-4882.

Semantic Features are curated to generalise the information representation of the crisis situations across various languages. Semantic features are designed to be broader in context and less crisis-specific, in comparison to the actual text of the posts, thereby helping to resolve the problem of data sparsity. To this end, we use the Named Entity Recognition (NER) service BabelfyFootnote 10, and two different knowledge bases for creating these features: BabelNetFootnote 11 and DBpediaFootnote 12. Note that the semantics extracted by these tools are in English, and hence they bring the multilingual datasets a bit closer linguistically. The following semantic information is extracted:

  • Babelfy Entities: Babelfy extracts the entities from each post in different languages (e.g., news, sadness, terremoto), and disambiguates them with respect to the BabelNet [15] knowledge base.

  • BabelNet Senses (English): for each entity extracted from Babelfy, the English labels associated with the entities are extracted (e.g. \({news{\rightarrow }news}\), \({sadness{\rightarrow }sadness}\), \({terremoto{\rightarrow }earthquake}\)).

  • BabelNet Hypernyms (English): for each entity, the direct hypernyms (at distance-1) are extracted from BabelNet and the main sense of each hypernym is retrieved in English. From our original entities, we now get broadcasting, communication, and emotion).

  • DBpedia Properties: for each annotated entity we also get a DBpedia URI from Babelfy. The following properties associated with the DBpedia URIs are queried via SPARQL: dct:subject, rdfs:label (only in English), rdf:type (only of the type http://schema.org and http://dbpedia.org/ontology), dbo:city, dbp:state, dbo:state, dbp:country and dbo:country (the location properties fluctuate between dbp and dbo) (e.g., dbc:Grief, dbc:Emotions, dbr:Sadness).

The inclusion of semantic features such as hypernyms has been shown to enhance the semantic and contextual representation of a document by correlating different entities, from different languages, with a similar context [12]. For example, the entities policeman, policía (Spanish for police), fireman, and MP (Military Police) all have a common hypernym (English): defender. By generalising the semantics in one language, English, we avoid the sparsity that often results from having various morphological forms of entities across different languages (see Table 3 for an example). Similarly, the English words floods and earthquake both have natural disaster as a hypernym, as does inondazione in Italian, ensuring that we know the Italian word is also crisis relevant. Adding the semantic information, through BabelNet Semantics, results in a vocabulary size in unigrams of: en-12604, es-11791, and it-8544.

Table 3. Semantic expansion with BabelNet and DBpedia semantics

Finally, we extract DBpedia properties of the entities (see Table 3) in the form of subject, label, and location-specific properties. This semantic expansion of the dataset forms the DBpedia Semantics component, and results in a vocabulary size in unigrams of: en-21905, es-15388, it-10674. The two types of semantic features (BabelNet and DBpedia) are used both individually and also in combination, to develop the binary classifier.

3.3 Classifier Selection

In order to address the binary classification problem, the high dimensionality resulting from unigrams of tweets and semantic features, and the need to avoid overfitting, were taken into consideration. The training data instances (which varied between 1200–4500 under different experimental setups) were much smaller in size than the large dimensionality of the features (ranging between 9000–20000). We therefore opted for a Support Vector Machine (SVM) with a Linear Kernel [4] as the classification model. As discussed in [3, 11], SVM performs better than other common approaches such as classification and regression trees (CART) and Naive Bayes in similar classification problems. The work by [3] also shows almost identical performance (in terms of accuracy) of SVM and CNN models in classification of the tweets. In [11], we showed the appropriateness of SVM Linear kernel over RBF kernel, Polynomial kernel, and Logistic Regression in such a classification scenario.

4 Cross-Lingual Classification of Crisis-Information

We demonstrate and validate our classification models through multiple experiments designed to test various criteria and models. We experiment on the models created with the following combinations of statistical and semantic features, thereby enabling us to assess the impact of each classification approach:

  • SF: uses only the statistical features; this model is our baseline.

  • SF+SemBN: combines statistical features with semantic features from BabelNet (entity sense, and their hypernyms in English, as explained in Sect. 3.2).

  • SF+SemDB: combines statistical features with semantic features from DBpedia (label in English, type, and other DBpedia properties).

  • SF+SemBNDB: combines statistical features with semantic features from BabelNet and DBpedia.

We apply and validate the models above in the following three experiments:

Monolingual Classification with Monolingual Models: In this experiment, we train the model on one language and test it on data in the same language. This tests the value of adding semantics to the classifier over the baseline when the language is the same.

Cross-lingual Classification with Monolingual Models: Here we evaluate the classifiers on crisis information in languages that were not observed in the training data. For example, we evaluate the classifier on Italian when the classifier was trained on English or Spanish.

Cross-lingual Classification with Machine Translation: In the third experiment, we evaluate the classifier when the model is trained on data in a certain language (e.g. Spanish), and used to classify information that has been automatically translated from other languages (e.g. Italian and English) into the language of the training data. The translation is performed using the Google Translate API.Footnote 13 To perform this experiment, we first translate the data from each of our three languages in turn into the other two languages.

All experiments are performed on both (i) the unbalanced dataset, to adhere to the natural distribution of these languages; and (ii) the balanced dataset, to remove bias towards any particular language which is caused by the uneven distribution of these languages in our datasets. By default, we refer to results from the balanced dataset unless we specifically mention the unbalanced one. Results are reported in terms of Precision (P), Recall (R), \(F_1\) score (\(F_1\)), and \(\varDelta F_1\) (% change over baseline \(\frac{(semantic~model~F_{1}-SF~F_{1})*100}{SF~F_{1}}\), where \(SF~F_{1}\) is the \(F_{1}\) score in SF model).

4.1 Results: Monolingual Classification with Monolingual Models

For the monolingual classification, a 5-fold cross validation approach was adopted and applied to individual datasets of English, Italian, and Spanish. Results in Table 4 show that adding semantics has no impact compared with the baseline (SF model) when the language of training and testing is the same.

Table 4. Monolingual Classification Models – 5-fold cross-validation (best \(F_1\) score is highlighted for each model). en, it, and es refer to English, Italian, and Spanish respectively.

4.2 Results: Cross-Lingual Classification with Monolingual Models

This experiment involves training on data in one language and testing on another. Results, shown in Table 5, indicate that when using the statistical features alone (SF - the baseline), average \(F_{1}\) is 0.557. When semantics are included in the classifier, average classification performance improvement (\(\varDelta F_1\)) is by 8.26%–9.07%, with a standard deviation (SDV) between 10.9%–13.86% across all three semantic models, for all the test cases. Similarly, when applied to unbalanced datasets, performance increases by 7.44%–9.78%.

While the highest gains are observed in SF+SemBNDB, the SF+SemBN seems to exhibit a consistent performance by improving over the SF baseline in 5 out of 6 cross-lingual classification tests, while SF+SemDB and SF+SemBNDB each show improvement in 4 out of 6 tests.

Table 5. Cross-Lingual Classification Models (best \(F_1\) score is highlighted for each model).

4.3 Results: Cross-Lingual Crisis Classification with Machine Translation

The results from cross-lingual classification after language translation are presented in Table 6. For each training dataset, we translate the test data into the language of the training data. For example, when the training data is in English (en), the Italian data is translated to English, and is represented in the table as it2en. We aim to analyse two aspects here: (i) how semantics impacts the classifier on the translated content; and (ii) how the classifiers perform over the translated data in comparison to cross-lingual classifiers, as seen in Sect. 4.2.

Table 6. Cross-Lingual Crisis Classification with Machine Translation (best \(F_1\) score is highlighted for each event).

From the results in Table 6, we see that based on average % change \(\varDelta F_{1}\) of all translated test cases (en2it,es2it, etc.), SF+SemBN outperforms the statistical classifier (SF) by 3.75% (balanced data) with a standard deviation (SDV) of 4.57%. However, the other two semantic feature models (SF+SemDB and SF+SemBNDB) do not improve over the statistical features when the test and training data are both in the same language (after translation). The SF+SemBN shows improvement in 4 out of 6 translated test cases, except when trained on Spanish (es).

Comparing the best performing model from translated data, i.e. SF+SemBN, and overall baseline (SF model from cross-lingual classification Table 5-balanced), the SF+SemBN (translation) has an average \(F_{1}\) gain (\(\varDelta F\)) across each translated test case over the baseline of 15.23% (with a SDV 12.6%). For example, compare \(\varDelta F\) between it-en2it in SF+SemBN in the translated model and it-en in SF in the cross-lingual model, similarly for the other 5 test cases. Based on an average of \(\varDelta F\) across all the test cases, the SF+SemBN (from translation) and SF+SemBN (from cross-lingual models), both perform well over the baseline (SF from cross-lingual model), by 8.26% and 15.23% respectively.

4.4 Cross-Lingual Ranked Feature Correlation Analysis

To understand the impact of the semantics and the translation on the discriminatory nature of the cross-lingual data from different languages, we analysed the correlation between ranked features of each dataset under different models. For this, we considered the balanced datasets across each language and took the entire data by merging the training and test data for each language. Next, we calculated Information Gain over each dataset (English (en), Spanish, (es), and Italian (it)), across all 4 models (SF, SF+SemBN, SF+SemDB, SF+SemBNDB). We also calculated Information Gain over the translated datasets (en2it, en2es, es2it, es2en, it2en, and it2es). This provides the ranked list of features, in terms of their discriminatory powers in the classifiers, in each selected dataset.

For each pair of datasets, such as English (en) - Spanish (es), we consider the common ranked features with \(IG score>0\), and calculate the Spearman’s Rank Order Correlation (ranges between \([-1, 1]\)) across the two ranked lists. For the translated data, we analysed pairs where one dataset is translated to the language of another dataset, such as en-it2en and it-en2it.

Table 7 shows how the correlation varies across the data. These variations can be attributed to a number of aspects. The overlap of crisis events while sampling the data is a crucial parameter, as the data was sampled based on language, and the discreteness of the source events (Table 2) was not taken into consideration. This can particularly be observed in the en-es correlation, where the highest correlation is without the semantics. This also explains the better performance of the SF model over the semantic models when trained on en and evaluated on es (Table 5-balanced). The correlation between en-it is ~–0.179, which indicates nearly ‘no correlation’. The increase in discriminative-feature correlations between datasets once semantics are added is in part due to the extraction of semantics in English (see Sect. 3.2), thus bringing the terminologies closer semantically as well as linguistically.

Translating the data to the same language shows an increment in the correlation. This is expected for multiple reasons. Firstly, having the data in the same language enables the identification of more similar features such as verbs and adjectives across the datasets. Secondly, given the similarity in the different types of events covered under the three languages, such as floods and earthquakes, the nature of the information is likely to have a high contextual overlap.

Table 7. Spearman’s Rank Order Correlation between ranked informative features (based on IG) across models and languages

5 Discussion and Future Work

Our aim is to create hybrid models, by mixing semantic features with the statistical features, to produce a crisis data classification model that is largely language-agnostic. The work was limited to English, Spanish, and Italian, due to the lack of sufficient data annotations in other languages. We are currently designing a CrowdFlower annotation task to expand our annotations to several other languages.

We ran our experiments on both balanced and unbalanced datasets. However, performance over the balanced dataset provides a fairer comparison, since biases towards the dominant languages are removed. We also experimented with classifying data in their original languages, as well as automatically translating the data into the language of the training data. Results show that with balanced datasets, translation improves the performance of all classifiers, and reduces the benefits of using semantics in comparison to the statistical classifier (SF; the baseline). One could conclude that if the data is to be translated into the same language that the model was trained on, then the statistical model (SF) might be sufficient, whereas if translation is not viable (e.g., data arriving in unpredicted languages, or where translations are too inaccurate or untrustworthy) then the model that mixes statistical and semantics features is recommended, since it produces higher classification accuracies.

In this work, the classifiers were trained and tested on data from various types of crisis events. It is natural for some nouns to be identical across various languages, such as names of crises (e.g. Typhoon Yolanda), places, and people. In future work we will measure the level of terminological overlap between the datasets of different languages.

We augmented all datasets with semantics in English (Sect. 3.2). This is mainly because BabelNet (version 3.7) is heavily biased towards EnglishFootnote 14. Most existing entity extractors are skewed heavily towards the English language, and hence as a byproduct of adding their identified semantics, more terms (concepts) in a single language (English) will be added to the datasets. As a consequence, this will bring the datasets of different languages closer together linguistically, thus giving an advantage to semantic models over purely statistical ones in the context of cross-lingual analysis. We performed a comparison of vocabulary similarity between the language datasets, before and after the addition of semantics, to also comprehend the overlapping of the vocabulary. For instance, the cosine similarity between (without semantics) en-it is 0.311, en-es is 0.536, and it-es is 0.32. Adding semantics increased the cosine similarity across all the datasets. In current experiments, we had 6 test cases in each classification model; despite the consistency observed across 6 cross-lingual test cases, we would need more observations to establish that the gain achieved by the semantic models over the baseline models is statistically significant. Repeating these experiments over more languages should help in this; alternatively, creating multiple train and test splits for each test case could also complement such analysis, which was not feasible in this study due to insufficient data to create multiple splits for each dataset. However, we did perform a 10 iteration of 5-fold cross validation over the entire dataset across all the feature sets and found that SF+SemBN(BabelNet Semantics) model outperformed all others (particularly baseline with a statistically significant value of p = 0.0192, on a two-tailed t-test).

In this work, we experimented with training the model on one language at a time. Another possibility is to train the model on multiple languages, thus increasing its ability to classify data in those languages. However, generating such a multilingual model is not always feasible, since it requires annotated data in all the languages it is intended to analyse. Furthermore, the need for models that can handle other languages is likely to remain, since the language of data shared on social media during crises tends to differ substantially, depending on where these crises are taking place. Therefore, the ability of a model to classify data in a new language will always be a clear advantage. The curated data (with semantics) and code, in this work, is being made available for research purposes.Footnote 15

6 Conclusion

Determining which tweets are relevant to a given crisis situation is important in order to achieve a more efficient use of social media, and to improve situational awareness. In this paper, we demonstrated the ability of various models to classify crisis related information from social media posts in multiple languages. We tested two approaches: (1) adding semantics (from BabelNet and DBpedia) to the datasets; and (2) automatically translating the datasets to the language that the model was trained on. Through multiple experiments, we showed that all our semantic models outperform statistical ones in the first approach, whereas only one semantic model (using BabelNet) shows an improvement over the statistical model in the second approach.