Twitter Geolocation Prediction using Neural Networks

Knowing the location of a user is important for several use cases, such as location speciﬁc recommendations, demographic analysis, or monitoring of disaster outbreaks. We present a bottom up study on the impact of text-and metadata-derived contextual features for Twitter geolocation prediction. The ﬁ-nal model incorporates individual types of tweet information and achieves state-of-the-art performance on a publicly available test set. The source code of our implementation, together with individual models, is freely available at github-url. blinded.for.review .


Introduction
Data from social media platforms is an attractive real-time resource for data analysts. It can be used for a wide range of use cases, such as monitoring of fire-  and flue-outbreaks (Power et al., 2013), provide location-based recommendations (Ye et al., 2010), or is utilized in demographic analyses (Sloan et al., 2013). Although some platforms, such as Twitter, allow users to geolocate posts, Jurgens et al. (2015) reported that less than 3 % of all Twitter posts are geotagged. This severely impacts the use of social media data for such location-specific applications.
The location prediction task can be either tackled as classification problem, or alternatively as a multi-target regression problem. In the former case the goal is to predict city labels for a specific tweet, whereas the latter case predicts latitude and longitude coordinates for a given tweet. Previous studies showed that text in combination with metadata can be used to predict user locations (Han et al., 2014). Liu and Inkpen (2015) presented a system based on stacked denoising auto-encoders (Vincent et al., 2008) for location prediction. State-of-theart approaches, however, often make use of very specific, non-generalizing features based on web site scraping, IP resolutions, or external resources such as GeoNames. In contrast, we present an approach for geographical location prediction that achieves state-of-the-art results using neural networks trained solely on Twitter text and metadata. It does not require external knowledge sources, and hence generalizes more easily to new domains and languages.
The remainder of this publication is organized as follows: First, we provide an overview of related work for Twitter location prediction. In Section 3 we describe the details of our neural network architecture. Results on the test set are shown in Section 4. Finally, we conclude the paper with some future directions in Section 5.

Related Work
For a better comparability of our approach, we focus on the shared task presented at the 2nd Workshop on Noisy User-generated Text (WNUT'16) (Han et al., 2016). The organizers introduced a dataset to evaluate individual approaches for tweet-and user-level location prediction. For tweet-level prediction the goal is to predict the location of one specific message, while for user-level prediction the goal is to predict the user location based on a variable number of user messages. In the following, we focus on tweet-level prediction as it is more practical in real world applications (Han et al., 2016). The organizers evaluate team submissions based on accuracy and distance in kilometers. The latter metric allows to account for wrong, but geographically close predictions, for example, when the model predicts Vienna instead of Budapest.
We focus on the five teams who participated in the WNUT shared task. Official team results for tweet-and user-level predictions are shown in Table 1. Unfortunately, only three participants provided systems descriptions, which we will briefly summarize: Team FujiXerox (Miura et al., 2016) built a neural network using text, user declared locations, timezone values, and user self-descriptions. For feature preprocessing the authors build several mapping services using external resources, such as GeoNames and time zone boundaries. Finally, they train a neural network using the fastText n-gram model (Joulin et al., 2016) on post text, user location, user description, and user timezone.
Team csiro (Jayasinghe et al., 2016) used an ensemble learning method built on several information resources. First, the authors use post texts, user location text, user time zone information, messenger source (e.g., Android or iPhone) and reverse country lookups for URL mentions to build a list of candidate cities contained in GeoNames. Furthermore, URL mentions were scraped and the website metadata was screened for geographic coordinates. The authors implemented custom scrapers for websites which are frequently used in Twitter and sometimes provide latitude and longitude in their metadata. Second, a relationship network is built from tweets mentioning another user. Third, posts are used to find similar texts in the training data to calculate a class-label probability for the most similar tweets. Fourth, text is classified using the geotagging tool pigeo . The output of individual stages is then used in an ensemble learner.

Methods
We used the WNUT'16 shared task data consisting of 12,827,165 tweet IDs, which have been assigned to a metropolitan city center from the GeoNames database 1 , using the strategy described in Han et al. (2012). As Twitter does not allow to share individual tweets, posts need to be retrieved using the Twitter API, of which we were able to retrieve 9,127,900 (71.2 %). The remaining tweets are no longer available, usually because users deleted these messages. In comparison, the winner of the WNUT'16 task (Miura et al., 2016) reported that they were able to successfully retrieve 9,472,450 1 http://www.geonames.org/ (73.8 %) tweets. The overall training data consists of 3,362 individual class labels (i.e., GeoNames cities). In our subset of approximately 9 million tweets we only observed 3,315 different classes.
For text preprocessing, we use a simple whitespace tokenizer with lower casing, without any domain specific processing, such as unicode normalization (Davis et al., 2001) or any lexical text normalization (see for instance Han and Baldwin (2011)). The text of tweets, and metadata fields containing texts (user description, user location, user name, timezone) are converted to word embeddings (Mikolov et al., 2013), which are then forwarded to a Long Short-Term Memory (LSTM) unit (Hochreiter and Schmidhuber, 1997). In our experiments we randomly initialized embedding vectors. We use batch normalization (Ioffe and Szegedy, 2015) for normalizing inputs in order to reduce internal covariate shift. The risk of overfitting by co-adapting units is reduced by implementing dropout (Srivastava et al., 2014) between individual neural network layers. An example architecture for textual data is shown in Figure 1. Mentions of links in the post are handled slightly differently by building character embeddings and feeding them into a LSTM layer. Metadata fields with a finite set of elements (UTC time and source type) are directly represented as one-hot encodings.
We connect all eight individual neural architectures with a dense layer for classification using a softmax activation function. We use stochastic gradient descent over shuffled mini-batches with Adam (Kingma and Ba, 2014) and cross-entropy loss as objective function for classification. For parameter tuning we tested different properties on a randomly selected validation set consisting of 50,000 tweets. The final parameters of our model are shown in Table 3. The WNUT'16 task requires the model to predict class labels and longitude/latiude pairs. To account for this, we predict the mean city longitude/latitude location given the class label. For user-level prediction, we classify all messages individually and predict the city label with the highest probability over all messages.

Model combination
The internal representations for all eight different resources (i.e., text, user-description, user-location, user-name, user-timezone, links, UTC, and source) are concatenated to build a final tweet represen-  tation. We then evaluate two training strategies: In the first training regime, we train the combined model from scratch. The parameters for all wordand character-level embeddings, as well as all network layers, are initialized randomly. The parameters of the full model including the softmax layer combining the output of the six individual LSTM models and the two metadata models are learned jointly. For the second strategy, we first train each LSTM model separately, and then keep their parameters fixed while training only the final softmax layer.

Results
The individual performance of our different models is shown in Table 4. As simple baseline, we predict the city label most frequently observed in the training data (Jakarta in Indonesia). According to our bottom-up analysis, the user-location metadata is the most productive kind of information for tweet-and user-level location prediction. Using the text alone, we can correctly predict the location for 19.3 % of all tweets with a median distance of 2,128 kilometers to the correct location. Aggregation of pretrained models also increases performance for all three evaluation metrics in comparison to training a model from scratch. For tweet-level prediction, our best merged model outperforms the best submission (FujiXerox.2) in terms of accuracy, median and mean distance by 1.4 percentage points, 18.4 kilometers, and 392.1 kilometers respectively. The ensemble learning method (csiro) outperforms our best models in terms of accuracy by 1.3 percentage points,  Table 3: Tweet level results ranked by median error distance (in kilometers). Individual best results for all three criteria are highlighted in bold face. Full-scratch refers to a merged model trained from scratch, whereas the weights of the full-fixed model are only retrained where applicable. The baseline predicts the location most frequently observed in the training data (Jakarta).
but our model performs considerably better on median and mean distance with 23.6 and 1137.8 kilometers respectively. Additionally, the approach of csiro requires several dedicated services, such as GeoNames gazetteers, time zone to GeoName mappings, IP country resolver and customized scrapers for social media websites. The authors describe custom link handling for FourSquare, Swarm, Path, Facebook, and Instagram. On our training data we observed that these websites account for 1,941,079 (87.5 %) of all 2,217,267 shared links. It is therefore tempting to speculate that a customized scraper for these websites could further boost our results for location prediction. As team cogeo uses only the text of a tweet, the results of cogeo.1 are comparable with our textmodel. The results show that our text-model outperforms this approach in terms of accuracy, median and mean distance to the gold standard by 4.7 percentage points, 1296 kilometers, and 934 kilometers respectively.
For user-level prediction, our method performs on a par with the individual best results collected from the three top team submissions (FujiXerox.2, csiro.1, and FujiXerox.1).

Conclusion
We presented our neural network architecture for the prediction of city labels and geo-coordinates for tweets. We focus on the classification task and derive longitude/latitude information from the city label. We evaluated models for individual Twitter (meta)-data in a bottom up fashion and identified highly location indicative fields. The proposed combination of individual models requires no customized text-preprocessing, specific website crawlers, database lookups or IP to country resolution while achieving state-of-the-art performance on a publicly available data set. For better comparability, source code and pretrained models is freely available to the community.
As future work, we plan to incorporate images as another type of metadata for location prediction using the approach presented by Simonyan and Zisserman (2014).