Abstract
Twitter has become a major player in the social media scene with over half billion users and over 500 million tweets published daily. With this abundant data, researchers saw the opportunity to explore this data for monitoring events and tracking epidemics. In this type of application, knowing the location of the user is essential. However, most of the information about location self-reported by users is difficult to process, and barely 1% of all published tweets are geolocated. Hence, user location inference is often performed by analyzing public available information from the user profile and his tweets. In this work, we evaluate and compare 16 approaches for user location inference based on different information sources that include interaction networks and text from tweets. We show that methods working with the user friendship network obtain higher values of accuracy and recall when compared to the other methods. From these results, we verify the agreement of pairs of methods regarding the predicted location and the users they cover. We find out that most methods disagree in their inferences while covering different sets of users. These results open up an opportunity to combine different methods in order to improve location accuracy and user recall. We propose four methods for combining the outputs of the evaluated methods. Two of them, one based on a weighting vote scheme (GAVe) and another based on a meta decision tree cover at least 98% of the users in the dataset, while location 75% of them within a distance of 100 km from their real location.
Similar content being viewed by others
References
Abrol S, Khan L (2010) Tweethood: Agglomerative clustering on fuzzy k-closest friends with variable depth for location mining 2nd Int. Conf. on Social Computing (SocialCom), pp 153–160
Aramaki E, Maskawa S, Morita M (2011) Twitter catches the flu: detecting influenza epidemics using Twitter Proceedings of the Conference on empirical methods in natural language processing, pp 1568–1576
Backstrom L, Sun E, Marlow C (2010) Find me if you can: improving geographical prediction with social and spatial proximity Proceedings of the 19th Int. Conf on World Wide Web, pp 61–70
Bouillot F, Poncelet P, Roche M et al (2012) How and why exploit tweet’s location information? International Conference on Geographic Information Science (AGILE)
Brazdil P, Gira‘ud-Carrier C, Soares C, Vilalta R (2008) Metalearning: Applications to Data Mining. Springer
Chandra S, Khan L, Muhaya FB (2011) Estimating Twitter user location using social interactions–a content based approach 3rd Int. Conf. on Social Computing (SocialCom), pp 838–843
Cheng Z, Caverlee J, Lee K (2010) You are where you tweet: a content-based approach to geo-locating Twitter users Proceedings of the 19th ACM international conference on Information and knowledge management. ACM, pp 759–768
Compton R, Jurgens D, Allen D (2014) Geotagging one hundred million Twitter accounts with total variation minimization IEEE Int Conf on Big Data, pp 393–401
Crandall D, Backstrom L, Cosley D, Suri S, Huttenlocher D, Kleinberg J (2010) Inferring social ties from geographic coincidences. Proc Natl Acad Sci 107 (52):22436–22441
Davis Jr C, Pappa GL, Rennó Rocha de Oliveira D, de L Arcanjo F (2011) Inferring the location of Twitter messages based on user relationships. Trans GIS 15 (6):735–751
Eisenstein J, O’Connor B, Smith NA, Xing EP (2010) A latent variable model for geographic lexical variation Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp 1277–1287
Finkel J, Grenager T, Manning Ch (2005) Incorporating non-local information into information extraction systems by gibbs sampling Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pp 363–370
Gelernter J, Mushegian N (2011) Geo-parsing messages from microtext. Trans GIS 15(6):753–773
Goldberg DE (1989) Genetic Algorithms in Search, Optimization and Machine Learning. Addison-Wesley Longman Publishing Co., Inc., 1st edition
Graham M, Hale SA, Gaffney D (2013) Where in the world are you? geolocation and language identification in Twitter CoRR, abs/1308.0683, abs/1308.0683
Bo H, Cook P, Baldwin T (2014) Text-based Twitter user geolocation prediction. Journal of Artificial Intelligence Research, pages 451–500
Hecht B, Hong L, Suh B, Chi EH (2011) Tweets from justin bieber’s heart: the dynamics of the location field in user profiles Proceedings of the SIGCHI Conf. on Human Factors in Computing Systems, pp 237–246
Ikawa Y, Enoki M, Tatsubori M (2012) Location inference using microblog messages Proceedings of the 21st international conference companion on World Wide Web. ACM, pp 687–690
Jurgens D (2013) That’s what friends are for: Inferring location in online social media platforms based on social relationships ICWSM
Jurgens D, McCorriston J, Xu YT, Ruths D (2015) Geolocation prediction in Twitter using social networks: A critical analysis and review of current practice ICWSM
Kinsella S, Murdock V, O’Hare N (2011) I’m eating a sandwich in glasgow: modeling locations with tweets Proceedings of the 3rd Int. Workshop on Search and Mining user-generated contents, pp 61–68
Kohen J (1960) A coefficient of agreement for nominal scale. Educ Psychol Meas 20:37–46
Longbo K, Liu Z, Huang Y (2014) Spot: Locating social media users based on social network context Proceedings of the VLDB Endowment, vol 7
Li R, Wang S, Chang KC-C (2012) Multiple location profiling for users and relationships from social network and content. Proceedings of the VLDB Endowment 5(11):1603–1614
Mahmud J, Nichols J, Drews C (2012) Where is this tweet from? inferring home locations of Twitter users International AAAI Conference on Weblogs and Social Media
Paradesi SM (2011) Geotagging tweets using their content FLAIRS Conference
Ren K, Zhang S, Lin H (2012) Where are you settling down: Geo-locating Twitter users based on tweets and social networks Information Retrieval Technology, pp 150–161
Ribeiro Jr SS, Davis Jr CA, Oliveira DRR, Meira Jr W, Gonċalves TS, Pappa GL (2012) Traffic observatory: a system to detect and locate traffic events and conditions using Twitter Proceedings of the 5th ACM SIGSPATIAL International Workshop on Location-Based Social Networks. ACM, pp 5– 11
Rodrigues E, Assunção R, Pappa GL, Renno D, Meira Jr. W (2015) Exploring multiple evidence to infer users’ location in Twitter. Neurocomputing, pages –
Roller S, Speriosu M, Rallapalli S, Wing B, Baldridge J (2012) Supervised text-based geolocation using language models on an adaptive grid Proceedings of the Joint Conf. on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp 1500–1510
Rout D, Bontcheva K, Preoṫiuc-Pietro D, Cohn T (2013) Where’s@ wally?: a classification approach to geolocating users based on their social ties Proceedings of the 24th ACM Conference on Hypertext and Social Media, pp 11–20
Sakaki T, Okazaki M, Matsuo Y (2010) Earthquake shakes Twitter users: real-time event detection by social sensors Proceedings of the 19th Int. Conf. on World Wide Web, pp 851–860
Schulz A, Hadjakos As, Paulheim H, Nachtwey Js, Mühlhäuser M (2013) A multi-indicator approach for geolocalization of tweets Proceedings of the 7th Int. Conf. on Weblogs and Social Media, International AAAI Conference on Weblogs and Social Media
Sultanik EA, Fink C (2012) Rapid geotagging and disambiguation of social media text via an indexed gazetteer ISCRAM, 2012, pp 1–10
Takhteyev Y, Gruzd A, Wellman B (2012) Geography of Twitter networks. Soc Networks 34(1):73–81
Todorovski L, DŻeroski S (2000) Combining multiple models with meta decision trees. Springer
Wing B, Baldridge J (2011) Simple supervised document geolocation with geodesic grids ACL, vol 11, pp 955–964
Witten IH, Frank E, Hall MA (2011) Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann Publishers Inc. 3rd edition
Zhu X, Ghahramani Z (2002) Learning from labeled and unlabeled data with label propagation. Technical report, Carnegie Mellon University
Acknowledgments
This work was partially funded by CAPES, CNPq and FAPEMIG, all Brazilian Research Agencies. The authors would like to thank David Jurgens for providing the source codes for the four network-based methods.
Author information
Authors and Affiliations
Corresponding author
Appendix A: Example of a meta-decision tree
Appendix A: Example of a meta-decision tree
Figure 7 shows an example of a simplified meta-decision tree obtained when training one of the folds of the cross-validation procedure. Notice that the trees are not necessarily small, and we found trees with up to 41 nodes.
The root of the tree in Fig. 7 uses the average log probability returned by the temporal partitions when classifying the user tweets using Naive Bayes (NB). If this value is smaller or equal to -0.01, and if the result returned by the exact match method in the self-declared location field was correct, the exact match is used as the method to predict the location of the user. Otherwise, if the exact match did not return the correct location, the left branch of the tree is followed, verifying again the average log probability returned by the temporal partitions when classifying the user tweets using Naive Bayes (NB). If it is greater than -2.06, Naive Bayes in the tweets is chosen as the final classifier to predict the location of the user. If that is not the case, then the average log probability returned by the temporal partitions when classifying the user tweets using logistic regression (LR) is checked, and decides whether to use this method or to go for FindMe in the mentions network. The same logic is followed when reading the left branch of the tree from the root.
Note that, in this tree, more emphasis was given to the text of the tweets. This might be related to the fact that the predictions made by these methods return higher degrees of confidence, and hence are preferred over network methods. Apart from the text methods, the mentions network with FindMe also appears among the tree choices for classifiers.
However, it is important to point out that this model is not unique, and many other trees with different combinations of methods can be generated, depending on the data fold given as input, which affects the calculation of the accuracy of the methods (see Eq. 2) and consequently changes the order the ordinary attributes appear in the tree.
Rights and permissions
About this article
Cite this article
Ribeiro, S., Pappa, G.L. Strategies for combining Twitter users geo-location methods. Geoinformatica 22, 563–587 (2018). https://doi.org/10.1007/s10707-017-0296-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10707-017-0296-z