Skip to main content
Log in

Strategies for combining Twitter users geo-location methods

  • Published:
GeoInformatica Aims and scope Submit manuscript

Abstract

Twitter has become a major player in the social media scene with over half billion users and over 500 million tweets published daily. With this abundant data, researchers saw the opportunity to explore this data for monitoring events and tracking epidemics. In this type of application, knowing the location of the user is essential. However, most of the information about location self-reported by users is difficult to process, and barely 1% of all published tweets are geolocated. Hence, user location inference is often performed by analyzing public available information from the user profile and his tweets. In this work, we evaluate and compare 16 approaches for user location inference based on different information sources that include interaction networks and text from tweets. We show that methods working with the user friendship network obtain higher values of accuracy and recall when compared to the other methods. From these results, we verify the agreement of pairs of methods regarding the predicted location and the users they cover. We find out that most methods disagree in their inferences while covering different sets of users. These results open up an opportunity to combine different methods in order to improve location accuracy and user recall. We propose four methods for combining the outputs of the evaluated methods. Two of them, one based on a weighting vote scheme (GAVe) and another based on a meta decision tree cover at least 98% of the users in the dataset, while location 75% of them within a distance of 100 km from their real location.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. https://about.twitter.com/company

  2. http://www.geonames.org/

  3. https://dev.twitter.com/streaming

  4. https://dev.twitter.com/rest/public

References

  1. Abrol S, Khan L (2010) Tweethood: Agglomerative clustering on fuzzy k-closest friends with variable depth for location mining 2nd Int. Conf. on Social Computing (SocialCom), pp 153–160

    Google Scholar 

  2. Aramaki E, Maskawa S, Morita M (2011) Twitter catches the flu: detecting influenza epidemics using Twitter Proceedings of the Conference on empirical methods in natural language processing, pp 1568–1576

    Google Scholar 

  3. Backstrom L, Sun E, Marlow C (2010) Find me if you can: improving geographical prediction with social and spatial proximity Proceedings of the 19th Int. Conf on World Wide Web, pp 61–70

    Google Scholar 

  4. Bouillot F, Poncelet P, Roche M et al (2012) How and why exploit tweet’s location information? International Conference on Geographic Information Science (AGILE)

    Google Scholar 

  5. Brazdil P, Gira‘ud-Carrier C, Soares C, Vilalta R (2008) Metalearning: Applications to Data Mining. Springer

  6. Chandra S, Khan L, Muhaya FB (2011) Estimating Twitter user location using social interactions–a content based approach 3rd Int. Conf. on Social Computing (SocialCom), pp 838–843

    Google Scholar 

  7. Cheng Z, Caverlee J, Lee K (2010) You are where you tweet: a content-based approach to geo-locating Twitter users Proceedings of the 19th ACM international conference on Information and knowledge management. ACM, pp 759–768

  8. Compton R, Jurgens D, Allen D (2014) Geotagging one hundred million Twitter accounts with total variation minimization IEEE Int Conf on Big Data, pp 393–401

  9. Crandall D, Backstrom L, Cosley D, Suri S, Huttenlocher D, Kleinberg J (2010) Inferring social ties from geographic coincidences. Proc Natl Acad Sci 107 (52):22436–22441

    Article  Google Scholar 

  10. Davis Jr C, Pappa GL, Rennó Rocha de Oliveira D, de L Arcanjo F (2011) Inferring the location of Twitter messages based on user relationships. Trans GIS 15 (6):735–751

  11. Eisenstein J, O’Connor B, Smith NA, Xing EP (2010) A latent variable model for geographic lexical variation Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp 1277–1287

  12. Finkel J, Grenager T, Manning Ch (2005) Incorporating non-local information into information extraction systems by gibbs sampling Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pp 363–370

  13. Gelernter J, Mushegian N (2011) Geo-parsing messages from microtext. Trans GIS 15(6):753–773

    Article  Google Scholar 

  14. Goldberg DE (1989) Genetic Algorithms in Search, Optimization and Machine Learning. Addison-Wesley Longman Publishing Co., Inc., 1st edition

  15. Graham M, Hale SA, Gaffney D (2013) Where in the world are you? geolocation and language identification in Twitter CoRR, abs/1308.0683, abs/1308.0683

  16. Bo H, Cook P, Baldwin T (2014) Text-based Twitter user geolocation prediction. Journal of Artificial Intelligence Research, pages 451–500

  17. Hecht B, Hong L, Suh B, Chi EH (2011) Tweets from justin bieber’s heart: the dynamics of the location field in user profiles Proceedings of the SIGCHI Conf. on Human Factors in Computing Systems, pp 237–246

  18. Ikawa Y, Enoki M, Tatsubori M (2012) Location inference using microblog messages Proceedings of the 21st international conference companion on World Wide Web. ACM, pp 687–690

  19. Jurgens D (2013) That’s what friends are for: Inferring location in online social media platforms based on social relationships ICWSM

  20. Jurgens D, McCorriston J, Xu YT, Ruths D (2015) Geolocation prediction in Twitter using social networks: A critical analysis and review of current practice ICWSM

  21. Kinsella S, Murdock V, O’Hare N (2011) I’m eating a sandwich in glasgow: modeling locations with tweets Proceedings of the 3rd Int. Workshop on Search and Mining user-generated contents, pp 61–68

  22. Kohen J (1960) A coefficient of agreement for nominal scale. Educ Psychol Meas 20:37–46

    Article  Google Scholar 

  23. Longbo K, Liu Z, Huang Y (2014) Spot: Locating social media users based on social network context Proceedings of the VLDB Endowment, vol 7

  24. Li R, Wang S, Chang KC-C (2012) Multiple location profiling for users and relationships from social network and content. Proceedings of the VLDB Endowment 5(11):1603–1614

    Article  Google Scholar 

  25. Mahmud J, Nichols J, Drews C (2012) Where is this tweet from? inferring home locations of Twitter users International AAAI Conference on Weblogs and Social Media

  26. Paradesi SM (2011) Geotagging tweets using their content FLAIRS Conference

  27. Ren K, Zhang S, Lin H (2012) Where are you settling down: Geo-locating Twitter users based on tweets and social networks Information Retrieval Technology, pp 150–161

  28. Ribeiro Jr SS, Davis Jr CA, Oliveira DRR, Meira Jr W, Gonċalves TS, Pappa GL (2012) Traffic observatory: a system to detect and locate traffic events and conditions using Twitter Proceedings of the 5th ACM SIGSPATIAL International Workshop on Location-Based Social Networks. ACM, pp 5– 11

  29. Rodrigues E, Assunção R, Pappa GL, Renno D, Meira Jr. W (2015) Exploring multiple evidence to infer users’ location in Twitter. Neurocomputing, pages –

  30. Roller S, Speriosu M, Rallapalli S, Wing B, Baldridge J (2012) Supervised text-based geolocation using language models on an adaptive grid Proceedings of the Joint Conf. on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp 1500–1510

  31. Rout D, Bontcheva K, Preoṫiuc-Pietro D, Cohn T (2013) Where’s@ wally?: a classification approach to geolocating users based on their social ties Proceedings of the 24th ACM Conference on Hypertext and Social Media, pp 11–20

  32. Sakaki T, Okazaki M, Matsuo Y (2010) Earthquake shakes Twitter users: real-time event detection by social sensors Proceedings of the 19th Int. Conf. on World Wide Web, pp 851–860

  33. Schulz A, Hadjakos As, Paulheim H, Nachtwey Js, Mühlhäuser M (2013) A multi-indicator approach for geolocalization of tweets Proceedings of the 7th Int. Conf. on Weblogs and Social Media, International AAAI Conference on Weblogs and Social Media

  34. Sultanik EA, Fink C (2012) Rapid geotagging and disambiguation of social media text via an indexed gazetteer ISCRAM, 2012, pp 1–10

    Google Scholar 

  35. Takhteyev Y, Gruzd A, Wellman B (2012) Geography of Twitter networks. Soc Networks 34(1):73–81

    Article  Google Scholar 

  36. Todorovski L, DŻeroski S (2000) Combining multiple models with meta decision trees. Springer

  37. Wing B, Baldridge J (2011) Simple supervised document geolocation with geodesic grids ACL, vol 11, pp 955–964

  38. Witten IH, Frank E, Hall MA (2011) Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann Publishers Inc. 3rd edition

  39. Zhu X, Ghahramani Z (2002) Learning from labeled and unlabeled data with label propagation. Technical report, Carnegie Mellon University

Download references

Acknowledgments

This work was partially funded by CAPES, CNPq and FAPEMIG, all Brazilian Research Agencies. The authors would like to thank David Jurgens for providing the source codes for the four network-based methods.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gisele L. Pappa.

Appendix A: Example of a meta-decision tree

Appendix A: Example of a meta-decision tree

Figure 7 shows an example of a simplified meta-decision tree obtained when training one of the folds of the cross-validation procedure. Notice that the trees are not necessarily small, and we found trees with up to 41 nodes.

Fig. 7
figure 7

Meta-decision tree generated for one of the data folds of the cross-validation procedure

The root of the tree in Fig. 7 uses the average log probability returned by the temporal partitions when classifying the user tweets using Naive Bayes (NB). If this value is smaller or equal to -0.01, and if the result returned by the exact match method in the self-declared location field was correct, the exact match is used as the method to predict the location of the user. Otherwise, if the exact match did not return the correct location, the left branch of the tree is followed, verifying again the average log probability returned by the temporal partitions when classifying the user tweets using Naive Bayes (NB). If it is greater than -2.06, Naive Bayes in the tweets is chosen as the final classifier to predict the location of the user. If that is not the case, then the average log probability returned by the temporal partitions when classifying the user tweets using logistic regression (LR) is checked, and decides whether to use this method or to go for FindMe in the mentions network. The same logic is followed when reading the left branch of the tree from the root.

Note that, in this tree, more emphasis was given to the text of the tweets. This might be related to the fact that the predictions made by these methods return higher degrees of confidence, and hence are preferred over network methods. Apart from the text methods, the mentions network with FindMe also appears among the tree choices for classifiers.

However, it is important to point out that this model is not unique, and many other trees with different combinations of methods can be generated, depending on the data fold given as input, which affects the calculation of the accuracy of the methods (see Eq. 2) and consequently changes the order the ordinary attributes appear in the tree.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ribeiro, S., Pappa, G.L. Strategies for combining Twitter users geo-location methods. Geoinformatica 22, 563–587 (2018). https://doi.org/10.1007/s10707-017-0296-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10707-017-0296-z

Keywords

Navigation