Skip to main content

Creating Extended Gender Labelled Datasets of Twitter Users

  • Conference paper
  • First Online:

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 611))

Abstract

The gender information of a Twitter user is not known a priori when analysing Twitter data, because user registration does not include gender information. This paper proposes an approach for creating extended gender labelled datasets of Twitter users. The process involves creating a smaller database of active Twitter users and to manually label the gender. The process follows by extracting features from unstructured information found on each user profile and by creating a gender classification model. The model is then applied to a larger dataset, thus providing automatic labels and corresponding confidence scores, which can be used to estimate the most accurately labeled users. The resulting databases can be further enriched with additional information extracted, for example, from the profile picture and from the user location. The proposed approach was successfully applied to English and Portuguese users, leading to two large datasets containing more than 57 K labeled users each.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    Gay and transsexual users, as profiles from companies, are not in the scope of this study.

  2. 2.

    http://www.faceplusplus.com/.

  3. 3.

    http://geonames.usgs.gov/domestic/download_data.htm.

References

  1. Al Zamal, F., Liu, W., Ruths, D.: Homophily and latent attribute inference: inferring latent attributes of twitter users from neighbors. In: ICWSM 270 (2012)

    Google Scholar 

  2. Bamman, D., Eisenstein, J., Schnoebelen, T.: Gender in twitter: styles, stances, and social networks. CoRR abs/1210.4567 (2012)

    Google Scholar 

  3. Bergsma, S., Dredze, M., Van Durme, B., Wilson, T., Yarowsky, D.: Broadly improving user classification via communication-based name and location clustering on twitter. In: HLT-NAACL, pp. 1010–1019 (2013)

    Google Scholar 

  4. Brogueira, G., Batista, F., Carvalho, J.P., Moniz, H.: Expanding a database of portuguese tweets. In: Pereira, M.J.V., Leal, J.P., Simões, A. (eds.) 3rd Symposium on Languages, Applications and Technologies. OpenAccess Series in Informatics (OASIcs), vol. 38, pp. 275–282. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, Dagstuhl, Germany (2014). http://drops.dagstuhl.de/opus/volltexte/2014/4576

  5. Burger, J.D., Henderson, J., Kim, G., Zarrella, G.: Discriminating gender on twitter. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2011), pp. 1301–1309. Association for Computational Linguistics, Stroudsburg, PA, USA (2011). http://dl.acm.org/citation.cfm?id=2145432.2145568

  6. Ciot, M., Sonderegger, M., Ruths, D.: Gender inference of twitter users in non-english contexts. In: EMNLP, pp. 1136–1145 (2013)

    Google Scholar 

  7. Deitrick, W., Miller, Z., Valyou, B., Dickinson, B., Munson, T., Hu, W.: Gender identification on twitter using the modified balanced winnow (2012)

    Google Scholar 

  8. Fink, C., Kopecky, J., Morawski, M.: Inferring gender from the content of tweets: a region specific example. In: ICWSM (2012)

    Google Scholar 

  9. van Halteren, H., Speerstra, N.: Gender recognition on dutch tweets. Comput. Linguist. Neth. J. 4, 171–190 (2014)

    Google Scholar 

  10. Kokkos, A., Tzouramanis, T.: A robust gender inference model for online social networks and its application to linkedin and twitter. First Monday 19(9) (2014)

    Google Scholar 

  11. Liu, W., Al Zamal, F., Ruths, D.: Using social media to infer gender composition of commuter populations. In: Proceedings of the When the City Meets the Citizen Workshop, The International Conference on Weblogs and Social Media (2012)

    Google Scholar 

  12. Liu, W., Ruths, D.: What’s in a name? using first names as features for gender inference in twitter. In: AAAI Spring Symposium: Analyzing Microtext (2013)

    Google Scholar 

  13. McCreadie, R., Soboroff, I., Lin, J., Macdonald, C., Ounis, I., McCullough, D.: On building a reusable twitter corpus. In: Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1113–1114. ACM (2012)

    Google Scholar 

  14. Miller, Z., Dickinson, B., Hu, W.: Gender prediction on twitter using stream algorithms with N-gram character features. Int. J. Intell. Sci. 2(24), 143–148 (2012)

    Article  Google Scholar 

  15. Nguyen, D., Trieschnigg, D., Dogruōz, A.S., Gravel, R., Theune, M., Meder, T., de Jong, F.: Why gender and age prediction from tweets is hard: lessons from a crowdsourcing experiment (2014)

    Google Scholar 

  16. Pennacchiotti, M., Popescu, A.M.: A machine learning approach to twitter user classification. In: ICWS, vol. 11, pp. 281–288 (2011)

    Google Scholar 

  17. Petrović, S., Osborne, M., Lavrenko, V.: The edinburgh twitter corpus. In: Proceedings of the NAACL HLT 2010 Workshop on Computational Linguistics in a World of Social Media (WSA 2010), pp. 25–26. Association for Computational Linguistics, Stroudsburg, PA, USA (2010). http://dl.acm.org/citation.cfm?id=1860667.1860680

  18. Rao, D., Yarowsky, D., Shreevats, A., Gupta, M.: Classifying latent user attributes in twitter. In: Proceedings of the 2nd International Workshop on Search and Mining User-Generated Contents (SMUC 2010), pp. 37–44. ACM, New York, NY, USA (2010). http://doi.acm.org/10.1145/1871985.1871993

  19. Ugheoke, T.O.: Detecting the Gender of a Tweet Sender. Master’s thesis (2014)

    Google Scholar 

  20. Van Zegbroeck, E.: Predicting the Gender of Flemish Twitter Users Using an Ensemble of Classifiers. Master’s thesis (2014)

    Google Scholar 

  21. Vicente, M., Batista, F., Carvalho, J.P.: Twitter gender classification using user unstructured information. In: Proceedings of IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), Istambul, Turkey, August 2015. http://fuzzieee2015.org

  22. Vicente, M., Carvalho, J.P., Batista, F.: Using unstructured profile information for gender classification of portuguese and english twitter users. In: Sierra-Rodríguez, J.-L., Leal, J.-P., Simões, A. (eds.) SLATE 2015. CCIS, vol. 563, pp. 57–64. Springer, Heidelberg (2015). doi:10.1007/978-3-319-27653-3_6

    Chapter  Google Scholar 

Download references

Acknowledgments

This work was supported by national funds through Fundação para a Ciência e a Tecnologia (FCT) with reference UID/CEC/50021/2013.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Joao Paulo Carvalho .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Vicente, M., Batista, F., Carvalho, J.P. (2016). Creating Extended Gender Labelled Datasets of Twitter Users. In: Carvalho, J., Lesot, MJ., Kaymak, U., Vieira, S., Bouchon-Meunier, B., Yager, R. (eds) Information Processing and Management of Uncertainty in Knowledge-Based Systems. IPMU 2016. Communications in Computer and Information Science, vol 611. Springer, Cham. https://doi.org/10.1007/978-3-319-40581-0_56

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-40581-0_56

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-40580-3

  • Online ISBN: 978-3-319-40581-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics