Abstract
The gender information of a Twitter user is not known a priori when analysing Twitter data, because user registration does not include gender information. This paper proposes an approach for creating extended gender labelled datasets of Twitter users. The process involves creating a smaller database of active Twitter users and to manually label the gender. The process follows by extracting features from unstructured information found on each user profile and by creating a gender classification model. The model is then applied to a larger dataset, thus providing automatic labels and corresponding confidence scores, which can be used to estimate the most accurately labeled users. The resulting databases can be further enriched with additional information extracted, for example, from the profile picture and from the user location. The proposed approach was successfully applied to English and Portuguese users, leading to two large datasets containing more than 57 K labeled users each.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
Gay and transsexual users, as profiles from companies, are not in the scope of this study.
- 2.
- 3.
References
Al Zamal, F., Liu, W., Ruths, D.: Homophily and latent attribute inference: inferring latent attributes of twitter users from neighbors. In: ICWSM 270 (2012)
Bamman, D., Eisenstein, J., Schnoebelen, T.: Gender in twitter: styles, stances, and social networks. CoRR abs/1210.4567 (2012)
Bergsma, S., Dredze, M., Van Durme, B., Wilson, T., Yarowsky, D.: Broadly improving user classification via communication-based name and location clustering on twitter. In: HLT-NAACL, pp. 1010–1019 (2013)
Brogueira, G., Batista, F., Carvalho, J.P., Moniz, H.: Expanding a database of portuguese tweets. In: Pereira, M.J.V., Leal, J.P., Simões, A. (eds.) 3rd Symposium on Languages, Applications and Technologies. OpenAccess Series in Informatics (OASIcs), vol. 38, pp. 275–282. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, Dagstuhl, Germany (2014). http://drops.dagstuhl.de/opus/volltexte/2014/4576
Burger, J.D., Henderson, J., Kim, G., Zarrella, G.: Discriminating gender on twitter. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2011), pp. 1301–1309. Association for Computational Linguistics, Stroudsburg, PA, USA (2011). http://dl.acm.org/citation.cfm?id=2145432.2145568
Ciot, M., Sonderegger, M., Ruths, D.: Gender inference of twitter users in non-english contexts. In: EMNLP, pp. 1136–1145 (2013)
Deitrick, W., Miller, Z., Valyou, B., Dickinson, B., Munson, T., Hu, W.: Gender identification on twitter using the modified balanced winnow (2012)
Fink, C., Kopecky, J., Morawski, M.: Inferring gender from the content of tweets: a region specific example. In: ICWSM (2012)
van Halteren, H., Speerstra, N.: Gender recognition on dutch tweets. Comput. Linguist. Neth. J. 4, 171–190 (2014)
Kokkos, A., Tzouramanis, T.: A robust gender inference model for online social networks and its application to linkedin and twitter. First Monday 19(9) (2014)
Liu, W., Al Zamal, F., Ruths, D.: Using social media to infer gender composition of commuter populations. In: Proceedings of the When the City Meets the Citizen Workshop, The International Conference on Weblogs and Social Media (2012)
Liu, W., Ruths, D.: What’s in a name? using first names as features for gender inference in twitter. In: AAAI Spring Symposium: Analyzing Microtext (2013)
McCreadie, R., Soboroff, I., Lin, J., Macdonald, C., Ounis, I., McCullough, D.: On building a reusable twitter corpus. In: Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1113–1114. ACM (2012)
Miller, Z., Dickinson, B., Hu, W.: Gender prediction on twitter using stream algorithms with N-gram character features. Int. J. Intell. Sci. 2(24), 143–148 (2012)
Nguyen, D., Trieschnigg, D., Dogruōz, A.S., Gravel, R., Theune, M., Meder, T., de Jong, F.: Why gender and age prediction from tweets is hard: lessons from a crowdsourcing experiment (2014)
Pennacchiotti, M., Popescu, A.M.: A machine learning approach to twitter user classification. In: ICWS, vol. 11, pp. 281–288 (2011)
Petrović, S., Osborne, M., Lavrenko, V.: The edinburgh twitter corpus. In: Proceedings of the NAACL HLT 2010 Workshop on Computational Linguistics in a World of Social Media (WSA 2010), pp. 25–26. Association for Computational Linguistics, Stroudsburg, PA, USA (2010). http://dl.acm.org/citation.cfm?id=1860667.1860680
Rao, D., Yarowsky, D., Shreevats, A., Gupta, M.: Classifying latent user attributes in twitter. In: Proceedings of the 2nd International Workshop on Search and Mining User-Generated Contents (SMUC 2010), pp. 37–44. ACM, New York, NY, USA (2010). http://doi.acm.org/10.1145/1871985.1871993
Ugheoke, T.O.: Detecting the Gender of a Tweet Sender. Master’s thesis (2014)
Van Zegbroeck, E.: Predicting the Gender of Flemish Twitter Users Using an Ensemble of Classifiers. Master’s thesis (2014)
Vicente, M., Batista, F., Carvalho, J.P.: Twitter gender classification using user unstructured information. In: Proceedings of IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), Istambul, Turkey, August 2015. http://fuzzieee2015.org
Vicente, M., Carvalho, J.P., Batista, F.: Using unstructured profile information for gender classification of portuguese and english twitter users. In: Sierra-Rodríguez, J.-L., Leal, J.-P., Simões, A. (eds.) SLATE 2015. CCIS, vol. 563, pp. 57–64. Springer, Heidelberg (2015). doi:10.1007/978-3-319-27653-3_6
Acknowledgments
This work was supported by national funds through Fundação para a Ciência e a Tecnologia (FCT) with reference UID/CEC/50021/2013.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Vicente, M., Batista, F., Carvalho, J.P. (2016). Creating Extended Gender Labelled Datasets of Twitter Users. In: Carvalho, J., Lesot, MJ., Kaymak, U., Vieira, S., Bouchon-Meunier, B., Yager, R. (eds) Information Processing and Management of Uncertainty in Knowledge-Based Systems. IPMU 2016. Communications in Computer and Information Science, vol 611. Springer, Cham. https://doi.org/10.1007/978-3-319-40581-0_56
Download citation
DOI: https://doi.org/10.1007/978-3-319-40581-0_56
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-40580-3
Online ISBN: 978-3-319-40581-0
eBook Packages: Computer ScienceComputer Science (R0)