Semantic Feature Aggregation for Gender Identification in Russian Facebook

Panicheva, Polina; Mirzagitova, Aliia; Ledovaya, Yanina

doi:10.1007/978-3-319-71746-3_1

Polina Panicheva¹²,
Aliia Mirzagitova¹² &
Yanina Ledovaya¹²

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 789))

Included in the following conference series:

Conference on Artificial Intelligence and Natural Language

1374 Accesses
4 Citations

Abstract

The goal of the current work is to evaluate semantic feature aggregation techniques in a task of gender classification of public social media texts in Russian. We collect Facebook posts of Russian-speaking users and apply them as a dataset for two topic modelling techniques and a distributional clustering approach. The output of the algorithms is applied as a feature aggregation method in a task of gender classification based on a smaller Facebook sample. The classification performance of the best model is favorably compared against the lemmas baseline and the state-of-the-art results reported for a different genre or language. The resulting successful features are exemplified, and the difference between the three techniques in terms of classification performance and feature contents are discussed, with the best technique clearly outperforming the others.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://yandex.ru/.
2.
http://wwbp.org.

References

Aletras, N., Stevenson, M.: Labelling topics using unsupervised graph-based methods. In: ACL, vol. 2, pp. 631–636 (2014)
Google Scholar
Álvarez-Carmona, M.A., López-Monroy, A.P., Montes-y-Gómez, M., Villaseñor-Pineda, L., Meza, I.: Evaluating topic-based representations for author profiling in social media. In: Montes-y-Gómez, M., Escalante, H.J., Segura, A., Murillo, J.D. (eds.) IBERAMIA 2016. LNCS (LNAI), vol. 10022, pp. 151–162. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-47955-2_13
Chapter Google Scholar
Amir, S., Coppersmith, G., Carvalho, P., Silva, M.J., Wallace, B.C.: Quantifying mental health from social media with neural user embeddings. arXiv preprint arXiv:1705.00335 (2017)
Benjamini, Y., Hochberg, Y.: Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. Roy. Stat. Soc.: Ser. B (Methodol.) 57(1), 289–300 (1995)
MathSciNet MATH Google Scholar
Biemann, C.: Chinese whispers: an efficient graph clustering algorithm and its application to natural language processing problems. In: Proceedings of the First Workshop on Graph Based Methods for Natural Language Processing, pp. 73–80. Association for Computational Linguistics (2006)
Google Scholar
Bird, S., Klein, E., Loper, E.: Natural Language Processing With Python: Analyzing Text With The Natural Language Toolkit. O’Reilly Media Inc, Sebastopol (2009)
MATH Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
MATH Google Scholar
Bogolyubova, O., Tikhonov, R., Ivanov, V., Panicheva, P., Ledovaya, Y.: Violence exposure, posttraumatic stress, and subjective well-being in a sample of russian adults: a facebook-based study. J. Interpersonal Violence 30, 1153–1167 (2017). http://journals.sagepub.com/doi/abs/10.1177/0886260517698279
Ding, T., Pan, S., Bickel, W.K.: \(1 today or \)2 tomorrow? the answer is in your facebook likes. arXiv preprint arXiv:1703.07726 (2017)
Gliozzo, A., Biemann, C., Riedl, M., Coppola, B., Glass, M.R., Hatem, M.: Jobimtext visualizer: a graph-based approach to contextualizing distributional similarity. In: Graph-Based Methods for Natural Language Processing, p. 6 (2013)
Google Scholar
Hulpus, I., Hayes, C., Karnstedt, M., Greene, D.: Unsupervised graph-based topic labelling using dbpedia. In: Proceedings of the Sixth ACM International Conference on Web Search and Data Mining, pp. 465–474. ACM (2013)
Google Scholar
Jones, E., Oliphant, T., Peterson, P., et al.: SciPy: open source scientific tools for python (2001). http://www.scipy.org/
Korobov, M.: Morphological analyzer and generator for russian and ukrainian languages. In: Khachay, M.Y., Konstantinova, N., Panchenko, A., Ignatov, D.I., Labunets, V.G. (eds.) AIST 2015. CCIS, vol. 542, pp. 320–332. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-26123-2_31
Chapter Google Scholar
Kosinski, M., Matz, S.C., Gosling, S.D., Popov, V., Stillwell, D.: Facebook as a research tool for the social sciences: opportunities, challenges, ethical considerations, and practical guidelines. Am. Psychol. 70(6), 543 (2015)
Article Google Scholar
Kou, W., Li, F., Baldwin, T.: Automatic labelling of topic models using word vectors and letter trigram vectors. In: Zuccon, G., Geva, S., Joho, H., Scholer, F., Sun, A., Zhang, P. (eds.) AIRS 2015. LNCS, vol. 9460, pp. 253–264. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-28940-3_20
Chapter Google Scholar
Kulkarni, V., Kern, M.L., Stillwell, D., Kosinski, M., Matz, S., Ungar, L., Skiena, S., Schwartz, H.A.: Latent human traits in the language of social media: an open-vocabulary approach (2017)
Google Scholar
Kutuzov, A., Andreev, I.: Texts in, meaning out: neural language models in semantic similarity task for Russian. arXiv preprint arXiv:1504.08183 (2015)
Lau, J.H., Grieser, K., Newman, D., Baldwin, T.: Automatic labelling of topic models. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 1536–1545. Association for Computational Linguistics (2011)
Google Scholar
Lau, J.H., Newman, D., Karimi, S., Baldwin, T.: Best topic word selection for topic labelling. In: Proceedings of the 23rd International Conference on Computational Linguistics: Posters, pp. 605–613. Association for Computational Linguistics (2010)
Google Scholar
Litvinova, T., Seredin, P., Litvinova, O., Zagorovskaya, O., Sboev, A., Gudovskih, D., Moloshnikov, I., Rybka, R.: Gender prediction for authors of Russian texts using regression and classification techniques. In: CDUD 2016–The 3rd International Workshop on Concept Discovery in Unstructured Data, p. 44 (2016). https://cla2016.hse.ru/data/2016/07/24/1119022942/CDUD2016.pdf#page=51
Lui, M., Baldwin, T.: Langid. py: an off-the-shelf language identification tool. In: Proceedings of the ACL 2012 System Demonstrations, pp. 25–30. Association for Computational Linguistics (2012)
Google Scholar
Magatti, D., Calegari, S., Ciucci, D., Stella, F.: Automatic labeling of topics. In: Ninth International Conference on Intelligent Systems Design and Applications ISDA 2009, pp. 1227–1232. IEEE (2009)
Google Scholar
Mehrotra, R., Sanner, S., Buntine, W., Xie, L.: Improving lda topic models for microblogs via tweet pooling and automatic labeling. In: Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 889–892. ACM (2013)
Google Scholar
Mei, Q., Shen, X., Zhai, C.: Automatic labeling of multinomial topic models. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 490–499. ACM (2007)
Google Scholar
Mihalcea, R., Tarau, P.: Textrank: bringing order into texts. Association for Computational Linguistics (2004)
Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)
Google Scholar
Mirzagitova, A., Mitrofanova, O.: Automatic assignment of labels in topic modelling for Russian corpora. In: Proceedings of 7th Tutorial and Research Workshop on Experimental Linguistics, ExLing, pp. 115–118 (2016)
Google Scholar
Panchenko, A., Loukachevitch, N., Ustalov, D., Paperno, D., Meyer, C., Konstantinova, N.: Russe: the first workshop on Russian semantic similarity. In: Computational Linguistics and Intellectual Technologies: Papers from the Annual Conference. Dialogue, vol. 2, pp. 89–105 (2015)
Google Scholar
Panicheva, P., Ledovaya, Y., Bogoliubova, O.: Revealing interpetable content correlates of the dark triad personality traits. In: Russian Summer School in Information Retrieval (2016)
Google Scholar
Panicheva, P., Ledovaya, Y., Bogolyubova, O.: Lexical, morphological and semantic correlates of the dark triad personality traits in Russian facebook texts. In: Artificial Intelligence and Natural Language Conference (AINL) IEEE, pp. 1–8. IEEE (2016)
Google Scholar
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet MATH Google Scholar
Pennebaker, J.W., Francis, M.E., Booth, R.J.: Linguistic inquiry and word count: Liwc 2001. Mahway: Lawrence Erlbaum Associates 71 (2001)
Google Scholar
Prince, S.J.: Computer Vision: Models, Learning and Inference. Cambridge University Press, Cambridge (2012)
Book Google Scholar
Rangel, F., Rosso, P., Potthast, M., Trenkmann, M., Stein, B., Verhoeven, B., Daeleman, W., et al.: Overview of the 2nd author profiling task at pan 2014. In: CEUR Workshop Proceedings, vol. 1180, pp. 898–927. CEUR Workshop Proceedings. https://riunet.upv.es/handle/10251/61150
Rehurek, R., Sojka, P.: Gensim–python framework for vector space modelling. NLP Centre, Faculty of Informatics, Masaryk University, Brno (2011)
Google Scholar
Rosen-Zvi, M., Griffiths, T., Steyvers, M., Smyth, P.: The author-topic model for authors and documents. In: Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence, pp. 487–494. AUAI Press (2004)
Google Scholar
Schwartz, H.A., Eichstaedt, J.C., Kern, M.L., Dziurzynski, L., Ramones, S.M., Agrawal, M., Shah, A., Kosinski, M., Stillwell, D., Seligman, M.E., et al.: Personality, gender, and age in the language of social media: the open-vocabulary approach. PLoS ONE 8(9), e73791 (2013)
Article Google Scholar
Zhang, X., Zhao, J., LeCun, Y.: Character-level convolutional networks for text classification. In: Advances in Neural Information Processing Systems, pp. 649–657 (2015)
Google Scholar
Zhiqiang, T., Wenting, W.: Dlirec: aspect term extraction and term polarity classification system. In: Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014) (2014)
Google Scholar

Download references

Acknowledgments

The authors acknowledge Saint-Petersburg State University for a research grant 8.38.351.2015. The reported study is also supported by RFBR grant 16-06-00529.

Author information

Authors and Affiliations

St. Petersburg State University, Universitetskaya nab. 7-9, 199034, St. Petersburg, Russia
Polina Panicheva, Aliia Mirzagitova & Yanina Ledovaya

Authors

Polina Panicheva
View author publications
You can also search for this author in PubMed Google Scholar
Aliia Mirzagitova
View author publications
You can also search for this author in PubMed Google Scholar
Yanina Ledovaya
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Polina Panicheva .

Editor information

Editors and Affiliations

ITMO University, St. Petersburg, Russia
Andrey Filchenkov
University of Helsinki, Helsinki, Finland
Lidia Pivovarova
Mendel University , Brno, Czech Republic
Jan Žižka

Appendix

Table 6. Significant lemmas (English translation)

Full size table

Table 7. Significant LDA topics (English translation)

Full size table

Table 8. Significant clusters (English translation)

Full size table

Table 9. Significant ATM topics (English translation)

Full size table

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Panicheva, P., Mirzagitova, A., Ledovaya, Y. (2018). Semantic Feature Aggregation for Gender Identification in Russian Facebook. In: Filchenkov, A., Pivovarova, L., Žižka, J. (eds) Artificial Intelligence and Natural Language. AINL 2017. Communications in Computer and Information Science, vol 789. Springer, Cham. https://doi.org/10.1007/978-3-319-71746-3_1

Download citation

DOI: https://doi.org/10.1007/978-3-319-71746-3_1
Published: 28 November 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-71745-6
Online ISBN: 978-3-319-71746-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Semantic Feature Aggregation for Gender Identification in Russian Facebook

Abstract

Access this chapter

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendix

Appendix

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation