Advertisement

Author Profiling with Doc2vec Neural Network-Based Document Embeddings

  • Ilia Markov
  • Helena Gómez-Adorno
  • Juan-Pablo Posadas-Durán
  • Grigori Sidorov
  • Alexander Gelbukh
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10062)

Abstract

To determine author demographics of texts in social media such as Twitter, blogs, and reviews, we use doc2vec document embeddings to train a logistic regression classifier. We experimented with age and gender identification on the PAN author profiling 2014–2016 corpora under both single- and cross-genre conditions. We show that under certain settings the neural network-based features outperform the traditional features when using the same classifier. Our method outperforms existing state of the art under some settings, though the current state-of-the-art results on those tasks have been quite weak.

Keywords

Document embeddings doc2vec Neural networks Machine learning Author profiling 

Notes

Acknowledgments

This work was partially supported by the Mexican Government (CONACYT projects 240844 and 20161958, SNI, COFAA-IPN, SIP-IPN 20151406, 20161947, 20161958, 20151589, 20162204, and 20162064).

References

  1. 1.
    Cambria, E., Poria, S., Gelbukh, A., Kwok, K.: Sentic API: a common-sense based API for concept-level sentiment analysis. In: Proceedings of the 4th Workshop on Making Sense of Microposts, co-located with WWW 2014, 23rd International World Wide Web Conference. Number 1141 in CEUR Workshop Proceedings (2014)Google Scholar
  2. 2.
    Poria, S., Gelbukh, A., Agarwal, B., Cambria, E., Howard, N.: Common sense knowledge based personality recognition from text. In: Castro, F., Gelbukh, A., González, M. (eds.) MICAI 2013. LNCS, vol. 8266, pp. 484–496. Springer, Heidelberg (2013). doi: 10.1007/978-3-642-45111-9_42 CrossRefGoogle Scholar
  3. 3.
    Poria, S., Gelbukh, A., Hussain, A., Howard, N., Das, D., Bandyopadhyay, S.: Enhanced SenticNet with affective labels for concept-based opinion mining. IEEE Intell. Syst. 28, 31–38 (2013)CrossRefGoogle Scholar
  4. 4.
    Cambria, E., Poria, S., Bajpai, R., Schuller, B.: SenticNet 4: a semantic resource for sentiment analysis based on conceptual primitives. In: COLING 2016, 26th International Conference on Computational Linguistics, Osaka, Japan (2016)Google Scholar
  5. 5.
    Poria, S., Cambria, E., Hazarika, D., Vij, P.: A deeper look into sarcastic tweets using deep convolutional neural networks. In: 26th International Conference on Computational Linguistics, COLING 2016, Osaka, Japan, pp. 1601–1612 (2016)Google Scholar
  6. 6.
    Poria, S., Cambria, E., Gelbukh, A., Bisio, F., Hussain, A.: Sentiment data flow analysis by means of dynamic linguistic patterns. IEEE Comput. Intell. Mag. 10, 26–36 (2015)CrossRefGoogle Scholar
  7. 7.
    Majumder, N., Poria, S., Gelbukh, A., Cambria, E.: Deep learning based document modeling for personality detection from text. IEEE Intell. Syst. 32, 74–79 (2017)CrossRefGoogle Scholar
  8. 8.
    Poria, S., Cambria, E., Bajpai, R., Hussain, A.: A review of affective computing: from unimodal analysis to multimodal fusion. Inf. Fus. 37, 98–125 (2017)CrossRefGoogle Scholar
  9. 9.
    Poria, S., Peng, H., Hussain, A., Howard, N., Cambria, E.: Ensemble application of convolutional neural networks and multiple kernel learning for multimodal sentiment analysis. Neurocomputing (2017, in press)Google Scholar
  10. 10.
    Chikersal, P., Poria, S., Cambria, E., Gelbukh, A., Siong, C.E.: Modelling public sentiment in twitter: using linguistic patterns to enhance supervised learning. In: Gelbukh, A. (ed.) CICLing 2015. LNCS, vol. 9042, pp. 49–65. Springer, Cham (2015). doi: 10.1007/978-3-319-18117-2_4 Google Scholar
  11. 11.
    Poria, S., Cambria, E., Gelbukh, A.: Aspect extraction for opinion mining with a deep convolutional neural network. Knowl.-Based Syst. 108, 42–49 (2016)CrossRefGoogle Scholar
  12. 12.
    Poria, S., Chaturvedi, I., Cambria, E., Hussain, A.: Convolutional MKL based multimodal emotion recognition and sentiment analysis. In: 16th International Conference on Data Mining, ICDM 2016, pp. 439–448. IEEE (2016)Google Scholar
  13. 13.
    Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: Proceedings of the 31st International Conference on Machine Learning. ICML 2014, pp. 1188–1196 (2014)Google Scholar
  14. 14.
    Dai, A., Olah, C., Le, Q.: Document embedding with paragraph vectors. CoRR abs/1507.07998 (2015)Google Scholar
  15. 15.
    Rangel, F., Rosso, P., Koppel, M., Stamatatos, E., Inches, G.: Overview of the author profiling task at PAN 2013. In: CLEF 2013 Labs and Workshops, Notebook Papers, vol. 1179 (2013)Google Scholar
  16. 16.
    López-Monroy, A.P., Montes-y-Gómez, M., Escalante, H.J., Villaseñor-Pineda, L., Villatoro-Tello, E.: INAOE’s participation at PAN 2013: author profiling task. In: Working Notes Papers of the CLEF 2013 Evaluation Labs. CLEF 2013, CEUR (2013)Google Scholar
  17. 17.
    Meina, M., Brodzińska, K., Celmer, B., Czoków, M., Patera, M., Pezacki, J., Wilk, M.: Ensemble-based classification for author profiling using various features. In: Working Notes Papers of the CLEF 2013 Evaluation Labs. CLEF 2013, CEUR (2013)Google Scholar
  18. 18.
    Santosh, K., Bansal, R., Shekhar, M., Varma, V.: Author profiling: predicting age and gender from blogs. In: Working Notes Papers of the CLEF 2013 Evaluation Labs. CLEF 2013, CEUR (2013)Google Scholar
  19. 19.
    Rangel, F., Rosso, P., Chugur, I., Potthast, M., Trenkmann, M., Stein, B., Verhoeven, B., Daelemans, W.: Overview of the 2nd author profiling task at PAN 2014. In: CLEF 2014 Labs and Workshops, Notebook Papers. vol. 1180. 898–927 (2014)Google Scholar
  20. 20.
    López-Monroy, A.P., Montes-y-Gómez, M., Escalante, H.J., Villaseñor-Pineda, L.: Using intra-profile information for author profiling. In: Working Notes Papers of the CLEF 2014 Evaluation Labs. CLEF 2014, CEUR (2014)Google Scholar
  21. 21.
    Rangel, F., Celli, F., Rosso, P., Pottast, M., Stein, B., Daelemans, W.: Overview of the 3rd author profiling task at PAN 2015. In: CLEF 2015 Labs and Workshops, Notebook Papers, vol. 1391. CEUR (2015)Google Scholar
  22. 22.
    Álvarez-Carmona, M.A., López-Monroy, A.P., Montes-y-Gómez, M., Villaseor-Pineda, L., Jair-Escalante, H.: INAOE’s participation at PAN 2015: author profiling task. In: Working Notes Papers of the CLEF 2015 Evaluation Labs, CLEF 2015, vol. 1391. CEUR (2015)Google Scholar
  23. 23.
    González-Gallardo, C.E., Montes, A., Sierra, G., Núñez-Juárez, J.A., Salinas-López, A.J., Ek, J.: Tweets classification using corpus dependent tags, character and POS n-grams. In: Working Notes Papers of the CLEF 2015 Evaluation Labs, CLEF 2015, vol. 1391. CEUR (2015)Google Scholar
  24. 24.
    Rangel, F., Rosso, P., Verhoeven, B., Daelemans, W., Potthast, M., Stein, B.: Overview of the 4th author profiling task at PAN 2016: cross-genre evaluations. In: Working Notes Papers of the CLEF 2016 Evaluation Labs. CEUR Workshop Proceedings, CLEF and CEUR-WS.org (2016)Google Scholar
  25. 25.
    Busger Op Vollenbroek, M., Carlotto, T., Kreutz, T., Medvedeva, M., Pool, C., Bjerva, J., Haagsma, H., Nissim, M.: GronUP: groningen user profiling. In: CEUR Workshop Proceedings Working Notes Papers of the CLEF 2016 Evaluation Labs, vol. 1609, pp. 846–857. CLEF and CEUR-WS.org (2016)Google Scholar
  26. 26.
    Mikolov, T., Chen, K., Corrado, G.S., Dean, J.: Efficient estimation of word representations in vector space. Computing Research Repository abs/1301.3781 (2013)Google Scholar
  27. 27.
    Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Proceedings of the 27th Annual Conference on Neural Information Processing Systems: Advances in Neural Information Processing Systems, vol. 26, pp. 3111–3119 (2013)Google Scholar
  28. 28.
    Bayot, R., Gonçalves, T.: Author profiling using SVMs and word embedding averages. In: CEUR Workshop Proceedings of the Working Notes Papers of the CLEF 2016 Evaluation Labs, vol. 1609, pp. 815–823. CLEF and CEUR-WS.org (2016)Google Scholar
  29. 29.
    Gómez-Adorno, H., Markov, I., Sidorov, G., Posadas-Durán, J.-P., Sanchez-Perez, M.A., Chanona-Hernandez, L.: Improving feature representation based on a neural network for author profiling in social media texts. Comput. Intell. Neurosci. 2016, 13 p. (2016). doi: 10.1155/2016/1638936. Article ID 1638936
  30. 30.
    Sidorov, G., Ibarra Romero, M., Markov, I., Guzman-Cabrera, R., Chanona-Hernández, L., Velásquez, F.: Detección automática de similitud entre programas del lenguaje de programación Karel basada en técnicas de procesamiento de lenguaje natural [Automatic detection of similarity of programs in Karel programming language based on natural language processing techniques (in Spanish, abstract in English)]. Computación y Sistemas, vol. 20, pp. 279–288 (2016)Google Scholar
  31. 31.
    Sidorov, G., Ibarra Romero, M., Markov, I., Guzman-Cabrera, R., Chanona-Hernández, L., Velásquez, F.: Measuring similarity between Karel programs using character and word n-grams. Programming and Computer Software 43 (2017, in press)Google Scholar
  32. 32.
    Ronald, F., Frank, Y.: Statistical Tables for Biological, Agricultural and Medical Research, 3rd edn. Oliver and Boyd, London (1948)zbMATHGoogle Scholar
  33. 33.
    Kocher, M.: UniNE at CLEF 2015: author profiling. In: Working Notes Papers of the CLEF 2015 Evaluation Labs, CLEF 2015, vol. 1391. CEUR (2015)Google Scholar
  34. 34.
    Nowson, S., Perez, J., Brun, C., Mirkin, S., Roux, C.: XRCE personal language analytics engine for multilingual author profiling. In: Working Notes Papers of the CLEF 2015 Evaluation Labs, CLEF 2015, vol. 1391. CEUR (2015)Google Scholar
  35. 35.
    Gómez-Adorno, H., Markov, I., Sidorov, G., Posadas-Durán, J., Fócil-Arias, C.: Compilación de un lexicón de redes sociales para la identificación de perfiles de autor [Compiling a lexicon of social media for the author profiling task] (in Spanish, abstract in English), vol. 115, Research in Computing Science (2016)Google Scholar
  36. 36.
    Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. SIGKDD Explor. 11, 10–18 (2009)CrossRefGoogle Scholar
  37. 37.
    Villena Román, J., González Cristóbal, J.C.: DAEDALUS at pan 2014: Guessing tweet author’s gender and age. In: CLEF 2014 Labs and Workshops, Notebook Papers, CLEF 2014, vol. 1180, pp. 1157–1163 (2014)Google Scholar
  38. 38.
    De-Arteaga, M., Jimenez, S., Duenas, G., Mancera, S., Baquero, J.: Author profiling using corpus statistics, lexicons and stylistic features. In: CLEF 2013 Labs and Workshops, Notebook Papers. CLEF 2013, vol. 1179 (2013)Google Scholar
  39. 39.
    Buitinck, L., Louppe, G., Blondel, M., Pedregosa, F., Mueller, A., Grisel, O., Niculae, V., Prettenhofer, P., Gramfort, A., Grobler, J., Layton, R., VanderPlas, J., Joly, A., Holt, B., Varoquaux, G.: API design for machine learning software: experiences from the scikit-learn project. In: ECML PKDD Workshop: Languages for Data Mining and Machine Learning, pp. 108–122 (2013)Google Scholar
  40. 40.
    Nguyen, D., Gravel, R., Trieschnigg, D., Meder, T.: “How old do you think i am?"; A study of language and age in Twitter. In: Proceedings of the 7th International AAAI Conference on Weblogs and Social Media, AAAI Press (2013)Google Scholar
  41. 41.
    Maharjan, S., Solorio, T.: Using wide range of features for author profiling. In: Working Notes Papers of the CLEF 2015 Evaluation Labs, CEUR Workshop Proceedings, vol. 1391. CEUR (2015)Google Scholar
  42. 42.
    Sulea, O.M., Dichiu, D.: Automatic profiling of Twitter users based on their tweets. In: Working Notes Papers of the CLEF 2015 Evaluation Labs, CEUR Workshop Proceedings, vol. 1391. CEUR (2015)Google Scholar
  43. 43.
    Modaresi, P., Liebeck, M., Conrad, S.: Exploring the effects of cross-genre machine learning for author profiling in PAN 2016. In: Working Notes Papers of the CLEF 2016 Evaluation Labs, CEUR Workshop Proceedings, vol. 1609, pp. 970–977. CLEF and CEUR-WS.org (2016)Google Scholar
  44. 44.
    Schler, J., Koppel, M., Argamon, S., Pennebaker, J.: Effects of age and gender on blogging. In: AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs, pp. 199–205. AAAI (2006)Google Scholar
  45. 45.
    Markov, I., Mamede, N., Baptista, J.: A rule-based meronymy extraction module for Portuguese. Computación Sistemas 19, 661–683 (2015)Google Scholar
  46. 46.
    Markov, I., Mamede, N., Baptista, J.: Automatic identification of whole-part relations in Portuguese. In: Proceedings of the 3rd Symposium on Languages, Applications and Technologies, vol. 38, pp. 225–232. Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik (2014)Google Scholar
  47. 47.
    Posadas-Durán, J., Markov, I., Gómez-Adorno, H., Sidorov, G., Batyrshin, I., Gelbukh, A., Pichardo-Lagunas, O.: Syntactic n-grams as features for the author profiling task. In: Working Notes Papers of the CLEF 2015 Evaluation Labs, CEUR Workshop Proceedings, vol. 1391. CEUR (2015)Google Scholar
  48. 48.
    Gómez-Adorno, H., Sidorov, G., Pinto, D., Markov, I.: A graph based authorship identification approach. In: Working Notes Papers of the CLEF 2015 Evaluation Labs, CEUR Workshop Proceedings, vol. 1391. CEUR (2015)Google Scholar
  49. 49.
    Sidorov, G., Gómez-Adorno, H., Markov, I., Pinto, D., Loya, N.: Computing text similarity using tree edit distance. In: Proceedings of the Annual Conference of the North American Fuzzy Information processing Society and 5th World Conference on Soft Computing, NAFIPS 2015, pp. 1–4 (2015)Google Scholar
  50. 50.
    Gómez-Adorno, H., Pinto, D., Montes, M., Sidorov, G., Alfaro, R.: Content and style features for automatic detection of users’ intentions in tweets. In: Bazzan, A.L.C., Pichara, K. (eds.) IBERAMIA 2014. LNCS, vol. 8864, pp. 120–128. Springer, Cham (2014). doi: 10.1007/978-3-319-12027-0_10 Google Scholar
  51. 51.
    Sapkota, U., Bethard, S., Montes-y-Gómez, M., Solorio, T.: Not all character n-grams are created equal: a study in authorship attribution. In: Proceedings of the 2015 Annual Conference of the North American Chapter of the ACL: Human Language Technologies, NAACL-HLT 2015, Association for Computational Linguistics, pp. 93–102 (2015)Google Scholar
  52. 52.
    Markov, I., Gómez-Adorno, H., Sidorov, G., Gelbukh, A.: Adapting cross-genre author profiling to language and corpus. In: Working Notes Papers of the CLEF 2016 Evaluation Labs, CEUR Workshop Proceedings, vol. 1609, pp. 947–955. CLEF and CEUR-WS.org (2016)Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Ilia Markov
    • 1
  • Helena Gómez-Adorno
    • 1
  • Juan-Pablo Posadas-Durán
    • 2
  • Grigori Sidorov
    • 1
  • Alexander Gelbukh
    • 1
  1. 1.CICInstituto Politécnico Nacional (IPN)Mexico CityMexico
  2. 2.ESIME-ZacatencoInstituto Politécnico Nacional (IPN)Mexico CityMexico

Personalised recommendations