Skip to main content

Author Profiling with Doc2vec Neural Network-Based Document Embeddings

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10062))

Abstract

To determine author demographics of texts in social media such as Twitter, blogs, and reviews, we use doc2vec document embeddings to train a logistic regression classifier. We experimented with age and gender identification on the PAN author profiling 2014–2016 corpora under both single- and cross-genre conditions. We show that under certain settings the neural network-based features outperform the traditional features when using the same classifier. Our method outperforms existing state of the art under some settings, though the current state-of-the-art results on those tasks have been quite weak.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    http://pan.webis.de [last access: 17.07.2016]. All other URLs in this document were also verified on this date.

  2. 2.

    http://www.autoritas.es/prsoco/.

  3. 3.

    https://radimrehurek.com/gensim/.

References

  1. Cambria, E., Poria, S., Gelbukh, A., Kwok, K.: Sentic API: a common-sense based API for concept-level sentiment analysis. In: Proceedings of the 4th Workshop on Making Sense of Microposts, co-located with WWW 2014, 23rd International World Wide Web Conference. Number 1141 in CEUR Workshop Proceedings (2014)

    Google Scholar 

  2. Poria, S., Gelbukh, A., Agarwal, B., Cambria, E., Howard, N.: Common sense knowledge based personality recognition from text. In: Castro, F., Gelbukh, A., González, M. (eds.) MICAI 2013. LNCS, vol. 8266, pp. 484–496. Springer, Heidelberg (2013). doi:10.1007/978-3-642-45111-9_42

    Chapter  Google Scholar 

  3. Poria, S., Gelbukh, A., Hussain, A., Howard, N., Das, D., Bandyopadhyay, S.: Enhanced SenticNet with affective labels for concept-based opinion mining. IEEE Intell. Syst. 28, 31–38 (2013)

    Article  Google Scholar 

  4. Cambria, E., Poria, S., Bajpai, R., Schuller, B.: SenticNet 4: a semantic resource for sentiment analysis based on conceptual primitives. In: COLING 2016, 26th International Conference on Computational Linguistics, Osaka, Japan (2016)

    Google Scholar 

  5. Poria, S., Cambria, E., Hazarika, D., Vij, P.: A deeper look into sarcastic tweets using deep convolutional neural networks. In: 26th International Conference on Computational Linguistics, COLING 2016, Osaka, Japan, pp. 1601–1612 (2016)

    Google Scholar 

  6. Poria, S., Cambria, E., Gelbukh, A., Bisio, F., Hussain, A.: Sentiment data flow analysis by means of dynamic linguistic patterns. IEEE Comput. Intell. Mag. 10, 26–36 (2015)

    Article  Google Scholar 

  7. Majumder, N., Poria, S., Gelbukh, A., Cambria, E.: Deep learning based document modeling for personality detection from text. IEEE Intell. Syst. 32, 74–79 (2017)

    Article  Google Scholar 

  8. Poria, S., Cambria, E., Bajpai, R., Hussain, A.: A review of affective computing: from unimodal analysis to multimodal fusion. Inf. Fus. 37, 98–125 (2017)

    Article  Google Scholar 

  9. Poria, S., Peng, H., Hussain, A., Howard, N., Cambria, E.: Ensemble application of convolutional neural networks and multiple kernel learning for multimodal sentiment analysis. Neurocomputing (2017, in press)

    Google Scholar 

  10. Chikersal, P., Poria, S., Cambria, E., Gelbukh, A., Siong, C.E.: Modelling public sentiment in twitter: using linguistic patterns to enhance supervised learning. In: Gelbukh, A. (ed.) CICLing 2015. LNCS, vol. 9042, pp. 49–65. Springer, Cham (2015). doi:10.1007/978-3-319-18117-2_4

    Google Scholar 

  11. Poria, S., Cambria, E., Gelbukh, A.: Aspect extraction for opinion mining with a deep convolutional neural network. Knowl.-Based Syst. 108, 42–49 (2016)

    Article  Google Scholar 

  12. Poria, S., Chaturvedi, I., Cambria, E., Hussain, A.: Convolutional MKL based multimodal emotion recognition and sentiment analysis. In: 16th International Conference on Data Mining, ICDM 2016, pp. 439–448. IEEE (2016)

    Google Scholar 

  13. Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: Proceedings of the 31st International Conference on Machine Learning. ICML 2014, pp. 1188–1196 (2014)

    Google Scholar 

  14. Dai, A., Olah, C., Le, Q.: Document embedding with paragraph vectors. CoRR abs/1507.07998 (2015)

    Google Scholar 

  15. Rangel, F., Rosso, P., Koppel, M., Stamatatos, E., Inches, G.: Overview of the author profiling task at PAN 2013. In: CLEF 2013 Labs and Workshops, Notebook Papers, vol. 1179 (2013)

    Google Scholar 

  16. López-Monroy, A.P., Montes-y-Gómez, M., Escalante, H.J., Villaseñor-Pineda, L., Villatoro-Tello, E.: INAOE’s participation at PAN 2013: author profiling task. In: Working Notes Papers of the CLEF 2013 Evaluation Labs. CLEF 2013, CEUR (2013)

    Google Scholar 

  17. Meina, M., Brodzińska, K., Celmer, B., Czoków, M., Patera, M., Pezacki, J., Wilk, M.: Ensemble-based classification for author profiling using various features. In: Working Notes Papers of the CLEF 2013 Evaluation Labs. CLEF 2013, CEUR (2013)

    Google Scholar 

  18. Santosh, K., Bansal, R., Shekhar, M., Varma, V.: Author profiling: predicting age and gender from blogs. In: Working Notes Papers of the CLEF 2013 Evaluation Labs. CLEF 2013, CEUR (2013)

    Google Scholar 

  19. Rangel, F., Rosso, P., Chugur, I., Potthast, M., Trenkmann, M., Stein, B., Verhoeven, B., Daelemans, W.: Overview of the 2nd author profiling task at PAN 2014. In: CLEF 2014 Labs and Workshops, Notebook Papers. vol. 1180. 898–927 (2014)

    Google Scholar 

  20. López-Monroy, A.P., Montes-y-Gómez, M., Escalante, H.J., Villaseñor-Pineda, L.: Using intra-profile information for author profiling. In: Working Notes Papers of the CLEF 2014 Evaluation Labs. CLEF 2014, CEUR (2014)

    Google Scholar 

  21. Rangel, F., Celli, F., Rosso, P., Pottast, M., Stein, B., Daelemans, W.: Overview of the 3rd author profiling task at PAN 2015. In: CLEF 2015 Labs and Workshops, Notebook Papers, vol. 1391. CEUR (2015)

    Google Scholar 

  22. Álvarez-Carmona, M.A., López-Monroy, A.P., Montes-y-Gómez, M., Villaseor-Pineda, L., Jair-Escalante, H.: INAOE’s participation at PAN 2015: author profiling task. In: Working Notes Papers of the CLEF 2015 Evaluation Labs, CLEF 2015, vol. 1391. CEUR (2015)

    Google Scholar 

  23. González-Gallardo, C.E., Montes, A., Sierra, G., Núñez-Juárez, J.A., Salinas-López, A.J., Ek, J.: Tweets classification using corpus dependent tags, character and POS n-grams. In: Working Notes Papers of the CLEF 2015 Evaluation Labs, CLEF 2015, vol. 1391. CEUR (2015)

    Google Scholar 

  24. Rangel, F., Rosso, P., Verhoeven, B., Daelemans, W., Potthast, M., Stein, B.: Overview of the 4th author profiling task at PAN 2016: cross-genre evaluations. In: Working Notes Papers of the CLEF 2016 Evaluation Labs. CEUR Workshop Proceedings, CLEF and CEUR-WS.org (2016)

    Google Scholar 

  25. Busger Op Vollenbroek, M., Carlotto, T., Kreutz, T., Medvedeva, M., Pool, C., Bjerva, J., Haagsma, H., Nissim, M.: GronUP: groningen user profiling. In: CEUR Workshop Proceedings Working Notes Papers of the CLEF 2016 Evaluation Labs, vol. 1609, pp. 846–857. CLEF and CEUR-WS.org (2016)

    Google Scholar 

  26. Mikolov, T., Chen, K., Corrado, G.S., Dean, J.: Efficient estimation of word representations in vector space. Computing Research Repository abs/1301.3781 (2013)

    Google Scholar 

  27. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Proceedings of the 27th Annual Conference on Neural Information Processing Systems: Advances in Neural Information Processing Systems, vol. 26, pp. 3111–3119 (2013)

    Google Scholar 

  28. Bayot, R., Gonçalves, T.: Author profiling using SVMs and word embedding averages. In: CEUR Workshop Proceedings of the Working Notes Papers of the CLEF 2016 Evaluation Labs, vol. 1609, pp. 815–823. CLEF and CEUR-WS.org (2016)

    Google Scholar 

  29. Gómez-Adorno, H., Markov, I., Sidorov, G., Posadas-Durán, J.-P., Sanchez-Perez, M.A., Chanona-Hernandez, L.: Improving feature representation based on a neural network for author profiling in social media texts. Comput. Intell. Neurosci. 2016, 13 p. (2016). doi:10.1155/2016/1638936. Article ID 1638936

  30. Sidorov, G., Ibarra Romero, M., Markov, I., Guzman-Cabrera, R., Chanona-Hernández, L., Velásquez, F.: Detección automática de similitud entre programas del lenguaje de programación Karel basada en técnicas de procesamiento de lenguaje natural [Automatic detection of similarity of programs in Karel programming language based on natural language processing techniques (in Spanish, abstract in English)]. Computación y Sistemas, vol. 20, pp. 279–288 (2016)

    Google Scholar 

  31. Sidorov, G., Ibarra Romero, M., Markov, I., Guzman-Cabrera, R., Chanona-Hernández, L., Velásquez, F.: Measuring similarity between Karel programs using character and word n-grams. Programming and Computer Software 43 (2017, in press)

    Google Scholar 

  32. Ronald, F., Frank, Y.: Statistical Tables for Biological, Agricultural and Medical Research, 3rd edn. Oliver and Boyd, London (1948)

    MATH  Google Scholar 

  33. Kocher, M.: UniNE at CLEF 2015: author profiling. In: Working Notes Papers of the CLEF 2015 Evaluation Labs, CLEF 2015, vol. 1391. CEUR (2015)

    Google Scholar 

  34. Nowson, S., Perez, J., Brun, C., Mirkin, S., Roux, C.: XRCE personal language analytics engine for multilingual author profiling. In: Working Notes Papers of the CLEF 2015 Evaluation Labs, CLEF 2015, vol. 1391. CEUR (2015)

    Google Scholar 

  35. Gómez-Adorno, H., Markov, I., Sidorov, G., Posadas-Durán, J., Fócil-Arias, C.: Compilación de un lexicón de redes sociales para la identificación de perfiles de autor [Compiling a lexicon of social media for the author profiling task] (in Spanish, abstract in English), vol. 115, Research in Computing Science (2016)

    Google Scholar 

  36. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. SIGKDD Explor. 11, 10–18 (2009)

    Article  Google Scholar 

  37. Villena Román, J., González Cristóbal, J.C.: DAEDALUS at pan 2014: Guessing tweet author’s gender and age. In: CLEF 2014 Labs and Workshops, Notebook Papers, CLEF 2014, vol. 1180, pp. 1157–1163 (2014)

    Google Scholar 

  38. De-Arteaga, M., Jimenez, S., Duenas, G., Mancera, S., Baquero, J.: Author profiling using corpus statistics, lexicons and stylistic features. In: CLEF 2013 Labs and Workshops, Notebook Papers. CLEF 2013, vol. 1179 (2013)

    Google Scholar 

  39. Buitinck, L., Louppe, G., Blondel, M., Pedregosa, F., Mueller, A., Grisel, O., Niculae, V., Prettenhofer, P., Gramfort, A., Grobler, J., Layton, R., VanderPlas, J., Joly, A., Holt, B., Varoquaux, G.: API design for machine learning software: experiences from the scikit-learn project. In: ECML PKDD Workshop: Languages for Data Mining and Machine Learning, pp. 108–122 (2013)

    Google Scholar 

  40. Nguyen, D., Gravel, R., Trieschnigg, D., Meder, T.: “How old do you think i am?"; A study of language and age in Twitter. In: Proceedings of the 7th International AAAI Conference on Weblogs and Social Media, AAAI Press (2013)

    Google Scholar 

  41. Maharjan, S., Solorio, T.: Using wide range of features for author profiling. In: Working Notes Papers of the CLEF 2015 Evaluation Labs, CEUR Workshop Proceedings, vol. 1391. CEUR (2015)

    Google Scholar 

  42. Sulea, O.M., Dichiu, D.: Automatic profiling of Twitter users based on their tweets. In: Working Notes Papers of the CLEF 2015 Evaluation Labs, CEUR Workshop Proceedings, vol. 1391. CEUR (2015)

    Google Scholar 

  43. Modaresi, P., Liebeck, M., Conrad, S.: Exploring the effects of cross-genre machine learning for author profiling in PAN 2016. In: Working Notes Papers of the CLEF 2016 Evaluation Labs, CEUR Workshop Proceedings, vol. 1609, pp. 970–977. CLEF and CEUR-WS.org (2016)

    Google Scholar 

  44. Schler, J., Koppel, M., Argamon, S., Pennebaker, J.: Effects of age and gender on blogging. In: AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs, pp. 199–205. AAAI (2006)

    Google Scholar 

  45. Markov, I., Mamede, N., Baptista, J.: A rule-based meronymy extraction module for Portuguese. Computación Sistemas 19, 661–683 (2015)

    Google Scholar 

  46. Markov, I., Mamede, N., Baptista, J.: Automatic identification of whole-part relations in Portuguese. In: Proceedings of the 3rd Symposium on Languages, Applications and Technologies, vol. 38, pp. 225–232. Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik (2014)

    Google Scholar 

  47. Posadas-Durán, J., Markov, I., Gómez-Adorno, H., Sidorov, G., Batyrshin, I., Gelbukh, A., Pichardo-Lagunas, O.: Syntactic n-grams as features for the author profiling task. In: Working Notes Papers of the CLEF 2015 Evaluation Labs, CEUR Workshop Proceedings, vol. 1391. CEUR (2015)

    Google Scholar 

  48. Gómez-Adorno, H., Sidorov, G., Pinto, D., Markov, I.: A graph based authorship identification approach. In: Working Notes Papers of the CLEF 2015 Evaluation Labs, CEUR Workshop Proceedings, vol. 1391. CEUR (2015)

    Google Scholar 

  49. Sidorov, G., Gómez-Adorno, H., Markov, I., Pinto, D., Loya, N.: Computing text similarity using tree edit distance. In: Proceedings of the Annual Conference of the North American Fuzzy Information processing Society and 5th World Conference on Soft Computing, NAFIPS 2015, pp. 1–4 (2015)

    Google Scholar 

  50. Gómez-Adorno, H., Pinto, D., Montes, M., Sidorov, G., Alfaro, R.: Content and style features for automatic detection of users’ intentions in tweets. In: Bazzan, A.L.C., Pichara, K. (eds.) IBERAMIA 2014. LNCS, vol. 8864, pp. 120–128. Springer, Cham (2014). doi:10.1007/978-3-319-12027-0_10

    Google Scholar 

  51. Sapkota, U., Bethard, S., Montes-y-Gómez, M., Solorio, T.: Not all character n-grams are created equal: a study in authorship attribution. In: Proceedings of the 2015 Annual Conference of the North American Chapter of the ACL: Human Language Technologies, NAACL-HLT 2015, Association for Computational Linguistics, pp. 93–102 (2015)

    Google Scholar 

  52. Markov, I., Gómez-Adorno, H., Sidorov, G., Gelbukh, A.: Adapting cross-genre author profiling to language and corpus. In: Working Notes Papers of the CLEF 2016 Evaluation Labs, CEUR Workshop Proceedings, vol. 1609, pp. 947–955. CLEF and CEUR-WS.org (2016)

    Google Scholar 

Download references

Acknowledgments

This work was partially supported by the Mexican Government (CONACYT projects 240844 and 20161958, SNI, COFAA-IPN, SIP-IPN 20151406, 20161947, 20161958, 20151589, 20162204, and 20162064).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ilia Markov .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Markov, I., Gómez-Adorno, H., Posadas-Durán, JP., Sidorov, G., Gelbukh, A. (2017). Author Profiling with Doc2vec Neural Network-Based Document Embeddings. In: Pichardo-Lagunas, O., Miranda-Jiménez, S. (eds) Advances in Soft Computing. MICAI 2016. Lecture Notes in Computer Science(), vol 10062. Springer, Cham. https://doi.org/10.1007/978-3-319-62428-0_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-62428-0_9

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-62427-3

  • Online ISBN: 978-3-319-62428-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics