Skip to main content

I, Me, Mine: The Role of Personal Phrases in Author Profiling

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9822))

Abstract

The Author Profiling (AP) task aims to distinguish between groups of authors labeled by a common demographic characteristic such as gender or age by studying the language usage. In this work we studied the role of personal phrases (i.e., sentences containing first person pronouns) for the AP task. We support the idea that people better expose their personal interests and writing style when they talk about themselves and, consequently, that words near to a personal pronoun reveal valuable information for the classification of authors. The evaluation using different social media data showed that phrases containing singular first person pronouns are highly valuable for predicting the age and gender of users. Considering only these phrases we obtained reductions of up to 60 % of the information in the user documents and a comparable classification performance than using all available data. In addition, the results obtained by personal phrases considerably outperformed those from non-personal sentences, indicating their greater suitability for the AP task. We consider these findings could be further applied in the design of strategies for the construction of AP corpora, novel feature selection methods, as well as new feature and instance weighting schemes.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    In this context, documents are commonly referred to as user profiles or user histories, and they correspond to all textual information generated by a user, for example, all posts from her blog or the set of tweets from her account.

  2. 2.

    http://u.cs.biu.ac.il/~koppel/BlogCorpus.htm.

  3. 3.

    http://pan.webis.de/clef14/pan14-web/author-profiling.html.

  4. 4.

    POS tags were obtained using Stanford tagger: http://nlp.stanford.edu/software/tagger.shtml.

References

  1. Argamon, S., Dhawle, S., Koppel, M., Pennebaker, J.W.: Lexical predictors of personality type. In: Joint Annual Meeting of the Interface and the Classification Society of North America, St. Louis, MI (2005)

    Google Scholar 

  2. Argamon, S., Koppel, M., Pennebaker, J.W., Schler, J.: Automatically profiling the author of an anonymous text. Commun. ACM 52(2), 119–123 (2009)

    Article  Google Scholar 

  3. Cappellato, L., Ferro, N., Jones, G., San-Juan, E. (eds.): CLEF 2015 Labs and Workshops, Notebook Papers, Toulouse, France, September 2015

    Google Scholar 

  4. Chung, C.K., Pennebaker, J.W.: The psychological functions of function words. In: Fiedler, K. (ed.) Social Communication: Frontiers of Social Psychology, pp. 343–359. Psychology Press, New York (2007)

    Google Scholar 

  5. Demšar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006)

    MathSciNet  MATH  Google Scholar 

  6. Dietterich, T.: Approximate statistical tests for comparing supervised classification learning algorithms. Neural Comput. 10, 1895–1923 (1998)

    Article  Google Scholar 

  7. Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: LIBLINEAR: a library for large linear classification. J. Mach. Learn. Res. 9, 1871–1874 (2008)

    MATH  Google Scholar 

  8. Forner, P., Navigli, R., Tufis, D. (eds.): Notebook Papers of CLEF 2013 LABs and Workshops (CLEF-2013), Valencia, Spain, September 2013

    Google Scholar 

  9. Goswami, S., Sarkar, S., Rustagi, M.: Stylometric analysis of bloggers’ age and gender. In: Third International ICWSM Conference, pp. 214–217 (2009)

    Google Scholar 

  10. Kacewicz, E., Pennebaker, J.W., Davis, M., Moongee, J., Graesser, A.C.: Pronoun use reflects standings in social hierarchies. J. Lang. Soc. Psychol. 33, 125–143 (2013)

    Article  Google Scholar 

  11. Koppel, M., Argamon, S., Shimoni, A.R.: Automatically categorizing written texts by author gender. Literary Linguist. Comput. 17(4), 401–412 (2002)

    Article  Google Scholar 

  12. López-Monroy, A.P., Montes-y-Gómez, M., Escalante, H.J., Villaseñor-Pineda, L., Villatoro-Tello, E.: INAOE’s participation at PAN’13–Notebook for PAN at CLEF 2013: author profiling task. In: Forner et al. [8]

    Google Scholar 

  13. López-Monroy, A.P., Montes-y-Gómez, M., Escalante, H.J., Villaseñor-Pineda, L., Stamatatos, E.: Discriminative subprofile-specific representations for author profiling in social media. Knowl. Based Syst. 89, 134–147 (2015)

    Article  Google Scholar 

  14. Maharjan, S., Solorio, T.: Using wide range of features for author profiling–notebook for PAN at CLEF 2015. In: Cappellato et al. [3]

    Google Scholar 

  15. Meina, M., Brodzínska, K., Celmer, B., Czoków, M., Patera, M., Pezacki, J., Wilk, M.: Ensemble-based classification for author profiling using various features-notebook for PAN at CLEF 2013. In: Forner et al. [8]

    Google Scholar 

  16. Mihalcea, R., Hassan, S.: Using the essence of texts to improve document classification. In: RANLP 2005, Borovetz, Bulgaria (2005)

    Google Scholar 

  17. Mukherjee, A., Liu, B.: Improving gender classification of blog authors. In: Conference on Empirical Methods in Natural Language Processing (EMNLP 2010), Stroudsburg, PA, USA, pp. 207–217. Association for Computational Linguistics (2010)

    Google Scholar 

  18. Newman, M.L., Groom, C.J., Handelman, L.D., Pennebaker, J.W.: Gender differences in language use: an analysis of 14,000 text samples. Discourse Process. 45, 211–236 (2008)

    Article  Google Scholar 

  19. Newman, M., Pennebaker, J., Berry, D., Richards, J.: Lying words: predicting deception from linguistic styles. Pers. Soc. Psychol. Bull. 29, 665–675 (2003)

    Article  Google Scholar 

  20. Nguyen, D., Smith, N.A., Rosé, C.P.: Author age prediction from text using linear regression. In: 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences and Humanities, pp. 115–123. Association for Computational Linguistics (2011)

    Google Scholar 

  21. Pennachiotti, M., Popescu, A.M.: Democrats, republicans and starbucks afficionados: user classification in Twitter. In: 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, USA, pp. 430–438 (2011)

    Google Scholar 

  22. Pennebaker, J.: The Secret Life of Pronouns: What Our Words Say About Us. Bloomsbury, London (2011)

    Google Scholar 

  23. Pennebaker, J., Stone, L.: Words of wisdom: language use over the life span. J. Pers. Soc. Psychol. 85, 291–301 (2003)

    Article  Google Scholar 

  24. Rangel, F., Celli, F., Rosso, P., Potthast, M., Stein, B., Daelemans, W.: Overview of the 3rd author profiling task at PAN 2015. In: Cappellato et al. [3]

    Google Scholar 

  25. Rangel, F., Rosso, P.: Use of language and author profiling: identification of gender and age. In: Workshop on Natural Language Processing and Cognitive Science (NLPCS-2013), Marseille, France (2013)

    Google Scholar 

  26. Rangel, F., Rosso, P.: On the multilingual and genre robustness of emographs for author profiling in social media. In: Mothe, J., et al. (eds.) Experimental IR Meets Multilinguality, Multimodality, and Interaction. LNCS, vol. 9283, pp. 274–280. Springer, Heidelberg (2015)

    Chapter  Google Scholar 

  27. Rangel, F., Rosso, P.: On the impact of emotions on author profiling. Inf. Process. Manage. 52(1), 73–92 (2016)

    Article  Google Scholar 

  28. Rangel, F., Rosso, P., Koppel, M., Stamatatos, E., Inches, G. Overview of the author profiling task at PAN 2013. In: Forner et al. [8]

    Google Scholar 

  29. Rao, D., Yarowsky, D., Shreevats, A., Gupta, M.: Classifying latent user attributes in Twitter. In: Proceedings of SMUC 2010, pp. 710–718 (2010)

    Google Scholar 

  30. Rude, S., Gortner, E.M., Pennebaker, J.W.: Language use of depressed and depression-vulnerable college students. Cogn. Emot. 18, 1121–1133 (2004)

    Article  Google Scholar 

  31. Schler, J., Koppel, M., Argamon, S., Pennebaker, J.W.: Effects of age and gender on blogging. In: AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs, pp. 199–205. AAAI (2006)

    Google Scholar 

  32. Schwartz, H.A., Eichstaedt, J.C., Dziurzynski, L., Kern, M.L., Blanco, E., Kosinski, M., Stillwell, D., Seligman, M.E.P., Ungar, L.H.: Toward personality insights from language exploration in social media. In: AAAI Spring Symposium: Analyzing Microtext. AAAI (2013)

    Google Scholar 

  33. Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)

    Article  Google Scholar 

  34. Sidorov, G., Miranda Jiménez, S., Viveros Jiménez, F., Gelbukh, A., Castro Sánchez, N., Velásquez, F., Díaz Rangel, I., Suárez Guerra, S., Treviño, A., Gordon, J.: Empirical study of opinion mining in spanish tweets. LNAI, pp. 7629–7630 (2012)

    Google Scholar 

Download references

Acknowledgments

This work was supported under CONACYT project no. 247870 and scholarship 243957.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rosa María Ortega-Mendoza .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Ortega-Mendoza, R.M., Franco-Arcega, A., López-Monroy, A.P., Montes-y-Gómez, M. (2016). I, Me, Mine: The Role of Personal Phrases in Author Profiling. In: Fuhr, N., et al. Experimental IR Meets Multilinguality, Multimodality, and Interaction. CLEF 2016. Lecture Notes in Computer Science(), vol 9822. Springer, Cham. https://doi.org/10.1007/978-3-319-44564-9_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-44564-9_9

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-44563-2

  • Online ISBN: 978-3-319-44564-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics