Large Scale Personality Classification of Bloggers

  • Francisco Iacobelli
  • Alastair J. Gill
  • Scott Nowson
  • Jon Oberlander
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6975)


Personality is a fundamental component of an individual’s affective behavior. Previous work on personality classification has emerged from disparate sources: Varieties of algorithms and feature-selection across spoken and written data have made comparison difficult. Here, we use a large corpus of blogs to compare classification feature selection; we also use these results to identify characteristic language information relating to personality. Using Support Vector Machines, the best accuracies range from 84.36% (openness to experience) to 70.51% (neuroticism). To achieve these results, the best performing features were a combination of: (1) stemmed bigrams; (2) no exclusion of stopwords (i.e. common words); and (3) the boolean, presence or absence of features noted, rather than their rate of use. We take these findings to suggest that both the structure of the text and the presence of common words are important. We also note that a common dictionary of words used for content analysis (LIWC) performs less well in this classification task, which we propose is due to their conceptual breadth. To get a better sense of how personality is expressed in the blogs, we explore the best performing features and discuss how these can provide a deeper understanding of personality language behavior online.


Machine Learning Personality Classification 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Argamon, S., Dhawle, S., Koppel, M., Pennebaker, J.W.: Lexical predictors of personality type. In: Proceedings of the 2005 Joint Annual Meeting of the Interface and the Classification Society of North America (2005)Google Scholar
  2. 2.
    Costa, P.T., McCrae, R.R.: Neo PI-R Professional Manual. In: Psychological Assessment Resources, Odessa, FL (1992)Google Scholar
  3. 3.
    Eid, M., Diener, E.: Intraindividual variability in affect: Reliability, validity, and personality correlates. Journal of Personality and Social Psychology 76(4), 662–676 (1999)CrossRefGoogle Scholar
  4. 4.
    Estival, D., Gaustad, T., Pham, S.B., Radford, W., Hutchinson, B.: Author profiling for english emails. In: 10th Conference of the Pacific Association for Computational Linguistics (PACLING 2007), pp. 262–272 (2007)Google Scholar
  5. 5.
    Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: Liblinear: A library for large linear classification. J. Mach. Learn. Res. 9, 1871–1874 (2008)MATHGoogle Scholar
  6. 6.
    Gill, A.J., Nowson, S., Oberlander, J.: What are they blogging about? personality, topic and motivation in blogs. In: ICWSM 2009 (2009)Google Scholar
  7. 7.
    Gütlein, M.: Large scale attribute selection using wrappers. Master’s thesis, Albert-Ludwigs-Universitat, Freiburg (2006)Google Scholar
  8. 8.
    Hall, M.A., Smith, L.: Practical feature subset selection for machine learning. In: Proc. 21st Australian Computer Science Conference, Perth, Australia, pp. 181–191. Springer, Heidelberg (1998)Google Scholar
  9. 9.
    Herring, S., Scheidt, L., Bonus, S., Wright, E.: Weblogs as a bridging genre. Information, Technology & People 18(2), 142–171 (2005)CrossRefGoogle Scholar
  10. 10.
    Kramer, A.D.I., Fussell, S.R., Setlock, L.D.: Text analysis as a tool for analyzing conversation in online support groups. In: Extended Abstracts of the 2004 Conference on Human Factors and Computing Systems, pp. 1485–1488 (2004)Google Scholar
  11. 11.
    Mairesse, F., Walker, M.A., Mehl, M.R., Moore, R.K.: Using linguistic cues for the automatic recognition of personality in conversation and text. Journal of Artificial Intelligence Research 30, 457–500 (2007)MATHGoogle Scholar
  12. 12.
    Mehl, M.R., Gosling, S.D., Pennebaker, J.W.: Personality in its natural habitat: manifestations and implicit folk theories of personality in daily life. Journal of Personality and Social Psychology 90(5), 862–877 (2006)CrossRefGoogle Scholar
  13. 13.
    Nowson, S.: The Language of Weblogs: A study of genre and individual differences. PhD thesis, University of Edinburgh (2006)Google Scholar
  14. 14.
    Nowson, S., Oberlander, J.: Identifying more bloggers: Towards large scale personality classification of personal weblogs. In: Proceedings of the International Conference on Weblogs and Social (2007)Google Scholar
  15. 15.
    Nowson, S., Oberlander, J., Gill, A.J.: Weblogs, genres and individual differences. In: Proceedings of the 27th Annual Conference of the Cognitive Science Society, pp. 1666–1671 (2005)Google Scholar
  16. 16.
    Oberlander, J., Gill, A.J.: Language with character: A stratified corpus comparison of individual differences in e-mail communication. Discourse Processes 42(3), 239–270 (2006)CrossRefGoogle Scholar
  17. 17.
    Oberlander, J., Nowson, S.: Whose thumb is it anyway? Classifying author personality from weblog text. In: Proceedings of COLING/ACL-2006: 44th Annual Meeting of the Association for Computational Linguistics and 21st International Conference on Computational Linguistics (2006)Google Scholar
  18. 18.
    Pennebaker, J.W., Francis, M.E.: Linguistic Inquiry and Word Count, 1st edn. Lawrence Erlbaum, Mahwah (1999)Google Scholar
  19. 19.
    Pennebaker, J.W., King, L.A.: Linguistic styles: language use as an individual difference. Journal of Personality and Social Psychology 77(6), 1296–1312 (1999)CrossRefGoogle Scholar
  20. 20.
    Platt, J.C.: Fast training of support vector machines using sequential minimal optimization, pp. 185–208. MIT Press, Cambridge (1999)Google Scholar
  21. 21.
    Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)CrossRefGoogle Scholar
  22. 22.
    Reeves, B., Nass, C.: The media equation: how people treat computers, television, and new media like real people and places. Cambridge University Press, New York (1996)Google Scholar
  23. 23.
    Schutte, N.S., Malouff, J.M.: University student reading preferences in relation to the big five personality dimensions. Reading Psychology an International Quarterly 25(4), 273–295 (2004)CrossRefGoogle Scholar
  24. 24.
    Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques, 2nd edn. Morgan Kaufmann Series in Data Management Systems. Morgan Kaufmann, San Francisco (2005)MATHGoogle Scholar
  25. 25.
    Yarkoni, T.: Personality in 100,000 Words: A large-scale analysis of personality and word use among bloggers. Journal of Research in Personality 44, 363–373 (2010)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Francisco Iacobelli
    • 1
  • Alastair J. Gill
    • 2
  • Scott Nowson
    • 3
  • Jon Oberlander
    • 4
  1. 1.Northeastern Illinois UniversityChicagoUSA
  2. 2.University of SurreyGuildfordUK
  3. 3.Appen Pty LtdChatswoodAustralia
  4. 4.University of EdinburghEdinburghUK

Personalised recommendations