The Affects of Demographics Differentiations on Authorship Identification

Chapter
Part of the Lecture Notes in Electrical Engineering book series (LNEE, volume 60)

Abstract

There is lots of previous studies concern the language difference in text regarding the demographics attribute. This investigation is different by presenting a new question: is male style more consistent than female or the opposite? Furthermore, we study the style differentiation according to age. Hence, this investigation presents a novel analysis of the proposed problem by applying authorship identification across each category and comparing the identification accuracy between them. We select personal blogs or diaries, which are different from other types of text such as essays, emails, or articles based on the text properties. The investigation utilizes couple of intuitive feature sets and studies various parameters that affect the identification performance. The results and evaluation show that the utilized features are compact while their performance is highly comparable with other larger feature sets. The analysis also confirmed the usefulness of the common users’ classifier, based on common demographics attributes, in improving the performance for the author identification task.

Keywords

Web mining information extraction psycholinguistic machine learning authorship identification demographics differentiation 

References

  1. 1.
    Mosteller, F., Wallace, D.L.: Inference and Disputed Authorship: The Federalist. Addison-Wesley, Reading, MA (1964)MATHGoogle Scholar
  2. 2.
    de Vel, O., Anderson, A., Corney, M., Mohay, G.: Mining e-mail content for author identification forensics. ACM SIGMOD Record 30, 55–64 (2001)CrossRefGoogle Scholar
  3. 3.
    Gamon, M.: Linguistic correlates of style: authorship classification with deep linguistic analysis features. In: Proceedings of the 20th International Conference on Computational Linguistics (2004)Google Scholar
  4. 4.
    Abbasi, A., Chen, H.: Applying authorship analysis to extremist-group web forum messages. IEEE Intell. Syst. 20, 67–75 (2005)CrossRefGoogle Scholar
  5. 5.
    Abbasi, A., Chen, H.: Writeprints: a stylometric approach to identity-level identification and similarity detection in cyberspace. ACM Trans. Inform. Syst. 26, 29 (2008)Google Scholar
  6. 6.
    Sanderson, C., Guenter, S.: Short text authorship attribution via sequence kernels, markov chains and author unmasking. In: Proceeding of 2006 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 482–491 (2006)Google Scholar
  7. 7.
    Pennebaker, J.W., Francis, M.E., Booth, R.J.: Linguistic Inquiry and Word Count: LIWC 2001. Lawrence Erlbaum Associates, Mahway (2001)Google Scholar
  8. 8.
    Wilson, M.: MRC Psycholinguistic Database: Machine Usable Dictionary, Information Division Science and Engineering Research Council (1987)Google Scholar
  9. 9.
    Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, San Francisco, CA (2005)MATHGoogle Scholar
  10. 10.
    Nowson, S., Oberlander, J.: The identity of bloggers: openness and gender in personal weblogs. In: Proceedings of the AAAI Spring Symposia on Computational Approaches to Analyzing Weblogs (2006)Google Scholar
  11. 11.
    NORMAN, W.T.: Toward an adequate taxonomy of personality attributes: replicated factors structure in peer nomination personality ratings. J. Abnorm. Soc. Psychol. 66, 574–583 (1963)CrossRefGoogle Scholar
  12. 12.
    Schler, J., Koppel, M., Argamon, S., Pennebaker, J.: Effects of age and gender on blogging. 2006 AAAI Spring Symposium on Computational Approaches for Analyzing Weblogs (2006)Google Scholar

Copyright information

© Springer Science+Business Media B.V. 2010

Authors and Affiliations

  1. 1.School of Computer ScienceUniversity of LincolnLincolnUK

Personalised recommendations