An Empirical Evaluation of SVM on Meta Features for Authorship Attribution of Online Texts

  • Hongwei Yao
  • Tieyun Qian
  • Li Chen
  • Manyun Qian
  • Xueyu Mo
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8284)

Abstract

Authorship attribution (AA) has been studied by many researchers. Recently, with the widespread of online texts, authorship attribution of online texts starts to receive a great deal of attentions. The essence of this problem is to identify a set of features that can capture the writing styles of an author. However, previous studies on feature identification mainly used statistical methods and conducted out experiments on small data sets, i.e., less than 10. This scale is distance from the real application of AA of online texts. In addition, due to the special characteristics of online texts, statistical approaches are rarely used for this problem. As the the performance of authorship identification depends highly on the the combination of the features used and classification methods, the feature sets for traditional authorship attribution needs to be re-examined using machine learning approaches. In this paper, we evaluate the effectiveness of six types of meta features on two public data sets with SVM, a well established machine learning technique. The experimental results show that lexical and syntactic features are the most promising features for AA of online texts. Furthermore, a number of interesting findings regarding the impacts of different types of features on authorship attribution are discovered through our experiments.

Keywords

authorship attribution of online texts meta features comparative evaluation 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Argamon, S., Levitan, S.: Measuring the usefulness of function words for authorship attribution. In: Literary and Linguistic Computing, pp. 1–3 (2004)Google Scholar
  2. 2.
    Argamon, S., Šarić, M., Stein, S.S.: Style mining of electronic messages for multiple authorship discrimination: First results. In: Proc. of the 9th SIGKDD, pp. 475–480 (2003)Google Scholar
  3. 3.
    Argamon, S., Whitelaw, C., Chase, P., Hota, S.R., Garg, N., Levitan, S.: Stylistic text classification using functional lexical features: Research articles. JASIST 58, 802–822 (2007)CrossRefGoogle Scholar
  4. 4.
    Burrows, J.F.: Not unles you ask nicely: The interpretative nexus between analysis and information. Literary and Linguistic Computing 7, 91–109 (1992)CrossRefGoogle Scholar
  5. 5.
    Diederich, J., Kindermann, J., Leopold, E., Paass, G., Informationstechnik, G.F., Augustin, D.S.: Authorship attribution with support vector machines. Applied Intelligence 19, 109–123 (2000)CrossRefGoogle Scholar
  6. 6.
    Escalante, H.J., Solorio, T., Montes-y Gómez, M.: Local histograms of character n-grams for authorship attribution. In: Proc. of the 49th ACL, pp. 288–298 (2011)Google Scholar
  7. 7.
    Gamon, M.: Linguistic correlates of style: authorship classification with deep linguistic analysis features. In: Proc. of the 20th COLING (2004)Google Scholar
  8. 8.
    Graham, N., Hirst, G., Marthi, B.: Segmenting documents by stylistic character. Natural Language Engineering 11, 397–415 (2005)CrossRefGoogle Scholar
  9. 9.
    Grieve, J.: Quantitative authorship attribution: An evaluation of techniques. Literary and Linguistic Computing 22, 251–270 (2007)CrossRefGoogle Scholar
  10. 10.
    van Halteren, H.: Author verification by linguistic profiling: An exploration of the parameter space. ACM Transactions on Speech and Language Processing 4, 1–17 (2007)CrossRefGoogle Scholar
  11. 11.
    van Halteren, H., Tweedie, F., Baayen, H.: Outside the cave of shadows: using syntactic annotation to enhance authorship attribution. Literary and Linguistic Computing 11, 121–132 (1996)CrossRefGoogle Scholar
  12. 12.
    Hedegaard, S., Simonsen, J.G.: Lost in translation: authorship attribution using frame semantics. In: Proc. of the 49th ACL, pp. 65–70 (2011)Google Scholar
  13. 13.
    Hirst, G., Feiguina, O.: Bigrams of syntactic labels for authorship discrimination of short texts. Literary and Linguistic Computing 22, 405–417 (2007)CrossRefGoogle Scholar
  14. 14.
    Hoover, D.L.: Statistical stylistics and authorship attribution: An empirical investigation. Literary and Linguistic Computing 16, 421–424 (2001)CrossRefGoogle Scholar
  15. 15.
    Joachims, T.: Making large-scale support vector machine learning practical. In: Advances in Kernel Methods, pp. 169–184. MIT Press (1999)Google Scholar
  16. 16.
    Kern, R., Seifert, C., Zechner, M., Granitzer, M.: Vote/veto meta-classifier for authorship identification - notebook for pan at clef 2011 (2011)Google Scholar
  17. 17.
    Kim, S., Kim, H., Weninger, T., Han, J., Kim, H.D.: Authorship classification: a discriminative syntactic tree mining approach. In: Proc. of the 34th SIGIR, pp. 455–464 (2011)Google Scholar
  18. 18.
    Klein, D., Manning, C.D.: Accurate unlexicalized parsing. In: Proc. of the 41st ACL, pp. 423–430 (2003)Google Scholar
  19. 19.
    Koppel, M., Schler, J.: Authorship verification as a one-class classification problem. In: Proc. of the 21st ICML (2004)Google Scholar
  20. 20.
    Koppel, M., Schler, J., Argamon, S.: Authorship attribution in the wild. Lang. Resources & Evaluation 45, 83–94 (2011)CrossRefGoogle Scholar
  21. 21.
    Li, J., Zheng, R., Chen, H.: From fingerprint to writeprint. Communications of the ACM 49, 76–82 (2006)CrossRefGoogle Scholar
  22. 22.
    Mosteller, F.W.: Inference and disputed authorship: The Federalist. Addison-Wesley (1964)Google Scholar
  23. 23.
    Sanderson, C., Guenter, S.: Short text authorship attribution via sequence kernels, markov chains and author unmasking: an investigation. In: Proc. of EMNLP, pp. 482–491 (2006)Google Scholar
  24. 24.
    Seroussi, Y., Bohnert, F., Zukerman, I.: Authorship attribution with author-aware topic models. In: Proc. of ACL, pp. 264–269 (2012)Google Scholar
  25. 25.
    Seroussi, Y., Zukerman, I., Bohnert, F.: Collaborative inference of sentiments from texts. In: Proc. of the 18th UMAP, pp. 195–206 (2010)Google Scholar
  26. 26.
    Solorio, T., Pillay, S., Raghavan, S., y Gomez, M.M.: Modality specific meta features for authorship attribution in web forum posts. In: Proc. of the 5th IJCNLP, pp. 156–164 (2011)Google Scholar
  27. 27.
    Stamatatos, E., Kokkinakis, G., Fakotakis, N.: Automatic text categorization in terms of genre and author. Comput. Linguist. 26, 471–495 (2000)CrossRefGoogle Scholar
  28. 28.
    Uzuner, Ö., Katz, B.: A comparative study of language models for book and author recognition. In: Proc. of the 2nd IJCNLP, pp. 969–980 (2005)Google Scholar
  29. 29.
    de Vel, O., Anderson, A., Corney, M., Mohay, G.: Mining email content for author identification forensics. Sigmod Record 30, 55–64 (2001)CrossRefGoogle Scholar
  30. 30.
    Yule, G.U.: The statistical study of literary vocabulary. Cambridge University Press (1944)Google Scholar
  31. 31.
    Zhao, Y., Zobel, J.: Effective and scalable authorship attribution using function words. In: Proceeding of Information Retrival Technology, pp. 174–189 (2005)Google Scholar
  32. 32.
    Zheng, R., Li, J., Chen, H., Huang, Z.: A framework for authorship identification of online messages: Writing-style features and classification techniques. JASIST 57, 378–393 (2006)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2013

Authors and Affiliations

  • Hongwei Yao
    • 1
  • Tieyun Qian
    • 1
  • Li Chen
    • 2
  • Manyun Qian
    • 2
  • Xueyu Mo
    • 2
  1. 1.State Key Laboratory of Software EngineeringWuhan UniversityWuhanChina
  2. 2.Department of Computer ScienceCentral China Normal UniversityWuhanChina

Personalised recommendations