Creativity and Universality in Language pp 143-155

Part of the Lecture Notes in Morphogenesis book series (LECTMORPH) | Cite as

Universality of Stylistic Traits in Texts

Chapter

Abstract

The style of documents is an important property that can be used as discriminant factor in text mining applications. Among the great number of possible measures proposed to quantify writing style there are some features that can be characterized as universal, in the sense that they can be easily extracted from any kind of text in practically any natural language and provide accurate results when used in style-based text categorization tasks. In this paper we examine whether such universal stylometric features remain effective under difficult scenarios where the topic and/or genre of documents used in the training phase differ from that of the questioned documents. Based on a series of experiments in authorship attribution, we demonstrate that character n-gram features are reliable and effective given that the appropriate number of features is used. It is also shown that when the number of candidate authors increases, the representation dimensionality should also increase to improve classification results.

References

  1. 1.
    Abbasi, A., Chen, H.: Applying authorship analysis to extremist-group web forum messages. IEEE Intell. Syst. 20(5), 67–75 (2005)CrossRefGoogle Scholar
  2. 2.
    Argamon, S., Saric, M., Stein, S.: Style mining of electronic messages for multiple authorship discrimination: first results. In: Proceedings of the 9th ACM SIGKDD, pp. 475–480 (2003)Google Scholar
  3. 3.
    Argamon, S., Whitelaw, C., Chase, P., Hota, S.R., Garg, N., Levitan, S.: Stylistic text classification using functional lexical features. J. Am. Soc. Inf. Sci. Technol. 58(6), 802–822 (2007)CrossRefGoogle Scholar
  4. 4.
    Arun, R., Suresh, V., Madhavan, C.E.V.: Stopword graphs and authorship attribution in text corpora. In: Proceedings of the 3rd IEEE International Conference on Semantic Computing, pp. 192–196 (2009)Google Scholar
  5. 5.
    Benedetto, D., Caglioti, E., Loreto, V.: Language trees and zipping. Phys. Rev. Lett. 88(4), 048702 (2002)CrossRefGoogle Scholar
  6. 6.
    Burrows, J.F.: Not unless you ask nicely: the interpretative nexus between analysis and information. Lit. Linguist. Comput. 7(2), 91–109 (1992)CrossRefGoogle Scholar
  7. 7.
    Chaski, C.E.: Who’s at the keyboard?: authorship attribution in digital evidence investigations. Int. J. Digit. Evid. 4(1), 1–13 (2005)Google Scholar
  8. 8.
    Cristani, M., Roffo, G., Segalin, C., Bazzani, L., Vinciarelli, A., Murino, V.: Conversationally-inspired stylometric features for authorship attribution in instant messaging. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 1121–1124 (2012)Google Scholar
  9. 9.
    de Vel, O., Anderson, A., Corney, M., Mohay, G.: Mining e-mail content for author identification forensics. SIGMOD Rec. 30(4), 55–64 (2001)CrossRefGoogle Scholar
  10. 10.
    Gamon, M.: Linguistic correlates of style: authorship classification with deep linguistic analysis features. In: Proceedings of the 20th International Conference on Computational Linguistics, pp. 611–617 (2004)Google Scholar
  11. 11.
    Grieve, J.: Quantitative authorship attribution: an evaluation of techniques. Lit. Linguist. Comput. 22(3), 251–270 (2007)CrossRefGoogle Scholar
  12. 12.
    Holmes, D.I.: The evolution of stylometry in humanities scholarship. Lit. Linguist. Comput. 13(3), 111–117 (1998)CrossRefGoogle Scholar
  13. 13.
    Jair Escalante, H., Solorio, T., Montes-y-Gómez, M.: Local histograms of character n-grams for authorship attribution. In: Proceedings of ACL, pp. 288–298 (2011)Google Scholar
  14. 14.
    Jachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Proceedings of the 10th European Conference on Machine Learning, pp. 137–142 (1998)Google Scholar
  15. 15.
    Kanaris, I., Stamatatos, E.: Learning to recognize webpage genres. Inf. Process. Manag. 45(5), 499–512 (2009)CrossRefGoogle Scholar
  16. 16.
    Khmelev, D.V., Teahan, W.J.: A repetition based measure for verification of text collections and for text categorization. In: Proceedings of the 26th ACM SIGIR, pp. 104–110 (2003)Google Scholar
  17. 17.
    Keselj, V., Peng, F., Cercone, N., Thomas, C.: N-gram-based author profiles for authorship attribution. In: Proceedings of the Pacific Association for Computational Linguistics, pp. 255–264 (2003)Google Scholar
  18. 18.
    Koppel, M., Winter, Y.: Determining if two documents are by the same author. J. Am. Soc. Inf. Sci. Technol. 65(1), 178–187 (2014)CrossRefGoogle Scholar
  19. 19.
    Koppel, M., Schler, J., Bonchek-Dokow, E.: Measuring differentiability: unmasking pseudonymous authors. J. Mach. Learn. Res. 8, 1261–1276 (2007)Google Scholar
  20. 20.
    Koppel, M., Schler, J., Argamon, S.: Authorship attribution in the wild. Lang. Resour. Eval. 45, 83–94 (2011)CrossRefGoogle Scholar
  21. 21.
    Lim, C.S., Lee, K.J., Kim, G.C.: Multiple sets of features for automatic genre classification of web documents. Inf. Process. Manag. 41(5), 1263–1276 (2005)CrossRefGoogle Scholar
  22. 22.
    Luyckx, K., Daelemans, W.: Shallow text analysis and machine learning for authorship attribution. In: Proceedings of the Fifteenth Meeting of Computational Linguistics in the Netherlands (2005)Google Scholar
  23. 23.
    Luyckx, K., Daelemans, W.: Authorship attribution and verification with many authors and limited data. In: Proceedings of the Twenty-Second International Conference on Computational Linguistics, pp. 513–520 (2008)Google Scholar
  24. 24.
    Madigan, D., Genkin, A., Lewis, D., Argamon, S., Fradkin, D., Ye, L.: Author identification on the large scale. In: Proceedings of CSNA-05 (2005)Google Scholar
  25. 25.
    Mendenhall, T.C.: The characteristic curves of composition. Science IX, 237–249 (1887)CrossRefGoogle Scholar
  26. 26.
    Meyer zu Eissen, S., Stein, B.: Genre classification of web pages: user study and feasibility analysis. In: Biundo, S., Fruhwirth, T., Palm, G. (eds.) KI 2004: Advances in Artificial Intelligence, pp. 256–269. Springer, Berlin (2004)CrossRefGoogle Scholar
  27. 27.
    Mosteller, F., Wallace, D.L.: Inference and Disputed Authorship: The Federalist. Addison-Wesley, Reading (1964)Google Scholar
  28. 28.
    Pang, B., Lee, L.: Opinion mining and sentiment analysis. Found. Trends Inf. Retr. 2(12), 1–135 (2008)CrossRefGoogle Scholar
  29. 29.
    Rangel, F., Rosso, P., Koppel, M., Stamatatos, E., Inches, G.: Overview of the author profiling task at PAN 2013. In: Forner, P., Navigli, R., Tufis, D. (eds.) Working Notes Papers of the CLEF 2013 Evaluation Labs (2013)Google Scholar
  30. 30.
    Santini, M.: Automatic identification of genre in webpages. Ph.D. thesis, University of Brighton (2007)Google Scholar
  31. 31.
    Seidman, S.: Authorship verification using the impostors method. In: Forner, P., Navigli, R., Tufis, D. (eds.) CLEF 2013 Evaluation Labs and Workshop Working Notes Papers (2013)Google Scholar
  32. 32.
    Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)CrossRefGoogle Scholar
  33. 33.
    Sidorov, G., Velasquez, F., Stamatatos, E., Gelbukh, A.F., Chanona-Hernández, L.: Syntactic n-grams as machine learning features for natural language processing. Expert Syst. Appl. 41(3), 853–860 (2014)CrossRefGoogle Scholar
  34. 34.
    Stamatatos, E.: A survey of modern authorship attribution methods. J. Am. Soc. Inf. Sci. Technol. 60, 538–556 (2009)CrossRefGoogle Scholar
  35. 35.
    Stamatatos, E.: Intrinsic plagiarism detection using character n-gram profiles. In: Proceedings of the 3rd International Workshop on Uncovering Plagiarism, Authorship, and Social Software Misuse (2009)Google Scholar
  36. 36.
    Stamatatos, E.: Plagiarism detection using stopword n-grams. J. Am. Soc. Inf. Sci. Technol. 62(12), 2512–2527 (2011)CrossRefGoogle Scholar
  37. 37.
    Stamatatos, E., Fakotakis, N., Kokkinakis, G.: Automatic text categorization in terms of genre and author. Comput. Linguist. 26(4), 471–495 (2000)CrossRefGoogle Scholar
  38. 38.
    Stamatatos, E., Daelemans, W., Verhoeven, B., Stein, B., Potthast, M., Juola, P., Sánchez-Pérez, M.A., Barròn-Cedeño, A.: Overview of the author identification task at PAN 2014. CLEF Working Notes, pp. 877–897 (2014)Google Scholar
  39. 39.
    Van Halteren, H.: Author verification by linguistic profiling: an exploration of the parameter space. ACM Trans. Speech Lang. Process. 4(1), 1–17 (2007)CrossRefGoogle Scholar
  40. 40.
    Weiss, S.M., Indurkhya, N., Zhang, T., Damerau, F.: Text Mining: Predictive Methods for Analyzing Unstructured Information. Springer, New York (2005)CrossRefGoogle Scholar
  41. 41.
    Yule, G.U.: The Statistical Study of Literary Vocabulary. Cambridge University Press, Cambridge (1944)Google Scholar
  42. 42.
    Zheng, R., Li, J., Chen, H., Huang, Z.: A framework for authorship identification of online messages: writing style features and classification techniques. J. Am. Soc. Inf. Sci. Technol. 57(3), 378–393 (2006)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  1. 1.University of the AegeanKarlovassiGreece

Personalised recommendations