Example of Application of n-grams: Authorship Attribution Using Syllables

  • Grigori Sidorov
Part of the SpringerBriefs in Computer Science book series (BRIEFSCOMPUTER)


As we described in the previous chapters, mainstream of the modern computational linguistics is based on application of machine learning methods. We represent our task as a classification task, represent our objects formally using features and their values (constructing vector space model), and then apply well-known classification algorithms. In this pipeline, the crucial question is how to select the features. For example, we can use as features words or n-grams of words (sequences of words) or sequences of characters (character n-grams), etc. An interesting question arises: Can we use syllables as features? It is very rarely done in computational linguistics, but there is certain linguistic reality behind syllables. This chapter explores this possibility for the authorship attribution task; it follows our research paper [99]. Note that syllables are somewhat similar to character n-grams in the sense that they are composed of several characters (being not too long).


  1. 1.
    Abbasi, A., Chen, H.: Applying authorship analysis to extremist-group web forum messages. IEEE Intelligent Systems, Vol. 20, No. 5, pp. 67–75 (2005)CrossRefGoogle Scholar
  2. 5.
    Argamon, S., Whitelaw, C., Chase, P., Hota, S.R., Garg, N., Levitan, S.: Stylistic text classification using functional lexical features. Journal of the American Society of Information Science and Technology, Vol. 58, No. 6, pp. 802–822 (2007)CrossRefGoogle Scholar
  3. 9.
    Burrows, J.: Word-patterns and story-shapes: The statistical analysis of narrative style. Literary and Linguistic Computing. Vol. 2, No. 2, pp. 61–70 (1987)CrossRefGoogle Scholar
  4. 14.
    Daelemans, W.: Explanation in computational stylometry. In: Proceedings of the 14th International Conference on Intelligent Text Processing and Computational Linguistics, pp. 451–462 (2013)Google Scholar
  5. 17.
    Diederich, J., Kindermann, J., Leopold, E., Paass, G.: Authorship attribution with support vector machines. Applied Intelligence, Vol. 19, No. 1–2, pp. 109–123 (2003)CrossRefGoogle Scholar
  6. 21.
    Feng, L., Jansche, M., Huenerfauth, M., Elhadad, N.: A comparison of features for automatic readability assessment. In: Proceedings of the 23rd International Conference on Computational Linguistics, pp. 276–284 (2010)Google Scholar
  7. 24.
    Fucks, W.: On the mathematical analysis of style. Biometrica, Vol. 39, No. 1–2, pp. 122–129 (1952)CrossRefGoogle Scholar
  8. 42.
    Gómez-Adorno, H., Sidorov, G., Pinto, D., Markov, I.: A graph based authorship identification approach. Working Notes Papers of the CLEF 2015 Evaluation Labs, Vol. 1391 (2015)Google Scholar
  9. 43.
    Grieve, J.: Quantitative authorship attribution: A history and an evaluation of techniques. MSc dis. Simon Fraser University (2005)Google Scholar
  10. 45.
    Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA Data Mining Software: An Update; SIGKDD Explorations, 11(1), pp. 10–18 (2009)CrossRefGoogle Scholar
  11. 48.
    Holmes, D.: Authorship attribution. Computers and the Humanities. Vol. 28, No. 2, pp. 87–106 (1994)CrossRefGoogle Scholar
  12. 50.
    Jarvis, S., Bestgen, Y., Pepper, S.: Maximizing classification accuracy in native language identification. In: Proceeding of the 8th Workshop on Innovative Use of NLP for Building Educational Applications, pp. 111–118 (2013)Google Scholar
  13. 52.
    Juola, P.: Authorship Attribution. Foundations and Trends in Information Retrieval. 1(3):233–334 (2006)CrossRefGoogle Scholar
  14. 55.
    Kestemont, M.: Function words in authorship attribution. From black magic to theory? In: Proceedings of the 3rd Workshop on Computational Linguistics for Literature, pp. 59–66 (2014)Google Scholar
  15. 58.
    Koppel, M., Winter, Y.: Determining if two documents are written by the same author. Journal of the American Society for Information Science and Technology. Vol. 65, No. 1, pp. 178–187 (2014)Google Scholar
  16. 62.
    Lewis, D.D., Yang, Y., Rose, T.G., Li, F.: RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, Vol. 5, pp. 361–397 (2004)Google Scholar
  17. 65.
    Luyckx K., Daelemans W. Authorship attribution and verification with many authors and limited data. In: Proceedings of the 22nd International Conference on Computational Linguistics, pp. 513–520 (2008)Google Scholar
  18. 67.
    Markov, I., Baptista, J., Pichardo-Lagunas, O.: Authorship attribution in Portuguese using character n-grams. Acta Polytechnica Hungarica, Vol. 14, No. 3, pp. 59–78 (2017)Google Scholar
  19. 68.
    Markov, I., Gómez-Adorno, H., Posadas-Durán, J.-P., Sidorov, G., Gelbukh, A.: Author profiling with doc2vec neural network-based document embeddings. In: Proceedings of the 15th Mexican International Conference on Artificial Intelligence, LNAI, Vol. 10062, pp. 117–131 (2017)Google Scholar
  20. 69.
    Markov, I., Gómez-Adorno, H., Sidorov, G.: Language- and subtask-dependent feature selection and classifier parameter tuning for author profiling. Working Notes Papers of the CLEF 2017 Evaluation Labs, Vol. 1866 (2017)Google Scholar
  21. 70.
    Markov, I., Stamatatos, E., Sidorov, G.: Improving cross-topic authorship attribution: The role of pre-processing. In: Proceedings of the 18th International Conference on Computational Linguistics and Intelligent Text Processing (2017)Google Scholar
  22. 71.
    McNamara, D., Louwerse, M., McCarthy, P., Graesser, A.: Cohmetrix: Capturing linguistic features of cohesion. Discourse Processes, Vol. 47, No. 4, pp. 292–330 (2010)CrossRefGoogle Scholar
  23. 73.
    Mendenhall, T.: The characteristic curves of composition. Science, Vol. 9, No. 214, pp. 237–249 (1887)CrossRefGoogle Scholar
  24. 78.
    Mosteller, F., Wallace, D.L.: Inference and Disputed Authorship: The Federalist. Reading, MA: Addison-Wesley Publishing Company (1964) (Reprinted: Stanford: Center for the Study of Language and Information (2008))zbMATHGoogle Scholar
  25. 83.
    Pentel, A. Effect of different feature types on age based classification of short texts. In: Proceedings of the 6th International Conference on Information, Intelligence, Systems and Applications, pp. 1–7 (2015)Google Scholar
  26. 85.
    Posadas-Durán, J.-P., Gómez-Adorno, H., Sidorov, G., Batyrshin, I., Pinto, D., Chanona-Hernandez, L.: Application of the distributed document representation in the authorship attribution task for small corpora. Soft Computing, Vol. 21. No. 3, pp. 627–639 (2016)CrossRefGoogle Scholar
  27. 86.
    Qian, T., Liu, B., Chen, L., Peng, Z.: Tritraining for authorship attribution with limited training data. In: Proceeding of the 52nd Annual Meeting of the Association for Computational Linguistics, pp. 345–351 (2014)Google Scholar
  28. 90.
    Sapkota, U., Solorio, T., Montes-y-Gómez, M., Bethard, S., Rosso, P.: Cross-topic authorship attribution: Will out-of-topic data help? In: Proceedings of the 25th International Conference on Computational Linguistics, pp. 1228–1237 (2014)Google Scholar
  29. 91.
    Sapkota, U., Bethard, S., Montes-y-Gómez, M., Solorio, T. Not all character n-grams are created equal: A study in authorship attribution. In: Proceedings of the 2015 Annual Conference of the North American Chapter of the ACL: Human Language Technologies, pp. 93–102 (2015)Google Scholar
  30. 99.
    Sidorov, G.: Automatic Authorship Attribution Using Syllables as Classification Features. Rhema, Vol. 1, pp. 62–81 (2018)Google Scholar
  31. 102.
    Stamatatos, E.: A survey of modern authorship attribution methods. Journal of the American Society for information Science and Technology 60(3): 538–556 (2009)CrossRefGoogle Scholar
  32. 103.
    Stamatatos, E.: On the robustness of authorship attribution based on character n-gram features. Journal of Law & Policy, Vol. 21, pp. 427–439 (2013)Google Scholar
  33. 104.
    Stamatatos, E., Daelemans, W., Verhoeven, B., Stein, B., Potthast, M., Juola, P., Sánchez-Pérez, M.A., Barrón-Cedeño, A.: Overview of the author identification task at PAN 2014. Working Notes of CLEF 2014 - Conference and Labs of the Evaluation forum, pp. 877–897 (2014)Google Scholar
  34. 105.
    Stamatatos, E., Daelemans, W., Verhoeven, B., Juola, P., López-López, A., Potthast, M., Stein, B.: Overview of the author identification task at PAN 2015. Working Notes of CLEF 2015 - Conference and Labs of the Evaluation forum (2015)Google Scholar
  35. 106.
    Stamatatos, E., Kokkinakis, G., Fakotakis, N.: Automatic text categorization in terms of genre and author. Computational Linguistics, Vol. 26, No. 4, pp. 471–495 (2000)CrossRefGoogle Scholar
  36. 107.
    Van Halteren, H.: Linguistic profiling for author recognition and verification. In: Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics (2004)Google Scholar

Copyright information

© The Author(s), under exclusive licence to Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Grigori Sidorov
    • 1
  1. 1.Instituto Politécnico NacionalCentro de Investigación en ComputaciónMexico CityMexico

Personalised recommendations