Unsupervised Classification of Translated Texts

  • Sergiu NisioiEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9103)


In our paper we investigate the possibility to use an unsupervised classifier to automatically distinguish between the translated and original novels of a multilingual writer (Vladimir Nabokov) and to determine whether the authorship of a translated document can be achieved. We employ a rank-based document vector representation using only function words as features. To extract the results, we propose a generalization of Ward’s hierarchical clustering method that is compatible with any similarity metric.


Function Word Manhattan Distance Adjusted Rand Index Supervise Classifier Rank Type 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Wiktionary. Accessed in June 2013
  2. 2.
    Burrows, J.F.: Delta: a measure of stylistic difference and a guide to likely authorship. Literary Linguist. Comput. 17(1), 267–287 (2002)CrossRefGoogle Scholar
  3. 3.
    Carlsson, G.E., Mémoli, F.: Characterization, stability and convergence of hierarchical clustering methods. J. Mach. Learn. Res. 11, 1425–1470 (2010)zbMATHMathSciNetGoogle Scholar
  4. 4.
    Dinu, L.P., Niculae, V., Şulea, O.M.: Pastiche detection based on stopword rankings: exposing impersonators of a romanian writer. In: Proceedings of the Workshop on Computational Approaches to Deception Detection, EACL 2012, pp. 72–77. Association for Computational Linguistics, Stroudsburg (2012)Google Scholar
  5. 5.
    Dinu, L.P., Niculae, V., Şulea, O.M.: Pastiche detection based on stopword rankings: exposing impersonators of a romanian writer. In: Proceedings of the Workshop on Computational Approaches to Deception Detection, EACL 2012, pp. 72–77 (2012)Google Scholar
  6. 6.
    Dinu, L.P., Popescu, M.: Comparing statistical similarity measures for stylistic multivariate analysis. In: RANLP, pp. 349–354. Association for Computational Linguistics, Borovets (2009)Google Scholar
  7. 7.
    Everit, B., Landau, S., Leese, M.: Cluster Analysis. Hodder, London (2001)Google Scholar
  8. 8.
    Forsyth, R., Sharoff, S.: Document dissimilarity within and across languages: a benchmarking study. Literary Linguist. Comput. 29, 6–22 (2014)CrossRefGoogle Scholar
  9. 9.
    Foucault, M.: What Is an Author?. State University Press of New York, Albany (1987)Google Scholar
  10. 10.
    Gorski, B.: Nabokov vs. Open image in new window: A literary investigation of linguistic relativity. Vestnik, J. Russ. Asian Stud. (8) 56–78 (2010)
  11. 11.
    Hallé, P.A., Best, C.T., Levitt, A.: Phonetic vs. phonological influences on french listeners’ perception of american english approximants. J. Phonetics 27(3), 281–306 (1999)CrossRefGoogle Scholar
  12. 12.
    Hoover, D.L.: Testing burrows’s delta. Literary Linguist. Comput. 19(4), 453–475 (2004)CrossRefGoogle Scholar
  13. 13.
    Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2, 193–218 (1985)CrossRefGoogle Scholar
  14. 14.
    Ilisei, I., Inkpen, D., Corpas Pastor, G., Mitkov, R.: Identification of translationese: a machine learning approach. In: Gelbukh, A. (ed.) CICLing 2010. LNCS, vol. 6008, pp. 503–511. Springer, Heidelberg (2010) CrossRefGoogle Scholar
  15. 15.
    Jockers, M.L., Witten, D.M.: A comparative study of machine learning methods for authorship attribution. Literary Linguist. Comput. 25, 215–223 (2012)CrossRefGoogle Scholar
  16. 16.
    Juola, P.: Authorship attribution. Found. Trends Inf. Retrieval 1(3), 233–334 (2006)CrossRefGoogle Scholar
  17. 17.
    Koppel, M., Schler, J., Argamon, S.: Computational methods in authorship attribution. J. Am. Soc. Inf. Sci. Technol. 60(1), 9–26 (2009)CrossRefGoogle Scholar
  18. 18.
    Koppel, M., Schler, J., Bonchek-Dokow, E.: Measuring differentiability: unmasking pseudonymous authors. J. Mach. Learn. Res. 8, 1261–1276 (2007)zbMATHGoogle Scholar
  19. 19.
    Lance, G.N., Williams, W.T.: A general theory of classificatory sorting strategies: 1. hierarchical systems. Comput. J. 9(4), 373–380 (1967)CrossRefGoogle Scholar
  20. 20.
    Madigan, D., Genkin, A., Lewis, D.D., Lewis, E.G.D.D., Argamon, S., Fradkin, D., Ye, L., Consulting, D.D.L.: Author identification on the large scale. In: Proceedings of the Meeting of the Classification Society of North America (2005)Google Scholar
  21. 21.
    Milligan, G.W.: Ultrametric hierarchical clustering algorithms. PSYCHOMETRIKA 44(3), 343–346 (1979)zbMATHCrossRefMathSciNetGoogle Scholar
  22. 22.
    Mosteller, F., Wallace, L.D.: Inference in an authorship problem. J. Am. Stat. Assoc. 58(302), 275–309 (1963)zbMATHGoogle Scholar
  23. 23.
    Nabokov, V.: Lolita. Penguin Books Limited, UK (2012) Google Scholar
  24. 24.
    Nabokov, V.: Speak, Memory: An Autobiography Revisited. Vintage International, New York (1989) Google Scholar
  25. 25.
    Nabokov, V.: Eugene Onegin. A Translation from the Russian of Aleksandr Pushkin’s (1833) Yevgeniy Onegin (1990)Google Scholar
  26. 26.
    Nisioi, S.: Feature analysis for native language identification. In: Gelbukh, A. (ed.) CICLing 2015, Part I. LNCS, vol. 9041, pp. 644–657. Springer, Heidelberg (2015) CrossRefGoogle Scholar
  27. 27.
    Nisioi, S., Dinu, L.P.: A clustering approach for translationese identification. In: Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013, Hissar, Bulgaria, pp. 532–538, September 2013Google Scholar
  28. 28.
    Popescu, M., Dinu, L.P.: Comparing statistical similarity measures for stylistic multivariate analysis. In: RANLP 2009 Organising Committee/ACL RANLP, pp. 349–354 (2009)Google Scholar
  29. 29.
    Selinker, L., Rutherford, W.: Rediscovering Interlanguage. Applied Linguistics and Language Study. Routledge, New York (2014) Google Scholar
  30. 30.
    Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering techniques (2000)Google Scholar
  31. 31.
    Szekely, G.J., Rizzo, M.L.: Hierarchical clustering via joint between-within distances: extending ward’s minimum variance method. J. Classif. 22, 151–183 (2005)CrossRefMathSciNetGoogle Scholar
  32. 32.
    Tsvetkov, Y., Twitto, N., Schneider, N., Ordan, N., Faruqui, M., Chahuneau, V., Wintner, S., Dyer, C.: Identifying the l1 of non-native writers: the cmu-haifa system. In: Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications, pp. 279–287. Association for Computational Linguistics, Atlanta, June 2013Google Scholar
  33. 33.
    Volansky, V., Ordan, N., Wintner, S.: On the features of translationese. Digit. Scholars. Humanit. 30(1) 98–118 (2015) doi: 10.1093/llc/fqt031
  34. 34.
    Ward, J.H.: Hierarchical grouping to optimize an objective function. J. Am. Stat. Assoc. 301(58), 236–244 (1963)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  1. 1.Center for Computational LinguisticsUniversity of BucharestBucharestRomania

Personalised recommendations