Skip to main content

Problems of Authorship Identification of the National Language Electronic Discourse

  • Conference paper
  • First Online:
Information and Software Technologies (ICIST 2015)

Abstract

The paper presents a comprehensive overview and analysis of the authorship identification methods in the national language electronic discourse. First, an overview and analysis of methods for English language is presented. Next, adaptations of general methods and well as language specific methods for national languages are considered. Challenges of authorship identification in electronic discourse is discussed. The requirements for developing authorship identification systems for forensics applications are discussed. Finally, the recommendations for developers of authorship identification methods and tools are presented.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

References

  1. Sánchez-Moya, A., Cruz-Moya, O.: Whatsapp, textese, and moral panics: discourse features and habits across two generations. Procedia – Soc. Behav. Sci. 173, 300–306 (2015)

    Article  Google Scholar 

  2. Segerstad, Y.H.: Use and adaptation of written language to the conditions of Computer-Mediated Communication. PhD dissertation, Göteborg University (2002)

    Google Scholar 

  3. Thurlow, C.: Generation txt? The sociolinguistics of young people’s text-messaging. Discourse Anal. Online 1(1), 30 (2003)

    Google Scholar 

  4. MacLeod, N., Grant, T.: Whose tweet?: authorship analysis of micro-blogs and other short form messages. In: Proceedings of the International Association of Forensic Linguists’ 10th Biennial Conference (2011)

    Google Scholar 

  5. Voutilainen, A.: Part-of-speech tagging. In: Mitkov, R. (ed.) The Oxford Handbook of Computational Linguistics, pp. 219–232. University Press, Oxford (2003)

    Google Scholar 

  6. Nivre, J.: Logic programming tools for probabilistic part-of-speech tagging. Master’s thesis, Växjö University (2000)

    Google Scholar 

  7. Bogdanova, D., Lazaridou, A.: Cross-language authorship attribution. In: The International Conference on Language Resources and Evaluation, pp. 2015–2020 (2014)

    Google Scholar 

  8. Potthast, M., Barron-Cedeno, A., Stein, B., Rosso, P.: Cross-language plagiarism detection, language resources and evaluation (LRE). Spec. Issue Plagiarism Authorship Anal. 45(1), 1–18 (2011)

    Google Scholar 

  9. Salvador, M.F., Gupta, P., Rosso, P.: Cross-language plagiarism detection using a multilingual semantic network. In: Proceedings of the 35th European conference on Advances in Information Retrieval, ECIR 2013, pp. 710–713 (2013)

    Google Scholar 

  10. Navigli, R., Ponzetto, S.P.: BabelNet: building a very large multilingual semantic network. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, ACL 2010, pp. 216–225 (2010)

    Google Scholar 

  11. Panicheva, P., Cardiff, J., Rosso, P.: Personal sense and idiolect: combining authorship attribution and opinion analysis. In Proceedings of the 4th International Conference on Language Resources and Evaluation, LREC 2010 (2010)

    Google Scholar 

  12. Dunn, R., Beaudry, J., Klavas, A.: Survey of research on learning styles. Educ. Leadersh. 46(6), 50–58 (1989)

    Google Scholar 

  13. Bellman, R.: Adaptive Control Processes: a Guided Tour. Princeton University Press, Princeton (1961)

    Book  Google Scholar 

  14. Koller, D., Sahami, M.: Hierarchically classifying documents using very few words. In: Proceedings of International Conference on Machine Learning, pp. 170–178 (1997)

    Google Scholar 

  15. Fuka, K., Hanka, R.: Feature set reduction for document classification problems. In: Proceedings of IJCAI-01 Workshop: Text Learning: Beyond Supervision, Seattle (2001)

    Google Scholar 

  16. Zervas, G., Rüger, S.M.: The curse of dimensionality and document clustering. In: Proceedings of the IEEE Searching for Information: AI and IR Approaches (1999)

    Google Scholar 

  17. Pearl, L., Steyvers, M.: Detecting authorship deception: a supervised machine learning approach using author writeprints. LLC 27(2), 183–196 (2012)

    Google Scholar 

  18. Brennan, M., Afroz, S., Greenstadt, R.: Adversarial stylometry: circumventing authorship recognition to preserve privacy and anonymity. ACM Trans. Inf. Syst. Secur. (TISSEC), 15(3), Article 12, 22 p. (2012)

    Article  Google Scholar 

  19. De Vel O.: Mining e-mail authorship. In: ACM International Conference on Knowledge Discovery and Data Mining, KDD 2000, Workshop on Text Mining (2000)

    Google Scholar 

  20. Holmes, D.: Authorship attribution. Comput. Humanit. 28(2), 87–106 (1994)

    Article  Google Scholar 

  21. Stamatatos, E.: A survey of modern authorship attribution methods. J. Am. Soc. Inf. Sci. Technol. 60(3), 538–556 (2009)

    Article  Google Scholar 

  22. Zheng, R., Li, J., Chen, H., Huang, Z.: A framework for authorship identification of online messages: writing-style features and classification techniques. JASIST (JASIS) 57(3), 378–393 (2006)

    Article  Google Scholar 

  23. Graovac, J.: A variant of n-gram based language-independent text categorization. Intell. Data Anal. 18(4) (2014)

    Google Scholar 

  24. Coyotl-Morales, R.M., Villaseñor-Pineda, L., Montes-y-Gómez, M., Rosso, P.: Authorship attribution using word sequences. In: Martínez-Trinidad, J.F., Carrasco Ochoa, J.A., Kittler, J. (eds.) CIARP 2006. LNCS, vol. 4225, pp. 844–853. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  25. Stamatatos, E.: On the robustness of authorship attribution based on character n-gram features. J. Law Policy 21(2), 421 (2013)

    Google Scholar 

  26. Koppel, M., Schler, J., Argamon, S.: Authorship attribution in the wild. Lang. Resour. Eval. 45, 83–94 (2011)

    Article  Google Scholar 

  27. Reicher, T., Krišto, I., Belša, I., Šilić, A.: Automatic authorship attribution for texts in croatian language using combinations of features. In: Setchi, R., Jordanov, I., Howlett, R.J., Jain, L.C. (eds.) KES 2010, Part II. LNCS, vol. 6277, pp. 21–30. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  28. Argamon, S., Levitan S.: Measuring the usefulness of function words for authorship attribution. In: Proceedings of the Association for Literary and Linguistic Computing/Association Computer Humanities Conference (2005)

    Google Scholar 

  29. Chaski, C.E.: Who’s at the keyboard? Authorship attribution in digital evidence investigations. Int. J. Digit. Evid. 4(1), 1–13 (2005)

    Google Scholar 

  30. Hilton, O.: Scientific Examination of Questioned Documents. CRC Press, Boca Raton (1993)

    Google Scholar 

  31. McMenamin, G.R.: Forensic Linguistics: Advances in Forensic Stylistics. CRC Press, Boca Raton (2003)

    Google Scholar 

  32. Martindale, C., McKenzie, D.: On the utility of content analysis in author attribution: the Federalist. Comput. Humanit. 29, 259–270 (1995)

    Article  Google Scholar 

  33. Palkovskii, Y., Belov, A., Muzika I.: Exploring Fingerprinting as External Plagiarism Detection Method - Lab Report for PAN at CLEF 2010. CLEF (Notebook Papers/LABs/Workshops) (2010)

    Google Scholar 

  34. Yang, T., Lee, D.: T3: On mapping text to time series. In: Proceedings of the 3rd Alberto Mendelzon International Workshop on Foundations of Data Management. CEUR Workshop Proceedings 450 (2009)

    Google Scholar 

  35. Fukunaga, K.: Introduction to Statistical Pattern Recognition. Academic Press, New York (1972)

    MATH  Google Scholar 

  36. Qu, Y., Ostrouchovz, G., Samatovaz, N., Geist, A.: Principal component analysis for dimension reduction in massive distributed data sets. In: Proceedings of IEEE International Conference on Data Mining (ICDM) (2002)

    Google Scholar 

  37. Koppel, M., Schler, J., Argamon, S., Messeri, E.: Authorship attribution with thousands of candidate authors. In: Proceedings of the 29th ACM SIGIR Conference on Research and Development on Information Retrieval Seattle, Washington, pp. 659–660 (2006)

    Google Scholar 

  38. Koppel, M., Schler, J., Argamon, S., Winter, Y.: The “fundamental problem” of authorship attribution. Engl. Stud. 93(3), 284–291 (2012)

    Article  Google Scholar 

  39. Huang, A.: Similarity measures for text document clustering. In: Proceedings of the 6th New Zealand Computer Science Research Student Conference NZCSRSC2008, pp. 49–56 (2008)

    Google Scholar 

  40. Kjell, B., Woods, W.A., Frieder, O.: Discrimination of authorship using visualization. Inf. Process. Manage. 30(1), 141–150 (1994)

    Article  Google Scholar 

  41. Shaw, C.D., Kukla, J.M., Soboroff, I., Ebert, D.S., Nicholas, C.K., Zwa, A., Miller, E.L., Roberts, D.A.: Interactive volumetric information visualization for document corpus management. Int. J. Digit. Libr. 2, 144–156 (1999)

    Article  Google Scholar 

  42. Burrows, J.F.: Delta: a measure of stylistic difference and a guide to likely authorship. Literary Linguist. Comput. 17, 267–287 (2002)

    Article  Google Scholar 

  43. Mosteller, F., Wallace, D.L.: Inference and Disputed Authorship: the Federalist. Addison-Wesley, Reading, MA (1964)

    MATH  Google Scholar 

  44. Antiqueira, L., Pardo, T.A.S., das Gracas Volpe Nunes, M., de Oliveira Jr., O.N., da Fontoura Costa, L.: Some issues on complex networks for author characterization. Revista Iberoamericana de Inteligencia Artificial 11(36), 51–58 (2006)

    Google Scholar 

  45. Segarra, S., Eisen, M., Ribeiro, A.: Authorship attribution using function words adjacency networks. In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2013, pp. 5563–5567 (2013)

    Google Scholar 

  46. Ke, J., Yao, Y.: Analysing language development from a network, approach. J. Quant. Linguist. 15(1), 70–99 (2008)

    Article  MathSciNet  Google Scholar 

  47. Leskovec, J., Kleinberg J., Faloutsos, C.: Graph evolution: densification and shrinking diameters. ACM Trans. Knowl. Discov. Data, 1(1), Article 2 (2007)

    Article  Google Scholar 

  48. Matsuo, Y., Ohsawa, Y., Ishizuka, M.: A document as a small world. In: Terano, T., Nishida, T., Namatame, A., Tsumoto, S., Ohsawa, Y., Washio, T. (eds.) JSAI-WS 2001. LNCS (LNAI), vol. 2253, pp. 444–448. Springer, Heidelberg (2001)

    Chapter  Google Scholar 

  49. Chang, F., Lieven, E., Tomasello, M.: Automatic evaluation of syntactic learners in typologically-different languages. Cogn. Syst. Res. 9(3), 198–213 (2008)

    Article  Google Scholar 

  50. Rizvic, H., Martincic-Ipsic, S., Mestrovic, A.: Network Motifs Analysis of Croatian Literature. CoRR abs/1411.4960 (2014)

    Google Scholar 

  51. Wagner, H., Dłotko, P., Mrozek, M.: Computational topology in text mining. In: Ferri, M., Frosini, P., Landi, C., Cerri, A., Di Fabio, B. (eds.) CTIC 2012. LNCS, vol. 7309, pp. 68–78. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  52. Beeferman, D., Berger, A., Lafferty, J.: A model of lexical attraction and repulsion. In: 35th Annual Meeting of the Association for Computational Linguistics (1997)

    Google Scholar 

  53. Amancio, D.R.: Authorship recognition via fluctuation analysis of network topology and word intermittency. J. Stat. Mech. P03005 (2015)

    Google Scholar 

  54. Basile, C., Benedetto, D., Caglioti, E., Degli Esposti, M.: An example of mathematical authorship attribution. J. Math. Phys. 49, 125211–125230 (2008)

    Article  MathSciNet  Google Scholar 

  55. Todirascu, A., Pado, S., Krisch, J., Kisselew, M., Heid, U.: French and German corpora for audience–based text type classification. LREC 2012, 1591–1597 (2012)

    Google Scholar 

  56. Varela, P., Justino, E., Oliveira, L.S.: Verbs and pronouns for authorship attribution. In: 17th International Conference on Systems, Signals and Image Processing (IWSSIP 2010), pp. 89–92 (2010)

    Google Scholar 

  57. Pavelec, D., Oliveira, L.S., Justino, E., Batista, L.V.: Using conjunctions and adverbs for author verification. J. Univ. Comput. Sci. 14(18), 2967–2981 (2008)

    Google Scholar 

  58. Hancke, J., Meurers, D., Vajjala, S.: Readability classification for German using lexical, syntactic, and morphological features. In: Proceedings of the 24th International Conference on Computational Linguistics (COLING), pp. 1063–1080 (2012)

    Google Scholar 

  59. Diederich, J., Kindermann, J., Leopold, E., Paass, G.: Authorship attribution with support vector machines. Appl. Intell. 19(1), 109–123 (2003)

    Article  Google Scholar 

  60. Kukushkina, O.V., Polikarpov, A.A., Khmelev, D.V.: Using literal and grammatical statistics for authorship attribution. Probl. Inf. Transm. 37(2), 172–184 (2001)

    Article  MathSciNet  Google Scholar 

  61. Zecevic, A., Utvic, M.: An authorship attribution for Serbian. In: BCI (Local), pp. 109–112 (2012)

    Google Scholar 

  62. Žalkauskaitė, G.: Idiolect signs in the e-mail. PhD dissertation, Vilnius University (2012)

    Google Scholar 

  63. Barragán, J.: Why some hard cases remain unsolved. Legal knowledge based systems. In: JURIX 1993 (1993)

    Google Scholar 

  64. Grant, T.: TXT 4N6 method, consistency, and distinctiveness in the analysis of SMS text messages. J. Law Policy 21(2), 467–494 (2013)

    Google Scholar 

  65. Mohtasseb, H., Ahmed, A.: Two-layered blogger identification model integrating profile and instance-based methods. Knowl. Inf. Syst. 31(1), 1–21 (2012)

    Article  Google Scholar 

  66. Guillén-Nieto, V., Vargas-Sierra, C., Pardiño-Juan, M., Martínez-Barco, P., Suárez-Cueto, A.: Exploring state-of-the art software for forensic authorship identification. Int. J. Engl. Stud. 8(1), 1–28 (2008)

    Google Scholar 

  67. Teahan, W.J., Harper, D.J.: Using compression-based language models for text categorization. In: Croft, W.B., Lafferty, J. (eds.) Language modeling for information retrieval, pp. 141–165. Springer, Dordrecht (2003)

    Chapter  Google Scholar 

  68. Benedetto, D., Caglioti, E., Loreto, V.: Language trees and zipping. Phys. Rev. Lett. 88(4), 048702 (2002)

    Article  Google Scholar 

  69. Abbasi, A., Chen, H.: Applying authorship analysis to extremist-group web forum messages. IEEE Intell. Syst. 20(5), 67–75 (2005)

    Article  Google Scholar 

  70. Kapociute-Dzikiene, J., Vaassen, F., Daelemans, W., Krupavicius, A.: Improving topic classification for highly inflective languages. In: 24th International Conference on Computational Linguistics, COLING 2012, pp. 1393–1410 (2012)

    Google Scholar 

  71. Napoli, C., Tramontana, E., Lo Sciuto, G., Wozniak, M., Damasevicius, R., Borowik, G.: Authorship semantical identification using holomorphic Chebyshev projectors. In: Proceedings of 3rd Asia-Pacific Conference on Computer Aided System Engineering (APCASE) (2015)

    Google Scholar 

Download references

Acknowledgement

The authors acknowledge the contribution of the project “Lithuanian Cybercrime Centre of Excellence for Training, Research and Education”, Grant Agreement No. HOME/2013/ISEC/AG/INT/4000005176, co-funded by the Prevention of and Fight against Crime Programme of the European Union.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Robertas Damaševičius .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Venčkauskas, A., Damaševičius, R., Marcinkevičius, R., Karpavičius, A. (2015). Problems of Authorship Identification of the National Language Electronic Discourse. In: Dregvaite, G., Damasevicius, R. (eds) Information and Software Technologies. ICIST 2015. Communications in Computer and Information Science, vol 538. Springer, Cham. https://doi.org/10.1007/978-3-319-24770-0_36

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-24770-0_36

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-24769-4

  • Online ISBN: 978-3-319-24770-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics