Problems of Authorship Identification of the National Language Electronic Discourse

Venčkauskas, Algimantas; Damaševičius, Robertas; Marcinkevičius, Romas; Karpavičius, Arnas

doi:10.1007/978-3-319-24770-0_36

Algimantas Venčkauskas¹⁰,
Robertas Damaševičius¹¹,
Romas Marcinkevičius¹¹ &
…
Arnas Karpavičius¹⁰

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 538))

Included in the following conference series:

International Conference on Information and Software Technologies

1073 Accesses
2 Citations

Abstract

The paper presents a comprehensive overview and analysis of the authorship identification methods in the national language electronic discourse. First, an overview and analysis of methods for English language is presented. Next, adaptations of general methods and well as language specific methods for national languages are considered. Challenges of authorship identification in electronic discourse is discussed. The requirements for developing authorship identification systems for forensics applications are discussed. Finally, the recommendations for developers of authorship identification methods and tools are presented.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

References

Sánchez-Moya, A., Cruz-Moya, O.: Whatsapp, textese, and moral panics: discourse features and habits across two generations. Procedia – Soc. Behav. Sci. 173, 300–306 (2015)
Article Google Scholar
Segerstad, Y.H.: Use and adaptation of written language to the conditions of Computer-Mediated Communication. PhD dissertation, Göteborg University (2002)
Google Scholar
Thurlow, C.: Generation txt? The sociolinguistics of young people’s text-messaging. Discourse Anal. Online 1(1), 30 (2003)
Google Scholar
MacLeod, N., Grant, T.: Whose tweet?: authorship analysis of micro-blogs and other short form messages. In: Proceedings of the International Association of Forensic Linguists’ 10th Biennial Conference (2011)
Google Scholar
Voutilainen, A.: Part-of-speech tagging. In: Mitkov, R. (ed.) The Oxford Handbook of Computational Linguistics, pp. 219–232. University Press, Oxford (2003)
Google Scholar
Nivre, J.: Logic programming tools for probabilistic part-of-speech tagging. Master’s thesis, Växjö University (2000)
Google Scholar
Bogdanova, D., Lazaridou, A.: Cross-language authorship attribution. In: The International Conference on Language Resources and Evaluation, pp. 2015–2020 (2014)
Google Scholar
Potthast, M., Barron-Cedeno, A., Stein, B., Rosso, P.: Cross-language plagiarism detection, language resources and evaluation (LRE). Spec. Issue Plagiarism Authorship Anal. 45(1), 1–18 (2011)
Google Scholar
Salvador, M.F., Gupta, P., Rosso, P.: Cross-language plagiarism detection using a multilingual semantic network. In: Proceedings of the 35th European conference on Advances in Information Retrieval, ECIR 2013, pp. 710–713 (2013)
Google Scholar
Navigli, R., Ponzetto, S.P.: BabelNet: building a very large multilingual semantic network. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, ACL 2010, pp. 216–225 (2010)
Google Scholar
Panicheva, P., Cardiff, J., Rosso, P.: Personal sense and idiolect: combining authorship attribution and opinion analysis. In Proceedings of the 4th International Conference on Language Resources and Evaluation, LREC 2010 (2010)
Google Scholar
Dunn, R., Beaudry, J., Klavas, A.: Survey of research on learning styles. Educ. Leadersh. 46(6), 50–58 (1989)
Google Scholar
Bellman, R.: Adaptive Control Processes: a Guided Tour. Princeton University Press, Princeton (1961)
Book Google Scholar
Koller, D., Sahami, M.: Hierarchically classifying documents using very few words. In: Proceedings of International Conference on Machine Learning, pp. 170–178 (1997)
Google Scholar
Fuka, K., Hanka, R.: Feature set reduction for document classification problems. In: Proceedings of IJCAI-01 Workshop: Text Learning: Beyond Supervision, Seattle (2001)
Google Scholar
Zervas, G., Rüger, S.M.: The curse of dimensionality and document clustering. In: Proceedings of the IEEE Searching for Information: AI and IR Approaches (1999)
Google Scholar
Pearl, L., Steyvers, M.: Detecting authorship deception: a supervised machine learning approach using author writeprints. LLC 27(2), 183–196 (2012)
Google Scholar
Brennan, M., Afroz, S., Greenstadt, R.: Adversarial stylometry: circumventing authorship recognition to preserve privacy and anonymity. ACM Trans. Inf. Syst. Secur. (TISSEC), 15(3), Article 12, 22 p. (2012)
Article Google Scholar
De Vel O.: Mining e-mail authorship. In: ACM International Conference on Knowledge Discovery and Data Mining, KDD 2000, Workshop on Text Mining (2000)
Google Scholar
Holmes, D.: Authorship attribution. Comput. Humanit. 28(2), 87–106 (1994)
Article Google Scholar
Stamatatos, E.: A survey of modern authorship attribution methods. J. Am. Soc. Inf. Sci. Technol. 60(3), 538–556 (2009)
Article Google Scholar
Zheng, R., Li, J., Chen, H., Huang, Z.: A framework for authorship identification of online messages: writing-style features and classification techniques. JASIST (JASIS) 57(3), 378–393 (2006)
Article Google Scholar
Graovac, J.: A variant of n-gram based language-independent text categorization. Intell. Data Anal. 18(4) (2014)
Google Scholar
Coyotl-Morales, R.M., Villaseñor-Pineda, L., Montes-y-Gómez, M., Rosso, P.: Authorship attribution using word sequences. In: Martínez-Trinidad, J.F., Carrasco Ochoa, J.A., Kittler, J. (eds.) CIARP 2006. LNCS, vol. 4225, pp. 844–853. Springer, Heidelberg (2006)
Chapter Google Scholar
Stamatatos, E.: On the robustness of authorship attribution based on character n-gram features. J. Law Policy 21(2), 421 (2013)
Google Scholar
Koppel, M., Schler, J., Argamon, S.: Authorship attribution in the wild. Lang. Resour. Eval. 45, 83–94 (2011)
Article Google Scholar
Reicher, T., Krišto, I., Belša, I., Šilić, A.: Automatic authorship attribution for texts in croatian language using combinations of features. In: Setchi, R., Jordanov, I., Howlett, R.J., Jain, L.C. (eds.) KES 2010, Part II. LNCS, vol. 6277, pp. 21–30. Springer, Heidelberg (2010)
Chapter Google Scholar
Argamon, S., Levitan S.: Measuring the usefulness of function words for authorship attribution. In: Proceedings of the Association for Literary and Linguistic Computing/Association Computer Humanities Conference (2005)
Google Scholar
Chaski, C.E.: Who’s at the keyboard? Authorship attribution in digital evidence investigations. Int. J. Digit. Evid. 4(1), 1–13 (2005)
Google Scholar
Hilton, O.: Scientific Examination of Questioned Documents. CRC Press, Boca Raton (1993)
Google Scholar
McMenamin, G.R.: Forensic Linguistics: Advances in Forensic Stylistics. CRC Press, Boca Raton (2003)
Google Scholar
Martindale, C., McKenzie, D.: On the utility of content analysis in author attribution: the Federalist. Comput. Humanit. 29, 259–270 (1995)
Article Google Scholar
Palkovskii, Y., Belov, A., Muzika I.: Exploring Fingerprinting as External Plagiarism Detection Method - Lab Report for PAN at CLEF 2010. CLEF (Notebook Papers/LABs/Workshops) (2010)
Google Scholar
Yang, T., Lee, D.: T3: On mapping text to time series. In: Proceedings of the 3rd Alberto Mendelzon International Workshop on Foundations of Data Management. CEUR Workshop Proceedings 450 (2009)
Google Scholar
Fukunaga, K.: Introduction to Statistical Pattern Recognition. Academic Press, New York (1972)
MATH Google Scholar
Qu, Y., Ostrouchovz, G., Samatovaz, N., Geist, A.: Principal component analysis for dimension reduction in massive distributed data sets. In: Proceedings of IEEE International Conference on Data Mining (ICDM) (2002)
Google Scholar
Koppel, M., Schler, J., Argamon, S., Messeri, E.: Authorship attribution with thousands of candidate authors. In: Proceedings of the 29th ACM SIGIR Conference on Research and Development on Information Retrieval Seattle, Washington, pp. 659–660 (2006)
Google Scholar
Koppel, M., Schler, J., Argamon, S., Winter, Y.: The “fundamental problem” of authorship attribution. Engl. Stud. 93(3), 284–291 (2012)
Article Google Scholar
Huang, A.: Similarity measures for text document clustering. In: Proceedings of the 6th New Zealand Computer Science Research Student Conference NZCSRSC2008, pp. 49–56 (2008)
Google Scholar
Kjell, B., Woods, W.A., Frieder, O.: Discrimination of authorship using visualization. Inf. Process. Manage. 30(1), 141–150 (1994)
Article Google Scholar
Shaw, C.D., Kukla, J.M., Soboroff, I., Ebert, D.S., Nicholas, C.K., Zwa, A., Miller, E.L., Roberts, D.A.: Interactive volumetric information visualization for document corpus management. Int. J. Digit. Libr. 2, 144–156 (1999)
Article Google Scholar
Burrows, J.F.: Delta: a measure of stylistic difference and a guide to likely authorship. Literary Linguist. Comput. 17, 267–287 (2002)
Article Google Scholar
Mosteller, F., Wallace, D.L.: Inference and Disputed Authorship: the Federalist. Addison-Wesley, Reading, MA (1964)
MATH Google Scholar
Antiqueira, L., Pardo, T.A.S., das Gracas Volpe Nunes, M., de Oliveira Jr., O.N., da Fontoura Costa, L.: Some issues on complex networks for author characterization. Revista Iberoamericana de Inteligencia Artificial 11(36), 51–58 (2006)
Google Scholar
Segarra, S., Eisen, M., Ribeiro, A.: Authorship attribution using function words adjacency networks. In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2013, pp. 5563–5567 (2013)
Google Scholar
Ke, J., Yao, Y.: Analysing language development from a network, approach. J. Quant. Linguist. 15(1), 70–99 (2008)
Article MathSciNet Google Scholar
Leskovec, J., Kleinberg J., Faloutsos, C.: Graph evolution: densification and shrinking diameters. ACM Trans. Knowl. Discov. Data, 1(1), Article 2 (2007)
Article Google Scholar
Matsuo, Y., Ohsawa, Y., Ishizuka, M.: A document as a small world. In: Terano, T., Nishida, T., Namatame, A., Tsumoto, S., Ohsawa, Y., Washio, T. (eds.) JSAI-WS 2001. LNCS (LNAI), vol. 2253, pp. 444–448. Springer, Heidelberg (2001)
Chapter Google Scholar
Chang, F., Lieven, E., Tomasello, M.: Automatic evaluation of syntactic learners in typologically-different languages. Cogn. Syst. Res. 9(3), 198–213 (2008)
Article Google Scholar
Rizvic, H., Martincic-Ipsic, S., Mestrovic, A.: Network Motifs Analysis of Croatian Literature. CoRR abs/1411.4960 (2014)
Google Scholar
Wagner, H., Dłotko, P., Mrozek, M.: Computational topology in text mining. In: Ferri, M., Frosini, P., Landi, C., Cerri, A., Di Fabio, B. (eds.) CTIC 2012. LNCS, vol. 7309, pp. 68–78. Springer, Heidelberg (2012)
Chapter Google Scholar
Beeferman, D., Berger, A., Lafferty, J.: A model of lexical attraction and repulsion. In: 35th Annual Meeting of the Association for Computational Linguistics (1997)
Google Scholar
Amancio, D.R.: Authorship recognition via fluctuation analysis of network topology and word intermittency. J. Stat. Mech. P03005 (2015)
Google Scholar
Basile, C., Benedetto, D., Caglioti, E., Degli Esposti, M.: An example of mathematical authorship attribution. J. Math. Phys. 49, 125211–125230 (2008)
Article MathSciNet Google Scholar
Todirascu, A., Pado, S., Krisch, J., Kisselew, M., Heid, U.: French and German corpora for audience–based text type classification. LREC 2012, 1591–1597 (2012)
Google Scholar
Varela, P., Justino, E., Oliveira, L.S.: Verbs and pronouns for authorship attribution. In: 17th International Conference on Systems, Signals and Image Processing (IWSSIP 2010), pp. 89–92 (2010)
Google Scholar
Pavelec, D., Oliveira, L.S., Justino, E., Batista, L.V.: Using conjunctions and adverbs for author verification. J. Univ. Comput. Sci. 14(18), 2967–2981 (2008)
Google Scholar
Hancke, J., Meurers, D., Vajjala, S.: Readability classification for German using lexical, syntactic, and morphological features. In: Proceedings of the 24th International Conference on Computational Linguistics (COLING), pp. 1063–1080 (2012)
Google Scholar
Diederich, J., Kindermann, J., Leopold, E., Paass, G.: Authorship attribution with support vector machines. Appl. Intell. 19(1), 109–123 (2003)
Article Google Scholar
Kukushkina, O.V., Polikarpov, A.A., Khmelev, D.V.: Using literal and grammatical statistics for authorship attribution. Probl. Inf. Transm. 37(2), 172–184 (2001)
Article MathSciNet Google Scholar
Zecevic, A., Utvic, M.: An authorship attribution for Serbian. In: BCI (Local), pp. 109–112 (2012)
Google Scholar
Žalkauskaitė, G.: Idiolect signs in the e-mail. PhD dissertation, Vilnius University (2012)
Google Scholar
Barragán, J.: Why some hard cases remain unsolved. Legal knowledge based systems. In: JURIX 1993 (1993)
Google Scholar
Grant, T.: TXT 4N6 method, consistency, and distinctiveness in the analysis of SMS text messages. J. Law Policy 21(2), 467–494 (2013)
Google Scholar
Mohtasseb, H., Ahmed, A.: Two-layered blogger identification model integrating profile and instance-based methods. Knowl. Inf. Syst. 31(1), 1–21 (2012)
Article Google Scholar
Guillén-Nieto, V., Vargas-Sierra, C., Pardiño-Juan, M., Martínez-Barco, P., Suárez-Cueto, A.: Exploring state-of-the art software for forensic authorship identification. Int. J. Engl. Stud. 8(1), 1–28 (2008)
Google Scholar
Teahan, W.J., Harper, D.J.: Using compression-based language models for text categorization. In: Croft, W.B., Lafferty, J. (eds.) Language modeling for information retrieval, pp. 141–165. Springer, Dordrecht (2003)
Chapter Google Scholar
Benedetto, D., Caglioti, E., Loreto, V.: Language trees and zipping. Phys. Rev. Lett. 88(4), 048702 (2002)
Article Google Scholar
Abbasi, A., Chen, H.: Applying authorship analysis to extremist-group web forum messages. IEEE Intell. Syst. 20(5), 67–75 (2005)
Article Google Scholar
Kapociute-Dzikiene, J., Vaassen, F., Daelemans, W., Krupavicius, A.: Improving topic classification for highly inflective languages. In: 24th International Conference on Computational Linguistics, COLING 2012, pp. 1393–1410 (2012)
Google Scholar
Napoli, C., Tramontana, E., Lo Sciuto, G., Wozniak, M., Damasevicius, R., Borowik, G.: Authorship semantical identification using holomorphic Chebyshev projectors. In: Proceedings of 3rd Asia-Pacific Conference on Computer Aided System Engineering (APCASE) (2015)
Google Scholar

Download references

Acknowledgement

The authors acknowledge the contribution of the project “Lithuanian Cybercrime Centre of Excellence for Training, Research and Education”, Grant Agreement No. HOME/2013/ISEC/AG/INT/4000005176, co-funded by the Prevention of and Fight against Crime Programme of the European Union.

Author information

Authors and Affiliations

Computer Science Department, Kaunas University of Technology, Studentų 50, Kaunas, Lithuania
Algimantas Venčkauskas & Arnas Karpavičius
Software Engineering Department, Kaunas University of Technology, Studentų 50, Kaunas, Lithuania
Robertas Damaševičius & Romas Marcinkevičius

Authors

Algimantas Venčkauskas
View author publications
You can also search for this author in PubMed Google Scholar
Robertas Damaševičius
View author publications
You can also search for this author in PubMed Google Scholar
Romas Marcinkevičius
View author publications
You can also search for this author in PubMed Google Scholar
Arnas Karpavičius
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Robertas Damaševičius .

Editor information

Editors and Affiliations

Kaunas University of Technology, Kaunas, Lithuania
Giedre Dregvaite
Kaunas University of Technology, Kaunas, Lithuania
Robertas Damasevicius

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Venčkauskas, A., Damaševičius, R., Marcinkevičius, R., Karpavičius, A. (2015). Problems of Authorship Identification of the National Language Electronic Discourse. In: Dregvaite, G., Damasevicius, R. (eds) Information and Software Technologies. ICIST 2015. Communications in Computer and Information Science, vol 538. Springer, Cham. https://doi.org/10.1007/978-3-319-24770-0_36

Download citation

DOI: https://doi.org/10.1007/978-3-319-24770-0_36
Published: 10 January 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-24769-4
Online ISBN: 978-3-319-24770-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics