Abstract
The paper presents a comprehensive overview and analysis of the authorship identification methods in the national language electronic discourse. First, an overview and analysis of methods for English language is presented. Next, adaptations of general methods and well as language specific methods for national languages are considered. Challenges of authorship identification in electronic discourse is discussed. The requirements for developing authorship identification systems for forensics applications are discussed. Finally, the recommendations for developers of authorship identification methods and tools are presented.
References
Sánchez-Moya, A., Cruz-Moya, O.: Whatsapp, textese, and moral panics: discourse features and habits across two generations. Procedia – Soc. Behav. Sci. 173, 300–306 (2015)
Segerstad, Y.H.: Use and adaptation of written language to the conditions of Computer-Mediated Communication. PhD dissertation, Göteborg University (2002)
Thurlow, C.: Generation txt? The sociolinguistics of young people’s text-messaging. Discourse Anal. Online 1(1), 30 (2003)
MacLeod, N., Grant, T.: Whose tweet?: authorship analysis of micro-blogs and other short form messages. In: Proceedings of the International Association of Forensic Linguists’ 10th Biennial Conference (2011)
Voutilainen, A.: Part-of-speech tagging. In: Mitkov, R. (ed.) The Oxford Handbook of Computational Linguistics, pp. 219–232. University Press, Oxford (2003)
Nivre, J.: Logic programming tools for probabilistic part-of-speech tagging. Master’s thesis, Växjö University (2000)
Bogdanova, D., Lazaridou, A.: Cross-language authorship attribution. In: The International Conference on Language Resources and Evaluation, pp. 2015–2020 (2014)
Potthast, M., Barron-Cedeno, A., Stein, B., Rosso, P.: Cross-language plagiarism detection, language resources and evaluation (LRE). Spec. Issue Plagiarism Authorship Anal. 45(1), 1–18 (2011)
Salvador, M.F., Gupta, P., Rosso, P.: Cross-language plagiarism detection using a multilingual semantic network. In: Proceedings of the 35th European conference on Advances in Information Retrieval, ECIR 2013, pp. 710–713 (2013)
Navigli, R., Ponzetto, S.P.: BabelNet: building a very large multilingual semantic network. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, ACL 2010, pp. 216–225 (2010)
Panicheva, P., Cardiff, J., Rosso, P.: Personal sense and idiolect: combining authorship attribution and opinion analysis. In Proceedings of the 4th International Conference on Language Resources and Evaluation, LREC 2010 (2010)
Dunn, R., Beaudry, J., Klavas, A.: Survey of research on learning styles. Educ. Leadersh. 46(6), 50–58 (1989)
Bellman, R.: Adaptive Control Processes: a Guided Tour. Princeton University Press, Princeton (1961)
Koller, D., Sahami, M.: Hierarchically classifying documents using very few words. In: Proceedings of International Conference on Machine Learning, pp. 170–178 (1997)
Fuka, K., Hanka, R.: Feature set reduction for document classification problems. In: Proceedings of IJCAI-01 Workshop: Text Learning: Beyond Supervision, Seattle (2001)
Zervas, G., Rüger, S.M.: The curse of dimensionality and document clustering. In: Proceedings of the IEEE Searching for Information: AI and IR Approaches (1999)
Pearl, L., Steyvers, M.: Detecting authorship deception: a supervised machine learning approach using author writeprints. LLC 27(2), 183–196 (2012)
Brennan, M., Afroz, S., Greenstadt, R.: Adversarial stylometry: circumventing authorship recognition to preserve privacy and anonymity. ACM Trans. Inf. Syst. Secur. (TISSEC), 15(3), Article 12, 22 p. (2012)
De Vel O.: Mining e-mail authorship. In: ACM International Conference on Knowledge Discovery and Data Mining, KDD 2000, Workshop on Text Mining (2000)
Holmes, D.: Authorship attribution. Comput. Humanit. 28(2), 87–106 (1994)
Stamatatos, E.: A survey of modern authorship attribution methods. J. Am. Soc. Inf. Sci. Technol. 60(3), 538–556 (2009)
Zheng, R., Li, J., Chen, H., Huang, Z.: A framework for authorship identification of online messages: writing-style features and classification techniques. JASIST (JASIS) 57(3), 378–393 (2006)
Graovac, J.: A variant of n-gram based language-independent text categorization. Intell. Data Anal. 18(4) (2014)
Coyotl-Morales, R.M., Villaseñor-Pineda, L., Montes-y-Gómez, M., Rosso, P.: Authorship attribution using word sequences. In: Martínez-Trinidad, J.F., Carrasco Ochoa, J.A., Kittler, J. (eds.) CIARP 2006. LNCS, vol. 4225, pp. 844–853. Springer, Heidelberg (2006)
Stamatatos, E.: On the robustness of authorship attribution based on character n-gram features. J. Law Policy 21(2), 421 (2013)
Koppel, M., Schler, J., Argamon, S.: Authorship attribution in the wild. Lang. Resour. Eval. 45, 83–94 (2011)
Reicher, T., Krišto, I., Belša, I., Šilić, A.: Automatic authorship attribution for texts in croatian language using combinations of features. In: Setchi, R., Jordanov, I., Howlett, R.J., Jain, L.C. (eds.) KES 2010, Part II. LNCS, vol. 6277, pp. 21–30. Springer, Heidelberg (2010)
Argamon, S., Levitan S.: Measuring the usefulness of function words for authorship attribution. In: Proceedings of the Association for Literary and Linguistic Computing/Association Computer Humanities Conference (2005)
Chaski, C.E.: Who’s at the keyboard? Authorship attribution in digital evidence investigations. Int. J. Digit. Evid. 4(1), 1–13 (2005)
Hilton, O.: Scientific Examination of Questioned Documents. CRC Press, Boca Raton (1993)
McMenamin, G.R.: Forensic Linguistics: Advances in Forensic Stylistics. CRC Press, Boca Raton (2003)
Martindale, C., McKenzie, D.: On the utility of content analysis in author attribution: the Federalist. Comput. Humanit. 29, 259–270 (1995)
Palkovskii, Y., Belov, A., Muzika I.: Exploring Fingerprinting as External Plagiarism Detection Method - Lab Report for PAN at CLEF 2010. CLEF (Notebook Papers/LABs/Workshops) (2010)
Yang, T., Lee, D.: T3: On mapping text to time series. In: Proceedings of the 3rd Alberto Mendelzon International Workshop on Foundations of Data Management. CEUR Workshop Proceedings 450 (2009)
Fukunaga, K.: Introduction to Statistical Pattern Recognition. Academic Press, New York (1972)
Qu, Y., Ostrouchovz, G., Samatovaz, N., Geist, A.: Principal component analysis for dimension reduction in massive distributed data sets. In: Proceedings of IEEE International Conference on Data Mining (ICDM) (2002)
Koppel, M., Schler, J., Argamon, S., Messeri, E.: Authorship attribution with thousands of candidate authors. In: Proceedings of the 29th ACM SIGIR Conference on Research and Development on Information Retrieval Seattle, Washington, pp. 659–660 (2006)
Koppel, M., Schler, J., Argamon, S., Winter, Y.: The “fundamental problem” of authorship attribution. Engl. Stud. 93(3), 284–291 (2012)
Huang, A.: Similarity measures for text document clustering. In: Proceedings of the 6th New Zealand Computer Science Research Student Conference NZCSRSC2008, pp. 49–56 (2008)
Kjell, B., Woods, W.A., Frieder, O.: Discrimination of authorship using visualization. Inf. Process. Manage. 30(1), 141–150 (1994)
Shaw, C.D., Kukla, J.M., Soboroff, I., Ebert, D.S., Nicholas, C.K., Zwa, A., Miller, E.L., Roberts, D.A.: Interactive volumetric information visualization for document corpus management. Int. J. Digit. Libr. 2, 144–156 (1999)
Burrows, J.F.: Delta: a measure of stylistic difference and a guide to likely authorship. Literary Linguist. Comput. 17, 267–287 (2002)
Mosteller, F., Wallace, D.L.: Inference and Disputed Authorship: the Federalist. Addison-Wesley, Reading, MA (1964)
Antiqueira, L., Pardo, T.A.S., das Gracas Volpe Nunes, M., de Oliveira Jr., O.N., da Fontoura Costa, L.: Some issues on complex networks for author characterization. Revista Iberoamericana de Inteligencia Artificial 11(36), 51–58 (2006)
Segarra, S., Eisen, M., Ribeiro, A.: Authorship attribution using function words adjacency networks. In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2013, pp. 5563–5567 (2013)
Ke, J., Yao, Y.: Analysing language development from a network, approach. J. Quant. Linguist. 15(1), 70–99 (2008)
Leskovec, J., Kleinberg J., Faloutsos, C.: Graph evolution: densification and shrinking diameters. ACM Trans. Knowl. Discov. Data, 1(1), Article 2 (2007)
Matsuo, Y., Ohsawa, Y., Ishizuka, M.: A document as a small world. In: Terano, T., Nishida, T., Namatame, A., Tsumoto, S., Ohsawa, Y., Washio, T. (eds.) JSAI-WS 2001. LNCS (LNAI), vol. 2253, pp. 444–448. Springer, Heidelberg (2001)
Chang, F., Lieven, E., Tomasello, M.: Automatic evaluation of syntactic learners in typologically-different languages. Cogn. Syst. Res. 9(3), 198–213 (2008)
Rizvic, H., Martincic-Ipsic, S., Mestrovic, A.: Network Motifs Analysis of Croatian Literature. CoRR abs/1411.4960 (2014)
Wagner, H., Dłotko, P., Mrozek, M.: Computational topology in text mining. In: Ferri, M., Frosini, P., Landi, C., Cerri, A., Di Fabio, B. (eds.) CTIC 2012. LNCS, vol. 7309, pp. 68–78. Springer, Heidelberg (2012)
Beeferman, D., Berger, A., Lafferty, J.: A model of lexical attraction and repulsion. In: 35th Annual Meeting of the Association for Computational Linguistics (1997)
Amancio, D.R.: Authorship recognition via fluctuation analysis of network topology and word intermittency. J. Stat. Mech. P03005 (2015)
Basile, C., Benedetto, D., Caglioti, E., Degli Esposti, M.: An example of mathematical authorship attribution. J. Math. Phys. 49, 125211–125230 (2008)
Todirascu, A., Pado, S., Krisch, J., Kisselew, M., Heid, U.: French and German corpora for audience–based text type classification. LREC 2012, 1591–1597 (2012)
Varela, P., Justino, E., Oliveira, L.S.: Verbs and pronouns for authorship attribution. In: 17th International Conference on Systems, Signals and Image Processing (IWSSIP 2010), pp. 89–92 (2010)
Pavelec, D., Oliveira, L.S., Justino, E., Batista, L.V.: Using conjunctions and adverbs for author verification. J. Univ. Comput. Sci. 14(18), 2967–2981 (2008)
Hancke, J., Meurers, D., Vajjala, S.: Readability classification for German using lexical, syntactic, and morphological features. In: Proceedings of the 24th International Conference on Computational Linguistics (COLING), pp. 1063–1080 (2012)
Diederich, J., Kindermann, J., Leopold, E., Paass, G.: Authorship attribution with support vector machines. Appl. Intell. 19(1), 109–123 (2003)
Kukushkina, O.V., Polikarpov, A.A., Khmelev, D.V.: Using literal and grammatical statistics for authorship attribution. Probl. Inf. Transm. 37(2), 172–184 (2001)
Zecevic, A., Utvic, M.: An authorship attribution for Serbian. In: BCI (Local), pp. 109–112 (2012)
Žalkauskaitė, G.: Idiolect signs in the e-mail. PhD dissertation, Vilnius University (2012)
Barragán, J.: Why some hard cases remain unsolved. Legal knowledge based systems. In: JURIX 1993 (1993)
Grant, T.: TXT 4N6 method, consistency, and distinctiveness in the analysis of SMS text messages. J. Law Policy 21(2), 467–494 (2013)
Mohtasseb, H., Ahmed, A.: Two-layered blogger identification model integrating profile and instance-based methods. Knowl. Inf. Syst. 31(1), 1–21 (2012)
Guillén-Nieto, V., Vargas-Sierra, C., Pardiño-Juan, M., Martínez-Barco, P., Suárez-Cueto, A.: Exploring state-of-the art software for forensic authorship identification. Int. J. Engl. Stud. 8(1), 1–28 (2008)
Teahan, W.J., Harper, D.J.: Using compression-based language models for text categorization. In: Croft, W.B., Lafferty, J. (eds.) Language modeling for information retrieval, pp. 141–165. Springer, Dordrecht (2003)
Benedetto, D., Caglioti, E., Loreto, V.: Language trees and zipping. Phys. Rev. Lett. 88(4), 048702 (2002)
Abbasi, A., Chen, H.: Applying authorship analysis to extremist-group web forum messages. IEEE Intell. Syst. 20(5), 67–75 (2005)
Kapociute-Dzikiene, J., Vaassen, F., Daelemans, W., Krupavicius, A.: Improving topic classification for highly inflective languages. In: 24th International Conference on Computational Linguistics, COLING 2012, pp. 1393–1410 (2012)
Napoli, C., Tramontana, E., Lo Sciuto, G., Wozniak, M., Damasevicius, R., Borowik, G.: Authorship semantical identification using holomorphic Chebyshev projectors. In: Proceedings of 3rd Asia-Pacific Conference on Computer Aided System Engineering (APCASE) (2015)
Acknowledgement
The authors acknowledge the contribution of the project “Lithuanian Cybercrime Centre of Excellence for Training, Research and Education”, Grant Agreement No. HOME/2013/ISEC/AG/INT/4000005176, co-funded by the Prevention of and Fight against Crime Programme of the European Union.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Venčkauskas, A., Damaševičius, R., Marcinkevičius, R., Karpavičius, A. (2015). Problems of Authorship Identification of the National Language Electronic Discourse. In: Dregvaite, G., Damasevicius, R. (eds) Information and Software Technologies. ICIST 2015. Communications in Computer and Information Science, vol 538. Springer, Cham. https://doi.org/10.1007/978-3-319-24770-0_36
Download citation
DOI: https://doi.org/10.1007/978-3-319-24770-0_36
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-24769-4
Online ISBN: 978-3-319-24770-0
eBook Packages: Computer ScienceComputer Science (R0)