Abstract
With the popularity of computer and Internet, a growing number of criminals have been using the Internet to distribute a wide range of illegal materials and false information globally in an anonymous manner, making criminal identity tracing difficult in the cybercrime investigation process. Consequently, automatic authorship attribution of online messages becomes increasingly crucial for forensic investigation. Although researchers have got many achievements, the accuracies of authorship attribution with tens or thousands of candidate are still relatively poor which is generally among 20%~40%, and cannot be used as evidence in forensic investigation. Instead of asserting that a given text was written by a given user, this paper proposes a novel authorship attribution model combining both profile-based and instance-based approaches to reduce the size of the candidate authors to a small number and narrow the scope of investigation with a high level of accuracy. To evaluate the effectiveness of our model, we conduct extensive experiments on a blog corpus with thousands of candidate authors. The experimental results show that our algorithm can successfully output a small number of candidate authors with high accuracy.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Abbasi, A., Chen, H.: Applying authorship analysis to extremist-group web forum messages. IEEE Intelligent Systems 20(5), 67–75 (2005)
Burrows, J.F.: An ocean where each kind: Statistical analysis and some major determinants of literary style. Computers and the Humanities 23(4-5), 309–321 (1989)
Coifman, R.R., Victor Wickerhauser, M.: Entropy-based algorithms for best basis selection. IEEE Transactions on Information Theory 38(2), 713–718 (1992)
De Vel, O., Anderson, A., Corney, M., Mohay, G.: Mining e-mail content for author identification forensics. ACM Sigmod Record 30(4), 55–64 (2001)
Eggins, S.: Introduction to systemic functional linguistics. Continuum International Publishing Group (2004)
Frantzeskou, G., Stamatatos, E., Gritzalis, S., Katsikas, S.: Effective identification of source code authors using byte-level information. In: Proceedings of the 28th International Conference on Software Engineering, pp. 893–896. ACM (2006)
Guzmán-Cabrera, R., Montes-y-Gómez, M., Rosso, P., Villaseñor-Pineda, L.: A Web-Based Self-training Approach for Authorship Attribution. In: Nordström, B., Ranta, A. (eds.) GoTAL 2008. LNCS (LNAI), vol. 5221, pp. 160–168. Springer, Heidelberg (2008)
Van Halteren, H.: Author verification by linguistic profiling: An exploration of the parameter space. ACM Transactions on Speech and Language Processing (TSLP)Â 4(1), 1 (2007)
Kešelj, V., Peng, F., Cercone, N., Thomas, C.: N-gram-based author profiles for authorship attribution. In: Proceedings of the Conference Pacific Association for Computational Linguistics, PACLING, vol. 3, pp. 255–264 (2003)
Koppel, M., Schler, J., Argamon, S.: Authorship attribution in the wild. Language Resources and Evaluation 45(1), 83–94 (2011)
Koppel, M., Schler, J., Argamon, S., Messeri, E.: Authorship attribution with thousands of candidate authors. In: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 659–660. ACM (2006)
Kourtis, I., Stamatatos, E.: Author identification using semi-supervised learning. In: CLEF 2011: Proceedings of the 2011 Conference on Multilingual and Multimodal Information Access Evaluation (Lab and Workshop Notebook Papers), Amsterdam, The Netherlands (2011)
Lee, S.-I., Lee, H., Abbeel, P., Ng, A.Y.: Efficient l~ 1 regularized logistic regression. In: Proceedings of the National Conference on Artificial Intelligence, vol. 21, p. 401. AAAI Press, MIT Press, Menlo Park, Cambridge (1999, 2006)
Li, J., Zheng, R., Chen, H.: From fingerprint to writeprint. Communications of the ACM 49(4), 76–82 (2006)
Love, H.: Attributing authorship: An introduction. Cambridge University Press (2002)
Luyckx, K., Daelemans, W.: Authorship attribution and verification with many authors and limited data. In: Proceedings of the 22nd International Conference on Computational Linguistics, vol. 1, pp. 513–520. Association for Computational Linguistics (2008)
Madigan, D., Genkin, A., Lewis, D.D., Argamon, S., Fradkin, D., Ye, L.: Author identification on the large scale. In: Proc. of the Meeting of the Classification Society of North America (2005)
Peng, F., Schuurmans, D., Wang, S.: Augmenting naive bayes classifiers with statistical language models. Information Retrieval 7(3-4), 317–345 (2004)
Schler, J., Koppel, M., Argamon, S., Pennebaker, J.W.: Effects of age and gender on blogging. In: AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs, pp. 199–205 (2006)
Seroussi, Y., Bohnert, F., Zukerman, I.: Authorship attribution with author-aware topic models. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers-Volume 2, pp. 264–269. Association for Computational Linguistics (2012)
Stamatatos, E.: A survey of modern authorship attribution methods. Journal of the American Society for Information Science and Technology 60(3), 538–556 (2009)
Zheng, R., Qin, Y., Huang, Z., Chen, H.: Authorship analysis in cybercrime investigation. In: Chen, H., Miranda, R., Zeng, D.D., Demchak, C.C., Schroeder, J., Madhusudan, T. (eds.) ISI 2003. LNCS, vol. 2665, pp. 59–73. Springer, Heidelberg (2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 IFIP International Federation for Information Processing
About this paper
Cite this paper
Yang, M., Chow, KP. (2014). Authorship Attribution for Forensic Investigation with Thousands of Authors. In: Cuppens-Boulahia, N., Cuppens, F., Jajodia, S., Abou El Kalam, A., Sans, T. (eds) ICT Systems Security and Privacy Protection. SEC 2014. IFIP Advances in Information and Communication Technology, vol 428. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-55415-5_28
Download citation
DOI: https://doi.org/10.1007/978-3-642-55415-5_28
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-55414-8
Online ISBN: 978-3-642-55415-5
eBook Packages: Computer ScienceComputer Science (R0)