Applying Authorship Analysis to Arabic Web Content

  • Ahmed Abbasi
  • Hsinchun Chen
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3495)


The advent and rapid proliferation of internet communication has allowed the realization of numerous security issues. The anonymous nature of online mediums such as email, web sites, and forums provides an attractive communication method for criminal activity. Increased globalization and the boundless nature of the internet have further amplified these concerns due to the addition of a multilingual dimension. The world’s social and political climate has caused Arabic to draw a great deal of attention. In this study we apply authorship identification techniques to Arabic web forum messages. Our research uses lexical, syntactic, structural, and content-specific writing style features for authorship identification. We address some of the problematic characteristics of Arabic in route to the development of an Arabic language model that provides a respectable level of classification accuracy for authorship discrimination. We also run experiments to evaluate the effectiveness of different feature types and classification techniques on our dataset.


Word Length Function Word Syntactic Feature Lexical Feature Arabic Word 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Adamson, G.W., Boreham, J.: The use of an association measure based on character structure to identify semantically related pairs of words and document titles. Information Storage and Retrieval 10, 253–260 (1974)CrossRefGoogle Scholar
  2. 2.
    Al-Fedaghi, S.S., Al-Anzi, F.: A new algorithm to generate Arabic root-pattern forms. In: Proceedings of the 11th National Computer Conference, pp. 4–7. King Fahd University of Petroleum & Minerals, Dhahran (1989)Google Scholar
  3. 3.
    Baayen, H., Halteren, H.v., Neijt, A., Tweedie, F.: An experiment in authorship attribution. Paper presented at the Proceedings of the 6th International Conference on the Statistical Analysis of Textual Data, JADT 2002 (2002)Google Scholar
  4. 4.
    Beesley, K.B.: Arabic Finite-State Morphological Analysis and Generation. In: Proceedings of COLING 1996, pp. 89–94 (1996)Google Scholar
  5. 5.
    Burrows, J.F.: Word patterns and story shapes: the statistical analysis of narrative style. Literary and Linguistic Computing 2, 61–67 (1987)CrossRefGoogle Scholar
  6. 6.
    Chen, H., Shankaranarayanan, G., Iyer, A., She, L.: A machine learning approach to inductive query by examples: an experiment using relevance feedback, ID3, Genetic Algorithms, and Simulated Annealing. Journal of the American Society for Information Science 49(8), 693–705 (1998)CrossRefGoogle Scholar
  7. 7.
    De Roeck, A.N., Al-Fares, W.: A morphologically sensitive clustering algorithm for identifying Arabic roots. In: Proceedings ACL 2000, Hong Kong (2000)Google Scholar
  8. 8.
    De Vel, O.: Mining E-mail authorship. Paper presented at the Workshop on Text Mining, ACM International Conference on Knowledge Discovery and Data Mining, KDD 2000 (2000)Google Scholar
  9. 9.
    De Vel, O., Anderson, A., Corney, M., Mohay, G.: Mining E-mail content for author identification forensics. SIGMOD Record 30(4), 55–64 (2001)CrossRefGoogle Scholar
  10. 10.
    Diab, M., Hacioglu, K., Jurafsky, D.: Automatic Tagging of Arabic Text: From raw text to Base Phrase Chunks. In: Proceedings of HLT-NAACL 2004 (2004)Google Scholar
  11. 11.
    Diederich, J., Kindermann, J., Leopold, E., Paass, G.: Authorship attribution with Support Vector Machines. Applied Intelligence (2000)Google Scholar
  12. 12.
    Dietterich, T.G., Hild, H., Bakiri, G.: A comparative study of ID3 and Backpropagation for English Text-to-Speech mapping. Machine Learning, 24–31 (1990)Google Scholar
  13. 13.
    Forsyth, R.S., Holmes, D.I.: Feature finding for text classification. Literary and Linguistic Computing 11(4) (1996)Google Scholar
  14. 14.
    Hmeidi, I., Kanaan, G., Evens, M.: Design and Implementation of Automatic Indexing for Information Retrieval with Arabic Documents. Journal of the American Society for Information Science 48(10), 867–881 (1997)CrossRefGoogle Scholar
  15. 15.
    Holmes, D.I.: A stylometric analysis of Mormon Scripture and related texts. Journal of Royal Statistical Society 155, 91–120 (1992)CrossRefGoogle Scholar
  16. 16.
    Holmes, D.I.: The evolution of stylometry in humanities. Literary and Linguistic Computing 13(3), 111–117 (1998)CrossRefGoogle Scholar
  17. 17.
    Hoorn, J.F., Frank, S.L., Kowalczyk, W., Ham, F.V.D.: Neural network identification of poets using letter sequences. Literary and Linguistic Computing 14(3), 311–338 (1999)CrossRefGoogle Scholar
  18. 18.
    Larkey, L.S., Connell, M.E.: Arabic information retrieval at UMass in TREC-10. In: TREC 2001. NIST, Gaithersburg (2001)Google Scholar
  19. 19.
    Ledger, G.R., Merriam, T.V.N.: Shakespeare, Fletcher, and the two Noble Kinsmen. Literary and Linguistic Computing 9, 235–248 (1994)CrossRefGoogle Scholar
  20. 20.
    Lowe, D., Matthews, R.: Shakespeare vs. Fletcher: a stylometric analysis by radial basis functions. Computers and the Humanities 29, 449–461 (1995)CrossRefGoogle Scholar
  21. 21.
    Martindale, C., McKenzie, D.: On the utility of content analysis in author attribution: The Federalist. Computer and the Humanities 29, 259–270 (1995)CrossRefGoogle Scholar
  22. 22.
    Mealand, D.L.: Correspondence analysis of Luke. Literary and Linguistic Computing 10, 171–182 (1995)CrossRefGoogle Scholar
  23. 23.
    Mendenhall, T.C.: The characteristic curves of composition. Science 11(11), 237–249 (1887)CrossRefGoogle Scholar
  24. 24.
    Mosteller, F., Frederick, Wallace, D.L.: Applied Bayesian and classical inference: the case of the Federalist papers, 2nd edn. Springer, Heidelberg (1964)Google Scholar
  25. 25.
    Mosteller, F., Wallace, D.L.: Inference and disputed authorship: the Federalist. Addison-Wesley, Reading (1964)zbMATHGoogle Scholar
  26. 26.
    Peng, F., Schuurmans, D., Keselj, V., Wang, S.: Automated authorship attribution with character level language models. Paper presented at the 10th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2003 (2003)Google Scholar
  27. 27.
    Quinlan, J.R.: Induction of decision trees. Machine Learning 1(1), 81–106 (1986)Google Scholar
  28. 28.
    Rudman, J.: The state of authorship attribution studies: some problems and solutions. Computers and the Humanities 31, 351–365 (1998)CrossRefGoogle Scholar
  29. 29.
    Stamatatos, E., Fakotakis, N., Kokkinakis, G.: Computer-based authorship attribution without lexical measures. Computers and the Humanities 35(2), 193–214 (2001)CrossRefGoogle Scholar
  30. 30.
    Tweedie, F.J., Singh, S., Holmes, D.I.: Neural Network applications in stylometry: the Federalist papers. Computers and the Humanities 30(1), 1–10 (1996)CrossRefGoogle Scholar
  31. 31.
    Van Rijsbergen, C.J.: Information retrieval. Butterworths, London (1979)Google Scholar
  32. 32.
    Vapnik, V.: The nature of statistical learning theory. Springer, New York (1995)zbMATHGoogle Scholar
  33. 33.
    Yule, G.U.: On sentence length as a statistical characteristic of style in prose. Biometrika 30 (1938)Google Scholar
  34. 34.
    Yule, G.U.: The statistical study of literary vocabulary. Cambridge University Press, Cambridge (1944)Google Scholar
  35. 35.
    Zheng, R., Qin, Y., Huang, Z., Chen, H.: Authorship Analysis in Cybercrime Investigation. Paper presented at the Proceedings of the first NSF/NIJ Symposium, ISI 2003, Tucson, AZ, USA (2003)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Ahmed Abbasi
    • 1
  • Hsinchun Chen
    • 1
  1. 1.Department of Management Information SystemsThe University of ArizonaTucsonUSA

Personalised recommendations