Authorship Analysis in Cybercrime Investigation

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2665)


Criminals have been using the Internet to distribute a wide range of illegal materials globally in an anonymous manner, making criminal identity tracing difficult in the cybercrime investigation process. In this study we propose to adopt the authorship analysis framework to automatically trace identities of cyber criminals through messages they post on the Internet. Under this framework, three types of message features, including style markers, structural features, and content-specific features, are extracted and inductive learning algorithms are used to build feature-based models to identify authorship of illegal messages. To evaluate the effectiveness of this framework, we conducted an experimental study on data sets of English and Chinese email and online newsgroup messages. We experimented with all three types of message features and three inductive learning algorithms. The results indicate that the proposed approach can discover real identities of authors of both English and Chinese Internet messages with relatively high accuracies.


Support Vector Machine Pirate Software Authorship Analysis Authorship Attribution Federalist Paper 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    B. Brainerd, Statistical analysis of Lexical data using Chi-squared and related distributions. Computers and the Humanities, 9, 161–178. (1975).CrossRefGoogle Scholar
  2. 2.
    Binongo and Smith, A Study of Oscar Wilde’s Writings, Journal of Applied Statistics, vol. 26–7, p. 781, (1999).MathSciNetCrossRefGoogle Scholar
  3. 3.
    R. H. Baayen, Statistical Models for Word Frequency Distributions: A Linguistic Evaluation. Computers and the Humanities, 26 347–363, 347–363. (1993).CrossRefGoogle Scholar
  4. 4.
    R. H. Baayen, H. van Halteren, and F. J. Tweedie, Outside The Cave of Shadows: Using Syntactic Annotation to Enhance Authorship Attribution. Literary and Linguistic Computing, 2, 110–120, (1996).Google Scholar
  5. 5.
    R. Bosch and J. Smith, Separating hyperplanes and the authorship of the disputed federalist papers, American Mathematical Monthly, 105(7): 601–608, (1998).zbMATHCrossRefMathSciNetGoogle Scholar
  6. 6.
    H. Chen, G. Shankaranarayanan, A. Iyer, and L. She, A Machine Learning Approach to Inductive Query by Examples: An Experiment Using Relevance Feedback, ID3, Genetic Algorithms, and Simulated Annealing, Journal of the American Society for Information Science, Volume 49, Number 8, Pages 693–705, (1998).CrossRefGoogle Scholar
  7. 7.
    N. Cristianini and J. Shawe-Taylor, An Introduction to Support Vector Machines, Cambridge University Press, (2000).Google Scholar
  8. 8.
    E. Charniak, Statistical Language Learning. MIT Press, Cambridge, (1993).Google Scholar
  9. 9.
    J. Diederich, J. Kindermann, E. Leopold, and G. Paass, Authorship Attribution with Support Vector Machines, Applied Intelligence, (2000).Google Scholar
  10. 10.
    W. Elliot and R. Valenza, Was the Earl of Oxford The True Shakespeare? Notes and Queries, 38: 501–506, (1991).Google Scholar
  11. 11.
    I. S. Francis, An Exposition of a Statistical Approach to the Federalist Dispute. In J. Leed (Ed.), The Computer and Literary Style (pp. 38–79). Kent, Ohio: Kent State University Press. (1966).Google Scholar
  12. 12.
    J. M. Farringdon, Analyzing for Authorship A Guide to the Cusum Technique. Cardiff: University of Wales Press. (1996).Google Scholar
  13. 13.
    D. Foster, Author Unknown: On the Trail of Anonymous, Henry Holt, New York, (2000).Google Scholar
  14. 14.
    A. Gray, P. Sallis, and S. MacDonell, Software forensics: Extending authorship analysis techniques to computer programs, in Proc. 3rd Biannual Conf. Int. Assoc. of Forensic Linguists (IAFL’97), pages 1–8, (1997).Google Scholar
  15. 15.
    C. W. Hsu and C. J. Lin. A comparison on methods for multi-class support vector machines, IEEE Transactions on Neural Networks, 13, pages 415–425, (2002).CrossRefGoogle Scholar
  16. 16.
    D. I. Holmes and R. S. Forsyth, The Federalist Revisited: New Directions in Authorship Attribution. Literary and Linguistic Computing, 10, 111–127. (1995).CrossRefGoogle Scholar
  17. 17.
    D. I. Holmes, The Evolution of Stylometry in Humanities. Literary and Linguistic Computing, 13, 3. (1998).Google Scholar
  18. 18.
    T. Joachims, Text Categorization with Support Vector Machines, in: Proceedings of the European Conference on Machine learning (ECML), (1998).Google Scholar
  19. 19.
    D.V. Khmelev and F. J. Tweedir, Using Markov Chains for Identification of Writers, Literary and Linguistic Computing, vol. 16, no. 4, pp. 299–307, (2001).CrossRefGoogle Scholar
  20. 20.
    B. Kjell, Authorship Determination Using Letter-pair Frequency Features with Neural Network Classifiers. Literary and Linguistic Computing, 9, 119–124. (1994).CrossRefGoogle Scholar
  21. 21.
    D. Lowe, and R. Matthews, Shakespeare vs. Fletcher: A Stylometric Analysis by Radial Basis Functions. Computers and the Humanities, 29, 449–461 (1995).CrossRefGoogle Scholar
  22. 22.
    R. P. Lippmann, An Introduction to Computing with Neural Networks, IEEE Acoustics Speech and Signal Processing Magazine, 4(2): 4–22, (1987).Google Scholar
  23. 23.
    F. Mosteller and D. L. Wallace, Inference and Disputed Authorship: The Federalist, Addison-Wesley, Reading, Mass., (1964).zbMATHGoogle Scholar
  24. 24.
    F. Mosteller, Frederick, and D. L. Wallace, Applied Bayesian and Classical Inference: the Case of the Federalist Papers, in the 2nd edition of Inference and Disputed Authorship, The Federalist, Springer-Verlag, (1964).Google Scholar
  25. 25.
    A. McCallum and K. Nigam, A Comparison of Event Models for Naive Bayes Text Classification. AAAI-98 Workshop on “Learning for Text Categorization”, (1998).Google Scholar
  26. 26.
    J. Moody and J. Utans, Architecture Selection Strategies for Neural Networks Application to Corporate Bond Rating, Neural Networks in the Capital Markets, (1995).Google Scholar
  27. 27.
    E. Osuna, R. Freund and F. Girosi, Training Support Vector Machines: An Application to Face Detection, Proceedings of Computer Vision and Pattern Recognition, 130–136, (1997).Google Scholar
  28. 28.
    J. R. Quinlan, Induction of Decision Trees, Machine Learning, 1(1): 81–106, (1986).Google Scholar
  29. 29.
    J. Rudman, The State of Authorship Attribution Studies: Some Problems and Solutions. Computers and the Humanities, 31, 351–365. (1998).CrossRefGoogle Scholar
  30. 30.
    R. Thisted, and B. Efron, Did Shakespeare Write a Newly Discovered Poem? Biometrika, 74, 445–455. (1987).zbMATHCrossRefMathSciNetGoogle Scholar
  31. 31.
    D. Thomas, and B. D. Loader, Introduction — Cyber Crime: law enforcement, security and surveillance in the information age, Taylor & Francis Group, New York, NY, (2000).Google Scholar
  32. 32.
    T. Tomoji, Dickens’s Narrative Style: A Statistical Approach to Chronological Variation. Revue, Informatique et Statistique dans les Sciences Humaines (RISSH, Centre Informatique de Philosophie et Lettres, Universite de Liege, Belgique), 30, 165–182, (1994).Google Scholar
  33. 33.
    F. J. Tweedie, S. Singh, and D. I. Holmes, Neural Network Applications in Stylometry: The Federalist Papers. Computers and the Humanities, 30(1), 1–10 (1996).CrossRefGoogle Scholar
  34. 34.
    K. M. Tolle, H. Chen and H. Chow, Estimating Drug/Plasma Concentration Levels by Applying Neural Networks to Pharmacokinetic Data Sets, Decision Support Systems, Special Issue on Decision Support for Health Care in a New Information Age, 30(2), 139–152, (2000).Google Scholar
  35. 35.
    O. de Vel, A. Anderson, M. Corney and G. Mohay, Mining E-mail Content for Author Identification Forensics, SIGMOD Record, 30(4): 55–64, (2001).CrossRefGoogle Scholar
  36. 36.
    O. de Vel, Mining e-mail authorship. In Proc. Workshop on Text Mining, ACM International Conference on Knowledge Discovery and Data Mining (KDD’2000), (2000).Google Scholar
  37. 37.
    V. Vapnik, The Nature of Statistical Learning Theory, Springer Verlag, New York, (1995).zbMATHGoogle Scholar
  38. 38.
    B. Widrow, D. E. Rumelhart and M. A. Lehr, Neural Networks: Applications in Industry, Business, and Science, Communications of the ACM, 37, 93–105, (1994).CrossRefGoogle Scholar
  39. 39.
    G. U. Yule, On sentence length as a statistical characteristic of style in prose, Bometrikka, 30, (1938).Google Scholar
  40. 40.
    G. U. Yule, The statistical study of literary vocabulary, Cambridge University Press, (1944).Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2003

Authors and Affiliations

  1. 1.Artificial Intelligence Lab, Department of Management Information SystemsThe University of ArizonaTucsonUSA

Personalised recommendations