Journal of Computer Science and Technology

, Volume 25, Issue 6, pp 1256–1266 | Cite as

Negative Selection of Written Language Using Character Multiset Statistics

  • Matti PölläEmail author
  • Timo Honkela
Regular Paper


We study the combination of symbol frequence analysis and negative selection for anomaly detection of discrete sequences where conventional negative selection algorithms are not practical due to data sparsity. Theoretical analysis on ergodic Markov chains is used to outline the properties of the presented anomaly detection algorithm and to predict the probability of successful detection. Simulations are used to evaluate the detection sensitivity and the resolution of the analysis on both generated artificial data and real-world language data including the English Wikipedia. Simulation results on large reference corpora are used to study the effects of the assumptions made in the theoretical model in comparison to real-world data.


negative selection anomaly detection frequency analysis 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. [1]
    National Institute of Standards and Technology (NIST). FIPS 180-2: Secure Hash Standard, August 2002. Available online at
  2. [2]
    Forrest S, Perelson A S, Allen L, Cherukuri R. Self-nonself discrimination in a computer. In Proc. the 1994 IEEE Symposium on Research in Security and Privacy, Oakland, USA, May 16-18, 1994, pp.202–212.Google Scholar
  3. [3]
    Pöllä M, Honkela T. Change detection of text documents using negative first-order statistics. In Proc. the Second International and Interdisciplinary Conference on Adaptive Knowledge Representation and Reasoning (AKRR2008), Porvoo, Finland, Sept. 17-19, 2008, pp.48–55.Google Scholar
  4. [4]
    Arstila T P, Casrouge A, Baron V, Even J, Kanellopoulos J, Kourilsky P. A direct estimate of the human α β T cell receptor diversity. Science, Oct. 1999, 286(29): 958–961.CrossRefGoogle Scholar
  5. [5]
    Leandro N. de Castro, Jonathan Timmis (eds.). Artificial Immune Systems: A New Computational Intelligence Approach. Springer-Verlag, 2002.Google Scholar
  6. [6]
    Forrest S, Hofmeyr S A, Somayaji A, Longstaff T A. A sense of self for UNIX processes. In Proc. the 1996 IEEE Symp. Security and Privacy, Oakland, USA, May 6-8, 1996, pp.120–128.Google Scholar
  7. [7]
    Hofmeyr S A, Forrest S, Somayaji A. Intrusion detection using sequences of system calls. Journal of Computer Security, 1998, 6(3): 151–180.Google Scholar
  8. [8]
    Dasgupta D, Forrest S. Tool breakage detection in milling operations using a negative-selection algorithm. Technical Report CS95-5, Dept. Computer Science, Univ. New Mexico, 1995.Google Scholar
  9. [9]
    Dasgupta D, Forrest S. Novelty detection in time series data using ideas from immunology. In Proc. The International Conference on Intelligent Systems, 1995.Google Scholar
  10. [10]
    Ji Z, D Dasgupta. Revisiting negative selection algorithms. Evolutionary Computation, 2007, 15(2): 223–251.CrossRefGoogle Scholar
  11. [11]
    Stibor T, Timmis J, Eckert C. The link between r-contiguous detectors and k-CNF satisfiability. In Proc. Congress on Evolutionary Computation (CEC), Vancouver, Canada, Jul. 2006, pp.491–498.Google Scholar
  12. [12]
    Esponda F, Forrest S, Helman P. A formal framework for positive and negative detection. IEEE Transactions on Systems, Man, and Cybernetics, 2004, 34(1): 357–373.CrossRefGoogle Scholar
  13. [13]
    Fischer I. Pattern recognition algorithms for symbol strings [Ph.D. Dissertation]. University of Tübingen, 2003.Google Scholar
  14. [14]
    Percus J K, Percus O, Perelson A S. Predicting the size of the antibody combining region from consideration of efficient self/non-self discrimination. Proc. the National Academy of Science of the USA, 1993, 90(5): 1691–1695.CrossRefGoogle Scholar
  15. [15]
    Balthrop J, Esponda F, Forrest S, Glickman M. Coverage and generalization in an artificial immune system. In Proc. GECCO-2002, New York, USA, July 9-13, 2002, pp.3–10.Google Scholar
  16. [16]
    Stibor T, Bayarou K M, Eckert C. An investigation of Rchunk detector generation on higher alphabets. In Proc. GECCO, Seattle, USA, Jun. 26-30, 2004, pp.299–307.Google Scholar
  17. [17]
    Stibor T. On the appropriateness of negative selection for anomaly detection and network intrusion detection [Ph.D. Dissertation]. Technische Universität Darmstadt, 2006.Google Scholar
  18. [18]
    D’haeseleer P, Forrest S, Helman P. An immunological approach to change detection: Algorithms, analysis, and implications. In Proc. the Symposium on Research in Security and Privacy, Oaklands, USA, May 6-8, 1996, pp.110–119.Google Scholar
  19. [19]
    D’haeseleer P. An immunological approach to change detection: Theoretical results. In Proc. the 9th Computer Security Foundations Workshop, Dromquinna Manor, Ireland, Mar. 10-12, 1996, pp.18–26.Google Scholar
  20. [20]
    Lewis D D, Yang Y, Rose T, Li F. RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 2004, 5: 361–397.Google Scholar
  21. [21]
    González F A, Dasgupta D. Anomaly detection using realvalued negative selection. Genetic Programming and Evolvable Machines, 2003, 4(4): 383–403.CrossRefGoogle Scholar
  22. [22]
    Grinstead C M, Snell L J. Introduction to Probability. American Mathematical Society, 4 July, 2006 edition, 2006.Google Scholar
  23. [23]
    Koehn P. Europarl: A Parallel Corpus for Statistical Machine Translation. MT Summit, 2005.Google Scholar
  24. [24]
    Timmis J, Hone A, Stibor T, Clark E. Theoretical advances in artificial immune systems. Theoretical Computer Science, 2008, 403(1): 11–32.zbMATHCrossRefMathSciNetGoogle Scholar
  25. [25]
    The Unicode Consortium. The Unicode Standard, Version 5.0. Addison-Wesley Professional, 5th Edition, Nov. 2006.Google Scholar
  26. [26]
    Pöllä M. A generative model for self/non-self discrimination in strings. In Proc. Int. Conf. Adaptive and Natural Computing Algorithms, Kuopio, Finland, Apr. 23-25, 2009, pp.293–302.Google Scholar
  27. [27]
    Pöllä M. An evaluation of windowing-based anomaly detection schemes for discrete sequences. 2010, unpublished manuscript.Google Scholar
  28. [28]
    Stibor T. A study of detecting computer viruses in real-infected files in the n-gram representation with machine learning methods. In Proc. the 23rd International Conference on Industrial, Engineering & Other Applications of Applied Intelligent Systems (IEA-AIE), 2010. (Accepted)Google Scholar

Copyright information

© Springer 2010

Authors and Affiliations

  1. 1.Department of Information and Computer Science, School of Science and TechnologyAalto UniversityAaltoFinland

Personalised recommendations