Text Mining Using Markov Chains of Variable Length

  • Björn Hoffmeister
  • Thomas Zeugmann
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3847)


When dealing with knowledge federation over text documents one has to figure out whether or not documents are related by context. A new approach is proposed to solve this problem.

This leads to the design of a new search engine for literature research and related problems. The idea is that one has already some documents of interest. These documents are taken as input. Then all documents known to a classical search engine are ranked according to their relevance. For achieving this goal we use Markov chains of variable length.

The algorithms developed have been implemented and testing over the Reuters-21578 data set has been performed.


Markov Chain Feature Selection Markov Model Search Engine Text Retrieval 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Doob, J.L.: Stochastic Processes. Wiley, Chichester (1990)zbMATHGoogle Scholar
  2. 2.
    Dümbgen, L.: Stochastik für Informatiker. Springer, Heidelberg (2003)zbMATHCrossRefGoogle Scholar
  3. 3.
    Dunning, T.E.: Accurate methods for the statistics of surprise and coincidence. Computational Linguistics 19(1), 61–74 (1994)Google Scholar
  4. 4.
    Feller, W.: An Introduction to Probability Theory and Its Applications, 3rd edn., vol. 1. Wiley, Chichester (1968)zbMATHGoogle Scholar
  5. 5.
    Fuhr, N.: Probabilistic models in information retrieval. The Computer Journal 35(3), 243–255 (1992)zbMATHCrossRefGoogle Scholar
  6. 6.
    Fürnkranz, J.: A study using n-gram features for text categorization. Technical report, Austrian Institute for Artificial Intelligence (1998)Google Scholar
  7. 7.
    Garg, A., Roth, D.: Understanding probabilistic classifiers. In: Flach, P.A., De Raedt, L. (eds.) ECML 2001. LNCS (LNAI), vol. 2167, pp. 179–191. Springer, Heidelberg (2001)CrossRefGoogle Scholar
  8. 8.
    Hand, D., Mannila, H., Smyth, P.: Principles of Data Mining. MIT Press, Cambridge (2002)Google Scholar
  9. 9.
    Harrison, M.A.: Introduction to Formal Language Theory. Addison-Wesley, Reading (1978)zbMATHGoogle Scholar
  10. 10.
    Joachims, T.: Learning to Classify Text using Support Vector Machines: Methods, Theory, and Algorithms. Kluwer Academic Publishers, Dordrecht (2002)Google Scholar
  11. 11.
    Lewis, D.D.: Feature selection and feature extraction for text categorization. In: Proceedings of Speech and Natural Language Workshop, San Mateo, California, pp. 212–217. Morgan Kaufmann, San Francisco (1992)CrossRefGoogle Scholar
  12. 12.
    Lewis, D.D., Jones, K.S.: Natural language processing for information retrieval. Communications of the ACM 39(1), 92–101 (1996)CrossRefGoogle Scholar
  13. 13.
    Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (2002)Google Scholar
  14. 14.
    McCallum, A., Nigam, K.: A comparison of event models for naive bayes text classification. In: Proceedings of the AAAI 1998 Workshop on Learning for Text Categorization (1998)Google Scholar
  15. 15.
    Mitchell, T.M.: Machine Learning. WCB/McGraw-Hill, New York (1997)zbMATHGoogle Scholar
  16. 16.
    Papoulis, A.: Probability, Random Variables, and Stochastic Processes, 3rd edn. WCB/McGraw-Hill, New York (1991)Google Scholar
  17. 17.
    Robertson, S.E.: The probability ranking principle in ir. Journal of Documentation 33, 294–304 (1977)CrossRefGoogle Scholar
  18. 18.
    Ron, D., Singer, Y., Tishby, N.: The power of amnesia: Learning probabilistic automata with variable memory length. Machine Learning 25(2–3), 117–149 (1996)zbMATHCrossRefGoogle Scholar
  19. 19.
    Slonim, N., Bejerano, G., Fine, S., Tishby, N.: Discriminative feature selection via multiclass variable memory markov model. In: Sammut, C., Hoffmann, A.G. (eds.) Machine Learning, Proceedings of the Nineteenth International Conference (ICML 2002), University of New South Wales, Sydney, Australia, July 8-12, pp. 578–585. Morgan Kaufmann, San Francisco (2002)Google Scholar
  20. 20.
    Yang, Y.: An evaluation of statistical approaches to text categorization. Information Retrieval 1(1/2), 69–90 (1999)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Björn Hoffmeister
    • 1
  • Thomas Zeugmann
    • 2
  1. 1.RWTH Aachen, Lehrstuhl für Informatik VIAachen
  2. 2.Division of Computer ScienceHokkaido UniversitySapporoJapan

Personalised recommendations