Text Mining Using Markov Chains of Variable Length
When dealing with knowledge federation over text documents one has to figure out whether or not documents are related by context. A new approach is proposed to solve this problem.
This leads to the design of a new search engine for literature research and related problems. The idea is that one has already some documents of interest. These documents are taken as input. Then all documents known to a classical search engine are ranked according to their relevance. For achieving this goal we use Markov chains of variable length.
The algorithms developed have been implemented and testing over the Reuters-21578 data set has been performed.
KeywordsSugar Entropy Assure
Unable to display preview. Download preview PDF.
- 3.Dunning, T.E.: Accurate methods for the statistics of surprise and coincidence. Computational Linguistics 19(1), 61–74 (1994)Google Scholar
- 6.Fürnkranz, J.: A study using n-gram features for text categorization. Technical report, Austrian Institute for Artificial Intelligence (1998)Google Scholar
- 8.Hand, D., Mannila, H., Smyth, P.: Principles of Data Mining. MIT Press, Cambridge (2002)Google Scholar
- 10.Joachims, T.: Learning to Classify Text using Support Vector Machines: Methods, Theory, and Algorithms. Kluwer Academic Publishers, Dordrecht (2002)Google Scholar
- 13.Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (2002)Google Scholar
- 14.McCallum, A., Nigam, K.: A comparison of event models for naive bayes text classification. In: Proceedings of the AAAI 1998 Workshop on Learning for Text Categorization (1998)Google Scholar
- 16.Papoulis, A.: Probability, Random Variables, and Stochastic Processes, 3rd edn. WCB/McGraw-Hill, New York (1991)Google Scholar
- 19.Slonim, N., Bejerano, G., Fine, S., Tishby, N.: Discriminative feature selection via multiclass variable memory markov model. In: Sammut, C., Hoffmann, A.G. (eds.) Machine Learning, Proceedings of the Nineteenth International Conference (ICML 2002), University of New South Wales, Sydney, Australia, July 8-12, pp. 578–585. Morgan Kaufmann, San Francisco (2002)Google Scholar