Filtering Very Similar Text Documents: A Case Study
This paper describes problems with classification and filtration of similar relevant and irrelevant real medical documents from one very specific domain, obtained from the Internet resources. Besides the similarity, the documents are often unbalanced—a lack of irrelevant documents for the training. A definition of similarity is suggested. For the classification, six algorithms are tested from the document similarity point of view. The best results are provided by the back propagation-based neural network and by the radial basis function-based support vector machine.
KeywordsTextual Document Common Word Internet Resource Medical Document Training Document
Unable to display preview. Download preview PDF.
- 7.Porter, M.F.: An Algorithm For Suffix Stripping. Program 14(3), 130–137 (1980)Google Scholar
- 8.Quinlan, J.R.: Bagging, Boosting, and C4.5. In: Proc. of the 8th Annual Conference on Innovative Applications of Artificial Intelligence, AAAI 1996, Portland, Oregon, August 4-8, pp. 725–730 (1996)Google Scholar
- 11.Van Rijsbergen, C.J.: Information Retrieval, 2nd edn., Department of Computer Science, University of Glasgow (1979)Google Scholar