Multi-label Text Categorization Using K-Nearest Neighbor Approach with M-Similarity
Due to the ubiquity of textual information nowadays and the multi-topic nature of text, it is of great necessity to explore multi-label text categorization problem. Traditional methods based on vector-space-model text representation suffer the losing of word order information. In this paper, texts are considered as symbol sequences. A multi-label lazy learning approach named kNN-M is proposed, which is derived from traditional k-nearest neighbor (kNN) method. The flexible order-semisensitive measure, M-Similarity, which enables the usage of sequence information in text by swap-allowed dynamic block matching, is applied to evaluate the closeness of texts on finding k-nearest neighbors in kNN-M. Experiments on real-world OHSUMED datasets illustrate that our approach outperforms existing ones considerably, showing the power of considering both term co-occurrence and order on text categorization tasks.
KeywordsText Categorization String Match String Kernel Thresholding Strategy Text Categorization Task
Unable to display preview. Download preview PDF.
- 4.McCallum, A.K.: Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering (1996), http://www-2.cs.cmu.edu/~mccallum/bow