Abstract
In many supervised learning approaches to text classification, it is necessary to have a large volume of manually labeled documents to achieve a high level of performance. This manual labeling process is time-consuming, expensive, and will have some level of inconsistency. Two common approaches to reduce the amount of expensive labeled examples are: (1) selecting informative uncertain examples for human-labeling, rather than relying on random sampling, and (2) using many inexpensive unlabeled data with a small number of manually labeled examples. Previous research has been focused on a single approach and has shown the considerable reduction on the amount of labeled examples to achieve a given level of performance. In this paper, we investigate a new framework to combine both approaches for similarity-based text classification. By applying our new thresholding strategy, RinSCut, to the conventional uncertainty sampling, we propose a new framework which automatically selects informative uncertain data that should be presented to human expert for labeling and positive-certain data that could be directly used for learning without human-labeling. Extensive experiments have been conducted on Reuters-21578 dataset to compare our proposed scheme with random sampling and conventional uncertainty sampling schemes, based on micro and macro-averaged F1. The results showed that if both macro and micro-averaged measures are concerned, the optimal choice might be our new approach.
Keywords
- Active Learn
- Similarity Score
- Text Classification
- Human Expert
- Uncertainty Sampling
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
This is a preview of subscription content, access via your institution.
Buying options
Preview
Unable to display preview. Download preview PDF.
References
Apte, C., Damerau, F., Weiss, S.M.: Automated Learning of Decision Rules for Text Categorization. ACM Transactions of Information Systems 12(3), 233–251 (1994)
Lewis, D.D., Catlett, J.: Heterogeneous Uncertainty Sampling for Supervised Learning. In: Proceedings of the Eleventh International Conference on Machine Learning, pp. 148–156. Morgan Kaufman, San Francisco (1994)
Lewis, D.D., Gale, W.: A Sequential Algorithm for Training Text Classifiers. In: Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 3–12 (1994)
Cohn, D., Atlas, L., Lander, R.: Improving Generalization with Active Learning. Machine Learning 15(2), 201–221 (1994)
Blum, A., Mitchell, T.: Combining Labeled and Unlabeled Data with Co- Training. In: Proceedings of the 11th Annual Conference on Computational Learning Theory, pp. 92–100 (1998)
Nigam, K., McCallum, A.K., Thrun, S., Mitchell, T.M.: Text Classification from Labeled and Unlabeled Documents using EM. Machine Learning 39, 103–134 (2000)
Lee, K.H., Kay, J., Kang, B.H.: Lazy Linear Classifier and Rank-in-Score Threshold in Similarity-Based Text Categorization. In: ICML Workshop on Text Learning (TextML 2002), Sydney, Australia, pp. 36–43 (2002)
Lee, K.H., Kay, J., Kang, B.H., Rosebrock, U.: A Comparative Study on Statistical Machine Learning Algorithms and Thresholding Strategies for Automatic Text Categorization. In: Ishizuka, M., Sattar, A. (eds.) PRICAI 2002. LNCS (LNAI), vol. 2417, pp. 444–453. Springer, Heidelberg (2002)
Buckley, C., Salton, G., Allan, J., Singhal, A.: Automatic Query Expansion Using SMART: TREC 3. In: The Third Text Retrieval Conference (TREC-3). National Institute of Standards and Technology Special Publication 500-207. Gaithersburg, MD (1995)
Rocchio, J.: Relevance Feedback in Information Retrieval. In: Salton, G. (ed.) The SMART Retrieval System: Experiments in Automatic Document Processing, Prentice-Hall, Englewood Cliffs (1971)
Widrow, B., Stearns, S.D.: Adaptive Signal Processing. Prentice-Hall Inc., Eaglewood Cliffs (1985)
Yang, Y.: Expert Network: Effective and Efficient Learning from Human Decisions in Text Categorization and Retrieval. In: Proceedings of the 17th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 13–22 (1994)
Yang, Y.: A Study on Thresholding Strategies for Text Categorization. In: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2001), pp. 137–145 (2001)
Reuters-21578 collection, originally collected and labeled by Carnegie Group Inc and Reuters Ltd, may be freely available for research purpose only, from http://www.daviddlewis.com/resources/testcollections/reuters21578/
The new Reuters collection, called Reuters Corpus Volume 1, has recently been made available by Reuters Ltd, may be freely available for research purpose only, from http://about.reuters.com/researchandstandards/corpus/
Porter, M.F.: An Algorithm for Suffix Stripping. Program 14(3), 130–137 (1980)
Yang, Y.: An Evaluation of Statistical Approaches to Text Categorization. Journal of Information Retrieval 1(1/2), 67–88 (1999)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Lee, K.H., Kay, J., Kang, B.H. (2003). Active Learning: Applying RinSCut Thresholding Strategy to Uncertainty Sampling. In: Gedeon, T.(.D., Fung, L.C.C. (eds) AI 2003: Advances in Artificial Intelligence. AI 2003. Lecture Notes in Computer Science(), vol 2903. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24581-0_79
Download citation
DOI: https://doi.org/10.1007/978-3-540-24581-0_79
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-20646-0
Online ISBN: 978-3-540-24581-0
eBook Packages: Springer Book Archive