Advertisement

Active Learning: Applying RinSCut Thresholding Strategy to Uncertainty Sampling

  • Kang Hyuk Lee
  • Judy Kay
  • Byeong Ho Kang
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2903)

Abstract

In many supervised learning approaches to text classification, it is necessary to have a large volume of manually labeled documents to achieve a high level of performance. This manual labeling process is time-consuming, expensive, and will have some level of inconsistency. Two common approaches to reduce the amount of expensive labeled examples are: (1) selecting informative uncertain examples for human-labeling, rather than relying on random sampling, and (2) using many inexpensive unlabeled data with a small number of manually labeled examples. Previous research has been focused on a single approach and has shown the considerable reduction on the amount of labeled examples to achieve a given level of performance. In this paper, we investigate a new framework to combine both approaches for similarity-based text classification. By applying our new thresholding strategy, RinSCut, to the conventional uncertainty sampling, we propose a new framework which automatically selects informative uncertain data that should be presented to human expert for labeling and positive-certain data that could be directly used for learning without human-labeling. Extensive experiments have been conducted on Reuters-21578 dataset to compare our proposed scheme with random sampling and conventional uncertainty sampling schemes, based on micro and macro-averaged F1. The results showed that if both macro and micro-averaged measures are concerned, the optimal choice might be our new approach.

Keywords

Active Learn Similarity Score Text Classification Human Expert Uncertainty Sampling 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Apte, C., Damerau, F., Weiss, S.M.: Automated Learning of Decision Rules for Text Categorization. ACM Transactions of Information Systems 12(3), 233–251 (1994)CrossRefGoogle Scholar
  2. 2.
    Lewis, D.D., Catlett, J.: Heterogeneous Uncertainty Sampling for Supervised Learning. In: Proceedings of the Eleventh International Conference on Machine Learning, pp. 148–156. Morgan Kaufman, San Francisco (1994)Google Scholar
  3. 3.
    Lewis, D.D., Gale, W.: A Sequential Algorithm for Training Text Classifiers. In: Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 3–12 (1994)Google Scholar
  4. 4.
    Cohn, D., Atlas, L., Lander, R.: Improving Generalization with Active Learning. Machine Learning 15(2), 201–221 (1994)Google Scholar
  5. 5.
    Blum, A., Mitchell, T.: Combining Labeled and Unlabeled Data with Co- Training. In: Proceedings of the 11th Annual Conference on Computational Learning Theory, pp. 92–100 (1998)Google Scholar
  6. 6.
    Nigam, K., McCallum, A.K., Thrun, S., Mitchell, T.M.: Text Classification from Labeled and Unlabeled Documents using EM. Machine Learning 39, 103–134 (2000)MATHCrossRefGoogle Scholar
  7. 7.
    Lee, K.H., Kay, J., Kang, B.H.: Lazy Linear Classifier and Rank-in-Score Threshold in Similarity-Based Text Categorization. In: ICML Workshop on Text Learning (TextML 2002), Sydney, Australia, pp. 36–43 (2002)Google Scholar
  8. 8.
    Lee, K.H., Kay, J., Kang, B.H., Rosebrock, U.: A Comparative Study on Statistical Machine Learning Algorithms and Thresholding Strategies for Automatic Text Categorization. In: Ishizuka, M., Sattar, A. (eds.) PRICAI 2002. LNCS (LNAI), vol. 2417, pp. 444–453. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  9. 9.
    Buckley, C., Salton, G., Allan, J., Singhal, A.: Automatic Query Expansion Using SMART: TREC 3. In: The Third Text Retrieval Conference (TREC-3). National Institute of Standards and Technology Special Publication 500-207. Gaithersburg, MD (1995)Google Scholar
  10. 10.
    Rocchio, J.: Relevance Feedback in Information Retrieval. In: Salton, G. (ed.) The SMART Retrieval System: Experiments in Automatic Document Processing, Prentice-Hall, Englewood Cliffs (1971)Google Scholar
  11. 11.
    Widrow, B., Stearns, S.D.: Adaptive Signal Processing. Prentice-Hall Inc., Eaglewood Cliffs (1985)MATHGoogle Scholar
  12. 12.
    Yang, Y.: Expert Network: Effective and Efficient Learning from Human Decisions in Text Categorization and Retrieval. In: Proceedings of the 17th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 13–22 (1994)Google Scholar
  13. 13.
    Yang, Y.: A Study on Thresholding Strategies for Text Categorization. In: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2001), pp. 137–145 (2001)Google Scholar
  14. 14.
    Reuters-21578 collection, originally collected and labeled by Carnegie Group Inc and Reuters Ltd, may be freely available for research purpose only, from http://www.daviddlewis.com/resources/testcollections/reuters21578/
  15. 15.
    The new Reuters collection, called Reuters Corpus Volume 1, has recently been made available by Reuters Ltd, may be freely available for research purpose only, from http://about.reuters.com/researchandstandards/corpus/
  16. 16.
    Porter, M.F.: An Algorithm for Suffix Stripping. Program 14(3), 130–137 (1980)Google Scholar
  17. 17.
    Yang, Y.: An Evaluation of Statistical Approaches to Text Categorization. Journal of Information Retrieval 1(1/2), 67–88 (1999)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2003

Authors and Affiliations

  • Kang Hyuk Lee
    • 1
  • Judy Kay
    • 1
  • Byeong Ho Kang
    • 2
  1. 1.School of Information TechnologiesUniversity of SydneyAustralia
  2. 2.School of ComputingUniversity of TasmaniaHobartAustralia

Personalised recommendations