Skip to main content

Active Learning: Applying RinSCut Thresholding Strategy to Uncertainty Sampling

  • Conference paper

Part of the Lecture Notes in Computer Science book series (LNAI,volume 2903)

Abstract

In many supervised learning approaches to text classification, it is necessary to have a large volume of manually labeled documents to achieve a high level of performance. This manual labeling process is time-consuming, expensive, and will have some level of inconsistency. Two common approaches to reduce the amount of expensive labeled examples are: (1) selecting informative uncertain examples for human-labeling, rather than relying on random sampling, and (2) using many inexpensive unlabeled data with a small number of manually labeled examples. Previous research has been focused on a single approach and has shown the considerable reduction on the amount of labeled examples to achieve a given level of performance. In this paper, we investigate a new framework to combine both approaches for similarity-based text classification. By applying our new thresholding strategy, RinSCut, to the conventional uncertainty sampling, we propose a new framework which automatically selects informative uncertain data that should be presented to human expert for labeling and positive-certain data that could be directly used for learning without human-labeling. Extensive experiments have been conducted on Reuters-21578 dataset to compare our proposed scheme with random sampling and conventional uncertainty sampling schemes, based on micro and macro-averaged F1. The results showed that if both macro and micro-averaged measures are concerned, the optimal choice might be our new approach.

Keywords

  • Active Learn
  • Similarity Score
  • Text Classification
  • Human Expert
  • Uncertainty Sampling

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-540-24581-0_79
  • Chapter length: 11 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   119.00
Price excludes VAT (USA)
  • ISBN: 978-3-540-24581-0
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   159.99
Price excludes VAT (USA)

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Apte, C., Damerau, F., Weiss, S.M.: Automated Learning of Decision Rules for Text Categorization. ACM Transactions of Information Systems 12(3), 233–251 (1994)

    CrossRef  Google Scholar 

  2. Lewis, D.D., Catlett, J.: Heterogeneous Uncertainty Sampling for Supervised Learning. In: Proceedings of the Eleventh International Conference on Machine Learning, pp. 148–156. Morgan Kaufman, San Francisco (1994)

    Google Scholar 

  3. Lewis, D.D., Gale, W.: A Sequential Algorithm for Training Text Classifiers. In: Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 3–12 (1994)

    Google Scholar 

  4. Cohn, D., Atlas, L., Lander, R.: Improving Generalization with Active Learning. Machine Learning 15(2), 201–221 (1994)

    Google Scholar 

  5. Blum, A., Mitchell, T.: Combining Labeled and Unlabeled Data with Co- Training. In: Proceedings of the 11th Annual Conference on Computational Learning Theory, pp. 92–100 (1998)

    Google Scholar 

  6. Nigam, K., McCallum, A.K., Thrun, S., Mitchell, T.M.: Text Classification from Labeled and Unlabeled Documents using EM. Machine Learning 39, 103–134 (2000)

    CrossRef  MATH  Google Scholar 

  7. Lee, K.H., Kay, J., Kang, B.H.: Lazy Linear Classifier and Rank-in-Score Threshold in Similarity-Based Text Categorization. In: ICML Workshop on Text Learning (TextML 2002), Sydney, Australia, pp. 36–43 (2002)

    Google Scholar 

  8. Lee, K.H., Kay, J., Kang, B.H., Rosebrock, U.: A Comparative Study on Statistical Machine Learning Algorithms and Thresholding Strategies for Automatic Text Categorization. In: Ishizuka, M., Sattar, A. (eds.) PRICAI 2002. LNCS (LNAI), vol. 2417, pp. 444–453. Springer, Heidelberg (2002)

    CrossRef  Google Scholar 

  9. Buckley, C., Salton, G., Allan, J., Singhal, A.: Automatic Query Expansion Using SMART: TREC 3. In: The Third Text Retrieval Conference (TREC-3). National Institute of Standards and Technology Special Publication 500-207. Gaithersburg, MD (1995)

    Google Scholar 

  10. Rocchio, J.: Relevance Feedback in Information Retrieval. In: Salton, G. (ed.) The SMART Retrieval System: Experiments in Automatic Document Processing, Prentice-Hall, Englewood Cliffs (1971)

    Google Scholar 

  11. Widrow, B., Stearns, S.D.: Adaptive Signal Processing. Prentice-Hall Inc., Eaglewood Cliffs (1985)

    MATH  Google Scholar 

  12. Yang, Y.: Expert Network: Effective and Efficient Learning from Human Decisions in Text Categorization and Retrieval. In: Proceedings of the 17th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 13–22 (1994)

    Google Scholar 

  13. Yang, Y.: A Study on Thresholding Strategies for Text Categorization. In: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2001), pp. 137–145 (2001)

    Google Scholar 

  14. Reuters-21578 collection, originally collected and labeled by Carnegie Group Inc and Reuters Ltd, may be freely available for research purpose only, from http://www.daviddlewis.com/resources/testcollections/reuters21578/

  15. The new Reuters collection, called Reuters Corpus Volume 1, has recently been made available by Reuters Ltd, may be freely available for research purpose only, from http://about.reuters.com/researchandstandards/corpus/

  16. Porter, M.F.: An Algorithm for Suffix Stripping. Program 14(3), 130–137 (1980)

    Google Scholar 

  17. Yang, Y.: An Evaluation of Statistical Approaches to Text Categorization. Journal of Information Retrieval 1(1/2), 67–88 (1999)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2003 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Lee, K.H., Kay, J., Kang, B.H. (2003). Active Learning: Applying RinSCut Thresholding Strategy to Uncertainty Sampling. In: Gedeon, T.(.D., Fung, L.C.C. (eds) AI 2003: Advances in Artificial Intelligence. AI 2003. Lecture Notes in Computer Science(), vol 2903. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24581-0_79

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-24581-0_79

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-20646-0

  • Online ISBN: 978-3-540-24581-0

  • eBook Packages: Springer Book Archive