Skip to main content

A Scalable Noise Reduction Technique for Large Case-Based Systems

  • Conference paper
Case-Based Reasoning Research and Development (ICCBR 2009)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5650))

Included in the following conference series:

Abstract

Because case-based reasoning (CBR) is instance-based, it is vulnerable to noisy data. Other learning techniques such as support vector machines (SVMs) and decision trees have been developed to be noise-tolerant so a certain level of noise in the data can be condoned. By contrast, noisy data can have a big impact in CBR because inference is normally based on a small number of cases. So far, research on noise reduction has been based on a majority-rule strategy, cases that are out of line with their neighbors are removed. We depart from that strategy and use local SVMs to identify noisy cases. This is more powerful than a majority-rule strategy because it explicitly considers the decision boundary in the noise reduction process. In this paper we provide details on how such a local SVM strategy for noise reduction can be made scale to very large datasets (> 500,000 training samples). The technique is evaluated on nine very large datasets and shows excellent performance when compared with alternative techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Leake, D.B.: CBR in context: The present and future. In: Leake (ed.) Case Based Reasoning: Experiences, Lessons, and Future Directions, pp. 3–30. MIT Press, Cambridge (1996)

    Google Scholar 

  2. Cunningham, P., Doyle, D., Loughrey, J.: An evaluation of the usefulness of case-based explanation. In: Ashley, K.D., Bridge, D.G. (eds.) ICCBR 2003. LNCS, vol. 2689, pp. 122–130. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  3. Lorena, A.C., Carvalho, A.: Evaluation of noise reduction techniques in the splice junction recognition problem. Genet. Mol. Biol. 27, 665–672 (2004)

    Article  Google Scholar 

  4. Devijver, P., Kittler, J.: Pattern recognition: a statistical approach, Englewood Cliffs, London (1982)

    Google Scholar 

  5. Segata, N., Blanzieri, E., Delany, S., Cunningham, P.: Noise reduction for instance-based learning with a local maximal margin approach. Technical Report DISI-08-056, DISI, University of Trento, Italy (2008)

    Google Scholar 

  6. Blanzieri, E., Melgani, F.: Nearest neighbor classification of remote sensing images with the maximal margin principle. IEEE Trans. Geosci. Remote Sens. 46(6) (2008)

    Google Scholar 

  7. Segata, N., Blanzieri, E.: Empirical assessment of classification accuracy of Local SVM. In: Proc. of Benelearn, pp. 47–55 (2009)

    Google Scholar 

  8. Segata, N.: FaLKM-lib v1.0: a Library for Fast Local Kernel Machines. Technical report, DISI, University of Trento, Italy (2009), http://disi.unitn.it/~segata/FaLKM-lib

  9. Cataltepe, Z., Abu-mostafa, Y.S., Magdon-ismail, M.: No free lunch for early stopping. Neural Comput. 11, 995–1009 (1999)

    Article  Google Scholar 

  10. Quinlan, J.: The effect of noise on concept learning. In: Michalski, R., Carboneel, J., Mitchell, T. (eds.) Mach Learn. Morgan Kaufmann, San Francisco (1986)

    Google Scholar 

  11. Cortes, C., Vapnik, V.: Support-vector networks. Mach Learn., 273–297 (1995)

    Google Scholar 

  12. Roth-Berghofer, T.: Explanations and case-based reasoning: Foundational issues. In: Funk, P., González-Calero, P. (eds.) ECCBR 2004. LNCS, vol. 3155, pp. 389–403. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  13. Nugent, C., Doyle, D., Cunningham, P.: Gaining insight through case-based explanation. Int. J. Intell. Inf. Syst. (2008)

    Google Scholar 

  14. Pechenizkiy, M., Tsymbal, A., Puuronen, S., Pechenizkiy, O.: Class noise and supervised learning in medical domains: The effect of feature extraction. In: CBMS 2006, Washington, DC, USA, pp. 708–713. IEEE Computer Society, Los Alamitos (2006)

    Google Scholar 

  15. Malossini, A., Blanzieri, E., Ng, R.T.: Detecting potential labeling errors in microarrays by data perturbation. Bioinformatics 22(17), 2114–2121 (2006)

    Article  Google Scholar 

  16. Gamberger, A., Lavrac, N., Dzeroski, S.: Noise detection and elimination in data preprocessing: experiments in medical domains. Appl. Artif. Intell., 205–223 (2000)

    Google Scholar 

  17. Tang, S., Chen, S.P.: Data cleansing based on mathematic morphology. In: iCBBE 2008, pp. 755–758 (2008)

    Google Scholar 

  18. Wilson, D.R., Martinez, T.R.: Reduction techniques for instance-based learning algorithms. Mach. Learn. 38(3), 257–286 (2000)

    Article  MATH  Google Scholar 

  19. Brighton, H., Mellish, C.: Advances in instance selection for instance-based learning algorithms. Data Min. Knowl. Discovery 6(2), 153–172 (2002)

    Article  MathSciNet  MATH  Google Scholar 

  20. Wilson, D.L.: Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans. Syst. Man Cybern. 2(3), 408–421 (1972)

    Article  MathSciNet  MATH  Google Scholar 

  21. Tomek, I.: An experiment with the edited nearest-neighbor rule. IEEE Trans. Syst. Man Cybern. 6(6), 448–452 (1976)

    Article  MathSciNet  MATH  Google Scholar 

  22. Koplowitz, J., Brown, T.A.: On the relation of performance to editing in nearest neighbor rules. Pattern Recognit. 13(3), 251–255 (1981)

    Article  Google Scholar 

  23. Jiang, Y., Zhou, Z.: Editing training data for knn classifiers with neural network ensemble. In: Yin, F.-L., Wang, J., Guo, C. (eds.) ISNN 2004. LNCS, vol. 3173, pp. 356–361. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  24. Sánchez, J.S., Barandela, R., Marqués, A.I., Alejo, R., Badenas, J.: Analysis of new techniques to obtain quality training sets. Pattern Recognit. Lett. 24(7) (2003)

    Google Scholar 

  25. Delany, S.J., Cunningham, P.: An analysis of case-base editing in a spam filtering system. In: Funk, P., González Calero, P. (eds.) ECCBR 2004. LNCS (LNAI), vol. 3155, pp. 128–141. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  26. Pan, R., Yang, Q., Pan, S.J.: Mining competent case bases for case-based reasoning. Artif. Intell. 171(16-17), 1039–1068 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  27. Angiulli, F.: Fast nearest neighbor condensation for large data sets classification. IEEE Trans. Knowl. Data Eng. 19(11), 1450–1464 (2007)

    Article  Google Scholar 

  28. Bottou, L., Vapnik, V.: Local learning algorithms. Neural Comput. 4(6) (1992)

    Google Scholar 

  29. Vapnik, V.N., Bottou, L.: Local algorithms for pattern recognition and dependencies estimation. Neural Comput. 5(6), 893–909 (1993)

    Article  Google Scholar 

  30. Platt, J.C.: Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In: Adv. in Large Margin Classifiers, pp. 61–74 (1999)

    Google Scholar 

  31. Lin, H.T., Lin, C.J., Weng, R.: A note on Platt’s probabilistic outputs for support vector machines. Mach. Learn. 68(3), 267–276 (2007)

    Article  Google Scholar 

  32. Beygelzimer, A., Kakade, S., Langford, J.: Cover Trees for Nearest Neighbor. In: ICML 2006, pp. 97–104. ACM Press, New York (2006)

    Google Scholar 

  33. Krauthgamer, R., Lee, J.: Navigating nets: simple algorithms for proximity search. In: SODA 2004, Society for Industrial and Applied Mathematics, pp. 798–807 (2004)

    Google Scholar 

  34. Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines (2001)

    Google Scholar 

  35. Asuncion, A., Newman, D.J.: Uci machine learning repository (2007)

    Google Scholar 

  36. Segata, N., Blanzieri, E.: Fast local support vector machines for large datasets. In: Proc. of MLDM (2009) (accepted for publication)

    Google Scholar 

  37. Uzilov, A., Keegan, J., Mathews, D.: Detection of non-coding RNAs on the basis of predicted secondary structure formation free energy change. BMC Bioinf. 7(1), 173 (2006)

    Article  Google Scholar 

  38. Demsar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006)

    MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Segata, N., Blanzieri, E., Cunningham, P. (2009). A Scalable Noise Reduction Technique for Large Case-Based Systems. In: McGinty, L., Wilson, D.C. (eds) Case-Based Reasoning Research and Development. ICCBR 2009. Lecture Notes in Computer Science(), vol 5650. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-02998-1_24

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-02998-1_24

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-02997-4

  • Online ISBN: 978-3-642-02998-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics