A Scalable Noise Reduction Technique for Large Case-Based Systems

Segata, Nicola; Blanzieri, Enrico; Cunningham, Pádraig

doi:10.1007/978-3-642-02998-1_24

Nicola Segata²¹,
Enrico Blanzieri²¹ &
Pádraig Cunningham²²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5650))

Included in the following conference series:

International Conference on Case-Based Reasoning

842 Accesses
7 Citations

Abstract

Because case-based reasoning (CBR) is instance-based, it is vulnerable to noisy data. Other learning techniques such as support vector machines (SVMs) and decision trees have been developed to be noise-tolerant so a certain level of noise in the data can be condoned. By contrast, noisy data can have a big impact in CBR because inference is normally based on a small number of cases. So far, research on noise reduction has been based on a majority-rule strategy, cases that are out of line with their neighbors are removed. We depart from that strategy and use local SVMs to identify noisy cases. This is more powerful than a majority-rule strategy because it explicitly considers the decision boundary in the noise reduction process. In this paper we provide details on how such a local SVM strategy for noise reduction can be made scale to very large datasets (> 500,000 training samples). The technique is evaluated on nine very large datasets and shows excellent performance when compared with alternative techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Leake, D.B.: CBR in context: The present and future. In: Leake (ed.) Case Based Reasoning: Experiences, Lessons, and Future Directions, pp. 3–30. MIT Press, Cambridge (1996)
Google Scholar
Cunningham, P., Doyle, D., Loughrey, J.: An evaluation of the usefulness of case-based explanation. In: Ashley, K.D., Bridge, D.G. (eds.) ICCBR 2003. LNCS, vol. 2689, pp. 122–130. Springer, Heidelberg (2003)
Chapter Google Scholar
Lorena, A.C., Carvalho, A.: Evaluation of noise reduction techniques in the splice junction recognition problem. Genet. Mol. Biol. 27, 665–672 (2004)
Article Google Scholar
Devijver, P., Kittler, J.: Pattern recognition: a statistical approach, Englewood Cliffs, London (1982)
Google Scholar
Segata, N., Blanzieri, E., Delany, S., Cunningham, P.: Noise reduction for instance-based learning with a local maximal margin approach. Technical Report DISI-08-056, DISI, University of Trento, Italy (2008)
Google Scholar
Blanzieri, E., Melgani, F.: Nearest neighbor classification of remote sensing images with the maximal margin principle. IEEE Trans. Geosci. Remote Sens. 46(6) (2008)
Google Scholar
Segata, N., Blanzieri, E.: Empirical assessment of classification accuracy of Local SVM. In: Proc. of Benelearn, pp. 47–55 (2009)
Google Scholar
Segata, N.: FaLKM-lib v1.0: a Library for Fast Local Kernel Machines. Technical report, DISI, University of Trento, Italy (2009), http://disi.unitn.it/~segata/FaLKM-lib
Cataltepe, Z., Abu-mostafa, Y.S., Magdon-ismail, M.: No free lunch for early stopping. Neural Comput. 11, 995–1009 (1999)
Article Google Scholar
Quinlan, J.: The effect of noise on concept learning. In: Michalski, R., Carboneel, J., Mitchell, T. (eds.) Mach Learn. Morgan Kaufmann, San Francisco (1986)
Google Scholar
Cortes, C., Vapnik, V.: Support-vector networks. Mach Learn., 273–297 (1995)
Google Scholar
Roth-Berghofer, T.: Explanations and case-based reasoning: Foundational issues. In: Funk, P., González-Calero, P. (eds.) ECCBR 2004. LNCS, vol. 3155, pp. 389–403. Springer, Heidelberg (2004)
Chapter Google Scholar
Nugent, C., Doyle, D., Cunningham, P.: Gaining insight through case-based explanation. Int. J. Intell. Inf. Syst. (2008)
Google Scholar
Pechenizkiy, M., Tsymbal, A., Puuronen, S., Pechenizkiy, O.: Class noise and supervised learning in medical domains: The effect of feature extraction. In: CBMS 2006, Washington, DC, USA, pp. 708–713. IEEE Computer Society, Los Alamitos (2006)
Google Scholar
Malossini, A., Blanzieri, E., Ng, R.T.: Detecting potential labeling errors in microarrays by data perturbation. Bioinformatics 22(17), 2114–2121 (2006)
Article Google Scholar
Gamberger, A., Lavrac, N., Dzeroski, S.: Noise detection and elimination in data preprocessing: experiments in medical domains. Appl. Artif. Intell., 205–223 (2000)
Google Scholar
Tang, S., Chen, S.P.: Data cleansing based on mathematic morphology. In: iCBBE 2008, pp. 755–758 (2008)
Google Scholar
Wilson, D.R., Martinez, T.R.: Reduction techniques for instance-based learning algorithms. Mach. Learn. 38(3), 257–286 (2000)
Article MATH Google Scholar
Brighton, H., Mellish, C.: Advances in instance selection for instance-based learning algorithms. Data Min. Knowl. Discovery 6(2), 153–172 (2002)
Article MathSciNet MATH Google Scholar
Wilson, D.L.: Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans. Syst. Man Cybern. 2(3), 408–421 (1972)
Article MathSciNet MATH Google Scholar
Tomek, I.: An experiment with the edited nearest-neighbor rule. IEEE Trans. Syst. Man Cybern. 6(6), 448–452 (1976)
Article MathSciNet MATH Google Scholar
Koplowitz, J., Brown, T.A.: On the relation of performance to editing in nearest neighbor rules. Pattern Recognit. 13(3), 251–255 (1981)
Article Google Scholar
Jiang, Y., Zhou, Z.: Editing training data for knn classifiers with neural network ensemble. In: Yin, F.-L., Wang, J., Guo, C. (eds.) ISNN 2004. LNCS, vol. 3173, pp. 356–361. Springer, Heidelberg (2004)
Chapter Google Scholar
Sánchez, J.S., Barandela, R., Marqués, A.I., Alejo, R., Badenas, J.: Analysis of new techniques to obtain quality training sets. Pattern Recognit. Lett. 24(7) (2003)
Google Scholar
Delany, S.J., Cunningham, P.: An analysis of case-base editing in a spam filtering system. In: Funk, P., González Calero, P. (eds.) ECCBR 2004. LNCS (LNAI), vol. 3155, pp. 128–141. Springer, Heidelberg (2004)
Chapter Google Scholar
Pan, R., Yang, Q., Pan, S.J.: Mining competent case bases for case-based reasoning. Artif. Intell. 171(16-17), 1039–1068 (2007)
Article MathSciNet MATH Google Scholar
Angiulli, F.: Fast nearest neighbor condensation for large data sets classification. IEEE Trans. Knowl. Data Eng. 19(11), 1450–1464 (2007)
Article Google Scholar
Bottou, L., Vapnik, V.: Local learning algorithms. Neural Comput. 4(6) (1992)
Google Scholar
Vapnik, V.N., Bottou, L.: Local algorithms for pattern recognition and dependencies estimation. Neural Comput. 5(6), 893–909 (1993)
Article Google Scholar
Platt, J.C.: Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In: Adv. in Large Margin Classifiers, pp. 61–74 (1999)
Google Scholar
Lin, H.T., Lin, C.J., Weng, R.: A note on Platt’s probabilistic outputs for support vector machines. Mach. Learn. 68(3), 267–276 (2007)
Article Google Scholar
Beygelzimer, A., Kakade, S., Langford, J.: Cover Trees for Nearest Neighbor. In: ICML 2006, pp. 97–104. ACM Press, New York (2006)
Google Scholar
Krauthgamer, R., Lee, J.: Navigating nets: simple algorithms for proximity search. In: SODA 2004, Society for Industrial and Applied Mathematics, pp. 798–807 (2004)
Google Scholar
Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines (2001)
Google Scholar
Asuncion, A., Newman, D.J.: Uci machine learning repository (2007)
Google Scholar
Segata, N., Blanzieri, E.: Fast local support vector machines for large datasets. In: Proc. of MLDM (2009) (accepted for publication)
Google Scholar
Uzilov, A., Keegan, J., Mathews, D.: Detection of non-coding RNAs on the basis of predicted secondary structure formation free energy change. BMC Bioinf. 7(1), 173 (2006)
Article Google Scholar
Demsar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006)
MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

DISI, University of Trento, Italy
Nicola Segata & Enrico Blanzieri
Computer Science, University College Dublin, Dublin, Ireland
Pádraig Cunningham

Authors

Nicola Segata
View author publications
You can also search for this author in PubMed Google Scholar
Enrico Blanzieri
View author publications
You can also search for this author in PubMed Google Scholar
Pádraig Cunningham
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Adaptive Information Cluster, School of Computer Science & Informatics, University College Dublin,, Computer Science Building, Belfield, Dublin 4, Ireland
Lorraine McGinty
Department of Software and Information Systems, College of Computing and Informatics, University of North Carolina at Charlotte, 9201 University City Boulevard, NC 28223-0001, Charlotte, USA
David C. Wilson

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Segata, N., Blanzieri, E., Cunningham, P. (2009). A Scalable Noise Reduction Technique for Large Case-Based Systems. In: McGinty, L., Wilson, D.C. (eds) Case-Based Reasoning Research and Development. ICCBR 2009. Lecture Notes in Computer Science(), vol 5650. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-02998-1_24

Download citation

DOI: https://doi.org/10.1007/978-3-642-02998-1_24
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-02997-4
Online ISBN: 978-3-642-02998-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics