Abstract
Because case-based reasoning (CBR) is instance-based, it is vulnerable to noisy data. Other learning techniques such as support vector machines (SVMs) and decision trees have been developed to be noise-tolerant so a certain level of noise in the data can be condoned. By contrast, noisy data can have a big impact in CBR because inference is normally based on a small number of cases. So far, research on noise reduction has been based on a majority-rule strategy, cases that are out of line with their neighbors are removed. We depart from that strategy and use local SVMs to identify noisy cases. This is more powerful than a majority-rule strategy because it explicitly considers the decision boundary in the noise reduction process. In this paper we provide details on how such a local SVM strategy for noise reduction can be made scale to very large datasets (> 500,000 training samples). The technique is evaluated on nine very large datasets and shows excellent performance when compared with alternative techniques.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Leake, D.B.: CBR in context: The present and future. In: Leake (ed.) Case Based Reasoning: Experiences, Lessons, and Future Directions, pp. 3–30. MIT Press, Cambridge (1996)
Cunningham, P., Doyle, D., Loughrey, J.: An evaluation of the usefulness of case-based explanation. In: Ashley, K.D., Bridge, D.G. (eds.) ICCBR 2003. LNCS, vol. 2689, pp. 122–130. Springer, Heidelberg (2003)
Lorena, A.C., Carvalho, A.: Evaluation of noise reduction techniques in the splice junction recognition problem. Genet. Mol. Biol. 27, 665–672 (2004)
Devijver, P., Kittler, J.: Pattern recognition: a statistical approach, Englewood Cliffs, London (1982)
Segata, N., Blanzieri, E., Delany, S., Cunningham, P.: Noise reduction for instance-based learning with a local maximal margin approach. Technical Report DISI-08-056, DISI, University of Trento, Italy (2008)
Blanzieri, E., Melgani, F.: Nearest neighbor classification of remote sensing images with the maximal margin principle. IEEE Trans. Geosci. Remote Sens. 46(6) (2008)
Segata, N., Blanzieri, E.: Empirical assessment of classification accuracy of Local SVM. In: Proc. of Benelearn, pp. 47–55 (2009)
Segata, N.: FaLKM-lib v1.0: a Library for Fast Local Kernel Machines. Technical report, DISI, University of Trento, Italy (2009), http://disi.unitn.it/~segata/FaLKM-lib
Cataltepe, Z., Abu-mostafa, Y.S., Magdon-ismail, M.: No free lunch for early stopping. Neural Comput. 11, 995–1009 (1999)
Quinlan, J.: The effect of noise on concept learning. In: Michalski, R., Carboneel, J., Mitchell, T. (eds.) Mach Learn. Morgan Kaufmann, San Francisco (1986)
Cortes, C., Vapnik, V.: Support-vector networks. Mach Learn., 273–297 (1995)
Roth-Berghofer, T.: Explanations and case-based reasoning: Foundational issues. In: Funk, P., González-Calero, P. (eds.) ECCBR 2004. LNCS, vol. 3155, pp. 389–403. Springer, Heidelberg (2004)
Nugent, C., Doyle, D., Cunningham, P.: Gaining insight through case-based explanation. Int. J. Intell. Inf. Syst. (2008)
Pechenizkiy, M., Tsymbal, A., Puuronen, S., Pechenizkiy, O.: Class noise and supervised learning in medical domains: The effect of feature extraction. In: CBMS 2006, Washington, DC, USA, pp. 708–713. IEEE Computer Society, Los Alamitos (2006)
Malossini, A., Blanzieri, E., Ng, R.T.: Detecting potential labeling errors in microarrays by data perturbation. Bioinformatics 22(17), 2114–2121 (2006)
Gamberger, A., Lavrac, N., Dzeroski, S.: Noise detection and elimination in data preprocessing: experiments in medical domains. Appl. Artif. Intell., 205–223 (2000)
Tang, S., Chen, S.P.: Data cleansing based on mathematic morphology. In: iCBBE 2008, pp. 755–758 (2008)
Wilson, D.R., Martinez, T.R.: Reduction techniques for instance-based learning algorithms. Mach. Learn. 38(3), 257–286 (2000)
Brighton, H., Mellish, C.: Advances in instance selection for instance-based learning algorithms. Data Min. Knowl. Discovery 6(2), 153–172 (2002)
Wilson, D.L.: Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans. Syst. Man Cybern. 2(3), 408–421 (1972)
Tomek, I.: An experiment with the edited nearest-neighbor rule. IEEE Trans. Syst. Man Cybern. 6(6), 448–452 (1976)
Koplowitz, J., Brown, T.A.: On the relation of performance to editing in nearest neighbor rules. Pattern Recognit. 13(3), 251–255 (1981)
Jiang, Y., Zhou, Z.: Editing training data for knn classifiers with neural network ensemble. In: Yin, F.-L., Wang, J., Guo, C. (eds.) ISNN 2004. LNCS, vol. 3173, pp. 356–361. Springer, Heidelberg (2004)
Sánchez, J.S., Barandela, R., Marqués, A.I., Alejo, R., Badenas, J.: Analysis of new techniques to obtain quality training sets. Pattern Recognit. Lett. 24(7) (2003)
Delany, S.J., Cunningham, P.: An analysis of case-base editing in a spam filtering system. In: Funk, P., González Calero, P. (eds.) ECCBR 2004. LNCS (LNAI), vol. 3155, pp. 128–141. Springer, Heidelberg (2004)
Pan, R., Yang, Q., Pan, S.J.: Mining competent case bases for case-based reasoning. Artif. Intell. 171(16-17), 1039–1068 (2007)
Angiulli, F.: Fast nearest neighbor condensation for large data sets classification. IEEE Trans. Knowl. Data Eng. 19(11), 1450–1464 (2007)
Bottou, L., Vapnik, V.: Local learning algorithms. Neural Comput. 4(6) (1992)
Vapnik, V.N., Bottou, L.: Local algorithms for pattern recognition and dependencies estimation. Neural Comput. 5(6), 893–909 (1993)
Platt, J.C.: Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In: Adv. in Large Margin Classifiers, pp. 61–74 (1999)
Lin, H.T., Lin, C.J., Weng, R.: A note on Platt’s probabilistic outputs for support vector machines. Mach. Learn. 68(3), 267–276 (2007)
Beygelzimer, A., Kakade, S., Langford, J.: Cover Trees for Nearest Neighbor. In: ICML 2006, pp. 97–104. ACM Press, New York (2006)
Krauthgamer, R., Lee, J.: Navigating nets: simple algorithms for proximity search. In: SODA 2004, Society for Industrial and Applied Mathematics, pp. 798–807 (2004)
Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines (2001)
Asuncion, A., Newman, D.J.: Uci machine learning repository (2007)
Segata, N., Blanzieri, E.: Fast local support vector machines for large datasets. In: Proc. of MLDM (2009) (accepted for publication)
Uzilov, A., Keegan, J., Mathews, D.: Detection of non-coding RNAs on the basis of predicted secondary structure formation free energy change. BMC Bioinf. 7(1), 173 (2006)
Demsar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Segata, N., Blanzieri, E., Cunningham, P. (2009). A Scalable Noise Reduction Technique for Large Case-Based Systems. In: McGinty, L., Wilson, D.C. (eds) Case-Based Reasoning Research and Development. ICCBR 2009. Lecture Notes in Computer Science(), vol 5650. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-02998-1_24
Download citation
DOI: https://doi.org/10.1007/978-3-642-02998-1_24
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-02997-4
Online ISBN: 978-3-642-02998-1
eBook Packages: Computer ScienceComputer Science (R0)