Abstract
Active learning is important for duplicate record identification since manually identifying a suitable set of labeled examples is difficult. The imbalance data problem for duplicate record identification, wherein the number of non-matches samples far exceeds the number of matches samples, causes poor prediction performance for matches class. In this paper, we present a new active learning approach by taking the certainty, uncertainty and representativeness into account. Our method first trains two feature subspace classifiers and uses certainty classifier to generate a matches pool from which informative matches samples were selected for manual annotation by leveraging an uncertainty and density measurement, and meanwhile, non-matches samples are automatically labeled to reduce human annotation efforts. We include a detailed experimental evaluation on real-world data demonstrating the effectiveness of our algorithms.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Ghanem TM, Aref WG (2004) Databases deepen the web. IEEE Comput 73(1):116–117
Bergman MK (2001) The deep web: surfacing hidden value. J Electron Pub 7(1)
Sarawagi S, Bhamidipaty A (2002) Interactive deduplication using active learning. In: Proceedings of the 8th ACM SIGKDD international conference on knowledge discovery and data mining, pp 269–278
Bilenko M, Mooney RJ (2003) On evaluation and training-set construction for duplicate detection. In: Proceedings of the ACM SIGKDD-03 workshop on data cleaning, record linkage, and object consolidation, pp 7–12
Chaudhuri S, Chen B-C, Ganti V, Kaushik R (2007) Example-driven design of efficient record matching queries. In: Proceedings of the 33rd international conference on very large data bases, pp 327–338
Elmagarmid AK, Ipeirotis PG, Verykios VS (2007) Duplicate record detection: a survey. IEEE Trans Knowl Data Eng 19(1):1–16
Koudas N, Sarawagi S, Srivastava D (2006) Record linkage: similarity measures and algorithms. SIGMOD 2006:802–803
Settles B (2010) Active learning literature survey. University of Wisconsin, Madison
Tong S, Koller D (2001) Support vector machine active learning with applications to text classification. J Mach Learn Res 2:45–66
Li S, Ju S, Zhou G, Li X (2012) Active learning for imbalanced sentiment classification. In: Proceedings of EMNLP-CoNLL 2012, pp 139–148
Argamon-Engelson S, Dagan I (1999) Committee-based sample selection for probabilistic classifiers. J Artif Intell Res (JAIR) 11:335–360
Hanneke S (2007) A bound on the label complexity of agnostic active learning. In: Proceedings of the 24th international conference on machine learning, pp 353–360
Kohavi R, Sommerfield D, Dougherty J (1996) Data mining using MLC++: a machine learning library in C++. In: Tools with artificial intelligence, IEEE Computer Society Press, pp 234–245
Hsu CW, Chang CC, Lin CJ (2003) A practical guide to support vector classification. Technical report, Department of Computer Science, National Taiwan University. http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf
Acknowledgments
This work is partially supported by NSFC (No. 61003054, No. 61170020); College Natural Science Research project of Jiangsu Province (No. 10KJB520018); Science and Technology Support Program of Suzhou (No. SG201257); Science and Technology Support program of Jiangsu province (No. BE2012075); and Open fund of Jiangsu Province Software Engineering R&D Center (SX201205).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Zhao, P., Xin, J., Xian, X., Cui, Z. (2014). Active Learning for Duplicate Record Identification in Deep Web. In: Wen, Z., Li, T. (eds) Foundations of Intelligent Systems. Advances in Intelligent Systems and Computing, vol 277. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-54924-3_12
Download citation
DOI: https://doi.org/10.1007/978-3-642-54924-3_12
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-54923-6
Online ISBN: 978-3-642-54924-3
eBook Packages: EngineeringEngineering (R0)