Active Learning for Duplicate Record Identification in Deep Web

Zhao, Pengpeng; Xin, Jie; Xian, Xuefeng; Cui, Zhiming

doi:10.1007/978-3-642-54924-3_12

Pengpeng Zhao⁴,
Jie Xin⁴,
Xuefeng Xian⁴ &
…
Zhiming Cui⁴

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 277))

1258 Accesses

Abstract

Active learning is important for duplicate record identification since manually identifying a suitable set of labeled examples is difficult. The imbalance data problem for duplicate record identification, wherein the number of non-matches samples far exceeds the number of matches samples, causes poor prediction performance for matches class. In this paper, we present a new active learning approach by taking the certainty, uncertainty and representativeness into account. Our method first trains two feature subspace classifiers and uses certainty classifier to generate a matches pool from which informative matches samples were selected for manual annotation by leveraging an uncertainty and density measurement, and meanwhile, non-matches samples are automatically labeled to reduce human annotation efforts. We include a detailed experimental evaluation on real-world data demonstrating the effectiveness of our algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Ghanem TM, Aref WG (2004) Databases deepen the web. IEEE Comput 73(1):116–117
Article Google Scholar
Bergman MK (2001) The deep web: surfacing hidden value. J Electron Pub 7(1)
Google Scholar
Sarawagi S, Bhamidipaty A (2002) Interactive deduplication using active learning. In: Proceedings of the 8th ACM SIGKDD international conference on knowledge discovery and data mining, pp 269–278
Google Scholar
Bilenko M, Mooney RJ (2003) On evaluation and training-set construction for duplicate detection. In: Proceedings of the ACM SIGKDD-03 workshop on data cleaning, record linkage, and object consolidation, pp 7–12
Google Scholar
Chaudhuri S, Chen B-C, Ganti V, Kaushik R (2007) Example-driven design of efficient record matching queries. In: Proceedings of the 33rd international conference on very large data bases, pp 327–338
Google Scholar
Elmagarmid AK, Ipeirotis PG, Verykios VS (2007) Duplicate record detection: a survey. IEEE Trans Knowl Data Eng 19(1):1–16
Article Google Scholar
Koudas N, Sarawagi S, Srivastava D (2006) Record linkage: similarity measures and algorithms. SIGMOD 2006:802–803
Google Scholar
Settles B (2010) Active learning literature survey. University of Wisconsin, Madison
Google Scholar
Tong S, Koller D (2001) Support vector machine active learning with applications to text classification. J Mach Learn Res 2:45–66
Google Scholar
Li S, Ju S, Zhou G, Li X (2012) Active learning for imbalanced sentiment classification. In: Proceedings of EMNLP-CoNLL 2012, pp 139–148
Google Scholar
Argamon-Engelson S, Dagan I (1999) Committee-based sample selection for probabilistic classifiers. J Artif Intell Res (JAIR) 11:335–360
MATH Google Scholar
Hanneke S (2007) A bound on the label complexity of agnostic active learning. In: Proceedings of the 24th international conference on machine learning, pp 353–360
Google Scholar
Kohavi R, Sommerfield D, Dougherty J (1996) Data mining using MLC++: a machine learning library in C++. In: Tools with artificial intelligence, IEEE Computer Society Press, pp 234–245
Google Scholar
Hsu CW, Chang CC, Lin CJ (2003) A practical guide to support vector classification. Technical report, Department of Computer Science, National Taiwan University. http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf

Download references

Acknowledgments

This work is partially supported by NSFC (No. 61003054, No. 61170020); College Natural Science Research project of Jiangsu Province (No. 10KJB520018); Science and Technology Support Program of Suzhou (No. SG201257); Science and Technology Support program of Jiangsu province (No. BE2012075); and Open fund of Jiangsu Province Software Engineering R&D Center (SX201205).

Author information

Authors and Affiliations

Department of Computer Science and Technology, Soochow University, Suzhou, 215006, China
Pengpeng Zhao, Jie Xin, Xuefeng Xian & Zhiming Cui

Authors

Pengpeng Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Jie Xin
View author publications
You can also search for this author in PubMed Google Scholar
Xuefeng Xian
View author publications
You can also search for this author in PubMed Google Scholar
Zhiming Cui
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhiming Cui .

Editor information

Editors and Affiliations

College of Computer and Software Engineering, Shenzhen University, Shenzhen, China
Zhenkun Wen
School of Information Science and Technology, Southwest Jiaotong University, Chengdu, China
Tianrui Li

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhao, P., Xin, J., Xian, X., Cui, Z. (2014). Active Learning for Duplicate Record Identification in Deep Web. In: Wen, Z., Li, T. (eds) Foundations of Intelligent Systems. Advances in Intelligent Systems and Computing, vol 277. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-54924-3_12

Download citation

DOI: https://doi.org/10.1007/978-3-642-54924-3_12
Published: 20 June 2014
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-54923-6
Online ISBN: 978-3-642-54924-3
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics