Skip to main content

Active Learning for Duplicate Record Identification in Deep Web

  • Conference paper
  • First Online:
Foundations of Intelligent Systems

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 277))

  • 1258 Accesses

Abstract

Active learning is important for duplicate record identification since manually identifying a suitable set of labeled examples is difficult. The imbalance data problem for duplicate record identification, wherein the number of non-matches samples far exceeds the number of matches samples, causes poor prediction performance for matches class. In this paper, we present a new active learning approach by taking the certainty, uncertainty and representativeness into account. Our method first trains two feature subspace classifiers and uses certainty classifier to generate a matches pool from which informative matches samples were selected for manual annotation by leveraging an uncertainty and density measurement, and meanwhile, non-matches samples are automatically labeled to reduce human annotation efforts. We include a detailed experimental evaluation on real-world data demonstrating the effectiveness of our algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Ghanem TM, Aref WG (2004) Databases deepen the web. IEEE Comput 73(1):116–117

    Article  Google Scholar 

  2. Bergman MK (2001) The deep web: surfacing hidden value. J Electron Pub 7(1)

    Google Scholar 

  3. Sarawagi S, Bhamidipaty A (2002) Interactive deduplication using active learning. In: Proceedings of the 8th ACM SIGKDD international conference on knowledge discovery and data mining, pp 269–278

    Google Scholar 

  4. Bilenko M, Mooney RJ (2003) On evaluation and training-set construction for duplicate detection. In: Proceedings of the ACM SIGKDD-03 workshop on data cleaning, record linkage, and object consolidation, pp 7–12

    Google Scholar 

  5. Chaudhuri S, Chen B-C, Ganti V, Kaushik R (2007) Example-driven design of efficient record matching queries. In: Proceedings of the 33rd international conference on very large data bases, pp 327–338

    Google Scholar 

  6. Elmagarmid AK, Ipeirotis PG, Verykios VS (2007) Duplicate record detection: a survey. IEEE Trans Knowl Data Eng 19(1):1–16

    Article  Google Scholar 

  7. Koudas N, Sarawagi S, Srivastava D (2006) Record linkage: similarity measures and algorithms. SIGMOD 2006:802–803

    Google Scholar 

  8. Settles B (2010) Active learning literature survey. University of Wisconsin, Madison

    Google Scholar 

  9. Tong S, Koller D (2001) Support vector machine active learning with applications to text classification. J Mach Learn Res 2:45–66

    Google Scholar 

  10. Li S, Ju S, Zhou G, Li X (2012) Active learning for imbalanced sentiment classification. In: Proceedings of EMNLP-CoNLL 2012, pp 139–148

    Google Scholar 

  11. Argamon-Engelson S, Dagan I (1999) Committee-based sample selection for probabilistic classifiers. J Artif Intell Res (JAIR) 11:335–360

    MATH  Google Scholar 

  12. Hanneke S (2007) A bound on the label complexity of agnostic active learning. In: Proceedings of the 24th international conference on machine learning, pp 353–360

    Google Scholar 

  13. Kohavi R, Sommerfield D, Dougherty J (1996) Data mining using MLC++: a machine learning library in C++. In: Tools with artificial intelligence, IEEE Computer Society Press, pp 234–245

    Google Scholar 

  14. Hsu CW, Chang CC, Lin CJ (2003) A practical guide to support vector classification. Technical report, Department of Computer Science, National Taiwan University. http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf

Download references

Acknowledgments

This work is partially supported by NSFC (No. 61003054, No. 61170020); College Natural Science Research project of Jiangsu Province (No. 10KJB520018); Science and Technology Support Program of Suzhou (No. SG201257); Science and Technology Support program of Jiangsu province (No. BE2012075); and Open fund of Jiangsu Province Software Engineering R&D Center (SX201205).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhiming Cui .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Zhao, P., Xin, J., Xian, X., Cui, Z. (2014). Active Learning for Duplicate Record Identification in Deep Web. In: Wen, Z., Li, T. (eds) Foundations of Intelligent Systems. Advances in Intelligent Systems and Computing, vol 277. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-54924-3_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-54924-3_12

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-54923-6

  • Online ISBN: 978-3-642-54924-3

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics