Information Integration of Partially Labeled Data

Rendle, Steffen; Schmidt-Thieme, Lars

doi:10.1007/978-3-540-78246-9_21

Steffen Rendle⁵ &
Lars Schmidt-Thieme⁵

Part of the book series: Studies in Classification, Data Analysis, and Knowledge Organization ((STUDIES CLASS))

5993 Accesses

Abstract

A central task when integrating data from different sources is to detect identical items. For example, price comparison websites have to identify offers for identical products. This task is known, among others, as record linkage, object identification, or duplicate detection.

In this work, we examine problem settings where some relations between items are given in advance — for example by EAN article codes in an e-commerce scenario or by manually labeled parts. To represent and solve these problems we bring in ideas of semi-supervised and constrained clustering in terms of pairwise must-link and cannot-link constraints. We show that extending object identification by pairwise constraints results in an expressive framework that subsumes many variants of the integration problem like traditional object identification, matching, iterative problems or an active learning setting.

For solving these integration tasks, we propose an extension to current object identification models that assures consistent solutions to problems with constraints. Our evaluation shows that additionally taking the labeled data into account dramatically increases the quality of state-of-the-art object identification systems.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

BASU, S. and BILENKO, M. and MOONEY, R. J. (2004): A Probabilistic Framework for Semi-Supervised Clustering. In: Proceedings of the 10th International Conference on Knowledge Discovery and Data Mining (KDD-2004).
Google Scholar
BILENKO, M. and MOONEY, R. J. (2003): Adaptive Duplicate Detection Using Learn-able String Similarity Measures. In: Proceedings of the 9th International Conference on Knowledge Discovery and Data Mining (KDD-2004).
Google Scholar
COHEN, W. W. and RICHMAN, J. (2002): Learning to Match and Cluster Large High-Dimensional Data Sets for Data Integration. In: Proceedings of the 8th International Conference on Knowledge Discovery and Data Mining (KDD-2002).
Google Scholar
MCCALLUM, A. K., NIGAM K. and UNGAR L. (2000): Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching. In: Proceedings of the 6th International Conference On Knowledge Discovery and Data Mining (KDD-2000).
Google Scholar
NEILING, M. (2005): Identification of Real-World Objects in Multiple Databases. In: Pro-ceedings of GfKl Conference 2005.
Google Scholar
RENDLE, S. and SCHMIDT-THIEME, L. (2006): Object Identification with Constraints. In: Proceedings of 6th IEEE International Conference on Data Mining (ICDM-2006).
Google Scholar
WINKLER W. E. (1999): The State of Record Linkage and Current Research Problems. Tech-nical report, Statistical Research Division, U.S. Census Bureau.
Google Scholar

Download references

Author information

Authors and Affiliations

Information Systems and Machine Learning Lab, University of Hildesheim, Germany
Steffen Rendle & Lars Schmidt-Thieme

Authors

Steffen Rendle
View author publications
You can also search for this author in PubMed Google Scholar
Lars Schmidt-Thieme
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institute of Computer Science and Institute of Business Economics and Information Systems, University of Hildesheim, Marienburgerplatz 22, 31141, Hildesheim, Germany
Christine Preisach
Lehrstuhl für Mustererkennung und Bildverarbeitung, Universität Freiburg, Gebäude 052, 79110, Freiburg i. Br, Germany
Hans Burkhardt
Institute of Computer Science and Institute of Business Economics and Information Systems, Marienburgerplatz 22, 31141, Hildesheim, Germany
Lars Schmidt-Thieme
Fakultät für Wirtschaftswissenschaften, Lehrstuhl für Betriebswirtschaftslehre, insbes. Marketing, Universitätsstraße 25, 33615, Bielefeld, Germany
Reinhold Decker

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Rendle, S., Schmidt-Thieme, L. (2008). Information Integration of Partially Labeled Data. In: Preisach, C., Burkhardt, H., Schmidt-Thieme, L., Decker, R. (eds) Data Analysis, Machine Learning and Applications. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-78246-9_21

Download citation

DOI: https://doi.org/10.1007/978-3-540-78246-9_21
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-78239-1
Online ISBN: 978-3-540-78246-9
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics