Models and Indices for Integrating Unstructured Data with a Relational Database

Sarawagi, Sunita

doi:10.1007/978-3-540-31841-5_1

Sunita Sarawagi¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3377))

Included in the following conference series:

International Workshop on Knowledge Discovery in Inductive Databases

207 Accesses
3 Citations

Abstract

Database systems are islands of structure in a sea of unstructured data sources. Several real-world applications now need to create bridges for smooth integration of semi-structured sources with existing structured databases for seamless querying. This integration requires extracting structured column values from the unstructured source and mapping them to known database entities. Existing methods of data integration do not effectively exploit the wealth of information available in multi-relational entities.

We present statistical models for co-reference resolution and information extraction in a database setting. We then go over the performance challenges of training and applying these models efficiently over very large databases. This requires us to break open a black box statistical model and extract predicates over indexable attributes of the database. We show how to extract such predicates for several classification models, including naive Bayes classifiers and support vector machines. We extend these indexing methods for supporting similarity predicates needed during data integration.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Borthwick, A., Sterling, J., Agichtein, E., Grishman, R.: Exploiting diverse knowledge sources via maximum entropy in named entity recognition. In: Sixth Workshop on Very Large Corpora New Brunswick. Association for Computational Linguistics, New Jersey (1998)
Google Scholar
Chaudhuri, S., Ganjam, K., Ganti, V., Motwani, R.: Robust and efficient fuzzy match for online data cleaning. In: SIGMOD (2003)
Google Scholar
Cohen, W.W., Ravikumar, P., Fienberg, S.E.: A comparison of string distance metrics for name-matching tasks. In: Proceedings of the IJCAI-2003 Workshop on Information Integration on the Web, IIWeb-2003 (2003) (to appear)
Google Scholar
Cohen, W.W., Sarawagi, S.: Exploiting dictionaries in named entity extraction: Combining semi-markov extraction processes and data integration methods. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2004) (to appear)
Google Scholar
Jordan, M.I.: Graphical models. Statistical Science (Special Issue on Bayesian Statistics) 19, 140–155 (2004)
MATH Google Scholar
Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proceedings of the International Conference on Machine Learning (ICML-2001), Williams, MA (2001)
Google Scholar
McCallum, A., Wellner, B.: Toward conditional models of identity uncertainty with application to proper noun coreference. In: Proceedings of the IJCAI-2003 Workshop on Information Integration on the Web, Acapulco, Mexico, August 2003, pp. 79–86 (2003)
Google Scholar
Parag, Domingos, P.: Multi-relational record linkage. In: Proceedings of 3rd Workshop on Multi-Relational Data Mining at ACM SIGKDD, Seattle, WA (August 2004)
Google Scholar
Sarawagi, S., Cohen, W.W.: Semi-markov conditional random fields for information extraction. In: NIPs (2004) (to appear)
Google Scholar
Sarawagi, S., Kirpal, A.: Efficient set joins on similarity predicates. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (2004)
Google Scholar
Sha, F., Pereira, F.: Shallow parsing with conditional random fields. In: Proceedings of HLT-NAACL (2003)
Google Scholar
Theobald, M., Weikum, G., Schenkel, R.: Top-k query evaluation with probabilistic guarantees. In: VLDB, pp. 648–659 (2004)
Google Scholar

Download references

Author information

Authors and Affiliations

IIT Bombay,
Sunita Sarawagi

Authors

Sunita Sarawagi
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Mathematics and computer Science Department, University of Antwerp, Middelheimlaan 1, 2020, Antwerp, Belgium
Bart Goethals
Department of Computer Science, Universiteit Utrecht,
Arno Siebes

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sarawagi, S. (2005). Models and Indices for Integrating Unstructured Data with a Relational Database. In: Goethals, B., Siebes, A. (eds) Knowledge Discovery in Inductive Databases. KDID 2004. Lecture Notes in Computer Science, vol 3377. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-31841-5_1

Download citation

DOI: https://doi.org/10.1007/978-3-540-31841-5_1
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-25082-1
Online ISBN: 978-3-540-31841-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics