Skip to main content

Models and Indices for Integrating Unstructured Data with a Relational Database

  • Conference paper
Knowledge Discovery in Inductive Databases (KDID 2004)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3377))

Included in the following conference series:

Abstract

Database systems are islands of structure in a sea of unstructured data sources. Several real-world applications now need to create bridges for smooth integration of semi-structured sources with existing structured databases for seamless querying. This integration requires extracting structured column values from the unstructured source and mapping them to known database entities. Existing methods of data integration do not effectively exploit the wealth of information available in multi-relational entities.

We present statistical models for co-reference resolution and information extraction in a database setting. We then go over the performance challenges of training and applying these models efficiently over very large databases. This requires us to break open a black box statistical model and extract predicates over indexable attributes of the database. We show how to extract such predicates for several classification models, including naive Bayes classifiers and support vector machines. We extend these indexing methods for supporting similarity predicates needed during data integration.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Borthwick, A., Sterling, J., Agichtein, E., Grishman, R.: Exploiting diverse knowledge sources via maximum entropy in named entity recognition. In: Sixth Workshop on Very Large Corpora New Brunswick. Association for Computational Linguistics, New Jersey (1998)

    Google Scholar 

  2. Chaudhuri, S., Ganjam, K., Ganti, V., Motwani, R.: Robust and efficient fuzzy match for online data cleaning. In: SIGMOD (2003)

    Google Scholar 

  3. Cohen, W.W., Ravikumar, P., Fienberg, S.E.: A comparison of string distance metrics for name-matching tasks. In: Proceedings of the IJCAI-2003 Workshop on Information Integration on the Web, IIWeb-2003 (2003) (to appear)

    Google Scholar 

  4. Cohen, W.W., Sarawagi, S.: Exploiting dictionaries in named entity extraction: Combining semi-markov extraction processes and data integration methods. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2004) (to appear)

    Google Scholar 

  5. Jordan, M.I.: Graphical models. Statistical Science (Special Issue on Bayesian Statistics) 19, 140–155 (2004)

    MATH  Google Scholar 

  6. Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proceedings of the International Conference on Machine Learning (ICML-2001), Williams, MA (2001)

    Google Scholar 

  7. McCallum, A., Wellner, B.: Toward conditional models of identity uncertainty with application to proper noun coreference. In: Proceedings of the IJCAI-2003 Workshop on Information Integration on the Web, Acapulco, Mexico, August 2003, pp. 79–86 (2003)

    Google Scholar 

  8. Parag, Domingos, P.: Multi-relational record linkage. In: Proceedings of 3rd Workshop on Multi-Relational Data Mining at ACM SIGKDD, Seattle, WA (August 2004)

    Google Scholar 

  9. Sarawagi, S., Cohen, W.W.: Semi-markov conditional random fields for information extraction. In: NIPs (2004) (to appear)

    Google Scholar 

  10. Sarawagi, S., Kirpal, A.: Efficient set joins on similarity predicates. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (2004)

    Google Scholar 

  11. Sha, F., Pereira, F.: Shallow parsing with conditional random fields. In: Proceedings of HLT-NAACL (2003)

    Google Scholar 

  12. Theobald, M., Weikum, G., Schenkel, R.: Top-k query evaluation with probabilistic guarantees. In: VLDB, pp. 648–659 (2004)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Sarawagi, S. (2005). Models and Indices for Integrating Unstructured Data with a Relational Database. In: Goethals, B., Siebes, A. (eds) Knowledge Discovery in Inductive Databases. KDID 2004. Lecture Notes in Computer Science, vol 3377. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-31841-5_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-31841-5_1

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-25082-1

  • Online ISBN: 978-3-540-31841-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics