Beauty and the Beast: The Theory and Practice of Information Integration

Haas, Laura

doi:10.1007/11965893_3

Laura Haas¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4353))

Included in the following conference series:

International Conference on Database Theory

971 Accesses
47 Citations

Abstract

Information integration is becoming a critical problem for businesses and individuals alike. Data volumes are sky-rocketing, and new sources and types of information are proliferating. This paper briefly reviews some of the key research accomplishments in information integration (theory and systems), then describes the current state-of-the-art in commercial practice, and the challenges (still) faced by CIOs and application developers. One critical challenge is choosing the right combination of tools and technologies to do the integration. Although each has been studied separately, we lack a unified (and certainly, a unifying) understanding of these various approaches to integration. Experience with a variety of integration projects suggests that we need a broader framework, perhaps even a theory, which explicitly takes into account requirements on the result of the integration, and considers the entire end-to-end integration process.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Jacob, K.J.: Betting on Brain Power. The Week (February 2, 2003), Available at: http://www.the-week.com/23feb02/biz2.htm
IBM Business Consulting Services: Your Turn, The Global CEO Study (2004), Available from: http://www.bitpipe.com/detail/RES/1129048329_469.html
Moore, C., Markham, R.: The Future of Content in the Enterprise. Forrester Report (2003)
Google Scholar
Lenzerini, M.: Data Integration: A Theoretical Perspective. In: PODS, pp. 233–246 (2002)
Google Scholar
IEEE Data Eng. Bull. Special Issue on Structure Discovery 26(3) (2003)
Google Scholar
Barbará, D., DuMouchel, W., Faloutsos, C., Haas, P.J., Hellerstein, J.M., Ioannidis, Y.E., Jagadish, H.V., Johnson, T., Ng, R.T., Poosala, V., Ross, K.A., Sevcik, K.C.: The New Jersey Data Reduction Report. IEEE Data Eng. Bull. 20(4), 3–45 (1997)
Google Scholar
Ilyas, I.F., Markl, V., Haas, P.J., Brown, P., Aboulnaga, A.: CORDS: Automatic Discovery of Correlations and Soft Functional Dependencies. In: SIGMOD, pp. 647–658 (2004)
Google Scholar
Doan, A., Ramakrishnan, R., Vaithyanathan, S.: Managing information extraction: state of the art and research directions. In: SIGMOD, pp. 799–800 (2006)
Google Scholar
Gravano, L., García-Molina, H., Tomasic, A.: GlOSS: text-source discovery over the Internet. ACM Transactions on Database Systems (TODS) 24(2), 229–264 (1999)
Article Google Scholar
Powell, A.L., French, J.C., Callan, J., Connell, M., Viles, C.L.: The impact of database selection on distributed searching. In: SIGIR, pp. 232–239 (2000)
Google Scholar
Hernández, M.A., Stolfo, S.J.: Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem. Data Min. Knowl. Discov. 2(1), 9–37 (1998)
Article Google Scholar
Johnson, T., Dasu, T.: Exploratory Data Mining and Data Cleaning. John Wiley, Chichester (2003)
MATH Google Scholar
Koudas, N., Sarawagi, S., Srivastava, D.: Record Linkage: Similarity Measures and Algorithms. In: SIGMOD, pp. 802–803 (2006)
Google Scholar
Lembo, D., Lenzerini, M., Rosati, R.: Source inconsistency and incompleteness in data integration. In: KRDB (2002)
Google Scholar
Bertossi, L.E., Chomicki, J.: Query Answering in Inconsistent Databases. Logics for Emerging Applications of Databases, 43–83 (2003)
Google Scholar
Naumann, F., Gertz, M., Madnick, S.E.: Proc. Information Quality (MIT IQ Conference), Sponsored by Lockheed Martin. MIT, Cambridge (2005)
Google Scholar
IEEE Data Eng. Bull. Special Issue on Probabilistic Data Management, 29(1) (2006)
Google Scholar
Miller, R.J., Haas, L.M., Hernández, M.A.: Schema Mapping as Query Discovery. In: VLDB, pp. 77–88 (2000)
Google Scholar
Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. VLDB J. 10(4), 334–350 (2001)
Article MATH Google Scholar
Johnston, W.M., Hanna, J.P., Millar, R.J.: Advances in dataflow programming languages. ACM Comput. Surv. 36(1), 1–34 (2004)
Article Google Scholar
Rinderle, S., Reichert, M., Dadam, P.: Flexible Support of Team Processes by Adaptive Workflow Systems. Distributed and Parallel Databases 16(1), 91–116 (2004)
Article Google Scholar
Bernstein, P.A.: Applying Model Management to Classical Meta Data Problems. In: Proc. CIDR, pp. 209–220 (2003)
Google Scholar
Haas, L.M., Hernández, M.A., Ho, H., Popa, L., Roth, M.: Clio grows up: from research prototype to industrial tool. In: SIGMOD, pp. 805–810 (2005)
Google Scholar
Shu, N.C., Housel, B.C., Taylor, R.W., Ghosh, S.P., Lum, V.Y.: EXPRESS: A Data EXtraction, Processing, amd REStructuring System. ACM Trans. Database Syst. 2(2), 134–174 (1977)
Article Google Scholar
Breitbart, Y., Komondoor, R., Rastogi, R., Seshadri, S., Silberschatz, A.: Update Propagation Protocols For Replicated Databases. In: SIGMOD, pp. 97–108 (1999)
Google Scholar
Kemme, B., Alonso, G.: A new approach to developing and implementing eager database replication protocols. ACM Trans. Database Syst. 25(3), 333–379 (2000)
Article Google Scholar
Dayal, U., Hwang, H.-Y.: View Definition and Generalization for Database Integration in a Multidatabase System. IEEE Trans. Software Eng. 10(6), 628–645 (1984)
Article Google Scholar
Lohman, G.M., Daniels, D., Haas, L.M., Kistler, R., Selinger, P.G.: Optimization of Nested Queries in a Distributed Relational Database. In: VLDB, pp. 403–415 (1984)
Google Scholar
Wiederhold, G.: Mediators in the architecture of future information systems. IEEE Computer 25(3), 38–49 (1992)
Google Scholar
Papakonstantinou, Y., Gupta, A., Haas, L.M.: Capabilities-Based Query Rewriting in Mediator Systems. In: PDIS, pp. 170–181 (1996)
Google Scholar
Levy, A.Y., Rajaraman, A., Ordille, J.J.: Querying Heterogeneous Information Sources Using Source Descriptions. In: VLDB, pp. 251–262 (1996)
Google Scholar
Roth, M.T., Schwarz, P.M., Haas, L.M.: An Architecture for Transparent Access to Diverse Data Sources. In: Dittrich, K.R., Geppert, A. (eds.) Component Database Systems, pp. 175–206. Morgan Kaufmann Publishers, San Francisco (2001)
Chapter Google Scholar
Haas, L.M., Kossmann, D., Wimmers, E.L., Yang, J.: Optimizing Queries Across Diverse Data Sources. In: VLDB, pp. 276–285 (1997)
Google Scholar
Fagin, R., Kolaitis, P.G., Miller, R.J., Popa, L.: Data exchange: semantics and query answering. Theor. Comput. Sci. 336(1), 89–124 (2005)
Article MATH MathSciNet Google Scholar
Kolaitis, P.G.: Schema mappings, data exchange, and metadata management. In: PODS, pp. 61–75 (2005)
Google Scholar
Zobel, J., Moffat, A.: Inverted files for text search engines. ACM Comput. Surv. 38(2) (2006)
Google Scholar
Meng, W., Yu, C., Liu, K.: Building efficient and effective metasearch engines. ACM Comput. Surv. 34(1), 48–89 (2002)
Article Google Scholar
Chang, K.C.-C., Cho, J.: Accessing the web: from search to integration. In: SIGMOD, pp. 804–805 (2006)
Google Scholar
Leser, U., Naumann, F., Eckman, B.A.: Data Integration in the Life Sciences (DILS 2006). In: Leser, U., Naumann, F., Eckman, B. (eds.) DILS 2006. LNCS (LNBI), vol. 4075, Springer, Heidelberg (2006)
Chapter Google Scholar
Buneman, P., Davidson, S.B., Hart, K., Overton, G.C., Wong, L.: A Data Transformation System for Biological Data Sources. In: VLDB, pp. 158–169 (1995)
Google Scholar
Blake, J.A., Bult, C.J.: Beyond the data deluge: Data integration and bio-ontologies. Journal of Biomedical Informatics 39(3), 314–320 (2006)
Article Google Scholar
http://www-306.ibm.com/software/data/integration/
http://www.informatica.com/
http://www-306.ibm.com/software/data/integration/info_server/overview.html
ISO/IEC 9075-14:2003 Information technology – Database languages – SQL – Part 14: XML-Related Specifications (SQL/XML). International Organization for Standardization (2003)
Google Scholar
http://www-306.ibm.com/software/data/integration/db2ii/editions_content.html
http://www-306.ibm.com/software/data/integration/db2ii/editions_womnifind.html
Ferrucci, D., Lally, A.: UIMA: an architectural approach to unstructured information processing in the corporate research environment. In: Natural Language Engineering, vol. 10(3-4), pp. 327–348. Cambridge University Press, New York (2004)
Google Scholar
Zilio, D.C., Rao, J., Lightstone, S., Lohman, G.M., Storm, A., Garcia-Arellano, C., Fadden, S.: DB2 Design Advisor: Integrated Automatic Physical Database Design. In: VLDB, pp. 1087–1097 (2004)
Google Scholar
Agrawal, S., Chaudhuri, S., Kollár, L., Marathe, A.P., Narasayya, V.R., Syamala, M.: Database Tuning Advisor for Microsoft SQL Server 2005. In: VLDB, pp. 1110–1121 (2004)
Google Scholar
Saracco, C., Englert, S., Gebert, I.: Using DB2 Information Integrator for J2EE Development: A Cost/Benefit Analysis. On IBM Developerworks (May 2003), available at: www.ibm.com/developerworks/db2/library/techarticle/0305saracco1/0305saracco1.html
Halevy, A.Y., Franklin, M.J., Maier, D.: Principles of dataspace systems. In: PODS, pp. 1–9 (2006)
Google Scholar

Download references

Author information

Authors and Affiliations

IBM Almaden Research Center, 650 Harry Road, San Jose, CA, 95120, USA
Laura Haas

Authors

Laura Haas
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Technical University of Dortmund,
Thomas Schwentick
Department of Computer Science and Engineering, University of Washington,
Dan Suciu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Haas, L. (2006). Beauty and the Beast: The Theory and Practice of Information Integration. In: Schwentick, T., Suciu, D. (eds) Database Theory – ICDT 2007. ICDT 2007. Lecture Notes in Computer Science, vol 4353. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11965893_3

Download citation

DOI: https://doi.org/10.1007/11965893_3
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-69269-0
Online ISBN: 978-3-540-69270-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics