Skip to main content

Beauty and the Beast: The Theory and Practice of Information Integration

  • Conference paper
Book cover Database Theory – ICDT 2007 (ICDT 2007)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4353))

Included in the following conference series:

Abstract

Information integration is becoming a critical problem for businesses and individuals alike. Data volumes are sky-rocketing, and new sources and types of information are proliferating. This paper briefly reviews some of the key research accomplishments in information integration (theory and systems), then describes the current state-of-the-art in commercial practice, and the challenges (still) faced by CIOs and application developers. One critical challenge is choosing the right combination of tools and technologies to do the integration. Although each has been studied separately, we lack a unified (and certainly, a unifying) understanding of these various approaches to integration. Experience with a variety of integration projects suggests that we need a broader framework, perhaps even a theory, which explicitly takes into account requirements on the result of the integration, and considers the entire end-to-end integration process.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Jacob, K.J.: Betting on Brain Power. The Week (February 2, 2003), Available at: http://www.the-week.com/23feb02/biz2.htm

  2. IBM Business Consulting Services: Your Turn, The Global CEO Study (2004), Available from: http://www.bitpipe.com/detail/RES/1129048329_469.html

  3. Moore, C., Markham, R.: The Future of Content in the Enterprise. Forrester Report (2003)

    Google Scholar 

  4. Lenzerini, M.: Data Integration: A Theoretical Perspective. In: PODS, pp. 233–246 (2002)

    Google Scholar 

  5. IEEE Data Eng. Bull. Special Issue on Structure Discovery 26(3) (2003)

    Google Scholar 

  6. Barbará, D., DuMouchel, W., Faloutsos, C., Haas, P.J., Hellerstein, J.M., Ioannidis, Y.E., Jagadish, H.V., Johnson, T., Ng, R.T., Poosala, V., Ross, K.A., Sevcik, K.C.: The New Jersey Data Reduction Report. IEEE Data Eng. Bull. 20(4), 3–45 (1997)

    Google Scholar 

  7. Ilyas, I.F., Markl, V., Haas, P.J., Brown, P., Aboulnaga, A.: CORDS: Automatic Discovery of Correlations and Soft Functional Dependencies. In: SIGMOD, pp. 647–658 (2004)

    Google Scholar 

  8. Doan, A., Ramakrishnan, R., Vaithyanathan, S.: Managing information extraction: state of the art and research directions. In: SIGMOD, pp. 799–800 (2006)

    Google Scholar 

  9. Gravano, L., García-Molina, H., Tomasic, A.: GlOSS: text-source discovery over the Internet. ACM Transactions on Database Systems (TODS) 24(2), 229–264 (1999)

    Article  Google Scholar 

  10. Powell, A.L., French, J.C., Callan, J., Connell, M., Viles, C.L.: The impact of database selection on distributed searching. In: SIGIR, pp. 232–239 (2000)

    Google Scholar 

  11. Hernández, M.A., Stolfo, S.J.: Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem. Data Min. Knowl. Discov. 2(1), 9–37 (1998)

    Article  Google Scholar 

  12. Johnson, T., Dasu, T.: Exploratory Data Mining and Data Cleaning. John Wiley, Chichester (2003)

    MATH  Google Scholar 

  13. Koudas, N., Sarawagi, S., Srivastava, D.: Record Linkage: Similarity Measures and Algorithms. In: SIGMOD, pp. 802–803 (2006)

    Google Scholar 

  14. Lembo, D., Lenzerini, M., Rosati, R.: Source inconsistency and incompleteness in data integration. In: KRDB (2002)

    Google Scholar 

  15. Bertossi, L.E., Chomicki, J.: Query Answering in Inconsistent Databases. Logics for Emerging Applications of Databases, 43–83 (2003)

    Google Scholar 

  16. Naumann, F., Gertz, M., Madnick, S.E.: Proc. Information Quality (MIT IQ Conference), Sponsored by Lockheed Martin. MIT, Cambridge (2005)

    Google Scholar 

  17. IEEE Data Eng. Bull. Special Issue on Probabilistic Data Management, 29(1) (2006)

    Google Scholar 

  18. Miller, R.J., Haas, L.M., Hernández, M.A.: Schema Mapping as Query Discovery. In: VLDB, pp. 77–88 (2000)

    Google Scholar 

  19. Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. VLDB J. 10(4), 334–350 (2001)

    Article  MATH  Google Scholar 

  20. Johnston, W.M., Hanna, J.P., Millar, R.J.: Advances in dataflow programming languages. ACM Comput. Surv. 36(1), 1–34 (2004)

    Article  Google Scholar 

  21. Rinderle, S., Reichert, M., Dadam, P.: Flexible Support of Team Processes by Adaptive Workflow Systems. Distributed and Parallel Databases 16(1), 91–116 (2004)

    Article  Google Scholar 

  22. Bernstein, P.A.: Applying Model Management to Classical Meta Data Problems. In: Proc. CIDR, pp. 209–220 (2003)

    Google Scholar 

  23. Haas, L.M., Hernández, M.A., Ho, H., Popa, L., Roth, M.: Clio grows up: from research prototype to industrial tool. In: SIGMOD, pp. 805–810 (2005)

    Google Scholar 

  24. Shu, N.C., Housel, B.C., Taylor, R.W., Ghosh, S.P., Lum, V.Y.: EXPRESS: A Data EXtraction, Processing, amd REStructuring System. ACM Trans. Database Syst. 2(2), 134–174 (1977)

    Article  Google Scholar 

  25. Breitbart, Y., Komondoor, R., Rastogi, R., Seshadri, S., Silberschatz, A.: Update Propagation Protocols For Replicated Databases. In: SIGMOD, pp. 97–108 (1999)

    Google Scholar 

  26. Kemme, B., Alonso, G.: A new approach to developing and implementing eager database replication protocols. ACM Trans. Database Syst. 25(3), 333–379 (2000)

    Article  Google Scholar 

  27. Dayal, U., Hwang, H.-Y.: View Definition and Generalization for Database Integration in a Multidatabase System. IEEE Trans. Software Eng. 10(6), 628–645 (1984)

    Article  Google Scholar 

  28. Lohman, G.M., Daniels, D., Haas, L.M., Kistler, R., Selinger, P.G.: Optimization of Nested Queries in a Distributed Relational Database. In: VLDB, pp. 403–415 (1984)

    Google Scholar 

  29. Wiederhold, G.: Mediators in the architecture of future information systems. IEEE Computer 25(3), 38–49 (1992)

    Google Scholar 

  30. Papakonstantinou, Y., Gupta, A., Haas, L.M.: Capabilities-Based Query Rewriting in Mediator Systems. In: PDIS, pp. 170–181 (1996)

    Google Scholar 

  31. Levy, A.Y., Rajaraman, A., Ordille, J.J.: Querying Heterogeneous Information Sources Using Source Descriptions. In: VLDB, pp. 251–262 (1996)

    Google Scholar 

  32. Roth, M.T., Schwarz, P.M., Haas, L.M.: An Architecture for Transparent Access to Diverse Data Sources. In: Dittrich, K.R., Geppert, A. (eds.) Component Database Systems, pp. 175–206. Morgan Kaufmann Publishers, San Francisco (2001)

    Chapter  Google Scholar 

  33. Haas, L.M., Kossmann, D., Wimmers, E.L., Yang, J.: Optimizing Queries Across Diverse Data Sources. In: VLDB, pp. 276–285 (1997)

    Google Scholar 

  34. Fagin, R., Kolaitis, P.G., Miller, R.J., Popa, L.: Data exchange: semantics and query answering. Theor. Comput. Sci. 336(1), 89–124 (2005)

    Article  MATH  MathSciNet  Google Scholar 

  35. Kolaitis, P.G.: Schema mappings, data exchange, and metadata management. In: PODS, pp. 61–75 (2005)

    Google Scholar 

  36. Zobel, J., Moffat, A.: Inverted files for text search engines. ACM Comput. Surv. 38(2) (2006)

    Google Scholar 

  37. Meng, W., Yu, C., Liu, K.: Building efficient and effective metasearch engines. ACM Comput. Surv. 34(1), 48–89 (2002)

    Article  Google Scholar 

  38. Chang, K.C.-C., Cho, J.: Accessing the web: from search to integration. In: SIGMOD, pp. 804–805 (2006)

    Google Scholar 

  39. Leser, U., Naumann, F., Eckman, B.A.: Data Integration in the Life Sciences (DILS 2006). In: Leser, U., Naumann, F., Eckman, B. (eds.) DILS 2006. LNCS (LNBI), vol. 4075, Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  40. Buneman, P., Davidson, S.B., Hart, K., Overton, G.C., Wong, L.: A Data Transformation System for Biological Data Sources. In: VLDB, pp. 158–169 (1995)

    Google Scholar 

  41. Blake, J.A., Bult, C.J.: Beyond the data deluge: Data integration and bio-ontologies. Journal of Biomedical Informatics 39(3), 314–320 (2006)

    Article  Google Scholar 

  42. http://www-306.ibm.com/software/data/integration/

  43. http://www.informatica.com/

  44. http://www-306.ibm.com/software/data/integration/info_server/overview.html

  45. ISO/IEC 9075-14:2003 Information technology – Database languages – SQL – Part 14: XML-Related Specifications (SQL/XML). International Organization for Standardization (2003)

    Google Scholar 

  46. http://www-306.ibm.com/software/data/integration/db2ii/editions_content.html

  47. http://www-306.ibm.com/software/data/integration/db2ii/editions_womnifind.html

  48. Ferrucci, D., Lally, A.: UIMA: an architectural approach to unstructured information processing in the corporate research environment. In: Natural Language Engineering, vol. 10(3-4), pp. 327–348. Cambridge University Press, New York (2004)

    Google Scholar 

  49. Zilio, D.C., Rao, J., Lightstone, S., Lohman, G.M., Storm, A., Garcia-Arellano, C., Fadden, S.: DB2 Design Advisor: Integrated Automatic Physical Database Design. In: VLDB, pp. 1087–1097 (2004)

    Google Scholar 

  50. Agrawal, S., Chaudhuri, S., Kollár, L., Marathe, A.P., Narasayya, V.R., Syamala, M.: Database Tuning Advisor for Microsoft SQL Server 2005. In: VLDB, pp. 1110–1121 (2004)

    Google Scholar 

  51. Saracco, C., Englert, S., Gebert, I.: Using DB2 Information Integrator for J2EE Development: A Cost/Benefit Analysis. On IBM Developerworks (May 2003), available at: www.ibm.com/developerworks/db2/library/techarticle/0305saracco1/0305saracco1.html

  52. Halevy, A.Y., Franklin, M.J., Maier, D.: Principles of dataspace systems. In: PODS, pp. 1–9 (2006)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Haas, L. (2006). Beauty and the Beast: The Theory and Practice of Information Integration. In: Schwentick, T., Suciu, D. (eds) Database Theory – ICDT 2007. ICDT 2007. Lecture Notes in Computer Science, vol 4353. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11965893_3

Download citation

  • DOI: https://doi.org/10.1007/11965893_3

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-69269-0

  • Online ISBN: 978-3-540-69270-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics