Beauty and the Beast: The Theory and Practice of Information Integration

  • Laura Haas
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4353)

Abstract

Information integration is becoming a critical problem for businesses and individuals alike. Data volumes are sky-rocketing, and new sources and types of information are proliferating. This paper briefly reviews some of the key research accomplishments in information integration (theory and systems), then describes the current state-of-the-art in commercial practice, and the challenges (still) faced by CIOs and application developers. One critical challenge is choosing the right combination of tools and technologies to do the integration. Although each has been studied separately, we lack a unified (and certainly, a unifying) understanding of these various approaches to integration. Experience with a variety of integration projects suggests that we need a broader framework, perhaps even a theory, which explicitly takes into account requirements on the result of the integration, and considers the entire end-to-end integration process.

Keywords

Information integration data integration data exchange data cleansing federation extract/transform/load 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Jacob, K.J.: Betting on Brain Power. The Week (February 2, 2003), Available at: http://www.the-week.com/23feb02/biz2.htm
  2. 2.
    IBM Business Consulting Services: Your Turn, The Global CEO Study (2004), Available from: http://www.bitpipe.com/detail/RES/1129048329_469.html
  3. 3.
    Moore, C., Markham, R.: The Future of Content in the Enterprise. Forrester Report (2003)Google Scholar
  4. 4.
    Lenzerini, M.: Data Integration: A Theoretical Perspective. In: PODS, pp. 233–246 (2002)Google Scholar
  5. 5.
    IEEE Data Eng. Bull. Special Issue on Structure Discovery 26(3) (2003)Google Scholar
  6. 6.
    Barbará, D., DuMouchel, W., Faloutsos, C., Haas, P.J., Hellerstein, J.M., Ioannidis, Y.E., Jagadish, H.V., Johnson, T., Ng, R.T., Poosala, V., Ross, K.A., Sevcik, K.C.: The New Jersey Data Reduction Report. IEEE Data Eng. Bull. 20(4), 3–45 (1997)Google Scholar
  7. 7.
    Ilyas, I.F., Markl, V., Haas, P.J., Brown, P., Aboulnaga, A.: CORDS: Automatic Discovery of Correlations and Soft Functional Dependencies. In: SIGMOD, pp. 647–658 (2004)Google Scholar
  8. 8.
    Doan, A., Ramakrishnan, R., Vaithyanathan, S.: Managing information extraction: state of the art and research directions. In: SIGMOD, pp. 799–800 (2006)Google Scholar
  9. 9.
    Gravano, L., García-Molina, H., Tomasic, A.: GlOSS: text-source discovery over the Internet. ACM Transactions on Database Systems (TODS) 24(2), 229–264 (1999)CrossRefGoogle Scholar
  10. 10.
    Powell, A.L., French, J.C., Callan, J., Connell, M., Viles, C.L.: The impact of database selection on distributed searching. In: SIGIR, pp. 232–239 (2000)Google Scholar
  11. 11.
    Hernández, M.A., Stolfo, S.J.: Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem. Data Min. Knowl. Discov. 2(1), 9–37 (1998)CrossRefGoogle Scholar
  12. 12.
    Johnson, T., Dasu, T.: Exploratory Data Mining and Data Cleaning. John Wiley, Chichester (2003)MATHGoogle Scholar
  13. 13.
    Koudas, N., Sarawagi, S., Srivastava, D.: Record Linkage: Similarity Measures and Algorithms. In: SIGMOD, pp. 802–803 (2006)Google Scholar
  14. 14.
    Lembo, D., Lenzerini, M., Rosati, R.: Source inconsistency and incompleteness in data integration. In: KRDB (2002)Google Scholar
  15. 15.
    Bertossi, L.E., Chomicki, J.: Query Answering in Inconsistent Databases. Logics for Emerging Applications of Databases, 43–83 (2003)Google Scholar
  16. 16.
    Naumann, F., Gertz, M., Madnick, S.E.: Proc. Information Quality (MIT IQ Conference), Sponsored by Lockheed Martin. MIT, Cambridge (2005)Google Scholar
  17. 17.
    IEEE Data Eng. Bull. Special Issue on Probabilistic Data Management, 29(1) (2006)Google Scholar
  18. 18.
    Miller, R.J., Haas, L.M., Hernández, M.A.: Schema Mapping as Query Discovery. In: VLDB, pp. 77–88 (2000)Google Scholar
  19. 19.
    Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. VLDB J. 10(4), 334–350 (2001)MATHCrossRefGoogle Scholar
  20. 20.
    Johnston, W.M., Hanna, J.P., Millar, R.J.: Advances in dataflow programming languages. ACM Comput. Surv. 36(1), 1–34 (2004)CrossRefGoogle Scholar
  21. 21.
    Rinderle, S., Reichert, M., Dadam, P.: Flexible Support of Team Processes by Adaptive Workflow Systems. Distributed and Parallel Databases 16(1), 91–116 (2004)CrossRefGoogle Scholar
  22. 22.
    Bernstein, P.A.: Applying Model Management to Classical Meta Data Problems. In: Proc. CIDR, pp. 209–220 (2003)Google Scholar
  23. 23.
    Haas, L.M., Hernández, M.A., Ho, H., Popa, L., Roth, M.: Clio grows up: from research prototype to industrial tool. In: SIGMOD, pp. 805–810 (2005)Google Scholar
  24. 24.
    Shu, N.C., Housel, B.C., Taylor, R.W., Ghosh, S.P., Lum, V.Y.: EXPRESS: A Data EXtraction, Processing, amd REStructuring System. ACM Trans. Database Syst. 2(2), 134–174 (1977)CrossRefGoogle Scholar
  25. 25.
    Breitbart, Y., Komondoor, R., Rastogi, R., Seshadri, S., Silberschatz, A.: Update Propagation Protocols For Replicated Databases. In: SIGMOD, pp. 97–108 (1999)Google Scholar
  26. 26.
    Kemme, B., Alonso, G.: A new approach to developing and implementing eager database replication protocols. ACM Trans. Database Syst. 25(3), 333–379 (2000)CrossRefGoogle Scholar
  27. 27.
    Dayal, U., Hwang, H.-Y.: View Definition and Generalization for Database Integration in a Multidatabase System. IEEE Trans. Software Eng. 10(6), 628–645 (1984)CrossRefGoogle Scholar
  28. 28.
    Lohman, G.M., Daniels, D., Haas, L.M., Kistler, R., Selinger, P.G.: Optimization of Nested Queries in a Distributed Relational Database. In: VLDB, pp. 403–415 (1984)Google Scholar
  29. 29.
    Wiederhold, G.: Mediators in the architecture of future information systems. IEEE Computer 25(3), 38–49 (1992)Google Scholar
  30. 30.
    Papakonstantinou, Y., Gupta, A., Haas, L.M.: Capabilities-Based Query Rewriting in Mediator Systems. In: PDIS, pp. 170–181 (1996)Google Scholar
  31. 31.
    Levy, A.Y., Rajaraman, A., Ordille, J.J.: Querying Heterogeneous Information Sources Using Source Descriptions. In: VLDB, pp. 251–262 (1996)Google Scholar
  32. 32.
    Roth, M.T., Schwarz, P.M., Haas, L.M.: An Architecture for Transparent Access to Diverse Data Sources. In: Dittrich, K.R., Geppert, A. (eds.) Component Database Systems, pp. 175–206. Morgan Kaufmann Publishers, San Francisco (2001)CrossRefGoogle Scholar
  33. 33.
    Haas, L.M., Kossmann, D., Wimmers, E.L., Yang, J.: Optimizing Queries Across Diverse Data Sources. In: VLDB, pp. 276–285 (1997)Google Scholar
  34. 34.
    Fagin, R., Kolaitis, P.G., Miller, R.J., Popa, L.: Data exchange: semantics and query answering. Theor. Comput. Sci. 336(1), 89–124 (2005)MATHCrossRefMathSciNetGoogle Scholar
  35. 35.
    Kolaitis, P.G.: Schema mappings, data exchange, and metadata management. In: PODS, pp. 61–75 (2005)Google Scholar
  36. 36.
    Zobel, J., Moffat, A.: Inverted files for text search engines. ACM Comput. Surv. 38(2) (2006)Google Scholar
  37. 37.
    Meng, W., Yu, C., Liu, K.: Building efficient and effective metasearch engines. ACM Comput. Surv. 34(1), 48–89 (2002)CrossRefGoogle Scholar
  38. 38.
    Chang, K.C.-C., Cho, J.: Accessing the web: from search to integration. In: SIGMOD, pp. 804–805 (2006)Google Scholar
  39. 39.
    Leser, U., Naumann, F., Eckman, B.A.: Data Integration in the Life Sciences (DILS 2006). In: Leser, U., Naumann, F., Eckman, B. (eds.) DILS 2006. LNCS (LNBI), vol. 4075, Springer, Heidelberg (2006)CrossRefGoogle Scholar
  40. 40.
    Buneman, P., Davidson, S.B., Hart, K., Overton, G.C., Wong, L.: A Data Transformation System for Biological Data Sources. In: VLDB, pp. 158–169 (1995)Google Scholar
  41. 41.
    Blake, J.A., Bult, C.J.: Beyond the data deluge: Data integration and bio-ontologies. Journal of Biomedical Informatics 39(3), 314–320 (2006)CrossRefGoogle Scholar
  42. 42.
  43. 43.
  44. 44.
  45. 45.
    ISO/IEC 9075-14:2003 Information technology – Database languages – SQL – Part 14: XML-Related Specifications (SQL/XML). International Organization for Standardization (2003)Google Scholar
  46. 46.
  47. 47.
  48. 48.
    Ferrucci, D., Lally, A.: UIMA: an architectural approach to unstructured information processing in the corporate research environment. In: Natural Language Engineering, vol. 10(3-4), pp. 327–348. Cambridge University Press, New York (2004)Google Scholar
  49. 49.
    Zilio, D.C., Rao, J., Lightstone, S., Lohman, G.M., Storm, A., Garcia-Arellano, C., Fadden, S.: DB2 Design Advisor: Integrated Automatic Physical Database Design. In: VLDB, pp. 1087–1097 (2004)Google Scholar
  50. 50.
    Agrawal, S., Chaudhuri, S., Kollár, L., Marathe, A.P., Narasayya, V.R., Syamala, M.: Database Tuning Advisor for Microsoft SQL Server 2005. In: VLDB, pp. 1110–1121 (2004)Google Scholar
  51. 51.
    Saracco, C., Englert, S., Gebert, I.: Using DB2 Information Integrator for J2EE Development: A Cost/Benefit Analysis. On IBM Developerworks (May 2003), available at: www.ibm.com/developerworks/db2/library/techarticle/0305saracco1/0305saracco1.html
  52. 52.
    Halevy, A.Y., Franklin, M.J., Maier, D.: Principles of dataspace systems. In: PODS, pp. 1–9 (2006)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Laura Haas
    • 1
  1. 1.IBM Almaden Research CenterSan JoseUSA

Personalised recommendations