Abstract
Information integration is becoming a critical problem for businesses and individuals alike. Data volumes are sky-rocketing, and new sources and types of information are proliferating. This paper briefly reviews some of the key research accomplishments in information integration (theory and systems), then describes the current state-of-the-art in commercial practice, and the challenges (still) faced by CIOs and application developers. One critical challenge is choosing the right combination of tools and technologies to do the integration. Although each has been studied separately, we lack a unified (and certainly, a unifying) understanding of these various approaches to integration. Experience with a variety of integration projects suggests that we need a broader framework, perhaps even a theory, which explicitly takes into account requirements on the result of the integration, and considers the entire end-to-end integration process.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Jacob, K.J.: Betting on Brain Power. The Week (February 2, 2003), Available at: http://www.the-week.com/23feb02/biz2.htm
IBM Business Consulting Services: Your Turn, The Global CEO Study (2004), Available from: http://www.bitpipe.com/detail/RES/1129048329_469.html
Moore, C., Markham, R.: The Future of Content in the Enterprise. Forrester Report (2003)
Lenzerini, M.: Data Integration: A Theoretical Perspective. In: PODS, pp. 233–246 (2002)
IEEE Data Eng. Bull. Special Issue on Structure Discovery 26(3) (2003)
Barbará, D., DuMouchel, W., Faloutsos, C., Haas, P.J., Hellerstein, J.M., Ioannidis, Y.E., Jagadish, H.V., Johnson, T., Ng, R.T., Poosala, V., Ross, K.A., Sevcik, K.C.: The New Jersey Data Reduction Report. IEEE Data Eng. Bull. 20(4), 3–45 (1997)
Ilyas, I.F., Markl, V., Haas, P.J., Brown, P., Aboulnaga, A.: CORDS: Automatic Discovery of Correlations and Soft Functional Dependencies. In: SIGMOD, pp. 647–658 (2004)
Doan, A., Ramakrishnan, R., Vaithyanathan, S.: Managing information extraction: state of the art and research directions. In: SIGMOD, pp. 799–800 (2006)
Gravano, L., García-Molina, H., Tomasic, A.: GlOSS: text-source discovery over the Internet. ACM Transactions on Database Systems (TODS) 24(2), 229–264 (1999)
Powell, A.L., French, J.C., Callan, J., Connell, M., Viles, C.L.: The impact of database selection on distributed searching. In: SIGIR, pp. 232–239 (2000)
Hernández, M.A., Stolfo, S.J.: Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem. Data Min. Knowl. Discov. 2(1), 9–37 (1998)
Johnson, T., Dasu, T.: Exploratory Data Mining and Data Cleaning. John Wiley, Chichester (2003)
Koudas, N., Sarawagi, S., Srivastava, D.: Record Linkage: Similarity Measures and Algorithms. In: SIGMOD, pp. 802–803 (2006)
Lembo, D., Lenzerini, M., Rosati, R.: Source inconsistency and incompleteness in data integration. In: KRDB (2002)
Bertossi, L.E., Chomicki, J.: Query Answering in Inconsistent Databases. Logics for Emerging Applications of Databases, 43–83 (2003)
Naumann, F., Gertz, M., Madnick, S.E.: Proc. Information Quality (MIT IQ Conference), Sponsored by Lockheed Martin. MIT, Cambridge (2005)
IEEE Data Eng. Bull. Special Issue on Probabilistic Data Management, 29(1) (2006)
Miller, R.J., Haas, L.M., Hernández, M.A.: Schema Mapping as Query Discovery. In: VLDB, pp. 77–88 (2000)
Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. VLDB J. 10(4), 334–350 (2001)
Johnston, W.M., Hanna, J.P., Millar, R.J.: Advances in dataflow programming languages. ACM Comput. Surv. 36(1), 1–34 (2004)
Rinderle, S., Reichert, M., Dadam, P.: Flexible Support of Team Processes by Adaptive Workflow Systems. Distributed and Parallel Databases 16(1), 91–116 (2004)
Bernstein, P.A.: Applying Model Management to Classical Meta Data Problems. In: Proc. CIDR, pp. 209–220 (2003)
Haas, L.M., Hernández, M.A., Ho, H., Popa, L., Roth, M.: Clio grows up: from research prototype to industrial tool. In: SIGMOD, pp. 805–810 (2005)
Shu, N.C., Housel, B.C., Taylor, R.W., Ghosh, S.P., Lum, V.Y.: EXPRESS: A Data EXtraction, Processing, amd REStructuring System. ACM Trans. Database Syst. 2(2), 134–174 (1977)
Breitbart, Y., Komondoor, R., Rastogi, R., Seshadri, S., Silberschatz, A.: Update Propagation Protocols For Replicated Databases. In: SIGMOD, pp. 97–108 (1999)
Kemme, B., Alonso, G.: A new approach to developing and implementing eager database replication protocols. ACM Trans. Database Syst. 25(3), 333–379 (2000)
Dayal, U., Hwang, H.-Y.: View Definition and Generalization for Database Integration in a Multidatabase System. IEEE Trans. Software Eng. 10(6), 628–645 (1984)
Lohman, G.M., Daniels, D., Haas, L.M., Kistler, R., Selinger, P.G.: Optimization of Nested Queries in a Distributed Relational Database. In: VLDB, pp. 403–415 (1984)
Wiederhold, G.: Mediators in the architecture of future information systems. IEEE Computer 25(3), 38–49 (1992)
Papakonstantinou, Y., Gupta, A., Haas, L.M.: Capabilities-Based Query Rewriting in Mediator Systems. In: PDIS, pp. 170–181 (1996)
Levy, A.Y., Rajaraman, A., Ordille, J.J.: Querying Heterogeneous Information Sources Using Source Descriptions. In: VLDB, pp. 251–262 (1996)
Roth, M.T., Schwarz, P.M., Haas, L.M.: An Architecture for Transparent Access to Diverse Data Sources. In: Dittrich, K.R., Geppert, A. (eds.) Component Database Systems, pp. 175–206. Morgan Kaufmann Publishers, San Francisco (2001)
Haas, L.M., Kossmann, D., Wimmers, E.L., Yang, J.: Optimizing Queries Across Diverse Data Sources. In: VLDB, pp. 276–285 (1997)
Fagin, R., Kolaitis, P.G., Miller, R.J., Popa, L.: Data exchange: semantics and query answering. Theor. Comput. Sci. 336(1), 89–124 (2005)
Kolaitis, P.G.: Schema mappings, data exchange, and metadata management. In: PODS, pp. 61–75 (2005)
Zobel, J., Moffat, A.: Inverted files for text search engines. ACM Comput. Surv. 38(2) (2006)
Meng, W., Yu, C., Liu, K.: Building efficient and effective metasearch engines. ACM Comput. Surv. 34(1), 48–89 (2002)
Chang, K.C.-C., Cho, J.: Accessing the web: from search to integration. In: SIGMOD, pp. 804–805 (2006)
Leser, U., Naumann, F., Eckman, B.A.: Data Integration in the Life Sciences (DILS 2006). In: Leser, U., Naumann, F., Eckman, B. (eds.) DILS 2006. LNCS (LNBI), vol. 4075, Springer, Heidelberg (2006)
Buneman, P., Davidson, S.B., Hart, K., Overton, G.C., Wong, L.: A Data Transformation System for Biological Data Sources. In: VLDB, pp. 158–169 (1995)
Blake, J.A., Bult, C.J.: Beyond the data deluge: Data integration and bio-ontologies. Journal of Biomedical Informatics 39(3), 314–320 (2006)
http://www-306.ibm.com/software/data/integration/info_server/overview.html
ISO/IEC 9075-14:2003 Information technology – Database languages – SQL – Part 14: XML-Related Specifications (SQL/XML). International Organization for Standardization (2003)
http://www-306.ibm.com/software/data/integration/db2ii/editions_content.html
http://www-306.ibm.com/software/data/integration/db2ii/editions_womnifind.html
Ferrucci, D., Lally, A.: UIMA: an architectural approach to unstructured information processing in the corporate research environment. In: Natural Language Engineering, vol. 10(3-4), pp. 327–348. Cambridge University Press, New York (2004)
Zilio, D.C., Rao, J., Lightstone, S., Lohman, G.M., Storm, A., Garcia-Arellano, C., Fadden, S.: DB2 Design Advisor: Integrated Automatic Physical Database Design. In: VLDB, pp. 1087–1097 (2004)
Agrawal, S., Chaudhuri, S., Kollár, L., Marathe, A.P., Narasayya, V.R., Syamala, M.: Database Tuning Advisor for Microsoft SQL Server 2005. In: VLDB, pp. 1110–1121 (2004)
Saracco, C., Englert, S., Gebert, I.: Using DB2 Information Integrator for J2EE Development: A Cost/Benefit Analysis. On IBM Developerworks (May 2003), available at: www.ibm.com/developerworks/db2/library/techarticle/0305saracco1/0305saracco1.html
Halevy, A.Y., Franklin, M.J., Maier, D.: Principles of dataspace systems. In: PODS, pp. 1–9 (2006)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Haas, L. (2006). Beauty and the Beast: The Theory and Practice of Information Integration. In: Schwentick, T., Suciu, D. (eds) Database Theory – ICDT 2007. ICDT 2007. Lecture Notes in Computer Science, vol 4353. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11965893_3
Download citation
DOI: https://doi.org/10.1007/11965893_3
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-69269-0
Online ISBN: 978-3-540-69270-6
eBook Packages: Computer ScienceComputer Science (R0)