Abstract
The advent of the Internet and the Web and their subsequent ubiquity have brought forth opportunities to connect information sources across all types of boundaries (local, regional, organizational, etc.). Examples of such information sources include databases, XML documents, and other unstructured sources. Uniformly querying those information sources has been extensively investigated. A major challenge relates to query optimization. Indeed, querying multiple information sources scattered on the Web raises several barriers for achieving efficiency. This is due to the characteristics of Web information sources that include volatility, heterogeneity, and autonomy. Those characteristics impede a straightforward application of classical query optimization techniques. They add new dimensions to the optimization problem such as the choice of objective function, selection of relevant information sources, limited query capabilities, and unpredictable events. In this paper, we survey the current research on fundamental problems to efficiently process queries over Web data integration systems. We also outline a classification for optimization techniques and a framework for evaluating them.
Similar content being viewed by others
References
S. Adali, K.S. Candan, Y. Papakonstantinou, and V.S. Subrahmanian, “Query caching and optimization in distributed mediator systems,” in Proceeedings of ACM SIGMOD International Conference on Management of Data, Montreal, Canada, June 1996.
B. Amann, C. Beeri, I. Fundulaki, and M. Scholl, “QueryingXMLsources using an ontology-based mediator,” in Proceedings of the Tenth International Conference on Cooperative Information Systems, Irvine, CA, USA, Oct. 2002.
J.L. Ambite and C.A. Knoblock, “Flexible and scalable query planning in distributed and heterogeneous environments,” in Proceedings of the Fourth International Conference on Artificial Intelligence Planning Systems, Pitsburg, USA, June 1998.
L. Amsaleg, P. Bonnet, M.J. Franklin, A. Tomasic, and T. Urhan, “Improving responsivness for wide-area data access,” IEEE Data Engineering Bulletin, vol. 20, no. 3, pp. 3–11, 1997.
G.O. Arocena and A.O. Mendelzon, “WebOQL: Restructuring documents, databases and Webs,” in Proceedings of the 14th International Conference on Data Engineering, Orlando, Florida, Feb. 1998.
R.H. Arpaci-Dusseau, E. Anderson, N. Treuhaft, D.E. Culler, J.M. Hellerstein, D. Patterson, and K. Yelick, “Cluster I/O with River: Making the fast case common,” in Proceedings of the Sixth Workshop on I/O in Parallel and Distributed Systems. ACM Press, May 1999.
R. Avnur and J. Hellerstein, “Eddies: Continuously adaptive query processing,” in Proceedings of the ACM SIGMOD International Conference on Management of Data, Dallas, Texas, USA, May 2000.
C. Batini, M. Lenzerini, and S.B. Navathe, “A comparative analysis of methodologies for database schema integration,” ACM Computing Surveys, vol. 18, no. 4, 1986.
T. Berners-Lee, Services and Semantics: Web Architecture. http://www.w3.org/2001/04/30-tbl, 2001.
T. Berners-Lee, R. Calliau, A. Luotonen, H.F. Nielsen, and A. Secret, “The world wideWeb,” CACM, vol. 37, no. 8, 1994.
T. Berners-Lee, J. Hendler, and O. Lassila, “The semantic Web,” Scientific American, vol. 284, no. 5, 2001.
E. Bertino and A. Bouguettaya, “Introduction to the special issue on database technology on the Web,” IEEE Internet Computing, vol. 6, no. 4, 2002.
L. Bouganim, F. Fabret, C. Mohan, and P. Valduriez, “Dynamic query scheduling in data integration systems,” in Proceedings of the 16th International Conference on Data Engineering, San Diego, CA, USA, Feb.-March 2000.
A. Bouguettaya, B. Benatallah, and A. Elmagarmid, Interconnecting Heterogeneous Information Systems, Kluwer Academic Publishers (ISBN 0-7923-8216-1), 1998.
A. Bouguettaya and R. King, “Large multidatabases: Issues and directions,” in IFIP DS-5 Semantics of Interoperable Database Systems, E.K. Hsiao, E.J. Neuhold, and R. Sacks-Davis (Eds.), Elsevier Publishers, 1993.
A. Bouguettaya, R. King, and K. Zhao, “FINDIT: A server based approach to finding information in large scale heterogeneous databases,” in First International Workshop on Interoperability in Multidatabase Systems, Kyoto, Japan, April 1991.
R. Braumandl, M. Keidl, A. Kemper, D. Kossmann, A. Kreutz, S. Seltzsam, and K. Stocker, “ObjectGlobe: Ubiquitous query processing on the internet,” The VLDB Journal, vol. 10, no. 1, 2001.
M. Conti, M. Kumar, S.K. Das, and B.A. Shirazi, “Quality of service issues in internet Web services,” IEEE Transactions on Computers, vol. 51, no. 6, 2001.
W. Du, R. Krishnamurthy, and M.-C. Shan, “Query optimization in a heterogeneous DBMS,” in Proceeedings of the 18th International Conference on Very Large Data Bases (VLDB), Vancouver, Canada, 1992.
O.M. Duschka, Query Planning and Optimization in Information Integration, PhD thesis, Computer Science Department, Stanford University, 1997.
O.M. Duschka and M.R. Genesereth, “Query planning in infomaster,” in Proceedings of the Twelfth Annual ACM Symposium on Applied Computing, SAC '97, San Jose, CA, USA, Feb. 1997.
D. Florescu, A. Levy, I. Manolescu, and D. Suciu, “Query optimization in the presence of limited access patterns,” in Proceedings ACM SIGMOD International Conference on Management of Data, Philadephia, Pennsylvania, USA, June 1999.
H. Garcia-Molina, W. Labio, and R. Yerneni, “Capability sensitive query processing on internet sources,” in Proceedings of the 15th International Conference on Data Engineering, Sydney, Australia, March 1999.
H. Garcia-Molina, Y. Papakonstantinou, D. Quass, A. Rajaraman, Y. Sagiv, J.D. Ullman, V. Vassalos, and J. Widom, “The TSIMMIS approach to mediation: Data models and languages,” Journal of Intelligent Information Systems, vol. 8, no. 2, 1997.
G. Gardarin, F. Sha, and Z. Tang, “Calibrating the query optimizer cost model of IRO-DB,” in Proceeedings of the 22nd International Conference on Very Large Data Bases (VLDB), Bombay, India, Sept. 1996.
G. Graefe, “Query evaluation techniques for large databases,” ACM Computing Survey, vol. 25, no. 2, 1993.
L. Gravano and Y. Papakonstantinou, “Mediating and metasearching on the Internet,” IEEE Data Engineering Bulletin, vol. 21, no. 2, 1998.
L.M. Haas, D. Kossmann, E.L. Wimmers, and J. Yang, “Optimizing queries across diverse data sources,” in Proceeedings of the 23rd International Conference on Very Large Data Bases (VLDB), Athens, Greece, Aug. 1997.
D. Heimbigner and D. McLeod, “A federated architecture for information systems,” ACM Transactions on Office Information Systems, vol. 3, no. 3, 1985.
J.M. Hellerstein, M.J. Franklin, S. Chnadrasekaran, A. Deshpande, K. Hildrum, S. Madden, V. Ramana, and M.A. Shah, “Adaptive query processing: Technology in evolution,” IEEE Data Engineering Bulletin, vol. 23, no. 2, 2000.
A.R. Hurson, M.W. Bright, and H. Pakzad, Multidatabase Systems: An Advanced Solution for Global Information Sharing, IEEE Computer Society Press: Los Alamitos, CA, 1994.
Z. Ives, D. Florescu, M. Friedman, A. Levy, and D.Weld, “An adaptive query execution system for data integration,” in Proceedings of the ACMSIGMOD International Conference on Management of Data, Philadelphia, PA, USA, June 1999.
D. Konopnicki and O. Shmueli, “WWW information gathering: The W3QL query language and the W3QS system,” ACM Transaction on Database Systems, vol. 23, no. 4, 1998.
A. Levy, A. Rajaraman, and J. Ordille, “Querying heterogeneous information sources using source descriptions,” in Proceeedings of the 22nd International Conference on Very Large Data Bases (VLDB), Bombay, India, 1996.
G. Lohman, “Grammer-like functional rules for representing query optimization alternatives,” in Proceedings of the ACM SIGMOD International Conference on Management of Data, Chicago, 1988.
R. MacGregor, “A deductive pattern matcher,” in Proceedings of AAAI-88, The National Conference on Artificial Intelligence, St. Paul, MN, USA, 1988.
E. Mena, A. Illarramendi, V. Kashyap, and A. Sheth, “OBSERVER: An approach for query processing in global information systems based on interoperation across pre-existing ontologies,” International Journal Distributed and Parallel Databases, vol. 8, no. 2, 2000.
A.O. Mendelzon, G.A. Mihaila, and T. Milo, “Queyring the world wideWeb,” International Journal on Digital Libraries, vol. 1,no.1, 1997.
H. Naacke, G. Gardarin, and A. Tomasic, “Leveraging mediator cost models with heterogeneous data sources,” in Proceedings of the 14th International Conference on Data Engineering, Orlando, Florida, Feb. 1998.
F. Naumann and U. Lesser, “Quality-driven integration of heterogeneous information systems,” in Proceeedings of the 25th International Conference on Very Large Data Bases (VLDB), Edinburgh, UK, Sept. 1999.
M. Nodine, W. Bohrer, and A.H.H. Ngu, “Semantic brokering over dynamic heterogeneous data sources in InfoSleuth,” in Proceedings of the 15th International Conference on Data Engineering, Sydney, Australia, March 1999.
F. Ozcan, S. Nural, P. Koskal, C. Evrendilek, and A. Dogac, “Dynamic query optimization in multidatabases,” IEEE Data Engineering Bulletin, vol. 20, no. 3, 1997.
M. Ouzzani, B. Benatallah, and A. Bouguettaya, “Ontological approach for information discovery in internet databases,” Distributed and Parallel Databases, an International Journal, vol. 8, no. 3, 2000.
M.T. Ozsu and P. Valduriez, Principles of Distributed Database Systems, Prentice Hall, 1999.
J.S. Quarterman and J.C. Hoskins, “Notable computer networks,” Communications of the ACM, vol. 29, no. 10, 1986.
M.T. Roth, F. Ozcan, and L.M. Haas, “Cost models do matter: Providing cost information for diverse data sources in a federated system,” in Proceedings of 25th International Conference on Very Large Data Bases, Edinburgh, Scotland, UK, Sept. 1999.
M.T. Roth and P. Schwarz, “Don't scrap it, wrap it! A wrapper architecture for legacy data sources,” in Proceeedings of the 23rd International Conference on Very Large Data Bases (VLDB), Athens, Greece, Aug. 1997.
A. Ruiz, R. Corchuelo, Duran, and M. Toro, “Automated support for quality requirements in web-based systems,” in Proceedings of the 8th IEEE Workshop on Future Trends of Distributed Computing Systems, Bologna, Italy, IEEE, Oct.-Nov. 2001.
P. Selinger, M. Astrahan, D. Chamberlin, R. Lorie, and T. Price, “Access path selection in a relational database management system,” in Proceedings of the 1979 ACM SIGMOD International Conference on Management of Data, P.A. Bernstein (Ed), Boston, Massachusetts, ACM, May 30-June 1, 1979.
M.-C. Shan, “Pegasus architecture and design principles,” in Proceeedings of theACMSIGMOD International Conference on Management of Data, Washington, DC, USA, June 1993.
A.P. Sheth and J.A. Larson, “Federated database systems and managing distributed, heterogeneous, and autonomous databases,” ACM Computing Surveys, vol. 22, no. 3, pp. 183–226, 1990.
S. Spaccapietra and C. Parent, “A step forward in solving structural conflicts,” IEEE Transactions on Knowledge and Data Engineering, vol. 6, no. 2, 1994.
A. Tomasic, L. Rashid, and P. Valduriez, “Scaling heterogeneous database and design of DISCO,” in Proceedings of the 16th International Conference on Distributing Computing Systems (ICDCS), Hong Kong, May 1996.
G. Wiederhold, “Mediators in the architecture of future information systems,” IEEE Computer, vol. 25, no. 3, 1992.
Y. Yerneni, C. Li, J. Ullman, and H. Garcia-Molina, “Optimizing large join queries in mediation systems,” in Proceedings of the International Conference Database Theory, Al Qods, Jan. 1999.
K. Zhao, R. King, and A. Bouguettaya, “Incremental specification of views across databases,” in First International Workshop on Interoperability in Multidatabase Systems, Kyoto, Japan, April 1991.
Q. Zhu and P. Larson, “Global query processing and optimization in the CORDS multidatabase system,” in Proceedings of ACMSIGMOD International Conference on Management of Data, San Jose, CA, USA, 1995.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Ouzzani, M., Bouguettaya, A. Query Processing and Optimization on the Web. Distributed and Parallel Databases 15, 187–218 (2004). https://doi.org/10.1023/B:DAPD.0000018574.71588.06
Issue Date:
DOI: https://doi.org/10.1023/B:DAPD.0000018574.71588.06