Abstract
The envisioned Semantic Web aims to provide richly annotated and explicitly structured Web pages in XML, RDF, or description logics, based upon underlying ontologies and thesauri. Ideally, this should enable a wealth of query processing and semantic reasoning capabilities using XQuery and logical inference engines. However, we believe that the diversity and uncertainty of terminologies and schema-like annotations will make precise querying on a Web scale extremely elusive if not hopeless, and the same argument holds for large-scale dynamic federations of Deep Web sources. Therefore, ontology-based reasoning and querying needs to be enhanced by statistical means, leading to relevance-ranked lists as query results.
This paper presents steps towards such a “statistically semantic” Web and outlines technical challenges. We discuss how statistically quantified ontological relations can be exploited in XML retrieval, how statistics can help in making Web-scale search efficient, and how statistical information extracted from users’ query logs and click streams can be leveraged for better search result ranking. We believe these are decisive issues for improving the quality of next-generation search engines for intranets, digital libraries, and the Web, and they are crucial also for peer-to-peer collaborative Web search.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Aberer, K., et al.: Emergent Semantics Principles and Issues. In: International Conference on Database Systems for Advanced Applications, DASFAA (2004)
Abolhassani, M., Fuhr, N.: Applying the Divergence from Randomness Approach for Content-Only Search in XML Documents. In: McDonald, S., Tait, J.I. (eds.) ECIR 2004. LNCS, vol. 2997, pp. 409–419. Springer, Heidelberg (2004)
Alexandria Digital Library Project, Gazetteer Development, http://www.alexandria.ucsb.edu/gazetteer/
Al-Khalifa, S., Yu, C., Jagadish, H.V.: Querying Structured Text in an XML Database. In: SIGMOD 2003 (2003)
Amer-Yahia, S., Lakshmanan, L.V.S., Pandit, S.: FleXPath: Flexible Structure and Full-Text Querying for XML. In: SIGMOD 2004 (2004)
Arasu, A., Garcia-Molina, H.: Extracting Structured Data fromWeb Pages. In: SIGMOD 2003 (2003)
Bawa, M., Manku, G.S., Raghavan, P.: SETS: Search Enhanced by Topic Segmentation. In: SIGIR 2003 (2003)
Bender, M., Michel, S., Weikum, G., Zimmer, C.: Bookmark-driven Query Routing in Peer-to-Peer Web Search. In: SIGIR Workshop on Peer-to-Peer Information Retrieval (2004)
Brin, S., Page, L.: The Anatomy of a Large-Scale Hypertextual Web Search Engine. In: WWW Conference (1998)
Budanitsky, A., Hirst, G.: Semantic Distance in WordNet: An Experimental. In: Application-oriented Evaluation of FiveMeasures, Workshop on WordNet and Other Lexical Resources (2001)
Carmel, D., Maarek, Y.S., Mandelbrod, M., Mass, Y., Soffer, A.: Searching XML Documents via XML Fragments. In: SIGIR 2003 (2003)
Chakrabarti, S.: Mining the Web: Discovering Knowledge from Hypertext Data. Morgan Kaufmann Publishers, San Francisco (2002)
Chinenyanga, T., Kushmerick, N.: An Expressive and Efficient Language for XML Information Retrieval. Journal of the American Society for Information Science and Technology (JASIST) 53(6) (2002)
Cohen, S., Mamou, J., Kanza, Y., Sagiv, Y.: XSEarch: A Semantic Search Engine for XML. In: VLDB 2003 (2003)
Cohen, W.W., Hurst, M., Jensen, L.S.: A Flexible Learning System for Wrapping Tables and Lists in HTML Documents. In: Antonacopoulos, A., Hu, J. (eds.) Web Document Analysis: Challenges and Opportunities, Word Scientific Publishing, Singapore (2004)
Cohen, W.W., Sarawagi, S.: Exploiting Dictionaries in Named Entity Extraction: Combining Semi-markov Extraction Processes and Data Integration Methods. In: KDD 2004 (2004)
Crescenzi, V., Mecca, G., Merialdo, P.: RoadRunner: Towards Automatic Data Extraction from Large Web Sites. In: VLDB 2001 (2001)
Crespo, A., Garcia-Molina, H.: Semantic Overlay Networks, Technical Report, Stanford University (2003)
Cui, H., Wen, J.-R., Nie, J.-Y., Ma, W.-Y.: Query Expansion by Mining User Logs. IEEE Transactions on Knowledge and Data Engineering 15(4) (2003)
Cunningham, H.: GATE, a General Architecture for Text Engineering. Computers and the Humanities 36 (2002)
Davulcu, H., Vadrevu, S., Nagarajan, S., Ramakrishnan, I.V.: OntoMiner: Bootstrapping and Populating Ontologies from Domain-Specific Web Sites. IEEE Intelligent Systems 18(5) (2003)
Doan, A., Madhavan, J., Dhamankar, R., Domingos, P., Halevy, A.Y.: Learning to Match Ontologies on the Semantic Web. VLDB Journal 12(4) (2003)
Fagin, R., Lotem, A., Naor, M.: Optimal Aggregation Algorithms for Middleware. Journal of Computer and System Sciences 66(4) (2003)
Fellbaum, C. (ed.): WordNet: An Electronic Lexical Database. MIT Press, Cambridge (1998)
Fensel, D., Wahlster, W., Lieberman, H., Hendler, J.A. (eds.): Spinning the Semantic Web: Bringing the World Wide Web to Its Full Potential. MIT Press, Cambridge (2002)
Fuhr, N., Großjohann, K.: XIRQL – An Extension of XQL for Information Retrieval. In: SIGIR Workshop on XML and Information Retrieval (2000)
Fuhr, N.: Probabilistic Datalog: Implementing Logical Information Retrieval for Advanced Applications. Journal of the American Society for Information Science (JASIS) 51(2) (2000)
Fuhr, N., Großjohann, K.: XIRQL: A Query Language for Information Retrieval in XML Documents. In: SIGIR 2001 (2001)
Getoor, L., Friedman, N., Koller, D., Pfeffer, A.: Learning Probabilistic Relational Models. In: Dzeroski, S., Lavrac, N. (eds.) Relational Data Mining, Springer, Heidelberg (2001)
Gottlob, G., Koch, C., Baumgartner, R., Herzog, M., Flesca, S.: The Lixto Data Extraction Project – Back and Forth between Theory and Practice. In: PODS 2004 (2004)
Grabs, T., Schek, H.-J.: Flexible Information Retrieval on XML Documents. In: Blanken, H., et al. (eds.) Intelligent Search on XML Data, Springer, Heidelberg (2003)
Graupmann, J., Biwer, M., Zimmer, C., Zimmer, P., Bender, M., Theobald, M., Weikum, G.: COMPASS: A Concept-based Web Search Engine for HTML, XML, and Deep Web Data, Demo Program. In: VLDB 2004 (2004)
Güntzer, U., Balke, W.-T., Kießling, W.: Optimizing Multi-Feature Queries for Image Databases. In: VLDB 2000 (2000)
Güntzer, U., Balke, W.-T., Kießling, W.: Towards Efficient Multi-Feature Queries in Heterogeneous Environments. In: International Symposium on Information Technology, ITCC (2001)
Guo, L., Shao, F., Botev, C., Shanmugasundaram, J.: XRANK: Ranked Keyword Search over XML Documents. In: SIGMOD 2003 (2003)
Halkidi, M., Nguyen, B., Varlamis, I., Vazirgiannis, M.: THESUS: Organizing Web Document Collections Based on Link Semantics. VLDB Journal 12(4) (2003)
Halpern, J.Y.: Reasoning about Uncertainty. MIT Press, Cambridge (2003)
Kaushik, R., Krishnamurthy, R., Naughton, J.F., Ramakrishnan, R.: On the Integration of Structure Indexes and Inverted Lists. In: SIGMOD 2004 (2004)
Kushmerick, N., Thomas, B.: Adaptive Information Extraction: Core Technologies for Information Agents. In: Klusch, M., et al. (eds.) Intelligent Information Agents, Springer, Heidelberg (2003)
Lerman, K., Getoor, L., Minton, S., Knoblock, C.A.: Using the Structure of Web Sites for Automatic Segmentation of Tables. In: SIGMOD 2004 (2004)
Liu, Z., Luo, C., Cho, J., Chu, W.W.: A Probabilistic Approach to Metasearching with Adaptive Probing. In: ICDE 2004 (2004)
Lu, J., Callan, J.P.: Content-based Retrieval in Hybrid Peer-to-peer Networks. In: CIKM 2003 (2003)
Luxenburger, J., Weikum, G.: Query-log Based Authority Analysis forWeb Information Search (submitted for publication)
Maedche, A., Staab, S.: Learning Ontologies for the SemanticWeb. In: International Workshop on the Semantic Web, SemWeb (2001)
Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)
Meng, W., Yu, C.T., Liu, K.-L.: Building Efficient and Effective Metasearch Engines. ACM Computing Surveys 34(1) (2002)
Nepal, S., Ramakrishna, M.V.: Query Processing Issues in Image (Multimedia) Databases. In: ICDE 1999 (1999)
Nottelmann, H., Fuhr, N.: Combining CORI and the Decision-Theoretic Approach for Advanced Resource Selection. In: McDonald, S., Tait, J.I. (eds.) ECIR 2004. LNCS, vol. 2997, pp. 138–153. Springer, Heidelberg (2004)
Rahm, E., Bernstein, P.A.: A Survey of Approaches to Automatic SchemaMatching. VLDB Journal 10(4) (2001)
Russell, S.J., Norvig, P.: Artificial Intelligence - A Modern Approach. Prentice Hall, Englewood Cliffs (2002)
Sahuguet, A., Azavant, F.: Building Light-weight Wrappers for Legacy Web Datasources using W4F. In: VLDB 1999 (1999)
Schenkel, R., Theobald, A., Weikum, G.: Ontology-Enabled XML Search. In: Blanken, H., et al. (eds.) Intelligent Search on XML Data, Springer, Heidelberg (2003)
Schenkel, R., Theobald, A., Weikum, G.: An Efficient Connection Index for Complex XML Document Collections. In: Bertino, E., Christodoulakis, S., Plexousakis, D., Christophides, V., Koubarakis, M., Böhm, K., Ferrari, E. (eds.) EDBT 2004. LNCS, vol. 2992, pp. 237–255. Springer, Heidelberg (2004)
Schlieder, T., Meuss, H.: Querying and Ranking XML Documents. Journal of the American Society for Information Science and Technology (JASIST) 53(6) (2002)
Schlieder, T., Meuss, H.: Result Ranking for Structured Queries against XML Documents. In: DELOS Workshop: Information Seeking, Searching and Querying in Digital Libraries (2000)
Schlieder, T., Naumann, F.: Approximate Tree Embedding for Querying XML Data. In: SIGIR Workshop on XML and Information Retrieval (2000)
Skounakis, M., Craven, M., Ray, S.: Hierarchical Hidden Markov Models for Information Extraction. In: IJCAI 2003 (2003)
Staab, S., Studer, R. (eds.): Handbook on Ontologies. Springer, Heidelberg (2004)
Theobald, M., Schenkel, R., Weikum, G.: Exploiting Structure, Annotation, and Ontological Knowledge for Automatic Classification of XML Data. In: International Workshop on Web and Databases, WebDB (2003)
Theobald, A., Weikum, G.: Adding Relevance to XML. In: Suciu, D., Vossen, G. (eds.) WebDB 2000. LNCS, vol. 1997, Springer, Heidelberg (2001)
Theobald, A., Weikum, G.: The Index-based XXL Search Engine for Querying XML Data with Relevance Ranking. In: Jensen, C.S., Jeffery, K., Pokorný, J., Šaltenis, S., Bertino, E., Böhm, K., Jarke, M. (eds.) EDBT 2002. LNCS, vol. 2287, p. 477. Springer, Heidelberg (2002)
Theobald, M., Weikum, G., Schenkel, R.: Top-k Query Evaluation with Probabilistic Guarantees. In: VLDB 2004 (2004)
Tijerino, Y.A., Embley, D., Lonsdale, D.W., Nagy, G.: Ontology Generation from Tables. In: WISE 2003 (2003)
Voorhees, E.M.: Query Expansion Using Lexical-Semantic Relations. In: SIGIR 1994 (1994)
Wen, J.-R., Nie, J.-Y., Zhang, H.-J.: Query Clustering Using User Logs. ACM TOIS 20(1) (2002)
Xu, L., Dai, C., Cai, W., Zhou, S., Zhou, A.: Towards Adaptive Probabilistic Search in Unstructured P2P Systems. In: Asia-Pacific Web Conference, APWeb (2004)
Xue, G.-R., Zeng, H.-J., Chen, Z., Ma, W.-Y., Zhang, H.-J., Lu, C.-J.: Implicit Link Analysis for Small Web Search. In: SIGIR 2003 (2003)
Yu, S., Cai, D., Wen, J.-R., Ma, W.-Y.: Improving Pseudo-Relevance Feedback in Web Information Retrieval Using Web Page Segmentation. In: WWW Conference (2003)
Zezula, P., Amato, G., Rabitti, F.: Processing XML Queries with Tree Signatures. In: Blanken, H., et al. (eds.) Intelligent Search on XML Data, Springer, Heidelberg (2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Weikum, G., Graupmann, J., Schenkel, R., Theobald, M. (2004). Towards a Statistically Semantic Web. In: Atzeni, P., Chu, W., Lu, H., Zhou, S., Ling, TW. (eds) Conceptual Modeling – ER 2004. ER 2004. Lecture Notes in Computer Science, vol 3288. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30464-7_2
Download citation
DOI: https://doi.org/10.1007/978-3-540-30464-7_2
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23723-5
Online ISBN: 978-3-540-30464-7
eBook Packages: Springer Book Archive