Towards a Statistically Semantic Web

  • Gerhard Weikum
  • Jens Graupmann
  • Ralf Schenkel
  • Martin Theobald
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3288)


The envisioned Semantic Web aims to provide richly annotated and explicitly structured Web pages in XML, RDF, or description logics, based upon underlying ontologies and thesauri. Ideally, this should enable a wealth of query processing and semantic reasoning capabilities using XQuery and logical inference engines. However, we believe that the diversity and uncertainty of terminologies and schema-like annotations will make precise querying on a Web scale extremely elusive if not hopeless, and the same argument holds for large-scale dynamic federations of Deep Web sources. Therefore, ontology-based reasoning and querying needs to be enhanced by statistical means, leading to relevance-ranked lists as query results.

This paper presents steps towards such a “statistically semantic” Web and outlines technical challenges. We discuss how statistically quantified ontological relations can be exploited in XML retrieval, how statistics can help in making Web-scale search efficient, and how statistical information extracted from users’ query logs and click streams can be leveraged for better search result ranking. We believe these are decisive issues for improving the quality of next-generation search engines for intranets, digital libraries, and the Web, and they are crucial also for peer-to-peer collaborative Web search.


Query Processing Index List Semantic Overlay Network Metasearch Engine Alexandria Digital Library 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Aberer, K., et al.: Emergent Semantics Principles and Issues. In: International Conference on Database Systems for Advanced Applications, DASFAA (2004)Google Scholar
  2. 2.
    Abolhassani, M., Fuhr, N.: Applying the Divergence from Randomness Approach for Content-Only Search in XML Documents. In: McDonald, S., Tait, J.I. (eds.) ECIR 2004. LNCS, vol. 2997, pp. 409–419. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  3. 3.
    Alexandria Digital Library Project, Gazetteer Development,
  4. 4.
    Al-Khalifa, S., Yu, C., Jagadish, H.V.: Querying Structured Text in an XML Database. In: SIGMOD 2003 (2003)Google Scholar
  5. 5.
    Amer-Yahia, S., Lakshmanan, L.V.S., Pandit, S.: FleXPath: Flexible Structure and Full-Text Querying for XML. In: SIGMOD 2004 (2004)Google Scholar
  6. 6.
    Arasu, A., Garcia-Molina, H.: Extracting Structured Data fromWeb Pages. In: SIGMOD 2003 (2003)Google Scholar
  7. 7.
    Bawa, M., Manku, G.S., Raghavan, P.: SETS: Search Enhanced by Topic Segmentation. In: SIGIR 2003 (2003)Google Scholar
  8. 8.
    Bender, M., Michel, S., Weikum, G., Zimmer, C.: Bookmark-driven Query Routing in Peer-to-Peer Web Search. In: SIGIR Workshop on Peer-to-Peer Information Retrieval (2004)Google Scholar
  9. 9.
    Brin, S., Page, L.: The Anatomy of a Large-Scale Hypertextual Web Search Engine. In: WWW Conference (1998)Google Scholar
  10. 10.
    Budanitsky, A., Hirst, G.: Semantic Distance in WordNet: An Experimental. In: Application-oriented Evaluation of FiveMeasures, Workshop on WordNet and Other Lexical Resources (2001)Google Scholar
  11. 11.
    Carmel, D., Maarek, Y.S., Mandelbrod, M., Mass, Y., Soffer, A.: Searching XML Documents via XML Fragments. In: SIGIR 2003 (2003)Google Scholar
  12. 12.
    Chakrabarti, S.: Mining the Web: Discovering Knowledge from Hypertext Data. Morgan Kaufmann Publishers, San Francisco (2002)Google Scholar
  13. 13.
    Chinenyanga, T., Kushmerick, N.: An Expressive and Efficient Language for XML Information Retrieval. Journal of the American Society for Information Science and Technology (JASIST) 53(6) (2002)Google Scholar
  14. 14.
    Cohen, S., Mamou, J., Kanza, Y., Sagiv, Y.: XSEarch: A Semantic Search Engine for XML. In: VLDB 2003 (2003)Google Scholar
  15. 15.
    Cohen, W.W., Hurst, M., Jensen, L.S.: A Flexible Learning System for Wrapping Tables and Lists in HTML Documents. In: Antonacopoulos, A., Hu, J. (eds.) Web Document Analysis: Challenges and Opportunities, Word Scientific Publishing, Singapore (2004)Google Scholar
  16. 16.
    Cohen, W.W., Sarawagi, S.: Exploiting Dictionaries in Named Entity Extraction: Combining Semi-markov Extraction Processes and Data Integration Methods. In: KDD 2004 (2004)Google Scholar
  17. 17.
    Crescenzi, V., Mecca, G., Merialdo, P.: RoadRunner: Towards Automatic Data Extraction from Large Web Sites. In: VLDB 2001 (2001)Google Scholar
  18. 18.
    Crespo, A., Garcia-Molina, H.: Semantic Overlay Networks, Technical Report, Stanford University (2003)Google Scholar
  19. 19.
    Cui, H., Wen, J.-R., Nie, J.-Y., Ma, W.-Y.: Query Expansion by Mining User Logs. IEEE Transactions on Knowledge and Data Engineering 15(4) (2003)Google Scholar
  20. 20.
    Cunningham, H.: GATE, a General Architecture for Text Engineering. Computers and the Humanities 36 (2002)Google Scholar
  21. 21.
    Davulcu, H., Vadrevu, S., Nagarajan, S., Ramakrishnan, I.V.: OntoMiner: Bootstrapping and Populating Ontologies from Domain-Specific Web Sites. IEEE Intelligent Systems 18(5) (2003)Google Scholar
  22. 22.
    Doan, A., Madhavan, J., Dhamankar, R., Domingos, P., Halevy, A.Y.: Learning to Match Ontologies on the Semantic Web. VLDB Journal 12(4) (2003)Google Scholar
  23. 23.
    Fagin, R., Lotem, A., Naor, M.: Optimal Aggregation Algorithms for Middleware. Journal of Computer and System Sciences 66(4) (2003)Google Scholar
  24. 24.
    Fellbaum, C. (ed.): WordNet: An Electronic Lexical Database. MIT Press, Cambridge (1998)zbMATHGoogle Scholar
  25. 25.
    Fensel, D., Wahlster, W., Lieberman, H., Hendler, J.A. (eds.): Spinning the Semantic Web: Bringing the World Wide Web to Its Full Potential. MIT Press, Cambridge (2002)Google Scholar
  26. 26.
    Fuhr, N., Großjohann, K.: XIRQL – An Extension of XQL for Information Retrieval. In: SIGIR Workshop on XML and Information Retrieval (2000)Google Scholar
  27. 27.
    Fuhr, N.: Probabilistic Datalog: Implementing Logical Information Retrieval for Advanced Applications. Journal of the American Society for Information Science (JASIS) 51(2) (2000)Google Scholar
  28. 28.
    Fuhr, N., Großjohann, K.: XIRQL: A Query Language for Information Retrieval in XML Documents. In: SIGIR 2001 (2001)Google Scholar
  29. 29.
    Getoor, L., Friedman, N., Koller, D., Pfeffer, A.: Learning Probabilistic Relational Models. In: Dzeroski, S., Lavrac, N. (eds.) Relational Data Mining, Springer, Heidelberg (2001)Google Scholar
  30. 30.
    Gottlob, G., Koch, C., Baumgartner, R., Herzog, M., Flesca, S.: The Lixto Data Extraction Project – Back and Forth between Theory and Practice. In: PODS 2004 (2004)Google Scholar
  31. 31.
    Grabs, T., Schek, H.-J.: Flexible Information Retrieval on XML Documents. In: Blanken, H., et al. (eds.) Intelligent Search on XML Data, Springer, Heidelberg (2003)Google Scholar
  32. 32.
    Graupmann, J., Biwer, M., Zimmer, C., Zimmer, P., Bender, M., Theobald, M., Weikum, G.: COMPASS: A Concept-based Web Search Engine for HTML, XML, and Deep Web Data, Demo Program. In: VLDB 2004 (2004)Google Scholar
  33. 33.
    Güntzer, U., Balke, W.-T., Kießling, W.: Optimizing Multi-Feature Queries for Image Databases. In: VLDB 2000 (2000)Google Scholar
  34. 34.
    Güntzer, U., Balke, W.-T., Kießling, W.: Towards Efficient Multi-Feature Queries in Heterogeneous Environments. In: International Symposium on Information Technology, ITCC (2001)Google Scholar
  35. 35.
    Guo, L., Shao, F., Botev, C., Shanmugasundaram, J.: XRANK: Ranked Keyword Search over XML Documents. In: SIGMOD 2003 (2003)Google Scholar
  36. 36.
    Halkidi, M., Nguyen, B., Varlamis, I., Vazirgiannis, M.: THESUS: Organizing Web Document Collections Based on Link Semantics. VLDB Journal 12(4) (2003)Google Scholar
  37. 37.
    Halpern, J.Y.: Reasoning about Uncertainty. MIT Press, Cambridge (2003)zbMATHGoogle Scholar
  38. 38.
    Kaushik, R., Krishnamurthy, R., Naughton, J.F., Ramakrishnan, R.: On the Integration of Structure Indexes and Inverted Lists. In: SIGMOD 2004 (2004)Google Scholar
  39. 39.
    Kushmerick, N., Thomas, B.: Adaptive Information Extraction: Core Technologies for Information Agents. In: Klusch, M., et al. (eds.) Intelligent Information Agents, Springer, Heidelberg (2003)Google Scholar
  40. 40.
    Lerman, K., Getoor, L., Minton, S., Knoblock, C.A.: Using the Structure of Web Sites for Automatic Segmentation of Tables. In: SIGMOD 2004 (2004)Google Scholar
  41. 41.
    Liu, Z., Luo, C., Cho, J., Chu, W.W.: A Probabilistic Approach to Metasearching with Adaptive Probing. In: ICDE 2004 (2004)Google Scholar
  42. 42.
    Lu, J., Callan, J.P.: Content-based Retrieval in Hybrid Peer-to-peer Networks. In: CIKM 2003 (2003)Google Scholar
  43. 43.
    Luxenburger, J., Weikum, G.: Query-log Based Authority Analysis forWeb Information Search (submitted for publication)Google Scholar
  44. 44.
    Maedche, A., Staab, S.: Learning Ontologies for the SemanticWeb. In: International Workshop on the Semantic Web, SemWeb (2001)Google Scholar
  45. 45.
    Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)zbMATHGoogle Scholar
  46. 46.
    Meng, W., Yu, C.T., Liu, K.-L.: Building Efficient and Effective Metasearch Engines. ACM Computing Surveys 34(1) (2002)Google Scholar
  47. 47.
    Nepal, S., Ramakrishna, M.V.: Query Processing Issues in Image (Multimedia) Databases. In: ICDE 1999 (1999)Google Scholar
  48. 48.
    Nottelmann, H., Fuhr, N.: Combining CORI and the Decision-Theoretic Approach for Advanced Resource Selection. In: McDonald, S., Tait, J.I. (eds.) ECIR 2004. LNCS, vol. 2997, pp. 138–153. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  49. 49.
    Rahm, E., Bernstein, P.A.: A Survey of Approaches to Automatic SchemaMatching. VLDB Journal 10(4) (2001)Google Scholar
  50. 50.
    Russell, S.J., Norvig, P.: Artificial Intelligence - A Modern Approach. Prentice Hall, Englewood Cliffs (2002)Google Scholar
  51. 51.
    Sahuguet, A., Azavant, F.: Building Light-weight Wrappers for Legacy Web Datasources using W4F. In: VLDB 1999 (1999)Google Scholar
  52. 52.
    Schenkel, R., Theobald, A., Weikum, G.: Ontology-Enabled XML Search. In: Blanken, H., et al. (eds.) Intelligent Search on XML Data, Springer, Heidelberg (2003)Google Scholar
  53. 53.
    Schenkel, R., Theobald, A., Weikum, G.: An Efficient Connection Index for Complex XML Document Collections. In: Bertino, E., Christodoulakis, S., Plexousakis, D., Christophides, V., Koubarakis, M., Böhm, K., Ferrari, E. (eds.) EDBT 2004. LNCS, vol. 2992, pp. 237–255. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  54. 54.
    Schlieder, T., Meuss, H.: Querying and Ranking XML Documents. Journal of the American Society for Information Science and Technology (JASIST) 53(6) (2002)Google Scholar
  55. 55.
    Schlieder, T., Meuss, H.: Result Ranking for Structured Queries against XML Documents. In: DELOS Workshop: Information Seeking, Searching and Querying in Digital Libraries (2000)Google Scholar
  56. 56.
    Schlieder, T., Naumann, F.: Approximate Tree Embedding for Querying XML Data. In: SIGIR Workshop on XML and Information Retrieval (2000)Google Scholar
  57. 57.
    Skounakis, M., Craven, M., Ray, S.: Hierarchical Hidden Markov Models for Information Extraction. In: IJCAI 2003 (2003)Google Scholar
  58. 58.
    Staab, S., Studer, R. (eds.): Handbook on Ontologies. Springer, Heidelberg (2004)Google Scholar
  59. 59.
    Theobald, M., Schenkel, R., Weikum, G.: Exploiting Structure, Annotation, and Ontological Knowledge for Automatic Classification of XML Data. In: International Workshop on Web and Databases, WebDB (2003)Google Scholar
  60. 60.
    Theobald, A., Weikum, G.: Adding Relevance to XML. In: Suciu, D., Vossen, G. (eds.) WebDB 2000. LNCS, vol. 1997, Springer, Heidelberg (2001)CrossRefGoogle Scholar
  61. 61.
    Theobald, A., Weikum, G.: The Index-based XXL Search Engine for Querying XML Data with Relevance Ranking. In: Jensen, C.S., Jeffery, K., Pokorný, J., Šaltenis, S., Bertino, E., Böhm, K., Jarke, M. (eds.) EDBT 2002. LNCS, vol. 2287, p. 477. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  62. 62.
    Theobald, M., Weikum, G., Schenkel, R.: Top-k Query Evaluation with Probabilistic Guarantees. In: VLDB 2004 (2004)Google Scholar
  63. 63.
    Tijerino, Y.A., Embley, D., Lonsdale, D.W., Nagy, G.: Ontology Generation from Tables. In: WISE 2003 (2003)Google Scholar
  64. 64.
    Voorhees, E.M.: Query Expansion Using Lexical-Semantic Relations. In: SIGIR 1994 (1994)Google Scholar
  65. 65.
    Wen, J.-R., Nie, J.-Y., Zhang, H.-J.: Query Clustering Using User Logs. ACM TOIS 20(1) (2002)Google Scholar
  66. 66.
    Xu, L., Dai, C., Cai, W., Zhou, S., Zhou, A.: Towards Adaptive Probabilistic Search in Unstructured P2P Systems. In: Asia-Pacific Web Conference, APWeb (2004)Google Scholar
  67. 67.
    Xue, G.-R., Zeng, H.-J., Chen, Z., Ma, W.-Y., Zhang, H.-J., Lu, C.-J.: Implicit Link Analysis for Small Web Search. In: SIGIR 2003 (2003)Google Scholar
  68. 68.
    Yu, S., Cai, D., Wen, J.-R., Ma, W.-Y.: Improving Pseudo-Relevance Feedback in Web Information Retrieval Using Web Page Segmentation. In: WWW Conference (2003)Google Scholar
  69. 69.
    Zezula, P., Amato, G., Rabitti, F.: Processing XML Queries with Tree Signatures. In: Blanken, H., et al. (eds.) Intelligent Search on XML Data, Springer, Heidelberg (2003)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2004

Authors and Affiliations

  • Gerhard Weikum
    • 1
  • Jens Graupmann
    • 1
  • Ralf Schenkel
    • 1
  • Martin Theobald
    • 1
  1. 1.Max-Planck Institute of Computer ScienceSaarbrueckenGermany

Personalised recommendations