Information Retrieval

, Volume 8, Issue 4, pp 521–545 | Cite as

Semantic Similarity Search on Semistructured Data with the XXL Search Engine

Article

Abstract

Query languages for XML such as XPath or XQuery support Boolean retrieval: a query result is a (possibly restructured) subset of XML elements or entire documents that satisfy the search conditions of the query. This search paradigm works for highly schematic XML data collections such as electronic catalogs. However, for searching information in open environments such as the Web or intranets of large corporations, ranked retrieval is more appropriate: a query result is a ranked list of XML elements in descending order of (estimated) relevance. Web search engines, which are based on the ranked retrieval paradigm, do, however, not consider the additional information and rich annotations provided by the structure of XML documents and their element names.

This article presents the XXL search engine that supports relevance ranking on XML data. XXL is particularly geared for path queries with wildcards that can span multiple XML collections and contain both exact-match as well as semantic-similarity search conditions. In addition, ontological information and suitable index structures are used to improve the search efficiency and effectiveness. XXL is fully implemented as a suite of Java classes and servlets. Experiments in the context of the INEX benchmark demonstrate the efficiency of the XXL search engine and underline its effectiveness for ranked retrieval.

Keywords

XML retrieval semistructured data ranked retrieval 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Aboulnaga A, et al. (2001) Estimating the selectivity of XML path expressions for internet scale applications. In: VLDB 2001, pp. 591–600.Google Scholar
  2. Agirre E and Rigau G (1996) Word Sense disambiguation using conceptual density. In: 16th Int. Conf. on Computational Linguistics 1996, pp. 16–22.Google Scholar
  3. Amer-Yahia S, Botev C and Shanmugasundaram J (2004) TeXQuery: A Full-Text Search Extension to XQuery. In: WWW 2004. Online Proceedings, available from http://www2004.org/.
  4. Amer-Yahia S, et al. (2002) Tree pattern relaxation. In: Jensen CS et al., Eds., EDBT 2002, pp. 496–513.Google Scholar
  5. Baeza-Yates RA and Riberto-Neto B, Eds. (1999) Modern Information Retrieval. Addison Wesley.Google Scholar
  6. Blanken H, Grabs T, Schek H-J, Schenkel R and Weikum G., Eds. (2003) Intelligent Search on XML Data, vol. 2818 of LNCS.Google Scholar
  7. Boag S, et al. (2002) XQuery 1.0: An XML query language. W3c recommendation, World Wide Web Consortium. http://www.w3.org/TR/xquery.
  8. Brin S and Page L (1998) The anatomy of a large-scale hypertextual web search engine. Computer Networks, 30(1–7):107–117.Google Scholar
  9. Bruno N, Koudas N and Srivastava D (2002) Holistic twig joins: Optimal XML pattern matching. In: SIGMOD 2002, pp. 310–321.Google Scholar
  10. Budanitsky A and Hirst G (2001) Semantic distance in WordNet: An experimental, application-oriented evaluation of five measures. In: Workshop on WordNet and Other Lexical Resources. Online Proceedings, available from http://engr.smu.edu/rada/mwnw/.
  11. Chen Z, et al. (2001) Counting twig matches in a tree. In: ICDE 2001, pp. 395–404.Google Scholar
  12. Chinenyanga TT and Kushmerick N (2001) Expressive and efficient ranked querying of xml data. In: WebDB 2001, pp. 1–6.Google Scholar
  13. Chinenyanga TT and Kushmerick N (2002) An expressive and efficient language for XML information retrieval. Journal of the Americal Society for Information Science & Technology, 53(6):438–453.Google Scholar
  14. Choi B, et al. (2003) On the optimality of holistic algorithms for Twig Queries. In: DEXA 2003, pp. 28–37.Google Scholar
  15. Chung C-W, et al. (2002) APEX: An adaptive path index for xml data. In: SIGMOD 2002, pp. 121–132.Google Scholar
  16. Cohen E, et al. (2002) Reachability and distance queries via 2-hop labels. In: 13th ACM-SIAM Symposium on Discrete algorithms (SODA 2002), pp. 937–946.Google Scholar
  17. Cohen S, et al. (2003) XSEarch: A semantic search engine for xml. In: VLDB 2003, pp. 45–56.Google Scholar
  18. Cohen WW (1999) Recognizing structure in web pages using Similarity Queries. In: 16th National Conference of Artificial Intelligence (AAAI)/11th Conference on Innovative Applications of Artificial Intelligence (IAAI), pp. 59–66.Google Scholar
  19. Cormen TH, Leiserson CE and Rivest RL (2001) Introduction to Algorithms. 2nd edition, MIT Press.Google Scholar
  20. Deutsch A, Fernandez MF, Florescu D, Levy AY and Suciu D (1998) XML-QL. In: QL ‘98, The Query Languages Workshop, W3C Workshop. available from http://www.w3.org/TR/1998/NOTE-xml-ql-19980819/.
  21. Fellbaum C, ed. (1998) WordNet: An Electronic Lexical Database. MIT Press.Google Scholar
  22. Fuhr N and Großjohann K (2001) ‘XIRQL: A query language for information retrievalin XML documents. In: SIGIR 2001, pp. 172–180.Google Scholar
  23. Graupmann J, Biwer M, Zimmer C, Zimmer P, Bender M, Theobald M and Weikum G (2004) COMPASS: A concept-based Web search engine for html, xml, and deep Web data (demo). In: VLDB 2004.Google Scholar
  24. Grust T (2002)Accelerating XPath location steps. In: SIGMOD 2002, pp. 109–120.Google Scholar
  25. Grust T and van Keulen M (2003) Tree awareness for relational dbms kernels staircase join. In: Blanken H, Grabs T, Schek H-J, Schenkel R and Weikum G., Eds. Intelligent Search on XML Data, vol. 2818 of LNCS, pp. 231–245.Google Scholar
  26. Guha S, et al. (2002) Approximate XML joins. In: SIGMOD 2002, pp. 278–298.Google Scholar
  27. Guo L, et al. (2003) XRANK: ranked keyword search over XML documents. In: SIGMOD 2003, pp. 16–27.Google Scholar
  28. Hayashi Y, et al. (2000) Searching text-rich xml documentswith Relevance Ranking. In: ACM SIGIR 2000 Workshop on XML and Information Retrieval. Online Proceedings, available from http://www.haifa.il.ibm.com/sigir00-xml/.
  29. Hirst G and St-Onge D (1998) Lexical chains as representations of context for the detection and correction of malapropisms. In: Fellbaum C, Ed. WordNet: An Electronic Lexical Database. MIT Press, pp. 305–332.Google Scholar
  30. Horrocks I (2002) DAML+OIL: A reason-able Web ontology language. in: edbt 2002, pp. 2–13.Google Scholar
  31. Jamasz M and Szpankowicz S (2001) Roget’s thesaurus: A lexical resource to treasure. in Proceedings of the NAACL Workshop “WordNet and Other Lexical Resources,” Pittsburg, pp. 186–188.Google Scholar
  32. Jamasz M and Szpankowicz S (2003) Roget’s thesaurus and semantic similarity. technicalReport TR-2003-01, University of Ottawa, Canada.Google Scholar
  33. Jiang H, et al. (2003) Holistic twig joins on indexed xml documents. In: VLDB 2003, pp. 273–284.Google Scholar
  34. Jiang JJ and Conrath DW (1997) Semantic similarity based on corpus statisticsand Lexical Taxonomy. In: 10th Int. Conf. on Research on Computational Linguistics (ROCLING 1997), Taipeh, Taiwan, pp. 19–33.Google Scholar
  35. K+02 Kaushik R, et al. (2002) Covering indexes for branching path queries. In: SIGMOD 2002, pp. 133–144.Google Scholar
  36. Kaushik R, et al. (2004) On the integration of structure indexes and Inverted Lists. In: SIGMOD 2004, pp. 779–790.Google Scholar
  37. Kazai G, et al. (2003) The INEX evaluation initiative. in: blanken h Grabs T, Schek H-J, Schenkel R and Weikum G., Eds. Intelligent Search on XML Data, vol. 2818 of LNCS, pp. 279–293.Google Scholar
  38. Kleinberg J (1999) Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604–632.Google Scholar
  39. Leacock C and Chodrow M (1998) Combining local context and wordnet Similarity for Word Sense Disambiguation. In: Fellbaum C, Ed. WordNet: An Electronic Lexical Database. MIT Press, pp. 265–283.Google Scholar
  40. Lenat DB and Guha RV (1990) Building Large Knowledge Based Systems. Addison Wesley.Google Scholar
  41. Lesk M (1969) Word-word association in document retrieval systems’. American Documentation, 20(1):27–38.Google Scholar
  42. Lewis WD (2002) Measuring conceptual distance using wordnet: the Design of a Metric for Measuring Semantic Similarity. The University of Arizona Working Papers in Linguistics, 12.Google Scholar
  43. Lin D (1998) An information-theoretic definition of similarity’. In: 15th Int. Conf. on Machine Learning (ICML 1998), pp. 296–304.Google Scholar
  44. Manning CD and Schuetze H (1999) Foundations of Statistical Natural Language Processing. The MIT Press.Google Scholar
  45. McHale ML (1998) A comparison of WordNet and Roget’s taxonomy for measuring semantic similarity. in: workshopon Usage of WordNet in Natural Language Processing Systems (COLING-ACL 1998). Online Proceedings, available from http://xxx.lanl.gov/abs/cmp-lg/9809003.
  46. Polyzotis N and Garofalakis MN (2002a) Statistical synopses for graph-structured xml databases. in SIGMOD 2002, pp. 358–369.Google Scholar
  47. Polyzotis N and Garofalakis MN (2002b) Structure and value synopses for xml data graphs. In: VLDB 2002, pp. 466–477.Google Scholar
  48. Qiu Y and Frei H-P (1993) Concept-based query expansion. In: SIGIR 1993, pp. 160–169.Google Scholar
  49. Qiu Y and Frei H-P (1995) Improving the retrieval effectiveness by a similarityThesaurus. Technical Report 225, Swiss Federate Institute of Technology, Zürich, Switzerland.Google Scholar
  50. Qun C, et al. (2003) D(k)-index: An adaptive structural summary for graphStructured Data. In: SIGMOD 2003, pp. 134–144.Google Scholar
  51. Rada R, et al. (1989) Development and application of a metric on semantic nets. IEEE Transactions on Systems, Man, and Cybernetics, 19(1):17–30.Google Scholar
  52. Resnik P (1995) Using information content to evaluate semanticSimilarity in a Taxonomy. In: 14th Int. Joint Conf. on Artificial Intelligence (IJCAI 95), Vol. 1. pp. 448–453.Google Scholar
  53. Resnik P (1999) Semantic similarity in a taxonomy: anInformation-Based Measure and its Application to Problems of Ambiguity in Natural Language. Journal of Artificial Intelligence Research, 11:95–130.Google Scholar
  54. Richardson R, et al. (1994) Using WordNet as a knowledge base for measuring semantic similarity between words. In: Proceedings of the AICS Conference.Google Scholar
  55. Rubenstein H and Goodenough JB (1965) Contextual correlates of synonymy. Communications of the ACM, 8(10):627–633.Google Scholar
  56. Russel S and Norvig P (1995) Artificial Intelligence–-A Modern Approach. Prentice Hall.Google Scholar
  57. Schenkel R (2004) FliX: A flexible framework for indexing complex XML document collections. In: 1st Int. Workshop on Database Technologies for Handling XML Information on the Web.Google Scholar
  58. Schenkel R, Theobald A and Weikum G (2003) Ontology-enabled XML search. In: Blanken H, Grabs T, Schek H-J, Schenkel R and Weikum G., Eds. Intelligent Search on XML Data, vol. 2818 of LNCS, pp. 119–131.Google Scholar
  59. Schenkel R, Theobald A and Weikum G (2004) HOPI: An efficient connection index for complex xml document collections. In: EDBT 2004, pp. 237–255.Google Scholar
  60. Schlieder T (2002) Schema-driven evaluation of approximate treepattern queries. In: EDBT 2002, pp. 514–532.Google Scholar
  61. Schlieder T and Meuss H (2000) Result ranking for structured queries against xml documents. In: DELOS Workshop: Information Seeking, Searching and Querying in Digital Libraries, available from http://www.ercim.org/publication/ws-proceedings/DelNoe01.
  62. SAD+00 Staab S, et al. (2000) Semantic community web portals. Computer Networks, 33(1–6):473–491.Google Scholar
  63. Sussna M (1993) Word sense disambiguation for free-textindexing using a Massive Semantic Network. In: 2nd Int. Conf. on Information and Knowledge Management (CIKM 1993), pp. 67–74.Google Scholar
  64. Theobald A and Weikum G (2000) Adding relevance to XML. In: 3rd Int. Workshop WebDB 2000, pp. 105–124.Google Scholar
  65. Theobald A and Weikum G (2002a) The index-based xxl searchengine for Querying XML data with relevance ranking. In: EDBT 2002, pp. 477–495.Google Scholar
  66. Theobald A and Weikum G (2002b) The XXL search engine: ranked retrieval ofxml data using indexes and ontologies. In: SIGMOD 2002, p. 615.Google Scholar
  67. Wu Y, Patel JM and Jagadish H (2002) Estimating answer sizes for xml queries in: edbt 2002, pp. 590–608.Google Scholar
  68. Wu Z and Palmer M (1994) Verb semantics and lexical selection. In: 32nd. Annual Meeting of the Association for Computational Linguistics 1994, pp. 133 –138.Google Scholar
  69. Zezula P, et al. (2003) Tree signatures for XML querying and navigation. in: 1stint. xml Database Symposium, pp. 149–163.Google Scholar

Copyright information

© Springer Science + Business Media, Inc. 2005

Authors and Affiliations

  1. 1.Max-Planck-Institut für InformatikSaarbrückenGermany

Personalised recommendations