The VLDB Journal

, Volume 17, Issue 1, pp 81–115 | Cite as

TopX: efficient and versatile top-k query processing for semistructured data

  • Martin Theobald
  • Holger Bast
  • Debapriyo Majumdar
  • Ralf SchenkelEmail author
  • Gerhard Weikum
Open Access
Special Issue Paper


Recent IR extensions to XML query languages such as Xpath 1.0 Full-Text or the NEXI query language of the INEX benchmark series reflect the emerging interest in IR-style ranked retrieval over semistructured data. TopX is a top-k retrieval engine for text and semistructured data. It terminates query execution as soon as it can safely determine the k top-ranked result elements according to a monotonic score aggregation function with respect to a multidimensional query. It efficiently supports vague search on both content- and structure-oriented query conditions for dynamic query relaxation with controllable influence on the result ranking. The main contributions of this paper unfold into four main points: (1) fully implemented models and algorithms for ranked XML retrieval with XPath Full-Text functionality, (2) efficient and effective top-k query processing for semistructured data, (3) support for integrating thesauri and ontologies with statistically quantified relationships among concepts, leveraged for word-sense disambiguation and query expansion, and (4) a comprehensive description of the TopX system, with performance experiments on large-scale corpora like TREC Terabyte and INEX Wikipedia.


Efficient XML full-text search Content- and structure-aware ranking Top-k query processing Cost-based index access scheduling Probabilistic candidate pruning Dynamic query expansion DB&IR integration 


  1. 1.
    Aboulnaga, A., Alameldeen, A.R., Naughton, J.F.: Estimating the selectivity of XML path expressions for internet scale applications. In: VLDB, pp. 591–600. Morgan Kaufmann, San Francisco (2001)Google Scholar
  2. 2.
    Agrawal, S., Chaudhuri, S., Das, G., Gionis, A.: Automated ranking of database query results. In: CIDR (2003)Google Scholar
  3. 3.
    Al-Khalifa, S., Jagadish, H.V., Patel, J.M., Wu, Y., Koudas, N., Srivastava, D.: Structural joins: a primitive for efficient XML query pattern matching. In: ICDE, pp. 141–152. IEEE Computer Society, New York (2002)Google Scholar
  4. 4.
    Al-Khalifa, S., Yu, C., Jagadish, H.V.: Querying structured text in an XML database. In: SIGMOD, pp. 4–15. ACM, New York (2003)Google Scholar
  5. 5.
    Amato G., Rabitti F., Savino P. and Zezula P. (2003). Region proximity in metric spaces and its use for approximate similarity search. ACM Trans. Inf. Syst. 21(2): 192–227 CrossRefGoogle Scholar
  6. 6.
    Amer-Yahia S., Botev C., Dörre J. and Shanmugasundaram J. (2006). XQuery Full-Text extensions explained. IBM Syst. J. 45(2): 335–352 CrossRefGoogle Scholar
  7. 7.
    Amer-Yahia, S., Botev, C., Shanmugasundaram, J.: TeXQuery: a full-text search extension to XQuery. In: WWW, pp. 583–594. ACM, New York (2004)Google Scholar
  8. 8.
    Amer-Yahia S., Case P., Rölleke T., Shanmugasundaram J. and Weikum G. (2005). Report on the DB/IR panel at SIGMOD 2005. SIGMOD Rec. 34(4): 71–74 CrossRefGoogle Scholar
  9. 9.
    Amer-Yahia, S., Curtmola, E., Deutsch, A.: Flexible and efficient XML search with complex full-text predicates. In: SIGMOD, pp. 575–586. ACM, New York (2006)Google Scholar
  10. 10.
    Amer-Yahia, S., Koudas, N., Marian, A., Srivastava, D., Toman, D.: Structure and content scoring for XML. In: VLDB, pp. 361–372. ACM, New York (2005)Google Scholar
  11. 11.
    Amer-Yahia, S., Lakshmanan, L.V.S., Pandit, S.: FleXPath: flexible structure and full-text querying for XML. In: SIGMOD, pp. 83–94. ACM, New York (2004)Google Scholar
  12. 12.
    Amer-Yahia S. and Lalmas M. (2006). XML Search: languages, INEX and scoring. SIGMOD Rec. 36(7): 16–23 CrossRefGoogle Scholar
  13. 13.
    Anh, V.N., de Kretser, O., Moffat, A.: Vector-space ranking with effective early termination. In: SIGIR, pp. 35–42. ACM, New York (2001)Google Scholar
  14. 14.
    Anh, V.N., Moffat, A.: Impact transformation: effective and efficient web retrieval. In: SIGIR, pp. 3–10. ACM, New York (2002)Google Scholar
  15. 15.
    Anh, V.N., Moffat, A.: Pruned query evaluation using pre-computed impacts. In: SIGIR, pp. 372–379. ACM, New York (2006)Google Scholar
  16. 16.
    Avnur, R., Hellerstein, J.M.: Eddies: continuously adaptive query processing. In: SIGMOD, pp. 261–272. ACM, New York (2000)Google Scholar
  17. 17.
    Baeza-Yates R.A. and Ribeiro-Neto B.A. (1999). Modern information retrieval. ACM Press/Addison–Wesley, Reading Google Scholar
  18. 18.
    Bast, H., Majumdar, D., Theobald, M., Schenkel, R., Weikum, G.: IO-Top-k: index-optimized top-k query processing. In: VLDB, pp. 475–486. ACM, New York (2006)Google Scholar
  19. 19.
    Bast, H., Weber, I.: Type less, find more: fast autocompletion search with a succinct index. In: SIGIR, pp. 364–371. ACM, New York (2006)Google Scholar
  20. 20.
    Böhm C., Berchtold S. and Keim D.A. (2001). Searching in high-dimensional spaces: index structures for improving the performance of multimedia databases. ACM Comput. Surv. 33(3): 322–373 CrossRefGoogle Scholar
  21. 21.
    Bruno N., Chaudhuri S. and Gravano L. (2002). Top-k selection queries over relational databases: mapping strategies and performance evaluation. ACM Trans. Database Syst. 27(2): 153–187 CrossRefGoogle Scholar
  22. 22.
    Bruno, N., Koudas, N., Srivastava, D.: Holistic twig joins: optimal XML pattern matching. In: SIGMOD, pp. 310–321. ACM, New York (2002)Google Scholar
  23. 23.
    Buckley, C., Lewit, A.F.: Optimization of inverted vector searches. In: SIGIR, pp. 97–110. ACM, New York (1985)Google Scholar
  24. 24.
    Buckley, C., Voorhees, E.M.: Evaluating evaluation measure stability. In: SIGIR, pp. 33–40. ACM, New York (2000)Google Scholar
  25. 25.
    Carmel, D., Maarek, Y.S., Mandelbrod, M., Mass, Y., Soffer, A.: Searching XML documents via XML fragments. In: SIGIR, pp. 151–158. ACM, New York (2003)Google Scholar
  26. 26.
    Chang, K.C.C., won Hwang, S.: Minimal probing: supporting expensive predicates for top-k queries. In: SIGMOD, pp. 346–357. ACM, New York (2002)Google Scholar
  27. 27.
    Chaudhuri S., Gravano L. and Marian A. (2004). Optimizing top-k selection queries over multimedia repositories. IEEE Trans. Knowl. Data Eng. 16(8): 992–1009 CrossRefGoogle Scholar
  28. 28.
    Chinenyanga, T.T., Kushmerick, N.: Expressive retrieval from XML documents. In: SIGIR, pp. 163–171. ACM, New York (2001)Google Scholar
  29. 29.
    Choi, B., Mahoui, M., Wood, D.: On the optimality of holistic algorithms for twig queries. In: DEXA. Lecture Notes in Computer Science, vol. 2736, pp. 28–37. Springer, Heidelberg (2003)Google Scholar
  30. 30.
    Ciaccia, P., Patella, M.: Pac nearest neighbor queries: approximate and controlled search in high-dimensional and metric spaces. In: ICDE, pp. 244–255. IEEE Computer Society, New York (2000)Google Scholar
  31. 31.
    Cohen, S., Mamou, J., Kanza, Y., Sagiv, Y.: XSEarch: A semantic search engine for XML. In: VLDB, pp. 45–56. Morgan Kaufmann, San Francisco (2003)Google Scholar
  32. 32.
    Consens, M.P., Baeza-Yates, R.A.: Database and information retrieval techniques for XML. In: ASIAN, pp. 22–27. Springer, Heidelberg (2005)Google Scholar
  33. 33.
    Cormen T.H., Leiserson C.E., Rivest R.L. and Clifford S. (2001). Introduction of algorithms. The MIT Press, Cambridge Google Scholar
  34. 34.
    Craswell, N., Hawking, D., Wilkinson, R., Wu, M.: Overview of the TREC 2003 Web track. In: TREC, pp. 78–92 (2003)Google Scholar
  35. 35.
    Denoyer, L., Gallinari, P.: The Wikipedia XML Corpus. SIGIR Forum (2006)Google Scholar
  36. 36.
    Donjerkovic, D., Ramakrishnan, R.: Probabilistic optimization of top n queries. In: VLDB, pp. 411–422. Morgan Kaufmann, San Francisco (1999)Google Scholar
  37. 37.
    Fagin R. (1999). Combining fuzzy information from multiple systems. J. Comput. Syst. Sci. 58(1): 83–99 zbMATHCrossRefMathSciNetGoogle Scholar
  38. 38.
    Fagin R. (2002). Combining fuzzy information: an overview. SIGMOD Rec. 31(2): 109–118 CrossRefGoogle Scholar
  39. 39.
    Fagin, R., Lotem, A., Naor, M.: Optimal aggregation algorithms for middleware. In: PODS, pp. 102–113. ACM, New York (2001)Google Scholar
  40. 40.
    Fagin R., Lotem A. and Naor M. (2003). Optimal aggregation algorithms for middleware. J. Comput. Syst. Sci. 66(4): 614–656 zbMATHCrossRefMathSciNetGoogle Scholar
  41. 41.
    Fegaras, L.: XQuery processing with relevance ranking. In: XSym. Lecture Notes in Computer Science, vol. 3186, pp. 51–65. Springer, Heidelberg (2004)Google Scholar
  42. 42.
    Fellbaum, C. (ed.) (1998). WordNet: An Electronic Lexical Database. MIT Press, Cambridge zbMATHGoogle Scholar
  43. 43.
    Fuhr, N., Großjohann, K.: XIRQL: a query language for information retrieval in XML documents. In: SIGIR, pp. 172–180. ACM, New York (2001)Google Scholar
  44. 44.
    Goldman, R., Widom, J.: Dataguides: enabling query formulation and optimization in semistructured databases. In: VLDB, pp. 436–445. Morgan Kaufmann, San Francisco (1997)Google Scholar
  45. 45.
    Grabs, T., Schek, H.J.: PowerDB-XML: scalable XML processing with a database cluster. In: Intelligent Search on XML Data, pp. 193–206 (2003)Google Scholar
  46. 46.
    Graupmann, J., Schenkel, R., Weikum, G.: The spheresearch engine for unified ranked retrieval of heterogeneous XML and web documents. In: VLDB, pp. 529–540. ACM, New York (2005)Google Scholar
  47. 47.
    Grust, T.: Accelerating XPath location steps. In: SIGMOD, pp. 109–120. ACM, New York (2002)Google Scholar
  48. 48.
    Grust, T., van Keulen, M., Teubner, J.: Staircase join: teach a relational DBMS to watch its (axis) steps. In: VLDB, pp. 524–525. Morgan Kaufmann, San Francisco (2003)Google Scholar
  49. 49.
    Güntzer, U., Balke, W.T., Kießling, W.: Optimizing multi-feature queries for image databases. In: VLDB, pp. 419–428. Morgan Kaufmann, San Francisco (2000)Google Scholar
  50. 50.
    Güntzer, U., Balke, W.T., Kießling, W.: Towards efficient multi-feature queries in heterogeneous environments. In: ITCC, pp. 622–628. IEEE Computer Society, New York (2001)Google Scholar
  51. 51.
    Guo, L., Shao, F., Botev, C., Shanmugasundaram, J.: XRank: ranked keyword search over XML documents. In: SIGMOD, pp. 16–27. ACM (2003)Google Scholar
  52. 52.
    Hjaltason G.R. and Samet H. (1999). Distance browsing in spatial databases. ACM Trans. Database Syst. 24(2): 265–318 CrossRefGoogle Scholar
  53. 53.
    Hjaltason G.R. and Samet H. (2003). Index-driven similarity search in metric spaces. ACM Trans. Database Syst. 28(4): 517–580 CrossRefGoogle Scholar
  54. 54.
    Hristidis, V., Papakonstantinou, Y., Balmin, A.: Keyword proximity search on XML graphs. In: ICDE, pp. 367–378. IEEE Computer Society, New York (2003)Google Scholar
  55. 55.
    Hung, E., Deng, Y., Subrahmanian, V.S.: TOSS: an extension of TAX with ontologies and similarity queries. In: SIGMOD, pp. 719–730. ACM, New York (2004)Google Scholar
  56. 56.
    INitiative for the Evaluation of XML Retrieval (INEX):
  57. 57.
  58. 58.
    Jagadish, H.V., Lakshmanan, L.V.S., Srivastava, D., Thompson, K.: TAX: A tree algebra for XML. In: DBPL. Lecture Notes in Computer Science, vol. 2397, pp. 149–164. Springer, Heidelberg (2001)Google Scholar
  59. 59.
    Jiang, H., Wang, W., Lu, H., Yu, J.X.: Holistic twig joins on indexed XML documents. In: VLDB, pp. 273–284. Morgan Kaufmann, San Francisco (2003)Google Scholar
  60. 60.
    Kaushik, R., Bohannon, P., Naughton, J.F., Korth, H.F.: Covering indexes for branching path queries. In: SIGMOD, pp. 133–144. ACM, New York (2002)Google Scholar
  61. 61.
    Kaushik, R., Krishnamurthy, R., Naughton, J.F., Ramakrishnan, R.: On the integration of structure indexes and inverted lists. In: SIGMOD, pp. 779–790. ACM, New York (2004)Google Scholar
  62. 62.
    Kazai, G., Lalmas, M.: INEX 2005 evaluation measures. In: 4th International Workshop of the Initiative for the Evaluation of XML Retrieval. Lecture Notes in Computer Science, vol. 3977, pp. 16–29. Springer, Heidelberg (2005)Google Scholar
  63. 63.
    Lalmas, M., Kazai, G., Kamps, J., Pehcevski, J., Piwowarski, B., Robertson, S.: INEX 2006 evaluation measures. In: 5th International Workshop of the Initiative for the Evaluation of XML Retrieval. Lecture Notes in Computer Science, vol. 4518. Springer, Heidelberg (2007)Google Scholar
  64. 64.
    Li, Y., Yu, C., Jagadish, H.V.: Schema-free XQuery. In: VLDB, pp. 72–83. Morgan Kaufmann, San Francisco (2004)Google Scholar
  65. 65.
    Lim, L., Wang, M., Padmanabhan, S., Vitter, J.S., Parr, R.: XPathLearner: an on-line self-tuning Markov histogram for XML path selectivity estimation. In: VLDB, pp. 442–453. Morgan Kaufmann, San Francisco (2002)Google Scholar
  66. 66.
    List J.A., Mihajlovic V., Ramírez G., Hiemstra D., Blok H.E. and Vries A.P. (2005). TIJAH: embracing IR methods in XML databases. Inf. Retr. 8(4): 547–570 CrossRefGoogle Scholar
  67. 67.
    Liu, S., Chu, W.W., Shahinian, R.: Vague content and structure (VCAS) retrieval for document-centric XML collections. In: WebDB, pp. 79–84 (2005)Google Scholar
  68. 68.
    Marian, A., Amer-Yahia, S., Koudas, N., Srivastava, D.: Adaptive processing of top-k queries in XML. In: ICDE, pp. 162–173. IEEE Computer Society, New York (2005)Google Scholar
  69. 69.
    Marian A., Bruno N. and Gravano L. (2004). Evaluating top-k queries over web-accessible databases. ACM Trans. Database Syst. 29(2): 319–362 CrossRefGoogle Scholar
  70. 70.
    Mass, Y., Mandelbrod, M.: Component ranking and automatic query refinement for XML retrieval. In: 3rd International Workshop of the INitiative for the Evaluation of XML Retrieval. Lecture Notes in Computer Science, vol. 3493, pp. 73–84. Springer, Heidelberg (2004)Google Scholar
  71. 71.
    Moffat A. and Zobel J. (1996). Self-indexing inverted files for fast text retrieval. ACM Trans. Inf. Syst. 14(4): 349–379 CrossRefGoogle Scholar
  72. 72.
    Natsev, A., Chang, Y.C., Smith, J.R., Li, C.S., Vitter, J.S.: Supporting incremental join queries on ranked inputs. In: VLDB, pp. 281–290. Morgan Kaufmann, San Francisco (2001)Google Scholar
  73. 73.
    Nepal, S., Ramakrishna, M.V.: Query processing issues in image (multimedia) databases. In: ICDE, pp. 22–29. IEEE Computer Society, New York (1999)Google Scholar
  74. 74.
    Persin M., Zobel J. and Sacks-Davis R. (1996). Filtered document retrieval with frequency-sorted indexes. JASIS 47(10): 749–764 CrossRefGoogle Scholar
  75. 75.
    Polyzotis, N., Garofalakis, M.N., Ioannidis, Y.E.: Approximate XML query answers. In: SIGMOD, pp. 263–274. ACM, New York (2004)Google Scholar
  76. 76.
    Polyzotis N. and Garofalakis M.N. (2006). XSKETCH synopses for XML data graphs. ACM Trans. Database Syst. 31(3): 1014–1063 CrossRefGoogle Scholar
  77. 77.
    Reid J., Lalmas M., Finesilver K. and Hertzum M. (2006). Best entry points for structured document retrieval (Part I and II). Inf. Process. Manage. 42(1): 74–105 CrossRefGoogle Scholar
  78. 78.
    Robertson S.E. and Spärck-Jones K. (1976). Relevance weighting of search terms. J. Am. Soc. Inf. Sci. 27(1): 129–146 CrossRefGoogle Scholar
  79. 79.
    Robertson, S.E., Walker, S.: Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In: SIGIR, pp. 232–241. ACM/Springer, New York (1994)Google Scholar
  80. 80.
    Rocchio J. Jr (1971). Relevance feedback in information retrieval. In: Salton, G. (eds) The SMART Retrieval System: Experiments in Automatic Document Processing, chap. 14, pp 313–323. Prentice-Hall, Englewood Cliffs Google Scholar
  81. 81.
    Schenkel, R., Theobald, M.: Feedback-driven structural query expansion for ranked retrieval of XML data. In: EDBT, pp. 331–348. Springer, Heidelberg (2006)Google Scholar
  82. 82.
    Schenkel, R., Theobald, M.: Structural feedback for keyword-based XML retrieval. In: ECIR, pp. 326–337. Springer, Heidelberg (2006)Google Scholar
  83. 83.
    Schlieder T. and Meuss H. (2002). Querying and ranking XML documents. JASIST 53(6): 489–503 CrossRefGoogle Scholar
  84. 84.
    Soffer, A., Carmel, D., Cohen, D., Fagin, R., Farchi, E., Herscovici, M., Maarek, Y.S.: Static index pruning for information retrieval systems. In SIGIR, pp. 43–50. ACM, New York (2001)Google Scholar
  85. 85.
    Suchanek, F., Kasneci, G., Weikum, G.: YAGO: A core of semantic knowledge unifying WordNet and Wikipedia. In: WWW (2007)Google Scholar
  86. 86.
    Tao, Y., Faloutsos, C., Papadias, D.: The power-method: a comprehensive estimation technique for multi-dimensional queries. In: CIKM, pp. 83–90 (2003)Google Scholar
  87. 87.
    Theobald, A., Weikum, G.: Adding relevance to XML. In: WebDB (informal proceedings), pp. 35–40 (2000)Google Scholar
  88. 88.
    Theobald, A., Weikum, G.: The index-based XXL search engine for querying XML data with relevance ranking. In: EDBT, pp. 477–495. Springer, Heidelberg (2002)Google Scholar
  89. 89.
    Theobald, M., Schenkel, R., Weikum, G.: Exploiting structure, annotation, and ontological knowledge for automatic classification of XML data. In: WebDB, pp. 1–6 (2003)Google Scholar
  90. 90.
    Theobald, M., Schenkel, R., Weikum, G.: Efficient and self-tuning incremental query expansion for top-k query processing. In: SIGIR, pp. 242–249. ACM, New York (2005)Google Scholar
  91. 91.
    Theobald, M., Schenkel, R., Weikum, G.: An efficient and versatile query engine for TopX search. In: VLDB, pp. 625–636. ACM, New York (2005)Google Scholar
  92. 92.
    Theobald, M., Schenkel, R., Weikum, G.: The TopX DB & IR engine. In: SIGMOD. ACM, New York (2007)Google Scholar
  93. 93.
    Theobald, M., Weikum, G., Schenkel, R.: Top-k query evaluation with probabilistic guarantees. In: VLDB, pp. 648–659. Morgan Kaufmann, San Francisco (2004)Google Scholar
  94. 94.
    Text REtrieval Conference (TREC):
  95. 95.
    Trotman, A., Sigurbjörnsson, B.: Narrowed Extended XPath I (NEXI). In: 3rd International Workshop of the INitiative for the Evaluation of XML Retrieval. Lecture Notes in Computer Science, vol. 3493, pp. 16–40. Springer, Heidelberg (2004)Google Scholar
  96. 96.
    Vagena, Z., Moro, M.M., Tsotras, V.J.: Twig query processing over graph-structured XML data. In: WebDB, pp. 43–48 (2004)Google Scholar
  97. 97.
    Vorhees, E.: Overview of the TREC 2004 Robust retrieval track. In: TREC, pp. 69–77 (2004)Google Scholar
  98. 98.
    de Vries, A.P., Mamoulis, N., Nes, N., Kersten, M.L.: Efficient k-nn search on vertically decomposed data. In: SIGMOD, pp. 322–333. ACM, New York (2002)Google Scholar
  99. 99.
    Williams H.E., Zobel J. and Bahle D. (2004). Fast phrase querying with combined indexes. ACM Trans. Inf. Syst. 22(4): 573–594 CrossRefGoogle Scholar
  100. 100.
    XQuery 1.0 and XPath 2.0 Full-Text:
  101. 101.
    Wu, Y., Patel, J.M., Jagadish, H.V.: Structural join order selection for XML query optimization. In: ICDE, pp. 443–454. IEEE Computer Society, New York (2003)Google Scholar
  102. 102.
    Yu, C.T., Sharma, P., Meng, W., Qin, Y.: Database selection for processing k nearest neighbors queries in distributed environments. In: JCDL, pp. 215–222. ACM, New York (2001)Google Scholar
  103. 103.
    Zhang, C., Naughton, J.F., DeWitt, D.J., Luo, Q., Lohman, G.M.: On supporting containment queries in relational database management systems. In: SIGMOD, pp. 425–436. ACM, New York (2001)Google Scholar
  104. 104.
    Zobel, J., Moffat, A.: Inverted files for text search engines. ACM Comput. Surv. 38(2) (2006)Google Scholar

Copyright information

© Springer-Verlag 2007

Authors and Affiliations

  • Martin Theobald
    • 1
  • Holger Bast
    • 1
  • Debapriyo Majumdar
    • 1
  • Ralf Schenkel
    • 1
    Email author
  • Gerhard Weikum
    • 1
  1. 1.Max-Planck Institute for InformaticsSaarbrueckenGermany

Personalised recommendations