Advertisement

The VLDB Journal

, Volume 17, Issue 1, pp 151–168 | Cite as

Flexible and efficient IR using array databases

  • Roberto Cornacchia
  • Sándor Héman
  • Marcin Zukowski
  • Arjen P. de Vries
  • Peter Boncz
Open Access
Special Issue Paper

Abstract

The Matrix Framework is a recent proposal by Information Retrieval (IR) researchers to flexibly represent information retrieval models and concepts in a single multi-dimensional array framework. We provide computational support for exactly this framework with the array database system SRAM (Sparse Relational Array Mapping), that works on top of a DBMS. Information retrieval models can be specified in its comprehension-based array query language, in a way that directly corresponds to the underlying mathematical formulas. SRAM efficiently stores sparse arrays in (compressed) relational tables and translates and optimizes array queries into relational queries. In this work, we describe a number of array query optimization rules. To demonstrate their effect on text retrieval, we apply them in the TREC TeraByte track (TREC-TB) efficiency task, using the Okapi BM25 model as our example. It turns out that these optimization rules enable SRAM to automatically translate the BM25 array queries into the relational equivalent of inverted list processing including compression, score materialization and quantization, such as employed by custom-built IR systems. The use of the high-performance MonetDB/X100 relational backend, that provides transparent database compression, allows the system to achieve very fast response times with good precision and low resource usage.

Keywords

Information retrieval Array databases Query optimization Database compression 

References

  1. 1.
    Ailamaki, A., DeWitt, D., Hill, M., Skounakis, M.: Weaving relations for cache performance. In: Proceedings of the International Conference on Very Large Databases (VLDB), pp. 169–180, Rome (2001)Google Scholar
  2. 2.
    Anh V.N. and Moffat A. (2005). Inverted index compression using word-aligned binary codes. Inf. Retr. 8(1): 151–166 CrossRefGoogle Scholar
  3. 3.
    Anh, V.N., Moffat, A.: Simplified similarity scoring using term ranks. In: Proceedings of the International Conference on Information Retrieval (ACM SIGIR), pp. 226–233, Salvador (2005)Google Scholar
  4. 4.
    van Ballegooij, A., de~Vries, A., Kersten, M.L.: RAM: Array processing over a relational DBMS. Tech. Rep. INS-R0301, CWI (2003)Google Scholar
  5. 5.
    Barroso L.A., Dean J. and Holzle U. (2003). Web search for a planet: the google cluster architecture. IEEE Mio 23(2): 22–28 CrossRefGoogle Scholar
  6. 6.
    Baumann, P.: A database array algebra for spatio-temporal data and beyond. In: Next Generation Information Technologies and Systems, pp. 76–93 (1999)Google Scholar
  7. 7.
    Boncz, P., Zukowski, M., Nes, N.: MonetDB/X100: Hyper-pipelining query execution. In: Proceedings of the Conference of Innovative Database Research (CIDR), pp. 225–237, Asilomar (2005)Google Scholar
  8. 8.
    Broder, A.Z., Carmel, D., Herscovici, M., Soffer, A., Zien, J.: Efficient query evaluation using a two-level retrieval process. In: Proceedings of the Conference of Information and Knowledge Management (CIKM), pp. 426–434, New Orleans (2003)Google Scholar
  9. 9.
    Buneman P., Libkin L., Suciu D., Tannen V. and Wong L. (1994). Comprehension syntax. SIGMOD Record 23(1): 87–96 CrossRefGoogle Scholar
  10. 10.
    Chakravarthy U.S., Grant J. and Minker J. (1990). Logic-based approach to semantic query optimization. ACM Trans. Database Syst. 15(2): 162–207 CrossRefGoogle Scholar
  11. 11.
    Clarke, C.L.A., Craswell, N., Soboroff, I.: Overview of the TREC 2004 terabyte track. In: Proceedings of the Text Retrieval Conference (TREC), Gaithersburg (2004)Google Scholar
  12. 12.
    Clarke, C.L.A., Scholer, F., Soboroff, I.: The TREC 2005 terabyte track. In: Proceedings of the Text Retrieval Conference (TREC), Gaithersburg (2005)Google Scholar
  13. 13.
    Demmel, J., Dongarra, J., Ruhe, A., van~der Vorst, H.: Templates for the solution of algebraic eigenvalue problems: a practical guide. Society for Industrial and Applied Mathematics, Philadelphia (2000)Google Scholar
  14. 14.
    Eisenberg A., Melton J., Kulkarni K., Michels J.E. and Zemke F. (2004). Sql:2003 has been published. SIGMOD Rec. 33(1): 119–126 CrossRefGoogle Scholar
  15. 15.
    Furtado, P., Baumann, P.: Storage of multidimensional arrays based on arbitrary tiling. In: Proc. of the 15th International Conference on Data Engineering, ICDE99, pp. 408–489 (1999)Google Scholar
  16. 16.
    Galindo-Legaria C. and Rosenthal A. (1997). Outerjoin simplification and reordering for query optimization. ACM Trans. Database Syst. 22(1): 43–74 CrossRefGoogle Scholar
  17. 17.
    Goldstein, J., Ramakrishnan, R., Shaft, U.: Compressing relations and indexes. In: Proceedings of the International Conference on Data Engineering (IEEE ICDE), Orlando (1998)Google Scholar
  18. 18.
    Golub G.H. and Loan C.F.V. (1983). Matrix Computations. Johns Hopkins University Press, Baltimore zbMATHGoogle Scholar
  19. 19.
    Grabs T., Bhoem K. and Schek H.J. (2004). PowerDB-IR: scalable information retrieval and storage with a cluster of databases. Knowl. Inf. Systems 6(4): 465–505 Google Scholar
  20. 20.
    Graefe G. (1994). Volcano—an extensible and parallel query evaluation system. IEEE TKDE 6(1): 120–135 Google Scholar
  21. 21.
    Grossman D.A., Frieder O., Holmes D.O. and Roberts D.C. (1997). Integrating structured data and text: a relational approach. JASIS 48(2): 122–132 CrossRefGoogle Scholar
  22. 22.
    Hiemstra, D.: A linguistically motivated probabilistic model of information retrieval. In: ECDL, pp. 569–584 (1998)Google Scholar
  23. 23.
    Howe, B., Maier, D.: Algebraic manipulation of scientific datasets. In: VLDB, pp. 924–935 (2004)Google Scholar
  24. 24.
    Huffman, D.: A method for construction of minimum redundancy codes. In: Proc. of the IRE, vol.~40, pp. 1098–1101 (1952)Google Scholar
  25. 25.
    Kotlyar, V., Pingali, K., Stodghill, P.: A relational approach to the compilation of sparse matrix programs. In: Euro-Par, pp. 318–327 (1997)Google Scholar
  26. 26.
    Lerner, A., Shasha, D.: Aquery: query language for ordered data, optimization techniques, and experiments. In: VLDB, pp. 345–356 (2003)Google Scholar
  27. 27.
    Libkin, L., Machlin, R., Wong, L.: A query language for multidimensional arrays: Design, implementation, and optimization techniques. In: Proceedings of ACM SIGMOD International Conference on Managing Data, pp. 228–239. ACM Press (1996)Google Scholar
  28. 28.
    Mahesh, K., Kud, J., Dixen, P.: Oracle at TREC 8: a lexical approach. In: Proceedings of the Text Retrieval Conference (TREC), Gaithersburg (1999)Google Scholar
  29. 29.
    Maier, D., Vance, B.: A call to order. In: Proceedings of the Twelfth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, May 25–28, 1993, Washington, pp. 1–16. ACM Press (1993)Google Scholar
  30. 30.
    Ponte, J., Croft, W.: A language modelling approach to information retrieval. In: Proceedings of the 21st ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’98) (1998)Google Scholar
  31. 31.
    Porter M.F. (1980). An algorithm for suffix stripping. Program 14(3): 130–137 Google Scholar
  32. 32.
    Robertson S.E. and Jones K.S. (1976). Relevance weighting of search terms. J. Am. Soc. Inf. Sci. 27(3): 129–146 CrossRefGoogle Scholar
  33. 33.
    Robertson, S.E., Walker, S., Beaulieu, M.: Okapi at TREC-7: automatic ad hoc, filtering, VLC and interactive track. In: Proceedings of the Text Retrieval Conference (TREC), pp. 143–167. Gaithersburg (1998)Google Scholar
  34. 34.
    Roelleke T., Tsikrika T. and Kazai G. (2005). A general matrix framework for modelling information retrieval. IP&M 42(1): 4–30 Google Scholar
  35. 35.
    Steinbrunn, M., Moerkotte, G., Kemper, A.: Optimizing join orders. University of Passau, Tech. rep. (1993)Google Scholar
  36. 36.
    Trotman A. (2003). Compressing inverted files. Inf. Retr. 6(1): 5–19 CrossRefGoogle Scholar
  37. 37.
    Turtle H. and Flood J. (1995). Query evaluation: strategies and optimizations. Inf. Proces. Manage. 31(6): 831–850 CrossRefGoogle Scholar
  38. 38.
    Witten I.H., Moffat A. and Bell T.C. (1999). Managing gigabytes Compressing and Indexing Documents and Images. Morgan Kaufmann, San Francisco Google Scholar
  39. 39.
    Zukowski, M., Heman, S., Nes, N., Boncz, P.: Super-scalar RAM-CPU cache compression. In: Proceedings of the International Conference on Data Engineering (IEEE ICDE), Atlanta (2006)Google Scholar

Copyright information

© Springer-Verlag 2007

Authors and Affiliations

  • Roberto Cornacchia
    • 1
  • Sándor Héman
    • 1
  • Marcin Zukowski
    • 1
  • Arjen P. de Vries
    • 1
  • Peter Boncz
    • 1
  1. 1.CWIAmsterdamThe Netherlands

Personalised recommendations