The VLDB Journal

, Volume 18, Issue 1, pp 157–179

B-tries for disk-based string management

Regular Paper

Abstract

A wide range of applications require that large quantities of data be maintained in sort order on disk. The B-tree, and its variants, are an efficient general-purpose disk-based data structure that is almost universally used for this task. The B-trie has the potential to be a competitive alternative for the storage of data where strings are used as keys, but has not previously been thoroughly described or tested. We propose new algorithms for the insertion, deletion, and equality search of variable-length strings in a disk-resident B-trie, as well as novel splitting strategies which are a critical element of a practical implementation. We experimentally compare the B-trie against variants of B-tree on several large sets of strings with a range of characteristics. Our results demonstrate that, although the B-trie uses more memory, it is faster, more scalable, and requires less disk space.

Keywords

B-tree Burst trie Secondary storage Vocabulary accumulation Word-level indexing Data structures 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Aoe, J., Morimoto, K., Sato, T.: An efficient implementation of trie structures. Softw Practice Exp 22(9), 695–721 (1992)CrossRefGoogle Scholar
  2. 2.
    Arge, L.: The buffer tree: a new technique for optimal I/O-algorithms. In: Proc. Int. Workshop on Algorithms and Data Structures, pp. 334–345. Kingston (1995)Google Scholar
  3. 3.
    Arge, L.: External memory data structures. In: Handbook of Massive Data Sets, pp. 313–357. Kluwer, Norwell (2002)Google Scholar
  4. 4.
    Arnow, D.M., Tenenbaum, A.M.: An empirical comparison of B-trees, compact B-trees and multiway trees. In: Proc. ACM SIGMOD Int. Conf. on the Management of Data, pp. 33–46. Boston (1984)Google Scholar
  5. 5.
    Arnow, D.M., Tenenbaum, A.M., Wu, C.: P-trees: Storage efficient multiway trees. In: Proc. ACM SIGIR Int. Conf. on Research and Development in Information Retrieval, pp. 111–121. Montreal (1985)Google Scholar
  6. 6.
    Askitis, N., Zobel, J.: Cache-conscious collision resolution in string hash tables. In: Proc. SPIRE String Processing and Information Retrieval Symp., pp. 91–102. Buenos Aires (2005)Google Scholar
  7. 7.
    Baeza-Yates, R.A.: An adaptive overflow technique for B-trees. In: Proc. Int. Conf. on Extending Database Technology, pp. 16–28, Venice (1990)Google Scholar
  8. 8.
    Baeza-Yates, R.A., Larson, P.A.: Performance of B+-trees with partial expansions. IEEE Trans Knowl Data Eng 1(2), 248–257 (1989)CrossRefGoogle Scholar
  9. 9.
    Bayer, R., McCreight, E.M.: Organization and maintenance of large ordered indices. Acta Inf 1(3), 173–189 (1972)CrossRefGoogle Scholar
  10. 10.
    Bayer, R., Unterauer, K.: Prefix B-trees. ACM Trans Database Systems 2(1), 11–26 (1977)CrossRefGoogle Scholar
  11. 11.
    Bell, T.C., Cleary, J.G., Witten, I.H.: Text Compression, 1st edn. Prentice-Hall, New Jersey (1990)Google Scholar
  12. 12.
    Bell, T.C., Moffat, A., Witten, I.H., Zobel, J.: The MG retrieval system: compressing for space and speed. Commun ACM 38(4), 41–42 (1995)CrossRefGoogle Scholar
  13. 13.
    Ben-Asher, Y., Farchi, E., Newman, I.: Optimal search in trees. SIAM J. Comput. 28(6), 2090–2102 (1999)MATHCrossRefMathSciNetGoogle Scholar
  14. 14.
    Bender, M.A., Demaine, E.D., Farach-Colton, M.: Cache-oblivious B-trees. In: Proc. IEEE Foundations of Computer Science, pp. 399–409, Redondo Beach (2000)Google Scholar
  15. 15.
    Bender, M.A., Demaine, E.D., Farach-Colton, M.: Efficient tree layout in a multilevel memory hierarchy. In: Proc. European Symp. on Algorithms, pp. 165–173, Rome (2002)Google Scholar
  16. 16.
    Bender, M.A., Duan, Z., Iacono, J., Wu, J.: A locality-preserving cache-oblivious dynamic dictionary. J. Algorithms 53(2), 115–136 (2004)MATHCrossRefMathSciNetGoogle Scholar
  17. 17.
    Bender, M.A., Farach-Colton, M., Kuszmaul, B.C.: Cache-oblivious string B-trees. In: Proc. of ACM SIGACT-SIGMOD-SIGART Symp. on Principles of Database Systems, pp. 233–242. Chicago (2006)Google Scholar
  18. 18.
    Bentley, J.L., Sedgewick, R.: Fast algorithms for sorting and searching strings. In: Proc. ACM SIAM Symp. on Discrete Algorithms, pp. 360–369. New Orleans (1997)Google Scholar
  19. 19.
    de~la Briandais, R.: File searching using variable length keys. In: Proc. Western Joint Computer Conference, pp. 295–298, New York (1959)Google Scholar
  20. 20.
    Brodal, G., Fagerberg, R.: Cache-oblivious string dictionaries. In: Proc. ACM SIAM Symp. on Discrete Algorithms, pp. 581–590, Miami (2006)Google Scholar
  21. 21.
    Chang, Y., Lee, C., ChangLiaw, W.: Linear spiral hashing for expansible files. IEEE Trans. Knowl. Data Eng. 11(6), 969–984 (1999)CrossRefGoogle Scholar
  22. 22.
    Cheung, C., Yu, J.X., Lu, H.: Constructing suffix tree for gigabyte sequences with megabyte memory. IEEE Trans. Knowl. Data Eng. 17, 90–105 (2005)CrossRefGoogle Scholar
  23. 23.
    Chong, E.I., Srinivasan, J., Das, S., Freiwald, C., Yalamanchi, A., Jagannath, M., Tran, A., Krishnan, R., Jiang, R.: A mapping mechanism to support bitmap index and other auxiliary structures on tables stored as primary B+trees. ACM SIGMOD Record 32(2), 78–88 (2003)CrossRefGoogle Scholar
  24. 24.
    Chowdhury, N.M.M.K., Akbar, M.M., Kaykobad, M.: Disk Trie: An efficient data structure using flash memory for mobile devices. In: Workshop on Algorithms and Computation, pp. 76–87. Bangladesh Computer Council Bhaban, Agargaon (2007)Google Scholar
  25. 25.
    Ciriani, V., Ferragina, P., Luccio, F., Muthukrishnan, S.: Static optimality theorem for external memory string access. In: IEEE Symp. on the Foundations of Computer Science, pp. 219–227, Vancouver (2002)Google Scholar
  26. 26.
    Ciriani, V., Ferragina, P., Luccio, F., Muthukrishnan, S.: A data structure for a sequence of string accesses in external memory. ACM Trans. Algorithms 3(1), 6 (2007)CrossRefMathSciNetGoogle Scholar
  27. 27.
    Clark, D.R., Munro, J.I.: Efficient suffix trees on secondary storage. In: Proc. ACM SIAM Symp. on Discrete Algorithms, pp. 383–391, Atlanta (1996)Google Scholar
  28. 28.
    Comer, D.: Heuristics for trie index minimization. ACM Trans. Database Systems 4(3), 383–395 (1979)CrossRefGoogle Scholar
  29. 29.
    Comer, D.: Ubiquitous B-tree. ACM Comput. Surv. 11(2), 121–137 (1979)MATHCrossRefGoogle Scholar
  30. 30.
    Crauser, A., Ferragina, P.: On constructing suffix arrays in external memory. In: Proc. of European Symp. on Algorithms, pp. 224–235, Prague (1999)Google Scholar
  31. 31.
    Culik, K., Ottmann, T., Wood, D.: Dense multiway trees. ACM Trans. Database Systems 6(3), 486–512 (1981)MATHCrossRefMathSciNetGoogle Scholar
  32. 32.
    Deschler, K.W., Rundensteiner, E.A.: B+Retake: Sustaining high volume inserts into large data pages. In: Proc. Int. Workshop on Data Warehousing and OLAP, pp. 56–63, Atlanta (2001)Google Scholar
  33. 33.
    Fan, X., Yang, Y., Zhang, L.: Implementation and evaluation of String B-tree. Tech. rep., University of Florida (2001)Google Scholar
  34. 34.
    Farach, M., Ferragina, P., Muthukrishnan, S.: Overcoming the memory bottleneck in suffix tree construction. In: IEEE Symp. on the Foundations of Computer Science, p. 174, Palo Alto (1998)Google Scholar
  35. 35.
    Ferragina, P., Grossi, R.: Fast string searching in secondary storage: theoretical developments and experimental results. In: Proc. ACM SIAM Symp. on Discrete Algorithms, pp. 373–382, Atlanta (1996)Google Scholar
  36. 36.
    Ferragina, P., Grossi, R.: The string B-tree: a new data structure for string search in external memory and its applications. J. ACM 46(2), 236–280 (1999)MATHCrossRefMathSciNetGoogle Scholar
  37. 37.
    Ferragina, P., Luccio, F.: Dynamic dictionary matching in external memory. Inf. Comput. 146(2), 85–99 (1998)MATHCrossRefMathSciNetGoogle Scholar
  38. 38.
    Ferragina, P., Manzini, G.: Indexing compressed text. J. ACM 52(4), 552–581 (2005)CrossRefMathSciNetGoogle Scholar
  39. 39.
    Flajolet, P., Puech, C.: Partial match retrieval of multimedia data. J. ACM 33(2), 371–407 (1986)CrossRefMathSciNetGoogle Scholar
  40. 40.
    Foster, C.C.: Information retrieval: information storage and retrieval using AVL trees. In: Proc. National Conf., pp. 192–205, Cleveland (1965)Google Scholar
  41. 41.
    Fredkin, E.: Trie memory. Commun. ACM 3(9), 490–499 (1960)CrossRefGoogle Scholar
  42. 42.
    Frigo, M., Leiserson, C., Prokop, H., Ramachandran, S.: Cache-oblivious algorithms. In: IEEE Symp. on the Foundations of Computer Science, p. 285, New York City (1999)Google Scholar
  43. 43.
    Garcia-Molina, H., Ullman, J.D., Widom, J.: Database Systems: the Complete Book, 1st edn. Prentice-Hall, New Jersey (2001)Google Scholar
  44. 44.
    Gonnet, G.H., Larson, P.: External hashing with limited internal storage. J. ACM 35(1), 161–184 (1988)CrossRefMathSciNetGoogle Scholar
  45. 45.
    Gray, J., Graefe, G.: The five-minute rule ten years later, and other computer storage rules of thumb. SIGMOD Record 26(4), 63–68 (1997)CrossRefGoogle Scholar
  46. 46.
    Gray, J., Reuter, A.: Transaction Processing: Concepts and Techniques, 1st edn. Morgan Kaufmann, San Francisco (1992)Google Scholar
  47. 47.
    Grossi, R., Vitter, J.S.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching (extended abstract). In: Proc. ACM Symp. on Theory of Computing, pp. 397–406, Portland (2000)Google Scholar
  48. 48.
    Guibas, L.J., Sedgewick, R.: A dichromatic framework for balanced trees. In: IEEE Symp. on the Foundations of Computer Science, pp. 8–21, Ann Arbor (1978)Google Scholar
  49. 49.
    Hansen, W.J.: A cost model for the internal organization of B+-tree nodes. ACM Trans. Program. Languages Systems 3(4), 508–532 (1981)MATHCrossRefGoogle Scholar
  50. 50.
    Harman, D.: Overview of the second text retrieval conf. (TREC-2). Inf. Process. Manage. 31(3), 271–289 (1995)CrossRefGoogle Scholar
  51. 51.
    Heinz, S., Zobel, J., Williams, H.E.: Burst tries: A fast, efficient data structure for string keys. ACM Trans. Inf. Systems 20(2), 192–223 (2002)CrossRefGoogle Scholar
  52. 52.
    Hui, L.C.K., Martel, C.: On efficient unsuccessful search. In: Proc. ACM SIAM Symp. on Discrete Algorithms, pp. 217–227, Orlando (1992)Google Scholar
  53. 53.
    Jannink, J.: Implementing deletion in B+-trees. Proc. ACM SIGMOD Int. Conf. Manag. Data 24(1), 33–38 (1995)CrossRefGoogle Scholar
  54. 54.
    Johnson, T., Shasha, D.: Utilization of B-trees with inserts, deletes and modifies. In: Proc. of ACM SIGACT-SIGMOD-SIGART Symp. on Principles of Database Systems, pp. 235–246, Philadelphia (1989)Google Scholar
  55. 55.
    Johnson, T., Shasha, D.: B-trees with inserts and deletes: why free-at-empty is better than merge-at-half. J. Comput. System Sci. 47(1), 45–76 (1993)MATHCrossRefMathSciNetGoogle Scholar
  56. 56.
    Kärkkäinen, J., Rao, S.S.: Full-text indexes in external memory. In: Algorithms for Memory Hierarchies, pp. 149–170. Dagstuhl Research Seminar, Schloss Dagstuhl (2002)Google Scholar
  57. 57.
    Kato, K.: Persistently cached B-trees. IEEE Trans. Knowl. Data Eng. 15(3), 706–720 (2003)CrossRefGoogle Scholar
  58. 58.
    Kelley, K.L., Rusinkiewicz, M.: Multikey extensible hashing for relational databases. IEEE Softw. 05(4), 77–85 (1988)CrossRefGoogle Scholar
  59. 59.
    Knessl, C., Szpankowski, W.: A note on the asymptotic behavior of the height in B-tries for B large. Electron. J. Combinat. 7(R39) (2000)Google Scholar
  60. 60.
    Knessl, C., Szpankowski, W.: Limit laws for the height in Patricia tries. J. Algorithms 44(1), 63–97 (2002)MATHCrossRefMathSciNetGoogle Scholar
  61. 61.
    Knuth, D.E.: The Art of Computer Programming: Sorting and Searching, vol. 3, 2nd edn. Addison-Wesley Longman, Redwood City (1998)Google Scholar
  62. 62.
    Ko, P., Aluru, S.: Obtaining provably good performance from suffix trees in secondary storage. In: Proc. Symp. on Combinatorial Pattern Matching, pp. 72–83, Barcelona (2006)Google Scholar
  63. 63.
    Ko, P., Aluru, S.: Optimal self-adjusting trees for dynamic string data in secondary storage. In: Proc. SPIRE String Processing and Information Retrieval Symp., pp. 184–194, Santiago (2007)Google Scholar
  64. 64.
    Kumar, P.: Cache oblivious algorithms. In: Algorithms for Memory Hierarchies, pp. 193–212. Dagstuhl Research Seminar, Schloss Dagstuhl (2003)Google Scholar
  65. 65.
    Kurtz, S.: Reducing the space requirement of suffix trees. Softw. Practice Exp. 29(13), 1149–1171 (1999)CrossRefGoogle Scholar
  66. 66.
    Ladner, R.E., Fortna, R., Nguyen, B.: A comparison of cache aware and cache oblivious static search trees using program instrumentation. In: Experimental Algorithmics: from Algorithm Design to Robust and Efficient Software, pp. 78–92, New York City (2002)Google Scholar
  67. 67.
    Larson, P.: Linear hashing with separators—a dynamic hashing scheme achieving one-access. ACM Trans. Database Systems 13(3), 366–388 (1988)CrossRefGoogle Scholar
  68. 68.
    Lomet, D.B.: Partial expansions for file organizations with an index. ACM Trans. Database Systems 12(1), 65–84 (1987)CrossRefGoogle Scholar
  69. 69.
    Mahmoud, H.M.: Evolution of Random Search Trees, 1st edn. J Wiley, New York (1992)MATHGoogle Scholar
  70. 70.
    Makawita, D., Tan, K., Liu, H.: Sampling from databases using B+-trees. In: Proc. CIKM Int. Conf. on Information and Knowledge Management, pp. 158–164, McLean (2000)Google Scholar
  71. 71.
    Manber, U., Myers, G.: Suffix arrays: a new method for on-line string searches. In: Proc. ACM SIAM Symp. on Discrete Algorithms, pp. 319–327, San Francisco (1990)Google Scholar
  72. 72.
    Martel, C.: Self-adjusting multi-way search trees. Inf. Process. Lett. 38(3), 135–141 (1991)MATHCrossRefMathSciNetGoogle Scholar
  73. 73.
    McCreight, E.M.: A space-economical suffix tree construction algorithm. J. ACM 23(2), 262–271 (1976)MATHCrossRefMathSciNetGoogle Scholar
  74. 74.
    Na, J.C., Park, K.: Simple implementation of String B-trees. In: Proc. SPIRE String Processing and Information Retrieval Symp., pp. 214–215, Padova (2004)Google Scholar
  75. 75.
    Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Comput. Surv. 39(1), 1–61 (2007)CrossRefGoogle Scholar
  76. 76.
    Ooi, B.C., Tan, K.: B-trees: Bearing fruits of all kinds. In: Proc. Australasian Database Conf., pp. 13–20, Melbourne (2002)Google Scholar
  77. 77.
    Oracle: Berkeley DB, Oracle Embedded Database (2007). http://www.oracle.com/technology/software/products/berkeley-db/index.html. Version 4.5.20
  78. 78.
    Pagh, R.: Basic external memory data structures. In: Algorithms for Memory Hierarchies, pp. 14–35. Dagstuhl Research Seminar, Schloss Dagstuhl (2002)Google Scholar
  79. 79.
    Pugh, W.: Skip lists: a probabilistic alternative to balanced trees. Commun. ACM 33(6), 668–676 (1990)CrossRefMathSciNetGoogle Scholar
  80. 80.
    Rao, J., Ross, K.A.: Making B+-trees cache conscious in main memory. In: Proc. ACM SIGMOD Int. Conf. on the Management of Data, pp. 475–486, Dallas (2000)Google Scholar
  81. 81.
    Rose, K.R.: Asynchronous generic key/value database. Master’s thesis, Massachusetts Institute of Technology (2000)Google Scholar
  82. 82.
    Rosenberg, A.L., Snyder, L.: Time and space optimality in B-trees. ACM Trans. Database Systems 6(1), 174–193 (1981)MATHCrossRefMathSciNetGoogle Scholar
  83. 83.
    Sedgewick, R.: Algorithms in C, Parts 1-4: Fundamentals, Data structures, Sorting, and Searching, 3rd edn. Addison-Wesley, Boston (1998)Google Scholar
  84. 84.
    Severance, D.G.: Identifier search mechanisms: a survey and generalized model. ACM Comput. Surv. 6(3), 175–194 (1974)MATHCrossRefGoogle Scholar
  85. 85.
    Sherk, M.: Self-adjusting k-ary search trees. In: Proc. of Workshop on Algorithms and Data Structures, pp. 381–392, Ottawa (1989)Google Scholar
  86. 86.
    Silberschatz, A., Galvin, P.B., Gagne, G.: Operating System Concepts, 7th edn. Wiley, Boston (2004)Google Scholar
  87. 87.
    Sleator, D.D., Tarjan, R.E.: Self-adjusting binary search trees. J. ACM 32(3), 652–686 (1985)MATHCrossRefMathSciNetGoogle Scholar
  88. 88.
    Software, T.M.: C++ string B-tree library (2007). http://wikipedia-clustering.speedblue.org/strBTree.php
  89. 89.
    Szpankowski, W.: Average Case Analysis of Algorithms on Sequences, 1st edn. Wiley, New York City (2001)MATHGoogle Scholar
  90. 90.
    Tian, Y., Tata, S., Hankins, R.A., Patel, J.M.: Practical methods for constructing suffix trees. Int. J. Very Large Databases 14(3), 281–299 (2005)CrossRefGoogle Scholar
  91. 91.
    Vitter, J.S.: External memory algorithms and data structures: dealing with massive data. ACM Comput. Surv. 33(2), 209–271 (2001)CrossRefGoogle Scholar
  92. 92.
    Williams, H.E., Zobel, J., Heinz, S.: Self-adjusting trees in practice for large text collections. Softw. Practice Exp. 31(10), 925–939 (2001)MATHCrossRefGoogle Scholar
  93. 93.
    Witten, I.H., Bell, T.C., Moffat, A.: Managing Gigabytes: Compressing and Indexing Documents and Images, 1st edn. Morgan Kaufmann, San Francisco (1999)Google Scholar
  94. 94.
    Yao, A.C.: On random 2-3 trees. Acta Inf. 9, 159–170 (1978)MATHCrossRefGoogle Scholar
  95. 95.
    Zobel, J., Moffat, A.: Inverted files for text search engines. ACM Comput. Surv. 38, 1–56 (2006)CrossRefGoogle Scholar
  96. 96.
    Zobel, J., Moffat, A., Ramamohanarao, K.: Inverted files versus signature files for text indexing. ACM Trans. Database Systems 23(4), 453–490 (1998)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag 2008

Authors and Affiliations

  1. 1.School of Computer Science and Information TechnologyRMIT UniversityMelbourneAustralia
  2. 2.NICTAUniversity of MelbourneParkvilleAustralia

Personalised recommendations