Advertisement

Knowledge and Information Systems

, Volume 51, Issue 3, pp 1023–1042 | Cite as

Compressed double-array tries for string dictionaries supporting fast lookup

  • Shunsuke KandaEmail author
  • Kazuhiro Morita
  • Masao Fuketa
Regular Paper

Abstract

A string dictionary is a basic tool for storing a set of strings in many kinds of applications. Recently, many applications need space-efficient dictionaries to handle very large datasets. In this paper, we propose new compressed string dictionaries using improved double-array tries. The double-array trie is a data structure that can implement a string dictionary supporting extremely fast lookup of strings, but its space efficiency is low. We introduce approaches for improving the disadvantage. From experimental evaluations, our dictionaries can provide the fastest lookup compared to state-of-the-art compressed string dictionaries. Moreover, the space efficiency is competitive in many cases.

Keywords

Trie Double-array Compressed string dictionaries Data management String processing and indexing 

References

  1. 1.
    Aoe J (1989) An efficient digital search algorithm by using a double-array structure. IEEE Trans Softw Eng 15(9):1066–1077CrossRefGoogle Scholar
  2. 2.
    Aoe J, Morimoto K (1992) An efficient implementation of trie structures. Softw Pract Exp 22(9):695–721CrossRefGoogle Scholar
  3. 3.
    Arroyuelo D, Cánovas R, Navarro G, Sadakane K (2010) Succinct trees in practice. In: Proceedings of the 11st meeting on algorithm engineering and experimentation (ALENEX), pp. 84–97Google Scholar
  4. 4.
    Arz J, Fischer J (2014) LZ-compressed string dictionaries. In: Proceedings of the data compression conference (DCC), pp. 322–331Google Scholar
  5. 5.
    Baeza-Yates R, Ribeiro-Neto B (2011) Modern information retrieval, 2nd edn. Addison Wesley, BostonGoogle Scholar
  6. 6.
    Bast H, Mortensen CW, Weber I (2008) Output-sensitive autocompletion search. Inf Retr 11(4):269–286CrossRefGoogle Scholar
  7. 7.
    Benoit D, Demaine ED, Munro JI, Raman R, Raman V, Rao SS (2005) Representing trees of higher degree. Algorithmica 43(4):275–292MathSciNetCrossRefzbMATHGoogle Scholar
  8. 8.
    Boldi P, Codenotti B, Santini M, Vigna S (2004) Ubicrawler: a scalable fully distributed web crawler. Softw Pract Exp 34(8):711–726CrossRefGoogle Scholar
  9. 9.
    Brisaboa NR, Ladra S, Navarro G (2013) DACs: bringing direct access to variable-length codes. Inf Process Manag 49(1):392–404CrossRefGoogle Scholar
  10. 10.
    Cormen TH, Leiserson CE, Rivest RL, Stein C (2009) Introduction to algorithms, 3rd edn. MIT press, CambridgezbMATHGoogle Scholar
  11. 11.
    Dundas JA (1991) Implementing dynamic minimal-prefix tries. Softw Pract Exp 21(10):1027–1040CrossRefGoogle Scholar
  12. 12.
    Ferragina P, Grossi R, Gupta A, Shah R, Vitter JS (2008) On searching compressed string collections cache-obliviously. In: Proceedings of the 27th symposium on principles of database systems (PODS), ACM, pp. 181–190Google Scholar
  13. 13.
    Ferragina P, Luccio F, Manzini G, Muthukrishnan S (2009) Compressing and indexing labeled trees, with applications. J ACM 57(1):4MathSciNetCrossRefzbMATHGoogle Scholar
  14. 14.
    Fredkin E (1960) Trie memory. Commun ACM 3(9):490–499CrossRefGoogle Scholar
  15. 15.
    Fuketa M, Kitagawa H, Ogawa T, Morita K, Aoe J (2014) Compression of double array structures for fixed length keywords. Inf Process Manag 50(5):796–806CrossRefGoogle Scholar
  16. 16.
    Fuketa M, Morita K, Aoe J (2014) Comparisons of efficient implementations for DAWG. In: Proceedings of the 7th international conference on computer science and information technology (ICCSIT)Google Scholar
  17. 17.
    González R, Grabowski S, Mäkinen V, Navarro G (2005) Practical implementation of rank and select queries. In: Poster proceedings of the 4th workshop on experimental and efficient a lgorithms (WEA), pp. 27–38Google Scholar
  18. 18.
    Grossi R, Ottaviano G (2014) Fast compressed tries through path decompositions. ACM J Exp Algorithm 19(1):3–4MathSciNetzbMATHGoogle Scholar
  19. 19.
    Heaps HS (1978) Information retrieval: computational and theoretical aspects. Academic Press Inc, OrlandozbMATHGoogle Scholar
  20. 20.
    Hu TC, Tucker AC (1971) Optimal computer search trees and variable-length alphabetical codes. SIAM J Appl Math 21(4):514–532MathSciNetCrossRefzbMATHGoogle Scholar
  21. 21.
    Kanda S, Fuketa M, Morita K, Aoe J (2016) A compression method of double-array structures using linear functions. Knowl Inf Syst 48(1):55–80CrossRefGoogle Scholar
  22. 22.
    Kim DK, Na JC, Kim JE, Park K (2005) Efficient implementation of rank and elect functions for succinct representation. Proceedings of the 4th international workshop on experimental and efficient algorithms (WEA), LNCS 3503. Springer, New York, pp 315–327Google Scholar
  23. 23.
    Knuth DE (1998) The art of computer programming, 3: sorting and searching, 2nd edn. Addison Wesley, Redwood CityzbMATHGoogle Scholar
  24. 24.
    Kudo T, Hanaoka T, Mukai J, Tabata Y, Komatsu H (2011) Efficient dictionary and language model compression for input method editors. In: Proceedings of the 1st workshop on advances in text input methods (WTIM), pp. 19–25Google Scholar
  25. 25.
    Kudo T, Yamamoto K, Matsumoto Y (2004) Applying conditional random fields to Japanese morphological analysis. In: Proceedings of the conference on empirical methods in natural language processing (EMNLP), pp. 230–237Google Scholar
  26. 26.
    Larsson NJ, Moffat A (1999) Offline dictionary-based compression. In: Proceedings of the data compression conference (DCC), pp. 296–305Google Scholar
  27. 27.
    Maeda A, Mizushima K (2008) A compressed-array representation of automata and its application to programming language (in Japanese). In: Proceedings of the 49th IPSJ programming symposium, pp. 49–54Google Scholar
  28. 28.
    Martínez-Prieto MA, Brisaboa N, Cánovas R, Claude F, Navarro G (2016) Practical compressed string dictionaries. Inf Syst 56:73–108CrossRefGoogle Scholar
  29. 29.
    Morita K, Fuketa M, Yamakawa Y, Aoe J (2001) Fast insertion methods of a double-array structure. Softw Pract Exp 31(1):43–65CrossRefzbMATHGoogle Scholar
  30. 30.
    Munro JI, Raman V (2001) Succinct representation of balanced parentheses and static trees. SIAM J Comput 31(3):762–776MathSciNetCrossRefzbMATHGoogle Scholar
  31. 31.
    Navarro G, Sadakane K (2014) Fully functional static and dynamic succinct trees. ACM Trans Algorithms 10(3):16MathSciNetCrossRefzbMATHGoogle Scholar
  32. 32.
    Okanohara D, Sadakane K (2007) Practical entropy-compressed rank/select dictionary. In: Proceedings of the 9th meeting on algorithm engineering and expermiments (ALENEX), pp. 60–70Google Scholar
  33. 33.
    Oono M, Atlam ES, Fuketa M, Morita K, Aoe J (2003) A fast and compact elimination method of empty elements from a double-array structure. Softw Pract Exp 33(13):1229–1249CrossRefGoogle Scholar
  34. 34.
    Salomon D (2008) A concise introduction to data compression. Springer, LondonCrossRefzbMATHGoogle Scholar
  35. 35.
    Williams HE, Zobel J (1999) Compressing integers for fast file access. Comput J 42(3):193–201CrossRefGoogle Scholar
  36. 36.
    Witten IH, Moffat A, Bell TC (1999) Managing gigabytes: compressing and indexing documents and images. Morgan Kaufmann, San FranciscozbMATHGoogle Scholar
  37. 37.
    Yasuhara M, Tanaka T, Norimatsu J, Yamamoto M (2013) An efficient language model using double-array structures. In: Proceedings of the conference on empirical methods in natural language processing (EMNLP), pp. 222–232Google Scholar
  38. 38.
    Yata S, Morita K, Fuketa M, Aoe J (2008) Fast string matching with space-efficient word graphs. In: Proceedings of the 4th international conference on innovations in information technology (IIT), pp. 79–83Google Scholar
  39. 39.
    Yata S, Oono M, Morita K, Fuketa M, Aoe J (2007) An efficient deletion method for a minimal prefix double array. Softw Pract Exp 37(5):523–534CrossRefGoogle Scholar
  40. 40.
    Yata S, Oono M, Morita K, Fuketa M, Sumitomo T, Aoe J (2007) A compact static double-array keeping character codes. Inf Process Manag 43(1):237–247CrossRefGoogle Scholar
  41. 41.
    Yata S, Oono M, Morita K, Sumitomo T, Aoe J (2006) Double-array compression by pruning twin leaves and unifying common suffixes. In: Proceedings of the 1st international conference on computing and informatics (ICOCI), pp. 1–4Google Scholar
  42. 42.
    Yoshinaga N, Kitsuregawa M (2014) A self-adaptive classifier for efficient text-stream processing. In: Proceedings of the 24th international conference on computational linguistics (COLING), pp. 1091–1102Google Scholar
  43. 43.
    Ziv J, Lempel A (1978) Compression of individual sequences via variable-rate coding. IEEE Trans Inf Theory 24(5):530–536MathSciNetCrossRefzbMATHGoogle Scholar

Copyright information

© Springer-Verlag London 2016

Authors and Affiliations

  1. 1.Department of Information Science and Intelligent SystemsTokushima UniversityTokushimaJapan

Personalised recommendations