Advertisement

Arabian Journal for Science and Engineering

, Volume 43, Issue 12, pp 8209–8221 | Cite as

Multi-Stream Word-Based Compression Algorithm for Compressed Text Search

  • Emir ÖztürkEmail author
  • Altan Mesut
  • Banu Diri
Research Article - Computer Engineering and Computer Science
  • 54 Downloads

Abstract

In this article, we present a novel word-based lossless compression algorithm for text files using a semi-static model. We named this method the ‘Multi-stream word-based compression algorithm (MWCA)’ because it stores the compressed forms of the words in three individual streams depending on their frequencies in the text and stores two dictionaries and a bit vector as side information. In our experiments, MWCA produces a compression ratio of 3.23 bpc on average and 2.88 bpc for files greater than 50 MB; if a variable length encoder such as Huffman coding is used after MWCA, the given ratios are reduced to 2.65 and 2.44 bpc, respectively. MWCA supports exact word matching without decompression, and its multi-stream approach reduces the search time with respect to single-stream algorithms. Additionally, the MWCA multi-stream structure supplies the reduction in network load by requesting only the necessary streams from the database. With the advantage of its fast compressed search feature and multi-stream structure, we believe that MWCA is a good solution, especially for storing and searching big text data.

Keywords

Data compression Text compression Dictionary-based compression Compressed matching MWCA 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Bell, T.C.; Cleary, J.G.; Witten, I.H.: Text compression. vol. 348. Englewood Cliffs: Prentice Hall, Inc., Upper Saddle River, NJ, USA (1990)Google Scholar
  2. 2.
    Brisaboa, N.R.; Farina, A.; Navarro, G.; Param, J.R.: Improving semistatic compression via pair-based coding. In: International Andrei Ershov Memorial Conference on Perspectives of System Informatics (pp. 124–134) (2006)Google Scholar
  3. 3.
    Brisaboa, N.R.; Farina, A.; Navarro, G.; Esteller, M.F.: (S, C)-dense coding: an optimized compression code for natural language text databases. In: International Symposium on String Processing and Information Retrieval (pp. 122–136) (2003)CrossRefGoogle Scholar
  4. 4.
    Navarro, G.; Tarhio, J.: Boyer–Moore string matching over Ziv–Lempel compressed text. In: Annual Symposium on Combinatorial Pattern Matching (pp. 166–180) (2000)CrossRefGoogle Scholar
  5. 5.
    Moffat, A.: Word-based text compression. Softw. Pract. Exp. 19(2), 185–198 (1989)CrossRefGoogle Scholar
  6. 6.
    Ziv, J.; Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23(3), 337–343 (1977)MathSciNetCrossRefGoogle Scholar
  7. 7.
    Lempel, A.: Compression of individual sequences via variable-rate coding. IEEE Trans. Inf. Theory 24(5), 530–536 (1978)MathSciNetCrossRefGoogle Scholar
  8. 8.
    Welch, T.A.: A technique for high-performance data compression. IEEE Comput. 17(6), 8–19 (1984)CrossRefGoogle Scholar
  9. 9.
    Deutsch, L.P.: DEFLATE compressed data format specification version 1.3 (1996)Google Scholar
  10. 10.
    Cleary, J.; Witten, I.: Data compression using adaptive coding and partial string matching. IEEE Trans. Commun. 32(4), 396–402 (1984)CrossRefGoogle Scholar
  11. 11.
    Bentley, J.L.; Sleator, D.D.; Tarjan, R.E.; Wei, V.K.: A locally adaptive data compression scheme. Commun. ACM 29(4), 320–330 (1986)MathSciNetCrossRefGoogle Scholar
  12. 12.
    Turpin, A.; Moffat, A.: Fast file search using text compression. Australian Comput. Sci. Commun. 19, 18 (1997)Google Scholar
  13. 13.
    de Moura, E.; Navarro, G.; Ziviani, N.; Baeza-Yates, R.: Fast and flexible word searching on compressed text. ACM Trans. Inf. Syst. (TOIS) 18(2), 113–139 (2000)CrossRefGoogle Scholar
  14. 14.
    Brisaboa, N.R.; Faria, A.; Navarro, G.; Param, J.R.: Lightweight natural language text compression. Inf. Retr. 10(1), 133 (2007)Google Scholar
  15. 15.
    Brisaboa, N.R.; Farina, A.; Navarro, G.; Parama, J.R.: New adaptive compressors for natural language text. Softw. Pract. Exp. 38(13), 1429–1450 (2008)CrossRefGoogle Scholar
  16. 16.
    Brisaboa, N.; Farina, A.; Navarro, G.; Param, J.: Dynamic lightweight text compression. ACM Trans. Inf. Syst. (TOIS) 28(3), 10 (2010)CrossRefGoogle Scholar
  17. 17.
    Carus, A.; Mesut, A.: A new compression algorithm for fast text search. Turk. J. Electr. Eng. Comput. Sci. 24(5), 4355–4367 (2016)CrossRefGoogle Scholar
  18. 18.
    Farina, A.; Navarro, G.; Param, J.R.: Word-based statistical compressors as natural language compression boosters. In: Data Compression Conference, 2008. DCC 2008 (pp. 162–171) (2008)Google Scholar
  19. 19.
    Farina, A.; Navarro, G.; Param, J.R.: Boosting text compression with word-based statistical encoding. Comput. J. bxr096 (2011)Google Scholar
  20. 20.
    Burrows, M.; Wheeler, D.J.: A Block-Sorting Lossless Data Compression Algorithm. Digital Systems Research Center Research Reports, Palo Alto, California, USA (1994)Google Scholar
  21. 21.
    Awan, F.S.; Mukherjee, A.: LIPT: a lossless text transform to improve compression. In: Information Technology: Coding and Computing, 2001. Proceedings. International Conference on (pp. 452–460) (2001)Google Scholar
  22. 22.
    Sun, W.; Zhang, N.; Mukherjee, A.: Dictionary-based fast transform for text compression. In: Information Technology: Coding and Computing [Computers and Communications], 2003. Proceedings. ITCC 2003. International Conference on (pp. 176–182) (2003)Google Scholar
  23. 23.
    Skibiski, P.; Grabowski, S.; Deorowicz, S.: Revisiting dictionary-based compression. Softw. Pract. Exp. 35(15), 1455–1476 (2005)CrossRefGoogle Scholar
  24. 24.
    Amir, A.; Benson, C.: Efficient two-dimensional compressed matching. In: Data Compression Conference, 1992. DCC’92. (pp. 279–288) (1992)Google Scholar
  25. 25.
    Wu, S.; Manber, U.: Fast Text Searching with Errors. University of Arizona, Department of Computer Science, Tucson (1991)Google Scholar
  26. 26.
    Boyer, R.S.; Moore, J.S.: A fast string searching algorithm. Commun. ACM 20(10), 762–772 (1977)CrossRefGoogle Scholar
  27. 27.
    Berry, T.; Ravindran, S.: A fast string matching algorithm and experimental results. In: Stringology (pp. 16–28) (1999)Google Scholar
  28. 28.
    Horspool, R.N.: Practical fast searching in strings. Softw. Pract. Exp. 10(6), 501–506 (1980)CrossRefGoogle Scholar
  29. 29.
    Knuth, D.E.; Morris Jr., J.H.; Pratt, V.R.: Fast pattern matching in strings. SIAM J. Comput. 6(2), 323–350 (1977)MathSciNetCrossRefGoogle Scholar
  30. 30.
    Navarro, G.; Tarhio, J.: LZgrep: a Boyer–Moore string matching tool for Ziv–Lempel compressed text. Softw. Pract. Exp. 35(12), 1107–1130 (2005)CrossRefGoogle Scholar
  31. 31.
    Ziviani, N.; De Moura, E.S.; Navarro, G.; Baeza-Yates, R.: Compression: a key for next-generation text retrieval systems. Computer 33(11), 37–44 (2000)CrossRefGoogle Scholar
  32. 32.
    Heaps, H.S.: Information retrieval: Computational and theoretical aspects. Academic Press, Inc., Orlando, FL, USA (1978)Google Scholar
  33. 33.
    Huffman, D.A.: A method for the construction of minimum-redundancy codes. Proc. IRE 40(9), 1098–1101 (1952)CrossRefGoogle Scholar

Copyright information

© King Fahd University of Petroleum & Minerals 2018

Authors and Affiliations

  1. 1.Computer Engineering DepartmentTrakya UniversityEdirneTurkey
  2. 2.Computer Engineering DepartmentYildiz Technical UniversityIstanbulTurkey

Personalised recommendations