Skip to main content
Log in

TADOC: Text analytics directly on compression

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

This article provides a comprehensive description of text analytics directly on compression (TADOC), which enables direct document analytics on compressed textual data. The article explains the concept of TADOC and the challenges to its effective realizations. Additionally, a series of guidelines and technical solutions that effectively address those challenges, including the adoption of a hierarchical compression method and a set of novel algorithms and data structure designs, are presented. Experiments on six data analytics tasks of various complexities show that TADOC can save 90.8% storage space and 87.9% memory usage, while halving data processing times.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18

Similar content being viewed by others

Notes

  1. Section 7.6 shows the sensitivity on the other two datasets, D and E.

  2. The processing time in gzip is the same as in the baseline method since they both process the decompressed data.

References

  1. Amazon elastic compute cloud (Amazon EC2). https://aws.amazon.com/ec2/

  2. Re-Pair compression and decompression. https://users.dcc.uchile.cl/~gnavarro/software/index.html (2010)

  3. word2vec. https://code.google.com/archive/p/word2vec/ (2013)

  4. C++ B-tree. https://code.google.com/archive/p/cpp-btree/ (2017)

  5. Wikipedia HTML data dumps. https://dumps.wikimedia.org/enwiki/ (2017)

  6. FM-index. https://en.wikipedia.org/wiki/FM-index (2018)

  7. zstd. https://facebook.github.io/zstd/ (2020)

  8. Agarwal, R., Khandelwal, A., Stoica, I.: Succinct: enabling queries on compressed data. In: NSDI (2015)

  9. Ahmad, F., Lee, S., Thottethodi, M., Vijaykumar, T.: PUMA: Purdue MapReduce Benchmarks Suite (2012)

  10. Bille, P., Christiansen, A.R., Cording, P.H., Gørtz, I.L.: Finger search in grammar-compressed strings (2015). arXiv preprint arXiv:1507.02853

  11. Bille, P., Landau, G.M., Raman, R., Sadakane, K., Satti, S.R., Weimann, O.: Random access to grammar-compressed strings and trees. SIAM J. Comput. 43, 513–539 (2015)

    Article  MathSciNet  Google Scholar 

  12. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J Mach Learn Res 3, 993–1022 (2003)

    MATH  Google Scholar 

  13. Blumenstock, J.E.: Size matters: word count as a measure of quality on Wikipedia. In: WWW (2008)

  14. Boroumand, A., Ghose, S., Kim, Y., Ausavarungnirun, R., Shiu, E., Thakur, R., Kim, D., Kuusela, A., Knies, A., Ranganathan, P., Mutlu, O.: Google workloads for consumer devices: mitigating data movement bottlenecks. In: ASPLOS (2018)

  15. Borthakur, D.: HDFS architecture guide. HADOOP APACHE PROJECT http://hadoop.apache.org/common/docs/current/hdfs design. pdf (2008)

  16. Brisaboa, N.R., Gómez-Brandón, A., Navarro, G., Paramá, J.R.: Gract: a grammar-based compressed index for trajectory data. Inf. Sci. 483, 106–135 (2019)

    Article  MathSciNet  Google Scholar 

  17. Burrows, M., Wheeler, D.J.: A block-sorting lossless data compression algorithm (1994)

  18. Carbone, P., Katsifodimos, A., Ewen, S., Markl, V., Haridi, S., Tzoumas, K.: Apache flink: stream and batch processing in a single engine. Bull. IEEE Comput. Soc. Tech. Comm. Data Eng. (2015)

  19. Charikar, M., Lehman, E., Liu, D., Panigrahy, R., Prabhakaran, M., Sahai, A., Shelat, A.: The smallest grammar problem. IEEE Trans. Inf. Theory (2005)

  20. Chilimbi, T.M.: Efficient representations and abstractions for quantifying and exploiting data reference locality. In: PLDI (2001)

  21. Chilimbi, T.M., Hirzel, M.: Dynamic hot data stream prefetching for general-purpose programs. In: PLDI (2002)

  22. Chiu, J.P., Nichols, E.: Named entity recognition with bidirectional LSTM-CNNs. Trans. Assoc. Comput. Linguist. 4, 357–370 (2016)

    Article  Google Scholar 

  23. Cormen, T.H.: Introduction to Algorithms. MIT Press, Cambridge (2009)

    MATH  Google Scholar 

  24. Farruggia, A., Ferragina, P., Venturini, R.: Bicriteria data compression: efficient and usable. In: European Symposium on Algorithms (2014)

  25. Ferragina, P., González, R., Navarro, G., Venturini, R.: Compressed text indexes: from theory to practice. J. Exp. Algorithm (JEA) 13, 1–12 (2009)

    MathSciNet  MATH  Google Scholar 

  26. Ferragina, P., Manzini, G.: Opportunistic data structures with applications. In: Proceedings 41st Annual Symposium on Foundations of Computer Science (2000)

  27. Ferragina, P., Manzini, G.: An experimental study of a compressed index. Inf. Sci. 135, 13–28 (2001)

    Article  Google Scholar 

  28. Ferragina, P., Manzini, G.: An experimental study of an opportunistic index. In: Proceedings of the Twelfth Annual ACM-SIAM Symposium on Discrete Algorithms (2001)

  29. Ferragina, P., Manzini, G.: Indexing compressed text. J. ACM (JACM) 52, 552–581 (2005)

    Article  MathSciNet  Google Scholar 

  30. Ferragina, P., Nitto, I., Venturini, R.: On the bit-complexity of Lempel–Ziv compression. In: Proceedings of the Twentieth Annual ACM-SIAM Symposium on Discrete Algorithms (2009)

  31. Gagie, T., Gawrychowski, P., Kärkkäinen, J., Nekrich, Y., Puglisi, S.J.: A faster grammar-based self-index. In: International Conference on Language and Automata Theory and Applications (2012)

  32. Ganardi, M., Jeż, A., Lohrey, M.: Balancing straight-line programs. In: IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS) (2019)

  33. Gańczorz, M., Jeż, A.: Improvements on re-pair grammar compressor. In: Data Compression Conference (DCC) (2017)

  34. Gog, S., Beller, T., Moffat, A., Petri, M.: From theory to practice: plug and play with succinct data structures. In: International Symposium on Experimental Algorithms (2014)

  35. Gonzalez, J.E., Low, Y., Gu, H., Bickson, D., Guestrin, C.: PowerGraph: distributed graph-parallel computation on natural graphs. In: OSDI (2012)

  36. Grossi, R., Gupta, A., Vitter, J.S.: High-order entropy-compressed text indexes. In: Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms (2003)

  37. Grossi, R., Gupta, A., Vitter, J.S.: When indexing equals compression: experiments with compressing suffix arrays and applications. In: Proceedings of the Fifteenth Annual ACM-SIAM Symposium on Discrete Algorithms (2004)

  38. Grossi, R., Vitter, J.S.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM J. Comput. 35, 378–407 (2005)

    Article  MathSciNet  Google Scholar 

  39. Hon, W.-K., Lam, T.W., Sung, W.-K., Tse, W.-L., Wong, C.-K., Yiu, S.-M.: Practical aspects of compressed suffix arrays and FM-index in searching DNA sequences. In: ALENEX/ANALC (2004)

  40. Huang, S., Huang, J., Dai, J., Xie, T., Huang, B.: The HiBench benchmark suite: characterization of the MapReduce-based data analysis. In: New Frontiers in Information and Software as Services (2011)

  41. Joachims, T.: A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization. Technical report, Carnegie-Mellon Univ Pittsburgh Pa Dept of Computer Science (1996)

  42. Khandelwal, A., Agarwal, R., Stoica, I.: Blowfish: dynamic storage-performance tradeoff in data stores. In: NSDI (2016)

  43. Koiwa, T., Ohwada, H.: Extraction of disease-related genes from PubMed paper using word2vec. In: Proceedings of the 8th International Conference on Computational Systems-Biology and Bioinformatics (2017)

  44. Kurtz, S.: Reducing the space requirement of suffix trees. Softw. Pract. Exp. 29, 1149–1171 (1999)

    Article  Google Scholar 

  45. Larsson, N.J., Moffat, A.: Off-line dictionary-based compression. In: Proceedings of the IEEE (2000)

  46. Larus, J.R.: Whole program paths. In: PLDI (1999)

  47. Lau, J., Perelman, E., Hamerly, G., Sherwood, T., Calder, B.: Motivation for variable length intervals and hierarchical phase behavior. In: International Symposium on Performance Analysis of Systems and Software (2005)

  48. Law, J., Rothermel, G.: Whole program path-based dynamic impact analysis. In: ICSE (2003)

  49. Lebart, L.: Classification problems in text analysis and information retrieval. In: Advances in Data Science and Classification (1998)

  50. Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. In: Soviet Physics Doklady (1966)

  51. Lichman, M.: UCI machine learning repository. http://archive.ics.uci.edu/ml (2013)

  52. Lin, Y., Zhang, Y., Li, Q., Yang, J.: Supporting efficient query processing on compressed XML files. In: Proceedings of ACM Symposium on Applied Computing (2005)

  53. Liu, Z., Zhang, Y., Chang, E.Y., Sun, M.: PLDA+: Parallel latent Dirichlet allocation with data placement and pipeline processing. ACM Trans. Intell. Syst. Technol. 2, 1–18 (2011)

    Google Scholar 

  54. Mackenzie, J., Mallia, A., Petri, M., Culpepper, J.S., Suel, T.: Compressing inverted indexes with recursive graph bisection: a reproducibility study. In: European Conference on Information Retrieval (2019)

  55. Malewicz, G., Austern, M.H., Bik, A.J., Dehnert, J.C., Horn, I., Leiser, N., Czajkowski, G.: Pregel: a system for large-scale graph processing. In: SIGMOD (2010)

  56. Manber, U., Myers, G.: Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22, 935–948 (1993)

    Article  MathSciNet  Google Scholar 

  57. Martella, C., Shaposhnik, R., Logothetis, D., Harenberg, S.: Practical Graph Analytics with Apache Giraph. Springer, Berlin (2015)

    Book  Google Scholar 

  58. Matsuo, Y., Ishizuka, M.: Keyword extraction from a single document using word co-occurrence statistical information. Int. J. Artif. Intell. Tools 13, 157–169 (2004)

    Article  Google Scholar 

  59. Mitsui, K.: Information retrieval based on rank-ordered cumulative query scores calculated from weights of all keywords in an inverted index file for minimizing access to a main database, 1993. US Patent 5,263,159

  60. Moffat, A., Petri, M.: Index compression using byte-aligned ANS coding and two-dimensional contexts. In: WSDM (2018)

  61. Monge, A.E., Elkan, C., et al.: The field matching problem: algorithms and applications. In: Proceedings of the International Conference on Knowledge Discovery and Data Mining (1996)

  62. Navarro, G.: Compact Data Structures: A Practical Approach. Cambridge University Press, Cambridge (2016)

    Book  Google Scholar 

  63. Nevill-Manning, C.G.: Inferring sequential structure. PhD thesis, University of Waikato (1996)

  64. Nevill-Manning, C.G., Witten, I.H.: Compression and explanation using hierarchical grammars. Comput. J. 40, 103–116 (1997)

    Article  Google Scholar 

  65. Nevill-Manning, C.G., Witten, I.H.: Identifying hierarchical structure in sequences: a linear-time algorithm. J. Artif. Intell. Res. (JAIR) 7, 67–82 (1997)

    Article  Google Scholar 

  66. Nevill-Manning, C.G., Witten, I.H.: Linear-time, incremental hierarchy inference for compression. In: Data Compression Conference (1997)

  67. Nichols, B., Buttlar, D., Farrell, J.: Pthreads Programming: A POSIX Standard for Better Multiprocessing. O’Reilly Media Inc, Sebastopol (1996)

    Google Scholar 

  68. Oosterhuis, H., Culpepper, J.S., de Rijke, M.: The potential of learned index structures for index compression. In: Proceedings of the 23rd Australasian Document Computing Symposium (2018)

  69. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    MathSciNet  MATH  Google Scholar 

  70. Pekhimenko, G., Seshadri, V., Kim, Y., Xin, H., Mutlu, O., Gibbons, P.B., Kozuch, M.A., Mowry, T.C.: Linearly compressed pages: a low-complexity, low-latency main memory compression framework. In: MICRO (2013)

  71. Pennebaker, J.W., Francis, M.E., Booth, R.J.: Linguistic Inquiry and Word Count: LIWC 2001. Lawrence Erlbaum Associates, Mahway (2001)

    Google Scholar 

  72. Pennington, J., Socher, R., Manning, C.: Glove: Global vectors for word representation. In: EMNLP (2014)

  73. Petri, M., Moffat, A.: Compact inverted index storage using general-purpose compression libraries. Softw. Pract. Exp. 48, 974–982 (2018)

    Article  Google Scholar 

  74. Petroni, F., Querzoni, L., Daudjee, K., Kamali, S., Iacoboni, G.: HDRF: stream-based partitioning for power-law graphs. In: CIKM (2015)

  75. Pibiri, G.E., Perego, R., Venturini, R.: Compressed Indexes for Fast Search of Semantic Data. TKDE (2020)

  76. Pibiri, G.E., Petri, M., Moffat, A.: Fast dictionary-based compression for inverted indexes. In: WSDM (2019)

  77. Pibiri, G.E., Venturini, R.: Techniques for inverted index compression. arXiv preprint arXiv:1908.10598 (2019)

  78. Popov, I.: Malware detection using machine learning based on word2vec embeddings of machine code instructions. In: 2017 Siberian Symposium on Data Science and Engineering (SSDSE) (2017)

  79. Rong, X.: word2vec parameter learning explained. arXiv preprint arXiv:1411.2738 (2014)

  80. Rytter, W.: Grammar compression, lz-encodings, and string algorithms with implicit input. In: International Colloquium on Automata, Languages, and Programming (2004)

  81. Sadakane, K.: Compressed text databases with efficient query algorithms based on the compressed suffix array. In: International Symposium on Algorithms and Computation (2000)

  82. Sadakane, K.: Succinct representations of LCP information and improvements in the compressed suffix arrays. In: Proceedings of the Thirteenth Annual ACM-SIAM Symposium on Discrete Algorithms (2002)

  83. Sadakane, K.: New text indexing functionalities of the compressed suffix arrays. J. Algorithms 48, 294–313 (2003)

    Article  MathSciNet  Google Scholar 

  84. Sadakane, K.: Compressed suffix trees with full functionality. Theory Comput. Syst. 41, 589–607 (2007)

    Article  MathSciNet  Google Scholar 

  85. Sadakane, K.: Succinct data structures for flexible text retrieval systems. J. Discrete Algorithms 5, 12–22 (2007)

    Article  MathSciNet  Google Scholar 

  86. Sharma, M.: Compression using Huffman coding. IJCSNS Int. J. Comput. Sci. Netw. Secur. 10, 133–141 (2010)

    Google Scholar 

  87. Takabatake, Y., Sakamoto, H., et al.: A space-optimal grammar compression. In: 25th Annual European Symposium on Algorithms (2017)

  88. Vasile, F., Smirnova, E., Conneau, A.: Meta-prod2vec: Product embeddings using side-information for recommendation. In: Proceedings of the 10th ACM Conference on Recommender Systems (2016)

  89. Walkinshaw, N., Afshan, S., McMinn, P.: Using compression algorithms to support the comprehension of program traces. In: Proceedings of the Eighth International Workshop on Dynamic Analysis (2010)

  90. Whang, K.-Y., Park, B.-K., Han, W.-S., Lee, Y.-K.: Inverted index storage structure using subindexes and large objects for tight coupling of information retrieval with database management systems, 2002. US Patent 6,349,308

  91. Xin, R.S., Gonzalez, J.E., Franklin, M.J., Stoica, I.: GraphX: A resilient distributed graph system on spark. In: First International Workshop on Graph Data Management Experiences and Systems (2013)

  92. Xu, A., Liu, Z., Guo, Y., Sinha, V., Akkiraju, R.: A new chatbot for customer service on social media. In: Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems (2017)

  93. Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. HotCloud 10, 95 (2010)

    Google Scholar 

  94. Zernik, U.: Lexical Acquisition: Exploiting On-Line Resources to Build a Lexicon. Psychology Press, Milton Park (1991)

    Google Scholar 

  95. Zhang, C., Naughton, J., DeWitt, D., Luo, Q., Lohman, G.: On supporting containment queries in relational database management systems. In: SIGMOD (2001)

  96. Zhang, F., Wu, B., Zhai, J., He, B., Chen, W., Du, X.: Automatic Irregularity-Aware Fine-Grained Workload Partitioning on Integrated Architectures. TKDE (2019)

  97. Zhang, F., Zhai, J., Shen, X., Mutlu, O., Chen, W.: Efficient document analytics on compressed data: method, challenges, algorithms, insights. PVLDB (2018)

  98. Zhang, F., Zhai, J., Shen, X., Mutlu, O., Chen, W.: Zwift: A programming framework for high performance text analytics on compressed data. In: ICS (2018)

  99. Zhang, F., Zhai, J., Shen, X., Mutlu, O., Du, X.: Enabling efficient random access to hierarchically-compressed data. In: ICDE (2020)

  100. Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23, 337–343 (1977)

    Article  MathSciNet  Google Scholar 

  101. Zobel, J., Moffat, A.: Inverted files for text search engines. ACM Comput. Surv. CSUR 38, 6-es (2006)

    Article  Google Scholar 

Download references

Acknowledgements

This work is supported by the National Key R&D Program of China (Grant No. 2017YFB1003103), National Natural Science Foundation of China (Nos. 61732014 and 61802412), Beijing Natural Science Foundation (Nos. 4202031 and L192027), Tsinghua University Initiative Scientific Research Program (20191080594), and Beijing Academy of Artificial Intelligence (BAAI). Onur Mutlu is supported by ETH Zürich, SRC, and various industrial partners of the SAFARI Research Group, including Alibaba, Huawei, Intel, Microsoft, and VMware. Jidong Zhai, Xipeng Shen, and Xiaoyong Du are the corresponding authors of this paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Feng Zhang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, F., Zhai, J., Shen, X. et al. TADOC: Text analytics directly on compression. The VLDB Journal 30, 163–188 (2021). https://doi.org/10.1007/s00778-020-00636-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-020-00636-3

Keywords

Navigation