TADOC: Text analytics directly on compression

Zhang, Feng; Zhai, Jidong; Shen, Xipeng; Wang, Dalin; Chen, Zheng; Mutlu, Onur; Chen, Wenguang; Du, Xiaoyong

doi:10.1007/s00778-020-00636-3

TADOC: Text analytics directly on compression

Regular Paper
Published: 19 September 2020

Volume 30, pages 163–188, (2021)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

Feng Zhang ORCID: orcid.org/0000-0003-1983-7321¹,
Jidong Zhai²,
Xipeng Shen³,
Dalin Wang¹,
Zheng Chen¹,
Onur Mutlu⁴,
Wenguang Chen² &
…
Xiaoyong Du¹

1029 Accesses
40 Citations
7 Altmetric
Explore all metrics

Abstract

This article provides a comprehensive description of text analytics directly on compression (TADOC), which enables direct document analytics on compressed textual data. The article explains the concept of TADOC and the challenges to its effective realizations. Additionally, a series of guidelines and technical solutions that effectively address those challenges, including the adoption of a hierarchical compression method and a set of novel algorithms and data structure designs, are presented. Experiments on six data analytics tasks of various complexities show that TADOC can save 90.8% storage space and 87.9% memory usage, while halving data processing times.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Natural language processing: state of the art, current trends and challenges

Article 14 July 2022

Siamese Neural Networks: An Overview

Testing of detection tools for AI-generated text

Article Open access 25 December 2023

Notes

Section 7.6 shows the sensitivity on the other two datasets, D and E.
The processing time in gzip is the same as in the baseline method since they both process the decompressed data.

References

Amazon elastic compute cloud (Amazon EC2). https://aws.amazon.com/ec2/
Re-Pair compression and decompression. https://users.dcc.uchile.cl/~gnavarro/software/index.html (2010)
word2vec. https://code.google.com/archive/p/word2vec/ (2013)
C++ B-tree. https://code.google.com/archive/p/cpp-btree/ (2017)
Wikipedia HTML data dumps. https://dumps.wikimedia.org/enwiki/ (2017)
FM-index. https://en.wikipedia.org/wiki/FM-index (2018)
zstd. https://facebook.github.io/zstd/ (2020)
Agarwal, R., Khandelwal, A., Stoica, I.: Succinct: enabling queries on compressed data. In: NSDI (2015)
Ahmad, F., Lee, S., Thottethodi, M., Vijaykumar, T.: PUMA: Purdue MapReduce Benchmarks Suite (2012)
Bille, P., Christiansen, A.R., Cording, P.H., Gørtz, I.L.: Finger search in grammar-compressed strings (2015). arXiv preprint arXiv:1507.02853
Bille, P., Landau, G.M., Raman, R., Sadakane, K., Satti, S.R., Weimann, O.: Random access to grammar-compressed strings and trees. SIAM J. Comput. 43, 513–539 (2015)
Article MathSciNet Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J Mach Learn Res 3, 993–1022 (2003)
MATH Google Scholar
Blumenstock, J.E.: Size matters: word count as a measure of quality on Wikipedia. In: WWW (2008)
Boroumand, A., Ghose, S., Kim, Y., Ausavarungnirun, R., Shiu, E., Thakur, R., Kim, D., Kuusela, A., Knies, A., Ranganathan, P., Mutlu, O.: Google workloads for consumer devices: mitigating data movement bottlenecks. In: ASPLOS (2018)
Borthakur, D.: HDFS architecture guide. HADOOP APACHE PROJECT http://hadoop.apache.org/common/docs/current/hdfs design. pdf (2008)
Brisaboa, N.R., Gómez-Brandón, A., Navarro, G., Paramá, J.R.: Gract: a grammar-based compressed index for trajectory data. Inf. Sci. 483, 106–135 (2019)
Article MathSciNet Google Scholar
Burrows, M., Wheeler, D.J.: A block-sorting lossless data compression algorithm (1994)
Carbone, P., Katsifodimos, A., Ewen, S., Markl, V., Haridi, S., Tzoumas, K.: Apache flink: stream and batch processing in a single engine. Bull. IEEE Comput. Soc. Tech. Comm. Data Eng. (2015)
Charikar, M., Lehman, E., Liu, D., Panigrahy, R., Prabhakaran, M., Sahai, A., Shelat, A.: The smallest grammar problem. IEEE Trans. Inf. Theory (2005)
Chilimbi, T.M.: Efficient representations and abstractions for quantifying and exploiting data reference locality. In: PLDI (2001)
Chilimbi, T.M., Hirzel, M.: Dynamic hot data stream prefetching for general-purpose programs. In: PLDI (2002)
Chiu, J.P., Nichols, E.: Named entity recognition with bidirectional LSTM-CNNs. Trans. Assoc. Comput. Linguist. 4, 357–370 (2016)
Article Google Scholar
Cormen, T.H.: Introduction to Algorithms. MIT Press, Cambridge (2009)
MATH Google Scholar
Farruggia, A., Ferragina, P., Venturini, R.: Bicriteria data compression: efficient and usable. In: European Symposium on Algorithms (2014)
Ferragina, P., González, R., Navarro, G., Venturini, R.: Compressed text indexes: from theory to practice. J. Exp. Algorithm (JEA) 13, 1–12 (2009)
MathSciNet MATH Google Scholar
Ferragina, P., Manzini, G.: Opportunistic data structures with applications. In: Proceedings 41st Annual Symposium on Foundations of Computer Science (2000)
Ferragina, P., Manzini, G.: An experimental study of a compressed index. Inf. Sci. 135, 13–28 (2001)
Article Google Scholar
Ferragina, P., Manzini, G.: An experimental study of an opportunistic index. In: Proceedings of the Twelfth Annual ACM-SIAM Symposium on Discrete Algorithms (2001)
Ferragina, P., Manzini, G.: Indexing compressed text. J. ACM (JACM) 52, 552–581 (2005)
Article MathSciNet Google Scholar
Ferragina, P., Nitto, I., Venturini, R.: On the bit-complexity of Lempel–Ziv compression. In: Proceedings of the Twentieth Annual ACM-SIAM Symposium on Discrete Algorithms (2009)
Gagie, T., Gawrychowski, P., Kärkkäinen, J., Nekrich, Y., Puglisi, S.J.: A faster grammar-based self-index. In: International Conference on Language and Automata Theory and Applications (2012)
Ganardi, M., Jeż, A., Lohrey, M.: Balancing straight-line programs. In: IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS) (2019)
Gańczorz, M., Jeż, A.: Improvements on re-pair grammar compressor. In: Data Compression Conference (DCC) (2017)
Gog, S., Beller, T., Moffat, A., Petri, M.: From theory to practice: plug and play with succinct data structures. In: International Symposium on Experimental Algorithms (2014)
Gonzalez, J.E., Low, Y., Gu, H., Bickson, D., Guestrin, C.: PowerGraph: distributed graph-parallel computation on natural graphs. In: OSDI (2012)
Grossi, R., Gupta, A., Vitter, J.S.: High-order entropy-compressed text indexes. In: Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms (2003)
Grossi, R., Gupta, A., Vitter, J.S.: When indexing equals compression: experiments with compressing suffix arrays and applications. In: Proceedings of the Fifteenth Annual ACM-SIAM Symposium on Discrete Algorithms (2004)
Grossi, R., Vitter, J.S.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM J. Comput. 35, 378–407 (2005)
Article MathSciNet Google Scholar
Hon, W.-K., Lam, T.W., Sung, W.-K., Tse, W.-L., Wong, C.-K., Yiu, S.-M.: Practical aspects of compressed suffix arrays and FM-index in searching DNA sequences. In: ALENEX/ANALC (2004)
Huang, S., Huang, J., Dai, J., Xie, T., Huang, B.: The HiBench benchmark suite: characterization of the MapReduce-based data analysis. In: New Frontiers in Information and Software as Services (2011)
Joachims, T.: A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization. Technical report, Carnegie-Mellon Univ Pittsburgh Pa Dept of Computer Science (1996)
Khandelwal, A., Agarwal, R., Stoica, I.: Blowfish: dynamic storage-performance tradeoff in data stores. In: NSDI (2016)
Koiwa, T., Ohwada, H.: Extraction of disease-related genes from PubMed paper using word2vec. In: Proceedings of the 8th International Conference on Computational Systems-Biology and Bioinformatics (2017)
Kurtz, S.: Reducing the space requirement of suffix trees. Softw. Pract. Exp. 29, 1149–1171 (1999)
Article Google Scholar
Larsson, N.J., Moffat, A.: Off-line dictionary-based compression. In: Proceedings of the IEEE (2000)
Larus, J.R.: Whole program paths. In: PLDI (1999)
Lau, J., Perelman, E., Hamerly, G., Sherwood, T., Calder, B.: Motivation for variable length intervals and hierarchical phase behavior. In: International Symposium on Performance Analysis of Systems and Software (2005)
Law, J., Rothermel, G.: Whole program path-based dynamic impact analysis. In: ICSE (2003)
Lebart, L.: Classification problems in text analysis and information retrieval. In: Advances in Data Science and Classification (1998)
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. In: Soviet Physics Doklady (1966)
Lichman, M.: UCI machine learning repository. http://archive.ics.uci.edu/ml (2013)
Lin, Y., Zhang, Y., Li, Q., Yang, J.: Supporting efficient query processing on compressed XML files. In: Proceedings of ACM Symposium on Applied Computing (2005)
Liu, Z., Zhang, Y., Chang, E.Y., Sun, M.: PLDA+: Parallel latent Dirichlet allocation with data placement and pipeline processing. ACM Trans. Intell. Syst. Technol. 2, 1–18 (2011)
Google Scholar
Mackenzie, J., Mallia, A., Petri, M., Culpepper, J.S., Suel, T.: Compressing inverted indexes with recursive graph bisection: a reproducibility study. In: European Conference on Information Retrieval (2019)
Malewicz, G., Austern, M.H., Bik, A.J., Dehnert, J.C., Horn, I., Leiser, N., Czajkowski, G.: Pregel: a system for large-scale graph processing. In: SIGMOD (2010)
Manber, U., Myers, G.: Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22, 935–948 (1993)
Article MathSciNet Google Scholar
Martella, C., Shaposhnik, R., Logothetis, D., Harenberg, S.: Practical Graph Analytics with Apache Giraph. Springer, Berlin (2015)
Book Google Scholar
Matsuo, Y., Ishizuka, M.: Keyword extraction from a single document using word co-occurrence statistical information. Int. J. Artif. Intell. Tools 13, 157–169 (2004)
Article Google Scholar
Mitsui, K.: Information retrieval based on rank-ordered cumulative query scores calculated from weights of all keywords in an inverted index file for minimizing access to a main database, 1993. US Patent 5,263,159
Moffat, A., Petri, M.: Index compression using byte-aligned ANS coding and two-dimensional contexts. In: WSDM (2018)
Monge, A.E., Elkan, C., et al.: The field matching problem: algorithms and applications. In: Proceedings of the International Conference on Knowledge Discovery and Data Mining (1996)
Navarro, G.: Compact Data Structures: A Practical Approach. Cambridge University Press, Cambridge (2016)
Book Google Scholar
Nevill-Manning, C.G.: Inferring sequential structure. PhD thesis, University of Waikato (1996)
Nevill-Manning, C.G., Witten, I.H.: Compression and explanation using hierarchical grammars. Comput. J. 40, 103–116 (1997)
Article Google Scholar
Nevill-Manning, C.G., Witten, I.H.: Identifying hierarchical structure in sequences: a linear-time algorithm. J. Artif. Intell. Res. (JAIR) 7, 67–82 (1997)
Article Google Scholar
Nevill-Manning, C.G., Witten, I.H.: Linear-time, incremental hierarchy inference for compression. In: Data Compression Conference (1997)
Nichols, B., Buttlar, D., Farrell, J.: Pthreads Programming: A POSIX Standard for Better Multiprocessing. O’Reilly Media Inc, Sebastopol (1996)
Google Scholar
Oosterhuis, H., Culpepper, J.S., de Rijke, M.: The potential of learned index structures for index compression. In: Proceedings of the 23rd Australasian Document Computing Symposium (2018)
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet MATH Google Scholar
Pekhimenko, G., Seshadri, V., Kim, Y., Xin, H., Mutlu, O., Gibbons, P.B., Kozuch, M.A., Mowry, T.C.: Linearly compressed pages: a low-complexity, low-latency main memory compression framework. In: MICRO (2013)
Pennebaker, J.W., Francis, M.E., Booth, R.J.: Linguistic Inquiry and Word Count: LIWC 2001. Lawrence Erlbaum Associates, Mahway (2001)
Google Scholar
Pennington, J., Socher, R., Manning, C.: Glove: Global vectors for word representation. In: EMNLP (2014)
Petri, M., Moffat, A.: Compact inverted index storage using general-purpose compression libraries. Softw. Pract. Exp. 48, 974–982 (2018)
Article Google Scholar
Petroni, F., Querzoni, L., Daudjee, K., Kamali, S., Iacoboni, G.: HDRF: stream-based partitioning for power-law graphs. In: CIKM (2015)
Pibiri, G.E., Perego, R., Venturini, R.: Compressed Indexes for Fast Search of Semantic Data. TKDE (2020)
Pibiri, G.E., Petri, M., Moffat, A.: Fast dictionary-based compression for inverted indexes. In: WSDM (2019)
Pibiri, G.E., Venturini, R.: Techniques for inverted index compression. arXiv preprint arXiv:1908.10598 (2019)
Popov, I.: Malware detection using machine learning based on word2vec embeddings of machine code instructions. In: 2017 Siberian Symposium on Data Science and Engineering (SSDSE) (2017)
Rong, X.: word2vec parameter learning explained. arXiv preprint arXiv:1411.2738 (2014)
Rytter, W.: Grammar compression, lz-encodings, and string algorithms with implicit input. In: International Colloquium on Automata, Languages, and Programming (2004)
Sadakane, K.: Compressed text databases with efficient query algorithms based on the compressed suffix array. In: International Symposium on Algorithms and Computation (2000)
Sadakane, K.: Succinct representations of LCP information and improvements in the compressed suffix arrays. In: Proceedings of the Thirteenth Annual ACM-SIAM Symposium on Discrete Algorithms (2002)
Sadakane, K.: New text indexing functionalities of the compressed suffix arrays. J. Algorithms 48, 294–313 (2003)
Article MathSciNet Google Scholar
Sadakane, K.: Compressed suffix trees with full functionality. Theory Comput. Syst. 41, 589–607 (2007)
Article MathSciNet Google Scholar
Sadakane, K.: Succinct data structures for flexible text retrieval systems. J. Discrete Algorithms 5, 12–22 (2007)
Article MathSciNet Google Scholar
Sharma, M.: Compression using Huffman coding. IJCSNS Int. J. Comput. Sci. Netw. Secur. 10, 133–141 (2010)
Google Scholar
Takabatake, Y., Sakamoto, H., et al.: A space-optimal grammar compression. In: 25th Annual European Symposium on Algorithms (2017)
Vasile, F., Smirnova, E., Conneau, A.: Meta-prod2vec: Product embeddings using side-information for recommendation. In: Proceedings of the 10th ACM Conference on Recommender Systems (2016)
Walkinshaw, N., Afshan, S., McMinn, P.: Using compression algorithms to support the comprehension of program traces. In: Proceedings of the Eighth International Workshop on Dynamic Analysis (2010)
Whang, K.-Y., Park, B.-K., Han, W.-S., Lee, Y.-K.: Inverted index storage structure using subindexes and large objects for tight coupling of information retrieval with database management systems, 2002. US Patent 6,349,308
Xin, R.S., Gonzalez, J.E., Franklin, M.J., Stoica, I.: GraphX: A resilient distributed graph system on spark. In: First International Workshop on Graph Data Management Experiences and Systems (2013)
Xu, A., Liu, Z., Guo, Y., Sinha, V., Akkiraju, R.: A new chatbot for customer service on social media. In: Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems (2017)
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. HotCloud 10, 95 (2010)
Google Scholar
Zernik, U.: Lexical Acquisition: Exploiting On-Line Resources to Build a Lexicon. Psychology Press, Milton Park (1991)
Google Scholar
Zhang, C., Naughton, J., DeWitt, D., Luo, Q., Lohman, G.: On supporting containment queries in relational database management systems. In: SIGMOD (2001)
Zhang, F., Wu, B., Zhai, J., He, B., Chen, W., Du, X.: Automatic Irregularity-Aware Fine-Grained Workload Partitioning on Integrated Architectures. TKDE (2019)
Zhang, F., Zhai, J., Shen, X., Mutlu, O., Chen, W.: Efficient document analytics on compressed data: method, challenges, algorithms, insights. PVLDB (2018)
Zhang, F., Zhai, J., Shen, X., Mutlu, O., Chen, W.: Zwift: A programming framework for high performance text analytics on compressed data. In: ICS (2018)
Zhang, F., Zhai, J., Shen, X., Mutlu, O., Du, X.: Enabling efficient random access to hierarchically-compressed data. In: ICDE (2020)
Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23, 337–343 (1977)
Article MathSciNet Google Scholar
Zobel, J., Moffat, A.: Inverted files for text search engines. ACM Comput. Surv. CSUR 38, 6-es (2006)
Article Google Scholar

Download references

Acknowledgements

This work is supported by the National Key R&D Program of China (Grant No. 2017YFB1003103), National Natural Science Foundation of China (Nos. 61732014 and 61802412), Beijing Natural Science Foundation (Nos. 4202031 and L192027), Tsinghua University Initiative Scientific Research Program (20191080594), and Beijing Academy of Artificial Intelligence (BAAI). Onur Mutlu is supported by ETH Zürich, SRC, and various industrial partners of the SAFARI Research Group, including Alibaba, Huawei, Intel, Microsoft, and VMware. Jidong Zhai, Xipeng Shen, and Xiaoyong Du are the corresponding authors of this paper.

Author information

Authors and Affiliations

Key Laboratory of Data Engineering and Knowledge Engineering (MOE), School of Information, Renmin University of China, Beijing, China
Feng Zhang, Dalin Wang, Zheng Chen & Xiaoyong Du
Department of Computer Science and Technology, Tsinghua University, Beijing, China
Jidong Zhai & Wenguang Chen
Computer Science Department, North Carolina State University, Raleigh, USA
Xipeng Shen
Department of Computer Science, ETH Zürich, Zürich, Switzerland
Onur Mutlu

Authors

Feng Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jidong Zhai
View author publications
You can also search for this author in PubMed Google Scholar
Xipeng Shen
View author publications
You can also search for this author in PubMed Google Scholar
Dalin Wang
View author publications
You can also search for this author in PubMed Google Scholar
Zheng Chen
View author publications
You can also search for this author in PubMed Google Scholar
Onur Mutlu
View author publications
You can also search for this author in PubMed Google Scholar
Wenguang Chen
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoyong Du
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Feng Zhang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhang, F., Zhai, J., Shen, X. et al. TADOC: Text analytics directly on compression. The VLDB Journal 30, 163–188 (2021). https://doi.org/10.1007/s00778-020-00636-3

Download citation

Received: 08 October 2019
Revised: 21 July 2020
Accepted: 02 September 2020
Published: 19 September 2020
Issue Date: March 2021
DOI: https://doi.org/10.1007/s00778-020-00636-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

TADOC: Text analytics directly on compression

Abstract

Access this article

Similar content being viewed by others

Natural language processing: state of the art, current trends and challenges

Siamese Neural Networks: An Overview

Testing of detection tools for AI-generated text

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

TADOC: Text analytics directly on compression

Abstract

Access this article

Similar content being viewed by others

Natural language processing: state of the art, current trends and challenges

Siamese Neural Networks: An Overview

Testing of detection tools for AI-generated text

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation