Abstract
This article provides a comprehensive description of text analytics directly on compression (TADOC), which enables direct document analytics on compressed textual data. The article explains the concept of TADOC and the challenges to its effective realizations. Additionally, a series of guidelines and technical solutions that effectively address those challenges, including the adoption of a hierarchical compression method and a set of novel algorithms and data structure designs, are presented. Experiments on six data analytics tasks of various complexities show that TADOC can save 90.8% storage space and 87.9% memory usage, while halving data processing times.
Similar content being viewed by others
Notes
Section 7.6 shows the sensitivity on the other two datasets, D and E.
The processing time in gzip is the same as in the baseline method since they both process the decompressed data.
References
Amazon elastic compute cloud (Amazon EC2). https://aws.amazon.com/ec2/
Re-Pair compression and decompression. https://users.dcc.uchile.cl/~gnavarro/software/index.html (2010)
word2vec. https://code.google.com/archive/p/word2vec/ (2013)
C++ B-tree. https://code.google.com/archive/p/cpp-btree/ (2017)
Wikipedia HTML data dumps. https://dumps.wikimedia.org/enwiki/ (2017)
FM-index. https://en.wikipedia.org/wiki/FM-index (2018)
zstd. https://facebook.github.io/zstd/ (2020)
Agarwal, R., Khandelwal, A., Stoica, I.: Succinct: enabling queries on compressed data. In: NSDI (2015)
Ahmad, F., Lee, S., Thottethodi, M., Vijaykumar, T.: PUMA: Purdue MapReduce Benchmarks Suite (2012)
Bille, P., Christiansen, A.R., Cording, P.H., Gørtz, I.L.: Finger search in grammar-compressed strings (2015). arXiv preprint arXiv:1507.02853
Bille, P., Landau, G.M., Raman, R., Sadakane, K., Satti, S.R., Weimann, O.: Random access to grammar-compressed strings and trees. SIAM J. Comput. 43, 513–539 (2015)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J Mach Learn Res 3, 993–1022 (2003)
Blumenstock, J.E.: Size matters: word count as a measure of quality on Wikipedia. In: WWW (2008)
Boroumand, A., Ghose, S., Kim, Y., Ausavarungnirun, R., Shiu, E., Thakur, R., Kim, D., Kuusela, A., Knies, A., Ranganathan, P., Mutlu, O.: Google workloads for consumer devices: mitigating data movement bottlenecks. In: ASPLOS (2018)
Borthakur, D.: HDFS architecture guide. HADOOP APACHE PROJECT http://hadoop.apache.org/common/docs/current/hdfs design. pdf (2008)
Brisaboa, N.R., Gómez-Brandón, A., Navarro, G., Paramá, J.R.: Gract: a grammar-based compressed index for trajectory data. Inf. Sci. 483, 106–135 (2019)
Burrows, M., Wheeler, D.J.: A block-sorting lossless data compression algorithm (1994)
Carbone, P., Katsifodimos, A., Ewen, S., Markl, V., Haridi, S., Tzoumas, K.: Apache flink: stream and batch processing in a single engine. Bull. IEEE Comput. Soc. Tech. Comm. Data Eng. (2015)
Charikar, M., Lehman, E., Liu, D., Panigrahy, R., Prabhakaran, M., Sahai, A., Shelat, A.: The smallest grammar problem. IEEE Trans. Inf. Theory (2005)
Chilimbi, T.M.: Efficient representations and abstractions for quantifying and exploiting data reference locality. In: PLDI (2001)
Chilimbi, T.M., Hirzel, M.: Dynamic hot data stream prefetching for general-purpose programs. In: PLDI (2002)
Chiu, J.P., Nichols, E.: Named entity recognition with bidirectional LSTM-CNNs. Trans. Assoc. Comput. Linguist. 4, 357–370 (2016)
Cormen, T.H.: Introduction to Algorithms. MIT Press, Cambridge (2009)
Farruggia, A., Ferragina, P., Venturini, R.: Bicriteria data compression: efficient and usable. In: European Symposium on Algorithms (2014)
Ferragina, P., González, R., Navarro, G., Venturini, R.: Compressed text indexes: from theory to practice. J. Exp. Algorithm (JEA) 13, 1–12 (2009)
Ferragina, P., Manzini, G.: Opportunistic data structures with applications. In: Proceedings 41st Annual Symposium on Foundations of Computer Science (2000)
Ferragina, P., Manzini, G.: An experimental study of a compressed index. Inf. Sci. 135, 13–28 (2001)
Ferragina, P., Manzini, G.: An experimental study of an opportunistic index. In: Proceedings of the Twelfth Annual ACM-SIAM Symposium on Discrete Algorithms (2001)
Ferragina, P., Manzini, G.: Indexing compressed text. J. ACM (JACM) 52, 552–581 (2005)
Ferragina, P., Nitto, I., Venturini, R.: On the bit-complexity of Lempel–Ziv compression. In: Proceedings of the Twentieth Annual ACM-SIAM Symposium on Discrete Algorithms (2009)
Gagie, T., Gawrychowski, P., Kärkkäinen, J., Nekrich, Y., Puglisi, S.J.: A faster grammar-based self-index. In: International Conference on Language and Automata Theory and Applications (2012)
Ganardi, M., Jeż, A., Lohrey, M.: Balancing straight-line programs. In: IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS) (2019)
Gańczorz, M., Jeż, A.: Improvements on re-pair grammar compressor. In: Data Compression Conference (DCC) (2017)
Gog, S., Beller, T., Moffat, A., Petri, M.: From theory to practice: plug and play with succinct data structures. In: International Symposium on Experimental Algorithms (2014)
Gonzalez, J.E., Low, Y., Gu, H., Bickson, D., Guestrin, C.: PowerGraph: distributed graph-parallel computation on natural graphs. In: OSDI (2012)
Grossi, R., Gupta, A., Vitter, J.S.: High-order entropy-compressed text indexes. In: Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms (2003)
Grossi, R., Gupta, A., Vitter, J.S.: When indexing equals compression: experiments with compressing suffix arrays and applications. In: Proceedings of the Fifteenth Annual ACM-SIAM Symposium on Discrete Algorithms (2004)
Grossi, R., Vitter, J.S.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM J. Comput. 35, 378–407 (2005)
Hon, W.-K., Lam, T.W., Sung, W.-K., Tse, W.-L., Wong, C.-K., Yiu, S.-M.: Practical aspects of compressed suffix arrays and FM-index in searching DNA sequences. In: ALENEX/ANALC (2004)
Huang, S., Huang, J., Dai, J., Xie, T., Huang, B.: The HiBench benchmark suite: characterization of the MapReduce-based data analysis. In: New Frontiers in Information and Software as Services (2011)
Joachims, T.: A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization. Technical report, Carnegie-Mellon Univ Pittsburgh Pa Dept of Computer Science (1996)
Khandelwal, A., Agarwal, R., Stoica, I.: Blowfish: dynamic storage-performance tradeoff in data stores. In: NSDI (2016)
Koiwa, T., Ohwada, H.: Extraction of disease-related genes from PubMed paper using word2vec. In: Proceedings of the 8th International Conference on Computational Systems-Biology and Bioinformatics (2017)
Kurtz, S.: Reducing the space requirement of suffix trees. Softw. Pract. Exp. 29, 1149–1171 (1999)
Larsson, N.J., Moffat, A.: Off-line dictionary-based compression. In: Proceedings of the IEEE (2000)
Larus, J.R.: Whole program paths. In: PLDI (1999)
Lau, J., Perelman, E., Hamerly, G., Sherwood, T., Calder, B.: Motivation for variable length intervals and hierarchical phase behavior. In: International Symposium on Performance Analysis of Systems and Software (2005)
Law, J., Rothermel, G.: Whole program path-based dynamic impact analysis. In: ICSE (2003)
Lebart, L.: Classification problems in text analysis and information retrieval. In: Advances in Data Science and Classification (1998)
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. In: Soviet Physics Doklady (1966)
Lichman, M.: UCI machine learning repository. http://archive.ics.uci.edu/ml (2013)
Lin, Y., Zhang, Y., Li, Q., Yang, J.: Supporting efficient query processing on compressed XML files. In: Proceedings of ACM Symposium on Applied Computing (2005)
Liu, Z., Zhang, Y., Chang, E.Y., Sun, M.: PLDA+: Parallel latent Dirichlet allocation with data placement and pipeline processing. ACM Trans. Intell. Syst. Technol. 2, 1–18 (2011)
Mackenzie, J., Mallia, A., Petri, M., Culpepper, J.S., Suel, T.: Compressing inverted indexes with recursive graph bisection: a reproducibility study. In: European Conference on Information Retrieval (2019)
Malewicz, G., Austern, M.H., Bik, A.J., Dehnert, J.C., Horn, I., Leiser, N., Czajkowski, G.: Pregel: a system for large-scale graph processing. In: SIGMOD (2010)
Manber, U., Myers, G.: Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22, 935–948 (1993)
Martella, C., Shaposhnik, R., Logothetis, D., Harenberg, S.: Practical Graph Analytics with Apache Giraph. Springer, Berlin (2015)
Matsuo, Y., Ishizuka, M.: Keyword extraction from a single document using word co-occurrence statistical information. Int. J. Artif. Intell. Tools 13, 157–169 (2004)
Mitsui, K.: Information retrieval based on rank-ordered cumulative query scores calculated from weights of all keywords in an inverted index file for minimizing access to a main database, 1993. US Patent 5,263,159
Moffat, A., Petri, M.: Index compression using byte-aligned ANS coding and two-dimensional contexts. In: WSDM (2018)
Monge, A.E., Elkan, C., et al.: The field matching problem: algorithms and applications. In: Proceedings of the International Conference on Knowledge Discovery and Data Mining (1996)
Navarro, G.: Compact Data Structures: A Practical Approach. Cambridge University Press, Cambridge (2016)
Nevill-Manning, C.G.: Inferring sequential structure. PhD thesis, University of Waikato (1996)
Nevill-Manning, C.G., Witten, I.H.: Compression and explanation using hierarchical grammars. Comput. J. 40, 103–116 (1997)
Nevill-Manning, C.G., Witten, I.H.: Identifying hierarchical structure in sequences: a linear-time algorithm. J. Artif. Intell. Res. (JAIR) 7, 67–82 (1997)
Nevill-Manning, C.G., Witten, I.H.: Linear-time, incremental hierarchy inference for compression. In: Data Compression Conference (1997)
Nichols, B., Buttlar, D., Farrell, J.: Pthreads Programming: A POSIX Standard for Better Multiprocessing. O’Reilly Media Inc, Sebastopol (1996)
Oosterhuis, H., Culpepper, J.S., de Rijke, M.: The potential of learned index structures for index compression. In: Proceedings of the 23rd Australasian Document Computing Symposium (2018)
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Pekhimenko, G., Seshadri, V., Kim, Y., Xin, H., Mutlu, O., Gibbons, P.B., Kozuch, M.A., Mowry, T.C.: Linearly compressed pages: a low-complexity, low-latency main memory compression framework. In: MICRO (2013)
Pennebaker, J.W., Francis, M.E., Booth, R.J.: Linguistic Inquiry and Word Count: LIWC 2001. Lawrence Erlbaum Associates, Mahway (2001)
Pennington, J., Socher, R., Manning, C.: Glove: Global vectors for word representation. In: EMNLP (2014)
Petri, M., Moffat, A.: Compact inverted index storage using general-purpose compression libraries. Softw. Pract. Exp. 48, 974–982 (2018)
Petroni, F., Querzoni, L., Daudjee, K., Kamali, S., Iacoboni, G.: HDRF: stream-based partitioning for power-law graphs. In: CIKM (2015)
Pibiri, G.E., Perego, R., Venturini, R.: Compressed Indexes for Fast Search of Semantic Data. TKDE (2020)
Pibiri, G.E., Petri, M., Moffat, A.: Fast dictionary-based compression for inverted indexes. In: WSDM (2019)
Pibiri, G.E., Venturini, R.: Techniques for inverted index compression. arXiv preprint arXiv:1908.10598 (2019)
Popov, I.: Malware detection using machine learning based on word2vec embeddings of machine code instructions. In: 2017 Siberian Symposium on Data Science and Engineering (SSDSE) (2017)
Rong, X.: word2vec parameter learning explained. arXiv preprint arXiv:1411.2738 (2014)
Rytter, W.: Grammar compression, lz-encodings, and string algorithms with implicit input. In: International Colloquium on Automata, Languages, and Programming (2004)
Sadakane, K.: Compressed text databases with efficient query algorithms based on the compressed suffix array. In: International Symposium on Algorithms and Computation (2000)
Sadakane, K.: Succinct representations of LCP information and improvements in the compressed suffix arrays. In: Proceedings of the Thirteenth Annual ACM-SIAM Symposium on Discrete Algorithms (2002)
Sadakane, K.: New text indexing functionalities of the compressed suffix arrays. J. Algorithms 48, 294–313 (2003)
Sadakane, K.: Compressed suffix trees with full functionality. Theory Comput. Syst. 41, 589–607 (2007)
Sadakane, K.: Succinct data structures for flexible text retrieval systems. J. Discrete Algorithms 5, 12–22 (2007)
Sharma, M.: Compression using Huffman coding. IJCSNS Int. J. Comput. Sci. Netw. Secur. 10, 133–141 (2010)
Takabatake, Y., Sakamoto, H., et al.: A space-optimal grammar compression. In: 25th Annual European Symposium on Algorithms (2017)
Vasile, F., Smirnova, E., Conneau, A.: Meta-prod2vec: Product embeddings using side-information for recommendation. In: Proceedings of the 10th ACM Conference on Recommender Systems (2016)
Walkinshaw, N., Afshan, S., McMinn, P.: Using compression algorithms to support the comprehension of program traces. In: Proceedings of the Eighth International Workshop on Dynamic Analysis (2010)
Whang, K.-Y., Park, B.-K., Han, W.-S., Lee, Y.-K.: Inverted index storage structure using subindexes and large objects for tight coupling of information retrieval with database management systems, 2002. US Patent 6,349,308
Xin, R.S., Gonzalez, J.E., Franklin, M.J., Stoica, I.: GraphX: A resilient distributed graph system on spark. In: First International Workshop on Graph Data Management Experiences and Systems (2013)
Xu, A., Liu, Z., Guo, Y., Sinha, V., Akkiraju, R.: A new chatbot for customer service on social media. In: Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems (2017)
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. HotCloud 10, 95 (2010)
Zernik, U.: Lexical Acquisition: Exploiting On-Line Resources to Build a Lexicon. Psychology Press, Milton Park (1991)
Zhang, C., Naughton, J., DeWitt, D., Luo, Q., Lohman, G.: On supporting containment queries in relational database management systems. In: SIGMOD (2001)
Zhang, F., Wu, B., Zhai, J., He, B., Chen, W., Du, X.: Automatic Irregularity-Aware Fine-Grained Workload Partitioning on Integrated Architectures. TKDE (2019)
Zhang, F., Zhai, J., Shen, X., Mutlu, O., Chen, W.: Efficient document analytics on compressed data: method, challenges, algorithms, insights. PVLDB (2018)
Zhang, F., Zhai, J., Shen, X., Mutlu, O., Chen, W.: Zwift: A programming framework for high performance text analytics on compressed data. In: ICS (2018)
Zhang, F., Zhai, J., Shen, X., Mutlu, O., Du, X.: Enabling efficient random access to hierarchically-compressed data. In: ICDE (2020)
Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23, 337–343 (1977)
Zobel, J., Moffat, A.: Inverted files for text search engines. ACM Comput. Surv. CSUR 38, 6-es (2006)
Acknowledgements
This work is supported by the National Key R&D Program of China (Grant No. 2017YFB1003103), National Natural Science Foundation of China (Nos. 61732014 and 61802412), Beijing Natural Science Foundation (Nos. 4202031 and L192027), Tsinghua University Initiative Scientific Research Program (20191080594), and Beijing Academy of Artificial Intelligence (BAAI). Onur Mutlu is supported by ETH Zürich, SRC, and various industrial partners of the SAFARI Research Group, including Alibaba, Huawei, Intel, Microsoft, and VMware. Jidong Zhai, Xipeng Shen, and Xiaoyong Du are the corresponding authors of this paper.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Zhang, F., Zhai, J., Shen, X. et al. TADOC: Text analytics directly on compression. The VLDB Journal 30, 163–188 (2021). https://doi.org/10.1007/s00778-020-00636-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-020-00636-3