TextBenDS: a Generic Textual Data Benchmark for Distributed Systems

Truică, Ciprian-Octavian; Apostol, Elena-Simona; Darmont, Jérôme; Assent, Ira

doi:10.1007/s10796-020-09999-y

TextBenDS: a Generic Textual Data Benchmark for Distributed Systems

Published: 06 March 2020

Volume 23, pages 81–100, (2021)
Cite this article

Information Systems Frontiers Aims and scope Submit manuscript

Ciprian-Octavian Truică ORCID: orcid.org/0000-0001-7292-4462^1,2,
Elena-Simona Apostol¹,
Jérôme Darmont³ &
…
Ira Assent⁴

534 Accesses
9 Citations
3 Altmetric
Explore all metrics

Abstract

Extracting top-k keywords and documents using weighting schemes are popular techniques employed in text mining and machine learning for different analysis and retrieval tasks. The weights are usually computed in the data preprocessing step, as they are costly to update and keep track of all the modifications performed on the dataset. Furthermore, calculation errors are introduced when analyzing only subsets of the dataset, i.e., wrong weighting are computed as weighting schemes use the number of documents for scoring keywords and documents. Therefore, in a Big Data context, it is crucial to lower the runtime of computing weighting schemes, without hindering the analysis process and the accuracy of the machine learning algorithms. To address this requirement for the task of computing top-k keywords and documents (which largely relies on weighting schemes), it is customary to design benchmarks that compare weighting schemes within various configurations of distributedframeworks and database management systems. Thus, we propose TextBenDS - a generic document-oriented benchmark for storing textual data and constructing weighting schemes. Our benchmark offers a generic data model designed with a multidimensional approach for storing text documents. We also propose using aggregation queries with various complexities and selectivities for constructing term weighting schemes, that are utilized in extracting top-k keywords and documents. We evaluate the computing performance of the queries on several distributed environments set within the Apache Hadoop ecosystem. Our experimental results provide interesting insights. As an example, MongoDB shows the best overall performance, while Spark’s execution time remains almost constant regardless of weighting schemes.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Social media analytics: a survey of techniques, tools and platforms

Article Open access 26 July 2014

DB-GPT: Large Language Model Meets Database

Article Open access 19 January 2024

Performing web analytics with Google Analytics 4: a platform review

Article 16 August 2023

Notes

Source code https://github.com/cipriantruica/TextBenDS

References

Agrawal, D., Butt, A., Doshi, K., Larriba-Pey, J. L., Li, M., Reiss, F. R., Raab, F., Schiefer, B., Suzumura, T., & Xia, Y. (2016). Sparkbench – a spark performance testing suite. In Performance evaluation and benchmarking: Traditional to big data to internet of things (pp. 26–44). Springer International Publishing. https://doi.org/10.1007/978-3-319-31409-9-3.
Armbrust, M., Xin, R. S., Lian, C., Huai, Y., Liu, D., Bradley, J. K., Meng, X., Kaftan, T., Franklin, M. J., Ghodsi, A., & Zaharia, M. (2015). Spark sql: Relational data processing in spark. In ACM SIGMOD International Conference on Management of Data (pp. 1383–1394). ACM Press. https://doi.org/10.1145/2723372.2742797.
Armstrong, T. G., Ponnekanti, V., Borthakur, D., & Callaghan, M. (2013). Linkbench: A database benchmark based on the facebook social graph. In ACM SIGMOD International Conference on Management of Data, SIGMOD ‘13 (pp. 1185–1196). ACM. https://doi.org/10.1145/2463676.2465296.
Bellot, P., Doucet, A., Geva, S., Gurajada, S., Kamps, J., Kazai, G., Koolen, M., Mishra, A., Moriceau, V., Mothe, J., Preminger, M., SanJuan, E., Schenkel, R., Tannier, X., Theobald, M., Trappett, M., Trotman, A., Sanderson, M., Scholer, F., & Wang, Q. (2013). Report on inex 2013. SIGIR Forum, 47(2), 21–32. https://doi.org/10.1145/2568388.2568393.
Article Google Scholar
Bifet, A., & Frank, E. (2010). Sentiment knowledge discovery in twitter streaming data. In Discovery Science (pp. 1–15). Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-16184-1-1.
Bouakkaz, M., Loudcher, S., & Ouinten, Y. (2016). OLAP textual aggregation approach using the google similarity distance. International Journal of Business Intelligence and Data Mining, 11(1), 31. https://doi.org/10.1504/ijbidm.2016.076425.
Article Google Scholar
Bringay, S., Béchet, N., Bouillot, F., Poncelet, P., Roche, M., & Teisseire, M. (2011). Towards an on-line analysis of tweets processing. In International Conference on Database and Expert Systems Applications (pp. 154–161). https://doi.org/10.1007/978-3-642-23091-2_15.
Chapter Google Scholar
Chowdhury, B., Rabl, T., Saadatpanah, P., Du, J., & Jacobsen, H. A. (2014). A bigbench implementation in the hadoop ecosystem. In Advancing big data benchmarks (pp. 3–18). Springer International Publishing. https://doi.org/10.1007/978-3-319-10596-3-1.
Crane, M., Culpepper, J. S., Lin, J., Mackenzie, J., & Trotman, A. (2017). A comparison of document-at-a-time and score-at-a-time query evaluation. In ACM International Conference on Web Search and Data Mining (pp. 201–210). ACM. https://doi.org/10.1145/3018661.3018726.
Dean, J., & Ghemawat, S. (2008). Mapreduce: Simplified data processing on large clusters. Communications of the ACM, 51(1), 107–113. https://doi.org/10.1145/1327452.1327492.
Article Google Scholar
Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6), 391–407. https://doi.org/10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9.
Article Google Scholar
Ferrarons, J., Adhana, M., Colmenares, C., Pietrowska, S., Bentayeb, F., Darmont, J. (2014). Primeball: a parallel processing framework benchmark for big data applications in the cloud. In: TPC Technology Conference on Performance Evaluation and Benchmarking, LNCS1, 839, pp. 109–124. https://doi.org/10.1007/978-3-319-04936-6_8
Gattiker, A. E., Gebara, F. H., Hofstee, H. P., Hayes, J. D., & Hylick, A. (2013). Big data text-oriented benchmark creation for Hadoop. IBM Journal of Research and Development, 57(3/4), 10:1–10:6. https://doi.org/10.1147/JRD.2013.2240732.
Article Google Scholar
Ghazal, A., Rabl, T., Hu, M., Raab, F., Poess, M., Crolotte, A., & Jacobsen, H. A. (2013). Bigbench: Towards an industry standard benchmark for big data analytics. In ACM SIGMOD International Conference on Management of Data, SIGMOD ‘13 (pp. 1197–1208). https://doi.org/10.1145/2463676.2463712.
Chapter Google Scholar
Ghazal, A., Ivanov, T., Kostamaa, P., Crolotte, A., Voong, R., Al-Kateb, M., Ghazal, W., & Zicari, R. V. (2017). Bigbench v2: The new and improved bigbench. In 2017 IEEE 33rd International Conference on Data Engineering (pp. 1225–1236). https://doi.org/10.1109/ICDE.2017.167.
Chapter Google Scholar
Gray, J. (1993). The benchmark handbook for database and transaction systems (2nd ed.). Burlington: Morgan Kaufmann Publishers.
Guille, A., & Favre, C. (2015). Event detection, tracking, and visualization in twitter: a mention-anomaly-based approach. Social Network Analysis and Mining, 5(1), 18. https://doi.org/10.1007/s13278-015-0258-0.
Article Google Scholar
Hofmann, T. (2017). Probabilistic latent semantic indexing. SIGIR Forum, 51(2), 211–218. https://doi.org/10.1145/3130348.3130370.
Article Google Scholar
Huang, S., Huang, J., Dai, J., Xie, T., & Huang, B. (2010). The HiBench benchmark suite: Characterization of the MapReduce-based data analysis. In International Conference on Data Engineering (pp. 41–51). https://doi.org/10.1109/ICDEW.2010.5452747.
Chapter Google Scholar
Jia, Z., Zhan, J., Wang, L., Han, R., McKee, S. A., Yang, Q., Luo, C., & Li, J. (2014). Characterizing and subsetting big data workloads. In 2014 IEEE International Symposium on Workload Characterization (pp. 191–201). https://doi.org/10.1109/IISWC.2014.6983058.
Chapter Google Scholar
Kılıç, D., Özçift, A., Bozyigit, F., Yildirim, P., Yücalar, F., & Borandag, E. (2017). Ttc-3600: A new benchmark dataset for turkish text categorization. Journal of Information Science, 43(2), 174–185. https://doi.org/10.1177/0165551515620551.
Article Google Scholar
Krasnashchok, K., Jouili, S. (2018). Improving topic quality by promoting named entities in topic modeling. In: Annual Meeting of the Association for Computational Linguistics, pp. 247–253.
Lavrenko, V., & Croft, W. B. (2017). Relevance-based language models. SIGIR Forum, 51(2), 260–267. https://doi.org/10.1145/3130348.3130376.
Article Google Scholar
Lewis, D. D., Yang, Y., Rose, T. G., & Li, F. (2004). Rcv1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 5, 361–397 URL http://www.jmlr.org/papers/v5/lewis04a.html.
Google Scholar
Li, M., Tan, J., Wang, Y., Zhang, L., & Salapura, V. (2015). Sparkbench: A comprehensive benchmarking suite for in memory data analytic platform spark. In ACM International Conference on Computing Frontiers, CF ‘15 (pp. 53:1–53:8). ACM. https://doi.org/10.1145/2742854.2747283.
Lin, J., Crane, M., Trotman, A., Callan, J., Chattopadhyaya, I., Foley, J., Ingersoll, G., Macdonald, C., & Vigna, S. (2016). Toward reproducible baselines: The open-source ir reproducibility challenge. In Advances in information retrieval (pp. 408–420). Springer International Publishing. https://doi.org/10.1007/978-3-319-30671-1-30.
Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to information retrieval. Cambridge: Cambridge University Press.
Ming, Z., Luo, C., Gao, W., Han, R., Yang, Q., Wang, L., & Zhan, J. (2014). Bdgs: A scalable big data generator suite in big data benchmarking. In Advancing big data benchmarks (pp. 138–154). Springer International Publishing. https://doi.org/10.1007/978-3-319-10596-3-11.
O’Shea, J., Bandar, Z., Crockett, K. A., & McLean, D. (2010). Benchmarking short text semantic similarity. International Journal of Intelligent Information and Database Systems, 4(2), 103–120. https://doi.org/10.1504/IJIIDS.2010.032437.
Article Google Scholar
Paltoglou, G., Thelwall, M. (2010). A study of information retrieval weighting schemes for sentiment analysis. In: Annual Meeting of the Association for Computational Linguistics, pp. 1386–1395. URL http://dl.acm.org/citation.cfm?id=1858681.1858822.
Partalas, I., Kosmopoulos, A., Baskiotis, N., Artières, T., Paliouras, G., Gaussier, É., Androutsopoulos, I., Amini, M.R., Gallinari, P. (2015). Lshtc: A benchmark for large-scale text classification. CoRR. URL http://arxiv.org/abs/1503.08581.
Pirzadeh, P., Carey, M. J., & Westmann, T. (2015). Bigfun: A performance study of big data management system functionality. In IEEE International Conference on Big Data (pp. 507–514). https://doi.org/10.1109/BigData.2015.7363793.
Chapter Google Scholar
Raiber, F., & Kurland, O. (2017). Kullback-leibler divergence revisited. In ACM SIGIR International Conference on Theory of Information Retrieval, ICTIR ‘17 (pp. 117–124). ACM. https://doi.org/10.1145/3121050.3121062.
Ravat, F., Teste, O., Tournier, R., & Zurfluh, G. (2008). Top−keyword: an aggregation function for textual document olap. In International Conference on Data Warehousing and Knowledge Discovery (pp. 55–64). https://doi.org/10.1007/978-3-540-85836-2-6.
Chapter Google Scholar
Saha, B., Shah, H., Seth, S., Vijayaraghavan, G., Murthy, A., & Curino, C. (2015). Apache tez: A unifying framework for modeling and building data processing applications. In ACM SIGMOD International Conference on Management of Data (pp. 1357–1369). New York: ACM. https://doi.org/10.1145/2723372.2742790.
Chapter Google Scholar
Sangroya, A., Serrano, D., & Bouchenak, S. (2013). Mrbs: Towards dependability benchmarking for hadoop mapreduce. In Euro-Par 2012: Parallel Processing Workshops (pp. 3–12). Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-36949-0-2.
Shu, K., Sliva, A., Wang, S., Tang, J., & Liu, H. (2017). Fake news detection on social media: A data mining perspective. ACM SIGKDD Explorations Newsletter, 19(1), 22–36. https://doi.org/10.1145/3137597.3137600.
Article Google Scholar
Shu, K., Mahudeswaran, D., Wang, S., Lee, D., Liu, H. (2018). Fakenewsnet: A data repository with news content, social context and dynamic information for studying fake news on social media. arXiv preprint arXiv:1809.01286.
Shvachko, K., Kuang, H., Radia, S., & Chansler, R. (2010). The hadoop distributed file system. In Symposium on Mass Storage Systems and Technologies (pp. 1–10). https://doi.org/10.1109/MSST.2010.5496972.
Chapter Google Scholar
Spärck Jones, K., Walker, S., & Robertson, S. E. (2000a). A probabilistic model of information retrieval: development and comparative experiments: Part 1. Information Processing & Management, 36(6), 779–808. https://doi.org/10.1016/S0306-4573(00)00015-7.
Article Google Scholar
Spärck Jones, K., Walker, S., & Robertson, S. E. (2000b). A probabilistic model of information retrieval: development and comparative experiments: Part 2. Information Processing & Management, 36(6), 809–840. https://doi.org/10.1016/S0306-4573(00)00016-9.
Article Google Scholar
Thusoo, A., Sarma, J. S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., & Murthy, R. (2009). Hive: A warehousing solution over a map-reduce framework. VLDB Endowment, 2(2), 1626–1629. https://doi.org/10.14778/1687553.1687609.
Article Google Scholar
Transaction Processing Performance Council (TPC) (2016). TPC express benchmark hs standard specification version 1.4.2.http://www.tpc.org Accessed March 2019.
Transaction Processing Performance Council (TPC) (2019). TPC-DS decision support benchmark 2.10.1.http://www.tpc.org Accessed March 2019.
Truică, C. O., & Darmont, J. (2017). T²K²: The twitter top-k keywords benchmark. In European Conference on Advances in Databases and Information Systems (pp. 21–28). Springer International Publishing. https://doi.org/10.1007/978-3-319-67162-8_3.
Truică, C. O., Darmont, J., & Velcine, J. (2016a). A scalable document-based architecture for text analysis. In International Conference on Advanced Data Mining and Applications (pp. 481–494). Springer. https://doi.org/10.1007/978-3-319-49586-6-33.
Truică, C.O., Rădulescu, F., Boicea, A. (2016b). Comparing different term weighting schemas for topic modeling. In: International Symposium on Symbolic and Numeric Algorithms for Scientific Computing. IEEE. https://doi.org/10.1109/synasc.2016.055.
Truică, C. O., Darmont, J., Boicea, A., & Rădulescu, F. (2018). Benchmarking top-k keyword and top-k document processing with T²K² and T²K²D². Future Generation Computer Systems, 85, 60–75. https://doi.org/10.1016/j.future.2018.02.037.
Article Google Scholar
Vavilapalli, V. K., Murthy, A. C., Douglas, C., Agarwal, S., Konar, M., Evans, R., Graves, T., Lowe, J., Shah, H., Seth, S., Saha, B., Curino, C., O’Malley, O., Radia, S., Reed, B., & Baldeschwieler, E. (2013). Apache hadoop yarn: Yet another resource negotiator. In Annual Symposium on Cloud Computing (pp. 5:1–5:16). https://doi.org/10.1145/2523616.2523633.
Chapter Google Scholar
Wang, L., Zhan, J., Luo, C., Zhu, Y., Yang, Q., He, Y., Gao, W., Jia, Z., Shi, Y., Zhang, S., Zheng, C., Lu, G., Zhan, K., Li, X., & Qiu, B. (2014). BigDataBench: A big data benchmark suite from internet services. In IEEE International Symposium on High Performance Computer Architecture (pp. 488–499). https://doi.org/10.1109/HPCA.2014.6835958.
Chapter Google Scholar
Wang, L., Dong, X., Zhang, X., Wang, Y., Ju, T., & Feng, G. (2016). Textgen: a realistic text data content generation method for modern storage system benchmarks. Frontiers of Information Technology & Electronic Engineering, 17(10), 982–993. https://doi.org/10.1631/FITEE.1500332.
Article Google Scholar
Wang, X., Ah-Pine, J., & Darmont, J. (2017). Shcoclust, a scalable similarity-based hierarchical co-clustering method and its application to textual collections. In 2017 IEEE International Conference on Fuzzy Systems (pp. 1–6). https://doi.org/10.1109/FUZZ-IEEE.2017.8015720.
Chapter Google Scholar
Yin, J., Chao, D., Liu, Z., Zhang, W., Yu, X., & Wang, J. (2018). Model-based clustering of short text streams. In ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (pp. 2634–2642). ACM Press. https://doi.org/10.1145/3219819.3220094.
Zaharia, M., Xin, R. S., Wendell, P., Das, T., Armbrust, M., Dave, A., Meng, X., Rosen, J., Venkataraman, S., Franklin, M. J., Ghodsi, A., Gonzalez, J., Shenker, S., & Stoica, I. (2016). Apache spark: A unified engine for big data processing. Communications of the ACM, 59(11), 56–65. https://doi.org/10.1145/2934664.
Article Google Scholar
Zhang, D., Zhai, C., Han, J. (2009). Topic cube: Topic modeling for OLAP on multidimensional text databases. In: SIAM International Conference on Data Mining, pp. 1124–1135. Society for Industrial and Applied Mathematics. https://doi.org/10.1137/1.9781611972795.96
Zhang, D., Zhai, C., & Han, J. (2012). MiTexCube: MicroTextCluster cube for online analysis of text cells and its applications. Statistical Analysis and Data Mining, 6(3), 243–259. https://doi.org/10.1002/sam.11159.
Article Google Scholar

Download references

Acknowledgements

This research was funded by grant No. PN-III-P1-1.2-PCCDI-2017-0734.

Author information

Authors and Affiliations

Computer Science and Engineering Department, Faculty of Automatic Control and Computers, University Politehnica of Bucharest, Bucharest, Romania
Ciprian-Octavian Truică & Elena-Simona Apostol
Department of Computer Science, Aarhus University, Aarhus, Denmark
Ciprian-Octavian Truică
Université de Lyon, ERIC EA 3083, Lyon 2, France
Jérôme Darmont
DIGIT, Department of Computer Science, Aarhus University, Aarhus, Denmark
Ira Assent

Authors

Ciprian-Octavian Truică
View author publications
You can also search for this author in PubMed Google Scholar
Elena-Simona Apostol
View author publications
You can also search for this author in PubMed Google Scholar
Jérôme Darmont
View author publications
You can also search for this author in PubMed Google Scholar
Ira Assent
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ciprian-Octavian Truică.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Truică, CO., Apostol, ES., Darmont, J. et al. TextBenDS: a Generic Textual Data Benchmark for Distributed Systems. Inf Syst Front 23, 81–100 (2021). https://doi.org/10.1007/s10796-020-09999-y

Download citation

Published: 06 March 2020
Issue Date: February 2021
DOI: https://doi.org/10.1007/s10796-020-09999-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

TextBenDS: a Generic Textual Data Benchmark for Distributed Systems

Abstract

Access this article

Similar content being viewed by others

Social media analytics: a survey of techniques, tools and platforms

DB-GPT: Large Language Model Meets Database

Performing web analytics with Google Analytics 4: a platform review

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

TextBenDS: a Generic Textual Data Benchmark for Distributed Systems

Abstract

Access this article

Similar content being viewed by others

Social media analytics: a survey of techniques, tools and platforms

DB-GPT: Large Language Model Meets Database

Performing web analytics with Google Analytics 4: a platform review

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation