Abstract
Extracting top-k keywords and documents using weighting schemes are popular techniques employed in text mining and machine learning for different analysis and retrieval tasks. The weights are usually computed in the data preprocessing step, as they are costly to update and keep track of all the modifications performed on the dataset. Furthermore, calculation errors are introduced when analyzing only subsets of the dataset, i.e., wrong weighting are computed as weighting schemes use the number of documents for scoring keywords and documents. Therefore, in a Big Data context, it is crucial to lower the runtime of computing weighting schemes, without hindering the analysis process and the accuracy of the machine learning algorithms. To address this requirement for the task of computing top-k keywords and documents (which largely relies on weighting schemes), it is customary to design benchmarks that compare weighting schemes within various configurations of distributedframeworks and database management systems. Thus, we propose TextBenDS - a generic document-oriented benchmark for storing textual data and constructing weighting schemes. Our benchmark offers a generic data model designed with a multidimensional approach for storing text documents. We also propose using aggregation queries with various complexities and selectivities for constructing term weighting schemes, that are utilized in extracting top-k keywords and documents. We evaluate the computing performance of the queries on several distributed environments set within the Apache Hadoop ecosystem. Our experimental results provide interesting insights. As an example, MongoDB shows the best overall performance, while Spark’s execution time remains almost constant regardless of weighting schemes.
Similar content being viewed by others
Notes
Source code https://github.com/cipriantruica/TextBenDS
References
Agrawal, D., Butt, A., Doshi, K., Larriba-Pey, J. L., Li, M., Reiss, F. R., Raab, F., Schiefer, B., Suzumura, T., & Xia, Y. (2016). Sparkbench – a spark performance testing suite. In Performance evaluation and benchmarking: Traditional to big data to internet of things (pp. 26–44). Springer International Publishing. https://doi.org/10.1007/978-3-319-31409-9-3.
Armbrust, M., Xin, R. S., Lian, C., Huai, Y., Liu, D., Bradley, J. K., Meng, X., Kaftan, T., Franklin, M. J., Ghodsi, A., & Zaharia, M. (2015). Spark sql: Relational data processing in spark. In ACM SIGMOD International Conference on Management of Data (pp. 1383–1394). ACM Press. https://doi.org/10.1145/2723372.2742797.
Armstrong, T. G., Ponnekanti, V., Borthakur, D., & Callaghan, M. (2013). Linkbench: A database benchmark based on the facebook social graph. In ACM SIGMOD International Conference on Management of Data, SIGMOD ‘13 (pp. 1185–1196). ACM. https://doi.org/10.1145/2463676.2465296.
Bellot, P., Doucet, A., Geva, S., Gurajada, S., Kamps, J., Kazai, G., Koolen, M., Mishra, A., Moriceau, V., Mothe, J., Preminger, M., SanJuan, E., Schenkel, R., Tannier, X., Theobald, M., Trappett, M., Trotman, A., Sanderson, M., Scholer, F., & Wang, Q. (2013). Report on inex 2013. SIGIR Forum, 47(2), 21–32. https://doi.org/10.1145/2568388.2568393.
Bifet, A., & Frank, E. (2010). Sentiment knowledge discovery in twitter streaming data. In Discovery Science (pp. 1–15). Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-16184-1-1.
Bouakkaz, M., Loudcher, S., & Ouinten, Y. (2016). OLAP textual aggregation approach using the google similarity distance. International Journal of Business Intelligence and Data Mining, 11(1), 31. https://doi.org/10.1504/ijbidm.2016.076425.
Bringay, S., Béchet, N., Bouillot, F., Poncelet, P., Roche, M., & Teisseire, M. (2011). Towards an on-line analysis of tweets processing. In International Conference on Database and Expert Systems Applications (pp. 154–161). https://doi.org/10.1007/978-3-642-23091-2_15.
Chowdhury, B., Rabl, T., Saadatpanah, P., Du, J., & Jacobsen, H. A. (2014). A bigbench implementation in the hadoop ecosystem. In Advancing big data benchmarks (pp. 3–18). Springer International Publishing. https://doi.org/10.1007/978-3-319-10596-3-1.
Crane, M., Culpepper, J. S., Lin, J., Mackenzie, J., & Trotman, A. (2017). A comparison of document-at-a-time and score-at-a-time query evaluation. In ACM International Conference on Web Search and Data Mining (pp. 201–210). ACM. https://doi.org/10.1145/3018661.3018726.
Dean, J., & Ghemawat, S. (2008). Mapreduce: Simplified data processing on large clusters. Communications of the ACM, 51(1), 107–113. https://doi.org/10.1145/1327452.1327492.
Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6), 391–407. https://doi.org/10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9.
Ferrarons, J., Adhana, M., Colmenares, C., Pietrowska, S., Bentayeb, F., Darmont, J. (2014). Primeball: a parallel processing framework benchmark for big data applications in the cloud. In: TPC Technology Conference on Performance Evaluation and Benchmarking, LNCS1, 839, pp. 109–124. https://doi.org/10.1007/978-3-319-04936-6_8
Gattiker, A. E., Gebara, F. H., Hofstee, H. P., Hayes, J. D., & Hylick, A. (2013). Big data text-oriented benchmark creation for Hadoop. IBM Journal of Research and Development, 57(3/4), 10:1–10:6. https://doi.org/10.1147/JRD.2013.2240732.
Ghazal, A., Rabl, T., Hu, M., Raab, F., Poess, M., Crolotte, A., & Jacobsen, H. A. (2013). Bigbench: Towards an industry standard benchmark for big data analytics. In ACM SIGMOD International Conference on Management of Data, SIGMOD ‘13 (pp. 1197–1208). https://doi.org/10.1145/2463676.2463712.
Ghazal, A., Ivanov, T., Kostamaa, P., Crolotte, A., Voong, R., Al-Kateb, M., Ghazal, W., & Zicari, R. V. (2017). Bigbench v2: The new and improved bigbench. In 2017 IEEE 33rd International Conference on Data Engineering (pp. 1225–1236). https://doi.org/10.1109/ICDE.2017.167.
Gray, J. (1993). The benchmark handbook for database and transaction systems (2nd ed.). Burlington: Morgan Kaufmann Publishers.
Guille, A., & Favre, C. (2015). Event detection, tracking, and visualization in twitter: a mention-anomaly-based approach. Social Network Analysis and Mining, 5(1), 18. https://doi.org/10.1007/s13278-015-0258-0.
Hofmann, T. (2017). Probabilistic latent semantic indexing. SIGIR Forum, 51(2), 211–218. https://doi.org/10.1145/3130348.3130370.
Huang, S., Huang, J., Dai, J., Xie, T., & Huang, B. (2010). The HiBench benchmark suite: Characterization of the MapReduce-based data analysis. In International Conference on Data Engineering (pp. 41–51). https://doi.org/10.1109/ICDEW.2010.5452747.
Jia, Z., Zhan, J., Wang, L., Han, R., McKee, S. A., Yang, Q., Luo, C., & Li, J. (2014). Characterizing and subsetting big data workloads. In 2014 IEEE International Symposium on Workload Characterization (pp. 191–201). https://doi.org/10.1109/IISWC.2014.6983058.
Kılıç, D., Özçift, A., Bozyigit, F., Yildirim, P., Yücalar, F., & Borandag, E. (2017). Ttc-3600: A new benchmark dataset for turkish text categorization. Journal of Information Science, 43(2), 174–185. https://doi.org/10.1177/0165551515620551.
Krasnashchok, K., Jouili, S. (2018). Improving topic quality by promoting named entities in topic modeling. In: Annual Meeting of the Association for Computational Linguistics, pp. 247–253.
Lavrenko, V., & Croft, W. B. (2017). Relevance-based language models. SIGIR Forum, 51(2), 260–267. https://doi.org/10.1145/3130348.3130376.
Lewis, D. D., Yang, Y., Rose, T. G., & Li, F. (2004). Rcv1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 5, 361–397 URL http://www.jmlr.org/papers/v5/lewis04a.html.
Li, M., Tan, J., Wang, Y., Zhang, L., & Salapura, V. (2015). Sparkbench: A comprehensive benchmarking suite for in memory data analytic platform spark. In ACM International Conference on Computing Frontiers, CF ‘15 (pp. 53:1–53:8). ACM. https://doi.org/10.1145/2742854.2747283.
Lin, J., Crane, M., Trotman, A., Callan, J., Chattopadhyaya, I., Foley, J., Ingersoll, G., Macdonald, C., & Vigna, S. (2016). Toward reproducible baselines: The open-source ir reproducibility challenge. In Advances in information retrieval (pp. 408–420). Springer International Publishing. https://doi.org/10.1007/978-3-319-30671-1-30.
Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to information retrieval. Cambridge: Cambridge University Press.
Ming, Z., Luo, C., Gao, W., Han, R., Yang, Q., Wang, L., & Zhan, J. (2014). Bdgs: A scalable big data generator suite in big data benchmarking. In Advancing big data benchmarks (pp. 138–154). Springer International Publishing. https://doi.org/10.1007/978-3-319-10596-3-11.
O’Shea, J., Bandar, Z., Crockett, K. A., & McLean, D. (2010). Benchmarking short text semantic similarity. International Journal of Intelligent Information and Database Systems, 4(2), 103–120. https://doi.org/10.1504/IJIIDS.2010.032437.
Paltoglou, G., Thelwall, M. (2010). A study of information retrieval weighting schemes for sentiment analysis. In: Annual Meeting of the Association for Computational Linguistics, pp. 1386–1395. URL http://dl.acm.org/citation.cfm?id=1858681.1858822.
Partalas, I., Kosmopoulos, A., Baskiotis, N., Artières, T., Paliouras, G., Gaussier, É., Androutsopoulos, I., Amini, M.R., Gallinari, P. (2015). Lshtc: A benchmark for large-scale text classification. CoRR. URL http://arxiv.org/abs/1503.08581.
Pirzadeh, P., Carey, M. J., & Westmann, T. (2015). Bigfun: A performance study of big data management system functionality. In IEEE International Conference on Big Data (pp. 507–514). https://doi.org/10.1109/BigData.2015.7363793.
Raiber, F., & Kurland, O. (2017). Kullback-leibler divergence revisited. In ACM SIGIR International Conference on Theory of Information Retrieval, ICTIR ‘17 (pp. 117–124). ACM. https://doi.org/10.1145/3121050.3121062.
Ravat, F., Teste, O., Tournier, R., & Zurfluh, G. (2008). Top−keyword: an aggregation function for textual document olap. In International Conference on Data Warehousing and Knowledge Discovery (pp. 55–64). https://doi.org/10.1007/978-3-540-85836-2-6.
Saha, B., Shah, H., Seth, S., Vijayaraghavan, G., Murthy, A., & Curino, C. (2015). Apache tez: A unifying framework for modeling and building data processing applications. In ACM SIGMOD International Conference on Management of Data (pp. 1357–1369). New York: ACM. https://doi.org/10.1145/2723372.2742790.
Sangroya, A., Serrano, D., & Bouchenak, S. (2013). Mrbs: Towards dependability benchmarking for hadoop mapreduce. In Euro-Par 2012: Parallel Processing Workshops (pp. 3–12). Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-36949-0-2.
Shu, K., Sliva, A., Wang, S., Tang, J., & Liu, H. (2017). Fake news detection on social media: A data mining perspective. ACM SIGKDD Explorations Newsletter, 19(1), 22–36. https://doi.org/10.1145/3137597.3137600.
Shu, K., Mahudeswaran, D., Wang, S., Lee, D., Liu, H. (2018). Fakenewsnet: A data repository with news content, social context and dynamic information for studying fake news on social media. arXiv preprint arXiv:1809.01286.
Shvachko, K., Kuang, H., Radia, S., & Chansler, R. (2010). The hadoop distributed file system. In Symposium on Mass Storage Systems and Technologies (pp. 1–10). https://doi.org/10.1109/MSST.2010.5496972.
Spärck Jones, K., Walker, S., & Robertson, S. E. (2000a). A probabilistic model of information retrieval: development and comparative experiments: Part 1. Information Processing & Management, 36(6), 779–808. https://doi.org/10.1016/S0306-4573(00)00015-7.
Spärck Jones, K., Walker, S., & Robertson, S. E. (2000b). A probabilistic model of information retrieval: development and comparative experiments: Part 2. Information Processing & Management, 36(6), 809–840. https://doi.org/10.1016/S0306-4573(00)00016-9.
Thusoo, A., Sarma, J. S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., & Murthy, R. (2009). Hive: A warehousing solution over a map-reduce framework. VLDB Endowment, 2(2), 1626–1629. https://doi.org/10.14778/1687553.1687609.
Transaction Processing Performance Council (TPC) (2016). TPC express benchmark hs standard specification version 1.4.2.http://www.tpc.org Accessed March 2019.
Transaction Processing Performance Council (TPC) (2019). TPC-DS decision support benchmark 2.10.1.http://www.tpc.org Accessed March 2019.
Truică, C. O., & Darmont, J. (2017). T2K2: The twitter top-k keywords benchmark. In European Conference on Advances in Databases and Information Systems (pp. 21–28). Springer International Publishing. https://doi.org/10.1007/978-3-319-67162-8_3.
Truică, C. O., Darmont, J., & Velcine, J. (2016a). A scalable document-based architecture for text analysis. In International Conference on Advanced Data Mining and Applications (pp. 481–494). Springer. https://doi.org/10.1007/978-3-319-49586-6-33.
Truică, C.O., Rădulescu, F., Boicea, A. (2016b). Comparing different term weighting schemas for topic modeling. In: International Symposium on Symbolic and Numeric Algorithms for Scientific Computing. IEEE. https://doi.org/10.1109/synasc.2016.055.
Truică, C. O., Darmont, J., Boicea, A., & Rădulescu, F. (2018). Benchmarking top-k keyword and top-k document processing with T2K2 and T2K2D2. Future Generation Computer Systems, 85, 60–75. https://doi.org/10.1016/j.future.2018.02.037.
Vavilapalli, V. K., Murthy, A. C., Douglas, C., Agarwal, S., Konar, M., Evans, R., Graves, T., Lowe, J., Shah, H., Seth, S., Saha, B., Curino, C., O’Malley, O., Radia, S., Reed, B., & Baldeschwieler, E. (2013). Apache hadoop yarn: Yet another resource negotiator. In Annual Symposium on Cloud Computing (pp. 5:1–5:16). https://doi.org/10.1145/2523616.2523633.
Wang, L., Zhan, J., Luo, C., Zhu, Y., Yang, Q., He, Y., Gao, W., Jia, Z., Shi, Y., Zhang, S., Zheng, C., Lu, G., Zhan, K., Li, X., & Qiu, B. (2014). BigDataBench: A big data benchmark suite from internet services. In IEEE International Symposium on High Performance Computer Architecture (pp. 488–499). https://doi.org/10.1109/HPCA.2014.6835958.
Wang, L., Dong, X., Zhang, X., Wang, Y., Ju, T., & Feng, G. (2016). Textgen: a realistic text data content generation method for modern storage system benchmarks. Frontiers of Information Technology & Electronic Engineering, 17(10), 982–993. https://doi.org/10.1631/FITEE.1500332.
Wang, X., Ah-Pine, J., & Darmont, J. (2017). Shcoclust, a scalable similarity-based hierarchical co-clustering method and its application to textual collections. In 2017 IEEE International Conference on Fuzzy Systems (pp. 1–6). https://doi.org/10.1109/FUZZ-IEEE.2017.8015720.
Yin, J., Chao, D., Liu, Z., Zhang, W., Yu, X., & Wang, J. (2018). Model-based clustering of short text streams. In ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (pp. 2634–2642). ACM Press. https://doi.org/10.1145/3219819.3220094.
Zaharia, M., Xin, R. S., Wendell, P., Das, T., Armbrust, M., Dave, A., Meng, X., Rosen, J., Venkataraman, S., Franklin, M. J., Ghodsi, A., Gonzalez, J., Shenker, S., & Stoica, I. (2016). Apache spark: A unified engine for big data processing. Communications of the ACM, 59(11), 56–65. https://doi.org/10.1145/2934664.
Zhang, D., Zhai, C., Han, J. (2009). Topic cube: Topic modeling for OLAP on multidimensional text databases. In: SIAM International Conference on Data Mining, pp. 1124–1135. Society for Industrial and Applied Mathematics. https://doi.org/10.1137/1.9781611972795.96
Zhang, D., Zhai, C., & Han, J. (2012). MiTexCube: MicroTextCluster cube for online analysis of text cells and its applications. Statistical Analysis and Data Mining, 6(3), 243–259. https://doi.org/10.1002/sam.11159.
Acknowledgements
This research was funded by grant No. PN-III-P1-1.2-PCCDI-2017-0734.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Truică, CO., Apostol, ES., Darmont, J. et al. TextBenDS: a Generic Textual Data Benchmark for Distributed Systems. Inf Syst Front 23, 81–100 (2021). https://doi.org/10.1007/s10796-020-09999-y
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10796-020-09999-y