Abstract
The traditional standalone computing approach is difficult to handle the task of processing large XML data due to scalability, thus distributed processing using cluster systems becomes an inevitable choice. The currently distributed XML processing methods generally rely on existing distributed computing frameworks for general purpose data, which have limitations such as complex configuration, inflexible working mechanism, and difficult performance optimization in the context of XML semi-structural features and complex queries. In addition, XML distributed queries suffer from a low level of automatic processing and lack of effective integration with distributed XML parsing and indexing. In this paper we propose an integrated method for distributed processing of large XML data, called the dXML method. Our method supports the distributed parsing of arbitrary XML fragment and the distributed creation of index, and adopts the efficient navigational XPath evaluation based on relation index. Through a distributed XPath evaluation approach based on filter-upon-pre-evaluate, our method enables data locality and reduces network traffic during the distributed evaluation of complex XPath predicates. dXML integrates the distributed processing technology of XML parsing, index creation and XPath query, provides a one-stop XML processing solution, supports the automatic distributed processing of large XML data, and has the characteristics of lightweight configuration and flexible working mechanism. Experimental evaluation verifies the effectiveness of dXML, and comparative experimental results show that dXML has better distributed query performance than both the typical existing navigational and Twig distributed processing methods.
Similar content being viewed by others
Data availability
Enquiries about data availability should be directed to the authors.
References
Zhen, H.L., Murthy, R.: A Decade of XML data management: an industrial experience report from oracle. In: Proceedings of the 25th International Conference on Data Engineering (ICDE 2009), Shanghai, China, March 29 - April 2 2009 2009, pp. 1351–1362. IEEE Computer Society
Lee, H.: Data storage practices and query processing in XML databases: a survey. Knowl.-Based Syst. 24(8), 1317–1340 (2011)
DBLP XML dataset. http://dblp.uni-trier.de/xml/.
Wikimedia XML dataset. http://download.wikimedia.org/enwiki/latest.
OpenStreetMap XML dataset. http://www.openstreetmap.org/export.
Sankari, S., Bose, S.: Elaborative survey on storage technologies for XML big data: A real-time approach. In: 2016 International Conference on Recent Trends in Information Technology (ICRTIT) 2016
Brahmia, Z., Hamrouni, H., Bouaziz, R.: XML data manipulation in conventional and temporal XML databases: a survey. Comput. Sci. Rev. 36, 100231 (2020)
Berglund, A., Boag, S., Chamberlin, D., Fernandez, M.F., Kay, M., Robie, J., Siméon, J.: XML path language (XPath) 2.0 (Second Edition). W3C recommendation (2015).
Dean, J.: MapReduce : simplified data processing on large clusters. In: Symposium on Operating System Design & Implementation 2004
Lee, K.H., Lee, Y.J., Choi, H., Chung, Y.D., Moon, B.: Parallel data processing with MapReduce: a survey. ACM SIGMOD Rec. 40(4), 11–20 (2012)
Gou, G., Chirkova, R.: Efficiently querying large XML data repositories: a survey. IEEE Trans. Knowl. Data Eng. 19(10), 1381–1403 (2007)
Fan, H., Ma, Z., Wang, D., Liu, J.: Handling distributed XML queries over large XML data based on MapReduce framework. Inform. Sci. 15, 2–89 (2018)
Chen, R., Liao, H.: ParaParse: A parallel method for XML parsing. In: Proceedings of the 3rd IEEE International Conference on Communication Software and Networks (ICCSN2011) 2011, pp. 81–85
Chen, R., Liao, H., Wang, Z.: Parallel XPath evaluation based on node relation matrix. J. Comput. Inform. Syst. 9(19), 7583–7592 (2013)
Chen, R., Wang, Z., Su, H., Xie, S., Wang, Z.: Parallel XPath query based on cost optimization. J. Supercomput. (2021). https://doi.org/10.1007/s11227-021-04074-y
Cate, B.T., Marx, M.: Navigational XPath. ACM. SIGMOD Record 36(2), 19–26 (2007)
Bruno, N., Koudas, N., Srivastava, D.: Holistic twig joins: optimal XML pattern matching. In: Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, Madison, Wisconsin, USA, June 3–6, 2002 2002, pp. 310–321. ACM
Lukas, P., Baca, R., Kratky, M., Ling, T.W.: Demythization of structural XML query processing: comparison of holistic and binary approaches. IEEE Trans. Knowl. Data Eng. 33(04), 1439–1452 (2021)
Sato, S., Hao, W., Matsuzaki, K.: Parallelization of XPath Queries Using Modern XQuery Processors. In: New Trends in Databases and Information Systems. ADBIS 2018 2018 (2018)
Mortier, R., Narayanan, D., Donnelly, A., Rowstron, A.: Seaweed: Distributed Scalable Ad Hoc Querying. In: International Conference on Data Engineering Workshops 2006
White, T.: Hadoop: the definitive guide. O’rlly Media Inc Gravenstn Highway North 215(11), 1–4 (2012)
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: Cluster computing with working sets. (2010).
Choi, H., Lee, K.H., Kim, S.H., Lee, Y.J., Moon, B.: HadoopXML: a suite for parallel processing of massive XML data with multiple twig pattern queries. In: Acm International Conference on Information & Knowledge Management 2012
Owen, S., Kwon, H.: Spark-XML. https://github.com/databricks/spark-xml (2015).
Bidoit, N., Colazzo, D., Sartiani, C., Solimando, A., Ulliana, F.: Andromeda: a system for processing queries and updates on big XML documents. In: East European Conference on Advances in Databases & Information Systems 2015
Bidoit, N., Colazzo, D., Malla, N., Sartiani, C.: Evaluating Queries and Updates on Big XML Documents. Inf. Syst. Front. 20(1), 63–90 (2018)
Camacho-Rodriguez, J., Colazzo, D., Manolescu, I.: PAXQuery: Efficient Parallel Processing of Complex XQuery. IEEE Trans. Knowl. Data Eng. 27(7), 1–1 (2015)
Boag, S., Chamberlin, D., Fernández, M.F., Florescu, D., Robie, J., Siméon, J., Stefanescu, M.: XQuery 1.0: An XML query language (Second Edition). W3C working draft (2010).
Carman, E.P., Westmann, T., Borkar, V.R., Carey, M.J., Tsotras, V.J.: A scalable parallel XQuery processor. In: IEEE International Conference on Big Data 2015
Using Oracle XQuery for Hadoop. http://docs.oracle.com/cd/E63064_01/doc.42/e63063/oxh.htm #BDCUG527 (2016).
Hricov, R., Šenk, A., Kroha, P., Valenta, M.: Evaluation of XPath Queries Over XML Documents Using SparkSQL Framework. In: International Conference: Beyond Databases, Architectures and Structures 2017
Khatchadourian, S., Mariano P. Consens, Siméon, J.: Having a ChuQL at XML on the Cloud. In: AMW 2011
Fegaras, L., Philip, J.J.: XML Query Optimization in Map-Reduce. In: International Workshop on the Web & Databases 2011
Senk, A., Valenta, M., Benn, W.: Distributed Evaluation of XPath Axes Queries over Large XML Documents Stored in MapReduce Clusters Paper presented at the DEXA.2014,
Damigos, M., Gergatsoulis, M., Plitsos, S.: Distributed Processing of XPath Queries Using MapReduce. (2014).
Kunfang, S., Lu, H.: Efficient querying distributed big-XML Data using MapReduce. Int. J. Grid High Perf. Comput. 8(3), 70–79 (2016)
Liang, B.A., Jin, Y.A., Cqw, B., Hq, A., Xin, Z.A., Sc, A.: XML2HBase: Storing and querying large collections of XML documents using a NoSQL database system. J. Parall. Distrib.Comput. 161, 83–99 (2021)
Apache HBase. https://hbase.apache.org/.
Liu, J., Liu, Q., Zhang, L., Su, S., Liu, Y.: Enabling massive XML-based biological data management in HBase. IEEE/ACM Trans. Comput. Biol. Bioinf. 17(6), 1994–2004 (2020)
Longjian, Y., Koide, H., Cavendish, D., Sakurai, K.: Efficient Shortest Path Routing Algorithms for Distributed XML Processing. In: Proceedings of the 15th International Conference on Web Information Systems and Technologies 2019, pp. 265–272
Bi, X., Zhao, X.G., Wang, G.R.: Efficient processing of distributed twig queries based on node distribution. J. Comput. Sci. Technol. 32(1), 78–92 (2017)
Subramaniam, S., Haw, S.C., Soon, L.K.: Improved centralized XML query processing using distributed query workload. IEEE Access 9, 29127–29142 (2021)
Fan, H., Yang, H., Ma, Z., Liu, J.: TwigStack-MR: An Approach to Distributed XML Twig Query Using MapReduce. In: IEEE International Congress on Big Data 2016, pp. 133–140
Braganholo, V., Mattoso, M.: A Survey on XML Fragmentation. ACM SIGMOD Record (2014).
Choi, H., Lee, K.H., Lee, Y.J.: Parallel labeling of massive XML data with MapReduce. J. Supercomput. 67(2), 408–437 (2014)
Zhang, C., Naughton, J., DeWitt, D., Luo, Q., Lohman, G.: On supporting containment queries in relational database management systems. In: ACM SIGMOD Record, 2001 2001, vol. 2, pp. 425–436. ACM
Lu, J., Meng, X., Ling, T.W.: Indexing and querying XML using extended Dewey labeling scheme. Data Knowl. Eng. 70(1), 35–59 (2011)
Hsu, W.-C., Shih, H.-C., Liao, I.-E.: A scalable XML indexing method using MapReduce. In: Fourth edition of the International Conference on the Innovative Computing Technology (INTECH 2014) 2014, pp. 81–86. IEEE
Schmidt, A., Waas, F., Kersten, M., Carey, M.J., Manolescu, I., Busse, R.: XMark: A benchmark for XML data management. In: Proceedings of the 28th international conference on Very Large Data Bases 2002, pp. 974–985. VLDB Endowment
Afrati, F., Damigos, M., Gergatsoulis, M.: Lower bounds on the communication of XPath queries in MapReduce. (2015).
Funding
This research was supported by the Natural Science Foundation of Fujian Province of China (2022J01336, 2022J01820) and Open Fund of Digital Fujian Big Data Modeling and Intelligent Computing Institute.
Author information
Authors and Affiliations
Contributions
RC: conceptualization, formal analysis, software, wrote original draft. GC: validation, project administration. JC: validation, software. YH: validation, reviewed & edited.
Corresponding author
Ethics declarations
Conflict of interest
There is no conflict of interest by any of the authors of this article.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Chen, R., Cai, G., Chen, J. et al. Integrated method for distributed processing of large XML data. Cluster Comput 27, 1375–1399 (2024). https://doi.org/10.1007/s10586-023-04010-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10586-023-04010-0