Skip to main content
Log in

Integrated method for distributed processing of large XML data

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

The traditional standalone computing approach is difficult to handle the task of processing large XML data due to scalability, thus distributed processing using cluster systems becomes an inevitable choice. The currently distributed XML processing methods generally rely on existing distributed computing frameworks for general purpose data, which have limitations such as complex configuration, inflexible working mechanism, and difficult performance optimization in the context of XML semi-structural features and complex queries. In addition, XML distributed queries suffer from a low level of automatic processing and lack of effective integration with distributed XML parsing and indexing. In this paper we propose an integrated method for distributed processing of large XML data, called the dXML method. Our method supports the distributed parsing of arbitrary XML fragment and the distributed creation of index, and adopts the efficient navigational XPath evaluation based on relation index. Through a distributed XPath evaluation approach based on filter-upon-pre-evaluate, our method enables data locality and reduces network traffic during the distributed evaluation of complex XPath predicates. dXML integrates the distributed processing technology of XML parsing, index creation and XPath query, provides a one-stop XML processing solution, supports the automatic distributed processing of large XML data, and has the characteristics of lightweight configuration and flexible working mechanism. Experimental evaluation verifies the effectiveness of dXML, and comparative experimental results show that dXML has better distributed query performance than both the typical existing navigational and Twig distributed processing methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Data availability

Enquiries about data availability should be directed to the authors.

References

  1. Zhen, H.L., Murthy, R.: A Decade of XML data management: an industrial experience report from oracle. In: Proceedings of the 25th International Conference on Data Engineering (ICDE 2009), Shanghai, China, March 29 - April 2 2009 2009, pp. 1351–1362. IEEE Computer Society

  2. Lee, H.: Data storage practices and query processing in XML databases: a survey. Knowl.-Based Syst. 24(8), 1317–1340 (2011)

    Article  Google Scholar 

  3. DBLP XML dataset. http://dblp.uni-trier.de/xml/.

  4. Wikimedia XML dataset. http://download.wikimedia.org/enwiki/latest.

  5. OpenStreetMap XML dataset. http://www.openstreetmap.org/export.

  6. Sankari, S., Bose, S.: Elaborative survey on storage technologies for XML big data: A real-time approach. In: 2016 International Conference on Recent Trends in Information Technology (ICRTIT) 2016

  7. Brahmia, Z., Hamrouni, H., Bouaziz, R.: XML data manipulation in conventional and temporal XML databases: a survey. Comput. Sci. Rev. 36, 100231 (2020)

    Article  Google Scholar 

  8. Berglund, A., Boag, S., Chamberlin, D., Fernandez, M.F., Kay, M., Robie, J., Siméon, J.: XML path language (XPath) 2.0 (Second Edition). W3C recommendation (2015).

  9. Dean, J.: MapReduce : simplified data processing on large clusters. In: Symposium on Operating System Design & Implementation 2004

  10. Lee, K.H., Lee, Y.J., Choi, H., Chung, Y.D., Moon, B.: Parallel data processing with MapReduce: a survey. ACM SIGMOD Rec. 40(4), 11–20 (2012)

    Article  Google Scholar 

  11. Gou, G., Chirkova, R.: Efficiently querying large XML data repositories: a survey. IEEE Trans. Knowl. Data Eng. 19(10), 1381–1403 (2007)

    Article  Google Scholar 

  12. Fan, H., Ma, Z., Wang, D., Liu, J.: Handling distributed XML queries over large XML data based on MapReduce framework. Inform. Sci. 15, 2–89 (2018)

    MathSciNet  Google Scholar 

  13. Chen, R., Liao, H.: ParaParse: A parallel method for XML parsing. In: Proceedings of the 3rd IEEE International Conference on Communication Software and Networks (ICCSN2011) 2011, pp. 81–85

  14. Chen, R., Liao, H., Wang, Z.: Parallel XPath evaluation based on node relation matrix. J. Comput. Inform. Syst. 9(19), 7583–7592 (2013)

    Google Scholar 

  15. Chen, R., Wang, Z., Su, H., Xie, S., Wang, Z.: Parallel XPath query based on cost optimization. J. Supercomput. (2021). https://doi.org/10.1007/s11227-021-04074-y

    Article  Google Scholar 

  16. Cate, B.T., Marx, M.: Navigational XPath. ACM. SIGMOD Record 36(2), 19–26 (2007)

    Article  Google Scholar 

  17. Bruno, N., Koudas, N., Srivastava, D.: Holistic twig joins: optimal XML pattern matching. In: Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, Madison, Wisconsin, USA, June 3–6, 2002 2002, pp. 310–321. ACM

  18. Lukas, P., Baca, R., Kratky, M., Ling, T.W.: Demythization of structural XML query processing: comparison of holistic and binary approaches. IEEE Trans. Knowl. Data Eng. 33(04), 1439–1452 (2021)

    Article  Google Scholar 

  19. Sato, S., Hao, W., Matsuzaki, K.: Parallelization of XPath Queries Using Modern XQuery Processors. In: New Trends in Databases and Information Systems. ADBIS 2018 2018 (2018)

  20. Mortier, R., Narayanan, D., Donnelly, A., Rowstron, A.: Seaweed: Distributed Scalable Ad Hoc Querying. In: International Conference on Data Engineering Workshops 2006

  21. White, T.: Hadoop: the definitive guide. O’rlly Media Inc Gravenstn Highway North 215(11), 1–4 (2012)

    Google Scholar 

  22. Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: Cluster computing with working sets. (2010).

  23. Choi, H., Lee, K.H., Kim, S.H., Lee, Y.J., Moon, B.: HadoopXML: a suite for parallel processing of massive XML data with multiple twig pattern queries. In: Acm International Conference on Information & Knowledge Management 2012

  24. Owen, S., Kwon, H.: Spark-XML. https://github.com/databricks/spark-xml (2015).

  25. Bidoit, N., Colazzo, D., Sartiani, C., Solimando, A., Ulliana, F.: Andromeda: a system for processing queries and updates on big XML documents. In: East European Conference on Advances in Databases & Information Systems 2015

  26. Bidoit, N., Colazzo, D., Malla, N., Sartiani, C.: Evaluating Queries and Updates on Big XML Documents. Inf. Syst. Front. 20(1), 63–90 (2018)

    Article  Google Scholar 

  27. Camacho-Rodriguez, J., Colazzo, D., Manolescu, I.: PAXQuery: Efficient Parallel Processing of Complex XQuery. IEEE Trans. Knowl. Data Eng. 27(7), 1–1 (2015)

    Article  Google Scholar 

  28. Boag, S., Chamberlin, D., Fernández, M.F., Florescu, D., Robie, J., Siméon, J., Stefanescu, M.: XQuery 1.0: An XML query language (Second Edition). W3C working draft (2010).

  29. Carman, E.P., Westmann, T., Borkar, V.R., Carey, M.J., Tsotras, V.J.: A scalable parallel XQuery processor. In: IEEE International Conference on Big Data 2015

  30. Using Oracle XQuery for Hadoop. http://docs.oracle.com/cd/E63064_01/doc.42/e63063/oxh.htm #BDCUG527 (2016).

  31. Hricov, R., Šenk, A., Kroha, P., Valenta, M.: Evaluation of XPath Queries Over XML Documents Using SparkSQL Framework. In: International Conference: Beyond Databases, Architectures and Structures 2017

  32. Khatchadourian, S., Mariano P. Consens, Siméon, J.: Having a ChuQL at XML on the Cloud. In: AMW 2011

  33. Fegaras, L., Philip, J.J.: XML Query Optimization in Map-Reduce. In: International Workshop on the Web & Databases 2011

  34. Senk, A., Valenta, M., Benn, W.: Distributed Evaluation of XPath Axes Queries over Large XML Documents Stored in MapReduce Clusters Paper presented at the DEXA.2014,

  35. Damigos, M., Gergatsoulis, M., Plitsos, S.: Distributed Processing of XPath Queries Using MapReduce. (2014).

  36. Kunfang, S., Lu, H.: Efficient querying distributed big-XML Data using MapReduce. Int. J. Grid High Perf. Comput. 8(3), 70–79 (2016)

    Article  Google Scholar 

  37. Liang, B.A., Jin, Y.A., Cqw, B., Hq, A., Xin, Z.A., Sc, A.: XML2HBase: Storing and querying large collections of XML documents using a NoSQL database system. J. Parall. Distrib.Comput. 161, 83–99 (2021)

    Google Scholar 

  38. Apache HBase. https://hbase.apache.org/.

  39. Liu, J., Liu, Q., Zhang, L., Su, S., Liu, Y.: Enabling massive XML-based biological data management in HBase. IEEE/ACM Trans. Comput. Biol. Bioinf. 17(6), 1994–2004 (2020)

    Article  Google Scholar 

  40. Longjian, Y., Koide, H., Cavendish, D., Sakurai, K.: Efficient Shortest Path Routing Algorithms for Distributed XML Processing. In: Proceedings of the 15th International Conference on Web Information Systems and Technologies 2019, pp. 265–272

  41. Bi, X., Zhao, X.G., Wang, G.R.: Efficient processing of distributed twig queries based on node distribution. J. Comput. Sci. Technol. 32(1), 78–92 (2017)

    Article  MathSciNet  Google Scholar 

  42. Subramaniam, S., Haw, S.C., Soon, L.K.: Improved centralized XML query processing using distributed query workload. IEEE Access 9, 29127–29142 (2021)

    Article  Google Scholar 

  43. Fan, H., Yang, H., Ma, Z., Liu, J.: TwigStack-MR: An Approach to Distributed XML Twig Query Using MapReduce. In: IEEE International Congress on Big Data 2016, pp. 133–140

  44. Braganholo, V., Mattoso, M.: A Survey on XML Fragmentation. ACM SIGMOD Record (2014).

  45. Choi, H., Lee, K.H., Lee, Y.J.: Parallel labeling of massive XML data with MapReduce. J. Supercomput. 67(2), 408–437 (2014)

    Article  Google Scholar 

  46. Zhang, C., Naughton, J., DeWitt, D., Luo, Q., Lohman, G.: On supporting containment queries in relational database management systems. In: ACM SIGMOD Record, 2001 2001, vol. 2, pp. 425–436. ACM

  47. Lu, J., Meng, X., Ling, T.W.: Indexing and querying XML using extended Dewey labeling scheme. Data Knowl. Eng. 70(1), 35–59 (2011)

    Article  Google Scholar 

  48. Hsu, W.-C., Shih, H.-C., Liao, I.-E.: A scalable XML indexing method using MapReduce. In: Fourth edition of the International Conference on the Innovative Computing Technology (INTECH 2014) 2014, pp. 81–86. IEEE

  49. Schmidt, A., Waas, F., Kersten, M., Carey, M.J., Manolescu, I., Busse, R.: XMark: A benchmark for XML data management. In: Proceedings of the 28th international conference on Very Large Data Bases 2002, pp. 974–985. VLDB Endowment

  50. Afrati, F., Damigos, M., Gergatsoulis, M.: Lower bounds on the communication of XPath queries in MapReduce. (2015).

Download references

Funding

This research was supported by the Natural Science Foundation of Fujian Province of China (2022J01336, 2022J01820) and Open Fund of Digital Fujian Big Data Modeling and Intelligent Computing Institute.

Author information

Authors and Affiliations

Authors

Contributions

RC: conceptualization, formal analysis, software, wrote original draft. GC: validation, project administration. JC: validation, software. YH: validation, reviewed & edited.

Corresponding author

Correspondence to Guorong Cai.

Ethics declarations

Conflict of interest

There is no conflict of interest by any of the authors of this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, R., Cai, G., Chen, J. et al. Integrated method for distributed processing of large XML data. Cluster Comput 27, 1375–1399 (2024). https://doi.org/10.1007/s10586-023-04010-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10586-023-04010-0

Keywords

Navigation