Skip to main content

A Dual-Index Based Representation for Processing XPath Queries on Very Large XML Documents

  • Conference paper
  • First Online:
Cloud Computing (CloudComp 2020)

Abstract

Although XML processing has been intensively studied in recent years, designing efficient implementations for evaluating XPath queries on XML documents remains a challenge in case XML documents are very large. In this study, we implemented a tree-shaped data structure called partial tree that is intrinsically suitable for large XML document processing with multiple computers. Our implementation uses two index sets to accelerate the evaluation of structural relationships among nodes, making it highly efficient for processing very large XML documents regarding three important classes of XPath queries: backward, order-aware and predicate-containing queries. Experiment results show that our implementation outperforms a start-of-the-art XML database BaseX in both absolute loading time and execution time for the target queries. The absolute execution time over 358 GB of XML data averagely is only seconds by using 32 EC2 instances.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://www.w3.org/TR/xpath/.

  2. 2.

    https://www.w3.org/TR/xquery/.

  3. 3.

    http://dblp.uni-trier.de/.

  4. 4.

    http://www.uniprot.org/help/uniprotkb.

  5. 5.

    http://basex.org/.

  6. 6.

    https://aws.amazon.com/ec2/.

  7. 7.

    The factor determines the file size of an XMark generated document. It is nearly linear: 1 = 110 MB, for example xmark100 with the factor 100 is about 11 GB, while xmark2000 with the factor 2000 sized 220 GB.

References

  1. Al-Khalifa, S., Jagadish, H., Koudas, N., Patel, J.M., Srivastava, D., Wu, Y.: Structural joins: a primitive for efficient XML query pattern matching. In: Proceedings of the 12th International Conference on Data Engineering, pp. 141–152 (2002)

    Google Scholar 

  2. Arroyuelo, D., et al.: Fast in-memory XPath search using compressed indexes. Softw. Pract. Exp. 45(3), 399–434 (2015)

    Article  Google Scholar 

  3. Brantner, M., Helmer, S., Kanne, C.C., Moerkotte, G.: Full-fledged algebraic XPath processing in Natix. In: Proceedings of the 21st International Conference on Data Engineering (ICDE 2005), pp. 705–716 (2005)

    Google Scholar 

  4. Carman, E.P., Westmann, T., Borkar, V.R., Carey, M.J., Tsotras, V.J.: A scalable parallel XQuery processor. In: Proceedings of 2015 IEEE International Conference on Big Data, pp. 164–173 (2015)

    Google Scholar 

  5. Choi, H., Lee, K.H., Kim, S.H., Lee, Y.J., Moon, B.: HadoopXML: a suite for parallel processing of massive XML data with multiple twig pattern queries. In: Proceedings of the 21st ACM International Conference on Information and Knowledge Management (CIKM 2012), pp. 2737–2739 (2012)

    Google Scholar 

  6. Choi, H., Lee, K.-H., Lee, Y.-J.: Parallel labeling of massive XML data with MapReduce. J. Supercomputing 67(2), 408–437 (2013). https://doi.org/10.1007/s11227-013-1008-6

    Article  Google Scholar 

  7. Cong, G., Fan, W., Kementsietsidis, A., Li, J., Liu, X.: Partial evaluation for distributed XPath query processing and beyond. ACM Trans. Database Syst. 37(4), 32:1–32:43 (2012)

    Article  Google Scholar 

  8. Damigos, M., Gergatsoulis, M., Plitsos, S.: Distributed processing of XPath queries using MapReduce. In: Proceedings of the 17th East European Conference on Advances in Databases and Information Systems (ADBIS 2013), Part II, pp. 69–77 (2013)

    Google Scholar 

  9. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)

    Article  Google Scholar 

  10. Grust, T.: Accelerating XPath location steps. In: Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data (SIGMOD 2002), pp. 109–120 (2002)

    Google Scholar 

  11. Hao, W., Matsuzaki, K.: A partial-tree-based approach for XPath query on large XML trees. J. Inf. Process. 24(2), 425–438 (2016)

    Google Scholar 

  12. Ogden, P., Thomas, D., Pietzuch, P.: Scalable XML query processing using parallel pushdown transducers. Proc. VLDB Endow. 6(14), 1738–1749 (2013)

    Article  Google Scholar 

  13. O’Neil, P., O’Neil, E., Pal, S., Cseri, I., Schaller, G., Westbury, N.: ORDPATHs: insert-friendly XML node labels. In: Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data (SIGMOD 2004), pp. 903–908 (2004)

    Google Scholar 

  14. Qin, L., Yu, J.X., Ding, B.: TwigList: make twig pattern matching fast. In: the 12th International Conference on Database Systems for Advanced Applications, pp. 850–862 (2007)

    Google Scholar 

  15. Sauer, C., Bächle, S., Härder, T.: Versatile XQuery processing in MapReduce. In: Catania, B., Guerrini, G., Pokorný, J. (eds.) ADBIS 2013. LNCS, vol. 8133, pp. 204–217. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40683-6_16

    Chapter  Google Scholar 

  16. Wu, H.: Parallelizing structural joins to process queries over big XML data using MapReduce. In: Decker, H., Lhotská, L., Link, S., Spies, M., Wagner, R.R. (eds.) DEXA 2014. LNCS, vol. 8645, pp. 183–190. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10085-2_16

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wei Hao .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Hao, W., Matsuzaki, K., Sato, S. (2021). A Dual-Index Based Representation for Processing XPath Queries on Very Large XML Documents. In: Qi, L., Khosravi, M.R., Xu, X., Zhang, Y., Menon, V.G. (eds) Cloud Computing. CloudComp 2020. Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, vol 363. Springer, Cham. https://doi.org/10.1007/978-3-030-69992-5_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-69992-5_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-69991-8

  • Online ISBN: 978-3-030-69992-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics