A Dual-Index Based Representation for Processing XPath Queries on Very Large XML Documents

Hao, Wei; Matsuzaki, Kiminori; Sato, Shigeyuki

doi:10.1007/978-3-030-69992-5_2

Wei Hao^20,21,
Kiminori Matsuzaki²¹ &
Shigeyuki Sato²¹

Part of the book series: Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering ((LNICST,volume 363))

Included in the following conference series:

International Conference on Cloud Computing

516 Accesses

Abstract

Although XML processing has been intensively studied in recent years, designing efficient implementations for evaluating XPath queries on XML documents remains a challenge in case XML documents are very large. In this study, we implemented a tree-shaped data structure called partial tree that is intrinsically suitable for large XML document processing with multiple computers. Our implementation uses two index sets to accelerate the evaluation of structural relationships among nodes, making it highly efficient for processing very large XML documents regarding three important classes of XPath queries: backward, order-aware and predicate-containing queries. Experiment results show that our implementation outperforms a start-of-the-art XML database BaseX in both absolute loading time and execution time for the target queries. The absolute execution time over 358 GB of XML data averagely is only seconds by using 32 EC2 instances.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://www.w3.org/TR/xpath/.
2.
https://www.w3.org/TR/xquery/.
3.
http://dblp.uni-trier.de/.
4.
http://www.uniprot.org/help/uniprotkb.
5.
http://basex.org/.
6.
https://aws.amazon.com/ec2/.
7.
The factor determines the file size of an XMark generated document. It is nearly linear: 1 = 110 MB, for example xmark100 with the factor 100 is about 11 GB, while xmark2000 with the factor 2000 sized 220 GB.

References

Al-Khalifa, S., Jagadish, H., Koudas, N., Patel, J.M., Srivastava, D., Wu, Y.: Structural joins: a primitive for efficient XML query pattern matching. In: Proceedings of the 12th International Conference on Data Engineering, pp. 141–152 (2002)
Google Scholar
Arroyuelo, D., et al.: Fast in-memory XPath search using compressed indexes. Softw. Pract. Exp. 45(3), 399–434 (2015)
Article Google Scholar
Brantner, M., Helmer, S., Kanne, C.C., Moerkotte, G.: Full-fledged algebraic XPath processing in Natix. In: Proceedings of the 21st International Conference on Data Engineering (ICDE 2005), pp. 705–716 (2005)
Google Scholar
Carman, E.P., Westmann, T., Borkar, V.R., Carey, M.J., Tsotras, V.J.: A scalable parallel XQuery processor. In: Proceedings of 2015 IEEE International Conference on Big Data, pp. 164–173 (2015)
Google Scholar
Choi, H., Lee, K.H., Kim, S.H., Lee, Y.J., Moon, B.: HadoopXML: a suite for parallel processing of massive XML data with multiple twig pattern queries. In: Proceedings of the 21st ACM International Conference on Information and Knowledge Management (CIKM 2012), pp. 2737–2739 (2012)
Google Scholar
Choi, H., Lee, K.-H., Lee, Y.-J.: Parallel labeling of massive XML data with MapReduce. J. Supercomputing 67(2), 408–437 (2013). https://doi.org/10.1007/s11227-013-1008-6
Article Google Scholar
Cong, G., Fan, W., Kementsietsidis, A., Li, J., Liu, X.: Partial evaluation for distributed XPath query processing and beyond. ACM Trans. Database Syst. 37(4), 32:1–32:43 (2012)
Article Google Scholar
Damigos, M., Gergatsoulis, M., Plitsos, S.: Distributed processing of XPath queries using MapReduce. In: Proceedings of the 17th East European Conference on Advances in Databases and Information Systems (ADBIS 2013), Part II, pp. 69–77 (2013)
Google Scholar
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Article Google Scholar
Grust, T.: Accelerating XPath location steps. In: Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data (SIGMOD 2002), pp. 109–120 (2002)
Google Scholar
Hao, W., Matsuzaki, K.: A partial-tree-based approach for XPath query on large XML trees. J. Inf. Process. 24(2), 425–438 (2016)
Google Scholar
Ogden, P., Thomas, D., Pietzuch, P.: Scalable XML query processing using parallel pushdown transducers. Proc. VLDB Endow. 6(14), 1738–1749 (2013)
Article Google Scholar
O’Neil, P., O’Neil, E., Pal, S., Cseri, I., Schaller, G., Westbury, N.: ORDPATHs: insert-friendly XML node labels. In: Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data (SIGMOD 2004), pp. 903–908 (2004)
Google Scholar
Qin, L., Yu, J.X., Ding, B.: TwigList: make twig pattern matching fast. In: the 12th International Conference on Database Systems for Advanced Applications, pp. 850–862 (2007)
Google Scholar
Sauer, C., Bächle, S., Härder, T.: Versatile XQuery processing in MapReduce. In: Catania, B., Guerrini, G., Pokorný, J. (eds.) ADBIS 2013. LNCS, vol. 8133, pp. 204–217. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40683-6_16
Chapter Google Scholar
Wu, H.: Parallelizing structural joins to process queries over big XML data using MapReduce. In: Decker, H., Lhotská, L., Link, S., Spies, M., Wagner, R.R. (eds.) DEXA 2014. LNCS, vol. 8645, pp. 183–190. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10085-2_16
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Anhui University of Science and Technology, Taifeng Avenue 168, Huanian, Anhui, China
Wei Hao
Kochi University of Technology, 185 Miyanokuchi, Tosayamada, Kami, Kochi, 782–8502, Japan
Wei Hao, Kiminori Matsuzaki & Shigeyuki Sato

Authors

Wei Hao
View author publications
You can also search for this author in PubMed Google Scholar
Kiminori Matsuzaki
View author publications
You can also search for this author in PubMed Google Scholar
Shigeyuki Sato
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wei Hao .

Editor information

Editors and Affiliations

Qufu Normal University, Qufu, China
Lianyong Qi
Persian Gulf University, Bushehr, Iran
Mohammad R. Khosravi
Nanjing University of Information Science and Technology, Nanjing, China
Xiaolong Xu
Anhui University, Hefei, China
Yiwen Zhang
SCMS School of Engineering and Technology, Kerala, India
Varun G. Menon

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hao, W., Matsuzaki, K., Sato, S. (2021). A Dual-Index Based Representation for Processing XPath Queries on Very Large XML Documents. In: Qi, L., Khosravi, M.R., Xu, X., Zhang, Y., Menon, V.G. (eds) Cloud Computing. CloudComp 2020. Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, vol 363. Springer, Cham. https://doi.org/10.1007/978-3-030-69992-5_2

Download citation

DOI: https://doi.org/10.1007/978-3-030-69992-5_2
Published: 13 February 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-69991-8
Online ISBN: 978-3-030-69992-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics