Evaluation of XPath Queries Over XML Documents Using SparkSQL Framework

  • Radoslav Hricov
  • Adam Šenk
  • Petr Kroha
  • Michal Valenta
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 716)


In this contribution, we present our approach to querying XML document that is stored in a distributed system. The main goal of this paper is to describe how to use Spark SQL framework to implement a subset of expressions from XPath query language. Five different methods of our approach are introduced and compared, and by this, we also demonstrate the actual state of query optimization on Spark SQL platform. It may be taken as the next contribution of our paper. A subset of expressions from XPath query language (supported by the implemented methods) contains all XPath axes except the axes of attribute and namespace while predicates are not implemented in our prototype. We present our implemented system, data, measurements, tests, and results. The evaluated results support our belief that our method significantly decreases data transfers in the distributed system that occur during the query evaluation.


Spark SQL XML XPath Big data 


  1. 1.
    Amer-Yahia, S., Du, F., Freire, J.: A comprehensive solution to the XML-to-relational mapping problem. In: Proceedings of the 6th Annual ACM International Workshop on Web Information and Data Management, pp. 31–38 (2004)Google Scholar
  2. 2.
    Bidoit, N., Colazzo, D., Malla, N., Sartiani, C.: Partitioning XML documents for iterative queries. In: Proceedings of the 16th International Database Engineering & Applications Symposium, pp. 51–60. ACM (2012)Google Scholar
  3. 3.
    Bourret, R., Bornhövd, C., Buchmann, A.: A generic load/extract utility for data transfer between XML documents and relational databases. In: Advanced Issues of E-Commerce and Web-Based Information Systems, WECWIS 2000, pp. 134–143 (2000)Google Scholar
  4. 4.
    Camacho-Rodríguez, J., Colazzo, D., Manolescu, I.: Building large XML stores in the Amazon cloud. In: 2012 IEEE 28th International Conference on Data Engineering Workshops (ICDEW), pp. 151–158. IEEE (2012)Google Scholar
  5. 5.
    Choi, H., Lee, K.H., Kim, S.H., Lee, Y.J., Moon, B.: HadoopXML: a suite for parallel processing of massive XML data with multiple twig pattern queries. In: Proceedings of the 21st ACM International Conference on Information and Knowledge Management, pp. 2737–2739. ACM (2012)Google Scholar
  6. 6.
    Fegaras, L., Li, C., Gupta, U., Philip, J.: XML query optimization in Map-Reduce. In: WebDB (2011)Google Scholar
  7. 7.
    Hricov, R.: Evaluation of XPath queries over XML documents using SparkSQL framework - Master thesis. FIT CTU - Master thesis (2016)Google Scholar
  8. 8.
    Marcjan, R., Siwik, L.: The concept of transformation of XML documents into quasi-relational model. In: Kozielski, S., Mrozek, D., Kasprowski, P., Małysiak-Mrozek, B., Kostrzewa, D. (eds.) BDAS 2014. CCIS, vol. 424, pp. 569–580. Springer, Cham (2014). doi: 10.1007/978-3-319-06932-6_55 CrossRefGoogle Scholar
  9. 9.
    Strnad, P., Macek, O., Jira, P.: Mapping XML to key-value database. In: The Fifth International Conference on Advances in Databases, Knowledge, and Data Applications, DBKDA 2013, pp. 121–127 (2013)Google Scholar
  10. 10.
    Tatarinov, I., Viglas, S.D., Beyer, K., Shanmugasundaram, J., Shekita, E., Zhang, C.: Storing and querying ordered XML using a relational database system. In: Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, pp. 204–215. ACM (2002)Google Scholar
  11. 11.
    Šenk, A., Valenta, M., Benn, W.: Distributed evaluation of XPath axes queries over large XML documents stored in MapReduce clusters. In: Proceedings of the 2014 International Semiconductor Laser Conference, ISLC 2014, pp. 253–257. IEEE Computer Society, Washington, DC (2014). http://dx.doi.org/10.1109/DEXA.2014.59

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Radoslav Hricov
    • 1
  • Adam Šenk
    • 1
  • Petr Kroha
    • 1
  • Michal Valenta
    • 1
  1. 1.Faculty of Information TechnologyCzech Technical University in PraguePragueCzech Republic

Personalised recommendations