Skip to main content
Log in

High-performance XML modeling of parallel queries based on MapReduce framework

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

With the increasing of data at an incredible rate, the development of cloud computing technologies is of critical importance to the advances of researches. MapReduce is a widely adopted computing framework for data-intensive applications running on clusters. Traditional parallel XML parsing and indexing approaches are inadequate for processing large-scale XML datasets on clusters and; therefore, we propose an approach to exploit data parallelisms in XML processing using MapReduce in Hadoop. Our solution seamlessly integrates data storage, labeling, indexing, and parallel queries to process a massive amount of XML data. Specifically, we introduce an SDN labeling algorithm and a distributed hierarchical index using DHTs. More importantly, we design an advanced two phase MapReduce solution that is able to efficiently address the issues of labeling, indexing, and query processing on big XML data. The first MapReduce phase applies filtering, labeling, index building techniques, in which each DataNode performs elements labeling using a map function and a reduce function to merge and build indexes. In the second phase, local XML queries in multiple partitions are performed in parallel using index-table-enabled B-SLCA. Our experimental results show the efficiency and effectiveness of our proposed parallel XML data approach using MapReduce Framework.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  1. Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. In: OSDI (2004)

  2. Fegaras, L., Li, C., Philip, J.J.: Xml query optimization in map-reduce. In: WebDB (2011)

  3. Yang, D.D., Wei, Z.Q., Yang, Y.Q.: A novel implementation of a Hash function based on XML DOM parser. In: Cyber-Enabled Distributed Computing and Knowledge, Discovery, pp. 5–8 (2015)

  4. Choi, H., Lee, K.-H., Lee, Y.-J.: Parallel labeling of massive xml data with mapreduce. J. Supercomput. 67, 408–437 (2013)

    Article  Google Scholar 

  5. Zhou, J., Bao, Z., Meng, X.: Efficient query processing for xml keyword queries based on the idlist index. VLDB J. 23, 1–26 (2013)

    Google Scholar 

  6. Xu, L., Ling, T., Bao, Z.: Dde: from dewey to a fully dynamic xml labeling scheme. In: 2009 ACM SIGMOD International Conference on Management of data, pp. 719–730 (2009)

  7. Camacho-Rodriguez, J., Colazzo, D., Manolescu, I.: Building large xml stores in the amazon cloud. In: Data Engineering Workshops (ICDEW), pp. 151–158 (2012)

  8. Chen, G., Vo, H.T., Ooi, B.C.: A framework for supporting dbms-like indexes in the cloud. VLDB 4, 702–713 (2011)

    Google Scholar 

  9. Ottaviano, G., Grossi, R.: Semi-indexing semi-structured data in tiny space. In: Proceedings of the 20th ACM international conference on Information and Knowledge Management, pp. 1485–1494 (2011)

  10. Feng, J., Li, G.: Efficient fuzzy type-ahead search in xml data. IEEE Trans. Knowl. Data Eng. 24, 882–895 (2012)

    Article  Google Scholar 

  11. Li, J.F.G., Li, C., Zhou, L.: Sail: structure-aware indexing for effective and progressive top-k keyword search over xml documents. Inf. Sci. 179, 3745–3762 (2009)

    Article  Google Scholar 

  12. Chen, L.J., Papakonstantinou, Y.: Supporting top-k keyword search in xml databases. In: ICDE (2010)

  13. Ling, Y., Xu, G.: A distributed keyword search algorithm in xml databases using mapreduce. Comput. Inform. Cybern. Appl. 107, 1307–1316 (2012)

    Google Scholar 

  14. Zhang, C., Ma, Q., Wang, X., Zhou, A.: Distributed slca-based xml keyword search by map-reduce. Database Syst. Adv. Appl. 6193, 386–397 (2010)

    Article  Google Scholar 

  15. Zhou, M., Hu, H., Zhou, M.: Search xml data by slca on a mapreduce cluster. In: IUCS, pp. 84–89 (2010)

  16. Zinn, D., Bowers, S., Kohler, S., Ludascher, B.: Parallelizing xml data-streaming workflows via mapreduce. J. Comput. Syst. Sci. 76, 447463 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  17. Fadika, Z., Head, M.R., Govindaraju, M.: Parallel and distributed approach for processing large-scale xml datasets. In: 10th IEEE/ACM International Conference on Grid Computing, pp. 105–112 (2009)

  18. Y. Zhang, Q. L. Li and B. Liu. MapReduce implementation of XML keyword search algorithm. In: 2015 IEEE International Conference on Smart City, pp. 721–728 (2015)

  19. Wang, X.W.W., Zhou, A.: Hash-search: an efficient slca-based keyword search algorithm on xml documents. In: DASFAA, p. 496510 (2009)

  20. Lee, k, Choi, H., Moon, B.: Parallel data processing with mapreduce: a survey. ACM SIGMOD Rec. 40, 11–20 (2012)

    Article  Google Scholar 

  21. Hsu, W.-C., Shih, H.-C.: A cloud computing implementation of xml indexing method using hadoop. In: Intelligent Information and Database Systems, vol. 7198, pp. 256–265 (2012)

  22. Wang, G., Chan, C.-Y.: Multi-query optimization in mapreduce framework. VLDB 7, 145–156 (2014)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kunfang Song.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Song, K., Lu, H. High-performance XML modeling of parallel queries based on MapReduce framework. Cluster Comput 19, 1975–1986 (2016). https://doi.org/10.1007/s10586-016-0628-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10586-016-0628-z

Keywords

Navigation