A MapReduce-Based Approach for Prefix-Based Labeling of Large XML Data

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10055)

Abstract

A massive amount of XML (Extensible Markup Language) data is available on the web, which can be viewed as tree data. One of the fundamental building blocks of information retrieval from tree data is answering structural queries. Various labeling schemes have been suggested for rapid structural query processing. We focus on the prefix-based labeling scheme that labels each node with a concatenation of its parent’s label and its child order. This scheme has been adapted in RDF (Resource Description Framework) data management systems that index RDF data in tree by grouping subjects. Recently, a MapReduce-based algorithm for the prefix-based labeling scheme was suggested. We observe that this algorithm fails to keep label size minimized, which makes the prefix-based labeling scheme difficult for massive real-world XML datasets. To address this issue, we propose a MapReduce-based algorithm for prefix-based labeling of XML data that reduces label size by adjusting the order of label assignments based on the structural information of the XML data. Experiments with real-world XML datasets show that the proposed approach is more effective than previous works.

Notes

Acknowledgement

This work was supported by Institute for Information & communications Technology Promotion (IITP) grant funded by the Korea government (MSIP) (No. R0101-16-0054, WiseKB: Big data based self-evolving knowledge base and reasoning platform) and Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science, ICT & Future Planning (NRF-2014R1A1A1002236).

References

  1. 1.
    Clark, J., DeRose, S., et al.: XML path language (XPath) (1999)Google Scholar
  2. 2.
    Pal, S., Cseri, I., Seeliger, O., Rys, M., Schaller, G., Yu, W., Tomic, D., Baras, A., Berg, B., Churin, D., et al.: XQuery implementation in a relational database system. In: Proceedings of the 31st International Conference on Very Large Data Bases, VLDB Endowment, pp. 1175–1186 (2005)Google Scholar
  3. 3.
    O’Neil, P., O’Neil, E., Pal, S., Cseri, I., Schaller, G., Westbury, N.: ORDPATHs: insert-friendly XML node labels. In: Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, pp. 903–908. ACM (2004)Google Scholar
  4. 4.
    Delbru, R., Toupikov, N., Catasta, M., Tummarello, G.: A node indexing scheme for web entity retrieval. In: Aroyo, L., Antoniou, G., Hyvönen, E., Teije, A., Stuckenschmidt, H., Cabral, L., Tudorache, T. (eds.) ESWC 2010. LNCS, vol. 6089, pp. 240–256. Springer, Heidelberg (2010). doi:10.1007/978-3-642-13489-0_17 CrossRefGoogle Scholar
  5. 5.
    Choi, H., Lee, K.H., Lee, Y.J.: Parallel labeling of massive XML data with mapreduce. J. Supercomputing 67(2), 408–437 (2014)MathSciNetCrossRefGoogle Scholar
  6. 6.
    Ahn, J., Im, D.H., Lee, T., Kim, H.G.: A dynamic and parallel approach for repetitive prime number labeling of XML data with MapReduce. J. Supercomputing (To Appear)Google Scholar
  7. 7.
    Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)CrossRefGoogle Scholar
  8. 8.
    Xu, L., Ling, T.W., Wu, H., Bao, Z.: DDE: from dewey to a fully dynamic XML labeling scheme. In: SIGMOD. ACM (2009)Google Scholar
  9. 9.
    Tatarinov, I., Viglas, S.D., Beyer, K., Shanmugasundaram, J., Shekita, E., Zhang, C.: Storing and querying ordered XML using a relational database system. In: Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, pp. 204–215. ACM (2002)Google Scholar
  10. 10.
    Lin, R.-R., Chang, Y.-H., Chao, K.-M.: A compact and efficient labeling scheme for XML documents. In: Meng, W., Feng, L., Bressan, S., Winiwarter, W., Song, W. (eds.) DASFAA 2013. LNCS, vol. 7825, pp. 269–283. Springer, Heidelberg (2013). doi:10.1007/978-3-642-37487-6_22 CrossRefGoogle Scholar
  11. 11.
    Lu, J., Meng, X., Ling, T.W.: Indexing and querying XML using extended dewey labeling scheme. Data Knowl. Eng. 70(1), 35–59 (2011)CrossRefGoogle Scholar
  12. 12.
    Klaib, A., Joan, L.: Investigation into indexing XML data techniques (2014)Google Scholar
  13. 13.
    Xu, L., Bao, Z., Ling, T.W.: A dynamic labeling scheme using vectors. In: Wagner, R., Revell, N., Pernul, G. (eds.) DEXA 2007. LNCS, vol. 4653, pp. 130–140. Springer, Heidelberg (2007). doi:10.1007/978-3-540-74469-6_14 CrossRefGoogle Scholar
  14. 14.
    Li, C., Ling, T.W.: QED: a novel quaternary encoding to completely avoid re-labeling in XML updates. In: CIKM. ACM (2005)Google Scholar
  15. 15.
    Christophides, V., Karvounarakis, G., Plexousakis, D., Scholl, M., Tourtounis, S.: Optimizing taxonomic semantic web queries using labeling schemes. Web Semant. Sci. Serv. Agents World Wide Web 1(2), 207–228 (2004)CrossRefGoogle Scholar
  16. 16.
    Xu, L., Ling, T.W., Wu, H.: Labeling dynamic XML documents: an order-centric approach. IEEE Trans. Knowl. Data Eng. 24(1), 100–113 (2012)CrossRefGoogle Scholar
  17. 17.
    Subramaniam, S., Haw, S.C., Soon, L.K.: Relab: A subtree based labeling scheme for efficient XML query processing. In: 2014 IEEE 2nd International Symposium on Telecommunication Technologies (ISTT), pp. 121–125. IEEE (2014)Google Scholar
  18. 18.
    Wu, X., Lee, M.L., Hsu, W.: A prime number labeling scheme for dynamic ordered XML trees. In: ICDE (2004)Google Scholar
  19. 19.
    Sun, D.H., Hwang, S.C.: A labeling methods for keyword search over large XML documents. J. KIISE 41(9), 699–706 (2014)CrossRefGoogle Scholar
  20. 20.
    Wang, Y., DeWitt, D.J., Cai, J.Y.: X-Diff: An effective change detection algorithm for XML documents. In: 2003 Proceedings of the 19th International Conference on Data Engineering, pp. 519–530. IEEE (2003)Google Scholar
  21. 21.
    Leonardi, E., Bhowmick, S.S., Madria, S.: Xandy: Detecting changes on large unordered XML documents using relational databases. In: Zhou, L., Ooi, B.C., Meng, X. (eds.) DASFAA 2005. LNCS, vol. 3453, pp. 711–723. Springer, Heidelberg (2005). doi:10.1007/11408079_65 CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  1. 1.Biomedical Knowledge Engineering Laboratory and Dental Research InstituteSeoul National UniversitySeoulSouth Korea
  2. 2.Department of Computer and Information EngineeringHoseo UniversityCheonanSouth Korea
  3. 3.Institute of Human-Environment Interface BiologySeoul National UniversitySeoulSouth Korea

Personalised recommendations