A Data Parallel Algorithm for XML DOM Parsing

  • Bhavik Shah
  • Praveen R. Rao
  • Bongki Moon
  • Mohan Rajagopalan
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5679)

Abstract

The extensible markup language XML has become the de facto standard for information representation and interchange on the Internet. XML parsing is a core operation performed on an XML document for it to be accessed and manipulated. This operation is known to cause performance bottlenecks in applications and systems that process large volumes of XML data. We believe that parallelism is a natural way to boost performance. Leveraging multicore processors can offer a cost-effective solution, because future multicore processors will support hundreds of cores, and will offer a high degree of parallelism in hardware. We propose a data parallel algorithm called ParDOM for XML DOM parsing, that builds an in-memory tree structure for an XML document. ParDOM has two phases. In the first phase, an XML document is partitioned into chunks and parsed in parallel. In the second phase, partial DOM node tree structures created during the first phase, are linked together (in parallel) to build a complete DOM node tree. ParDOM offers fine-grained parallelism by adopting a flexible chunking scheme – each chunk can contain an arbitrary number of start and end XML tags that are not necessarily matched. ParDOM can be conveniently implemented using a data parallel programming model that supports map and sort operations. Through empirical evaluation, we show that ParDOM yields better scalability than PXP [23] – a recently proposed parallel DOM parsing algorithm – on commodity multicore processors. Furthermore, ParDOM can process a wide-variety of XML datasets with complex structures which PXP fails to parse.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Intel XML Software Suite Performance Paper, http://intel.com/software/xmlsoftwaresuite
  2. 2.
    Microsoft XML Core Services (MSXML), http://msdn.microsoft.com/en-us/xml/
  3. 3.
    Xerces-C++ XML Parser, http://xerces.apache.org/xerces-c/
  4. 4.
    Asanovic, K., Bodik, R., Catanzaro, B.C., Gebis, J.J., Husbands, P., Keutzer, K., Patterson, D.A., Plishker, W.L., Shalf, J., Williams, S.W., Yelick, K.A.: The landscape of parallel computing research: A view from berkeley. Technical Report UCB/EECS-2006-183, EECS Department, University of California, Berkeley (December 2006)Google Scholar
  5. 5.
    Berglund, A., Boag, S., Chamberlin, D., Fernandez, M.F., Kay, M., Robie, J., Simon, J.: XML path language (XPath) 2.0 W3C working draft 16. Technical Report WD-xpath20-20020816, World Wide Web Consortium (August 2002)Google Scholar
  6. 6.
    Cable, L., Chow, T.: JSR 173: Streaming API for XML (2007), http://jcp.org/en/jsr/detail?id=173
  7. 7.
    Cameron, R.D., Herdy, K.S., Lin, D.: High performance XML parsing using parallel bit stream technology. In: CASCON 2008: Proc. of the 2008 conference of the center for advanced studies on collaborative research, New York, pp. 222–235 (2008)Google Scholar
  8. 8.
    Chakravarty, M.M.T., Leshchinskiy, R., Jones, S.P., Keller, G., Marlow, S.: Data Parallel Haskell: a status report. In: Proc. of the 2007 Workshop on Declarative Aspects of Multicore Programming, Nice, France, January 2007, pp. 10–18 (2007)Google Scholar
  9. 9.
    Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters. In: Proc. of the OSDI 2004, San Francisco, CA (December 2004)Google Scholar
  10. 10.
    Engelen, R.A.V.: A framework for service-oriented computing with C and C++ Web service components. ACM Transactions on Internet Technology 8(3), 1–25 (2008)CrossRefGoogle Scholar
  11. 11.
    Gao, Z., Pan, Y., Zhang, Y., Chiu, K.: A high performance schema-specific xml parser. In: IEEE Intl. Conf. on e-Science and Grid Computing, December 2007, pp. 245–252 (2007)Google Scholar
  12. 12.
    Ghuloum, A., Smith, T., Wu, G., Zhou, X., Fang, J., Guo, P., So, B., Rajagopalan, M., Chen, Y., Chen, B.: Future-proof data parallel algorithms and software on intel multi-core architecture. Intel Technology Journal 11(4), 333–348 (2007)CrossRefGoogle Scholar
  13. 13.
    Ghuloum, A., Sprangle, E., Fang, J., Wu, G., Zhou, X.: Ct: A Flexible Parallel Programming Model for Tera-scale Architectures, 2007. Intel White Paper (2007)Google Scholar
  14. 14.
    Goldman, O., Lenkov, D.: XML Binary Characterization. Technical report, World Wide Web Consortium (March 2005)Google Scholar
  15. 15.
    Grohoski, G.: Niagara 2: A highly threaded server-on-a-chip. In: 18th Hot Chips Symposium (August 2006)Google Scholar
  16. 16.
    Huhns, M., Singh, M.P.: Service-Oriented Computing: Key Concepts and Principles. IEEE Internet Computing 9(1), 75–81 (2005)CrossRefGoogle Scholar
  17. 17.
    Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D.: Dryad: distributed data-parallel programs from sequential building blocks. In: Proc. of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007, pp. 59–72 (2007)Google Scholar
  18. 18.
    Kay, M.: SAXON: The XSLT and XQuery Processor, http://saxon.sourceforge.net
  19. 19.
    Kostoulas, M.G., Matsa, M., Mendelsohn, N., Perkins, E., Heifets, A., Mercaldi, M.: XML screamer: an integrated approach to high performance XML parsing, validation and deserialization. In: Proc. of the 15th International Conference on World Wide Web, New York, pp. 93–102 (2006)Google Scholar
  20. 20.
    Li, Q., Moon, B.: Indexing and querying XML data for regular path expressions. In: Proc. of the 27th VLDB Conference, Rome, Italy, September 2001, pp. 361–370 (2001)Google Scholar
  21. 21.
    Megginson, D.: Simple API for XML, http://sax.sourceforge.net/
  22. 22.
    Nicola, M., John, J.: XML parsing: a threat to database performance. In: Proc. of the 12th International Conference on Information and Knowledge Management, pp. 175–178 (2003)Google Scholar
  23. 23.
    Pan, Y., Lu, W., Zhang, Y., Chiu, K.: A Static Load-Balancing Scheme for Parallel XML Parsing on Multicore CPUs. In: Proc. of the 7th International Symposium on Cluster Computing and the Grid (CCGRID), Washington D.C., May 2007, pp. 351–362 (2007)Google Scholar
  24. 24.
    Pan, Y., Zhang, Y., Chiu, K.: Simultaneous transducers for data-parallel XML parsing. In: Proc. of Intl. Symposium on Parallel and Distributed Processing, April 2008, pp. 1–12 (2008)Google Scholar
  25. 25.
    Ranger, C., Raghuraman, R., Penmetsa, A., Bradski, G., Kozyrakis, C.: Evaluating MapReduce for Multi-core and Multiprocessor Systems. In: Proceedings of the 13th International Symposium on High-Performance Computer Architecture (HPCA), Phoenix, AZ (Feburary 2007)Google Scholar
  26. 26.
    Seiler, L., Carmean, D., Sprangle, E., Forsyth, T., Abrash, M., Dubey, P., Junkins, S., Lake, A., Sugerman, J., Cavin, R., Espasa, R., Grochowski, E., Juan, T., Hanrahan, P.: Larrabee: a many-core x86 architecture for visual computing. ACM Trans. Graph. 27(3), 1–15 (2008)CrossRefGoogle Scholar
  27. 27.
    Tatarinov, I., Viglas, S.D., Beyer, K., Shanmugasundaram, J., Shekita, E., Zhang, C.: Storing and Querying Ordered XML Using a Relational Database System. In: Proc. of the 2002 ACM-SIGMOD Conference, June 2002, pp. 204–215 (2002)Google Scholar
  28. 28.
    TPC. TPC-H (2002), http://www.tpc.org/tpch/
  29. 29.
  30. 30.
    W3C. The document object model (1998), http://www.w3.org/DOM
  31. 31.
    Wu, Y., Zhang, Q., Yu, Z., Li, J.: A Hybrid Parallel Processing for XML Parsing and Schema Validation. In: Proceedings of Balisage Markup Conference (2008)Google Scholar
  32. 32.
    Zhang, J., Lovette, K.: XimpleWare W3C Position Paper. In: W3C Workshop on Binary Interchange of XML Information Item Sets (2003)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Bhavik Shah
    • 1
  • Praveen R. Rao
    • 1
  • Bongki Moon
    • 2
  • Mohan Rajagopalan
    • 3
  1. 1.University of Missouri-Kansas CityUSA
  2. 2.University of ArizonaUSA
  3. 3.Intel Research LabsUSA

Personalised recommendations