Advertisement

World Wide Web

, Volume 8, Issue 4, pp 413–438 | Cite as

Studying the XML Web: Gathering Statistics from an XML Sample

  • Denilson BarbosaEmail author
  • Laurent Mignet
  • Pierangelo Veltri
Article

Abstract

XML has emerged as the language for exchanging data on the web and has attracted considerable interest both in industry and in academia. Nevertheless, to date, little is known about the XML documents published on the web. This paper presents a comprehensive analysis of a sample of about 200,000 XML documents on the web, and is the first study of its kind. We study the distribution of XML documents across the web in several ways; moreover, we provided a detailed characterization of the structure of real XML documents. Our results provide valuable input to the design of algorithms, tools and systems that use XML in one form or another.

Keywords

World Wide Web XML XML web XML Documents XML processing tools 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. [1]
    S. Abiteboul, P. Buneman, and D. Suciu, Data on the Web. Morgan Kauffman, 1999.Google Scholar
  2. [2]
    S. Abiteboul, M. Preda, and G. Cobena, “Adaptive On-Line Page Importance Computation,” in Proc. of the Int. WWW Conf., 2003.Google Scholar
  3. [3]
    S. Abiteboul and V. Vianu, “Queries and Computation on the Web,” in Proc. of the Int. Conf. on Data Transaction (ICDT), 1997.Google Scholar
  4. [4]
    V. Aguiléra, S. Cluet, T. Milo, P. Veltri, and D. Vodislav, “Views in a large scale XML repository,” VLDB Journal 11(3), November 2002.Google Scholar
  5. [5]
    V. Apparao, S. Byrne, M. Champion, S. Isaacs, I. Jacobs, A. L. Hors, G. Nicol, J. Robie, R. Sutor, C. Wilson, and L. Wood. Document Object Model (DOM) Level 1 Specification. W3C Recommendation, http://www.w3.org/TR/1998/REC-DOM-Level-1–19981001, October 1 1998.
  6. [6]
    D. Barbosa, A. O. Mendelzon, L. Libkin, L. Mignet, and M. Arenas, “Effcient incremental validation of XML documents,” in Proceedings of the 20th International Conference on Data Engineering, IEEE Computer Society, Boston, MA, USA, 2004, pp 671–682Google Scholar
  7. [7]
    L. Barbosa and J. Freire, “Siphoning hidden-web data through keyword-based interfaces,” in Proceedings of the Brazilian Symposium on Databases.Google Scholar
  8. [8]
    G. J. Bex, F. Neven, and J. V. den Bussche, “DTDs versus XML Schema: A practical study,” in Proceedings of the Seventh International Workshop on the Web and Databases, WebDB 2004, Maison de la Chimie, Paris, France, June 17–18, 2004, pp. 79–84.Google Scholar
  9. [9]
    P. Bohannon, J. Freire, P. Roy, and J. Siméon, “From XML schema to relations: A cost-based approach to XML storage,” in Proc. of the Int. Conf. on Data Engineering (ICDE), 2002.Google Scholar
  10. [10]
    T. Bray, J. Paoli, C. M. Sperberg-McQueen, and E. Maler (Eds). Extensible Markup Language (XML) 1.0. World Wide Web Consortium, third edition, February 4 2004. http://www.w3.org/TR/2004/REC-xml-20040204.
  11. [11]
    S. Brin and L. Page, “The anatomy of a large-scale hypertextual web search engine,” in Proc. of the Int. WWW Conf., 1998.Google Scholar
  12. [12]
    P. Buneman, M. Grohe, and C. Koch, “Path queries on compressed XML,” in Proceedings of 29th International Conference on Very Large Data Bases, Berlin, Germany, September 9–12, 2003, pp. 141–152.Google Scholar
  13. [13]
    Cooperative Association for Internet Data Analysis. http://www.caida.org/.
  14. [14]
    J. Cho and H. Garcia-Molina, “Finding replicated web collections,” in Proc. of the Int. Conf. on Management of Data (SIGMOD), 2000.Google Scholar
  15. [15]
    B. Choi, “What are real DTDs like,” in WebDB, 2002.Google Scholar
  16. [16]
    J. Clark and S. DeRose, XML Path Language (XPath)—Version 1.0. World Wide Web Consortium, November 16, 1999. http://www.w3.org/TR/1999/REC-xpath-19991116
  17. [17]
    S. Dill, R. Kumar, K. S. McCurley, S. Rajagopalan, D. Sivakumar, and A. Tomkins, “Self-similarity in the Web,” in Proc. Int. Conf. on Very Large Data Bases (VLDB), 2001.Google Scholar
  18. [18]
    T. Fiebig, S. Helmer, C. Kanne, G. Moerkotte, J. Neumann, R. Schiele, and T. Westmann, “Anatomy of a native XML base management system,” VLDB Journal, 11(4), 2002, 292–314.CrossRefGoogle Scholar
  19. [19]
    R. T. Fielding, J. Gettys, J. C. Mogul, H. F. Nielsen, L. Masinter, P. Leach, and T. Berners-Lee, Hypertext Transfer Protocol–HTTP/1.1. RFC 2616. HTTP Working Group, 1999. ftp://ftp.isi.edu/in-notes/rfc2616.txt.
  20. [20]
    J. Freire, J. R. Haritsa, M. Ramanath, P. Roy, and J. Simon, “StatiX: Making XML count,” in Proc. of the Int. Conf. on Management of Data (SIGMOD), 2002.Google Scholar
  21. [21]
    R. Hull, M. Benedikt, V. Christophides, and J. Su, “Eservices: A look behind the curtain,” in Proceedings of the Twenty-Second ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, San Diego, California, USA, June 09–11, 2003, pp. 1–14.Google Scholar
  22. [22]
    IBM DB2 v8.1. http://www.ibm.com.
  23. [23]
    International Standards Organization. ISO 8879—Standard Generalized Markup Language (SGML), 1986.Google Scholar
  24. [24]
    Internet Domain Survey. http://www.isc.org/ds/
  25. [25]
    P. Iperiotis, L. Gravano, and M. Saham, “Probe, count, and classify: Categorizing hidden web databases,” in Proc. of the Int. Conf. on Management of Data (SIGMOD), 2001.Google Scholar
  26. [26]
    H. V. Jagadish, S. Al-Khalifa, A. Chapman, L. V. S. Lakshmanan, A. Nierman, S. Paparizos, J. M. Patel, D. Srivastava, N. Wiwatwattana, Y. Wu, and C. Yu, “TIMBER: A native XML database,” VLDB Journal, 11(4), 2002, 274–291.CrossRefGoogle Scholar
  27. [27]
    R. Kumar, P. Raghavan, S. Rajagopalan, D. Sivakumar, A. Tomkins, and E. Upfal, “The web as a graph,” in Proc. of the Int. Conf. on Principle of Database Systems (PODS), 2000.Google Scholar
  28. [28]
    Q. Li and B. Moon, “Indexing and querying XML data for regular path expressions,” in Proc. Int. Conf. on Very Large Data Bases (VLDB), 2001.Google Scholar
  29. [29]
    H. Liefke and D. Suciu, “XMILL: An efficient compressor for XML data,” in Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data,Dallas, Texas, USA, May 16–18, 2000, ACM, 2000, pp. 153–164.Google Scholar
  30. [30]
    I. Manolescu, D. Florescu, and D. Kossmann, “Answering XML queries on heterogeneous data sources,” in Proc. Int. Conf. on Very Large Data Bases (VLDB), 2001.Google Scholar
  31. [31]
    Microsoft SQL Server 2000. http://www.microsoft.com/sql
  32. [32]
    L. Mignet, D. Barbosa, and P. Veltri, “The XML Web: A first study,” in Proceedings of the 12th International Conference on World Wide Web, Budapest, Hungary, May 20–24, 2003.Google Scholar
  33. [33]
    L. Mignet, M. Preda, S. Abiteboul, S. Ailleret, B. Amann, and A. Marian, “Acquiring XML pages for a webhouse,” in Base de Donnes Avances, 2000.Google Scholar
  34. [34]
    RFC 1321—The MD5 Message-Digest Algorithm.Google Scholar
  35. [35]
  36. [36]
    Y. Papakonstantinou and V. Vianu, “Incremental validation of XML documents,” in Proceeedings of The 9th International Conference on Database Theory, Siena, Italy, January 8–10, 2003, pp. 47–63.Google Scholar
  37. [37]
    D. Raggett, A. L. Hors, and I. Jacobs, HTML 4.01 Specification, World Wide Web Consortium, December 24 1999. http://www.w3.org/TR/1999/REC-html401-19991224.
  38. [38]
    S. Raghavan and H. Garcia-Molina, “Crawling the hidden web,” in Proc. Int. Conf. on Very Large Data Bases (VLDB), 2001.Google Scholar
  39. [39]
    L. Segoufin and V. Vianu, “Validating streaming XML documents,” in Proceedings of the 21st ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, Madison, Wisconsin, June 3–5 2002, pp. 53–64Google Scholar
  40. [40]
    The Plays of Shakespeare in XML. http://metalab.unc.edu/bosak/xml/
  41. [41]
    J. Shanmugasundaram, K. Tufte, C. Zhang, G. He, D. J. DeWitt, and J. F. Naughton, “Relational databases for querying XML documents: Limitations and opportunities,” in Proceedings of 25th International Conference on Very Large Data Bases, Edinburgh, Scotland, UK, September 7–10, 1999, pp. 302–314.Google Scholar
  42. [42]
  43. [43]
    I. Tatarinov, Z. Ives, A. Halevy, and D. Weld, “Updating XML,” in Proc. of the Int. Conf. on Management of Data (SIGMOD), 2001.Google Scholar
  44. [44]
    H. S. Thompson, D. Beech, M. Maloney, and N. Mendelsohn (Eds), XML Schema Part 1: Structures. World Wide Web Consortium, May 2 2001. http://www.w3.org/TR/2001/REC-xmlschema-1-20010502/
  45. [45]
  46. [46]
    Wireless Application Protocol. http://www.wapforum.org/
  47. [47]
  48. [48]
    The XML benchmark project. http://www.xml-benchmark.org/
  49. [49]
  50. [50]
  51. [51]
    L. Xyleme, “A dynamic warehouse for XML data of the Web,” IEEE—Data Engineering Bulletin, 24(2), 2001.Google Scholar

Copyright information

© Springer Science + Business Media, Inc. 2005

Authors and Affiliations

  • Denilson Barbosa
    • 1
    Email author
  • Laurent Mignet
    • 2
  • Pierangelo Veltri
    • 3
  1. 1.Department of Computer ScienceUniversity of TorontoTorontoCanada
  2. 2.IBM India Research LaboratoryNew DelhiIndia
  3. 3.Department of Experimental and Clinical MedicineMagna Graecia University of CatanzaroCatanzaroItaly

Personalised recommendations