On the Midpoint of a Set of XML Documents

  • Alberto Abelló
  • Xavier de Palol
  • Mohand-Saïd Hacid
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3588)


The WWW contains a huge amount of documents. Some of them share the subject, but are generated by different people or even organizations. To guarantee the interchange of such documents, we can use XML, which allows to share documents that do not have the same structure. However, it makes difficult to understand the core of such heterogeneous documents (in general, schema is not available). In this paper, we offer a characterization and algorithm to obtain the midpoint (in terms of a resemblance function) of a set of semi-structured, heterogeneous documents without optional elements. The trivial case of midpoint would be the common elements to all documents. Nevertheless, in cases with several heterogeneous documents this may result in an empty set. Thus, we consider that those elements present in a given amount of documents belong to the midpoint. A exact schema could always be found generating optional elements. However, the exact schema of the whole set may result in overspecialization (lots of optional elements), which would make it useless.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. [ABS00]
    Abiteboul, S., Buneman, P., Suciu, D.: Data on the Web - From Relations to Semistructured Data and XML. Morgan Kaufmann, San Francisco (2000)Google Scholar
  2. [AGW01]
    Albert, J., Giammarresi, D., Wood, D.: Normal Form algorithms for extended Context-Free Grammars. Theoretical Computer Science 267(1-2), 35–47 (2001)MATHCrossRefMathSciNetGoogle Scholar
  3. [BB95]
    Batagelj, V., Bren, M.: Comparing resemblance measures. Journal of Classification 12(1), 73–90 (1995)MATHCrossRefMathSciNetGoogle Scholar
  4. [BCM+03]
    Baader, F., Calvanese, D., McGuinness, D., Nardi, D., Patel-Schneider, P. (eds.): The Description Logic Handbook. Cambridge University Press, Cambridge (2003)MATHGoogle Scholar
  5. [BdR04]
    Boobna, U., de Rougemont, M.: Correctors for XML Data. In: Bellahsène, Z., Milo, T., Rys, M., Suciu, D., Unland, R. (eds.) XSym 2004. LNCS, vol. 3186, pp. 97–111. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  6. [BGM04]
    Bertino, E., Guerrini, G., Mesiti, M.: A matching algorithm for measuring the structural similarity between an XML document and a DTD and its applications. Information Systems 29(1), 23–46 (2004)CrossRefMathSciNetGoogle Scholar
  7. [JOKA02]
    Jung, J.-S., Oh, D.-I., Kong, Y.-H., Ahn, J.-K.: Extracting Information from XML Documents by Reverse Generating a DTD. In: Shafazand, H., Tjoa, A.M. (eds.) EurAsia-ICT 2002. LNCS, vol. 2510, pp. 314–321. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  8. [NAM98]
    Nestorov, S., Abiteboul, S., Motwani, R.: Extracting schema from semistructured data. In: Proc. ACM SIGMOD Int. Conf. on Management of Data (SIGMOD 1998), pp. 295–306. ACM, New York (1998)CrossRefGoogle Scholar
  9. [SPBA03]
    Sanz, I., Pérez, J.M., Berlanga, R., Aramburu, M.J.: XML Schemata Inference and Evolution. In: Mařík, V., Štěpánková, O., Retschitzegger, W. (eds.) DEXA 2003. LNCS, vol. 2736, pp. 109–118. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  10. [W3C04]
    W3C. Extensible Markup Language (XML) 1.0, 3rd edn. (February 2004)Google Scholar
  11. [Wid99]
    Widom, J.: Data Management for XML: Research Directions. IEEE Data Engineering Bulletin 22(3), 44–52 (1999)Google Scholar
  12. [ZS89]
    Zhang, Z., Shasha, D.: Simple Fast Algorithms for the Editing Distance Between Trees and Related Problems. SIAM Journal on Computing 18(6), 1245–1262 (1989)MATHCrossRefMathSciNetGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Alberto Abelló
    • 1
  • Xavier de Palol
    • 1
  • Mohand-Saïd Hacid
    • 2
  1. 1.Dept. de Llenguatges i Sistemes InformàticsU. Politècnica de Catalunya 
  2. 2.LIRIS- UFR d’InformatiqueU. Claude Bernard Lyon 1 

Personalised recommendations