In Search of the Lost Schema

  • Stéphane Grumbach
  • Giansalvatore Mecca
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 1540)


We study the problem of rediscovering the schema of nested relations that have been encoded as strings for storage purposes. We consider various classes of encoding functions, and consider the mark-up encodings, which allow to find the schema without knowledge of the encoding function, under reasonable assumptions on the input data. Depending upon the encoding of empty sets, we propose two polynomial on-line algorithms (with different buffer size) solving the schema finding problem. We also prove that with a high probability, both algorithms find the schema after examining a fixed number of tuples, thus leading in practice to a linear time behavior with respect to the database size for wrapping the data. Finally, we show that the proposed techniques are well-suited for practical applications, such as structuring and wrapping HTML pages and Web sites.


Regular Expression Database Object Encode Function Semistructured Data Subsumption Relation 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. AB95.
    S. Abiteboul and C. Beeri. On the power of languages for the manipulation of complex objects. The VLDB Journal, 4(4):117–138, 1995.CrossRefGoogle Scholar
  2. Abi97.
    S. Abiteboul. Querying semi-structured data. In ICDT’97.Google Scholar
  3. ACC+97._S. Abiteboul, S. Cluet, V. Christophides, T. Milo, G. Moerkotte, and J. Siméon. Querying documents in object databases. Journal of Digital Libraries, 1(1):5–19, April 1997.CrossRefGoogle Scholar
  4. Ade98.
    B. Adelberg. NoDoSE-a tool for semi-automatically extracting structured and semistructured data from text documents. In SIGMOD’98.Google Scholar
  5. AHV94.
    S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. Addison-Wesley, 1994.Google Scholar
  6. AK97.
    N. Ashish and C. Knoblock. Wrapper generation for semistructured Internet sources. In Proceedings of the Workshop on the Management of Semistructured Data (in conjunction with SIGMOD’97).Google Scholar
  7. AM97.
    P. Atzeni and G. Mecca. Cut and Paste. In PODS’97.Google Scholar
  8. AMM97.
    P. Atzeni, G. Mecca, and P. Merialdo. To Weave the Web. In VLDB’97.Google Scholar
  9. Bri98.
    D. Brin. Extracting patterns and relations from the World Wide Web. In Proceedings of the Workshop on the Web and Databases (WebDB’98) (in conjunction with EDBT’98).Google Scholar
  10. CM98.
    V. Crescenzi and G. Mecca. Grammars have exceptions. Information Systems, 1998. Special Issue on Semistructured Data, to appear.Google Scholar
  11. CR94.
    M. Crochemore and W. Rytter. Text Algorithms. Oxford University Press, 1994.Google Scholar
  12. GV95.
    S. Grumbach and V. Vianu. Tractable query languages for complex object databases. Journal of Computer and System Sciences, 51(2):149–167, 1995.zbMATHCrossRefMathSciNetGoogle Scholar
  13. HGMC+97._J. Hammer, H. Garcia-Molina, J. Cho, R. Aranha, and A. Crespo. Extracting semistructured information from the Web. In Proceedings of the Workshop on the Management of Semistructured Data (in conjunction with ACM SIGMOD, 1997).Google Scholar
  14. HK93.
    K. Han and H. J. Kim. Prediction of common folding structures of homologous RNAs. Nucleic Acids Research, 21(5):1251–1257, 1993.CrossRefGoogle Scholar
  15. Hul88.
    R. Hull. A survey of theoretical research on typed complex database objects. In J. Paredaens, editor, Databases, pages 193–256. Academic Press, 1988.Google Scholar
  16. ISO86.
    ISO. International Organization for Standardization. ISO-8879: Information Processing-Text and Office Systems-Standard Generalized Markup Language (SGML), October 1986.Google Scholar
  17. KWD97.
    N. Kushmerick, D. S. Weld, and R. Doorenbos. Wrapper induction for information extraction. In International Joint Conference on Artificial Intelligence (IJCAI’97), 1997.Google Scholar
  18. Las98.
    E. R. Lassettre. Olympic records for data at the 1998 Nagano Games. In SIGMOD’98. Industrial Session.Google Scholar
  19. MMM98.
    G. Mecca, A. Mendelzon, and P. Merialdo. Efficient queries over Web views. In EDBT’98.Google Scholar
  20. Nag.
    Nagano 1998 Winter Olympics Web site. http://www.nagano.olympic.-org.
  21. NAM98.
    S. Nestorov, S. Abiteboul, and R. Motwani. Extracting schema from semistructured data. In SIGMOD, 1998.Google Scholar
  22. Oho90.
    A. Ohori. Semantics of types for database objects. Theoretical Computer Science, 76(1):53–91, 1990.zbMATHCrossRefMathSciNetGoogle Scholar
  23. Pap94.
    C. H. Papadimitriou. Computational Complexity. Addison-Wesley, 1994.Google Scholar
  24. Wat89.
    M.S. Waterman. Mathematical Methods for DNA Sequences. CRC Press, 1989.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 1999

Authors and Affiliations

  • Stéphane Grumbach
    • 1
  • Giansalvatore Mecca
    • 2
  1. 1.IASI and INRIALe ChesnayFrance
  2. 2.DIFA - Universitá della BasilicataPotenzaItaly

Personalised recommendations