In Search of the Lost Schema
We study the problem of rediscovering the schema of nested relations that have been encoded as strings for storage purposes. We consider various classes of encoding functions, and consider the mark-up encodings, which allow to find the schema without knowledge of the encoding function, under reasonable assumptions on the input data. Depending upon the encoding of empty sets, we propose two polynomial on-line algorithms (with different buffer size) solving the schema finding problem. We also prove that with a high probability, both algorithms find the schema after examining a fixed number of tuples, thus leading in practice to a linear time behavior with respect to the database size for wrapping the data. Finally, we show that the proposed techniques are well-suited for practical applications, such as structuring and wrapping HTML pages and Web sites.
Unable to display preview. Download preview PDF.
- Abi97.S. Abiteboul. Querying semi-structured data. In ICDT’97.Google Scholar
- Ade98.B. Adelberg. NoDoSE-a tool for semi-automatically extracting structured and semistructured data from text documents. In SIGMOD’98.Google Scholar
- AHV94.S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. Addison-Wesley, 1994.Google Scholar
- AK97.N. Ashish and C. Knoblock. Wrapper generation for semistructured Internet sources. In Proceedings of the Workshop on the Management of Semistructured Data (in conjunction with SIGMOD’97).Google Scholar
- AM97.P. Atzeni and G. Mecca. Cut and Paste. In PODS’97.Google Scholar
- AMM97.P. Atzeni, G. Mecca, and P. Merialdo. To Weave the Web. In VLDB’97.Google Scholar
- Bri98.D. Brin. Extracting patterns and relations from the World Wide Web. In Proceedings of the Workshop on the Web and Databases (WebDB’98) (in conjunction with EDBT’98).Google Scholar
- CM98.V. Crescenzi and G. Mecca. Grammars have exceptions. Information Systems, 1998. Special Issue on Semistructured Data, to appear.Google Scholar
- CR94.M. Crochemore and W. Rytter. Text Algorithms. Oxford University Press, 1994.Google Scholar
- HGMC+97._J. Hammer, H. Garcia-Molina, J. Cho, R. Aranha, and A. Crespo. Extracting semistructured information from the Web. In Proceedings of the Workshop on the Management of Semistructured Data (in conjunction with ACM SIGMOD, 1997).Google Scholar
- Hul88.R. Hull. A survey of theoretical research on typed complex database objects. In J. Paredaens, editor, Databases, pages 193–256. Academic Press, 1988.Google Scholar
- ISO86.ISO. International Organization for Standardization. ISO-8879: Information Processing-Text and Office Systems-Standard Generalized Markup Language (SGML), October 1986.Google Scholar
- KWD97.N. Kushmerick, D. S. Weld, and R. Doorenbos. Wrapper induction for information extraction. In International Joint Conference on Artificial Intelligence (IJCAI’97), 1997.Google Scholar
- Las98.E. R. Lassettre. Olympic records for data at the 1998 Nagano Games. In SIGMOD’98. Industrial Session.Google Scholar
- MMM98.G. Mecca, A. Mendelzon, and P. Merialdo. Efficient queries over Web views. In EDBT’98.Google Scholar
- Nag.Nagano 1998 Winter Olympics Web site. http://www.nagano.olympic.-org.
- NAM98.S. Nestorov, S. Abiteboul, and R. Motwani. Extracting schema from semistructured data. In SIGMOD, 1998.Google Scholar
- Pap94.C. H. Papadimitriou. Computational Complexity. Addison-Wesley, 1994.Google Scholar
- Wat89.M.S. Waterman. Mathematical Methods for DNA Sequences. CRC Press, 1989.Google Scholar