Advertisement

Representing Web Data as Complex Objects

  • Alberto H. F. Laender
  • Berthier Ribeiro-Neto
  • Altigran S. da Silva
  • Elaine S. Silva
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 1875)

Abstract

The popularization of the Web has made a huge volume of data available for a large audience. In a large number of Web sites, such as bookstores, electronic catalogs, travel agencies, etc., the pages constitute documents which are composed of pieces of data whose overall structure can be easily recognized. Such pages are called data-rich and can be seen as collections of complex objects. In this paper, we show how such objects can be represented by nested tables, which are simple, intuitive, and quite convenient for expressing their implicit structure. The assumption is that, for most sites of interest, only few examples are required to reveal the structure of the objects. To corroborate our assumption, we describe a data extraction tool that adopts this approach and present results of some experiments carried out with this tool.

Keywords

Object Type Complex Object Travel Agency List Type Semistructured Data 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. [1]
    Abiteboul, S., Hull, R., and Vianu, V. Foundations of Databases. Addison-Wesley, Reading, Massachusetts, 1995.zbMATHGoogle Scholar
  2. [2]
    Buneman, P. Semistructured Data. In Proceedings of the Sixteenth ACM SIGMOD Symposium on Principles of Database Systems (Tucson, Arizona, 1997), pp. 117–121.Google Scholar
  3. [3]
    Buneman, P., Davidson, S., Hillebrand, G., and Suciu, D. A Query Language and Optimization Techniques for Unstructured Data. In Proceedings of the ACM SIGMOD International Conference on Management of Data (Quebec, Canada, 1996), pp. 505–516.Google Scholar
  4. [4]
    Buneman, P., Deutsch, A., and Tan, W. A Deterministic Model for Semistructured Data. In Proceedings of the Workshop on Query Processing for Semistructured Data and Non-Standard Data Formats (Jerusalem, Israel, 1999).Google Scholar
  5. [5]
    da Silva, A.S. Example-based Extraction and Integration of Semi-Structured Data. Ph.D. Thesis Proposal, Departament of Computer Science, Federal University of Minas Gerais, Belo Horizonte, Brazil, 2000. In preparation.Google Scholar
  6. [6]
    Embley, D. W., Campbell, D. M., Jiang, Y. S., Liddle, S. W., Ng, Y.-K., Quass, D., and Smith, R. D. Conceptual-model-based data extraction. Data & Knowledge Engineering 31, 3 (1999), 227–251.CrossRefGoogle Scholar
  7. [7]
    Jaeschke, G., and Schek, H.-J. Remarks on the algebra of non first normal form relations. In Proceedings of the ACM Symposium on Principles of Database Systems (Los Angeles, California, 1982), ACM, pp. 124–138.Google Scholar
  8. [8]
    Laender, A. H. F., Ribeiro-Neto, B., and da Silva, A. S. DEByE-Data Extraction By Example. Technical Report, Department of Computer Science, Federal University of Minas Gerais, Belo Horizonte, Brazil, 2000.Google Scholar
  9. [9]
    Libkin, L. A Relational Algebra for Complex Objects Based on Partial Information. In Proceedings of the Third Symposium on Mathematical Fundamentals of Database and Knowledge Systems (Rostock, Germany, 1991), pp. 29–43.Google Scholar
  10. [10]
    Lorentzos, N. A., and Dondis, K. A. Query by Example for Nested Tables. In Proceedings of the 9th International Conference in Database and Experts Systems Applications(Vienna, Austria, 1998), pp. 716–725.Google Scholar
  11. [11]
    Nestorov, S., Abiteboul, S., and Motwani, R. Inferring Structure in Semistructured Data. SIGMOD Record 26, 4 (1997), 39–43.CrossRefGoogle Scholar
  12. [12]
    Nestorov, S., Abiteboul, S., and Motwani, R. Extracting Schema from Semistructured Data. In Proceedings of the ACM SIGMOD Conference on Management of Data (Seatle, Washington, 1998), pp. 256–306.Google Scholar
  13. [13]
    P. Buneman and W. Fan and S. Weinstein. Interaction between Path and Type Constraints. In Proceedings of ACM Symposium on Principles of Database Systems (PODS) (Philadephia, Pennsylvania, 1999), pp. 56–67.Google Scholar
  14. [14]
    Papakonstantinou, Y., Garcia-Molina, H., and Widom, J. Object Exchange Across Heterogeneous Information Sources. In Proceedings of the Eleventh International Conference on Data Engineering(Taipei, Taiwan, 1995).Google Scholar
  15. [15]
    Ribeiro-Neto, B., Laender, A. H. F., and da Silva, A. S. Extracting Semi-Structured Data Through Examples. In Proceedings of the Eighth ACM International Conference on Information and Knowledge Management-CIKM’99 (Kansas City, Missouri, 1999), pp. 94–101.Google Scholar
  16. [16]
    Silva, E. S. Example-Based Semi-Structured Data Extraction. Master’s Thesis, Departament of Computer Science, Federal University of Minas Gerais, Belo Horizonte, Brazil, 1999. In Portuguese.Google Scholar
  17. [17]
    van Gucht, D., and Fischer, P. C. Multilevel nested relational structures. Journal of Computer and System Sciences 36, 1 (1988), 77–105.CrossRefGoogle Scholar
  18. [18]
    Wang, K., and Liu, H. Schema Discovery for Semistructured Data. In Proceedings of the Third International Conference on Knowledge Discovery and Data Mining (KDD-97) (Newport Beach, California, 1997), pp. 271–274.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2000

Authors and Affiliations

  • Alberto H. F. Laender
    • 1
  • Berthier Ribeiro-Neto
    • 1
  • Altigran S. da Silva
    • 1
  • Elaine S. Silva
    • 1
  1. 1.Department of Computer ScienceFederal University of Minas GeraisBelo Hrizonte MGBrazil

Personalised recommendations