Tabular Web Data: Schema Discovery and Integration

  • Prudhvi Janga
  • Karen C. Davis
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8057)

Abstract

Web data such as web tables, lists, and data records from a wide variety of domains can be combined for different purposes such as querying for information and creating example data sets. Tabular web data location, extraction, and schema discovery and integration are important for effectively combining, querying, and presenting it in a uniform format. We focus on schema generation and integration for both a static and a dynamic framework. We contribute algorithms for generating individual schemas from extracted tabular web data and integrating the generated schemas. Our approach is novel because it contributes functionality not previously addressed; it accommodates both the static and dynamic frameworks, different kinds of web data types, schema discovery and unification, and table integration.

Keywords

Web tables web lists schema generation schema discovery schema integration schema integration framework 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. [CHK09]
    Cafarella, M.J., Halevy, A.Y., Khoussainova, N.: Data Integration for the Relational Web. In: Proceedings of the 35th International Conference on Very Large Data Bases (VLDB 2009), Lyon, France, August 24-28, pp. 1090–1101 (2009)Google Scholar
  2. [CHM11]
    Cafarella, M.J., Halevy, A.Y., Madhavan, J.: Structured Data on the Web. Communications of the ACM (CACM) 54(2), 72–79 (2011)CrossRefGoogle Scholar
  3. [CHW08]
    Cafarella, M.J., Halevy, A.Y., Wang, D.Z., Wu, E., Zhang, Y.: WebTables: Exploring the Power of Tables on the Web. In: Proceedings of the 34th International Conference on Very Large Data Bases (VLDB 2008), Auckland, New Zealand, August 23-28, pp. 538–549 (2008)Google Scholar
  4. [CHZ08]
    Cafarella, M.J., Halevy, A.Y., Zhang, Y., Wang, D.Z., Wu, E.: Uncovering the Relational Web. In: Proceedings of the 11th International Workshop on Web and Databases (WebDB 2008), Vancouver, BC, Canada (June 13, 2008)Google Scholar
  5. [CMH09]
    Cafarella, M.J., Madhavan, J., Halevy, A.Y.: Web-scale Extraction of Structured Data. ACM SIGMOD Record 37(4), 55–61 (2009)CrossRefGoogle Scholar
  6. [ELN06]
    Embley, D.W., Lopresti, D.P., Nagy, G.: Notes on Contemporary Table Recognition. In: Bunke, H., Spitz, A.L. (eds.) DAS 2006. LNCS, vol. 3872, pp. 164–175. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  7. [EMH11]
    Elmeleegy, H., Madhavan, J., Halevy, A.Y.: Harvesting Relational Tables from Lists on the Web. In: Proceedings of the VLDB Endowment, vol. 20(1), pp. 209–226 (2009)Google Scholar
  8. [ENX01]
    Embley, D.W., Ng, Y.-K., Xu, L.: Recognizing Ontology-applicable Multiple-record Web Documents. In: Kunii, H.S., Jajodia, S., Sølvberg, A. (eds.) ER 2001. LNCS, vol. 2224, pp. 555–570. Springer, Heidelberg (2001)CrossRefGoogle Scholar
  9. [ETL05]
    Embley, D.W., Tao, C., Liddle, S.W.: Automating the Extraction of Data from HTML Tables with Unknown Structure. Data and Knowledge Engineering 54(1), 3–28 (2005)CrossRefGoogle Scholar
  10. [GS09]
    Gupta, R., Sarawagi, S.: Answering Table Augmentation Queries from Unstructured Lists on the Web. In: Proceedings of the VLDB Endowment, vol. 2(1), pp. 289–330 (2009)Google Scholar
  11. [H01]
    Hurst, M.: Layout and Language: Challenges for Table Understanding on the Web. In: Proceedings of the International Workshop on Web Document Analysis (WDA 2001), Seattle, Washington, USA, pp. 27–30 (September 8, 2001)Google Scholar
  12. [MFH08]
    Mergen, S., Freire, J., Heuser, C.: Mesa: A Search Engine for Querying Web Tables (2008), http://www.scholr.ly/paper/1328437/mesa-a-search-engine-for-querying-web-tables
  13. [MTH+09]
    Miao, G., Tatemura, J., Hsiung, W.-P., Sawires, A., Moser, L.E.: Extracting Data Records from the Web Using Tag Path Clustering. In: Proceedings of the 18th International ACM Conference on World Wide Web (WWW 2009), Madrid, Spain, April 20-24, pp. 981–990 (2009)Google Scholar
  14. [SFG+12]
    Sarma, A.D., Fang, L., Gupta, N., Halevy, A.Y., Lee, H., Wu, F., Xin, R., Yu, C.: Finding Related Tables. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD 2012), Scottsdale, Arizona, USA, May 20-24, pp. 817–828 (2012)Google Scholar
  15. [WLW+12]
    Wu, W., Li, H., Wang, H., Zhu, K.Q.: Probase: A Probabilistic Taxonomy for Text Understanding. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD 2012), Scottsdale, Arizona, USA, May 20-24, pp. 481–492 (2012)Google Scholar
  16. [WWW+12]
    Wang, J., Wang, H., Wang, Z., Zhu, K.Q.: Understanding tables on the web. In: Atzeni, P., Cheung, D., Ram, S. (eds.) ER 2012. LNCS, vol. 7532, pp. 141–155. Springer, Heidelberg (2012)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag GmbH Berlin Heidelberg 2013

Authors and Affiliations

  • Prudhvi Janga
    • 1
  • Karen C. Davis
    • 1
  1. 1.University of CincinnatiCincinnatiUSA

Personalised recommendations