DaWaK 2013: Data Warehousing and Knowledge Discovery pp 26-33 | Cite as
Tabular Web Data: Schema Discovery and Integration
Abstract
Web data such as web tables, lists, and data records from a wide variety of domains can be combined for different purposes such as querying for information and creating example data sets. Tabular web data location, extraction, and schema discovery and integration are important for effectively combining, querying, and presenting it in a uniform format. We focus on schema generation and integration for both a static and a dynamic framework. We contribute algorithms for generating individual schemas from extracted tabular web data and integrating the generated schemas. Our approach is novel because it contributes functionality not previously addressed; it accommodates both the static and dynamic frameworks, different kinds of web data types, schema discovery and unification, and table integration.
Keywords
Web tables web lists schema generation schema discovery schema integration schema integration frameworkPreview
Unable to display preview. Download preview PDF.
References
- [CHK09]Cafarella, M.J., Halevy, A.Y., Khoussainova, N.: Data Integration for the Relational Web. In: Proceedings of the 35th International Conference on Very Large Data Bases (VLDB 2009), Lyon, France, August 24-28, pp. 1090–1101 (2009)Google Scholar
- [CHM11]Cafarella, M.J., Halevy, A.Y., Madhavan, J.: Structured Data on the Web. Communications of the ACM (CACM) 54(2), 72–79 (2011)CrossRefGoogle Scholar
- [CHW08]Cafarella, M.J., Halevy, A.Y., Wang, D.Z., Wu, E., Zhang, Y.: WebTables: Exploring the Power of Tables on the Web. In: Proceedings of the 34th International Conference on Very Large Data Bases (VLDB 2008), Auckland, New Zealand, August 23-28, pp. 538–549 (2008)Google Scholar
- [CHZ08]Cafarella, M.J., Halevy, A.Y., Zhang, Y., Wang, D.Z., Wu, E.: Uncovering the Relational Web. In: Proceedings of the 11th International Workshop on Web and Databases (WebDB 2008), Vancouver, BC, Canada (June 13, 2008)Google Scholar
- [CMH09]Cafarella, M.J., Madhavan, J., Halevy, A.Y.: Web-scale Extraction of Structured Data. ACM SIGMOD Record 37(4), 55–61 (2009)CrossRefGoogle Scholar
- [ELN06]Embley, D.W., Lopresti, D.P., Nagy, G.: Notes on Contemporary Table Recognition. In: Bunke, H., Spitz, A.L. (eds.) DAS 2006. LNCS, vol. 3872, pp. 164–175. Springer, Heidelberg (2006)CrossRefGoogle Scholar
- [EMH11]Elmeleegy, H., Madhavan, J., Halevy, A.Y.: Harvesting Relational Tables from Lists on the Web. In: Proceedings of the VLDB Endowment, vol. 20(1), pp. 209–226 (2009)Google Scholar
- [ENX01]Embley, D.W., Ng, Y.-K., Xu, L.: Recognizing Ontology-applicable Multiple-record Web Documents. In: Kunii, H.S., Jajodia, S., Sølvberg, A. (eds.) ER 2001. LNCS, vol. 2224, pp. 555–570. Springer, Heidelberg (2001)CrossRefGoogle Scholar
- [ETL05]Embley, D.W., Tao, C., Liddle, S.W.: Automating the Extraction of Data from HTML Tables with Unknown Structure. Data and Knowledge Engineering 54(1), 3–28 (2005)CrossRefGoogle Scholar
- [GS09]Gupta, R., Sarawagi, S.: Answering Table Augmentation Queries from Unstructured Lists on the Web. In: Proceedings of the VLDB Endowment, vol. 2(1), pp. 289–330 (2009)Google Scholar
- [H01]Hurst, M.: Layout and Language: Challenges for Table Understanding on the Web. In: Proceedings of the International Workshop on Web Document Analysis (WDA 2001), Seattle, Washington, USA, pp. 27–30 (September 8, 2001)Google Scholar
- [MFH08]Mergen, S., Freire, J., Heuser, C.: Mesa: A Search Engine for Querying Web Tables (2008), http://www.scholr.ly/paper/1328437/mesa-a-search-engine-for-querying-web-tables
- [MTH+09]Miao, G., Tatemura, J., Hsiung, W.-P., Sawires, A., Moser, L.E.: Extracting Data Records from the Web Using Tag Path Clustering. In: Proceedings of the 18th International ACM Conference on World Wide Web (WWW 2009), Madrid, Spain, April 20-24, pp. 981–990 (2009)Google Scholar
- [SFG+12]Sarma, A.D., Fang, L., Gupta, N., Halevy, A.Y., Lee, H., Wu, F., Xin, R., Yu, C.: Finding Related Tables. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD 2012), Scottsdale, Arizona, USA, May 20-24, pp. 817–828 (2012)Google Scholar
- [WLW+12]Wu, W., Li, H., Wang, H., Zhu, K.Q.: Probase: A Probabilistic Taxonomy for Text Understanding. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD 2012), Scottsdale, Arizona, USA, May 20-24, pp. 481–492 (2012)Google Scholar
- [WWW+12]Wang, J., Wang, H., Wang, Z., Zhu, K.Q.: Understanding tables on the web. In: Atzeni, P., Cheung, D., Ram, S. (eds.) ER 2012. LNCS, vol. 7532, pp. 141–155. Springer, Heidelberg (2012)CrossRefGoogle Scholar