Abstract
Web data extraction systems in use today transform semi-structured Web documents and deliver structured documents to end users. Some systems provide a visual interface to users to generate the extraction rules. However, to end users, the visual effect of Web documents is lost during the transformation process. In this paper, we propose an approach that allows a user to query extracted documents without knowledge of formal query language. We bridge the gap between visual effect of Web documents and structured documents extracted by providing a QBE-like (Query by Example) interface named Wdee. The principle component of our method is the notion of a document schema. Document schemata are patterns of structures embedded in documents. Wdee generates tree skeletons based on schema information and a user may execute queries by input condition in the skeltons. By maintaining the mapping relation among schemata of Web documents and extracted documents, a visual example may be presented to end users. With the example, Wdee allows a user to construct tree skeletons in a manner that resembles the browsing of Web pages.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Gupta, A., Harinarayan, V., Rajaraman, A.: Virtual database technology. SIGMOD Record 26, 57–61 (1997)
Gottlob, G., Koch, C.: Monadic datalog and the expressive power of languages for web information extraction. In: Proc. of the 21th PODS, pp. 17–28 (2002)
Gottlob, G., Koch, C.: Monadic queries over tree-structured data. In: Proceedings of the 17th IEEE Symposium on Logic in Computer Science, pp. 189–202 (2002)
Li, Z., Ng, W.K.: Wiccap: From semi-structured data to structured data. In: Proc. of 11th IEEE International Conference on the ECBS (2004)
Baumgartner, R., Flesca, S., Gottlob, G.: Visual web information extraction with lixto. In: Proc. of 27th International Conference on Very Large Data Bases, Roma, Italy, pp. 119–128. Morgan Kaufmann, San Francisco (2001)
Liu, Z., Li, F., Ng, W.K.: WICCAP data model: Mapping physical websites to logical views. In: Proc. of the 21st International Conference on Conceptual Modelling (2002)
Arasu, A., Garcia-Molina, H.: Extracting structured data from web pages. In: Proc. of the 2003 ACM SIGMOD, pp. 337–348 (2003)
Chang, C.H., Lui, S.C.: IEPAD: information extraction based on pattern discovery. In: Proc. of the 10th WWW Conference, pp. 681–688 (2001)
Crescenzi, V., Mecca, G., Merialdo, P.: RoadRunner: Towards automatic data extraction from large web sites. In: Proc. of 27th International Conference on Very Large Data Bases, pp. 109–118 (2001)
Rajaraman, A., Ullman, J.D.: Querying websites using compact skeletons. In: Proc. of the 12th PODS, pp. 16–27 (2001)
Zloof, M.M.: Query-by-example: A data base language. IBM Systems Journal 16, 324–343 (1977)
da Silva, A.S., Filha, I.M.E., Laender, A.H.F., Embley, D.W.: Representing and querying semistructured web data using nested tables with structural variants. In: Proc. of the 21st International Conference on Conceptual Modelling (2002)
Neven, F.: Automata, logic, and XML. In: Bradfield, J.C. (ed.) CSL 2002 and EACSL 2002. LNCS, vol. 2471, pp. 2–26. Springer, Heidelberg (2002)
Bergholz, A.: Querying Semistructured Data Based On Schema Matching. PhD thesis, Humboldt-University Berlin (2000)
Gottlob, G., Koch, C., Pichler, R.: Xpath processing in a nutshell. SIGMOD Record 32, 12–19 (2003)
May, W., Himmeröder, R., Lausen, G., Ludäscher, B.: A unified framework for wrapping, mediating and restructuring information from the web. In: Proceedings of the Workshops on Evolution and Change in Data Management, Reverse Engineering in Information Systems, and the World Wide Web and Conceptual Modeling, pp. 307–320 (1999)
Kushmerick, N.: Wrapper induction: Efficiency and expressiveness. Artificial Intelligence 118, 15–68 (2000)
Wang, K., Liu, H.: Discovering structural association of semistructured data. IEEE Transactions on Knowledge and Data Engineering 12, 353–371 (2000)
Flesca, S., Manco, G., Masciari, E., Pontieri, L., Pugliese, A.: Detecting structural similarities between xml documents. In: Proceedings of 5th International Workshop on the Web and Databases, Madison, Wisconsin, USA (2002)
Nierman, A., Jagadish, H.V.: Evaluating structural similarity in xml documents. In: Proceedings of 5th International Workshop on the Web and Databases (2002)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Li, Z., Ng, W.K. (2005). WDEE: Web Data Extraction by Example. In: Zhou, L., Ooi, B.C., Meng, X. (eds) Database Systems for Advanced Applications. DASFAA 2005. Lecture Notes in Computer Science, vol 3453. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11408079_32
Download citation
DOI: https://doi.org/10.1007/11408079_32
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-25334-1
Online ISBN: 978-3-540-32005-0
eBook Packages: Computer ScienceComputer Science (R0)