Web Data Extraction Techniques and Applications Using the Extensible Markup Language (XML)

Myllymaki, Jussi; Jackson, Jared

doi:10.1007/978-1-4020-7829-3_18

Jussi Myllymaki² &
Jared Jackson²

2389 Accesses

Abstract

The driving force behind the technology revolution has always been just one thing: information. Almost every invention related to the computer since the transistor has been made to aid in the transferring of a piece of information, or data, from one place to another. Despite the existence of a primitive form of what we now know of as the Internet, less than one generation ago digital information mostly needed to be carried around on magnetic devices such as tapes and disks. Fortunately, the prominent rise of the Internet and the World Wide Web in the mid-1990s removed the barrier that physical transportation of data placed on us.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 429.00; Price excludes VAT (USA)

Hardcover Book: USD 549.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Bibliography

Charles Allen. WIDL: Application Integration with XML. World Wide Web Journal 2(4), November 1997.
Google Scholar
Maria Luisa Barja, Tore Bratvold, Jussi Myllymaki, and Gabriele Sonnenberger. Informia: a Mediator for Integrated Access to Heterogeneous Information Sources. Proceedings of the ACM Conference on Information and Knowledge Management (CIKM), Washington, DC, November 1998.
Google Scholar
CGI: Common Gateway Interface. October 1999. http://www.w3.org/CGI/.
Google Scholar
Compaq Computer. Compaq’s Web Language. http://www.research.digital.com/SRC/WebL/index. html.
Google Scholar
Erik Espe. Blockade of site aims to keep firms from “deep linking.” Silicon Valley/San Jose Business Journal. http://sanjose.bizjournals.com/sanjose/stories/1999/11/15/story3.html
Google Scholar
Dayne Freitag. Information Extraction from HTML: Application of a General Machine Learning Approach. Proceedings of the Conference on Artificial Intelligence (AAAI), pp. 517–523, September 1998.
Google Scholar
Ashish Gupta, Venky Harinarayan, and Anand Rajaraman. Virtual Database Technology. ACM SIGMOD Record, vol. 26, no. 4, December 1997.
Google Scholar
Joachim Hammer, Hector Garcia-Molina, Junghoo Cho, Arturo Crespo, and Rohan Aranha. Extracting Semistructured Information from the Web. Proceedings of the Workshop on Management of Semistructured Data, Tucson, Arizona, 1997.
Google Scholar
HTML 4.01 Specification, W3C Recommendation, December 1999. http://www.w3.org/TR/htmI4/.
Google Scholar
Hypertext Transfer Protocol-HTTP/1.1, RFC 2616, The Internet Society, June 1999. ftp://ftp.isi.edu/in-notes/rfc2616.txt.
Google Scholar
HTTP State Management Mechanism, RFC 2109, http://www.ietf.org/rfc/rfc2109.txt.
Google Scholar
International Business Machines. DB2 XML Extender. http://www.ibm.com/software/data/db2/ extenders/xmlext/index. html.
Google Scholar
Jared Jackson and Jussi Myllymaki. Web-Based Data Mining. IBM developerWorks, June 2001. http://www-l06.ibm.com/developerworks/Web/library/wa-wbdm/?dwzone=web.
Google Scholar
Jared Jackson. Use Recursion Effectively in XSL. IBM developerWorks, October http://www-l06.ibm. com/developerworks/xml/library/x-xslrecur/?dwzone=xml.
Google Scholar
JavaServer Pages 2.0 Specification OSR-000152), Java Community Process, http://jcp.org/aboutJava/ communityprocess/first/jsr152/ index3.html.
Google Scholar
Craig A. Knoblock, Kristina Lerman, Steven Minton, and Ion Muslea. Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach. IEEE Data Engineering Bulletin, vol. 23, no. 4, pp. 33–41, 2000.
Google Scholar
David Konopnicki and Oded Shmueli. W3QS: A Query System for the World Wide Web. Proceedings of the International Conference on Very Large Data Bases (VLDB), pp. 54–65, Zurich, Switzerland, September 1995.
Google Scholar
Martijn Koster. A Standard for Robot Exclusion. http://www.robotstxt.org/wc/norobots.html
Google Scholar
Nicholas Kushmerick. Gleaning the Web. IEEE Intelligent Systems, vol. 14, no. 2, pp. 20–22, March/April 1999.
Article Google Scholar
Laks V. S. Lakshmanan, Fereidoon Sadri, and Iyer N. Subramanian. A Declarative Language for Querying and Restructuring the Web. Proceedings of the 6th International Workshop on Research Issues in Data Engineering (RIDE), February 1996.
Google Scholar
Ling Liu, Calton Pu, and Wei Han. XWRAP: An XML-Enabled Wrapper Construction System for Web Information Sources. Proceedings of the International Conference on Data Engineering (ICDE), San Diego, California, February 2000.
Google Scholar
Alberto Mendelzon, George Mihaila, and Tova Milo. Querying the World Wide Web. International Journal on Digital Libraries, vol. 1, no. 1, pp. 54–67, 1997.
Google Scholar
Jussi Myllymaki. Effective Web Data Extraction with Standard XML Technologies. Proceedings of the Tenth International World Wide Web Conference, Hong Kong, May 2001.
Google Scholar
Jussi Myllymaki and Jared Jackson. Robust Web Data Extraction with XML Path Expressions. IBM Research Report RJ 10245, May 2002.
Google Scholar
Lucian Popa, Mauricio A. Hernández, Yannis Velegrakis and R. J. Miller. Mapping XML and Relational Schemas with CLIO. System Demonstration, IEEE Data Engineering Conference, 2002.
Google Scholar
Sriram Raghavan and Hector Garcia-Molina. Crawling the Hidden Web. Proceedings of the International Conference on Very Large Databases (VLDB), 2001.
Google Scholar
Berthier Ribeiro-Neto, Alberto H.F. Laender, and Altigran S. pa Silva. Extracting Semi-Structured Data Through Examples. Proceedings of the ACM Conference on Information and Knowledge Management (CIKM), Kansas City, Missouri, November 1999.
Google Scholar
Arnaud Sahuguet and Fabien Azavant. Building Light-Weight Wrappers for Legacy Web Data-Sources Using W4F. Proceedings of the International Conference on Very Large Data Bases (VLDB), Edinburgh, Scotland, September 1999.
Google Scholar
Stephen Soderland. Learning Information Extraction Rules for Semi-structured and Free Text. Machine Learning, vol. 34, no. 1, pp. 233–272, 1999.
Article MATH Google Scholar
Simple Object Access Protocol (SOAP) 1.1, W3C Note, May 2000. http://www.w3.org/TR/SOAP/.
Google Scholar
Marc Songini. IBM: All Searches Start at Grand Central, Network World, November 11, 1997.
Google Scholar
HTML Tidy. http://www.w3.org/People/Raggett/tidy/.
Google Scholar
Web Content Accessibility Guidelines 1.0. W3C Recommendation, May 1999. http://www.w3.org/ TR/WAI-WEBCONTENT/.
Google Scholar
XHTML: The Extensible HyperText Markup Language,W3C Recommendation.january 2000. http://www.w3.org/TR/xhtml1.
Google Scholar
Extensible Markup Language (XML), W3C Recommendation, February 1998. http://www.w3.org/ TR/REC-xml.
Google Scholar
XQuery 1.0: An XML Query Language. W3C Working Draft, November 2002. http://www.w3.org/TR/xquery/.
Google Scholar
XML Schema Part 0: Primer, W3C Working Draft, April 2000. http://www.w3.org/TR/xmlschema0/.
Google Scholar
XML Path Language (XPath), W3C Recommendation, November 1999. http://www.w3.org/TR/ xpath.html.
Google Scholar
XSL Transformations (XSLT), W3C Recommendation, November 1999. http://www.w3.org/TR/ xslt.html.
Google Scholar

Download references

Author information

Authors and Affiliations

IBM Almaden Research Center, San Jose, California, USA
Jussi Myllymaki & Jared Jackson

Authors

Jussi Myllymaki
View author publications
You can also search for this author in PubMed Google Scholar
Jared Jackson
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

University of California, Los Angeles, USA
Cornelius T. Leondes

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Myllymaki, J., Jackson, J. (2005). Web Data Extraction Techniques and Applications Using the Extensible Markup Language (XML). In: Leondes, C.T. (eds) Intelligent Knowledge-Based Systems. Springer, Boston, MA. https://doi.org/10.1007/978-1-4020-7829-3_18

Download citation

DOI: https://doi.org/10.1007/978-1-4020-7829-3_18
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4020-7746-3
Online ISBN: 978-1-4020-7829-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics