Skip to main content

Web Data Extraction Techniques and Applications Using the Extensible Markup Language (XML)

  • Chapter
Intelligent Knowledge-Based Systems
  • 2389 Accesses

Abstract

The driving force behind the technology revolution has always been just one thing: information. Almost every invention related to the computer since the transistor has been made to aid in the transferring of a piece of information, or data, from one place to another. Despite the existence of a primitive form of what we now know of as the Internet, less than one generation ago digital information mostly needed to be carried around on magnetic devices such as tapes and disks. Fortunately, the prominent rise of the Internet and the World Wide Web in the mid-1990s removed the barrier that physical transportation of data placed on us.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 429.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 549.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Bibliography

  1. Charles Allen. WIDL: Application Integration with XML. World Wide Web Journal 2(4), November 1997.

    Google Scholar 

  2. Maria Luisa Barja, Tore Bratvold, Jussi Myllymaki, and Gabriele Sonnenberger. Informia: a Mediator for Integrated Access to Heterogeneous Information Sources. Proceedings of the ACM Conference on Information and Knowledge Management (CIKM), Washington, DC, November 1998.

    Google Scholar 

  3. CGI: Common Gateway Interface. October 1999. http://www.w3.org/CGI/.

    Google Scholar 

  4. Compaq Computer. Compaq’s Web Language. http://www.research.digital.com/SRC/WebL/index. html.

    Google Scholar 

  5. Erik Espe. Blockade of site aims to keep firms from “deep linking.” Silicon Valley/San Jose Business Journal. http://sanjose.bizjournals.com/sanjose/stories/1999/11/15/story3.html

    Google Scholar 

  6. Dayne Freitag. Information Extraction from HTML: Application of a General Machine Learning Approach. Proceedings of the Conference on Artificial Intelligence (AAAI), pp. 517–523, September 1998.

    Google Scholar 

  7. Ashish Gupta, Venky Harinarayan, and Anand Rajaraman. Virtual Database Technology. ACM SIGMOD Record, vol. 26, no. 4, December 1997.

    Google Scholar 

  8. Joachim Hammer, Hector Garcia-Molina, Junghoo Cho, Arturo Crespo, and Rohan Aranha. Extracting Semistructured Information from the Web. Proceedings of the Workshop on Management of Semistructured Data, Tucson, Arizona, 1997.

    Google Scholar 

  9. HTML 4.01 Specification, W3C Recommendation, December 1999. http://www.w3.org/TR/htmI4/.

    Google Scholar 

  10. Hypertext Transfer Protocol-HTTP/1.1, RFC 2616, The Internet Society, June 1999. ftp://ftp.isi.edu/in-notes/rfc2616.txt.

    Google Scholar 

  11. HTTP State Management Mechanism, RFC 2109, http://www.ietf.org/rfc/rfc2109.txt.

    Google Scholar 

  12. International Business Machines. DB2 XML Extender. http://www.ibm.com/software/data/db2/ extenders/xmlext/index. html.

    Google Scholar 

  13. Jared Jackson and Jussi Myllymaki. Web-Based Data Mining. IBM developerWorks, June 2001. http://www-l06.ibm.com/developerworks/Web/library/wa-wbdm/?dwzone=web.

    Google Scholar 

  14. Jared Jackson. Use Recursion Effectively in XSL. IBM developerWorks, October http://www-l06.ibm. com/developerworks/xml/library/x-xslrecur/?dwzone=xml.

    Google Scholar 

  15. JavaServer Pages 2.0 Specification OSR-000152), Java Community Process, http://jcp.org/aboutJava/ communityprocess/first/jsr152/ index3.html.

    Google Scholar 

  16. Craig A. Knoblock, Kristina Lerman, Steven Minton, and Ion Muslea. Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach. IEEE Data Engineering Bulletin, vol. 23, no. 4, pp. 33–41, 2000.

    Google Scholar 

  17. David Konopnicki and Oded Shmueli. W3QS: A Query System for the World Wide Web. Proceedings of the International Conference on Very Large Data Bases (VLDB), pp. 54–65, Zurich, Switzerland, September 1995.

    Google Scholar 

  18. Martijn Koster. A Standard for Robot Exclusion. http://www.robotstxt.org/wc/norobots.html

    Google Scholar 

  19. Nicholas Kushmerick. Gleaning the Web. IEEE Intelligent Systems, vol. 14, no. 2, pp. 20–22, March/April 1999.

    Article  Google Scholar 

  20. Laks V. S. Lakshmanan, Fereidoon Sadri, and Iyer N. Subramanian. A Declarative Language for Querying and Restructuring the Web. Proceedings of the 6th International Workshop on Research Issues in Data Engineering (RIDE), February 1996.

    Google Scholar 

  21. Ling Liu, Calton Pu, and Wei Han. XWRAP: An XML-Enabled Wrapper Construction System for Web Information Sources. Proceedings of the International Conference on Data Engineering (ICDE), San Diego, California, February 2000.

    Google Scholar 

  22. Alberto Mendelzon, George Mihaila, and Tova Milo. Querying the World Wide Web. International Journal on Digital Libraries, vol. 1, no. 1, pp. 54–67, 1997.

    Google Scholar 

  23. Jussi Myllymaki. Effective Web Data Extraction with Standard XML Technologies. Proceedings of the Tenth International World Wide Web Conference, Hong Kong, May 2001.

    Google Scholar 

  24. Jussi Myllymaki and Jared Jackson. Robust Web Data Extraction with XML Path Expressions. IBM Research Report RJ 10245, May 2002.

    Google Scholar 

  25. Lucian Popa, Mauricio A. Hernández, Yannis Velegrakis and R. J. Miller. Mapping XML and Relational Schemas with CLIO. System Demonstration, IEEE Data Engineering Conference, 2002.

    Google Scholar 

  26. Sriram Raghavan and Hector Garcia-Molina. Crawling the Hidden Web. Proceedings of the International Conference on Very Large Databases (VLDB), 2001.

    Google Scholar 

  27. Berthier Ribeiro-Neto, Alberto H.F. Laender, and Altigran S. pa Silva. Extracting Semi-Structured Data Through Examples. Proceedings of the ACM Conference on Information and Knowledge Management (CIKM), Kansas City, Missouri, November 1999.

    Google Scholar 

  28. Arnaud Sahuguet and Fabien Azavant. Building Light-Weight Wrappers for Legacy Web Data-Sources Using W4F. Proceedings of the International Conference on Very Large Data Bases (VLDB), Edinburgh, Scotland, September 1999.

    Google Scholar 

  29. Stephen Soderland. Learning Information Extraction Rules for Semi-structured and Free Text. Machine Learning, vol. 34, no. 1, pp. 233–272, 1999.

    Article  MATH  Google Scholar 

  30. Simple Object Access Protocol (SOAP) 1.1, W3C Note, May 2000. http://www.w3.org/TR/SOAP/.

    Google Scholar 

  31. Marc Songini. IBM: All Searches Start at Grand Central, Network World, November 11, 1997.

    Google Scholar 

  32. HTML Tidy. http://www.w3.org/People/Raggett/tidy/.

    Google Scholar 

  33. Web Content Accessibility Guidelines 1.0. W3C Recommendation, May 1999. http://www.w3.org/ TR/WAI-WEBCONTENT/.

    Google Scholar 

  34. XHTML: The Extensible HyperText Markup Language,W3C Recommendation.january 2000. http://www.w3.org/TR/xhtml1.

    Google Scholar 

  35. Extensible Markup Language (XML), W3C Recommendation, February 1998. http://www.w3.org/ TR/REC-xml.

    Google Scholar 

  36. XQuery 1.0: An XML Query Language. W3C Working Draft, November 2002. http://www.w3.org/TR/xquery/.

    Google Scholar 

  37. XML Schema Part 0: Primer, W3C Working Draft, April 2000. http://www.w3.org/TR/xmlschema0/.

    Google Scholar 

  38. XML Path Language (XPath), W3C Recommendation, November 1999. http://www.w3.org/TR/ xpath.html.

    Google Scholar 

  39. XSL Transformations (XSLT), W3C Recommendation, November 1999. http://www.w3.org/TR/ xslt.html.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Kluwer Academic Publishers

About this chapter

Cite this chapter

Myllymaki, J., Jackson, J. (2005). Web Data Extraction Techniques and Applications Using the Extensible Markup Language (XML). In: Leondes, C.T. (eds) Intelligent Knowledge-Based Systems. Springer, Boston, MA. https://doi.org/10.1007/978-1-4020-7829-3_18

Download citation

  • DOI: https://doi.org/10.1007/978-1-4020-7829-3_18

  • Publisher Name: Springer, Boston, MA

  • Print ISBN: 978-1-4020-7746-3

  • Online ISBN: 978-1-4020-7829-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics