Skip to main content
Log in

Automating Content Extraction of HTML Documents

  • Published:
World Wide Web Aims and scope Submit manuscript

Abstract

Web pages often contain clutter (such as unnecessary images and extraneous links) around the body of an article that distracts a user from actual content. Extraction of “useful and relevant” content from web pages has many applications, including cell phone and PDA browsing, speech rendering for the visually impaired, and text summarization. Most approaches to making content more readable involve changing font size or removing HTML and data components such as images, which takes away from a webpage’s inherent look and feel. Unlike “Content Reformatting,” which aims to reproduce the entire webpage in a more convenient form, our solution directly addresses “Content Extraction.” We have developed a framework that employs an easily extensible set of techniques. It incorporates advantages of previous work on content extraction. Our key insight is to work with DOM trees, a W3C specified interface that allows programs to dynamically access document structure, rather than with raw HTML markup. We have implemented our approach in a publicly available Web proxy to extract content from HTML web pages. This proxy can be used both centrally, administered for groups of users, as well as by individuals for personal browsers. We have also, after receiving feedback from users about the proxy, created a revised version with improved performance and accessibility in mind.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. American Foundation for the Blind, Statistics and Sources for Professionals, American Foundation for the Blind: New York, 2000.

    Google Scholar 

  2. S. Brin and L. Page, “The anatomy of a large-scale hypertextual web search engine,” Computer Networks and ISDN Systems 30, 1998, 107–117.

    Google Scholar 

  3. C. Brown, “Assistive technology computers and personal with disabilities,” Communications of the ACM 35, 1992, 36–45.

    Google Scholar 

  4. M. H. Brown and R. A. Shillner, “A new paradigm for browsing the Web,” in Proc. of Human Factors in Computing Systems (CHI’95 Conference Companion), 1995.

  5. O. Buyukkokten, H. Garcia-Molina, and A. Paepcke, “Accordion summarization for end-game browsing on PDAs and cellular phones,” in Proc. of Conference on Human Factors in Computing Systems (CHI’01), 2001.

  6. O. Buyukkokten, H. Garcia-Molina, and A. Paepcke, “Seeing the whole in parts: text summarization for Web browsing on handheld devices,” in Proc. of 10th Internat. World-Wide Web Conference, 2001.

  7. O. Buyukkokten, H. Garcia-Molina, and A. Paepcke, “Text summarization for Web browsing on handheld devices,” in Proc. of 10th Internat. World-Wide Web Conference, 2001.

  8. Y. Chen, W. Y. Ma, and H. J. Zhang, “Detecting Web page structure for adaptive viewing on small form factor devices,” in Proc. WWW’03, Budapest, Hungary, May 2003.

  9. M. Chiang, “World Wide Web accessibility by visually disabled patients: Problems and solutions,” Final Report for CS6125 WHIM, Columbia University’s Computer Science Department.

  10. W. Chisolm, G. Vanderheiden, and I. Jacobs, “Web content accessibility guidelines 1.0,” Interactions 8, 2001, 35–54.

    Google Scholar 

  11. W. K. Edwards, E. D. Mynatt, and K. Stockton, “Access to graphical interfaces for blind users,” Interactions 2, 1995, 54–67.

    Google Scholar 

  12. K. A. Ericsson and H. A. Simon, Protocol Analysis: Verbal Reports as Data, MIT Press: Cambridge, MA, 1993.

    Google Scholar 

  13. A. Finn, N. Kushmerick, and B. Smyth, “Fact or fiction: content classification for digital libraries,” in Proc. of Joint DELOS–NSF Workshop on Personalisation and Recommender Systems in Digital Libraries (Dublin), 2001.

  14. S. Hanzlik, “Gorilla design studios presents: the hosts file,” Gorilla Design Studios, August 31, 2002, http://accs-net.com/hosts/

  15. http://sourceforge.net/projects/wpar

  16. http://www-3.ibm.com/able/solution_offerings/hpr.html

  17. http://www.apache.org/

  18. http://www.apache.org/~andyc/neko/doc/html/

  19. http://www.avantbrowser.com

  20. http://www.bitstream.com/wireless

  21. http://www.bitstream.com/wireless/server/workflow.html

  22. http://www.dolphinuk.co.uk/products/hal.htm

  23. http://www.eclipse.org/articles/Article-Accessibility/accessibility.html

  24. http://www.eclipse.org/articles/Article-SWT-Design-1/SWT-Design-1.html

  25. http://www.gnu.org/software/gcc/java/

  26. http://www.greenlightwireless.net/services/default.asp

  27. http://www.junkbusters.com

  28. http://www.microsoft.com/enable/

  29. http://www.microsoft.com/technet/treeview/default.asp?url=/technet/prodtechnol/winxppro/reader_overview.asp

  30. http://www.mozilla.org

  31. http://www.openxml.org

  32. http://www.opera.com

  33. http://www.promotiondata.com/article.php?sid=190

  34. http://www.webaim.org/simulations/screenreader

  35. http://www.webwiper.com

  36. E. Kaasinen, M. Aaltonen, J. Kolari, S. Melakoski, and T. Laakko, “Two approaches to bringing Internet services to WAP devices,” in Proc. of 9th Internat. World-Wide Web Conference, 2000.

  37. M.-Y. Kan, Private communication, Columbia NLP group, 2002.

  38. M.-Y. Kan, J. L. Klavans, and K. R. McKeown, “Linear segmentation and segment relevance,” in Proc. of 6th Internat. Workshop of Very Large Corpora (WVLC-6), 1998.

  39. K. R. McKeown, R. Barzilay, D. Evans, V. Hatzivassiloglou, M. Y. Kan, B. Schiffman, and S. Teufel, “Columbia multi-document summarization: approach and evaluation,” in Proc. of Document Understanding Conference, 2001.

  40. R. L. Kline and E. P. Glinert, “Improving GUI accessibility for people with low vision,” in Proc. of Human Factors in Computing Systems (CHI’95 Conference Companion), 1995.

  41. M. Kunze and D. Rosner, “An XML-based approach for the presentation and exploitation of extracted information,” in Proc. of the 19th Internat. Conference on Computational Linguistics (Coling), 2002.

  42. A. W. Kushniruk, D. R. Kaufman, V. L. Patel , “Assessment of a computerized patient record system: A cognitive approach to evaluating medical technology,” MD Comput. 13, 1996, 406–415.

    Google Scholar 

  43. A. W. Kushniruk, V. L. Patel, and J. J. Cimino, “Usability testing in medical informatics: cognitive approaches to evaluation of information systems and user interfaces,” in Proc. AMIA Sympos. 1997, pp. 218–222.

  44. A. W. Kushniruk, M. Y. Kan, K. McKeown et al., “Usability evaluation of an experimental text summarization system and three search engines: Implications for the reengineering of health care interfaces,” in Proc. AMIA Sympos. 2002, pp. 420–424.

  45. C. Lewis, Using the ‘Thinking-Aloud’ method in cognitive interface design, IBM Research Report RC 9265, IBM Thomas J. Watson Research Center: Yorktown Heights, NY, 1982.

  46. I. Muslea, S. Minton, and C. Knoblock, “A hierarchal approach to wrapper induction,” in Proc. of the 3rd Internat. Conference on Autonomous Agents (Agents’99), 1999.

  47. J. Nielsen, Usability Engineering, Academic Press: New York, 1993.

    Google Scholar 

  48. I. J. Pitt and A. D. N. Edwards, “Improving the usability of speech-based interfaces for blind users,” in Proc. of the 2nd Annual ACM Conference on Assistive Technologies (ASSETS), 1996.

  49. A. F. R. Rahman, H. Alam, and R. Hartono, “Content extraction from HTML documents,” in Proc. of the 1st Internat. Workshop on Web Document Analysis (WDA2001), 2001.

  50. A. F. R. Rahman, H. Alam, and R. Hartono, “Understanding the flow of content in summarizing HTML documents,” in Proc. of the Internat. Workshop on Document Layout Interpretation and Its Applications, DLIA’01, September 2001.

  51. W. Reichl, B. Carpenter, J. Chu-Carroll, and W. Chou, “Language modeling for content extraction in human–computer dialogues,” in Proc. of the Internat. Conference on Spoken Language Processing (ICSLP), 1998.

  52. D. P. Rice, Chronic Care in America: A 21st Century Challenge, Institute for Health and Aging, University of California, San Francisco, Robert Wood Johnson Foundation: Princeton, NJ, 1996.

  53. B. Schneiderman, Designing the User Interface: Strategies for Effective Human–Computer Interaction, 3rd ed., Addison-Wesley: Reading, MA, 1997.

    Google Scholar 

  54. I. U. Scott, W. J. Feurer, and J. A. Jacko, “Impact of graphical user interface screen features on computer task accuracy and speed in a cohort of patients with age-related macular degeneration,” Amer. J. Ophthalmol. 134, 2002, 857–862.

    Google Scholar 

  55. J. A. Shoemaker, “Vision problems in the US: prevalence of adult vision impairment and age-related eye diseases in America,” National Eye Institute: Bethesda, MD, 2002.

  56. N. Wacholder, D. Evans, and J. Klavans, “Automatic identification and organization of index terms for interactive browsing,” in Proc. of the Joint Conference on Digital Libraries’01, 2001.

  57. M. Welsh, “The staged event-driven architecture for highly-concurrent server applications,” Ph.D. Qualifying Examination Proposal, UC Berkeley, December 2000, http://www.cs.berkeley.edu/~mdw/papers/quals-seda.pdf

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Suhit Gupta.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gupta, S., Kaiser, G.E., Grimm, P. et al. Automating Content Extraction of HTML Documents. World Wide Web 8, 179–224 (2005). https://doi.org/10.1007/s11280-004-4873-3

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11280-004-4873-3

Keywords

Navigation