Advertisement

Web Document Analysis

  • Apostolos Antonacopoulos
  • Jianying Hu
Part of the Advances in Pattern Recognition book series (ACVPR)

Keywords

Document Image Table Detection Collaborative Editing Wrapper Induction Voice Browsing 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Alam, H., Hartono, R., and Rahman, A.F.R. (2004). Extraction and management of content from HTML documents. In: A. Antonacopoulos and J. Hu (Eds.). Web Document Analysis: Challenges and Opportunities. Singapore: World Scientific, pp. 95-112.Google Scholar
  2. 2.
    Antonacopoulos, A., Karatzas, D., and Ortiz Lopez, J. (2001). Accessing textual information embedded in internet images. Proceedings of SPIE Internet Imaging II, San Jose, USA, pp. 198-205.Google Scholar
  3. 3.
    Antonacopoulos, A. and Karatzas, D. (2002). Fuzzy segmentation of characters in Web images based on human colour perception. In: D. Lopresti, J. Hu, and R. Kashi (Eds.). Document Analysis Systems V. London: Springer, LNCS 2423, pp. 295-306.CrossRefGoogle Scholar
  4. 4.
    Antonacopoulos, A. and Delporte, F. (1999). Automated interpretation of visual representations: extracting textual information from WWW images. In: R. Paton and I. Neilson (Eds.). Visual Representations and Interpretations. London: Springer.Google Scholar
  5. 5.
    Baird, H.S. and Popat, K. (2004). Web security and document image analysis. In: A. Antonacopoulos and J. Hu (Eds.). Web Document Analysis: Challenges and Opportunities. Singapore: World Scientific.Google Scholar
  6. 6.
    Blood, R. Weblogs: a history and perspective. http://www.rebeccablood.net/essays/weblog history.html.
  7. 7.
    Breuel, T.M., Janssen, W.C., Popat, K., and Baird, H.S. (2004). Reflowable document images. In: A. Antonacopoulos and J. Hu (Eds.). Web Document Analysis: Challenges and Opportunities. Singapore: World Scientific.Google Scholar
  8. 8.
    Brown, M.K., Glinski, S.C., and Schmult, B.C. (2001). Web page analysis for voice browsing. Proceedings of the First International Workshop on Web Document Analysis (WDA2001), Seattle, USA.Google Scholar
  9. 9.
    Chen, L.Q., Xie, X., Ma, W.Y., and Zhang, H.J. (2003). Dress: a slicing tree based web page representation for various display sizes. WWW2003 (poster), Budapest, Hungary.Google Scholar
  10. 10.
    Chen, Y., Ma, W., and Zhang, H.J. (2003). Detecting web page structure for adaptive viewing on small form-factor devices. WWW2003, Budapest, Hungary.Google Scholar
  11. 11.
    Cohen, W.W., Hurst, M., and Jensen, L.S. (2004). A wrapper induction system for complex documents and its application to tabular data on the web. In: A. Antonacopoulos and J. Hu (Eds.). Web Document Analysis: Challenges and Opportunities. Singapore: World Scientific, pp. 155-178.Google Scholar
  12. 12.
    Di Iorio, A. and Vitali, F. (2003). A xanalogical collaborative editing environment. In: A. Antonacopoulos and J. Hu (Eds.). Second International Workshop on Web Document Analysis (WDA2003).Google Scholar
  13. 13.
    Gupta, S., Kaiser, G., Neistadt, D., and Grimm, P. (2003). Dom based content extraction of html documents. WWW2003, Budapest, Hungary.Google Scholar
  14. 14.
    Hsu, C. and Dung, M. (1998). Generating finite-state transducers for semi-structured data extraction from the web. Journal of Information Systems, 23, pp. 521-538.CrossRefGoogle Scholar
  15. 15.
  16. 16.
    Hu, J. and Bagga, A. (2004). Functional categorization of images in web documents. IEEE Multimedia Special Issue on Content Repurposing.Google Scholar
  17. 17.
    International workshop on web document analysis. http://www.csc.liv.ac.uk/{∼wda2001∼wda2003}.
  18. 18.
    Jain, A.K. and Yu, B. (1998). Automatic text location in images and video frames. Pattern Recognition, 31(12), pp. 2055-2076.CrossRefGoogle Scholar
  19. 19.
    Ashish, N. and Knoblock, C. (1997). Wrapper generation for semi-structured internet sources. Proceedings of PODS/SIGMOD'97.Google Scholar
  20. 20.
    Yee, K.P. CritLink: Public Web Annotation. http://zesty.ca/crit.
  21. 21.
    Kanungo, T., Lee, C.H., and Bradford, R. (2001). What fraction of images on the web contain text? Proceedings of the First International Workshop on Web Document Analysis (WDA2001), Seattle, USA, pp. 43-46.Google Scholar
  22. 22.
    Karatzas, D. and Antonacopoulos, A. (2004). Text extraction from web images based on a split-and-merge segmentation method using colour perception. Proceedings of the Seventeenth International Conference on Pattern Recognition (ICPR2004), Cambridge, UK. Silver Spring, MD: IEEECS Press, pp. 634-637.Google Scholar
  23. 23.
    Kasik, D.J.(2004). Strategies for consistent image partitioning. IEEE Multimedia Special Issue on Content Repurposing.Google Scholar
  24. 24.
    Kushmerick, N., Weld, D. and Doorenbos, R. (1997). Wrapper induction for information extraction. Proceedings of the Fifteenth International Conference on Artificial Intelligence, pp. 729-735.Google Scholar
  25. 25.
    Lai, W.C., Chang, E.Y., and Cheng, K.T. (2004). An anatomy of a large-scale image search engine. In: A. Antonacopoulos and J. Hu (Eds.). Web Document Analysis: Challenges and Opportunities. Singapore: World Scientific.Google Scholar
  26. 26.
    Leuf, B. and Cummingham, W. (2001). The Wiki way. New York: Addison-Wesley.Google Scholar
  27. 27.
    Lopresti, D. and Wilfong, G. (2004). Applications of graph probing to web document analysis. In: A. Antonacopoulos and J. Hu (Eds.). Web Document Analysis: Challenges and Opportunities. Singapore: World Scientific.Google Scholar
  28. 28.
    Lopresti, D. and Zhou, J. (2000). Locating and recognizing text in WWW images. Information Retrieval, 2(2/3), pp. 177-206.CrossRefGoogle Scholar
  29. 29.
    Mukherjee, S., Yang, G., Tan, W., and Ramakrishnan, I.V. (2003). Automatic discovery of semantic structures in html documents. Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR2003), Edinburgh, Scotland.Google Scholar
  30. 30.
    Muslea, I. (1999). Extracting patterns for information extraction tasks: a survey. AAAI-99 Workshop on Machine Learning for Information Extraction.Google Scholar
  31. 31.
    Nanno, T., Saito, S., and Okumura, M. (2003). Structuring web pages based on repetition of elements. In: A. Antonacopoulos and J. Hu (Eds.). Second International Workshop on Web Document Analysis (WDA2003).Google Scholar
  32. 32.
    Narayan, M., Williams, C., Perugini, S., and Ramakrishnan, N. (2004). Staging transformations for multimodal web interaction management. WWW2004. New York, USA, pp. 212-223.Google Scholar
  33. 33.
    Penn, G., Hu, J., Luo, H., and McDonald, R. (2001). Flexible web document analysis for delivery to narrow-bandwidth devices. Proceedings of the Sixth International Conference on Document Analysis and Recognition (ICDAR01), Seattle, WA, USA, pp. 1074-1078.Google Scholar
  34. 34.
    Perantonis, S.J., Gatos, B., and Maragos, V. (2003). A novel Web image processing algorithm for text area identification that helps commercial OCR engines to improve their Web image recognition efficiency. Proceedings of the Second International Workshop on Web Document Analysis (WDA2003), Edinburgh, Scotland, pp. 61-64.Google Scholar
  35. 35.
    Ramachandran, S. and Kashi, R. (2003). An architecture for ink annotations on web documents. Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR2003), Edinburgh, Scotland.Google Scholar
  36. 36.
    Ramakrishnan, I.V., Stent, A., and Yang, G. (2004). Hearsay: enabling audio browsing on hypertext content. WWW2004, New York, USA, pp. 80-89.Google Scholar
  37. 37.
    Schenker, Last, M., Bunke, H., and Kandel, A. (2004). Clustering of web documents using a graph model. In: A. Antonacopoulos and J. Hu (Eds.). Web Document Analysis: Challenges and Opportunities. Singapore: World Scientific.Google Scholar
  38. 38.
    Shih, L.K. and Karger, D.R. (2004). Using URLs and table layout for web classification tasks. WWW2004, New York, USA, pp. 193-202.Google Scholar
  39. 39.
    Singh, G. (2004). Content repurposing. IEEE Multimedia Special Issue on Content Repurposing.Google Scholar
  40. 40.
    Tao, C. and Munson, E.V. (2003). A relevance model for web image search. Proceedings of the Second International Workshop on Web Document Analysis (WDA2003), Edinburgh, Scotland, pp. 58-60.Google Scholar
  41. 41.
    The ACM Symposium on Document Engineering. http://www.documentengineering.org..
  42. 42.
    Thuong, T.T. and Roisin, C. (2004). Structured media for authoring multi-media documents. In: A. Antonacopoulos and J. Hu (Eds.). Web Document Analysis: Challenges and Opportunities. Singapore: World Scientific.Google Scholar
  43. 43.
    van Ossenbruggen, J., Rutledge, L., and Hardman, L. (2003). Towards a multimedia formatting vocabulary. WWW2003, Budapest, Hungary.Google Scholar
  44. 44.
    Villard, L., Roisin, C., and Layaida, N. (2000). An XML based multimedia document processing model for content adaptation. Digital Documents and Electronic Publishing Conference (DDEP00), pp. 1-12.Google Scholar
  45. 45.
    Wang, Y. and Hu, J. (2002). A machine learning based approach for table detection on the web. WWW2002, Honolulu, Hawaii, USA.Google Scholar
  46. 46.
    Yang, Y., Chen, Y., and Zhang, H.J. (2004). HTML page analysis based on visual cues. In: A. Antonacopoulos and J. Hu (Eds.). Web Document Analysis: Challenges and Opportunities. Singapore: World Scientific.Google Scholar
  47. 47.
    Yoshida, M., Torisawa, K., and Tsujii, J. (2004). Extracting attributes and their values from web pages. In: A. Antonacopoulos and J. Hu (Eds.). Web Document Analysis: Challenges and Opportunities. Singapore: World Scientific.Google Scholar
  48. 48.
    Zhou, J., Lopresti, D., and Tasdizen, T. (1998). Finding text in color images. Proceedings of the IS&T/SPIE Symposium on Electronic Imaging, San Jose, California, pp. 130-140.Google Scholar

Copyright information

© Springer-Verlag London Limited 2007

Authors and Affiliations

  • Apostolos Antonacopoulos
    • 1
  • Jianying Hu
    • 2
  1. 1.School of Computing, Science and EngineeringUniversity of SalfordUK
  2. 2.IBM T. J.Watson Research CenterYorktown HeightsUSA

Personalised recommendations