Abstract
The size and growth of the current Web is still creating new challenges to researchers. For example, one of these challenges is the improvement of user familarity to a large number of Web pages. Today’s search engines provide tools that allow users to refine their queries. One way is the refinement of a query based on the analysis of web content. Possible outcomes are not only recommended collocations, but also recommended page genres (e.g., discussion forums, etc.). It is proving to be very useful to provide the details of page content when viewing the page. Not only text snippets, but also parts of the page menu, for certain pages how many posts are present in the discussion, what day the review was created, or what the price is of a product sold on the page. Obtaining this information from unstructured or semi-structured content is not straightforward. In this chapter the development of methods capable of detecting and extracting information from Web pages will be addressed. The concept of objects, called MicroGenre will be presented. Finally we also present experiments with our own Pattrio method, which provides a way to detect objects placed on Web pages.
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Alexander, C.: A Pattern Language: Towns, Buildings, Construction. Oxford University Press, New York (1977)
Alexander, C.: The Timeless Way of Building. Oxford University Press, Oxford (1979)
Baumgartner, R., Flesca, S., Gottlob, G.: Visual Web Information Extraction with Lixto. In: Proc. of the 27th Int. Conference on Very Large Data Bases, pp. 119–128 (2001)
Shahnaz, F., Berry, M.W., Pauca, P.V., Plemmons, R.J.: Document clustering using nonnegative matrix factorization. Information Processing and Management 42(2), 373–386 (2006)
Boese, E.S., Howe, A.E.: Effects of web document evolution on genre classification. In: Proceedings of the 14th ACM international Conference on information and Knowledge Management, CIKM 2005, Bremen, Germany, October 31 - November 05, 2005, pp. 632–639. ACM, New York (2005)
Borchers, J.O.: A pattern approach to interaction design. AI & Society 15(4), 359–376 (2001)
Cai, D., Yu, S., Wen, J.-R., Ma, W.-Y.: Extracting Content Structure for Web Pages based on Visual Representation. In: Zhou, X., Zhang, Y., Orlowska, M.E. (eds.) APWeb 2003. LNCS, vol. 2642, pp. 406–417. Springer, Heidelberg (2003)
Cai, D., Yu, S., Wen, J.-R., Ma, W.-Y.: VIPS: a visionbased page segmentation algorithm, Microsoft Technical Re-port, MSR-TR-2003-79 (2003)
Chaker, J., Ounelli, H.: Genre Categorization of Web Pages. In: ICDM Workshops 2007, pp. 455–464 (2007)
Chang, C., Lui, S.: IEPAD: information extraction based on pattern discovery. In: Proceedings of the 10th international Conference on World Wide Web, WWW 2001, Hong Kong, May 01-05, 2001, pp. 681–688. ACM, New York (2001)
Chang, C.H., Kayed, M., Girgis, M.R., Shaalan, K.F.: A Survey of Web Information Extraction Systems. IEEE Transactions on Knowledge and Data Engineering 18(10), 1411–1428 (2006)
Chen, J., Zhou, B., Shi, J., Zhang, H., Fengwu, Q.: Function-based object model towards Website adaptation. In: Proceedings of the 10th international conference on World Wide Web, Hong Kong, pp. 587–596 (2001)
Chibane, I., Doan, B.L.: A web page topic segmentation algorithm based on visual criteria and content layout. In: SIGIR 2007, pp. 817–818 (2007)
Conrad, J.G., Schilder, F.: Opinion mining in legal blogs. In: Proceedings of the 11th international Conference on Artificial intelligence and Law, ICAIL 2007, Stanford, California, June 04-08, 2007, pp. 231–236. ACM, New York (2007)
Cosulschi, M., Constantinescu, N., Gabroveanu, M.: Classifcation and comparison of information structures from a web page. The Annals of the University of Craiova 31, 109–121 (2004)
Dearden, A., Finlay, J.: Pattern Languages in HCI: A critical review. Human Computer Interaction 21(1), 49–102 (2006)
Ding, C., Li, T., Peng, W., Park, H.: Orthogonal nonnegative matrix t-factorizations for clustering. In: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2006), pp. 126–135. ACM Press, New York (2006)
Ding, X., Liu, B., Yu, P.S.: A Holistic Lexicon-Based Approach to Opinion Mining. In: Web Search and Web Data Mining, Palo Alto, California, USA, pp. 231–240 (2008)
Dong, L., Watters, C.R., Duffy, J., Shepherd, M.: An Examination of Genre Attributes for Web Page Classification. In: HICSS 2008, p. 133 (2008)
Dujovne, L.E., Velásquez, J.D.: Design and Implementation of a Methodology for Identifying Website Keyobjects. In: Velásquez, J.D., Ríos, S.A., Howlett, R.J., Jain, L.C. (eds.) KES 2009. LNCS, vol. 5711, pp. 301–308. Springer, Heidelberg (2009)
Embley, D.E., Tao, C., Liddle, S.W.: Automating the extraction of data from HTML tables with unknown structure. Data Knowl. Eng. 54(1), 3–28 (2005)
Fernandes, D., de Moura, E.S., Ribeiro-Neto, B.: Computing Block Importance for Searching on Web Sites. In: ACM International Conference on Information and Knowledge Management, Lisboa, Portugal, pp. 165–173 (2007)
Flieder, K., Mödritscher, F.: Foundations of a pattern language based on Gestalt principles. In: CHI 2006 Extended Abstracts on Human Factors in Computing Systems, Montreal, Quebec, Canada, April 22 - 27, pp. 773–778. ACM, New York (2006)
Gagneux, A., Eglin, V., Emptoz, H.: Quality Approach of Web Documents by an Evaluation of Structure Relevance. In: Proceedings of WDA 2001, pp. 11–14 (2001)
Gamma, E., Helm, R., Johnson, R., Vlissides, J.: Design Patterns Elements of Reusable Object-Oriented Software. Addison-Wesley, Reading (1995)
Gatterbauer, W., Bohunsky, P., Herzog, M., Krüpl, B., Pollak, B.: Towards domain-independent information extraction from web tables. In: Proceedings of the 16th international Conference on World Wide Web, WWW 2007, Banff, Alberta, Canada, May 08–12, pp. 71–80. ACM, New York (2007)
Goldberg, J.H., Stimson, M.J., Lewenstein, M., Scott, N., Wichansky, A.M.: Eye tracking in web search tasks: design implications. In: Proceedings of the 2002 Symposium on Eye Tracking Research & Applications, ETRA 2002, New Orleans, Louisiana, March 25-27, pp. 51–58. ACM, New York (2002)
Graham, L.: A pattern language for web usability. Addison-Wesley, Reading (2003)
Gupta, S., Kaiser, G., Neistadt, D.,, Grimm, P.: DOM-based Content Extraction of HTML Documents. In: World Wide Web conference (WWW 2003), Budapest, Hungary, pp. 207–214 (2003)
Han, J., Kamber, M.: Data mining: concepts and techniques. Morgan Kaufmann Publishers Inc., San Francisco (2000)
Han, H., Noro, T., Tokuda, T.: An Automatic Web News Article Contents Extraction System Based on RSS Feeds. Journal of Web Engineering 8(3), 268–284 (2009)
Han, J., Chang, K.: Data Mining for Web Intelligence. Computer 35(11), 64–70 (2002)
Hu, M., Liu, B.: Mining and summarizing customer reviews. In: Proceedings of the Tenth ACM SIGKDD international Conference on Knowledge Discovery and Data Mining, KDD 2004, Seattle, WA, USA, August 22-25, pp. 168–177. ACM, New York (2004)
Ivory, M.Y., Megraw, R.: Evolution of Web Site Design Patterns. ACM Transactions on Information Systems 23(4), 463–497 (2005)
Ivory, M.Y., Sinha, R.R., Hearst, M.A.: Empirically validated web page design metrics. In: Proceedings of the SIGCHI conference on Human factors in computing systems, Seattle, Washington, United States, March 2001, pp. 53–60 (2001)
Kanaris, I., Stamatatos, E.: Webpage Genre Identification Using Variable-Length Character n-Grams Tools with Artificial Intelligence, 2007. In: ICTAI 2007, pp. 3–10 (2007)
Kennedy, A., Shepherd, M.: Automatic identification of home pages on the web. In: Annual Hawaii International Conference on System Sciences (HICSS 2005), pp. 236–251 (2005)
Kim, Y.S., Lee, K.H.: Extracting logical structures from HTML tables. Computer Standards & Interfaces 30(5), 296–308 (2008)
Kiyavitskaya, N., Zeni, N., Mich, L., Cordy, J.R., Mylopoulos, J.: Text mining through semi automatic semantic annotation. In: Reimer, U., Karagiannis, D. (eds.) PAKM 2006. LNCS (LNAI), vol. 4333, pp. 143–154. Springer, Heidelberg (2006)
Kosala, K., Blockeel, H.: Web Mining Research: A Survey. SIGKDD Explorations 2(1), 1–15 (2000)
Kovacevic, M., Diligenti, M., Gori, M., Milutinovic, V.: Recognition of Common Areas in a Web Page Using Visual Information: a possible application in a page classification. In: Proceedings of the 2002 IEEE International Conference on Data Mining (ICDM 2002), p. 250 (2002)
Kudělka, M., Snášel, V., Lehečka, O., El-Qawasmeh, E.: Semantic Analysis of Web Pages Using Web Patterns. In: Web Intelligence 2006, pp. 329–333 (2006)
Kudělka, M., Snášel, V., Lehečka, O., El-Qawasmeh, E., Pokorný, J.: Web Pages Reordering and Clustering Based on Web Patterns. In: Geffert, V., Karhumäki, J., Bertoni, A., Preneel, B., Návrat, P., Bieliková, M. (eds.) SOFSEM 2008. LNCS, vol. 4910, pp. 731–742. Springer, Heidelberg (2008)
Kudelka, M., Snasel, V., Horak, Z., Abraham, A.: Social Aspects of Web Page Contents. In: IEEE CASoN 2009, Fontainebleau, France, pp. 80–87 (2009)
Lee, D., Jeong, O., Lee, S.: Opinion mining of customer feedback data on the web. In: Proceedings of the 2nd international Conference on Ubiquitous information Management and Communication, ICUIMC 2008, Suwon, Korea, January 31 - February 01, pp. 230–235. ACM, New York (2008)
Lerman, K., Getoor, L., Minton, S., Knoblock, C.: Using the structure of Web sites for automatic segmentation of tables. In: Proceedings of the 2004 ACM SIGMOD international Conference on Management of Data, SIGMOD 2004, Paris, France, June 13-18, pp. 119–130. ACM, New York (2004)
Limanto, H.Y., Giang, N.N., Trung, V.T., Zhang, J., He, Q., Huy, N.Q.: An information extraction engine for web discussion forums. In: Special interest Tracks and Posters of the 14th international Conference on World Wide Web, WWW 2005, Chiba, Japan, May 10-14, pp. 978–979. ACM, New York (2005)
Liu, B.: Web content mining (tutorial). In: Proceedings of the 14th International Conference on World Wide Web (2005)
Liu, B., Chang, K.C.-C.: Editorial: Special Issue on Web Content Mining. IGKDD Explor. Newsl. 6(2), 1–4 (2004)
Liu, B., Grossman, R., Zhai, Y.: Mining data records in Web pages. In: KDD 2003, pp. 601–606 (2003)
Liu, B., Hu, M., Cheng, J.: Opinion observer: analyzing and comparing opinions on the Web. In: Proceedings of the 14th international Conference on World Wide Web, WWW 2005, Chiba, Japan, May 10-14, pp. 342–351. ACM, New York (2005)
Martin, J.R.: Text and clause: Fractal resonance. Text 15, 5–42 (1995)
Nie, Z., Wen, J.-R., Ma, W.-Y.: Object-level Vertical Search. In: CIDR 2007, Asilomar, CA, pp. 235–246 (2007)
Nie, Z., Ma, Y., Shi, S., xWen, J.-R., Ma, W.-Y.: Web Object Retrieval. In: WWW 2007, pp. 81–90 (2007)
Nielsen, J.: DesigningWeb Usability: The Practice of Simplicity. New Riders Publisher, Indianapolis (2000)
Nielsen, J., Loranger, H.: Prioritizing Web Usability. New Riders Press, Berkeley (2006)
Yates, J., Orlikowski, W.J.: Genres of Organizational Communication: A Structurational Approach to Studying Communication and Media. Academy of Management Review 17(2), 299–326 (1992)
Pivk, A., Cimiano, P., Sure, Y.: From tables to frames. Journal of Web Semantics 3(2-3), 132–146 (2005)
Pivk, A., Cimiano, P., Sure, Y., Gams, M., Rajkovic, V., Studer, R.: Transforming arbitrary tables into logical form with TARTAR. Data Knowl. Eng. 60(3), 567–595 (2007)
Popescu, A., Etzioni, O.: Extracting product features and opinions from reviews. In: Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing, Vancouver, British Columbia, Canada, October 06 - 08, pp. 339–346. Association for Computational Linguistics, Morristown (2005)
Rehm, G.: Towards Automatic Web Genre Identification. In: 35th Annual Hawaii International Conference on System Sciences (HICSS 2002), vol. 4, p. 101 (2002)
Reis, D.C., Golgher, P.B., Silva, A.S., Laender, A.F.: Automatic web news extraction using tree edit distance. In: WWW 2004: Proceedings of the 13th international conference on World Wide Web, pp. 502–511. ACM Press, New York (2004)
Rosso, M.A.: User-based identification of Web genres. JASIST (JASIS) 59(7), 1053–1072 (2008/2009)
Santini, M.: Description of 3 feature sets for automatic identification of genres in web pages (2006), http://www.nltg.brighton.ac.uk/home/Marina.Santini/three_feature_sets.pdf (last accessed February 02, 2010)
Santini, M.: Characterizing Genres of Web Pages: Genre Hybridism and Individualization. In: HICSS 2007, p. 71 (2007)
Schmidt, S., Stoyan, H.: Web-based Extraction of Technical Features of Products. GI Jahrestagung (1), 246–250 (2005)
Schmidt, S., Mandl, S., Ludwig, B., Stoyan, H.: Product-advisory on the web: An information extraction approach. In: Artificial Intelligence and Applications 2007, pp. 678–683 (2007)
Schuth, A., Marx, M., de Rijke, M.: Extracting the discussion structure in comments on news-articles. In: Proceedings of the 9th Annual ACM international Workshop on Web information and Data Management, WIDM 2007, Lisbon, Portugal, November 09-09, pp. 97–104. ACM, New York (2007)
Simon, K., Lausen, G.: ViPER: augmenting automatic information extraction with visual perceptions. In: Proceedings of the 14th ACM international Conference on information and Knowledge Management, CIKM 2005, Bremen, Germany, October 31 - November 05, pp. 381–388. ACM, New York (2005)
Su, Z., Zhang, H.J., Li, S., Ma, S.: Relevance feedback in content-based image retrieval: Bayesian framework, feature subspaces, and progressive learning. IEEE Transactions on Image Processing 12(8), 924–937 (2003)
Takama, Y., Mitsuhashi, N.: Visual Similarity Comparison for Web Page Retrieval. In: Web Intelligence, pp. 301–304 (2005)
Tidwell, J.: Designing Interfaces: Patterns for Effective Interaction Design. O’Reilly Media, Inc., Sebastopol (2006)
Tseng, Y.-F., Kao, H.-K.: The Mining and Extraction of Primary Informative Blocks and Data Objects from Systematic Web Pages. In: 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings) (WI 2006), pp. 370–373 (2006)
Van Duyne, D.K., Landay, J.A., Hong, J.I.: The Design of Sites: Patterns, Principles, and Processes for Crafting a Customer-Centered Web Experience. Pearson Education, London (2002)
Vredenburg, K., Isensee, S., Righi, C.: User-Centered Design: An Integrated Approach. Prentice Hall, Upper Saddle River (2002)
W3C Document Object Model, http://www.welie.com (last accessed February 02, 2010)
Van Welie, M., van der Veer, G.: Pattern Languages in Interaction Design: Structure and Organization. In: Rauterberg, Menozzi, Wesson (eds.) Proceedings of Interact 2003, Zürich, Switserland, September 1–5, pp. 527–534. IOS Press, Amsterdam (2003)
Van Welie, M.: Pattern in Interaction Design, http://www.welie.com (last accessed February 28, 2010)
Wong, T.-L.W., Lam, W.: Hot Item Mining and Summarization from Multiple Auction Web Sites. In: ICDM 2005, New Orleans, Louisiana, USA, pp. 797–800 (2005)
Xiang, P., Yang, X., Shi, Y.: Effective Page Segmentation Combining Pattern Analysis and Visual Separators for Browsing on Small Screens. In: Web Intelligence 2006, pp. 831–840 (2006)
Yang, Y., Chen, Y., Zhang, H.J.: HTML Page Analysis Based on Visual Cues. In: International Conference on Document Analysis and Recognition, 2001, pp. 859–864 (2003)
Yi, L., Liu, B., Li, X.: Eliminating noisy information in Web pages for data mining. In: International Conference on Knowledge Discovery and Data Mining, KDD 2003, Washington, DC, USA, pp. 296–305 (2003)
Yu, S., Cai, D., Wen, J.-R., Ma, W.-Y.: Improving Pseudo-Relevance Feedback in Web Information retrieval Using Web Page Segmentation. In: The Proceedings of Twelfth World Wide Web conference (WWW 2003), Hungary, pp. 203–211 (2003)
Zanibbi, R., Blostein, D., Cordy, J.R.: A survey of table recognition: Models, observations, transformations, and inferences. International Journal on Document Analysis and Recognition 7(1), 1–16 (2004)
Zhai, Y., Liu, B.: Web data extraction based on partial tree alignment. In: Proceedings of the 14th international Conference on World Wide Web, WWW 2005, Chiba, Japan, May 10-14, pp. 76–85. ACM, New York (2005)
Zhai, Y., Liu, B.: Structured Data Extraction from the Web Based on Partial Tree Alignment. IEEE Transaction on Knowledge and Data Engineering 18(12), 1614–1628 (2006)
Zhang, R.Y., Lakshmanan, L.V.S., Zamar, R.H.: Extracting relational data from HTML repositories. ACM SIGKDD Explorations Newsletter 6(2), 5–13 (2004)
Zheng, S., Song, R., Wen, J.-R.: Template-independent news extraction based on visual consistency. In: Proceedings of AAAI–2007, pp. 1507–1511 (2007)
Zheng, S., Zhou, D., Li, J., Giles, C.L.: Extracting Author Meta-Data from Web Using Visual Features. In: Data Mining Workshops, ICDM Workshops, pp. 33–40 (2007)
Zhu, J., Nie, Z., Wen, J., Zhang, B., Ma, W.: Simultaneous record detection and attribute labeling in web data extraction. In: Proceedings of the 12th ACM SIGKDD international Conference on Knowledge Discovery and Data Mining, KDD 2006, Philadelphia, PA, USA, August 20 - 23, pp. 494–503. ACM, New York (2006)
Zhu, J., Zhang, B., Nie, Z., Wen, J.R., Hon, H.W.: Webpage understanding: an integrated approach. In: Conference on Knowledge Discovery in Data, San Jose, California, USA, pp. 903–912 (2007)
Zhu, M., Hu, W.: Topic Detection Tracking for Threaded Discussion Communities. In: International Conferences on Web Intelligence and Intelligent Agent Technology, Sydney, Australia, pp. 77–83 (2008)
Zou, J., Le, D., Thoma, G.R.: Combining DOM tree and geometric layout analysis for online medical journal article segmentation. In: Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL 2006, Chapel Hill, NC, USA, June 11-15, pp. 119–128. ACM, New York (2006)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Snášel, V., Kudělka, M., Horák, Z. (2010). Web Content Mining Using MicroGenres. In: Velásquez, J.D., Jain, L.C. (eds) Advanced Techniques in Web Intelligence - I. Studies in Computational Intelligence, vol 311. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-14461-5_4
Download citation
DOI: https://doi.org/10.1007/978-3-642-14461-5_4
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-14460-8
Online ISBN: 978-3-642-14461-5
eBook Packages: EngineeringEngineering (R0)