Entropy-based automated wrapper generation for weblog data extraction

Gkotsis, George; Stepanyan, Karen; Cristea, Alexandra I.; Joy, Mike

doi:10.1007/s11280-013-0269-6

Entropy-based automated wrapper generation for weblog data extraction

Published: 21 November 2013

Volume 17, pages 827–846, (2014)
Cite this article

World Wide Web Aims and scope Submit manuscript

George Gkotsis¹,
Karen Stepanyan¹,
Alexandra I. Cristea¹ &
…
Mike Joy¹

241 Accesses
Explore all metrics

Abstract

This paper proposes a fully automated information extraction methodology for weblogs. The methodology integrates a set of relevant approaches based on the use of web feeds and processing of HTML for the extraction of weblog properties. The approach includes a model for generating a wrapper that exploits web feeds for deriving a set of extraction rules automatically. Instead of performing a pairwise comparison between posts, the model matches the values of the web feeds against their corresponding HTML elements retrieved from multiple weblog posts. It adopts a probabilistic approach for deriving a set of rules and automating the process of wrapper generation. An evaluation of the model is conducted on a collection of weblogs reporting a prediction accuracy of 89 %. The results of this evaluation show that the proposed technique enables robust extraction of weblog properties and can be applied across the blogosphere.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Self-supervised Automated Wrapper Generation for Weblog Data Extraction

An Approach to Web Information Processing

The BigGrams: the semi-supervised information extraction system from HTML: an improvement in the wrapper induction

Article Open access 20 August 2017

References

Adelberg, B.: NoDoSE-a tool for semi-automatically extracting structured and semistructured data from text documents. SIGMOD Rec. 27(2), 283–294 (1998)
Article Google Scholar
Baumgartner, R., Flesca, S., Gottlob, G.: Visual web information extraction with lixto. In: Proceedings of the 27th International Conference on Very Large Data Bases, VLDB ’01, pp. 119–128. Morgan Kaufmann Publishers Inc., San Francisco (2001)
Google Scholar
Baumgartner, R., Gatterbauer, W., Gottlob, G.: Web data extraction system. In: Encyclopedia of Database Systems, pp. 3465–3471. Springer (2009)
Berger, P., Hennig, P., Bross, J., Meinel, C.: Mapping the blogosphere–towards a universal and scalable blog-crawler. In: Privacy, Security, Risk and Trust (PASSAT), 2011 IEEE Third Int Confernece Soc Comput (SocialCom), pp. 672–677. IEEE (2011)
Burton, K., Kasch, N., Soboroff, I.: The ICWSM 2011 Spinn3r dataset. In: Proceedings of the Fifth Annual Conference on Weblogs and Social Media (ICWSM 2011). Barcelona, Spain (2011)
Crescenzi, V., Mecca, G., Merialdo, P.: Roadrunner: towards automatic data extraction from large web sites. In: Proceedings of the International Conference on Very Large Data Bases, pp. 109–118 (2001)
Dutton, W., Blank, G.: Next generation users: The Internet in Britain. Oxford Internet Survey. http://www.oii.ox.ac.uk/publications/oxis2011_report.pdf (2011)
Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)
Article Google Scholar
Faheem, M., Senellart, P.: Intelligent and adaptive crawling of web applications for web archiving. In: Proceeding ICWE. Aalborg (2013)
Geibel, P., Pustylnikov, O., Mehler, A., Gust, H., Kühnberger, K.: Classification of documents based on the structure of their DOM trees. In: Neural Information Processing, pp. 779–788. Springer (2008)
Giles, K., Bryson, K., Weng, Q.: Comparison of two families of entropy-based classification measures with and without feature selection. In: Proceedings of the 34th Annual Hawaii International Conference on System Sciences, HICSS ’01, p. 3014. IEEE Computer Society, Washington (2001)
Google Scholar
Gkotsis, G., Stepanyan, K., Cristea, A., Joy, M.: Self-supervised automated wrapper generation for weblog data extraction. In: G. Gottlob, G. Grasso, D. Olteanu, C. Schallhart (eds.) Big Data, Lecture Notes in Computer Science, vol. 7968, pp. 292–302. Springer, Berlin (2013)
Ihara, S.: Information theory for continuous systems. World Scientific Publishing Company, Singapore (1993)
Book MATH Google Scholar
Jaro, M.A.: Unimatch: A Record Linkage System: User’s Manual. Tech. rep., U.S. Bureau of the Census, Washington DC (1976)
Kohlschütter, C., Fankhauser, P., Nejdl, W.: Boilerplate detection using shallow text features. In: Proceedings of the 3rd ACM International Conference on Web Search and Data Mining, WSDM ’10, pp. 441–450. ACM, New York (2010)
Kushmerick, N.: Wrapper induction: efficiency and expressiveness. Artif. Intell. 118(1), 15–68 (2000)
Article MATH MathSciNet Google Scholar
Laender, A., Ribeiro-Neto, B., Da Silva, A., Teixeira, J.: A brief survey of web data extraction tools. ACM Sigmod Rec. 31(2), 84–93 (2002)
Article Google Scholar
Liu, B.: Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data. Springer-Verlag, Berlin (2009)
Google Scholar
Liu, L., Pu, C., Han, W.: XWrap: an extensible wrapper construction system for internet information. In: Proceedings of the 16th International Conference on Data Engineering (ICDE 2000), pp. 611–621. IEEE CS Press, San Diego (2000)
Chapter Google Scholar
Muslea, I., Minton, S., Knoblock, C.: Hierarchical wrapper induction for semistructured information sources. Auton. Agent. Multi-Agent Syst. 4(1), 93–114 (2001)
Article Google Scholar
Oita, M., Senellart, P.: Archiving data objects using web feeds. In: Proceedings of International Web Archiving Workshop, pp. 31–41. Vienna, Austria (2010)
Google Scholar
Pennock, M., Davis, R.: ArchivePress: a really simple solution to archiving blog content. In: Sixth International Conference on Preservation of Digital Objects (iPRES 2009). California Digital Library, San Francisco (2009)
Google Scholar
Quinlan, J.: Induction of decision trees. Mach. Learn. 1(1), 81–106 (1986)
Google Scholar
Sigurthsson, K.: Incremental crawling with Heritrix. In: Proceedings of International Web Archiving Workshop, pp. 1–12 (2005)
Web Technology Survey: Usage of content management systems for websites. [Online]. Available: http://web.archive.org/web/20131015180119, http://w3techs.com/technologies/overview/content_management/all Accessed Oct 2013, Tech. rep., W3Techs (2013)
Winkler, W.E., Thibaudeau, Y.: An application of the fellegi-sunter model of record linkage to the 1990 us decennial census. Methods 9 (1990)
Winn, P.: State of the Blogosphere 2008: Introduction. http://technorati.com/blogging/article/state-of-the-blogosphere-introduction/ (2009). Accessed 21 Aug 2009
Witten, I.H., Frank, E.: Data Mining. Practical Machine Learning Tools and Techniques, 2nd edn. Morgan Kaufmann (2005)
Yujian, L., Bo, L.: A normalized levenshtein distance metric. IEEE Trans. Pattern Anal. Mach. Intell. 29(6), 1091 –1095 (2007)
Article Google Scholar
Zhai, Y., Liu, B.: Web data extraction based on partial tree alignment. In: Proceedings of the 14th International Conference on World Wide Web, pp. 76–85. ACM (2005)

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Warwick, Coventry, CV4 7AL, UK
George Gkotsis, Karen Stepanyan, Alexandra I. Cristea & Mike Joy

Authors

George Gkotsis
View author publications
You can also search for this author in PubMed Google Scholar
Karen Stepanyan
View author publications
You can also search for this author in PubMed Google Scholar
Alexandra I. Cristea
View author publications
You can also search for this author in PubMed Google Scholar
Mike Joy
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to George Gkotsis.

Additional information

This work was conducted as part of the BlogForever project funded by the European Commission Framework Programme 7 (FP7), grant agreement No.269963.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gkotsis, G., Stepanyan, K., Cristea, A.I. et al. Entropy-based automated wrapper generation for weblog data extraction. World Wide Web 17, 827–846 (2014). https://doi.org/10.1007/s11280-013-0269-6

Download citation

Received: 31 October 2012
Revised: 24 October 2013
Accepted: 04 November 2013
Published: 21 November 2013
Issue Date: July 2014
DOI: https://doi.org/10.1007/s11280-013-0269-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Entropy-based automated wrapper generation for weblog data extraction

Abstract

Access this article

Similar content being viewed by others

Self-supervised Automated Wrapper Generation for Weblog Data Extraction

An Approach to Web Information Processing

The BigGrams: the semi-supervised information extraction system from HTML: an improvement in the wrapper induction

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Entropy-based automated wrapper generation for weblog data extraction

Abstract

Access this article

Similar content being viewed by others

Self-supervised Automated Wrapper Generation for Weblog Data Extraction

An Approach to Web Information Processing

The BigGrams: the semi-supervised information extraction system from HTML: an improvement in the wrapper induction

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation