Skip to main content
Log in

Entropy-based automated wrapper generation for weblog data extraction

  • Published:
World Wide Web Aims and scope Submit manuscript

Abstract

This paper proposes a fully automated information extraction methodology for weblogs. The methodology integrates a set of relevant approaches based on the use of web feeds and processing of HTML for the extraction of weblog properties. The approach includes a model for generating a wrapper that exploits web feeds for deriving a set of extraction rules automatically. Instead of performing a pairwise comparison between posts, the model matches the values of the web feeds against their corresponding HTML elements retrieved from multiple weblog posts. It adopts a probabilistic approach for deriving a set of rules and automating the process of wrapper generation. An evaluation of the model is conducted on a collection of weblogs reporting a prediction accuracy of 89 %. The results of this evaluation show that the proposed technique enables robust extraction of weblog properties and can be applied across the blogosphere.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Adelberg, B.: NoDoSE-a tool for semi-automatically extracting structured and semistructured data from text documents. SIGMOD Rec. 27(2), 283–294 (1998)

    Article  Google Scholar 

  2. Baumgartner, R., Flesca, S., Gottlob, G.: Visual web information extraction with lixto. In: Proceedings of the 27th International Conference on Very Large Data Bases, VLDB ’01, pp. 119–128. Morgan Kaufmann Publishers Inc., San Francisco (2001)

    Google Scholar 

  3. Baumgartner, R., Gatterbauer, W., Gottlob, G.: Web data extraction system. In: Encyclopedia of Database Systems, pp. 3465–3471. Springer (2009)

  4. Berger, P., Hennig, P., Bross, J., Meinel, C.: Mapping the blogosphere–towards a universal and scalable blog-crawler. In: Privacy, Security, Risk and Trust (PASSAT), 2011 IEEE Third Int Confernece Soc Comput (SocialCom), pp. 672–677. IEEE (2011)

  5. Burton, K., Kasch, N., Soboroff, I.: The ICWSM 2011 Spinn3r dataset. In: Proceedings of the Fifth Annual Conference on Weblogs and Social Media (ICWSM 2011). Barcelona, Spain (2011)

  6. Crescenzi, V., Mecca, G., Merialdo, P.: Roadrunner: towards automatic data extraction from large web sites. In: Proceedings of the International Conference on Very Large Data Bases, pp. 109–118 (2001)

  7. Dutton, W., Blank, G.: Next generation users: The Internet in Britain. Oxford Internet Survey. http://www.oii.ox.ac.uk/publications/oxis2011_report.pdf (2011)

  8. Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)

    Article  Google Scholar 

  9. Faheem, M., Senellart, P.: Intelligent and adaptive crawling of web applications for web archiving. In: Proceeding ICWE. Aalborg (2013)

  10. Geibel, P., Pustylnikov, O., Mehler, A., Gust, H., Kühnberger, K.: Classification of documents based on the structure of their DOM trees. In: Neural Information Processing, pp. 779–788. Springer (2008)

  11. Giles, K., Bryson, K., Weng, Q.: Comparison of two families of entropy-based classification measures with and without feature selection. In: Proceedings of the 34th Annual Hawaii International Conference on System Sciences, HICSS ’01, p. 3014. IEEE Computer Society, Washington (2001)

    Google Scholar 

  12. Gkotsis, G., Stepanyan, K., Cristea, A., Joy, M.: Self-supervised automated wrapper generation for weblog data extraction. In: G. Gottlob, G. Grasso, D. Olteanu, C. Schallhart (eds.) Big Data, Lecture Notes in Computer Science, vol. 7968, pp. 292–302. Springer, Berlin (2013)

  13. Ihara, S.: Information theory for continuous systems. World Scientific Publishing Company, Singapore (1993)

    Book  MATH  Google Scholar 

  14. Jaro, M.A.: Unimatch: A Record Linkage System: User’s Manual. Tech. rep., U.S. Bureau of the Census, Washington DC (1976)

  15. Kohlschütter, C., Fankhauser, P., Nejdl, W.: Boilerplate detection using shallow text features. In: Proceedings of the 3rd ACM International Conference on Web Search and Data Mining, WSDM ’10, pp. 441–450. ACM, New York (2010)

  16. Kushmerick, N.: Wrapper induction: efficiency and expressiveness. Artif. Intell. 118(1), 15–68 (2000)

    Article  MATH  MathSciNet  Google Scholar 

  17. Laender, A., Ribeiro-Neto, B., Da Silva, A., Teixeira, J.: A brief survey of web data extraction tools. ACM Sigmod Rec. 31(2), 84–93 (2002)

    Article  Google Scholar 

  18. Liu, B.: Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data. Springer-Verlag, Berlin (2009)

    Google Scholar 

  19. Liu, L., Pu, C., Han, W.: XWrap: an extensible wrapper construction system for internet information. In: Proceedings of the 16th International Conference on Data Engineering (ICDE 2000), pp. 611–621. IEEE CS Press, San Diego (2000)

    Chapter  Google Scholar 

  20. Muslea, I., Minton, S., Knoblock, C.: Hierarchical wrapper induction for semistructured information sources. Auton. Agent. Multi-Agent Syst. 4(1), 93–114 (2001)

    Article  Google Scholar 

  21. Oita, M., Senellart, P.: Archiving data objects using web feeds. In: Proceedings of International Web Archiving Workshop, pp. 31–41. Vienna, Austria (2010)

    Google Scholar 

  22. Pennock, M., Davis, R.: ArchivePress: a really simple solution to archiving blog content. In: Sixth International Conference on Preservation of Digital Objects (iPRES 2009). California Digital Library, San Francisco (2009)

    Google Scholar 

  23. Quinlan, J.: Induction of decision trees. Mach. Learn. 1(1), 81–106 (1986)

    Google Scholar 

  24. Sigurthsson, K.: Incremental crawling with Heritrix. In: Proceedings of International Web Archiving Workshop, pp. 1–12 (2005)

  25. Web Technology Survey: Usage of content management systems for websites. [Online]. Available: http://web.archive.org/web/20131015180119, http://w3techs.com/technologies/overview/content_management/all Accessed Oct 2013, Tech. rep., W3Techs (2013)

  26. Winkler, W.E., Thibaudeau, Y.: An application of the fellegi-sunter model of record linkage to the 1990 us decennial census. Methods 9 (1990)

  27. Winn, P.: State of the Blogosphere 2008: Introduction. http://technorati.com/blogging/article/state-of-the-blogosphere-introduction/ (2009). Accessed 21 Aug 2009

  28. Witten, I.H., Frank, E.: Data Mining. Practical Machine Learning Tools and Techniques, 2nd edn. Morgan Kaufmann (2005)

  29. Yujian, L., Bo, L.: A normalized levenshtein distance metric. IEEE Trans. Pattern Anal. Mach. Intell. 29(6), 1091 –1095 (2007)

    Article  Google Scholar 

  30. Zhai, Y., Liu, B.: Web data extraction based on partial tree alignment. In: Proceedings of the 14th International Conference on World Wide Web, pp. 76–85. ACM (2005)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to George Gkotsis.

Additional information

This work was conducted as part of the BlogForever project funded by the European Commission Framework Programme 7 (FP7), grant agreement No.269963.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gkotsis, G., Stepanyan, K., Cristea, A.I. et al. Entropy-based automated wrapper generation for weblog data extraction. World Wide Web 17, 827–846 (2014). https://doi.org/10.1007/s11280-013-0269-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11280-013-0269-6

Keywords

Navigation