World Wide Web

, Volume 17, Issue 4, pp 827–846 | Cite as

Entropy-based automated wrapper generation for weblog data extraction

  • George Gkotsis
  • Karen Stepanyan
  • Alexandra I. Cristea
  • Mike Joy
Article
  • 199 Downloads

Abstract

This paper proposes a fully automated information extraction methodology for weblogs. The methodology integrates a set of relevant approaches based on the use of web feeds and processing of HTML for the extraction of weblog properties. The approach includes a model for generating a wrapper that exploits web feeds for deriving a set of extraction rules automatically. Instead of performing a pairwise comparison between posts, the model matches the values of the web feeds against their corresponding HTML elements retrieved from multiple weblog posts. It adopts a probabilistic approach for deriving a set of rules and automating the process of wrapper generation. An evaluation of the model is conducted on a collection of weblogs reporting a prediction accuracy of 89 %. The results of this evaluation show that the proposed technique enables robust extraction of weblog properties and can be applied across the blogosphere.

Keywords

Web information extraction Automatic wrapper generation Weblogs 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Adelberg, B.: NoDoSE-a tool for semi-automatically extracting structured and semistructured data from text documents. SIGMOD Rec. 27(2), 283–294 (1998)CrossRefGoogle Scholar
  2. 2.
    Baumgartner, R., Flesca, S., Gottlob, G.: Visual web information extraction with lixto. In: Proceedings of the 27th International Conference on Very Large Data Bases, VLDB ’01, pp. 119–128. Morgan Kaufmann Publishers Inc., San Francisco (2001)Google Scholar
  3. 3.
    Baumgartner, R., Gatterbauer, W., Gottlob, G.: Web data extraction system. In: Encyclopedia of Database Systems, pp. 3465–3471. Springer (2009)Google Scholar
  4. 4.
    Berger, P., Hennig, P., Bross, J., Meinel, C.: Mapping the blogosphere–towards a universal and scalable blog-crawler. In: Privacy, Security, Risk and Trust (PASSAT), 2011 IEEE Third Int Confernece Soc Comput (SocialCom), pp. 672–677. IEEE (2011)Google Scholar
  5. 5.
    Burton, K., Kasch, N., Soboroff, I.: The ICWSM 2011 Spinn3r dataset. In: Proceedings of the Fifth Annual Conference on Weblogs and Social Media (ICWSM 2011). Barcelona, Spain (2011)Google Scholar
  6. 6.
    Crescenzi, V., Mecca, G., Merialdo, P.: Roadrunner: towards automatic data extraction from large web sites. In: Proceedings of the International Conference on Very Large Data Bases, pp. 109–118 (2001)Google Scholar
  7. 7.
    Dutton, W., Blank, G.: Next generation users: The Internet in Britain. Oxford Internet Survey. http://www.oii.ox.ac.uk/publications/oxis2011_report.pdf (2011)
  8. 8.
    Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)CrossRefGoogle Scholar
  9. 9.
    Faheem, M., Senellart, P.: Intelligent and adaptive crawling of web applications for web archiving. In: Proceeding ICWE. Aalborg (2013)Google Scholar
  10. 10.
    Geibel, P., Pustylnikov, O., Mehler, A., Gust, H., Kühnberger, K.: Classification of documents based on the structure of their DOM trees. In: Neural Information Processing, pp. 779–788. Springer (2008)Google Scholar
  11. 11.
    Giles, K., Bryson, K., Weng, Q.: Comparison of two families of entropy-based classification measures with and without feature selection. In: Proceedings of the 34th Annual Hawaii International Conference on System Sciences, HICSS ’01, p. 3014. IEEE Computer Society, Washington (2001)Google Scholar
  12. 12.
    Gkotsis, G., Stepanyan, K., Cristea, A., Joy, M.: Self-supervised automated wrapper generation for weblog data extraction. In: G. Gottlob, G. Grasso, D. Olteanu, C. Schallhart (eds.) Big Data, Lecture Notes in Computer Science, vol. 7968, pp. 292–302. Springer, Berlin (2013)Google Scholar
  13. 13.
    Ihara, S.: Information theory for continuous systems. World Scientific Publishing Company, Singapore (1993)CrossRefMATHGoogle Scholar
  14. 14.
    Jaro, M.A.: Unimatch: A Record Linkage System: User’s Manual. Tech. rep., U.S. Bureau of the Census, Washington DC (1976)Google Scholar
  15. 15.
    Kohlschütter, C., Fankhauser, P., Nejdl, W.: Boilerplate detection using shallow text features. In: Proceedings of the 3rd ACM International Conference on Web Search and Data Mining, WSDM ’10, pp. 441–450. ACM, New York (2010)Google Scholar
  16. 16.
    Kushmerick, N.: Wrapper induction: efficiency and expressiveness. Artif. Intell. 118(1), 15–68 (2000)CrossRefMATHMathSciNetGoogle Scholar
  17. 17.
    Laender, A., Ribeiro-Neto, B., Da Silva, A., Teixeira, J.: A brief survey of web data extraction tools. ACM Sigmod Rec. 31(2), 84–93 (2002)CrossRefGoogle Scholar
  18. 18.
    Liu, B.: Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data. Springer-Verlag, Berlin (2009)Google Scholar
  19. 19.
    Liu, L., Pu, C., Han, W.: XWrap: an extensible wrapper construction system for internet information. In: Proceedings of the 16th International Conference on Data Engineering (ICDE 2000), pp. 611–621. IEEE CS Press, San Diego (2000)CrossRefGoogle Scholar
  20. 20.
    Muslea, I., Minton, S., Knoblock, C.: Hierarchical wrapper induction for semistructured information sources. Auton. Agent. Multi-Agent Syst. 4(1), 93–114 (2001)CrossRefGoogle Scholar
  21. 21.
    Oita, M., Senellart, P.: Archiving data objects using web feeds. In: Proceedings of International Web Archiving Workshop, pp. 31–41. Vienna, Austria (2010)Google Scholar
  22. 22.
    Pennock, M., Davis, R.: ArchivePress: a really simple solution to archiving blog content. In: Sixth International Conference on Preservation of Digital Objects (iPRES 2009). California Digital Library, San Francisco (2009)Google Scholar
  23. 23.
    Quinlan, J.: Induction of decision trees. Mach. Learn. 1(1), 81–106 (1986)Google Scholar
  24. 24.
    Sigurthsson, K.: Incremental crawling with Heritrix. In: Proceedings of International Web Archiving Workshop, pp. 1–12 (2005)Google Scholar
  25. 25.
    Web Technology Survey: Usage of content management systems for websites. [Online]. Available: http://web.archive.org/web/20131015180119, http://w3techs.com/technologies/overview/content_management/all Accessed Oct 2013, Tech. rep., W3Techs (2013)
  26. 26.
    Winkler, W.E., Thibaudeau, Y.: An application of the fellegi-sunter model of record linkage to the 1990 us decennial census. Methods 9 (1990)Google Scholar
  27. 27.
    Winn, P.: State of the Blogosphere 2008: Introduction. http://technorati.com/blogging/article/state-of-the-blogosphere-introduction/ (2009). Accessed 21 Aug 2009
  28. 28.
    Witten, I.H., Frank, E.: Data Mining. Practical Machine Learning Tools and Techniques, 2nd edn. Morgan Kaufmann (2005)Google Scholar
  29. 29.
    Yujian, L., Bo, L.: A normalized levenshtein distance metric. IEEE Trans. Pattern Anal. Mach. Intell. 29(6), 1091 –1095 (2007)CrossRefGoogle Scholar
  30. 30.
    Zhai, Y., Liu, B.: Web data extraction based on partial tree alignment. In: Proceedings of the 14th International Conference on World Wide Web, pp. 76–85. ACM (2005)Google Scholar

Copyright information

© Springer Science+Business Media New York 2013

Authors and Affiliations

  • George Gkotsis
    • 1
  • Karen Stepanyan
    • 1
  • Alexandra I. Cristea
    • 1
  • Mike Joy
    • 1
  1. 1.Department of Computer ScienceUniversity of WarwickCoventryUK

Personalised recommendations