Robust Web Data Extraction Based on Unsupervised Visual Validation

  • Benoit PotvinEmail author
  • Roger VillemaireEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11431)


Visual validation is the process of validating sets of extracted entities by means of visual information. The main advantage of visual validation is to make use of visual information for web information extraction without impacting on the robustness of extractors. In this paper, we show that unsupervised visual validation can be used to create robust web data extractors. More precisely, we evaluate the performance of visual validation on a corpus of visually heterogeneous documents. The selected extraction task consists in extracting the price, name, description, and SKU of unspecified products from unseen documents. Our corpus contains 1000 various products from 100 different sources, which we render public. Results also show that visual validation improves web data extraction even when the extractor is trained with visual features.


Visual validation Robustness Isolation forest Web information extraction Classifiers 



The authors gratefully acknowledge the financial support of the Natural Sciences and Engineering Research Council of Canada (NSERC).


  1. 1.
    Apostolova, E., Pourashraf, P., Sack, J.: Digital leafleting: extracting structured data from multimedia online flyers. In: Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 283–292 (2015)Google Scholar
  2. 2.
    Berners-Lee, T., Hendler, J., Lassila, O., et al.: The semantic web. Sci. Am. 284(5), 28–37 (2001)CrossRefGoogle Scholar
  3. 3.
    Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)CrossRefGoogle Scholar
  4. 4.
    Burget, R., Rudolfova, I.: Web page element classification based on visual features. In: First Asian Conference on Intelligent Information and Database Systems, ACIIDS 2009, pp. 67–72. IEEE (2009)Google Scholar
  5. 5.
    Chang, C.H., Kayed, M., Girgis, M.R., Shaalan, K.F.: A survey of web information extraction systems. IEEE Trans. Knowl. Data Eng. 18(10), 1411–1428 (2006)CrossRefGoogle Scholar
  6. 6.
    Ferrara, E., Baumgartner, R.: Automatic wrapper adaptation by tree edit distance matching. In: Hatzilygeroudis, I., Prentzas, J. (eds.) Combinations of Intelligent Methods and Applications. SIST, vol. 8, pp. 41–54. Springer, Heidelberg (2011). Scholar
  7. 7.
    Ferrara, E., De Meo, P., Fiumara, G., Baumgartner, R.: Web data extraction, applications and techniques: a survey. Knowl.-Based Syst. 70, 301–323 (2014)CrossRefGoogle Scholar
  8. 8.
    Gatterbauer, W., Bohunsky, P., Herzog, M., Krüpl, B., Pollak, B.: Towards domain-independent information extraction from web tables. In: Proceedings of the 16th International Conference on World Wide Web, pp. 71–80. ACM (2007)Google Scholar
  9. 9.
    Gogar, T., Hubacek, O., Sedivy, J.: Deep neural networks for web page information extraction. In: Iliadis, L., Maglogiannis, I. (eds.) AIAI 2016. IAICT, vol. 475, pp. 154–163. Springer, Cham (2016). Scholar
  10. 10.
    Grassi, M., Morbidoni, C., Nucci, M., Fonda, S., Ledda, G.: Pundit: semantically structured annotations for web contents and digital libraries. In: SDA, pp. 49–60 (2012)Google Scholar
  11. 11.
    Han, H., Noro, T., Tokuda, T.: An automatic web news article contents extraction system based on RSS feeds. J. Web Eng. 8(3), 268 (2009)Google Scholar
  12. 12.
    Kang, J., Choi, J.: Detecting informative web page blocks for efficient information extraction using visual block segmentation. In: International Symposium on Information Technology Convergence, ISITC 2007, pp. 306–310. IEEE (2007)Google Scholar
  13. 13.
    Kohlschütter, C., Fankhauser, P., Nejdl, W.: Boilerplate detection using shallow text features. In: Proceedings of the Third ACM International Conference on Web Search and Data Mining, pp. 441–450. ACM (2010)Google Scholar
  14. 14.
    Krüpl-Sypien, B., Fayzrakhmanov, R.R., Holzinger, W., Panzenböck, M., Baumgartner, R.: A versatile model for web page representation, information extraction and content re-packaging. In: Proceedings of the 11th ACM Symposium on Document Engineering, pp. 129–138. ACM (2011)Google Scholar
  15. 15.
    Liu, F.T., Ting, K.M., Zhou, Z.H.: Isolation forest. In: Eighth IEEE International Conference on Data Mining, ICDM 2008, pp. 413–422. IEEE (2008)Google Scholar
  16. 16.
    Liu, L., Özsu, M.T.: Encyclopedia of Database Systems, vol. 6. Springer, New York (2009)CrossRefGoogle Scholar
  17. 17.
    Parameswaran, A., Dalvi, N., Garcia-Molina, H., Rastogi, R.: Optimal schemesfor robust web extraction. Proc. VLDB Conf. 4(11)(2011)Google Scholar
  18. 18.
    Pedregosa, F., et al.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12(Oct), 2825–2830 (2011)MathSciNetzbMATHGoogle Scholar
  19. 19.
    Potvin, B., Villemaire, R.: When different is wrong: visual unsupervised validation for web information extraction. In: Perner, P. (ed.) MLDM 2018. LNCS (LNAI), vol. 10935, pp. 132–146. Springer, Cham (2018). Scholar
  20. 20.
    Tang, J., Hong, M., Zhang, D.L., Li, J.: Information extraction: methodologies and applications. In: Emerging Technologies of Text Mining: Techniques and Applications, pp. 1–33. IGI Global (2008)Google Scholar
  21. 21.
    Wang, J., et al.: Can we learn a template-independent wrapper for news article extraction from a single training site? In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1345–1354. ACM (2009)Google Scholar
  22. 22.
    Wang, R.C., Cohen, W.W.: Language-independent set expansion of named entities using the web. In: ICDM, pp. 342–350. IEEE (2007)Google Scholar
  23. 23.
    Weninger, T., Palacios, R., Crescenzi, V., Gottron, T., Merialdo, P.: Web content extraction: a metaanalysis of its past and thoughts on its future. ACM SIGKDD Explor. Newsl. 17(2), 17–23 (2016)CrossRefGoogle Scholar
  24. 24.
    Witten, I.H., Frank, E., Hall, M.A., Pal, C.J.: Data Mining: Practical Machinelearning Tools and Techniques. Morgan Kaufmann, Burlington (2016)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Department of Computer ScienceUniversité du Québec à MontréalMontréalCanada

Personalised recommendations