Revisiting Web Data Extraction Using In-Browser Structural Analysis and Visual Cues in Modern Web Designs

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9671)

Abstract

Recent trends in website design have an impact on methods used for web data extraction. Many existing methods rely on structural analysis of web pages and, with the introduction of CSS, table-based layouts are no longer used, while responsive design means that layout and presentation are dependent on browsing context which also makes the use of visual clues more complex. We present DeepDesign, a system that semi-automatically extracts data records from web pages based on a combination of structural and visual features. It runs in a general-purpose browser, taking advantage of direct access to the complete CSS3 spectrum and the capability to trigger and execute JavaScript in the page. The user sees record matching in real-time and dynamically adapts the process if required. We present the details of the matching algorithms and provide an evaluation of them based on the top ten Alexa websites.

Keywords

Data extraction Wrapper induction Browser 

References

  1. 1.
    Zheng, S., Song, R., Wen, J., Giles, C.L.: Efficient record-level wrapper induction. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management. ACM (2009)Google Scholar
  2. 2.
    Liu, W., Meng, X., Meng, W.: Vision-based web data records extraction. In: Proceedings 9th International Workshop on the Web and Databases (2006)Google Scholar
  3. 3.
    Manabe, T., Tajima, K.: Extracting logical hierarchical structure of HTML documents based on headings. Proc. VLDB Endowment 8(12), 1606–1617 (2015)CrossRefGoogle Scholar
  4. 4.
    Pembe, F., Canan, F., Güngör, T.: A tree learning approach to web document sectional hierarchy extraction. In: Proceedings of the 2nd International Conference on Agents and Artificial Intelligence (2010)Google Scholar
  5. 5.
    Geel, M., Church, T., Norrie, M.C.: Sift: an end-user tool for gathering web content on the go. In: Proceedings of the 2012 ACM Symposium on Document Engineering, pp. 181–190. ACM (2012)Google Scholar
  6. 6.
    Murolo, A., Norrie, M.C.: Deriving custom post types from digital mockups. In: Cimiano, P., Frasincar, F., Houben, G.-J., Schwabe, D. (eds.) ICWE 2015. LNCS, vol. 9114, pp. 71–80. Springer, Heidelberg (2015)CrossRefGoogle Scholar
  7. 7.
    Chang, C., Kayed, M., Girgis, M.R., Shaalan, K.F.: A survey of web information extraction systems. IEEE Trans. Knowl. Data Eng. 18(10), 1411–1428 (2006)CrossRefGoogle Scholar
  8. 8.
    Adelberg, B.: NoDoSE a tool for semi-automatically extracting structured and semistructured data from text documents. In: Proceedings of the 9th ACM SIGMOD International Conference on Management of Data (SIGMOD). ACM (1998)Google Scholar
  9. 9.
    Crescenzi, V., Mecca, G., Merialdo, P.: Roadrunner: towards automatic data extraction from large web sites. In: Proceedings of the 27th International Conference on Very Large Data Bases (VLDB). Morgan Kaufmann (2001)Google Scholar
  10. 10.
    Chang, C., Lui, S.: IEPAD: information extraction based on pattern discovery. In: Proceedings of the 10th International Conference on World Wide Web (WWW). ACM (2001)Google Scholar
  11. 11.
    Wang, J., Lochovsky, F.H.: Data extraction and label assignment for web databases. In: Proceedings of the 12th International Conference on World Wide Web (WWW). ACM (2003)Google Scholar
  12. 12.
    Zhai, Y., Liu, B.: Structured data extraction from the web based on partial tree alignment. IEEE Trans. Knowl. Data Eng. 18(12), 1614–1628 (2006)CrossRefGoogle Scholar
  13. 13.
    Lu, Y., He, H., Zhao, H., Meng, W., Yu, C.: Annotating search results from web databases. IEEE Trans. Knowl. Data Eng. 25(3), 514–527 (2013)CrossRefGoogle Scholar
  14. 14.
    Zhao, H., Meng, W., Wu, Z., Raghavan, V., Yu, C.: Fully automatic wrapper generation for search engines. In: Proceedings of the 14th International Conference on World Wide Web. ACM (2005)Google Scholar
  15. 15.
    Hong, J.L., Siew, E., Egerton, S.: ViWER-Data extraction for search engine results pages using visual cue and dom tree. In: Proceedings of the 1st International Conference on Information Retrieval & Knowledge Management (CAMP). IEEE (2010)Google Scholar
  16. 16.
    Liu, W., Meng, X., Meng, W.: Vide: a vision-based approach for deep web data extraction. IEEE Trans. Knowl. Data Eng. 22(3), 447–460 (2010)CrossRefGoogle Scholar
  17. 17.
    Laender, A.H., Ribeiro-Neto, B., da Silva, A.S.: DEByE - data extraction by example. Data Knowl. Eng. 40(2), 121–154 (2002)CrossRefMATHGoogle Scholar
  18. 18.
    Chang, C., Kuo, S.: OLERA: semisupervised web-data extraction with visual support. IEEE Intell. Syst. 19(6), 56–64 (2004)CrossRefGoogle Scholar
  19. 19.
    Hogue, A., Karger, D.: Thresher: automating the unwrapping of semantic content from the world wide web. In: Proceedings of the 14th International Conference on World Wide Web. ACM (2005)Google Scholar
  20. 20.
    Levenshtein, V.: Binary codes capable of correcting deletions, insertions, and reversals. Sov. Phys. Dokl. 10, 707–710 (1966)MathSciNetMATHGoogle Scholar
  21. 21.
    Augsten, N., Böhlen, M., Gamper, J.: Approximate matching of hierarchical data using Pq-Grams. In: Proceedings of the 31st International Conference on Very Large Data Bases (VLDB), VLDB Endowment (2005)Google Scholar
  22. 22.
    Sakai, S., Togasaki, M., Yamazaki, K.: A note on greedy algorithms for the maximum weighted independent set problem. Discrete Appl. Math. 126(2), 313–322 (2003)MathSciNetCrossRefMATHGoogle Scholar
  23. 23.
    Demange, M.: A note on the approximation of a minimum-weight maximal independent set. Comput. Optim. Appl. 14(1), 157–169 (1999)MathSciNetCrossRefMATHGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  1. 1.Department of Computer ScienceETH ZurichZurichSwitzerland

Personalised recommendations