User-Friendly and Extensible Web Data Extraction

Conference paper
Part of the Lecture Notes in Information Systems and Organisation book series (LNISO, volume 26)

Abstract

Creation of web wrappers is a subject of study in the field of web data extraction. Designing a domain-specific language for a web wrapper is a challenging task, because it introduces tradeoffs between expressiveness of a wrapper’s language and safety. In addition, little attention has been paid to execution of a wrapper in a restricted environment. In this paper we present a new wrapping language—Serrano—that has three goals: (1) ability to run in a restricted environment, such as a browser extension, (2) extensibility to balance the tradeoffs between expressiveness of a command set and safety, and (3) processing capabilities to eliminate the need for additional programs to clean the extracted data. Serrano has been successfully deployed in a number of projects and provided competitive results.

Keywords

Web data extraction Safe execution Restricted environment Web browser extension 

Notes

Acknowledgements

This work was supported by project SVV 260451.

Bibliography

  1. 1.
    AJAX. Mozilla Developer Network, 2017. https://developer.mozilla.org/en/ajax
  2. 2.
    G. Cormode, B. Krishnamurthy: Key differences between Web 1.0 and Web 2.0. First Monday 13(6) (2008)Google Scholar
  3. 3.
    A vocabulary and associated APIs for HTML and XHTML, 2016. https://www.w3.org/TR/html5/
  4. 4.
    Laender, A.H., Ribeiro-Neto, B.A., da Silva, A.S., Teixeira, J.S.: A brief survey of web data extraction tools. ACM Sigmod Record 31(2), 84–93 (2002)CrossRefGoogle Scholar
  5. 5.
    R. Baumgartner, W. Gatterbauer, G. Gottlob. Web data extraction system. In Encyclopedia of Database Systems, pp. 3465–3471. Springer, Berlin (2009)Google Scholar
  6. 6.
    Document Object Model (DOM). W3C, 2005. http://www.w3.org/TR/REC-DOM-Level-1/cover.html
  7. 7.
    Rahm, E., Do, H.H.: Data cleaning: problems and current approaches. IEEE Data Eng. Bull. 23(4), 3–13 (2000)Google Scholar
  8. 8.
    Extensible Markup Language (XML) 1.0 (Fourth Edition), 2006. http://www.w3.org/XML/
  9. 9.
    D. Crockford. The application/json Media Type for JavaScript Object Notation (JSON). JSON.org (2006)Google Scholar
  10. 10.
    J. Hammer, J. McHugh, H. Garcia-Molina. Semistructured Data: the TSIMMIS Experience. In: ADBIS ’97, p. 22 (1997)Google Scholar
  11. 11.
    Sahuguet, A., Azavant, F.: Building intelligent web applications using lightweight wrappers. Data Knowl. Eng. 36(3), 283–316 (2001)CrossRefGoogle Scholar
  12. 12.
    Califf, M.E., Mooney, R.J.: Bottom-up relational learning of pattern matching rules for information extraction. JMLR 4, 177–210 (2003)Google Scholar
  13. 13.
    Kushmerick, N.: Wrapper induction: efficiency and expressiveness. Artif. Intell. 118(1), 15–68 (2000)CrossRefGoogle Scholar
  14. 14.
    B. Adelberg: NoDoSE—a tool for semi-automatically extracting structured and semistructured data from text documents. ACM Sigmod Record 27(2):283–294 (1998)Google Scholar
  15. 15.
    T. Furche, G. Gottlob, G. Grasso, O. Gunes, X. Guo, A. Kravchenko, G. Orsi, C. Schallhart, A. Sellers, C. Wang: DIADEM: domain-centric, intelligent, automated data extraction methodology. In: WWW ’12, pp. 267–270. ACM, New York (2012)Google Scholar
  16. 16.
    T. Furche, G. Gottlob, G. Grasso, C. Schallhart, A. Sellers: OXPath: a language for scalable data extraction, automation, and crawling on the deep web. VLDB J. 22(1), 47–72 (2013)Google Scholar
  17. 17.
    R. Baumgartner, S. Flesca, G. Gottlob: The Elog web extraction language. In: LPAR, pp. 548–560. Springer, Berlin (2001)Google Scholar
  18. 18.
    E. Oro, M. Ruffolo, S. Staab: SXPath: extending XPath towards spatial querying on web documents. In: Proc. VLDB Endow. 4(2), 129–140 (2010)Google Scholar
  19. 19.
    E. Ferrara, P. De Meo, G. Fiumara, R. Baumgartner. Web data extraction, applications and techniques: a survey. Knowl. Based Syst. 70, 301–323 (2014)Google Scholar
  20. 20.
    G. Gottlob, C. Koch: Monadic datalog and the expressive power of languages for web information extraction. JACM 51(1), 74–113 (2004)Google Scholar
  21. 21.
    I. Hickson: HTML microdata, 2011. http://www.w3.org/TR/microdata/

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Charles UniversityPragueCzechia

Personalised recommendations