A Reference Architecture to Devise Web Information Extractors

  • Hassan A. Sleiman
  • Rafael Corchuelo
Part of the Lecture Notes in Business Information Processing book series (LNBIP, volume 112)


The Web is the largest repository of human-friendly information. Unfortunately, web information is embedded in formatting tags and is surrounded by irrelevant information. Researchers are working on information extractors that allow transforming this information into structured data for its later integration into automated processes. Devising a new information extraction technique requires an array of tasks that are specific to this technique and many tasks that are actually common between all techniques. The lack of a reference architectural proposal in the literature to guide software engineers in the design and implementation of information extractors, amounts to little reuse and the focus is usually blurred because of irrelevant details. In this paper, we present a reference architecture to design and implement rule learners for information extractors. We have implemented a software framework to support our architecture, and we have validated it by means of four case studies and a number of experiments that prove that our proposal helps reduce development costs significantly.


Information Extraction Rule Learning Reference Architecture 


  1. 1.
    Aalst, W., Mylopoulos, J., Sadeh, N.M., Shaw, M.J., Szyperski, C., Lyytinen, K., Loucopoulos, P., Robinson, B., Ralph, P., Wand, Y.: Design Requirements Engineering: A Ten-Year Perspective, vol. 14. Springer, Heidelberg (2009)Google Scholar
  2. 2.
    Adelberg, B.: NoDoSE - a tool for semi-automatically extracting semi-structured data from text documents. In: SIGMOD Conference, pp. 283–294 (1998)Google Scholar
  3. 3.
    Álvarez, M., Pan, A., Raposo, J., Bellas, F., Cacheda, F.: Extracting lists of data records from semi-structured web pages. Data Knowl. Eng. 64(2), 491–509 (2008)CrossRefGoogle Scholar
  4. 4.
    Arocena, G.O., Mendelzon, A.O.: WebOQL: Restructuring documents, databases, and webs. TAPOS 5(3), 127–141 (1999)Google Scholar
  5. 5.
    Buttler, D., Liu, L., Pu, C.: A fully automated object extraction system for the world wide web. In: ICDCS, pp. 361–370 (2001)Google Scholar
  6. 6.
    Chang, C.-H., Lui, S.-C.: IEPAD: information extraction based on pattern discovery. In: WWW, pp. 681–688 (2001)Google Scholar
  7. 7.
    Chiticariu, L., Li, Y., Raghavan, S., Reiss, F.: Enterprise information extraction: recent developments and open challenges. In: SIGMOD Conference (2010)Google Scholar
  8. 8.
    Cohen, W.W., Hurst, M., Jensen, L.S.: A flexible learning system for wrapping tables and lists in HTML documents. In: WWW, pp. 232–241 (2002)Google Scholar
  9. 9.
    Crescenzi, V., Mecca, G.: Grammars have exceptions. Inf. Syst. 23(8), 539–565 (1998)CrossRefGoogle Scholar
  10. 10.
    Crescenzi, V., Mecca, G., Merialdo, P.: RoadRunner: automatic data extraction from data-intensive web sites. In: SIGMOD Conference, p. 624 (2002)Google Scholar
  11. 11.
    Cunningham, H., Humphreys, K., Gaizauskas, R.J., Wilks, Y.: Software infrastructure for natural language processing. CoRR, cmp-lg/9702005 (1997)Google Scholar
  12. 12.
    Ferrucci, D., Lally, A.: Uima: an architectural approach to unstructured information processing in the corporate research environment. Nat. Lang. Eng. 10, 327–348 (2004)CrossRefGoogle Scholar
  13. 13.
    Han, W., Buttler, D., Pu, C.: Wrapping web data into XML. SIGMOD Record 30(3), 33–38 (2001)CrossRefGoogle Scholar
  14. 14.
    Hsu, C.-N., Dung, M.-T.: Generating finite-state transducers for semi-structured data extraction from the web. Inf. Syst. 23(8), 521–538 (1998)CrossRefGoogle Scholar
  15. 15.
    Kayed, M., Chang, C.-H.: FiVaTech: Page-level web data extraction from template pages. IEEE Trans. Knowl. Data Eng. (2010)Google Scholar
  16. 16.
    Kitchenham, B., Pfleeger, S.L., Pickard, L., Jones, P., Hoaglin, D.C., Rosenberg, J., Emam, K.E.: Preliminary guidelines for empirical research in software engineering. IEEE Trans. Software Eng. 28(8), 721–734 (2002)CrossRefGoogle Scholar
  17. 17.
    Kiyavitskaya, N., Zeni, N., Cordy, J.R., Mich, L., Mylopoulos, J.: Cerno: Lightweight tool support for semantic annotation of textual documents. Data Knowl. Eng. (2009)Google Scholar
  18. 18.
    Kruchten, P.: The 4+1 view model of architecture. IEEE Software 12(6), 42–50 (1995)CrossRefGoogle Scholar
  19. 19.
    Kushmerick, N.: Wrapper induction: Efficiency and expressiveness. Artif. Intell. 118(1-2), 15–68 (2000)MathSciNetzbMATHCrossRefGoogle Scholar
  20. 20.
    Laender, A.H.F., Ribeiro-Neto, B.A., da Silva, A.S.: DEByE - data extraction by example. Data Knowl. Eng. 40(2), 121–154 (2002)zbMATHCrossRefGoogle Scholar
  21. 21.
    Lavelli, A., Califf, M.E., Ciravegna, F., Freitag, D., Kushmerick, N., Giuliano, C., Romano, L., Ireson, N.: Evaluation of machine learning-based information extraction algorithms: criticisms and recommendations. Language Resources and Evaluation (2008)Google Scholar
  22. 22.
    Madhavan, J., Cohen, S., Dong, X.L., Halevy, A.Y., Jeffery, S.R., Ko, D., Yu, C.: Web-scale data integration: You can afford to pay as you go. In: Conference on Innovative Data Systems Research, pp. 342–350 (2007)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Hassan A. Sleiman
    • 1
  • Rafael Corchuelo
    • 1
  1. 1.ETSI InformáticaUniversidad de SevillaSevillaSpain

Personalised recommendations