On Extracting Information from Semi-structured Deep Web Documents

  • Patricia Jiménez
  • Rafael Corchuelo
Conference paper
Part of the Lecture Notes in Business Information Processing book series (LNBIP, volume 208)


Some software agents need information that is provided by some web sites, which is difficult if they lack a query API. Information extractors are intended to extract the information of interest automatically and offer it in a structured format. Unfortunately, most of them rely on ad-hoc techniques, which make them fade away as the Web evolves. In this paper, we present a proposal that relies on an open catalogue of features that allows to adapt it easily; we have also devised an optimisation that allows it to be very efficient. Our experimental results prove that our proposal outperforms other state-of-the-art proposals.


Information extraction Semi-structured deep-web data sources 



Our work was funded by the Spanish and the Andalusian R&D&I programmes by means of grants TIN2007-64119, P07-TIC-2602, P08-TIC-4100, TIN2008-04718-E, TIN2010-21744, TIN2010-09809-E, TIN2010-10811-E, TIN2010-09988-E, TIN2011-15497-E, and TIN2013-40848-R, which got funds from the European FEDER programme.


  1. 1.
    Álvarez, M., Pan, A., Raposo, J., Bellas, F., Cacheda, F.: Finding and extracting data records from web pages. Signal Process. Syst. 59(1), 123–137 (2010)CrossRefGoogle Scholar
  2. 2.
    Arasu, A., Garcia-Molina, H.: Extracting structured data from web pages. In: SIGMOD Conference, pp. 337–348 (2003)Google Scholar
  3. 3.
    Ashraf, F., Özyer, T., Alhajj, R.: Employing clustering techniques for automatic information extraction from HTML documents. IEEE Trans. Syst. Man Cybern. Part C 38(5), 660–673 (2008)CrossRefGoogle Scholar
  4. 4.
    Barbosa, J.P.D.: Adaptive record extraction from web pages. In: WWW, pp. 1335–1336 (2007)Google Scholar
  5. 5.
    Bădică, C., Bădică, A., Popescu, E., Abraham, A.: L-wrappers: concepts, properties and construction. Soft Comput. 11(8), 753–772 (2007)CrossRefGoogle Scholar
  6. 6.
    Califf, M.E., Mooney, R.J.: Bottom-up relational learning of pattern matching rules for information extraction. J. Mach. Learn. Res. 4, 177–210 (2003)MathSciNetGoogle Scholar
  7. 7.
    Chang, C.H., Kayed, M., Girgis, M.R., Shaalan, K.F.: A survey of web information extraction systems. IEEE Trans. Knowl. Data Eng. 18(10), 1411–1428 (2006)CrossRefGoogle Scholar
  8. 8.
    Chang, C.H., Kuo, S.C.: OLERA: semisupervised web-data extraction with visual support. IEEE Intel. Syst. 19(6), 56–64 (2004)CrossRefGoogle Scholar
  9. 9.
    Cohen, W.W., Hurst, M., Jensen, L.S.: A flexible learning system for wrapping tables and lists in HTML documents. In: WWW, pp. 232–241 (2002)Google Scholar
  10. 10.
    Crescenzi, V., Mecca, G.: Automatic information extraction from large websites. J. ACM 51(5), 731–779 (2004)zbMATHMathSciNetCrossRefGoogle Scholar
  11. 11.
    Crescenzi, V., Merialdo, P.: Wrapper inference for ambiguous web pages. Appl. Artif. Intel. 22(1–2), 21–52 (2008)CrossRefGoogle Scholar
  12. 12.
    Fernández-Villamor, J.I., Iglesias, C.A., Garijo, M.: First-order logic rule induction for information extraction in web resources. Int. J. Artif. Intel. Tools 21(6), 20 (2012)CrossRefGoogle Scholar
  13. 13.
    Freitag, D.: Machine learning for information extraction in informal domains. Mach. Learn. 39(2/3), 169–202 (2000)zbMATHCrossRefGoogle Scholar
  14. 14.
    Gregg, D.G., Walczak, S.: Exploiting the information web. IEEE Trans. Syst. Man Cybern. Part C 37(1), 109–125 (2007)CrossRefGoogle Scholar
  15. 15.
    Gulhane, P., Madaan, A., Mehta, R.R., Ramamirtham, J., Rastogi, R., Satpal, S., Sengamedu, S.H., Tengli, A., Tiwari, C.: Web-scale information extraction with vertex. In: ICDE, pp. 1209–1220 (2011)Google Scholar
  16. 16.
    Hogue, A.W., Karger, D.R.: Thresher: automating the unwrapping of semantic content from the world wide web. In: WWW, pp. 86–95 (2005)Google Scholar
  17. 17.
    Hsu, C.N., Dung, M.T.: Generating finite-state transducers for semi-structured data extraction from the Web. Inf. Syst. 23(8), 521–538 (1998)CrossRefGoogle Scholar
  18. 18.
    Irmak, U., Suel, T.: Interactive wrapper generation with minimal user effort. In: WWW, pp. 553–563 (2006)Google Scholar
  19. 19.
    Kayed, M., Chang, C.H.: Fivatech: page-level web data extraction from template pages. IEEE Trans. Knowl. Data Eng. 22(2), 249–263 (2010)CrossRefGoogle Scholar
  20. 20.
    Kosala, R., Blockeel, H., Bruynooghe, M., den Bussche, J.V.: Information extraction from structured documents using \(k\)-testable tree automaton inference. Data Knowl. Eng. 58(2), 129–158 (2006)CrossRefGoogle Scholar
  21. 21.
    Kushmerick, N., Weld, D.S., Doorenbos, R.B.: Wrapper induction for information extraction. In: IJCAI, vol. 1, pp. 729–737 (1997)Google Scholar
  22. 22.
    Liu, B., Zhai, Y.: NET – a system for extracting web data from flat and nested data records. In: Ngu, A.H.H., Kitsuregawa, M., Neuhold, E.J., Chung, J.-Y., Sheng, Q.Z. (eds.) WISE 2005. LNCS, vol. 3806, pp. 487–495. Springer, Heidelberg (2005) CrossRefGoogle Scholar
  23. 23.
    Liu, W., Meng, X., Meng, W.: Vide: a vision-based approach for deep web data extraction. IEEE Trans. Knowl. Data Eng. 22(3), 447–460 (2010)CrossRefGoogle Scholar
  24. 24.
    Meng, W., Yu, C.T.: Advanced Metasearch Engine Technology. Morgan & Claypool Publishers, USA (2010) Google Scholar
  25. 25.
    Muslea, I., Minton, S., Knoblock, C.A.: Hierarchical wrapper induction for semistructured information sources. Auton. Agents Multi-Agent Syst. 4(1/2), 93–114 (2001)CrossRefGoogle Scholar
  26. 26.
    Raposo, J., Pan, A., Álvarez, M., Hidalgo, J., Viña, Á.: The wargo system: semi-automatic wrapper generation in presence of complex data access modes. In: DEXA Workshops, pp. 313–320 (2002)Google Scholar
  27. 27.
    Simon, K., Lausen, G.: ViPER: augmenting automatic information extraction with visual perceptions. In: CIKM, pp. 381–388 (2005)Google Scholar
  28. 28.
    Sleiman, H.A., Corchuelo, R.: An unsupervised technique to extract information from semi-structured web pages. In: Wang, X.S., Cruz, I., Delis, A., Huang, G. (eds.) WISE 2012. LNCS, vol. 7651, pp. 631–637. Springer, Heidelberg (2012) CrossRefGoogle Scholar
  29. 29.
    Sleiman, H.A., Corchuelo, R.: A survey on region extractors from web documents. IEEE Trans. Knowl. Data Eng. 25(9), 1960–1981 (2013)CrossRefGoogle Scholar
  30. 30.
    Sleiman, H.A., Corchuelo, R.: TEX: an efficient and effective unsupervised web information extractor. Knowl.-Based Syst. 39, 109–123 (2013)CrossRefGoogle Scholar
  31. 31.
    Sleiman, H.A., Corchuelo, R.: A class of neural-network-based transducers for web information extraction. Neurocomputing 135, 61–68 (2014)CrossRefGoogle Scholar
  32. 32.
    Sleiman, H.A., Corchuelo, R.: Trinity: on using trinary trees for unsupervised web data extraction. IEEE Trans. Knowl. Data Eng. 26(6), 1544–1556 (2014)CrossRefGoogle Scholar
  33. 33.
    Su, W., Wang, J., Lochovsky, F.H.: ODE: ontology-assisted data extraction. ACM Trans. Database Syst. 34(2) (2009)Google Scholar
  34. 34.
    Tao, C., Embley, D.W.: Automatic hidden-web table interpretation, conceptualization, and semantic annotation. Data Knowl. Eng. 68(7), 683–703 (2009)CrossRefGoogle Scholar
  35. 35.
    Turmo, J., Ageno, A., Català, N.: Adaptive information extraction. ACM Comput. Surv. 38(2) (2006)Google Scholar
  36. 36.
    Wang, J., Lochovsky, F.H.: Data extraction and label assignment for web databases. In: WWW, pp. 187–196 (2003)Google Scholar
  37. 37.
    Zhai, Y., Liu, B.: Structured data extraction from the web based on partial tree alignment. IEEE Trans. Knowl. Data Eng. 18(12), 1614–1628 (2006)CrossRefGoogle Scholar
  38. 38.
    Zhu, J., Nie, Z., Wen, J.R., Zhang, B., Ma, W.Y.: Simultaneous record detection and attribute labeling in web data extraction. In: KDD, pp. 494–503 (2006)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  1. 1.ETSI InformáticaSevillaSpain

Personalised recommendations