Using Clustering and Edit Distance Techniques for Automatic Web Data Extraction

  • Manuel Álvarez
  • Alberto Pan
  • Juan Raposo
  • Fernando Bellas
  • Fidel Cacheda
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4831)


Many web sources provide access to an underlying database containing structured data. These data can be usually accessed in HTML form only, which makes it difficult for software programs to obtain them in structured form. Nevertheless, web sources usually encode data records using a consistent template or layout, and the implicit regularities in the template can be used to automatically infer the structure and extract the data. In this paper, we propose a set of novel techniques to address this problem. While several previous works have addressed the same problem, most of them require multiple input pages while our method requires only one. In addition, previous methods make some assumptions about how data records are encoded into web pages, which do not always hold in real websites. Finally, we have tested our techniques with a high number of real web sources and we have found them to be very effective.


Data Record Data Region Candidate Division Column Similarity Text Node 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Arasu, A., Garcia-Molina, H.: Extracting Structured Data from Web Pages. In: Proc. of the ACM SIGMOD Int. Conf. on Management of Data (2003)Google Scholar
  2. 2.
    Baumgartner, R., Flesca, S., Gottlob, G.: Visual Web Information Extraction with Lixto. In: VLDB. Proc. of Very Large DataBases (2001)Google Scholar
  3. 3.
    Chakrabarti, S.: Mining the Web: Discovering Knowledge from Hypertext Data. Morgan Kaufmann Publishers, San Francisco (2003)Google Scholar
  4. 4.
    Chang, C., Lui, S.: IEPAD: Information extraction based on pattern discovery. In: Proc. of 2001 Int. World Wide Web Conf., pp. 681–688 (2001)Google Scholar
  5. 5.
    Crescenzi, V., Mecca, G., Merialdo, P.: ROADRUNNER: Towards automatic data extraction from large web sites. In: Proc. of the 2001 Int. VLDB Conf., pp. 109–118 (2001)Google Scholar
  6. 6.
    Gonnet, G.H., Baeza-Yates, R.A., Snider, T.: New Indices for Text: Pat trees and Pat Arrays. In: Information Retrieval: Data Structures and Algorithms, Prentice Hall, Englewood Cliffs (1992)Google Scholar
  7. 7.
    Laender, A.H.F., Ribeiro-Neto, B.A., da Silva, A.S., Teixeira, J.S.: A Brief Survey of Web Data Extraction Tools. ACM SIGMOD Record 31(2), 84–93 (2002)CrossRefGoogle Scholar
  8. 8.
    Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady 10, 707–710 (1966)MathSciNetGoogle Scholar
  9. 9.
    Muslea, I., Minton, S., Knoblock, C.: Hierarchical Wrapper Induction for Semistructured Information Sources. In: Autonomous Agents and Multi-Agent Systems, pp. 93–114 (2001)Google Scholar
  10. 10.
    Notredame, C.: Recent Progresses in Multiple Sequence Alignment: A Survey. Technical report, Information Genetique et. (2002)Google Scholar
  11. 11.
    Pan, A., et al.: Semi-Automatic Wrapper Generation for Commercial Web Sources. In: EISIC. Proc. of IFIP WG8.1 Conf. on Engineering Inf. Systems in the Internet Context (2002)Google Scholar
  12. 12.
    Raposo, J., Pan, A., Álvarez, M., Hidalgo, J.: Automatically Maintaining Wrappers for Web Sources. Data & Knowledge Engineering 61(2), 331–358 (2007)CrossRefGoogle Scholar
  13. 13.
    Zhai, Y., Liu, B.: Extracting Web Data Using Instance-Based Learning. In: Ngu, A.H.H., Kitsuregawa, M., Neuhold, E.J., Chung, J.-Y., Sheng, Q.Z. (eds.) WISE 2005. LNCS, vol. 3806, pp. 318–331. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  14. 14.
    Zhai, Y., Liu, B.: Structured Data Extraction from the Web Based on Partial Tree Alignment. IEEE Trans. Knowl. Data Eng. 18(12), 1614–1628 (2006)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2007

Authors and Affiliations

  • Manuel Álvarez
    • 1
  • Alberto Pan
    • 1
  • Juan Raposo
    • 1
  • Fernando Bellas
    • 1
  • Fidel Cacheda
    • 1
  1. 1.Department of Information and Communications Technologies, University of A Coruña, Campus de Elviña s/n. 15071. A CoruñaSpain

Personalised recommendations