Advertisement

Site-Wide Wrapper Induction for Life Science Deep Web Databases

  • Saqib Mir
  • Steffen Staab
  • Isabel Rojas
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5647)

Abstract

We present a novel approach to automatic information extraction from Deep Web Life Science databases using wrapper induction. Traditional wrapper induction techniques focus on learning wrappers based on examples from one class of Web pages, i.e. from Web pages that are all similar in structure and content. Thereby, traditional wrapper induction targets the understanding of Web pages generated from a database using the same generation template as observed in the example set. However, Life Science Web sites typically contain structurally diverse web pages from multiple classes making the problem more challenging. Furthermore, we observed that such Life Science Web sites do not just provide mere data, but they also tend to provide schema information in terms of data labels – giving further cues for solving the Web site wrapping task. Our solution to this novel challenge of Site-Wide wrapper induction consists of a sequence of steps: 1. classification of similar Web pages into classes, 2. discovery of these classes and 3. wrapper induction for each class. Our approach thus allows us to perform unsupervised information retrieval from across an entire Web site. We test our algorithm against three real-world biochemical deep Web sources and report our preliminary results, which are very promising.

Keywords

Deep Web Wrapper Generation Information Extraction Database 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Anton, T.: XPath-Wrapper Induction by generalizing tree traversal patterns. In: Workshop on Web Mining, in ECML/PKDD (2006)Google Scholar
  2. 2.
    Barbosa, L., Freire, J.: Searching for Hidden-Web Databases. In: WebDB, pp. 1–6 (2005)Google Scholar
  3. 3.
    Baumgartner, R., Flesca, S., Gottlob, G.: Visual Web Information Extraction with Lixto. In: Proc. 27th Interntnl. Conference on Very Large Data Bases, pp. 119–128 (2001)Google Scholar
  4. 4.
    Chakrabarti, S., et al.: Mining the Web’s link structure. Computer 32(8), 60–67 (1999)CrossRefGoogle Scholar
  5. 5.
    Chang, K.C.-C., Cho, J.: Accessing the Web: From Search to Integration. In: Proceedings of the 2006 ACM SIGMOD Conference (2006)Google Scholar
  6. 6.
    Chang, C.-H., Hsu, C.-N., Lui, S.-C.: Automatic information extraction from semi-structured web pages by pattern discovery. SCI expanded 35(1), 129–147 (2003), Special Issue on Web Retrieval and MiningGoogle Scholar
  7. 7.
    Chang, K.C.-C., He, B., Zhang, Z.: Mining Semantics for Large Scale Integration on the Web: Evidences, Insights and Challenges. SIGKDD Explorations 6(2), 67–76 (2004)CrossRefGoogle Scholar
  8. 8.
    Crescenzi, V., Mecca, G., Merialdo, P.: RoadRunner: Towards Automatic Data Extraction from Large Web Sites. In: VLDB, pp. 109–118 (2001)Google Scholar
  9. 9.
    Crescenzi, V., Merialdo, P., Missier, P.: Clustering Web pages based on their structure. Data & Knowledge Engineering 54, 279–299 (2005)CrossRefGoogle Scholar
  10. 10.
    Crescenzi, V., Mecca, G., Merialdo, P.: Improving the expressiveness of ROADRUNNER. In: SEBD, pp. 62–69 (2004)Google Scholar
  11. 11.
    de Castro Reis, D., et al.: Automatic web news extraction using tree edit distance. In: WWW13, pp. 502–511 (2004)Google Scholar
  12. 12.
    Degtyarenko, K., et al.: ChEBI: a database and ontology for chemical entities of biological interest. Nucleic Acids Res. 350, D344–D350 (2008)Google Scholar
  13. 13.
    Golovin, A., et al.: E-MSD: an integrated data. Nucleic Acids Research 32(Database issue), 211–216 (2004)CrossRefGoogle Scholar
  14. 14.
    He, B., Chang, K.C.-C.: Statistical Schema Matching across Web Query Interfaces. In: SIGMOD Conference, pp. 217–228 (2003)Google Scholar
  15. 15.
    He, H., Meng, W., Yu, C.T., Wu, Z.: WISE-Integrator: An Automatic Integrator of Web Search Interfaces for E-Commerce. In: VLDB, pp. 357–368 (2003)Google Scholar
  16. 16.
    He, B., Tao, T., Chang, K.C.-C.: Organizing structured web sources by query schemas: a clustering approach. In: CIKM, pp. 22–31 (2004)Google Scholar
  17. 17.
    Kanehisa, M.: The KEGG database. In: Novartis Found Symp., vol. 247, pp. 91–101, discussion 101–3, 119–28, 244–52 (2002)Google Scholar
  18. 18.
    Knoblock, C., Kambhampati, C.: Information Integration on the Web. In: AAAI (2002)Google Scholar
  19. 19.
    Kabra, G., Li, C., Chang, K.C.C.: Query Routing: Finding Ways in the Maze of the DeepWeb. In: WIRI 2005, pp. 64–73 (2005)Google Scholar
  20. 20.
    Kushmerick, N.: Wrapper Induction for information extraction. In: ICAI (1998)Google Scholar
  21. 21.
    Kushmerick, N.: Learning to Invoke Web Forms. In: Meersman, R., Tari, Z., Schmidt, D.C. (eds.) CoopIS 2003, DOA 2003, and ODBASE 2003. LNCS, vol. 2888, pp. 997–1013. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  22. 22.
    Laender, A.H.F., Ribeiro-Neto, B., Silva, A.S.D., Teixeira, J.S.: A brief survey of web data extraction tools. ACM SIGMOD Record 31(2), 84–93 (2002)CrossRefGoogle Scholar
  23. 23.
    Lu, Y., et al.: Clustering e-commerce search engines based on search interface pages using WISE-Cluster. Data Knowl. Eng. 59(2), 231–246 (2006)CrossRefGoogle Scholar
  24. 24.
    Madhavan, J., et al.: Corpus-based Schema Matching. In: ICDE, pp. 57–68 (2005)Google Scholar
  25. 25.
    Myllymaki, J., Jackson, J.: Robust Web Data Extraction with XML Path Expressions. IBM Research Report (2002)Google Scholar
  26. 26.
    Muslea, I., Minton, S., Knoblock, C.: Stalker: Learning extraction rules for semistructured, web-based information sources. In: AAAI 1998: AI and Information Integration Workshop (1998)Google Scholar
  27. 27.
    Meng, W., Raghavan, V., Yu, C.: Fully automatic wrapper generation for search engines. In: WWW14 (2005)Google Scholar
  28. 28.
    Sahuguet, A., Azavant, F.: Building intelligent Web applications using lightweight wrappers. Data Knowl. Eng. 36(3), 283–316 (2001)CrossRefGoogle Scholar
  29. 29.
    Simon, K., Lausen, G.: ViPER: augmenting automatic information extraction with visual perceptions. In: CIKM 2005, pp. 381–388 (2005)Google Scholar
  30. 30.
    Vidal, A., et al.: Structure-driven crawler generation by example. In: SIGIR 2006, pp. 292–299 (2006)Google Scholar
  31. 31.
    Wang, J., Wen, J.-R., Lochovsky, F.H., Ma, W.-Y.: Instance-based Schema Matching for Web Databases by Domain-specific Query Probing. In: VLDB, pp. 408–419 (2004)Google Scholar
  32. 32.
    Wu, W., Doan, A., Yu, C.T.: WebIQ: Learning from the Web to Match Deep-Web Query Interfaces. In: ICDE, p. 44 (2006)Google Scholar
  33. 33.
    Wang, J., Lochovsky, F.H.: Data extraction and label assignment for web databases. In: WWW12, p. 187–196 (2003)Google Scholar
  34. 34.
    Zhang, Z., He, B., Chang, K.C.-C.: Understanding Web Query Interfaces: Best-Effort Parsing with Hidden Syntax. In: SIGMOD Conference, pp. 107–118 (2004)Google Scholar
  35. 35.
    Zhai, Y., Liu, B.: Automatic Wrapper Generation Using Tree Matching and Partial Tree Alignment. In: AAAI 2006, Boston, USA, July 16-20 (2006)Google Scholar
  36. 36.
    Zhai, Y., Liu, B.: Extracting Web Data Using Instance-Based Learning. In: WWW16 (2007)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Saqib Mir
    • 1
    • 2
  • Steffen Staab
    • 2
  • Isabel Rojas
    • 1
  1. 1.EML ResearchHeidelbergGermany
  2. 2.Institute for Computer ScienceUniversity of Koblenz-LandauKoblenzGermany

Personalised recommendations