Skip to main content

Instance Discovery and Schema Matching with Applications to Biological Deep Web Data Integration

  • Conference paper

Part of the Lecture Notes in Computer Science book series (LNBI,volume 6254)

Abstract

This paper presents data mining-based techniques for enabling data integration across deep web data sources. We target query processing across inter-dependent data sources. Thus, besides input-input and output-output matching of attributes, we also need to consider input-output matching. We develop data mining techniques for discovering the instances for querying deep web data sources from the information provided by the query interfaces themselves, as well as from the obtained output pages of the related data sources, by query probing using dynamically identified input instances. Then, using a hierarchical representation of schemas and by applying clustering techniques, we are able to generate schema matches. We show the effectiveness of our technique while integrating 24 query interfaces.

Keywords

  • Schema Match
  • Output Attribute
  • Input Attribute
  • Query Interface
  • Input Instance

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-642-15120-0_12
  • Chapter length: 16 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   69.99
Price excludes VAT (USA)
  • ISBN: 978-3-642-15120-0
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   89.99
Price excludes VAT (USA)

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Brookes, A.J.: The essence of snps. Gene. 234, 177–186 (1999)

    CrossRef  Google Scholar 

  2. Ashish, N., Knoblock, C.A.: Semi-automatic wrapper generation for internet information sources. In: Proceedings of the Second IFCIS International Conference on Cooperative Information Systems. IEEE Computer Society, Los Alamitos (1997)

    Google Scholar 

  3. Babu, P.A., Boddepalli, R., Lakshmi, V.V., Rao, G.N.: Dod: Database of databases–updated molecular biology databases. Silico Biol. 5 (2005)

    Google Scholar 

  4. Barbosa, L., Freire, J.: Siphoning hidden-web data through keyword-based interfaces. In: Proceedings of SDDB (2004)

    Google Scholar 

  5. Bergman, M.K.: The deep web: Surfacing hidden value. Journal of Electronic Publishing 7(1) (August 2001)

    Google Scholar 

  6. Buneman, P., Davidson, S.B., Hart, K., Overton, C., Wong, L.: A data transformation system for biological data sources. In: Proceedings of the Twenty-first International Conference on Very Large Databases (1995)

    Google Scholar 

  7. Callan, J.: Query-based sampling of text databases. ACM Transactions on Information Systems 19, 97–130 (2001)

    CrossRef  Google Scholar 

  8. Doan, A., Domingos, P., Halevy, A.: Reconciling schemas of disparate data sources: A machine-learning approach. In: SIGMOD Conference, pp. 509–520 (2001)

    Google Scholar 

  9. He, B.: Statistical schema matching across web query interfaces. In: SIGMOD Conference, pp. 217–228 (2003)

    Google Scholar 

  10. He, H., Meng, W., Yu, C., Wu, Z.: Wise-integrator: a system for extracting and integrating complex web search interfaces of the deep web. In: VLDB 2005: Proceedings of the 31st international conference on Very large data bases, pp. 1314–1317. VLDB Endowment (2005)

    Google Scholar 

  11. Hern, T., Kambhampati, S.: Integration of biological sources: Current systems and challenges ahead. Sigmod Record 33, 51–60 (2004)

    Google Scholar 

  12. Madhavan, J., Bernstein, P.A., Rahm, E.: Generic schema matching with cupid. The VLDB Journal, 49–58 (2001)

    Google Scholar 

  13. Madhavan, J., Ko, D., Kot, L., Ganapathy, V., Rasmussen, A., Halevy, A.: Google’s Deep Web Crawl. VLDB Endowment 1, 1241–1252 (2008)

    Google Scholar 

  14. Nie, Z., Wen, J.-R., Ma, W.-Y.: Object-level vertical search. In: Proceedings of the 3rd Biennial Conference on Innovative Data Systems Research, pp. 235–246 (2007)

    Google Scholar 

  15. Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. VLDB Journal 10(2001) (2001)

    Google Scholar 

  16. Salton, G., Mcgill, M.J.: Introduction to Modern Information Retrieval. McGraw-Hill, Inc., New York (1986)

    Google Scholar 

  17. Sarma, A.D., Dong, X., Halevy, A.: Bootstrapping pay-as-you-go data integration systems. In: SIGMOD 2008: Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pp. 861–874. ACM, New York (2008)

    CrossRef  Google Scholar 

  18. Wang, F., Agrawal, G., Jin, R.: Query planning for searching inter-dependent deep-web databases. In: Ludäscher, B., Mamoulis, N. (eds.) SSDBM 2008. LNCS, vol. 5069, pp. 24–41. Springer, Heidelberg (2008)

    CrossRef  Google Scholar 

  19. Wang, G., Goguen, J., Nam, Y.k., Lin, K.: Interactive schema matching with semantic functions. In: Yu, J.X., Lin, X., Lu, H., Zhang, Y. (eds.) APWeb 2004. LNCS, vol. 3007, pp. 654–664. Springer, Heidelberg (2004)

    Google Scholar 

  20. Wang, J., Wen, J.-R., Lochovsky, F., Ma, W.-Y.: Instance-based schema matching for web databases by domain-specific query probing. In: VLDB 2004: Proceedings of the Thirtieth international conference on Very large data bases, pp. 408–419. VLDB Endowment (2004)

    Google Scholar 

  21. Wu, W., Doan, A., Yu, C.: Webiq: Learning from the web to match deep-web query interfaces. In: International Conference on Data Engineering, p. 44 (2006)

    Google Scholar 

  22. Wu, W., Yu, C., Doan, A., Meng, W.: An interactive clustering-based approach to integrating source query interfaces on the deep web. In: SIGMOD 2004: Proceedings of the 2004 ACM SIGMOD international conference on Management of data, pp. 95–106. ACM Press, New York (2004)

    CrossRef  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Liu, T., Wang, F., Agrawal, G. (2010). Instance Discovery and Schema Matching with Applications to Biological Deep Web Data Integration. In: Lambrix, P., Kemp, G. (eds) Data Integration in the Life Sciences. DILS 2010. Lecture Notes in Computer Science(), vol 6254. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15120-0_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-15120-0_12

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-15119-4

  • Online ISBN: 978-3-642-15120-0

  • eBook Packages: Computer ScienceComputer Science (R0)