Skip to main content

Advertisement

Log in

Automatic Discovery and Inferencing of Complex Bioinformatics Web Interfaces

  • Published:
World Wide Web Aims and scope Submit manuscript

Abstract

The World Wide Web provides a vast resource to genomics researchers, with Web-based access to distributed data sources such as BLAST sequence homology search interfaces. However, finding the desired scientific information can still be very tedious and frustrating. While there are several known servers on genomic data (e.g., GeneBank, EMBL, NCBI) that are shared and accessed frequently, new data sources are created each day in laboratories all over the world. Sharing these new genomics results is hindered by the lack of a common interface or data exchange mechanism. Moreover, the number of autonomous genomics sources and their rate of change outpace the speed at which they can be manually identified, meaning that the available data is not being utilized to its full potential. An automated system that can find, classify, describe, and wrap new sources without tedious and low-level coding of source-specific wrappers is needed to assist scientists in accessing hundreds of dynamically changing bioinformatics Web data sources through a single interface. A correct classification of any kind of Web data source must address both the capability of the source and the conversation/interaction semantics inherent in the design of the data source. We propose a service class description (SCD)-a meta-data approach for classifying Web data sources that takes into account both the capability and the conversational semantics of the source. The ability to discover the interaction pattern of a Web source leads to increased accuracy in the classification process. Our results show that an SCD-based approach successfully classifies two thirds of BLAST sites with 100% accuracy and two thirds of bioinformatics keyword search sites with around 80% precision.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. S. F. Altschul, W. Gish, W. Miller, E. W. Meyers, and D. J. Lipman, “Basic local alignment search tool,” Journal of Molecular Biology 215(3), 1990, 403–410.

    Article  Google Scholar 

  2. A. Arasu and H. Garcia-Molina, “Extracting structured data from web pages,” in Proceedings of ACM/SIGMOD Annual Conference on Management of Data,2003, pp. 337–348.

  3. Y. Arens, C. Knoblock, and W. Shen, “Query reformulation for dynamic information integration,” International Journal of Intelligent and Cooperative Information Systems 6(2), 1996, 99–130.

    Google Scholar 

  4. R. Bayardo et al., “InfoSleuth: Agent-based semantic integration of information in open and dynamic environments,” in Proc. ACM SIGMOD Int'l Conference on Management of Data, 1997.

  5. S. Brin and L. Page, “The anatomy of a large-scale hypertextual Web search engine,” Computer Networks and ISDN Systems 30(1–7), 1998, 107–117.

    Google Scholar 

  6. S. Chawathe, H. Garcia-Molina, J. Hammer, K. Ireland, Y. Papakonstantinou, J. D. Ullman, and J. Widom, “The TSIMMIS project: Integration of heterogeneous information sources,” in 16th Meeting of the Information Processing Society of Japan, Tokyo, Japan, 1994, pp. 7–18.

  7. S. B. Davidson, G. C. Overton, V. Tannen, and L. Wong, “BioKleisli: A digital library for biomedical researchers,” Int. J. on Digital Libraries, l(l), 1997, 36–53.

    Google Scholar 

  8. DBCAT, The Public Catalog of Databases, http://www.infobiogen.fr/services/dbcat/, 2002.

  9. R. B. Doorenbos, O. Etzioni, and D. S. Weld, “A scalable comparison-shopping agent for the world-wide web,” in W. L. Johnson and B. Hayes-Roth (eds), Proceedings of the First International Conference on Autonomous Agents (Agents'97), pp. 39–48, ACM Press, Marina del Rey, CA, USA, 1997.

  10. B. Eckman, Z. Lacroix, and L. Raschid, “Optimized seamless integration of biomolecular data,” in IEEE International Conference on Bioinformatics and Biomedical Egineering, 2001, pp. 23–32.

  11. D. C. Fallside, “XML Schema Part 0: Primer,” Technical report, World Wide Web Consortium, 2001. http://www.w3.org/TRyxnilschema-0/

  12. W. Gish. BLAST, 2002. http://blast.wustl.edu/

  13. R. Gold. HttpUnit. 2003. http://httpunit.sourceforge.net.

  14. L. Haas, P. Schwarz, P. Kodali, E. Kotlar, J. Rice, and W. Swope, “Discoverylink: A system for integrating life sciences data,” IBM Systems Journa 40(2), 2001.

  15. A. Heydon and M. Najork, “Mercator: A scalable, extensible web crawler,” World Wide Web 2(4), 1999, 219–229.

    Article  Google Scholar 

  16. C. A. Knoblock, S. Minton, J. L. Ambite, N. Ashish, I. Mulsea, A. G. Philpot, and S. Tejada. “The ariadne approach to web-based information integration,” International Journal of Cooperative Information Systems (IJCIS) 10(1–2), 2001, 145–169.

    Google Scholar 

  17. A. Y. Levy, A. Rajaraman, and J. J. Ordille, “Querying heterogeneous information sources using source descriptions,” in Proceedings of the Twenty-second International Conference on Very Large Databases, pp. 251–262, Bombay, India, 1996. VLDB Endowment, Saratoga, CA.

  18. L. Liu, C. Pu, and W. Han, “XWrap: An XML-enabled wrapper construction system for web information sources,” Proceedings of the International Conference on Data Engineering, 2000.

  19. R. Miller and K. Bharat, “SPHINX: A framework for creating personal, site-specific web crawlers,” in Proceedings of the Seventh International World Wide Web Conference, 1998.

  20. G. Modica, A. Gal, and H. M. Jamil, “The use of machine-generated ontologies in dynamic information seeking,” in 9th International Conference on Cooperative Information Systems, CoopIS2001, 2001, pp. 433–448.

  21. National Center for Biotechnology Information. GenBank Statistics, 2003. http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html

  22. NIAS DNA Bank. Growth of daily updates of DNA Sequence Databases, 2003. http://www.dna.affrc.go.jp/htdocs/growth/D-daily.html

  23. NLM/NIH, National Center for Biotechnology Information, 2002. http://www.ncbi.nih.gov,/ 2002.

  24. D. Rocco and T. Critchlow, “Automatic discovery and classification of bioinformatics web sources,” Bioinformatics, Oxford University Press, 19(15), 2003, 1927–1933

  25. P. Srinivasan, J. Mitchell, O. Bodenreider, G. Pant, and F. Menczer, “Web crawling agents for retrieving biomedical information,” in Proceedings of the International Workshop on Agents in Bioinformatics (NETTAB-02), 2002.

  26. G., Mecca V. Crescenzi, and P. Merialdo, “Towards automatic data extraction from large web sites,” in Proceedings of the 27th International Conference on Very Large Data bases, September 2001.

  27. V. Zadorozhny, L. Raschid, M.-E. Vidal, T. Urhan, and L. Bright, “Efficient evaluation of queries in a mediator for websources,” in Proceedings of ACM/SIGMOD Annual Conference on Management of Data, 2002.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Anne H. H. Ngu.

Additional information

This work was performed under the auspices of the U.S. Department of Energy by University of California, Lawrence Livermore National Laboratory under Contract W-7405-ENG-48. UCRL-JC

This work was performed while the author was a summer faculty scholar at LLNL.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ngu, A.H.H., Rocco, D., Critchlow, T. et al. Automatic Discovery and Inferencing of Complex Bioinformatics Web Interfaces. World Wide Web 8, 463–493 (2005). https://doi.org/10.1007/s11280-005-0509-5

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11280-005-0509-5

Keywords

Navigation