Integrated Term Weighting, Visualization, and User Interface Development for Bioinformation Retrieval

  • Min Hong
  • Anis Karimpour-Fard
  • Steve Russell
  • Lawrence Hunter
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3397)


This project implements an integrated biological information website that classifies technical documents, learns about users’ interests, and offers intuitive interactive visualization to navigate vast information spaces. The effective use of modern software engineering principles, system environments, and development approaches is demonstrated. Straightforward yet powerful document characterization strategies are illustrated, helpful visualization for effective knowledge transfer is shown, and current user interface methodologies are applied. A specific success of note is the collaboration of disparately skilled specialists to deliver a flexible integrated prototype in a rapid manner that meets user acceptance and performance goals. The domain chosen for the demonstration is breast cancer, using a corpus of abstracts from publications obtained online from Medline. The terms in the abstracts are extracted by word stemming and a stop list, and are encoded in vectors. A TF-IDF technique is implemented to calculate similarity scores between a set of documents and a query. Polysemy and synonyms are explicitly addressed. Groups of related and useful documents are identified using interactive visual displays such as a spiral graph that represents of the overall similarity of documents. K-means clustering of the similarities among a document set is used to display a 3-D relationship map. User identities are established and updated by observing the patterns of terms used in their queries, and from login site locations. Explicit considerations of changing user category profiles, site stakeholders, information modeling, and networked technologies are pointed out.


Information Retrieval Vector Space Model Unify Medical Language System Document Cluster Document Vector 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Anh, V., Kretser, O., Moffat, A.: Vector-space ranking with effective early termination. In: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, New Orleans, Louisiana, United States, September 2001, pp. 35–42 (2001)Google Scholar
  2. 2.
    Berrios, D., Cucina, R., Sutphin, P., Fagan, L.: Methods for Semi-Automated Indexing for High Precision Information Retrieval (2002)Google Scholar
  3. 3.
    Berry, M., Dumais, S., Letsche, T.: Computational methods for intelligent information access. In: Proc. of Supercomputing 1995, San Diego, CA, USA (1995)Google Scholar
  4. 4.
    Blaschke, C., Andrade, M., Ouzounis, C., Valencia, A.: Automatic extraction of biological information from scientific text: Protein-protein interactions, Intelligent Systems for Molecular Biology, Heidelberg p. 60. (1999)Google Scholar
  5. 5.
    Campbell, K., Oliver, D., Shortliffe, E.: The Unified Medical Language System: Toward a Collaborative Approach for Solving Terminologic Problems. Submitted to a Special Issue of the Journal of the American Medical Informatics AssociationGoogle Scholar
  6. 6.
    Crouch, C., Apte, S., Bapat, H.A.: An IR approach to XML retrieval based on the extended vector model,
  7. 7.
    Deerwester, S., Dumais, S., Furnas, G., Landauer, T., Harshman, R.: Indexing by latent semantic analysis. Journal of the American Society for Information Science 41, 391–407 (1990)CrossRefGoogle Scholar
  8. 8.
    Fukuda, K., Tamura, A., Tsunoda, T., Takagi, T.: Toward information extraction: identifying protein names from biological papers. In: Pac. Symp. Biocomput., p. 707 (1998)Google Scholar
  9. 9.
    Griffiths, A.H., Luckhurst, C., Willett, P.: P Using inter-document similarity information in document retrieval systems. Journal of the American Society for Information Science 37, 3–11 (1986)Google Scholar
  10. 10.
    Hersh, W., Greenes, R.: SAPHIRE – An information retrieval system featuring concept matching, automatic indexing, probabilistic retrieval, and hierarchical relationships. Computers and Biomedical Research 23, 410–425 (1990)CrossRefGoogle Scholar
  11. 11.
    Iliopoulos, A. J. Enright, C. A. Ouzounis; TEXTQUEST: Document Clustering of Medical Abstracts for Discovery in Molecular Biology; Proceedings of the Sixth Annual Pacific Symposium on Biocomputing (PSB 01), 384-395, 2001 Google Scholar
  12. 12.
    Lovins, B.: Development of a stemming algorithm. Mechanical Translation and Computational Linguistics 11, 22–31 (1968)Google Scholar
  13. 13.
    Porter, M.E.: An algorithm for suffix stripping program 14(3), 130–137 (1990)Google Scholar
  14. 14.
    Pratt, L.F.: The Usefulness of Dynamically Categorizing Search Results. Journal of the American Medical Informatics Association (JAMIA) 7(6), 605–617 (2000)CrossRefGoogle Scholar
  15. 15.
    Pratt, W., Wasserman, H.: QueryCat: Automatic Categorization of MEDLINE Queries. In: Proceedings of the American Medical Informatics Association (AMIA) Fall Symposium 2000 (2000)Google Scholar
  16. 16.
    Proux, D., Rechenmann, F., Julliard, L., Pillet, V., Jacq, B.: Detecting gene symbols and names in biological texts: a first step toward pertinent information extraction. In: Genome Informatics Workshop, Tokyo, p. 72 (1998)Google Scholar
  17. 17.
    Rindflesch, T.C., Tanabe, L., Weinstein, J.N., Hunter, L.: EDGAR: extraction of drugs, genes and relations from the biomedical literature. In: Pac. Symp. Biocomput., pp. 517–528Google Scholar
  18. 18.
    Robertson, S., Walker, S.: Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval. In: Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval, July 3-6, pp. 232–241 (1994)Google Scholar
  19. 19.
    Rocchio, J.: The SMART retrieval system - experiments in automated document processing. In: Salton, G. (ed.) Relevance feedback information retrieval, p. 313. Prentice-Hall, Englewood Cliffs (1971)Google Scholar
  20. 20.
    Russell, S.: Knowledge Liquidity. In: Proceedings of Knowledge World (July 1999)Google Scholar
  21. 21.
    Salton, G.: Automatic content analysis in information retrieval, University of Pennsylvania, PA (1968)Google Scholar
  22. 22.
    Salton, G.: Developments in automatic text retrieval. Science 253, 974 (1991)CrossRefMathSciNetGoogle Scholar
  23. 23.
    Salton, G.: Term-weighting approaches in automatic text retrieval. Information Processing & Management 24(5), 513–523 (1988)CrossRefGoogle Scholar
  24. 24.
    Salton, G.: Automatic Text Processing: The transformation, analysis, and retrieval of information by computer. Addison-Wesley, Reading (1989)Google Scholar
  25. 25.
    Salton, G., Wang, A., Yang, C.: A vector space model for information retrieval. Journal of the American Society for Information Science 18, 613–620 (1975)zbMATHGoogle Scholar
  26. 26.
    Sekimizu, T., Park, H., Tsujii, J.: Identifying the interaction between genes and gene products based on frequently seen verbs in Medline abstracts. In: Genome Informatics Workshop, Tokyo, p. 62 (1998)Google Scholar
  27. 27.
    Thomas, J., Milward, D., Ouzounis, C., Pulman, S., Carroll, M.: Automatic extraction of protein interactions from scientific abstracts. In: Pac. Symp. Biocomput., p. 538 (2000)Google Scholar
  28. 28.
    Tokunaga, T., Iwayama, M.: Text categorization based on weighted inverse document frequency. Technical Report 1994 TR0001, Department of Computer Science, Tokyo Institute of Technology, Tokyo, Japan (1994)Google Scholar
  29. 29.
    Uger, S., Gaucho, S.: Feature reduction for document clustering and classification. Technical report, Computing Department, Imperial College, London, UK (2000)Google Scholar
  30. 30.
    Voorhees, E.M.: WordNet: an electronic lexical database. In: Fellbaum, C. (ed.) Using WordNet for text retrieval, p. 285. MIT Press, Cambridge (1998)Google Scholar
  31. 31.
    Willett, P.: An algorithm for the calculation of exact term discrimination values. Information Processing and Management: An International Journal 21(3), 225–232 (1985)CrossRefGoogle Scholar
  32. 32.
    Yoshida, Fukuda, K., Takagi, T.: PNAD-CSS: A workbench for constructing a protein name abbreviation dictionary. Bioinformatics 16, 169–175 (2000)CrossRefGoogle Scholar
  33. 33.
    Information Retrieval lecture note (University of Massachusetts Amherst),
  34. 34.
  35. 35.
    Interactive 3D Visualization for Document Retrieval,

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Min Hong
    • 1
  • Anis Karimpour-Fard
    • 1
  • Steve Russell
    • 1
  • Lawrence Hunter
    • 1
  1. 1.BioinformaticsUniversity of Colorado Health Sciences CenterDenverUSA

Personalised recommendations