Collaborative Discovery Through Biological Language Modeling Interface

  • Madhavi Ganapathiraju
  • Vijayalaxmi Manoharan
  • Raj Reddy
  • Judith Klein-Seetharaman
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3864)


Scientific progress is exponentially increasing, and a typical example is the progress in the area of computational biology. Here, problems pertaining to biology and biochemistry are being solved by way of analogy through the application of computational theories from physics, mathematics, statistical mechanics, material science and computer science. More recently, theories from language processing have been applied to the mapping of protein sequences to their structure, dynamics and function under the Biological Language Modeling project. Scientists from diverse computational and linguistics backgrounds collaborate with experimental biologists and have made significant scientific contributions. The essential component of this collaborative discovery is the web server of the biological language modeling toolkit that enables the computational and non-computational scientists to interface and collaborate with each other. The web server acts as the computational laboratory to which researchers from a variety of scientific disciplines and geographical locations come to characterize specific attributes pertaining to their protein or groups of proteins of interest using the available tools. They then combine the results with their domain expertise to arrive at conclusions. The web server is also useful for education of students entering into the research field in computational biology in general. In this paper, we describe this web server and the results that were arrived at through local and global collaboration and education.


Mutual Information Latent Semantic Analysis Language Technology Amino Acid Pair Suffix Array 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Kurzweil, R.: The Age of Spiritual Machines: When Computers Exceed Human Intelligence, p. 400. Penguin (2000)Google Scholar
  2. 2.
    Klein-Seetharaman, J., Reddy, R.: Biological Language Modeling: Convergence of computational linguistics and biological chemistry, in Converging Technologies for Improving Human Performance. In: Bainbridge, W.S. (ed.) Nanotechnology, Biotechnology, Information Technology and Cognitive Science. National Science Foundation, Arlington, Virginia, pp. 378–385 (2002)Google Scholar
  3. 3.
    Jones, P.H., Nemeth, C.P.: Cognitive Artifacts in Complex Work. In: Cai, Y. (ed.) Ambient Intelligence for Scientific Discovery. LNCS (LNAI), vol. 3345, pp. 152–183. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  4. 4.
    OSI, Open Source Initiative:
  5. 5.
    Wheeler, D.A.: Why Open Source Software / Free Software (OSS/FS, FLOSS, or FOSS)? Look at the Numbers! (2005)Google Scholar
  6. 6.
    Okada, T., Simon, H.A.: Collaborative discovery in a scientific domain. Cognitive Science: A Multidisciplinary Journal 21(2), 109–146 (1997)CrossRefGoogle Scholar
  7. 7.
    Klein-Seetharaman, J., Reddy, R.: Biological Language Modeling: Convergence of Computational Linguistics and Biological Chemistry. In: NSF Workshop Converging Technolgoy (NBIC) for Improving Human Performance (2002)Google Scholar
  8. 8.
    Klein-Seetharaman, J.: The Use of Analogies for Interdisciplinary Research in the Convergence of Nano-, Bio- and Information Technology. In: NSF Report on Societal Implications of Nanoscience and Nanotechnology (2005)Google Scholar
  9. 9.
    Ganapathiraju, M., et al.: Computational Biology and Language. In: Cai, Y. (ed.) Ambient Intelligence for Scientific Discovery. LNCS (LNAI), vol. 3345, pp. 25–47. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  10. 10.
    Manoharan, V., Ganapathiraju, M., Klein-Seetharaman, J.: BLMT Web Server: Interactive Language Technologies for Analogous Biological Data. In: Workshop on Ambient Intelligence and (Everyday) Life. San-Sebastian, Spain (2005)Google Scholar
  11. 11.
    Berman, H.M., et al.: The Protein Data Bank and the challenge of structural genomics. Nat. Struct. Biol. (suppl. 7), 957–959 (2000)Google Scholar
  12. 12.
    Bairoch, A., Apweiler, R.: The SWISS-PROT protein sequence data bank and its supplement TrEMBL in 1999. Nucleic Acids Res. 27(1), 49–54 (1999)CrossRefGoogle Scholar
  13. 13.
    Hubbard, T., et al.: Ensembl 2005. Nucleic Acids Res. 33(Database issue), D447–453 (2005)Google Scholar
  14. 14.
    Bateman, A., et al.: The Pfam protein families database. Nucleic Acids Res. 30(1), 276–280 (2002)CrossRefGoogle Scholar
  15. 15.
    Horn, D.L., et al.: Why have group A streptococci remained susceptible to penicillin? Report on a symposium. Clin. Infect. Dis. 26(6), 1341–1345 (1998)CrossRefGoogle Scholar
  16. 16.
    Subramaniam, S.: The Biology Workbench–a seamless database and analysis environment for the biologist. Proteins 32(1), 1–2 (1998)CrossRefGoogle Scholar
  17. 17.
    Sauro, H.M., et al.: Next generation simulation tools: the Systems Biology Workbench and BioSPICE integration. Omics 7(4), 355–372 (2003)CrossRefGoogle Scholar
  18. 18.
  19. 19.
  20. 20.
    Gasteiger, E., et al.: ExPASy: The proteomics server for in-depth protein knowledge and analysis. Nucleic Acids Res. 31(13), 3784–3788 (2003)CrossRefGoogle Scholar
  21. 21.
  22. 22.
    Jenuth, J.P.: The NCBI. Publicly available tools and resources on the Web. Methods Mol. Biol. 132, 301–312 (2000)Google Scholar
  23. 23.
  24. 24.
    Searls, D.B., Noordewier, M.O.: Pattern-matching search of DNA sequences using logic grammars. In: Proceedings of the 7th Conference on Artificial Intelligence Applications, pp. 3–9. IEEE, Los Alamitos (1991)Google Scholar
  25. 25.
    Searls, D.B.: The language of genes. Nature 420(6912), 211–217 (2002)CrossRefGoogle Scholar
  26. 26.
    Bolshoy, A., et al.: Enhancement of the nucleosomal pattern in sequences of lower complexity. Nucl. Acids. Res. 25(16), 3248–3254 (1997)CrossRefGoogle Scholar
  27. 27.
    Burge, C., Karlin, S.: Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268(1), 78–94 (1997)CrossRefGoogle Scholar
  28. 28.
    Hearst, M.: Untangling Text Data Mining. In: 37th Annual Meeting of the Association for Computer Linguistics, College Park, MD, USA, pp. 3–10 (1999)Google Scholar
  29. 29.
    Pustejovsky, J., et al.: Robust Relational Parsing Over Biomedical Literature: Extracting Inhibit Relations. In: Pacific Symposium on Biocomputing, Hawaii, USA, pp. 362–373 (2002)Google Scholar
  30. 30.
    Friedman, C., et al.: GENIES: a natural-language processing system for the extraction of molecular pathways from journal articles. Bioinformatics, S74–S82 (2001)Google Scholar
  31. 31.
    Hatzivassiloglou, V., Duboue, P.A., Rzhetsky, A.: Disambiguating proteins, genes, and RNA in text: a machine learning approach. Bioinformatics, S97–S106 (2001)Google Scholar
  32. 32.
    Coin, L., Bateman, A., Durbin, R.: Enhanced protein domain discovery by using language modeling techniques from speech recognition. Proc. Natl. Acad. Sci. USA 100(8), 4516–4520 (2003)CrossRefGoogle Scholar
  33. 33.
    Vries, J., et al.: A Sequence Alignment-Independent Method For Protein Classification. Applied Bioinformatics, 137–148 (2004)Google Scholar
  34. 34.
    Cheng, B., Carbonell, J., Klein-Seetharaman, J.: Protein Classification based on Text Document Classification Techniques. In: Proteins - Structure, Function and Bioinformatics, pp. 955–970 (2005)Google Scholar
  35. 35.
    Cheng, B., Carbonell, J., Klein-Seetharaman, J.: A Machine Text-Inspired Machine Learning Approach for Identification of Transmembrane Helix Boundaries. In: 15th International Symposium on Methodologies for Intelligent Systems, Saratoga, New York, USA, pp. 29–37 (2004)Google Scholar
  36. 36.
    Liu, Y., et al.: Comparison of probabilistic combination methods for protein secondary structure prediction. Bioinformatics 20(17), 3099–3107 (2004)CrossRefGoogle Scholar
  37. 37.
    Ganapathiraju, M., et al.: Characterization of protein secondary structure using latent semantic analysis. IEEE Signal Processing magazine 21(3), 78–87 (2004)CrossRefGoogle Scholar
  38. 38.
    Weisser, D., Klein-Seetharaman, J.: Identification of Fundamental Building Blocks in Protein Sequences Using Statistical Association Measures. In: ACM Symposium on Applied Computing, Nicosia, Cyprus, pp. 154–161 (2004)Google Scholar
  39. 39.
    Ganapathiraju, M., et al.: Comparative n-gram analysis of whole-genome sequences. In: HLT 2002: Human Language Technologies Conference, San Diego, USA (2002)Google Scholar
  40. 40.
    Ganapathiraju, M., et al.: Yule value tables from protein datasets of different categories: emphasis on trasnmembrane proteins. In: SCI 2004: Eighth World Multi-Conference on Systemics, Cybernetics and Informatics, Orlando, Florida, USA (2004)Google Scholar
  41. 41.
    Hoberman, R., Klein-Seetharaman, J., Rosenfeld, R.: Inferring Property Selection Pressure from Positional Residue Conservation. Applied Bioinformatics 3(2-3), 167–180 (2004)CrossRefGoogle Scholar
  42. 42.
    Qi, Y., Klein-Seetharaman, J., Bar-Joseph, Z.: Random forest similarity for protein-protein interaction prediction from multiple sources. In: 10th Pacific Symposium on Biocomputing, Hawaii, pp. 531–542 (2005)Google Scholar
  43. 43.
    Weiner, P.: Linear pattern matching algorithms. In: Proc. of the 14th Annual Symp. on Switching and Automata Theory, University of Iowa, pp. 1–11 (1973)Google Scholar
  44. 44.
    Manber, U., Meyers, G.: A new method for online string searches. SIAM Journal on Computing 22(5), 935–948 (1993)MathSciNetCrossRefMATHGoogle Scholar
  45. 45.
    Delcher, A.L., et al.: Alignment of whole genomes. Nucleic Acids Res., 2369–2376 (1999)Google Scholar
  46. 46.
    Kasai, T., et al.: Linear-Time Longest-Common-Prefix computation in Suffix Arrays and Its applications. In: Annual Symposium on Combinatorial Pattern Matching CPM 2001, Jerusalem, Israel (2001)Google Scholar
  47. 47.
    Ganapathiraju, M., Manoharan, V., Klein-Seetharaman, J.: BLMT: Statistical Sequence Analysis using N-grams. J. Applied Bioinformatics, 193–200 (2004)Google Scholar
  48. 48.
    Cheng, B., Carbonell, J., Klein-Seetharaman, J.: Protein Classification based on Text Document Classification Techniques. Proteins - Structure, Function and Bioinformatics 58(4), 955–970 (2005)CrossRefGoogle Scholar
  49. 49.
    Chiu, D.K., Kolodziejczak, T.: Inferring consensus structure from nucleic acid sequences. Comput. Appl. Biosci. 7(3), 347–352 (1991)Google Scholar
  50. 50.
    Akmaev, V.R., Kelley, S.T., Stormo, G.D.: Phylogenetically enhanced statistical tools for RNA structure prediction. Bioinformatics 16(6), 501–512 (2000)CrossRefGoogle Scholar
  51. 51.
    Grosse, I., et al.: Average mutual information of coding and noncoding DNA. In: Pac. Symp. Biocomput., pp. 614–623 (2000)Google Scholar
  52. 52.
    Butte, A.J., Kohane, I.S.: Mutual information relevance networks: functional genomic clustering using pairwise entropy measurements. In: Pac. Symp. Biocomput., pp. 418–429 (2000)Google Scholar
  53. 53.
  54. 54.
    Liu, W., et al.: Helix packing moments reveal diversity and conservation in membrane protein structure. J. Mol. Biol. 337(3), 713–729 (2004)CrossRefGoogle Scholar
  55. 55.
    Breiman, L.: Random forests. Machine Learning (2001)Google Scholar
  56. 56.
    Qi, Y., Bar-Joseph, Z., Klein-Seetharaman, J.: Evaluation of different biological data and computational classification methods for use in protein interaction prediction. In: Proteins - Structure, Function and Bioinformatics (2005)Google Scholar
  57. 57.
  58. 58.
    Shaw, M., Garlan, D.: Software Architecture: Perspectives on an Emerging Discipline, vol. 1006. Prentice Hall, Englewood CliffsGoogle Scholar
  59. 59.
  60. 60.
    Klein-Seetharaman, J., et al.: Rare and frequent amino acid n-grams in whole-genome protein sequences. In: RECOMB 2002: The Sixth Annual International Conference on Research in Computational Molecular Biology, Washington, USA (2002)Google Scholar
  61. 61.
    Ganapathiraju, M., et al.: Characterization of protein secondary structure using latent semantic analysis. IEEE Signal Processing magazine, 78–87 (2004)Google Scholar
  62. 62.
    Ganapathiraju, M., et al.: Computational Biology and Language. LNCS (LNAI), pp. 25–47 (2005)Google Scholar
  63. 63.
    Liu, Y., et al.: Context Sensitive Vocabulary And its Application in Protein Secondary Structure Prediction. In: ACM SIGIR Conference, pp. 538–539 (2004)Google Scholar
  64. 64.
    Qi, Y., Bar-Joseph, Z., Klein-Seetharaman, J.: Evaluation of different biological data and computational classification methods for use in protein interaction prediction. Proteins - Structure, Function and Bioinformatics (in press, 2005)Google Scholar
  65. 65.
    Dong, Q.W., Wang, X.L., Lin, L.: N-gram Statistics and Linguistic Features Analysis of Whole Genome Protein Sequences. In: HUPO 3rd Annual World Congress, Beijing, China (2004)Google Scholar
  66. 66.
    Kyte, J., Doolittle, R.F.: A simple method for displaying the hydropathic character of a protein. J. Mol. Biol., 105–132 (1982)Google Scholar
  67. 67.
    Chen, C.P., Kernytsky, A., Rost, B.: Transmembrane helix predictions revisited. Protein Sci., 2774–2791 (2002)Google Scholar
  68. 68.
    Uliel, S., et al.: A simple algorithm for detecting circular permutations in proteins. Bioinformatics 15(11), 930–936 (1999)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Madhavi Ganapathiraju
    • 1
  • Vijayalaxmi Manoharan
    • 1
    • 2
  • Raj Reddy
    • 1
  • Judith Klein-Seetharaman
    • 1
    • 2
  1. 1.Language Technologies InstituteCarnegie Mellon UniversityPittsburghUSA
  2. 2.Department of Structural BiologyUniversity of PittsburghPittsburghUSA

Personalised recommendations