Computational Biology and Language

  • Madhavi Ganapathiraju
  • Narayanas Balakrishnan
  • Raj Reddy
  • Judith Klein-Seetharaman
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3345)


Current scientific research is characterized by increasing specialization, accumulating knowledge at a high speed due to parallel advances in a multitude of sub-disciplines. Recent estimates suggest that human knowledge doubles every two to three years – and with the advances in information and communication technologies, this wide body of scientific knowledge is available to anyone, anywhere, anytime. This may also be referred to as ambient intelligence – an environment characterized by plentiful and available knowledge. The bottleneck in utilizing this knowledge for specific applications is not accessing but assimilating the information and transforming it to suit the needs for a specific application. The increasingly specialized areas of scientific research often have the common goal of converting data into insight allowing the identification of solutions to scientific problems. Due to this common goal, there are strong parallels between different areas of applications that can be exploited and used to cross-fertilize different disciplines. For example, the same fundamental statistical methods are used extensively in speech and language processing, in materials science applications, in visual processing and in biomedicine. Each sub-discipline has found its own specialized methodologies making these statistical methods successful to the given application. The unification of specialized areas is possible because many different problems can share strong analogies, making the theories developed for one problem applicable to other areas of research. It is the goal of this paper to demonstrate the utility of merging two disparate areas of applications to advance scientific research. The merging process requires cross-disciplinary collaboration to allow maximal exploitation of advances in one sub-discipline for that of another. We will demonstrate this general concept with the specific example of merging language technologies and computational biology.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Searls, D.B.: The Language of Genes. Nature 420(6912), 211–217 (2002)CrossRefGoogle Scholar
  2. 2.
    Bolshoy, A.: DNA Sequence Analysis Linguistic Tools: Contrast Vocabularies, Compositional Spectra and Linguistic Complexity. Appl. Bioinformatics 2(2), 103–112 (2003)Google Scholar
  3. 3.
    Biological Language Modeling Project,
  4. 4.
    Huang, C.C., Couch, G.S., Pettersen, E.F., Ferrin, T.E.: Chimera: An Extensible Molecular Modeling Application Constructed Using Standard Components. In: PSB1996: Pacific Symposium on Biocomputing, pp. 50–61 (1996),
  5. 5.
    Baldi, P.: Bioinformatics. MIT Press, Cambridge (1998)Google Scholar
  6. 6.
    Durbin, R., Eddy, S., Krogh, A., Mitchison, G.: Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, Cambridge (1998)MATHCrossRefGoogle Scholar
  7. 7.
    Bolshoy, A., Shapiro, K., Trifonov, E., Ioshikhes, I.: Enhancement of the Nucleosomal Pattern in Sequences of Lower Complexity. Nucl. Acids. Res. 25(16), 3248–3254 (1997)CrossRefGoogle Scholar
  8. 8.
    Burge, C., Karlin, S.: Prediction of Complete Gene Structures in Human Genomic DNA. J. Mol. Biol. 268(1), 78–94 (1997)Google Scholar
  9. 9.
    Baxevanis, A.D., Ouellette, B.F.F.: Bioinformatics. A Practical Guide to the Analysis of Genes and Proteins. Wiley-Interscience, Hoboken (1998)Google Scholar
  10. 10.
    Bussemaker, H.J., Li, H., Siggia, E.D.: Building a Dictionary for Genomes: Identification of Presumptive Regulatory Sites by Statistical Analysis. Proc. Natl. Acad. Sci. U.S.A. 97(18), 10096–10100 (2000)Google Scholar
  11. 11.
    Gibas, C., Jambeck, P.: Developing Bioinformatics Computer Skills. O’Reilly & Associates, Sebastopol (2001)Google Scholar
  12. 12.
    Troyanskaya, O.G., Arbell, O., Koren, Y., Landau, G.M., Bolshoy, A.: Sequence Complexity Profiles of Prokaryotic Genomic Sequences: A Fast Algorithm for Calculating Linguistic Complexity. Bioinformatics 18(5), 679–688 (2002)CrossRefGoogle Scholar
  13. 13.
    Coin, L., Bateman, A., Durbin, R.: Enhanced Protein Domain Discovery by Using Language Modeling Techniques from Speech Recognition. Proc. Natl. Acad. Sci. USA 100(8), 4516–4520 (2003)Google Scholar
  14. 14.
    Cheng, B.Y.M., Carbonell, J., Klein-Seetharaman, J.: Application of Topic Segmentation Techniques to Protein Sequences: Identification of Transmembrane Helix Boundaries in Gpcrs. In: Proceedings of the 8th International Conference on Spoken Language Processing, Jeju Island, Korea (2004) (submitted)Google Scholar
  15. 15.
    Ganapathiraju, M., Klein-Seetharaman, J., Rosenfeld, R., Carbonell, J., Reddy, R.: Rare and Frequent Amino Acid N-Grams in Whole-Genome Protein Sequences. In: RECOMB 2002: The Sixth Annual International Conference on Research in Computational Molecular Biology, Washington, USA (2002)Google Scholar
  16. 16.
    Ganapathiraju, M., Weisser, D., Rosenfeld, R., Carbonell, J., Reddy, R., Klein-Seetharaman, J.: Comparative N-Gram Analysis of Whole-Genome Sequences. In: HLT 2002: Human Language Technologies Conference, California, USA (2002)Google Scholar
  17. 17.
    Ganapathiraju, M., Klein-Seetharaman, J., Balakrishnan, N., Reddy, R.: Characterization of Protein Secondary Structure Using Latent Semantic Analysis. IEEE Signal Processing magazine (15), 78–87 (May 2004)Google Scholar
  18. 18.
    Ganapathiraju, M., Weisser, D., Klein-Seetharaman, J.: Yule Value Tables from Protein Datasets. In: SCI 2004: World Conference on Systemics Cybernetics and Informatics, Florida, USA (2004)Google Scholar
  19. 19.
    Weisser, D., Klein-Seetharaman, J.: Identification of Fundamental Building Blocks in Protein Sequences Using Statistical Association Measures (2004)Google Scholar
  20. 20.
    PDBase. Silico. Biol. 4(2), 0012 (2004),
  21. 21.
    ProtScale. Silico. Biol. 4(2), 0012 (1992),
  22. 22.
    Landauer, T., Foltx, P., Laham, D.: Introduction to Latent Semantic Analysis. Discourse Processes 25(5212), 259–284 (1998)CrossRefGoogle Scholar
  23. 23.
    Berry, M.W., Browne, M.: Understanding Search Engines: Mathematical Modeling and Text Retrieval. Soc. for Industrial & Applied Math. (1999)Google Scholar
  24. 24.
    Rost, B.: Review: Protein Secondary Structure Prediction Continues to Rise. J. Struct. Biol. 134(2-3), 204–218 (2001)Google Scholar
  25. 25.
    Liu, Y., Carbonell, J., Klein-Seetharaman, J., Gopalakrishnan, V.: Comparison of Probabilistic Combination Methods for Protein Secondary Structure Prediction. Bioinformatics 16(4), 376–382 (2004)CrossRefGoogle Scholar
  26. 26.
    Frauenfelder, H., Wolynes, P.G.: Proteins: Where the Physics of Simplicity and Complexity Meet. Physics Today 47(15), 58–64 (1994)CrossRefGoogle Scholar
  27. 27.
    Carl-Ivar Branden, J.T.: Introduction to Protein Structure. Garland Publishing, New York (1999)Google Scholar
  28. 28.
    Voet, D., Voet, J.G.: Biochemistry. J. Wiley & Sons, Chichester (1995)Google Scholar
  29. 29.
    Rabiner, L., Juang, B.-H.: Fundamentals of Speech Recognition. Pearson Education POD (1993)Google Scholar
  30. 30.
    Deller, J.R., Hansen, J.H.L., Proakis, J.G.: Discrete-Time Processing of Speech Signals. Wiley-IEEE press (1999)Google Scholar
  31. 31.
    Proakis, J.G., Manolakis, D.: Digital Signal Processing: Principles, Algorithms and Applications. Macmillan, USA (1992)Google Scholar
  32. 32.
    Giuliani, A., Benigni, R., Zbilut, J.P., Webber Jr., C.L., Sirabella, P., Colosimo, A.: Nonlinear Signal Analysis Methods in the Elucidation of Protein Sequence-Structure Relationships. Chem. Rev. 102(5), 1471–1492 (2002)Google Scholar
  33. 33.
    Graps, A.: An Introduction to Wavelets. Computational Science and Engineering, IEEE [see also Computing in Science & Engineering] 2(2), 50–61 (1995)Google Scholar
  34. 34.
    Tan, B.T., Fu, M., Spray, A., Dermody, P.: The Use ofWavelet Transforms in Phoneme Recognition. In: ICSLP 1996: Fourth International Conference on Spoken Language Processing, pp. 148–155 (1996)Google Scholar
  35. 35.
    Gupta, M., Gilbert, A.: Robust Speech Recognition Using Wavelet Coefficient Features. In: ASRU 2001: IEEE Workshop on Automatic Speech Recognition and Understanding, pp. 50–61 (2001)Google Scholar
  36. 36.
    Lio, P., Vannucci, M.: Wavelet Change-Point Prediction of Transmembrane Proteins. Bioinformatics 16(4), 376–382 (2000)CrossRefGoogle Scholar
  37. 37.
    Fischer, P., Baudoux, G., Wouters, J.: Wavpred: A Wavelet-Based Algorithm for the Prediction of Transmembrane Proteins. Comm. math. sci. 1(1), 44–56 (2003)Google Scholar
  38. 38.
    Pashou, E.E., Litou, Z.I., Liakopoulos, T.D., Hamodrakas, S.J.: Wavetm: Wavelet-Based Transmembrane Segment Prediction. Silico. Biol. 4(2), 0012 (2004)Google Scholar
  39. 39.
    Qiu, J., Liang, R., Zou, X., Mo, J.: Prediction of Transmembrane Proteins Based on the Continuous Wavelet Transform. J. Chem. Inf. Comput. Sci 44(2), 741–747 (2004)Google Scholar
  40. 40.
    von Heijne, G.: Membrane Protein Structure Prediction. Hydrophobicity Analysis and the Positive-inside Rule. J. Mol. Biol. 225(2), 487–494 (1992)Google Scholar
  41. 41.
    Sonnhammer, E.L., von Heijne, G., Krogh, A.: A Hidden Markov Model for Predicting Transmembrane Helices in Protein Sequences. In: Proc. Int. Conf. Intell. Syst. Mol. Biol., vol. 6(6912), pp. 175–182 (1998)Google Scholar
  42. 42.
    Bishop, Y.M.M., Fienberg, S.E., Holland, P.W.: Discrete Multivariate Analysis. The MIT Press, Cambridge (1975)MATHGoogle Scholar
  43. 43.
    Cai, C., Rosenfeld, R., Wasserman, L.: Exponential Language Models, Logistic Regression, and Semantic Coherence. In: Proc. NIST/DARPA Speech Transcription Workshop, pp. 10096–10100 (2000)Google Scholar
  44. 44.
    Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic Local Alignment Search Tool. J. Mol. Biol. 215(3), 403–410 (1990); Related Articles, LinksGoogle Scholar
  45. 45.
    Mantegna, R.N., Buldyrev, S.V., Goldberger, A.L., Havlin, S., Peng, C.K., Simons, M., Stanley, H.E.: Linguistic Features of Noncoding DNA Sequences. Phys. Rev. Lett. 73(23), 3169–3172 (1994)Google Scholar
  46. 46.
    Konopka, A.K., Martindale, C.: Noncoding DNA, Zipf’s Law, and Language. Science 268(5212), 789 (1995)CrossRefGoogle Scholar
  47. 47.
    Chatzidimitriou-Dreismann, C.A., Streffer, R.M., Larhammar, D.: Lack of Biological Significance in the ‘Linguistic Features’ of Noncoding DNA – a Quantitative Analysis. Nucleic Acids Res. 24(9), 1676–1681 (1996)Google Scholar
  48. 48.
    Israeloff, N.E., Kagalenko, M., Chan, K.: Can Zipf Distinguish Language from Noise in Noncoding DNA? Physical Review Letters 76(11), 1976 (1996)Google Scholar
  49. 49.
    Strait, B.J., Dewey, T.G.: The Shannon Information Entropy of Protein Sequences. Biophys. J. 71(1), 148–155 (1996)Google Scholar
  50. 50.
    Tsonis, A.A., Elsner, J.B., Tsonis, P.A.: Is DNA a Language? J. Theor. Biol. 184(1), 25–29 (1997)Google Scholar
  51. 51.
    Li, W.: Statistical Properties of Open Reading Frames in Complete Genome Sequences. Comput. Chem. 23(3-4), 283–301 (1999)Google Scholar
  52. 52.
    Zipf, G.K.: Selective Studies and the Principle of Relative Frequency in Language. In: ICSLP96: Fourth International Conference on Spoken Language Processing, pp. 3544–3557 (1932)Google Scholar
  53. 53.
    Miller, G.A., Newman, E.B.: Tests of a Statistical Explanation of the Rank-Frequency Relation for Words in Written English. American Journal of Psychology 71(23), 209–218 (1958)CrossRefGoogle Scholar
  54. 54.
    Karchin, R., Karplus, K., Haussler, D.: Classifying G-Protein Coupled Receptors with Support Vector Machines. Bioinformatics 18(1), 147–159 (2002)CrossRefGoogle Scholar
  55. 55.
    Cheng, B.Y.M., Carbonell, J.G., Klein-Seetharaman, J.: Protein Classification Based on Text Document Classification Techniques. Proteins: Structure, Function and Bioinformatics (2004) (in press)Google Scholar
  56. 56.
    Vries, J., Munshi, R., Tobi, D., Klein-Seetharaman, J., Benos, P.V., Bahar, I.: A Sequence Alignment-Independent Method for Protein Classification. J. Appl. Bioinformatics (2004) (in press)Google Scholar
  57. 57.
    Wu, C., Whitson, G., McLarty, J., Ermongkonchai, A., Chang, T.C.: Protein Classification Artificial Neural System. Protein Science 1(5), 667–677 (1992)CrossRefGoogle Scholar
  58. 58.
    Klein-Seetharaman, J., Oikawa, M., Grimshaw, S.B., Wirmer, J., Duchardt, E., Ueda, T., Imoto, T., Smith, L.J., Dobson, C.M., Schwalbe, H.: Long-Range Interactions within a Nonnative Protein. Science 295(5560), 1719–1722 (2002)CrossRefGoogle Scholar
  59. 59.
    Simons, K.T., Bonneau, R., Ruczinski, I., Baker, D.: Ab Initio Protein Structure Prediction of Casp III Targets Using Rosetta. Proteins 1999 (Suppl. 3), 171–176 (1999)Google Scholar
  60. 60.
    Kuznetsov, I.B., Rackovsky, S.: On the Properties and Sequence Context of Structurally Ambivalent Fragments in Proteins. Protein Science 12(11), 2420–2433 (2003)CrossRefGoogle Scholar
  61. 61.
    Ganapathiraju, M., Manoharan, V., Klein-Seetharaman, J.: BLMT: Statistical Sequence Analysis using N-Grams. J. Applied Bioinformatics 3(2) (2004)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Madhavi Ganapathiraju
    • 1
  • Narayanas Balakrishnan
    • 2
  • Raj Reddy
    • 1
  • Judith Klein-Seetharaman
    • 3
  1. 1.Carnegie Mellon UniversityUSA
  2. 2.Indian Inst. of ScienceIndia & Carnegie Mellon UnivUSA
  3. 3.Carnegie Mellon University & University of PittsburghUSA

Personalised recommendations