Language engineering and information theoretic methods in protein sequence similarity studies

Bogan-Marta, A.; Hategan, A.; Pitas, I.

doi:10.1007/978-3-540-75767-2_8

A. Bogan-Marta^5,7,
A. Hategan⁶ &
I. Pitas⁷

Part of the book series: Studies in Computational Intelligence ((SCI,volume 85))

638 Accesses
5 Citations

The representation of biological data as text information opened new perspectives in the evolution of biological research. Many biological sequence databases are providing detailed information about sequences allowing investigations like searches, comparison, the establishment of relations between different sequences and species. The algorithmic procedures used for data sequence analysis are coming from many areas of computational sciences. Within this book chapter, we are bringing together a diversity of language engineering techniques and those involving information theoretic principles in analyzing protein sequences from similarity perspective. After we are proposing a state of the art in the subject, presenting a survey of the different approaches identified, the attention is oriented to the two methods we experimented. The description of these methods and the experiments performed open discussions addressed to the interested reader that may think about new ideas of improvement.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Metfessel BA and Saurugger PN (1993) Pattern recognition is the prediction of protein structural class. In: Proceedings of the Twenty-Sixth Hawaii INternational Conference on System Science 1:679–688
Article Google Scholar
Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN,BournePE (2000) The Protein Data Bank. Nucleic Acids Research (28):235–242
Article Google Scholar
Boeckmann B, Bairoch A, Apweiler R, Blatter MC, Estreicher A, Gasteiger E, Martin MJ, Michoud K, O’Donovan C, Phan I, Pilbout S, Schneider M (2003) The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res. 31:365–370
Article Google Scholar
Durbin R, Eddy S, Krogh A, Mitchison G (1998) Biological Sequence Analysis: probabilistic models of proteins and nucleic acids. Cambridge University Press
Google Scholar
Koonin EV and Galperin MY (2002) Sequence-Evolution-Function Computational approaches in comparative genomics. Kluwe, Boston
Google Scholar
Pearson WR and Lipman DJ (1988) Improved tools for biological sequence comparison. PNAS 85(8): 2444–2448
Article Google Scholar
Altschul SF, Gish W, Miller W, Myers EW and Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215(3):403–410
Google Scholar
G.J. Barton, (1996) Protein Sequence Alignment and Database Scanning. M.J. E. Sternberg (eds), IN: Protein Structure Prediction - a practical approach, IRL Press at Oxford University Press
Google Scholar
Bogan-Marta A, Laskaris N, Gavrielides M, Pitas I, Lyroudia K (2005) A novel efficient protein similarity measure based on n-gram modeling. In: IEEE, IEE Second International Conference on Intelligence in Medicine and Healthcare 122–127
Google Scholar
Ganapathiraju M, Balakrishnan N, Reddy R, Klein-Seetharaman J, (2005) Computational Biology and Language. Ambient Intelligence for Scientific Discovery, Springer-Verlag Berlin Heidelberg, Lecture Notes in Computer Science LNAI 3345:25–47
Google Scholar
Searls DB (2002) The Language of Genes. Nature 420(6912):211-7
Article Google Scholar
Bolshoi A (2003) DNA Sequence Analysis Linguistic Tools: Contrast Vocabularies, Compositional Spectra and Linguistic Complexity. Appl. Bioinformatics 2(2):103–12
Google Scholar
Wu K-P, Lin H-N, Sung T-Y and Su W-L (2003) A new Similarity Measure among Protein Sequences. IEEE Computer Society Bioinformatics Conference (CSB’03) Proceedings 347–352
Google Scholar
Henikoff S and Henikoff JG (1992) Amino acid substitution matrices from protein block. In: Proceedings of the National Academy of Science USA 89(22):10915–10919
Article Google Scholar
Lachlan Coin, Alex Bateman, and Richard Durbin (2003) Enhanced protein domain discovery by using language modeling techniques from speech recognition, Proc Natl Acad Sci U S A 100(8): 4516–4520
Article Google Scholar
Lord PD, Stevens RD, Brass A and Goble CA (2003) Semantic similarity measures as tools for exploring the gene ontology. In: Pacific Symposium on Biocomputing, PubMed 601–612
Google Scholar
Sarkar I, Rindflesch T (2002) Discovering Protein Similarity using Natural Language Processing, Proc AMIA Symp :677-81
Google Scholar
The Gene Ontology Consortium (2001) Creating the gene ontology resource: design and implementation. Genome Res 11(8):1425–33
Google Scholar
Rada R, Mili H, Bicknell E, Blettner M (1989) Development and application of a metric on semantic nets. IEEE Transactions on Systems Management and Cybernetics, 19(1):17–30
Article Google Scholar
Lord PW, Stevens RD, Brass A, Goble CA. (2003) Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation, Bioinformatics Vol. 19(10):1275–1283
Article Google Scholar
Resnik P (1995) Using information content to evaluate semantic similarity in a taxonomy. IJCAI 448–453
Google Scholar
Lin D (1998) An information-theoretic definition of similarity. In Morgan Kaufman (EDS) Proc 15th International Conf. on Machine Learning. San Francisco, CA 296–304
Google Scholar
Jiang JJ and Conrath DW (1998) Semantic similarity based on corpus statistics and lexical taxonomy. In: Proc.of International Conference on Research in Computational Linguistics
Google Scholar
Resnik P (1999) Semantic Similarity in a Taxonomy: An Information-Based Measure and its Application to Problems of Ambiguity in Natural Language” Journal of Artificial Intelligence Research, 11:95-130
MATH Google Scholar
Schlicker A, Domingues FS, Rahnenfhrer J, Lengauer T (2006) A new measure for functional similarity of gene products based on Gene Ontology, BMC Bioinformatics 7: 302
Article Google Scholar
Guo X, Shriver CD, Hu H, Liebman MN (2005) Semantic similarity-based validation of human protein-protein interactions, Computational Systems Bioinformatics Conference :149–150
Google Scholar
Ganapathiraju MK, Klein-Seetharaman J, Balakrishnan N and Reddy R (2004) Characterization of Protein Secondary Structure. Application of latent semantic analysis using different vocabularies. IEEE Signal Processing Magazine 78–86
Google Scholar
Bellegarda J (2000) Exploiting latent semantic information in statistical language modeling. In: IEEE Proceedings 88(8):1279–1296
Article Google Scholar
Landauer T, Foltx P and Laham D (1998) Introduction to latent semantic analysis. Discourse Processes 25:259–284
Article Google Scholar
Salton G, Wong A, and Yang CS (1975) A Vector Space Model for Automatic Indexing. Communications of the ACM, 18(11)613–620
Article MATH Google Scholar
Haley D, Thomas P, Nuseibeh B, Tailor J, Lefrere P (2003) E-assesment using Lantent Semantic Analysis, Electronic Workshops in Computing, LeGE-WG
Google Scholar
Yuan Y, Lin L, Dong Q, Wang X, Li M (2005) A Protein Classification Method Based on Latent Semantic Analysis, Engineering in Medicine and Biology Society, 2005. IEEE-EMBS 27th Annual International Conference : 7738–7741
Google Scholar
Dong Q, Wang X, Lin L (2005) Application of latent semantic analysis to protein remote homology detection, Bioinformatics Advance Access published online, Bioinformatics, doi:10.1093/bioinformatics/bti801
Google Scholar
Maguitman AG, Rechtsteiner A, Verspoor K, Strauss CE, Rocha LM (2006) Large-Scale Testing of Bibliome. Informatics Using Pfam Protein Families, In: Pacific Symposium on Biocomputing 11:76-87
Google Scholar
Tueyu F, Mostafa J, Seki K (2003) Protein association discovery in biomedical literature, Proceedings of the 3rd ACM/IEEE-CS joint conference on Digital libraries :113-115
Google Scholar
Finn RD, Mistry J, Schuster-Bckler B, Griffiths-Jones S, Hollich V, Lassmann T, Moxon S, Marshall M, Khanna A, Durbin R, Eddy SR, Sonnhammer ELL, Bateman A (2006) Pfam: clans, web tools and services, Nucleic Acids Research, 34 Database issue D247-D251
Google Scholar
Mulder NJ, Fleischmann W, Kanapin A, Apweiler R (2006) InterPro as a new tool for complete genome analysis: An example of comparative analysis, Biofizika 51(4):656-660
Google Scholar
Ganapathiraju, M., V. Manoharan, et al. (2004) BLMT: Statistical Sequence Analysis using N-grams Applied Bioinformatics 3(2-3): 193-200
Google Scholar
Benedetto D, Caglioti E, and Loreto V (2002) Language trees and zipping. Physical Review Letters 88(4):048702
Article Google Scholar
Chen X, Francia B, Ming L, McKinnon B and Seker A (2004) Shared information and program plagiarism detection. IEEE Transactions on Information Theory 50(7):1545–1551
Article Google Scholar
Grozea C (2004) Plagiarism detection with state of the art compression programs. In: CDMTCS Research Report Series
Google Scholar
Chen X, Kwong S, and Li M (1999) A compression algorithm for DNA sequences and its applications in genome comparison. In: Genome Informatics. Universal Academy Press, Tokyo
Google Scholar
Li M, Badger JH, Chen X, Kwong S, Kearney P, and Zhang H (2001) An information-based sequence distance and its application to whole mitochondrial genome phylogeny. Bioinformatics 17(2):149154
Article Google Scholar
Otu HH and Sayood K (2003) A new sequence distance measure for phylogenetic tree construction. Bioinformatics 19(16):2122–2130
Article Google Scholar
Li M, Chen X, Li X, Ma B, and Vitnyi PMB (2004) The similarity metric. IEEE Transactions on Information Theory 50(12):3250–3264
Article Google Scholar
Hategan A and Tabus I (2004) Protein is compressible. In: NORSIG2005 1992–195
Google Scholar
Cilibrasi R and Vitanyi PMB (2005) Clustering by compression.IEEE Transactions on Information Theory 51(4):1523–1545
Article MathSciNet Google Scholar
Bennett CH, Li M, and Ma B (2003) Chain letters and evolutionary histories. Scientific American 288(6):76–81
Article Google Scholar
Kocsor A, Kertsz-Farkas A, Kajn L, and Pongor S (2006) Application of compression-based distance measures to protein sequence classification: a methodological study. Bioinformatics 22(4):407–412
Article Google Scholar
Kolmogorov AN (1965) Three approaches to the definition of the concept “quantity of information”. Problemy Peredachi Informatsii 1:3–11
MATH MathSciNet Google Scholar
Bennett CH, Gacs P, Li M, Vitanyi PMB, Zurek WH (1998) Information Distance, IEEE Transacations on Information Theory 44(4):1407–1423
Article MATH MathSciNet Google Scholar
Li M and Vitanyi PMB (1997) An Introduction to Kolmogorov Complexity and its Applications. Springer-Verlag, 2nd Edition
Google Scholar
Apostolico A and Lonardi S (2000) Compression of biological sequences by greedy off-line textual substitution. In: Data Compression Conference. IEEE Computer Society Press
Google Scholar
Chen X, Kwong S and Li M (2001) A compression algorithm for DNA sequences. IEEE-EMB Special Issue on Bioinformatics 20(4):61–66
Google Scholar
Chen X, Li M, Ma B, and Tromp J (2002) DNACompress: Fast and effective DNA sequence compression. Bioinformatics 18:1696-1698
Article Google Scholar
Grumbach S and Tahi F (1993) Compression of DNA sequences. In: Data Compression Conference. IEEE Computer Society Press
Google Scholar
Korodi G and Tabus I (2005) An efficient normalized maximum likelihood algorithm for DNA sequence compression. ACM Transactions on Information Systems 23(1):3–34
Article Google Scholar
Tabus I, Korodi G and Rissanen J (2003) DNA Sequence Compression Using the Normalized Maximum Likelihood Model for Discrete Regression. In: Data Compression Conference. IEEE Computer Society Press
Google Scholar
Nevill-Manning CG and Witten IH (1999) Protein is incompressible. In: Data Compression Conference. IEEE Computer Society Press
Google Scholar
Wang S, Schuurmans D, Peng F, Zhao F (2005) Combining Statistical Language Models via the Latent Maximum Entropy Principle. Machine Learning, Springer Netherlands 60(1-3):229–250
Google Scholar
Kang S, Wang S, Greiner R, Schuurmans D, Cheng L (2004) Exploiting syntactic, semantic and lexical regularities in language modeling via directed Markov random fields. International Symposium on Chinese Spoken Language Processing : 305–308
Google Scholar
Wang S, Schuurmans D, Pengun F and Zhao Y (2003) Semantic N-gram Language Modeling With The Latent Maximum Entropy Principle. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP-03)
Google Scholar
Ganapathiraju M, Weisser D, Rosenfeld R, Carbonell J, Reddy R, Klein-Seetharaman J (2002) Comparative n-gram analysis of whole-genome protein sequences. Proc. HLT, San Diego 2002
Google Scholar
Liu Y, Carbonell J, et al. (2004) Context Sensitive Vocabulary And its Application in Protein Secondary Structure Prediction. ACM SIGIR Conference.
Google Scholar
Cheng, B., J. Carbonell, et al. (2004). A Machine Text-Inspired Machine Learning Approach for Identification of Transmembrane Helix Boundaries. 15th International Symposium on Methodologies for Intelligent Systems, Saratoga Springs, New York, USA
Google Scholar
Cheng, B., J. Carbonell, et al. (2005) Protein Classification based on Text Document Classification Techniques. Proteins - Structure, Function and Bioinformatics 58(4): 955-70
Article Google Scholar
Burkhardt S, Crauser A, Ferragina P, Lenhof HP, Rivals E, Vingron M (1999). q-gram Based Database Searching Using a Suffix Array (QUASAR). Third Annual International Conference on Computational Molecular Biology, RECOMB’99, Lyon, France.
Google Scholar
Van Compernolle D (2003) Spoken Language Science and Technology, course material
Google Scholar
Manning CD and Schtze H (2000) Foundations of statistical natural language processing. Massachusetts Institute of Technology Press, Cambridge, Massachusetts London, England 554–588
Google Scholar
Brown PF, Della Pietra AS, Della Pietra VJ, Mercer Robert LR and Jennifer CL (1992) An estimation of an upper bound for the entropy of English. In Association for Computational Linguistics, Yorktown Heights, NY 10598
Google Scholar
Jurafsky D and Martin J (2000) Speech and Language Processing. Prentice Hall(EDS)
Google Scholar
Bogan-Marta A, Gavrielides M, Pitas I and Lyroudia K (2005) A New Statistical Measure of Protein Similarity based on Language Modeling. In: IEEE International Workshop on Genomic Signal Processing and Statistics
Google Scholar
Bogan-Marta A, Pitas I, Lyroudia K (2006) Statistical Method of Context Evaluation for Biological Sequence Similarity. In: IEEE Conference on ‘Artificial Intelligence in Theory and Practice’, IFIP World Computer Congress 11:1–10
Google Scholar
Liao L and Noble W S (2003) Combining pairwise sequence similarity and support vector machines for detecting remoteprotein evolutionary and structural relationships. Journal of Computational Biology 10:857–868
Article Google Scholar
Schffer A, Aravind L, Madden L, Shavirin S, Spouge J, Wolf Y, Koonin E, Altschul S (2001). Improving the accuracy of PSI-BLAST protein data-base searches with composition-based statistics and other refinements. Nucleic Acids Res 29(14):2994–3005
Article Google Scholar
Cover TM and Thomas AJ (1991) Elements of information theory, New York
Google Scholar
Huffman DA (1952) A method for the construction of minimum redundancy codes. Proceedings of the IRE 40:1098–1101
Article Google Scholar
Rissanen J (1976) Generalized Kraft inequality and arithmetic coding. IBM Journal of Research and Development 20:198–203
Article MATH MathSciNet Google Scholar
Ross SM (1996) Stochastic processes, 2nd Edition, New York
Google Scholar
Hategan A and Tabus I (2005) Detecting local similarity based on lossless compression of protein sequences. In: International Workshop on Genomic Signal Processing 95–99
Google Scholar
Yu YK, Wootton JC and Altschul SF (2003) The compositional adjustment of amino acid subtitution matrices. PNAS 100(26):15688–15693
Article Google Scholar
Cao MD, Dix TI, Allison L, Mears C (2007) A simple statistical algorithm for biological sequence compression. In: DCC’07, 43–52
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computers, University of Oradea, Universtatii No1, 410087, Oradea, Romania
A. Bogan-Marta
Institute of Signal Processing, Tampere University of Technology, Korkeakoulunkatu 1, 553, FIN-33101
A. Hategan
Department of Informatics, Artificial Intelligence and Information Analysis Laboratory, Aristotle University of Thessaloniki, 451, Thessaloniki, Greece
A. Bogan-Marta & I. Pitas

Authors

A. Bogan-Marta
View author publications
You can also search for this author in PubMed Google Scholar
A. Hategan
View author publications
You can also search for this author in PubMed Google Scholar
I. Pitas
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Buffalo Neuroimaging Analysis Center The Jacobs Neurological Institute University at Buffalo, The State University of New York, 100 High Street, Buffalo, NY, 14203, USA
Arpad Kelemen
Centre for Quantifiable Quality of Service in Communication Systems (Q2S) Centre of Excellence, Norwegian University of Science and Technology, O.S. Bragstads plass 2E, N-7491, Trondheim, Norway
Ajith Abraham
Biostatistics Dept. University at Buffalo, The State University of New York, 252A2 Farber Hall, 3435 Main st., Buffalo, NY, 14214, USA
Yulan Liang

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Bogan-Marta, A., Hategan, A., Pitas, I. (2008). Language engineering and information theoretic methods in protein sequence similarity studies. In: Kelemen, A., Abraham, A., Liang, Y. (eds) Computational Intelligence in Medical Informatics. Studies in Computational Intelligence, vol 85. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-75767-2_8

Download citation

DOI: https://doi.org/10.1007/978-3-540-75767-2_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-75766-5
Online ISBN: 978-3-540-75767-2
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics