Computational Linguistics and Biological Sequences in Artificial Intelligence

Chen, Qingfeng

doi:10.1007/978-981-99-8251-6_4

Qingfeng Chen²

55 Accesses

Abstract

Researchers generally believe that nucleic acid is also a language with rich information. Nucleic acid language can be used to describe the structure of life and life processes, and there is as much diversity as language, with many common characteristics. Therefore, many existing studies apply the results and methods achieved in the field of language theory to the study of biological sequences. Based on this route, computational linguistics has also brought many new breakthroughs to the study of biological sequences. Association analysis is a data mining technique that can be used to discover frequent patterns and association rules in a dataset. In computational linguistics, association analysis can be used to uncover association rules in text data, helping us better understand semantic and grammatical rules in natural languages. In biological sequence analysis, association analysis can be applied to identify association rules in the genome and proteome to reveal interactions and functional relationships between genes or proteins. These analysis results can help us better understand the data and draw conclusions and promote further development in both fields. Therefore, the application of association analysis technology to the study of biological sequences in computational linguistics is a research field worthy of our expectations.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 139.00; Price excludes VAT (USA)

Hardcover Book: USD 179.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Encyclopedia of China Publishing House. Language Encyclopedia[M]. Beijing: Encyclopedia of China Publishing House, 1994.
Google Scholar
Wu X, Jain L, Wang J, et al. Survey of Biodata Analysis from a Data Mining Perspective[J]. Data Mining in Bioinformatics, 2005:9–39.
Google Scholar
Yu S W, Huang J R. Prospects of Computational Linguistics[M]. Beijing: The Commercial Press, 2005.
Google Scholar
Zhu B P, Li Q M. Formal languages and Automata[M]. Beijing: Tsinghua University Press, 2015.
Google Scholar
Chomsky N. Syntactic Structures[M].In Xing G W, Pang B J, et al. Beijing: China Social Sciences Press, 1979: 28–35.
Google Scholar
Liu Y. computational linguistics[M]. Beijing: Tsinghua University Press, 2002.
Google Scholar
Zong C Q. Statistical natural language processing[M]. Tsinghua University Press, 2013.
Google Scholar
Jiang Z L, Jiang S X. fundamentals of compiling[M]. Higher Education Press, 2010.
Google Scholar
Brejovab B, Dimarco C, Vinar T, et al. Finding Patterns in Biological Sequences[J]. Technical report, 2000.
Google Scholar
Ben-Hur A, Brutlag D. Remote homology detection: A motif based approach[J]. Bioinformatics, 2003, 19: 26–33.
Article Google Scholar
Li Y, Korol A, Fahima T, Beiles A, et al. Microsatellites: genomic distribution, putative functions and mutational mechanisms[J]. Molecular Ecology, 2002, 11(12): 2453–2465.
Article Google Scholar
Shapiro J A, Sernberg R V. Why repetitive DNA is essential to genome function[J]. Biological Reviews, 2005: 1–24.
Google Scholar
Agaawal R, Srikant R. Mining sequential patterns. In: Yu PS, Chen ALP, eds. Proc. Of the 11th Int’l Conf. on Data Engineering[J]. Taipei: IEEE Computer Society, 1995: 3–14.
Google Scholar
Srikant R, Agrawal R. Mining sequential patterns: Generalization and performance improvements[C]// In: Apers PMG, Bouzeghoub M, Gardarin G, eds. Advances in Database Technology, Proc. of the 15th Int’l Conf. on Extending Database Technology, 1996: 3–17.
Google Scholar
Pei J, Han J W, Mortazavi-Asl B, et al. Prefixspan: Mining sequential patterns efficiently by prefix-projected growth[C]// In: Proc. of the 17th Int’l Conf. on Data Engineering, 2001: 215–224.
Google Scholar
Xiong Y, Zhu Y Y. BioPM: An Efficient Algorithm for Protein Motif Mining[C]. In: Proc. of ICBBE’07, 2007: 394–397.
Google Scholar
Wang D, Wang G, Wu Q Q, Chen B.C. Finding LPRs in DNA sequence based on a new index SUA[C]// Bioinformatics and Bioengineering, 2005: 281–284.
Google Scholar
Kurtz S, Choudhuri J V, Ohlebusch E, et al. REPuter: The manifold applications of repeat analysis on a genomic scale[J]. Nucleic Acids Research, 2001, 29(22): 4633–4642.
Article Google Scholar
Guo S, Jiang Q S, Wang B Z, Shi L. A new algorithm for protein sequence pattern mining[J]. Computer Engineering, 2009, 35(8): 208–210.
Google Scholar
Pearson W R, Lipman D J. Improved tools for biological sequence comparison[J]. Proceedings of the National Academy of Sciences, 1988, 85(8):2444–2448.
Article Google Scholar
Roth F P, Hughes J D, Estep P W, et al. Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation[J]. Nature Biotechnology, 1998, 16(10): 939–945.
Article Google Scholar
Cardon L, Stormo G. Expectation maximization algorithm for identifying protein-binding sites with variable lengths from unaligned DNA fragments[J].Journal of Molecular Biology, 1992, 223: 159–170.
Google Scholar
Liu J, Neuwald A, Lawrence C. Bayesian models for multiple local sequence alignment and Gibbs sampling strategies[J]. Journal of the American Statistical Association, 1995, 90(432): 1156–1170.
Article Google Scholar
Altschul S F, Gish W, Miller W, et al. Basic local alignment search tool[J]. Journal of molecular biology, 1990, 215(3): 403–410.
Article Google Scholar
Gotoh O. Multiple sequence alignment: Algorithms and applications[J]. Advances in biophysics, 1999, 36: 159–206.
Article Google Scholar
Stuart G W, Moffett K, Baker S. Integrated gene and species phylogenies from unaligned whole genome protein sequences[J]. Bioinformatics, 2002, 18(1): 100–108.
Article Google Scholar
Wu T J, Burke J P, Davison D B. A measure of DNA sequence dissimilarity based on mahalanobis distance between frequencies of words[J]. Biometrics, 1997, 1431–1439.
Google Scholar
Wu T J, Hsieh Y C, Li L A. Statistical measures of DNA sequence dissimilarity under markov chain models of base composition[J]. Biometrics, 2001, 57(2): 441–448.
Article MathSciNet Google Scholar
Li M, Badger J H, Chen X, et al. An information-based sequence distance and its application to whole mitochondrial genome phylogeny[J]. Bioinformatics, 2001, 17(2): 149–154.
Article Google Scholar
Yu S W. Introduction to computational linguistics[J]. The Commercial Press, 2003.
Google Scholar
Fu J S. Pattern recognition and its application[M]. Science Press, 1983.
Google Scholar
Nagata, M A clustered global phrase reordering model for statistical machine translation[A]. Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, 2006, (7): 713–720.
Google Scholar
Poibeau T. Machine Translation[M]. Boston: The MIT Press, 2017.
Book Google Scholar
Turing A.M, Copeland B.J. essential Turing seminal writings in computing, logic, philosophy, artificial intelligence, and artificial life plus The secrets of Enigma[M]. Oxford University Press, 2004.
Google Scholar
Sakakibara Y, Brown M, Hughey R, et al. Stochastic context-free grammars for tRNA modeling[J]. Nucleic Acids Research, 1994, 22(23): 5112.
Article Google Scholar
Serls D B. The linguistics of DNA[J]. American Scientist, 1992, 80(6): 579–591.
Google Scholar
Searl D B, Group B. Formal Language Theory and Biological Macromolecules[J]. series in discrete mathematics \(\displaystyle \&\) theoretical computer science, 1999.
Google Scholar
Searls D B. Linguistic approaches to biological sequences. Comput Apply Biosci, 1997, 13(4): 333–344.
Google Scholar
Collado-Vides J. The search for a grammatical theory of gene regulation is formally justified by showing the inadequacy of context-free grammars[J]. Computer applications in the biosciences: CABIOS, 1991, 7(3): 321–326.
Google Scholar
Baldip P, Brunak S, Stolovitzky G A. Bioinformatics: The Machine Learning Approach[J]. Physics Today, 2002, 55(12): 57–58.
Article Google Scholar
Paun Gh, Sântean L. Further remarks on parallel communicating grammar systems[J]. International Journal of Computer Mathematics, 1990, 34(3–4): 187–203.
Google Scholar
Pieter AJ, Den B, Van B E, et al. Prediction of RNA secondary structure, including pseudoknotting, by computer simulation[J]. Nucleic Acids Research, 1990,(10): 3035.
Google Scholar
Brendel V, Busse H G. Genome structure described by formal languages[J]. Nucleic Acids Research, 1984, 12(5): 2561–2568.
Article Google Scholar
Chomsky N. Some simple evo devo theses: How true might they be for language[C]// In Richard K.Larson, Viviane Déprez\(\displaystyle \&\)Hiroko Yamakido (eds), The Evolution of Language: Biolinguistic Perspectives, Cambridge University Press, 2010: 45–62.
Google Scholar
Atkinson Q D, Gray R D. Curious Parallels and Curious Connections—Phylogenetic Thinking in Biology and Historical Linguistics[J]. Systematic Biology, 2005(4):4.
Google Scholar
Ritt N. Selfish Sounds and Linguistic Evolution: A Darwinian Approach to Language Change[M]. Cambridge University Press, 2004.
Book Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer and Electronic Information, Guangxi University, Nanning, Guangxi, China
Qingfeng Chen

Authors

Qingfeng Chen
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Chen, Q. (2024). Computational Linguistics and Biological Sequences in Artificial Intelligence. In: Association Analysis Techniques and Applications in Bioinformatics. Springer, Singapore. https://doi.org/10.1007/978-981-99-8251-6_4

Download citation

DOI: https://doi.org/10.1007/978-981-99-8251-6_4
Published: 26 April 2024
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8250-9
Online ISBN: 978-981-99-8251-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics