Abstract
Researchers generally believe that nucleic acid is also a language with rich information. Nucleic acid language can be used to describe the structure of life and life processes, and there is as much diversity as language, with many common characteristics. Therefore, many existing studies apply the results and methods achieved in the field of language theory to the study of biological sequences. Based on this route, computational linguistics has also brought many new breakthroughs to the study of biological sequences. Association analysis is a data mining technique that can be used to discover frequent patterns and association rules in a dataset. In computational linguistics, association analysis can be used to uncover association rules in text data, helping us better understand semantic and grammatical rules in natural languages. In biological sequence analysis, association analysis can be applied to identify association rules in the genome and proteome to reveal interactions and functional relationships between genes or proteins. These analysis results can help us better understand the data and draw conclusions and promote further development in both fields. Therefore, the application of association analysis technology to the study of biological sequences in computational linguistics is a research field worthy of our expectations.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Encyclopedia of China Publishing House. Language Encyclopedia[M]. Beijing: Encyclopedia of China Publishing House, 1994.
Wu X, Jain L, Wang J, et al. Survey of Biodata Analysis from a Data Mining Perspective[J]. Data Mining in Bioinformatics, 2005:9–39.
Yu S W, Huang J R. Prospects of Computational Linguistics[M]. Beijing: The Commercial Press, 2005.
Zhu B P, Li Q M. Formal languages and Automata[M]. Beijing: Tsinghua University Press, 2015.
Chomsky N. Syntactic Structures[M].In Xing G W, Pang B J, et al. Beijing: China Social Sciences Press, 1979: 28–35.
Liu Y. computational linguistics[M]. Beijing: Tsinghua University Press, 2002.
Zong C Q. Statistical natural language processing[M]. Tsinghua University Press, 2013.
Jiang Z L, Jiang S X. fundamentals of compiling[M]. Higher Education Press, 2010.
Brejovab B, Dimarco C, Vinar T, et al. Finding Patterns in Biological Sequences[J]. Technical report, 2000.
Ben-Hur A, Brutlag D. Remote homology detection: A motif based approach[J]. Bioinformatics, 2003, 19: 26–33.
Li Y, Korol A, Fahima T, Beiles A, et al. Microsatellites: genomic distribution, putative functions and mutational mechanisms[J]. Molecular Ecology, 2002, 11(12): 2453–2465.
Shapiro J A, Sernberg R V. Why repetitive DNA is essential to genome function[J]. Biological Reviews, 2005: 1–24.
Agaawal R, Srikant R. Mining sequential patterns. In: Yu PS, Chen ALP, eds. Proc. Of the 11th Int’l Conf. on Data Engineering[J]. Taipei: IEEE Computer Society, 1995: 3–14.
Srikant R, Agrawal R. Mining sequential patterns: Generalization and performance improvements[C]// In: Apers PMG, Bouzeghoub M, Gardarin G, eds. Advances in Database Technology, Proc. of the 15th Int’l Conf. on Extending Database Technology, 1996: 3–17.
Pei J, Han J W, Mortazavi-Asl B, et al. Prefixspan: Mining sequential patterns efficiently by prefix-projected growth[C]// In: Proc. of the 17th Int’l Conf. on Data Engineering, 2001: 215–224.
Xiong Y, Zhu Y Y. BioPM: An Efficient Algorithm for Protein Motif Mining[C]. In: Proc. of ICBBE’07, 2007: 394–397.
Wang D, Wang G, Wu Q Q, Chen B.C. Finding LPRs in DNA sequence based on a new index SUA[C]// Bioinformatics and Bioengineering, 2005: 281–284.
Kurtz S, Choudhuri J V, Ohlebusch E, et al. REPuter: The manifold applications of repeat analysis on a genomic scale[J]. Nucleic Acids Research, 2001, 29(22): 4633–4642.
Guo S, Jiang Q S, Wang B Z, Shi L. A new algorithm for protein sequence pattern mining[J]. Computer Engineering, 2009, 35(8): 208–210.
Pearson W R, Lipman D J. Improved tools for biological sequence comparison[J]. Proceedings of the National Academy of Sciences, 1988, 85(8):2444–2448.
Roth F P, Hughes J D, Estep P W, et al. Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation[J]. Nature Biotechnology, 1998, 16(10): 939–945.
Cardon L, Stormo G. Expectation maximization algorithm for identifying protein-binding sites with variable lengths from unaligned DNA fragments[J].Journal of Molecular Biology, 1992, 223: 159–170.
Liu J, Neuwald A, Lawrence C. Bayesian models for multiple local sequence alignment and Gibbs sampling strategies[J]. Journal of the American Statistical Association, 1995, 90(432): 1156–1170.
Altschul S F, Gish W, Miller W, et al. Basic local alignment search tool[J]. Journal of molecular biology, 1990, 215(3): 403–410.
Gotoh O. Multiple sequence alignment: Algorithms and applications[J]. Advances in biophysics, 1999, 36: 159–206.
Stuart G W, Moffett K, Baker S. Integrated gene and species phylogenies from unaligned whole genome protein sequences[J]. Bioinformatics, 2002, 18(1): 100–108.
Wu T J, Burke J P, Davison D B. A measure of DNA sequence dissimilarity based on mahalanobis distance between frequencies of words[J]. Biometrics, 1997, 1431–1439.
Wu T J, Hsieh Y C, Li L A. Statistical measures of DNA sequence dissimilarity under markov chain models of base composition[J]. Biometrics, 2001, 57(2): 441–448.
Li M, Badger J H, Chen X, et al. An information-based sequence distance and its application to whole mitochondrial genome phylogeny[J]. Bioinformatics, 2001, 17(2): 149–154.
Yu S W. Introduction to computational linguistics[J]. The Commercial Press, 2003.
Fu J S. Pattern recognition and its application[M]. Science Press, 1983.
Nagata, M A clustered global phrase reordering model for statistical machine translation[A]. Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, 2006, (7): 713–720.
Poibeau T. Machine Translation[M]. Boston: The MIT Press, 2017.
Turing A.M, Copeland B.J. essential Turing seminal writings in computing, logic, philosophy, artificial intelligence, and artificial life plus The secrets of Enigma[M]. Oxford University Press, 2004.
Sakakibara Y, Brown M, Hughey R, et al. Stochastic context-free grammars for tRNA modeling[J]. Nucleic Acids Research, 1994, 22(23): 5112.
Serls D B. The linguistics of DNA[J]. American Scientist, 1992, 80(6): 579–591.
Searl D B, Group B. Formal Language Theory and Biological Macromolecules[J]. series in discrete mathematics \(\displaystyle \&\) theoretical computer science, 1999.
Searls D B. Linguistic approaches to biological sequences. Comput Apply Biosci, 1997, 13(4): 333–344.
Collado-Vides J. The search for a grammatical theory of gene regulation is formally justified by showing the inadequacy of context-free grammars[J]. Computer applications in the biosciences: CABIOS, 1991, 7(3): 321–326.
Baldip P, Brunak S, Stolovitzky G A. Bioinformatics: The Machine Learning Approach[J]. Physics Today, 2002, 55(12): 57–58.
Paun Gh, Sântean L. Further remarks on parallel communicating grammar systems[J]. International Journal of Computer Mathematics, 1990, 34(3–4): 187–203.
Pieter AJ, Den B, Van B E, et al. Prediction of RNA secondary structure, including pseudoknotting, by computer simulation[J]. Nucleic Acids Research, 1990,(10): 3035.
Brendel V, Busse H G. Genome structure described by formal languages[J]. Nucleic Acids Research, 1984, 12(5): 2561–2568.
Chomsky N. Some simple evo devo theses: How true might they be for language[C]// In Richard K.Larson, Viviane Déprez\(\displaystyle \&\)Hiroko Yamakido (eds), The Evolution of Language: Biolinguistic Perspectives, Cambridge University Press, 2010: 45–62.
Atkinson Q D, Gray R D. Curious Parallels and Curious Connections—Phylogenetic Thinking in Biology and Historical Linguistics[J]. Systematic Biology, 2005(4):4.
Ritt N. Selfish Sounds and Linguistic Evolution: A Darwinian Approach to Language Change[M]. Cambridge University Press, 2004.
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2024 Guangxi Education Publishing House
About this chapter
Cite this chapter
Chen, Q. (2024). Computational Linguistics and Biological Sequences in Artificial Intelligence. In: Association Analysis Techniques and Applications in Bioinformatics. Springer, Singapore. https://doi.org/10.1007/978-981-99-8251-6_4
Download citation
DOI: https://doi.org/10.1007/978-981-99-8251-6_4
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8250-9
Online ISBN: 978-981-99-8251-6
eBook Packages: Computer ScienceComputer Science (R0)