Skip to main content

Computational Linguistics and Biological Sequences in Artificial Intelligence

  • Chapter
  • First Online:
Association Analysis Techniques and Applications in Bioinformatics
  • 55 Accesses

Abstract

Researchers generally believe that nucleic acid is also a language with rich information. Nucleic acid language can be used to describe the structure of life and life processes, and there is as much diversity as language, with many common characteristics. Therefore, many existing studies apply the results and methods achieved in the field of language theory to the study of biological sequences. Based on this route, computational linguistics has also brought many new breakthroughs to the study of biological sequences. Association analysis is a data mining technique that can be used to discover frequent patterns and association rules in a dataset. In computational linguistics, association analysis can be used to uncover association rules in text data, helping us better understand semantic and grammatical rules in natural languages. In biological sequence analysis, association analysis can be applied to identify association rules in the genome and proteome to reveal interactions and functional relationships between genes or proteins. These analysis results can help us better understand the data and draw conclusions and promote further development in both fields. Therefore, the application of association analysis technology to the study of biological sequences in computational linguistics is a research field worthy of our expectations.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 139.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 179.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Encyclopedia of China Publishing House. Language Encyclopedia[M]. Beijing: Encyclopedia of China Publishing House, 1994.

    Google Scholar 

  2. Wu X, Jain L, Wang J, et al. Survey of Biodata Analysis from a Data Mining Perspective[J]. Data Mining in Bioinformatics, 2005:9–39.

    Google Scholar 

  3. Yu S W, Huang J R. Prospects of Computational Linguistics[M]. Beijing: The Commercial Press, 2005.

    Google Scholar 

  4. Zhu B P, Li Q M. Formal languages and Automata[M]. Beijing: Tsinghua University Press, 2015.

    Google Scholar 

  5. Chomsky N. Syntactic Structures[M].In Xing G W, Pang B J, et al. Beijing: China Social Sciences Press, 1979: 28–35.

    Google Scholar 

  6. Liu Y. computational linguistics[M]. Beijing: Tsinghua University Press, 2002.

    Google Scholar 

  7. Zong C Q. Statistical natural language processing[M]. Tsinghua University Press, 2013.

    Google Scholar 

  8. Jiang Z L, Jiang S X. fundamentals of compiling[M]. Higher Education Press, 2010.

    Google Scholar 

  9. Brejovab B, Dimarco C, Vinar T, et al. Finding Patterns in Biological Sequences[J]. Technical report, 2000.

    Google Scholar 

  10. Ben-Hur A, Brutlag D. Remote homology detection: A motif based approach[J]. Bioinformatics, 2003, 19: 26–33.

    Article  Google Scholar 

  11. Li Y, Korol A, Fahima T, Beiles A, et al. Microsatellites: genomic distribution, putative functions and mutational mechanisms[J]. Molecular Ecology, 2002, 11(12): 2453–2465.

    Article  Google Scholar 

  12. Shapiro J A, Sernberg R V. Why repetitive DNA is essential to genome function[J]. Biological Reviews, 2005: 1–24.

    Google Scholar 

  13. Agaawal R, Srikant R. Mining sequential patterns. In: Yu PS, Chen ALP, eds. Proc. Of the 11th Int’l Conf. on Data Engineering[J]. Taipei: IEEE Computer Society, 1995: 3–14.

    Google Scholar 

  14. Srikant R, Agrawal R. Mining sequential patterns: Generalization and performance improvements[C]// In: Apers PMG, Bouzeghoub M, Gardarin G, eds. Advances in Database Technology, Proc. of the 15th Int’l Conf. on Extending Database Technology, 1996: 3–17.

    Google Scholar 

  15. Pei J, Han J W, Mortazavi-Asl B, et al. Prefixspan: Mining sequential patterns efficiently by prefix-projected growth[C]// In: Proc. of the 17th Int’l Conf. on Data Engineering, 2001: 215–224.

    Google Scholar 

  16. Xiong Y, Zhu Y Y. BioPM: An Efficient Algorithm for Protein Motif Mining[C]. In: Proc. of ICBBE’07, 2007: 394–397.

    Google Scholar 

  17. Wang D, Wang G, Wu Q Q, Chen B.C. Finding LPRs in DNA sequence based on a new index SUA[C]// Bioinformatics and Bioengineering, 2005: 281–284.

    Google Scholar 

  18. Kurtz S, Choudhuri J V, Ohlebusch E, et al. REPuter: The manifold applications of repeat analysis on a genomic scale[J]. Nucleic Acids Research, 2001, 29(22): 4633–4642.

    Article  Google Scholar 

  19. Guo S, Jiang Q S, Wang B Z, Shi L. A new algorithm for protein sequence pattern mining[J]. Computer Engineering, 2009, 35(8): 208–210.

    Google Scholar 

  20. Pearson W R, Lipman D J. Improved tools for biological sequence comparison[J]. Proceedings of the National Academy of Sciences, 1988, 85(8):2444–2448.

    Article  Google Scholar 

  21. Roth F P, Hughes J D, Estep P W, et al. Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation[J]. Nature Biotechnology, 1998, 16(10): 939–945.

    Article  Google Scholar 

  22. Cardon L, Stormo G. Expectation maximization algorithm for identifying protein-binding sites with variable lengths from unaligned DNA fragments[J].Journal of Molecular Biology, 1992, 223: 159–170.

    Google Scholar 

  23. Liu J, Neuwald A, Lawrence C. Bayesian models for multiple local sequence alignment and Gibbs sampling strategies[J]. Journal of the American Statistical Association, 1995, 90(432): 1156–1170.

    Article  Google Scholar 

  24. Altschul S F, Gish W, Miller W, et al. Basic local alignment search tool[J]. Journal of molecular biology, 1990, 215(3): 403–410.

    Article  Google Scholar 

  25. Gotoh O. Multiple sequence alignment: Algorithms and applications[J]. Advances in biophysics, 1999, 36: 159–206.

    Article  Google Scholar 

  26. Stuart G W, Moffett K, Baker S. Integrated gene and species phylogenies from unaligned whole genome protein sequences[J]. Bioinformatics, 2002, 18(1): 100–108.

    Article  Google Scholar 

  27. Wu T J, Burke J P, Davison D B. A measure of DNA sequence dissimilarity based on mahalanobis distance between frequencies of words[J]. Biometrics, 1997, 1431–1439.

    Google Scholar 

  28. Wu T J, Hsieh Y C, Li L A. Statistical measures of DNA sequence dissimilarity under markov chain models of base composition[J]. Biometrics, 2001, 57(2): 441–448.

    Article  MathSciNet  Google Scholar 

  29. Li M, Badger J H, Chen X, et al. An information-based sequence distance and its application to whole mitochondrial genome phylogeny[J]. Bioinformatics, 2001, 17(2): 149–154.

    Article  Google Scholar 

  30. Yu S W. Introduction to computational linguistics[J]. The Commercial Press, 2003.

    Google Scholar 

  31. Fu J S. Pattern recognition and its application[M]. Science Press, 1983.

    Google Scholar 

  32. Nagata, M A clustered global phrase reordering model for statistical machine translation[A]. Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, 2006, (7): 713–720.

    Google Scholar 

  33. Poibeau T. Machine Translation[M]. Boston: The MIT Press, 2017.

    Book  Google Scholar 

  34. Turing A.M, Copeland B.J. essential Turing seminal writings in computing, logic, philosophy, artificial intelligence, and artificial life plus The secrets of Enigma[M]. Oxford University Press, 2004.

    Google Scholar 

  35. Sakakibara Y, Brown M, Hughey R, et al. Stochastic context-free grammars for tRNA modeling[J]. Nucleic Acids Research, 1994, 22(23): 5112.

    Article  Google Scholar 

  36. Serls D B. The linguistics of DNA[J]. American Scientist, 1992, 80(6): 579–591.

    Google Scholar 

  37. Searl D B, Group B. Formal Language Theory and Biological Macromolecules[J]. series in discrete mathematics \(\displaystyle \&\) theoretical computer science, 1999.

    Google Scholar 

  38. Searls D B. Linguistic approaches to biological sequences. Comput Apply Biosci, 1997, 13(4): 333–344.

    Google Scholar 

  39. Collado-Vides J. The search for a grammatical theory of gene regulation is formally justified by showing the inadequacy of context-free grammars[J]. Computer applications in the biosciences: CABIOS, 1991, 7(3): 321–326.

    Google Scholar 

  40. Baldip P, Brunak S, Stolovitzky G A. Bioinformatics: The Machine Learning Approach[J]. Physics Today, 2002, 55(12): 57–58.

    Article  Google Scholar 

  41. Paun Gh, Sântean L. Further remarks on parallel communicating grammar systems[J]. International Journal of Computer Mathematics, 1990, 34(3–4): 187–203.

    Google Scholar 

  42. Pieter AJ, Den B, Van B E, et al. Prediction of RNA secondary structure, including pseudoknotting, by computer simulation[J]. Nucleic Acids Research, 1990,(10): 3035.

    Google Scholar 

  43. Brendel V, Busse H G. Genome structure described by formal languages[J]. Nucleic Acids Research, 1984, 12(5): 2561–2568.

    Article  Google Scholar 

  44. Chomsky N. Some simple evo devo theses: How true might they be for language[C]// In Richard K.Larson, Viviane Déprez\(\displaystyle \&\)Hiroko Yamakido (eds), The Evolution of Language: Biolinguistic Perspectives, Cambridge University Press, 2010: 45–62.

    Google Scholar 

  45. Atkinson Q D, Gray R D. Curious Parallels and Curious Connections—Phylogenetic Thinking in Biology and Historical Linguistics[J]. Systematic Biology, 2005(4):4.

    Google Scholar 

  46. Ritt N. Selfish Sounds and Linguistic Evolution: A Darwinian Approach to Language Change[M]. Cambridge University Press, 2004.

    Book  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2024 Guangxi Education Publishing House

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Chen, Q. (2024). Computational Linguistics and Biological Sequences in Artificial Intelligence. In: Association Analysis Techniques and Applications in Bioinformatics. Springer, Singapore. https://doi.org/10.1007/978-981-99-8251-6_4

Download citation

  • DOI: https://doi.org/10.1007/978-981-99-8251-6_4

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-99-8250-9

  • Online ISBN: 978-981-99-8251-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics