Skip to main content

Advertisement

Log in

Controllable protein design with language models

  • Review Article
  • Published:

From Nature Machine Intelligence

View current issue Submit your manuscript

Abstract

The twenty-first century is presenting humankind with unprecedented environmental and medical challenges. The ability to design novel proteins tailored for specific purposes would potentially transform our ability to respond to these issues in a timely manner. Recent advances in the field of artificial intelligence are now setting the stage to make this goal achievable. Protein sequences are inherently similar to natural languages: amino acids arrange in a multitude of combinations to form structures that carry function, the same way as letters form words and sentences carry meaning. Accordingly, it is not surprising that, throughout the history of natural language processing (NLP), many of its techniques have been applied to protein research problems. In the past few years we have witnessed revolutionary breakthroughs in the field of NLP. The implementation of transformer pre-trained models has enabled text generation with human-like capabilities, including texts with specific properties such as style or subject. Motivated by its considerable success in NLP tasks, we expect dedicated transformers to dominate custom protein sequence generation in the near future. Fine-tuning pre-trained models on protein families will enable the extension of their repertoires with novel sequences that could be highly divergent but still potentially functional. The combination of control tags such as cellular compartment or function will further enable the controllable design of novel protein functions. Moreover, recent model interpretability methods will allow us to open the ‘black box’ and thus enhance our understanding of folding principles. Early initiatives show the enormous potential of generative language models to design functional sequences. We believe that using generative text models to create novel proteins is a promising and largely unexplored field, and we discuss its foreseeable impact on protein design.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1: Similarities between proteins and languages.
Fig. 2: Overview of the most commonly used methods for NLP problems.
Fig. 3: Schematic overview of most used transformers.
Fig. 4: Model size and database growth over time.
Fig. 5: Overview of possibilities in the protein engineering field with transformer models.

Similar content being viewed by others

References

  1. Lechner, H., Ferruz, N. & Höcker, B. Strategies for designing non-natural enzymes and binders. Curr. Opin. Chem. Biol. 47, 67–76 (2018).

    Article  Google Scholar 

  2. Gainza, P., Nisonoff, H. M. & Donald, B. R. Algorithms for protein design. Curr. Opin. Struct. Biol. 39, 16–26 (2016).

    Article  Google Scholar 

  3. Kuhlman, B. & Bradley, P. Advances in protein structure prediction and design. Nat. Rev. Mol. Cell Biol. 20, 681–697 (2019).

    Article  Google Scholar 

  4. Ferruz, N. et al. Identification and analysis of natural building blocks for evolution-guided fragment-based protein design. J. Mol. Biol. 432, 3898–3914 (2020).

    Article  Google Scholar 

  5. W, E. et al. Molecular evolution of FOXP2, a gene involved in speech and language. Nature 418, 869–872 (2002).

    Article  Google Scholar 

  6. Theobald, D. L. A formal test of the theory of universal common ancestry. Nature 465, 219–222 (2010).

    Article  Google Scholar 

  7. Arena, S. et al. Emergence of multiple EGFR extracellular mutations during cetuximab treatment in colorectal cancer. Clin. Cancer Res. 21, 2157–2166 (2015).

    Article  Google Scholar 

  8. Lindqvist, Y. & Schneider, G. Circular permutations of natural protein sequences: structural evidence. Curr. Opin. Struct. Biol. 7, 422–427 (1997).

    Article  Google Scholar 

  9. Huang, P. S. et al. De novo design of a four-fold symmetric TIM-barrel protein with atomic-level accuracy. Nat. Chem. Biol. 12, 29–34 (2016).

    Article  Google Scholar 

  10. Freeman, M. R., Blumenfeld, H. K. & Marian, V. Phonotactic constraints are activated across languages in bilinguals. Front. Psychol. 7, 702 (2016).

    Article  Google Scholar 

  11. Göbel, U., Sander, C., Schneider, R. & Valencia, A. Correlated mutations and residue contacts in proteins. Proteins Struct. Funct. Bioinformatics 18, 309–317 (1994).

    Article  Google Scholar 

  12. Rao, R. M. et al. MSA Transformer. In Proc. 38th International Conference on Machine Learning Vol. 139, 8844–8856 https://proceedings.mlr.press/v139/rao21a.html (MLR, 2021).

  13. Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).

    Article  Google Scholar 

  14. Nguyen, K. A., Im Walde, S. S. & Vu, N. T. Distinguishing antonyms and synonyms in a pattern-based neural network. In Proc. 15th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2017 Vol. 1, 76–85 (Association for Computational Linguistics, 2017).

  15. Young, T., Hazarika, D., Poria, S., and Cambria, E. Recent Trends in Deep Learning Based Natural Language Processing" in IEEE Computational Intelligence Magazine, Vol. 13, no. 3, 55-75, (2018).

  16. Zhou, G. & Su, J. Named entity recognition using an HMM-based chunk tagger. In Proc. 40th Annual Meeting of the Association for Computational Linguistics, ACL ’02 473–480 https://doi.org/10.3115/1073083.1073163 (Association for Computational Linguistics, 2001).

  17. Karchin, R., Cline, M., Mandel-Gutfreund, Y. & Karplus, K. Hidden Markov models that use predicted local structure for fold recognition: alphabets of backbone geometry. Proteins Struct. Funct. Genet. 51, 504–514 (2003).

    Article  Google Scholar 

  18. Yakhnenko, O., Silvescu, A. & Honavar, V. Discriminatively trained Markov model for sequence classification. In Proc. IEEE International Conference on Data Mining, ICDM 498–505 https://doi.org/10.1109/ICDM.2005.52 (IEEE, 2005).

  19. Nguyen Ba, A. N., Pogoutse, A., Provart, N. & Moses, A. M. NLStradamus: a simple hidden Markov model for nuclear localization signal prediction. BMC Bioinformatics 10, 202 (2009).

    Article  Google Scholar 

  20. Söding, J. Protein homology detection by HMM-HMM comparison. Bioinformatics 21, 951–960 (2005).

    Article  Google Scholar 

  21. Bengio, Y. et al. A neural probabilistic language model. J. Mach. Learn. Res. 3, 1137–1155 (2003).

    MATH  Google Scholar 

  22. Mikolov, T., Chen, K., Corrado, G. & Dean, J. Efficient estimation of word representations in vector space. In Proc. 1st International Conference on Learning Representations, ICLR 2013 (ICLR, 2013).

  23. Mikolov, T., Chen, K., Corrado, G. & Dean, J. Distributed representations of words and phrases and their compositionality. In Proc. 26th International Conference on Neural Information Processing Systems Vol. 2, 3111–3119 (ACM, 2013).

  24. Mikolov, T., Yih, W.-T. & Zweig, G. Linguistic Regularities in Continuous Space Word Representations http://research.microsoft.com/en- (Microsoft, 2013).

  25. Young, T., Hazarika, D., Poria, S. & Cambria, E. Recent trends in deep learning based natural language processing. IEEE Comput. Intell. Mag. 13, 55–75 (2017).

    Article  Google Scholar 

  26. Yang, K. K., Wu, Z., Bedbrook, C. N. & Arnold, F. H. Learned protein embeddings for machine learning. Bioinformatics 34, 2642–2648 (2018).

    Article  Google Scholar 

  27. Asgari, E. & Mofrad, M. R. K. Continuous distributed representation of biological sequences for deep proteomics and genomics. PLoS ONE 10, e0141287 (2015).

    Article  Google Scholar 

  28. Collobert, R. & Weston, J. A unified architecture for natural language processing. In Proc. 25th International Conference on Machine learning, ICML ’08 160–167 https://doi.org/10.1145/1390156.1390177 (ACM, 2008).

  29. Wang, S., Weng, S., Ma, J. & Tang, Q. DeepCNF-D: predicting protein order/disorder regions by weighted deep convolutional neural fields. Int. J. Mol. Sci. 16, 17315–17330 (2015).

    Article  Google Scholar 

  30. Zeng, H., Edwards, M. D., Liu, G. & Gifford, D. K. Convolutional neural network architectures for predicting DNA-protein binding. Bioinformatics 32, i121–i127 (2016).

    Article  Google Scholar 

  31. Hou, J., Adhikari, B. & Cheng, J. DeepSF: deep convolutional neural network for mapping protein sequences to folds. Bioinformatics 34, 1295–1303 (2018).

    Article  Google Scholar 

  32. Mikolov, T. et al. Recurrent neural network based language model. In Proc. 11th Annual Conference of the International Speech Communication Association 1048–1048 (ISCA, 2010).

  33. Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K. & Dyer, C. Neural architectures for named entity recognition. In Proc. 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 260–270 (Association for Computational Linguistics, 2016).

  34. Bahdanau, D., Cho, K. H. & Bengio, Y. Neural machine translation by jointly learning to align and translate. In Proc. 3rd International Conference on Learning Representations, ICLR 2015 (ICLR, 2015).

  35. Radford, A., Jozefowicz, R. & Sutskever, I. Learning to generate reviews and discovering sentiment. Preprint at https://arxiv.org/abs/1704.01444 (2017).

  36. Krause, B., Murray, I., Renals, S. & Lu, L. Multiplicative LSTM for sequence modelling. In Proc. 5th International Conference on Learning Representations, ICLR 2017 https://doi.org/10.48550/arxiv.1609.07959 (ICLR, 2016).

  37. Alley, E. C., Khimulya, G., Biswas, S., AlQuraishi, M. & Church, G. M. Unified rational protein engineering with sequence-based deep representation learning. Nat. Methods 16, 1315–1322 (2019).

    Article  Google Scholar 

  38. Senior, A. W. et al. Improved protein structure prediction using potentials from deep learning. Nature 577, 706–710 (2020).

    Article  Google Scholar 

  39. Vaswani, A. et al. Transformer: attention is all you need. In Advances in Neural Information Processing Systems Vol. 2017, 5999–6009 (NIPS, 2017).

  40. Radford, A. & Narasimhan, K. Improving language understanding by generative pre-training; https://openai.com/blog/language-unsupervised/ (2018).

  41. Radford, A. et al. Language Models are Unsupervised Multitask Learners (GitHub); https://github.com/codelucas/newspaper

  42. Brown, T. B. et al. Language models are few-shot learners. Preprint at https://arxiv.org/abs/2005.14165 (2020).

  43. Mak, A. When is technology too dangerous to release to the public? Slate https://slate.com/technology/2019/02/openai-gpt2-text-generating-algorithm-ai-dangerous.html (22 February 2019).

  44. Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: pre-training of deep bidirectional transformers for language understanding. In Proc. 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies Vol. 1 4171–4186 (Association for Computational Linguistics, 2018).

  45. Wang, A. & Cho, K. BERT has a mouth, and it must speak: BERT as a Markov random field language model. In Proc. Workshop on Methods for Optimizing and Evaluating Neural Language Generation 30–36 (ACL, 2019).

  46. Sun, C., Qiu, X., Xu, Y. & Huang, X. in Lecture Notes in Computer Science Vol. 11856, 194–206 (Springer, 2019).

  47. Wolf, T. et al. ransformers: state-of-the-art natural language processing. in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 38-45 (2020).

  48. Total Data Volume Worldwide 2010–2025 (Statista); https://www.statista.com/statistics/871513/worldwide-data-created/

  49. Yu, L. et al. Grammar of protein domain architectures. Proc. Natl Acad. Sci. USA 116, 3636–3645 (2019).

    Article  Google Scholar 

  50. Rives, A. et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl Acad. Sci. USA 118, e2016239118 (2021).

    Article  Google Scholar 

  51. Rao, R. et al. Evaluating protein transfer learning with TAPE. Adv. Neural Inf. Process. Syst 32, 9689–9701 (2019).

    Google Scholar 

  52. Elnaggar, A. et al. ProtTrans: towards cracking the language of life’s code through self-supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence. 1-1 (2019).

  53. Ferruz, N. & Höcker, B. Dreaming ideal protein structures. Nat. Biotechnol. https://doi.org/10.1038/s41587-021-01196-9 (2022).

  54. Mordvintsev, A. DeepDream—a code example for visualizing neural networks. Google Research Blog https://web.archive.org/web/20150708233542/http://googleresearch.blogspot.co.uk/2015/07/deepdream-code-example-for-visualizing.html

  55. Anishchenko, I. et al. De novo protein design by deep network hallucination. Nature 600, 547–552 (2021).

    Article  Google Scholar 

  56. Huang, B. et al. A backbone-centred energy function of neural networks for protein design. Nature 602, 523–528 (2022).

    Article  Google Scholar 

  57. Castro, E. et al. Guided generative protein design using regularized transformers. Preprint at https://arxiv.org/abs/2201.09948 (2022).

  58. Moffat, L., Kandathil, S. M. & Jones, D. T. Design in the DARK: learning deep generative models for de novo protein design. Preprint at https://www.biorxiv.org/content/10.1101/2022.01.27.478087v1 (2022).

  59. Ferruz, N., Schmidt, S. & Höcker, B. A deep unsupervised language model for protein design. Preprint at https://www.biorxiv.org/content/10.1101/2022.03.09.483666v1 (2022).

  60. Lee, J. S. & Hsiang, J. Patent claim generation by fine-tuning OpenAI GPT-2. World Pat. Inf. 62, 101983 (2020).

    Article  Google Scholar 

  61. Gligorijević, V. et al. Function-guided protein design by deep manifold sampling. in Neural Information Processing Systems (NeurIPS, 2021).

  62. Keskar, N. S., McCann, B., Varshney, L. R., Xiong, C. & Socher, R. CTRL: a conditional transformer language model for controllable generation. Preprint at https://arxiv.org/abs/1909.05858 (2019).

  63. Madani, A. et al. ProGen: language modeling for protein generation. Preprint at https://www.biorxiv.org/content/10.1101/2020.03.07.982272v2 (2020).

  64. Madani, A. et al. Deep neural language modeling enables functional protein generation across families. Preprint at https://www.biorxiv.org/content/10.1101/2021.07.18.452833v1 (2021).

  65. Rembeza, E. & Engqvist, M. K. M. Experimental and computational investigation of enzyme functional annotations uncovers misannotation in the EC 1.1.3.15 enzyme class. PLoS Comput. Biol. 17, e1009446 (2021).

    Article  Google Scholar 

  66. Chang, Y. C. et al. COMBREX-DB: an experiment centered database of protein function: knowledge, predictions and knowledge gaps. Nucleic Acids Res. 44, D330–D335 (2016).

    Article  Google Scholar 

  67. Bileschi, M. L. et al. Using deep learning to annotate the protein universe. Nat. Biotechnol. https://doi.org/10.1038/s41587-021-01179-w (2022).

  68. Schwaller, P., Gaudin, T., Lányi, D., Bekas, C. & Laino, T. ‘Found in Translation’: predicting outcomes of complex organic chemistry reactions using neural sequence-to-sequence models. Chem. Sci. 9, 6091–6098 (2018).

    Article  Google Scholar 

  69. Schwaller, P. et al. Predicting retrosynthetic pathways using transformer-based models and a hyper-graph exploration strategy. Chem. Sci. 11, 3316–3325 (2020).

    Article  Google Scholar 

  70. Grechishnikova, D. Transformer neural network for protein-specific de novo drug generation as a machine translation problem. Sci. Rep. 11, 321 (2021).

    Article  Google Scholar 

  71. Schwaller, P. et al. Mapping the space of chemical reactions using attention-based neural networks. Nat. Mach. Intell. 3, 144–152 (2021).

    Article  Google Scholar 

  72. Koga, N. et al. Principles for designing ideal protein structures. Nature 491, 222–227 (2012).

    Article  Google Scholar 

  73. Jiménez-Luna, J., Grisoni, F. & Schneider, G. Drug discovery with explainable artificial intelligence. Nat. Mach. Intell. 2, 573–584 (2020).

    Article  Google Scholar 

  74. Danilevsky, M. et al. A survey of the state of explainable AI for natural language processing. In Proc. 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing 447–459 (Association for Computational Linguistics, 2020).

  75. Hoover, B., Strobelt, H. & Gehrmann, S. exBERT: a visual analysis tool to explore learned representations in transformer models. In Proc 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations 187–196 (Association for Computational Linguistics, 2019); https://doi.org/10.18653/v1/2020.acl-demos.22

  76. OpenAI’s massive GPT-3 model is impressive, but size isn’t everything. VentureBeat https://venturebeat.com/2020/06/01/ai-Junemachine-learning-openai-gpt-3-size-isnt-everything/ (1 June 2020).

  77. Dhar, P. The carbon impact of artificial intelligence. Nat. Mach. Intell. 2, 423–425 (2020).

    Article  Google Scholar 

  78. Li, Z. et al. Train large, then compress: rethinking model size for efficient training and inference of Transformers. In Proc. 37th International Conference on Machine Learning ICML 2020 5914–5924 (ICML, 2020).

  79. AI and Compute; https://openai.com/blog/ai-and-compute/

  80. Shaw, D. E. et al. Anton, a special-purpose machine for molecular dynamics simulation. Commun. ACM 51, 91–97 (2008).

    Article  Google Scholar 

  81. Buch, I., Giorgino, T. & De Fabritiis, G. Complete reconstruction of an enzyme-inhibitor binding process by molecular dynamics simulations. Proc. Natl Acad. Sci. USA 108, 10184–10189 (2011).

    Article  Google Scholar 

  82. Ferruz, N., Harvey, M. J., Mestres, J. & De Fabritiis, G. Insights from fragment hit binding assays by molecular simulations. J. Chem. Inf. Model. 55, 2200–2205 (2015).

    Article  Google Scholar 

  83. Yang, K. K., Wu, Z. & Arnold, F. H. Machine-learning-guided directed evolution for protein engineering. Nat. Methods 16, 687–694 (2019).

    Article  Google Scholar 

  84. Chu, S. K. S. & Siegel, J. Predicting single-point mutational effect on protein stability. In Proc. 35th Conference on Neural Information Processing Systems (NIPS, 2021).

  85. Hsu, C., Nisonoff, H., Fannjiang, C. & Listgarten, J. Combining evolutionary and assay-labelled data for protein fitness prediction. Preprint at https://www.biorxiv.org/content/10.1101/2021.03.28.437402v1 (2021).

  86. Baran, D. et al. Principles for computational design of binding antibodies. Proc. Natl Acad. Sci. USA 114, 10900–10905 (2017).

    Article  Google Scholar 

  87. Dai, Z. et al. Transformer-XL: attentive language models beyond a fixed-length context. In Proc. 57th Annual Meeting for the Association for Computational Linguistics 2978–2988 (ACL, 2019).

  88. Lample, G. & Conneau, A. Cross-lingual language model pretraining. Adv. Neural Inf. Process. Syst. 32, 7057–7067 (2019).

    Google Scholar 

  89. Yang, Z. et al. XLNet: Generalized autoregressive pretraining for language understanding. In Proc. 33rd International Conference on Neural Information Processing Systems Vol. 517, 5753–5763 (ACM, 2019).

  90. Liu, Y. et al. RoBERTa: a robustly optimized BERT pretraining approach. Preprint at https://arxiv.org/abs/1907.11692 (2019).

  91. Shoeybi, M. et al. Megatron-LM: training multi-billion parameter language models using model parallelism. Preprint at https://arxiv.org/abs/1909.08053 (2019).

  92. Lan, Z. et al. ALBERT: a lite BERT for self-supervised learning of language representations. Preprint at https://arxiv.org/abs/1909.11942 (2019).

  93. Sanh, V., Debut, L., Chaumond, J. & Wolf, T. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. Preprint at https://arxiv.org/abs/1910.01108 (2019).

  94. Gao, L. et al. The Pile: an 800-GB dataset of diverse text for language modeling. Preprint at https://arxiv.org/abs/2101.00027 (2020).

  95. Rasley, J., Rajbhandari, S., Ruwase, O. & He, Y. DeepSpeed: system optimizations enable training deep learning models with over 100 billion parameters. In Proc. 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 3505–3506 https://doi.org/10.1145/3394486.3406703 (ACM, 2020).

  96. Clark, K., Luong, M.-T., Brain, G., Le Google Brain, Q. V. & Manning, C. D. ELECTRA: pre-training text encoders as discriminators rather than generators. Preprint at https://arxiv.org/abs/2003.10555 (2020).

  97. Raffel, C. et al. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21, 1–67 (2020).

    MathSciNet  MATH  Google Scholar 

  98. Fedus, W., Brain, G., Zoph, B. & Shazeer, N. Switch transformers: scaling to trillion parameter models with simple and efficient sparsity. J. Mach. Learn. Res. 23, 1–39 (2022).

    Google Scholar 

  99. Smith, S. et al. Using DeepSpeed and Megatron to train Megatron-Turing NLG 530B, a large-scale generative language model. Preprint at https://arxiv.org/abs/2201.11990 (2022).

  100. Huang, P. S., Boyken, S. E. & Baker, D. The coming of age of de novo protein design. Nature 537, 320–327 (2016).

    Article  Google Scholar 

Download references

Acknowledgements

N.F. acknowledges support from an AGAUR Beatriu de Pinós MSCA-COFUND Fellowship (project 2020-BP-00130).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Noelia Ferruz.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Machine Intelligence thanks the anonymous reviewers for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ferruz, N., Höcker, B. Controllable protein design with language models. Nat Mach Intell 4, 521–532 (2022). https://doi.org/10.1038/s42256-022-00499-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s42256-022-00499-z

  • Springer Nature Limited

This article is cited by

Navigation