Evolution of Transcript Abundance is Influenced by Indels in Protein Low Complexity Regions

Dickson, Zachery W.; Golding, G. Brian

doi:10.1007/s00239-024-10158-z

Evolution of Transcript Abundance is Influenced by Indels in Protein Low Complexity Regions

Original Article
Published: 14 March 2024

Volume 92, pages 153–168, (2024)
Cite this article

Journal of Molecular Evolution Aims and scope Submit manuscript

115 Accesses
1 Altmetric
Explore all metrics

Abstract

Protein Protein low complexity regions (LCRs) are compositionally biased amino acid sequences, many of which have significant evolutionary impacts on the proteins which contain them. They are mutationally unstable experiencing higher rates of indels and substitutions than higher complexity regions. LCRs also impact the expression of their proteins, likely through multiple effects along the path from gene transcription, through translation, and eventual protein degradation. It has been observed that proteins which contain LCRs are associated with elevated transcript abundance (TAb), despite having lower protein abundance. We have gathered and integrated human data to investigate the co-evolution of TAb and LCRs through ancestral reconstructions and model inference using an approximate Bayesian calculation based method. We observe that on short evolutionary timescales TAb evolution is significantly impacted by changes in LCR length, with insertions driving TAb down. But in contrast, the observed data is best explained by indel rates in LCRs which are unaffected by shifts in TAb. Our work demonstrates a coupling between LCR and TAb evolution, and the utility of incorporating multiple responses into evolutionary analyses.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Detecting macroevolutionary genotype–phenotype associations using error-corrected rates of protein convergence

Article Open access 05 January 2023

Evolutionary hallmarks of the human proteome: chasing the age and coregulation of protein-coding genes

Article Open access 25 October 2016

GC content of vertebrate exome landscapes reveal areas of accelerated protein evolution

Article Open access 16 July 2019

Code Availability

Custom Perl and R scripts used for quality control of input data and reconstruction of ancestral TAb and LCR states can be found on Github at: www.github.com/zacherydickson/AncRecon-LCR-TAb. The program written to perform ABC inference of co-evolutionary models can be found on Github at: www.github.com/zacherydickson/ABC-LCR-TAb

References

Akaike H (1998) Selected Papers of Hirotugu Akaike. Chapter Information Theory and an Extension of the Maximum Likelihood Principle. Springer, New York, pp 199–213. https://doi.org/10.1007/978-1-4612-1694-0_15
Book Google Scholar
Andrews S (2015) Fastqc. https://www.bioinformatics.babraham.ac.uk/projects/fastqc
Andrieu C, Thoms J (2008) A tutorial on adaptive MCMC. Stat Comput 18:343–373
Article Google Scholar
Beaumont M, Zhang W, Balding D (2002) Approximate Bayesian computation in population genetics. Genetics 162:2025–2035
Article PubMed PubMed Central Google Scholar
Bedford T, Hartl D (2009) Optimization of gene expression by natural selection. Proc Natl Acad Sci USA 106:1133–1138
Article CAS PubMed PubMed Central Google Scholar
Bourque G, Leong B, Vega V, Chen X, Lee Y, Srinivasan K, Chew J, Ruan Y, Wei C, Ng H et al (2008) Evolution of the mammalian transcription factor binding repertoire via transposable elements. Genome Res 18:1752–1762
Article CAS PubMed PubMed Central Google Scholar
Bradley R, Li X, Trapnell C, Davidson S, Pachter L, Chu H, Tonkin L, Biggin M, Eisen M (2010) Binding site turnover produces pervasive quantitative changes in transcription factor binding between closely related Drosophila species. PLoS Biol 8:e1000343
Article PubMed PubMed Central Google Scholar
Byrska-Bishop M, Evani U, Zhao X, Basile A, Abel H, Regier A, Corvelo A, Clarke W, Musunuri R, Nagulapalli K et al (2022) High-coverage whole-genome sequencing of the expanded 1000 genomes project cohort including 602 trios. Cell 185:3426-3440.e19
Article CAS PubMed PubMed Central Google Scholar
Chavali S, Chavali PL, Chalancon G, deGroot NS, Gemayel R, Latysheva NS, Ing-Simmons E, Verstrepen KJ, Balaji S, Babu MM (2017) Constraints and consequences of the emergence of amino acid repeats in eukaryotic proteins. Nat Struct Mol Biol 24:765–777
Article CAS PubMed PubMed Central Google Scholar
Chen S, Zhou Y, Chen Y, Gu J (2018) fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 34:i884–i890
Article PubMed PubMed Central Google Scholar
Cook D, Andersen E (2017) VCF-kit: assorted utilities for the variant call format. Bioinformatics 33:1581–1582
Article CAS PubMed PubMed Central Google Scholar
Cummings CJ, Zoghbi HY (2000) Fourteen and counting: unraveling trinucleotide repeat diseases. Hum Mol Genet 9:909–16
Article CAS PubMed Google Scholar
DePristo MA, Zilversmit MM, Hartl DL (2006) On the abundance, amino acid composition, and evolutionary dynamics of low-complexity regions in proteins. Gene 378:19–30
Article CAS PubMed Google Scholar
Dickson Z, Golding G (2022) Low complexity regions in mammalian proteins are associated with low protein abundance and high transcript abundance. Mol Biol Evol 39:mcac087
Article Google Scholar
Dieringer D, Schlotterer C (2003) Two distinct modes of microsatellite mutation processes: evidence from the complete genomic sequences of nine species. Genome Res 13:2242–2251
Article CAS PubMed PubMed Central Google Scholar
Dobin A, Davis C, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P, Chaisson M, Gingeras T (2013) STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29:15–21
Article CAS PubMed Google Scholar
Dosztányi Z, Csizmók V, Tompa P, Simon I (2005) The pairwise energy content estimated from amino acid composition discriminates between folded and intrinsically unstructured proteins. J Mol Biol 347:827–839
Article PubMed Google Scholar
Ebert P, Audano P, Zhu Q, Rodriguez-Martin B, Porubsky D, Bonder M, Sulovari A, Ebler J, Zhou W, SerraMari R et al (2021) Haplotype-resolved diverse human genomes and integrated analysis of structural variation. Science 372:abf7177
Article Google Scholar
Enright J, Dickson Z, Golding G (2023) Low complexity regions in proteins and DNA are poorly correlated. Mol Biol Evol 40:msad084
Article CAS PubMed PubMed Central Google Scholar
Fitch WM (1971) Toward defining the course of evolution: minimum change for a specific tree topology. Syst Biol 20:406–416
Article Google Scholar
Fomicheva A, Ross E (2021) From prions to stress granules: defining the compositional features of prion-like domains that promote different types of assemblies. Int J Mol Sci 22:1251
Article CAS PubMed PubMed Central Google Scholar
Golding GB (1999) Simple sequence is abundant in eukaryotic proteins. Protein Sci 8:1358–61
Article CAS PubMed PubMed Central Google Scholar
Gonzalez CE, Roberts P, Ostermeier M (2019) Fitness effects of single amino acid insertions and deletions in tem-1 beta-lactamase. J Mol Biol 431:2320–2330
Article CAS PubMed PubMed Central Google Scholar
Goolsby E (2017) Rapid maximum likelihood ancestral state reconstruction of continuous characters: a rerooting-free algorithm. Ecol Evol 7:2791–2797
Article PubMed PubMed Central Google Scholar
Grimwood J, Gordon L, Olsen A, Terry A, Schmutz J, Lamerdin J, Hellsten U, Goodstein D, Couronne O, Tran-Gyamfi M et al (2004) The DNA sequence and biology of human chromosome 19. Nature 428:529–535
Article CAS PubMed Google Scholar
Haba Y, Kutsukake N (2019) A multivariate phylogenetic comparative method incorporating a flexible function between discrete and continuous traits. Evol Ecol 33:751–768
Article Google Scholar
Haerty W, Golding G (2010) Low-complexity sequences and single amino acid repeats: not just “junk” peptide sequences. Genome 53:753–762
Article CAS PubMed Google Scholar
Hastings WK (1970) Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57:97–109
Article Google Scholar
He Q, Bardet A, Patton B, Purvis J, Johnston J, Paulson A, Gogol M, Stark A, Zeitlinger J (2011) High conservation of transcription factor binding and evidence for combinatorial regulation across six Drosophila species. Nat Genet 43:414–420
Article CAS PubMed Google Scholar
Holst L (1980) On the lengths of the pieces of a stick broken at random. J Appl Probab 17:623–634
Article Google Scholar
Horton C, Alexandari A, Hayes M, Marklund E, Schaepe J, Aditham A, Shah N, Suzuki P, Shrikumar A, Afek A et al (2023) Short tandem repeats bind transcription factors to tune eukaryotic gene expression. Science 381:eadd1250
Article CAS PubMed Google Scholar
Huntley M, Golding G (2000) Evolution of simple sequence in proteins. J Mol Evol 51:131–140
Article CAS PubMed Google Scholar
Huntley M, Golding G (2002) Simple sequences are rare in the protein data bank. Proteins 48:134–140
Article CAS PubMed Google Scholar
Huntley M, Golding G (2006) Selection and slippage creating serine homopolymers. Mol Biol Evol 23:2017–2025
Article CAS PubMed Google Scholar
Huntley MA, Golding GB (2006) Selection and slippage creating serine homopolymers. Mol Biol Evol 23:2017–2025
Article CAS PubMed Google Scholar
Karlin S, Brocchieri L, Bergman A, Mrázek J, Gentles AJ (2002) Amino acid runs in eukaryotic proteomes and disease associations. Proc Natl Acad Sci 99:333–338
Article CAS PubMed PubMed Central Google Scholar
Katoh K, Standley DM (2013) MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol Biol Evol 30:772–80
Article CAS PubMed PubMed Central Google Scholar
Kiefer J (1953) Sequential minimax search for a maximum. Proc Am Math Soc 4:502–506
Article Google Scholar
Kruglyak S, Durrett R, Schug M, Aquadro C (1998) Equilibrium distributions of microsatellite repeat length resulting from a balance between slippage events and point mutations. Proc Natl Acad Sci USA 95:10774–10778
Article CAS PubMed PubMed Central Google Scholar
Lenz C, Haerty W, Golding GB (2014) Increased substitution rates surrounding low-complexity regions within primate proteins. Genome Biol Evol 6:655–65
Article CAS PubMed PubMed Central Google Scholar
Li H (2011) A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics 27:2987–2993
Article CAS PubMed PubMed Central Google Scholar
Li H, Durbin R (2009) Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25:1754–1760
Article CAS PubMed PubMed Central Google Scholar
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R (2009) The sequence alignment/map format and SAMtools. Bioinformatics 25:2078–2079
Article PubMed PubMed Central Google Scholar
Lin M, Whitmire S, Chen J, Farrel A, Shi X, Jt Guo (2017) Effects of short indels on protein structure and function in human genomes. Sci Rep 7:9313
Article PubMed PubMed Central Google Scholar
Love M, Huber W, Anders S (2014) Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol 15:550
Article PubMed PubMed Central Google Scholar
Loya T, O’Rourke T, Reines D (2017) The hnRNP-like Nab3 termination factor can employ heterologous prion-like domains in place of its own essential low complexity domain. PLoS ONE 12:e0186187
Article PubMed PubMed Central Google Scholar
Marjoram P, Molitor J, Plagnol V, Tavare S (2003) Markov chain Monte Carlo without likelihoods. Proc Natl Acad Sci U S A 100:15324–15328
Article CAS PubMed PubMed Central Google Scholar
Martin E, Mittag T (2018) Relationship of sequence and phase separation in protein low-complexity regions. Biochemistry 57:2478–2487
Article CAS PubMed Google Scholar
McGinnis S, Madden T (2004) BLAST: at the core of a powerful and diverse set of sequence analysis tools. Nucleic Acids Res 32:W20-5
Article CAS PubMed PubMed Central Google Scholar
Mier P, Alanis-Lobato G, Andrade-Navarro MA (2017) Context characterization of amino acid homorepeats using evolution, position, and order. Proteins 85:709–719
Article CAS PubMed Google Scholar
Minh B, Schmidt H, Chernomor O, Schrempf D, Woodhams M, vonHaeseler A, Lanfear R (2020) IQ-TREE 2: new models and efficient methods for phylogenetic inference in the Genomic Era. Mol Biol Evol 37:1530–1534
Article CAS PubMed PubMed Central Google Scholar
Ni X, Zhang Y, Negre N, Chen S, Long M, White K (2012) Adaptive evolution and the birth of CTCF binding sites in the Drosophila genome. PLoS Biol 10:e1001420
Article CAS PubMed PubMed Central Google Scholar
Odom D, Dowell R, Jacobsen E, Gordon W, Danford T, MacIsaac K, Rolfe P, Conboy C, Gifford D, Fraenkel E (2007) Tissue-specific transcriptional regulation has diverged significantly between human and mouse. Nat Genet 39:730–732
Article CAS PubMed PubMed Central Google Scholar
Pál C, Papp B, Hurst LD (2001) Highly expressed genes in yeast evolve slowly. Genetics 158:927–931
Article PubMed PubMed Central Google Scholar
Parry D, North A (1998) Hard alpha-keratin intermediate filament chains: substructure of the N- and C-terminal domains and the predicted structure and function of the C-terminal domains of type I and type II chains. J Struct Biol 122:67–75
Article CAS PubMed Google Scholar
Persi E, Wolf Y, Karamycheva S, Makarova K, Koonin E (2023) Compensatory relationship between low-complexity regions and gene paralogy in the evolution of prokaryotes. Proc Natl Acad Sci USA 120:e2300154120
Article CAS PubMed PubMed Central Google Scholar
Persikov A, Ramshaw J, Kirkpatrick A, Brodsky B (2000) Amino acid propensities for the collagen triple-helix. Biochemistry 39:14960–14967
Article CAS PubMed Google Scholar
Pertea M, Pertea G, Antonescu C, Chang T, Mendell J, Salzberg S (2015) StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat Biotechnol 33:290–295
Article CAS PubMed PubMed Central Google Scholar
Pritchard J, Seielstad M, Perez-Lezaun A, Feldman M (1999) Population growth of human Y chromosomes: a study of Y chromosome microsatellites. Mol Biol Evol 16:1791–1798
Article CAS PubMed Google Scholar
R Core Team (2022) R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria
Revell LJ (2012) Phytools: an R package for phylogenetic comparative biology (and other things). Methods Ecol Evol 3:217–223
Article Google Scholar
Rohlfs R, Harrigan P, Nielsen R (2014) Modeling gene expression evolution with an extended Ornstein–Uhlenbeck process accounting for within-species variation. Mol Biol Evol 31:201–211
Article CAS PubMed Google Scholar
Romero P, Obradovic Z, Li X, Garner E, Brown C, Dunker A (2001) Sequence complexity of disordered protein. Proteins 42:38–48
Article CAS PubMed Google Scholar
Sainudiin R, Durrett R, Aquadro C, Nielsen R (2004) Microsatellite mutation models: insights from a comparison of humans and chimpanzees. Genetics 168:383–395
Article CAS PubMed PubMed Central Google Scholar
Schmon S, Gagnon P (2022) Optimal scaling of random walk Metropolis algorithms using Bayesian large-sample asymptotics. Stat Comput 32:28
Article PubMed PubMed Central Google Scholar
Schneider VA, Graves-Lindsay T, Howe K, Bouk N, Chen HC, Kitts PA, Murphy TD, Pruitt KD, Thibaud-Nissen F, Albracht D, et al. 2016. Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly. bioRxiv https://www.biorxiv.org/content/early/2016/08/30/072116
Sequencing C, Consortium A (2005) Initial sequence of the chimpanzee genome and comparison with the human genome. Nature 437:69–87
Article Google Scholar
Shen W, Ren H (2021) Taxonkit: a practical and efficient ncbi taxonomy toolkit. J Genet Genomics 48:844–850
Article PubMed Google Scholar
Shi J, Rabosky D (2015) Speciation dynamics during the global radiation of extant bats. Evolution 69:1528–1545
Article PubMed Google Scholar
Shumate A, Salzberg S (2021) Liftoff: accurate mapping of gene annotations. Bioinformatics 37:1639–1643
Article CAS PubMed PubMed Central Google Scholar
Stajich J, Block D, Boulez K, Brenner S, Chervitz S, Dagdigian C, Fuellen G, Gilbert J, Korf I, Lapp H et al (2002) The bioperl toolkit: Perl modules for the life sciences. Genome Res 12:1611–1618
Article CAS PubMed PubMed Central Google Scholar
Vats D, Flegal JM, Jones GL. (2017). Multivariate output analysis for Markov chain Monte Carlo. arXiv:1512.07713
Villar D, Flicek P, Odom D (2014) Evolution of transcription factor binding in metazoans - mechanisms and functional implications. Nat Rev Genet 15:221–233
Article CAS PubMed PubMed Central Google Scholar
Wall L, Christiansen T, Orwant J. 2000. Programming perl. " O’Reilly Media, Inc."
Werner M, Sieriebriennikov B, Prabh N, Loschko T, Lanz C, Sommer R (2018) Young genes have distinct gene structure, epigenetic profiles, and transcriptional regulation. Genome Res 28:1675–1687
Article CAS PubMed PubMed Central Google Scholar
Wootton JC, Federhen S (1993) Statistics of local complexity in amino acid sequences and sequence databases. Computers Chem 17:149–163
Article CAS Google Scholar
Zhou K, Shi H, Lyu R, Wylder A, Matuszek Z, Pan J, He C, Parisien M, Pan T (2019) Regulation of co-transcriptional pre-mRNA splicing by m(6)A through the low-complexity protein hnRNPG. Mol Cell 76:70-81.e9
Article CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

This work was funded by the Natural Sciences and Engineering Research Council of Canada (grants RGPIN-202-05733 to GBG and PGSD3-547476-2020 to ZWD).

Author information

Authors and Affiliations

Department of Biology, McMaster University, Hamilton, ON, Canada
Zachery W. Dickson & G. Brian Golding

Authors

Zachery W. Dickson
View author publications
You can also search for this author in PubMed Google Scholar
G. Brian Golding
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zachery W. Dickson.

Ethics declarations

Conflict of interest

The authors declare no conflict of interest.The funders had no role in the design ofthe study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Additional information

Communicated by Minh Bui.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Dickson, Z.W., Golding, G.B. Evolution of Transcript Abundance is Influenced by Indels in Protein Low Complexity Regions. J Mol Evol 92, 153–168 (2024). https://doi.org/10.1007/s00239-024-10158-z

Download citation

Received: 05 October 2023
Accepted: 24 January 2024
Published: 14 March 2024
Issue Date: April 2024
DOI: https://doi.org/10.1007/s00239-024-10158-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Evolution of Transcript Abundance is Influenced by Indels in Protein Low Complexity Regions

Abstract

Access this article

Similar content being viewed by others

Detecting macroevolutionary genotype–phenotype associations using error-corrected rates of protein convergence

Evolutionary hallmarks of the human proteome: chasing the age and coregulation of protein-coding genes

GC content of vertebrate exome landscapes reveal areas of accelerated protein evolution

Code Availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Evolution of Transcript Abundance is Influenced by Indels in Protein Low Complexity Regions

Abstract

Access this article

Similar content being viewed by others

Detecting macroevolutionary genotype–phenotype associations using error-corrected rates of protein convergence

Evolutionary hallmarks of the human proteome: chasing the age and coregulation of protein-coding genes

GC content of vertebrate exome landscapes reveal areas of accelerated protein evolution

Code Availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation