Selection pressure on human STR loci and its relevance in repeat expansion disease
Short Tandem Repeats (STRs) comprise repeats of one to several base pairs. Because of the high mutability due to strand slippage during DNA synthesis, rapid evolutionary change in the number of repeating units directly shapes the range of repeat-number variation according to selection pressure. However, the remaining questions include: Why are STRs causing repeat expansion diseases maintained in the human population; and why are these limited to neurodegenerative diseases? By evaluating the genome-wide selection pressure on STRs using the database we constructed, we identified two different patterns of relationship in repeat-number polymorphisms between DNA and amino-acid sequences, although both patterns are evolutionary consequences of avoiding the formation of harmful long STRs. First, a mixture of degenerate codons is represented in poly-proline (poly-P) repeats. Second, long poly-glutamine (poly-Q) repeats are favored at the protein level; however, at the DNA level, STRs encoding long poly-Qs are frequently divided by synonymous SNPs. Furthermore, significant enrichments of apoptosis and neurodevelopment were biological processes found specifically in genes encoding poly-Qs with repeat polymorphism. This suggests the existence of a specific molecular function for polymorphic and/or long poly-Q stretches. Given that the poly-Qs causing expansion diseases were longer than other poly-Qs, even in healthy subjects, our results indicate that the evolutionary benefits of long and/or polymorphic poly-Q stretches outweigh the risks of long CAG repeats predisposing to pathological hyper-expansions. Molecular pathways in neurodevelopment requiring long and polymorphic poly-Q stretches may provide a clue to understanding why poly-Q expansion diseases are limited to neurodegenerative diseases.
KeywordsSTR polymorphism Single amino-acid repeat Human evolution Triplet-repeat expansion disease Database for human polymorphism (VarySysDB)
Short tandem repeat
Simple amino acids repeat
Coding sequence region
Coding trinucleotide short tandem repeat
The international nucleotide sequence databases collaboration
Human-gene diversity of life-style related diseases/gene diversity database system
Annotation data set for All Human Genes version 2
H-InvDB gene cluster defined by mapping of transcripts on genome sequence
Percentage of G or C at the third codon
We are grateful to Hidetoshi Inoko for support to use H-GOLD/GDBS data, Yasuyuki Fujii, Katsuhiko Murakami, Yoshiharu Sato and Jun-ichi Takeda for providing gene structure and annotation data, Ryuzo Matsumoto and Yosuke Hayakawa for useful suggestion on computer programming, and other former member of the H-Invitational 2 consortium, Genome Information Integration Project (GIIP), the Integrated Database and Systems Biology Team of BIRC, AIST for their helpful support. This research was financially supported by the Ministry of Economy, Trade and Industry of Japan (METI) and the Japan Biological Informatics Consortium (JBIC). Also, this work is partly supported by the Grants-in-Aid for Scientific Research (C) to MKS (JSPS Grant Numbers 24510271 and 21510205), and the Saito Gratitude Foundation to MKS.
Compliance with ethical standards
Conflict of interest
All authors declare no conflict of interest.
This article does not contain any studies with human participants or animals performed by any of the authors.
This article does not contain any studies with human participants.
Updated data on STRs and SARs within known human transcriptome sequences will be continuously provided in the VarySysDB database (http://h-invitational.jp/varygene/home.htm). The original data of STRs and SARs in human exonic region are available at the web site of the first author, MKS (http://www.fujita-hu.ac.jp/~mshimada/sub/STR_SAR_Data.html) and a web-based data sharing system provided by the research map (http://researchmap.jp/shimada-mk/%E8%B3%87%E6%96%99%E5%85%AC%E9%96%8B/).
- Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J Royal Stat Soc 57:289–300Google Scholar
- Core Team R (2012) R: a language and environment for statistical computing. R Foundation for Statistical Computing, ViennaGoogle Scholar
- Dunbar RI (1998) The social brain hypothesis. Brain 9:178–190Google Scholar
- Erwin AL, Bonthuis PJ, Geelhood JL, Nelson KL, McCrea KW, Gilsdorf JR, Smith AL (2006) Heterogeneity in tandem octanucleotides within Haemophilus influenzae lipopolysaccharide biosynthetic gene losA affects serum resistance. Infect Immun 74:3408–3414. doi: 10.1128/IAI.01540-05 PubMedPubMedCentralCrossRefGoogle Scholar
- Fukuda K, Ichiyanagi K, Yamada Y, Go Y, Udono T, Wada S, Maeda T, Soejima H, Saitou N, Ito T et al (2013) Regional DNA methylation differences between humans and chimpanzees are associated with genetic changes, transcriptional divergence and disease genes. J Hum Genet 58:446–454. doi: 10.1038/jhg.2013.55 PubMedCrossRefGoogle Scholar
- Guo W-J, Ling J, Li P (2009) Consensus features of microsatellite distribution: microsatellite contents are universally correlated with recombination rates and are preferentially depressed by centromeres in multicellular eukaryotic genomes. Genomics 93:323–331. doi: 10.1016/j.ygeno.2008.12.009 PubMedCrossRefGoogle Scholar
- Guzhova IV, Lazarev VF, Kaznacheeva AV, Ippolitova MV, Muronetz VI, Kinev AV, Margulis BA (2011) Novel mechanism of Hsp70 chaperone-mediated prevention of polyglutamine aggregates in a cellular model of Huntington disease. Hum Mol Genet 20:3953–3963. doi: 10.1093/hmg/ddr314 PubMedCrossRefGoogle Scholar
- Imanishi T, Itoh T, Suzuki Y, O’Donovan C, Fukuchi S, Koyanagi KO, Barrero RA, Tamura T, Yamaguchi-Kabata Y, Tanino M et al (2004) Integrative annotation of 21,037 human genes validated by full-length cDNA clones. PLoS Biol 2:e162. doi: 10.1371/journal.pbio.0020256 PubMedPubMedCentralCrossRefGoogle Scholar
- Laffita-Mesa JM, Velazquez-Perez LC, Santos Falcon N, Cruz-Marino T, Gonzalez Zaldivar Y, Vazquez Mojena Y, Almaguer-Gotay D, Almaguer Mederos LE, Rodriguez Labrada R (2012) Unexpanded and intermediate CAG polymorphisms at the SCA2 locus (ATXN2) in the Cuban population: evidence about the origin of expanded SCA2 alleles. Eur J Hum Genet 20:41–49. doi: 10.1038/ejhg.2011.154 PubMedCrossRefGoogle Scholar
- Mishra R, Jayaraman M, Roland BP, Landrum E, Fullam T, Kodali R, Thakur AK, Arduini I, Wetzel R (2012) Inhibiting the nucleation of amyloid structure in a huntingtin fragment by targeting α-helix-rich oligomeric intermediates. J Mol Biol 415:900–917. doi: 10.1016/j.jmb.2011.12.011 PubMedCrossRefGoogle Scholar
- Pruitt K, Brown G, Tatusova T, Maglott D (2002) The reference sequence (RefSeq) database. The NCBI handbook. National Center for Biotechnology Information, U.S. National Library of Medicine. http://www.ncbi.nlm.nih.gov/books/NBK21091/. Accessed 30 Jun 2015
- Waragai M, Lammers C-H, Takeuchi S, Imafuku I, Udagawa Y, Kanazawa I, Kawabata M, Mouradian MM, Okazawa H (1999) PQBP-1, a novel polyglutamine tract-binding protein, inhibits transcription activation by Brn-2 and affects cell survival. Hum Mol Genet 8:977–987. doi: 10.1093/hmg/8.6.977 PubMedCrossRefGoogle Scholar
- Yamasaki C, Murakami K, Fujii Y, Sato Y, Harada E, J-i Takeda, Taniya T, Sakate R, Kikugawa S, Shimada M et al (2008) The H-invitational database (H-InvDB), a comprehensive annotation resource for human genes and transcripts. Nucleic Acids Res 36:D793–D799. doi: 10.1093/nar/gkm999 PubMedGoogle Scholar
- Zaghlool A, Ameur A, Cavelier L, Feuk L (2014) Splicing in the Human Brain. In: Robert H, Shannon M (eds) International review of neurobiology, vol 116., Academic PressWaltham, MA, pp 95–125Google Scholar
- Zhang W, Zeng F, Liu Y, Zhao Y, Lv H, Niu L, Teng M, Li X (2013) Crystal structures and RNA-binding properties of the RNA recognition motifs of heterogeneous nuclear ribonucleoprotein L: insights into its roles in alternative-splicing regulation. J Biol Chem 288:22636–22649. doi: 10.1074/jbc.M113.463901 PubMedPubMedCentralCrossRefGoogle Scholar