Background

The enlarged and globular brain distinguish modern humans not only from extant non-human primates but also from their extinct Homo relatives [1]. The modern human brain is approximately three-fold larger in size than that of our closest extant relative, the chimpanzee, and extinct early hominids. From developmental perspective, the larger brain size in humans is attributed to human-specific fast and prolonged neonatal and postnatal brain growth and patterning [2,3,4]. The evolutionary expansion of human brain size is heterogeneous across brain regions. For instance, the most notable expansion occurred in the neocortex that has been directly related to the emergence of higher cognitive capabilities, such as language, intelligence, and social learning [5]. Most of the expansion in brain size occurred in the last 2–3 million years of human evolution [6, 7]. The genetic basis of divergence between the highly cognitive human brain and supposedly lesser cognitive non-human primate brain has largely remained enigmatic. It has been speculated that complexity of modern human brain arose through changes in protein-coding genes and non-coding regulatory elements [8]. In particular, in relation to the evolutionary expansion of cerebral cortex, genes associated with primary microcephaly (MCPH) have been the focus of immense attention in the past couple of decades [9]. Primary microcephaly is an autosomal congenital disorder, which is characterized by significant reduction in brain mass, particularly the cerebral cortex with no other neuroanatomical abnormalities [10]. The brain size of microcephaly patients is similar to that of very primitive extinct hominids Sahelanthropus tchadensis (brain size; 370 cm3) and Australopithecus afarensis (brain size; 450 cm3). Therefore, the small brain size in microcephaly patients is considered as an example of evolutionary retrogression [7, 11]. Furthermore, microcephaly patients have mild to severe cognitive impairment [10]. MCPH is a heterogeneous disorder and associated with mutations in at least sixteen gene loci (MCPH1-16) [12]. The underlying MCPH genes are named as MCPH1, WDR62, CDK5RAP2, CASC5, ASPM, CENPJ, STIL, CEP135, CEP152, ZNF335, PHC1, CDK6, SASS6, MFSD2A, CIT, and KIF14 [12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27]. Almost all MCPH-associated genes express predominantly in the fetal brain and regulate neurogenesis and brain size through participation in several important cellular mechanisms including DNA replication, repair, cytokinesis, intracellular transport and autophagy [27]. Therefore, investigation of MCPH genes can reveal molecular mechanisms that are critical in the regulation of brain size and complexity [28, 29]. Indeed, many previously reported molecular evolutionary investigations have implicated MCPH genes (like ASPM and CDK5RAP2) in the expansion and reduction of brain size during primate history [30, 31].

In this study, the roles of MCPH genes in the evolutionary enlargement of human brain size was explored through molecular evolutionary analysis of eight newly identified microcephaly genes; CEP135, ZNF335, PHC1, CDK6, SASS6, MFSD2A, CIT, and KIF14. Sequence data from 48 eutherian species was employed to investigate the signatures of episodic positive selection and diversifying selection during mammalian evolutionary history. Results obtained in the present study provide a broader perspective on the evolutionary link between primary microcephaly genes and human brain size.

Results

Molecular evolutionary analysis of MCPH genes

In order to investigate the genetic basis of the evolutionary expansion of human brain, eight newly identified MCPH genes (CEP135, ZNF335, PHC1, CDK6, SASS6, MFSD2A, CIT, and KIF14) were considered as candidates for evolutionary analysis. We assembled the coding sequence for each of this selected subset of MCPH genes from a broad range of eutherian genomes which include 21 primatomorphans, 9 glirans, 6 carnivores, 2 perissodactylans, 5 cetartiodactylans, 3 chiropterans, and one animal each from Eulipotyphla and Proboscidea (Additional file 1: Figure S1). These 48 eutherian genomes provide enough genomic coverage and evolutionary breadth to perform a thorough molecular evolutionary genetic analysis. For each of these selected subsets of MCPH genes, orthologous sequence information was classified into three data sets. i.e., primates (20 genomes), nonprimate placental mammals (28 genomes), and placental mammals (48 genomes). Each of these selected subsets of datasets is subjected to maximum likelihood based codon substitution models.

Signature of pervasive positive selection in MCPH genes

In order to examine whether the pervasive positive selection has operated on selected subset of MCPH genes (CEP135, ZNF335, PHC1, CDK6, SASS6, MFSD2A, CIT, and KIF14), three pairs of site models (M1 & M2, M7 & M8, and M8a & M8) based on codon substitutions were used. These codon substitutions site models assume that selective pressure ω vary among the amino acid sites but not across the lineage [32]. The signature of positive selection is considered optimal only if two out of the three null models (M1, M7, and M8a) are rejected in the favor of more complex alternative models (M2 and M8). The results of codon substitutions site pair models (M1/M2, M7/M8, and M8a/M8) revealed the consistent signature of pervasive positive selection only for KIF14 gene in primates (Table 1 and Additional file 2: Table S1). Furthermore, significant signatures of positive selection were also detected in non-primate placental mammals and placental mammals for three (ZNF335, SASS6, and CIT) and two MCPH genes (ZNF335, and KIF14) respectively (Table 1 and Additional file 2: Table S1). Positively selected codon sites were identified by using the Bayes Empirical Bayes (BEB) method implemented in M8 codon substitution site model [33] (Table 2).

Table 1 Signature of pervasive positive selection in primary microcephaly genes
Table 2 Positive selected sites detected by using M8 codon substitution site model

Signatures of episodic positive selection

We next examined the imprints of transient or episodic positive selection in different lineages from ancestral primate branch to human terminal branch by employing branch site model (Additional file 2: Table S2). Branch site model allows the ratio of nonsynonymous (dN) to synonymous substitution (dS) rates ω to vary not only across the branches in the phylogeny but also across sites [34]. The significance of the transient imprint of adaptive evolution was determined by likelihood ratio tests (LRTs) of null model (similar to branch-site model except ω2 is restricted to one for the predefined lineage of interest) against branch-site model through log likelihood score for each model. False positive results obtained by branch-site model were eliminated by estimating the false discovery rate q value. Branch site calculations revealed no significant patterns of episodic positive selection in any lineage from primate ancestor to human terminal branch for all MCPH genes analyzed except for KIF14 in homininae ancestral branch (Additional file 2: Table S2).

Divergent selection of MCPH genes across mammals

Functional divergence among orthologous proteins during evolution may not necessarily be reflected as a signature of positive selection at the sequence level. Instead, sequence evolutionary rate variations among different clades of phylogeny can also be taken as a metric of adaptive functional diversity. For each of the eight MCPH coding regions, we estimated the divergent selective pressure between different partitions of mammalian phylogeny (primate vs. nonprimate, simians vs. nonsimians, catarrhini vs. noncatarrhini, hominidae vs. nonhominidae and hominini vs. nonhominini placental mammals) by employing clade model c (CmC) (Table 3). Similar to site and branch-site analysis, false positive data in CmC analysis was eradicated by q value [35]. The CmC analysis revealed that selective pressure on different parts of phylogeny is not significantly divergent for all analyzed microcephaly loci except for SASS6 in the comparison of simians and nonsimians placental mammals (Table 3). In this particular case of SASS6, approximately 34% of sites evolved under divergent selective pressure between simians (ωs = 0.499) and nonsimians placental mammals (ωns = 0.214) (Table 3).

Table 3 Divergent selection constraint parameters estimation and likelihood scores for eight MCPH genes on different partition of mammalian phylogeny

Discussion

Compared to other primates, including great apes, humans have very large brains. Weighing approximately 1,400 g, our brains are roughly three times larger than those of other great apes such as chimpanzees (395 g) and gorillas (490 g) [36]. In particular, during 4–5 million years of human evolution, an enormous increase in brain size has occurred, from a brain mass of 450 g found in Australopithecines to about 1400 g in modern Homo sapiens [37]. Similar to overall increase in brain size/mass, the neocortex of brain (which is involved in higher-order brain functions such as cognition, reasoning and language) has significantly enlarged in the hominin lineage after the divergence of closely related chimpanzee lineage aproximately ~ 6–7 million years ago [7]. This expansion in neocortex size in the hominin lineage might have occurred prior to the split of anatomically modern humans from archaic hominins (Neanderthals and Denisovans) approximately 550,000–750,000 years ago, as both modern humans and Neanderthals exhibit comparable overall brain size [38, 39]. However, cranial lobe size (that demarcate different regions of neocortex) does differ between anatomically modern humans and Neanderthals, most prominently parieto-temporal lobe of the neocortex has increased and orbitofrontal cortex is wider in modern humans as compared to Neanderthals [38, 40]. This indicates that certain neocortical regions have evolved after the split of Neanderthals and anatomically modern humans. The size of the neocortex is predominantly determined by the magnitude of neurogenesis and cytokinesis during fetal development. During neurogenesis, cortical neurons originate from a progenitor cell in the ventricular zone of the developing brain [41]. The progenitor cells undergo successive cycles of proliferative division before entering to neurogenic division and formation of the subventricular zone [42,43,44,45]. Massive expansion in the neocortex size during human evolution has been attributed to extraordinary expansion in neuron number through increased rate of neural progenitor cell division [5]. This expansion can be explained in the context of the “radial unit hypothesis” of cortical development [46]. This hypothesis proposes a general mechanism for a rapid increase in neocortical surface area during evolution through prolonged proliferative symmetric division period. This could yield an increased number of radial columnar units that ultimately generate neurons and consequently expand the neocortical surface area [46]. An alternate hypothesis, “intermediate progenitor model” suggests that during mammalian evolution the expansion in neocortical surface area and folding have occurred due to the escalation in basal progenitor pool size (BP originates from apical radial glia; the main neural progenitor cells in the ventricular zone) and their subsequent expansion in the subventricular zone [47]. Recently, another hypothesis linked the evolutionary expansion of neocortex to increased proliferative capacity of basal radial glia (bRG) [48]. Though the timing of brain development is conserved across mammals, however species specific differences in the duration of cortical neurogenesis (6 days in mice, 60 days in macaque, and 100 days in humans) might have contributed to the differences in neocortex size and complexity during primate history [5, 49]. Species-specific difference in BP pool size abundance and temporal aspects of neocorticogenesis might have some lineage specific genetic underpinnings [48].

The majority of MCPH gene products are known to play an important role in the regulation of duration and mode of cell division and hence identified as prime suspect in evolutionary expansion of mammalian/primate cerebral cortex [43, 50, 51]. Initial evaluation of MCPH genes has revealed that MCPH1, CDK5RAP2, ASPM, and CENPJ evolved adaptively in the human lineage [9, 52, 53]. Further investigations have extended the signatures of positive selection in these MCPH loci (MCPH1, CDK5RAP2, ASPM, and CENPJ) from human to all anthropoid primates [31]. Recently, the inclusion of the non-primate eutherian species in evolutionary genetic studies of MCPH genes revealed the signatures of pervasive positive selection on ASPM, CDK5RAP2, MCPH1, CENPJ, CEP152 and WDR62 throughout the eutherians history [54].

The present study investigates the molecular evolutionary history of eight newly identified MCPH genes by employing sequence data from 48 eutherian genomes and rigorous maximum likelihood based codon substitution models. The codon substitutions site analysis indicated that positive selection occurred during different stages of eutherian evolution in four MCPH genes, ZNF335, SASS6, CIT, and KIF14 (Table 1 and Additional file 2: Table S1). Furthermore, the codon based site models revealed an inconsistent signature of pervasive positive selection in primates for all of the microcephaly genes analyzed in the present study except KIF14 (Table 1 and Additional file 2: Table S1). Positive selection often acts for a brief period of evolutionary time or transiently on protein-coding intervals [34]. During the course of primate history brain enlargement seems to have happened episodically, therefore, transient or episodic positive selection could be of particular relevance to genes involved in brain size expansion [7]. Intriguingly, for the majority of MCPH genes analyzed in the present study, the branch-site model was unable to identify any signatures of episodic positive selection from primate ancestor to human terminal branch (Additional file 2: Table S2). These data obtained through codon substitutions site and branch-site models were corroborated further by employing clade model C (CmC) and conclusively showed that majority of the MCPH genes have maintained their conserved functions throughout the history of placental mammals (Table 3).

Multiple sequence alignments of MCPH proteins analyzed in the present study have shown human specific substitutions for ZNF335, KIF14, PHC1, MFSD2A and CIT (Additional file 2: Table S3). Majority of these human specific substitutions appear to have fixed prior to the divergence of fully modern humans from archaic humans (Neandertals and Denisovans) some 450,000 years ago (Additional file 2: Table S3). Regardless the fact that human-specific adoptive evolution has not been detected in any of the MCPH genes analyzed here, we speculate that human specific replacements in subset of MCPH proteins could potentially be important in modifying their functions during the course of hominin evolution, The evolutionary relevance of these hominin-specific amino acid substitutions in evolution of brain size and complexity needs to be validated through further studies.

Conclusion

The present study demonstrates that the evolutionary enlargement of human brain cannot be attributed solely to the protein-coding sequences of MCPH genes [55]. Instead, complex conditional effects of human specific coding and non-coding regulatory changes in MCPH and other brain related loci might have been instrumental in the evolution of human brain size during the Pliocene–Pleistocene era.

Methods

Sequence acquisition and alignment

Full-length coding and amino acid sequences of eight MCPH genes (CEP135, ZNF335, PHC1, CDK6, SASS6, MFSD2A, CIT, and KIF14) for 48 eutherian species were retrieved from Ensemble and NCBI (National Center for Biotechnology Information) by using bidirectional BLAST hit strategy [56,57,58]. These sequence data include 20 primates and 28 nonprimate eutherian species (Additional file 1: Figure S1 and Additional file 3: Data S1).

The complete genomic sequences of archaic humans, the Neandertals and Denisovans were downloaded from Max Planck Institute for Evolutionary Anthropology website (http://www.eva.mpg.de/) in binary SAM (BAM) file format with 50 × and 30 × sequence coverage, respectively [59, 60]. MCPH gene sequences were retrived by calculating the consensus sequences of respective chromosomes from BAM (binary alignment Map) files of archaic genomes and was compared with human MCPH genes by using UGENE software [61].

The orthologous coding sequences were aligned through PRANK with default parameters for the empirical codon model by using phylogenetic information as a guide [62]. PRANK concedes insertion and deletion as a distinct evolutionary event and introduces indels instead of aligning too divergent sequences and therefore reduces the number of false positive for evolutionary analysis [63, 64].

Analysis of substitution parameters and sites under positive selection

At the protein level, the ratio of nonsynonymous (dN) to synonymous (dS) substitution rates ω measures the selective pressure [32, 65]. The value of ω delineates the strength and direction of natural selection operating on protein-coding sequence; ω = 1, ω < 1, and ω > 1, indicate neutral evolution, negative selection and positive selection respectively. For each of eight MCPH genes within every three datasets (primates, nonprimate eutherians, all eutherians), selective pressure ω was estimated using five different codon substitution site models (M1, M2, M7, M8 and M8a) implemented in CodeML program from PAML4.7 software [32, 66, 67]. In this study, a well-accepted phylogeny of eutherians was used for each gene [68]. In order to estimate the sites under positive selection, we compared the LRT of three pairs of site models. The first pair compare the null model M1 (nearly neutral model that assume the existence of two classes of sites with ω = 1 and ω < 1) and alternative model M2 (positive selection model that assume an additional third class of site with ω > 1) [33, 67]. The other two pairs are null model M7 (beta) and alternative model M8 (beta, and ω2 > 1), and the last pair comparison between null model M8a (beta and ω2 = 1) and alternative model M8 [65, 69]. For this study, positive selection is inferred, if two out of three site pair models significantly reject the null model in the favor of alternative model. In addition, positively selected sites were identified by using Bayes empirical Bayes method implemented in M8 codon substitution site model [33].

Analysis of episodic positive selection

To evaluate whether or not the signature of positive selection was restricted to specific lineage, we used branch site approach implemented in CodeML [34]. This model allows for ω variation not only among prespecified branches but also among sites. The branch site model allows that phylogeny can be divided into prespecified foreground branch (ω2 >  = 1, proportion of sites may be under positive selection) and background branch (where proportion of sites experienced either purifying selection or neutral evolution 0 < ω2 <  = 1). The inference of positive selection was conducted by calculating LRT between this branch site model and null model (it is the same as branch site model but with ω2 = 1 for foreground branch) [34]. We used eutherian phylogenetic tree as an input for the detection of episodic selection at different evolutionary time point from primate ancestral branch to human terminal branch.

Analysis of divergent selection constraint

To determine the pattern of divergence in selective constraint across the phylogeny, we undertook the clade model C (CmC) approach implemented in CodeML [70]. Clade model C assumes that proportions of sites have evolved under divergent selective pressure but not necessarily under positive selection in two or more partition of phylogeny defined as priory. The LRT was conducted by comparing the CmC model with null model M2a-rel [71]. Both alternative CmC and null M2a-rel models have possessed three classes of sites with ω = 1, ω < 1. The third class of site in M2a-rel has single ω ratio (ω2 > 0) that is shared between all clades of phylogeny while CmC third class of site has ω ratios equal to the partitions of the phylogeny and varies among the partition of phylogeny [70, 71].

Statistical analysis

P values were calculated in Chi-square program of PAMLX 1.2 package [72]. The P values of the multiple testing hypothesis were corrected for false discovery rate by using the q value package in R3.5.0 [35, 73]. The q values were calculated for each analysis, ranked in ascending order by applying the bootstrap method for π0 estimation and specified fdr.level = 0.05 in q value package in R3.5.0 [74].