Introduction

The human capacity for complex spoken language is unique [1]. Speech endows us with the ability to verbally express our ideas, opinions and feelings, using rapid precise control of the oral motor structures (larynx, mouth, tongue) to convert our thoughts into streams of sound that can be decoded by others. While vocal communication in other species sometimes exploits simple mappings between sound and meaning, the reach of human language extends far beyond this, most notably through its extraordinary generative power. A discrete number of individual units of language can be combined into a limitless number of utterances, giving us the potential to express and comprehend an infinite array of concepts. Moreover, when growing up in a language-rich environment, any normal human infant becomes highly proficient in his or her native language with astonishing ease, and without the need for explicit teaching.

It has been argued for many years that inherited factors must make a key contribution to the acquisition of spoken language [2]. It is only in the past decade or so, with the rise of molecular genetics, that biologists have been able to provide the first robust empirical evidence regarding this issue. To begin investigating the pathways involved, research has focused on the roles of genes, proteins and cellular machinery in the etiology of language impairments, in which people mysteriously fail to develop normal skills despite adequate linguistic input and opportunity [3]. There is a diverse array of these language-related disorders, which usually appear in early childhood and often persist into later life, and they are common enough to have a major impact on modern society. Language problems are frequently observed co-occurring with other developmental disorders, such as autism and epilepsy [4, 5].

Prior to the advent of molecular studies of language disorders, the importance of the genome was already evident from epidemiological analyses. These disorders typically cluster in families [69] and monozygotic twins display substantially higher rates of concordance than dizygotic twins [1012]. Clearly, acquisition of fluent spoken language is also influenced by the environment and its interaction with our genes. However, beyond the obvious effects of impoverished language input (for example, due to hearing problems) there is little known regarding specific environmental risk factors that may disturb linguistic development [13].

Initial clues to the molecular bases of speech and language impairments came from low-density linkage screens [14], followed by targeted association studies of particular chromosome regions and/or focused mutation screens of candidate genes [15]. In addition, studies of chromosomal abnormalities are contributing to our understanding of such disorders, and genome-wide association scans using hundreds of thousands of single nucleotide polymorphisms (SNPs) are underway in several cohorts. However, it is evident that the future of gene discovery in language-related traits, as for many other complex phenotypes, lies in large-scale DNA sequencing of entire human genomes.

Traditional sequencing methods are slow, laborious and expensive; the original human genome sequencing project cost more than US$3 billion and took more than a decade to finish [16]. Dramatic technological advances have transformed the ability to analyze our genetic makeup at single nucleotide resolution and commercialization of these 'next-generation' platforms is growing fast. At the time of writing, a human genome can be entirely sequenced in a matter of days for only a few thousand dollars, and costs continue to fall at a remarkable rate. Nevertheless, excitement over the enormous potential of the new technologies must be tempered by acknowledging the associated analytical challenges. Already, our capacity to rapidly generate large swathes of sequence data from many individuals outstrips our capacity to infer the underlying biology of a trait using such information.

Here, we begin by summarizing approaches previously applied to identify and study the first genes implicated in speech and language disorders (Table 1). We go on to discuss the promise of next-generation sequencing (NGS) for uncovering the key genomic changes that affect our speech and language abilities, not only in relevant disorders, but also in the general population. We argue that it is essential to be able to assess the functional significance of identified variants if we are to understand their biological impact and elucidate their contributions to the human traits of interest. The success of such efforts will depend on synergies between diverse research techniques, including bioinformatics and experimental analyses using model systems, as well as integration of human genome sequences and functional gene network datasets (Figure 1).

Table 1 Neurogenomics of speech and language: summary of key genes discussed in the article
Figure 1
figure 1

Neurogenomics of speech and language disorders. Next-generation sequencing will yield large datasets of genomic variants with potential relevance for speech and language. Identification of key variants is critically dependent on multidisciplinary studies of function in cell lines, animal models and humans, along with integration of data on neurogenetic networks, as detailed in the text. The image under 'Next-generation sequencing' comes from istockphoto.com (DNA code; File #9614920), the boxshade plot under 'In silico analyses' is a subpart taken from Figure 4 of [17], the lefthand bottom panel of 'Cellular assays' is a subpart taken from Supplementary Figure 5c of [68], the 'Neurogenetic networks' image is taken from Figure 4b of [82] and the Zebrafinch image is reproduced with permission from Geoffrey Dabb and Canberra Ornithologists Group.

Gene mapping in speech and language disorders

Speech apraxia

The first gene to be clearly implicated in a speech and language disorder was FOXP2. Disruptions of this gene cause a monogenic form of developmental verbal dyspraxia (DVD), also known as childhood apraxia of speech (CAS) [17], characterized by problems with the learning and execution of coordinated movement sequences of the mouth, tongue, lips and soft palate [18, 19]. FOXP2 was discovered through molecular studies of a large three-generational pedigree (the KE family) in which half the members have CAS, accompanied by wide-ranging deficits in both oral and written language, affecting not only production but also comprehension [17]. Linkage mapping in this family identified a region on chromosome 7q31 that co-segregated perfectly with the disorder [20]. An unrelated child with similar speech and language deficits was found to carry a de novo balanced translocation involving the same interval, which directly interrupted the coding region of a novel gene, FOXP2 [17, 21]. Screening of FOXP2 in the KE family revealed that all affected members had inherited a heterozygous point mutation yielding an amino acid substitution at a key residue of the encoded protein [17]. Subsequent studies identified additional etiological FOXP2 variants (nonsense mutations, translocations, deletions) in individuals and families with speech and language problems, typically including CAS as a core feature (reviewed by Fisher and Scharff [22]). Although etiological mutations of FOXP2 are rare [23, 24], the gene provides a valuable molecular window into neurogenetic mechanisms contributing to human spoken language, as detailed elsewhere in this article.

Beyond FOXP2, additional loci that may contribute to CAS have emerged from cases of chromosomal abnormalities, identified using cytogenetic screening and/or comparative genomic hybridization (CGH). One report described a family in which three affected siblings all carry an unbalanced 4q;16q translocation [25]. Another study defined a small region on 12p13.3, containing the ELKS/ERC1 gene, commonly deleted in nine unrelated patients with delayed speech development, most of whom had a formal diagnosis of CAS [26]. Interestingly, a key isoform encoded by ELKS/ERC1 appears to be expressed specifically in the brain, where it binds to RIM proteins. In neurons, RIMs act within the presynaptic active zone, a site that integrates synaptic vesicle exo/endocytosis with intracellular signaling in the nerve terminal [27]. Certain copy number variant (CNV) syndromes with complex variable phenotypes have been linked to increased risk of CAS, including 16p11.2 microdeletions [28, 29] and 7q11.23 microduplications [30]. The rare metabolic disorder, galactosemia, is also associated with elevated incidence of CAS [31].

Specific language impairment

When a child is delayed or impaired in acquiring language, without any obvious physical or neurological cause (cleft lip/palate, intellectual disability (ID), autism, deafness, and so on) he or she is usually diagnosed with specific language impairment (SLI). Since it is defined using exclusionary criteria, SLI encompasses a range of different cognitive and behavioral profiles. The most common forms involve deficits in expressive language, either in isolation or accompanied by receptive problems.

The estimated prevalence of SLI is up to 7% in kindergarten children [32] and it shows familial clustering; twin studies consistently indicate high heritability [10, 11, 33]. In contrast to the rare cases of monogenic CAS discussed above, typical forms of SLI have a complex multifactorial basis [34]. Genome-wide linkage mapping in families with SLI have suggested the existence of multiple risk loci, on chromosomes 16q and 19q [3538], as well as 2p and 13q [39, 40]. Targeted analysis of 16q identified variants in two genes, ATP2C2 and CMIP, associated with deficits on a non-word repetition task, considered to be an index of impaired phonological short-term memory [15, 41]. The ATP2C2 gene encodes a single subunit integral membrane P-type ATPase that catalyzes the ATP-driven transport of cytosolic calcium and manganese into the Golgi lumen [42]. This cellular role makes it a plausible candidate for SLI susceptibility, since intracellular calcium levels are intimately linked to multiple diverse aspects of neuronal function, ranging from migration to plasticity, while manganese dysregulation has been linked to neurodegenerative phenotypes. The product of CMIP contains pleckstrin homology and leucine-rich repeat domains, and is hypothesized to be an adaptor protein of the actin cytoskeleton, interacting with filamin A and RelA (an NF-kappaB subunit) [43]. Although little is known about CMIP at this stage, it is again a credible candidate for involvement in nervous system function, since cytoskeletal reorganization makes essential contributions to processes like neuronal migration and synapse formation/modification. Other candidate genes (such as CNTNAP2) have been implicated in SLI susceptibility through functional approaches [44], as highlighted elsewhere in this article.

Studies of isolated founder populations may also help pinpoint new genes contributing to language disorders. A notable example is Robinson Crusoe Island - an island of 633 residents lying west of Chile, South America - which was most recently colonized in the late 19th century [45]. Thirty-five percent of the colonizing children satisfy criteria for a diagnosis of SLI, substantially higher than the 4% prevalence rate for mainland Chile [45]. Initial molecular investigations identified several genomic regions of interest (on chromosomes 6, 7, 12, 13 and 17), but no specific risk genes have yet been discovered [46].

SLI has connections with another heritable neurodevelopmental trait, dyslexia, defined as specific significant impairments in reading and/or spelling that are not attributed to intelligence, visual acuity problems or inadequate learning opportunities. Although they do not display overt difficulties with speech or language, people with dyslexia often have subtle underlying deficits with aspects of linguistic processing [47]. Thus, genetic studies of dyslexia may be informative for understanding language pathways. We do not have space to discuss this here, and refer readers to other recent reviews [48, 49].

Stuttering

Stuttering is a neurodevelopmental disorder that disturbs the flow of speech [50]. People who stutter are affected by uncontrollable repetitions and prolongations of syllables, and by involuntary silent pauses while speaking; these difficulties begin in childhood, persisting in about 20% of case referrals [51]. Most people who suffer from persistent stuttering nevertheless display normal linguistic proficiency [52]. Stuttering is thought to have a strong genetic basis [53]. Thus far, most genome-wide investigations of persistent familial stuttering have revealed only suggestive evidence of linkage, with loci distributed across at least ten chromosomes, and little overlap between different studies, indicating that this is a complex multifactorial trait [5355].

One of the few reports of significant linkage focused on 46 consanguineous families from Pakistan, and highlighted chromosome 12q as a site of interest [56]. Subsequent analyses of the largest family from that study found that most affected relatives carried a coding variant in the 12q23.2 gene GNPTAB, which encodes two subunits of GlcNAc-phosphotransferase (GNPT) [57]. This putative risk variant (Q1200K), which altered a conserved residue of the protein, was identified in a number of other Pakistani cases, at higher frequency than Pakistani controls. GNPT is involved in addition of a mannose 6-phosphate tag to hydrolytic enzymes, allowing them to be targeted to lysosomes. Further screening of GNPTAB, as well as GNPTG and NAGPA, two closely related genes in this metabolic pathway, identified several different coding variants that were only present in cases and not controls [57]. The proposed risk variants are rare even among people who stutter, so it is likely that there are other unknown genes involved in stuttering.

The next generation: uncovering novel risk variants

While it is clear that exciting progress has been made, many of the genetic risk factors underlying speech and language disorders and/or normal linguistic variation remain to be discovered. At the time of writing, no study had yet reported the use of NGS methodologies to specifically investigate language-related traits. However, the advent of NGS has transformed the identification of genetic variants in other important neurodevelopmental phenotypes that co-occur with language deficits, such as ID and autism spectrum disorders (ASDs). Thus far, most such research has focused on sequencing protein-coding regions of the genome (the exome) to detect de novo variants in rare and common forms of these disorders [5860]. Since de novo mutations have highly deleterious effects and are subject to strong negative selection, it is hypothesized that they might be important explanations of sporadic occurrences of disorder.

Whole-exome sequencing first proved effective in detecting causal de novo variants in rare reproductively lethal neurodevelopmental disorders, such as Kabuki syndrome [61], Bohring-Opitz syndrome [62] and KBG syndrome [63]. The study that pioneered this approach assessed 13 cases of Schinzel-Giedion syndrome, which is characterized by severe ID and typical facial features, and revealed de novo gain-of-function mutations independently occurring in a single gene, SETBP1 [64]. Interestingly, haploinsufficiency of SETBP1 has been identified in some cases of expressive speech impairment [65]. SETBP1 encodes a widely expressed nuclear protein that interacts with SET, an oncogene involved in DNA replication. Recent studies have shown that SET binding protein 1 (SETBP1) also includes three highly conserved AT-hooks (motifs that bind AT-rich DNA in a non-sequence-specific manner) and that it can act as a transcription factor, directly activating targets such as Hoxa9 and Hoxa10 [66]. Functional links between SETBP1 and brain development have yet to be explored.

NGS techniques are also shedding light on the roles of de novo changes in common non-syndromic disorders [59]. A pilot study of whole-exome sequencing in sporadic cases of non-syndromic ID and their parents (parent-child trios) reported nine non-synonymous de novo mutations in different genes in seven of ten probands [67]. Since then, multiple investigations have employed similar approaches to screen trios or quads (trio plus unaffected siblings), including four large-scale whole-exome sequencing efforts across about 1,000 ASD families [6872] (reviewed by Buxbaum et al. [60]). One conclusion of this work was that the rate of de novo mutations was higher in ASD probands than controls, and it pointed to six genes of particular interest that had recurrent loss-of-function mutations.

A major advantage of focusing on de novo mutations is that it dramatically reduces the search space for potential causative variants; it is estimated that an average of approximately one de novo coding variant arises per genome per generation [59]. Interpretation of NGS data becomes more difficult when the search criteria are broadened to encompass all potential etiological coding variants that a proband carries, and it is even more challenging if one also considers non-exonic variations throughout the entire genome. It is not currently known if the genetic architecture underlying specific speech and language disorders includes a significant role for de novo mutations. Thus, it will be important to develop alternative study designs and analytic strategies (for example, Yu et al. [73] and Lim et al. [74]) for pinpointing causative mutations in NGS data from cases and families with language impairments.

Bridging the gap from genetic variants to biology

In the near future, NGS methods will become standard tools in molecular studies of speech and language disorders. As noted above, gene discovery strategies will need to move beyond the de novo paradigms that have been so successful for ID and ASD. Researchers will be faced with the major challenge of discerning which of the many plausibly causal variants carried by each affected person are physiologically relevant to their speech and/or language impairments. Fortunately, distinct fields combining computational and experimental methods can help ascertain the biological roles of detected variants and ultimately highlight genes important for our unique capacity for spoken language.

When focusing on protein-coding sequences, after initial filtering of identified variants from NGS data, it is possible to use predictive algorithms such as SIFT [75] and PolyPhen2 [76] to flag the most promising mutations for subsequent analyses. Computational methods such as these use known information on protein sequence and evolutionary history to rank them as benign, possibly damaging or probably damaging. Nonetheless, as cellular pathways harbor some degree of redundancy, not all loss-of-function mutations will contribute to a given disorder and such predictions should be treated with caution. For example, sequencing of FOXP2 in a cohort of CAS/DVD cases revealed a non-synonymous substitution near the N-terminus of the protein (Q17L) in one of the probands [24], a variant that is predicted to be damaging by both SIFT and PolyPhen2. However, follow-up functional experiments of the Q17L substitution using cell models did not find adverse effects on protein characteristics, in contrast to observations for other proband mutations [77]. Together with the fact that the Q17L proband has an affected sibling who does not carry the substitution, it seems unlikely that this particular change is etiological. Thus, although bioinformatic approaches help narrow down the list of variants from ongoing high-throughput genetic screens of speech and language phenotypes, experimental analyses in model systems are often crucial for determining causality, as well as offering deeper insights into mechanisms.

The value of functional approaches is particularly apparent from studies of how FOXP2 mutations lead to speech and language disorder [22]. FOXP2 encodes a forkhead-box transcription factor. Following homo- or hetero-dimerization with other forkhead box P (FOXP) family members [78], the protein binds DNA and represses transcription of its target genes [79]. Human neuron-like cells have been used to assess two different mutant FOXP2 proteins that co-segregate with disorder in CAS/DVD families: pFOXP2.R553H [17] and pFOXP2.R328X [24]. The functional assays demonstrated that these mutations severely disrupt nuclear localization, DNA-binding ability and transactivation potential of the protein [77]. Investigations into downstream targets of FOXP2 highlighted several neuronal pathways that it regulates. Independent high-throughput studies of promoter occupancy in cells and human fetal brain reported that FOXP2 directly regulates genes involved in neurite outgrowth, synaptic plasticity and axon guidance [80, 81]. More recently, following genome-wide analyses of neural targets in vivo in mouse models, it has been shown that Foxp2 mutations can alter neurite outgrowth and branching in primary neurons [82].

A subset of FOXP2 targets are implicated in neurodevelopmental disorders that often co-occur with language deficits, such as the sushi repeat-containing protein X-linked 2 (SRPX2)-plasminogen activator receptor, urokinase-type (uPAR) complex in epilepsy and speech apraxia [83], DISC1 in schizophrenia [84] and MET in ASD [85]. The most rigorously studied FOXP2 target is CNTNAP2, encoding contactin-associated protein-like 2 (CASPR2), a transmembrane scaffolding protein that clusters K+ channels in myelinated axons [86]. CASPR2 is a member of the neurexin superfamily and, in addition to its role in mature neurons, it has been implicated in neuronal migration, dendritic arborization and spine development [87]. Homozygous loss-of-function CNTNAP2 mutations cause infant-onset epilepsy, learning deficits and language regression [88]. FOXP2 binds directly within the first intron of CNTNAP2 and is able to downregulate its expression [44]. Association analyses of quantitative phenotype data in 184 small SLI families identified a cluster of common intronic SNPs in CNTNAP2 that correlated significantly with reduced performance on linguistic tests, most strongly for the non-word repetition endophenotype [44]. The identity of the precise functional variant(s) in this region is not yet determined, but it is hypothesized that they affect the way that CNTNAP2 is regulated. Rare and common CNTNAP2 variants have also been implicated independently in ASDs [8991], consistent with prior hypotheses that SLI and ASDs may involve some degree of shared genetic etiology. Beyond SLI, ASD and epilepsy, contributions of CNTNAP2 have been suggested for a range of other neurodevelopmental phenotypes, including schizophrenia [92], selective mutism [93] and Tourette syndrome [94].

A recent study of sporadic ASD demonstrates how the combination of NGS screens with functional experiments can shed light on language-related gene networks [68]. Whole-exome sequencing of parent-child trios identified a de novo frameshift mutation in an ASD proband, introducing a premature stop codon in FOXP1 [68]. The child was severely affected, with regression and language delays. FOXP1 is the most closely related gene to FOXP2 in the human genome and they can act synergistically to regulate shared targets in regions of co-expression [78, 95, 96]. Remarkably, the proband with the FOXP1 mutation also carried an extremely rare CNTNAP2 missense variant, inherited from his unaffected mother [68]. In cell-based functional analyses, the aberrant FOXP1 protein mislocalized to the cytoplasm and lost its transcriptional repressor properties; expression of the mutant FOXP1 isoform in cells elevated CNTNAP2 levels, unlike wild-type FOXP1 [68]. These data were consistent with a two-hit mechanism in which abnormal FOXP1 results in higher CNTNAP2 levels, amplifying any potentially deleterious effects of the missense CNTNAP2 variant of the proband [68]. Similar findings regarding multiple-hit mechanisms have emerged from independent studies of ASDs and other neurodevelopmental syndromes (for example, Leblond et al. [97]), suggesting that this may be an important model for genetic etiology of such disorders [98].

Previous screening of 49 children diagnosed with CAS/DVD did not detect any obviously etiological FOXP1 mutations [99]. However, studies of patients with mild to moderate ID and language impairment have detected rare de novo deletions and a nonsense FOXP1 variant [100, 101]. High-throughput sequencing of balanced chromosomal abnormalities in neurodevelopmental disorders identified disruptions at the FOXP1 locus [102].

There has been little reported to date on functional analyses of other genes (such as ATP2C2 and CMIP) associated with speech and language disorders, in part because no protein-coding variants have been pinpointed. As noted above, some cases of persistent stuttering carry coding variants in genes (GNPTAB, GNPTG and NAGPA) involved in lysosomal targeting of hydrolase enzymes. Interestingly, loss-of-function mutations of this pathway cause mucolipidosis disorders, which involve severe abnormalities affecting multiple systems, including skeletal, respiratory and cardiovascular tissues. Cell-based assays were recently used to analyze Mannose 6-phosphate-uncovering enzyme variants found in people who stutter, and were reported to yield incorrect protein folding, decreased enzymatic activity and degradation by the proteasome [103].

It is not always feasible to carry out experimental assessments of putative risk variants. The nature of assessment is highly dependent on the type of gene product; it is difficult to test protein function if there are no known measurable properties. In contrast to NGS technologies, functional experiments typically remain high cost, time-consuming and laborious, and are less amenable for upscaling. Nevertheless, as NGS reveals additional variants potentially implicated in language impairments and other neurodevelopmental traits, we will inevitably need access to high-throughput techniques for simultaneous mutation testing to define disease-causing variants across the genome [104]. Indeed, several multiplex approaches for characterizing the functional effects of genetic variation in proteins [105], mammalian regulatory elements [106, 107] and RNA [108] have recently been developed. More and more emphasis will be placed on possible functional variants that lie outside protein-coding regions. Various efforts are underway to facilitate this transition, most notably the ENCODE project, which aims to characterize all functional elements at a genome-wide scale, including non-coding RNA and cis-regulatory elements [109]. RegulomeDB is of particular interest, as it combines data from the ENCODE project, GEO and published literature into a single, integrated database that can be used to query the functional significance of variants in both coding and non-coding regions of the genome [110].

Integrating data networks

Beyond establishing causality, functional characterization of candidate risk variants in model organisms may also help highlight pathways implicated in the origins and bases of language. For example, studies of FOXP2 across different species (mouse, bird, human) have given us initial clues into neurogenetic networks facilitating human spoken language [22, 111]. FOXP2 expression is enriched in several brain areas, including the basal ganglia, deep cortical layers, thalamus and cerebellum [112], some of which display subtle structural and functional abnormalities in people carrying FOXP2 mutations [19, 112114]. From an evolutionary perspective, this is a highly conserved gene with regard to both the amino acid sequence of the encoded protein and the neural sites where it is expressed [95, 115]. These data suggest that ancestral forms of FOXP2 were involved in important aspects of brain development long before the emergence of spoken language. There is evidence that the functions of the gene may have been modified during human evolution ([116]; also see below), but it remains clear that its roles in the human brain are built on evolutionarily ancient pathways [1].

Extensive characterization of rodent models carrying etiological Foxp2 variants indicates roles in synaptic plasticity, motor-skill learning, and processing and integration of auditory information [117120]. When mice are heterozygous for the mutation that causes speech problems in the human KE family, they display decreased synaptic plasticity in corticostriatal circuits and motor-skill learning deficits [117]. These mouse findings are intriguing given that affected humans have problems learning to master the rapid coordinated orofacial movements underlying speech [121]. In vivo electrophysiology recordings in awake-behaving mice revealed more about the impacts of Foxp2 on corticostriatal circuitry; mice heterozygous for the KE mutation displayed higher basal striatal activity than wild-type controls, and medium spiny neurons showed aberrant negative modulation of their firing rates during motor-skill learning [118]. Separate studies used mouse models to explore whether impairments in auditory processing and auditory-motor integration might also be relevant to FOXP2-related disorders [119, 120]. Mice carrying the KE mutation were reported to have altered auditory brainstem responses to sound, although this finding was not replicated in mice carrying a different mutation associated with speech/language problems in another family [119]. Mice carrying either etiological mutation have deficits in learning to associate auditory stimuli with motor outputs [120].

Songbirds carry their own version of FOXP2, referred to as FoxP2, and it appears to make important contributions to the functions of a striatal nucleus called Area X [122]. In zebra finches, Area X is critical for auditory-guided vocal learning, a process in which young male birds learn their song by imitating an adult tutor. Vocal learning is also a key component of human speech acquisition. FoxP2 mRNA levels in Area X are enriched in young birds during the critical song-learning period [123] and show rapid downregulation when adult birds practice their songs outside the context of courtship [124126]. Furthermore, selective knockdown of FoxP2 in Area X disrupts the song-learning process [127] and alters dendritic spine density in this region [128].

Functional studies of genes implicated in language-related disorders may also give us entry points into mechanisms involved in language function in the general population. As discussed above, variants of CNTNAP2, a direct target of FOXP2, were associated with linguistic deficits in clinically distinct neurodevelopmental disorders [44, 88, 89, 129131]. Subsequent studies revealed that CNTNAP2 may contribute to language processing in healthy individuals [132134]. The cluster of CNTNAP2 SNPs that is associated with language phenotypes in SLI and ASDs has also been reported to correlate with assessments of early language development in general population samples [132]. Neuroimaging genetics studies of common CNTNAP2 SNPs in healthy samples have proposed associations with functional brain measures related to language [133, 134] and with altered structural connectivity patterns [135]. However, imaging genetics of language is a field that is only in its infancy; reports thus far involved small sample sizes with limited power, as well as a substantial multiple-testing burden, and results of different studies have been largely inconsistent. Additional analyses are required to elucidate how FOXP2, CNTNAP2 and other language-related genes influence brain circuits at multiple levels of description - molecular, cellular, structural and functional.

Insights from ancient genomes

The reach of NGS technologies extends well beyond living species. These innovations have allowed molecular anthropologists to reconstruct large portions of nuclear genomes from extinct hominins that co-existed with our ancestors, such as Neanderthals [136] and Denisovans [137]. By comparing modern human sequences to ancient hominin genomes, as well as to our closest extant relatives, chimpanzees, it is possible to identify molecular variants that arose during human evolution, and roughly date them with regard to branches of the primate phylogenetic tree. As for other NGS projects, our capacity to generate large amounts of sequence data exceeds our ability to interpret it. So although scientists have successfully catalogued many of the DNA changes that occurred on our lineage, an extraordinary feat in itself, it is still a major challenge to determine which of these evolutionary events were relevant for the emergence of traits such as speech and language acquisition [1]. Here, success may depend on the integration of findings from evolutionary genomics with data from molecular studies of language-related disorders.

The best illustration of this approach comes again from work on the FOXP2 gene, which was targeted for evolutionary investigations, based on its prior link to a severe speech and language disorder. Comparative primate genomics suggests that FOXP2 probably underwent at least two interesting evolutionary events on the lineage that led to modern humans. After splitting from the chimpanzee (several million years ago) there were changes in the coding region of the locus that yielded two amino acid substitutions in the encoded protein [138]. Although these are minor changes outside the known functional domains, when such substitutions are inserted into the endogenous Foxp2 gene of a mouse, they have subtle detectable effects on brain structure and function, including altered connectivity and plasticity of corticostriatal circuits [116]. NGS approaches indicate that these amino acid substitutions are shared by Neanderthals [136] and Denisovans [137]. (It is worth emphasizing here that status of a single gene is not enough to determine whether or not a species can speak.) Researchers went on to identify a number of non-coding variants in intronic regions of FOXP2 that had occurred more recently on the human lineage, after splitting from Neanderthal/Denisovan a few hundred thousand years ago [139]. One of these changes lies in a region that underwent a recent selective sweep, and alters a putative binding site for the POU class 3 homeobox 2 (POU3F2) transcription factor, such that it may have affected regulation of FOXP2 expression; cell-based analyses are consistent with this hypothesis [139]. Thus, just like sequence-based analyses of language-related disorders, evaluation of the biological significance of interesting variants from ancient genomics requires functional studies using model systems.

Conclusion

The advent of whole genome NGS means that data generation will no longer be the limiting factor in understanding how genetic factors contribute to mechanisms underlying complex neurodevelopmental traits. Coupling NGS approaches to functional validation in model systems will facilitate network mapping and pathway investigation in speech and language disorders, and ultimately in normal linguistic development.