Genes with human-specific features are primarily involved with brain, immune and metabolic evolution
Humans have adapted to widespread changes during the past 2 million years in both environmental and lifestyle factors. This is evident in overall body alterations such as average height and brain size. Although we can appreciate the uniqueness of our species in many aspects, molecular variations that drive such changes are far from being fully known and explained. Comparative genomics is able to determine variations in genomic sequence that may provide functional information to better understand species-specific adaptations. A large number of human-specific genomic variations have been reported but no currently available dataset comprises all of these, a problem which contributes to hinder progress in the field.
Here we critically update high confidence human-specific genomic variants that mostly associate with protein-coding regions and find 856 related genes. Events that create such human-specificity are mainly gene duplications, the emergence of novel gene regions and sequence and structural alterations. Functional analysis of these human-specific genes identifies adaptations to brain, immune and metabolic systems to be highly involved. We further show that many of these genes may be functionally associated with neural activity and generating the expanded human cortex in dynamic spatial and temporal contexts.
This comprehensive study contributes to the current knowledge by considerably updating the number of human-specific genes following a critical bibliographic survey. Human-specific genes were functionally assessed for the first time to such extent, thus providing unique information. Our results are consistent with environmental changes, such as immune challenges and alterations in diet, as well as neural sophistication, as significant contributors to recent human evolution.
KeywordsHuman-specific Brain Neuron Glia Metabolism Gene expression
Fragments Per Kilobase of transcript per Million mapped reads
Ingenuity Pathway Analysis
induced pluripotent stem cells
Matrix of Comparative Anthropogeny
Sequence Read Archive
Since humans split from the chimpanzee at around 6 million years ago, the different species of the genus Homo (from which modern humans are now the sole representative) have evolved very rapidly, apparently superseding all other events of evolutionary novelty accumulation . Especially prominent differences are observed in aspects such as height, brain size and changes to our gut and skeleton. Environmental alterations such as diet and immune challenges are thought to have played a major role in human-specific adaptations [2, 3]. Although these phenotypic traits, which have a whole-body effect are more readily noticeable, one can easily assume humans have also undergone significant change at the microscopic scale. The question of what makes humans unique at a molecular level is now being more broadly addressed as new and advanced laboratory and bioinformatics tools are enabling comparisons between species from genetic and functional perspectives. Genetic differences between species may have distinct mechanisms of origin, such as alterations in the cytogenetic architecture, local chromosomal rearrangements, gene family duplications, single gene modifications, creations or losses, differences in gene transcription levels and/or patterns and alternative splicing. Functional differences can be observed in general behaviour or tissue and organ development and function, and molecularly in circuits, pathways or cellular variation.
Historically, genomic comparisons in this context date back from the 1970s, when studies comparing humans with non-human primates at the karyotype level were first published, revealing a very close organization of chromosome banding and identical euchromatin . Later, at the chromosome level, translocation and fission events were reported as the first detectable differences between humans and their closest relatives and these were the known genomic landmarks for the origin of Anthropoids [5, 6]. Further, using fluorescent in situ hybridization and comparative genomic hybridization arrays, human-specific segmental duplications and genes displaying human-specific copy number variation were identified . The first human-chimpanzee comparative genome map was published in 2002 and further updated in 2005 . Also in 2005 , the first attempt to comprehensively identify human-specific segmental duplications was published from comparisons with the chimpanzee genome, revealing the extent of such alterations, which account for ~ 2.7% of the genomic differences between these species. For comparison, at the nucleotide level, the human and chimpanzee genomes genomes are estimated to differ by > 30 million single substitutions (or ~ 1.2% of the human genome) .
Although functional differences between humans and other primates are evident in major morphological features such as the skeleton (e.g. jaws  and hands ), hair (humans have thinner hair) and muscle tissue , and global functions including speech  and language , changes in the brain have presumably had the most significant impact on the human lineage. The size of the human brain tripled over a period of approximately 2 million years, which overlaps with the estimated period of transition from Australopithecus to Homo . Comparative neuroanatomy has revealed a specific expansion of both the neocortex, with increase in size and neuronal interconnectivity during hominid evolution and the right side of the human brain compared to chimpanzee . While this expansion is believed to be important to the emergence of human language and other high-order cognitive functions, its genetic basis remains largely unknown.
In these last two decades following the first discoveries of genomic differences between humans and other species, numerous studies have identified events that generated human-specific genetic features, such as gene duplications, structural gene alterations and accumulation of significant nucleotide substitutions. Although many authors have worked to identify the genes associated with such human-specific genetic features (hereby referred to as ‘human-specific genes’), no comprehensive and structured list is currently available and the published literature is redundant (in the sense that the same event or gene is many times reported in multiple studies) as well as diverse (in the sense that authors frequently direct their work to different aspects and subsets of genes, thus producing limited results). In summary, current knowledge on the subject is scattered and there is an inherent lack of standard, given the diversity of studies in which one or more human-specific gene is described. Such limitations hinder the study of human-specific genes at a genomic scale, regardless of information being publicly available. Through an extensive bibliographic survey, we gathered, curated and critically assessed the human-specific genes reported in the literature to provide the most comprehensive list to date. We further use this dataset as a platform to explore the general impact of these human-specific genes, assessing their biological impact through functional network and pathway analyses. Finally, we investigate differential gene expression in subpopulations of glial cells and in active versus inactive neurons to examine whether the human-specific genes are involved in specialized neural functions such as cortical development or neuronal activation. Our results highlight the importance of rapid adaptations in immunological, neurological and metabolomic areas that likely contribute to human evolution and identify human-specific genes that are differentially expressed in the brain.
The generation of a high confidence structured dataset for human-specific genes
Before describing the obtained results, it is necessary to define our object of study. In this report we use the term human-specific gene when referring to a gene impacted by one or more genetic alterations, which seem to have happened after divergence from non-human primates (usually proposed by genomic comparison with chimpanzee) and result in the emergence of human-specific features. The event causing these genetic alterations may change the gene itself or its regulatory region, as we report in detail.
An extensive bibliographic survey (described within the Materials and Methods section) of the literature published since 2000 resulted on a selective list of 54 scientific articles describing thousands of human-specific features. After triage and manual curation of the data we obtained a set of 982 associated gene descriptors. A descriptor was the most accurate term used by the original author(s) to describe the gene of interest (e.g. name, acronym, database entry number, etc). To standardize notation, for each gene we retrieved information from the human genome version GRCh38. Automatic annotation based on gene descriptor was carried out against the genome and 676 of these genes were directly annotated. Additionally, some gene names contained typos or were slightly modified from their actual name and over 100 other genes had been renamed or restructured since their first annotation. For such genes we carried out manual curation and further annotation when possible. In addition to these individual genes, there are 19 gene families, comprising at least 10 members each, with reported human-specific features that could not be individually attributed to a single gene (Additional file 1 Table S1). Although these gene families were treated separately (to avoid introducing bias given the high number of genes they encompass), when specific genes were described in the literature these were included in the main dataset.
Approximately 130 of the original descriptors could not be associated to any particular gene or gene family, many of these representing genomic fragments as opposed to specific genes and others obsolete or untraceable gene identifiers (IDs). A total of 856 genes (or 871 gene IDs, as some names map to multiple gene IDs, e.g. HAR1A and OR5AL1) with reported human-specific characteristics were curated and annotated and, to the best of our knowledge, comprise the most complete dataset of human-specific genomic features (Additional file 1 Table S1). This number is considerably higher than previously predicted or reported in the literature. For example, the genetics domain of the Matrix of Comparative Anthropogeny (MOCA), which is a repository for available information on human features that differ from great apes, lists only 103 genes known from literature. From these, over 70% are represented in our dataset and most of the remaining were either absent in the current version of the human genome or were filtered out during our manual curation process for lacking strong evidence of human specificity at the gene level.
Regarding chromosomal distribution, the 856 genes with human-specific features come from all 22 autosome chromosomes and both sexual chromosomes. No gene was listed from the mitochondrial chromosome. When proportionally compared, the distribution of protein-coding genes with human-specific features and the distribution of all human protein-coding genes per chromosome were relatively similar. A few chromosomes, however, bear a significantly higher number of human-specificity in protein-coding regions. Chromosomes X and 7 seem to be particularly enriched in proteins encoded by genes with human-specific features (Additional file 1 Table S2).
Although this report successfully listed hundreds of genes, it was limited not only by the current availability of studies regarding human-specific genes, but also by poorly defined terminology (the term ‘human-specific’ per se is object of debate, being ambiguously used to describe different levels of specificity). The field itself is specially limited by technical difficulties, such as the lack of a high-quality genome for archaic hominins, complexity of our gene architecture, poorly defined non-coding elements, problems faced when defining genomic correspondence between species, availability of functional data and complications of subsequent validation of predicted variation.
Functional analyses highlight neuronal, immunological and metabolic features
In possession of the newly generated dataset of genes with human-specific features, we set to investigate the general biological impact that altering their characteristics may have posed to our species. To this end we focus on the functional analysis of each human-specific gene searching further for overall patterns and relationships. Functional enrichment analysis was performed by FGNet  using GeneTerm Linker  as the underlying algorithm. The resulting network represents the links and associations between metagroups of genes and enriched terms. In total, 295 genes (~ 35%) were successfully functionally annotated by FGNet and assigned to 25 metagroups, two of which were automatically filtered out based on silhouette width. The comprehensive network of metagroups comprising 225 genes is provided as Additional file 1 Figure S1A and the description of each metagroup as Additional file 1 Table S3. Reported p-values for all metagroups are lower than 0.0006 (thus orders of magnitude lower than the threshold of 0.05) and each metagroup has at least 10 genes. Since the full network is highly complex, we manually selected 12 metagroups that we trust represent interesting functional classes of systemic level (as opposed to broad molecular or cellular level features). This sub-network clustered into 3 broad functional categories: neural function, immunological function and metabolic function (Fig. 1b and Additional file 1 Figure S1B).
Focusing on pathways as opposed to individual categories or broad clusters of functions, we further analyzed human-specific genes using Ingenuity Pathway Analysis (IPA; ). In summary, IPA analysis used 729 out of the 845 genes (~ 85%) and supported the importance of neuronal (e.g. mNOS signaling in neurons, Huntington’s disease signaling), immunological (e.g. phagosome formation, phagocytosis in macrophages) and metabolic (e.g. inositol pyrophosphates biosynthesis, adipogenesis pathway, glutamate biosynthesis and degradation) functions (Additional file 1 Figure S2). Taken together multiple functional analyses tools have converged to generally implicate neuronal, immunological and metabolic systems with human evolution and species-specific characteristics.
Highly expressed human-specific genes are cell-type enriched across different radial glial cell populations
Multiple human-specific genes are differentially expressed upon activation in neurons derived from induced pluripotent stem cells (iPSC)
We set out to survey the scientific literature for genes previously reported as human-specific, knowing a better understanding of how these genes have mechanistically impacted our evolution would be broadly beneficial for the study of human physiology and disease. The resulting dataset of genes associated with human-specific variants is, to the best of our knowledge, the most detailed, structured and comprehensive to date. Here we highlight higher order functional areas which house a large number of human-specific genes and are likely to by impacted by these genes and their products. Functional assessment of more than 850 human-specific genes emphasized the significance of brain, immune and metabolic adaptations.
In hindsight these findings may not be completely unexpected as infections, dietary alterations (coincident with the discovery of tools and the domestication of fire for cooking) and extraordinary brain expansion have been well documented.
Although humans possess a great degree of plasticity for adaptation, it is likely that the real origin of the human adaptations that truly ignited human uniqueness occurred during the time of Australopithecus and early Homo species [33, 34]. At this time there was widespread movement, the emergence of tools, an enlargement of the brain and a decrease in masticatory apparatus relative to an increasing body size. The human brain has evolved rapidly in the past 2 million years (coincidental with the emergence of Homo species) and continues to do so through highly unstable, or rather adaptable, regions in our genome, tissue-specific and function-specific gene expression and reorganized circuitry . Nevertheless, it was very likely a conjunction of factors that enabled human evolution to occur at such a rapid rate. For example, newly formed regions of the human brain such as the prefrontal cortex seem to have far higher energy requirements than more conserved regions . It may be that it was only possible to meet such requirements through modifications to food preparation methods that ultimately resulted in higher energy intake . This example could illustrate a crosstalk between different aspects of human evolution which may have resulted in emergent properties of our species. Significant changes are also observed in local adaptations in recent human populations to environmental and behavioral factors such as diet, infections, altitude and temperature . Emerging pathogens that specifically infect humans have to some degree been impacted by our own innovations, such as agriculture, and continue to shape our immune evolution through host-pathogen interactions .
Despite limitations, our comprehensive study contributes to the current knowledge by considerably updating the number of human-specific genes and further emphasizing the importance of brain, immune and metabolic adaptation in defining our species. It also highlights the potential significance of considering metabolism in conjunction with brain function to fully understand human-specific function and disease.
Materials and methods
Database of genes with human-specific features
We have extensively scanned and curated the current literature and searched for articles describing human-specific genetic features and its associated genes. PubMed (www.ncbi.nlm.nih.gov/pubmed) was used as the search platform with the criteria “Search human specific gene Filters: Publication date from 2000/01/01 to 2017/12/31” (further expanded to 2019/12/31), which resulted in over 218,000 publications. From these articles, we selected for terms such as “human-specific”, “duplication”, “de novo”, “evolution” among other terms of interest. Studies were also assessed regarding their relevance/direct relation to the topic, design of the study, type of publication and whether or not the publication was peer reviewed. An initial subset of 36 highly relevant and non-redundant studies were selected and further expanded (mainly through citation relationships) to 54 references from which data were retrieved. These articles report human-specific genetic features, i.e. gene-related molecular characteristics that have been reported to differ between humans and other species and are likely to impact the associated gene (such as changes to the sequence of a gene promoter, exon losses, gene duplications, etc). The genetic features are related to specific genes, which are the object of study of the present work. Gene names were listed and duplicated entries were collapsed. Ambiguities were assessed in as much detail as possible to clarify the specific gene authors referred to. The initial list was mapped back to the GRCh38 version of the human genome and remaining non-annotated entries mainly represented genes that have been renamed or excluded since their first annotation. The final set of genes was categorized according to the reported human-specific feature and grouped by biotypes as proposed in the Ensembl glossary (publicly available at ensembl.org/Help/Glossary).
Chromosomal distribution of human-specific protein-coding genes
There are 596 gene IDs associated with protein-coding genes. These were listed regarding their chromosome of origin and the proportion of entries per chromosome was calculated. The same was performed with the entire set of protein-coding genes annotated in the human genome, for comparison. In parallel, we used the GeneOverlap library (version 1.12.0) of the R package to infer significance of overlapping genes. The internal algorithm for Fisher’s exact test used by this package determined the respective p-values (which were not corrected for multiple hypothesis).
Functional analysis of genes with human-specific features
Genes were also subject to functional analyses for the generation of a protein-protein interaction network and functional clusterization using the Bioconductor package FGNet version 3.10.0  and GeneTerm Linker  for functional enrichment analysis. Metagroups with silhouette width of less than 0 were excluded and a minimum support of 3 genes was required for cluster validation.
Human protein sequences were obtained from Ensembl GRCh38  and genes with human-specific features had their respective protein sequence(s) retrieved. The retrieved sequences were submitted to AgBase GoAnna version 2.0.0  for GO assignment based on sequence homology. Blastp was used as the underlying algorithm and search parameters were an E-value cutoff of 10e-50, BLOSUM62 as the substitution matrix, a minimum of 80% sequence identity plus 75% coverage and default word size and gap penalty values. GoAnna results were submitted to AgBase GOSlim  to obtain high-level summaries of functions for the given dataset and further analyses were restricted to categories of biological processes, which involve pathways and activities of multiple genes. The same protocol was used to assign GOSlim terms to the entire set of human proteins obtained from Ensembl. Results report the percentual of each term both in the set of human-specific proteins and all human proteins, which was used as background. Against this background of expected abundance, significance for differential representation of functional terms within the human-specific subset of proteins was calculated using Fisher’s exact test (implemented in the GeneOverlap library of the R package version 1.12.0) to determine the respective p-values (which were not corrected for multiple hypothesis).
SRA samples of radial glial cells
We retrieved fastq files from the SRA-deposited study SRP094417, which contains 18 runs from samples of prenatal human brain, representing data with replicates from radial glial cells, outer radial glial, intermediate progenitor and mature neuronal cells. Reads are paired-end and were generated from cDNA with the Illumina HiSeq2000 platform in 2016.
RNA-Seq of iPSC
The generation and activation of human iPSC-derived neurons and RNA isolation, preparation and sequencing were described in a previous report by our group .
Both the set of iPSC and SRA-retrieved RNA-Seq samples were treated with the same bioinformatics pipeline, which is composed of 5 main steps: (1) Pre-trimming quality control with FastQC version 0.11.5 (bioinformatics.babraham.ac.uk/projects/fastqc); (2) Read trimming with Trimmomatic version 0.36 ; (3) Post-trimming quality control with FastQC; (4) Alignment or pseudoalignment to reference transcriptome and read counting for transcript abundance estimation with Kallisto version 0.43.0 or STAR-RSEM versions 2.5.2a and 1.2.30 [44, 45, 46]; (5) Measurement of differential expression of transcripts with EdgeR version 3.18.1 . Each step is generally described below.
FastQC was used for quality control of raw reads and a comparative round of quality control after running Trimmomatic, to ensure overall quality was either maintained or increased after read trimming. The set of default parameters was used for this step. Trimmomatic was employed for cleaning reads from sequencing artifacts. The set of Illumina adapters for the TruSeq paired-end library preparation kit was used as database for adapter trimming. Reads were scanned with a 4-base wide sliding window and trimmed when the average quality per base was lower than 20. Reads shorter than 40 bases after trimming were further excluded. Kallisto and STAR-RSEM were used as different alternatives to generate read counts. Kallisto performs pseudoalignments and read counts within the same command line, while STAR performs alignments to the reference transcriptome and the result is used by RSEM to generate read counts. Kallisto indexing tool was used to generate an index for the FASTA formatted file of the human transcriptome with k-mer size of 31. Reads were counted for transcript quantification using default parameters and a number of bootstrap samples of 100. As an alternative to estimate transcript abundance, STAR was used to perform alignments between the paired-end reads and the reference human transcriptome. An index was built with default parameters and the alignment was performed discarding multimappers and defining parameters for splicing treatment. Resulting bam alignment files were further converted to sam files using Samtools (samtools.sourceforge.net) and sorted with Novosort (novocraft.com/products/novosort), as an intermediate step. RSEM was used to prepare a reference file from the human transcriptome and count reads to provide transcript abundance in the paired-end mode. EdgeR was used to perform statistical analysis and define differentially expressed genes. Kallisto and STAR-RSEM results were compared to evaluate data robustness. In summary, when results were qualitatively similar, parameters were considered well adjusted. After assessing different thresholds, a minimum of 5 reads per transcript before normalization was needed to validate expression. Read counts generated by STAR-RSEM were used for differential expression assessment. Samples were normalized based on sample sizes and data variability was estimated according to a negative binomial dispersion parameter. Differential expression was reported with limits being a p-value of less than 0.001 and false discovery rate of less than 0.01.
Quantitative RT-PCR for differentially expressed human-specific genes in iPSC data
Quantitative RT-PCR was used to validate expression patterns for the subset of genes with human-specific features shown to be differentially expressed in iPSC. cDNA synthesis was performed using the SuperScript III First-Strand Synthesis System (ThermoFisher Scientific, USA). Briefly, 500 ng of total RNA was used and random hexamer primed protocol was followed. Each cDNA sample was amplified in triplicate using SYBR Green PCR Master Mix (ThermoFisher Scientific, USA). Primer pairs used for this analysis are described in Additional file 1 Table S5.
About this supplement
This article has been published as part of BMC Bioinformatics, Volume 20 Supplement 9, 2019: Italian Society of Bioinformatics (BITS): Annual Meeting 2018. The full contents of the supplement are available at https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume-20-supplement-9.
MB and GB conceived and designed the study and drafted and revised the manuscript. MB carried out the bibliographic survey, data collection and processing, statistical analyses and gene expression studies. SK participated in the bioinformatics analyses and data collection and processing. EAOB carried out experimental validation and contributed with drafting the manuscript. GB coordinated all instances of the project. All authors read and approved the final manuscript.
Authors declare to have received not specific funding, additional to their salary, to perform the study. All laboratorial supplies used in for experimental validations were provided as basic infrastructure by QIMR Berghofer Medical Research Institute.
This article did not receive sponsorship for publication. Publication costs were covered by the authors.
Ethics approval and consent to participate
Consent for publication
All authors declare that they have no competing interests.
- 1.Tattersall I. Why was human evolution so rapid? In:Human Paleontology and Prehistory. 2017:1:1–9.Google Scholar
- 27.Arbogast T, Iacono G, Chevalier C, Afinowi NO, Houbaert X, van Eede MC, Laliberte C, Birling MC, Linda K, Meziane H, et al. Mouse models of 17q21.31 microdeletion and microduplication syndromes highlight the importance of Kansl1 for cognition. PLoS Genet. 2017;13:e1006886.PubMedPubMedCentralCrossRefGoogle Scholar
- 29.Diepenbroek M, Casadei N, Esmer H, Saido TC, Takano J, Kahle PJ, Nixon RA, Rao MV, Melki R, Pieri L, et al. Overexpression of the calpain-specific inhibitor calpastatin reduces human alpha-Synuclein processing, aggregation and synaptic impairment in [A30P]alphaSyn transgenic mice. Hum Mol Genet. 2014;23:3975–89.PubMedPubMedCentralCrossRefGoogle Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.