Introduction

Data generated by next generation sequencing (NGS) are often utilized in the emerging fields of precision and personalized medicine. This massively parallel processing chemistry can identify genetic factors that predict treatment and response to therapies. Reference nucleotide sequences are critical for analyzing NGS data, as exemplified by routine clinical diagnosis for HLA antigens [1].

Genotype phasing is the process to determine if genetic variants, often single nucleotide variations, called SNVs, belong to 2 separate chromosomes (in trans). If SNVs are located on the same chromosome (in cis), they constitute a haplotype or an allele. Genotype phasing has often been inferred using computational methods [2, 3], which are prone to certain types of error [4]. These errors are encountered in samples harboring novel variants, low frequency or rare variants, and structural variants [5]. Almost all of these errors can be precluded by laboratory based methods, such as sequencing the genomes of both parents and sibling offspring [6], physical separation of homologous chromosomes in diploid cells [7, 8], sequencing in sperm cells [9], allele specific PCR [10], single DNA molecule dilution [11] and single molecule sequencing chemistry [12, 13]. These laboratory based methods are, however, labor-intensive and time consuming, and thus infrequently applied in clinical diagnostics.

The human genome contains many regions that are known as long contiguous stretches of homozygosity (LCSH) [14,15]. Their presence in unrelated individuals across different populations is attributed to a lower average recombination rate in these regions of the human genome [14].

The human atypical chemokine receptor 1 gene (ACKR1, MIM #613,665) [16] encodes a multi-pass trans-membrane glycoprotein. It is a receptor for pro-inflammatory cytokines [17] and malaria Plasmodium parasites (P. vivax and P. knowlesi) [18]. The ACKR1 glycoprotein carries the five antigens of the Duffy blood group system (Fy) [19, 20]. Recent sequencing studies in the ACKR1 gene have identified approximately 30 haplotypes, albeit at limited lengths of 2.1 kb [21], 2.5 kb [22], 5.2 kb [23], and 5.6 kb [24], respectively. We previously applied these ACKR1 haplotypes to predict the Duffy phenotype in Neanderthal samples [21]. Later, high-coverage genome sequences of Neanderthals were established [25,26,27], which confirmed our prediction [21]. A recent similar comparative study, involving long genomic segments, identified a 50 kb segment in humans, which was inherited from Neanderthals and represented a genetic risk factor in SARS-CoV-2 infection [28].

The 1000 Genomes Project (1000GP) provides a comprehensive database of genotypes and haplotypes in 2,504 unrelated individuals across 26 populations worldwide [29, 30]. As a proof of principle using data from the 1000GP for the ACKR1 gene, we establish a list of 902 haplotypes, some more than 80 kb long. Our scalable approach can be applied to any gene in any population.

Materials and methods

Algorithm workflow

A Python algorithm was developed (Supplementary Information, File S1) to download and analyze genotype data for 80.6 kb region of chromosome 1 (between positions NC_000001.11: 159,203,314–159,283,887) flanked between 2 genes, CADM3 and FCER1A, and encompassing the ACKR1 gene (Fig. 1) for all 2,504 unrelated individuals of the final release 1000GP panel (Phase 3; GRCh38) using Bcftools [31]. The SNV data was downloaded from the dbSNP database [32]. Individual sequences with heterozygosity at a single site or with complete homozygosity were automatically extracted as an unambiguous ACKR1 haplotype that can be considered experimentally confirmed, which applied a time-proven concept [4]. The algorithm outputs three files: a sequence file containing the distinct haplotypes, a meta-data file containing information about the population in which the haplotypes are found, and a folder containing graphical representations of the population distribution of the distinct haplotypes.

Fig. 1
figure 1

Schematic representation of chromosome 1 region analyzed. The ACKR1 gene is bordered by the 2 genes CADM3 in centromeric and FCER1A in telomeric direction at chromosomal position 1q23.2 (a). The structure of the ACKR1 gene (b) comprises 2 exons (closed boxes) and include the coding sequence (CDS,  black) and the 5’- and 3’-untranslated region (UTR, grey). The intron 1 joins the 2 exons (black line). The number of SNVs observed in the for the dbSNP (b) and 1000GP databases (c) are shown for the 5’-UTR, CDS, intron, CDS and 3’-UTR

Validation

Phased haplotype data for 80.6 kb region of chromosome 1 (between positions NC_000001.11: 159,203,314–159,283,887) was manually downloaded for all 2535 individuals of the 1000GP panel (Phase 3; GRCh37) from the 1000 Genomes browser. After removing 31 related individuals, haplotype data from 2504 unrelated individuals was imported into Microsoft Excel. Individuals with heterozygosity at a single site or with complete homozygosity in the 1,626 nucleotide-long ACKR1 gene (NG_011626.3; NC_000001.11:159,204,875–159,206,500) allowed unambiguous assignment of an ACKR1 haplotype. These unambiguous ACKR1 haplotypes were further analyzed individually using Excel spreadsheets, and their sequences were extended in both 5'- and 3'-directions until a heterozygous SNV was encountered. The region between 2 SNVs was catalogued as a haplotype and compared with the previous automated results. The manual analysis was performed and thus a validation dataset generated before the Python algorithm was developed.

Neanderthal genome

The published DNA sequence of the Neanderthal genome (Chagyrskaya, Altai, and Vindija 33.19, http://cdna.eva.mpg.de/neandertal/) [2527] was analyzed (Integrative genomics viewer version 2.3.20) [33] and aligned to the human genome (NCBI Build GRCh38/hg38). We searched for the longest match, if any, with the haplotypes in the 1000GP.

Results

Using the 1000GP database and a Python algorithm, we extracted and catalogued long haplotypes that encompassed the ACKR1 gene and were flanked between 2 SNVs (Fig. 1). Among 2,504 individuals included in the 1000GP database, 1,520 individuals were homozygous for the 1,626 nucleotide-long ACKR1 gene or heterozygous with only 1 SNV. The ACKR1 sequences for these individuals were further analyzed both upstream and downstream of ACKR1 gene until SNVs were encountered. The extension in both directions allowed us to identify long ACKR1 haplotypes that can be considered experimentally verified. The results obtained with our computational approach were validated by a manual method, performed in a blinded fashion.

ACKR1 and SNVs

For the ACKR1 gene (Fig. 1), the dbSNP database [32] lists 549 SNVs spread over 1,626 nucleotides (Fig. 1b). We encountered, however, only 43 SNVs of the ACKR1 gene in the 1000GP database (Fig. 1c) out of the 549 known SNVs.

ACKR1 haplotypes

We identified 31 distinct haplotypes with ≥ 10 observations (Table 1). They ranged in length from 2,383 nucleotides to 17,739 nucleotides. A total of 902 haplotypes were observed, ranging in length from 1,901 nucleotides to 80,584 nucleotides, some extending into the adjacent CADM3 and FCER1A genes (Fig. 2). The combined length of haplotype sequences comprised 19,895,388 nucleotides with a median of 16,014 nucleotides (Quartile 1 – Quartile 3: 7,588 – 30,729 nucleotides; Interquartile Range: 23,141 nucleotides). The length of the haplotypes was inversely proportional to the number of observations (Fig. 3). Most of the common haplotypes (70.13%) were small (< 10 kb; Table 2) and ranged in length between 1,901 to 9,927 nucleotides. The most common ACKR1 allele observed was the Duffy-null allele (FY*02 N.01) followed by FY*A (FY*01) and FY*B (FY*02), respectively (Table 3). For each of these 3 common ACKR1 alleles, we were able to identify reference sequences longer than 80 kb (Table 3).

Table 1 Experimentally confirmed ACKR1 haplotypes with ≥ 10 observations in the 1000GP database*
Fig. 2
figure 2

ACKR1 haplotypes observed in the 1000GP. A total of 902 unique haplotypes were observed and sorted according to their length (bars). All haplotypes comprise the ACKR1 gene (shaded column), their positions in the ACKR1 gene locus (top, see Fig. 1) is indicated. The cumulative number is listed (right). Haplotypes of similar lengths are grouped together (for exact lengths see Supplementary Information, Excel files S1 and S2)

Fig. 3
figure 3

Correlation between length and observations of ACKR1 haplotypes. The length of the ACKR1 haplotypes (x-axis) observed in the 1000GP was inversely proportional to the number of observations (y-axis)

Table 2 ACKR1 haplotypes and length distribution in the 1000GP database among 1520 individuals
Table 3 Length distribution of the 3 common ACKR1 alleles observed in the 1000GP

ACKR1 alleles in the Neanderthal samples

The 3 Neanderthal samples were GATA box negative (-67 T) and represented the ancestral FY*B allele (Table 4). None of the 3 Neanderthal ACKR1 sequences (Chagyrskaya, Altai, and Vindija 33.19) fully matched any of the 902 haplotypes. The 2 haplotypes closest to the Neanderthal sequences had 1 mismatch in the GATA box (Table 4).

Table 4 ACKR1 alleles in the 1000GP and 3 Neanderthal samples

Discussion

In the current study, we identified 902 experimentally confirmed reference haplotypes for the ACKR1 gene, using only publicly available data from the large scale 1000GP study database. Our approach is easily scalable. It can be applied to similar databases, including the UK10K Consortium [34], the African Genome Variation Project [35] and the upcoming All of Us Research Program [36]. For proof of principle, we demonstrated the application using a Python algorithm for one gene. The approach can, however, define reference sequences for any segment of the genome, with genes or without.

We showed that reference sequences can be obtained from databases and verified without ambiguity at lengths exceeding 80 kb. Such reference sequences can be catalogued inexpensively for use in clinical diagnostics. The catalogue comprised the set of the longest unique haplotypes that can be distinguished by the gene’s nucleotide sequence. In clinical diagnostics with molecular-based assays, common and well documented (CWD) [37] reference haplotypes are routinely applied, for example in HLA typing [1]. Exact matching at the haplotype level improves survival following bone marrow transplantation [38] and reduces alloimmunization in chronically transfused patients [39,40,41]. A limited number of common haplotypes represented the majority in the population [42], and identifying haplotypes from databases is an economical way to obtain such reference sequences.

Apart from clinical diagnostics, long-range haplotypes are also useful to understand the influence of environment on positive selection of genes in human populations [43], for association mapping of genes that contribute to disease and other phenotypes [44], for correlating the geographical distribution of haplotypes with endemicity of disease [45], for identifying evolutionarily conserved elements and regulatory elements [46], and for improving the reliability of genotype imputation [47]. Long haplotypes identified by using SNV data from high-density oligonucleotide arrays and the International HapMap Project [48] have been shown to be population dependent and can provide important insights into human evolutionary history [49]. These studies may also identify regions of positive selection with important roles in human health and disease [50].

Next generation sequencing is increasingly used for blood group genes [51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78]. In contrast to HLA [79], most blood group genes lack well documented long reference sequences associated with them [80]. Hence, a comprehensive reference database for blood group genes will facilitate blood group genotyping by NGS. The Erythrogene database [59] contains the complete coding region sequence of many different blood group alleles obtained from the 1000GP. However, it lacks information for sequence variants in the non-coding regions, such as promoter, splice sites and long intronic regions, which can also affect the expression of antigens and helps to ascertain the allele and its coding sequence [81,82,83,84].

A large number of haplotypes were more than 50 kb long with some extending at least to 80.5 kb in length (Fig. 2). Our observations are consistent with previous reports suggesting that most of the human genome is contained in blocks of a few kb to more than 100 kb [85, 86]. However, most of the ACKR1 haplotypes in the 1000GP were small and concentrated closely around the ACKR1 gene. The number of haplotypes decreased as their length increased and extended into the intergenic regions (Fig. 3). This is explained because most of the variants in the dbSNP database resides in the intergenic regions [87].

Our 2 haplotypes HAP897 and HAP899 (Additional file 4:Table S3), observed once each in African populations, were closest to the 3 Neanderthal samples. Both haplotypes carried the GATA box mutation (c.-67C), which all Neanderthal samples lacked (c.-67T). Individuals homozygous for the GATA box mutation (c.-67C) do not express the Duffy glycoprotein on the red cell surface [81] making them resistant to invasion by the malarial parasite P. vivax [88,89,90]. This similarity in alleles, discrepant at nucleotide position c.-67 only, was consistent with the fact that the GATA box mutation (c.-67C) started to spread in Africa only around 30,000 years ago [91], while the 3 Neanderthals Vindija, Altai and Chagyrskaya are 50,000, 120,000 and 50,000 years old, respectively [25,26,27].

In clinical diagnostics for patients, long-range haplotypes harboring novel or rare SNVs can only be detected when the haplotype is sequenced at full-length [92]. Using Sanger sequencing, we have previously characterized the ERMAP [93], ICAM4 [94], and ACKR1 [23] blood group genes at the haplotype level and identified prevalent long-range reference alleles, a time consuming and low throughput approach. We showed in this study how long contiguous stretches of homozygosity (LCSH) can serve to generate a database of long haplotypes, as defined by full length nucleotide sequences rather than the concatenation of known SNVs. Relying on SNV data would miss patients carrying novel or rare alleles with possible clinical relevance, which are not identical to the reference sequences. Features of the 1000GP allowed us to catalogue these extended nucleotide sequences with population specific frequencies. Our approach will enable the positive identification of patients carrying these reference sequences.

We plan to extend this approach to all blood group systems recognized by the International Society of Blood Transfusion (ISBT) [95]. A tool under development will allow researchers the customized online extraction of long haplotypes from databases and genes or genomic regions of their choice. Eventually, our approach can be applied to any region of a chromosome. For now, the 902 ACKR1 alleles identified through our novel approach will be useful as templates for analyzing data from NGS, thus enhancing the reliability of clinical diagnostics.

Web Resources

1000 Genomes browser (https://www.ncbi.nlm.nih.gov/variation/tools/1000genomes/) accessed on Aug 05, 2019. ISBT (https://www.isbtweb.org/fileadmin/user_upload/Table_of_blood_group_systems_v6.0_6th_August_2019.pdf). Max Planck Institute for Evolutionary Anthropology (http://cdna.eva.mpg.de/neandertal/).