Introduction

Genetics entered an exciting era of discovery with the advent of next-generation sequencing (NGS) technology, improved bioinformatic techniques and increased international collaboration to include underrepresented diversity in genetic studies. Collaborative initiatives, such as the Human Heredity and Health in Africa (H3Africa) Consortium and, African Genome Variation Project (AGVP) are rapidly obtaining and investigating valuable genetic data previously unattainable (Gurdasani et al. 2015; Zheng-Bradley and Flicek 2017; Fortes-Lima et al. 2017; Mulder et al. 2018). These studies have enabled novel genetic investigations, however large sample sizes and high-quality whole-genome data are still lacking for most populations from particularly southern Africa. Association studies often yield no significant single nucleotide polymorphisms (SNPs) associated with multifactorial diseases and fail to detect associations with rare genetic variants [minor allele frequency (MAF) of < 1%] in southern African populations due to a lack of predictive power. Furthermore, the vast majority of association studies continue to be focused on populations of European ancestry and simple admixture scenarios (Wojcik et al. 2019).

Genetic regions associated with multifactorial diseases could be identified by investigating the allelic architecture of highly complex admixed individuals, since they received haplotypes from diverse continental populations previously exposed to various environments and pathogens (Dias-Alves et al. 2018; Mazandu et al. 2019). If such gene regions could be successfully identified, it will aid in the advancement of drug therapies, implementation of personalized medicine and vaccine development in underdeveloped countries such as South Africa. However, individuals from South Africa can be up to five-way admixed, arguably the most complex global example of admixture (Daya et al. 2013; Uren et al. 2017b). The history of South Africa contributed to this observed population substructure, one of which includes ancestral contributions from predominantly the indigenous hunter-gatherers of southern Africa and Bantu-speaking Africans, as well as European-descent groups, South East Asians and East Asians (de Wit et al. 2010; Chimusa et al. 2014; Daya et al. 2014b; Uren et al. 2016).

The frequency of disease risk alleles differs between populations (Secolin et al. 2019). These disparities are exploited to map disease-causing variants of multifactorial diseases in admixed genomes, better known as admixture mapping (Shriner 2013). However, additional modifications are required to conduct admixture mapping studies for individuals from southern Africa, since most computational tools are designed to infer local ancestry for two-or three-way admixed populations only (Chimusa et al. 2018; Schurz et al. 2019; Mazandu et al. 2019). In addition, statistical methods assume homogeneity and may not be applicable for Africans with more complex haplotype structures and mosaic patterns present on chromosomes generated by recent admixture events across the African continent (Fan et al. 2019). The continuous increase in non-communicable diseases in Africa and the persistent threat of emerging and re-emerging infectious diseases could in part be countered by the development of comprehensive research pipelines for disease mapping in highly admixed individuals from Africa. This review, therefore, aims to summarise the current limitations of and prospective avenues for population genomics research in relation to disease mapping in southern African populations.

Admixture mapping

The conventional method of genome-wide association studies (GWAS), which is a hypothesis-free method of detecting SNPs associated with a certain disease, is not sufficient for detecting SNPs associated with a disease in a population with admixed genomes (Visscher et al. 2017). In contrast, admixture mapping (study design summarised in Fig. 1) acknowledges biogeographic ancestry associated with specific phenotypes by including ancestry proportions (globally or locally inferred from dense genotypic data) as covariates in epidemiological studies (Thornton and Bermejo 2014; Duan et al. 2018). Instead of relying on the association between a genotype and a phenotype, as in GWAS, it considers the associations of the number of haplotypes (0, 1 or 2) of a specific ancestry with the phenotype of interest (Hoggart et al. 2004; Duan et al. 2018). The study design has recently been successfully used in a variety of complex diseases, e.g. hypertension (Zhu et al. 2005), prostate cancer in African Americans (Freedman et al. 2006), asthma-associated variants in Latinos (Gignoux et al. 2019), obstructive sleep apnea in Hispanic/Latino Americans (Wang et al. 2019), tuberculosis (TB) associated variants in South Africans (Daya et al. 2014a) and multiple sclerosis in 3692 African Americans, 3777 Hispanics and 4915 Asian Americans (Chi et al. 2019).

Fig. 1
figure 1

Flow diagram indicating resources and software used for admixture mapping. Black blocks indicate the analysis steps, orange blocks represent the software used to conduct the relevant step, blue blocks indicate the resource required for the step and green blocks indicate software or approaches used for visualization. The red stars indicate missing or inadequate resources for executing the analysis step in South African populations

The successful implementation of admixture mapping relies on suitable proxy haplotype reference panels that represent each ancestral group. The reference panels are used to infer the number of haplotypes that originated from a specific ancestral source population at a given locus, better known as local ancestry inference (LAI) (Paşaniuc et al. 2009; Dias-Alves et al. 2018). However, limited haplotype reference panels are available for southern African populations. Furthermore, admixture between ethnic groups creates long-range linkage disequilibrium (LD) between variants from different ethnic groups with different allelic frequencies. This subsequently results in differing ancestral haplotype LD blocks, which holds implications for the use of tagging SNPs when working with admixed individuals (Skotte et al. 2019). Tagging SNPs are normally genotyped whilst conducting association studies and act as a proxy for the underlying common disease-causing variants. Association signals depend on how strongly tagging SNPs correlate with the presence of the disease-causing variant located in a haplotype LD block (Hellwege et al. 2017). The tagging SNPs could be in the same ancestral LD block, but due to admixture induced LD, be located in a different haplotype LD block and no longer tag the original LD block containing the causal variant. Therefore no association signal would be detected (Skotte et al. 2019). This is predominantly evident when genetic heterogeneity is present in the admixed population (Duan et al. 2018). Genetic heterogeneity within a population in this context refers to the individuals of the population having different proportions of global ancestry and/or differing local ancestry at a given locus (Duan et al. 2018).

The increasing number of admixed individuals is an evolving obstacle for epidemiological association studies. Even populations assumed to be unadmixed may harbour fine-scale admixture. A different admixture mapping approach is required for southern African populations since most individuals will have ancestral contributions from more than two different non-intermating admixed subpopulations of unknown origin with different effect sizes (Bostoen 2018). The Bantu expansion shaped the genetic composition of most populations from southern Africa, which consists of the countries Namibia, Botswana, Eswatini, Lesotho and South Africa, the latter having 11 official languages which reflect the main ethnic groups (Uren et al. 2016). Differential admixture dynamics were experienced by Bantu-speaking communities in different areas of south-eastern Africa and the indigenous populations were most affected by this event (Pickrell et al. 2012; González-Santos et al. 2015; Beltrame et al. 2016; Fan et al. 2019; Tucci and Akey 2019). The migration from west to eastern- and southern Africa of Bantu-speaking Africans disrupted local communities by displacement or admixture which led to genetic and cultural exchange. This caused agricultural expansion (González-Santos et al. 2015). Several aspects, therefore, require consideration before conducting genetic studies in southern African populations. Firstly, the extensive population heterogeneity amongst Africans caused by complex genetic population sub-structure and differing levels of admixture (Choudhury et al. 2018; Fan et al. 2019). These groups can therefore not be grouped as one population in genomic studies (Patin et al. 2017). Secondly, differences in LD between populations of African descent causes different haplotype structures resulting in reduced power to detect untyped causal loci (Campbell and Tishkoff 2008). Thirdly, derived alleles are more likely to be heterozygous instead of homozygous (Barnes et al. 2007).

Although admixture mapping increases power to detect disease-associated variants, due to longer ancestral LD blocks than haplotype blocks, differing amounts of ancestry and LD patterns from unknown ancestral populations could be present in each individual in the cohort under study (Zhu and Wang 2017). Adjusting by local ancestry, reflecting the admixture induced LD within admixed populations, will significantly improve the detection of genetic variants with small or moderate effect when extensive genetic heterogeneity is present in the admixed population under study (Duan et al. 2018). However, it is important to also correct for global ancestry proportions, whilst correcting for local ancestry, since association testing will take place in the context of the admixture induced LD blocks in the admixed genome (Duan et al. 2018). This emphasizes the importance of characterising fine-scale genome variation among underrepresented southern African populations to facilitate complex-trait mapping amongst those who harbour a significant burden of global disease.

Currently, no universal standard operating procedures for admixture mapping exists due to the unique admixture scenarios of each project (Zhang and Stram 2014). Considering finer details of population history and ancestry locally inferred for each haplotype will become a requirement for genomic studies among admixed individuals (Duan et al. 2018). This is especially true for populations with complex histories, such as those found in southern Africa.

Genomic resources

Population-specific reference genomes

The current human reference genome was essential in advancing genetic analysis such as imputation of missing genotypes required for GWAS and admixture mapping studies, as well as the identification of rare variants in mostly European and Asian populations (Ikegawa 2012). It is estimated that any two humans will share approximately 99.9% of their genomes (Li and Sadler 1991). A 0.1% difference might seem insignificant, although in a human genome this translates to approximately 3 million base pairs (Suwinski et al. 2019). Since the current reference systematically under represents the tremendous global sequence diversity it is necessary to focus on the individual, geographically defined populations.

Reference genomes have been assembled for multiple distinct human populations and have empirically proven their importance by improving both short read mapping and genotype calling. It facilitates the assessment and meta-analysis of studies done on different microarrays by imputing missing genotypes and is regularly utilized for improving imputation of low frequency and rare variants (Vergara et al. 2018). Despite the decrease in the cost of sequencing, microarrays are still the technology of choice, although these do have certain limitations. Most notably, probes on the microarray are designed from publicly available data to be an exact complement to the desired genomic region to maximize genotyping rate. If not, it will result in a lower genotyping rate and is especially problematic when considering the higher base-pair substitution rate of intergenic versus protein-coding regions (Cargill et al. 1999; Halushka et al. 1999; Leabman et al. 2003). Another limitation is the implication of a probe binding within structural variants (SV) which have been shown to be greatly population-specific (Rosenfeld et al. 2012; Sudmant et al. 2015). If sequencing is selected, the raw data still requires conversion to genomic data either through de novo assembly or by alignment to a reference genome. The latter approach, which is less computationally demanding, has not always allowed the detection of SVs. Due to the read length cut-off of standard NGS, this method of variant detection has a resolution perfect for single nucleotide variants (SNV), but is too small to accurately detect larger events (Merker et al. 2018).

The failure to detect variants of any size in genomic data due to a lack of a population-specific reference genome has multiple consequences. Firstly, it has an impact on the ability to identify potentially disease-associated variants in patients with underlying genetic conditions. This is applicable not only to the discovery of the variant but extends to the precise clinical diagnosis and subsequent treatment. Detecting functionally relevant genetic variants with increased accuracy with the help of a population-specific reference genome brings precision medicine to the forefront of diagnostic/treatment options. Secondly, the failure to detect variants has consequences for the generation and maintenance of allele frequency databases as the failure to accurately curate these findings overburden variant prioritization pipelines. Whilst considering genomic variants linked to a condition that is more prevalent in a certain population, it is preferable to compare the genomes to a reference genome more representative of that population.

Currently, no reference genome exists that adequately represents African genetic diversity and it is unlikely that a quality reference genome representative of all populations will be achieved. Rare genetic variants could be missed when conducting studies on participants from African origins since Africans contain ± 10% more DNA (± 3 million base pairs) than the presently available human reference genome (Sherman et al. 2019). Regions as large as 100,000 base pairs were identified and 387 novel contigs were in 315 distinct protein-coding regions (Sherman et al. 2019). Furthermore, the current $2.7 billion Human Genome Reference build 38 (GRCh38) lacks genetic variation from individuals worldwide. A study conducted by Yang et al. identified certain biases in GRCh38 by sequencing three Africans, three Asians, two Europeans and three Americans with PacBio single molecule, real-time (SMRT) sequencing and comparing these to GRCh38, 174 individuals from the 1000 Genome Project (1000GP) and 266 individuals from the Simons Genomes Variation Project (SGDP). A total of 40.8% (99,604 nonredundant SVs) were novel compared to previously published large-scale projects (Yang et al. 2019). The SGDP obtained high quality (average coverage of 43-fold) whole genomes from 142 diverse populations and indicated that these genomes include at least 5.8 million base pairs that are not present in GRCh38 (Mallick et al. 2016). Additionally, a study in Sweden identified 61,000 novel genetic sequences from 1000 individuals that were missing in GRCh38 and nearly 40% of the genetic material couldn’t be mapped (Eisfeldt et al. 2020).

Previous attempts to capture underrepresented southern African genetic variation have been made by the 1000GP, AGVP and SGDP to include more genome-wide genetic markers for broader groups of southern African populations (Gurdasani et al. 2015; 1000 Genomes Project Consortium et al. 2015). The 1000GP, the largest whole-genome sequencing survey, analysed 26 populations from Europe, East Asian, South Asian, the Americas and Africa. Low coverage sequencing was used and the focus was on demographically large populations, while smaller populations were excluded, despite their respective contributions to human diversity. Although five African populations were included in the analysis, most of these populations are of recent Niger-Kordofanian ancestry (West-and East African) and do not reflect the diversity present in southern Africa (1000 Genomes Project Consortium et al. 2015). Furthermore, 11%, 5% and 5% of heterozygous positions in KhoeSan, New Guineans and Australians respectively were not identified by the 1000GP. This study also validated that populations from southern Africa contain the highest genetic diversity amongst modern humans (Mallick et al. 2016). Addressing this diversity, the AGVP included whole-genome sequencing across individuals belonging to ten language subgroups in southern Africa, however, the low coverage sequencing (4 × coverage) risks misclassifying both observed and imputed rare variants (Gurdasani et al. 2015). Although efforts by multiple consortiums are currently expanding, the risk of eliminating genuine pathogenic variants that are segregating in the population will not be improved in the absence of comprehensive knowledge of human genetic architecture including rare variant frequencies.

Population-specific recombination maps

Recombination maps are often used for admixture mapping (Browning and Browning 2007). A recombination map is a genetic map that illustrates the variation of the recombination rate across a region of the genome or the entire genome (Myers et al. 2005). It is dependent on the underlying distribution of recombination events that occur between successive generations within a given population (Kong et al. 2010). The presence and activity of the PRDM9 zinc finger protein in the population under study, the ratio of males to females and the population’s genetic substructure are some of the known factors that have an effect on these recombination events. Population substructure is affected by the migratory history, the evolutionary history and the common ancestry of the population (Manu et al. 2018). The extent to which the population substructure impacts the utility of a recombination map is yet to be determined.

Currently, there is a lack of high-resolution population-specific recombination maps for southern African populations. This has inevitably led to inaccuracies in studies that make use of a recombination map. These inaccuracies are exacerbated when no recombination maps for closely related populations are available. Research being done in southern African populations have thus been forced to make use of ancestral maps (such as European and West African maps) (Uren et al. 2017a) or they have to rely on ancestry informative markers to mitigate potential bias when genome-wide data is not available (Daya et al. 2013). There is thus a need for accurate, high-resolution recombination maps for southern African populations.

However, there are several uncertainties to be addressed before such a map can be handled with confidence. Firstly, the accuracy of the map needs to be established. Software used to construct recombination maps has been developed and tested on populations with homogeneous ancestry (Auton and McVean 2007). Secondly, testing the accuracy of a recombination map of an admixed population is difficult, because there are variable recombination rates between ancestries. Any one segment of a recombination map would have a recombination rate that closely resembles the average rate of the rates of all the ancestries represented in the population. Thirdly, the method used to develop the map and the map itself would then have to be validated against currently available recombination maps (Kong et al. 2010). It should also be noted that the resolution of a given map relies strongly on the method used to construct the map and the number of individuals used to construct the map (Halldorsson et al. 2019). The most common methods used to build recombination maps are pedigree-based methods, LD-based methods and admixture based-inference (Halldorsson et al. 2019). Of these methods, the LD-based method produces the highest resolution if there are a limited number of individuals available. However, the pedigree-based and the admixture-based method can produce sub-kilobase resolutions when a few thousand individuals are available (Halldorsson et al. 2019). The problem with using the admixture-based method on a population for which no recombination map exists is that many methods that infer ancestry rely on a recombination map for the inference. Thus the resulting recombination map could be inaccurate because the map used for the ancestry inference might be based on a population that is distantly related to the population in question. When dealing with admixed populations, the pedigree-based method would produce the least amount of bias due to admixture, since the algorithms employed rely on direct observations of recombination events between parent–offspring pairs (Halldorsson et al. 2019). Because of the aforementioned reasons, the pedigree-based method should be the method of choice when a large enough sample from a population is available. The theoretical benefit of a population-specific recombination map has yet to be proven in practice, but one can expect such a map to improve the accuracy of admixture mapping and this improved accuracy could result in the discovery of novel variants associated with numerous phenotypes.

Future prospects for disease mapping

Novel loci identified in African populations

Novel genetic regions associated with multifactorial diseases could be identified by investigating the allelic architecture of highly admixed individuals from southern Africa, along with fine-mapping previous genomic loci associated with complex traits (Narang et al. 2011; Gurdasani et al. 2019). The first meta-analysis conducted in western Africa identified a novel locus (ZRANB3) significantly associated with type 2 diabetes (T2D) in a study investigating 5231 individuals from Nigeria, Kenya and Ghana. The study also indicated the transferability of 32 established T2D loci from previous investigations and contributed to the disease aetiology of T2D (Adeyemo et al. 2019). Furthermore, Gulsuner et al. investigated 909 schizophrenia patients and 917 healthy controls from the Xhosa population of South Africa (residing mostly in the eastern cape of South Africa). Not only did they identify admixture between Bantu-speaking Africans and San individuals, but also identified more private damaging mutations in cases than in controls. Interestingly when the same analysis was replicated in a Swedish cohort, the Xhosa individuals generally had larger effect sizes than that of the Swedes (Gulsuner et al. 2020). Furthermore, a meta-analysis consisting of 14,100 African individuals concerning cardiometabolic traits, identified novel loci associated with lipid, blood cell, and also other traits that appear to be rare in populations from other parts of the world (Gurdasani et al. 2019). However, these are mostly concerning common genetic variants and not adequate to identify rare genetic variants.

High-throughput technologies, such as whole-exome sequencing (WES) and whole-genome sequencing (WGS), are required to locate rare population-specific variants (Uren et al. 2017a; Retshabile et al. 2018; De La Vega and Bustamante 2018). Although WES is a cost-effective approach for identifying coding sequence targets in resource-restricted settings, WGS includes the complete and unbiased information carried by an individual, and high coverage WGS can detect rare variants (Suwinski et al. 2019). The first deep sequencing experiment of southern African populations assessed the population substructure within a cohort of HIV positive children from Botswana. WES data of 164 individuals from Botswana were analysed and compared with 150 similarly sequenced HIV positive Ugandan children (Retshabile et al. 2018). Approximately 13–25% of genetic variation in populations from Botswana was not captured in current public databases. These missing variants were significantly enriched for coding variants with MAF between 1% and 5% and included predicted-damaging non-synonymous variants. This population also had more rare (< 1%) pathogenic and damaging variants (Retshabile et al. 2018). These studies highlight the untapped potential of these populations to contribute to the novel discovery of disease risk alleles in GWAS studies. Extending GWAS and sequencing studies to diverse populations will surely generate a rich harvest of novel risk alleles.

Population-specific allele frequency databases

Population-specific allele frequencies have been sparsely characterised for southern African populations. For rare disease genetics, reference databases are continuously used for filtering based on allele frequency with the idea that common alleles are unlikely to be responsible for rare, highly penetrant disorders (Visscher et al. 2017). Therefore, in the absence of appropriate population reference datasets, variants can be misclassified and may lead to false disease associations. For instance, a major allele for southern African populations can be identified as minor, since the current reference genome indicates it is a minor allele (Yang et al. 2019). High-coverage whole-genome reference datasets are needed to characterize and catalog population-specific variation and facilitate genetic studies in admixed southern African populations to identify causal rare variants.

The clinical value contributed by the deep sequencing of whole genomes was demonstrated by The GenomeAsia 100 K project (GAsP) (Wall et al. 2019). The pilot phase, which included a WGS dataset of 1739 individuals from 219 populations and 64 countries across Asia, identified a total of 194,585 novel variants with a MAF of > 1%. Overall 23% of protein-coding altering variants in GAsP were not found in publicly available databases such as the Single Nucleotide Polymorphism Database (dbSNP), the Genome Aggregation Database (gnomAD), the Exome Aggregation Consortium (ExAC) and the Exome Sequencing Project (ESP) (Wall et al. 2019). Importantly, imputation accuracy using the GAsP reference panel was 93- 95% compared to < 90% utilising the 1000GP reference panel. GAsP discovered thirteen unique cancer risk variants and HBB, a variant associated with beta-thalassemia. HBB is found almost exclusively in South Asians and at a lower frequency in Southeast Asia. Ultimately the GAsP reference dataset improves the ability to filter out low–probability candidates for highly penetrant disorders to identify putatively pathogenic variants that are found at high frequency in particular populations and improves the ability to infer pathogenicity of identified variants (Wall et al. 2019). Not only did this study exceed the ability of publicly available sources to annotate protein-coding variants and capture low-frequency rare variants unique to Asian populations, but it also improved the imputation of missing genotypes.

The Ugandan 2000 Genomes Project (UG2G) consists of 1978 individuals from rural Uganda and is the largest sequence panel from Africa (Gurdasani et al. 2019). The investigators identified 41.5 million SNPs and 4.5 million insertions and deletions. Likewise, 29% of the SNPs discovered in the UG2G project were absent in gnomAD. Furthermore, 52 population clusters in the region of Uganda (home to 9 ethnolinguistic groups) were identified and revealed a mixture of complex ancient East African pastoralists (Gurdasani et al. 2019). A genetic study conducted by Higasa et al., which included 1208 Japanese individuals identified 156,622 previously unreported variants. Surprisingly, the allele frequencies were lower than 0.5% and functional deleterious. This study specifically emphasized the importance of constructing an ethnicity-specific reference genome for identifying rare variants (Higasa et al. 2016).

An existing catalogue of known variants, be it common or rare, will allow researchers to identify mutations in protein-coding regions, rare causal variants and track the small and discrete mutations at a genomic level at multiple loci. However, population-specific variants will only be accurately collected if a reference genome exists with a representative population consensus, instead of using the existing human reference genome (GRCh38) of European ancestry (currently employed as a proxy in all genomic studies) (Ballouz et al. 2019).

Avenues for future improvements

Genomic resources lack southern African representation which is impacting on research in these settings. Future investigations to address this could include the following:

  1. 1.

    A consensus southern African reference genome, obtained from high-throughput whole genomes, for southern African populations, is required to capture the major alleles present in the region. This will serve as a genetic toolbox to improve imputation of missing genotypes to standardize cohorts genotyped on different arrays for meta-analysis and minimise the possibility of misclassifying major and minor alleles for southern African populations.

  2. 2.

    A southern African recombination map might improve the phasing of haplotypes to increase the accuracy of local ancestry inference in highly admixed individuals. However, there still exists some uncertainty in this regard and further investigations are warranted.

  3. 3.

    A southern African population-specific catalogue is required to capture allele frequencies in this region. Rare variants could be shared amongst healthy individuals, but not be present in public databases. Only high throughput sequencing technologies will be able to capture population-specific rare variants, since a reference genome, which is used as consensus in disease mapping, would not necessarily contain a specific variant.

  4. 4.

    An electronic catalogue of phenotypic information and the associated genotypic information enables geneticists to accurately identify genetic variants associated with disease phenotypes. However, the complexity of sample collection (due to unique ethical, cultural and socio-economic factors) in southern Africa is frequently underestimated as is reviewed elsewhere (Martin et al. 2018). The United Kingdom Biobank is a recent example of how incorporating clinical data embedded in electronic health records combined with GWAS data and registries available for research, can benefit everyone and not just individuals from a specific region.

Conclusion

Current GWAS and admixture mapping study designs are failing to identify disease-causing loci or rare genetic variants in southern African individuals. This is largely the result of limited reference haplotype panels and in turn limited genetic and computational tools available for southern African populations. The majority of SNP genotyping arrays are selected from a small sample of individuals (predominantly of European ancestry) and imputation and phasing of genotypic data usually involve a human reference of European ancestry, missing ± 10% of the genomes of individuals from African descent (De La Vega and Bustamante 2018). The genome structures of future generations might develop in a similar way to that of a complex five-way admixed southern African populations as admixture between populations originating from more than two different continents are now considered a customary feature of human populations across the globe (Busby et al. 2016; Salter-Townshend and Myers 2018).

Existing methods to detect loci associated with the multifactorial disease are not optimized for southern African ancestral groups and innovative approaches are urgently needed to study lethal communicable diseases such as TB, as well as non-communicable diseases such as cardio-metabolic diseases and type 2 diabetes in Africa. This entails the systematic development of best practises for ancestry inference, imputation and association studies. The establishment of a publicly available southern African-specific consensus reference genome is required to capture novel genetic variants and to maximize imputation for southern African populations. This will benefit future genetic studies involving complex diseases and traits by capturing rare variants previously lost due to a lack of publicly available data. Admixture mapping studies will continue to be inconclusive for populations from Southern Africa if no reference panels are available to represent proxy ancestral populations contributing to their genomes. The accuracy of the local ancestry calls for southern African individuals could also be decreased if population-specific recombination maps are not available. This will, in turn, affect the accuracy of admixture mapping studies that make use of LAI.

Conducting genetic studies on admixed southern African populations, with varying ancestral contributions, could also be beneficial for genetic studies of communicable and non-communicable diseases not mentioned in this review. Without a proper representative reference genome and methodologies to analyse complex admixed southern African genomes, genomic medicine will never benefit these individuals in contrast to those of European descent. For southern African countries and ethnicities to benefit from large-scale GWAS, as most European countries have, disease variants associated with southern African-specific diseases have to be identified. This will allow precision medicine and polygenic risk scores to be implemented. Although several consortiums contributed immensely to the development and training of African genetic researchers to include more diverse populations that have traditionally been underrepresented, global collaboration is still essential to increase the genetic representation of southern African populations.