Introduction

Global population growth, accompanied with increased consumer wealth, will change food consumption patterns worldwide (Tilman and Clark 2014). By the year 2050, the projected demand for protein from animal sources is expected to double from 2000 (Alexandratos 1999). This trend raises sustainability and food security concerns, as the intensive production of animal protein adds pressure on the environment—as vast amounts scarce (non-renewable) resources such as land, water and minerals are needed. On the contrary, the production of plant protein is more sustainable for the environment as less resources are needed (Sabaté and Soret 2014).

Potato (Solanum tuberosum L.) is a well-known starch crop. However, few realise that the potato crop also serves as an abundant source of plant protein (Jørgensen et al. 2006). Although protein content in potato tubers is relatively low (0.32–1.63%) (Bárta et al. 2012; Klaassen et al. 2019; Ortiz-Medina 2006), protein yield per hectare (ha) is eminent due to the high-yielding ability and high harvest-index of the potato crop that can reach up to 124 ton−1ha (Kunkel and Campbell 1987). The potato starch industry processes potatoes to produce starch and by-products. After starch is extracted from tubers, potato fruit juice (PFJ) is released as a major aqueous by-product that contains protein. After proteins are extracted from PFJ, functional (native) potato protein isolates may be utilized in high-end food and pharmaceutical applications that include foaming agents, anti-oxidants, emulsifiers (Creusot et al. 2011; Edens et al. 1999; Kudo et al. 2009), inhibitors of faecal proteolytic compounds that cause dermatitis (Ruseler-van Embden et al. 2004) and satiety agents (Hill et al. 1990). Therefore, valorisation of protein provides opportunities to create added-value for the potato starch industry. Consequently, innovative firms in the industry are keen to use protein-rich potato varieties and therefore high protein content has emerged as a key quality trait for breeders. However, breeding for protein content in potato is challenging due to the complex genetic basis underlying the trait (Klaassen et al. 2019). To facilitate breeding for protein content in potato, improved comprehension of the inheritance, quantitative trait loci (QTLs) and relationships with other agronomical relevant traits are useful.

Knowledge on the inheritance of protein content in potato is limited. To the best of our knowledge, three genetic studies on protein content in bi-parental populations have been published (Acharjee et al. 2018; Klaassen et al. 2019; Werij 2011). These studies estimated moderate levels of trait heritability (40–74%) and identified minor-effect QTLs on chromosomes 1, 2, 3, 5 and 9 in both non-cultivated diploid and cultivated tetraploid potato germplasm. As for other crops that include soybean, maize and wheat (Balyan et al. 2013; Hwang et al. 2014; Karn et al. 2017), protein content has been described as a complex trait that is regulated by a plethora of interactions between genetic and environmental factors. Therefore, QTLs for protein content in heterozygous tetraploid (2n = 4x = 48) potato are likely to be affected by both epistasis and environmental factors.

Genome-wide association studies (GWAS) have been used as a method to dissect the genetic architecture of complex traits in multiple species that include potato (Rosyara et al. 2016; Sharma et al. 2018). As opposed to genetic studies performed on bi-parental populations, GWAS offers the advantage to identify QTLs within a panel of diverse individuals, and to potentially gain a high mapping resolution for identifying candidate genes.

In this study, we carried out a GWAS to dissect the genetics of protein content in a panel of tetraploid potato. We report on the relationship between protein content and tuber under-water weight (a proxy for starch content), haplotypes underlying QTLs and putative candidate genes.

Materials and methods

Germplasm collection

The panel (N = 277) consisted of tetraploid (2n = 4x = 48) individuals. The panel was composed of 189 varieties (D’hoop et al. 2008) and 88 starch potato progenitors that originated from five potato breeding companies (Agrico, Averis Seeds, C. Meijer, HZPC and KWS) (Supplementary Table 1). These included both modern and old individuals from different market segments and geographic origins. Analysis of population structure in the panel displayed three sub-populations, as reported earlier (D’hoop et al. 2008; Vos 2016). These sub-populations, hereafter referred to as “Processing”, “Other” and “Starch”, were used for analyses.

Field trials

Raw phenotypic data were collected over years and locations (multi-location, multi-year) from unbalanced field trials that were carried out in the Netherlands. These trials were carried out in years 2008–2010 in Bant, Emmeloord, Metslawier, Rilland and Valthermond. The accessions were replicated three times or more, except for nine accessions that were replicated twice or once. A replicate (experimental unit) consisted of a four-plant plot within a row in the field. Raw phenotypic data were used to compute the BLUEs for the accessions. The trials were carried out during the conventional potato growing seasons in the Netherlands as described by D’hoop et al. (2008). Uniform seed tubers were used as planting material and were propagated at a single location 1 year prior to the trials. The seed potatoes were planted at 75-cm spacing between the rows and 35 cm between the hills. Guard rows were used to separate the plots in the trial. Regular husbandry practices for potato production in the Netherlands were carried out during the field trials. After harvest, the tubers were stored under cool conditions prior to use.

Quantification of phenotypes

Soluble protein content in potato fruit juice (PFJ) was determined by using the bicinchoninic acid (BCA) assay (Smith et al. 1985). Bovine serum albumin (BSA) was used as a standard. Protein content was quantified as described by Klaassen et al. (2019). Tuber under-water weight (UWW), a proxy for starch content, was quantified as described in a previous study (Bradshaw et al. 2008).

Best linear unbiased estimates

To estimate the best linear unbiased estimates (BLUEs), a mixed model was used. BLUEs were computed using restricted maximum likelihood (REML) (D’hoop et al. 2011) as follows:

$$ \mathrm{Response}\ (Y)=\mu +\mathrm{Accession}+\mathrm{Year}+\mathrm{Location}+\left(\mathrm{Accession}\times \mathrm{Year}\right)+\left(\mathrm{Accession}\times \mathrm{Location}\right)+\left(\mathrm{Year}\times \mathrm{Location}\right)+\left(\mathrm{Accession}\times \mathrm{Year}\times \mathrm{Location}\right)+\mathrm{Error}, $$
(1)

where the “μ” represented the overall mean response, the “Accession” term was fixed. The “Year”, “Location” and residual “Error” terms were random. Broad sense heritability estimates (H2) were computed from variance components (see also D’hoop et al. 2011) as follows:

$$ {H}^2=\frac{{\sigma^2}_G}{\left({\sigma^2}_G+\frac{{\sigma^2}_{G\times Y}}{y}+\frac{{\sigma^2}_{G\times L}}{l}+\frac{{\sigma^2}_{G\times Y\times L}}{y\times l}+\frac{{\sigma^2}_e}{r\times y\times l}\right)} $$
(2)

where the variance components σ2G (“Accession”), σ2G × Y (“Accession” × “Year”), σ2G × L (“Accession” × “Location”), σ2G × Y × L (“Accession” × “Year” × “Location”) and σ2e (residual “Error”) were derived from REML. In a second REML model, the “Accession” term was fixed. In the H2 equation, the terms “y”, “l” and “r” represented the number of years, number of locations and number of biological replicates respectively.

Genotyping and genotype calling

The panel was genotyped using the SolSTW 20 K Infinium SNP marker array (Vos et al. 2015). Genotype calling (assignment of SNP allele dosages) were carried out by using fitTetra (Voorrips et al. 2011) and Illumina GenomeStudio software version 2010.3 (Illumina, San Diego, CA, USA), as described by Vos et al. (2015). The threshold for minor-allele frequency (MAF) was set at 1.5% (equivalent to 6% for tetraploid potato with four sets of homologous chromosomes). After filtering, 14,436 high-quality SNP markers were used for GWAS. The physical coordinates of the SNPs were based on the potato reference genome, i.e. pseudomolecules v4.03 (PGSC 2011).

Population structure analysis

The population structure of the panel was analysed by using STRUCTURE software package v2.3.4 (Pritchard et al. 2000). Ten runs were performed to estimate the K values using 2000 randomly selected SNPs. A Markov chain Monte Carlo (MCMC) burn-in period of 10,000 was used and the number of iterations was set at 10,000. The appropriate number of sub-populations were determined from delta K and optimal K values (Evanno et al. 2005) based on output data derived from STRUCTURE Harvester (Earl and vonHoldt 2012) (http://taylor0.biology.ucla.edu/structureHarvester/). Membership probability estimates from thirty runs were averaged and used to assign each individual to cluster groups (sub-populations). The sub-populations were denoted as “Processing”, “Other” and “Starch”, based on prior knowledge that these three sub-populations existed in the panel (D’hoop et al. 2008; Vos 2016).

Genome-wide association study

A GWAS was performed using the phenotype (BLUEs) and genotype data. A naive model, a mixed model and a conditional mixed model were used to compute associations. For the naive model, associations between the BLUEs and SNP dosages were analysed. The mixed model was used to perform association analysis whilst correcting for kinship (K). As population structure was weak for the panel, as shown by (Vos et al. 2017), we did not correct for sub-populations (Q). For the mixed model, we used the same SNPs for GWAS and kinship correction. A sub-set of these SNPs was used for inference of population structure. To dissect the effect of maturity alleles (StCDF1) that are physically positioned at the start of chromosome 5, a conditional mixed model was used. The conditional mixed model, as described in earlier studies (Kang et al. 2010; Segura et al. 2012), included the SNP marker “PotVar0079081” (Chromosome 5, coordinate 4,489,481 Mbp) as a cofactor that tagged the early maturity allele (StCDF1.1) in a haplotype-specific manner (Willemsen 2018). The following equations were used for the GWAS models:

$$ \mathrm{Naive}\ \mathrm{model}:Y=X\upalpha +\varepsilon, $$
(3)
$$ \mathrm{Mixed}\ \mathrm{model}\ \left(\mathrm{kinship}\ \mathrm{corrected}\right):Y=X\upalpha + K\mu +\varepsilon, $$
(4)
$$ \mathrm{Conditional}\ \mathrm{mixed}\ \mathrm{model}\ \left(\mathrm{cofactor}+\mathrm{kinship}\ \mathrm{corrected}\right): $$
$$ Y=A\upbeta +X\upalpha + K\mu +\varepsilon . $$
(5)

In the equations 3, 4 and 5, “Y” represents the BLUEs, “X” represents the SNP markers (fixed effect), “K” represents the random kinship (co-ancestry) matrix and “A” represents the SNP marker set as cofactor (fixed). The term “ε” represents the vector of random residual errors. The term “α” represents the estimated SNP effects, “β” represents the estimated effect of the SNP marker set as cofactor and “μ” represents the estimated kinship variance component. Analyses were performed in R software package GWASpoly (Rosyara et al. 2016). Phenotypic variance explained (R2) by SNPs were calculated from squared correlation coefficients between the BLUEs and SNP dosage scores (allele copy number).

Significance threshold and QTL support interval

Manhattan plots were used to illustrate the genome-wide association scores of SNPs. These scores were computed from P values of the SNPs as follows:

$$ \mathrm{Association}\ \mathrm{score}=-{\log}_{10}(P). $$
(6)

We used several significance thresholds to identify QTLs. To correct for multiple testing, we used the 5% Bonferroni threshold (−log10(P) = 5.3). The Bonferroni threshold is known to inflate the probability of Type II errors (false-negative findings) in the presence of high linkage disequilibrium between markers (Gao et al. 2008; Johnson et al. 2010). Therefore, the 5% Li and Ji threshold was also computed by correlated multiple testing (Li and Ji 2005) (−log10(P) = 3.9). Correlated multiple testing was conducted at α = 0.05, to adjust for the effective number of independent tests and compensate for Type II errors. For naive analyses, permutation testing was carried out with N = 1000 permutations at α = 0.05 to define the threshold value (−log10(P) = 5.0) (Churchill and Doerge 1994) (−log10(P) = 5.0). The support intervals of QTLs were set at 1.5 Mbp for non-introgressed regions and 2.5 Mbp for introgressed regions as described by Vos (2016).

Haplotype inference

Determination of haplotypes underlying QTLs was performed using a contemporary haplotype inference method developed in tetraploid potato (Willemsen 2018). This method estimated the linkage phase between pairs of SNP markers, followed by joining of linked SNPs into haplotypes. Only SNPs exceeding the Li and Ji threshold (−log10(P) = 3.9) were used for haplotype construction and to obtain the dosages of the haplotypes.

SNP allele frequency

The SNP allele frequency (%) in the panel of tetraploid accessions was computed by using the SNP dosage scores and number of accessions as follows:

$$ \mathrm{SNP}\ \mathrm{allele}\ \mathrm{frequency}\ \left(\%\right)=\frac{\mathrm{Sum}\ \mathrm{of}\ \mathrm{SNP}\ \mathrm{dosage}\ \mathrm{scores}}{\left(\mathrm{Number}\ \mathrm{of}\ \mathrm{accessions}\times 4\right)}\times 100 $$
(7)

Results

Phenotypic variation of the traits (BLUEs)

The panel displayed variation for the BLUEs of protein content, tuber under-water weight (UWW) and protein content in potato fruit juice (PFJ) (Fig. 1). Protein content ranged between 0.73–1.72% (w/w). The broad sense heritability estimates (H2) for protein content, protein content in PFJ and UWW were 48%, 58% and 81% respectively (Table 1). BLUEs for UWW and protein content in PFJ ranged between 270 and 572 g 5 kg−1 and 0.89–2.44% (w/v) respectively. To evaluate correlations between the traits, scatterplots for the phenotypic values (BLUEs) were evaluated. Moderate to high correlations (P < 0.001) were observed for the phenotypic BLUEs (Fig. 1) (protein content versus UWW: r = 0.639; UWW versus protein content PFJ: r = 0.746; protein content versus protein content PFJ: r = 0.988).

Fig. 1
figure 1

Distributions and scatterplots of the phenotypic values (BLUEs). Distributions for a protein content (% w/w), b tuber under-water weight (UWW) (g 5 kg−1) and c protein content PFJ (potato fruit juice PFJ) (% w/v). Scatter plots for d protein content versus under-water weight (UWW), e UWW versus protein content PFJ (potato fruit juice) and f protein content versus protein content PFJ. The linear regression lines are shown in blue. SD,standard deviation

Table 1 Variance components and heritability estimates for the traits

Population structure

Population structure of the panel was analysed using SNP marker data from the array. Three clusters (sub-populations) were characterized (Supplementary Fig. 2) and were denoted as “Processing” (N = 35), “Other” (N = 136) and “Starch” (N = 106). As the sub-populations showed unequal trait values (One-way ANOVA, P = 2.46 × 10−3; Supplementary Fig. 1), protein content was found to be confounded with population structure. Likewise, the Q-Q plot for naive GWAS on the panel showed inflated probabilities (Supplementary Fig. 3), that may have been caused by population structure. Therefore, a kinship-corrected GWAS was also performed to identify QTLs for protein content (Fig. 2).

Fig. 2
figure 2

Manhattan plot for kinship-corrected GWAS on the panel (N = 277) with Quantile-Quantile plot for the observed versus expected probabilities. Top (red) line indicates Bonferroni threshold at 5.3. Lower (blue) line indicates Li and Ji threshold at 3.9

Identification of QTLs

To identify QTLs for protein content, a kinship-corrected GWAS was carried out with 14,436 SNPs using the panel of 277 accessions. Three QTLs were identified above the Li and Ji threshold (−log10(P) = 3.9) on chromosomes 3, 5 and 7 (Fig. 2), that each explained 9–12% of the phenotypic variance (R2) (Table 2). The strongest association, that also exceeded the Bonferroni threshold, was found at the start of chromosome 5 at 4.71 Mbp (−log10(P) = 5.84; R2 = 0.11). The end of chromosome 3 harboured a QTL at 60.84 Mbp (−log10(P) = 4.07; R2 = 0.11). A third QTL was positioned at the end of chromosome 7 at 50.15 Mbp (−log10(P) = 3.97; R2 = 0.12). The naive GWAS identified significant QTLs on all the twelve potato chromosomes (Supplementary Fig. 3), but were expected to be false-positive associations because the trait values were confounded with population structure in the panel (Supplementary Fig. 1). To dissect the potential year (season) effects, correlation analysis and GWAS were performed on the BLUEs for 2008, 2009 and 2019 (Supplementary Figs. 7 and 8). As observed for the panel, GWAS on the BLUEs for 2008 showed associations at the start of chromosome 5. The BLUEs for 2009 produced associations again at the start of chromosome 5 and at the end of chromosome 7. For the BLUEs of 2010, associations were found at the ends of chromosomes 2 and 11. Moderate to high correlations were observed for the BLUEs between the individual years. As the raw values for UWW and protein content in PFJ were used to correct for the values for protein content, GWAS were performed on these two traits as well (Supplementary Fig. 6).

Table 2 Results from kinship-corrected GWAS on the panel (N = 277)

To pinpoint putative candidate genes from the genomic regions underlying the QTLs, linkage disequilibrium (LD)-based QTL support intervals were used as described by Vos et al. (2017). The genes underlying these intervals were retrieved from the potato reference genome (PGSC 2011). From the longlists of genes (Supplementary Table 4), putative candidates were selected based on their annotation (gene name). As a result, the QTL interval on chromosome 3 co-localized with a nitrate transporter (60.09 Mbp). The interval on chromosome 5 harboured StCDF1 (4.54 Mbp) and a cluster of nine nitrate transporters (6.00–7.52 Mbp). No obvious candidate genes could be proposed to be implicated with the QTL on chromosome 7.

Haplotypes underlying QTLs

A contemporary approach by Willemsen (2018) was used to determine the haplotype-specificity of the SNP markers underlying the QTLs. Results showed that all SNPs underlying the QTL on chromosome 5 (that exceeded the Li and Ji threshold) were haplotype-specific (Table 2). These SNPs were haplotype-specific for a late maturity allele of StCDF1 (Supplementary Table 2), as proposed by Willemsen (2018). Moreover, these SNPs also tagged a unique introgression segment from wild potato (Solanum vernei Bitter & Wittm.) as described by van Eck et al. (2017). Over the years, this introgression segment has been used by potato breeders to introduce resistance against Globodera pallida nematodes (the so-called Gpa5 locus) in the genepool of cultivated potato (Rouppe van der Voort et al. 2000; Van Eck et al. 2017). Graphical genotypes of the panel, as performed by van Eck et al. (2017), illustrated that this introgression segment was mainly present in the starch varieties and starch progenitors. For these varieties and progenitors, the introgression segment was found to be present in either simplex (a single copy) or duplex (two copies) form (Supplementary Fig. 4). The SNPs underlying the QTLs on chromosome 3 and 7 were not found to be haplotype-specific.

Variance explained by multiple QTLs

By using multiple linear regression, we tested the cumulative effect of multiple significant SNP markers underlying QTLs together. The SNPs underlying the QTLs on chromosomes 3, 5 and 7 together explained 22% of the variance (Supplementary Table 3). When the SNP on chromosome 5 was excluded, the QTLs on chromosomes 3 and 7 together explained 21%. The combination of SNPs on chromosomes 5 and 7 jointly explained 20%. The QTLs on chromosomes 3 and 5 jointly explained less variance (13%).

Sub-population QTLs

In an attempt to circumvent the confounding effect of population structure in the panel, GWAS was performed on the sub-populations “Starch” (N = 106) and “Other” (N = 136). The sub-population “Processing” was not included as it consisted of a relatively small number of individuals (N = 35). Kinship-corrected GWAS on the sub-population “Starch” identified one QTL above the Li and Ji threshold at the end of chromosome 3 (R2 = 0.15) (Fig. 3; Table 3). The QTL peak caused by SNP marker PotVar0020225 was positioned 0.708 Mbp north from the QTL identified in the panel (Table 2). The naive GWAS on the sub-population “Starch” did not identify significant QTLs, although noticeable associations were found slightly below the thresholds at the end of chromosome 3 (Supplementary Fig. 5). Kinship-corrected GWAS on the sub-population “Other” identified a QTL at the start of chromosome 5 (R2 = 0.15) (Fig. 3). This sub-population QTL was positioned 0.975 Mbp (Table 3) south from the QTL found in the panel, that was introgressed from wild potato into the starch varieties and starch progenitors as a source of resistance against nematodes (Table 2). The SNPs tagging this haplotype in the sub-population “Other” were lower than the minor allele frequency (MAF) threshold of 1.5%. Therefore this haplotype remained unnoticed and could not uncover a QTL.

Fig. 3
figure 3

Manhattan plots for kinship-corrected GWAS on sub-populations a “Starch” (N = 106), b “Other” (N = 136) and c “Other” (N = 136) by including SNP marker “PotVar0079081” as a cofactor for early maturity (StCDF1.1). Quantile-Quantile plots for the observed versus expected probabilities are shown on the right. Top (red) line indicates Bonferroni threshold at 5.3. Lower (blue) line indicates Li and Ji threshold at 3.9

Table 3 Results from kinship-corrected GWAS on sub-populations “Starch” and “Other”

To verify whether or not the QTL at the start of chromosome 5 was associated with plant maturity (StCDF1) (Kloosterman et al. 2013), we performed conditional kinship-corrected GWAS on the sub-population “Other” by using the SNP marker “PotVar0079081” as a cofactor that tags the early maturity allele (StCDF1.1), as described by Willemsen (2018). This approach, reduced the significance of the original QTL at the start of chromosome 5 (from −log10(P) = 4.46 down to 3.19) (Fig. 3). This finding suggested that the maturity score of potato varieties, as largely controlled by StCDF1.1, indirectly influenced protein content in this sub-population. By performing the cofactor analysis, an otherwise masked QTL was uncovered at the end of chromosome 12 (Peak SNP: “PotVar0052807”; 59,294,858 bp; −log10(P) = 4.63). Naive GWAS on the sub-population “Other” showed inflated associations that probably caused false-positive QTLs on chromosomes 1, 2, 3, 4, 5, 7 and 10 (Supplementary Fig. 5).

Discussion

GWAS as a tool to detect QTLs

We used GWAS to shed light on the complex genetic architecture of protein content in potato. We identified QTLs with minor effects on chromosomes 3, 5, 7 and 12 (Fig. 2; Fig. 3). The QTLs identified on chromosomes 3 and 5, coincided with previous studies (Acharjee et al. 2018; Klaassen et al. 2019; Werij 2011). For chromosome 3, the QTL identified in the entire panel was also observed in the sub-population “Starch”. For chromosome 5, we uncovered an introgression segment from wild potato that was associated with protein content (Supplementary Fig. 4). This introgressed segment harboured a late maturity allele of StCDF1 (Supplementary Table 2), as well as the Gpa5 resistance allele against potato cyst nematodes (Globodera pallida). However, the SNPs tagging this introgression segment did not bring forth a QTL in the sub-population “Starch”, even though the allele frequency of these SNPs in this sub-population was considerable (9–10%). We also observed that the additive effect of this QTL was lower than expected when combined with the other two QTLs on chromosomes 3 and 7 (Supplementary Table 3). We showed that protein content was confounded with population structure in the panel. This result was likely caused by higher BLUEs values for protein content in the sub-population “Starch” (Supplementary Fig. 1). Therefore, we propose that the QTL on chromosome 5 in the panel could be an artefact. Validation studies, for instance using bi-parental mapping populations, may confirm the relevance of SNPs underlying this QTL for use in breeding to improve protein content. If these SNPs are to be used for breeding, they will at least provide a source of resistance against cyst nematodes and contribute towards a later maturity index due to StCDF1. In the sub-population “Other” we also identified a QTL at the start of chromosome 5. Conditional GWAS on this sub-population showed that this association was not caused by the introgression segment from wild potato. Instead, this QTL coincided with the early maturity allele of StCDF1 (StCDF1.1). Findings from GWAS on the panel as well as the sub-populations showed that different haplotypes at the start of chromosome 5 were associated with protein content.

To the best of our knowledge, the identified QTLs on chromosomes 7 and 12 have not been described before in literature. Bi-parental populations, that descend from crosses between protein-rich varieties, can be used to test/validate and stack multiple copies of favourable variants/alleles for multiple protein content QTLs simultaneously. For instance, the cross between the starch varieties Kartel × Seresta will allow the SNPs underlying all three QTLs identified in the panel here, to segregate in nulliplex (null), simplex (one) and duplex (two) dosages in the F1 progeny. This cross will provide improved insight into the cumulative effects of the underlying haplotypes. Our results, as presented in Supplementary Table 3, suggest both additive and epistatic effects of the SNPs (alleles). We observed that the effects of genotype-by-environment (G × E) interactions were small to moderate for protein content (Table 1). On the other hand, a large proportion of variance was ascribed to the residuals (error). Hence, future genetic studies on protein content may be improved by reducing the residual error in these experiments.

Missing heritability

Studies in soybean, wheat and maize describe protein content as a complex trait that is governed by multiple genes and environmental factors. We estimated a moderate trait heritability for protein content (H2 = 0.48). This H2 value ranged between 40 and 74%, i.e. in line with previous studies (Klaassen et al. 2019; Werij 2011). GWAS on the panel identified three QTLs that cumulatively explained 22% of the variance. Hence, we demonstrate a clear example of missing heritability. Several factors may have contributed to this finding, that include the limited statistical power to detect loci with small effects, interactions between loci, effects or rare variants and potential banishment of true-positive QTLs due to kinship correction. Alternatively, overestimation of the broad sense heritability estimate (H2) may also have occurred. In any case, it should be noted that our H2 will be much larger than the narrow sense (h2) estimate.

To optimize the detection of QTLs by GWAS, the design and methodology should be considered carefully. Using more individuals will likely increase statistical power, as shown in numerous human and crop genetic studies, e.g. for soybean (Bandillo et al. 2015). Optimization of GWAS will likely identify loci with minor effects or those caused by rare variants with a low allele frequency. Certainly the population structure, distribution of the phenotypic values, as well as the ascertainment bias of SNPs in marker arrays should be considered beforehand as proposed by Vos (2016).

Correlation between tuber protein content and under-water weight

For other crops, a negative correlation is often observed between protein content and other major (seed) storage compounds, e.g. oil content in soybean (Patil et al. 2017). Interestingly, while expecting a similar trade-off in potato, we found a moderate positive correlation (r = 0.64) between protein content and under-water weight (UWW: a proxy for starch content) (Fig. 1). Therefore, selection pressure for high UWW in the starch genepool, aimed to increase starch content, may have coincided with unconscious selection for high protein content (Supplementary Fig. 1). Kinship-corrected GWAS on UWW in the panel did not identify potential associations between UWW and maturity alleles of StCDF1 at the start of chromosome 5 (Supplementary Fig. 6). The statistical power produced by the 277 individuals here may have been insufficient to uncover significant signals due to the complex (polygenic) genetic architecture of starch content in potato. A positive correlation between protein content and UWW suggests that these traits may be (partly) interrelated due to shared biological mechanisms. It is well established that photosynthesis-derived carbon and nitrogen assimilation pathways are connected and tightly controlled in plants. Molecular studies have shown that intracellular glucose is used by plants to synthesize both protein and starch (Bihmidine et al. 2013). Reduced levels of ADP-glucose (i.e. glucosyl donor of glucose) by inactivated ADP-glucose pyrophosphorylase (AGPase) in barley mutants, was accompanied with the downregulation of genes related to amino acid and storage protein biosynthesis (Faix et al. 2012). Therefore, the genes that regulate protein content in potato may affect starch content, yet this point remains to be addressed in future studies. Unravelling the positive correlation between protein and starch content in potato, will certainly be dealt with in future studies.

Putative candidate genes for protein content

To pinpoint putative candidate genes, we used LD-bound QTL support intervals to narrow down on genomic regions. This approach identified several candidates that included StCDF1 (maturity) and nitrate transporters (Supplementary Table 4). Conditional GWAS on the sub-population “Other” showed that a late maturity allele of StCDF1 was positively associated with protein content. Nitrate transporters are known to function in the uptake and allocation of inorganic nitrate (NO3) in plants (Hsu and Tsay 2013; Léran et al. 2014). Nitrate is the predominant nitrogen-containing macronutrient in aerobic soils under temperate climatic conditions. Hence, allelic variants of nitrate transporters may differ in nitrate uptake and interaction with nitrogen-responsive genes that ultimately affect protein content, as proposed for rice (Hu et al. 2015). Future molecular studies on the above mentioned candidate genes that include gene expression, overexpression and knock-out studies, are certainly relevant to study their biological functions and effects on protein content in potato.