Introduction

Determining how much genetic variance is present in phenotypic traits is a crucial step in understanding their adaptive evolution (Fisher 1930; Mousseau and Roff 1987). There are multiple estimates used to assess the evolutionary potential of traits in a population. The most frequently calculated measure is heritability (Mousseau and Roff 1987; Postma 2014), although evolvability may be more adequately measured with the mean-standardized additive genetic coefficient of variation (Houle 1992). There are many difficulties in estimating heritability correctly and precisely. This process requires large sample size and reliable data on the relationships between the individuals (Quinn et al. 2006; Morrissey et al. 2007; de Villemereuil et al. 2013). These requirements are especially difficult to be fulfilled in natural populations, in spite of that results from the wild are essential when studying evolution (Kruuk and Hadfield 2007; Postma 2014).

Animal models are widely used for estimating heritability of different traits, including labile traits, such as behaviour (Stirling et al. 2002; Kruuk 2004; Postma 2014). These models decompose additive genetic variances and environmental variances based on pedigree or other relatedness data (e.g. genetic similarity), and they are very flexible in controlling for confounding effects (e.g. dominance, common environment, maternal effects) (Wilson et al. 2010). Furthermore, if repeated measurements from the same individuals are included in the animal model, it can also discern permanent environmental variance (fixed differences between the individuals due to environmental and/or non-additive genetic effects) apart from additive genetic and residual variance (Kruuk 2004; Wilson et al. 2010). In addition to the additive genetic and residual variance, determining the amount of permanent environmental variance is also essential in predicting the evolutionary response of a trait.

Simulations are very important source of information for planning studies and assessing the reliability of studies investigating heritability. Simulations revealed that the sample size (de Villemereuil et al. 2013; Krag et al. 2013), the amount of the true heritability (Charmantier and Réale 2005; de Villemereuil et al. 2013; Krag et al. 2013), the type (genetic or social) (Bourret and Garant 2017) and the quality of the relatedness data (Israel and Weller 2000; Charmantier and Réale 2005; Kruuk and Hadfield 2007; Morrissey et al. 2007; de Villemereuil et al. 2013; Bourret and Garant 2017), structure of the simulated population (Clément et al. 2001; Kominakis 2008), data missing non at random (Steinsland et al. 2014) and also the analytical method (Kruuk and Hadfield 2007; de Villemereuil et al. 2013) can influence heritability estimates.

However, in spite of the huge amount of simulation research on the estimation of heritability (Clément et al. 2001; Morrissey et al. 2007; Bourret and Garant 2017), some aspects of this issue remained less explored. The calculation of heritability may be complicated by the remarkable within-individual variance that is characteristic of many behavioural, physiological and life history traits (Bell et al. 2009; Schoenemann and Bonier 2018; Taff et al. 2018). Within-individual variance has important biological significance, as it determines how well the individual can adapt to the changing environmental conditions, which is especially important during the recent climate change (Charmantier and Gienapp 2014). Moreover, within-individual variance has essential influence on the evolution of the traits, as it can promote or hinder adaptation (Piersma and Drent 2003; Snell-Rood 2013). However, there are simulation studies, showing how low repeatability (large within individual variance) influences the estimation of statistical parameters with evolutionary relevance, as it can induce bias in e.g. among-individual and residual variance (Schielzeth et al. 2020). Importantly, the large within-individual variance of labile traits relative to among-individual variance leads to small repeatability, which can be the upper limit of heritability (but see: Dohm 2002); thus, heritability is also expected to be small (see also: Mousseau and Roff 1987; Weigensberg and Roff 1996; Stirling et al. 2002). It was found repeatedly that it is more difficult to precisely and accurately estimate lower heritability (Klein 1974; Krag et al. 2013). However, it is crucial to estimate these small heritabilities precisely, as for example, in the song of the collared flycatcher (Ficedula albicollis) we have seen that revealing small but non-zero heritability can have strong theoretical implications, as it still can be the base of evolution (Jablonszky et al. 2022). Importantly, in spite of the well-known effect of the amount of heritability (Klein 1974; Charmantier and Réale 2005; Raffa and Thompson 2016), the effect of the amount of within-individual variance on the heritability estimates has not been thoroughly tested. Because heritability is estimated based on variance components, we can assume similar responses as the above mentioned effects during the estimation of among- and within-individual variances (Schielzeth et al. 2020). In previous studies, the among-individual variance was biased upwards and residual variance was usually biased downwards with low repeatability (Schielzeth et al. 2020). Thus, we predict less accurate (specifically upwardly biased) and less precise heritability estimates when the within-individual variance is large, especially at low sample sizes.

Another factor potentially influencing heritability estimation that received less attention is the number of repeated measurements included in the models. Collecting repeated measurements is common practice during the investigation of labile traits. Including the mean of these repeated measurements into animal models is not appropriate (Wilson et al. 2010; Ge et al. 2017; Risk and Zhu 2018; see also Garamszegi 2016 for other types of models) as the within-individual variance is removed from the variance components (i.e. the uncertainty around the mean estimate is not accounted for) resulting in upwardly biased estimates (Åkesson et al. 2008; Hadfield et al. 2010; Silva et al. 2017). Thus, all repeated measurements should be included in the models (Wilson et al. 2010). Although information on additive genetic variance comes from the data on the relatedness among the individuals (thus, the reliability of the estimation depends primarily on the number of individuals), the estimation of other variance components taking part in heritability calculation, such as within-individual variance, are sensitive to the number of repeats used (Royauté and Dochtermann 2021). Thus, it would be worthwhile to investigate whether collecting more measurements from the individuals improve heritability estimation. More repeats mean higher overall sample size, but also cover a wider range of possible trait values (until a certain level of sampling), resulting in better estimation of all variance components (Westneat et al. 2020), so we can expect more precise and accurate heritability estimates with higher number of repeated measurements. Although, if the number of repeated measurements is too low, additive genetic and permanent environmental effects cannot be separated reliably (Bourret and Garant 2017). However, the effect of the number of repeated measurements, especially in the case of labile traits, has received less attention in simulation studies, although we know that repeated measurements can increase power in linkage analyses (Zhang and Zhong 2006; Liang et al. 2009). We are aware of only one study, that showed reduced uncertainty around the estimates with increasing number of repeated measurements (Adams 2014). Additionally, the effect of within-individual variance and the number of repeated measurements could interact. Previously, it was shown that estimation problems arising from low repeatability can be eliminated if appropriate number of repeated measurements of the same individuals is included in the models (Martin et al. 2011; Dingemanse and Dochtermann 2013; Westneat et al. 2020). Similarly, probably more repeated measurements are necessary for the correct separation of permanent environmental and residual effects when within-individual variance is large (Martin et al. 2011).

In this simulation study, to fill the abovementioned gaps in our knowledge, our aim was to investigate the effect of increasing within-individual variance at different combinations of within- and between-individual sample sizes in the animal models. Specifically, we simulated datasets to investigate the effect of different amount of variances (we varied the value of additive genetic variance, permanent environmental and within-individual variance from small to large, between 0.1–0.5, 0–0.8 and 0.1–0.8, respectively), as well as within- and between-individual sample sizes (1–10 repeats from 100 to 1000 individuals) on the estimation of heritability. We compared the error, accuracy, precision and power of these scenarios (see details in the “Methods” section).

Methods

Data simulation

We simulated datasets with all combinations of number of individuals (Ni = 100, 500, 1000) and number of measurements (Nr = 1, 2, 5, 10). The simulated value for the additive genetic variance (Va) and the residual variance (Ve) were 0.1, 0.3 or 0.5, and we also simulated a scenario, when Va was 0.1 and Ve was 0.8 to cover the feasible range of the values (resulting in 10 different scenarios). Thus, we had 120 different scenarios based on the combination of different parameter settings. Residual variance usually represents the combined effect of within-individual variance, measurement error, and unaccounted environmental variance, but as we included no measurement error and environmental variance into our simulated data, we will regard this component (Ve) as within-individual variation in the followings. Permanent environmental effect (Vpe) was simulated in a way that the sum of variance components became 1 and thus its value was between 0 and 0.8. In the models with only one measurement per individual Vpe and Ve is summed and represent the residual variance together (later we refer to this term as Vr).

We simulated 100 datasets for each scenario. Running more rounds was not feasible because of the large number of scenarios and the high computational demands of the Bayesian models. As a first step, we built a pedigree in each simulation with the ‘generatePedigree’ function from the ‘geneticsPed 1.56’ package (Gorjanc et al. 2021). For all scenarios, the pedigree was simulated for the appropriate number of individuals with 5 generations (thus the number of individuals per generations were Ni/5), and with Ni/25 dams and sires per generation. For simplicity the simulated population was assumed to be closed, with complete random mating and non-overlapping generations. To check the effect of pedigree structure on our results, we run additionally some scenarios with different parameters for pedigree construction (5 generations, but Ni/2 dams and sires, 3 generations, lower number of sires than dams), but these settings did not influenced our results qualitatively (see Tables S2–S4). Additive genetic component was simulated with the ‘rbv’ function from the ‘MCMCglmm 2.32’ package (Hadfield 2010), using the appropriate value of additive genetic variance for the scenario. Permanent environmental effect was simulated for all individuals and the within-individual term was calculated for all measurements with the corresponding consideration for these variance components. All of these effects were assumed to be normally distributed. The phenotypic value for each individual was the sum of the population mean (which was arbitrarily assigned to the value of 1), additive genetic, permanent environmental and within-individual components:

$${y}_{ij}=\mu +{a}_{i}+ {p}_{i}+{e}_{ij}$$
(1)
$${a}_{i} \sim N(0, Va)$$
$${p}_{i}\sim N(0,Vpe)$$
$${e}_{ij}\sim N(0,Ve)$$

where yij is the phenotype of the ith individual at the jth repeat, μ is the population mean, ai is the additive genetic effect, pi is the permanent environmental effect and eij is the within-individual effect of the ith individual at the jth repeat.

Analysis of the simulated datasets

On the generated data we run animal models with the ‘MCMCglmm’ function from the ‘MCMCglmm 2.32’ package (Hadfield 2010). The models for the scenarios with only one measurement per individual contained only one random factor of individual identity connected to the pedigree:

$${y}_{i}=mu+{a}_{i}+{r}_{i}$$
(2)
$$\it {{\text{a}}}_{{\text{i}}} \sim N(0, Var({\text{a}}))$$
$$\it {{\text{r}}}_{{\text{i}}}\sim N(0,Var\left({\text{r}}\right))$$

where yij is the estimate for the phenotype of the ith individual, mu is the estimate for the population mean, ai is the estimate of the additive genetic effect and ri is the residual effect of the ith individual. We use here ri as this term include both permanent environmental and within-individual effects.

For the datasets with repeated measurements, we built models with two random factors for individual identity to separate additive genetic and permanent environmental effects:

$${y}_{ij}=mu+{a}_{i}+ {p}_{i}+{e}_{ij}$$
(3)
$$\it {{\text{a}}}_{{\text{i}}} \sim N(0, Var({\text{a}}))$$
$$\it {{\text{p}}}_{{\text{i}}}\sim N(0,Var\left({\text{p}}\right))$$
$$\it {{\text{e}}}_{{\text{ij}}}\sim N(0,Var\left({\text{e}}\right))$$

where yij is the estimate for the phenotype of the ith individual at the jth repeat, mu is the estimate for the population mean, ai is the estimate of the additive genetic effect, pi is estimate of the permanent environmental effect and eij is the estimate for the within-individual effect.

Priors with inverse-Gamma distribution were used for all models. However, we checked the effect of other priors (e.g. parameter expanded prior) for some scenarios and results remained qualitatively unchanged. The models were run for 110,000 iterations with 10,000 sample discarded at the beginning and a thinning intervals of 100. Before running all simulation, the trace and distribution of all variables and the autocorrelation between iterations were checked visually for some selected scenarios.

From all models, the median of the estimate of heritability (the median of additive genetic variance divided by the sum of all variance components, hereafter h2) and the variance components with their 95% credible intervals (CI) based on the whole posterior distributions were extracted with ‘HPDinterval()’. We did not use posterior mode as it was proved to be prone to bias (Pick et al. 2022). To assess whether our heritability estimates can be differentiated from that of a scenario with zero heritability, we also calculated for the h2 estimates the percentage of the values of a posterior distribution from a null model (run on a null dataset with Va = 0) that were greater than the actual estimates (Pick et al. 2022). We simulated one null dataset for all scenarios in a similar way as described above but with Va = 0 and Ve = Vaactual + Ve (Vaactual is the Va of the focal scenario) to ensure the same overall variance (Pick et al. 2022). The null model was built for this null dataset in the same way as for the original dataset.

Performance metrics

Measures of estimation error, accuracy, precision and statistical power were calculated for all scenarios for the heritability estimates and the first three measures also for the variance components (these latter results can be seen in the Supplementary material Figs. S1–S3). Specifically, we measured measurement error as the root mean square error (RMSE), and accuracy as absolute relative bias (we used the specific terms hereafter). RMSE (a measure of estimation error, often termed as accuracy, but reflecting also precision) is the square root of the average squared difference of the generating value of the actual parameter (p) and the estimated parameter (\(\widehat{p}\) and n is the number of simulated datasets) (as used in de Villemereuil et al. 2013; Schielzeth et al. 2020):

$$RMSE= \sqrt{\frac{{\sum }_{i=1}^{n}{({\widehat{p}}_{i}-p)}^{2}}{n}}$$
(4)

Thus, we obtained one value for each scenario reflecting the average difference of the estimates from the original simulated values. High values indicate high estimation error, and values close to zero indicate good estimation.

Accuracy was assessed as the absolute relative bias (Pick et al. 2022):

$$relative\;bias= \frac{1}{n}{\sum }_{i=1}^{n}\frac{\left|{({\widehat{p}}_{i}-p)}^{2}\right|}{p}$$
(5)

Thus, accuracy also resulted in one averaged value per scenario.

Precision was calculated as the inverse of the standard deviation of the heritability estimates of each run of the scenario (as used in Pick et al. 2022). The distribution of the point estimates reflects the expected distribution of the heritability estimates of 100 replicated studies.

Statistical power was assessed by comparing our estimates to the estimates of a null model, see above. Specifically, we calculated the ratio of the h2 values of a posterior distribution from a null model that were greater than the original estimates. Then, statistically power was equal to the ratio of the simulations when the above mentioned ratio was lower than 0.05 (these estimates will be referred as significant). Note that all performance estimates resulted in one value per each scenario.

All statistical analyses were performed in the R 3.6.1 statistical environment (R Core Team 2019).

Results

The RMSE values for the h2 estimates were the highest (indicating bad performance) when the number of individuals was 100 (Fig. 1, first row). The RMSE values became much smaller on average by 25% when 10 measurements were included instead of one, but at Ni = 100 only when Va = 0.5, Vpe was 0.2 or 0 and Ve was 0.3 or 0.5 (30 and 80% decrease, respectively, Fig. 1a). Even in these cases, using 10 measurements did not have an advantage over using 5 measurements. Apart from these scenarios RMSE was influenced by the magnitude of Va: scenarios with higher true value of Va had higher RMSE. There was even a 2.65-fold increase in RMSE between Va = 0.1 and Va = 0.5 scenarios when Ni = 100 (Fig. 1a).

Fig. 1
figure 1

Root mean square error (RMSE), relative bias, precision and power for the heritability (h2) estimates are displayed for all scenarios, separately for the models with 100, 500 and 1000 between-individual sample size. The corresponding additive genetic variance (Va) values used in the simulations are depicted by colours and point types and within-individual variance (Ve) by shades of the respective Va value as can be seen in the legend. Vpe values were simulated in a way that the sum of all variance components became one

Precision was also low at Ni = 100 but showed different patterns when the between-individual sample size was higher (Fig. 1, second row). At Ni = 500, precision was the highest when Va = 0.1 and precision dropped sharply by 70 and 80% respectively for the scenarios where Ve was 0.3 or 0.8 if even one repeated measurement were included (Fig. 1e). However, in the scenario of Ni = 500, Va = Ve = 0.5 and Vpe = 0 precision of h2 estimates increased 4.59-fold when 10 measurements was included instead of one. This scenario displayed the same behaviour also when Ni was 1000, along with the scenario of Va = 0.1 and Ve = 0.8 (Fig. 1f). In these cases 10 measurements resulted in better precision than 5.

Regarding relative bias, using at least 2 measurements caused significant improvement at Ni = 100, Va = 0.5, Vpe = 0.2 or 0 and Ve = 0.3 or 0.5 relative to the models with only one measurement (40 and 75% decrease in relative bias, respectively, Fig. 1g). At Ni = 500, relative bias decreased when 10 measurements was included instead of one on average by 25% and showed a marked decrease of 60% in the scenario where Va = 0.1 and Ve = 0.8 (Fig. 1h). At Ni = 1000, more scenarios with Va = 0.1 showed decreasing tendency of relative bias with the number of measurements (Fig. 1i). Some scenarios among all sample sizes showed very slightly increased bias when only two measurements were included in the models compared to the one measurement model.

The statistical power to detect significant h2 estimates increased on average by 40% when 10 measurements was included instead of one (Fig. 1, fourth row). This increase depended also on the magnitude of the Ve component: it was higher when Ve increased (5% increase when Ve = 0.1 and 800% increase when Ve = 0.8 across all sample sizes and scenarios for the other variance components). The improvement of power relative to models with one measurement was as high as 161% for models with 2 measurements at Ni = 100 (Fig. 1j), but 5 measurements provided additional advantage when Ni was higher (but only when Va = 0.1 (an increase of 47%), because the other scenarios have very high power (80–100%) with these higher sample sizes, Fig. 1k, l).

Additionally, the exact value for all performance estimates for all scenarios (Table S1) and the mean, the standard deviation and the average 95% CI width of the estimates for h2 and the variance components (Table S5) can be seen in the Supplementary material. In Table S5, we can see that heritability is usually underestimated. However, it is overestimated in most of the Va = 0.5, Vpe = 0.2, Ve = 0.3 scenarios with repeated measurements (with the exception of the Ni = 1000 and Nr = 10 scenario), and half of the Va = 0.3, Vpe = 0.2 and Ve = 0.5 scenarios with repeated measurements. In the one measurement models, if biased, Va was under- and Vr was overestimated. In the models with repeated measurements the bias came from the bad separation of Va and Vpe (usually underestimation of Va and overestimation of Vpe, except the above-mentioned exception where the pattern was reversed) as Ve was estimated relatively well in these models.

Overall, the scenario of Va = Ve = 0.5 and Vpe = 0 has the less bias under all sample size scenarios and the highest precision (if number of measurements was at least five). The other scenarios with Va = 0.5 and scenarios with Va = 0.3 at Ni = 500 or 1000 have also low relative bias, but did not show higher precision than the rest of the scenarios.

Discussion

Our simulation results highlight the need for considering the collection of repeated measurements when investigating heritability. In most of the scenarios using at least two measurements offered some advantage over using only one measurement in terms of accuracy and/or precision. For instance, in the scenario of Ni = 100, Va = Ve = 0.5, relative bias decreased by 75% and precision showed a 2.43-fold increase when having at least two measurements. Within-individual variance also should be taken into account when planning studies on heritability, as the magnitude of this variance component influenced the effect of the repeated measurements on RMSE, relative bias, precision and statistical power of the heritability estimates. Models with 500 or 1000 individuals usually yielded estimate with low RMSE and relative bias and high precision, apart from scenarios with low heritability (h2 = 0.1), where bias was significantly higher and power was lower. Biased estimates were usually underestimated. Although using 100 individuals seems to be insufficient to estimate heritability reliably, taking repeated measurements when the between-individual sample size is higher can increase accuracy and power (and sometimes also precision), especially in highly labile traits (i.e. high Ve).

The heritability estimates of the models including only one measurement per individual were influenced by the between-individual sample size and the magnitude of the true heritability. Generally, the models with one measurement yielded precise heritability estimates with low RMSE and relative bias at a sample size of 500 or 1000 individuals (aside from high relative bias for some scenarios with Va = 0.1, see below). These results were expected based on the sample size recommendations of at least 200, but possibly 300–1000 individuals of previous studies (Quinn et al. 2006; de Villemereuil et al. 2013; Krag et al. 2013). The accuracy and precision of heritability estimation also depended on the true heritability value, in a similar way as was found previously. In a comprehensive simulation study using 200 or 1000 individuals, the true value of heritability (0.1, 0.3, 0.5) also influenced the RMSE of the heritability estimates as estimates had less estimation error (i.e. lower RMSE) at 0.1 heritability (de Villemereuil et al. 2013). Another simulation study found higher bias for 0.1 than for 0.4 true heritability values when relying on 20–100 broods as sample size (Charmantier and Réale 2005), and these results also generally agree with our findings related to relative bias. Note that the trend in RMSE and relative bias according to the true heritability value was opposite both in previous papers and in our study. This emphasizes the need to investigate multiple performance metrics in simulation studies. We also investigated precision, and we found that heritability estimates were generally more precise when their generating value was low (thus, the previously mentioned RMSE values may reflect the higher precision of the estimates). However, higher precision for lower heritability may be only the consequence of that variance components are bound to be positive (de Villemereuil et al. 2013; Krag et al. 2013). Nevertheless, Krag et al. (2013) demonstrated based on simulations that for reliable estimates of heritability over 0.15 sample sizes larger than 400 individuals are needed. However, in our study, we found that the estimation of heritability of 0.1 can have still high (usually downward) relative bias and low statistical power with 500 or 1000 between-individual sample sizes. This fact is important to consider, as for example regarding behavioural traits, heritability estimates are often low, but sample size is usually below 1000 (heritability estimates were between 0.05 ± standard error: 0.02 and 0.21 ± 0.07, and number of individuals between 81 and 455 in the following papers: Blumstein et al. 2010; Santostefano et al. 2017; Jablonszky et al. 2022). Regarding life history traits, also low heritability estimates (0 ± 0.01 or 0.11 ± 0.003) were reported when investigating more than 1000 individuals (Brommer et al. 2008; Santostefano et al. 2021).

Fortunately, the estimation can be improved by collecting multiple repeated measurements. If we want to accurately separate the additive genetic, permanent and within-individual variances that could be of interest especially for labile traits, we had to include repeated measurements in the models (Kruuk 2004; Wilson et al. 2010). Furthermore, a previous study found that when the between-individual sample size is large, repeated measurements can lead to more precise heritability estimates (Adams 2014). Although, in our simulation precision increased with the number of measurements in only some specific scenarios (usually when Ve was high and Vpe was low), according to our results, repeated measurements may have other advantages. The quality of heritability estimation of the models containing also repeated measurements depended on the between-individual sample size and on the magnitude of the true heritability as described previously, but collecting 2–5 repeated measurements usually led to 9 and 16% less biased (and in some scenarios more precise, as was mentioned previously) estimation of heritability. Using 10 measurements only offered advantage in some cases (mostly in two scenarios: when Va = Ve = 0.5, Vpe = 0 and when Va = 0.1, Vpe = 0.1 and Ve = 0.8). Overall, the effect of the number of repeated measurements was substantial when Vpe was very low and Ve was high. The effect of repeated measurements in animal models has received little attention, but we can suppose some explanations. If we sample only one measurement from labile traits with high within-individual variability we may obtain biased results (Boake 1989; Dingemanse and Dochtermann 2013; Niemelä and Dingemanse 2018). If the sampled phenotypic values did not reflect well the phenotypic variability of the population, then genetic effects also became difficult to estimate. Thus, more repeated measurements facilitate the less biased and more precise estimation of the residual component reflecting partly the within-individual variance and presumably enables also the reliable separation of additive genetic, permanent environmental and residual effects resulting in good estimation of heritability. The first part of this explanation is corroborated by our results, as we found that the underestimation (or overestimation in some specific cases, see Results and Supplementary Table S5) of heritability was due to the poor partition of Va and Vpe, while Ve was generally reliably estimated with repeated measurements (see Fig. S2). The better separation of the variance components is also probable based on the generally negative trend between number of measurements and relative bias in our results (see Fig. 1h, i). Nevertheless, our results highlight that the estimation of heritability including repeated measurements in labile traits (when its expected value is low) is not necessarily biased or imprecise, as the relative bias and precision of heritability estimates in the scenarios of Ve = 0.5 or Ve = 0.8 were comparable to the other scenarios in many cases. Additionally, simulation studies for repeatability (which is also a ratio of variance components similarly to heritability) recommend 4 repeated measurements with 100 or 200 individuals that should result in accurate and precise estimates regardless of the value of generating parameter and the complexity of relationships between the variance components (Dingemanse and Dochtermann 2013; Royauté and Dochtermann 2021). Our results generally echo this suggestion, but suggest that in the case of the estimation of heritability sampling 100 individuals may be insufficient even if repeated measurements are taken. However, if the within-individual variance is high and the expected heritability is low, it can be advantageous to collect 2 measurements from 500 individuals than only one measurement from 1000 individuals.

Our results are of special interest for researchers investigating labile traits, such as behaviour, life history or physiological traits. Heritability of behaviour (usually characterized by high within-individual variation) was repeatedly found to be lower (on average 0.30) than that of morphological traits (0.46), while the heritability of life history (0.26) and physiological traits (0.33) was similar (Mousseau and Roff 1987; Stirling et al. 2002). Another review with data from wild populations found on average 0.5 heritability for behavioural traits (Postma 2014). The amount of heritable variation may also depend on whether the behaviour is learnt or not, as for example characteristics of innate calls (0.07 ± 0.05–0.38 ± 0.11, on average 0.21 ± 0.08) had higher heritability than learned song traits (0.03 ± 0.05–0.28 ± 0.09, on average 0.12 ± 0.07) in zebra finches (Taeniopygia guttata) (Forstmeier et al. 2009). Furthermore, specific studies on the heritability of behaviour that used multiple measurements from individuals found usually very low values e.g. 0.26 (95% confidence interval (CI): 0.01–0.55) for aggressiveness (2854 test/679 individuals) in great tits (Parus major) (Araya-Ajoy and Dingemanse 2017), 0.06 (95% CI: < 0.01–0.17), − 0.10 (95% CI: < 0.01–0.31) for song traits (3582 songs from 81 individuals) in the collared flycatcher (Jablonszky et al. 2022), 0.21 ± 0.07 for locomotor performance (341 tests from 187 individuals) and 0.08 ± 0.04 for vigilance (1237 tests from 315 individuals) in yellow-bellied marmots (Marmota flaviventris) (Blumstein et al. 2010) and 0.05 ± 0.02 for aggressiveness (1195 tests from 455 individuals) in Mediterranean field crickets (Gryllus bimaculatus) (Santostefano et al. 2017). Regarding life history traits heritability estimates close to 0 ± 0.01 were found in Eastern chipmunks (Tamias striatus, 1540 individuals) for fecundity (Santostefano et al. 2021), for clutch size values between 0.15–0.45 were reported for great tits (657–6156 records from 493 to 4077 individuals) and between 0.10 and 0.25 (430–2161 records from 208 to 509 individuals) mute swans (Cygnus olor) (Quinn et al. 2006) and 0.11 ± 0.003 for laying date (11,624 observations from 2262 individuals) in common gulls (Larus canus) (Brommer et al. 2008). Heritability of various morphological traits (characterized by low within-individual variability) was found between 0.14 ± 0.04—0.42 ± 0.04 (1620–3335 measurements from 720 to 1448 individuals) in house sparrows (Passer domesticus), 0.15 ± 0.05–0.29 ± 0.07 (1923–1981 measurements from 790 to 800 individuals) in collared flycatchers (Silva et al. 2017), 0.05 ± 0.10–0.72 ± 0.03 (302–456 individuals) in great reed warblers (Acrocephalus arundinaceus) (Åkesson et al. 2008) and 0.26 ± 0.04–0.47 ± 0.07 (2247–2564 measurements from 803 to 891 individuals) in traits of adult sheep (Bérénos et al. 2014) if repeated measurements were included. Thus, many low and non-significant heritability estimates are reported for behavioural and life history traits that underline the importance of our present findings on high relative bias in low heritability estimates even when including 500 or more individuals. Although many of these studies yielded unprecise and non-significant results even with high sample sizes and with multiple measurements, it is still recommended to measure more individuals and collect more repeated measurement as, according to our simulation, these can improve precision and statistical power in some scenarios when within-individual variance is high.

However, it should be noted that repeated measurements did not always improve the goodness of the estimation and in a few cases even decreased precision when the precision of the one measurement models was extremely high (see Fig. 1e, deep blue triangles, but in these cases, precision remained still relatively high with repeated measurements and high precision maybe caused by the Va estimates of the models stuck at zero as Va was 0.1 in these models). In many scenarios, repeated measurements did not have either positive or negative effect on the performance metrics. This may have multiple potential explanations. Despite the large overall sample size, using 100 individuals leads to biased and unprecise heritability estimates; thus, it seems that the repeated measurements could not compensate for the low number of individuals. Bias decreased and power increased with the number of repeated measurements at this small sample size only when the true heritability was high (thus relatively easily estimated; Klein 1974; Charmantier and Réale 2005; Krag et al. 2013)) and the within-individual variance was also high (and permanent environmental effects was low). On the other hand, when using large between-individual sample sizes and the true heritability was high then the estimates were unbiased, so repeated measurements could not offer further improvement at least in terms of bias and power. However, repeated measurement can still improve the estimation even at these large sample sizes when heritability is low and consequently the accuracy and power of estimation is low.

We note that, although we considered 120 scenarios in our study, we could not investigate all potential factors that may influence the accuracy of heritability estimation. Further studies may explore the effect of the relatedness and mistakes in the pedigree (Charmantier and Réale 2005; de Villemereuil et al. 2013; Krag et al. 2013), unequal sampling and various distributions of the response variable on the estimation of heritability (Schielzeth et al. 2020).

In sum, heritability estimates were influenced by the interaction of several factors: the between-individual and within-individual sample sizes, the true value of the additive genetic and within-individual variance. Specifically, heritability can be estimated more precisely and with less bias if 2–10 repeated measurements are taken of the focal trait and this effect can still be significant for higher sample sizes (more than 500 individuals) if the true heritability is low. This advantage is particularly important if the within-individual variance is high, such as in behavioural traits. Thus, we recommend (i) collecting data from more than 100 individuals, (ii) collecting 2–5 repeated measurements and even 10 measurements if within-individual variance is expected to be extremely high when the number of sampled individuals is around 500, and (iii) collecting repeated measurements when the number of individuals is around 1000 only when heritability is expected to be low and within-individual variation is expected to be high).