Background

Parent-progeny relationships, whether among individuals within in situ natural populations or ex situ genetic resource collections, are of fundamental interest to plant and animal breeders, molecular ecologists, and population geneticists. As empirical records of gene flow, pedigrees provide insight into a species’ mating system [1], including patterns of compatibility within and among gene pools [2]. In plant improvement programs, pedigrees can directly inform breeding strategies [3, 4] by facilitating the estimation of breeding values [5, 6], heritabilities [7], and relative combining abilities [8, 9]. Knowledge of family structure can also help rationalize germplasm collections [10,11,12] and guide the management of natural resources [13,14,15], including strategies for reintroducing captive stock to their natural habitats [16, 17].

The basic theoretical principle underlying parentage analysis is that parent(s) can be assigned to their respective progeny with a certain level of confidence based on the signature of genetic compatibility between generations. In other words, Mendelian laws of inheritance permit the inference of genealogical relationships, provided one has a sufficiently informative set of genetic markers that stably transmits from parents to offspring [18]. Over the years, parentage analyses have used various classes of molecular markers for this purpose, including simple sequence repeats (SSRs), variable number tandem repeats (VNTRs), amplified fragment length polymorphisms (AFLPs), and restriction fragment length polymorphisms (RFLPs). Of these, SSRs have long been held as the most appropriate markers for such analyses due to their co-dominant nature, their high polymorphic content per locus, and their relative ease of scoring [19]. Recently, however, SSR genotyping has become less common, particularly in heretofore unstudied species, due to the comparative advantages of high-throughput, sequence-based genotyping methods.

High marker number and density, genome-wide coverage, ever falling cost per datapoint, and ongoing innovation in bioinformatic pipelines [20,21,22,23,24,25] have made sequence-based markers, particularly single nucleotide polymorphisms (SNPs), the current standard platform for genotyping in both model and non-model species [26]. The majority of available parentage analysis tools were originally developed for SSR data [13, 18], with an assumption of relatively small datasets (dozens to hundreds of data points). Although both SSRs and SNPs are co-dominant markers, such tools are unable to make efficient use of genome-wide SNP data (thousands to hundreds of thousands of data points). While some more recent parentage analysis algorithms have been developed to deal with such large datasets [27,28,29,30], all require some a priori knowledge of family structure for their implementation. That is, one must specify, at least, the basic generational structure (i.e. which lines are offspring and which are potential parents) up front in order to perform a robust parentage test. For species whose individuals are particularly long-lived (e.g. trees), difficult to age (e.g. woody lianas), or inbred long ago (e.g. many landraces of cereals), even such minimal information may be unavailable.

There is a rich history of developing relationship inference methods outside of the plant sciences, particularly in the context of both human and natural animal populations [13, 31,32,33,34]. Accurate knowledge of family structure among human subjects is critical to the unbiased assessment of linkage between genetic markers and diseases. Indeed, common relationship misclassifications due to false paternity assignments, unrecorded adoptions, or sample switches can lead to a loss of power in association studies [33, 35]. Several methods have been developed to address this issue; but it is worth noting that all are based on maximum likelihood and/or Bayesian approaches that require a priori knowledge of generational classifications, parental genders, putative pedigrees, family groups, and/or marker linkage [35, 36].

There remains, therefore, a need for a simple and robust parentage analysis tool that makes efficient use of large genomic datasets and requires no prior information about family structure. The ‘apparent’ package was developed with this need in mind; and below we describe its underlying strategy, compare its functionality and performance to existing tools, and report its availability.

Implementation

Description of strategy, use, and package availability

The ‘apparent’ analysis begins with a tab-delimited input table of SNP-based genotypes across some set of loci (columns) for all individuals (rows) in the target population (see Additional file 1). In column 2 of the input file, each individual in the population is assigned to one of five classes for the analysis: Mo (exclusively considered as a potential mother, or female parent), Fa (exclusively considered as a potential father, or male parent), Off (exclusively considered as an offspring), Pa (exclusively considered as a parent, both female and male), or All (considered as a potential female parent, male parent, and offspring within the population).

For each of the possible pairs of i female parents (Mo, Pa, and All) and j male parents (Fa, Pa, and All), the genotype of the Expected Progeny (EPij) is constructed based only on markers that are homozygous in both parents. A rapid, pairwise calculation of genetic distance, namely Gower’s Dissimilarity coefficient (GD) [37], is then carried out between each EPij and all k potential offspring (POk) in the population (Off and All). Ranging from 0 (perfect identity) to 1 (perfect dissimilarity), GD captures the degree of genetic relatedness between two individuals by quantifying the identity-by-state of all n SNPs, according to:

$$ {GD}_{ij\mid k}\left(\operatorname{}{EP}_{ij}|{PO}_k\right)=1-\left(\frac{\sum \limits_{l=1}^n{s}_l{w}_l}{\sum \limits_{l=1}^n{w}_l}\right) $$
(1)

where, for each SNPl, sl = 1 if the genotypic states are the same; sl = 0.5 if the genotypic states differ by one allele (i.e. heterozygote vs. homozygote); sl = 0 if the genotypic states differ by both alleles (i.e. primary homozygote vs. secondary homozygote); wl = 1 if both individuals are genotyped; and wl = 0 if either individual lacks an assigned genotype (e.g. missing data due to low coverage).

Theoretically, if Moi and Faj are the true parents of POk, EPij and POk will be genetically identical across all homozygous parental loci, resulting in a pairwise GD equal to zero. Due to both sequencing and genotyping errors, however, in practice the calculated GD value for a true triad (Moi, Faj, POk) will be greater than zero; but it will be significantly lower than the population of GD’s calculated between EPij and all false offspring. Indeed, for a given population of individuals, a scatterplot of all possible GDij|k values exhibits a significant gap that separates true triads from spurious associations (Fig. 1a). This gap is located by scanning the ordered set of GDij|k values and detecting the place of maximum difference between two adjacent values; and the midpoint of this gap is taken as a simple threshold (Fig. 1a). A similar approach has been described as a reliable means of separating true and false parent-offspring assignments when applying discriminant analysis to thousands of homozygous loci [30, 38].

Fig. 1
figure 1

The ‘apparent’ analysis plots. For a given population, a simple gap analysis separates true triads from spurious relationships. (a) Gower Dissimilarities (GDij|k) are plotted for all possible parent-offspring combinations in the population, enabling an inspection of gap size and all subsequent hypothesis testing. (b) For each significant parent-offspring association from the dyad analysis, distribution plots of mean GDi(1...j)|k values (GDM) and their standard deviation in units of GDi|k (GDCV) help visualize the analysis. In this particular example, A. arguta cv. ‘#74–32’ was correctly identified as a parent of offspring 10 despite the absence of the other parent (cv. ‘Chang Bai Mountain 5’) from the population and the confounding presence of two full-sibs (offspring 11 and 12)

Once the gap has been identified, the significance of its magnitude vis-à-vis the distribution of gap lengths throughout the plot is assessed via a Dixon test [39, 40]. If the size of the gap is declared significant, the individual significance of each triad below the gap (i.e. those triads declared as potential real parent-offspring associations) is then tested against a sample of the most closely-related GDij|k values above the gap (i.e. those triads declared as spurious). If this second Dixon test is also found to be statistically significant, the implicated triad is declared as true and its p-value reported.

In the above triad analysis, a given offspring can be assigned to a pair of parents if and only if all three individuals (both parents and the offspring) are present in the genotyped population. In an attempt to identify one parent despite the absence of the other in the population, a subsequent dyad analysis can be performed. The primary challenge of such an analysis lies in discriminating an individual’s true parent from other close relatives (e.g. full siblings). To address this challenge, ‘apparent’ conducts a two-stage statistical test.

The first test makes use of the fact that, on average, an individual is more closely related to a population of its siblings than it is to a population of random individuals. For each potential offspring k and potential parent i, the package calculates the mean GD (GDM) between that POk and all expected progeny arising from the j possible triads involving potential parent i:

$$ GDM\equiv \frac{1}{j}{\sum}_j{GD}_{\left. ij\right|k} $$
(2)

For each POk, the resulting set of GDM values, one for each parent i, is treated as a normal distribution and the normal score of each value is obtained. If any normal score falls below the lower bound of the user-defined confidence interval, the pair (parent i and POk) is flagged as a potential parent-progeny set.

The second test makes use of the fact that, on average, variation in GD is higher between an individual and a population of its siblings than between an individual and a population of the progeny of its siblings. To further test the potential parent-progeny sets flagged above, the ‘apparent’ dyad analysis thus considers the variation within the sets of GDi(1...j)|k values. Specifically, for each POk and potential parent i, the package calculates the standard deviation among the pairwise GD’s between POk and each expected progeny arising from the j triads involving potential parent i:

$$ {\sigma}_{GD_{\left.i\left(1\dots j\right)\right|k}}=\sqrt{\frac{1}{j-1}{\sum}_j{\left({GD}_{\left. ij\right|k}-\frac{1}{j}{\sum}_j{GD}_{\left. ij\right|k}\right)}^2} $$
(3)

For the purpose of testing against the background of the entire population, this standard deviation is re-expressed in units of GDi|k, the Gower Dissimilarity between POk and potential parent i itself:

$$ GDCV\equiv \frac{\sigma_{G{D}_{i\left(\operatorname{}1\dots j|\right)k}}}{G{D}_{i\mid k}} $$
(4)

Similar to the first test above, for each POk the resulting set of GDCV values, one for each parent i, is treated as a normal distribution and the normal score of each value is obtained. If any normal score exceeds the upper bound of the user-defined confidence interval, the pair (parent i and POk) is reported as a likely potential parent-progeny set, along with its cumulative p-value. As shown in Fig. 1b, this two-step dyad analysis is effective not only in identifying likely parents (significant outliers in both tests) but also in distinguishing such parents from other close relatives (significant outliers in the first test only).

It is important to note that the ‘apparent’ algorithm makes no assumptions about the ploidy of the species under investigation; and the strategy performs well for any level of available pedigree information, from none (completely unknown adults and offspring) to the maximum possible information available (known adults, including their genders, as well as the set of offspring). The simple approach accommodates unlimited markers across unlimited individuals, the only requirement being that the population under investigation is genotyped with bi-allelic SNP markers. The ‘apparent’ package is freely available at https://github.com/halelab/apparent and through the Comprehensive R Archive Network (CRAN) at https://cran.r-project.org.

Method validation

To test the validity of the approach described above, we turned to the North American kiwiberry (Actinidia arguta) collection, comprised of 62 tetraploid (2n = 4x = 116), dioecious genotypes [41]. From these 62 genotypes, four males and five females were used in controlled crosses to produce a total of 15 offspring of known parentage (five groups of three full-siblings each; see Additional files 2 and 3). For each of the 77 samples (62 + 15 offspring), genomic DNA was isolated from ~ 1 g of fresh young leaves using a modified CTAB protocol, cleaned with a spin column (Zymo Research, Genomic DNA Clean & Concentrator™-10), and multiplexed into genotyping-by-sequencing (GBS) libraries using a two enzyme (PstI-MspI) protocol [42]. The libraries were sequenced using 150 bp paired-end (PE) reads on an Illumina 2500 HiSeq platform, and the CASAVA-processed sequence data were submitted to the GBS-SNP-CROP pipeline [25] for genotyping. Stringent quality filtering was carried out, as explained in detail in the pipeline documentation; and all recommended ploidy-specific parameters were used for SNP calling and genotyping.

The resulting set of genotypic data was submitted to ‘apparent’ with no accompanying generational, gender, or pedigree information. In other words, all 77 genotypes were coded as ‘All’ in the input file, meaning each individual was to be considered by ‘apparent’ as a possible mother, father, and offspring, for a total of 225,302 potential triads. Package performance was assessed using the following four metrics: 1) Number of Type I errors (false triads declared true); 2) Number of Type II errors (undeclared true triads); 3) Overall accuracy [100 * Number of declared true triads/(Number of true triads + Number of false triads declared true)]; and 4) Computation time.

Using the same set of data, we investigated the impact of total marker number on performance. Finally, we compared the simple gap-based method of triad GD threshold determination with a more intensive approach involving computation of genetic dissimilarities among technical replicates (i.e. duplicated DNA samples isolated from three different genotypes, split between different library preparations, and sequenced on different Illumina lanes).

Comparison to other parentage analysis tools

After choosing an appropriate number of loci to include in the analysis, we compared the performance of ‘apparent’ with five other parentage analysis tools, including four R packages (‘MasterBayes’ MCMCped function [27], ‘ParentOffspring’ [28], ‘Solomon’ [29], and ‘hsphase’ pogc function [30]) and the Windows-based program Cervus [43, 44], one of the most widely used software tools for parentage analysis. As described above for ‘apparent,’ we evaluated the performances of these tools using the test population of 77 A. arguta accessions. To fairly compare performance among tools, we applied the same criteria to all analyses, namely: 1) The same set of 1000 SNPs was used; 2) All 225,302 potential triads were tested (i.e. no information was provided in terms of classifying individuals as mothers, fathers, or offspring); and 3) Confidence level, when supported by a given tool, was set at 99% (α = 1%).

In addition, a more qualitative comparison of the tools was done based on their main features, ease of use, and available functions. The main features considered were marker type, parentage analysis method, number of genotype classes that must be declared, and operating system compatibility. Ease of use considers the relative level of difficulty in parameterizing the various tools, creating the needed input files, and interpreting the output. Lastly, the comparison of available functions follows the typology proposed by Jones et al. 2010 [18] to classify the various tools based on their abilities to perform paternity/maternity, parent pair allocation, parental reconstruction, sib-ship reconstruction, and full probability analyses. Also considered are the tools’ abilities to calculate exclusion probabilities, assign statistical confidence to individual parent-offspring pairs, and assess experiment-wide statistical confidence of parent-offspring assignments.

Results and discussion

GBS-SNP-CROP retained, on average, 5.14 million high-quality PE reads per genotype (Additional file 2) and called a total of 27,852 SNPs, with an average depth D = 36.0. Overall levels of heterozygosity, homozygosity, and missing data were 36.6, 51.5, and 11.8%, respectively.

Optimizing SNP number for parentage analysis

From the 27,852 SNPs called, random subsets of various sizes, ranging from 50 to 10,000 SNPs, were sampled and evaluated. Because only pairwise homozygous loci are used by ‘apparent’ for analysis, the genotype of any given EPij is based on fewer SNPs than the total available. For example, when 50 SNPs were provided to ‘apparent’, only 19 were usable in the analysis of this population; and the result was both a very high Type I error rate (99.4%) and a very low overall accuracy (0.64%). Supplying 500 SNPs to the package increased the number of usable loci to 186, which decreased the Type I error rate substantially (25.0%) and greatly improved overall accuracy (75.0%). With 1000 loci (371 SNPs used), the model became stable with no errors (100% accuracy) (Fig. 2).

Fig. 2
figure 2

Influence of the number of SNP loci on error rates, accuracy, and computation time. For each set of loci sampled, the performance of the ‘apparent’ package was evaluated in terms of error rates (Types I and II) and accuracy. The times required to successfully complete the analyses were also recorded and reveal a surprising insensitivity to the number of markers used. Note that the percentage of markers usable by ‘apparent’ for the analysis (i.e. parental homozygous SNPs) is quite stable

Although 1000 was found to be the lowest acceptable number of loci for reliable parentage analysis within this A. arguta collection, the optimum number can be expected to vary according to the species under investigation, the diversity within and among lines, and the population structure. For example, parentage analysis within a highly heterozygous, outcrossing species may require a relatively larger pool of loci due to the fact that a small proportion will be homozygous for any given pair of possible parents. In comparison, a greater proportion of loci generally will be usable in a more homozygous, inbred species, thereby requiring a relatively smaller pool of loci. In practice, as long as all of the individuals in the analysis can be clearly discriminated from one another based on the available pairwise homozygous loci, there will be sufficient resolution for the ‘apparent’ analysis. And as discussed in more detail below, increasing the number of loci has very little effect on total computation time; so there is no real advantage to using a reduced marker set.

Accuracy and computation time

Using 1000 total SNPs, ‘apparent’ identified the parental pairs of all 15 offspring from the controlled crosses with 100% accuracy (no Type I or II errors), despite the complicating presence of full-sibs in the population. In addition, we found an average accuracy of 73.3% (range 33.3–100%) for dyad analysis, over the nine analyses where one male or one female parent of the known offspring was removed from the population. Dyad analysis reached a consistent 100% accuracy, however, when minimal generational information (adults vs. juveniles) was provided to the algorithm. Both the triad and dyad analyses produce easily parsable and tab-delimited output (Additional file 4), along with summary plots (Fig. 1).

While the pairwise GD between redundant genotypes (i.e. technical replicates) should in theory be zero, the existence of both sequencing and genotyping errors means that, in practice, perfect similarity is rarely observed. Using the summary plot of GDij|k values, ‘apparent’ adopts a simple gap-based method of GD threshold determination to separate putative true triads from spurious parent-progeny associations. For the test population of 77 A. arguta accessions, the true triads identified via the gap-based method had a mean GDij|k of 0.0016. In a previous study with this population [35], 99% confidence intervals for declaring redundancy were empirically determined based on distributions of GD’s obtained between pairs of both biological replicates (two independent DNA isolations from the same accession, prepared as part of the same GBS library and sequenced in the same lane) and technical replicates (a single DNA isolation, used in two separate GBS library preparations and sequenced on different lanes). The mean GDij|k for triads declared via the gap-based method is lower than both the biological (0.0024) and technical (0.0046) replicate thresholds, meaning the simple gap-based ‘apparent’ assignments are supported by empirical measures of genetic redundancy.

Recognizing that true triads exhibit a very small pairwise GDij|k, despite the presence of sequencing and genotyping errors, one can greatly accelerate the ‘apparent’ analysis by limiting the time-intensive gap analysis to only those GDij|k values below some user-specified threshold via the package’s MaxIdent parameter. The MaxIdent default of 10% greatly reduces the analysis time because all GDij|k values above 0.1 are ignored during significance testing (i.e. they cannot, by definition, be declared as true triads). In a test population of n = 77 individuals, each coded as ‘All’ (potential mothers, fathers, and offspring), pairwise GDij|k values for a total of 225,302 possible triads must be explored [n2 * (n-1)/2]. With MaxIdent set to 0.1, however, the computation time required by ‘apparent’ for the A. arguta test population is modest (~ 20 min on a Unix workstation with a 2.6 GHz Dual Intel processor and 16 GB RAM) and fairly insensitive to the number of loci used (Fig. 2).

As a final note on computation time, although increasing the number of loci for a given population has very little effect on total computation time, increasing the number of individuals in that population does. In the absence of guiding information (i.e. all individuals coded as ‘All’), the exploratory triad space grows as the cube of the population size, an inflation that directly influences required computation time (see Additional file 5). Users are therefore advised to minimize the size of the exploratory triad space on the basis of available gender and/or generational information. Indeed, excluding irrelevant triads from the analysis should be considered a best practice, along with including a known triad in the population (i.e. a control) and culling individuals with unusually low mean GDij|k values or mean usable number of loci (see https://github.com/halelab/apparent for details).

Comparing features and performance with other tools

As summarized in Table 1, the ‘apparent’ package offers a novel combination of features compared to those possessed by the following commonly used parentage analysis tools: ‘MasterBayes’ MCMCped function [27], ‘ParentOffspring’ [28], ‘Solomon’ [29], ‘hsphase’ pogc function [30], and Cervus [43, 44]. Only ‘apparent’ and ‘hsphase’ permit fully exploratory parentage analysis in the absence of a priori classifications of individuals (e.g. parents vs. offspring). Despite this point of commonality, ‘apparent’ greatly exceeds the functionality of ‘hsphase’ in its performance of both paternity/maternity analysis and parent pair allocation, not to mention its ability to assign statistical confidence to declared triads. The ‘apparent’ package was also designed with relative ease of use in mind, a result accomplished via simple parameterization, input file requirements, and output interpretation.

Table 1 Comparison of the ‘apparent’ R package to five currently available tools for parentage analysis, based on main features, ease of use, and available functions

In addition to occupying a unique niche among available parental analysis tools in terms of features, ‘apparent’ consistently outperformed those tools in the correct identification of parent-offspring triads in the test population of 77 A. arguta individuals. Applying the same criteria to all analyses, the overall accuracy of the five tools ranged from 2.3–55.6%, compared to 100% for ‘apparent’ (Table 2). Cervus, one of the most popular parentage analysis tools available, completed the analysis in just under 12 min with no Type II errors; but it committed 44 Type I errors out of a total of 59 declared significant triads. Despite these errors, Cervus proved to be one of the better overall tools of the five, with an accuracy of 50.8%. These results indicate that identifying correct parent-offspring assignments within a population lacking pedigree information is a challenge even for one of the most robust parentage analysis tools available. Notably, Cervus’ triad accuracy increased to 100% when generational information (i.e. which individuals are parents and which are offspring) was supplied to the algorithm (Table 2).

Table 2 Summary of results comparing the performance of ‘apparent’ to five other parentage analysis tools in identifying the pairs of parents of 15 A. arguta offspring in a population of 77 individuals

In the absence of a priori classifying information, ‘MasterBayes’ and ‘ParentOffspring’ exhibited similar overall accuracies (48.1 and 55.5%, respectively; Table 2). The categorical allocation analysis of ‘MasterBayes’ relies on a Markov Chain Monte Carlo approach and runs extremely fast (Table 2); and the package is arguably one of the most sophisticated and comprehensive parentage analysis tools available, owing to its ability to handle both co-dominant and dominant markers and to perform Full Probability analysis (Table 1). The low accuracy of ‘MasterBayes’ in this scenario is understandable, however, in light of the fact that its modeling framework lies firmly within the tradition of analyses developed for general, guided relationship inference in human populations [35, 36], as opposed to the single, well-defined task of unguided parent identification under consideration here. As with Cervus, the accuracy improves greatly (100%) when generational classifications (parents vs. offspring) are provided. Unlike Cervus, however, ‘MasterBayes’ is noteworthy in its difficulty of use, a result of its complex input file requirements and non-trivial parameterization.

To run the ‘ParentOffspring’ package, generational classifications (parents vs. offspring) are required; therefore, carrying out a full, unbiased exploration of the full triad space (225,302 triads) is extremely cumbersome. Even when the required generational classifications (i.e. designating the 15 known offspring as juveniles) were provided, however, the algorithm committed one Type I error (Table 2). Reducing the guiding information even a little, by classifying some full-sib offspring as adults and adults of the same gender as potential parental pairs, increased the number of Type I error significantly and decreased the model accuracy to 55.5%. Given the impracticality of manually running all combinations of the 77 genotypes, the computation time to complete the whole analysis was estimated to be ~ 261 min, not including the time required for the manual permutation of the input files.

The ‘hsphase’ parentage assignment function pogc was only 26.1% accurate in this scenario of no available pedigree information. This was a somewhat surprising result, given the fact that both ‘hsphase’ and ‘apparent’ exclusively use homozygous parental loci for discriminating true and false parent-offspring assignments. Unlike ‘hsphase’, however, the ‘apparent’ GDij|k gap value is extensively tested based on outlier prediction (Dixon test), allowing the inference of statistical confidence for declared triads.

Of all the packages tested, ‘Solomon’ showed the worst overall performance, with an accuracy of only 2.3% in this scenario of no available pedigree information. In addition, the computational time required by ‘Solomon’ to complete the analysis was significantly longer than all other packages (401 min) due to the fundamental dependencies inherent in Bayesian approaches. Surprisingly, the package’s accuracy rose to a mere 2.6% when the adults and the offspring were duly classified; and in both scenarios the Type I error rate was around 97% (Table 2).

Compared to other available tools, the simplicity, speed, and accuracy of the ‘apparent’ package recommend it as a useful tool for inferring parent-offspring relationships within populations for which a priori relational information is lacking. The key column of the simple input file (Additional file 1, second column) lies at the heart of the package’s flexibility, allowing individuals in the population to be tested as both parents and offspring in the same analysis and eliminating the requirement for pedigree information. This same column also allows the user to provide additional information if it is available; thus one can easily control the type of parentage analysis performed. For example, if generational information (adults vs. offspring) and adult genders are known, either paternity or maternity analyses can be performed. If the genders are unknown, a generation-guided categorical allocation analysis is performed. Finally, when no family information is available and all individuals are to be tested as potential mothers, fathers, and offspring, ‘apparent’s novel approach to unguided categorical allocation is carried out, filling a current gap among existing parentage analysis tools.

Conclusions

By offering quick and accurate inference of parent-offspring triads within populations for which no generational, gender, or pedigree information is available, the ‘apparent’ R package occupies a unique niche among currently available parentage analysis tools. With simple parameterization and easily interpretable output, the package should be considered by molecular ecologists, population geneticists, and breeders interested in evaluating family relationships within populations of either model and non-model species for which genome-wide SNP data are available.

In terms of its range of applicability, it is worth emphasizing the fact that ‘apparent’ only attempts to identify direct parent-offspring associations (i.e. the approach only looks back a single generation to identify immediate parents). In practice, then, unless every line from all stages of a breeding program is genotyped (highly unlikely for annual crops), the required genomic data will not be available to establish the chain of generations underlying certain pedigrees of interest (e.g. the original parents of an inbred line). For this reason, the approach is more practically suited to questions of direct parentage within long-lived species, for which multiple generations co-exist and can therefore be included together in the analysis (e.g. trees, woody lianas, other perennials, clonally-propagated crops, etc.). In other words, ‘apparent’ is arguably best suited to plant species which cohere to the animal model, in the sense of having co-existing parents and offspring.

Availability and requirements

Project name: apparent.

Project home page: https://github.com/halelab/apparent.

Operating system(s): Platform independent.

Programming language: R.

Other requirements: R (> = 3.0.2).

License: GPL (> = 2).

Any restrictions to use by non-academics: none.