Background

After the death of an organism, its DNA decays and is progressively lost through time [1, 2]. Under favourable conditions, DNA can preserve for hundreds of thousands of years and provide valuable information about the evolutionary history of organisms [3, 4]. Yet, only minute amounts of ancient DNA (aDNA) often remain in historical or archaeological material. In addition, most of the extracted DNA usually comes from microorganisms that spread in decaying tissues [5, 6]. Whereas microbial sequences rarely align to the reference genome used for identifying endogenous sequences if appropriate length cut-offs are used [7,8,9], contamination with DNA from closely related organisms represents a recurrent problem [10,11,12]. This is particularly true for the genomic analyses of ancient humans, as the individuals handling the specimens during excavation and at later times often leave their DNA behind [13, 14]. Because this contamination can substantially affect the results of population genetic or phylogenetic analyses, quantifying the level of contamination is crucial for downstream analyses. An estimate of the level of present-day DNA contamination is also desirable for making decisions when screening samples to identify those that can be further sequenced with reasonable effort and expenses.

Approaches to quantify the level of contamination can be divided into three categories. Some methods rely on prior knowledge of sequence differences between the contaminating and endogenous genomes [15,16,17]. Alternatively, if these differences are unknown a priori, other methods evaluate the excess of alleles compared to the expected ploidy [18,19,20,21]. The third set of methods uses patterns of chemical damage that are characteristic of aDNA [17, 22].

Amongst the approaches that rely on genetic differences, the most common strategy is to identify diagnostic positions expected to differ between the contaminating and endogenous sequences [16, 20]. The proportion of sequences that carry the contaminant allele at diagnostic positions represents an estimate of the level of contamination. This approach is particularly well-tailored for studying the mitochondrial genome, because it is an extensively studied, non-recombining locus that is often available at high coverage. In contrast, local genealogies along the nuclear genome may differ from the overall population relationship (incomplete lineage sorting), making the identification of diagnostic positions difficult. By leveraging differences in allele frequencies between populations, it is possible to estimate the proportion of present-day human DNA contamination amongst nuclear sequences from archaic hominins [21,22,23]. For the analysis of early modern humans, this approach remains challenging because of the lack of knowledge about rare sequence variants in the sample of interest that are unlikely to be shared with the present-day human contaminant. Thus, contamination estimates obtained from mitochondrial sequences are often used as a proxy for the level of nuclear DNA contamination [24,25,26]. However, the ratio of mitochondrial DNA to nuclear DNA may vary between the endogenous and contaminating DNA [27, 28], leading to potential differences in the level of contamination between the mitochondrial and nuclear genomes [29, 30].

Other approaches that compare the nuclear genetic differences between contaminating and endogenous genomes commonly exploit the ploidy of the sex chromosomes [18, 19, 31]. For instance, apparent heterozygous sites on the X chromosome of a male individual or sequences mapping to the Y chromosome for a female individual represent evidence of DNA contamination. Although these analyses do not rely on a prior knowledge about the ancestry of the ancient individual, they are either restricted to the X chromosome of male samples or cannot detect female contamination in female samples. Another concern is that the level of contamination may differ between the sex chromosomes and the autosomes if the sexes of the contaminant(s) and the ancient individual differ. Other approaches for the autosomes exist, e.g. methods using apparent alternative alleles at homozygous positions or an allelic imbalance at heterozygous positions [20, 21]. However, such approaches assume that high sequence coverage is available.

Alternatively, properties of aDNA molecules can be used to estimate contamination. Ancient DNA is typically fragmented into pieces shorter than 100 bp and exhibits miscoding base modifications that accumulate over time [32,33,34,35]. The most common miscoding lesions observed in aDNA are the results of cytosine deamination [36,37,38,39] that converts cytosine (C) into uracil (U), which is then misread as thymine (T), or 5-methylcytosine into thymine. These apparent C-to-T substitutions occur preferentially toward the ends of sequences [39], likely because single-stranded overhangs, which are common in aDNA, exhibit a rate of cytosine deamination about two orders of magnitude higher than double-stranded regions [1, 40]. To estimate present-day DNA contamination, these properties need to be formalised in a model of aDNA damage. The simplest approach (“conditional substitution analysis”) is based on a model that assumes independence between C-to-T substitutions at both ends of sequences. Testing whether these substitutions are correlated between ends may reveal a set of undamaged sequences, which are likely contaminants [3]. This method works even for low sequence coverage but is primarily used to indicate the presence of contamination. Other methods extend this approach by considering the distribution of C-to-T substitutions along sequences, either assuming a parametric ([39], PMDtools [41] and mapDamage [42, 43]) or empirical distribution of these substitutions along sequences (contDeam [17] and aRchaic [44]). Notably, these methods assume that C-to-T substitutions in the aDNA sequences are independent of each other.

Here, we introduce a novel model for aDNA damage that does not assume independence between C-to-T substitutions. Our implementation, AuthentiCT, allows both estimation of the present-day DNA contamination rate and deconvolution of endogenous and contaminating sequences solely based on patterns of aDNA damage, which makes it applicable to any species, if a suitable reference genome is available for alignment. Applying this method to both simulated and existing aDNA datasets, we find that present-day DNA contamination can be estimated from as few as 10,000 sequences, making it a practical tool in the screening of samples for aDNA preservation.

Results

Method overview

In this section, we first motivate our approach by studying the features of aDNA damage. We then formalise a model of aDNA damage and develop a mixture model to describe and distinguish endogenous from putative contaminating sequences.

Ancient DNA deamination patterns used in this study

Deamination patterns in aDNA sequences depend on the DNA library preparation method used [45]. Some methods involve ligation of adapters to double-stranded DNA (“double-stranded libraries”, [46]) while other methods convert the two DNA strands into separate library molecules (“single-stranded libraries”, [47]). Here, we focus on the damage patterns in single-stranded libraries, as they fully preserve the strand orientation of the sequenced DNA fragments and are widely used in aDNA studies [48,49,50,51,52,53,54,55,56].

While C-to-T substitutions occur predominantly at the ends of DNA fragments, they are also common in the internal part [39, 57]. Our model is motivated by the finding that these internal C-to-Ts are not independent of each other (Fig. 1). Excluding the first and last five bases to mask potential overhangs, we found that C-to-Ts are particularly common in adjacent positions in many samples, with a significant deviation from the geometric distribution expected from independent events (p < 10−15, chi-square goodness-of-fit test; see Additional file 1: Supplementary Note 1 for more details or for results excluding the first and last ten bases), and from a control using sheared present-day human DNA that was treated like aDNA [9]. Single-stranded regions inside aDNA fragments represent a possible cause.

Fig. 1
figure 1

Excess of directly adjacent C-to-T substitutions inside aDNA sequences. The observed and expected frequencies of the distance between pairs of C-to-T substitutions are compared amongst sequences that exhibit two internal C-to-T substitutions (excluding the first and last 5 bases). The colours correspond to sequence data from different archaic humans (> 500,000 sequences each; Les Cottés Z4-1514, Spy 94a, Vindija 87, Mezmaiskaya 2, Goyet Q56-1 from [58]; Scladina I-4A and Hohlenstein-Stadel from [23]; Denisova 2 from [59]; Mezmaiskaya 1 from [60]). The present-day human DNA control dataset is represented in black [9]. All sequences are derived from single-stranded DNA libraries

Model of ancient DNA damage

Motivated by this finding, we developed a model of aDNA damage that jointly models all C-to-T substitutions, accounting for the observed clustering of C-to-T substitutions within a sequence. We used a hidden Markov model (HMM), where each potentially deaminated site in the reference is an informative site, i.e. Cs or Gs for sequences aligning to the forward or reverse strand, respectively. Other positions will give the same likelihood for the endogenous and contaminating DNA models and are therefore ignored in both models. At the C or G positions of the reference genome, we classify observations either as emitting “M” (matches the reference allele), “D” (differs from the reference allele; compatible with a deamination) or “E” (other mismatches corresponding to sequencing errors or polymorphisms) (Fig. 2).

Fig. 2
figure 2

Graphical representation of the model (a) and illustration of the posterior decoding (b). a States are depicted by nodes and transitions by edges. Each state emits a match to the reference M (blue) or a mismatch, which can either be compatible with cytosine deamination, D (red), or an error (or polymorphism), E (yellow). Single-stranded states (5’o, 3’o and ss) and the double-stranded state (ds) are in light and dark green, respectively. b The posterior probability for each state is shown with different colours. (see Additional file 1: Supplementary Note 2 for an evaluation of the posterior decoding). We show bases on top for positions in the sequence (grey bar) that align to a C in the reference

The model attempts to infer which parts of the underlying DNA molecule were single-stranded and double-stranded. It uses four hidden states corresponding to double-stranded (ds) or single-stranded stretches. We further separate internal single-stranded regions (ss) from 5′ (5’o) and 3′ (3’o) single-stranded overhangs. At the first position of the sequence alignment, the chain starts in a 5′ single-stranded overhang with probability o or in a double-stranded state with probability 1-o. Then, the lengths of the different regions follow geometric distributions, with parameters lo, lss or lds for the overhangs, single-stranded and double-stranded regions, respectively. We therefore assumed the following matrix of transition probabilities:

figure a

with d5 representing the distance from the previous observation (set to the current position for the initial observation), d3 the distance to the end of the sequence and o2 the probability of a 3′ single-stranded overhang. The modelling of the state transitions between informative sites assumes that only one transition happens between two informative sites.

The chain ends with the following transition probabilities:

figure b

For the emission probabilities of the three possible observations mentioned above, we assumed that all single-stranded states (5’o, 3’o and ss) have the same emissions:

$$ {\displaystyle \begin{array}{c}\mathrm{M}\\ {}\mathrm{D}\\ {}\mathrm{E}\end{array}}\left(\begin{array}{c}\left(1-e\right)\left(1-{r}_{ss}\right)+\frac{e}{3}{r}_{ss}\\ {}\left(1-e\right){r}_{ss}+\frac{e}{3}\left(1-{r}_{ss}\right)\\ {}\frac{2e}{3}\end{array}\right). $$

Similarly, the emission vector for the double-stranded state is:

$$ {\displaystyle \begin{array}{c}\mathrm{M}\\ {}\mathrm{D}\\ {}\mathrm{E}\end{array}}\left(\begin{array}{c}\left(1-e\right)\left(1-{r}_{ds}\right)+\frac{e}{3}{r}_{ds}\\ {}\left(1-e\right){r}_{ds}+\frac{e}{3}\left(1-{r}_{ds}\right)\\ {}\frac{2e}{3}\end{array}\right). $$

Here, rss and rds denote the deamination rates in single-stranded regions (including the single-stranded overhangs) and double-stranded regions, respectively. We model polymorphisms and errors using a single rate, e, as these are indistinguishable without prior knowledge of polymorphic sites in the genomes of the source populations/species. We also note that all types of substitutions are assumed to be equally likely, a simplification that has also been used in previous work [15, 18].

Model of present-day DNA contamination

To identify contamination, we contrast the aDNA model with a model for DNA without deamination. We assumed that any difference to the reference genome arose from a constant mismatch rate e along the sequence. Assuming independence between sites, the probability of the data is simply: ed(1 − e)s where d and s are the number of mismatches and matches to the reference at informative positions, respectively. We used the same rate e for both endogenous and contaminating sequences.

Estimating present-day DNA contamination

We use a mixture model to estimate the overall contamination rate c. We denote the i-th sequence as Xi, assuming we have N sequences in total. Using the aDNA sequence model, for each sequence, we calculate PE (Xi | ϴ, e), the probability of the sequence given that the corresponding DNA fragment is endogenous, conditional on the HMM parameters ϴ. Similarly, using the model of contaminating DNA outlined above, we calculate PC (Xi | e), the probability of the sequence given that it originates from a contaminating DNA fragment. Therefore, we have P (Xi | ϴ, e) = c PC (Xi|e) + (1 − c) PE (Xi | ϴ, e). Further assuming sequences are independent, the complete likelihood is \( P\left(X|\varTheta, e\right)={\prod}_{i=1}^N\left({P}_E,\left({X}_i|\varTheta, e\right),\left(1-c\right),+,{P}_C,\left({X}_i|e\right),c\right) \). We obtain a maximum likelihood estimate of c using the L-BFGS-B algorithm (tolerance: 10−10) implemented in scipy.optimize (version 1.3.1) and estimate standard errors using the Hessian matrix of the likelihood function, generated using the numdifftools library (version 0.9.39). Assuming that the maximum likelihood estimates are normally distributed, the 95% confidence intervals (CI) are approximated as ± 1.96 standard errors.

Evaluation of AuthentiCT

We implemented this model in a program called AuthentiCT. In this section, we evaluate how well AuthentiCT is able to estimate the proportion of contaminating sequences and to separate aDNA from present-day DNA sequences.

Inference of present-day DNA contamination rates

Assessing the accuracy with artificial mixtures of ancient and present-day DNA

To test if AuthentiCT can estimate the proportion of present-day DNA contamination, we created artificial mixtures of aDNA and present-day DNA sequences in varying proportions, from 5 to 95%, in steps of 5%. As present-day contaminant, we used sequences generated from present-day human DNA previously sheared to short fragments and treated like aDNA (mimicking the treatment of a genuine present-day DNA contamination [9]; these sequences are available in Additional file 2). As aDNA sequences, we used sequences from archaic datasets generated from single-stranded libraries that exhibit minimal amounts of present-day human DNA contamination [58, 60].

For each dataset, we then compared the contamination rate estimates from AuthentiCT to the estimates from contDeam [17] and from the conditional substitution analysis ([22], as computed in [23]) (Fig. 3). The conditional substitution analysis underestimates the contamination proportion (average bias, − 6.73%; root mean square error (RMSE), 0.0741; based on 100,000 sequences), and contDeam overestimates it (average bias, 2.36–1.19%; RMSE, 0.0396–0.0320; based on 10,000 and 100,000 sequences, respectively). In contrast, our method yields more accurate estimates (average bias, 0.05–0.42%; RMSE, 0.0194–0.0191; based on 10,000 and 100,000 sequences, respectively). We note some variability in the results depending on the dataset used, which may reflect different properties that are not under our control (e.g. additional contamination in the Neandertal datasets or differences in error rates between the present-day and ancient DNA sequences).

Fig. 3
figure 3

Contamination estimates on artificial mixtures of Neandertal and present-day human DNA sequences. Estimates from AuthentiCT are shown with red dots, whereas the estimates from two other methods are shown in grey [17, 22]. Each point represents the difference between the estimated proportion of contamination and the proportion of present-day human DNA sequences introduced in the corresponding Neandertal dataset (different panels). Added contamination (x-axis) ranges from 5 to 95% in steps of 5% (see Additional file 1: Figure S3 for a narrower range from 1 to 10% in steps of 1%). Error bars correspond to 95% CI. There were not enough observations with 10,000 sequences for the method relying only on terminal deamination [3]. We note that the Neandertal datasets, particularly Spy 94a, may contain some present-day human DNA contamination in addition to the human contamination we introduced (see Additional file 1: Table S1 for the libraries used and the associated contamination estimates)

Exploring the limits of the method with simulations

To further evaluate AuthentiCT in scenarios where we have full control over the parameters, we simulated aDNA and present-day DNA sequences using the model described above, varying deamination rates, error rates, sequence lengths, GC contents, and numbers of sequences. Unless stated otherwise, each dataset contained 100,000 sequences with lengths following a shifted geometric distribution with minimum and mean lengths of 35 and 45 bp, respectively. By default, we use a GC content of 40%, a terminal deamination rate of 0.5 and an error/divergence rate of 0.001, and set the HMM parameters to o = 0.5, o2 = 0.5, lo = 0.34, lss = 0.20 and lds = 0.003.

We found that AuthentiCT performs well for datasets of 10,000 or more sequences (RMSE, 0.010 and 0.009; average bias, 0.006 and 0.007; Fig. 4a), and its performance is consistent over a wide range of deamination rates (from 0.03 to 0.5 in Fig. 4b; RMSE between 0.005 and 0.013), albeit with larger confidence intervals for lower deamination rates. The least reliable estimates were obtained for very small datasets (1000 sequences) with low contamination rates (below 10%; RMSE, 0.028; average bias, 0.010; Fig. 4a). We note that AuthentiCT overestimates contamination for low contamination rates (average bias of 0.020 and 0.009 for contamination estimates below 0.25 with a terminal deamination rate of 0.15 and 0.5, respectively). This likely represents overfitting to short sequences with few informative sites, as the bias decreases with longer sequences or higher GC contents (Additional file 1: Figure S1).

Fig. 4
figure 4

Contamination estimates on simulated datasets. The datasets differ in the number of sequences (a), deamination rates (b) and sequencing error/divergence rates (c). Each point corresponds to a set of simulated sequences with varying proportions of contamination (x-axis) from 0 to 95% in steps of 5%. The errors (y-axis) correspond to the difference between the estimated and true contamination rates. The bars correspond to the 95% CI

Another variable that may affect contamination estimates is the rate of C-to-T substitutions to the reference genome that is not induced by deamination. Although the sequencing error rate would typically not exceed 1% on sequencing platforms commonly used for ancient DNA sequencing [61, 62], divergence to the reference genome may be an issue. We therefore tested our method on simulated sequences with varying substitution rates, assuming the same rate for endogenous and contaminating sequences (see Additional file 1: Figure S2 for results with different rates). As expected, an increase of the substitution rate leads to a decrease in accuracy of the contamination estimates (RMSE of 0.006, 0.010 and 0.011 for substitution rates of 0.001, 0.01 and 0.02, respectively; average bias of 0.005, 0.009 and 0.01 for the same substitution rates; Fig. 4c). We note that divergence can also lead to alignment issues that may further introduce biases in the damage patterns, as mismatches to the reference may lead to the selective loss of sequences with additional C-to-T substitutions.

Application to real ancient DNA datasets

To further validate the method, we next applied AuthentiCT to published sequences from archaic human specimens [22, 23, 58, 60] without introducing sequences from present-day DNA. We compared our results with a previous approach [23] that relies solely on the divergence to a present-day African genome (HGDP00456, [45]). This independent method works well for Neandertals (as the contaminating modern human DNA will be much more similar to the African genome than the Neandertal genome), but will not translate to samples genetically close to their contaminant (i.e. early modern humans). Here, we use the F(A|B) statistic as a measure of divergence, as it varies little between individuals from the same population [63]. The value of this statistic for a contaminated sample is simply a linear combination of the values for the endogenous and contaminating genomes, i.e. Fobserved = c × Fcontaminant + (1 − c) × Fendogenous where c is the contamination rate. We set Fcontaminant to 0.275 (computed from the genotype calls of a present-day European genome, HGDP00521 [45]) and Fendogenous to 0.176 (Table S20 in [60]).

We note, however, that the two approaches measure slightly different contamination proportions. While AuthentiCT relies on every sequence, F(A|B) relies only on sequences that overlap informative positions. Therefore, AuthentiCT yields contamination estimates corresponding to the proportion of contaminating sequences, whereas the approach based on F(A|B) provides estimates corresponding to the proportion of bases that come from contaminating sequences. These estimates can differ if the length distributions of endogenous and contaminating sequences differ, as is the case for some datasets (Additional file 1: Supplementary Note 3). To get comparable estimates of contamination, we ran AuthentiCT on the subset of sequences that overlap the informative sites used for quantifying contamination based on the F(A|B) statistic (Fig. 5a; see Additional file 1: Table S2 for the contamination estimates per sequences). Using the same sequences may also account for potential differences in the proportion of contamination along the genome. The contamination estimates from both methods are highly correlated (Pearson’s coefficient, 0.98; p value < 10−15; Fig. 5a).

Fig. 5
figure 5

Contamination estimates in archaic and modern human DNA datasets. a Estimates from AuthentiCT (in red) and a method based on the proportion of shared derived alleles with a present-day human genome (in black; see Material and Methods). When available, we used 10,000 sequences from public archaic human DNA datasets generated by shotgun sequencing of single-stranded DNA libraries [22, 23, 58, 60]. For more details about the libraries (identifiers on the x-axis), see Additional file 1: Table S2. Bars represent 95% CI. b Estimates from AuthentiCT (in red; using 10,000 sequences) and contaminationX (in black [18];) for public DNA datasets generated by hybridization capture of single-stranded DNA libraries prepared from ancient modern human samples [48, 56]. For the Villabruna samples, the numbers after the identifiers correspond to the number of target sites in the corresponding hybridization capture experiment. Bars represent 95% CI

Finally, we applied AuthentiCT to ancient modern human DNA sequences generated by hybridization capture and with minimal amounts of present-day DNA contamination [48, 56]. We identified 19 DNA libraries derived from male specimens for which we could estimate contamination based on the proportion of observed alternative alleles on X-chromosome sequences (using contaminationX [18]), with a minimal coverage depth of 2 sequences, at least 100 informative positions and the HapMap CEU population as reference panel [64], excluding variants with a minor allele frequency lower than 5%. Both methods yielded similar low contamination estimates (Fig. 5b), which demonstrate that AuthentiCT can be applied to datasets generated by hybridization capture and with low levels of contamination (for estimates with ancient DNA from domesticated species, see also Additional file 1: Figure S4; these sequences are available in Additional file 2).

Separating ancient DNA and present-day DNA sequences

As most downstream applications assume the absence of contamination, it is often necessary to identify endogenous sequences in contaminated aDNA samples. A common way to achieve this is to compare models for each sequence whether it is endogenous or a contaminant, as done by PMDtools [41] using a likelihood-ratio test. Here, we investigate how the likelihoods of AuthentiCT contrasts with the ones from PMDtools (using recommended parameters) on the same artificial mixtures of Neandertal and present-day human DNA sequences described above. For computational convenience, we fitted the parameter values of AuthentiCT to datasets of 10,000 sequences. Note that once fitted on a subset of sequences, AuthentiCT can then be applied to larger datasets with million sequences akin to PMDtools.

A strict filter for sequences with at least one C-to-T substitution within the first or last three positions leads to about 30% of endogenous sequences being detected, with 0.4% of false positives (Fig. 6a; see Additional file 1: Figure S5 for comparison with a filter based on the number of C-to-T substitutions anywhere along the sequences). At the same false-positive rate, both PMDtools and AuthentiCT detect around 50% of ancient sequences, and the likelihood models allow fine-tuning of precision with recall (see Additional file 1: Figure S6 for the performance on other datasets). The performance of AuthentiCT and PMDtools is similar at low false-positive rates (< 0.02), which are most important for classifying sequences. However, AuthentiCT performs better for more ambiguous sequences and yields higher likelihoods for a contaminant origin for present-day DNA sequences (Additional file 1: Figure S7), which explains why it performs well for estimating contamination (Fig. 3). The two methods mostly differ in their classification of sequences that exhibit only one C-to-T substitution (Fig. 6b) in the internal part of sequences (Fig. 6c). Compared to AuthentiCT’s classification, there is an excess of sequences classified as ancient by PMDtools that exhibit non-deaminated bases adjacent to the C-to-T substitution (Fig. 6d). Some of these are therefore likely to represent polymorphisms or sequencing errors rather than deamination, indicating that AuthentiCT is more conservative when classifying these sequences as endogenous.

Fig. 6
figure 6

Classification of aDNA and present-day DNA sequences. a The receiver operating characteristic (ROC) curves illustrate the performance of AuthentiCT (solid) and PMDtools (dashed) to identify aDNA sequences. A sequence is considered ancient if the log-likelihood ratio (score) of an ancient versus present-day origin is equal to or higher than a threshold (different colours). Each point represents the average performance over 19 datasets (of 10,000 sequences each) with varying proportions of ancient and present-day DNA sequences (5 to 95% in steps of 5%) for AuthentiCT (circles), PMDtools (squares) and a filter for sequences exhibiting at least one C-to-T substitution within the first or last three positions (deam. filter; triangles). Sequences are from Mezmaiskaya 2 (libraries A9180, A9288, A9289 and R1917 [58]) and the present-day human control. The bars correspond to two standard errors. bd The distributions illustrate the number of C-to-T substitutions per sequences (b), the distance between a C-to-T substitution and the closest end of the sequence (c) or the closest non-deaminated base (d) in sequences classified as ancient only by AuthentiCT (left), only by PMDtools (middle) or both (right), using a score threshold of 3 as recommended for PMDtools [41]

Discussion

Estimating present-day DNA contamination in aDNA datasets remains a difficult task, particularly if the contaminating DNA is closely related to the DNA of the studied organism. Most approaches rely on genetic differences between the endogenous and contaminating genomes, which are often unknown a priori. Besides, the dependence on the same information used in downstream analyses is not desirable as contamination may confound signals of interest (e.g. modern human ancestry in a Neandertal genome may look like present-day human DNA contamination). Here, we developed an alternative method to estimate the proportion of present-day DNA contamination based solely on a model of aDNA damage.

AuthentiCT overcomes several shortcomings of previous methods for estimating contamination that are based on aDNA damage. Most notably, it uses every position that is potentially informative, irrespective of whether it is close to the end of a sequence or not, and accounts for clusters of C-to-T substitutions in the internal part of aDNA sequences. The latter represent a feature of aDNA deamination patterns that, to our knowledge, has not been described or exploited before. We demonstrated that this more detailed modelling of the distribution of C-to-T substitutions along aDNA sequences leads to more accurate estimates of the proportion of present-day DNA contamination than previous approaches. However, the performance of AuthentiCT and PMDtools to classify sequences with multiple C-to-T substitutions were almost identical, and filtering for aDNA sequences will yield very similar results for both methods. Thus, the improvement stems largely from a more confident detection of contaminant sequences. In contrast to PMDtools, a sequence with many non-deaminated bases is considered to be more likely a contaminant under the AuthentiCT model (Additional file 1: Figure S6).

It is important to note potential caveats. First, we assume the absence of significant levels of deamination in the contaminating DNA. This assumption does not always hold true (e.g. [13, 23, 59]) and would lead to an underestimate of the proportion of contamination (Additional file 1: Figure S8). One could test this by first identifying sequences that carry contaminant alleles at diagnostic positions in the mitochondrial genome and then checking for the presence of C-to-T substitutions [65]. Second, there may be populations of DNA fragments with different rates of damaged bases, even within a single sample, because of microstructural differences in DNA preservation, or because of different treatments [66]. These differences may lead to an overestimate of the proportion of present-day DNA contamination, as it would lead to an excess of sequences without C-to-T substitutions. Yet, amongst the 50 libraries that we tested (from 19 specimens), the confidence intervals of the contamination estimates from AuthentiCT and another independent method diverge only twice (library A9349 from the Goyet Q56-1 Neandertal, Fig. 5a, and library A0201 of Taforalt 14, Fig. 5b). As AuthentiCT can run on as few as 10,000 sequences (in 3–10 min, see Additional file 1: Figures S9 and S10 for evaluations of the run time), one could split the data by sequencing run, sequence length or other covariates to obtain stratified contamination estimates. Finally, AuthentiCT is not applicable to libraries generated after treatments that alter deamination patterns, e.g. uracil selection or treatment with a uracil-DNA-glycosylase (UDG) [67, 68].

In the future, it will be useful to extend its application to data where patterns of DNA damage differ, such as in double-stranded aDNA libraries [46]. In addition, contamination estimates might be further improved if additional features typical of aDNA are incorporated into the model, such as fragment length or the frequency of purines in the reference genome at positions flanking the sequence alignments [39].

Conclusions

AuthentiCT is useful for estimating contamination in small datasets, e.g. when screening ancient specimens with shallow sequencing depth or when the samples are badly preserved. The independence of the contamination estimates from genetic differences between the contaminating and endogenous genomes makes this method both particularly valuable for the study of ancient human samples and broadly applicable to other species.

Specimen

Archive

Accession number

Denisova 2

European Nucleotide Archive

PRJEB20653

Goyet Q56–1

European Nucleotide Archive

PRJEB21870

Hohlenstein-Stadel

European Nucleotide Archive

PRJEB29475

Les Cottés Z4–1514

European Nucleotide Archive

PRJEB21875

Mezmaiskaya 1

European Nucleotide Archive

PRJEB21195

Mezmaiskaya 2

European Nucleotide Archive

PRJEB21881

Sima de los Huesos specimens

European Nucleotide Archive

PRJEB10597

Scladina I-4A

European Nucleotide Archive

PRJEB29475

Spy 94a

European Nucleotide Archive

PRJEB21883

Taforalt specimens

Sequence Read Archive

SRP132033

Villabruna

European Nucleotide Archive

PRJEB13123

Vindija 33.19

European Nucleotide Archive

PRJEB21157

Vindija 87

European Nucleotide Archive

PRJEB21882