Using cluster analysis for grouping partial autoso m al haplotypes d erive d fro m single sper m STR proﬁling

Backgroun d an d objective: The use of single cell STR proﬁling for mixture deconvolution is increasingly being discussed in forensics; however, studies regarding STR proﬁling of single sperm are relatively rare. Considering that each sperm cell exclusively contains a haploid genome, STR proﬁling as well as grouping proﬁles from each single contributor to derive consensus proﬁles seems to be diﬃcult. Thus, so far, the information obtained from gonosomal markers partially combined with previously performed whole genome ampliﬁcation was used. For this study, we wanted to determine the quality of individual sperm analysis using our routine workﬂow and, assuming the results provided suﬃcient proﬁles, to establish means to cluster them. Material an d m etho d s: In terms of a feasibility study, STR proﬁles of single sperm cells were examined using diﬀerent multiplex kits and ampliﬁcation conditions. Based on this database, a cluster analysis for grouping partial haploid autosomal proﬁles was successfully developed. Simulations were carried out to increase the database. Furthermore, the correlation between successful cluster analysis and the number of sperm, the quality of the proﬁles obtained and the number of contributors was investigated. Results


Introduction
A lot of casework samples consist of DNA mixtures that require deconvolution so that the obtained alleles can be assigned to the individual contributors, a task that is not always possible; however, deducing profiles of different contributors is particularly essential in cases where no suspects can be identified and ultimately only the comparison with DNA databases can lead to each single perpetrator.The deduction of individual genotypes of all mixture contributors can basically be achieved on two levels.On the one hand, a DNA profile conventionally typed from a mixed trace can be analyzed using specially developed biostatistical models containing a deconvolution function [5].On the other hand, individual components can be separated before DNA extraction, which enables separate subsequent genotyping [7,8].
In cases of mixtures containing different cell types, cell type separation and subsequent investigation of a cell pool large enough to obtain a full STR profile, can be successful [2,3,7,12,24].The situation is different if the mixtures consist of morphologically indistinguishable cells from different contributors.This applies, for example, to a mixture consisting of blood or semen from more than one individual.In such cases, the investigation of decreasingly small cell pools (down to 3 cells) or even single cells has proven to be a promising approach [3,13,15].
Although the first publication about single cell STR profiling was already published in 1997 [9], this technique is not very common in forensics, most likely because the amount of DNA in a single cell (approximately 6 pg in the case of a diploid one) is often insufficient to obtain a full profile.In addition to allelic or locus drop-out, further artefacts like allelic drop-in and increased n -4 or n + 4 stutter peaks that are typical for low template profiles complicate the interpretation of these profiles.Even so, some publications addressing various cell separation techniques such as laser microdissection [1,7,18], micromanipulation [13], Fluorescence Activated Cell Sorting (FACS) [22], or DEPArray technology [2,10,13,17,21,24] as well as different DNA extraction methods and amplification strategies, like improved primer extension preamplification (I-PEP) PCR, low volume (LV) one-chip PCR, microfluidic droplet PCR and whole genome preamplification have been published in recent years [11,15,16,20,23].
In addition to the complex but necessary techniques for the separation of single cells, special amplification systems or protocols were applied with respect to the low initial amount of DNA of a single cell.In this context, for example one-chip PCRs were frequently used [15], a device which can be found only in a few forensic DNA laboratories where it does not serve as routine equipment.Only a small number of more recent studies used standard PCR protocols for single cell STR profiling.For example, informative partial STR profiles as well as complete consensus profiles for each of the two contributors from artificial epithelial cell mixtures could be obtained by Huffman et al. [13] using a 34 cycle AmpFlSTR ® Identifiler ® Plus PCR (Thermo Fisher Sci- entific, Waltham, MA, USA).Furthermore, in a previous studies, complete or almost complete STR profiles for all contributors could be deduced from artificial as well as real blood-blood mixtures of up to three individuals using 32 cycle PP ® ESX 17   Fast (Promega, Madison, WI, USA) PCR [4].In both studies mentioned several partial profiles obtained from the same individual were combined into a consensus sequence.Similar validation studies, using standard PCR equipment have meanwhile been published [17,23].
Most of the studies dealing with single STR profiling were carried out on diploid cells.Only a few tried to process sperm cells, which could be a conceivable approach in cases of multiple rape, for example [14,19].Considering that each sperm cell exclusively contains a haploid genome, STR profiling as well as grouping profiles from each single contributor to derive consensus profiles seems to be significantly more difficult.To enlarge the amount of DNA Theunissen et al. [21] carried out whole genome amplification before STR profiling and were able to obtain partial autosomal profiles, showing an average allele recovery rate of 81% (sperm cell from fresh ejaculate) and 47-75% (for different mock samples), respectively.To group partial profiles derived from a single contributor, an X-chromosomal as well as a Y-chromosomal PCR were carried out additionally.
Encouraged by our results in the investigation of diploid single cells, we asked ourselves whether a preceding preamplification is mandatory.What quality can be achieved with individual sperm analysis using our established single cell workflow for diploid cells and, assuming the results provided sufficient profiles, how can they be grouped?Working without preamplification also means that only one PCR approach is possible.Supplementary examinations with X-chromosomal and Y-chromosomal systems, carried out for the purpose of grouping, are no longer possible.Therefore, the development of a (mathematical) method that enables reliable grouping of partial profiles is inevitably linked to this approach.Grouping cells using model-based clustering was already published for diploid cells [6] but does this approach also work with haploid profiles?To the best of our knowledge, corresponding studies based on real data pools are not yet available.In terms of a feasibility study, STR profiles of single sperm cells were examined using the workflows established in our laboratory for examining diploid single cells.Based on this database, a method for grouping partial haploid autosomal profiles was developed.

Creating a data pool of autosomal haploid profiles
Single sperm cells were isolated from ejaculates of two healthy donors using the DEPArray TM NxT System and the CellBrowser software (Menarini Silicon Biosystems, Bologna, Italy) with the approval of the Bioethical Commission of the Ludwig Maximilians University of Munich.This technology enables single cells to be distinguished by immunofluorescent labels, verification by optical imaging and subsequent isolation using a computer-controlled semiconductor dielectrophoretic chip.To conduct the separation of single sperm cells, 30,000 sperm cells from each donor were first stained with Allophycocyanin (APC) conjugated sperm head specific antibody and 4 ′ ,6-Diamidin-2-phenylindol (DAPI) for the corresponding nuclei using the DEPArray™ Forensic Sample Prep Kit (Menarini Silicon Biosystems) according to the manufacturer's instructions.
DNA was isolated from each single sperm with the DEPArray TM LysePrep Kit (Silicon Biosystems) according to the manufacturer's instructions.To create full (diploid) reference profiles, DNA was extracted from 2 μl pure ejaculate of both donors using the Maxwell ® RSC 48 in- strument and the Maxwell ® FSC DNA IQ™ Casework Kit as recommended by the manufacturer (Promega).The extracts were quantified using the Quantifiler™ Trio DNA Quantification Kit (Thermo Fisher Scientific) as suggested by the manufacturer and subsequently diluted to the recommended DNA input.Using the Multiplex-PCR PowerPlex ® ESX 17 fast and Fusion 6C Systems (Promega), the sex determining Rechtsmedizin 2 • 2024 109 according to our in-house validated protocol; apart from that, the manufacturer's instructions were followed.Determination of fragment length was performed on a 3500xl Genetic Analyzer (Thermo Fisher Scientific) according to the manufacturer's instructions.Data analysis was carried out using the GeneMapper ® ID-X Software v1.4 (Thermo Fisher Scientific) and a detection threshold of 50rfu.The Y-chromosomal markers (Fusion 6C) were not considered in the evaluation.In total, a data pool of 123 haploid autosomal profiles was created, consisting of 23 ESX profiles (donor 1, amplified using a 32 cycle PCR program), 79 ESX profiles (32 from donor 1 and 47 from donor 2, amplified using a 30 cycle PCR program) and 21 Fusion 6C profiles (donor 2, amplified using a 30 cycle PCR program).To assess the profile quality, the drop-out and dropin rates of each group were calculated separately, whereas two different calculations were carried out for the Fusion 6C dataset, one including all 23 autosomal markers and a second including the 16 autosomal markers, which were also part of the ESX kit.

Model development and simulations
We developed and tested a small variety of algorithms to reconstruct the genotypes from the haplotype data.A simple cluster procedure using complete-linkage clustering as implemented in the R-function hclust (R version 4.2.2), based on a self-defined distance measure between haplotypes, performed best.We defined the distance between two haplotypes as the number of loci at which both haplotypes showed different alleles.Loci were not counted if there was no allele observed for at least one of the two haplotypes.To reduce the impact of drop-ins, we deleted all alleles that occurred only once at a locus, before applying the cluster procedure to the haplotype data.Finally, all "alleles" of a cluster defined the reconstructed profile (diplotype) of one contributor.
To further investigate the properties of the algorithm more precisely or on a larger data pool haplotype data with varying drop-in (2% and 5%) and dropout (37% and 54%) rates, were simulated.The combination of a drop-out rate of 37% with a drop-in rate of 2% was chosen as condition for data pool A; condition B with 37% drop-out and 5% drop-in rate, and condition C with 54% drop-out and 2% drop-in rate.For each condition, we simulated 1000 replications of haplotype data sets for both donors with the given properties leading to data pools A, B, and C. From these expanded data pools 2 × 10, 2 × 20, 2 × 30, 2 × 40 and 2 × 50 haplotypes (same number for each of two donors) were randomly selected from A, B, and C for one cluster analysis.The random selection and cluster analysis was repeated 1000 times each.These analyses of two-person mixtures were performed to determine the amount of sperm cells necessary per donor to yield a full (16 systems) as well as completely correct diploid profile for both donors.To investigate the effect of an increasing number of contributors on the quality of the cluster analysis, data pools for three additional donors were simulated (again, the 3 conditions A-C, each consisting of 1000 partial haplotypes).Cluster analysis (1000 per constellation) was performed based on 40 randomly selected partial haplotypes from each of the 2-5 donors.The quality of each cluster analysis was assessed by the number of wrong alleles (also called mismatches) per diplotype.Wrong alleles or mismatches includes all alleles that do not match the actual donor alleles, which could be incorrectly determined as well as missing alleles.

Results and discussion
Empirical data.Thedatapool created contains a total of 123 haploid profiles, with 1-16 (ESX) or 22 (Fusion 6C) alleles.Alleles that occurred (sometimes additionally to a true allele) but did not correspond to the alleles of the corresponding donor were evaluated as drop-in.To compare the quality of the profiles obtained with different amplification strategies or kits (ESX with 32 or 30 cycles and Fusion 6C with 30 cycles, named ESX/32, ESX/30 und Fusion 6C/30, respectively), allele recovery, drop-in rates (per detected allele as well as per affected sample) were determined.The overall allelerecoveryrateranged between 46%and 63% for the ESX/30 and ESX/32 datasets (.Table 1).The locus-specific drop-out rate ranged between 34% (D19S433 and D12S391) and 72% (D2S1338) and between 24% (D3S1358 and D1S1656) and 67% (TPOX) when using the ESX/30 and Fusion 6C/30 (23), respectively (. Figs. 1  and 2).An increase in the drop-out rate can generally be observed with increasing fragment size.Irrespective of this, there are also indications of locus-specific increased drop-out rates (dataset Fusion 6C, locus D2S1338).
For the Fusion 6C dataset, 2 different calculations were carried out, 1 including all 23 autosomal markers and a second including the 16 autosomal markers which were also part of the ESX kit.The determined values (61% and 60% allele recovery and a drop-in rate of 2%) for the 16 as well as 23 STR loci, amplified with Fusion 6C, are almost identical.A similarly good allele recovery could only be achieved using the ESX/32; however, with this combination, drop-ins occurred in almost 40% of the samples, which corresponds to 5% of the detected alleles.A reduction of drop-in events can be achieved by reducing the Fig. 2 8 Locus-specific drop-out rates dataset Fusion 6C (23).Loci are arranged on the X-axis according to their mean amplicon length (ascending) in the respective dye channel number of PCR cycles from 32 to 30 which in turn will be accompanied by a significant loss of information (allele recovery decreases from 63% to 46%).As expected, the achieved allele recovery rate was below that of Theunissen et al. [21], who performed WGA before the actual STR PCR (81% compared to a maximum of 62% in our study-both values based on the examination of fresh ejaculates).
Cluster analysis.Anapplicationonthedeveloped cluster method to reconstruct the haplotype of the donors applied to our empirical datasets ESX/30 and Fusion 6C/30 resulted in a complete reconstruction of each haplotype and thus underlines the correct function of the chosen approach.
Simulations.Assuming a 2-person mixture, the effect of a decreasing number of selected sperm cells per donor (2 × 20, 2 × 30, 2 × 40 and 2 × 50) depending on the different data pool A, B and C on the quality of the cluster is shown in .Fig. 3. Overall, the best results could be achieved using data pool A (drop-out rate 37%, drop-in rate 2%, .Fig. 3).As expected, the proportion of complete and correct diploid profiles increases rapidly with the number of (randomly selected partial) haplotypes used for cluster analysis.Analysis based on 2 × 10 haplotypes produces diploid profiles with more than 1 mismatch in 95.5% of all cluster analyses.The use of 20, 30, 40 and 50 haplotypes per donor resulted in 58.7%, 86.2%, 92.2% and 96.5% complete and correct diploid profiles and in another 27.7%, 11.5%, 7.2% and 3.1% profiles with a maximum of only 1 mismatch per donor.While increasing the drop-out rate from 37% to 54% leads to a drastic worsening of the results (maximum of 68.3% complete and correct diplotypes when using 50 randomly selected partial haploid profiles per donor), increasing the drop-in rate from 2% to 5% has a much smaller impact (in comparison 90.0% for 2 × 50 haplotypes).
The results for the analysis of more than two donors on the quality of the cluster analysis are summarized in .Fig. 4. Once again, the best overall results could be obtained from data pools with condition A. Complete correct diplotypes could be achieved in 92.2%, 87.4%, 80.4% and 71.6% of all cluster analyses carried out on mixtures containing 2, 3, 4 and 5 contributors, respectively.For combination B (37% drop-out and 5% drop-in rate), the proportion of correctly derived diplotypes decreased (80.3%, 71.2%, 57.1% and 54.6% for 2-5 person mixtures), whereas cluster analysis based on data pools with condition C (54% drop-out and 2% drop-in rate) again showed the worst results with only 60.7%, 28.5%, 13.5% and 5.0% correct diplotypes derived from mixtures consisting of 2, 3, 4 and 5 contributors.For further refinement of the cluster analyses combinations of different shares of donor contributions need to be performed (e.g. 10 sperms of donor 1 and 30 of donor 2 ...) in the future.
So far, the reconstruction of an autosomal (diploid) genotype from partial haplotypes has been carried out using the information obtained from gonosomal markers to group the haploid profiles that can be assigned to a single person.For this purpose, for example multiplex-PCR systems, containing additional Y-chromosomal STRs, were used [15].On the other hand, whole genome amplification (WGA) approaches were carried out beforehand, to yield a sufficient DNA amount to carry out autosomal as well as X-chromosomal and Y-chromosomal multiplex-PCRs for each single cell [16,21].The Y-STR information obtained using the first mentioned method are only meaningful to a limited extent (due to the small number of Y-STRs per multiplex in connection with unavoidable drop-out events) and, moreover, only about half of the sperm cells contain a Y chromosome and thus are informative.WGA approaches are a bit more labor intensive, but they appear to be a good way of successfully typing even a small number of sperm cells present in a mixture [21].
As was convincingly shown in this study, the use of cluster analyzes to group partial haploid profiles consisting only of autosomal markers appears to be a working alternative.The simultaneous examination of gonosomal markers is not necessary for grouping haploid profiles.The quality of the cluster analysis depends heavily on the completeness of the haplotypes.The better the allele recovery rate, the fewer sperm per donor are needed.Dropins, on the other hand, are usually identified as such in cluster analyses and have less influence on the success of the cluster analysis.When selecting and optimizing the amplification system, neither the drop-in rate nor the goal of being able to  Fig. 4 8 Effect of an increasing number of contributors on the quality of the cluster analysis assuming drop-out/drop-in rates of 54%/2%, 37%/5% and 37%/2% and a number of 40 sperm cells per contributor for the simulations study additional gonosomal systems but the allele recovery rate seems to be crucial and should be the focus; however, the use of WGA before actual STR typing can still be useful, especially in cases where there are only a few sperm.On the one hand, Theunissen et al. were able to obtain a higher allele recovery rate with upstream WGA (81% compared to maximum 62% in our study-both values based on the examination of fresh ejaculates [21]).On the other hand, upstream WGA offers the possibility of carrying out several different autosomal multiplex PCRs per spermatozoon and thus could lead to an increasing number of divergent profiles available for a cluster analysis.

Conclusion
-From a pool of partial haploid profiles of several individuals, generally reliable grouping can be obtained by cluster analysis and correct diploid profiles can be derived for each contributor.-In terms of a proof of principle study, it could be shown that the grouping of partial haploid profiles is also possible without the simultaneous examination of gonosomal markers.
-However, the fewer sperm per person are available for analysis, the more the completeness of the haploid profile affects the quality of the cluster analysis.-The question of whether routine STR profiling without prior preamplification using WGA is sufficient to obtain correct and meaningful profiles for all contributors involved thus depends crucially on how many sperm cells per donor are present in the mixture.

Fig. 1 8
Fig.18 Locus-specific drop-out rates dataset ESX/30.Loci are arranged on the X-axis according to the mean amplicon length (ascending) in the respective dye channel

Table 1
Allele recovery and rate for all profiles obtained by amplification with PowerPlex ® ESXfast and Fusion 6C systems using a 32 or 30 cycle program.The Fusion 6C dataset was evaluated twice, including all 23 and only the 16 autosomal ESXmarkers Drop