Cell-type-specific consequences of mosaic structural variants in hematopoietic stem and progenitor cells

Grimes, Karen; Jeong, Hyobin; Amoah, Amanda; Xu, Nuo; Niemann, Julian; Raeder, Benjamin; Hasenfeld, Patrick; Stober, Catherine; Rausch, Tobias; Benito, Eva; Jann, Johann-Christoph; Nowak, Daniel; Emini, Ramiz; Hoenicka, Markus; Liebold, Andreas; Ho, Anthony; Shuai, Shimin; Geiger, Hartmut; Sanders, Ashley D.; Korbel, Jan O.

doi:10.1038/s41588-024-01754-2

Cell-type-specific consequences of mosaic structural variants in hematopoietic stem and progenitor cells

Article
Open access
Published: 28 May 2024

Volume 56, pages 1134–1146, (2024)
Cite this article

Download PDF

You have full access to this open access article

From

View current issue Submit your manuscript

Cell-type-specific consequences of mosaic structural variants in hematopoietic stem and progenitor cells

Download PDF

Abstract

The functional impact and cellular context of mosaic structural variants (mSVs) in normal tissues is understudied. Utilizing Strand-seq, we sequenced 1,133 single-cell genomes from 19 human donors of increasing age, and discovered the heterogeneous mSV landscapes of hematopoietic stem and progenitor cells. While mSVs are continuously acquired throughout life, expanded subclones in our cohort are confined to individuals >60. Cells already harboring mSVs are more likely to acquire additional somatic structural variants, including megabase-scale segmental aneuploidies. Capitalizing on comprehensive single-cell micrococcal nuclease digestion with sequencing reference data, we conducted high-resolution cell-typing for eight hematopoietic stem and progenitor cells. Clonally expanded mSVs disrupt normal cellular function by dysregulating diverse cellular pathways, and enriching for myeloid progenitors. Our findings underscore the contribution of mSVs to the cellular and molecular phenotypes associated with the aging hematopoietic system, and establish a foundation for deciphering the molecular links between mSVs, aging and disease susceptibility in normal tissues.

Dissecting single-cell genomes through the clonal organoid technique

Article Open access 18 October 2021

Somatic mutations reveal asymmetric cellular dynamics in the early human embryo

Article 22 March 2017

Long-read whole-genome analysis of human single cells

Article Open access 24 August 2023

Main

Somatic subclonal (mosaic) mutations are present in nearly all tissues and accumulate with age^1,2,3,4,5,6, yet their role in human health and disease is underexplored. Somatic structural variants, which comprise copy-number alterations (CNAs) and copy-neutral rearrangement classes, are the most common class of driver mutation in cancer^7,8. Previous studies have associated mosaic CNAs in aged donors with unusual blood cell counts and susceptibility to age-associated diseases^2,9,10,11,12, which underscores the potential for mSVs to alter molecular phenotypes in healthy individuals upon aging. However, the molecular processes behind these associations, which are anticipated to vary by cell type, are poorly understood.

Detecting mSVs poses an important technical challenge^7,11, with bulk whole-genome sequencing (WGS) typically unable to differentiate cell types and identify mSVs present with a low variant allele frequency (VAF). Additionally, WGS of single-cell-derived clones is limited to mSVs that can be cultured long-term, potentially biasing against mSVs exhibiting large segmental aneuploidies^7,13,14. Single-cell sequencing offers a solution in theory, yet most methods are suited only for detecting large CNAs, yielding an incomplete understanding of mSVs¹⁵.

Here we utilize Strand-seq, a haplotype-resolved single-cell sequencing technique^14,16,17, to investigate the functional impact of mSVs. We focus on the blood compartment, where mosaic CNAs have been documented in aged donors^2,11,18,19. Strand-seq allows resolving of diverse mSV classes, including de novo structural rearrangements, by analyzing their unique ‘diagnostic footprints’ utilizing the scTRIP framework¹⁴. Additionally, Strand-seq simultaneously yields nucleosome occupancy profiles from each single cell, generated via micrococcal nuclease (MNase) digestion¹⁶, which can be used to analyze the functional consequences of structual variants with the scNOVA framework²⁰. In 1 of every 43 hematopoietic stem and progenitor cells (HSPCs), we detect de novo mSVs, which emerge regardless of age. We resolve the cell-type identity of mSV-bearing cells, revealing they are commonly enriched in myeloid progenitors and exhibit aberrant pathway activity previously associated with aging.

Results

Single-cell-resolved mSV landscapes in HSPCs

To study mSV formation in HSPCs with cell-type-specific resolution, we analyzed cells from 19 healthy donors—ranging from newborn to 92 years of age—composed of n = 3 umbilical cord blood (UCB) and n = 16 bone marrow samples (Fig. 1a). We isolated viable CD34⁺ HSPCs (Supplementary Fig. 1) and cultured them for one cell division to enable Strand-seq (Methods). We obtained 1,133 high-quality single-cell libraries, with a mean of 432,282 uniquely mapped fragments per cell (Supplementary Fig. 2 and Supplementary Table 1). We used scTRIP¹⁴ to discover mSVs and whole chromosome aneuploidies (herein, collectively called ‘mosaicisms’), both in single cells and in subclones. Altogether, we identify 51 independently arisen mosaicisms, occurring in 16 of 19 (84%) donors (mean per donor = 2.7; range 0–8), including: 22 deletions, 12 duplications, 3 complex mSVs involving three or more breakpoints, 1 balanced inversion and 13 chromosomal losses (Fig. 1b and Supplementary Table 2). These mosaicisms affect 17 of 24 chromosomes and exhibit no chromosomal enrichment except for the Y chromosome, which was independently lost once or multiple times (leading to mosaic loss of Y (LOY)) in 8 of 12 (67%) male donors.

**Fig. 1: HSPCs acquire a wide diversity of mSVs with age, without increased chromosomal instability.**

Investigating the subclonal composition of each mosaicism (Supplementary Table 2), we find 32 that are detected in only 1 cell (‘singleton mosaicism’), while the remaining mSVs constitute subclones with a cell fraction (CF) of 1.6–56.1% (‘subclonal mosaicism’). While subclones with sex chromosome losses (n = 12 LOY; n = 1 loss of X) reach CFs up to 46.4%, we do not observe whole autosomal aneuploidies. Focusing our further investigation on the 38 autosomal mSVs, we find notable differences between singleton and subclonal mosaicisms. First, 21 of 31 singleton mSVs (68%) exhibit terminal gains or losses, whereas all seven subclonal mSVs comprise interstitial alterations. Second, all complex mSVs are singletons. These include a breakage fusion bridge cycle-mediated¹⁴ mSV on chromosome 20p, and a terminal amplification of 1q (Fig. 1c). Third, singleton mSVs are nearly 18 times larger on average than subclonal mSVs (36.9 versus 2.1 megabase pairs (Mb), respectively; P = 0.0009, Wilcoxon rank-sum test; Fig. 1d). These data indicate that singleton mSVs, detected in 1 of every 43 HSPCs, bear the characteristics of de novo rearrangements (Supplementary Notes), suggesting that not all mSVs have the same potential to form appreciable subclones.

Analyzing these data with respect to donor age shows subclonal mSV expansions (Pearson’s correlation; R = 0.16; P = 1.1 × 10⁻⁷) and sex chromosome losses (R = 0.087; P = 0.0034) are associated with increased age (Fig. 1e), consistent with previous studies of mosaic CNAs^2,11,18,19. Conversely, singleton mSVs are uncorrelated with age (R = 0.008; P = 0.79; Fig. 1e), suggesting continuous acquisition throughout life. Instead, we observe elevated numbers of de novo mSVs in cells already containing a subclonal mosaicism versus unmutated cells (Fisher’s exact test; 4.76% versus 1.96%; P = 0.038; Fig. 1f), suggesting that mSV-harboring cells may be ‘predisposed’ to accumulate further rearrangements.

Hotspots of mSV formation

Since DNA double-strand breaks (DSBs) can trigger structural rearrangements^7,21, we examined the correlation between DSB acquisition and donor age. Strand-seq enables the detection of sister chromatid exchanges (SCEs) to allow systematic mapping of DSBs following homologous repair¹⁶. We identified 4,528 SCEs in our dataset (~4 SCEs per cell, consistent with previous reports²²; Extended Data Fig. 1a). SCE abundance is inversely correlated with age (R = −0.089; P = 0.0027; Fig. 1g and Supplementary Fig. 3), with on average 4.6 SCEs per cell in individuals <60, compared with 3.9 SCEs per cell in donors >60 (Extended Data Fig. 1a). With HSPCs exhibiting largely stable acquisition of mSVs and SCEs regardless of age, these data suggest mSV formation occurs consistently throughout life.

Since structural rearrangements can be influenced by local sequence context⁷, we analyzed the genomic locations of SCEs and mSVs. The skewed distribution of SCEs along chromosomes is even more pronounced than that of mSVs (Fig. 1b and Supplementary Fig. 4): 6.67% (302 of 4,528) cluster into 20 SCE ‘hotspots’ (Methods, Extended Data Fig. 1b and Supplementary Table 3), of which five (25%) coincide with common fragile sites²³ (CFSs) (Supplementary Table 4). Notably, SCEs overlap significantly with mSV breakpoints, with 3% (133 of 4,528) of all SCEs intersecting an mSV breakpoint (P < 0.0001, derived from 10,000 permutations; Fig. 1h,i, Extended Data Fig. 1c–f and Supplementary Table 3). While CFSs are enriched for both SCEs (P < 0.0002) and mSV breakpoints (Extended Data Fig. 1g,h), we identify additional SCE hotspots with similar enrichments not previously identified as CFSs (Fig. 1b,j, Supplementary Fig. 5 and Supplementary Tables 2–4). These loci may therefore represent mSV hotspots in HSPCs.

High-precision cell-typing using nucleosome occupancy profiles

To investigate the cell-type-specific impact mSVs exert on HSPCs, we utilized a two-pronged approach by coupling single-cell mSV analysis with nucleosome occupancy-based functional profiling²⁰. First, to develop nucleosome occupancy-based cell-type classifiers²⁰, we constructed single-cell nucleosome occupancy reference profiles for HSPCs derived from both UCB and bone marrow, covering eight distinct cell types: hematopoietic stem cells (HSCs), multipotent progenitors (MPPs), lymphoid-primed multipotent progenitors (LMPPs), common lymphoid progenitors (CLPs), plasmacytoid dendritic cells, common myeloid progenitors (CMPs), granulocyte–macrophage progenitors and megakaryocyte–erythroid progenitors (MEPs) (Fig. 2a and Supplementary Fig. 6). Using well-defined immunophenotypes (Supplementary Table 5 and Supplementary Fig. 6) we index-sorted HPSCs, and devised a preamplification-free single-cell MNase sequencing (scMNase-seq) protocol (Methods) to characterize the single-cell nucleosome occupancy profile for each cell type.

**Fig. 2: scMNase-seq atlases for eight distinct HPSCs enable cell-type-aware single-cell multiomic profiling.**

We obtained 480 high-quality scMNase-seq libraries (Supplementary Table 6): 305 from bone marrow-derived HSPCs (1 donor) and 175 from UCB-derived HSPCs (5 donors) (Supplementary Table 1). Using scNOVA, we identify 899 and 819 genes exhibiting cell-type-specific nucleosome occupancy in the UCB- and bone marrow-derived datasets, respectively (Fig. 2b and Extended Data Fig. 2a). The cell-type-specific gene activities inferred from nucleosome occupancy²⁰ are broadly consistent with published transcriptomic datasets²⁴ (Fig. 2c). For example, from the bone marrow-derived nucleosome occupancy dataset, we infer increased activity of the canonical marker MME (CD10) only in CLPs²⁵, while HDC (involved in myeloid-lineage priming²⁶) exhibits increased activity in CMPs. We also observe differential nucleosome occupancy at genes not previously reported as HSPC markers, such as SH2D4B and FAT3 (Supplementary Table 7a,b).

Harnessing these gene sets, we utilized nucleosome occupancy measurements as features for developing supervised cell-type classification models using partial linear square discriminant analysis (PLS-DA) (Fig. 2d–f, Extended Data Fig. 2, Supplementary Table 7a,b and Methods). These classifiers provide excellent accuracy, with an average area under the curve of 0.97 for bone marrow and 1.00 for UCB, as estimated by leave-one-out cross-validation (Fig. 2d and Extended Data Fig. 2). Uniform manifold approximation and projection (UMAP) of the latent variables corroborate the discriminatory power of these classifiers compared with unsupervised classification (Fig. 2e,f and Extended Data Fig. 2).

Subclonal mSVs commonly exhibit a lineage bias

Having constructed nucleosome occupancy references for HSPCs, we next performed cell-typing of each Strand-seq library (Fig. 3a and Supplementary Table 8). Tissue-level cell abundances detected based on nucleosome occupancy show high consistency with previous studies^24,27,28,29, including an expanded HSC frequency in older bone marrow donors (from 8.1% to 80%; false discovery rate (FDR)-adjusted P (P_adj) = 0.013; mixed linear model analysis), and a greater abundance of MPPs in UCB versus bone marrow^24,27 (37% versus 0.1%; P_adj = 2.45 × 10⁻³³; Fisher’s exact test; Extended Data Fig. 2). Furthermore, the cell-type compositions seen in Strand-seq closely resemble estimates from orthogonal single-cell RNA sequencing (scRNA-seq) data generated from two donors (BM65, BM712), independently verifying our nucleosome occupancy-based classifiers (Fig. 3b and Supplementary Fig. 8).

**Fig. 3: mSVs in HSPCs frequently exhibit a cell-type bias.**

We next explored the cellular context of mSVs. Of the 19 subclonal mosaicisms found, 8 (42%) show significant cell-type enrichments (FDR 10%; Fig. 3c,d and Supplementary Figs. 9 and 10); and, when considering only subclonal mSVs (that is, removing sex chromosome losses), 5 of 7 (71%) show significant biases. Here, we find predominantly myeloid skewing, with 5 of 5 (100%) of the cell-biased subclonal mSVs enriched in either myeloid or myelo-primitive cell types (Fig. 3e). These lineage-biased events include: a 10-Mb inversion on chromosome Xq12-Xq21.1 enriched in MEPs (BM65); a 1-Mb duplication at 13q enriched in MEPs (BM70); a 300-kilobase (kb) duplication at 19q enriched in CMPs (BM63); and two sequentially arisen deletions at 17p (1.2 Mb) and 17q (500 kb) enriched in both CMPs and HSCs (BM712).

By comparison, sex chromosome losses exhibit more variability, with cell-type enrichments seen in only 3 of 12 (25%; all LOYs) and each of these exhibiting bias for a different cell type: MEP, LMPP and HSC, respectively (Supplementary Figs. 10 and 11). This suggests that the functional impact of LOY is less pronounced or more context-specific³⁰. Furthermore, singleton mSVs do not show cell-type enrichment (Supplementary Fig. 10), suggesting that lineage biases seen in subclonal mSVs are due to their impact on cellular function, rather than biased acquisition in a specific cell type.

Remarkably, despite the diverse genomic loci affected by subclonal mSVs, there is a notable convergence on certain molecular phenotypes. Specifically, the Ras and JAK/STAT signaling pathways, as well as lipid metabolism—previously associated with clonal hematopoiesis (CH) and leukemia^31,32—are recurrently altered (Fig. 3d–f, Supplementary Figs. 12 and 13 and Supplementary Tables 9, 10 and 17). These data link mSVs to common changes in aging-related pathways.

Cell-type-specific impact of an inversion

The molecular consequences of mosaic inversions are underexplored, since most studies are biased towards CNAs^7,11. We therefore investigated the Xq12-Xq21.1 inversion (‘Xq-Inv’), seen in 22.6% (19 of 84) of cells from a 65-year-old female donor (BM65; Fig. 4a). Nucleosome occupancy analysis²⁰ confirms the inversion lies on the active X-homolog (Supplementary Fig. 14), supporting its potential for mediating functional effects. We refined the inversion breakpoints³³ (Methods) to chrX:66753519–76960327, with confidence intervals of ~10 kb and ~18 kb, respectively. While neither breakpoint directly overlaps a gene, the inversion is predicted to fuse two topologically associating domains (TADs) by disrupting their annotated boundaries (Fig. 4b), putatively altering the respective gene regulatory environments³⁴.

**Fig. 4: Mosaic inversion contributes to MEP-biased cell fate and subclonal expansion of HSPCs through *cis* and *trans* effects.**

To investigate the potential impact of the inversion, we interrogated haplotype-resolved nucleosome occupancy profiles at cis-regulatory elements (CREs) to infer chromatin accessibility for each homolog²⁰. Using a haplotype-aware sliding window analysis (Methods), we normalized nucleosome occupancy between the active and inactive X, and compared Xq-Inv cells with unmutated cells from the same donor. We identify 13 peak regions with significantly altered nucleosome occupancy (10% FDR; Fig. 4b), with 4 (31%) located within one of the affected TADs. The strongest peak fell into an intergenic region and showed decreased nucleosome occupancy on the inverted haplotype, indicating increased chromatin accessibility²⁰. This peak is located adjacent to the androgen receptor gene (AR). Closer analysis shows three annotated AR enhancers fall within this peak (Supplementary Table 11), all residing in the fused TAD (Fig. 4b and Supplementary Fig. 14). These data suggest AR as a potential target of gene dysregulation and contributor to subclonal expansion. Indeed, androgens are used to treat bone marrow failure syndromes by inducing HSPC proliferation, albeit with an incompletely understood mode of action³⁵.

To study the downstream effects of the Xq-Inv, we performed a genome-wide search for differential gene activity²⁰, comparing the nucleosome occupancy of gene bodies between Xq-Inv and unmutated cells (Methods). We find 123 genes displaying differential nucleosome occupancy (Fig. 4c and Supplementary Table 10)—all of which reside outside the inversion locus—suggesting strong trans effects of Xq-Inv. Gene set over-representation analysis reveals dysregulation of several AR-related pathways, including Ras signaling and erythropoietin signaling (10% FDR; Fig. 4d and Supplementary Table 12). Erythropoietin signaling, for example, contributes to an erythroid-bias of HSCs in association with elevated AR activity^36,37. Finally, TF-target enrichment analysis²⁰ reveals three TFs with differential activity in Xq-Inv cells: EGR1, RUNX1 and IKZF1—all of which are linked to AR signaling (Supplementary Fig. 15). These data independently suggest AR activation as a result of Xq-Inv.

Notably, all three TFs have previously been reported to play critical roles in MEPs^38,39,40, hinting that AR activation could be a key factor in the enrichment of MEPs within the Xq-Inv subclone (Fig. 4e). To explore this, we performed a cell-type-aware nucleosome occupancy analysis in the AR gene-body, revealing elevated AR activity from the rearranged homolog in HSCs, but not in MEPs (10% FDR; Supplementary Fig. 16). Likewise, upon testing AR target genes (Supplementary Table 13) we infer increased activity in HSCs, but not MEPs, with Xq-Inv (10% FDR; Fig. 4f and Supplementary Fig. 15), indicating HSC-specific AR overactivation in Xq-Inv cells. Consistent with this, Xq-Inv HSCs contain unique differential nucleosome occupancy peaks (10% FDR), including at two AR enhancers (Fig. 4g and Supplementary Fig. 17). These enhancers, which contain binding sites for EGR1, RUNX1 and IKZF1, are more accessible in HSCs, suggesting cell-type-specific enhancer activities (Supplementary Fig. 18). Finally, where these HSCs show regulatory changes consistent with elevated AR signaling (with 3 of 4 differential nucleosome occupancy genes representing annotated AR targets), Xq-Inv myeloid cells (CMPs and MEPs) show a more diffuse signal (with 23 of 105 and 12 of 55 differential nucleosome occupancy genes being AR targets, respectively) (Supplementary Table 9 and Supplementary Fig. 15). Among the MEP-specific genes, we infer high activity of RIT1 (P_adj = 0.0057), a gene whose overexpression has been implicated in CH with MEP expansion⁴¹. Comparing the scRNA-seq data from BM65 with HSPCs from the Human Cell Atlas bone marrow cohort⁴² shows significant enrichment for AR activity in BM65 versus the Human Cell Atlas cohort in HSCs and MEPs, but not LMPPs (Supplementary Fig. 19). These findings are in line with androgen-mediating erythropoiesis through AR-dependent pathways⁴³. They further imply HSC-specific AR overactivity, with a ‘priming’ role of Xq-Inv biasing cells towards megakaryocyte–erythroid lineages.

Stepwise accumulation of mSVs in HSPCs

While our data indicate that mSVs impact molecular phenotypes, how subclonal expansions are facilitated in cells harboring more than one co-existing mSV is unclear. We explored subclone dynamics in a 71-year-old male donor (BM712) exhibiting five distinct subclones, three of which demonstrate cell-type bias (FDR 10%; Fig. 5a). Of the 123 cells sequenced, 103 (84%) harbor at least one subclonal mosaicism, including two interstitial deletions and three LOYs (Fig. 5a,b). We tracked the subclonal evolution of BM712 using shared mSVs. One subclone (26% CF) shows LOY as the only mSV event and is enriched for HSCs. The four other subclones trace back to a ~1.2-Mb deletion at 17p11.2 (17p-Del), seen in 56% of cells, followed by the progressive acquisition of a ~500-kb deletion at 17q11.2 (17q-Del) and two independent LOYs (Fig. 5c–e). Bulk WGS of CD34⁻ cells verified the subclonal 17q-Del and 17p-Del events (Fig. 5e and Supplementary Fig. 20), and revealed both mSVs are carried into mature blood cells.

**Fig. 5: mSV accumulation in a single donor associates with clonal expansion.**

To explore the functional impact of the initiating mSV (17p-Del), we compared the gene-body nucleosome occupancy of 17p-Del cells with unmutated cells from BM712 using scNOVA, identifying 76 dysregulated genes (10% FDR; Fig. 5f). TF-target over-representation analysis²⁰ shows enrichment for the targets of seven TFs, with the most significant being SREBF1 (P_adj = 0.0047) (Supplementary Fig. 21). This gene is hemizygously deleted by 17p-Del, while the other six TFs fall outside the deletion, suggesting a potential important role for SREBF1 loss in the molecular phenotype of 17p-Del cells (Fig. 5d). Protein–protein interaction mapping of all seven dysregulated TFs using STRING⁴⁴ (Supplementary Methods) reveals a significant protein–protein interaction network connecting all TFs (P = 3.57 × 10⁻⁸; Supplementary Fig. 21), highlighting their functional relationship (Supplementary Notes). Pathway enrichment analysis shows this network is enriched for MAPK signaling components (P_adj = 0.0028), previously linked to cell-cycle activation in aging HSCs⁴⁵. Finally, gene set over-representation analysis of all 76 dysregulated genes supports MAPK activation (Fig. 5g), along with dysregulation of lipid homeostasis, a contributor to increased myelopoiesis⁴⁶. Taken together, this suggests that 17p-Del triggers increased MAPK activity, potentially driving myeloid-biased clonal expansion through hemizygous SREBF1 loss.

We next investigated the consequences of 17q-Del, seen in a subclone with 43.1% CF. This deletion disrupts the NF1 tumor suppressor via hemizygous loss of protein-coding exon 1 (Fig. 5e and Supplementary Figs. 22 and 23). In addition to its well-understood roles in cancer⁴⁷, NF1 has been nominated as a CH driver by single nucleotide variant (SNV) analysis⁴⁸ (Supplementary Notes), suggesting that the 17q-Del may fuel HSPC clonal expansion. Using scNOVA, we find 112 dysregulated genes in 17q-Del cells. Pathway over-representation analysis also shows altered metabolism and upregulated mTOR signaling in the subclone (Supplementary Fig. 24). Given the known critical role of NF1 in mTOR signaling⁴⁹, and the role of mTOR signaling in cell proliferation and HSPC differentiation⁵⁰, these findings suggest that the 17q-Del induces mTOR dysregulation, potentially fostering subclonal expansion.

To further characterize these subclones, we generated 4,114 scRNA-seq libraries from CD34⁺ cells isolated from BM712 (Supplementary Fig. 25), and assigned HSPC cell types to the data using a transcriptome reference of human blood⁵¹ (Fig. 5h). To molecularly phenotype the deletion subclones, we capitalized on the fact that copy-number-imbalanced mSV classes can be utilized for targeted re-calling of CNAs in scRNA-seq data²⁰ (Methods), allowing characterization of mSV-bearing cells across a widened dynamic expression range. Using this approach, we infer that 2,571 (63%) scRNA-seq cells bear the 17p-Del, 1,841 (45%) contain the 17q-Del and 995 (24%) exhibit LOY (Supplementary Table 14)—CFs similar to the Strand-seq analyses. Co-occurrence analyses of these mosaicisms corroborate the subclonal structure identified using Strand-seq (Supplementary Fig. 26). Finally, the scRNA-seq data also verify the inferred lineage biases, with 17p-Del cells enriched for CMPs and LMPPs (P_adj = 2.0 × 10⁻¹¹, P_adj = 0.0064; Fisher’s exact test), and both 17q-Del and LOY cells enriched for HSCs (P_adj = 2.6 × 10⁻¹⁴, P_adj = 1.0 × 10⁻⁵⁶; Fisher’s exact test; Fig. 5i and Supplementary Fig. 25).

Having located the mosaic subclones in the scRNA-seq data, we more deeply characterized their molecular phenotypes controlled by cell type. First, gene ontology analysis of the differentially expressed genes between HSCs with and without LOY identifies pathways linked to HSC quiescence^52,53 (10% FDR; Supplementary Tables 15 and 16), potentially explaining the observed HSC enrichment of LOY in BM712. Next, we confirm a distinct transcriptional profile for 17q-Del cells, with differential activity seen for 16 pathways (Molecular Signatures Database (MSigDB) Hallmark; Supplementary Tables 15 and 16 and Supplementary Fig. 27) including those related to HSPC proliferation, differentiation and metabolism. These pathways include MYC and mTOR signaling through mTORC1—two known downstream effectors of somatic NF1 inactivation^49,54—which can be linked to HSC expansion and inhibition of differentiation^55,56. Indeed, we find 17q-Del cells are significantly enriched for HSCs compared with 17p-Del cells (P_adj = 2.1 × 10⁻⁵; Fig. 5g), potentially mediated through MYC and/or mTORC1 upregulation^55,56. Finally, 17q-Del cells show an altered DNA damage response, with decreased expression of BRCA1, BRCA2, FANCI and BLM—implying these cells might be prone to acquire further alterations. Together, this suggests that BM712 underwent a stepwise acquisition of a potentially ‘higher-risk’ molecular phenotype; first, HSCs were enabled to exit quiescence and bias their differentiation (17p-Del); and, second, cells became more proliferative and HSC-like, and potentially more permissive to acquiring further mutations.

Finally, we explored the presence and functional impact of these mSVs in scRNA-seq data generated from terminally differentiated CD34⁻ blood cells. We annotated 2,965 cells into eight cell types using published reference data⁵⁷ (Fig. 5j and Supplementary Fig. 28), and performed targeted CNA re-calling⁵⁸. Notably, we find a significant enrichment for monocytes in 17p-Del cells (Fig. 5k), a circulating downstream progeny of CMPs. These data underscore that these mSVs, identified in HPSCs, could impact peripheral blood cells. In contrast, our efforts to re-detect CNAs within the smaller 17q-Del region were unsuccessful due to its limited number of expressed genes, underscoring the superior capability of Strand-seq in functionally characterizing mSVs relative to scRNA-seq.

Functional effects of mSVs in blood samples

To extrapolate these findings to a larger cohort of blood samples, we interrogated the UK Biobank cohort⁵⁹. The phenotypic data paired with whole-exome sequencing (WES) data from 469,792 donors⁵⁹ provide the opportunity to study somatic mutations in relation to blood counts. Focusing on our top hits—NF1, SREBF1 and AR—we extracted rare (minor allele frequency (MAF) < 1%) SNVs and small (<50 bp) insertion and deletion variants (INDELs) from UK Biobank samples, and classified these based on their predicted impact (Supplementary Table 18). Since CNA losses affecting both the 17p-Del and 17q-Del regions were previously documented^2,60, we additionally made use of WES-based CNA calls⁶⁰ which we analyzed by burden testing (Methods). We first concentrated on the 17p-Del and 17q-Del regions, analyzing gene-disrupting SNVs. We find a bimodal VAF distribution for NF1 and SREBF1 predicted loss-of-function (pLoF) SNVs, but not for rare synonymous and rare missense variants (Fig. 6a). These data indicate that gene-disrupting pLoF SNVs represent a common source of mosaicism at these loci. Furthermore, they emphasize the link between gene-disrupting mSVs affecting SREBF1 and NF1, and clonal expansions in normal blood.

**Fig. 6: Functional effects of mSVs are supported by re-analysis of UK Biobank data.**

Furthermore, at the SREBF1 locus, we find CNA losses and pLOF SNVs are independently associated with altered blood counts (n = 2 losses and n = 74 pLOF SNVs; Supplementary Table 18 and Supplementary Figs. 29 and 30), with the SREBF1 gene being among the strongest hits within the 17p-Del region for several categories, including elevated total leukocytes (P_adj = 0.00012; loss) and elevated monocytes (P_adj = 0.0012; loss) (Fig. 6b and Supplementary Fig. 29). These findings independently support that SREBF1 loss may contribute to a cell-type bias in leukocytes, specifically towards monocytes. When repeating the same analysis for all genes in the 17q-Del region, we find losses at 5 of 6 genes are associated with elevated total leukocytes—yet, only for NF1 do we observe that both loss and pLoF SNVs are significant (P_adj = 0.042 for both; Fig. 6c, Supplementary Table 18 and Supplementary Figs. 29 and 30). This supports the contributions of both 17p-Del and 17q-Del to cell-type skewing and potentially clonal expansion in blood. Interestingly, pLOF SNVs in NF1 are associated with a marked increase in neutrophil counts (P_adj = 0.00019), strongly implicating this gene in myeloid-skewed hematopoiesis.

Lastly, we analyzed rare missense SNVs at the Xq-Inv locus (n = 5 genes), motivated by earlier reports of activating somatic missense mutations in AR⁶¹, which we reasoned could potentially mirror the AR activation molecular phenotype seen in BM65. In females, we observe a bimodal VAF for missense SNVs, but neither for pLoF nor for rare synonymous SNVs, suggesting that AR missense SNVs, but not other SNVs, exhibit somatic mosaicism (Fig. 6d and Supplementary Notes). Furthermore, five rare AR missense SNVs, but no AR pLoF SNVs, are associated with altered blood cell counts (P_adj < 0.05, for all five SNVs; Fig. 6e). These fall into exon 1 (n = 3), exon 2 (n = 1) and exon 4 (n = 1), all of which also harbor missense SNVs in cancer that impinge on AR function⁶¹. We observe association with increased nucleated red blood cell count for n = 4 missense SNVs (P_adj < 0.05, for all four), and decreased basophil count for the remaining SNV (P_adj = 0.043). These findings independently support a link between AR activation and altered cell counts in UK Biobank samples.

Discussion

Our study provides an investigation into the impact of large-scale mosaicisms on normal HPSCs. Using the resolution of Strand-seq (Supplementary Fig. 31), we identify mSVs in most (84%) donors, although mSV subclonal expansion is confined to older (>60) donors. Subclonal mSVs show myeloid cell-type bias and active proliferation pathways, mirroring important features of CH⁴⁸. Therefore, mSVs may represent an important contributor to CH, with their high prevalence potentially accounting for ‘missing’ CH drivers⁶².

Subclonal mSVs are found at diverse loci, yet result in similar dysfunctional signaling pathways, with predominant myeloid-lineage enrichment. This is notable in light of the observation of myeloid skewing in aging HSPCs²⁸ and the involvement of myeloid cells in leukemogenesis⁶³. Our findings on cell-type biases are bolstered by a recent preprint⁶⁴, which reports an in vivo screen showing pronounced myeloid bias following NF1 knockout in mouse HPSCs (Supplementary Fig. 32).

The close association of SCEs and mSVs suggests that mSVs frequently arise as a byproduct of DSB repair^21,65. Intriguingly, mSV formation appears to occur constantly over age, akin to base substitution processes showing consistent activity over life⁶⁶. However, SCE formation slightly reduces with age, perhaps due to altered DNA repair pathway activities^67,68,69. Moreover, mosaicism-bearing cells appear more prone to accumulate further mSVs—analogous to CH driven by SNVs where the presence of multiple drivers implies higher cancer susceptibility⁷⁰. Conversely, newly formed singleton mSVs often result in large terminal alterations that do not reach appreciable CF, perhaps due to the detrimental consequences of segmental aneuploidy¹³. Collectively, factors other than increased mSV formation are likely to foster mSV subclonal expansion during aging. The less effective purging of cells comprising mSVs, exhaustion of HSCs decreasing their clonal diversity¹² or changes in the bone marrow microenvironment may contribute to the subclonal expansion of mSVs in aged donors.

To better understand how mSVs clonally expand in normal blood, additional studies are required. Given its demonstrated ability to discover and functionally characterize mSVs, conducting Strand-seq at scale⁷¹ could enable future studies in larger cohorts. However, limitations remain: Strand-seq is currently not suited to detecting mSVs <200 kb, and is restricted to dividing cells that can incorporate BrdU¹⁶. Furthermore, scalable single-cell methods that account for both mSVs and SNVs are lacking, highlighting an area for future technology development.

In conclusion, this study enhances our understanding of how mSVs alter molecular phenotypes in a cell-type-specific manner. Our approach paves new ways for studying mSV landscapes in diverse normal tissues and diseases in the future.

Methods

Ethics declarations

For samples from the Department of Hematology and Oncology, Medical Faculty Mannheim, Heidelberg University, the use of primary human materials for research purposes was approved by the Medical Ethics Committee II of the Medical Faculty Mannheim of the Heidelberg University. The Ethics approval number is 2013-509N-MA. For samples from Ulm University Hospital, collection and investigation was approved by the Internal Review Board (Ethikkommission) at Ulm University (392/16). Healthy samples used in this study were obtained from waste bone fragments obtained from endoprosthetic surgery and cardiovascular surgery. Recruitment was based on availability and written, informed consent. The status ‘healthy’ (normal) was defined as being negative for HIV and hepatitis B and C, having a normal blood count and having no history of or currently active malignancy. For samples from the Department of Medicine V, Hematology, Oncology and Rheumatology, University of Heidelberg, bone marrow samples were harvested from the posterior iliac crest. The studies on aging of bone marrow HSPCs have been approved by the Ethics Committee for Human Subjects at the University of Heidelberg. Healthy human subjects were recruited through an announcement published in the Department’s Newsletter for patients and their family. Before donation, healthy subjects were examined and screened by an internist and blood examinations (complete blood count, routine panel of laboratory examinations) were performed to assure their ‘healthy’ status. UCB was collected after informed consent of the mother using the guidelines approved by the Ethics Committee on the use of Human Subjects. All donors provided written, informed consent and all interventions were performed in accordance with the Declaration of Helsinki.

Human samples

Healthy donor human UCB and bone marrow samples were obtained either as frozen aliquots of mononuclear cells (MNCs) or freshly isolated from Heidelberg University Hospital, Ulm University Hospital, Mannheim University Hospital and ATCC (ATCC PCS-800-013), and were cryopreserved in liquid nitrogen until processing. Strand-seq library generation was initiated from cultures obtained from either freshly isolated or freshly thawed MNCs. For scMNase-seq and scRNA-seq, freshly thawed MNCs were used.

Statistics and reproducibility

All significance tests used are reported, where applied, in the main text. Multiple testing correction was utilized as required, indicated by P_adj, with an FDR of 10%. No statistical method was used to predetermine sample size. No data were excluded from the analyses. Our targeted analysis of UK Biobank data employed a more stringent significance threshold of P_adj < 0.05.

HSPC culturing and Strand-seq library preparation

UCB samples were obtained from Heidelberg University Hospital. Bone marrow was isolated from donor bone marrow aspirations (n = 2), discarded pelvis from hip replacement surgeries (n = 6) or sternum removed during routine heart surgeries (n = 8) (Supplementary Table 1). Cells were stained on ice in the dark for 30 min with CD34-APC (clone 581; BioLegend; 1:100), CD38-PE/Cy7 (clone HB7; eBioscience), CD45Ra-FITC (clone HI100; eBioscience), CD90-PE (clone 5E10; eBioscience) and LIVE/DEAD Fixable Near-IR Dead Cell Stain (ThermoFisher). Single, viable CD34⁺ cells (gating as per Supplementary Fig. 1) were FACS-sorted (BD FACSMelody, 100-μM nozzle, single-cell mode, gates determined using BD FACSDiva 8.0) directly into ice-cold complete medium (Stemspan serum-free expansion medium supplemented with 100 ng ml⁻¹ SCF and Flt3 (Stem Cell Technologies) and 20 ng ml⁻¹ IL-3, IL-6, G-CSF and TPO (Stem Cell Technologies)). Cells were seeded into Corning Costar Ultra-Low Attachment 96-well plates (Sigma-Aldrich) at a density of 1–2 × 10⁵ cells per ml and cultured for 42 h in the presence of BrdU (40 μM). BrdU-containing nuclei were sorted into 96-well plates and subjected to Strand-seq using the standard library preparation protocol¹⁶, which includes treatment with MNase for DNA fragmentation. Strand-seq libraries were generated using a Biomek FXP liquid handling robotic system^16,22, and sequenced on an Illumina NextSeq 500 sequencing platform (MID-mode, 75-base pair (bp) paired-end sequencing).

scMNase-seq

HSPCs from a healthy bone marrow donor were obtained from ATCC (ATCC PCS-800-013), and UCB samples as described above. Frozen MNCs were thawed and stained as per Supplementary Table 5, with antibodies outlined in Supplementary Table 19, to distinguish the eight distinct HSPC populations outlined in Supplementary Fig. 6a. Single, viable HSPCs (gating strategy Supplementary Fig. 6b) were index-sorted using a BD FACSAria Fusion Cell Sorter (100-μM nozzle, single-cell mode) into 96-well plates containing 5 μl of modified freeze buffer (0.1% NP-40, 7.5% dimethylsulfoxide, 42.5% 2X Profreeze-CDM (Lonza) in PBS) and frozen. ScMNase-seq⁷⁸ libraries were generated from sorted, frozen single cells as per Strand-seq library preparation²², with the following modification: the Hoechst/ultraviolet treatment step was omitted (with scMNase-seq requiring no BrdU incorporation). Following single-cell sequencing, each cell had an average coverage of 613,483 uniquely mapped fragments.

Building nucleosome occupancy reference set cell-type classifiers

The scNOVA framework enables cell-typing of each Strand-seq library, which is achieved by subjecting nucleosome occupancy patterns produced through MNase digestion to machine learning-based classification²⁰. While previously applied to distinguish cell lines from distinct tissues²⁰, here we employed this approach to classify closely related HSPC cell types, based on generating single-cell nucleosome occupancy reference profiles from scMNase-seq data. To achieve this, we index-sorted both the bone marrow- and UCB-derived CD34⁺ cells from eight HSPC cell types using previously defined immunophenotypes²⁴ (Supplementary Fig. 6a and Supplementary Table 5), as described above. Indexed scMNase-seq libraries were used as the ground-truth input for cell-type classifiers. In the case of bone marrow HSPCs, the gene-body nucleosome occupancy profiles were extracted for 305 high-quality single cells and normalized by library size to obtain reads per million. These normalized values were log₂-transformed and standardized, before being subjected to supervised PLS-DA⁷⁹ to (1) identify informative feature sets, and subsequently (2) build a classification model. To identify informative feature (gene) sets for each cell type, we used variable autosomal genes to build an X-matrix (305 cells × 18,851 genes) and a Y-matrix (305 cells × 8 cell types). These X and Y variables were passed to the PLS-DA feature selection process, which outputs variance importance in projection (VIP) scores for each feature. In total, 1,904 genes with a VIP score >90% of the null distribution from the permutation test were retained for the second stage of feature selection. In the second feature selection stage, an additional X-matrix (305 cells × 1,904 genes) and Y-matrix (305 cells × 1 cell type; with cell type in this case being binary information for each cell either belonging to that cell type (1) or not (0), based on FACS indexes) were passed to the PLS-DA, and features with a VIP score >95% of the null distribution from the permutation test retained. This was repeated for each cell type, resulting in a final informative feature set of 819 marker genes (Supplementary Table 7b). We repeated these steps for 175 high-quality single cells obtained from UCB HSPCs, which resulted in 899 marker genes as significant feature sets for cell-type classification (Supplementary Table 7a). We constructed distinct nucleosome occupancy-based classifiers for bone marrow and UCB HSPCs based on nucleosome occupancy patterns in the gene bodies of selected marker genes for cells derived from each source (Supplementary Table 7a,b and Code availability).

mSV discovery in Strand-seq data

We utilized the scTRIP computational approach¹⁴ for single-cell mSV discovery, to identify duplications, deletions, inversions, whole chromosome aneuploidies and complex mSVs. This approach leverages the synergy of three distinct readouts—read depth, strand and haplotype phase—retrieved from Strand-seq data, for haplotype-aware mSV discovery. We performed segmentation of the Strand-seq data by jointly processing strand-resolved binned read depth data across all single cells of a sample, used as a multivariate input signal with a squared-error assumption¹⁴. The single-cell footprints of different mSV classes (derived from unique combinations of read depth, strand and phase) were then discovered using scTRIP (achieved by running the ‘MosaiCatcher’ pipeline with default settings)¹⁴. This approach uses a Bayesian framework to compute posterior probabilities for each mSV diagnostic footprint, and to derive haplotype-resolved mSV genotype likelihoods. Each diagnostic footprint translates into the expected number of copies sequenced in Watson (W) and Crick (C) orientation, contributing to a respective genomic segment. The framework distinguishes between WC and CW chromosomal ground states, and is thus haplotype-aware. It implicitly allows us to perform mSV discovery throughout the genome, including for chromosomes sequenced only on the C strand (CC ground state) or such sequenced only on the W strand (WW ground state), since unambiguous single-cell mSV footprints exist for each ground state¹⁴. The framework estimates clonal frequency levels for each mSV and uses them to define prior probabilities for each candidate mSV. In this way, the framework benefits from the observation of mSVs in more than one cell, enabling improved detection of mSVs in subclones¹⁴—in addition to facilitating the detection of singleton mSVs. In contrast to CNAs, balanced inversions and translocations must be present in at least two single cells to trigger an mSV call¹⁴. We verified that the frequency of singleton mSVs detected using Strand-seq is consistent with results from intermediate coverage single-cell WGS (Supplementary Fig. 33). This suggests that short-term cell culturing with BrdU does not introduce singleton mSVs.

Cell-type enrichment testing

We devised cell-type enrichment tests for each of the identified subclones exhibiting specific mSVs, using a control group consisting of all individuals over the age of 60 who were not affected by mSVs. We performed a binomial test to determine if the number of cells in a particular cell type within the subclone was greater than expected, based on the cell-type composition of the control group. We then calculated permutation-based adjusted P values for each subclonal mSV by randomly sampling the same number of HSPCs from the entire single-cell population 100,000 times and tallying the number of cells from given cell types in question belonging to that subclone.

Single-cell multiomic analysis of differential gene activities in HSPC subclones

Differentially active genes in subclones affected by mSVs were identified in the Strand-seq data using scNOVA²⁰. We used scNOVA’s infer altered gene activity module with the PLS-DA option, which is recommended for the investigation of low-CF subclones²⁰. To regress-out cell-type effects in the identification of differential gene activity, we considered predicted cell type for each single cell as a confounding factor when we executed the infer altered gene activity module. Genes within the respective deleted region were masked, to avoid spurious associations⁸⁰. Genes with significantly altered gene activity (10% FDR) were subjected to gene set over-representation analysis using ConsensusPathDB⁷⁷. Using this approach, certain pathways may exhibit a significant P value for both upregulated and downregulated genes, with some genes contained in ConsensusPathDB functioning as activators and others as suppressors. Over-represented pathways (FDR 10%) were visualized as dot plots. When comparing 17p-cells and wild-type (WT) cells in BM712 in Fig. 5f, we considered all cells carrying the 17p-Del, including those harboring other mosaicisms in addition to 17p-Del, as ‘17p-cells’.

Investigation of potential cis-effects of a balanced inversion

To investigate the local effects of Xq-Inv in BM65, we employed scNOVA²⁰. We utilized a sliding window approach suitable to uncover the cis-effects of balanced mSVs, resolved by haplotype²⁰. We focused on the Xq-Inv-affected segment, including both of its rearranged TADs. We first defined CREs based on a previous study utilizing the assay for transposase-accessible chromatin with sequencing (ATAC-seq) in HSPCs²⁴. We used a sliding window (300 kb in size, moving 10 kb each)²⁰, analyzing CREs along chromosome X, to infer chromosome-wide haplotype-specific nucleosome occupancy for the mSV subclone and WT cells, which is predictive for chromatin accessibility²⁰. For each sliding window, haplotype-specific nucleosome occupancy values at CREs from the mSV subclone (nucleosome occupancy in the active X chromosome/nucleosome occupancy in the inactive X) and WT cells (nucleosome occupancy in the active X/nucleosome occupancy in the inactive X) were compared using likelihood ratio tests to obtain nominal P values [P real]. As a multiple testing correction to control the type I error, we performed a permutation test by randomly shuffling genotype labels of each single cell (mSV or WT) in the single-cell reads per million matrix 1,000 times. For each permutation, we performed likelihood ratio tests to compare nucleosome occupancy between randomly shuffled mSV subclones and WT cells. We computed the number of incidences we observed with the same, or a lower, P value than [P real] from 1,000 permutations, and divided this value by the number of trials (n = 1,000) to estimate the permutation-adjusted P value. Sliding windows with permutation-adjusted P value lower than 0.1 were identified as significantly altered windows, and were assigned to the nearest genes within the same TAD boundaries.

scRNA-seq

Bone marrow MNCs were thawed and stained as described above, with the following antibodies: CD34-AF488 (clone 561; BioLegend; 1:20), CD38-PE/Cy7 (clone HB7; eBioscience; 1:100). Cells were washed and resuspended as above, and stained for 5 min with DAPI before sorting. The gating strategy as described in Supplementary Fig. 1 was used to sort CD34⁺ cells and CD34⁻ cells, respectively, into ice-cold 0.04% BSA in PBS using a BD FACSMelody cell sorter. For each donor, two samples were prepared: one sample of CD34⁺ cells and one sample a 50:50 mixture of CD34⁺ and CD34⁻ cells. scRNA-seq libraries for each sample were generated as per the standard 10X Genomics Chromium 3′ (v.3.1 chemistry) protocol. Completed libraries were sequenced on a NextSeq5000 sequencer (HIGH mode, 75-bp paired-ends).

scRNA-seq data processing, unsupervised clustering and cell-type annotation

Transcripts were aligned to GRCh38 and quantified into count matrices using Cellranger mkfastq and count workflows (10X Genomics, v.3.1.0, default parameters). Seurat⁸¹ (v.3.2.2) was used for quality control of single cells and unbiased clustering of the data. Briefly, cells with <1,000 unique molecular identifiers (UMIs) and cells with >6% of mitochondrial reads were removed as ‘low quality’. Normalization, feature selection, scaling and dimensionality reduction were carried out using default settings. To annotate cell types, previously reported scRNA-seq data from HSPCs⁵¹ were used as a reference for cell-type labeling using SingleR⁷². Differential expression analysis to identify cluster-/genotype-specific marker genes was carried out using the FindMarkers() function from Seurat.

Targeted CNA re-calling in scRNA-seq data

scRNA-seq data were normalized to counts per million (CPM) and transformed into log₂(CPM/10 + 1) using Seurat⁸¹ (v.3.2.2). These values were then subject to targeted CNA re-calling using the CONICSmat package⁵⁸, as described previously²⁰. For the analysis of donor BM712, all three subclonal mosaicisms were investigated: 17p-Del, 17q-Del and LOY. By default, the CONICSmat ‘plotChrEnrichment’ function considers regions with more than 100 expressed genes for CNA discovery. Since we performed targeted re-calling of CNAs previously identified with the high-breakpoint mapping resolution of Strand-seq, we considered regions with five or more expressed genes in our analysis. The numbers of expressed genes detected per mSV were as follows: 17p-Del: 24 genes; 17q-Del: 5 genes; LOY: 17 genes for CD34⁺ dataset; 17p-Del: 28 genes; 17q-Del: 5 genes; LOY: 38 genes for CD34⁻ dataset (Supplementary Table 14). To profile CNA regions, CONICSmat generates distributions of average expression levels across single cells in the given regions, and then fits one-component and two-component mixture models to these distributions. It further compares the likelihood ratios of being one-component (unimodal; that is, absence of CNAs) and two-component (bimodal; that is, presence of CNAs), to determine the most-likely state in those regions based on the Bayesian information criterion. Candidate CNA regions identified as likely to be bimodal within a 1% FDR criterion (based on a Chi-squared likelihood ratio test) were considered further for downstream analysis. Once the region was inferred to have bimodality, the posterior probability for each single cell to belong to the normal clone or CNA subclone was calculated. A posterior probability cutoff of 0.8 was used to assign single cells into one of the two clones. This analysis was repeated for each subclonal mosaicism event.

SCE mapping and locus-specific SCE enrichment

We constructed genome-wide maps of SCEs in each single cell by subjecting the Strand-seq data of single cells to the MosaiCatcher pipeline¹⁴, followed by manual inspection and curation of each call yielding SCE positional coordinates for each cell. Candidate SCEs were identified as changes in strand-state (for example, WW to WC) on a chromosome, whereby we conservatively focused on chromosomes showing only a single changepoint. Chromosomes bearing singleton mSVs were removed by manual inspection, unless the observed strand-state patterns were clearly not attributed to an mSV alone (for example, a terminal deletion together with a complete change in strand orientation, such as WW to C), signifying the co-occurrence of an mSV and an SCE. Coordinates were padded by 1 bp upstream and downstream. GRCh38 was divided into 500-kb bins using the bedtools makewindows command⁸², and overlaps between these 500-kb bins and our SCE callset were generated using bedtools intersect, giving the number of times each bin is hit by an SCE. A bin was considered to be hit if the majority of an SCE confidence interval fell within that bin, and each SCE was only counted in a single bin. To compute significance of the calculated SCE counts per bin, the count data per bin genome-wide were then fit to a negative binomial distribution using the fitdist function from fitdistrplus⁸³, and P values calculated using the qnbinom function (with size = 1.2506716, mu = 0.4823156), applying Benjamini–Hochberg correction. To compute overlap of mSV breakpoints with SCEs, we considered 200-kb-sized breakpoint regions (reported breakpoints ±100 kb).

Breakpoint refinement by WGS

Bulk genomic DNA was isolated from CD34⁻ cells (viable cells from the donors that were not put into culture to be used for Strand-seq library preparation) using the QIAamp DNA Blood Maxi Kit as per the manufacturer’s instructions. Samples were sequenced using a NextSeq5000 (HIGH mode, 75-bp paired-end). Raw WGS reads were aligned to GRCh38 using bwa (v.0.7.15), sorted, marked for duplicates and indexed. mSVs were called using Delly2 (default parameters), combining split read, paired-end and read depth analysis⁷⁶. Unfiltered mSV calls were compared with our callset. Since split read analysis failed to identify the precise breakpoints of the 17p-Del that reside in a repeat-rich region, we generated a single, directional composite bam file of this region based on our Strand-seq data to allow for 17p-Del breakpoint refinement with BreakpointR³³.

UK Biobank analysis

Data collection

The UK Biobank is a population database of approximately half a million participants⁵⁹. For SNVs and INDELs, we used the population-level exome OQFE variants for 469,792 individuals (UK Biobank field ID 23157). For autosomal large deletions, we used CNA loss calls on WES data that were recently generated by subjecting 200,624 individuals from the UK Biobank to the CNest copy-number caller⁶⁰. We considered CNA calls >1 kb. Additionally, we obtained phenotypic data for 11 blood count traits (UK Biobank category ID 100081), containing count for white blood cells, basophils, eosinophils, monocytes, neutrophils, lymphocytes, red blood cells, nucleated red blood cells, platelets, reticulocytes and high-light-scatter reticulocytes. This research was conducted under the application number 83497. The UK Biobank has ethics approval from the North West Multi-centre Research Ethics Committee (21/NW/0157).

Variant annotation

We annotated SNVs/INDELs from WES data using Variant Effect Predictor (VEP v.1.0.3) with the Loss-Of-Function Transcript Effect Estimator (LOFTEE v.0.3-beta) plugin. Variant annotation was performed using Hail v.0.2. According to annotation results, we grouped variants into rare loss of function variants (‘high confidence’ identified by LOFTEE with a MAF < 1%) and rare missense variants (missense variants annotated by VEP with MAF < 1% in the UK Biobank cohort). In the case of CNA losses, we considered deletions overlapping coding exons with MAF < 1%.

Association testing

The blood count data were rank normalized using the ‘RNOmni’ package in R⁸⁴. Linear regression models (blood count ~ genotype + covariates) were used to assess the association between three loci of interest (17p-Del, 17q-Del and Xq-Inv) and blood counts adjusted for several covariates, including age, sex and the first five principal components derived from genotype arrays. For all genes at the respective 17p and 17q loci, we used gene rare pLoF burden and rare large CNA loss burden as genotype in the regression model. For all genes at the X chromosomal locus of interest, we used gene burden for rare pLoF variants and rare missense variants in the model. Moreover, since missense variants can have distinct functional impacts, we also performed single-variant association analysis for rare missense mutations at the Xq-Inv locus by sex. The volcano plot in Fig. 6e presents nominal P values derived solely from female donors, since we made the observation of sex-biased VAF distributions at the AR locus in UK Biobank samples. For all data, see Supplementary Fig. 34. A minimum of three individuals with relevant variants was required for association tests of a given gene, with the exception of the 17p-Del CNA seen in only two UK Biobank donors based on WES. P values were obtained using the Wald test and the Benjamini and Hochberg method was used to correct for multiple hypothesis testing.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

All genomics data generated in this study (Strand-seq, scMNase-seq, scRNA-seq, bulk WGS) are available under the following accession: EGAS00001006567. We re-analyzed publicly available bulk RNA-seq and bulk ATAC-seq data from HSPCs (GSE75384) to characterize signature genes while building the scMNase-seq-based cell-type classifier, and to define CREs in the HSPCs. Additionally, we utilized publicly available databases as follows: Molecular Signatures Database (MSigDB; https://www.gsea-msigdb.org/gsea/msigdb/), ConsensusPathDB (http://cpdb.molgen.mpg.de/).

Code availability

Our study has made publicly available scMNase/nucleosome occupancy-based classifiers for cell-typing Strand-seq libraries in HSPCs. This MATLAB-based classifier can be accessed and downloaded from GitHub, to facilitate its use in research studies (https://github.com/jeongdo801/NO_based_HSPC_classifier). The MATLAB code can be converted to other platforms (R/python) using openly accessible tools (such as large language models). To facilitate open access, the list of signature genes and their weights have also been made available in an open text format through the same hyperlink.

References

Martincorena, I. et al. Tumor evolution. High burden and pervasive positive selection of somatic mutations in normal human skin. Science 348, 880–886 (2015).
CAS PubMed PubMed Central Google Scholar
Loh, P.-R. et al. Insights into clonal haematopoiesis from 8,342 mosaic chromosomal alterations. Nature 559, 350–355 (2018).
CAS PubMed PubMed Central Google Scholar
Lee-Six, H. et al. Population dynamics of normal human blood inferred from somatic mutations. Nature 561, 473–478 (2018).
CAS PubMed PubMed Central Google Scholar
Jaiswal, S. et al. Age-related clonal hematopoiesis associated with adverse outcomes. N. Engl. J. Med. 371, 2488–2498 (2014).
PubMed PubMed Central Google Scholar
Genovese, G. et al. Clonal hematopoiesis and blood-cancer risk inferred from blood DNA sequence. N. Engl. J. Med. 371, 2477–2487 (2014).
PubMed PubMed Central Google Scholar
Kakiuchi, N. & Ogawa, S. Clonal expansion in non-cancer tissues. Nat. Rev. Cancer 21, 239–256 (2021).
CAS PubMed Google Scholar
Cosenza, M. R., Rodriguez-Martin, B. & Korbel, J. O. Structural variation in cancer: role, prevalence, and mechanisms. Annu. Rev. Genomics Hum. Genet. https://doi.org/10.1146/annurev-genom-120121-101149 (2022).
ICGC/TCGA Pan-Cancer Analysis of Whole Genomes Consortium. Pan-cancer analysis of whole genomes. Nature 578, 82–93 (2020).
Google Scholar
Sano, S. et al. Hematopoietic loss of Y chromosome leads to cardiac fibrosis and heart failure mortality. Science 377, 292–297 (2022).
CAS PubMed PubMed Central Google Scholar
Jaiswal, S. et al. Clonal hematopoiesis and risk of atherosclerotic cardiovascular disease. N. Engl. J. Med. 377, 111–121 (2017).
PubMed PubMed Central Google Scholar
Forsberg, L. A., Gisselsson, D. & Dumanski, J. P. Mosaicism in health and disease—clones picking up speed. Nat. Rev. Genet. 18, 128–142 (2017).
CAS PubMed Google Scholar
Mitchell, E. et al. Clonal dynamics of haematopoiesis across the human lifespan. Nature 606, 343–350 (2022).
CAS PubMed PubMed Central Google Scholar
Tang, Y.-C. & Amon, A. Gene copy-number alterations: a cost-benefit analysis. Cell 152, 394–405 (2013).
CAS PubMed PubMed Central Google Scholar
Sanders, A. D. et al. Single-cell analysis of structural variations and complex rearrangements with tri-channel processing. Nat. Biotechnol. 38, 343–354 (2020).
CAS PubMed Google Scholar
Gawad, C., Koh, W. & Quake, S. R. Single-cell genome sequencing: current state of the science. Nat. Rev. Genet. 17, 175–188 (2016).
CAS PubMed Google Scholar
Falconer, E. et al. DNA template strand sequencing of single-cells maps genomic rearrangements at high resolution. Nat. Methods 9, 1107–1112 (2012).
CAS PubMed PubMed Central Google Scholar
Porubsky, D. et al. Recurrent inversion polymorphisms in humans associate with genetic instability and genomic disorders. Cell https://doi.org/10.1016/j.cell.2022.04.017 (2022).
Forsberg, L. A. et al. Age-related somatic structural changes in the nuclear genome of human blood cells. Am. J. Hum. Genet. 90, 217–228 (2012).
CAS PubMed PubMed Central Google Scholar
Jacobs, K. B. et al. Detectable clonal mosaicism and its relationship to aging and cancer. Nat. Genet. 44, 651–658 (2012).
CAS PubMed PubMed Central Google Scholar
Jeong, H. et al. Functional analysis of structural variants in single cells using Strand-seq. Nat. Biotechnol. 41, 832–844 (2023).
CAS PubMed Google Scholar
Liu, P., Carvalho, C. M. B., Hastings, P. J. & Lupski, J. R. Mechanisms for recurrent and complex human genomic rearrangements. Curr. Opin. Genet. Dev. 22, 211–220 (2012).
CAS PubMed PubMed Central Google Scholar
Sanders, A. D., Falconer, E., Hills, M., Spierings, D. C. J. & Lansdorp, P. M. Single-cell template strand sequencing by Strand-seq enables the characterization of individual homologs. Nat. Protoc. 12, 1151–1176 (2017).
CAS PubMed Google Scholar
Glover, T. W. & Stein, C. K. Induction of sister chromatid exchanges at common fragile sites. Am. J. Hum. Genet. 41, 882–890 (1987).
CAS PubMed PubMed Central Google Scholar
Corces, M. R. et al. Lineage-specific and single-cell chromatin accessibility charts human hematopoiesis and leukemia evolution. Nat. Genet. 48, 1193–1203 (2016).
CAS PubMed PubMed Central Google Scholar
Hennrich, M. L. et al. Cell-specific proteome analyses of human bone marrow reveal molecular features of age-dependent functional decline. Nat. Commun. 9, 4004 (2018).
PubMed PubMed Central Google Scholar
Chen, X. et al. Bone marrow myeloid cells regulate myeloid-biased hematopoietic stem cells via a histamine-dependent feedback loop. Cell Stem Cell 21, 747–760.e7 (2017).
CAS PubMed PubMed Central Google Scholar
Bunis, D. G. et al. Single-cell mapping of progressive fetal-to-adult transition in human naive T cells. Cell Rep. 34, 108573 (2021).
CAS PubMed PubMed Central Google Scholar
Pang, W. W. et al. Human bone marrow hematopoietic stem cells are increased in frequency and myeloid-biased with age. Proc. Natl Acad. Sci. USA 108, 20012–20017 (2011).
CAS PubMed PubMed Central Google Scholar
Amoah, A. et al. Aging of human hematopoietic stem cells is linked to changes in Cdc42 activity. Haematologica 107, 393–402 (2022).
CAS PubMed Google Scholar
Dumanski, J. P. et al. Immune cells lacking Y chromosome show dysregulation of autosomal gene expression. Cell. Mol. Life Sci. 78, 4019–4033 (2021).
CAS PubMed PubMed Central Google Scholar
van Zeventer, I. A. et al. Evolutionary landscape of clonal hematopoiesis in 3,359 individuals from the general population. Cancer Cell https://doi.org/10.1016/j.ccell.2023.04.006 (2023).
Lee, M. K. S. et al. Interplay between clonal hematopoiesis of indeterminate potential and metabolism. Trends Endocrinol. Metab. 31, 525–535 (2020).
CAS PubMed Google Scholar
Porubsky, D. et al. breakpointR: an R/Bioconductor package to localize strand state changes in Strand-seq data. Bioinformatics 36, 1260–1261 (2020).
CAS PubMed Google Scholar
Weischenfeldt, J. et al. Pan-cancer analysis of somatic copy-number alterations implicates IRS4 and IGF2 in enhancer hijacking. Nat. Genet. 49, 65–74 (2017).
CAS PubMed Google Scholar
Zhang, Q.-S. et al. Oxymetholone therapy of fanconi anemia suppresses osteopontin transcription and induces hematopoietic stem cell cycling. Stem Cell Rep. 4, 90–102 (2015).
CAS Google Scholar
McManus, J. F. et al. Androgens stimulate erythropoiesis through the DNA-binding activity of the androgen receptor in non-hematopoietic cells. Eur. J. Haematol. 105, 247–254 (2020).
CAS PubMed Google Scholar
Grover, A. et al. Erythropoietin guides multipotent hematopoietic progenitor cells toward an erythroid fate. J. Exp. Med. 211, 181–188 (2014).
CAS PubMed PubMed Central Google Scholar
Behrens, K. et al. Runx1 downregulates stem cell and megakaryocytic transcription programs that support niche interactions. Blood 127, 3369–3381 (2016).
CAS PubMed Google Scholar
Yoshida, T., Ng, S. Y.-M., Zuniga-Pflucker, J. C. & Georgopoulos, K. Early hematopoietic lineage restrictions directed by Ikaros. Nat. Immunol. 7, 382–391 (2006).
CAS PubMed Google Scholar
Desterke, C., Bennaceur-Griscelli, A. & Turhan, A. G. EGR1 dysregulation defines an inflammatory and leukemic program in cell trajectory of human-aged hematopoietic stem cells (HSC). Stem Cell Res. Ther. 12, 419 (2021).
CAS PubMed PubMed Central Google Scholar
Chen, S. et al. Impaired proteolysis of noncanonical RAS proteins drives clonal hematopoietic transformation. Cancer Discov. 12, 2434–2453 (2022).
CAS PubMed PubMed Central Google Scholar
Hay, S. B., Ferchen, K., Chetal, K., Grimes, H. L. & Salomonis, N. The Human Cell Atlas bone marrow single-cell interactive web portal. Exp. Hematol. 68, 51–61 (2018).
PubMed PubMed Central Google Scholar
Huang, C.-K., Luo, J., Lee, S. O. & Chang, C. Concise review: androgen receptor differential roles in stem/progenitor cells including prostate, embryonic, stromal, and hematopoietic lineages. Stem Cells 32, 2299–2308 (2014).
PubMed Google Scholar
Szklarczyk, D. et al. The STRING database in 2021: customizable protein-protein networks, and functional characterization of user-uploaded gene/measurement sets. Nucleic Acids Res. 49, D605–D612 (2021).
CAS PubMed Google Scholar
Ito, K. et al. Reactive oxygen species act through p38 MAPK to limit the lifespan of hematopoietic stem cells. Nat. Med. 12, 446–451 (2006).
CAS PubMed Google Scholar
Dragoljevic, D., Westerterp, M., Veiga, C. B., Nagareddy, P. & Murphy, A. J. Disordered haematopoiesis and cardiovascular disease: a focus on myelopoiesis. Clin. Sci. 132, 1889–1899 (2018).
CAS Google Scholar
Imbard, A. et al. NF1 single and multi-exons copy number variations in neurofibromatosis type 1. J. Hum. Genet. 60, 221–224 (2015).
CAS PubMed Google Scholar
Pich, O., Reyes-Salazar, I., Gonzalez-Perez, A. & Lopez-Bigas, N. Discovering the drivers of clonal hematopoiesis. Nat. Commun. 13, 4267 (2022).
CAS PubMed PubMed Central Google Scholar
Johannessen, C. M. et al. The NF1 tumor suppressor critically regulates TSC2 and mTOR. Proc. Natl Acad. Sci. USA 102, 8573–8578 (2005).
CAS PubMed PubMed Central Google Scholar
Zou, Z., Tao, T., Li, H. & Zhu, X. mTOR signaling pathway and mTOR inhibitors in cancer: progress and challenges. Cell Biosci. 10, 31 (2020).
PubMed PubMed Central Google Scholar
Xie, X. et al. Single-cell transcriptomic landscape of human blood cells. Natl Sci. Rev. 8, nwaa180 (2021).
CAS PubMed Google Scholar
Singh, S. K. et al. Id1 ablation protects hematopoietic stem cells from stress-induced exhaustion and aging. Cell Stem Cell 23, 252–265.e8 (2018).
CAS PubMed PubMed Central Google Scholar
Kovtonyuk, L. V. et al. Hematopoietic stem cells increase quiescence during aging. Blood 134, 2484 (2019).
Google Scholar
Zhang, P. et al. Chromatin regulator Asxl1 loss and Nf1 haploinsufficiency cooperate to accelerate myeloid malignancy. J. Clin. Invest. 128, 5383–5398 (2018).
PubMed PubMed Central Google Scholar
Laurenti, E. et al. Hematopoietic stem cell function and survival depend on c-Myc and N-Myc activity. Cell Stem Cell 3, 611–624 (2008).
CAS PubMed PubMed Central Google Scholar
Fernandes, H., Moura, J. & Carvalho, E. mTOR signaling as a regulator of hematopoietic stem cell fate. Stem Cell Rev. Rep. 17, 1312–1322 (2021).
CAS PubMed Google Scholar
Novershtern, N. et al. Densely interconnected transcriptional circuits control cell states in human hematopoiesis. Cell 144, 296–309 (2011).
CAS PubMed PubMed Central Google Scholar
Müller, S., Cho, A., Liu, S. J., Lim, D. A. & Diaz, A. CONICS integrates scRNA-seq with DNA sequencing to map gene expression to tumor sub-clones. Bioinformatics 34, 3217–3219 (2018).
PubMed PubMed Central Google Scholar
Sudlow, C. et al. UK Biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med. 12, e1001779 (2015).
PubMed PubMed Central Google Scholar
Fitzgerald, T. & Birney, E. CNest: a novel copy number association discovery method uncovers 862 new associations from 200,629 whole-exome sequence datasets in the UK Biobank. Cell Genom. 2, 100167 (2022).
CAS PubMed PubMed Central Google Scholar
Gottlieb, B., Beitel, L. K., Nadarajah, A., Paliouras, M. & Trifiro, M. The androgen receptor gene mutations database: 2012 update. Hum. Mutat. 33, 887–894 (2012).
CAS PubMed Google Scholar
Zink, F. et al. Clonal hematopoiesis, with and without candidate driver mutations, is common in the elderly. Blood 130, 742–752 (2017).
CAS PubMed PubMed Central Google Scholar
Bowman, R. L., Busque, L. & Levine, R. L. Clonal hematopoiesis and evolution to hematopoietic malignancies. Cell Stem Cell 22, 157–170 (2018).
CAS PubMed PubMed Central Google Scholar
Haney, M. S. et al. Large-scale in vivo CRISPR screens identify SAGA complex members as a key regulators of HSC lineage commitment and aging. Preprint at bioRxiv https://doi.org/10.1101/2022.07.22.501030 (2022).
Johnson, R. D. & Jasin, M. Sister chromatid gene conversion is a prominent double-strand break repair pathway in mammalian cells. EMBO J. 19, 3398–3407 (2000).
CAS PubMed PubMed Central Google Scholar
Alexandrov, L. B. et al. Clock-like mutational processes in human somatic cells. Nat. Genet. 47, 1402–1407 (2015).
CAS PubMed PubMed Central Google Scholar
White, R. R. & Vijg, J. Do DNA double-strand breaks drive aging? Mol. Cell 63, 729–738 (2016).
CAS PubMed PubMed Central Google Scholar
Beerman, I. Accumulation of DNA damage in the aged hematopoietic stem cell compartment. Semin. Hematol. 54, 12–18 (2017).
PubMed Google Scholar
Hsieh, J. C. F., Van Den Berg, D., Kang, H., Hsieh, C.-L. & Lieber, M. R. Large chromosome deletions, duplications, and gene conversion events accumulate with age in normal human colon crypts. Aging Cell 12, 269–279 (2013).
CAS PubMed Google Scholar
Weeks, L. D. et al. Prediction of risk for myeloid malignancy in clonal hematopoiesis. NEJM Evid. 2, EVIDoa2200310 (2023).
Google Scholar
Hanlon, V. C. T. et al. Construction of Strand-seq libraries in open nanoliter arrays. Cell Rep. Methods 2, 100150 (2022).
CAS PubMed PubMed Central Google Scholar
Aran, D. et al. Reference-based analysis of lung single-cell sequencing reveals a transitional profibrotic macrophage. Nat. Immunol. 20, 163–172 (2019).
CAS PubMed PubMed Central Google Scholar
Rao, S. S. P. et al. A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell 159, 1665–1680 (2014).
CAS PubMed PubMed Central Google Scholar
Fishilevich, S. et al. GeneHancer: genome-wide integration of enhancers and target genes in GeneCards. Database 2017, bax028 (2017).
PubMed PubMed Central Google Scholar
Liberzon, A. et al. Molecular signatures database (MSigDB) 3.0. Bioinformatics 27, 1739–1740 (2011).
CAS PubMed PubMed Central Google Scholar
Rausch, T. et al. DELLY: structural variant discovery by integrated paired-end and split-read analysis. Bioinformatics 28, i333–i339 (2012).
CAS PubMed PubMed Central Google Scholar
Kamburov, A. & Herwig, R. ConsensusPathDB 2022: molecular interactions update as a resource for network biology. Nucleic Acids Res. 50, D587–D595 (2022).
CAS PubMed Google Scholar
Lai, B. et al. Principles of nucleosome organization revealed by single-cell micrococcal nuclease sequencing. Nature 562, 281–285 (2018).
CAS PubMed PubMed Central Google Scholar
Boulesteix, A.-L. & Strimmer, K. Partial least squares: a versatile tool for the analysis of high-dimensional genomic data. Brief. Bioinform. 8, 32–44 (2007).
CAS PubMed Google Scholar
Roszik, J. et al. Somatic copy number alterations at oncogenic loci show diverse correlations with gene expression. Sci. Rep. 6, 19649 (2016).
CAS PubMed PubMed Central Google Scholar
Stuart, T. et al. Comprehensive integration of single-cell data. Cell 177, 1888–1902.e21 (2019).
CAS PubMed PubMed Central Google Scholar
Quinlan, A. R. & Hall, I. M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010).
CAS PubMed PubMed Central Google Scholar
Delignette-Muller, M. L. & Dutang, C. fitdistrplus: an R package for fitting distributions. J. Stat. Softw. 64, 1–34 (2015).
Google Scholar
McCaw, Z. R., Lane, J. M., Saxena, R., Redline, S. & Lin, X. Operating characteristics of the rank-based inverse normal transformation for quantitative trait analysis in genome-wide association studies. Biometrics 76, 1262–1272 (2020).
CAS PubMed PubMed Central Google Scholar
Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. Series B Stat. Methodol. 57, 289–300 (1995).
Google Scholar
Palin, K. et al. Contribution of allelic imbalance to colorectal cancer. Nat. Commun. 9, 3664 (2018).
PubMed PubMed Central Google Scholar
Martens, J. H. A. & Stunnenberg, H. G. BLUEPRINT: mapping human blood cell epigenomes. Haematologica 98, 1487–1489 (2013).
CAS PubMed PubMed Central Google Scholar
Ichii, M., Oritani, K. & Kanakura, Y. Early B lymphocyte development: similarities and differences in human and mouse. World J. Stem Cells 6, 421–431 (2014).
PubMed PubMed Central Google Scholar

Download references

Acknowledgements

We acknowledge the EMBL core facilities and services for support in computing (IT), sequencing (GeneCore) and cell sorting (FACS). We thank J. Li from Southern University of Science and Technology for assisting with the analysis of the Pan-Cancer Analysis of Whole Genomes dataset. Principal funding for this work came from the European Research Council (ERC Consolidator grant (MOSAIC) grant no. 773026) to J.O.K. Additional funding was received from the National Institutes of Health (Somatic Mosaicism Across Human Tissues (SMaHT) project: grant no. UG3NS132146; J.O.K.), the Deutsche Forschungsgemeinschaft (DFG) (German Research Foundation grant no. TRR 241–375876048 (to A.D.S.)), as well as the Health + Life Science Alliance Heidelberg Mannheim (to J.O.K.), which receives state funds approved by the State Parliament of Baden-Württemberg in Germany. N.X. and S.S. were supported by the National Natural Science Foundation of China (grant no. 32200487) and the Center for Computational Science and Engineering at Southern University of Science and Technology. We thank S. Kaspar, from the Centre for Statistical Data Analysis (CSDA) at the EMBL Data Science Centre, for statistical analysis consulting.

Funding

Open access funding provided by European Molecular Biology Laboratory (EMBL).

Author information

These authors contributed equally: Karen Grimes, Hyobin Jeong.
These authors jointly supervised this work: Ashley D. Sanders, Jan O. Korbel.

Authors and Affiliations

Genome Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany
Karen Grimes, Hyobin Jeong, Benjamin Raeder, Patrick Hasenfeld, Catherine Stober, Tobias Rausch, Eva Benito & Jan O. Korbel
Department of Systems Biology, College of Life Science and Biotechnology, Yonsei University, Seoul, Republic of Korea
Hyobin Jeong
Institute of Molecular Medicine, Ulm University, Ulm, Germany
Amanda Amoah, Julian Niemann & Hartmut Geiger
Department of Human Cell Biology and Genetics, School of Medicine, Southern University of Science and Technology, Shenzhen, China
Nuo Xu & Shimin Shuai
Molecular Medicine Partnership Unit (MMPU), European Molecular Biology Laboratory, University of Heidelberg, Heidelberg, Germany
Tobias Rausch, Anthony Ho & Jan O. Korbel
Bridging Research Division on Mechanisms of Genomic Variation and Data Science, German Cancer Research Center (DKFZ), Heidelberg, Germany
Tobias Rausch & Jan O. Korbel
Department of Hematology and Oncology, Medical Faculty Mannheim of the Heidelberg University, Mannheim, Germany
Johann-Christoph Jann & Daniel Nowak
Department of Cardiothoracic and Vascular Surgery, Ulm University Hospital, Ulm, Germany
Ramiz Emini, Markus Hoenicka & Andreas Liebold
Department of Medicine V, Hematology, Oncology and Rheumatology, University of Heidelberg, Heidelberg, Germany
Anthony Ho
Berlin Institute for Medical Systems Biology, Max Delbrück Center for Molecular Medicine in the Helmholtz Association (MDC), Berlin, Germany
Ashley D. Sanders
Berlin Institute of Health (BIH) at Charité–Universitätsmedizin Berlin, Berlin, Germany
Ashley D. Sanders
Charité–Universitätsmedizin Berlin, Berlin, Germany
Ashley D. Sanders

Authors

Karen Grimes
View author publications
You can also search for this author in PubMed Google Scholar
Hyobin Jeong
View author publications
You can also search for this author in PubMed Google Scholar
Amanda Amoah
View author publications
You can also search for this author in PubMed Google Scholar
Nuo Xu
View author publications
You can also search for this author in PubMed Google Scholar
Julian Niemann
View author publications
You can also search for this author in PubMed Google Scholar
Benjamin Raeder
View author publications
You can also search for this author in PubMed Google Scholar
Patrick Hasenfeld
View author publications
You can also search for this author in PubMed Google Scholar
Catherine Stober
View author publications
You can also search for this author in PubMed Google Scholar
Tobias Rausch
View author publications
You can also search for this author in PubMed Google Scholar
Eva Benito
View author publications
You can also search for this author in PubMed Google Scholar
Johann-Christoph Jann
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Nowak
View author publications
You can also search for this author in PubMed Google Scholar
Ramiz Emini
View author publications
You can also search for this author in PubMed Google Scholar
Markus Hoenicka
View author publications
You can also search for this author in PubMed Google Scholar
Andreas Liebold
View author publications
You can also search for this author in PubMed Google Scholar
Anthony Ho
View author publications
You can also search for this author in PubMed Google Scholar
Shimin Shuai
View author publications
You can also search for this author in PubMed Google Scholar
Hartmut Geiger
View author publications
You can also search for this author in PubMed Google Scholar
Ashley D. Sanders
View author publications
You can also search for this author in PubMed Google Scholar
Jan O. Korbel
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

K.G., H.J., A.D.S. and J.O.K. designed the study. H.G., D.N., J.-C.J., A.A., J.N., R.E., M.H., A.L. and A.H. contributed bone marrow samples. A.H. contributed UCB samples. K.G. performed Strand-seq experiments. B.R., P.H., C.S. and E.B. contributed to the preparation of Strand-seq libraries on a high-throughput robotics platform. K.G. developed the plate-based scMNase-seq (that is, scWGS) protocol. H.J. and K.G. were responsible for the cell-type classification models. K.G. and H.J. performed mSV calling. K.G. performed SCE calling. H.J. and K.G. performed scNOVA analysis. K.G. and T.R. performed WGS-based mosaicism calling. H.J. performed scWGS-based CNA calling. K.G. generated the scRNA-seq data. K.G. and H.J. performed the scRNA-seq data analysis. H.J. and K.G. performed the cell-type classification and enrichment testing. K.G., H.J., A.D.S. and J.O.K. interpreted the data. N.X. and S.S. performed the UK Biobank analyses. K.G., H.J., A.D.S. and J.O.K. wrote the paper, with contributions from all authors.

Corresponding authors

Correspondence to Ashley D. Sanders or Jan O. Korbel.

Ethics declarations

Competing interests

The following authors have previously disclosed a patent application (no. EP19169090) that is relevant to the use of Strand-seq for somatic structural variation analysis: A.D.S., J.O.K. The remaining authors declare no competing interests.

Peer review

Peer review information

Nature Genetics thanks Floris Foijer, Seishi Ogawa and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 SCEs occur steadily over life but arise non-randomly across the genome.

Violin plot showing the number of SCEs per single cell per donor, in order of increasing donor age. The age of each donor is noted below the violin in each case. The median SCE count (n = 4 SCEs) is indicated by a red line. Boxplots were defined by minima = 25th percentile - 1.5X interquartile range (IQR), maxima = 75th percentile + 1.5X IQR, center = median, and bounds of box = 25th and 75th percentile. b) Manhattan plot showing the distribution of SCEs genome-wide (hg38, 500 kb bins). Red line denotes the significance cutoff used (10% FDR). The significance of SCE counts per bin were calculated by fitting the permuted data of the number of overlaps per bin genome-wide (quantified using bedtools intersect) to a negative binomial distribution using the fitdist() function from the fitdistrplus package⁸³ (evaluation of distribution of empirical vs theoretical data shown in Supplementary Fig. 4), and then computing the p-value of the actual data using the fitted negative binomial distribution as a null distribution. The resulting size and mu (size = 1.4013518 (standard error = 0.08457053), mu = 0.6714258 (standard error = 0.01213493)) were used to transform SCE counts per bin into p-values, followed by Benjamini-Hochberg correction⁸⁵ to control the FDR. c, e) Permutation summary plots for 10,000 permutations of breakpoint regions from c) all mSVs and c) subclonal mSVs vs. all SCE locations. d, f) Local Z-score plots showing enrichment Z-scores within a 20 Mb window, in 2 Mb bins, for d) all mSVs and f) subclonal mSVs. g) Permutation summary plots for 10,000 permutations of breakpoint regions from all mSVs vs. previously annotated CFSs⁸⁶. P-value is based on a one-sided permutation test. h) Local Z-score plots showing enrichment Z-scores within a 20 Mb window, in 2 Mb bins, for all mSVs vs previously annotated CFS regions.

Extended Data Fig. 2 Performance evaluation of the UCB-derived scMNase-seq-based cell-type classifier.

a) ROC curve showing leave-one-out cross-validation of the cell-type classifier’s performance using single cell NO patterns. b) UMAP projection of latent variables from the UCB HSPC cell type classifier. c) Heatmap of single cell NO of gene bodies of 175 single UCB HSPCs, generated using scMNase-seq. The 899 signature genes depicted (rows) allow for discrimination between 8 UUCB HSPC cell types (columns). Cells are grouped and color-coded by immunophenotyped cell-type identity, determined by FACS (Supplementary Fig. 6). Differential NO of marker genes is represented by Z-scores. d) UMAP projection of scMNase-seq latent variables, coloured by tissue-of-origin. BM, bone marrow. e) UMAP of single-cell transcriptome data of HSPCs from UCB and adult bone marrow, obtained from²⁷. f) Cell-types in the scRNA-seq were annotated using the singleR package⁷² based on the blueprint ref. ⁸⁷. g) Cell-type composition of scRNA-seq from UCB and BM shows that MPPs are highly enriched in UCB compared to BM HSPCs. MPPs, CLPs, and pDCs exhibit a lower prevalence than other cell-types, likely reflecting their natural scarcity in HSPCs²⁴ and known challenges with sustaining these cells in vitro⁸⁸.

Supplementary information

Supplementary Information

Supplementary Figs. 1–37, Methods, Notes and references.

Reporting Summary

Supplementary Data

Strand-seq plots of singleton mSVs.

Supplementary Table

All supplementary tables associated with this study.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Grimes, K., Jeong, H., Amoah, A. et al. Cell-type-specific consequences of mosaic structural variants in hematopoietic stem and progenitor cells. Nat Genet 56, 1134–1146 (2024). https://doi.org/10.1038/s41588-024-01754-2

Download citation

Received: 03 July 2023
Accepted: 17 April 2024
Published: 28 May 2024
Issue Date: June 2024
DOI: https://doi.org/10.1038/s41588-024-01754-2
Springer Nature America, Inc.

Cell-type-specific consequences of mosaic structural variants in hematopoietic stem and progenitor cells

Abstract

Similar content being viewed by others

Main

Results

Single-cell-resolved mSV landscapes in HSPCs

Hotspots of mSV formation

High-precision cell-typing using nucleosome occupancy profiles

Subclonal mSVs commonly exhibit a lineage bias

Cell-type-specific impact of an inversion

Stepwise accumulation of mSVs in HSPCs

Functional effects of mSVs in blood samples

Discussion

Methods

Ethics declarations

Human samples

Statistics and reproducibility

HSPC culturing and Strand-seq library preparation

scMNase-seq

Building nucleosome occupancy reference set cell-type classifiers

mSV discovery in Strand-seq data

Cell-type enrichment testing

Single-cell multiomic analysis of differential gene activities in HSPC subclones

Investigation of potential cis-effects of a balanced inversion

scRNA-seq

scRNA-seq data processing, unsupervised clustering and cell-type annotation

Targeted CNA re-calling in scRNA-seq data

SCE mapping and locus-specific SCE enrichment

Breakpoint refinement by WGS

UK Biobank analysis

Data collection

Variant annotation

Association testing

Reporting summary

Data availability

Code availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Extended data

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation