BART-Seq: cost-effective massively parallelized targeted sequencing for genomics, transcriptomics, and single-cell analysis
- 5.7k Downloads
We describe a highly sensitive, quantitative, and inexpensive technique for targeted sequencing of transcript cohorts or genomic regions from thousands of bulk samples or single cells in parallel. Multiplexing is based on a simple method that produces extensive matrices of diverse DNA barcodes attached to invariant primer sets, which are all pre-selected and optimized in silico. By applying the matrices in a novel workflow named Barcode Assembly foR Targeted Sequencing (BART-Seq), we analyze developmental states of thousands of single human pluripotent stem cells, either in different maintenance media or upon Wnt/β-catenin pathway activation, which identifies the mechanisms of differentiation induction. Moreover, we apply BART-Seq to the genetic screening of breast cancer patients and identify BRCA mutations with very high precision. The processing of thousands of samples and dynamic range measurements that outperform global transcriptomics techniques makes BART-Seq first targeted sequencing technique suitable for numerous research applications.
KeywordsBarcoding Single-cell RNA sequencing Targeted transcriptomics High-throughput screening Human pluripotent stem cells Multiplex PCR
Fluorescence-activated cell sorting
Glycogen synthase kinase 3
Human embryonic stem cells
Human pluripotent stem cells
Long non-coding RNA
Mouse embryonic fibroblasts
Minimum free energy
Molecular inversion probe
Quantitative polymerase chain reaction
t-distributed stochastic neighbor embedding
Uniform manifold approximation and projection
Unique molecular identifier
Indexing of next-generation sequencing (NGS) libraries by “DNA barcodes” is crucial for economies of scale in transcriptomics studies. The pooling of indexed libraries, also known as “multiplexing”, and bioinformatics analysis of sequencing reads with indices provides a basis for quantifying the transcripts. Introduction of indices during reverse transcription is a broadly used barcoding technique for labeling thousands of different gene transcripts , which on the downside leads to shallow coverage per gene. In the case of analyzing thousands of single cells, using 10× genomics or Drop-Seq platforms, for example, the transcriptional information is sufficient mostly for the classification of cell types [2, 3]. Analyzing biological processes in a greater detail requires using either global indexing techniques that provide greater coverage, like SMART-Seq2 , but at a significantly higher cost per sample, or using targeted sequencing approaches.
Methods for targeted analysis of specific transcripts and their multiplexing from many samples are generally based on the capture of the targeted regions [5, 6], or on multiplex or 2-step PCR and ligation [7, 8]. Commercial kits such as Illumina’s Targeted RNA Expression or QiaSeq Targeted RNA Panels are based on these principles. Chief drawbacks of these methods are requirement of bulk amounts of starting material and poor dynamic range readout due to the intermittent steps of fragmentation, capture by beads or by hybridization to arrays, and nested qPCR. Additionally, padlock/molecular inversion probe (MIP)-based methods [9, 10, 11] are generally used for multiplexing very high number of loci in small number of samples. A greater accuracy for measuring gene expression is offered by microfluidic devices coupled to thermocyclers, such as the Fluidigm Biomark , and by probe hybridization technologies such as the Nanostring nCounter , MERFISH , FISSEQ , or seqFISH . However, the complex workflows of these methods, and the costly operation of specialized instrumentation often prevents the analysis of thousands or even hundreds of samples, let alone application to truly massive single cell experiments.
Here, we present a novel method to serially label invariant sets of forward and reverse primers with panels of DNA barcodes, with which we generate amplicons with dual indices. We arrange the DNA barcode panels in large matrices and combine them with cDNA of bulk samples or single cells, followed by PCR and NGS. This concept of a priori sample indexing is different from the existing transcript-targeted analysis techniques, which are generally based on pre-amplification first, and indexing of the samples using DNA barcodes afterwards. The workflow, which we name Barcode Assembly foR Targeted Sequencing (BART-Seq), is inexpensive, simple, scalable, very sensitive, and accurate for omics applications using bulk samples or single cells. The relatively small number of target loci makes it readily possible to quantify gene expression. Importantly, BART-Seq can also be used for high-throughput targeted genomics, as we demonstrate in cancer patients. To make BART-Seq usable for the community, we developed a web-deployed software for designing bioinformatically optimized primers and DNA barcodes, which minimizes the sequence similarity and complementarity, hybridization to off-targets, and formation of secondary structures. Additionally, we implemented a demultiplexing pipeline to sort the amplicons to their respective samples of origin using the dual indices. Finally, we applied BART-Seq for analyzing the mechanisms of differentiation propensities of stem cells. We used human pluripotent stem cells (hPSCs) in massive sampling experiments after exposing the cells to different maintenance media and upon activation of the Wnt/β-catenin pathway at different stages of the signaling cascade. This demonstrated an effective discovery of mechanisms pertinent to cell medicines and disease modeling using BART-Seq.
We designed the oligonucleotide building blocks to ensure intra- and inter-primer compatibility during multiplex PCR and to minimize sample misidentification during demultiplexing. Briefly, all possible eight-mer oligonucleotides with 50–60% GC content were filtered for repeats, followed by global optimization using simulated annealing that selects barcode sequences with the lowest pairwise alignment scores. The ten-mer adapter oligonucleotides were designed the same way (listed in Additional file 1: Table S1). To generate multiplexed primer sets, we designed up to five putative primer pairs per target sequence using an implementation of Primer3 that ensures ending of the primers with a 3′ thymine (to account for template-independent addition of a single adenine (A) by Klenow fragment during primer synthesis, as illustrated in Fig. 1b). Besides the inter-primer compatibility, the efficiency of each primer set depends also on the entire set of target sequences; therefore, we used a simulated annealing approach that minimizes the secondary structure formation by evaluating pairwise folding using RNAcofold . Finally, we excluded primers and barcodes that align to the human genome or transcriptome (details in “Methods” section).
To assess the barcode-primer assembly method, we first targeted specific genomic loci. We co-amplified four and six regions of the human BRCA1 and BRCA2 genes, respectively, using 10 pairs of multiplexed primers, which were assembled with four barcode combinations. qPCR assessment of the pre-amplified samples using nested primers, which were homologous to the assembled primers (laying downstream to the barcodes, Additional file 2: Table S2a), indicated specific enrichment of all 10 loci (Amp1-10), shown by the significantly lower Ct compared to non-pre-amplified sample, non-targeted loci, or non-barcoded rc primers (Fig. 1d, Additional file 7: Figure S1a). Importantly, increasing the number of multiplexed primers gradually from 1 to 10 did not influence the efficiency of amplification (Fig. 1e, Additional file 7: Figure S1b), indicating that the assembly method produces excess of barcoded primers for multiplexed pre-amplification of targeted loci.
Demultiplexing of the reads mapped the amplicons exclusively to the barcode combinations that were used in the experiment. Importantly, we noted only minor sorting of amplicons to 18 additional “dummy” barcodes that were not part of the experiment (Fig. 2b, Additional file 7: Figure S1c). This proved the robustness of the barcode design and demultiplexing pipeline. Comparing the genotyping results of all 1920 multiplexed alleles (spanning 10 amplicons from 96 patient samples with two alleles each) showed that 92 out of 96 patients (~ 96%) mirrored the classification of the clinical lab (Fig. 2c, Additional file 3: Table S3a).
As we serendipitously observed shortening of some of the barcodes by a few bases, which could be due to trimming of 5′ barcode ends by λ-exo during the removal the of rc strand (Fig. 1b), we sought to reduce this effect in order to further improve the classification of amplicons. To this end, we flanked the barcodes by all possible trinucleotides in order to identify the best sequences that could “protect” the 5′ of barcodes from trimming (Additional file 1: Table S1b, Additional file 2: Table S2c). Using a matrix of 5′NNN-barcoded primers tested with a constant amount of template gDNA, we observed by NGS that the trinucleotide 5′CCA had the highest frequency among all 64 combinations (Additional file 7: Figure S1d, Additional file 3: Table S3b). Repeating the patient screening using 5′CCA-barcodes, 95 out of 96 patients (99%) were classified in agreement with the results of the clinical lab (Fig. 2d, Additional file 3: Table S3c). In the misclassified sample, the expected mutation (Mut2) was detected together with an unexpected mutation (Mut7), indicating that the misclassification might be due to sample cross-contamination. Collectively, these developments created a robust workflow for targeted sequencing in genomics studies, which we named gBART-Seq.
List of targeted genes in transcriptomics experiments
Number of targets
List of targets
Housekeeping: B2M, GAPDH; pluripotency: NANOG, POU5F1/OCT4, SOX2, LIN28A, DNMT3B, ZFP42/REX1; cell cycle: CCND1, CCNE1; control: CER1, 4 RNA spike-ins
Lateral mesoderm: EOMES, HAND1, MESP1; paraxial/(pre-)somitic mesoderm: CDX2, HOXA1, MSGN1, PAX3, TBX6; neural crest: ZIC1, MSX1; pan-primitive streak: EVX1, GSC, MIXL1, T; anterior primitive streak: NODAL; endoderm: FOXA2, SOX17; control: GAPDH, NANOG, 3 RNA spike-ins
We subsequently discovered that this was due to the index switching , spreading primarily from the wells with the highest concentration of RNA from hESCs. We later minimized this effect by increasing the diversity of the samples in the flow cell (e.g., using PhiX control or co-sequencing with non-BART-Seq libraries), and noted that this effect became marginal in those experiments (e.g., 0 pg samples in Fig. 3c). Taken together, these data show that rBART-Seq can be used for directly analyzing gene expression in numerous single cells and produce results with a broad dynamic range.
In accordance, we applied non-linear dimensionality reduction (UMAP) to analyze the single hESCs that were grown with the three maintenance media. This revealed two major subpopulations exhibiting ground state-like —NANOGHIGH ZFP42 (REX1)HIGH— and primed-like —LIN28AHIGH DNMT3BHIGH— phenotypes [25, 26, 27, 28] (Fig. 4d). Remarkably, mTeSR™1-treated cells were located primarily in the primed-like cluster, while the majority of the E8-treated cells were located in the ground state-like (naïve) cluster, suggesting that these growth conditions shift hESCs along the pluripotency axis. This indicates that preferences to use mTeSR™1 over E8, or vice versa, as a starting point for differentiation may depend on how well different protocols are tuned to the respective states of pluripotency, for example, in the case of cardiomyocyte differentiation [29, 30, 31].
Massive sequencing of defined sets of transcripts could be highly useful for numerous studies that involve analysis of vast arrays of samples in parallel. The application areas include probing of mechanisms; single-cell analysis; validating and complementing results obtained by genome-wide approaches, such as the Human Cell Atlas Project ; and screening in genome engineering, drug development, and toxicology assays. To become truly impactful, a targeted sequencing method should enable serial and straightforward production of libraries from individual samples, be cost competitive compared to genome-wide approaches, and involve simple bioinformatics analysis. It should also be sensitive and quantitative as genome-wide transcriptomics techniques or have better performance. We show here that BART-Seq is in fact the first transcript-targeted sequencing approach that fulfills these criteria.
BART-Seq overcomes important limitations of other targeted sequencing approaches. First, the workflow does not include intermittent steps of template purification that are generally incompatible with gene expression analysis as the case for other methods [7, 8]. Importantly, the entire BART-Seq process, from primer assembly to count matrices, can be completed within 5 days. Second, BART-Seq creates sequence data, while other methods such as MERFISH or seqFISH infer it from hybridization of probes [14, 16], which could limit the discovery of sequence variants and may require further validation. Third, BART-Seq is an inexpensive technique compared to other targeted and global approaches. Our estimate for the full cost of analysis per sample, e.g., one well of a 384-well plate, consisting of a single-cell or bulk gDNA/cDNA, is approximately 1 US dollar. Compared to the global transcriptomics techniques, this places BART-Seq just above Drop-Seq which is lowest in terms of cost per sample . Fourth, compared to Drop-Seq, 10×, and other methods, BART-Seq can be used to analyze a broader range of target RNAs in single cells, including non-polyadenylated lncRNAs, simply by the use of specific primer sets or random hexamers. For this reason, the method is also readily compatible with genomics studies as we show here (Fig. 2). Therefore, it could be used as an inexpensive and technically straightforward alternative to protocols involving nested PCR [7, 8, 38], gDNA circularization [38, 39], or MIPs  in genomic studies. We have not yet combined the use of unique molecular identifiers (UMIs)  with BART-Seq, which may be important in some applications for counting the absolute number of transcripts.
BART-Seq can expand the segment of targeted transcriptomics, which has not been fully exploited yet, especially in the high-throughput analysis of single cells. Using simple bioinformatics that sorts tens of thousands of amplicons that were indexed by BART-Seq, we gained important insights into the mechanisms that regulate the differentiation of hPSCs. We found that pluripotency is held at different depths when frequently used maintenance media are applied to hPSCs, a phenomenon that could explain the many cases of matching between certain maintenance media and differentiation protocols [29, 30, 31]. Moreover, we showed that the application of a small molecule that is widely considered an agonist of the Wnt/β-catenin pathway, in fact, reduces the repertoire of primitive streak-like progenitors that differentiate from hPSCs. Importantly, these experiments demonstrate that a key feature of BART-Seq is multivariable analysis, in this case of cells that were subjected to different treatments and were sampled at different time points, which can be achieved by simply increasing the size of the barcode panels (Fig. 1a). If we were to analyze the same cohort of samples as we did here by Drop-Seq or 10× techniques, for example, the analysis costs would have been drastically higher due to the use of different preparation kits for every iteration of time point, treatment, and biological replicate [2, 3]. This indicates that BART-Seq is particularly useful for kinetics studies, screens, and for linking phenotypes, e.g., fluorescent marker intensity, to the expression of transcripts and/or genotypes, a level of information that is lost with techniques that utilize pools of cells and barcoded droplets [2, 3, 41, 42].
Design of barcode panels
All possible 8-mer (barcode) and 10-mer (adapter) oligonucleotides of 50–60% GC content were computed omitting sequences with one, two, or three nucleotide repeats. All pairwise global alignment scores were computed separately for barcodes and adapters using pariwise2 from Biopython package. Whenever comparing two barcodes in all forward and reverse combinations, the maximal alignment scores were used for further analysis. Next, a global optimization heuristic (simulated annealing) was implemented to efficiently identify a set of highly unique sequences in terms of likelihood that mutations (exchange, deletion, insertion) might lead to a conversion into another sequence within the set. A random initial set of sequences was either shrunk (with 10% probability), altered by randomly exchanging sequences (36% probability), or randomly increased (54% probability). Changes were accepted if the new sum of alignment scores was lower or by change whenever exp(−Δsumscore÷T) was lower than another random number. This simulated annealing algorithm scanned temperatures T from 10,000 to 0 along 300 cooling iterations to reach a global optimum. The resulting sets were randomly divided into forward and reverse barcodes and adapters. Next, the 3′ of the forward and reverse adapters were ligated in silico to the sequences of the forward and reverse barcode sets, respectively. Finally, BLAST was used to accept 18 nt sequences without any identified hit in the human genome (for barcodes version 1, used for genomics) and transcriptome (for barcodes version 2, used for transcriptomics) as outlined in Additional file 1: Table S1.
Primer design and optimization
Primers were designed to amplify roughly 80–250 bp amplicons in locations where an adenine (A) base exists at the 3′ position of the final primer sequence after barcode assembly. This was done because the DNA Polymerase I large (Klenow) fragment frequently adds a template-independent A base to the 3′ of the newly synthesized strand. Primer3 was used with default settings, but with modified internal primer predictions such that it enforces the primer’s 3′ to end with a T nucleotide. For each template, up to five forward and reverse primer pairs were predicted. Each primer pair set was compared against the human genome using the blastn command from the blast+ package with the parameters -reward 1 -gapopen 5 -gapextend 5. Using our web-based software, the user can set the number of hits allowed for further processing. Next, given the predefined set of barcodes, adapters, and 1–5 predicted primer pairs per loci, an in silico ligation step was performed to generate all possible primer-adapter-barcode combinations. Hereby, matching forward and reverse primers defined one amplicon. To minimize the probability of forming stable dimers, we calculated the all-against-all minimal free energy (including all reverse complements) using the RNAcofold command from the ViennaRNA package version 2.1.8 with the parameters --noPS --noLP -P dna_mathews2004.par. Low predicted minimum free energy correlates to a high probability of forming a stable dimer. A simulated annealing was implemented to identify optimal combinations of each primer pair per locus, thereby taking barcode and adapter sequences into account. During optimization, the minimal value of free energy of the forward or reverse complement sequence was used for determining the probability of forming stable primer dimers. Per amplicon and gene, we started with a random initial set of primers. We proceeded to either randomly alter it (with 80% probability) or randomly exchanged amplicons if there were several amplicons available for a gene. In each step, the random change was accepted if the new sum of minimal free energies (mfe) is lower than in the last or randomly if exp(−Δmfe÷T) was lower than a uniformly drawn random number. We scanned over temperatures T from 15,000 to 0 along 500 cooling iterations. Finally, we reversed the primer sequences and linked 3′ to the reverse sequence of the respective forward or reverse adapter sequences. The primer prediction implementation is a Python-based web front end that is available online at: http://icb-bar.helmholtz-muenchen.de, of which we made the code freely available (see “Availability of data and materials” section).
Design of primer sets
Primer sets targeting 10 specific mutations in BRCA1 and BRCA2 genes [19, 20, 21] were designed based on the human genome reference hg19 (Additional file 2: Table S2a, c). Amplicon size was in the range of 75 to 248 nt to ensure detection by 2 × 150 bp paired-end sequencing. Pluripotency primer set was designed based on the analysis of publicly available RNA-Seq datasets of hESCs via NCBI-GEO from H9, H7, and HD291 cells (GSM602289, GSM1163070, GSM1163071, GSM1163072, GSM1704789, GSM1273672, GSM1327339), and own datasets. The target regions were selected for differentiation primer set using bulk RNA-Seq data produced by stimulation of hESCs by Wnt3a or CHIR99021 for 72 h. RNA-Seq reads were mapped to the genome reference hg38 using CLC Genomics Workbench (version 8.5.1) using mismatch cost: 2, insertion cost: 3, and deletion cost: 3. The regions mapped with a significantly high number of reads overlapping in the majority of the samples were used for primer design. The complete sequences of RNA spike-ins EC2 (RNA1), EC12 (RNA2), EC13 (RNA6), and EC5 (RNA8) were used as target regions (Ambion, AM1780).
Undifferentiated hESCs (H9 line) were maintained on Matrigel™ (Corning)-coated plates in mTeSR™1 medium (Stem Cell Technologies) in 5% (v/v) O2. Cells were passaged as clumps using 2 mg/ml solution of Collagenase Type IV prepared in DMEM F-12 (both from Thermo Fisher Scientific).
Growth media comparison
Cells were split and maintained for five passages in mTeSR™1, E8 (on Matrigel™), and KSR-bFGF media (on CD1-irradiated mouse embryonic fibroblasts) in parallel. E8 medium was prepared as described by Chen et al.  and KSR-bFGF media as described by Krendl et al. . Newborn human BJ fibroblasts (ATCC®) were cultured in DMEM high glucose (Thermo Fisher Scientific), supplemented with 1% GlutaMAX (Life Technologies), NEAA (Thermo Fisher Scientific), and 10% HyClone™ Fetal Bovine Serum (GE Healthcare).
Wnt/β-catenin pathway activation
hESCs and hESC line modified with doxycycline-inducible β-catenin (constitutively active form ΔN90) were maintained on Matrigel™-coated plates in mTeSR™1 medium with 25 μg/ml Hygromycin B (Thermo Fisher) in the case of β-cateninΔN90 line. For time course stimulations, the cells were dissociated to single-cell suspension with Accutase (Sigma) and seeded into 12-well plates at 2.5 × 105 cells per well in the presence of 10 μM Y-27632 (R&D Systems). The next day, the medium was changed to RPMI-1640 with l-glutamine supplemented with 1× non-essential amino acids and 1× B27 supplement without insulin (all from Life Technologies). Ligands were as follows: 10 μM CHIR99021 (Tocris) and 240 ng/ml recombinant Wnt3a (gift from Derk ten Berge, Erasmus University Medical Centre, Rotterdam). β-catenin expression was induced by adding 1 μg/ml doxycycline (Clontech). The medium and ligands were freshly re-added every 24 h.
Single-cell sorting and cDNA synthesis
hESCs were dissociated using Accutase (Sigma), and cells maintained in KSR-bFGF on MEFs were collected as clumps using Collagenase Type IV prior to Accutase treatment. Newborn human BJ fibroblasts were dissociated using Trypsin-EDTA 0.25% (Gibco). For sorting, the cells were resuspended in 1 ml of FACS buffer (4% FBS and 5 μM EDTA in PBS), filtered through a 0.2-μm nylon mesh, and single live cells (propidium iodide negative) were sorted into the 384-well plates (1–32 cells for medium comparison, and single cells for Wnt pathway activation) pre-filled with 2 μl reverse transcription mixture, using Aria III sorter (BD Biosciences).
Reverse transcription mixture (RT mix) was prepared using SuperScript™ III First-Strand Synthesis System (Invitrogen) with reverse transcriptase at a final concentration of 2.5 U/μl (nuclease-free water) and Oligo-dT primers (2.5 μM). RNA spike-ins were included in the RT mix (experiment-specific concentrations). Following sorting, plates were sealed with adhesive foils, placed immediately on dry ice for 2 min, and stored at − 20 °C. Plates were thawed at room temperature, and the reverse transcription was performed using the thermocycler program: 50 °C for 50 min and 85 °C for 5 min; RNaseH was not used.
Bulk RNA isolation
Total RNA was extracted using RNeasy Mini Kit (QIAGEN).
Klenow fill-in reaction
Unit reaction mixture was prepared in nuclease-free water by combining 1× React® 2 Buffer (Invitrogen), 0.267 mM dNTPs, 2.5 μM multiplexed rc primer mix, 2.5 μM barcode, and 0.0167 U/μl DNA Polymerase I large (Klenow) fragment (Invitrogen). The reaction was incubated at 25 °C for 1 h. Individual rc primers were used at a 0.025-μM final concentration, and barcode concentrations were matched to the total concentration of rc primers (incubation time of 2 h was also applicable). The enzyme was heat inactivated at 80 °C for 10 min.
Reverse complementary strand removal by lambda exonuclease
Products of the fill-in reaction were directly diluted as 2/3 volume ratio in the lambda reaction mixture containing 1× reaction buffer and 0.33 U/μl lambda exonuclease (New England Biolabs) and incubated at 37 °C for 30 min (incubation time of 1 h is also applicable). The enzyme was heat inactivated at 80 °C for 10 min.
PCR reactions (10 μl total) consisted of 2.5 μl (0.5× final) Platinum® Multiplex PCR Master Mix (Applied Biosystems), 1.8 μl 25 mM MgCl2 (4.5 mM final), 1.5 μl forward lambda reaction product (non-purified), 1.5 μl reverse lambda reaction product (non-purified), 2 μl cDNA, and 0.7 μl nuclease-free water (not DEPC-treated). The reaction cycle profile was as follows: initial denaturation at 95 °C for 5 min; 22 cycles of 95 °C for 30 s, 60 °C for 3 min, 72 °C for 60 s; and final extension at 68 °C for 10 min. Unit PCR reaction of genotyping assays was 20 μl, with the same concentration of reagents, and 18 cycles of PCR. Unit PCR reaction of transcriptomics experiments was 10 μl, with cycle numbers between 16 and 22.
qPCR and melting curve analysis
qPCR analyses were performed using nested primers, which are homologous to the barcode-assembled primers, excluding the barcode and the adapter regions (Additional file 2: Table S2). Unit reaction (10 μl total) consisted of 5 μl (1× final) Power SYBR™ Green PCR Master Mix (Applied Biosystems), 1 μl pre-amplification PCR product, 1 μl forward and reverse nested primers mix (each 0.2 μM final), and 3 μl nuclease-free water (not DEPC-treated). The reaction cycle profile was as follows: initial denaturation at 95 °C for 10 min followed by 35–40 cycles of 95 °C for 15 s and 60 °C for 1 min. Melting curve analysis was done by heating the amplicons from 60 to 95 °C, incrementing 0.05 °C/s. All the reactions were run as three replicates.
Sample pooling and purification
PCR products were pooled in nuclease-free falcon tubes (Ambion), mixed with 0.1 volume 3 M NaOAc (pH 5.5) (Ambion) and 2.5 volume 100% ethanol (molecular biology grade), and kept at − 20 °C overnight for precipitation. Samples were centrifuged at 4000g for 30 min in a centrifuge pre-cooled to 4 °C. The supernatant was discarded, and the samples were washed once with 500 μl ice-cold 70% ethanol. Tubes were centrifuged at 4000g for 2 min (4 °C), and the remaining supernatant was pipetted out. The pellet was air dried for 2–3 min and re-suspended in 200–400 μl nuclease-free water. Prior to library preparation, double-sided size selection was performed using Agencourt AMPure XP beads (Beckman Coulter). 0.5× and 1.5× bead to DNA ratio was used for upper and lower size limits, respectively.
RNA-Seq library preparation and sequencing
Libraries were prepared using NEBNext® Multiplex Oligos for Illumina® (New England Biolabs, E7335), and the protocol was based on NEBNext® ChIP-Seq Library Prep Master Mix Set for Illumina® (New England Biolabs, E6240) with the following modifications: end repair was performed using 1 μl NEBNext End Repair Enzyme Mix in 50 μl final reaction. PCR enrichment included 1 μl index and 1 μl universal primers in 50 μl final reaction. The enrichment PCR cycle profile was as follows: initial denaturation at 98 °C for 30 s; 10–15 cycles of 98 °C for 10 s, 65 °C for 30 s, 72 °C for 30 s; and final extension at 72 °C for 5 min. Fifteen, 15, 12, and 10 cycles of PCR enrichment was applied for genotyping, bulk dilution, media comparison, and mesoderm experiments, respectively. Beads to DNA ratios for purification steps using AMPure XP beads were adjusted according to the expected maximum and minimum amplicon size of the individual libraries. Libraries were evaluated using Agilent 2100 Bioanalyzer by High Sensitivity DNA Kit (Agilent) and quantified using Qubit® 2.0 Fluorometer by Qubit® dsDNA HS Assay Kit (Invitrogen), and by Safire II Microplate Reader (Tecan) using Quant-iT™ PicoGreen™ dsDNA Assay Kit (Invitrogen). Libraries were sequenced (paired-end) on Illumina MiSeq using MiSeq® Reagent Kit v2 (300 cycles) or Illumina NextSeq 500 using NSQ® 500/550 Mid Output Kit v2 (300 cycles). Ten percent PhiX control (Illumina, #FC-110-3001) was included in the sequencing runs as a measure against index switching  for low-diversity libraries like BART-Seq.
Demultiplexing of RNA-Seq reads to count matrices
To trace the origins of reads back to the samples, a pipeline that demultiplexed the reads and counted them while accounting for sequencing errors was implemented. FastQC software was used to create quality reports for manual inspection . Given the acceptable quality, Snakemake workflow engine  was used for automatic or step-by-step analysis of raw reads, sets of primers, linkers, barcodes, and expected amplicons. This started by trimming the read ends according to quality using Sickle , then a list of possible single nucleotide-mutated variants per barcode, excluding the ones shared with other barcodes, was created. Using the algorithm of Aho and Corasick , this list efficiently assigned barcodes to all reads while allowing at most one unambiguous mismatch. We also annotate the reads with several boolean criteria for statistical analysis of libraries. This included the information if the read contained only a primer, multiple (or no) barcodes, if the barcode contained a mismatch or if the read contained bases before the protection group. We aligned the longer amplicons to the reads using HISAT2 . The final step of the pipeline is to summarize the results. Heatmaps for each library were created per amplicon using the forward and reverse barcodes as a coordinate system, and a spreadsheet file containing the aforementioned read statistics as well as count matrices was generated. The pipeline was also made available as described in “Availability of data and materials” section.
Classification of BRCA mutations
To classify the amplicons corresponding to mutations 1–10, we generated read count per patient for both wild-type and mutation alleles (identified by top blast hit per read) and assigned the mutation type with the highest mutation read count. Read count ratios of mutation to wild-type allele per sample were computed and accepted as the mutation for ratios > 0.20.
Analysis of protection group
For the analysis of 5′ protection groups, we identified barcodes using BLAT , a BLAST-like alignment tool, with options -minScore=0 -minIdentity=95 allowing for one base mismatch at most. This was necessary to screen all possible protection groups. For each detected wild-type or mutant allele, we calculated the frequency of 64 trinucleotides for each forward and reverse barcode. Then, summing the frequencies up across all the alleles, we obtained the total frequency of each trinucleotide per barcode.
Data correction and normalization
Correction of RNA spike-in reads
First, all wells with extreme outlier spike-in reads were manually removed after inspecting the heatmaps of raw read counts (i.e., if exhibiting hundreds of folds higher/lower reads than the average). Per gene, samples exhibiting extremely low barcode-gene combinations were removed. Then, per spike-in, two-sided t test (default parameters, R version 3.5.2) was performed for each barcode against the rest of the barcodes of the same type (i.e., forward or reverse), using the data between the 5th and 95th percentiles for both groups. Barcode-spike-in combinations with p values lower than the set threshold were replaced with the median of the rest of the barcodes.
Normalization of the data
Wells were removed if the scaling factor was tenfold lower or higher than the median, to prevent overcorrection. Then, the factors were median-centered via division to preserve the read count magnitudes. Finally, raw read counts of the transcripts were divided by the scaling factors (Additional file 7: Figure S3). The corresponding script is available at the Github (see the “Availability of data and materials” section). An alternative and more precise method for normalizing the data based on the correction of spike-ins using negative binomial generalized linear modeling is also provided in the same repository.
Well filtering in single-cell experiments
Wells sorted with single cells were operationally defined as “empty” if the ratio of the sum of the spike-in reads to the total reads per sample (normalized and log-transformed) was same or higher than the negative controls (into which no cells were sorted) (negative control wells received some reads due to index switching). Samples representing the wells sorted with multiple cells were filtered based on the calculated one-cell values of the genes. Filtering the samples sorted with two cells or more, i.e., “doublets,” was done by placing a threshold estimated based on the bimodal distribution of the sum of the genes (log2-transformed) (Additional file 7: Figure S4a). Only housekeeping genes were used for filtering fibroblasts.
Analysis of gene expression
Gene expression analyses were done using custom scripts or Seurat package in R (version 2.3.4), based on normalized and log2-transformed read counts. Linear regression models were calculated using lm function (default parameters, R version 3.5.2).
We would like to thank the German Academic Exchange Service (DAAD) for supporting FU. We thank E. Rusha for assisting with the cell culture work; R. Batra for maintaining the scripts; L. Simon, R. Batra, A. Alpert, and S.S. Shen-Orr for the fruitful discussions on data analysis; Y. Mayshar for the comments on the manuscript; and I. Lerner and V. Meiner from the Center for Clinical Genetics at the Hadassah Medical Center for supplying patient DNA samples.
CS, CK, FU, FO, and MD developed the BART-Seq method. SS, PA, NSM, and FJT designed the bioinformatics tools for primer/barcode design and demultiplexing the NGS reads, and contributed writing the relevant sections. FU, FO, and DS performed the experiments. PA and SS demultiplexed the data. FO and CS contributed to genotyping analyses. FU and MD analyzed the results and wrote the manuscript. MD supervised the project. All authors read and approved the final manuscript.
FU was supported by the German Academic Exchange Service (DAAD) via the funding program Research Grants for Doctoral Candidates and Young Academics and Scientists. A Translational and Clinical Project grant by the Helmholtz Center supported this work.
Ethics approval and consent to participate
Anonymous archival samples from the Center for Clinical Genetics at the Hadassah Medical Center in Jerusalem were used for BRCA1 and BRCA2 analysis.
Consent for publication
FO, CS, CK, and MD are named on a patent application for the BART-Seq method, which does not limit the use of the method in any way.
- 5.Li H, Qu J, Fu X-D. RASL-seq for massively parallel and quantitative analysis of gene expression. Curr Protoc Mol Biol. 2012;CHAPTER:Unit4.13.Google Scholar
- 8.Herbold CW, Pelikan C, Kuzyk O, Hausmann B, Angel R, Berry D, et al. A flexible and economical barcoding approach for highly multiplexed amplicon sequencing of diverse target genes. Front Microbiol. 2015;6 [cited 2018 Nov 8]. Available from: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4503924/.
- 11.Tao L, Raz O, Marx Z, Biezuner T, Amir S, Milo L, et al. A biological-computational human cell lineage discovery platform based on duplex molecular inversion probes. bioRxiv. 2018;18:191296.Google Scholar
- 12.Kalisky T, Blainey P, Quake SR. Genomic analysis at the single-cell level. Annu Rev Genet. 2011;45 [cited 2017 Sep 8]. Available from: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3878048/.PubMedCrossRefGoogle Scholar
- 24.Sinha R, Stanley G, Gulati GS, Ezran C, Travaglini KJ, Wei E, et al. Index switching causes “spreading-of-signal” among multiplexed samples in Illumina HiSeq 4000 DNA sequencing. bioRxiv. 2017:125724. https://doi.org/10.1101/125724.
- 36.iTranscriptome. iTranscriptome-Home [Internet]. [cited 2019 Mar 19]. Available from: http://www.picb.ac.cn/hanlab/itranscriptome/Home/.
- 37.Hon C-C, Shin JW, Carninci P, Stubbington MJ. The human cell atlas: technical approaches and challenges. Brief Funct Genomics. [cited 2017 Nov 22]; Available from: https://academic.oup.com/bfg/advance-article/doi/10.1093/bfgp/elx029/4571849.
- 42.Lan F, Haliburton JR, Yuan A, Abate AR. Droplet barcoding for massively parallel single-molecule deep sequencing. Nat Commun. 2016;7 [cited 2017 Aug 24]. Available from: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4931254/.
- 45.Andrews S. FastQC: a quality control tool for high throughput sequence data. 2010. Available online at: http://www.bioinformatics.babraham.ac.uk/projects/fastqc.Google Scholar
- 47.Joshi NA, Fass JN. Sickle: a sliding-window, adaptive, quality-based trimming tool for FastQ files. (Version 1.33) [Software]. 2011. Available at https://github.com/najoshi/sickle
- 51.Uzbas F, Opperer F, Shaposhnikov D, Drukker M. BART-seq: cost-effective massively parallel targeted sequencing for genomics and transcriptomics. GSE107723. Gene Expression Omnibus. 2019; https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE107723.
- 52.Shaposhnikov D. Total RNA sequencing of a time course treatment of human embryonic stem cells with CHIR99021, recombinant Wnt3a, and a time course activation of constitutively active beta-catenin expression. GSE130381. Gene Expression Omnibus. 2019; https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE130381.
- 53.Sass S, Angerer P, Uzbas F, Batra R, Müller N. Software required for Bart-Seq technology. Github. 2019. https://doi.org/10.5281/zenodo.3252205.
- 54.Angerer P. Demultiplexing pipeline for BARTSeq. Github. 2019. https://doi.org/10.5281/zenodo.3251773.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.