Clonal Analysis of Cells with Cellular Barcoding: When Numbers and Sizes Matter

Bystrykh, Leonid V.; Belderbos, Mirjam E.

doi:10.1007/7651_2016_343

Leonid V. Bystrykh³ &
Mirjam E. Belderbos^3,4

Part of the book series: Methods in Molecular Biology ((MIMB,volume 1516))

2802 Accesses
19 Citations
1 Altmetric

Abstract

Cellular barcoding is a recently rediscovered tool to trace the clonal output of individual cells with genetically distinct and heritable DNA sequences. Each year a few dozens of papers are published using the cellular barcoding technique. Those publications largely focus on mutually related issues, namely: counting cells capable of clonal proliferation and expansion, monitoring clonal dynamics in time, tracing the origin of differentiated cells, characterizing the differentiation potential of stem cells and similar topics. Apart from their biological content, claims and conclusions, these studies show remarkable diversity in technical aspects of the barcoding method and sometimes in major conclusions. Although a diversity of approaches is quite usual in data analysis, deviant handling of barcode data might directly affect experimental results and their biological interpretation. Here, we will describe typical challenges and caveats in cellular barcoding publications available so far.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Protocol: USD 49.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 139.00; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Limitations and challenges of genetic barcode quantification

Article Open access 03 March 2017

Connecting past and present: single-cell lineage tracing

Article Open access 19 April 2022

Cellular barcoding: lineage tracing, screening and beyond

Article 30 October 2018

References

Schepers K, Swart E, van Heijst JWJ et al (2008) Dissecting T cell lineage relationships by cellular barcoding. J Exp Med 205:2309–2318. doi:10.1084/jem.20072462
Article CAS PubMed PubMed Central Google Scholar
Gerrits A, Dykstra B, Kalmykowa OJ et al (2010) Cellular barcoding tool for clonal analysis in the hematopoietic system. Blood 115:2610–2618, doi: 10.1182/blood-2009-06-229757; 10.1182/blood-2009-06-229757
Article CAS PubMed Google Scholar
Lu R, Neff NF, Quake SR, Weissman IL (2011) Tracking single hematopoietic stem cells in vivo using high-throughput sequencing in conjunction with viral genetic barcoding. Nat Biotechnol 29:928–933. doi:10.1038/nbt.1977
Article CAS PubMed PubMed Central Google Scholar
Verovskaya E, Broekhuis MJC, Zwart E et al (2013) Heterogeneity of young and aged murine hematopoietic stem cells revealed by quantitative clonal analysis using cellular barcoding. Blood 122:523–532. doi:10.1182/blood-2013-01-481135
Article CAS PubMed Google Scholar
Naik SH, Schumacher TN, Perié L (2014) Cellular barcoding: a technical appraisal. Exp Hematol 42:598–608. doi:10.1016/j.exphem.2014.05.003
Article PubMed Google Scholar
Cheung AMS, Nguyen LV, Carles A et al (2013) Analysis of the clonal growth and differentiation dynamics of primitive barcoded human cord blood cells in NSG mice. Blood 122:3129–3137. doi:10.1182/blood-2013-06-508432
Article CAS PubMed PubMed Central Google Scholar
Brugman MH, Wiekmeijer A-S, van Eggermond M et al (2015) Development of a diverse human T-cell repertoire despite stringent restriction of hematopoietic clonality in the thymus. Proc Natl Acad Sci U S A 112:E6020–E6027. doi:10.1073/pnas.1519118112
Article CAS PubMed PubMed Central Google Scholar
Harwell CC, Fuentealba LC, Gonzalez-Cerrillo A et al (2015) Wide dispersion and diversity of clonally related inhibitory interneurons. Neuron 87:999–1007. doi:10.1016/j.neuron.2015.07.030
Article CAS PubMed PubMed Central Google Scholar
Golden JA, Cepko CL (1996) Clones in the chick diencephalon contain multiple cell types and siblings are widely dispersed. Development 122:65–78
CAS PubMed Google Scholar
Golden JA, Fields-Berry SC, Cepko CL (1995) Construction and characterization of a highly complex retroviral library for lineage analysis. Proc Natl Acad Sci U S A 92:5704–5708
Article CAS PubMed PubMed Central Google Scholar
Nguyen LV, Cox CL, Eirew P et al (2014) DNA barcoding reveals diverse growth kinetics of human breast tumour subclones in serially passaged xenografts. Nat Commun 5:5871. doi:10.1038/ncomms6871
Article CAS PubMed PubMed Central Google Scholar
Porter SN, Baker LC, Mittelman D, Porteus MH (2014) Lentiviral and targeted cellular barcoding reveals ongoing clonal dynamics of cell lines in vitro and in vivo. Genome Biol 15:R75. doi:10.1186/gb-2014-15-5-r75
Article PubMed PubMed Central Google Scholar
Bhang HC, Ruddy DA, Krishnamurthy Radhakrishna V et al (2015) Studying clonal dynamics in response to cancer therapy using high-complexity barcoding. Nat Med 21:440–448. doi:10.1038/nm.3841
Article CAS PubMed Google Scholar
Wu C, Li B, Lu R et al (2014) Clonal tracking of rhesus macaque hematopoiesis highlights a distinct lineage origin for natural killer cells. Cell Stem Cell 14:486–499. doi:10.1016/j.stem.2014.01.020
Article CAS PubMed PubMed Central Google Scholar
Gerlach C, Rohr JC, Perié L et al (2013) Heterogeneous differentiation patterns of individual CD8+ T cells. Science 340:635–639. doi:10.1126/science.1235487
Article CAS PubMed Google Scholar
Chapal-Ilani N, Maruvka YE, Spiro A et al (2013) Comparing algorithms that reconstruct cell lineage trees utilizing information on microsatellite mutations. PLoS Comput Biol 9, e1003297. doi:10.1371/journal.pcbi.1003297
Article PubMed PubMed Central Google Scholar
Cornils K, Thielecke L, Hüser S et al (2014) Multiplexing clonality: combining RGB marking and genetic barcoding. Nucleic Acids Res 42, e56. doi:10.1093/nar/gku081
Article CAS PubMed PubMed Central Google Scholar
Bystrykh LV, de Haan G, Verovskaya E (2014) Barcoded vector libraries and retroviral or lentiviral barcoding of hematopoietic stem cells. Methods Mol Biol 1185:345–360. doi:10.1007/978-1-4939-1133-2_23
Article PubMed Google Scholar
Verovskaya E, Broekhuis MJC, Zwart E et al (2014) Asymmetry in skeletal distribution of mouse hematopoietic stem cell clones and their equilibration by mobilizing cytokines. J Exp Med 211:487–497. doi:10.1084/jem.20131804
Article CAS PubMed PubMed Central Google Scholar
Kolfschoten IGM, van Leeuwen B, Berns K et al (2005) A genetic screen identifies PITX1 as a suppressor of RAS activity and tumorigenicity. Cell 121:849–858. doi:10.1016/j.cell.2005.04.017
Article CAS PubMed Google Scholar
Adams BD, Guo S, Bai H et al (2012) An in vivo functional screen uncovers miR-150-mediated regulation of hematopoietic injury response. Cell Rep 2:1048–1060. doi:10.1016/j.celrep.2012.09.014
Article CAS PubMed PubMed Central Google Scholar
Nguyen LV, Pellacani D, Lefort S et al (2015) Barcoding reveals complex clonal dynamics of de novo transformed human mammary cells. Nature. doi:10.1038/nature15742
Google Scholar
Akhtar W, de Jong J, Pindyurin AV et al (2013) Chromatin position effects assayed by thousands of reporters integrated in parallel. Cell 154:914–927. doi:10.1016/j.cell.2013.07.018
Article CAS PubMed Google Scholar
Colvin GA, Lambert J-F, Abedi M et al (2004) Murine marrow cellularity and the concept of stem cell competition: geographic and quantitative determinants in stem cell biology. Leukemia 18:575–583. doi:10.1038/sj.leu.2403268
Article CAS PubMed Google Scholar
Li H, Handsaker B, Wysoker A et al (2009) The sequence alignment/map format and SAMtools. Bioinformatics 25:2078–2079. doi:10.1093/bioinformatics/btp352
Article PubMed PubMed Central Google Scholar
McKenna A, Hanna M, Banks E et al (2010) The genome analysis toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res 20:1297–1303. doi:10.1101/gr.107524.110
Article CAS PubMed PubMed Central Google Scholar
Dykstra B, Olthof S, Schreuder J et al (2011) Clonal analysis reveals multiple functional defects of aged murine hematopoietic stem cells. J Exp Med 208:2691–2703, doi: 10.1084/jem.20111490; 10.1084/jem.20111490
Article CAS PubMed PubMed Central Google Scholar
Bystrykh LV (2012) Generalized DNA barcode design based on Hamming codes. PLoS One 7:e36852. doi:10.1371/journal.pone.0036852
Article CAS PubMed PubMed Central Google Scholar
Kim S, Kim N, Presson AP et al (2010) High-throughput, sensitive quantification of repopulating hematopoietic stem cell clones. J Virol 84:11771–11780. doi:10.1128/JVI.01355-10
Article CAS PubMed PubMed Central Google Scholar
Kim S, Kim N, Presson AP et al (2014) Dynamics of HSPC repopulation in nonhuman primates revealed by a decade-long clonal-tracking study. Cell Stem Cell 14:473–485. doi:10.1016/j.stem.2013.12.012
Article CAS PubMed PubMed Central Google Scholar
Gabriel R, Kutschera I, Bartholomae CC et al. (2014) Linear amplification mediated PCR--localization of genetic elements and characterization of unknown flanking DNA. J Vis Exp. e51543. doi: 10.3791/51543
Xu Q, Schlabach MR, Hannon GJ, Elledge SJ (2009) Design of 240,000 orthogonal 25mer DNA barcode probes. Proc Natl Acad Sci 106:2289–2294. doi:10.1073/pnas.0812506106
Article CAS PubMed PubMed Central Google Scholar
Buschmann T, Bystrykh LV (2013) Levenshtein error-correcting barcodes for multiplexed DNA sequencing. BMC Bioinformatics 14:272. doi:10.1186/1471-2105-14-272
Article PubMed PubMed Central Google Scholar
Livet J, Weissman TA, Kang H et al (2007) Transgenic strategies for combinatorial expression of fluorescent proteins in the nervous system. Nature 450:56–62. doi:10.1038/nature06293
Article CAS PubMed Google Scholar
Wei Y, Koulakov AA (2012) An exactly solvable model of random site-specific recombinations. Bull Math Biol 74:2897–2916. doi:10.1007/s11538-012-9788-z
Article PubMed PubMed Central Google Scholar
Peikon ID, Gizatullina DI, Zador AM (2014) In vivo generation of DNA sequence diversity for cellular barcoding. Nucleic Acids Res 42, e127. doi:10.1093/nar/gku604
Article PubMed PubMed Central Google Scholar
Ally D, Ritland K, Otto SP (2008) Can clone size serve as a proxy for clone age? An exploration using microsatellite divergence in Populus tremuloides. Mol Ecol 17:4897–4911. doi:10.1111/j.1365-294X.2008.03962.x
Article CAS PubMed Google Scholar
Mock KE, Rowe CA, Hooten MB et al (2008) Clonal dynamics in western North American aspen (Populus tremuloides). Mol Ecol 17:4827–4844. doi:10.1111/j.1365-294X.2008.03963.x
Article CAS PubMed Google Scholar
Naxerova K, Brachtel E, Salk JJ et al (2014) Hypermutable DNA chronicles the evolution of human colon cancer. Proc Natl Acad Sci U S A 111:E1889–E1898. doi:10.1073/pnas.1400179111
Article CAS PubMed PubMed Central Google Scholar
Shlush LI, Chapal-Ilani N, Adar R et al (2012) Cell lineage analysis of acute leukemia relapse uncovers the role of replication-rate heterogeneity and microsatellite instability. Blood 120:603–612. doi:10.1182/blood-2011-10-388629
Article CAS PubMed Google Scholar
Mullighan CG (2013) Genomic characterization of childhood acute lymphoblastic leukemia. Semin Hematol 50:314–324. doi:10.1053/j.seminhematol.2013.10.001
Article CAS PubMed Google Scholar
Ding L, Ley TJ, Larson DE et al (2012) Clonal evolution in relapsed acute myeloid leukaemia revealed by whole-genome sequencing. Nature 481:506–510, doi: 10.1038/nature10738; 10.1038/nature10738
Article CAS PubMed PubMed Central Google Scholar
Behjati S, Huch M, van Boxtel R et al (2014) Genome sequencing of normal cells reveals developmental lineages and mutational processes. Nature 513:422–425. doi:10.1038/nature13448
Article CAS PubMed PubMed Central Google Scholar
Blundell JR, Levy SF (2014) Beyond genome sequencing: lineage tracking with barcodes to study the dynamics of evolution, infection, and cancer. Genomics 104:417–430. doi:10.1016/j.ygeno.2014.09.005
Article CAS PubMed Google Scholar
Korhonen J, Martinmäki P, Pizzi C et al (2009) MOODS: fast search for position weight matrix matches in DNA sequences. Bioinformatics 25:3181–3182. doi:10.1093/bioinformatics/btp554
Article CAS PubMed PubMed Central Google Scholar
Bailey TL, Boden M, Buske FA et al (2009) MEME SUITE: tools for motif discovery and searching. Nucleic Acids Res 37:W202–W208. doi:10.1093/nar/gkp335
Article CAS PubMed PubMed Central Google Scholar
Bailey TL, Johnson J, Grant CE, Noble WS (2015) The MEME suite. Nucleic Acids Res 43:W39–W49. doi:10.1093/nar/gkv416
Article CAS PubMed PubMed Central Google Scholar
van der Loo MPJ (2014) The stringdist package for approximate string matching. R J 6:111–122
Google Scholar

Download references

Acknowledgements

We kindly acknowledge Erik Zwart, Evgenia Verovskaya, and Tilo Buschmann for their critical review, comments, and suggestions. M. Belderbos was supported by personal grants from the University Medical Center Groningen, the European Research Institute for the Biology of Ageing and the Dutch Cancer Society.

Author information

Authors and Affiliations

Laboratory of Ageing Biology and Stem Cells, European Research Institute for the Biology of Ageing, University Medical Center Groningen, University of Groningen, Antonius Deusinglaan 1, Building 3226, Groningen, 9713, AV, The Netherlands
Leonid V. Bystrykh & Mirjam E. Belderbos
Department of Pediatrics, University Medical Center Groningen, Groningen, The Netherlands
Mirjam E. Belderbos

Authors

Leonid V. Bystrykh
View author publications
You can also search for this author in PubMed Google Scholar
Mirjam E. Belderbos
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Leonid V. Bystrykh .

Editor information

Editors and Affiliations

Ottawa Hospital Research Institute, Ottawa, Ontario, Canada
Kursad Turksen

Appendix: Background and Examples of Data Analysis

1.1 Probability of Unique Barcoding

To tag each target cell with a unique barcode, the number of barcodes needs to be in excess of the number of target cells. Here, we illustrate how to calculate whether the library size is sufficiently big compared to the potential number of barcoded cells. If we ignore the inequality of barcodes present in the library (ideally they should be all equal, yet most of the times the library is skewed, see below for how to deal with this), then the probabilities can be calculated by filling numbers into the binomial distribution model

If n is the library size and k is the number of barcoded cells, then the approximate probability of having each cell uniquely barcoded is as follows:

$$ P\left(k,n\right) = \mathrm{binom}.\mathrm{dist}(\mathrm{hits}=1,\ \mathrm{trials}=k,\ \mathrm{probability}\ \mathrm{of}\ \mathrm{hit}=1/n,\ \mathrm{cumulative}=\mathrm{true}) $$

(1a)

This can be presented as a formula:

$$ P\left(k,n\right)=\left(\frac{n!}{k!\left(n-k\right)!}\right){P}^k{\left(1-P\right)}^{n-k} $$

(1b)

Example:

Suppose we have a library of 500 barcodes (n) and we aim to barcode 50 stem cells uniquely (k). What is our risk of having two stem cells labeled with the same barcode?

In this case:

$$ 1-P\left(50,500\right) = 1-\mathrm{binom}.\mathrm{dist}(\mathrm{hits}=1,\ \mathrm{trials}=50,\ \mathrm{probability}\ \mathrm{of}\ \mathrm{hit}=1/500,\ \mathrm{cumulative}=\mathrm{true}) = 0.004597188 $$

Therefore, this is a relatively safe experimental design (<0.5 % chance of having a pair of cells identically labeled). Note: this is not the same as calculating the probability of having at least one pair of barcodes equal (a.k.a. “birthday paradox,” the probability of having identically labeled pair cells at least once in entire set). Such a probability is quite high. Here, we count the probability of having equal barcodes among other paired combinations. This equation will be used for analyzing barcode sequencing data below.

1.2 How to Estimate Approximate Number of Barcodes from Reading a Crude PCR Barcode SANGER Chromatogram

The number of barcodes in a sample can be approximated by analysis of the Sanger chromatogram (Fig. 2). This can be done by prediction models in different ways. The easiest way is to use a random generator and to count the probabilities of appearance of a single, double, triple, or quadruple peak at each of the variable positions of the barcode, depending on number of trials called (n). A pseudo code would look like:

My barcode mix = n
MaxIterations = 1000
For i in range(0,MaxIterations):
For j in range(0, n):
Repeat take random(A, C, G, T)
Count frequency of single, double, triple, quadruple bases
Collect and report all frequencies

An example of the counts by random is as follows:

	Frequencies of base calls, N
BC mix	1	2	3	4
1	1000	0	0	0
2	240	760	0	0
3	65	557	378	0
4	17	320	572	91
5	2	177	605	216
6	1	70	568	361
7	0	53	428	519
8	0	36	350	614
9	0	13	303	684
10	0	8	230	762

Where BC mix stands for numbers of barcodes in the mixture (in columns). The frequencies of base calls are shown for each kind (single peak, double peak etc.)

Another, more accurate approach is to count all possible combinations of base calls depending on number of trials, as follows:

	Frequencies of base calls, N
BC mix	1	2	3	4
1	1	0	0	0
2	4	12	0	0
3	4	36	24	0
4	4	84	144	24
5	4	180	600	240
6	4	372	2160	1560
7	4	756	7224	8400
8	4	1524	23184	40824
9	4	3060	72600	186480
10	4	6132	223920	818520

Both approaches can be programmed and they give equivalent results.

1.3 Major Properties of Barcode Libraries

There are three major parameters in barcoding to check and analyze: (1) the total number of barcodes (in the library and in experimental data); (2) the randomness of barcodes regarding their sequence, and (3) the evenness of barcodes in the sample (in the library and in experimental data). The effective library size is directly affected by the randomness of synthesis: The less randomly it is made, the less barcodes will be in the library. Evenness (uniformity) of barcodes also affect the effective size of the library: with less evenness the effective size will decrease. Finally, randomness of sequences is connected to the observed and expected distances between barcodes. Below, we present the minimal computational background to estimate these parameters.

1.4 Randomness

Poisson and binomial distributions are good in predicting uniqueness of barcoding by random sequences, and they are good in estimating maximal sizes of libraries and clones. Whenever possible, it is important to report how random the chemical synthesis of the used barcodes was. Sanger sequencing of E.coli clones is preferential because it is free from sequencing errors. Deep sequencing data are acceptable if sequencing noise is convincingly (adequately) removed.

1.5 Skewing

Skewing is a massively ignored aspect of the effective size of the vector library. It can be approached from the perspective of information theory. In an ideal library of evenly distributed frequencies of items, we have the maximal Shannon diversity index, H _sh. If items are not equal then we can calculate the diversity indexes in two different ways. One is Shannon diversity index:

$$ {H}_{\mathrm{sh}}={\displaystyle \sum {p}_i\times \mathrm{L}\mathrm{n}\left({p}_i\right)} $$

(2)

where p _i is the probability value for every barcode on the list. If the barcode frequencies are equal then p _i = 1/N, where N is the number of all barcodes. If those frequencies are not equal, then for every barcode, the p _i value will be a fraction of the frequency for the i-th barcode reads divided by total reads for all barcodes. Corrected barcode numbers by those parameters (C _b) will be:

$$ {C}_{\mathrm{b}}= \exp \left({H}_{\mathrm{sh}}\right) $$

(3)

This information content however will decrease upon increased skewing. Too small frequencies present in the library will have progressively fewer contributions to the whole library. This can be directly estimated by measuring the mentioned above indexes. An example is already given in Fig. 6.

A nearly identical solution to the Shannon correction can be done using the information index (in bits of information):

$$ \mathrm{I}\mathrm{B}={\displaystyle \sum {p}_i\times {\mathrm{Log}}_2\left({p}_i\right)} $$

(4)

And correction will be done as following:

$$ {C}_{\mathrm{b}}={2}^{\mathrm{IB}} $$

(5)

The result is identical to the Shannon index described above.

For skewing, the equitability index can be used, which is a Shannon index divided by the total number of reported barcodes.

$$ \mathrm{E}\mathrm{q}={H}_{\mathrm{Sh}}/\mathrm{L}\mathrm{n}(N) $$

(6)

Consider the example presented in Fig. 6. The table displays four data sets ten barcodes, with either equal frequencies or different degrees of inequality. Next to the table is the normalized cumulative frequency plot.

For equal barcodes, the cumulative frequency line is diagonal. More unequal sets deviate further from diagonal. If we calculate the Shannon diversity index for each set and then take the correction for Shannon index as in Eq. 3, it will suggest ten barcodes for the equal set (as we might expect), nine barcodes for linear set, and five and four for the remaining two. Although each set consists of ten barcodes, inequality of barcodes reduces their effective set size. From the plot (Fig. 6b), it becomes clear that the information content correction is roughly similar to the number of barcodes in the cumulative 95 % of barcode reads. The advantage of Shannon index correction compared to the % threshold approach is that it is correcting for the library size without any additional assumptions. In other words, it is not bound to the taste and preferences of the analyst.

Note that such cumulative plot also illustrates a degree of inequality which, when inversed, is also known as the Lorenz curve (if plotted from smallest to biggest barcodes cumulatively). Data closer to the diagonal are more equal to each other, and the other way around.

Skewing parameter varies from Eq = 1, if barcodes are equal, while approaching zero with increased skewing. For the power function, Eq = 0.6.

1.6 Distances Between Barcodes

The distance between barcodes is defined as the number of different bases between a pair of barcodes. For instance, the 6-mers AAATTG and AACTTC differ by 2 bases. If we take into account substitutions only, this will be referred to the Hamming distance (D _H = 2). As a simple guide for assessment of randomness we can rely on the fact that a probability for a certain number of similar bases in two randomly selected barcodes, S _ki , _kj, with the length of I bases follows a binomial distribution.

$$ P\left({S}_{ki,kj}\right)=\mathrm{binom}.\mathrm{dist}({S}_{ki,kj},\mathrm{length}=I,\kern5em \mathrm{probability}=0.25,\ \mathrm{cumulative}=\mathrm{false}) $$

(7a)

or as an equation (suggested by Tilo Buschmann):

$$ P\left(d\left({S}_i,{S}_j\right)=d\right)=\left(\frac{l}{d}\right)\times {\left(\frac{1}{4}\right)}^d\times {\left(\frac{3}{4}\right)}^{l-d} $$

(7b)

The similarity degree between two randomly chosen barcodes is equal to the difference in barcode length, Len(BC) and the Hamming distance (D _H), therefore:

$$ P\left(\mathrm{L}\mathrm{e}\mathrm{n}\left(\mathrm{B}\mathrm{C}\right)-{D}_{\mathrm{H}}\right)=P\left({S_{ki}}_{,kj}\right) $$

(8)

The distribution consequently predicts that if you take for instance any random pair of 12-base words or compare any randomly picked word to the full library set, the most likely and frequent distance will be 9–10 bases, as shown in the table below:

Similarity	D _min	P
0	12	0.031676352
1	11	0.126705408
2	10	0.232293248
3	9	0.258103609
4	8	0.193577707
5	7	0.103241444
6	6	0.04014945
7	5	0.011471272
8	4	0.002389848
9	3	0.000354052
10	2	3.54052E-05
11	1	2.14577E-06
12	0	5.96046E-08

Likewise, any randomly picked barcode compared to the full set of barcodes of the same length follows exactly the same distribution. This approach was used for the random simulation on the Fig. 5.

Example:

Let’s take for example the raw sequencing data of the mixture of two barcodes used for Fig. 5. There are 468 unique barcodes in this file. Suppose we do not know how many true barcodes are in the file. We can only assume that with more true barcodes we will have a better fit to what is expected by random simulation. As explained above, a truly random set of barcodes shows a (transformed) binomial distribution of distances upon random sampling of the set. We can do random sampling of the given set using a custom script of an approximate structure like the following pseudo code:

Import Distance Package (See Below in the Protocol)

Read the source file
Repeat 1000 × N times:
- Take random barcode1, barcode2
- If barcode1 not equal to barcode2:
Measure distance
Add result to the array of distances
Report all distances frequencies

Data for this particular case are shown in Fig. 5b. It is clear that the raw set of 2 barcodes contains an unusually high number of distances 1–3. Therefore, it disagrees with the hypothesis of the random set. One obvious explanation is that the set is contaminated with false barcodes, likely generated by sequencing errors. Therefore, this set requires filtering by small distances.

1.7 Resolution of Barcode Detection

Let’s define our barcodes as:

True positive: barcodes which are really present in the library, sequence confirmed.

True negative: barcodes detected in the sample not present in the library, have obscure origin, and likely originate from PCR errors or likewise, rejected by the method.

False negative: barcodes which are present in the library, but rejected by the method.

False positive: barcodes which are false barcode, but accepted by the method.

Sensitivity is defined as the fraction of true positive barcodes to the sum of true positive and false negative, whereas specificity is defined as a fraction of the false positive in the sample to the sum of true negative and false positive barcodes.

$$ \mathrm{Sensitivity}=\frac{\mathrm{true}\_\mathrm{positives}}{\left(\mathrm{true}\_\mathrm{positives}+\mathrm{false}\_\mathrm{negatives}\right)} $$

(9)

The denominator here is all barcodes in the library

$$ \mathrm{Specificity}=\frac{\mathrm{true}\_\mathrm{negatives}}{\left(\mathrm{true}\_\mathrm{negatives}+\mathrm{false}\_\mathrm{positives}\right)} $$

(10)

The denominator in this formula is the sum of all false reads in the data.

The estimation of sensitivity can be illustrated by filtering of the sequencing data from the sample based on one of the parameters, like using the minimal distance or read frequency. As a reference we can take the barcode library, validated by repetitive resequencing of the batch.

To summarize: the resolution of the method is a measure of its performance regarding sensitivity and specificity. A claim of high resolution is equivalent to the claim of the best possible detection of the barcodes in the sample and the best possible separation of true and false barcodes.

1.8 Protocol for Converting Raw Deep Sequencing Data into Sets of Barcodes

Here, we provide a pipeline for retrieval of barcodes from deep sequencing data.

Step 1. Import raw data.

The raw sequencing data from an Illumina HiSeq2500 machine are a collection of per-cycle bcl base-call files. These bcl files are converted to compressed FATSQ.gz files with bcl2FATSQ (https://support.illumina.com/downloads/bcl2FATSQ_conversion_software_184.html). A sample sheet, provided by the researcher/operator, lets the software assign reads to samples, and samples to projects, provided that Illumina multiplex primers where used. After this de-multiplexing, the FATSQ (2) files are available to download through a file transfer protocol (FTP) server (this can be arranged differently, depending on the local organization). For further data processing, a workstation with at least 32 Gb memory and 1 Tb hard drive space is advised. Note that all the steps can be performed on systems using Linux, Windows, or Mac OS, however a minimal of 8 Gb RAM and 250 Gb hard drive space is recommended.

Step 2. Data compression.

Multiplex barcode sequencing data (FATSQ format) ranges in file size between 1 and 50 Gb. Depending on the operational system, this can be a challenge. To ease downstream analysis, we compress the data with the following steps.

1.
Remove low quality reads; depending on the sequencer, the data will use an encoding scheme for example Illumina 1.8+-. Using the quality line for each read, one can filter on a predetermined cutoff value. This could be the total read quality value or a minimal quality value for the first x number of base pairs.
2.
Collapse reads; remove redundant FATSQ lines 1, 3, and 4. Thereafter, collapse reads and add their frequency, see example below.

AACCTT

AACCTT

AAGGTT

After collapsing data:

AACCTT 2

AAGGTT 1
3.
Remove single reads; After collapsing the FATSQ data, all reads with a frequency of one are removed. Note, these singles hold minimal biological value and needlessly increase the volume of the dataset.

By compressing FATSQ files one can reduce the size with 99 %.

Step 3. Barcoded samples.

After data compression, all sequencing reads are further de-multiplexed into samples. Note that initial sequencing data are split into separate files based on the Illumina indexes. For this purpose, we use special primers with variable tags (8–9 nt long) for amplifying the barcoded region of individual samples. By using unique sample tags and adjacent part of the primer, totally 13 nt in length of every read is tested for exact matching. For every sample marked with unique primer tag, a (text/csv/tsv) file is generated that lists the unique reads within this sample, and their frequencies. In principle, more sophisticated tag-extraction protocols could be used. However, since the tag is positioned at the very beginning of the sequencing read, and sequencing quality is likely the best out of entire read, more sophisticated algorithms will be more time consuming and provide little improvement in retrieved numbers of reads per sample.

Step 4. Converting lists or read into lists of barcodes.

So far, entire sequencing reads were collected into multiple separate files. Each of the reads supposedly contains a barcode. Depending on the type of barcodes, multiple search strategies are possible; exact match, regular expressions or motif search (probably along with multiple different options). We routinely use MOODS (Motif Occurrence Detection Suite [45] in BioPerl or Python custom script. First, the barcode sequence (“the backbone”) is transformed into a position-weight matrix. Second, MOODS searches for any similarity to the barcode sequence. The threshold for similarity is empirically established to such level, that in every read no more than one barcode is detected. Usually it tolerates multiple mismatches in the barcode backbone. Small indels on the side of the barcode will be tolerated, too. However, indels in the center of the barcode will deteriorate the discovery of the barcode. To circumvent this problem, an algorithm with gapped motif search must be implemented. For this, the GLAM2 tool (or a similar algorithm) might be used [46, 47]. For each sample, a file with unique barcodes and their frequency is generated.

Step 5. Compression of barcode lists.

The barcode list generated in the previous step is largely redundant, containing multiple identical barcodes, which must be merged, as well as their frequencies. Moreover, usually all minor barcodes with minimal distance of 1 base can be eliminated (and their frequencies added to the major similar barcode, see Fig. 4). We routinely use a custom script. Two distances can be employed, Levenshtein (which takes into account both SNPs and indels) or Hamming (SNP only). In Python and Perl, the Levenshtein package can be used for both types of distances (https://pypi.python.org/pypi/python-Levenshtein/). Several packages are available in R, such as Stringdist [48] and DNABarcodes [4]. Both packages are able to measure different kinds of distances, including the ones mentioned above. Note that the routinely used threshold by Hamming distance=1 is arbitrary and chosen for simplicity of theoretical considerations. Other distances can be used too, provided they are well justified.

After this step the barcode lists per sample can be used for analysis. Depending on the details of the experiment some other cutoffs can be used. For instance, steps like the 0.5 % biologically meaningful cutoff might be introduced here.

Usually, the end product is made by assembly of the individual samples into a table, representing each particular experiment. We routinely use a custom script for this purpose. From this table the dynamics of each individual barcode can be followed.

Rights and permissions

Reprints and permissions

Copyright information

About this protocol

Cite this protocol

Bystrykh, L.V., Belderbos, M.E. (2016). Clonal Analysis of Cells with Cellular Barcoding: When Numbers and Sizes Matter. In: Turksen, K. (eds) Stem Cell Heterogeneity. Methods in Molecular Biology, vol 1516. Humana Press, New York, NY. https://doi.org/10.1007/7651_2016_343

Download citation

DOI: https://doi.org/10.1007/7651_2016_343
Published: 05 April 2016
Publisher Name: Humana Press, New York, NY
Print ISBN: 978-1-4939-6549-6
Online ISBN: 978-1-4939-6550-2
eBook Packages: Springer Protocols

Publish with us

Policies and ethics

Clonal Analysis of Cells with Cellular Barcoding: When Numbers and Sizes Matter

Abstract

Access this chapter

Similar content being viewed by others

Limitations and challenges of genetic barcode quantification

Connecting past and present: single-cell lineage tracing

Cellular barcoding: lineage tracing, screening and beyond

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendix: Background and Examples of Data Analysis

1.1 Probability of Unique Barcoding

1.2 How to Estimate Approximate Number of Barcodes from Reading a Crude PCR Barcode SANGER Chromatogram

1.3 Major Properties of Barcode Libraries

1.4 Randomness

1.5 Skewing

1.6 Distances Between Barcodes

Import Distance Package (See Below in the Protocol)

1.7 Resolution of Barcode Detection

1.8 Protocol for Converting Raw Deep Sequencing Data into Sets of Barcodes

Rights and permissions

Copyright information

About this protocol

Cite this protocol

Download citation

Publish with us

Navigation

Clonal Analysis of Cells with Cellular Barcoding: When Numbers and Sizes Matter

Abstract

Access this chapter

Similar content being viewed by others

Limitations and challenges of genetic barcode quantification

Connecting past and present: single-cell lineage tracing

Cellular barcoding: lineage tracing, screening and beyond

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendix: Background and Examples of Data Analysis

Appendix: Background and Examples of Data Analysis

1.1 Probability of Unique Barcoding

1.2 How to Estimate Approximate Number of Barcodes from Reading a Crude PCR Barcode SANGER Chromatogram

1.3 Major Properties of Barcode Libraries

1.4 Randomness

1.5 Skewing

1.6 Distances Between Barcodes

Import Distance Package (See Below in the Protocol)

1.7 Resolution of Barcode Detection

1.8 Protocol for Converting Raw Deep Sequencing Data into Sets of Barcodes

Rights and permissions

Copyright information

About this protocol

Cite this protocol

Download citation

Publish with us

Search

Navigation