Skip to main content

Clonal Analysis of Cells with Cellular Barcoding: When Numbers and Sizes Matter

  • Protocol
  • First Online:
Stem Cell Heterogeneity

Part of the book series: Methods in Molecular Biology ((MIMB,volume 1516))

Abstract

Cellular barcoding is a recently rediscovered tool to trace the clonal output of individual cells with genetically distinct and heritable DNA sequences. Each year a few dozens of papers are published using the cellular barcoding technique. Those publications largely focus on mutually related issues, namely: counting cells capable of clonal proliferation and expansion, monitoring clonal dynamics in time, tracing the origin of differentiated cells, characterizing the differentiation potential of stem cells and similar topics. Apart from their biological content, claims and conclusions, these studies show remarkable diversity in technical aspects of the barcoding method and sometimes in major conclusions. Although a diversity of approaches is quite usual in data analysis, deviant handling of barcode data might directly affect experimental results and their biological interpretation. Here, we will describe typical challenges and caveats in cellular barcoding publications available so far.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Protocol
USD 49.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 139.00
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Schepers K, Swart E, van Heijst JWJ et al (2008) Dissecting T cell lineage relationships by cellular barcoding. J Exp Med 205:2309–2318. doi:10.1084/jem.20072462

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  2. Gerrits A, Dykstra B, Kalmykowa OJ et al (2010) Cellular barcoding tool for clonal analysis in the hematopoietic system. Blood 115:2610–2618, doi: 10.1182/blood-2009-06-229757; 10.1182/blood-2009-06-229757

    Article  CAS  PubMed  Google Scholar 

  3. Lu R, Neff NF, Quake SR, Weissman IL (2011) Tracking single hematopoietic stem cells in vivo using high-throughput sequencing in conjunction with viral genetic barcoding. Nat Biotechnol 29:928–933. doi:10.1038/nbt.1977

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Verovskaya E, Broekhuis MJC, Zwart E et al (2013) Heterogeneity of young and aged murine hematopoietic stem cells revealed by quantitative clonal analysis using cellular barcoding. Blood 122:523–532. doi:10.1182/blood-2013-01-481135

    Article  CAS  PubMed  Google Scholar 

  5. Naik SH, Schumacher TN, Perié L (2014) Cellular barcoding: a technical appraisal. Exp Hematol 42:598–608. doi:10.1016/j.exphem.2014.05.003

    Article  PubMed  Google Scholar 

  6. Cheung AMS, Nguyen LV, Carles A et al (2013) Analysis of the clonal growth and differentiation dynamics of primitive barcoded human cord blood cells in NSG mice. Blood 122:3129–3137. doi:10.1182/blood-2013-06-508432

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. Brugman MH, Wiekmeijer A-S, van Eggermond M et al (2015) Development of a diverse human T-cell repertoire despite stringent restriction of hematopoietic clonality in the thymus. Proc Natl Acad Sci U S A 112:E6020–E6027. doi:10.1073/pnas.1519118112

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Harwell CC, Fuentealba LC, Gonzalez-Cerrillo A et al (2015) Wide dispersion and diversity of clonally related inhibitory interneurons. Neuron 87:999–1007. doi:10.1016/j.neuron.2015.07.030

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Golden JA, Cepko CL (1996) Clones in the chick diencephalon contain multiple cell types and siblings are widely dispersed. Development 122:65–78

    CAS  PubMed  Google Scholar 

  10. Golden JA, Fields-Berry SC, Cepko CL (1995) Construction and characterization of a highly complex retroviral library for lineage analysis. Proc Natl Acad Sci U S A 92:5704–5708

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. Nguyen LV, Cox CL, Eirew P et al (2014) DNA barcoding reveals diverse growth kinetics of human breast tumour subclones in serially passaged xenografts. Nat Commun 5:5871. doi:10.1038/ncomms6871

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Porter SN, Baker LC, Mittelman D, Porteus MH (2014) Lentiviral and targeted cellular barcoding reveals ongoing clonal dynamics of cell lines in vitro and in vivo. Genome Biol 15:R75. doi:10.1186/gb-2014-15-5-r75

    Article  PubMed  PubMed Central  Google Scholar 

  13. Bhang HC, Ruddy DA, Krishnamurthy Radhakrishna V et al (2015) Studying clonal dynamics in response to cancer therapy using high-complexity barcoding. Nat Med 21:440–448. doi:10.1038/nm.3841

    Article  CAS  PubMed  Google Scholar 

  14. Wu C, Li B, Lu R et al (2014) Clonal tracking of rhesus macaque hematopoiesis highlights a distinct lineage origin for natural killer cells. Cell Stem Cell 14:486–499. doi:10.1016/j.stem.2014.01.020

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  15. Gerlach C, Rohr JC, Perié L et al (2013) Heterogeneous differentiation patterns of individual CD8+ T cells. Science 340:635–639. doi:10.1126/science.1235487

    Article  CAS  PubMed  Google Scholar 

  16. Chapal-Ilani N, Maruvka YE, Spiro A et al (2013) Comparing algorithms that reconstruct cell lineage trees utilizing information on microsatellite mutations. PLoS Comput Biol 9, e1003297. doi:10.1371/journal.pcbi.1003297

    Article  PubMed  PubMed Central  Google Scholar 

  17. Cornils K, Thielecke L, Hüser S et al (2014) Multiplexing clonality: combining RGB marking and genetic barcoding. Nucleic Acids Res 42, e56. doi:10.1093/nar/gku081

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  18. Bystrykh LV, de Haan G, Verovskaya E (2014) Barcoded vector libraries and retroviral or lentiviral barcoding of hematopoietic stem cells. Methods Mol Biol 1185:345–360. doi:10.1007/978-1-4939-1133-2_23

    Article  PubMed  Google Scholar 

  19. Verovskaya E, Broekhuis MJC, Zwart E et al (2014) Asymmetry in skeletal distribution of mouse hematopoietic stem cell clones and their equilibration by mobilizing cytokines. J Exp Med 211:487–497. doi:10.1084/jem.20131804

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Kolfschoten IGM, van Leeuwen B, Berns K et al (2005) A genetic screen identifies PITX1 as a suppressor of RAS activity and tumorigenicity. Cell 121:849–858. doi:10.1016/j.cell.2005.04.017

    Article  CAS  PubMed  Google Scholar 

  21. Adams BD, Guo S, Bai H et al (2012) An in vivo functional screen uncovers miR-150-mediated regulation of hematopoietic injury response. Cell Rep 2:1048–1060. doi:10.1016/j.celrep.2012.09.014

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. Nguyen LV, Pellacani D, Lefort S et al (2015) Barcoding reveals complex clonal dynamics of de novo transformed human mammary cells. Nature. doi:10.1038/nature15742

    Google Scholar 

  23. Akhtar W, de Jong J, Pindyurin AV et al (2013) Chromatin position effects assayed by thousands of reporters integrated in parallel. Cell 154:914–927. doi:10.1016/j.cell.2013.07.018

    Article  CAS  PubMed  Google Scholar 

  24. Colvin GA, Lambert J-F, Abedi M et al (2004) Murine marrow cellularity and the concept of stem cell competition: geographic and quantitative determinants in stem cell biology. Leukemia 18:575–583. doi:10.1038/sj.leu.2403268

    Article  CAS  PubMed  Google Scholar 

  25. Li H, Handsaker B, Wysoker A et al (2009) The sequence alignment/map format and SAMtools. Bioinformatics 25:2078–2079. doi:10.1093/bioinformatics/btp352

    Article  PubMed  PubMed Central  Google Scholar 

  26. McKenna A, Hanna M, Banks E et al (2010) The genome analysis toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res 20:1297–1303. doi:10.1101/gr.107524.110

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  27. Dykstra B, Olthof S, Schreuder J et al (2011) Clonal analysis reveals multiple functional defects of aged murine hematopoietic stem cells. J Exp Med 208:2691–2703, doi: 10.1084/jem.20111490; 10.1084/jem.20111490

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  28. Bystrykh LV (2012) Generalized DNA barcode design based on Hamming codes. PLoS One 7:e36852. doi:10.1371/journal.pone.0036852

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  29. Kim S, Kim N, Presson AP et al (2010) High-throughput, sensitive quantification of repopulating hematopoietic stem cell clones. J Virol 84:11771–11780. doi:10.1128/JVI.01355-10

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  30. Kim S, Kim N, Presson AP et al (2014) Dynamics of HSPC repopulation in nonhuman primates revealed by a decade-long clonal-tracking study. Cell Stem Cell 14:473–485. doi:10.1016/j.stem.2013.12.012

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  31. Gabriel R, Kutschera I, Bartholomae CC et al. (2014) Linear amplification mediated PCR--localization of genetic elements and characterization of unknown flanking DNA. J Vis Exp. e51543. doi: 10.3791/51543

  32. Xu Q, Schlabach MR, Hannon GJ, Elledge SJ (2009) Design of 240,000 orthogonal 25mer DNA barcode probes. Proc Natl Acad Sci 106:2289–2294. doi:10.1073/pnas.0812506106

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  33. Buschmann T, Bystrykh LV (2013) Levenshtein error-correcting barcodes for multiplexed DNA sequencing. BMC Bioinformatics 14:272. doi:10.1186/1471-2105-14-272

    Article  PubMed  PubMed Central  Google Scholar 

  34. Livet J, Weissman TA, Kang H et al (2007) Transgenic strategies for combinatorial expression of fluorescent proteins in the nervous system. Nature 450:56–62. doi:10.1038/nature06293

    Article  CAS  PubMed  Google Scholar 

  35. Wei Y, Koulakov AA (2012) An exactly solvable model of random site-specific recombinations. Bull Math Biol 74:2897–2916. doi:10.1007/s11538-012-9788-z

    Article  PubMed  PubMed Central  Google Scholar 

  36. Peikon ID, Gizatullina DI, Zador AM (2014) In vivo generation of DNA sequence diversity for cellular barcoding. Nucleic Acids Res 42, e127. doi:10.1093/nar/gku604

    Article  PubMed  PubMed Central  Google Scholar 

  37. Ally D, Ritland K, Otto SP (2008) Can clone size serve as a proxy for clone age? An exploration using microsatellite divergence in Populus tremuloides. Mol Ecol 17:4897–4911. doi:10.1111/j.1365-294X.2008.03962.x

    Article  CAS  PubMed  Google Scholar 

  38. Mock KE, Rowe CA, Hooten MB et al (2008) Clonal dynamics in western North American aspen (Populus tremuloides). Mol Ecol 17:4827–4844. doi:10.1111/j.1365-294X.2008.03963.x

    Article  CAS  PubMed  Google Scholar 

  39. Naxerova K, Brachtel E, Salk JJ et al (2014) Hypermutable DNA chronicles the evolution of human colon cancer. Proc Natl Acad Sci U S A 111:E1889–E1898. doi:10.1073/pnas.1400179111

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  40. Shlush LI, Chapal-Ilani N, Adar R et al (2012) Cell lineage analysis of acute leukemia relapse uncovers the role of replication-rate heterogeneity and microsatellite instability. Blood 120:603–612. doi:10.1182/blood-2011-10-388629

    Article  CAS  PubMed  Google Scholar 

  41. Mullighan CG (2013) Genomic characterization of childhood acute lymphoblastic leukemia. Semin Hematol 50:314–324. doi:10.1053/j.seminhematol.2013.10.001

    Article  CAS  PubMed  Google Scholar 

  42. Ding L, Ley TJ, Larson DE et al (2012) Clonal evolution in relapsed acute myeloid leukaemia revealed by whole-genome sequencing. Nature 481:506–510, doi: 10.1038/nature10738; 10.1038/nature10738

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  43. Behjati S, Huch M, van Boxtel R et al (2014) Genome sequencing of normal cells reveals developmental lineages and mutational processes. Nature 513:422–425. doi:10.1038/nature13448

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  44. Blundell JR, Levy SF (2014) Beyond genome sequencing: lineage tracking with barcodes to study the dynamics of evolution, infection, and cancer. Genomics 104:417–430. doi:10.1016/j.ygeno.2014.09.005

    Article  CAS  PubMed  Google Scholar 

  45. Korhonen J, Martinmäki P, Pizzi C et al (2009) MOODS: fast search for position weight matrix matches in DNA sequences. Bioinformatics 25:3181–3182. doi:10.1093/bioinformatics/btp554

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  46. Bailey TL, Boden M, Buske FA et al (2009) MEME SUITE: tools for motif discovery and searching. Nucleic Acids Res 37:W202–W208. doi:10.1093/nar/gkp335

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  47. Bailey TL, Johnson J, Grant CE, Noble WS (2015) The MEME suite. Nucleic Acids Res 43:W39–W49. doi:10.1093/nar/gkv416

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  48. van der Loo MPJ (2014) The stringdist package for approximate string matching. R J 6:111–122

    Google Scholar 

Download references

Acknowledgements

We kindly acknowledge Erik Zwart, Evgenia Verovskaya, and Tilo Buschmann for their critical review, comments, and suggestions. M. Belderbos was supported by personal grants from the University Medical Center Groningen, the European Research Institute for the Biology of Ageing and the Dutch Cancer Society.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Leonid V. Bystrykh .

Editor information

Editors and Affiliations

Appendix: Background and Examples of Data Analysis

Appendix: Background and Examples of Data Analysis

1.1 Probability of Unique Barcoding

To tag each target cell with a unique barcode, the number of barcodes needs to be in excess of the number of target cells. Here, we illustrate how to calculate whether the library size is sufficiently big compared to the potential number of barcoded cells. If we ignore the inequality of barcodes present in the library (ideally they should be all equal, yet most of the times the library is skewed, see below for how to deal with this), then the probabilities can be calculated by filling numbers into the binomial distribution model

If n is the library size and k is the number of barcoded cells, then the approximate probability of having each cell uniquely barcoded is as follows:

$$ P\left(k,n\right) = \mathrm{binom}.\mathrm{dist}(\mathrm{hits}=1,\ \mathrm{trials}=k,\ \mathrm{probability}\ \mathrm{of}\ \mathrm{hit}=1/n,\ \mathrm{cumulative}=\mathrm{true}) $$
(1a)

This can be presented as a formula:

$$ P\left(k,n\right)=\left(\frac{n!}{k!\left(n-k\right)!}\right){P}^k{\left(1-P\right)}^{n-k} $$
(1b)

Example:

Suppose we have a library of 500 barcodes (n) and we aim to barcode 50 stem cells uniquely (k). What is our risk of having two stem cells labeled with the same barcode?

In this case:

$$ 1-P\left(50,500\right) = 1-\mathrm{binom}.\mathrm{dist}(\mathrm{hits}=1,\ \mathrm{trials}=50,\ \mathrm{probability}\ \mathrm{of}\ \mathrm{hit}=1/500,\ \mathrm{cumulative}=\mathrm{true}) = 0.004597188 $$

Therefore, this is a relatively safe experimental design (<0.5 % chance of having a pair of cells identically labeled). Note: this is not the same as calculating the probability of having at least one pair of barcodes equal (a.k.a. “birthday paradox,” the probability of having identically labeled pair cells at least once in entire set). Such a probability is quite high. Here, we count the probability of having equal barcodes among other paired combinations. This equation will be used for analyzing barcode sequencing data below.

1.2 How to Estimate Approximate Number of Barcodes from Reading a Crude PCR Barcode SANGER Chromatogram

The number of barcodes in a sample can be approximated by analysis of the Sanger chromatogram (Fig. 2). This can be done by prediction models in different ways. The easiest way is to use a random generator and to count the probabilities of appearance of a single, double, triple, or quadruple peak at each of the variable positions of the barcode, depending on number of trials called (n). A pseudo code would look like:

  • My barcode mix = n

  • MaxIterations = 1000

  • For i in range(0,MaxIterations):

  • For j in range(0, n):

  • Repeat take random(A, C, G, T)

  •  Count frequency of single, double, triple, quadruple bases

  • Collect and report all frequencies

An example of the counts by random is as follows:

 

Frequencies of base calls, N

BC mix

1

2

3

4

1

1000

0

0

0

2

240

760

0

0

3

65

557

378

0

4

17

320

572

91

5

2

177

605

216

6

1

70

568

361

7

0

53

428

519

8

0

36

350

614

9

0

13

303

684

10

0

8

230

762

Where BC mix stands for numbers of barcodes in the mixture (in columns). The frequencies of base calls are shown for each kind (single peak, double peak etc.)

Another, more accurate approach is to count all possible combinations of base calls depending on number of trials, as follows:

 

Frequencies of base calls, N

BC mix

1

2

3

4

1

1

0

0

0

2

4

12

0

0

3

4

36

24

0

4

4

84

144

24

5

4

180

600

240

6

4

372

2160

1560

7

4

756

7224

8400

8

4

1524

23184

40824

9

4

3060

72600

186480

10

4

6132

223920

818520

Both approaches can be programmed and they give equivalent results.

1.3 Major Properties of Barcode Libraries

There are three major parameters in barcoding to check and analyze: (1) the total number of barcodes (in the library and in experimental data); (2) the randomness of barcodes regarding their sequence, and (3) the evenness of barcodes in the sample (in the library and in experimental data). The effective library size is directly affected by the randomness of synthesis: The less randomly it is made, the less barcodes will be in the library. Evenness (uniformity) of barcodes also affect the effective size of the library: with less evenness the effective size will decrease. Finally, randomness of sequences is connected to the observed and expected distances between barcodes. Below, we present the minimal computational background to estimate these parameters.

1.4 Randomness

Poisson and binomial distributions are good in predicting uniqueness of barcoding by random sequences, and they are good in estimating maximal sizes of libraries and clones. Whenever possible, it is important to report how random the chemical synthesis of the used barcodes was. Sanger sequencing of E.coli clones is preferential because it is free from sequencing errors. Deep sequencing data are acceptable if sequencing noise is convincingly (adequately) removed.

1.5 Skewing

Skewing is a massively ignored aspect of the effective size of the vector library. It can be approached from the perspective of information theory. In an ideal library of evenly distributed frequencies of items, we have the maximal Shannon diversity index, H sh. If items are not equal then we can calculate the diversity indexes in two different ways. One is Shannon diversity index:

$$ {H}_{\mathrm{sh}}={\displaystyle \sum {p}_i\times \mathrm{L}\mathrm{n}\left({p}_i\right)} $$
(2)

where p i is the probability value for every barcode on the list. If the barcode frequencies are equal then p i  = 1/N, where N is the number of all barcodes. If those frequencies are not equal, then for every barcode, the p i value will be a fraction of the frequency for the i-th barcode reads divided by total reads for all barcodes. Corrected barcode numbers by those parameters (C b) will be:

$$ {C}_{\mathrm{b}}= \exp \left({H}_{\mathrm{sh}}\right) $$
(3)

This information content however will decrease upon increased skewing. Too small frequencies present in the library will have progressively fewer contributions to the whole library. This can be directly estimated by measuring the mentioned above indexes. An example is already given in Fig. 6.

A nearly identical solution to the Shannon correction can be done using the information index (in bits of information):

$$ \mathrm{I}\mathrm{B}={\displaystyle \sum {p}_i\times {\mathrm{Log}}_2\left({p}_i\right)} $$
(4)

And correction will be done as following:

$$ {C}_{\mathrm{b}}={2}^{\mathrm{IB}} $$
(5)

The result is identical to the Shannon index described above.

For skewing, the equitability index can be used, which is a Shannon index divided by the total number of reported barcodes.

$$ \mathrm{E}\mathrm{q}={H}_{\mathrm{Sh}}/\mathrm{L}\mathrm{n}(N) $$
(6)

Consider the example presented in Fig. 6. The table displays four data sets ten barcodes, with either equal frequencies or different degrees of inequality. Next to the table is the normalized cumulative frequency plot.

For equal barcodes, the cumulative frequency line is diagonal. More unequal sets deviate further from diagonal. If we calculate the Shannon diversity index for each set and then take the correction for Shannon index as in Eq. 3, it will suggest ten barcodes for the equal set (as we might expect), nine barcodes for linear set, and five and four for the remaining two. Although each set consists of ten barcodes, inequality of barcodes reduces their effective set size. From the plot (Fig. 6b), it becomes clear that the information content correction is roughly similar to the number of barcodes in the cumulative 95 % of barcode reads. The advantage of Shannon index correction compared to the % threshold approach is that it is correcting for the library size without any additional assumptions. In other words, it is not bound to the taste and preferences of the analyst.

Note that such cumulative plot also illustrates a degree of inequality which, when inversed, is also known as the Lorenz curve (if plotted from smallest to biggest barcodes cumulatively). Data closer to the diagonal are more equal to each other, and the other way around.

Skewing parameter varies from Eq = 1, if barcodes are equal, while approaching zero with increased skewing. For the power function, Eq = 0.6.

1.6 Distances Between Barcodes

The distance between barcodes is defined as the number of different bases between a pair of barcodes. For instance, the 6-mers AAATTG and AACTTC differ by 2 bases. If we take into account substitutions only, this will be referred to the Hamming distance (D H = 2). As a simple guide for assessment of randomness we can rely on the fact that a probability for a certain number of similar bases in two randomly selected barcodes, S ki , kj , with the length of I bases follows a binomial distribution.

$$ P\left({S}_{ki,kj}\right)=\mathrm{binom}.\mathrm{dist}({S}_{ki,kj},\mathrm{length}=I,\kern5em \mathrm{probability}=0.25,\ \mathrm{cumulative}=\mathrm{false}) $$
(7a)

or as an equation (suggested by Tilo Buschmann):

$$ P\left(d\left({S}_i,{S}_j\right)=d\right)=\left(\frac{l}{d}\right)\times {\left(\frac{1}{4}\right)}^d\times {\left(\frac{3}{4}\right)}^{l-d} $$
(7b)

The similarity degree between two randomly chosen barcodes is equal to the difference in barcode length, Len(BC) and the Hamming distance (D H), therefore:

$$ P\left(\mathrm{L}\mathrm{e}\mathrm{n}\left(\mathrm{B}\mathrm{C}\right)-{D}_{\mathrm{H}}\right)=P\left({S_{ki}}_{,kj}\right) $$
(8)

The distribution consequently predicts that if you take for instance any random pair of 12-base words or compare any randomly picked word to the full library set, the most likely and frequent distance will be 9–10 bases, as shown in the table below:

Similarity

D min

P

0

12

0.031676352

1

11

0.126705408

2

10

0.232293248

3

9

0.258103609

4

8

0.193577707

5

7

0.103241444

6

6

0.04014945

7

5

0.011471272

8

4

0.002389848

9

3

0.000354052

10

2

3.54052E-05

11

1

2.14577E-06

12

0

5.96046E-08

Likewise, any randomly picked barcode compared to the full set of barcodes of the same length follows exactly the same distribution. This approach was used for the random simulation on the Fig. 5.

Example:

Let’s take for example the raw sequencing data of the mixture of two barcodes used for Fig. 5. There are 468 unique barcodes in this file. Suppose we do not know how many true barcodes are in the file. We can only assume that with more true barcodes we will have a better fit to what is expected by random simulation. As explained above, a truly random set of barcodes shows a (transformed) binomial distribution of distances upon random sampling of the set. We can do random sampling of the given set using a custom script of an approximate structure like the following pseudo code:

Import Distance Package (See Below in the Protocol)

  • Read the source file

  • Repeat 1000 × N times:

    • Take random barcode1, barcode2

    • If barcode1 not equal to barcode2:

  • Measure distance

  • Add result to the array of distances

  • Report all distances frequencies

Data for this particular case are shown in Fig. 5b. It is clear that the raw set of 2 barcodes contains an unusually high number of distances 1–3. Therefore, it disagrees with the hypothesis of the random set. One obvious explanation is that the set is contaminated with false barcodes, likely generated by sequencing errors. Therefore, this set requires filtering by small distances.

1.7 Resolution of Barcode Detection

Let’s define our barcodes as:

True positive: barcodes which are really present in the library, sequence confirmed.

True negative: barcodes detected in the sample not present in the library, have obscure origin, and likely originate from PCR errors or likewise, rejected by the method.

False negative: barcodes which are present in the library, but rejected by the method.

False positive: barcodes which are false barcode, but accepted by the method.

Sensitivity is defined as the fraction of true positive barcodes to the sum of true positive and false negative, whereas specificity is defined as a fraction of the false positive in the sample to the sum of true negative and false positive barcodes.

$$ \mathrm{Sensitivity}=\frac{\mathrm{true}\_\mathrm{positives}}{\left(\mathrm{true}\_\mathrm{positives}+\mathrm{false}\_\mathrm{negatives}\right)} $$
(9)

The denominator here is all barcodes in the library

$$ \mathrm{Specificity}=\frac{\mathrm{true}\_\mathrm{negatives}}{\left(\mathrm{true}\_\mathrm{negatives}+\mathrm{false}\_\mathrm{positives}\right)} $$
(10)

The denominator in this formula is the sum of all false reads in the data.

The estimation of sensitivity can be illustrated by filtering of the sequencing data from the sample based on one of the parameters, like using the minimal distance or read frequency. As a reference we can take the barcode library, validated by repetitive resequencing of the batch.

To summarize: the resolution of the method is a measure of its performance regarding sensitivity and specificity. A claim of high resolution is equivalent to the claim of the best possible detection of the barcodes in the sample and the best possible separation of true and false barcodes.

1.8 Protocol for Converting Raw Deep Sequencing Data into Sets of Barcodes

Here, we provide a pipeline for retrieval of barcodes from deep sequencing data.

Step 1. Import raw data.

The raw sequencing data from an Illumina HiSeq2500 machine are a collection of per-cycle bcl base-call files. These bcl files are converted to compressed FATSQ.gz files with bcl2FATSQ (https://support.illumina.com/downloads/bcl2FATSQ_conversion_software_184.html). A sample sheet, provided by the researcher/operator, lets the software assign reads to samples, and samples to projects, provided that Illumina multiplex primers where used. After this de-multiplexing, the FATSQ (2) files are available to download through a file transfer protocol (FTP) server (this can be arranged differently, depending on the local organization). For further data processing, a workstation with at least 32 Gb memory and 1 Tb hard drive space is advised. Note that all the steps can be performed on systems using Linux, Windows, or Mac OS, however a minimal of 8 Gb RAM and 250 Gb hard drive space is recommended.

Step 2. Data compression.

Multiplex barcode sequencing data (FATSQ format) ranges in file size between 1 and 50 Gb. Depending on the operational system, this can be a challenge. To ease downstream analysis, we compress the data with the following steps.

  1. 1.

    Remove low quality reads; depending on the sequencer, the data will use an encoding scheme for example Illumina 1.8+-. Using the quality line for each read, one can filter on a predetermined cutoff value. This could be the total read quality value or a minimal quality value for the first x number of base pairs.

  2. 2.

    Collapse reads; remove redundant FATSQ lines 1, 3, and 4. Thereafter, collapse reads and add their frequency, see example below.

    AACCTT

    AACCTT

    AAGGTT

    After collapsing data:

    AACCTT 2

    AAGGTT 1

  3. 3.

    Remove single reads; After collapsing the FATSQ data, all reads with a frequency of one are removed. Note, these singles hold minimal biological value and needlessly increase the volume of the dataset.

By compressing FATSQ files one can reduce the size with 99 %.

Step 3. Barcoded samples.

After data compression, all sequencing reads are further de-multiplexed into samples. Note that initial sequencing data are split into separate files based on the Illumina indexes. For this purpose, we use special primers with variable tags (8–9 nt long) for amplifying the barcoded region of individual samples. By using unique sample tags and adjacent part of the primer, totally 13 nt in length of every read is tested for exact matching. For every sample marked with unique primer tag, a (text/csv/tsv) file is generated that lists the unique reads within this sample, and their frequencies. In principle, more sophisticated tag-extraction protocols could be used. However, since the tag is positioned at the very beginning of the sequencing read, and sequencing quality is likely the best out of entire read, more sophisticated algorithms will be more time consuming and provide little improvement in retrieved numbers of reads per sample.

Step 4. Converting lists or read into lists of barcodes.

So far, entire sequencing reads were collected into multiple separate files. Each of the reads supposedly contains a barcode. Depending on the type of barcodes, multiple search strategies are possible; exact match, regular expressions or motif search (probably along with multiple different options). We routinely use MOODS (Motif Occurrence Detection Suite [45] in BioPerl or Python custom script. First, the barcode sequence (“the backbone”) is transformed into a position-weight matrix. Second, MOODS searches for any similarity to the barcode sequence. The threshold for similarity is empirically established to such level, that in every read no more than one barcode is detected. Usually it tolerates multiple mismatches in the barcode backbone. Small indels on the side of the barcode will be tolerated, too. However, indels in the center of the barcode will deteriorate the discovery of the barcode. To circumvent this problem, an algorithm with gapped motif search must be implemented. For this, the GLAM2 tool (or a similar algorithm) might be used [46, 47]. For each sample, a file with unique barcodes and their frequency is generated.

Step 5. Compression of barcode lists.

The barcode list generated in the previous step is largely redundant, containing multiple identical barcodes, which must be merged, as well as their frequencies. Moreover, usually all minor barcodes with minimal distance of 1 base can be eliminated (and their frequencies added to the major similar barcode, see Fig. 4). We routinely use a custom script. Two distances can be employed, Levenshtein (which takes into account both SNPs and indels) or Hamming (SNP only). In Python and Perl, the Levenshtein package can be used for both types of distances (https://pypi.python.org/pypi/python-Levenshtein/). Several packages are available in R, such as Stringdist [48] and DNABarcodes [4]. Both packages are able to measure different kinds of distances, including the ones mentioned above. Note that the routinely used threshold by Hamming distance=1 is arbitrary and chosen for simplicity of theoretical considerations. Other distances can be used too, provided they are well justified.

After this step the barcode lists per sample can be used for analysis. Depending on the details of the experiment some other cutoffs can be used. For instance, steps like the 0.5 % biologically meaningful cutoff might be introduced here.

Usually, the end product is made by assembly of the individual samples into a table, representing each particular experiment. We routinely use a custom script for this purpose. From this table the dynamics of each individual barcode can be followed.

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer Science+Business Media New York

About this protocol

Cite this protocol

Bystrykh, L.V., Belderbos, M.E. (2016). Clonal Analysis of Cells with Cellular Barcoding: When Numbers and Sizes Matter. In: Turksen, K. (eds) Stem Cell Heterogeneity. Methods in Molecular Biology, vol 1516. Humana Press, New York, NY. https://doi.org/10.1007/7651_2016_343

Download citation

  • DOI: https://doi.org/10.1007/7651_2016_343

  • Published:

  • Publisher Name: Humana Press, New York, NY

  • Print ISBN: 978-1-4939-6549-6

  • Online ISBN: 978-1-4939-6550-2

  • eBook Packages: Springer Protocols

Publish with us

Policies and ethics