Background

In a directed evolution experiment library diversity is a key factor for its success. The diversity of a library is the amount of unique variants and, as many of which are represented multiple times, the library diversity distinct from the library size. The library needs to be sufficiently diverse to find improved variants (‘winners’), while not too rich in mutations to be swamped by non-functional protein. Consequently, to best identify winning variants in an error-prone PCR (epPCR) library, the mutations need to be at an optimal frequency (an average of 5 to 10 mutations/kb) [1] and not overly biased towards certain bases; preferential mutation of AT bases has been reported with manganese mutagenesis [2].

One way to ensure an effective diversity of a mutagenised plasmid pool is to assess the diversity in a small test library. Despite its cost of time and reagents [3], this step is strongly recommended as it avoids heavily investing in a potentially suboptimal library further downstream [1]. A test library entails plating an E. coli culture transformed with the pool of variant plasmids for growth under non-selective conditions. The colony abundance is assessed and about 10–20 clones are sequenced in order to calculate the overall mutation frequency, the error rate of the polymerase, the individual mutation frequencies and the biases associated with them [1]. When extensive sampling is done a highly accurate picture of the diversity of the library is seen, including the identification of a variety of sequence-specific mutation hotspots and coldspot [4], however a rough sampling is generally sufficient to estimate the main contributors of the mutational spectrum of the library. However, one major issue that arises is that the numbers may not be accurate.

In order to simplify and add statistical depth to these calculations, a new online calculator is presented here. The program, Mutanalyst (available at www.mutanalyst.com; portmanteau of mutation and analyst), uses the gene sequence and the list of mutations found to calculate the mutation frequency per sequence, the specific mutational frequencies (normalised by nucleotide distribution) and various bias indicators tabulated and graphed (vide infra for details). A novel feature of this tool is that it estimates the error associated with the values found. This implementation is driven by the need to determine the accuracy of the calculations in light of the limited sampling of the test library. This benefits the user greatly as it gives an indication of the reliability of each parameter, and thus a more informed perspective in determining the suitability of the library.

Implementation

The site is written to be compatible with modern browsers and Internet Explorer 8 or above with JavaScript enabled (default setting in all browsers). The site is a series of three static HTML files powered by client-side JavaScript. This was done in order to allow transparency in the calculations involved (explained in the site) and to allow a user to use the site offline (available at github.com/matteoferla). The operations of the site are explained in detail in the Appendix.

The program has five parts with two different starting points. One starting point allows the user to input the wild type sequence and the list of mutations in the sampled sequences (written either as 239A > T, the standard notation, or as A239T, nucleotides written with the protein notation — further details can be found on the webpage), to generate the analysis of the number of mutations per sequence.

Results and discussion

The analysis of the number of mutations per sequence by Mutanalyst is improved compared to standard protocols [1]. It shows not only the average number of mutations, but it also calculates the mean (λ) of a Poisson distribution fitted to the value. This distribution approximates the PCR-distribution of Sun [5], but does not require the knowledge of the PCR efficiency. The latter value is important in light of the inevitable error arising from the small sampling size, which can be ameliorated by imposing this valid assumption of the distribution of the values. The number of mutations per sequence is a key indicator of the diversity of the library and is also used by another program, Pedel, to determine library completeness in terms of nucleic acid mutation coverage [68]. Together with nucleobase-specific mutational rates, the mean number of mutations per sequence is used both in Pedel-AA [8] and the library diversity program by Volles and Lansbury [9] to determine the library completeness in terms of amino acids. Albeit unaffiliated, the Mutanalyst output can be linked directly to Pedel-AA, circumventing the need for the user to copy the parameters manually.

In the third section, the user can input the nucleobase-specific mutational rates, which is otherwise obtained from the list of mutations and the wild type sequence. The resultant normalised nucleobase-specific mutational rates are displayed as both a table and as a diagram to concisely show the mutational spectrum.

The final section contains the standard indicators of bias with the addition of an error estimate. The calculation of errors is accomplished thanks to the complementarity of DNA. Specifically, a mutation from one base to another on one strand is matched by a complementary mutation on the opposite strand (e.g. A → G and T → C). Consequently, these two separate values can be taken as replicates from which to derive the errors in the values. The propagation of errors was done parametrically using Eqs. 1 and 2 with the assumption that the covariance is zero in light of the independence of each mutational event (further detail is found on the webpage).

$$ Var\left(x+y\right)=Var(x)+Var(y) $$
(1)
$$ Var\left(\frac{x}{y}\right)\approx \frac{Var(x)}{\mu_y^2}+\frac{\mu_x^2 \cdot \kern0.5em Var(y)}{\mu_y^4} $$
(2)

The indicators are as follows:

  • the sum of the four transition frequencies, i.e. purine (R) to purine, pyrimidine (Y) to pyrimidine,

  • the sum of the eight transversion frequencies, i.e. purine to pyrimidine, pyrimidine to purine,

  • the ratio of transitions over transversions,

  • the frequency of a weak-binding nucleobase pair (W: A and T) mutating,

  • the frequency of a strong-binding nucleobase pair (S: G and C) mutating,

  • the frequency of weak-binding nucleobases mutating to strong ones,

  • the frequency of strong-binding nucleobases mutating to weak ones and

  • the ratio of the latter two frequencies.

In particular the two ratios are the most important parameters as they cover the two largest sources of bias. Even though, in terms of possible combinations, there are twice as many transversions as transitions, the steric difference between purines and pyrimidines means that transitions occur less frequently than transversions. Due to the differences in binding strengths between weak-binding and strong-binding nucleobases, some epPCR methods more readily mutate weak-binding nucleotides. As a consequence, these two ratios are the two most crucial indicators of bias. Thanks to the estimation of the error associated with them, these values can be compared with other libraries that may have been unsuccessful or successful, to those from the brochure provided with a commercial enzyme (e.g. Genemorph II) or to a reference cut-off that the user may have chosen.

Conclusion

This easy-to-use online tool was designed to automate the calculations involved while simultaneously adding conservative error values. Specifically, it simplifies laborious checking of sequence for mutations, tallying of mutations, normalisation steps and fiddly calculations of commonly used biases indicators. This is done with the addition of informative graphs and with the addition of statistical rigour. The graphs include a Sankey diagram to show the directions of the mutation. The statistics that goes beyond current standards, includes a Poisson fit for the distribution of mutations per sequence in order to force the data to not be overly affected by jackpot samplings, the estimation and the propagation of errors by taking advantage of the fact that a mutation on one strand is as likely as a mutation on the complementary strand. The main aim was to automate and to add statistical confidence to the values in order to give a better representation of the values, which are strongly affected by the small sampling and may be otherwise misleading.

Ethical approval

No ethical approval was required for this study.

Availability and requirements

The program is available at http://www.mutanalyst.com. The source code can also be found at https://github.com/matteoferla/mutant_calculator.