Stable isotope labeling strategy based on coding theory
- 1.2k Downloads
We describe a strategy for stable isotope-aided protein nuclear magnetic resonance (NMR) analysis, called stable isotope encoding. The basic idea of this strategy is that amino-acid selective labeling can be considered as “encoding and decoding” processes, in which the information of amino acid type is encoded by the stable isotope labeling ratio of the corresponding residue and it is decoded by analyzing NMR spectra. According to the idea, the strategy can diminish the required number of labelled samples by increasing information content per sample, enabling discrimination of 19 kinds of non-proline amino acids with only three labeled samples. The idea also enables this strategy to combine with information technologies, such as error detection by check digit, to improve the robustness of analyses with low quality data. Stable isotope encoding will facilitate NMR analyses of proteins under non-ideal conditions, such as those in large complex systems, with low-solubility, and in living cells.
KeywordsAmino-acid selective stable isotope labeling Cell-free protein synthesis Coding theory Combinatorial selective labeling Signal assignment
Stable-isotope (SI) labeling of proteins is an essential technique to investigate three-dimensional structures, ligand interactions or dynamics of proteins by nuclear magnetic resonance (NMR) spectroscopy. The assignment of the main-chain signals, which is generally the first step in these analyses, is usually achieved by a sequential assignment method based on a combination of triple resonance experiments on proteins uniformly labeled with 15N and 13C (Grzesiek and Bax 1993). Amino-acid selective SI labeling (AASIL) helps to discriminate the amino-acid type of each signal, independently of the triple resonance experiment-based sequential assignment. Therefore, it is especially useful for the signal assignment of difficult targets, such as large complex systems (Bertelsen et al. 2009), low-solubility proteins (Cervantes et al. 2013), and proteins in living cells (Hembram et al. 2013). The dual selective labeling method, which utilizes both amide nitrogen and carbonyl carbon labeling, narrows down the assignment possibilities even further (Kainosho and Tsuji 1982), and as a consequence leads to the assignment, without the need for triple resonance experiments, of amino-acid pairs occurring only once in the sequence. However, for the discrimination of all amino-acids, these simple AASIL schemes require a large number of samples, which are typically the same as the number of amino acids (19 for nitrogen or 20 for carbon). For this reason, various combinatorial selective labeling (CSL) schemes (Parker et al. 2004; Shi et al. 2004; Trbovic et al. 2005; Staunton et al. 2006; Wu et al. 2006; Maslennikov et al. 2010; Sobhanifar et al. 2010; Hefke et al. 2011; Krishnarjuna et al. 2011; Jaipuria et al. 2012; Löhr et al. 2012; Maslennikov and Choe 2013) were developed to reduce required number of samples, by representing amino acids as combination of SI labeled samples rather than simply assigning one amino acid to one SI labeled sample. For example, a CSL scheme developed by Parker et al. (2004), which is based on the dual selective approach, can discriminate 16 amino-acids with one uniformly 13C and 15N-labeled reference and four selectively (100 or 0 % for 13C and 100 or 50 % for 15N, respectively) labeled samples. The use of labeling ratio of 100 or 50 % for 15N, rather than that of 100 or 0 %, ensures obtaining 13C labeling information from HN(CO) spectrum, irrespective of 15N labeling ratio. Otting and colleagues reported simpler CSL scheme using five samples (Wu et al. 2006), based on the single selective 15N-labeling approach, in which spectral overlaps were diminished by labeling one amino acid with high occurrence and at most three ones with low occurrence in each sample. Dötche and colleagues developed focused CSL (Trbovic et al. 2005; Sobhanifar et al. 2010), in which 6 or 7 amino acids frequently appearing in transmembrane regions of membrane proteins were labeled with 15N or 1-13C. They further improved the CSL to discriminate up to 20 amino-acids with a number of samples labeled with 15N and/or 13C by using dual selective approach (Hefke et al. 2011), or to discriminate 12 amino-acids with only three samples by introducing triple selective approach (Löhr et al. 2012), in which the samples were labeled with the combination of 15N, 1-13C, and 13C/15N. Choe and colleagues also improved membrane protein-focused CSL (Maslennikov et al. 2010; Maslennikov and Choe 2013) to discriminate up to 19 amino-acids except for glutamate with six samples simply labeled with 13C and/or 15N. A couple of computational methods for designing labeling patterns for CSL were employed (Maslennikov et al. 2010; Hefke et al. 2011; Maslennikov and Choe 2013) in order to maximize assignable residues by using dual selective approaches.
From the point of view of SiCode, the simplest AASIL scheme, in which 19 (for nitrogen) or 20 (for carbon) samples labeled with only a single amino acid are prepared, can be considered as a system in which each amino-acid is assigned to a specific 19 or 20-digit binary codeword with only a single ‘1’ digit (Fig. S1a). One CSL scheme proposed by Parker et al. (2004), which utilizes the combination of one uniformly labeled sample and four selectively labeled samples in order to discriminate 16 amino acids, can be considered as a system in which the information is assigned to a 4-digit binary codeword (Fig. S1b). As long as binary digits are used as codewords, each sample can contain an information amount of one bit, thus limiting the number of discriminable amino acids to 2n, where n is the number of selectively labeled samples. In AASIL, it is better to minimize the number of labeled samples, in terms of costs and sample preparation workload, as well as NMR machine time. Based on the SiCode concept, such minimization can be achieved by increasing the information content per sample by using three or more discrete SI-labeling levels, while the abovementioned CSL schemes utilize no more than two levels. As the simplest case of this idea, we have designed a novel scheme to use ternary digits as codewords, and an example of a codeword table based on a dual selective approach is shown in Fig. 1b (see “Materials and methods” section for details). In this scheme, the ternary digits, “0”, “1”, and “2”, are represented by SI-labeling levels of 50, 75, and 100 % (for 15N) or 0, 50, and 100 % (for 13C), respectively. Moreover, by using only the codewords with at least one “2”, the sample with the largest intensity for each signal can be used as a fully-labeled reference. The number of assignable codewords based on this scheme is 19, which is the exact number required for representing non-proline amino acids. Thus, we can discriminate 19 kinds of non-proline amino acids with only three labeled samples, by omitting the additional uniformly labeled reference sample used in the abovementioned CSL (Parker et al. 2004). As a proof of concept, we have applied this scheme to the 116-amino-acid CH domain of Smoothelin protein (BMRB ID: 11572), as described in detail in the Materials and Methods and Supplementary Information. We used an Escherichia coli-based cell-free protein synthesis system (Kigawa et al. 1999, 2004; Matsuda et al. 2007; Seki et al. 2008) supplemented with metabolic inhibitors (Yokoyama et al. 2011) in order to achieve the accurate SI-labeling ratios we designed (Fig. 1b) for preparing the three kinds of labeled samples. A pair of 2D 15N-HSQC and 2D HN(CO) spectra were acquired for each of the three samples so that six spectra in total were obtained (full spectra are shown in Fig. S2), and accurate signal intensities on each spectrum were obtained by fitting the signal to a two-dimensional Gaussian function. The 15N-labeling ratios of the corresponding residue (residue i) were calculated from the HSQC signal intensities, and the 13C-labeling ratios of the preceding residue (residue i − 1) were calculated similarly, using both the HSQC and HNCO signal intensities. These back-calculated SI labeling ratios are referred to as “SI indices”.
Figure 1c shows the cross peaks of six spectra of residue D73, which is preceded by residue A72. The SI indices of 15N of the residue i were 100, 76.2, and 50.0 %, respectively, which correspond to the codeword “210”, indicating that this signal is from an aspartate residue. The SI indices of 13C of the residue i − 1 were 49.3, 100, and 97.0 %, respectively, which correspond to the codeword “122”, revealing that the preceding residue is alanine. To investigate the decoding performance of SiCode, we analyzed 89 isolated (i.e. non-overlapping) main-chain signals out of 111 observable (non-proline) main-chain signals. The amino-acid types of all 89 isolated main-chain signals were correctly discriminated except for two preceding proline residues, as they are not SI-labeled in this scheme (see more examples in Fig. S4). The SI indices of these signals are accurate and precise enough to distinguish among the three levels, thus demonstrating that our strategy can be performed with sufficient accuracy and precision (Fig. 1d, see Supplemental Note S1 for the influence of the signal-to-noise ratio on the SI indices).
In the case of the BMX SH2 domain, 7 suspicious overlapped pairs of signals were identified by visual inspection, and 6 of them were detected by the error detection (Fig. S5). By fitting each overlapped pair to two two-dimensional Gaussian functions, 5 pairs of main-chain signals and 1 pair of side-chain signals were correctly discriminated (Fig. S5). For example, the overlapped pair signals of residues A35 and V109, shown in Fig. 2c, were successfully discriminated as two codeword and check digit pairs, “122 and 0” and “112 and 2”, corresponding to alanine and valine, respectively. These results indicate that signal overlapping, which can impair correct discrimination in the conventional CSL, was managed by the peak deconvolution implemented as Gaussian peak fitting in the decoding process of SiCode.
As described above, conventional CSL schemes generally uses two SI-labeling levels, enabling that 1 bit information is contained in each labeled sample (Parker et al. 2004; Shi et al. 2004; Trbovic et al. 2005; Staunton et al. 2006; Wu et al. 2006; Maslennikov et al. 2010; Sobhanifar et al. 2010; Hefke et al. 2011; Krishnarjuna et al. 2011; Jaipuria et al. 2012; Maslennikov and Choe 2013). In the previously mentioned triple selective CSL approach (Löhr et al. 2012), three SI-labeling types can be discriminated for both residues i and i − 1: unlabeled, 15N-labeled, or 2-13C/15N-labeled for the residue i and unlabeled, 1-13C-labeled, or 1,2-13C-labeled for the residue i − 1, respectively, being considered that 1 trit (ternary digit) information is contained in each sample. In this report, we demonstrated a version of SiCode, in which each labeled sample contains 1 trit information by using three SI-labeling levels rather than by increasing number of labeling types. Our approach can discriminate nearly all amino-acids by using simpler combination of 15N and 13C/15N-lablings, and introduce additional information like check digits for robust data analysis.
More complicated labeling patterns than that using three labeling levels can be easily achieved by using the cell-free protein synthesis system without SI scrambling (Yokoyama et al. 2011). Assuming that thermal noise of the observed data is the main reason for distorting the SI index, the labeling pattern should be designed in order to maximize minimum Euclidean distance between amino acids to achieve the best noise tolerance (see Supplementary Note S3 for the detailed discussion). Based on this strategy, we can easily design labeling patterns for the given number of samples and the given number of amino-acids, such as the pattern to discriminate 20 amino acids including proline. Since noise tolerance in discrimination between two specific amino acids depends on their information distance, specialized labeling patterns would be useful in some cases. For example, in the sequential assignment, amino acid pairs with similar Cα and Cβ chemical shifts could be easily discriminated with the help of SiCode specially designed so that such amino acid pairs have long information distance. When SI scrambling in the protein expression system is not strictly suppressed, the distance around the scrambling-prone amino acids should be increased. As mentioned above, noise tolerance of the selective labeling method can be evaluated based on the information distance. SiCode is the noise-tolerant method compared to the other selective labeling methods under the given total measurement time (see Supplementary Note S3 for the detail).
One of the major motivations to use AASIL is simplifying NMR spectrum by reducing the number of signals. Signal overlapping usually impede amino acid discrimination especially in CSL, therefore, the number of labeled amino-acids is reduced according to its occurrence in some CSL (Trbovic et al. 2005; Wu et al. 2006; Sobhanifar et al. 2010; Löhr et al. 2012). In the present study, we labeled all of the non-proline 19 amino acids, however, quantitative peak fitting used for decoding information in SiCode solved the signal overlapping issue as demonstrated. In addition to noise-tolerance based on coding theory, this feature will be especially crucial for analyzing difficult targets.
Almost all of the CSL studies, including this work, have so far used the cell-free synthesis system for protein expression in order to achieve the accurate SI-labeling by avoiding SI scrambling and dilution. SiCode can be performed using the protein expression system with manageable level of SI scrambling (see Sample Preparation section in the Supplementary Information). In addition, customized labeling pattern suitable for specific expression system could be designed by evaluating its SI scrambling profile based on information distance between amino acids (see Supplementary Note S3). Therefore, SiCode could also be achieved by in vivo expression system, for example, by the combination of the single protein production system (Suzuki et al. 2007; Schneider et al. 2010) and amino-acid auxotroph E. coli strains.
SiCode introduces a new concept into AASIL, by enabling the combination with information techniques that have rarely used in NMR field, such as detection of errors in signal intensities. From the standpoint of information science, its performance will be further improved, for instance, by optimizing the labeling pattern according to the amino acid content or sequence of the target, or by increasing the number of labeling levels or samples to expand the codeword’s space, which will enable the implementation of redundant messages for error detection and correction.
Materials and methods
Designing the codeword table
As mentioned in the text, the number of codewords consisting of three ternary digits with at least one “2” is 19. Therefore, 19 kinds of non-proline amino acid types can be discriminated with only three kinds of labeled samples. We designed the codeword tables shown in Figs. 1b, 2a based on the following considerations (see Supplementary Note S3 with respect to the SI-labeling ratio).
First, the signal intensity of each sample is disturbed by protein concentration differences and/or other technical reasons, such as magnetic field inhomogeneity (hereafter called “intensity disturbance”). The intensity disturbance has to be compensated, because the accuracy and precision of SI indices are critical for decoding amino acid information in the SiCode strategy. For this compensation, the signal of a fully labeled amino acid in all samples, namely that mapped to the codeword “222”, was used as described below (see Supplementary Note S4 for the result of the compensation). We have assigned “222” to glycine, as shown in Fig. 1b, because its signal rarely overlapped with other signals in 1H-15N HSQC-type spectrum and thus it can be easily distinguished.
Second, as described in the text, the E. coli-based cell-free system supplemented with metabolic inhibitors to suppress isotopic scrambling (Yokoyama et al. 2011) was used in order to achieve accurate and precise SI labeling, which is quite crucial for SiCode. However, asparagine to aspartate conversion could not be fully suppressed because 5-diazo-4-oxo-l-norvaline, which can achieve this suppression (Yokoyama et al. 2011), was not used in the present study because it was unavailable. In order to overcome this scrambling issue, we assigned asparagine and aspartate to “220” and “210”, respectively. As both amino acids were designed to have the same digit “2” or “0”, namely the same SI-labeling ratios, for samples 1 and 3, the scrambling does not matter. In addition, we intentionally lowered the SI-labeling ratio of aspartate in the protein production processes of samples 2 and 4, whereas asparagine was fully labeled, so that the SI-labeling ratio of aspartate would be within the range for digit “1” (between 25 and 75 % for 13C and between 62.5 and 87.5 % for 15N, as described below) even if it became increased by asparagine to aspartate conversion. For the same reasons, 6-diazo-5-oxo-l-norleucine, which is responsible for suppressing glutamine to glutamate conversion (Yokoyama et al. 2011), was not used in the present study. In order to overcome this scrambling, glutamine and glutamate were assigned to “022” and “021”, respectively, and the SI-labeling ratio of glutamine for the preparation of samples 3 and 4 was intentionally lowered, as described in Supplementary Information.
Third, from an economic viewpoint, we designed the table in order to limit the total consumption of relatively expensive SI-labeled amino acids; for example, tryptophan is assigned to “002”.
NMR spectral analysis for amino-acid discrimination of the Smoothelin CH domain
All NMR spectra were recorded on an AVANCE 700 spectrometer equipped with a CryoProbe (Bruker Biospin, Germany) at 295 K, and processed with the program NMRPipe (Delaglio et al. 1995). Acquisition and processing parameters are shown in Table S1. For the 1H-15N HSQC spectrum of sample 1, cross peaks were picked with the program NMRview (Johnson and Blevins 1994). The peaks were grouped so that a pair of peaks for which the chemical shift difference was less than 0.1 ppm in proton and 0.8 ppm in nitrogen was in the same group.
NMR spectral analysis with error detection by check digit for the BMX SH2 domain
Thirdly, if the converted check digit from the labeling ratio of sample 4 was inconsistent with the digit generated based on the defined codeword, as in Fig. 2a (see Supplementary Note S5 for the detail of check digit calculation), the judged amino acid type was considered to be incorrect.
We thank the lab members at RIKEN QBiC and RInC, particularly S. Watanabe, N. Matsuda, and Y. Kasai for their kind help in preparing the materials and S. Yasuda for secretarial assistance. This work was supported in part by a Grant-in-Aid for Scientific Research on Innovative Areas (Grant No. 25120003), a Grant-in-Aid for Challenging Exploratory Research (Grant No. 26650027), and a Grant-in-Aid for Young Scientists (B) (Grant No. 24770108) from the Ministry of Education, Culture, Sports, Science and Technology (MEXT) of Japan and the Japan Society for the Promotion of Science (JSPS).
Compliance with ethical standards
Conflict of interest
T. Kasai and T. Kigawa are co-inventors on a patent application (JP 2013-82543) related in part to the material presented here. J. Yokoyama is a salaried employee of Taiyo Nippon Sanso Corp., a company that has commercial interests in the cell-free protein synthesis system. Cell-Free Technology Application Laboratory was jointly funded by RIKEN and Taiyo Nippon Sanso Corp. S. Koshiba declares no potential conflict of interest.
This article does not contain any studies with human participants or animals performed by any of the authors.
- Hefke F, Bagaria A, Reckel S, Ullrich SJ, Dötsch V, Glaubitz C, Güntert P (2011) Optimization of amino acid type-specific 13C and 15N labeling for the backbone assignment of membrane proteins by solution- and solid-state NMR with the UPLABEL algorithm. J Biomol NMR 49:75–84. doi: 10.1007/s10858-010-9462-4 CrossRefGoogle Scholar
- Hembram DS, Haremaki T, Hamatsu J, Inoue J, Kamoshida H, Ikeya T, Mishima M, Mikawa T, Hayashi N, Shirakawa M, Ito Y (2013) An in-cell NMR study of monitoring stress-induced increase of cytosolic Ca2+ concentration in HeLa cells. Biochem Biophys Res Commun 438:653–659. doi: 10.1016/j.bbrc.2013.07.127 CrossRefGoogle Scholar
- Kainosho M, Tsuji T (1982) Assignment of the three methionyl carbonyl carbon resonances in Streptomyces subtilisin inhibitor by a carbon-13 and nitrogen-15 double-labeling technique. A new strategy for structural studies of proteins in solution. Biochemistry 21:6273–6279. doi: 10.1021/bi00267a036 CrossRefGoogle Scholar
- Maslennikov I, Klammt C, Hwang E, Kefala G, Okamura M, Esquivies L, Mörs K, Glaubitz C, Kwiatkowski W, Jeon YH, Choe S (2010) Membrane domain structures of three classes of histidine kinase receptors by cell-free expression and rapid NMR analysis. Proc Natl Acad Sci USA 107:10902–10907. doi: 10.1073/pnas.1001656107 CrossRefADSGoogle Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.