Background

The precise estimation of the frequencies of peaks in nuclear magnetic resonance (NMR) spectra is often complicated by poor signal-to-noise ratio and peak overlap. This results in only partially complete and correct peak picking. The problem aggravates especially when the peaks are highly overlapped. This is compounded by combinatorial ambiguity problems for resonance assignments and increases errors in NOE distance restraints [1]. To alleviate this problem, a more sophisticated peak decomposition algorithm, based on non-negative matrix factorization (NMF), has been developed and applied to three-dimensional (3D) NMR spectra.

Non-negative Matrix Factorization was first introduced by Paatero and Tapper as the concept of positive matrix factorization [2, 3] for estimating errors in widely varying environmental data. Their work revealed the non-negativity features of the underlying data models. Lee and Seung [4, 5] showed using an effective multiplicative algorithm parts-based representation of an object using NMF approach. A recent in-depth review on NMF algorithms discusses many forms of factorizations [6]. Because of the non-negativity and the sparseness constraints [7], NMF has wide applications in multidimensional data analysis [815]. The idea originated from the fact that in certain applications, by the rules of physics, the data quantities cannot be negative. The NMF approach was reported in application to complex metabolomic mixture analysis in two-dimensional NMR spectra [16]. Higher dimensional NMR spectral data matrices can be decomposed using NMF algorithms. The important property of NMF is the non-negative nature of the decomposed factors. Therefore, NMF processing of higher dimensional NMR spectral data can have important consequences in automated data processing.

In the automated peak picking approach peak identification is followed by the estimation of peak intensities and frequencies. Several algorithms have already been developed to perform peak picking in NMR spectra. Most of these algorithms are based on finding local maxima that fulfill certain criteria, and/or use Gaussian or Lorentzian lineshapes for lineshape fitting [1721] by minimizing the residual squared error between the observed peak shape and the assumed lineshape. Apart from the lineshape fitting methods, PICKY [22] is another program that uses a singular value decomposition of peak components for peak picking. In general, highly overlapped peaks cause the most commonly observed problems of existing peak picking algorithms. We used the basic two-dimensional (2D) NMF model, extended sequentially to ND usage to decompose a 3D data (signal) tensor. A Euclidean distance cost function was used as a measure of factorization convergence. The approach allows applying constraints if some information is known a priori, e.g. the total number of peaks or positions in common dimensions of hyper-dimensional (HD) shapes [23]. We discuss the NMF algorithm, the conditions for unique solutions of NMF models, and its applications to decompose 3D NMR signal tensors.

Methods

NMF decomposition algorithm

The basic idea of spectral factorization is to represent the multidimensional NMR spectrum as well as possible by a sum of direct products of one-dimensional shapes. The latter are expected to represent the lineshapes of resonances. In this way, i.e. if each one-dimensional shape represents a resonance line, possible overlap is deconvoluted and the factorization of the spectrum is equivalent to peak picking. The exact peak positions can simply be obtained by determining the (interpolated) maxima within the one-dimensional shapes. If it is known that the real signals in a spectrum are non-negative, a better result can be expected by introducing this condition into the factorization algorithm.

The non-negative factorization (NMF) problem may be described as follows. Given the observed data matrix Y = [y(1), y(2), …, y(T)] ∈ RmxT with Y ≥ 0. The solution is to find two matrices with only non-negative elements, the basis or mixing matrix ARm × r and the source matrix X = [x(1), x(2), …, x(T)] ∈ Rr × T, where r represents the number of true components [24]. The source matrix is expected to produce the unknown latent components of the original data matrix Y. The problem is to factorize the given data matrix such that it minimizes the squared Euclidean distance between the observed data matrix and the product of two non-negative data matrices i.e.

Y = AX + N ,

with A ≥ 0 and X ≥ 0, where NRmxT is a noise or error matrix that is to be minimized.

The divergence cost function is expressed in terms of the squared Euclidean distance given as

D F A , X = Y AX 2

The objective is to minimize the divergence of this function using a standard gradient descent technique. The divergence is calculated by component-wise calculation of the distance between matrixes Y and AX. The minimization is achieved using multiplicative update rules that update the matrices A and X iteratively until a minimum squared Euclidean distance is reached. The updates may be performed until as much minimum possible distance leading to a nonnegative solution is achieved. Any increase in the squared Euclidean distance may lead to an incorrect solution.

In matrix notation, the multiplicative update rules become

A A Y X T AX X T
X X A T Y A T AX

where ⊙ and ⊘ represent component-wise multiplication and division, respectively [25]. The proposed multiplicative update rules were originally introduced in the image space reconstruction algorithm (ISRA) [26]. The algorithm performs minimization of the squared Euclidean distance cost function using a gradient descent technique. The technique uses alternate switching between sets of parameters to generate updates on the matrices A and X until convergence is reached. The original ISRA algorithm used multiplicative updates rules by updating only the matrix X iteratively and assuming the matrix A to be known [25, 26]. The convergence to a nonnegative solution is obtained for any positive starting point given that the original input matrices contain hidden source components [27].

Extension to 3D non-negative tensor factorization

The 3D non-negative tensor factorization (NTF) model may be defined as an extension of the basic 2D models. Some 3D NTF models can be solved using the basic 2D NMF models referred to as the NTF2 model [24]. The model is illustrated in Figure 1 and is described as follows.

Y q = A D q X q ' + N q = A X q + N q , q = 1 , , Q

where Yq = [y itq ] ∈ RIxT is a q-th frontal slice (matrix) of the observed 3D data (signal) tensor YRIxTxQ. The component matrices A = [a ij ] = [a1,a2, …, a j ] ∈ RIxJ is a mixing or basis matrix and X q ' = [X jtq ] ∈ RJxTgives unknown normalized hidden components in q-th slice. The matrix X q = D q X q ' = [X jtq ] ∈ RJxT represents re-normalized source components and N q = [n itq ] ∈ RIxT represents the q-th frontal slice of the tensor NRI × T × Q representing noise in the input matrix Y.

Figure 1
figure 1

Illustration of the NTF2 model for peak picking a 3D NMR spectrum. The matrix Y is a 3D input NMR spectral data matrix. A q and X q are 2D matrices factorized from qth data plane of matrix Y representing the basis matrix and the source component matrix, respectively. N is the 3D matrix representing the residual error of the non-negative matrix factorization.

Determining the true number of components

For the NMF/NTF models the true number of components r plays an important role in reaching convergence because an approximate valid model is instrumental in capturing the true underlying structure in the data. The true number of components may be obtained using several approximate and heuristic techniques [2832]. Different numbers of components may result in different residual minima. In the present work, we applied the following procedure for calculating the true number of components. We observed the decay of residual values for different number of components r = 2, 3, 4, 5, 6, and 7. The NMF iterations were stopped when the residual showed no improvement over 10 or more consecutive iterations. The true number of components was obtained as the one that showed the minimum residual value. It was also observed that using a higher number of components than the true number did not yield better minimum residual values.

Uniqueness conditions for the 2D NMF ambiguity problem

The quadratic cost function with respect to matrices A and X may have many local minima, which leads to rotational ambiguity of the factorized matrices [33]. Therefore the alternating minimization of the squared Euclidean cost function may not result in a unique NMF solution. However, applying some preprocessing or filtering of the input matrix is sufficient to solve the NMF problem uniquely. The preprocessing involves normalization of the input matrix or normalizing the columns of the factorized matrix A and/or the rows of matrix X. We normalized each column of the matrix A to unit l1-norm. In addition, we normalized the input matrix Y to unit length at the beginning of the factorization and later used the corresponding scaling factor to obtain the original intensity values of the component peaks.

Peak picking of 3D NMR spectral data

The 3D NMF decomposition is performed as an extended 2D NMF decomposition as described above. Each slice of 2D data, taken from a 3D spectral data matrix Y, occupies data in the matrix Y q representing the qth point in the third dimension. The model factorizes peak components in one-dimensional (1D) peak shapes in the source matrix X q . Peak positions are obtained by fitting an ideal Gaussian shape of average linewidth to the observed component by minimizing the scalar product between the Gaussian shape and the observed component. Next, the linewidth of the peak is adapted to obtain an optimal agreement. The final peak positions are obtained by performing a three-point parabolic interpolation. The final peak lists are obtained by applying a user-defined intensity threshold.

Spectral data sets

The NTF2 model was applied using the 2D NMF algorithm to the previously measured 3D HNCO spectrum for the 23 kDa RcsD-ABL-HPt protein construct (residues 688-890) [34]. The spectrum was measured on a Bruker AVANCE spectrometer operating at 1H Larmor frequency of 950 MHz. The numbers of time-domain complex data points were 128 and 90 in 13C and 15N indirect dimensions respectively. Non-uniform sampling schedule was employed in both the indirect dimensions at a level of sparseness of 10 per cent, which acquired 1152 FIDs. The 203 amino acid protein gave rise to 195 expected peaks in the 3D HNCO experiment [35].

Additionally, the method was applied to a group of four synthetic signals with known positions. The synthetic signals were of different intensities and were used to construct a 2D HSQC spectrum. The efficiency of the algorithm for peak picking was assessed with different levels of noise in the spectrum. The noise was added incrementally in steps of 10 percent each. The separation of peak positions was varied from 10 to 1 points in steps of one data point each.

Results and discussion

Figure 2 shows peaks picked from the 3D HNCO spectrum of the RcsD-ABL-HPt protein on a 1H-13C projection. The model was able to pick 201 backbone and side-chain cross peaks from the HNCO spectrum. The peak list of 201 peaks is given in Additional file 1: Table S1. The NTF2 decomposition of a small region of about eight overlapped peaks of the 3D HNCO experiment is shown in Figure 3. The peak shapes determined for each dimension are plotted below each 2D projection. Peak shapes appearing in the same color in each 2D projection define one peak in the 3D spectrum. Table 1 lists the eight peaks picked in this overlap region with their assignments.

Figure 2
figure 2

1H-13C projection of the 3D HNCO NMR spectrum of RcsD-ABL-HPt (DAH) construct showing the peaks picked using the NTF2 model. Picked peaks are marked by a cross at the peak center.

Figure 3
figure 3

Non-negative matrix factorization of overlapped peaks of a small region (1H: 8.095–8.298ppm,15N: 118.813–123.806ppm,13C: 175.910–176.791ppm) of the 3D HNCO spectrum of the RcsD-ABL-HPt (DAH) protein. A)1H-13C projection. B)1H-15N projection. C)13C-15N projection. The upper row shows the peaks in 2D projections. The lower row shows the peaks factorized in 1D shapes from the corresponding projections. The peaks positions and intensities were obtained using a three-point parabolic interpolation.

Table 1 Peak list obtained by applying the NTF2 model to the overlapped region of the 3D HNCO spectrum of the RcsD-ABL-HPt construct shown in Figure 3

It can be seen that the overlapped peaks were well decomposed in all three dimensions of the HNCO experiment. Among the 201 peaks picked from the HNCO spectrum there were about 23 peaks that were overlapped in one or more dimensions. The NTF2 model was able to decompose all the overlapped peaks. Table 1 shows two peaks assigned to Cysteine 860 and to Leucine 859 which are overlapped in all three dimensions. The peaks for these two residues were well decomposed in all three dimensions as shown in Figure 3. This shows that correct peak picking of 3D NMR spectral data especially in overlapped regions is possible using the NTF2 model.

The HSQC spectrum built using four synthetic signals is shown in Figure 4A. The results of peak picking show that the algorithm could tolerate up to 60% noise in the spectrum when the peaks were separated by at least 7 points from each other. As the peaks were moved closer to each other by one point at each step, the noise tolerance started to drop. The result is shown in Figure 4B. When peaks were closer than two points, even 10% noise in the spectrum generated incorrect peak intensities.

Figure 4
figure 4

Noise and peak overlap tolerance of the NTF2 model for peak factorization of a synthetic HSQC spectrum. A) HSQC spectrum constructed using four synthetic signals. B) Effects of noise and peak overlap on peak picking. The amount of noise in the spectrum is shown on the x-axis. The y-axis shows the peak separation in number of points. Red circles indicate that peaks were incorrectly picked. Blue circles show that the peaks were correctly picked upon factorization. C) Effects of the amount of noise on the peak position determination. Differences in the peak position in number of points are shown on the y-axis. The x-axis shows the amount of noise in the spectrum. Blue points indicate that the peak was correctly picked. Red points indicate that the peak was obtained with incorrect parameters because of higher noise in the spectrum. D) Effects of the amount of noise on the peak intensity: The x-axis shows the amount of noise present in the spectrum. Differences in the peak intensity of the peak are shown on the y-axis. Blue points show that the peak was distinguishable from the noise. Red points show that the peak was obtained with incorrect parameters upon factorization.

The change in peak position of the peak with the highest intensity with varying levels of noise in the spectrum is plotted in Figure 4C. The result shows that the algorithm could tolerate up to 70% noise in the spectrum for factorizing the peak shape at its true position. It may be noted that the known peak position is being observed for its change with increasing noise in the spectrum. Therefore, with higher than 70% of noise in the spectrum, the peak appeared at its true position albeit with incorrect linewidth and intensity. Many other peaks resembled the true peak in the factorized 1D components (Matrix X q ). Therefore, it may be concluded that the algorithm could not reliably determine the true peak when more than 70% noise were present in the spectrum. Blue points indicate that the algorithm was able to distinguish the peak from noise. Red points indicate that many other peaks resembled the true peak because of higher noise in the spectrum. Because the original position of the peak was known, the difference in peak position could still be plotted in Figure 4C at noise levels higher than 70%. However, the matrix factorization residual and the number of peaks picked in each component were at unacceptable levels.

Similarly, the effect of noise on the change in peak intensity was observed on the peak selected for Figure 4C. Figure 4D shows the change in peak intensity with increasing level of noise in the spectrum. The algorithm could tolerate up to 70% noise in the spectrum for correct peak picking. The peak may be accepted or rejected at a different intensity depending on the user intensity tolerance threshold with higher noise levels in a spectrum. With more than 70% noise in the spectrum, incorrect intensities and linewidths were obtained after factorization.

In general, the method works well on overlapped peaks for two different reasons. First, the 3D peak picking data is reduced to a 2D peak picking data matrix. The 2D factorization is performed at all points in the third dimension. In case of overlapped peaks, for example, if two peaks are separated by only one or two points, the factorization can give two peaks separately in each data plane from the points that separate the two peaks. This becomes highly improbable in case of peak picking based on lineshape fitting. Second, if the peak position from one of the dimensions is known a priori then the 2D data matrix comprising unknown peak position can be factorized to get the peak position in the other dimension(s). The method has useful consequences when peaks are to be picked from hyper-dimensional data matrices as discussed below.

The usefulness of the NTF2 model becomes apparent when multi-dimensional NMR spectra have commonly referenced dimensions. Thus, if the peak positions from an already measured more sensitive NMR experiment such as 3D HNCO are known, then the same peak positions can be used to pick peaks in other spectra that have dimensions in common with HNCO, for example, 1H and/or 15N. Peak picking overlapped regions becomes easier with the NTF2 model if each common dimension has consistent spectral referencing such as the spectral width, the carrier frequency positions and the number of sampled points. Theory suggests that the components in matrix A are assumed known and the matrix X gives the hidden source components of the input matrix Y[25]. The advantage that the NTF2 model offers in decomposing a 3D NMR spectral data tensor is that peak positions from any two dimensions can be assumed known. Therefore, the reduced NTF2 model can offer accurate peak position solutions especially when the peaks are overlapped. This notion can be extended naturally to hyper-dimensional NMR spectral data tensors [23]. If peak positions from one or more dimensions are known, the peaks can be picked in the remaining dimensions by appropriately selecting the corresponding 2D data planes containing the picked peaks from the 3D data tensors.

Conclusion

We have developed a method based on non-negative matrix factorization that can be used for peak picking 3D NMR spectral data tensors. Our results demonstrate that the method is particularly useful for picking overlapped peaks. Additionally the method can be easily extended for peak picking three- or higher-dimensional NMR spectral data tensors that have commonly referenced dimensions.