Improved Baseline Correction Method Based on Polynomial Fitting for Raman Spectroscopy

Raman spectrum, as a kind of scattering spectrum, has been widely used in many fields because it can characterize the special properties of materials. However, Raman signal is so weak that the noise distorts the real signals seriously. Polynomial fitting has been proved to be the most convenient and simplest method for baseline correction. It is hard to choose the order of polynomial because it may be so high that Runge phenomenon appears or so low that inaccuracy fitting happens. This paper proposes an improved approach for baseline correction, namely the piecewise polynomial fitting (PPF). The spectral data are segmented, and then the proper orders are fitted, respectively. The iterative optimization method is used to eliminate discontinuities between piecewise points. The experimental results demonstrate that this approach improves the fitting accuracy.


1.Introduction
Raman spectrum analysis is widely carried out in various fields, such as chemistry and biology. After analyzing the scattering spectrum of the incident light, one can get the information of molecular vibration and rotation, and obtain the molecular structure characteristics. For example, from the Raman spectroscopy, one can analyze the structure of minerals, carry out archaeological studies [1], characterize the chemical properties of cells and intracellular fine molecules [2], detect the safety of food [3], and gauge geographic information [4]. However, Raman spectrum has only about 10 -8 of the intensity of the original excitation signal [5]. The presences of shot noise and fluorescence background noise which consist of the baseline seriously affect the analysis of the Raman spectrum. Therefore, the baseline correction is crucial to restore the real spectrum. Instrumental improvement and mathematical calculation are two methods to reduce the baseline drift. Compared with the mathematical calculation method, the cost of the instrument improvement method is relatively high, which limits its application to some extent. Therefore, a mathematical calculation method is a better one.
As the background usually varies from sample to sample, several baseline correction methods have been proposed and improved. The commonly used methods include polynomial fitting [6][7][8], wavelet transform [9][10][11], and the least-square method [12][13][14]. Each of these methods has their own advantages when they are used in certain situations. As for the polynomial fitting method, it is the most popular and widely used method due to its efficiency and simplicity. Baeket et al. [8] sought the peaks by inspecting the smoothed derivative of a given spectrum. After clipping out the corresponding peak regions, they estimated the background by applying a modified linear interpolation. Feng et al. [15] removed the baseline by an iterative fitting process which discarded points whose intensities were above a threshold and fitted the remaining points into a straight line. Zhao et al. [7] took the signal noise distortion and the influence of large Raman peaks on fluorescence background fitting into account. Qin et al. [16] used a piecewise linear fitting to correct the baseline. They roughly defined the abscissa of the target points and directly found the minimum value among three points before and after the abscissa. Then, the baseline between the target points was fitted in turn. Sun et al. [17] refined and improved the method of Qin. They obtained the area of extreme value (target points) by the positive and negative of the derivative and located interval to get the minimum value. He et al. [18] used a genetic algorithm to filter the background points and used the cubic spline method to fit the background points obtained. The baseline correction was finally realized. Liu et al. [19] proposed a new cost function, whose property was that the cost increased when the fitting curve moved upward and away from the real baseline, thus improving the accuracy of the baseline.
In this paper, we present a piecewise polynomial fitting (PPF) method for the baseline correction. The paper is organized as follows. In Section 2, the theoretical model of PPF is presented. We obtain the segmentation points based on the right boundary of the Raman peak by deriving the spectral data. The appropriate points and order of each section are selected to perform polynomial fitting. At the same time, the number of data involved in the fitting may change with the process of iterative optimization in order to eliminate the discontinuity at the segmentation points. Section 3 presents the simulated data to illustrate the performances of the methods. We intuitively show the whole process of segmentation point selection and the discontinuity elimination. By applications to real Raman spectra, we demonstrate the validity of our algorithm in Section 4. Finally, some conclusions are drawn in Section 5.

Methods
Defining a model of the spectrum will allow us to better access the useful information. In the process of baseline processing, the intensities of the Raman spectra of N points are expressed as are the Raman shifts corresponding to the first point and the last point, 1 seg to seg n are the Raman shifts of segmentation points, and 1 p to n p represent the highest order of polynomial in each fitting, respectively.

Selection of segmentation points
In order to conduct piecewise polynomial fitting of the baseline, the fitting points involved should contain as much background information as possible, and an isolated peak is not allowed for baseline fitting. Here, we choose the derivative method to detect the peaks. For the discrete data, the derivative is approximated with the difference as follows: ( The maximum value of the peak can be found with the sign change of the derivative from the positive to the negative. The boundary of the peak can be determined with adjacent zero positions of the derivative. Then, the segmentation points are determined according to the boundary of the peaks. If the spectrum is more complicated, you can manually select the segmentation point according to the actual situation. Compared with the characteristic peaks in a Raman spectrum, the variation of the baseline is moderate. Therefore, the change in the derivative value is relatively small where there are no characteristic peaks. Then, the derivative's difference of each peak's right n (n increases from one in increment of one) points is calculated. If the difference is less than a given threshold, it is assumed that the right side of the peak is the background information. Then, the point of the right boundary is determined as a segmentation point. The flow chart for selection of segmentation points is shown in Fig. 1.

Polynomial fitting method in each segment
For each section of the background, the curve is fitted with Zhao's method (I-ModPoly) [7], which takes the noise effect into account and avoids the mismatch peaks caused by noise.
Firstly, the polynomial fitting ( ) P x of the original spectral ( ) O x in the segmented region is performed. Then, the residual ( ) R x and standard deviation DEV of the fitting data and the original data are calculated according to (3) and (4), If it is the first iteration and the original spectral value is greater than the sum of the polynomial fitting value and the standard deviation, it proves that the peak exists, and the peak removal is needed. After removing the peak and re-fitting the spectral data, we can obtain the processed spectrum. Then, we calculate the standard deviation of the processed spectrum and add it to each point of the processed spectrum. The sum is compared with the contrastive spectrum (the contrastive spectrum is the original spectrum initially). If the value is larger than that of the contrastive spectrum, the contrastive spectrum doesn't need to be changed. Otherwise, the contrastive spectrum is changed into the sum. After several iterations, we can fit out the spectral background. If , which indicates that additional iterations cannot significantly improve the fitting, the iteration stops.

Elimination of discontinuities at the segmentation points
Because the orders and fitting data involved in each piecewise polynomial fitting section are different, the polynomial curves are not the same. Discontinuities occur at the endpoints (i.e., segment points) of the polynomial curve, which may introduce additional peaks or increase the background of some areas. We overcome the discontinuities by the following iterative optimization methods: (1) Fit the baseline by selecting the spectral data between two initially selected segment points.
(2) Calculate the difference between two adjacent fitting curves at the segmented points in the corresponding spectral shift according to (5) as (4) Repeat Steps 2 and 3 until the background is obtained.
Through the iterative optimization method, the discontinuity of segmentation points is eliminated.
After fitting the background polynomial curve, the analytical spectrum [ ] s x is obtained by subtracting fitting background [ ] b x from the original spectral [ ] y x . Piecewise polynomial fitting can solve the problem that one polynomial curve cannot effectively fit the background, and finally, the required analytical spectra are obtained. A detailed diagram of the PPF algorithm is shown in Fig. 2.

Noise analysis
Noise usually occurs in the real spectrum. So, the threshold 0 dif is determined by noise to ensure that the difference between the values of the two adjacent curves at the segment points is within the noise range. The measured spectrum noise can be classified as four types which are independent from each other, i.e. readout noise, photoelectric noise, dark noise, and fixed pattern noise. The total noise can be described as

Simulation
We simulate a Raman spectrum whose signals contain a series of Lorentzian peaks on a null baseline with appropriate locations, heights, and widths.
where 0i w is the bandwidth of the peak at the full width at half-maximum (FWHM), 0i r is the position of the peak, and 0 i A is the total area under the curve from the baseline. The Raman spectrum is composed of 7 peaks with intensities and positions. The parameters used in the simulation are listed in Table 1.  For the modeling of the Raman spectrum, the simulated spectrum is produced by mixing the baseline with several simulated Lorentzian peaks, which is shown in Fig. 3. We can obtain segmentation points by the following steps. Start from the right boundary of the peak and find the nth point as "S". The value of n is determined by the specific spectrum, which we choose is 40. If the difference of derivative between the point S and the right boundary of the peak is less than a certain threshold, the point S is not on the characteristic peaks, and its intensity is only the background value, which can be used as segment points. Otherwise, the point S is on the next characteristic peak.
After calculating the derivative of the synthesized Raman spectrum as shown in Fig. 3, the derivatives are shown in Fig. 4. After a short calculation, the right boundary of 7 peaks can be obtained, which are denoted as A − G. Points A, D, and E cannot be used as segmentation points, because they are on the next characteristic peak. Besides, the last point G cannot be used as a segmentation point as well because there is no peak behind the last point. Respectively, starting from the points B, C, and F, the next 40th points are the segmentation points, recorded as S1, S2, and S3. Spectral data are successfully segmented.
The threshold at selection of segmentation points is determined by the derivative of the spectrum. The dash dot shadow area in Fig. 5 is enlarged, and the effect diagram is shown in the lower right corner. The threshold is set between two dash lines, which needs to be set manually according to different spectra. In the fitting of each segment, we select the data between the two segments point merely. The discontinuities at the segmentation points are caused, which is shown in Fig. 6(a). Then, the iterative optimization method is used, and the final analytical spectrum is shown in Fig. 6(b), whose 0 dif is set to a minimum of 0.005 because there is no noise.
Then, we add noise to the simulated Raman spectrum. According to Huang's theory [21], the readout noise and dark noise both belong to the white noise whose mathematic models are a simple Gaussian function, and the value of photoelectron noise is proportional to the root-mean-square of the signal intensity. Figure 7 shows that under noisy condition, the PPF method can still fit the background well. By comparison, we find the accuracy of PPF is higher than that of the method of I-ModPoly and piecewise linear fitting (PLF).
All programs are written using Matlab R2016b and run under Windows 7 on a personal computer (RAM 8G, CPU 3.20 GHz). The I-ModPoly and PLF programs are written with the description of previous reports [7,17].

Experimental results and discussion
In order to demonstrate the utility of the proposed baseline correction algorithm, we obtain real Raman spectral data on two minerals from the database about RRUFF™ Project [22]; the minerals contain different impurities and therefore different baselines. The minerals used are magnetite (R080025) and topaz (R060024). Figure 8 shows the baseline correction for magnetite (780 nm), whose baseline has multiple curvatures. Figure 8(a) shows the original spectra and the background, and Fig. 8(b) shows the final pure Raman spectra fitted by three different methods with the appropriate order. For better comparison, we conduct the fifth-order fitting and the seventh-order fitting for the I-ModPoly method. For the fifth-order polynomial fitting, the spectra within the range of 150 cm -1 -950 cm -1 are fitted well. But the fitting effect is poor in the range of 950 cm -1 -1300 cm -1 . For the seventh-order polynomial fitting, the background curve at the beginning of the spectrum will be greatly bent, resulting in additional peaks, and the intensity of the peaks are also reduced. The background can be well fitted at the end region of the spectrum. Besides, other orders will lead to a worse fitting. Meanwhile, we also carry out the PLF. The fitting effect is better than that of the I-ModPoly. But if the width of the peak is large and the background below is a curve, this method is not very good. However, for the PPF method, the different orders can be used in different areas because of the segmentation spectrum data. The overall background fitting effect is perfect. Figure 9 shows the baseline correction for topaz (532 nm), whose baseline curvature varies greatly. Figure 9(a) shows the original spectra and the background, and Fig. 9(b) shows the final pure Raman spectra fitted by three different methods with the appropriate order. From the original spectrum, we can find that the fitting order of background spectra within the range of 210 cm -1 -450 cm -1 is different from other areas obviously. Therefore, if the third-order polynomial fitting is selected, the background spectra within the range of 210 cm -1 -450 cm -1 cannot be fitted. If the ninth-order polynomial is used, the other background areas will have oscillation. Moreover, because PLF uses the derivation to find the extreme and selects the neighborhood minimum, leading to the problem of overlay peaks within the range of 250 cm -1 -310 cm -1 . The PPF method uses the fifth-order in the range of 210 cm -1 -450 cm -1 and the lower order in other parts. The method can overcome the existing problems and achieve good results.

Conclusions
In this paper, we propose a novel algorithm for the baseline correction of the Raman spectrum. This baseline correction method overcomes the problem that the background cannot be fitted perfectly by one polynomial curve, especially for the Raman spectrum whose curvature of background varies greatly. In the analysis and simulation, we establish a theoretical model of the PPF, put forward an idea of segmentation, and solve the discontinuity problem in the segmentation process. Compared with the experimental results of the I-ModPoly and PLF method, the results of the PPF method show a high accuracy and validity.