Development of a scoring parameter to characterize data quality of centroids in high-resolution mass spectra

High-resolution mass spectrometry is widely used in many research fields allowing for accurate mass determinations. In this context, it is pretty standard that high-resolution profile mode mass spectra are reduced to centroided data, which many data processing routines rely on for further evaluation. Yet information on the peak profile quality is not conserved in those approaches; i.e., describing results reliability is almost impossible. Therefore, we overcome this limitation by developing a new statistical parameter called data quality score (DQS). For the DQS calculations, we performed a very fast and robust regression analysis of the individual high-resolution peak profiles and considered error propagation to estimate the uncertainties of the regression coefficients. We successfully validated the new algorithm with the vendor-specific algorithm implemented in Proteowizard’s msConvert. Moreover, we show that the DQS is a sum parameter associated with centroid accuracy and precision. We also demonstrate the benefit of the new algorithm in nontarget screenings as the DQS prioritizes signals that are not influenced by non-resolved isobaric ions or isotopic fine structures. The algorithm is implemented in Python, R, and Julia programming languages and supports multi- and cross-platform downstream data handling. Supplementary Information The online version contains supplementary material available at 10.1007/s00216-022-04224-y.


Sampling, sample preparation and instrumental analysis
To validate the developed centroiding algorithm examples samples were processed. The samples measured were grab samples taken at the effluent of the wastewater treatment plant in Warburg (Stadtwerke Warburg GmbH, Warburg, Germany). Details can be found in Itzel et al. (2020). [1] A solid phase extraction was performed for sample preparation within 48 h after sampling. The cartridges (150 mg, 6 mL, Oasis HLB, Waters, Eschborn, Germany) were conditioned with 2 x 5 mL methanol and equilibrated with 2 x 5 ml water. After drying the cartridges under vacuum, the loaded cartridges were stored at -18°C until further analysis. 5 x 5 mL MTBE were used for elution. After elution, the samples were measured within 48 h. The liquid chromatography was performed using a Dionex UltiMate 3000 HPLC system (Thermo Scientific, Bremen, Germany) with a gradient method on a XSelect HSS T3 (2.1 mm x 75 mm, 3.5 µm particle size, Waters, Milford, MA, USA). The mobile phase was consisting of eluent A: LC-MS grade water (Th. Geyer GmbH & Co. KG, Renningen, Germany) + 0.1 % formic acid (99 % purity, VWR Chemicals, Darmstadt, Germany), and eluent B: methanol (LC-MS grade, Th. Geyer GmbH & Co. KG, Renningen, Germany) + 0.1 % formic acid. The eluent gradient is displayed in figure S1. The injection volume was 20 µL and the flow rate 0.35 mL/min. The ion source was a heated electrospray ionization which operation parameters are specified in table S1. The scanning settings of the Orbitrap mass spectrometer are given in table S2. MS2 Spectra were not included in this data evaluation, but, however, influence instrument cycle time.

xcms Parameter
For processing of the data file with xcms the parameter from table S4 were chosen. All other parameter were kept in the default settings. The processing was performed in R using the package xcms. 3 Derivation of the Data Quality Score

Orbitrap HRMS peak profiles
The error ∆Â in the Gaussian peakÂ area is calculated with the following set of equations that is based on the propagation of errors: withβ 0 ,β 1 ,β 2 being the parameter of weighted second-order linear regression and ∆β 0 , ∆β 1 , ∆β 2 their respective standard errors. The Data Quality Score (DQS) introduced in the publication is estimated by the relative error of the peak area. The functional relationship between the two quantities is shown by the equation 8 and the figure S2.
with ∆Â being the error in the Gaussian peak area,Â being the Gaussian peak area. erf(x) is the Gaussian error function. To facilitate the selection of a DQS for the user, four qualitative categories are introduced in table S5.

TOF HRMS peak profiles
The routine described above is limited to Gaussian-shaped peaks observed,e.g., in Orbitrap-MS. However, our algorithm can also be extended to match asymmetric peak profiles, e.g., TOF data. The peak model must be adapted to apply the concept of data quality estimation for these instruments, as the peak profiles are usually asymmetric. Therefore, we cannot use the Gaussian model as asymmetry is not considered. Thus, we decided to apply the Bi-Gaussian model that fits the left and the right half of a peak independently and, therefore, allows asymmetry: [2] 6Î withÎ 01 andÎ 02 left and right half intensity andσ 1 andσ 2 the left and right half-width. A big advantage of the Bi-Gaussian compared to other peak models considering asymmetry is that it can be linearized following the procedure from Caruana et al. (1986), making the algorithm computationally very efficient: [3] lnÎ(x) = lnÎ 01 − 1 Eq. 10 describes two halfs of a parabola of the form y = a+bx 2 c with x c = x−x 0 . Essentially, the procedure is analogous to symmetric peaks, except that there is a smaller number of parameters in the polynomial function, and the effect on the statistical degrees of freedom must be taken into account: Further, a asymmetry factor γ can be calculated: Two exemplary asymmetric peak profiles from TOF-MS, one with high and one with low DQS are given in figure S3.

Influence of low quality centroids on the construction of regions-of-interest (ROI) with xcms
In this context, we considered the wastewater dataset, centroided conventionally using msConvert, and used xcms for ROI calculation. the xcms parameter are given in table S4 Afterwards, all the centroids from our new algorithm were assigned to ROIs from the conventionally evaluated data. Those centroids that cannot be assigned to any ROI were defined as "outside". A quantile-quantile plot (QQ-plot) comparing the distribution of DQS values within and outside the obtained ROIs is given in figure S4.  The QQ-plot shows a deviation from the black bisecting line, indicating different DQS distributions within and outside ROI. The lowest DQS within the ROIs is 0.48. The quantile values do not coincide, resulting in a slope greater than one, indicating differences in data spread. Within the ROIs, centroids show an increased DQS. This could be since those centroids with low DQS could not be meaningfully assigned to an ROI due to their potentially high mass deviation. Lower DQS centroids tend towards the edge of the mass extraction window. Therefore, DQS can be considered to calculate ROIs' weighted average masses to improve mass accuracy compared to the conventional unweighted ROIs' average masses.
5 Simulating the effect of peak-to-peak resolution on Data Quality Score In the following, it shall be analyzed how the Data Quality Score is affected by the presence of partially or non-resolved neighboring peak profiles in HRMS. Therefore, two Gaussian peaks, one left main peak and one right secondary peak are simulated according to the peak shape observed in Orbitrap-MS. The positions for main and secondary peak are x 0,1 and x 0,2 . Thus, the peak-topeak resolution is defined as: Equation 19 can be simplified to: Equal width σ 1 is assumed because both peaks originate from the nearly same m/z range. In a simulation, two Gaussian peaks (I and II) are generated with varying R p . In principle, a differentiation can be made between two cases: Either the two peaks are separated, indicated by the formation of a local minimum between the peaks, or they are not separated. In the second case, the two peaks cannot be considered separately. The differentiation by means of local minima mentioned here is also implemented in this form in our algorithm. The height of the left peak (I) is set to 1, and the second peak height (II) is varied in the range of 0-1. Figure S5 shows the DQS of the peak I affected by R p and the height of peak II. The number of data points per peak was adjusted on the experience from Orbitrap-MS, i.e., 7 points. The peaks are fitted independently when a local intensity minimum is observed in the sum of their signals in analogy with the study's method. Otherwise, their sum signal is considered as one peak. Simulation results (n = 5000) are presented in figure S5. No noise is incorporated in this simulation leading to higher DQS values than real measurement data. Simulation: the influence of peak to peak resolution and the height of the second peak on the Data Quality Score (n = 5000). The height of the main peak is set to 1, while the height of the secondary peak is varied in the interval [0,1]. B: Relation between DQS and peak-to-peak resolution for a secondary peak of height 1. As seen in subfigures A and B, the DQS increases with increase in peak-to-peak resolution in a step-wise manner. This can be attributed to a change in the statistical degrees of freedom on which the DQS depends the better both peaks are separated.
For peaks with the same position (R p = 0), the simulation results in DQS values close to 1 independent of the second peak height as the sum signal has an ideal shape and, therefore, a high DQS. If the position and height of the second peak are altered, the DQS is affected: At first, the DQS decreases slowly with distance as a distortion in peak shape is observed. At a critical distance a new local minimum appears that is now used to split the peaks. As a result, the statistical degrees of freedom of peak I are decreased drastically. At the same time peak distortion is increased, so that the deviation from the Gaussian model is increased. This leads to abruptly lowered DQS. After a global minimum the DQS increases with R p due to the fact that the mutual influence of the peaks is reduced due to reduced peak overlap. The model fit becomes better, but the statistical degrees of freedom are also increased, so that the resulting DQS tends to approach 1 in a step-wise manner. This stepwise manner is shown for a secondary of peak height 1 in figure S5B. For high R p values DQS values close to 1 are observed as the Gaussian peaks do not affect each other. Overall, the higher the intensity of the secondary peak the stronger is the distortion and the lower the DQS of peak I. Through simulation, it can be shown that our DQS is suitable to estimate the quality of the peak profiles, since the influence of non-resolved neighboring peaks, e.g., isotopic fine structures or isobaric analytes, onto centroid reliability can be quantified. log own algorithm peak height log msconvert intensity 7 Relationship between full-width half-maximum (FWHM) and standard deviation of the Gaussian peak To determine the FWHM, the task is to find the position x h where the intensity is half the maximum intensity. The Gaussian peak can be described by the following function.Î The intensity is maximal at the center of the Gaussian peakx 0 .
Thus, the intensity at the position x h is: Simplifying and transforming the equations yields: Due to the fact that the Gaussian peak is symmetric FWHM follows: