Background

The occurrence of organic micropollutants in surface water has raised concerns due to their harmful effects on aquatic organisms and the possible entry into human water supply [51]. Over the last two decades, the compound spectrum analysed has steadily increased, although the number of compounds included in routine monitoring programs is still rather low compared to those compounds known to be present in environmental samples [5].

The European Water Framework Directive (WFD; European Union 2000) is currently the main basis for surface water monitoring activities in European countries. It has a specific focus on European scale Priority Substances, which are used to define the Chemical Status of a water body, together with varying lists of river basin specific pollutants (RBSPs). Thus, monitoring efforts are typically biased towards compounds relevant for larger-scale catchments. However, also site-specific contamination might substantially contribute to the likelihood that surface water bodies fail to meet environmental quality objectives and thus a further assessment is required (WFD, Annex II, Sect. 1.5, Assessment of Impact).

Water pollution due to household effluents treated in and emitted via municipal wastewater treatment plants (WWTPs) is expected to be composed of a more or less consistent, typical set of substances from major human activities including laundry care, home care, health care, personal care and food [10]. Concentrations are mainly impacted by the type of wastewater treatment used, the number of inhabitants served by the WWTP, and the effluent dilution in the receiving water [29, 37]. Additionally, micropollutants from agricultural use (dominated by pesticides) reach surface water via diffuse inputs from leaching, and in particular by surface runoff during rain events [18, 33]. Surface waters may also be contaminated by local inputs from industrial production sites, landfills, or accidental releases, either directly or via WWTPs. These inputs might contain highly specific substances or substances in much higher concentrations as compared to municipal wastewater (e.g. [14, 41, 43]).

To select the relevant compounds to be monitored among these thousands of chemicals, different approaches for prioritisation have been developed. These use either predicted environmental concentrations from consumption and emissions models or measured concentrations and compare these to (eco)toxicological threshold values (e.g. [1, 11, 55, 57]). The outcomes using these methods depend strongly on the availability and quality of data, which might be very limited for certain compound classes [55]. Thus, they bias the spectrum towards well-known compounds, unknown or unexpected compounds such as metabolites and by-products are hardly considered. Furthermore, such prioritisation approaches based on general emissions scenarios will not rank site-specific chemicals among the top candidates.

With the introduction of liquid chromatography coupled to high-resolution mass spectrometry (LC-HRMS), it became possible to screen water samples for a more comprehensive spectrum of chemicals, provided that these are amendable to the individual analytical steps of the method [21, 31, 53, 54]. As a consequence, data-driven approaches based on LC-HRMS have been put forward for the discovery and prioritisation of compounds by mining LC-HRMS data for the presence of a large number of known chemicals in so-called suspect screening approaches [17, 55, 56]. However, in a suspect screening based on a list of known chemicals, unknown or unexpected compounds such as transformation and by-products are hardly considered and very large compound list have to be processed to cover all potentially relevant compounds.

Non-target screening approaches can be applied without any prior knowledge of the compounds present solely starting from the analytical data [31]. It has been successfully applied to prioritise so far unknown chemicals in rivers based on time series analysis [8, 22], their spatial trends in a river course [49], or in the context of fish mortality in a river [44]. It may be expected that NTS will be increasingly applied in water monitoring in Europe [6]. However, an exhaustive identification of all chemicals at all sites is not realistic.

In the present paper, we suggest a robust approach based on a semiautomated evaluation of non-target LC-HRMS data for the prioritisation of water bodies with site-specific contamination and the identification of the underlying chemicals. These could either be compounds which are found with the given detection limits at only one or a few sites or whose concentrations are several orders of magnitude higher at a particular site as compared to other sites or catchments due to local inputs. To express site-specific contamination in a single value, we calculate rarity scores for each detected peak in the dataset and prioritise for sites with a high number of these peaks. Additionally, peak attributes with a diagnostic value such as isotope patterns, mass defects, and occurrence of homologue series are considered.

Using a set of 31 samples from the catchments of the Rivers Saale and Mulde, Germany, the prioritisation approach is demonstrated for three sites with a high number of site-specific peaks, which were further characterised and identified.

Materials and methods

Sites and sampling

Surface water was sampled at 31 sites from the catchments of the rivers Saale and Mulde, which are major tributaries of the river Elbe in Germany (Additional file 1: Figure S1). Sites were selected at rivers and streams of different size, downstream of the discharge of industrial and municipal wastewater treatment plant (WWTP) effluents, or upstream of the first WWTP. Samples were taken in 5 L aluminium containers and stored at 4 °C until extraction. Details on the sites are given in Additional file 1: Table S1.

Estimation of discharge and wastewater fraction

Discharge data for the sampling sites were obtained for each sampling date either from associated gauging stations or from the size of the flow profile of small streams and flow velocity measurements over the profile. Details are given by Hug et al. [26] and in Additional file 1: Table S2. The mean annual discharge was obtained from hydrological records for gauging stations. In general, the real wastewater fractions are likely larger than those calculated, as for WWTPs < 2000 person equivalents, no data were available and wastewater from decentralised treatment or untreated wastewater might contribute. In the study area, about 85% of the inhabitants are connected to centralised wastewater treatment, and particularly in rural areas on-site treatment, mainly through septic tanks, or discharge directly into surface waters can occasionally be found.

Chemical analyses

Information on chemicals used is given in Additional file 1: Sect. 1.1. Extraction of the samples was done by solid-phase extraction using multi-layer cartridges similar to those described elsewhere [28, 39]. Details are given in Additional file 1: Sect. 1.2. Within a previous study focusing on ecotoxicological characterisation of these extracts by an in vitro assay, also a target screening for 205 compounds was carried out [26]. For this study, the extracts were stored at − 20 °C and analysed within 3 months after extraction and the data evaluation reported in this study is based on the archived data. Extracts concentrated 625-fold were analysed by LC-HRMS using reversed-phase separation and electrospray ionisation (ESI) in positive (ESI+) and negative ion mode (ESI−). A nominal resolving power of 100,000 referenced to m/z 400 was used (details see Additional file 1: Sect. 1.3). MS/MS spectra were obtained in additional runs using data-dependent MS2 on a precursor ion list of the prioritised peaks with collision-induced dissociation (CID) and higher energy collisional dissociation (HCD) at different collisions energies and a nominal resolving power of 15,000. Due to the biological analysis of the extracts [26], isotope-labelled internal standards were added only prior to LC-HRMS analysis and were used for quality control of peak detection in this study.

Automated peak detection

Raw HRMS full-scan chromatograms (m/z 100–1000) from ESI+ and ESI− runs were converted from profile to centroid mode and to .mzML format using ProteoWizard 3.0.6485 [30]. Afterwards, aligned peak lists containing 31 samples, one processing blank (i.e. from 10 mL of ultrapure water processed with the SPE cartridge as a sample) and one solvent blank (i.e. the type and amount of solvent used for eluting the cartridge processed as a sample) were generated by MZmine 2.20 [45]. We applied the steps mass detection, FTMS shoulder peak detection, chromatogram building, smoothing, peak deconvolution by local minimum search (minimum peak intensity 30,000 in ESI+ and 10,000 in ESI− mode), and alignment by the Join Aligner algorithm. Settings were slightly adjusted from Hu et al. [24] and are given in Additional file 1: Table S3. For further processing, ESI+ and ESI− peak lists with accurate m/z, retention time, peak intensity and area were exported from MZmine as.csv files. From the processing and solvent blanks, a combined blank peak list was generated in Excel by taking the maximum value in each of these two blanks. The sample and blank peak lists were imported into R, v3.3.0 (R [46] for further processing. All peaks with an area-to-height ratio > 50 were removed from the peak list to exclude signals coming from background noise (for details see [25]. Peaks with an intensity ratio < 10 between the surface water and the blank peak list were excluded from further analysis. Peak lists were exported for all individual samples as.csv files.

Determination of rarity scores

From the aligned peak lists, m/z, retention times (defining a unique peak in the dataset) and the intensities in all samples were used. To identify peaks which occur at a small number of the studied sites with high intensity as compared to the other sites, we calculated a rarity score for each peak x (RSx) according to:

$$\begin{aligned} {\text{RS}}_{x} &= \frac{{{\text{maximum intensity across all sites}} \left( x \right)}}{{{\text{median intensity across all sites}} \left( x \right)}}\\ & \quad\cdot \frac{\text{total number of samples}}{{{\text{number of positive detects}} }} \end{aligned}$$
(1)

For the calculation of the median intensity, non-detects were replaced with the threshold intensity of the peak detection in MZmine (30,000 in ESI+ mode and 10,000 in ESI− mode).

The rarity scores combine a low frequency of occurrence of a peak in a dataset and its maximum signal intensity in relation to the median intensity in one single number. As for every value based on a statistical measure, the calculation of rarity scores is meaningful only for a larger dataset, although we cannot specify a minimum number of samples. While a value of 1 is the lowest possible one (with median = maximum intensity and present at all sites), the maximum values are given by the peak intensity range obtained by the instrument used and the number of samples.

An advantage of this univariate approach is that it can be applied even to datasets with a large fraction of non-detects or “zeroes” using the threshold signal intensity for the missing data. Such data may be difficult to handle by more sophisticated multivariate statistical methods, which are commonly used to prioritise peaks in metabolomics and occasionally in environmental studies (e.g. [50]). Such methods are either vulnerable to bias of the data or computation is time-consuming and requires expert knowledge [19, 20]. Common methods are the deletion of data records with missing values, imputation of single values (e.g. the detection limit or the half detection limit) or simple regression imputation. Deletion of records yields a small number of cases or variables and would remove such site-specific peaks occurring in maybe one single sample. Constant values infuse the data with unconditional and uncorrelated observations and thus bias the variance and correlation relationship as the distribution is changed. Regression imputation of perfectly correlated values is vulnerable to overestimation of the regression fit.

Determination of additional peak attributes

Mass defects (i.e. differences between the nominal mass and the accurate monoisotopic mass of an ion) were calculated using an R script, assuming that the mass defects span a range from M − 0.4 to M + 0.6 for a nominal mass M. Using the R package non-target version 1.8 [34], peak lists of all individual samples were screened for bounds of feasible isotope peaks (13C, 15N, 34S, 37Cl and 81Br) with a rule-based algorithm. Homologue series detection [35] was carried out for four or more consecutive mass differences corresponding to CH2, CH2O, C2H4O, C3H6O, C2H6SiO, CF2 and C2H4 units for singly and doubly charged ions. Peaks were finally grouped into components, i.e., the monoisotopic peak and its associated isotope or adduct peaks representing an individual chemical compound. Details of the R package non-target settings are given in Additional file 1: Table S4. Statistical analyses were conducted using R, v3.3.0 and Statistica 12 (Statsoft Inc.). For data visualisation, the R packages ggplot2 [61] and ggradar (https://rdrr.io/github/ricardo-bion/ggradar, last accessed 01/07/2019) were used.

Identification of prioritised peaks

For the identification of prioritised peaks, most plausible molecular formulas were determined from the raw data files based on accurate masses and isotope patterns using the QualBrowser of Xcalibur (Thermo Scientific) by visual comparison of measured and simulated mass spectra. Possible structures were searched in compound databases (Chemspider, Royal Society of Chemistry [48]; Pubchem, NCBI [42], and experimental MS/MS spectra were searched against MassBank [23]. Plausible candidate compounds were selected based on commercial/industrial relevance and additional literature search. For confirmation, reference standards were obtained if available. Confidence levels for the identification were assigned according to [52]. Marvin, InstantJChem and JChem for Excel (Chemaxon, Budapest, Hungary) were used for chemical structure drawing and handling and calculation of ion masses. Given the scope of this study on evaluating the approach for its potential to prioritise site-specific contamination, compound identification was based solely on compound database search of molecular formulas, a MassBank search and confirmation of plausible hits with reference standard, but we did not use additional approaches such as MS fragmentation prediction or retention time prediction to assist with the identification of candidate compounds within this study.

Results and discussion

Prioritisation of site-specific contamination based on rarity scores

The distribution of rarity scores was similar for ESI+ and ESI− mode, with about 80% of the detected peaks showing values between 10 and 100. About 1% of peaks had values above 1000, thus we prioritised these as representing site-specific compounds in our set of samples (Fig. 1). At this RS level, peaks down to a signal intensity of 106 (which would for example correspond to a concentration of 20 ng/L of a well ionising compound such as atrazine) might become classified as rare peaks if they occur in a low number of samples. These peaks would usually be missed if only signal intensity is used as a criterion for prioritisation. In contrast, many peaks with intensities between 106 and 107 occurring at many of the study sites get rarity scores below 50, thus these are not considered for site-specific contamination. Obviously, peaks detected at low intensities at only a few sites might potentially also represent a site-specific contamination; however, this cannot be assessed based on the data, as we do simply not know whether (or at which level) the compounds are present in samples where we could not detect these.

Fig. 1
figure 1

a Rarity scores in ESI+ and ESI− mode of all detected peaks (sorted according to increasing rarity scores), which is enlarged in b for the range of 95–100%

Figure 2 shows that most of the prioritised peaks with RS > 1000 actually occur in only one sample, with substantially smaller numbers in 2–7 samples in ESI+ and two or three samples in ESI−. Only one to three peaks with an RS > 1000 occur in 8–15 samples in ESI+, and in 4–10 samples in ESI−; these were mainly peaks with intensities > 5 × 106 in one or a low number of samples and much lower intensities in other samples. These findings suggest that the rarity score is a suitable approach to prioritise site-specific contamination which is characterised either by high differences in intensity among the samples (as a proxy for concentration) or by a restricted frequency of occurrence. It should be noted, however, that the calculation of the RS might be problematic in rare cases: If a compound is detected in 15 out of 31 samples at peak intensities around 108, but not in the other 16 samples, the resulting RS value would be about 25,000, as the median is at a threshold of 10,000. If the compound is detected in 16 out of 31 samples at peak intensities around 108, but not in the other 15 samples, the resulting RS value would be around 5, as the median is about 108.

Fig. 2
figure 2

Frequency of occurrence of peaks with rarity scores > 1000 in ESI+ and ESI− mode in the 31 samples

To prioritise sites with a specific contamination, we compared the numbers of peaks with rarity scores above threshold levels of 5000 and 1000, respectively, among all sites as shown in Fig. 3. The number of peaks with high rarity scores showed large differences among the samples. A RS value of 5000 was exceeded by up to ten compounds in ESI+ and 13 compounds in ESI− mode in one sample, and a value of 1000 by up to 91 compounds in ESI+ and up to 48 compounds in ESI− mode in individual samples, while other samples had no single compound with rarity scores above these levels. In ESI+ mode, the sites with the largest number of rare peaks are B2, S8, DB, LN, H2 (RS > 5000) and DB, LN, LA, WE (RS > 1000). In ESI− mode site, SP shows clearly the site with the largest number of rare peaks (13 peaks with RS > 5000, 48 peaks with RS > 1000). Still considerable numbers of peaks with RS > 1000 in ESI− mode could be detected at the sites Sol (14 peaks) and DB (11 peaks). The occurrence of detected peaks with rarity scores > 1000 in the individual samples is given in Tables S5 (ESI+) and S6 (ESI−) in Additional file 2.

Fig. 3
figure 3

Comparison of the studied sites for the number of peaks above rarity scores of 1000 and 5000, respectively

Using peak attributes to further characterise site-specific contamination

For a further characterisation of site-specific contamination, we used the percentage of peaks containing potentially Cl, Br and S (as inferred from the isotopologue detection), the percentage of peaks with negative mass defect and those being part of a homologue series which are visualised in Fig. 4 for all sites. A detailed discussion of the performance of the peak attribute determination and consequences for the usage of this data is given in Additional file 1: Sect. 2.1.

Fig. 4
figure 4

Comparison of the studied sites for percentages of peaks with S/Cl/Br isotope pattern (top) and percentages of peaks with negative mass defect (m/z < 500) and contained in homologue series (bottom)

For site B2, the large number of rare peaks coincides with the largest percentage of peaks with potential Cl and Br isotopologue peaks and with negative mass defects (Figs. 3, 4). Other sites with relatively high fractions of Cl or Br isotopologue peaks were H1 and H2, with S isotopologue peaks B1, H1 and P2. The highest fractions of peaks in homologue series were found at sites B1, H1, M1 and M2. In ESI− mode, the largest number of rare peaks at site SP coincides with the largest percentage of peaks with potential S isotopologue peaks and with negative mass defects. Site WE showed a similarly large fraction of S isotopologue peaks and peaks with negative mass defects as site SP, but no peaks with particularly high rarity scores. Similar as for ESI+ mode, site B2 had the largest percentage of Cl or Br isotopologue peaks, followed by site H1. The percentage of peaks contained in homologue series did not coincide with the occurrence of a large number of site-specific peaks, and in fact, no such peaks were among those with RS > 1000.

Characterisation and identification of site-specific compounds

The previous section showed a range of sites with a specific contamination pattern and a detection of a high number of peaks with a RS > 1000, for which the generated peak attribute information can be used within the identification process. This is exemplified here for the sites SP (in ESI− mode; RS > 1000 for 48 peaks), DB (in ESI+ mode; RS > 1000 for 91 peaks) and H2 (in ESI+ mode; RS > 1000 for 47 peaks). A detailed overview of all peaks with rarity scores > 1000 in all samples is given in Tables S7 (ESI+) and S8 (ESI−) in Additional file 2.

Site SP (Spittelwasser downstream of Bitterfeld)

The Spittelwasser showed a large number of peaks with high RS values in ESI− mode, along with a large percentage of compounds containing sulphur and with negative mass defects based on the determination of peak attributes (Fig. 4). In contrast, peak numbers and numbers of compounds with high RS values in ESI+ mode were not noticeably high. As evident from Figure S4C (Additional file 1), many of these high-intensity S-containing compounds eluted around 3 min, others around 18-20 min retention time. The two most intense peaks at RT 18.8 and 19.7 min could be identified as being 1- and 2-naphthalenesulphonic acid (m/z 207.0121, M−H) based on a reference standard of 2-naphthalenesulphonic acid. Most of the other S-containing compounds were tentatively assigned as closely related compounds such as naphthalenedisulphonic (m/z 286.9688, M−H) and naphthalenetrisulphonic acids (m/z 366.9256 for M−H and m/z 182.9594 for [M−2H]2−) hydroxy- and amino-naphthalenesulphonic acids (m/z 302.9637 and m/z 301.9796, respectively, both M−H) as well as naphthylsulphate (m/z 223.0069, M−H). These assignments were based on the match and similarity of MS/MS spectra and retention times to those of reference compounds. Details are given in Table S9 and Figures S5 to S11 (Additional file 1).

Thus, our non-target screening approach revealed that derivatives of naphthalenesulphonic acids are important water contaminants in the Spittelwasser. The occurrence of naphthalene sulphonic acids and their derivatives in high concentrations was demonstrated for textile and tannery wastewater [9], stemming from their use as dye precursors and use in dyeing processes, and in landfill leachates [47]. In case of the Spittelwasser, it is likely that the found compounds are a legacy contamination related to the former dye (and maybe other chemicals) production at Bitterfeld. The contamination of sediments of the Mulde river and its tributary Spittelwasser with arylsulphonic acid derivatives and alkylsulphonic acid aryl esters was previously recognised [4, 15, 16]. Sediments of the Spittelwasser and lower Mulde are also heavily contaminated by persistent chlorinated compounds from the former chemical industry [15]. However, we did not detect a large percentage of chlorinated compounds in the Spittelwasser water sample, suggesting that their occurrence is limited to compounds with a high affinity to sediments and/or a poor ionisation by ESI. The peaks with high RS values at the site SP were not detected at any other site studied, except for one peak of a naphthalenedisulphonic acid found at about 100-fold lower intensity at site S1 (m/z 182.9594, RT 2.3 min).

Site DB (Dorfbach Niederschindmaas)

The Dorfbach is a small brook receiving wastewater from the WWTP of a large car manufacturer with 8000 employees, which also treats municipal wastewater of about 3000 inhabitants from adjacent settlements. In ESI+ mode, this site shows the largest number of peaks with RS values above 1000 (Fig. 3). The most intense rare peak in ESI+ mode at m/z 391.2294 could be identified as hexa(methoxymethyl)melamine (HMMM) based on a reference standard (Additional file 1: Table S10 and Figure S13). The full-scan spectrum of HMMM shows a significant in-source fragmentation resulting in the loss of one to three CH4O from the protonated molecule, which was partially assigned vice versa as a methanol adduct of the fragment by the non-target package (Additional file 1: Figure S14). Without a reference standard, it is indeed impossible to distinguish in-source fragmentation from methanol adduct formation in this case. Several other high-intensity peaks showed a similar full-scan mass spectral pattern (CH4O losses difference), and molecular formulas suggested compounds related to HMMM. The same peaks were detected in an old HMMM reference standard stored for more than 2 years at 4 °C, where they obviously stem from hydrolysis (Additional file 1: Figure S13). Although these compounds showed a low fragment ion intensity (typically in the 104 intensity range despite 107 precursor ion intensity) resulting in poor MS/MS spectra (Additional file 1: Figure S14), these compounds were tentatively identified as penta- and tetra(methoxymethyl)melamine and O-demethylated HMMM. HMMM is one important precursor of melamine–formaldehyde resins used for durable coatings such as in beverage cans and car paint finishes. Thus, the car manufacturer releasing treated wastewater in the Dorfbach is a plausible source. The technical product contains a mixture of monomers and oligomers of HMMM as well as not fully methyl-methoxylated melamine (US EPA [58]. Thus, the observed demethylated and demethoxymethylated derivatives might stem both from transformation or these technical mixtures. HMMM and related compounds have been previously identified in wastewater and surface water [2, 3, 44]. A widespread presence of HMMM in German rivers was reported by Dsikowitzky and Schwarzbauer [13] with a huge temporal and spatial variation and maximum concentrations of up to 880 ng/L in the Mulde river. Furthermore, we detected several “rare” and high-intensity peaks with molecular formulas and retention times similar to those of HMMM (e.g. C25H28O8N4 at 22.8 min, C12H18O4N6 at 21.6 min, C25H33O5N5 at 20.3 min, C22H25O8N6 at 22.9 min, Table S10),which could be caused by the presence of similar substituted melamines used for production of melamine–formaldehyde resins [12]. The co-occurrence of several melamine-derivatives coincides with results from Peter et al. [44], who detected a (methoxymethyl)melamine “compound family” in urban stormwater runoff in the USA.

Most of these compounds could also be detected at the other studied sites (Fig. 5), among them sites LN and S6, with peak heights about two orders of magnitude lower than at site DB, pointing towards some specific sources there. At the other sites fewer compounds, mainly PMMM and tetra(methoxymethyl)melamine were found and peaks heights were in general about three orders of magnitude lower, confirming a widespread occurrence of this compound class in surface waters.

Fig. 5
figure 5

Occurrence and peak intensities at all the 31 studied sites of hexa(methoxymethyl)melamine (HMMM), likely transformation products and related compounds detected as site-specific contaminants at site DB (PMMM: penta(methoxymethyl)melamine; TMMM: tetra(methoxymethyl)melamine; O-des: O-desmethylated compound). Details on the tentative identification of the compounds are given in Additional file 1

Site H2 (Holtemme downstream of WWTP Silstedt)

Site H2 on the Holtemme river receives municipal wastewater from one relatively large WWTP (80,000 person equivalents) serving the town of Wernigerode and surrounding villages, showing second largest number of peaks with RS values > 1000 in ESI+ mode. The twenty most intense peaks with RS > 1000 and their tentative identification and confirmation are shown in Additional file 1: Table S11 and Figures S16–S19.

Among these compounds were 7-diethylamino-4-methylcoumarin, 7-ethylamino-4-methylcoumarin and 7-amino-4-methylcoumarin, which were recently identified at this site as causative compounds for the observed anti-androgenicity [40]. The latter is used as optical brightener (or fluorescent whitening agent) for textiles and a constituent in cleaning detergents and washing powders [27] and has not been found anywhere else in surface water or wastewater [36]. We could detect one to all three of these compounds at six other sites at much lower peak heights (Fig. 6). While at sites B2 and S8, which are located further downstream of site H2 at the Bode and Saale, respectively, the occurrence might be related to the input upstream of H2, an input occurs also into the Solgraben (site Sol), the Chemnitz (C1), the Alte Luppe (LA) and the Spittelwasser (SP).

Fig. 6
figure 6

Occurrence and peak intensities at all the 31 studied sites of the most intense compounds with RS values > 1000 at site H2. Details on the tentative identification of the compounds are given in Additional file 1

Furthermore, we could identify the antipsychotic drugs melperon and pipamperone and the antibiotic clarithromycin (included in the target screening compound set of [26]). All compounds could be confirmed by reference standards. The high concentrations of pipamperone and melperon (estimated > 1 µg/L) and of clarithromycin of more than 5 µg/L [26] are not likely to stem from medical use. Pipamperone was previously analysed by Van De Steene et al. [59] and was detected in WWTP effluents typically at concentrations below 40 ng/L and in surface water below 20 ng/L. However, the authors found high pipamperone concentrations of up to 36 µg/L in the effluent of a WWTP treating wastewater from pharmaceutical and chemical industries. We did not detect pipamperone at any other site, while melperon was detected at six other sites at levels at least 20-fold lower. Clarithromycin concentrations in WWTP effluents are typically in the range of 50–500 ng/L [38, 53, 60], and the wastewater fraction in the Holtemme was calculated to be at about 27%. We did not detect clarithromycin or pipamperone at any other site. Thus, the most probable source of these compounds is the production by a pharmaceutical company located in Wernigerode. Emissions from drug manufacturing have been recognised as a significant source of pharmaceuticals at specific sites [7, 14, 25].

Metoprolol, N-methyl-1-dodecylamine and tributylamine were detected at similar peak intensities at three, four and eleven other sites, respectively, resulting in high RS values for these compounds, but not at other sites although previous studies indicate a ubiquitous occurrence of metoprolol in the aquatic environment [26]. The manual re-evaluation of these compound peaks in Xcalibur indicated that this finding was based on artefacts related to peak picking with MZmine, as all three compounds were present in most samples with varying intensity. However, the peaks of these compounds bearing all an aliphatic amino group were typically more than 1.5 min wide with a significant tailing, which hampered the peak picking, and resulted in their misclassification as site-specific contaminants. Note that in this case a false negative in peak detection resulted in a false positive assignment of a site-specific peak.

Conclusions

A new approach to identify and prioritise samples with a significant site-specific contamination based on LC-HRMS non-target screening data without any prior knowledge of the chemicals present was proposed. It is based on a simple calculation of rarity scores (RS) for each detected peak, without the need that the dataset fulfils any prerequisites for more sophisticated statistical approaches. The data processing steps with a final prioritisation of site-specific peaks and determination of peak attributes can be accomplished within 3–4 h using freely available software and can be applied by users less experienced in non-target screening or statistical data evaluation. The obtained rarity scores can be used for both, ranking compounds for identification, but also for ranking sites with a large number of such peaks for further investigation. As the magnitude of RS values depends on the instrument used and the dataset itself, it is not possible to set a general threshold value for site-specific peaks; the selection of peaks should instead be guided by the ranking and the occurrence among the different sites, and—very pragmatically—by the time which can be spent on the subsequent identification. This second step is by far more laborious and time-consuming, although automated workflows including MS/MS fragmentation prediction and MS library search have been established (e.g. [8]). Nevertheless, some degree of expert knowledge is required, but efforts can be focused on the relevant compounds and are supported by the automated annotation of isotopologues, homologue series and mass defects.

LC-HRMS instrumentation is currently becoming more frequently available also at authorities carrying out regulatory monitoring (e.g. along the Rhine river; [22, 32]). The proposed approach to detect site-specific contamination can be used by such authorities in investigative monitoring of catchments and water bodies which fail to meet quality criteria, while monitoring of priority substances and RBSPs does not indicate a chemical pollution issue. It may also be directly applied for locations where a specific contamination is suspected rather than using targeted methods focusing on a limited set of compounds. This will significantly reduce the risk of overlooking possibly hazardous chemicals (including unknowns), for which detailed investigations on sources and toxicity can follow. Ultimately, it could guide compound- and source-specific mitigation measures at sites where problematic compounds are emitted.