Which particles to select, and if yes, how many?

Micro- and nanoplastic contamination is becoming a growing concern for environmental protection and food safety. Therefore, analytical techniques need to produce reliable quantification to ensure proper risk assessment. Raman microspectroscopy (RM) offers identification of single particles, but to ensure that the results are reliable, a certain number of particles has to be analyzed. For larger MP, all particles on the Raman filter can be detected, errors can be quantified, and the minimal sample size can be calculated easily by random sampling. In contrast, very small particles might not all be detected, demanding a window-based analysis of the filter. A bootstrap method is presented to provide an error quantification with confidence intervals from the available window data. In this context, different window selection schemes are evaluated and there is a clear recommendation to employ random (rather than systematically placed) window locations with many small rather than few larger windows. Ultimately, these results are united in a proposed RM measurement algorithm that computes confidence intervals on-the-fly during the analysis and, by checking whether given precision requirements are already met, automatically stops if an appropriate number of particles are identified, thus improving efficiency. To provide quality control in the MP quantification by Raman microspectroscopy, a window subsampling and bootstrap protocol is presented, which can provide confidence intervals that enable the assessment of the reliability of the data. This is brought together with a proposed on-the-fly algorithm that assesses the precision during the measurement and stops at the optimal point Supplementary Information The online version contains supplementary material available at 10.1007/s00216-021-03326-3.

6 Supplementary Information 6.1 Random Sampling on the Complete Filter If every particle is known, then the spatial structure is irrelevant, only the set of particles is relevant. By using random sampling, the error can be controlled and minimized [2]. Thus, random sampling has to be preferred, whenever the possibility. The following section will lay out the formal theory of the condence interval and sample size calculation for the random sampling. A brief summary has been given in the Box in Figure 1.
In the statistical treatment of this problem, a perfect technical implementation of Raman measurement and particle detection shall be assumed although it still is a current topic, on which diligent work is being performed.

Formalizing the Estimation
Denote the number of particles and of plastic particles on the lter as N and N p , respectively. Accordingly, the ratio of plastic particles over all particles on the lter is r = N p N .
Only the total number N of particles on the lter is available, but the number N p of plastic particles on the lter is of interest, which can be calculated by The ratio r is unknown and usually it is not possible to identify the type of every particle on the lter. Therefore, a subset of all particles on the lter should be selected for RM. Denote the number of particles and plastic particles within this subset as S and S p , respectively. Both of these quantities are known after RM analysis and the ratio of plastic particles over all particles within this subset can be calculated: This ratio r S can then be used as an estimate for the ratio r on the lter. In that, we denote the estimate as (an estimate will always be indicated with a hatŵ ithin this section) r = r S .
By selecting the subset randomly, this estimater is a random variable, and by plugging in this estimate into equation (11), also the quantity of interest becomes a random variable: 6.1.2 Condence Interval -Theory Naturally, this estimateN p might be erroneous and its error can be quantied by considering its standard deviation sd N p , which in turn can be derived from the standard deviation sd(r) of the (random) ratio estimate: In that, it suces to assess the standard deviation of the ratio estimater in order to then quantify the error of the nal estimateN p .
This standard deviation depends on N , N p , S, and on the selection scheme of the subset (i.e. which kind of randomness).
The obvious question to ask here is which sampling scheme is the best (i.e. resulting in the lowest standard deviation). Universally agreed on [33,15] and already elaborated on in the eld of microplastic [2], this selection should be completely random, in a sense that each particle on the lter should have the same probability to be selected for RM identication.
By using this random sampling, the selection of particles for RM can be represented by a classical urn model without replacement and the formula for the standard deviation is: So, after selecting the subset, doing the RM analysis and calculating the ratio estimater (equation (13)), the standard deviation sd(r) can be used to determine a condence interval around the ratio estimater. This condence interval species a range of the most plausible values for the ratio estimater and therefore accounts for the estimation uncertainty (in contrast to the point estimater alone).
A condence interval always refers to a given condence level (1 − α) (where α is the error probability or signicance level), such that the probability of the interval covering the true value r is (1 − α). Although typical choices are 80%, 90%, 95%, or 99%, the condence level should be specied according to the actual requirements of the applied context.
The estimater (random variable) is approximately normally distributed (central limit theorem), such that the condence interval ofr can be calculated as where z 1− α 2 is the (1 − α 2 )-quantile of the normal distribution (referring to the given condence level (1 − α)).
This (1−α) condence interval of the ratio estimatê r can then be used to calculate the (1 − α) condence interval of the nal estimate (of the number of plastic particles on the lter) This condence interval is then interpreted as being the range of values forN p that covers the true value N p with probability (1 − α), which means that if this procedure of random sampling would be repeated innitely and condence intervals would be calculated analogously, then only α of these condence intervals do not contain the true value N p .

Condence Interval -Estimation
Unfortunately, these condence intervals cannot be calculated as the true ratio r in the formula for sd(r) (equation (16)) is unknown. Instead, the condence intervals can only be estimated by using the estimater.
In that, the estimated standard deviation iŝ and the estimated condence interval ofr becomeŝ leading to the estimated condence interval of the nal estimatê CI N p = N ·ĈI(r) .
Instead of reporting solely the point estimateN p , this estimated condence intervalĈI N p should be provided in every microplastic analysis.

Sample Size Calculation
In addition, these considerations about condence intervals can be used to calculate the required size S of the subset, such that a condence interval of a given length can be obtained.
As can be seen in equation (17), the condence in-tervalĈI (r) of the ratio estimater is a symmetric interval around the point estimater with radius which is frequently denoted by e and referred to as absolute error margin.
In contrast, the relative error margin e rel relates the absolute error margin e to the ratio estimater: For example, assume two dierent condence intervals around the point estimatesr = 0.5 andr = 0.1, respectively, both with a relative error of e rel = 0.1. In that, the absolute error margins are e = e rel ·r = 0.05 and e = e rel ·r = 0.01, yielding the condence intervals [0.45, 0.55] and [0.09, 0.11], respectively. In order to calculate the required number S of particles that should be subjected to RM identication, one needs the following quantities: N , r, (1 − α), and e. The rst is known, the second is unknown (and no estimater is available prior to the RM identication analysis), and the remaining two should be specied according to the precision requirements of the analysis, in a sense that one should state the error probability α, one is willing to accept, that the ratio estimater deviates more than e (absolute error margin) from the true ratio value r.
The usual handling of the unknown ratio r is to assume a plausible value and use this assumed ratio.
The required minimum size S can then be calculated Of course, the assumed ratio r used in this calculation prior to the RM identication process might dier from the estimater (i.e. r S ) that is obtained after RM identication. This might explain that the estimated condence interval (equation (20)), which is calculated with the ratio estimater, might not keep the previously specied precision requirements (α and e), which are based on the assumed ratio r.
Nevertheless, it is highly recommended to perform an own sample size calculation prior to RM analysis, as each RM analysis has its own characteristics and requirements. This can be done easily with the following steps: Determine the total number N of particles on the lter that might be subjected to RM identication.
Assume a plausible value for the ratio r of plastic particles among all particles on the lter. If it is too dicult to decide on one single value, try dierent plausible values. Applying a smaller r increases the sample size but increases the chances that the precision requirements are met.
State the precision requirements (α and e): Only an error probability of α should be accepted that the ratio estimater deviates more than e from the true ratio r. Frequently, the precision requirement is expressed by a relative error e rel , which needs to be transferred to the absolute margin of error e = r · e rel .
Determine the z 1− α 2 -quantile for the desired maximal error probability α.
Calculate the minimum sample size S using equation (3).
Of course, all S particles that should be subjected to RM identication have to be selected randomly from all N particles, else the error calculations as outlined above do not hold.
As illustration, table S1 contains results of the sample size calculation for a specic set of precision requirements, i.e. α = 0.1 and e rel = 0.1. Note that for decreasing r and increasing N the sample size increases, respectively.

Spatial Structure of Particle Locations
In many MP laboratories, window sampling is necessary and especially when approaching very small MP it will be mandatory. Here, the inuence of the spatial structure of the particles can no longer be evaded by random sampling. This section will discuss the fundamental concepts of spatial structures (especially dierent types of inuences), as laid out in the eld of spatial statistics [20].
In principle, there is a so called random point process that can generate spatial structures, which are said to be realizations of this point process. In that, we assume that our observed spatial structure on the lter (e.g. in Figure S2) belongs to a certain, but unknown point process. In the case of MP ltration, the point process would be characterized by the properties of the particles (e.g. propensity for clustering) and the ltration setup (e.g. vacuum pump, uid dynamic) and the spatial structure would be the actual arrangement of the particles on this one specic lter. Hypothetically, ltering the particle suspension (water sample) again, would give another spatial structure, which is another realization of the same point process.
In spatial statistics dierent point proceses are discerned [20]. The stereotypical and idealized point process is characterized by complete spatial randomness (CSR), such that the location of every point (i.e. particle) is uniformly distributed in the area of interest (i.e. on the lter), which means that every location on the lter has the same probability to be selected as location for a particle (see Figure S1a). With CSR dierent points might be arbitrarily close to each other.  Figure S1df ).
Having these dierent types of inuences on spatial structures in mind, it appears obvious that spatial structures might be quite complex and their characterization cannot be summarized in one single quantity. In fact, the eld of spatial statistics oers a range of different functions, each one only being able to describe a single aspect of a spatial structure [20]. In that, compre-Table S1 Exemplary values of S for dierent precision requirements. Columns are N and rows are r.
hensively describing or even modeling spatial structures is a very dicult task. Practically, Pitard even conclude[s] that the sampling of two-dimensional lots (...) is an unsolvable problem [33, p. 589].
Interactions between particles inuence only the standard deviation of the nal estimate, with regularity reducing it and clustering increasing it. Consider the following: If particles express regularity (which they do as they have a hard core), it is less likely that in one single window there would be extremely many or extremely few particles compared to when particles would not express regularity. In that, the standard deviation of the number of particles in this window (and therefore also the standard deviation of thenal estimate) is lower with regularity compared to without regularity.
If particles cluster (with random cluster locations), it is more likely that in a single window there would be extremely many (if the window is on a cluster) or extremely few particles (of the window is not on a cluster) compared to when particles do not cluster. Thus, the standard deviation of the number of particles in this window (and therefore the standard deviation of the nal estimate) is higher with clustering than without.
A bias would arise if characteristics of the spatial structure would be systematically missed within the observed windows. This is not the case with regularity or clustering per se, as particle locations or cluster locations would still follow a random pattern. Only with an external inuence (that aects the locations of particles or cluster, e.g. a vacuum pump vortex) a bias might arise in dependence of the window selection scheme, as illustrated in the paper in Section 2). Furthermore, concerning the use of the term homo-/heterogeneity, it seems that those terms are used with strongly diering meanings. In analytical chemistry they can refer to the spatial structure of particles, but also to chemical composition. Spatial statistics uses the term homogeneity in a sense of CSR ( Figure S1a) where par-ticle locations have a uniform probability distribution [20]. It, however, appeared to the authors that in analytical chemical texts a homogeneous spatial structure typically refers to a regular point process as in Figure S1b. This discrepancy might result from the observation that the distances between the points in the regular spatial structure are relatively similar, thus homogeneous. Due to this multitude of meanings, we urge for studies to clearly and explicitly denominate the concept of homo-/heterogeneity that is employed and we want to emphasize that care has to be taken with those terms in interdisciplinary communication.

Filter Edge Issues
With the random window scheme it is important that the windows are allowed to overlap the border of the lter. Otherwise, there would be parts of the outer lter that are not properly represented within the windows.
This eects an underrepresentation of the border, which can lead to a bias in the nal estimate. This is exemplied in Figure S3, where windows did not overlap and, in the presence of an external inuence (gaussian lter), the true value N p = 4000 was overestimated by ∼ 80 particles.

Conservativeness in Bootstrap Estimates
By its denition, the (1 − α)-condence interval should cover the true value N p with probability (1−α), i.e. out of all condence intervals (using the same setup, but dierent lters with the same external inuences), only a ratio of α should be allowed to miss the true value N p . Figure S4 depicts for dierent numbers k of windows the ratio of all 5000 bootstrap condence intervals that do not cover the true value. It shows that for a low number k of windows the bootstrap condence intervals do not keep the given limit of α = 0.10 (black horizontal line), but are conservative for a larger number k of windows.
6.6 Distribution of window number in on-the-y Within the simulation, actual window numbers after termination of the on-the-y procedure with e rel = 0.1 and α = 0.1 are depicted in Figure S5 for both regular and gaussian lters. The plot depicts the ratio of all 5000 bootstrap condence interval (with α = 0.1) that do not cover the true value of N p = 4000, which estimates the actual error probability α, for both regular and gaussian lters. For typical numbers of windows (in the on-the-y procedure on gaussian lters: 1300 − 2100, see Figure S5) bootstrap condence intervals are conservative, as their actual error probability is below the nominated α = 0.1.

Fig. S5
On-The-Fly Procedure: Actual Sample Sizes. Of the 5000 simulated lter, the stopping points k range from 650 to 1100 with m ± sd = 1665 ± 69 for regular lters and from 1300 to 2100 with m ± sd = 872 ± 108 for gaussian lters. Window numbers were investigated in increments of 50 for k ≤ 1000 and 100 for k ≥ 1000.