Statistical Foundations of Nanoscale Photonic Imaging

In this chapter different statistical models for the observations in nanoscale photonic imaging are discussed. While providing models of increasing accuracy and complexity, we develop a guideline which model should be chosen in practice depending on the total number of detected photons as well as their spatial and temporal dependency structure. We focus on different Gaussian, Poissonian, Bernoulli and Binomial models and link them to projects treated within the SFB 755.

holographic experiments where millions of photons can be collected from one sample [2]. In fluorescence microscopy, the number of photons is intrinsically limited to a few hundred or thousand per marker due to bleaching effects, and in case of temporally resolved measurements, only a handful of photons is available per time step [3]. Similar restrictions arise in related imaging modalities, including those based on Förster resonance energy transfer (FRET) or metal induced energy transfer (MIET), see e.g. Chap. 8 or [4,5] for a discussion. Although not within the context of nanoscale imaging, statistically related is astrophysical imaging. Here, there is no a priori limit for the observation time and hence for the number of photons. However, the former is practically limited to several minutes to avoid severe motion blur, see e.g. [6,7] for examples. We also mention positron emission tomography (PET), where the total number of emitted photons should be as small as possible to minimize the radiation dose for the patient [8]. In all of these applications, detected photons can also originate from undesired background contributions, whose nature strongly depends on the experimental setup, adding additional noise to the observations.

Purpose of the Chapter
The aim of this chapter is to give an overview over prototypical approaches to model the data emerging in photonic imaging from a statistical point of view, based on the physical modeling of photon observation. A sketch of the typical imaging setup we consider is presented in Fig. 4. 1. We assume that the imaging process is described by an underlying photon intensity λ : Ω × [0, T ] → [0, ∞) at the detector interface, where Ω is the spa- Note that this includes all detected photons, including all background contributions. We will always assume λ ≥ 0, which ensures that the integral in (4.1) is well-defined (however it might be ∞).
Throughout this manuscript, we will discuss statistical models for the distribution of the observations Y , depending on the physical measurement setup. We assume λ to be given, as deriving or estimating λ and/or other model parameters described (implicitly) by λ is the topic of other expositions (see e.g. Chap. 5 or Chap. 11).

Measurement Devices
Depending on the type of sensor used for photon detection, different models for photonic imaging settings have been proposed. One commonality of all measurement setups is that the spatial domain of observation Ω is discretized into detector regions, so-called bins. We will assume that the detectors on all bins have identical physical properties, and we denote the centers of such bins by x ∈ Ξ with Ξ being the set of all bin centers. If a charge-coupled device (CCD) camera is used for detection, all bins (the pixels of the sensor) can be observed simultaneously. This is e.g. the case in most coherent X-ray experiments or astrophysical imaging. PET requires a tomographic setup consisting of several photomultiplier tubes (PMT) surrounding the patient (see e.g. [9]). In confocal fluorescence microscopy the most widely applied detectors are based on avalanche photodiodes, which can measure photons in one bin at a time only. Hence, the domain of observation Ω is typically scanned by physically moving the specimen (or detector) at a fast pace. Temporal simultaneous photons can be measured as well, requiring a different experimental setup (see e.g. [10]).
Most photon detectors rely on the photoelectric effect. With a certain probability (the quantum efficiency), incident photons will release photo electrons on the detector surface. Since single electrons cannot be detected reliably, the signal is typically amplified by a cascade of electron multiplying systems. This introduces additional noise due to the stochastic nature of the multiplying steps. Another complication is the existence of dead times. The dead time of a detection device refers to the time interval (after activation) during which it is unable to record another event. Dead times can, for example, arise due to the necessity to recharge conductors in-between measurements, or due to time delays caused by analog-to-digital conversion and data storage. Details on the statistics of different detectors can be found in [11,Cpt. 12].

Structure and Notation
For the remainder of this chapter we will develop and discuss models for the right part in Fig. 4.1 with different degrees of accuracy. The model choice mainly depends on the total number of detected photons and on the spatial and temporal dependency structure of the randomly generated photons. We will start with the Poisson model, which is well-known and most common for many applications. It can be derived immediately from (4.1) under the assumption of independence, which explains its wide use in photonic imaging (see e.g. the reviews [7,12] and the references therein). However, if it is necessary to count photons on small time scales, or if independence is not given, a more refined modeling is on demand. In these situations, we turn towards Bernoulli and Binomial models subsequently, and discuss to what extend they are compatible with the aforementioned Poisson model. Finally we turn to the case of large counting rates, which lead to Gaussian models based on asymptotic normality. We discuss differences and commonalities arising from the different base models and indicate in which situation which model should be used. This will be linked to different examples from this book, where we argue if our assumptions are met or not.
Let us introduce the basic notation used in this chapter. We will always assume that any observation y is the realization of a random variable Y , and we will denote by P probabilities w.r.t. this random object. By E and V we will denote the expectation and variance w.r.t. P, respectively. The letters P, B and N will denote the Poisson, Binomial and normal distribution introduced below. Random variables will always be denoted by capital letters X, X i , Z etc., and if we write i.i.d. for a sequence X 1 , X 2 , . . . of random variables, this stands for independent identically distributed.

Poisson Modeling
Suppose we have a perfect photon detector that registers the individual arrival times of all emitted photons reaching a bin without missing any. We will focus on describing a single bin for the moment to avoid notational difficulties. In this situation, the total number of collected photons often can be modeled as Poissonian. A random variable X follows a Poisson law with parameter (intensity) μ ≥ 0, if We write X ∼ P (μ). The following fundamental theorem about point processes explains why the Poisson distribution often comes into play when modeling photon counts: Theorem 4.1 Suppose we observe a random number N of photons at random arrival times 0 ≤ t 1 < · · · < t N ≤ T such that (a) for each choice of disjoint intervals I 1 , . . . , I n ⊂ [0, T ], the random variables Then, for all 0 ≤ a < b ≤ T , the number of photons observed between time a and time b is Poisson distributed with parameter b a μ (t) dt, i.e.
For the proof we refer to [13,Theorem 1.11.8]. In terms of probability theory, this theorem implies that the point process X := N i=1 δ t i , with δ t denoting the Dirac measure at t, is a Poisson point process with intensity μ if the stated assumptions are satisfied.
Let us discuss these assumptions. Condition (b) underlies our whole modeling procedure as described in (4.1) and seems universally evident. Temporal independence of the arrival times in (a) is more critical but seems (at least approximately) reasonable in many imaging modalities where photons arise from a high-intensity source, including coherent X-ray imaging. However, if the photons arise from fluorescent markers, temporal independence can be violated due to hidden internal states of the fluorophores, energy transfer between different fluorophores on small time and spatial scales (e.g. FRET), or dead times of the detectors.
If temporal independence is given, then Theorem 4.1 states that the number Y x,t of collected photons within a bin B x until time t ∈ [0, T ] can naturally be modeled by a Poissonian random variable with intensity t 0 B x λ (y, τ ) dy dτ . This gives rise to the following model:

Poisson model
Let the spatial domain of observation Ω be discretized into bins B x with centers x ∈ Ξ . We assume that our observations are given by a field Y t := Y x,t x∈Ξ of random variables such that for some intensity function λ ≥ 0.
This is the basis of many popular models covering a variety of distinct applications. Examples include PET (see Vardi et al. [9]), astronomy and fluorescence microscopy (see Bertero et al. [7] or Hohage and Werner [12]), or a more subtle model for CCD cameras due to Snyder et al. [14,15].
Note that so far we have assumed that all arriving photons are collected by the detector. This will however be never the case due to several physical limitations, see The specific efficiency depends strongly on the setup and can vary considerably. Additionally to different quantum efficiencies of different detectors, it might also happen that the detector does not cover all of Ω or has some dead subregions (like interfaces between individual elements). This causes a loss of measured photons and hence a statistical thinning of the random variable Y x,t . In this case, the actually Photons that reach the detection plane can stay undetected due to various reasons. For example, they can fail to free a photo electron, miss the sensitive regions of the detector bins, or arrive during the dead time caused by a previously recorded photon observed random variable Y x,t can be written as with Bernoulli random variables X i having success probability η i ∈ [0, 1] where each X i indicates if the ith photon has been detected. If, in addition, the thinning happens identically and independently for each photon, i.e. X i , only the parameter in the Poisson law (4.2) changes, but not its distributional structure. More precisely, in this case it follows (see the Appendix) that Consequently, the imperfectness of a detector (as long as the induced thinning happens independent for each photon) can be seen as a scaling of the underlying photon intensity λ by an efficiency factor η ∈ (0, 1]. In agreement with Fig. 4.1 we can hence assume that all physical processes causing a thinning have already been treated when modeling λ in the following. Besides this kind of independent thinning, a further important issue in many imaging modalities is the dead time Δt of the employed detector. Dead times can vary significantly depending on the type of detector, but usually are in the range of nanoseconds. If a photon arrives at time t ∈ [0, T ), the detector will only be able to record the next photon arriving after t + Δt. Note that whenever Δt > 0, at most T /Δt photons can be detected during the whole measurement, which contradicts (4.2) in the sense that P Y x,T > T /Δt = 0 in this case. Such an upper limit on the total number of detected photons can crucially change the distribution, which can, e.g., be seen from the following fact proven in the appendix: In other words, Theorem 4.2 states that, conditioning on the total number of photons, the arrival times of individual photons behave like a Bernoulli process with intensity τ → B x λ (y, τ ) dy. This implies that conditioning on the total number of photons introduces a dependency structure between the number of counts during different time intervals. Consequently, if Δt cannot be neglected, temporal independence is not given anymore, hence corrupting the Poisson law, and different modeling approaches are needed.

Bernoulli Modeling
To measure the temporal structure of the incoming photons, counting as described above is not sufficient. In such cases, photons are consecutively counted during (short) time frames. We suppose that the discretization of the temporal measurement process is refined such that temporal aggregation underlying the Poisson model is not appropriate anymore. This is described by (equidistant) time frames, which are consecutive intervals I 1 , I 2 , . . . , I n ⊂ [0, T ] of equal length δ > 0, chosen such that the probability to observe more than one photon in each bin B x during any interval is sufficiently close to 0, and separated by a waiting time > 0, which allows to ignore the dead time. In this situation, the following model is a reasonable approximation:

Bernoulli model
For x ∈ Ξ and 1 ≤ i ≤ n the random variable Y x,i indicating if a photon arrives in bin B x during the time interval I i follows a Bernoulli distribution, with success probability As mentioned before, the detector will hardly count all arriving photons, which causes a statistical thinning as in (4.3). If the thinning happens independently of the photon arrivals, we obtain Y x,i ∼ B 1, η · p x,i with the probability η that an incident photon is detected, which immediately follows from In many imaging setups, it would be difficult to store the whole time series Y x,i , for instance due to memory limitations. Examples include fluorescence microscopy setups like confocal, STED or 4Pi microscopy, or coherent X-ray imaging, where millions of photons are observed in short times, which would require an unreasonably fine time discretization. For other examples like SMS microscopy, however, the temporal structure can be important (e.g. for adjusting temporal drifts, see e.g. [16,17]) and hence most of the data of the above model has to be used. If temporal dependencies are less important, it is sufficient to count photon arrivals in some interval I ⊂ [0, T ] larger than δ, i.e. to consider Y x,I := I i ⊂I Y x,i . The distribution of Y x,I depends strongly on the temporal dependency structure of the Y x,i . In case that they are independent and p x,i ≡ p x for all 1 ≤ i ≤ n, we obtain a Binomial model:

Binomial model
with p x,i ≡ p x for all 1 ≤ i ≤ n and p x,i as in (4.5).
Note that if we proceed similarly with the thinned observations Y x,i , we obtain Y x,I ∼ B (# {I i ⊂ I } , η p x ), which is the canonical thinning of (4.6), see e.g. [18].
Independence of the Y x,i is strongly connected to the photon source, as discussed above. If ≥ Δt, the dead times of the detectors have no influence on the temporal dependency structure anymore. The second assumption, p x,i ≡ p x for all 1 ≤ i ≤ n, is equivalent to stationarity of the underlying photon source, which again depends on the imaging modality. If, e.g., a freeze-dried sample is imaged sufficiently fast, then this assumption is reasonable.
Besides temporal dependencies, the field of random variables can also have a spatial dependency structure. In many modalities the random variables are independent for different pixels or voxels x, but on sufficiently small scales some dependency can occur, e.g., due to energy transfer between molecules.

Law of Small Numbers
It is a fundamental and well-known fact that a Binomial distribution can in certain situations be approximated by a Poissonian distribution. In this section, we will discuss how this provides a link between the initial Poisson modeling (4.2) and the preceding Bernoulli modeling (4.6). To this end, we recall the so-called law of small numbers, which will be stated in terms of Le Cam's theorem [19]. For the moment we suppress dependencies on x and consider only a single Binomial random variable, corresponding to a fixed bin.
For a textbook proof we refer to [20, Theorem 5.1]. Figure 4.3 visualizes the law of small numbers. Note that the bound on the right-hand side of (4.7) can be simplified by using log A classical example for this law is the situation when q i ≡ q m for all 1 ≤ i ≤ m and q m · m converges to some λ > 0, i.e., q m ∼ 1/m. In this case we may use log (1 − x) ≈ −x for small x to obtain λ m ≈ mq m → λ and 2 m i=1 (log (1 − q i )) 2 ≈ 2 m i=1 q 2 i = 2mq 2 m ∼ 1/m → 0 as m → ∞, i.e., the Binomial distribution of X converges rapidly to the Poisson distribution with parameter λ.
On the other hand, if the success probabilities q i ≡ q ∈ (0, 1) are fixed, the righthand side of (4.7) diverges. This seems intuitive, as in this situation convergence towards a normal distribution has to be expected (cf. Sect. 4.4.1 below). This is in line with the observation that a Poisson distribution with growing parameter λ m = −m log (1 − q) converges towards a normal distribution (cf. Sect. 4

.4.2 below).
Let us now compare the two Poisson laws arising from Theorem 4.3 and (4.2). According to (4.4), our observations are Binomial random variables with success probability where we used that the probability to observe more than one photon is close to 0. Hence, if we denote the largest time in I m by t m , and use again log (1 − x) ≈ −x, then the number of photons observed until time t m is approximately Poisson distributed with parameter λ x,m = p x,1 + · · · + p x,m ≈ t m 0 B x λ (y, τ ) dy dτ ignoring the waiting times. This is in good agreement with (4.2). According to (4.7), the error in this approximation is bounded by revealing it valid whenever the temporal discretization is sufficiently fine.

As Approximation of the Binomial Model
Besides the approximation by a Poisson distribution, it is well-known that a Binomial model can also be approximated by a Gaussian one under suitable circumstances. Let us start with the Bernoulli model (4.4) and suppose that all Y x,i are independent with p x,i ≡ p x . If we are interested in the total number of counts Y x := n i=1 Y x,i in bin x, the de Moivre-Laplace theorem states that is just the centered and standardized version of the total number of counts Y x . This implies that the distribution of Y x can be approximated by a Gaussian distribution with mean np x and variance np x (1 − p x ) if n is sufficiently large. This gives rise to a first Gaussian model:

Gaussian model I
For each x ∈ Ξ , the number of photons observed in the bin centered at where n = n(T ) ∼ T /δ with the length δ of the individual time frames.
The rate of convergence in (4.8) can be made more precise. For instance a special case of the Berry-Esseen theorem states where Φ denotes the distribution function of N (0, 1), i.e., In fact, the constant on the right-hand side of (4.10) cannot be improved [22]. An interpretation of this theorem is that the approximation leading to the model (4.9) is reasonable as soon as np x (1 − p x ) > 9, which implies the right-hand side of (4.10) to be bounded by 2π ≈ 0.137. If the success probabilities p x,i do vary in i, the de Moivre-Laplace theorem (4.8) cannot be applied immediately. However, it is still possible, under certain conditions, to derive an approximate Gaussian model of the form (4.9) by applying the Lindeberg central limit theorem (see e.g. [23]). It states that the sum Y x , after centralization and standardization, still converges to N (0, 1) in distribution even for non identically distributed Y x,i . This motivates a second Gaussian model:

Gaussian model II
For each x ∈ Ξ , the number of photons observed in the bin centered at x up to time T is Note that, if the random variables Y x,i are dependent, the type of dependency very much determines whether a central limit theorem is still valid (with different limiting variance), see e.g. [24]

As Approximation of the Poisson Model
The Poisson model in (4.2) can also be approximated by a Gaussian one. This relies on the fact that the Poisson distribution is infinitely divisible, which means that whenever X ∼ Poi (μ), then X can be represented as X = X 1 + · · · + X n for any n ∈ N with i.i.d. random variables X 1 , . . . , X n ∼ Poi (μ/n). Consequently, the central limit theorem states that N (0, 1). The general Berry-Esseen theorem can also be used to bound the error of an approximation of X −μ √ μ by Z , namely one obtains (see also [29]) Hence, if μ is sufficiently large, the distribution of X can be approximated by a Gaussian distribution with mean and variance μ. If we suppose that Y x,t satisfies (4.2) and that t 0 B x λ (y, τ ) dy dτ → ∞ as t → ∞, then the above reasoning gives rise to another Gaussian model:

Gaussian model III
For each x ∈ Ξ , the number of photons observed in the bin centered at x up to time t is (4.13)

Comparison
Let us briefly compare the Gaussian models I-III in (4.9), (4.11) and (4.13) respectively. It is clear that (4.11) is a generalization of (4.9) to the case of non-identical success probabilities p x,i , and both coincide if p x,i is independent of i. To compare (4.11) with (4.13), we recall our previous computation that p x,1 + · · · + p x,n = t n 0 B x λ (y, τ ) dy dτ → ∞ where t n is the largest time in the sub-interval I n . Consequently, (4.11) and (4.13) differ only in the variance by 1 − p x,i , which is usually small. Hence, all three Gaussian models are in good agreement, and (4.13) can be considered the most simple one which should be used.

Thinning
Taking into account the detection efficiency η ∈ [0, 1] as discussed before, we will arrive at models similar to (4.9), (4.11) and (4.13) with the only difference being that p x , p x,i or λ are multiplied by η. In this sense, the canonical thinning of the Poisson or Binomial models carries over to the Gaussian one.

Variance Stabilization
Note that the variance in the Gaussian models I-III is always inhomogeneous, which hinders data analysis with standard methods and causes further difficulties. This can be overcome by variance stabilization. The most popular choice is the celebrated Anscombe transform, which is applied to the Poisson model (4.2) to obtain asymptotically a normal distribution with variance 1. It is based on the following result (see e.g. [30, Lemma 1]): From this we can conclude that the choice c = 3/8 ensures that the variance of 2 √ Y + c does no longer depend on the parameter μ up to second order. To reduce the bias, c = 1/4 is the best choice. Furthermore, applying this result to the Poisson model in (4.2) gives rise to a fourth Gaussian model:

Gaussian model IV
For each x ∈ Ξ , denote the number of photons observed in the bin centered at x up to time t by Y x,t . Then we assume for each x ∈ Ξ .
We emphasize the importance of the model (4.14) in statistics, as it turns out to be equivalent in a strict sense to the previously discussed Poisson model (4.2) as the total number of photons (and hence the parameter t) tends to ∞ (see e.g. [31-33]).

Conclusion
In this chapter we introduced models for photonic imaging setups with different degrees of accuracy. The most common and basic Poisson model (4.2) is accurate as soon as the temporal dependency can be neglected and the detector has no significant dead time. If furthermore the number of observed photons is sufficiently large on each bin, then the Gaussian model (4.13) can be used. In case of significant temporal dependency, the Bernoulli model (4.4) with time resolved individual photon arrivals or the resulting Binomial model (4.6) should be considered instead.
An overview about appropriate model choices for the various imaging techniques discussed previously is provided in Fig. 4.4.
In fluorescence microscopy, STED based methods, which scan the sample pixelwise, record about 10-100 photons per fluorescent marker. Due to low temporal dependencies, we are thus in the scope of the binomial or Poisson models [3]. Even though a Gaussian approximation seems questionable as in regions of low intensities only a few photons per bin can be collected, it has been successfully applied employing variance stabilizing techniques [34]. In order to analyze STORM/PALM data, the full range of modeling approaches is applied. Individual frames contain spots with single or several photons and weak temporal dependency, calling for Bernoulli, binomial, or Poisson models, while Gaussian approximations are used successfully for drift and rotational corrections [17]. FRET/MIET based imaging heavily relies on the interactions of fluorescent markers, so that the assumption of temporal independence is violated. This makes the Bernoulli model the model of choice, or if more photons are counted, also the Binomial model can be applied [4,5].
Another example in the scope of the Bernoulli model is the 3-photon correlation technique (see e.g. Chap. 16), where molecular structures are probed by femtosecond X-ray pulses. This leads to a high number of images consisting of a few photons only, out of which only triples are used. Inference based on this sequence of images is additionally complicated by rotations of the single target molecules [1].
X-ray diffraction imaging also allows for a whole range of models. On first glance it seems that a Gaussian model is sufficient, as in total millions of photons are collected. However, depending on the specific setup, the photon intensity λ may vary strongly over the detection region. If imaging is performed in a near-field regime, as e.g. in many X-ray microscopy setups, the number of photons in the lower intensity regions is about one order of magnitude lower than in the high intensity regions, allowing for a Gaussian model. In contrast to this are far field methods where on high intensity bins 10 4 photons can be collected, but in low intensity regions only a handful of photons arrives, revealing a Binomial and/or Poisson model more suitable [12].
We will now show that the distribution ofỸ is still Poissonian, but with parameter η · μ. To this end, observe that the probability ofỸ being k is given by the sum over all probabilities of Y being l and exactly k out of the first l X i 's being 1, i.e.
Inserting the Poisson distribution of Y and the Binomial distribution of l i=1 X i gives which provesỸ ∼ P (ημ).

Appendix: Conditioned Poisson Processes
Suppose we observe a random number N of photons at random arrival times 0 ≤ t 1 < · · · < t N ≤ T such that the number of photons between time a and time b is Poisson distributed with parameter b a μ (t) dt for a fixed function μ ≥ 0. Given a decomposition of [0, T ] into disjoint intervals I 1 , . . . , I m , denote by the number of photons observed during I i . Assume furthermore that Y 1 , . . . , Y m are independent. We will now show that the conditional distribution given N = n of (Y 1 , . . . , Y m ) is multinomial with parameter n and probability vector ( p 1 , . . . , p m ) where Therefore let n 1 , . . . , n m ∈ N 0 such that m i=1 n i = n. Then we have The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.