A Bayesian Treatment of the German Tank Problem

The German tank problem has an interesting historical background and is an engaging problem of statistical estimation for the classroom. The objective is to estimate the size of a population of tanks inscribed with sequential serial numbers, from a random sample. In this tutorial article, we outline the Bayesian approach to the German tank problem, (i) whose solution assigns a probability to each tank population size, thereby quantifying uncertainty, and (ii) which provides an opportunity to incorporate prior information and/or beliefs about the tank population size into the solution. We illustrate with an example. Finally, we survey problems in other contexts that resemble the German tank problem.


History
To inform their military strategy during World War II (1939)(1940)(1941)(1942)(1943)(1944)(1945), the Allies sought to estimate Germany's rate of production of various military equipment (tanks, tires, rockets, etc.).Conventional methods to estimate armament production-including (i) extrapolating data on prewar manufacturing capabilities, (ii) obtaining reports from secret sources, and (iii) interrogating prisoners of war-were unreliable and/or contradictory.
In 1943, British and American economic intelligence agencies exploited a German manufacturing practice in order to statistically estimate their armament production.Specifically, Germany marked their military equipment with serial numbers and codes for the date and/or place of manufacture.Their intention was to facilitate handling spare parts and trace defective equipment/parts back to the manufacturer for quality control.However, these markings on a captured sample of German equipment conveyed information to the Allies about Germany's production of it.
To estimate Germany's production of tanks, the Allies collected serial numbers on the chassis, engines, gearboxes, and bogie wheels of samples of tanks by inspecting captured tanks and examining captured records 1 .Despite lacking an exhaustive sample, the sequential nature of 2 and patterns in these samples of serial numbers enabled the Allies to estimate Germany's tank production-postwar, we know-much more accurately than conventional intelligence methods (Tab.1).
See Ruggles and Brodie [1] for the detailed historical account of serial number analysis to estimate German armament production during World War II. 1 Eg., captured records from tank repair depots listed serial numbers of the chassis and engine of repaired tanks, and records from divisional headquarters listed chassis serial numbers of tanks held by a specific unit.
2 Gearboxes on captured tanks, for example, were inscribed with serial numbers belonging to an unbroken sequence.Chassis serial numbers, on the other hand, were broken into blocks to distinguish models/designs, leaving gaps between the serial numbers assigned to them.

The German tank problem
Simplification of the historical context to estimate German tank production via serial number analysis [1] motivated the formulation of the textbook-friendly German tank problem [2]:

Problem statement
In the backdrop of World War II, the German military has n tanks.Each tank is inscribed with a unique serial number in {1, ..., n}.
As the Allies, we do not know n, but we captured (without replacement, of course) a sample of k German tanks with inscribed serial numbers (s 1 , ..., s k ).
Assuming all tanks in the population were equally likely to be captured, our objective is to estimate n in consideration of the data (s 1 , ..., s k ).
In 1942, Alan Turing and Andrew Gleason discussed a variant of the German tank problem, "how to best to estimate the total number of taxicabs in a town, having seen a random selection of their license numbers", in a crowded restaurant in Washington DC [3,4].Today, with its interesting historical background [1], the German tank problem is still a suitable conversation topic for dinners and serves as an intellectually engaging, challenging, and enjoyable problem to illustrate combinatorics and statistical estimation in the classroom [5,6,7,8].
Uncertainty quantification.Any estimate of the tank population size n from the data (s 1 , ..., s k ) is subject to uncertainty, since we (presumably) have not captured all of the tanks (ie., k = n, probably).Quantifying uncertainty in our estimate of n is important because high-stakes military decisions may be made on its basis.
Our contribution.In this pedagogical article, we outline the Bayesian approach to the German tank problem, (i) whose solution assigns a probability to each tank population size, thereby quantifying uncertainty, and (ii) which provides an opportunity to incorporate prior information and/or beliefs about the tank population size into the solution.

Survey of previous work on the German tank problem
The frequentist approach.Border [9] calls the German tank problem a "weird case" in frequentist estimation.The maximum likelihood estimator of the tank population size n is the maximum serial number observed among the k captured tanks, m (k) := max i∈{1,...,k} s i .This is a biased estimator, as certainly m (k) ≤ n.
Goodman [2,10] derives the minimum-variance, unbiased estimator of the tank population size To intuit n, note (i) n must be greater than or equal to m (k) and (ii) if we observe large (small) gaps between the serial numbers (s 1 , ..., s k ) after sorting them (incl.the gap preceding the smallest serial number), then n is likely (unlikely) to be much greater than m (k) .The estimator of n in eqn. 1 quantifies how far beyond the maximum serial number m (k) we should estimate the tank population size, based on the gaps; m (k) /k − 1 is the average size of the gaps.Goodman also derives a frequentist confidence interval for n.Clark, Gonye, and Miller explore using simulations and linear regression to discover the estimator in eqn. 1 [11].
For pedagogy.Champkin highlights the historical context of the German tank problem as a "great moment in statistics" [12].Johnson lists and evaluates several intuitive point estimators for the size of the tank population [5].Scheaffer, Watkins, Gnanadesikan, and Witmer [13] propose a hands-on learning activity to illustrate the German tank problem by sampling chips, labeled with numbers from 1 to n, from a bowl.Berg [6] uses the German tank problem as a competition in the classroom.
The Bayesian approach.Closely related to our paper, Roberts [14], Höhle and Held [15], and Linden, Dose, and Toussaint [16], and Cocco, Monasson, and Zamponi [17] provide a Bayesian analysis of the German tank problem.They derive an analytical formula for the mean of the posterior distribution of the tank population size under an improper, uniform prior distribution.Andrews [18] outlines the Bayesian approach to the German tank problem in a blog post containing code in the R language.
Generalizations/variants. Goodman [2,10] poses a variant of the German tank problem where the initial serial number is not known; ie., where the n tanks are inscribed with serial numbers {b + 1, ..., n + b} with b and n unknown.Lee and Miller generalize the German tank problem to the settings where the serial numbers are continuous and/or lie in two dimensions [19].

Overview of the Bayesian approach to the German tank problem
Under a Bayesian perspective [20,8,21], we treat the (unknown) total number of tanks as a discrete random variable N (hence, capitalization) to model our uncertainty in it.A proba-bility mass function of N assigns a probability to each possible tank population size n.This probability is a measure of our degree of belief, perhaps with some basis in knowledge/data, that the tank population size is n [22].
Because the observed serial numbers (s 1 , ..., s k ) provide information about the tank population size, the probability mass function of N differs before and after they are collected and considered.Hence, N has a prior and posterior probability mass function.
The three inputs to a Bayesian treatment of the German tank problem are: 1. the prior mass function of N, which expresses a combination of our subjective beliefs and objective knowledge about the tank population size before we collect and consider the sample of serial numbers.
2. the data, the observed serial numbers (s 1 , ..., s k ), viewed as realizations of random variables owing to the stochasticity of tank-capturing.
3. the likelihood function, giving the probability of the data (s 1 , ..., s k ) under each tank population size N = n, based on a probabilistic model of the tank-capturing process.
The output of a Bayesian treatment of the German tank problem is the posterior mass function of the tank population size N, conditioned on the data (s 1 , ..., s k ).The posterior follows from Bayes' theorem and can be viewed as an update to the prior in light of the data.The posterior mass function of N assigns each possible tank population size n with a probability according to a compromise between its (i) likelihood, which quantifies the support the observed serial numbers (s 1 , ..., s k ) lend to the tank population size being n according to our probabilistic tank-capturing model, and (ii) prior probability, which quantifies how likely we thought the tank population size might be n before the serial numbers (s 1 , ..., s k ) were collected and considered.[21] The posterior mass function of N is the raw, uncertainty-quantifying, Bayesian solution to the German tank problem.We may summarize the posterior by reporting its median and the high-mass subset of the natural numbers that credibly contains the tank population size.Also, we can use the posterior to answer questions such as, "what is the probability that N exceeds some threshold quantity n that would alter military strategy?".

A Bayesian approach to the German tank problem
We now tackle the German tank problem from a Bayesian standpoint.
For reference, the variables are listed in Tab. 2. We use upper-and lower-case letters to represent random variables and realizations of them, respectively.Throughout, we employ the indicator function I A (x) which maps its input x to 1 if x belongs to the set A and to 0 otherwise (if x / ∈ A).The data.The data we obtain in the German tank problem is the vector of serial numbers inscribed on the k captured tanks We view the data s (k) as a realization of the discrete random vector S (k) := (S 1 , ..., S k ).Note, at this point, we are entertaining the possibility that the order in which tanks are captured matters.
The data-generating process.The stochastic data-generating process constitutes sequential capture of k tanks from a population of n tanks, without replacement, then inspecting their serial numbers to construct s (k) .We assume that each tank in the population is equally likely to be captured at each step.Then, mathematically, the stochastic data-generating process is sequential, uniform random selection of k integers, without replacement, from the set {1, ..., n}.
The likelihood function.The likelihood function specifies the probability of the data S (k) = s (k) given each tank population size N = n.Each outcome s (k) in the sample space Ω (k) n is equally likely, where n |, is the number of distinct ordered arrangements of k distinct integers from the set {1, ..., n}, given by the falling factorial: Under the data-generating process, then, the probability of observing data S (k) = s (k) given the tank population size N = n is the uniform distribution: Interpretation.The likelihood quantifies the support the serial numbers on the k captured tanks in s (k) lend for any particular tank population size n, according to our probabilistic model of the tank-capturing process [21].We view π likelihood (S (k) = s (k) | N = n) as a function of n, since in practice we possess the data s (k) but not n.
The likelihood as a sequence of events.Alternatively, we may arrive at eqn. 5 from a perspective of sequential events S 1 = s 1 , S 2 = s 2 , ..., S k = s k .First, the probability of a given serial number on the i th captured tank, conditioned on the tank population size and the outcomes of the previous serial numbers, is the uniform distribution since there are n − i + 1 tanks to choose from at uniform random.By the chain rule, the joint probability giving eqn. 5 after simplifying the product of indicator functions.
The likelihood function in terms of the maximum observed serial number.We will find in Sec.2.3 that only two independent features of the data (s 1 , ..., s k ) provide information about the tank population size, N: its (i) size, k, and (ii) maximum observed serial number Thus, we also write a different likelihood: the probability of observing a maximum serial number m (k) given the tank population size is the fraction of sample space under population size n where the maximum serial number is m (k) .To count the outcomes (s 1 , ..., s k ) ∈ Ω (k) n where the maximum serial number is m (k) , consider (i) one of the k captured tanks has serial number m (k) and (ii) the remaining k − 1 tanks have a serial number in {1, ..., m (k) − 1}.For each of the k possible positions of the maximum serial number in the vector s (k) , there are (m (k) − 1) k−1 distinct outcomes specifying the other k − 1 entries.Thus:

The prior distribution
The prior probability mass function π prior (N = n) expresses a combination of our subjective beliefs and objective knowledge about the total number of tanks N before the data (s 1 , ..., s k ) are collected and considered.
The prior mass function we impose on N is context-dependent.Based on the amount of uncertainty it admits about the tank population size (measured by eg.entropy [23]), prior distributions roughly belong to one of the ordinal categories of informative, weakly informative, or diffuse [21].If we do not possess prior information about the tank population size, we adopt the principle of indifference and impose a diffuse prior, eg. a uniform distribution over a set of feasible tank population sizes.On the other hand, an informative prior might concentrate its mass around some estimate of the total number of tanks obtained through other means.
Thinking ahead, about the posterior mass function of N, which balances the prior and the likelihood (the latter based on the data): (1) an informative prior will have a larger impact on the posterior than a diffuse one [21], which "lets the data speak for itself" [8]; (2) generally, as the number of captured tanks k increases (decreases), we expect the prior to have a smaller (larger) impact on the posterior [8] as the data "overwhelms" the prior.

The posterior distribution
The posterior probability mass function of N assigns a probability to each possible tank population size n in consideration of its consistency with (1) the data (s 1 , ..., s k ), according to the likelihood in eqn.5, and (2) our prior beliefs/knowledge encoded in π prior (N = n).
The posterior distribution is a conditional distribution related to the likelihood and prior mass functions by Bayes' theorem: where the denominator is the probability of the data s (k) : We view π posterior (N = n | S (k) = s (k) ) as a probability mass function of N, since in practice we have s (k) .Then, π data (S (k) = s (k) ) is just a normalizing factor for the numerator in eqn.10.Interpreting eqn. 10, the prior mass function of N is updated, in light of the data (s 1 , ..., s k ), to yield the posterior mass function of N. The posterior probability of N = n is proportional to the product of the likelihood at and prior probability of N = n, a compromise between the likelihood and prior.
We simplify the posterior mass function of N in eqn. 10 by (i) substituting eqn.5, (ii) restricting the sum in eqn.11 to tank population sizes where the likelihood is nonzero, and (iii) noting the only two features of the data (s 1 , ..., s k ) that appear are (a) its size k and (b) the maximum serial number m (k) : Note, we may arrive at eqn. 12 through eqn. 9 as well.
Interpretation.The posterior probability mass function of N in eqn.12 is our raw, uncertaintyquantifying solution to the German tank problem.It assigns a probability to each tank population size n in consideration of the serial numbers (s 1 , ..., s k ) observed on the captured tanks, our probabilistic model of the tank-capturing process, and our prior beliefs and knowledge about the tank population size expressed in the prior mass function.
A remark on "uncertainty".The spread of the posterior mass function of N in eqn.12 reflects epistemic [24] uncertainty about the tank population size.The source of this posterior uncertainty is a lack of complete data: we have not captured all of the tanks3 and observed their serial numbers to be certain of the tank population size.In practice, an additional source of posterior uncertainty about the tank population size is the possible inadequacy of the model of the tank-capturing process (uniform sampling) in eqn. 5. Ie., selection bias could be present in the tank-capturing process.Our analysis here neglects this source of uncertainty.
Summarizing the posterior mass function of N. We may summarize the posterior mass function of N with a point estimate of the tank population size and a credible subset of the natural numbers that likely4 contains it.A suitable point estimate of the tank population size is a median of the posterior mass function of N; by definition, the posterior probability that the tank population size is greater (less) than or equal to a median is at least 0.5.A suitable credible subset, which entertains multiple tank population sizes, is the α-high-mass subset [25] H α := {n : where π α is the largest mass to satisfy In words, the α-high-mass subset H α is the smallest to (i) contain at least a fraction 1 − α of the posterior mass of N and (ii) ensure every tank population size belonging to it is more probable than any outside of it.
Querying the posterior distribution.We may find the posterior probability that the tank population size belongs to any set of interest by summing the posterior mass over it.Eg., the probability the tank population size exceeds some number n is:

Posterior predictive checking
We may check the consistency of the data s (k) with the posterior mass function of N.
Conceptually, we can simulate new data s(k) using the model of the tank-capturing process under a sample of the tank population size from the posterior, then compare the simulated data s(k) to the real data s (k) [26,21].More appropriately, we can compare the serial numbers in the real data (s 1 , ..., s k ) with the mass function giving the probability that the tank with serial number s would be captured under this process: since k/n is the probability any given viable serial number s will be observed given the tank population size N = n .

Example
We illustrate the Bayesian approach to the German tank problem through an example.
The prior probability mass function of N. Suppose we have an upper bound n max for the possible number of tanks but no other information.Then, we may impose a diffuse prior, a uniform prior probability mass function: This prior mass function expresses: in the absence of any data (s 1 , ..., s k ) (ie., no serial numbers, not k either), we believe the total number of tanks N is equally likely to be a value in {0, ..., n max }.Particularly, suppose n max = 35.Fig. 1a visualizes π prior (N = n).The data (s 1 , ..., s k ).Now suppose we capture k = 3 tanks, with serial numbers s (3) = (15, 14, 3).See Fig. 1b.So, the maximum observed serial number is m (3) = 15.
The posterior probability mass function of N.Under the uniform prior in eqn.17, the posterior probability mass function of N in eqn.12 becomes: Fig. 1d visualizes the posterior probability mass function of N for the data s (3) in Fig. 1b and the prior in eqn.17 (n max = 35).
Summarizing the posterior.Summarizing the posterior mass function of N, its median is 19 and its high-mass credible subset H 0.2 = {15, ..., 25} (highlighted in Fig. 1d).For what it's worth, the data in Fig. 1b was generated from a tank population size of n = 20 (explaining the choice of scale in Fig. 1b).
Querying the posterior.Suppose our military strategy would change if the size of the tank population were to exceed 30.From the posterior distribution of N, we calculate π posterior (N > 30 | M (3) = 15) ≈ 0.066.
Posterior predictive checking.As a posterior predictive check, Fig. 2a shows how the observed serial numbers in the data s (3) compare with the probability of observing each serial number under the posterior mass function of N, according to eqn.16.
Sensitivity of the posterior to the prior.Because of the subjectivity involved in constructing the prior, checking the sensitivity of the posterior to the prior is good practice [21].Fig. 2b shows how the posterior mass function of N changes as we increase the upper-bound on the tank population n max we impose via the prior mass function of N in eqn.17.The median of the posterior under n max ∈ {60, 70} is 20 (an increase of one compared to n max = 35).The maximum of the high-mass subset H 0.2 increases to 29 for n max = 70.
Capturing more tanks.Suppose we capture an additional 9 tanks and re-run the Bayesian analysis.Fig. 3 shows the updated posterior mass function of N. The high-mass credible subset H 0.2 shrinks considerably, to {19, 20}.This shows how more data-increasing the number of tanks captured, k-generally reduces our uncertainty about the tank population size.

Discussion
Selection bias.A strict assumption in the textbook-friendly German tank problem, which enables us to estimate the size of the population of tanks from a random sample of their (sequential) serial numbers, is that sampling is uniform.To check consistency of the sample with this model of the tank-capturing process, Goodman [10] demonstrates a test of the hypothesis that the sample of serial numbers is from a uniform distribution.Interesting extensions of the textbook German tank problem could involve modeling selection bias in the tank-capturing process.Such bias could arise eg.hypothetically, if older tanks with smaller serial numbers were more likely to be deployed in the fronts opened earlier in the war, where capturing tanks is more difficult than at less fortified fronts opened more recently.The German tank problem in other contexts.The Bayesian probability theory to solve the German tank problem applies (perhaps, with modification) to many other contexts where we wish to estimate the size of some finite, hidden set [27], eg.: the number of taxicabs in a city [12], the number of accounts at a bank [15], the number of furniture pieces purchased by a university [10], the number of aircraft operations at an airport [28], the extent of leaked classified government communications [29], the time needed to complete a project deadline [30], the time-coverage of historical records of extreme events like floods [31], the length of a short-tandem repeat allele [32], the size of a social network [33], the number of cases in court [34], the lifetime of a flower of a plant [35], or the duration of existence of a species [36].Mark and recapture methods in ecology to estimate the size of an animal population [37,38] are tangentially related to the German tank problem.
The practice of inscribing sequential serial numbers on military equipment.Germany adopted the practice of marking their military equipment with serial numbers and codes to trace the equipment/parts/components back to the manufacturer.However, the sequential nature of these serial numbers was exploited by the Allies to estimate their armament production.To reduce vulnerability to serial number analysis for estimating production while maintaining advantages of tracing equipment back to the manufacturer, serial numbers and codes could instead be obfuscated by eg.chaffing [39].
with (• • • ) = meaning the elements of the vector (• • • ) are unique.The number of outcomes in the sample space, |Ω

Figure 1 :
Figure 1: A Bayesian approach to the German tank problem.(a, prior) The mass function.(b, data) The data s (3) , with maximum observed serial number m (3) = 15.(c, likelihood) The likelihood function associated with the data s (3) .(d, posterior) The posterior mass function of N. H 0.2 highlighted; median marked with vertical, dashed line.
sensitivity of the posterior to the prior

Figure 2 :
Figure 2: Checking (a) the consistency of the data s (3) with the probability of observing each serial number under the tank-capturing process and the posterior mass function of N and (b) the sensitivity of the posterior mass function of N to the upper bound n max imposed by the prior mass function of N.

Figure 3 :
Figure 3: The updated posterior mass function of N (b) after we capture an additional 9 tanks with serial numbers in (a).

Table 2 :
List of parameters/variables.