Introduction

In December 2019 in Wuhan, China, an infectious disease caused by a novel coronavirus was discovered. The disease, which was later labeled COVID-19, remained confined to China for several weeks. Starting in January 2020, the epidemic spreads outside China, first in Thailand, South Korea, and Japan, favored by an outflow of travelers during the Chinese Spring Festival (Milani 2021; Qiu et al. 2020). Concerned by the alarming levels of spread and severity, on 11 March 2020, the World Health Organization (WHO) declared COVID-19 a pandemic.

According to the official data released by the WHO, 1,056,157 cases were confirmed around the world by 4 April 2020, causing the death of 57,130 individuals (or 5.5% of confirmed cases). With more than 583,000 cases, Europe was the hardest hit continent at the time. Among European countries, Italy registered both the highest number of cases (almost 120,000) and the highest number of deaths (almost 15,000). Since then, the situation has worsened, reaching 13 million cases in the world and about 600,000 deaths (or about 4.5% of cases) as of mid-July 2020; Europe and Italy in particular are still among the most affected regions.

The first case of COVID-19 in Italy was registered in Lombardia on 20 February 2020 and spread rapidly across the Northern Italian regions. With these scary numbers at hand, in Italy, a huge debate started concerning the number of deaths, the number of infections, and mortality rates from COVID-19 (Bonacini et al. 2021). This paper contributes to the debate, seeking to estimate these quantities early on for Lombardia. These quantities are important for epidemiologists to identify possible directions of research towards a cure and for policy makers who monitor the number of deaths from and cases of COVID-19 to make decisions about lockdowns—a policy introduced in Italy, like in many other countries, to suppress and reverse the growth trajectories of the virus (Qiu et al. 2020; Zimmermann et al. 2020). As the Italian Government maintains the power to make the rules about lockdowns more or less stringent, the results of this paper may help the government make better-informed decisions.

To provide a continuous update of the COVID-19 situation, the Italian Civil Protection Department (Protezione Civile) started publishing daily data at the regional level on deaths, cases, and swabs. Based on these updates, many argued that there was something special about Lombardia, whose mortality rate was as high as almost 20%; for comparison, the mortality rate from the “natural experiment” among the Diamond Princess passengers was 1.5 %.Footnote 1 In fact, regardless of the external validity of the Diamond Princess experiment (Heckman 1996), the data from Protezione Civile are not appropriate for shedding light on COVID-19 because they are incomplete. For example, individuals who died from COVID-19 but who were never tested are not included in the sample of Protezione Civile, nor are they classified with the specific COVID-19 code in official statistics (WHO 2020): as a consequence, the number of deaths from COVID-19 provided by Protezione Civile underestimates the true deaths from the infection. By the same token, asymptomatic and paucisymptomatic individuals who are not tested are not considered infected from COVID-19. Given the tiny fraction of the population tested, the number of individuals suffering from COVID-19 provided by Protezione Civile vastly underestimates the real number. It follows that the mortality rate from COVID-19 (i.e., the ratio between deaths from and patients suffering from COVID-19, both of which are downward biased) will in general be biased.

To overcome the limitations of Protezione Civile data, in April 2020, the Italian Statistical Institute (Istat) began publishing the number of daily deaths in Italian municipalities between 2015 and 2020. So far, five waves have been released. As the collection of demographic data usually takes 4 months (Istat 2020), during the first two releases of the data, municipalities were selected on the basis of the observed deaths in 2020, and therefore, the observed municipalities did not represent a random draw of all Italian municipalities. Although Istat (2020) clearly emphasized the possible bias from the selected sample (Heckman 1979), several commentators employed the data from the selected sample to learn about the total population (see Rettore and Tonini 2020 for a review): the COVID-19 mortality in observed municipalities was used to predict the COVID-19 mortality in unobserved municipalities (Colombo and Impicciatore 2020; Modi et al. 2020), so as to obtain the COVID-19 mortality in Italy. For the third release of the data, Istat made an extraordinary effort to publish data on all the municipalities.

The main scope of this paper is to overcome the limitations of the administrative data, which are then used to obtain the correct number of deaths from, the incidence of, and the mortality rate from COVID-19 during March 2020 in Lombardia. The main theoretical argument of this paper is that the generalization of the results from the observed sample to the whole population is not recommended because of the (potential) bias that the selection of the sample might introduce. In order to learn about the population, a correction is required. To this aim, rather than using standard approaches that allow point identification, I partially identify the outcomes of interest (Manski 1990) while taking into account the selection mechanism (Horowitz and Manski 2000). Partial identification combines assumptions with the data to deliver a set of admissible values or bounds. Stronger assumptions yield stronger conclusions but decrease the credibility of the inference (Manski 2011). Given the limited knowledge about COVID-19, a distinctive feature of my approach is that I am very cautious about imposing assumptions. I start with assumptions based on definitions, and then I introduce mild assumptions (whose validity can be supported) that nonetheless have relevant identification power (i.e., narrow the bounds by much). As for mortality data from Istat, I begin with the second release, which is characterized by a non-random selection of municipalities, and check all the results using the release that contains all the municipalities. This gives the unique opportunity to appreciate the relevance of set identification with respect to the number of deaths, a key component of the mortality rate from COVID-19. The exercise is important because the challenges faced by Istat are common to other national Statistical Institutes (NSI) around the European Union (EU) and the solution proposed in this paper may also be adapted to other contexts. The methodologies of this paper might thus be of general interest.

To the best of my knowledge, only three papers adopt set identification to study Covid-19: Manski and Molinari (2020), Manski (2020), and Manski and Tetenov (2020). This paper contributes to that literature by adopting a population-level perspective, which thus admits different assumptions than an individual-level perspective, and allows me to answer related, but different, questions.

In Lombardia, the Italian region that was hit the hardest by COVID-19, there were between 10,000 and 18,500 more deaths during March 2020 compared with the same period over 2015–2019, on average. A striking result is that the conclusions drawn from the standard approach, based on point identification, are rejected by the bounds introduced in this paper. The most narrow bounds of mortality rate in Lombardia are between 0.1 and 7.5%—including also asymptomatic individuals—much smaller than the 17.5% estimated with Protezione Civile data. One should be cautious when concluding that there is something special in the region (Odone et al. 2020; Favero 2020). It is therefore important that researchers carefully consider what they can learn when available data are combined with credible assumptions: amid the uncertainty about COVID-19, imposing strong assumptions may lead to wrong conclusions.

Data

In this paper, I estimate the number of deaths, the incidence of Covid-19, and its mortality rate in Lombardia during March 2020. The data on the number of deaths are from the Italian Statistical Institute (Istat). Istat releases mortality data at the municipal level and at daily frequency for 2015–2020 period.

At the time of writing, five releases of data are available. While data for the period 2015–2019 are complete, in the first release (beginning of April) only 1000 municipalities (out of 7904) included the 2020 figures until 21 March 2020; in the second release (mid-April), only 1689 municipalities include the latest 2020 figures until 4 April 2020; starting from the third release (beginning of June), all the municipalities were included.Footnote 2 According to the mid-April release of the data, the sampled municipalities of Lombardia registered 19,824 deaths between 1 March 2020 and 4 April 2020, almost 13,000 more than in the corresponding period of 2015–2019, on average (Table 1). This increase represents 60% of the total increase in Italy. Although the first 5 regions by an incremental number of deaths are located in Northern Italy, and together they represent 90% of total deaths (Lombardia to Liguria in Table 5 in Appendix B), the second most affected region, Emilia-Romagna, represents less than 15% of the total increase in Italy. The main limitation for the generalization of the sample of the first two releases of mortality data to the entire population relies on the selection criteria adopted by Istat. Mortality is indeed published for all the municipalities which experienced (1) at least 10 deaths since the beginning of 2020 and (2) an increase in total mortality of at least 20% between 1 March 2020 and 4 April 2020 with respect to 2015–2019 average of the same period (for short, below I refer to the period as “March” only).

Table 1 Mortality between 1 March 2020 and 4 April 2020 as derived from Istat data, for municipalities available in the full period 2015–2020

The data on patients who suffer from COVID-19 and the number of swabs are instead released daily by Protezione Civile for each region. The reference sample is made of individuals who are actually tested. For them, we also know mortality. However, the share of the tested individuals over the reference population is relatively small in almost all Italian regions (1.4% in Lombardia at the end of the period considered).

The selection criteria of the data released by Istat and Protezione Civile make it impossible to answer the 3 questions of interest to policy makers, epidemiologists, and citizens: (1) what is the true number of deaths because of COVID-19?; (2) what is the incidence of COVID-19 in the population? (3) what is the mortality rate of COVID-19 in the population?Footnote 3 An answer to each of the above questions is important because the Italian Government monitors these variables to decide about the phasing out from/adjustment to the lockdown. In Section 3, I show that even with their limitations the available data answer each of the above questions.

Methods

The method that I use in this paper is based on partial identification rather than point identification (Manski 1990); therefore, instead of providing a single number to each question, I will provide a set of admissible values. With partial identification, the following assumptions can be evaluated: (1) can be increasingly restrictive, i.e., from weaker to stronger; (2) can be refutable or non-refutable;Footnote 4 (3) their identification power can be evaluated. The general result is that the larger/stronger the set of assumptions the smaller the identified set; however, there is no free lunch and, if the assumption turns out to be wrong, the true answer might lie outside of the estimated range. For example, assumptions required for point identification have the highest identification power (i.e., width equal to zero or point identification) but in this application, they are not satisfied.

Total number of deaths in March 2020

To derive the total number of deaths due to COVID-19 during the overall period t ≡ March 2020 in the region J ≡ Lombardia, I begin with the mid-April wave (second release) of Istat data—characterized by partial coverage of the municipalities—and check the predictions using all the municipalities, released at the beginning of June (third release). Thus, there is no loss of information. However, the challenges posed by the mid-April release in terms of partial availability of the data are common to several institutions around the world (for demographical information on COVID-19, see, for example, the other NSIs in EU), and several indicators (e.g., on labor market both in normal time and during the pandemic). Italian data on COVID-19 give the unique opportunity to appreciate the advantages and disadvantages of set identification when the data are only partially available.

The approach that I propose would allow Istat to release data much earlier than the standard 4-month lag. It may also be generalized to other countries or fields with minor adjustments. Finally, set identification may be used as a check for point identification and it may even be published so as to give the user a sense of the uncertainty surrounding the (preliminary) forecasts (Manski 2011).

I distinguish the universe of municipalities (MuniTot) between observed (MuniObs.) and unobserved (MuniUnobs.) municipalities:

$$ Muni^{Tot}=Muni^{Obs.}+Muni^{Unobs.}. $$
(1)

The main idea underlying the paper is that the total number of deaths during period t ≡ March 2020 in region J ≡ Lombardia (\(M^{Tot}_{t,J}\); to simplify notation, from now I omit the subscripts unless necessary) is equal to

(2)

where Mobs. is the number of deaths in observed municipalities during period t ≡ March 2020, MUnobs. is the number of deaths in the unobserved municipalities during period t ≡ March 2020, and is an indicator function that takes value one when the condition A is verified. While I observe the entire distribution function of mortality in the observed municipalities (MObs.), and whether a municipality is in the sample I do not observe the mortality in unobserved municipalities (MUnobs.). The main challenge consists in recovering MUnobs..

The least demanding assumptions I can impose on the number of deaths in the unobserved municipalities are that at least no death is recorded (obtaining the lower bound \(\underline {M}\)) and at most all the citizens died (obtaining the upper bound \(\overline {M}\)), such that \(M^{Tot} \in \{\underline {M}, \overline {M} \}\):Footnote 5

However, a close reading of the selection mechanism of Istat (Section 2) introduces a powerful assumption that affects the upper bound (\(\overline {M}\)). In all of the observed municipalities, at least 10 deaths were registered since the beginning of the year and 20% increase in mortality during March 2020 with respect to the average number of deaths during March of the 5 preceding years (2015–2019). Given these selection rules and following the vast majority of the papers on COVID-19, I focus on mortality during March 2020 only (see Appendix A for an example of the selected sample). The focus on March is the most appropriate because it represents the relevant period of COVID-19 disease in Lombardia. All the excess mortality that we observe is thus attributable to coronavirus and not to confounding effects. (Below I show how one can take advantage of information regarding previous months.)

For unobserved municipalities, I do not know how many deaths were registered in March 2020. Because municipalities must satisfy both conditions to be included in the sample of Istat, I know that the unobserved municipalities might have satisfied at most one condition, but not which of the two. It follows that the mortality in unobserved municipalities was at most equal to 9 (less than 10) or an increase no larger than 20% on March (year-on-year). This shrinks the bounds to \(M^{Tot} \in \{\underline {M}, \overline {M} \}\), where

(3)

and

(4)

A similar approach to recover missing data is in Horowitz and Manski (2000). Some comments are in order. First, suppose (only to simplify exposition) that the number of deaths is a constant μ1 in all the observed municipalities and μ0 in all the unobserved municipalities, then from Eq. 2 I get MTot = MuniObs.μ1 + (MuniTotMuniObs.)μ0 (because MuniUnobs. = MuniTotMuniObs. from Eq. 1); define ρ = MuniObs./MuniTot, it follows that MTot = MuniTotρμ1 + MuniTot(1 − ρ)μ0: if \(\rho \rightarrow 1\) the observed sample of municipalities is increasingly more informative about MTot, and when ρ = 1 the data provided are fully informative. Second, if municipalities were randomly drawn from the same population, then E[MObs.] = E[MUnobs.] and the sample selection criteria would be independent on the outcome variable (Heckman 1979).Footnote 6 Third, these bounds are based exclusively on definitions, and therefore, their assumptions are always satisfied; for this reason, I define them “worst case bounds” (Manski 1990). In Section 3.1.1, I consider (and support) further mild restrictions that further shrink the width of these bounds.

Further assumptions on mortality

If I impose further assumptions on the total number of deaths during the month of March 2020, I obtain narrower bounds. To this aim, consider Fig. 1. Panel (a) shows a hypothetical distribution of year-on-year mortality in a “normal” year (symmetric about zero, without loss of generality); panel (b) shows the distribution in the same municipalities in a year affected by a common shock that increases the mortality rate, like COVID-19. I also show lines for the 0% and 20% increase to reflect the rule adopted by Istat. After the shock, (1) the distribution shifts to the right and (2) the selection rule neglects a large part of municipalities where the increase in mortality is positive but smaller than the threshold set by Istat (“Unobserved” region). Mortality in unobserved municipalities may be recovered using past information:

  1. 1.

    “Rule monotonicity” (i.e., \(E[M^{Unobs.}_{t}] \geq E[M^{Unobs.}_{t-1}|\text {Istat rule}]\)): in unobserved municipalities, the mortality during t ≡March 2020 would have been no lower than the mortality of municipalities that would have been excluded if the same selection rules were applied in the previous years (t − 1 ≡average March 2015-2019), as if COVID-19 did not reach these municipalities:

    (5)

In fact, the existing literature on COVID-19 emphasizes the spatial dimension of the virus (Kang et al. 2020). This suggests that in Lombardia all municipalities experienced COVID-19, so that the mortality associated with the outbreak of the virus adds up to the normal-times mortality:

  1. 1.

    ‘COVID-19 monotonicity” (i.e., [Mt,i] ≥ [Mt− 1,i]∀i): for each municipality i, the mortality during t ≡March 2020 cannot be lower than in the previous years (t − 1 ≡average March 2015-2019), i.e., COVID-19 is not beneficial in any municipality, so that:

    (6)

    Contrary to other assumptions, the “COVID-19 monotonicity” assumption is an individual-level assumption that becomes, in principle, stronger. Differently from much of the existing literature on partial identification, which considers individuals, the unit of analysis in this application is the municipality, and thus, the assumption is really municipality-level. Examples where states rather than individuals are considered are in Manski and Pepper (2013) and Manski and Pepper (2018).Footnote 7

Further assumptions would better distinguish between the three regions in Fig. 1. To the extent that we know more about the virus, we may be more willing to impose more (and appropriate) assumptions.

Fig. 1
figure 1

Illustrative example of a shift in mortality induced by COVID-19

Three comments are in order. First, these assumptions have identification power with respect to the lower bound of mortality (\(\underline {M}\)); in the absence of further information, the upper bound of mortality (\(\overline {M}\)) is not affected and it remains as in Eq. 4. Second, although I view the monotonicity assumptions of this subsection as mild, I acknowledge that they might not be innocuous (which is why impose them only as a further refinement of the “worst-case bounds”). However, both assumptions imply a first-order stochastic dominance over time, which I successfully test below. For an application of first-order stochastic dominance in partial identification, see Bhattacharya et al. (2012) and Chen et al. (2018). Third, in general, as going from the first to the second assumption, the bounds narrow.

Finally, it is instructive to look at the “exact DID assumption” (i.e., \({\Delta } M^{Obs.}_{t}\% = {\Delta } M^{Unobs.}_{t} \%\)), such that the average increase in mortality in unobserved municipalities would have been identical to the increase in mortality in observed municipalities, in the absence of COVID-19. This is the approach followed in some early research on this subject (see Rettore and Tonini 2020 for a survey and a critique). This assumption point identifies mortality:

(7)

This quantity reveals that the generalization of the Istat data to the whole population of interest would likely deliver an upward bias of the total mortality equal to bias = (ΔMObs.% −ΔMUnobs.%) × MuniUnobs.. A formal argument can be found in Heckman (1979) and the following literature.

What is the incidence of COVID-19 in the population?

The incidence of COVID-19 in the population of Lombardia is defined as the ratio between the number of people infected by COVID-19 during period t ≡March 2020 (CTot) over the reference population P, i.e., \(\frac {C^{Tot}}{P}\).Footnote 8 Since I observe the population size, I need to recover only the true number of cases of COVID-19 (CTot). I derive this quantity using the same approach of Section 3.1. Define CObs. an indicator of confirmed cases of COVID-19, which takes value 1 if the tested individual is positive and 0 otherwise; P the population of interest; T the number of tested individuals (i.e., swabs).Footnote 9 It follows that for T individuals I know the outcome of the test, and for NT = PT individuals, I do not know the COVID-19 condition (CUnobs. is thus defined similarly to CObs. but it is unobserved). The true number of individuals with COVID-19 in Lombardia (CTot) is

$$ C^{Tot}=\sum\limits_{T} C^{Obs.} + \sum\limits_{NT} C^{Unobs.}, $$
(8)

where sums are over individuals, and \({\sum }_{T} C^{Obs.}=C^{PC}\) is the number of individuals with COVID-19 as published by Protezione Civile. The main difference from the number of deaths is that I have less information on the Data Generating Process of COVID-19 regarding CUnobs.. Two polar cases are admissible: either none of the untested individuals is positive to COVID-19 (\({\sum }_{NT} C^{Unobs.}=0\)); all of the NT untested individuals are positive (\(C_{i}^{Unobs.}=1 \forall i\) and thus \({\sum }_{NT} C^{Unobs.}={\sum }_{NT} 1=NT=P-T\)). It follows that \(C^{Tot} \in \{\underline {C}, \overline {C} \}\) where

$$ \begin{array}{@{}rcl@{}} \underline{C} &=& C^{PC} \\ \overline{C} &=& C^{PC} + (P-T), \end{array} $$
(9)

so that the incidence rate is \(\frac {C^{Tot}}{P} \in \left \{\frac {\underline {C}}{P}, \frac {\overline {C}}{P} \right \}\). These bounds rely only on definitions; therefore, I call them “worst-case bounds.”

Further assumptions on the incidence of COVID-19

By definition, the total number of individuals suffering from COVID-19 is a weighted sum, with weights given by the proportions of tested (T = 1) and untested (T = 0) individuals:

(10)

Using this definition, to narrow the bounds, I exploit the testing procedure adopted in Lombardia. In Lombardia, testing criteria required the person to show symptoms of infection to be tested.Footnote 10 I can thus recast the assumption in terms of symptoms (S = 1 for a symptomatic individual and S = 0 otherwise) and write

(11)

I impose the restrictions (i.e., an individual has no symptoms but is nonetheless tested), an event excluded by the testing protocols of Lombardia, and (i.e., the individual has symptoms but is not tested and thus no care is provided), an event excluded because in Italy the Nation Health System is universalistic and funded through general taxation (by Constitutional Law).Footnote 11Lavezzo et al. (2020), Day (2020a), Day (2020b), and Emery et al. (2020) find that the percentage of asymptomatic individuals suffering from COVID-19 in the population is up to about 80% of the individuals suffering from the virus, which corresponds to up to 4 undetected cases each detection. I thus impose the “symptoms-monotonicity assumption” that 5E[C = 1|T = 1,S = 1] ≤ E[C = 1|T = 0,S = 0] (I use 5 instead of 4, to be more conservative; see also Footnote 11).

Using this restriction with the definition in Eq. 11, the upper bound of Eq. 9 shrinks to

$$ \overline{C}=C^{PC} + (P-T) 5 E(C=1|T=1,S=1), $$
(12)

where E(C = 1|T = 1,S = 1) can be recovered using data on the infected population from Protezione Civile.

What is the mortality rate of COVID-19?

I define the mortality rate from Covid-19 as the ratio between total deaths from the virus (DTot.) over total cases (CTot.), or \(MC^{*} = \frac {D^{Tot.}}{C^{Tot.}}\).Footnote 12 The excess mortality of March 2020 with respect to the same month in the average between 2015 and 2019 is due to COVID-19, because in Lombardia there was no ongoing policy in March 2015–2020 that might have increased mortality.Footnote 13

The results from Sections 3.13.2 can be used to build \(MC^{*} \in \left \{ \frac {\underline {\Delta M}}{\overline {C}} , \frac {\overline {\Delta M}}{\underline {C}}\right \}\), where Δ is for the difference between the two periods.Footnote 14

Continuing with the comparison with point identification, Protezione Civile releases data on mortality. This mortality refers to people that we know died with COVID-19, because they were tested. This number does not reflect the overall mortality from COVID-19 for reasons related to testing procedures explained in Section 3.2. As a consequence, if the scope of the exercise is to derive the mortality rate from COVID-19, the information content of the data from Protezione Civile is incomplete.Footnote 15

To conclude this section, it is worth emphasizing that the mortality rate has been derived by Manski and Molinari (2020), which makes clear the connection between the two papers. They derive the bound of mortality rate as \(MC^{*}=\frac {P(D=1)}{P(C=1)}\), which is identical to this paper.Footnote 16 There exist however differences between the two approaches in the timing, the numerator, and the denominator. As for the timing, Manski and Molinari (2020) calculate the bounds on a daily basis (between mid-March and mid-April 2020). This is possible because the probability of deaths (D), the numerator, in Manski and Molinari (2020) is obtained from Protezione Civile and not from Istat. These two differences together show that different data provide different information and allow to look at different aspects of the disease. On the one hand, the data from Protezione Civile are released daily, and therefore, they allow to track the evolution of the virus over time; the selection rules of Istat are not informative about the daily evolution of mortality in the unobserved municipalities, and therefore, they do not allow to derive bounds on a daily frequency. On the other hand, the reference population of Protezione Civile is made of individuals who are positively tested to COVID-19, and therefore, these data are not informative about individuals who died without being tested; Istat data consider the entire population.

As for the denominator, the probability of infection (C) in Manski and Molinari (2020) takes into account also the negative predictive value, which is the probability that an individual is tested (T = 1) and gets a result negative to COVID-19 (R = 0), but in fact is infected, i.e., P(C = 1|T = 1,R = 0). Although I recognize the relevance of this quantity, I do not consider it because I do not currently have administrative information about it (on this subject see the interesting explanation in Manski and Molinari 2020, Section 2.1).

The definition of populations used to derive the bounds is therefore somewhat different between the approaches. For this reason, the comparison of the mortality rate between this paper and Manski and Molinari (2020) will be important to understand what we can learn from different data, which implicitly allow for different assumptions; after the differences of the data are taken into account, we can also provide some (non-conclusive) empirical evidence in favor of the assumptions made by both papers—if the conclusions are similar.

Results

In this section, I apply bounds of Section 3 to obtain the true number of deaths, the incidence of COVID-19 in the population, and the mortality rate from COVID-19, for the region of Lombardia during March 2020.

For a more direct comparison to Manski and Molinari (2020) and because the epidemiological research is ongoing on the matter, I do not consider the dynamics of the epidemic (results are qualitatively similar if I impose a delay between the insurgence of symptoms and deaths from COVID-19 up to 10 days, which is appropriate for Italy and above the median of 5 days; ISS2020).Footnote 17

I first impose assumptions based exclusively on definitions, which will always be satisfied; I empirically show that the larger the set of assumptions the smaller the bounds and even mild restrictions are highly informative. However, the credibility of inference decreases with the strength of the assumptions maintained (Manski 2011, “Law of decreasing credibility”). This is well reflected in the assumptions underlying point identification, whose validity is rejected in this application.

This is a very important result for the credibility of assumptions that are imposed in the ongoing research on COVID-19 and the real-time estimates produced by the NSIs (see, for example, the large revision of mortality in Spain; similar issues are relevant in Brasil, China, and Russia, to mention few). More generally, using a restricted sample to draw general conclusions rests critically on unsupported assumptions (or wishful extrapolation). See Manski (2011) for a complete treatment on the subject.

While interpreting the results, it is worth bearing in mind that as more data or more knowledge about the virus becomes available, more assumptions could be imposed and the bounds will narrow.

Total number of deaths in March 2020

The bounds for the total number of deaths are in Table 2. The upper bound derives from the selection rule adopted by Istat, and it is equal to 28,301 total deaths between 1 March 2020 and 4 April 2020 in Lombardia.

Table 2 Bounds on number of deaths

The lower bound depends on which assumptions I am willing to impose. Under the worst-case scenario, which relies exclusively on the idea that no deaths are registered in unobserved municipalities, at least 19,824 deaths are observed. (Notice that 19,824 is the same number of descriptive statistics in Table 1.) With this minimal set of assumptions, the width of the bounds is about 8500 deaths.

The larger the set of assumptions, the narrower the bounds. In Section 3.1.1, I consider monotonicity assumptions. As an indirect test in favor of these assumptions, I successfully tested the first-order stochastic dominance, necessary for the monotonicity assumptions, by mean of Kolmogorov-Smirnov test (available upon request). The identification power of “Rule monotonicity” is already remarkable and makes the lower bound of deaths in Lombardia equal to 21558, thus shrinking the width by 20% (to 6743 deaths); the “COVID-19 monotonicity” provides slightly more information and shrinks the width of the bounds by 30% (to 5792 deaths), setting the lower bound of deaths to 22500.Footnote 18

Once the bounds of deaths in Lombardia during March 2020 are recovered, they can be compared to the observed mortality during the same period between 2015 and 2019 (equal to 9739 deaths, on average). Four main conclusions can be drawn from these bounds. First, no matter which assumption I impose, the number of deaths during March 2020 is substantially higher than in the (average) 2015–2019 period. The claim that deaths did not increase after COVID-19 (e.g., Becchi and Zibordi 2020) can be dismissed. Second, at least 10-13,000 more deaths were registered. Third, no matter which assumption is imposed, during March 2020 in Lombardia, at most 18,500 more deaths than in the (average) 2015–2019 occurred. To better appreciate the power of set identification, I compare the predictions from this approach to the release of the data containing all the municipalities. In Lombardia, there were 27,500 deaths in 2020, about 18,000 more than during 2015–2019.Footnote 19 This result shows that the predictions based on the partial identification do not rely on a wishful extrapolation, and therefore, the true numbers are within the bounds introduced in this paper.

Fourth, and extremely important given the several attempts to generalize the observed sample of municipalities to the entire region, if I apply the “exact DID assumption” without covariates, 30,775 deaths are estimated (30,109 considering the intervals at 95% confidence levels). This result is incoherent with the precise and complete implementation of the selection rule of Istat: to see this, notice that the estimated number is higher than the upper bound (equal to 28,301).Footnote 20,Footnote 21 In this respect, future research on COVID-19 should pay much attention when imposing assumptions like, for example, the parallel trend (Goodman-Bacon and Marcus 2020).

What is the incidence of COVID-19 in the population?

The bounds for the incidence of COVID-19 are in Table 3. As the number of swabs in Lombardy is very small (141,877 tests over a population of 1,005,1747 , or 1.4 %) the worst-case bounds, based only on definitions, are remarkably large: according to the lower bound, at least 49,118 people suffer from COVID-19 in Lombardia in March 2020. The upper bound is derived under the extreme possibility that all the remaining population suffers from the virus, i.e., 9,909,870 (= 10,051,747 − 141,877) individuals: this gives an upper bound of patients suffering from COVID-19 equal to 9,958,988. If I impose the “test-monotonicity assumption” of Eq. 12, the upper bound shrinks dramatically to 291,242 individuals. To achieve point identification, one can exploit the universalistic coverage of the Italian National Health Service to impose that all the people at risk of COVID-19 are tested. This would imply that data from Protezione Civile are complete (row “Protezione Civile” in Table 3). This point identification is equal to the lower bound in Table 3.Footnote 22 However, taking point identification as “the true number” neglects the untested, asymptomatic population—against epidemiological evidence (Lavezzo et al. 2020; Day 2020a; 2020b).

Table 3 Bounds on COVID incidence

With the number of infected people (CTot.), I can derive the incidence of COVID-19 in the population, by dividing CTot. over the population. This is what I do in the last three columns of Table 3. The incidence rate is between 489 cases and 99,077 every 100,000 inhabitants in the worst-case bounds, and between 489 cases and 2897 every 100,000 inhabitants imposing test-monotonicity.

The worst-case bounds are not very informative, but this is not a weakness of the approach. Three issues are indeed worth emphasizing. First, the large width of the worst-case bounds has a clear policy implication for COVID-19: ‘test, test, test’ as suggested by the WHO. If the whole population was tested then TP = 0 and the variable would be point identified. This source of point identification is intrinsically different from that obtained using untenable assumptions (row “Protezione Civile” in Table 3). Second, from an epidemiological perspective, the knowledge of the sequence of the virus and how it interacts with people would suggest/support some assumptions rather than others. Until that moment, introducing assumptions, I introduce the possibility of errors. Third, for the release of data on COVID-19, it is important to have more information than currently available: suppose we learn that a specific group of individuals in the population is immune, if we do not know confirmed cases or swabs by group of individuals, this knowledge is useless to shrink the bounds. Smaller bounds would be relevant for a cure against the virus and would provide the Government with better information for the phasing out from/adjustment to the lockdown.

What is the mortality rate of COVID-19 in the population?

In Table 4, I derive the bounds of mortality rate due to COVID-19. The header of the rows are the assumptions imposed on the number of deaths; the header of the columns are the assumptions imposed on the number of cases of COVID-19. Using exclusively the definitions for both variables (“Worst” for both column and row), the width of the bounds is very large, and the mortality rate goes from 1 every 1000 cases (0.001 in the lower bound) to 378 (0.378 in the upper bound). The gain from imposing assumptions on the number of deaths (i.e., as going from the top to the bottom of the table within the first column) is fairly limited. Differently, the gain from imposing assumptions on the number of cases of COVID-19 is substantial: as going from the left to the right of the table the lower bound increases by much (between 3.5–4.5%).

Table 4 Bounds on mortality rates

These rates compare to 0.176 which was discussed for a long time in the Italian debate. This ratio is obtained by dividing the number of deaths (8656) to the number of total COVID-19 cases (49118) from Protezione Civile. Based on this approach, it was argued that there is something special in the mortality rate of Lombardia compared to the rest of the world (see Favero et al. 2020a for a summary). For example, in the Diamond Princess “experiment,” the mortality was 0.015.

Four main conclusions can be drawn from the bounds on the mortality rate. First, the width of bounds in Table 4 is large for the same reason discussed above about the little knowledge of the virus. If I choose to impose further assumptions, I introduce the possibility of errors. Second, more caution is needed when arguing that there is something special in the mortality rate of Lombardia compared to the rest of the world, because the data are coherent with a much smaller rate than that obtained using the standard approaches (for similar conclusions see Odone et al. 2020; Favero 2020). Third, the point estimate based on the standard DID approach is incoherent with the (precise and complete) application of the selection rules of Istat, because the rate of 0.428 (± 5% confidence intervals) is above the upper bound. This result confirms and complements the warning about the exact DID assumption in this application (Section 4.1). Fourth, the worst bounds are comparable to those on mortality rate calculated for Lombardia using the bounds in Manski and Molinari (2020). The lower bounds are identical. Their upper bound is remarkably smaller than mine (15% compared to 38%). The difference between the two upper bounds of mortality rates is related to the probabilities of death and of infection (Section 3.3). It is therefore useful and instructive to go from the bounds of this paper to those in Manski and Molinari (2020). To this aim, I focus on the probability of death; the difference in the probability of infection depends on the contribution of the false negative results, which is quantitatively small due to the small proportion of the tested individuals in Lombardia at the end of March 2020.Footnote 23 If I derive the bounds of the two papers using 18,562 deaths obtained using the data from Istat, the upper bound of the mortality rate is 31.8% (\(=\frac {\overline {P(D=1)}}{\underline {P(C=1)}}=\frac {18562/10051747}{0.006}=\frac {0.002}{0.006}\), and 10051747 is the population of Lombardia); if I derive the bounds of the two papers using 8656 deaths obtained using the data from Protezione Civile, the upper bound of the mortality rate is 14.8% (\(=\frac {\overline {P(D=1)}}{\underline {P(C=1)}}=\frac {8656/10051747}{0.006}=\frac {0.001}{0.006}\)). This exercise shows that the only differences between the two approaches are in the information exploited to derive the bounds. Considered together, the two approaches give a concrete idea about the uncertainty surrounding the relevant populations and about the relevance of the information that is used: given our largely incomplete knowledge of the diseases, it is worth discussing both bounds, which thus complement each other. Once the differences across the data are taken into account, the two approaches lead to identical conclusions. Finally, if I also consider the asymptomatic individuals (Day 2020b), the upper bound of mortality further drops to 7.6%, with 18,562 deaths.

Conclusions

This paper seeks to get early on reliable estimates of the number of death, the incidence of COVID-19 in the population and the mortality rate from COVID-19 in the Italian region of Lombardia during March 2020, using administrative data. The outcomes that I focus on are of large policy relevance, given the little availability of both the data and the epidemiological knowledge of the virus, on the one hand, and the need for the policy maker to make appropriate decisions to safely re-start the normal life (Bonacini et al. 2021) and to manage possible future resurgence of the COVID-19, on the other hand (Favero et al. 2020b; Ceriani and Verme 2020). The case of Lombardia is very interesting in this context because it is one of the regions most hardly hit from the COVID-19 pandemic in the world. I find that during March 2020 occurred between 10 and 18,500 more deaths than in 2015–2019 average.

Mortality rates are between 0.001 and 0.378; therefore, one should be cautious before concluding that there is something special in the mortality rate of Lombardia, because the observed data might be comparable to that of other regions in the world. If I impose further assumptions, the upper bound of mortality drops dramatically, to 7.6% if the asymptomatic individuals are considered. This percentage is much below 17.5% discussed for a long time in Lombardia.

This paper contributes to a small literature on the COVID-19 that uses partial identification. By using partial identification, I avoid strong assumptions: given the little knowledge about the virus, this is a strength of the approach which may be useful for the increasing literature on the disease. This little knowledge is clearly reflected in the width of the bounds (Manski and Molinari 2020). Although the bounds are large, in this application, partial identification is still more informative than point identification, because the assumptions underlying the latter approach are strongly rejected by the former. In my opinion, the limitations of point identification outlined in this paper may provide a checklist for the assumptions that are currently imposed in the research on COVID-19 (see also Goodman-Bacon and Marcus 2020).