The SIR model towards the data

In this work, the SIR epidemiological model is reformulated so to highlight the important effective reproduction number, as well as to account for the generation time, the inverse of the incidence rate, and the infectious period (or removal period), the inverse of the removal rate. The aim is to check whether the relationships the model poses among the various observables are actually found in the data. The study case of the second through the third wave of the Covid-19 pandemic in Italy is taken. Given its scale invariance, initially the model is tested with reference to the curve of swab-confirmed infectious individuals only. It is found to match the data, if the curve of the removed (that is healed or deceased) individuals is assumed underestimated by a factor of about 3 together with other related curves. Contextually, the generation time and the removal period, as well as the effective reproduction number, are obtained fitting the SIR equations to the data; the outcomes prove to be in good agreement with those of other works. Then, using knowledge of the proportion of Covid-19 transmissions likely occurring from individuals who didn’t develop symptoms, thus mainly undetected, an estimate of the real numbers of the epidemic is obtained, looking also in good agreement with results from other, completely different works. The line of this work is new, and the procedures, computationally really inexpensive, can be applied to any other national or regional case besides Italy’s study case here.


Introduction
The SIR model [1][2][3][4][5][6], developed by Kermack and McKendrick [1] in 1927, is the well-known very simple model of infectious diseases that considers three-compartments, recalled here to state terminology and notations: The compartment S of susceptible individuals; The compartment I of the infectious (or currently positive) individuals, who have been infected and are capable of infecting susceptible individuals during the infectious period; The compartment R of the removed individuals, who recovered from the disease or died from the disease, the former assumed to remain immune afterwards. a e-mail: ignazio.lazzizzera@unitn.it (corresponding author) Births and non-epidemic-related deaths are neglected. The cardinality of each of the compartments is indicated with the corresponding non bold letters, while N denotes the involved total population at an initial time t 0 : The disease incidence rate β is defined so that β S I gives the number of new infections per unit time [5]; the removal rate γ is defined so that γ I gives the rate at which infectious individuals "deactivate" (heal or die). Typically, β is taken constant over time, which is not the general case, due to possible mutations of the decease carrier or social measures to counter the spread of the infection; also, to simplify mathematics, the generation time g is neglected, that is the infector-infected pairing time lapse; as well as the removal period r, which is the average time between infection and recovery or death, despite the important relation Within the removal period r, a typical infectious individual is expected to cause r β S new infections, so defining a function of time that, normalized, is called effective reproduction number R t (see, for instance, [7]), namely: So, the SIR equations as used here become: d R dt (t + r) = γ I (t) .
Of course, they imply that the sum S + I + R is conserved, so that S(t) + I (t) + R(t) = N at any time t.

Outlines of the work
The purpose of this work is to check whether the relations established in Eq. 4 are actually found in the data or, at least, whether "corrections", accounting for incomplete data or systematic errors, may or should be introduced, with the implication that consequently the relationships are satisfied. Crucial is the fact that the model is scale-invariant, thus allowing to conveniently choose as a reference one sub-compartmental curve whose real data can be considered reliable, such as the swab-confirmed infectious individuals curve. This choice is done indeed here: swab-confirmed infectious are mostly individuals who have developed symptoms and are actually found to cover a nearly constant fraction of all the infectious people, given the circumstances that symptomatic and asymptomatic individuals roughly are, respectively, fractions of the age groups of over sixty and younger people ( [8][9][10][11]). The case of the second through the third pandemic wave of Covid-19 in Italy is studied. First, it will be shown that the relation established through Eq. 4c holds true if R(t) is scaled by a factor that is obtained, together with the removal period r , by a least-square procedure of matching over data. Arguments will be given for how a scaling-up over the official data should be due indeed. Once r, and thus γ , is obtained, the effective reproduction number is evaluated through Eq. 4b , reliably, despite using the swab-confirmed infection cases only, for that equation is scale-invariant on its own.
The transition to real numbers is finally done, correcting the swab-confirmed infectious cases for the proportion factor of transmissions that likely occur from asymptomatic subjects. The results are compared with those obtained at the MRC Centre for Global Infectious Disease Analysis, Imperial College London (ICL, [12]), where a completely independent approach is used.

The data set
The data set is from Italy's Department of Protezione Civile [13], lasting from 1 June 2020 to 31 May 2021. Since every weekend there was a postponement in cases recording to a few days later, according to common practice the data are smoothed via a multi-day moving average; the choice is 9 days, to systematically include a couple of days after each weekend.

The swab-confirmed infectious towards the daily removed
Verifying that the relationship given by Eq. 4c is indeed found in the data is not so trivial. For example, there is evidence that the monthly deaths from Covid-19 in 2020, as given by Italy's Department of Protezione Civile, are largely underestimated: this is shown by an ISTAT study on the monthly excess of deaths in 2020, compared to the corresponding averages over the previous five years (see [14] and [15]). ISTAT is Italy's Istituto Nazionale di Statistica. The matter is illustrated in Fig. 1. In addition to this, it is to be expected that R(t) does not include most of the cases that had an asymptomatic or mild course. Also, asymptomatic infected people are probably not reported among the infectious, whereas by far most of the reported infectious are those who had swabs confirmation, whose number will be called I sc . Let's indicate with ∨ R the curve of the actually registered healed and deaths: it is found that its derivative, the daily variation, shifted forward in time, is indeed proportional to I sc (t). To methodically verify Eq. 4b, the correction factor k rel is introduced so as to give maximum generality to a least-square search over the positive definite form with varying k rel and the removal period r. It is worth remarking the notation k rel , intended to emphasize that any correction on ∨ R(t), possibly required by the SIR model at this stage, is relative to the swab-confirmed infectious population only. The sum is over the days of the pandemic period considered, with the choice of weighing equally all daily data. In principle two functions of the kind 5 should be used, one for deaths and one for recoveries; however, deaths are well known to be only at most some 4-5% of the whole removed compartment, so that a probably negligible error is made by simplifying as in Eq. 5 because of the huge statistical and systematic uncertainties in the data. The minimization is performed using a C++ object of the class Minimizer of the CERN package ROOT, typically used by high energy physicist in their data analysis ([16,17]): its statistical methodology is described in [18]. Since the surface defined from the data through Eq. 5 is rather rough, the minimization algorithm is run 150,000 times to maximize the chance of hitting an optimal minimum: the initial values of k rel and r are drawn at random in the intervals [1.0, 5.0] and [5.0, 18.0] respectively. Execution on raw and smoothed data takes about one minute time altogether. The final issue for k rel and r and their uncertainties δk rel and δr are taken as the mean and the standard deviation of the distributions of the respective outcomes at each iterated minimization, weighted with the normalized inverse of the χ 2 .
The results are shown on the first and the second lines of Table 1, for the raw and the smoothed data respectively.
Since the value of the removal period r is critical in determining k rel , it is sought from the data in two further independent ways, as explained in the next two subsections.

The removal period from a Gaussian fit
At any new "wave" of epidemic, the rise in number of the infectious individuals follows with good approximation a sigmoidal shape, i.e. it is roughly exponential at the very beginning, up to an inflection point, after which it bends towards a plateau; consequently, its daily variation (the time derivative) exhibits a maximum at the inflection point, around which it is approximately Gaussian. If Eq. 4c correctly described the data, an analogous shape should be had in the second derivative of the removal curve. Very remarkably, this is in fact the case,  Table 1 with associated uncertainty and fit χ 2 . The uncertainty is the sum in quadrature of the uncertainty on the position in time of the vertexes of the two fitting Gaussians; the χ 2 , with its reduced, is their worse. The fit algorithm is from the already mentioned ROOT package (CERN, [19]).

4.2
The removal period from the "asymmetric sigmoid derivative" fit Given the almost sigmoidal initial growth of an epidemic wave, as already mentioned in the last subsection, an alternative fit function turns to be an asymmetric modification of the derivative of a sigmoid, which will be called skew sigmoid derivative, namely: This function has absolute maximum in It is plotted in Fig. 3 for μ = 1, σ = 1 and A = 1, and various values of the skewness parameter : for = 0.5 one has the derivative of a very sigmoid. The fits of the skew sigmoid derivative to the first derivative of the swab-confirmed infectious curve and to the second derivative of the removal curve, respectively, are shown in Fig. 4: again, the distance in time between the vertexes of the fitting functions gives a new measurement of the removal period r, reported in the fourth line of Table 1.

The removal curve corrected relatively to the swab-confirmed infectious only
From Table 1 the removal period r is assumed to be 10 ± 2 days, bearing in mind that the data have just one day resolution; also, comparing the χ 2 on the first and second lines of the table, the correction factor k rel is taken equal to 3.14 ± 0.82. So, we have the curve of the   Fig. 4 The two fitting "asymmetric sigmoid derivative" functions in cyan and green removed individuals, corrected relatively to the swab-confirmed infectious only, given by: Figure 5 does illustrate this: the cyan error bars are generated by the propagation of three times the ±0.82 uncertainty over k rel .

Getting the generation time and the effective reproduction number
There are several algorithms to estimate the effective reproduction number from the data: a simplified one is given in [20], where also an extensive bibliography on the subject can be The simplest yet effective estimate that very directly interprets the meaning of the function (see, for instance, [21]) is given by As far as the SIR model is concerned, from Eq. 4b one has R t = r I (t) So, the derivative is implemented by the symmetric difference quotient, to have the cancellation of the first-order error in the numerical discretization. While only the generation time g appears in Eq. 8, both g and the removal period r are present in Eq. 9; consequently, the validity test of the SIR model through the effective reproduction number R t it manages to provide, is to be considered quite stringent.
In the previous section, the removal period r was obtained from the data using the SIR model; the question is how to get the generation time as well.
In our conventions, I (t) denotes the total of all the infectious people, swab-confirmed or not. If t M is a day when I (t) presents a maximum, then correspondingly, but g days earlier, i.e. at day t M − g , the effective reproduction number R t should be equal to 1, because an increase in the number of the people becoming infectious requires R t > 1 and a decrease requires R t < 1 . Of course, every variation of R t has impact on I (t) with a delay of g days, so also for I sc (t), assuming this to be proportional to I (t) . With r fixed at 9.71 ± 2 days, as set out in the previous section, let's say t g a day when, for any given choice of g, R t is equal to 1: in general, checking over the data, it doesn't happen that the nearest next day t M , on which I sc (t) has a maximum, is such that t M − t g = g, as it should; indeed it happens only for a specific choice of g, namely, for the case being studied, with g = 6, an integer value just in view of the one-day resolution of the data. A convenient double check is done on the maximum of I sc (t) falling on 2 December 2020 (see Fig. 5). Very remarkable is the fact that the height of the peaks of R t does depend on the value one wants to give to g, the same way as the days when R t is equal to 1 do: so, all of these things are bounded by the SIR model, a fact that must be considered truly important in evaluating the validity of the model. The estimate g = 6 days is in total agreement with the average 6.7 ± 1.9 days given for Italy in Ref. [22]: this is a success of Eq. 9 that strengthens the agreement, within the uncertainties, of the resulting R t with that from other algorithms, as those reported in Ref. [20] , with references therein. Figure 6 shows this SIR generated R t , together with the one from Eq. 8; for either, error bars corresponding to a ± 2 days uncertainty on both g and r are also shown.

The "corrected" cumulative and daily-new infections relatively to the swab-confirmed infectious people
To avoid confusion, it is worth remarking that infections at day t is meant as the cumulative number of infections up to and including that day, while the number of infectious people at some day t refers to those people who were infected possibly earlier and are still able to transmit infection at that day. Thus, the (daily new) infections curve is different from the infectious curve.
Since N in Eq. 1 is conserved, Eq. 4a can be written as  Fig. 7 Italy, Covid-19 second through the third waves: estimates of removal (dark blue curve) and total (green curve) cases as from correction relative to the swab-confirmed infectious (red) curve. Also shown data in dark green (total cases) and blue (removals) For what has been done so far, Eq. 10 must be replaced by giving the corrected cumulative number of infections relatively to the swab-confirmed infectious people only. Figure 7 illustrates Eq. 12.

Infections from asymptomatic and symptomatic infectious people: estimate of the "real" numbers
There are several studies on the relevance of SARS-CoV-2 transmission from asymptomatic people, like [8,[23][24][25] and references therein. Quite recent and complete is ref. [25], where a decision analytical model is used to assess the proportion of SARS-CoV-2 transmissions in the community likely occurring from subjects who did not develop any symptom. In that work data from a meta-analysis were used to set the generation time at a median of 5 days and infectious period at 10 days, in good agreement, respectively, with the 6 and 10 days stated in the present work. The reported conclusion is that, across a range of plausible scenarios, a 59% of infection transmission occurs from persons without symptoms: no clear uncertainty is given, but the statement that the figure should be at least 50%, suggests an uncertainty of ±10 %. Also it is stated that the infected individuals who never develop symptoms are 75% as infectious as those who do develop symptoms.
Let's call f (asy) the percentage fraction of the asymptomatic infectious subjects over all the infectious people and i (asy) their relative infectiousness, that is the percentage fraction of the infectiousness of those who had developed symptoms: then, according to the best Then, in view of Eq. 11, Eq. 10 becomes with T (t) the "real" cumulative number of infections at day t, while its derivative represents the "real" daily new infections. Figure 8 shows the daily new infections curve, compared with the Imperial College's (ICL) model estimate, as published in [12]. The model in question is a stochastic SEIR variant that adopts multiple infectious states, which in turn reflect different COVID-19 severities. It uses an estimate of the infectious fatality rate (IFR), assuming that the number of confirmed deaths from Covid-19 is equal to the real Covid-19 deaths number; it also uses an estimate of the effective reproduction number, based on the changes of the virus transmission rate caused by the average mobility trends.
So, the ICL model's approach is totally different from the one followed in the present work; nevertheless the respective "real" daily new infections estimates appear to be in quite This work estimates of the "real" total cases (light green) and correspondent data form Italy's Department of Protezione Civile (dark green) with scale on the left; this work "real" daily new infections (blue) and corresponding official data (dark blue) with scale on the right good agreement, except on a time interval around 1 January 2021 (day 220 in plots of this paper), where the ICL curve shows a deep local minimum instead of a local maximum as in the data of Italy's Department of Protezione Civile. The uncertainty belt of the ICL estimates is surprisingly narrow. In Fig. 9 the present work's estimates of the "real" total cases of infections are shown, together with the estimated "real" daily new infections, the latter with their own scale on the right; also, the data as from Italy's Department of Protezione Civile are plotted.
Incidentally, the ripple visible in Figs. 8 and 9 on the data from Italy's Department of Civil Protection has the typical 7-day periodicity that arises from the weekend reduced data recording.

Conclusions
Taking as case study the second to the third waves of SARS-CoV-2 in Italy, the SIR model is confronted with data, after reformulating its equations by the explicit introduction of the important effective reproduction number R t , as well as the generation time and the infectious period, usually, erroneously, neglected. The relationships it sets among the main observables are actually found in the data, in particular between the curve of the swabconfirmed infectious individuals and the curve of the removed (healed or deceased) subjects. Indeed, taking advantage of its scale invariance and choosing the curve of the swab-confirmed infectious people as a reference, the model suggests a correction on the number of removed individuals for just a factor which would take into account: (a) infected people who have not developed relevant symptoms and, therefore, were not detected; (b) deaths erroneously not attributed to Covid-19. Generation time, infectious period and effective reproduction number have been sought from the data through the model. At the very end, the curve of the swab-confirmed infectious individuals has been completed for the proportion of infection transmissions likely occurred from individuals with no symptoms, using figures published in important works ( [8,[23][24][25]). Thus, an estimate of the real numbers of the pandemic in Italy is obtained for the considered period of time. All the results are in good agreement with those of other studies, in particular of the ICL group ([12]), whose approach is totally different from the present. The vision on and use of the SIR model of this work are new; the C++ code, computationally really inexpensive and available under request to the author, can be applied to any other national or regional case besides Italy's study case here.
Funding Open access funding provided by Universitá degli Studi di Trento within the CRUI-CARE Agreement.

Data Availability Statement
This manuscript has associated data in a data repository. [Authors comment: Original data are those from Italy's Department of "Protezione Civile", available at a public repository according to reference [13] in the paper. The resulting estimation data, as shown in the paper, are available from the author under request to him.] Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.