FormalPara Key Summary Points

Respiratory syncytial virus (RSV) is a common cause of respiratory tract infections, leading to hospitalisations and mortality especially in young children and older adults with underlying medical conditions.

The burden of RSV in adults is underestimated because of non-specific symptoms, lack of standard-of-care testing and lower test sensitivity compared to young children, particularly when using a single diagnostic specimen.

Quantifying the burden associated with RSV in adults is challenging and time- and resource-intensive, but this information is vital for public health decision-making.

This protocol presents a statistical modelling approach to estimate RSV disease burden in adults, which is adaptable to various data types and allows for consistent analysis across countries and settings.

The protocol proposes four event types (general practitioner visits, emergency department visits, hospitalizations and deaths) for which four primary and nine secondary outcomes are defined using International Classification of Diseases (ICD) codes.

Introduction

Background

Respiratory syncytial virus (RSV) is a common cause of respiratory tract infections in children and adults, leading to hospitalizations and death, especially in infants, older adults and those with underlying medical conditions [13]. The typical clinical presentation in adults varies from mild disease to severe lower respiratory tract infection (LRTI) and includes chronic respiratory and cardiac disease exacerbations as well as other cardiac manifestations [4]. RSV epidemics occur seasonally, primarily in colder months in temperate climates [5].

Quantifying RSV incidence among adults is challenging since the symptoms associated with RSV infection usually overlap with other respiratory illnesses, especially infections with influenza and other respiratory viral pathogens. Other factors, including the resolution of viral shedding before seeking medical attention, the lack of standard-of-care testing for RSV when presenting to medical care facilities, the use of case definitions that exclude some RSV illness (e.g., influenza-like illness or community-acquired pneumonia) and the low diagnostic capacity and high cost of polymerase chain reaction (PCR) testing, also contribute to the underestimation of RSV incidence in adults [68]. Furthermore, PCR testing of a single respiratory swab (e.g., nasal/nasopharyngeal) has reduced sensitivity in older adults compared to children likely because of lower viral loads in their secretions, inconsistent sampling procedures and other factors [9, 10]. Consequently, several alternative time series model-based approaches have been increasingly used to assess RSV incidence retrospectively in settings with limited standard-of-care testing [6, 1129].

These approaches involve fitting regression models to time series data extracted from real-world data (RWD), such as claims or electronic health records (EHRs), concerning outcomes potentially associated with RSV (e.g., respiratory hospitalizations). The models link the temporal variability in a pathogen, represented though a viral proxy (e.g., RSV-related ICD-coded hospitalizations), with the variability in the outcome variable to estimate the proportion of events associated with that pathogen. While doing so, the model accounts for baseline seasonality and co-circulation of other large seasonal contributors to the studied outcomes (e.g., influenza). A recent meta-analysis (US) demonstrated that application of time series models to RSV yields RSV-related incidence estimates comparable to those obtained in prospective studies (236 and 282 per 100,000 person-years, respectively, among adults aged ≥ 65 years). However, estimates based on RSV-specific ICD codes were substantially lower (1 to 5 per 100,000 person-years), suggesting that the time series models account for undiagnosed RSV-related events [30].

Linear regression, Poisson regression and negative-binomial regression are examples of common model-based approaches. Occasionally, more advanced methodologies, such as generalized additive models or hierarchical Bayesian regression, have been used [26, 31]. Furthermore, different modelling approaches use diverse definitions of outcome, risk status stratification and the time lag between viral activity and outcome. These differences underscore the necessity for a general framework to estimate RSV disease incidence from RWD. Such a framework is crucial for producing consistent estimates of RSV disease incidence across various studies using different databases.

RSV vaccines have been recently licensed to prevent lower respiratory tract disease caused by RSV in older adults such as those aged ≥ 60 years [32, 33]. Accurate local RSV incidence data are vital to inform decisions on public health policy such as those debated by vaccine technical committees. As setting up a prospective, cohort study to establish RSV burden in adults is time- and resource-intensive, this generic protocol outlines a time series model-based method to estimate RSV disease incidence, encompassing general practitioner (GP) visits, emergency department (ED) visits, hospitalizations and deaths, that can be already available and used across countries. The central components of this generic protocol are anticipated to be tailored to specific local databases, forming the final protocols for country-specific studies. This strategy facilitates methodological harmonization across countries and an integration of best practices.

Objectives

The primary objective of the studies using this generic protocol is to estimate population-based RSV incidence and mortality rates in adults. As a first step in this process, studies aim to estimate population-based RSV-attributable incidence rates of cardiorespiratory, respiratory and cardiovascular events identified from GP/outpatient, ED, hospital and death registries, stratified by age and risk status (when applicable). In addition, studies could also aim to estimate the RSV-attributable incidence rate of events associated with a subset of the primary outcomes stratified by age and risk status (when applicable).

Methods

Study Design

This is a retrospective database analysis in which data are modelled with a time series quasi-Poisson regression to assess the incidence of RSV disease. The study has been implemented in Spain, Germany, Canada and Italy. A consistent study design is employed, with adaptations made based on data availability, for each country participating in the study.

Sample Selection

The study population includes females and males aged 18 years or older who reside in the geographical areas captured by the selected databases. The study period starts after 2009 and ends before 2020, to exclude pandemics that are expected to distort RSV surveillance and incidence. Study period spans multiple years, as the year-to-year variability is essential in estimating the burden of RSV disease through modelling.

Sample Size

This is an observational study without specifying a priori test hypothesis. Therefore, sample size calculations are not applicable.

Measurements

Covariates included in the models as independent variables are time and viral proxies for RSV and influenza. Stratifying variables are age group and (when applicable) risk status.

Viral Proxies

Depending on data availability, viral proxies can be derived from hospital or surveillance data. The testing frequency of viral surveillance systems has been predominantly driven by influenza activity rather than RSV (or respiratory virus activity as a whole). Therefore, such systems may underestimate the circulation of RSV by testing less frequently during peak RSV activity. Also, many of these systems are based on influenza-like illness case definitions requiring fever, which is less commonly seen in RSV. As our model primarily aims to estimate medically attended RSV burden, we use hospital-based viral proxies where possible, as this allows the proxy to be directly derived from the healthcare system whose outcomes we are assessing and avoids any geographic mismatch that might arise from the use of sentinel viral surveillance data. The proxies seek to accurately track the relative level of viral activity in the community, so the absolute value of the activity is less important than consistent measurement across the year to accurately track relative trends. On this basis, as has been done in other studies [6, 29, 31], we use pediatric RSV activity for the RSV activity proxy as testing is frequent among young children, allowing for consistent measurement of RSV activity. They are represented by the number of RSV-related hospitalizations (ICD-10 codes: B97.4, J21.0, J12.1, J20.5, J21.9 or ICD-9 codes: 079.6, 466.11, 480.1, 466.1) in children < 2 years. Because the vast majority of bronchiolitis cases and hospitalizations in children < 2 years are related to RSV [34, 35], the more generic bronchiolitis code (J21.9 or 466.1) is included in the RSV proxy to accommodate for the reduction in RSV testing during the peak and tail of the season, which we have observed in administrative databases in several countries. For influenza, the largest burden and most consistent testing is among older adults, so we use influenza-specific hospitalizations (ICD-10 codes: J09-J11 or ICD-9 codes: 487, 488) in adults ≥ 65 years as has been done in other studies [13].

Time lags of 0 up to 4 weeks between the viral proxy and the outcome of interest are considered during model building to account for delays between changes in viral proxy detection and the number of events. For GP visits, potential time lags are shortened (0–2 weeks) to reflect the expectation that this would be the first source of care in most cases.

Stratifying Variables

Proposed age groups are 18–44 years, 45–64 years, 65–79 years and ≥ 80 years, but can be adapted to country-specific vaccination recommendations and data availability.

Risk factors for RSV are identified as the presence of at least one comorbidity code (Supplementary Materials, Table 3) within 1 year prior to the event. Low risk is defined as the absence of any comorbidity codes. Due to limited knowledge of risk factors for severe RSV disease [3], risk factors for influenza are used to develop the set of comorbidity codes [36]. Data on the risk status should be obtained from the same database as the outcome data.

Planned Outcomes

The generic protocol proposes four types of events: GP visits, ED visits, hospitalizations and deaths. A GP visit is defined as a visit to a GP. An ED visit is defined as a visit to a medical treatment facility specialized in emergency medicine, not leading to hospitalization. A hospitalization is defined as an overnight stay in a hospital.

The protocol proposes four primary outcomes: all cardiorespiratory (broad), selected cardiorespiratory (narrow), all respiratory and all cardiovascular events (see Supplementary Materials, Table 1). Both broad and narrow cardiorespiratory event definitions are considered to differentiate between the full group of respiratory and cardiovascular events and the selected cardiorespiratory codes most likely to be associated with RSV, as recommended by experts and existing literature [3, 4]. For the ICD outcome grouping, both primary and secondary diagnoses are used, as has been done in other studies [18, 22, 23, 31], to obtain a more comprehensive assessment of RSV-attributable events. This strategy is elected because the use of primary diagnosis only has been shown to underestimate the LRTI burden [37]. For deaths, outcome groups are defined using the underlying cause of death. If data on all-cause death are available, a sensitivity analysis could be conducted in which outcome groups are defined using all reported causes of death.

In addition to the primary outcomes, nine secondary outcomes are selected, based on literature review and usefulness for policy assessment, to provide more specific estimates that can be used for economic evaluation [4]. The following secondary outcomes, composed of a subset of the primary outcomes and defined by ICD code groups, are incorporated: influenza or pneumonia, bronchitis or bronchiolitis, chronic lower respiratory diseases, upper respiratory diseases, chronic heart failure exacerbations, ischaemic heart diseases, arrhythmias, cerebrovascular diseases and myocarditis (see Supplementary Materials, Table 2).

Data Requirements

The minimum data to be collected from each country-specific study include (1) outcomes (as defined above); (2) viral proxies (as defined above); (3) age group, for stratification by age groups; (4) risk status (if available), for further stratification by risk status.

Data are obtained from diverse sources, including national/regional registries or claims/EHRs from different care settings, such as GP/outpatient, ED, hospital and death registries.

Outcome data should be aggregated at least monthly, ensuring sufficient variability for seasonal modelling, and should have a well-defined catchment population (denominator) accurately representing the region/country studied, enabling incidence calculations. If the system does not have complete capture of the events in the catchment area, well-delineated adjustment factors should exist (e.g., a scaling factor to weigh up/down specific age groups). Risk status information should be extracted from the data sources from which outcome data are obtained, as the prevalence of risk factors is expected to differ by event type. To obtain risk-specific incidence rates, the catchment population should also be available stratified by risk status.

Preparation of Time Series Data

Data are aggregated weekly (or monthly) by age group and (if applicable) risk status. If cells with low counts (i.e., below the country-specific limit to guarantee anonymity, usually 5) are suppressed, a random number within the suppressed range (e.g., 1–4) is imputed to complete the time series. A shell table for constructing time series of the outcome data is given in the Supplementary Materials, Table 4.

Viral proxy data are extracted from hospital registries as discussed above and should be aggregated at the same level as the modelled outcome data (i.e., weekly or monthly). Shell tables for the weekly and monthly viral proxy data are given in the Supplementary Materials, Tables 5 and 6.

Data Analysis

Each country-specific study should establish a Statistical Analysis Plan (SAP) adapted to the country-specific data before initiating data analysis. The example of the country-specific SAP for Spain is provided in the Supplementary Materials. Quality control of the analysis scripts is planned before analysing the data. An example of country-specific scripts can be obtained from the authors upon request.

Descriptive statistics summarize the number of events for each year, both overall and stratified by age group and (if applicable) risk status. The observed number of events is plotted over time for each outcome stratified by age group and risk status (e.g., respiratory hospitalizations for adults aged 18–45 years with high-risk conditions) to evaluate if a seasonal trend is visually present, hence qualifying the data for seasonal modelling.

The weekly (or monthly) number of events is modelled separately for each outcome and each stratum (age group and, when applicable, risk status) using a quasi-Poisson regression model to allow for potential overdispersion. The identity link function is chosen to reflect the most plausible biological relation between viral circulation and the event occurrence. Seasonal variations in the number of events are captured by the periodic time trends represented by sine and cosine terms with weekly (period = 52.143) or monthly (period = 12) periodicity. The aperiodic time trends are reflected by a polynomial up to the fourth order. The seasonal terms are included in the model to accurately model the outcome (e.g., all respiratory events), not the viral proxy (e.g., RSV). RSV enters the model as a covariate; therefore, the regional pattern of RSV should not affect the suitability of this modelling approach. The viral activity is represented by appropriately lagged viral proxies for RSV and influenza. Although we anticipate a shorter lag for influenza than for RSV, we allowed the model to select the most suitable time lag for each pathogen.

Assume that the (weekly/monthly) number of events follows a Poisson distribution: \({\text{Nr}}.{{\text{events}}}_{t}\sim {\text{Poisson}}\left({\lambda }_{t} . \theta \right)\) with \(t=1, 2, 3,\ldots T\) the running week and T the total number of weeks in the study period, then the expected number of events \({E}\left({\text{Nr}}.{{\text{events}}}_{t}\right) = {\lambda }_{t}\) and the variance \({\text{Var}}\left({\text{Nr}}.{{\text{events}}}_{t}\right) = {\lambda }_{t} . \theta\), with \(\theta\) the overdispersion parameter. For weekly data, \({\lambda }_{t}\) is specified as follows:

$${\lambda }_{t}= {\beta }_{0}+\sum_{k=1}^{4}{\beta }_{k}.{t}^{k}+{\beta }_{5}.{\text{sin}}\left(\frac{2\pi .t}{52.143}\right)+ {\beta }_{6}.{\text{cos}}\left(\frac{2\pi .t}{52.143}\right)+{\beta }_{7}.{\text{sin}}\left(\frac{4\pi .t}{52.143}\right)+ {\beta }_{8}.{\text{cos}}\left(\frac{4\pi .t}{52.143}\right)+\sum_{l=1}^{L}{\beta }_{\left(8+l\right)}.{{{\text{VP}}}_{l}}_{\left(t-{m}_{l}\right)}$$

where \({\beta }_{0}\) is the expected number of baseline events, \({\beta }_{k} \left(k=1,\ldots ,4\right)\) are coefficients associated with aperiodic time trends while \({\beta }_{q} \left(q=5,\ldots ,8\right)\) are coefficients corresponding to yearly and half-yearly time trends. The effect of pathogen \(l\) (\(l = 1,\ldots ,L\) with \(L\) the total number of pathogens under consideration) is represented by the coefficient \({\beta }_{8+l}\) associated with the appropriately lagged activity of pathogens \({{\text{VP}}}_{1},\ldots , {{\text{VP}}}_{L}\), with \({m}_{l}= 0, 1,\ldots ,M\) and M the maximally allowed time lag (2 or 4, depending on the outcome).

The expected number of monthly events is specified as follows:

$${\lambda }_{t }= {\beta }_{0 }+ \sum_{k=1}^{4}{\beta }_{k} . {t}^{k} + {\beta }_{5 }. {\text{sin}}\left(\frac{2\pi . t}{12}\right) + {\beta }_{6} . {\text{cos}}\left(\frac{2\pi . t}{12}\right) + \sum_{l=1}^{L}{\beta }_{\left(6+l\right)} . {{\text{VP}}}_{{l}_{t}}$$

where \({\beta }_{0}\), \({\beta }_{k} \left(k=1,\ldots ,4\right)\) and \({\beta }_{q} \left(q=5, 6\right)\) are defined as above, and the effect of pathogen \({{\text{VP}}}_{1},\ldots ,{{\text{VP}}}_{L}\) is represented by the coefficients \({\beta }_{6+l } (l=1, \ldots , L).\)

The model-building procedure consists of two steps: first, identify the appropriate order of the aperiodic time trend; second, determine the proper lag of the viral proxies. In the first step, the model is fitted with only time trends (periodic and aperiodic time trends with all four polynomials). The periodic time trends are fixed to reflect the biological plausibility of seasonal trends of the data, while the order of the aperiodic time trend (\(\sum_{k=1}^{4}{\beta }_{k} . {t}^{k}\)) can be reduced up to first order (α = \(0.05).\) In the second step, each (lagged) proxy variable is included in the model from step-1 one at a time. The variable with the highest test statistic is selected for inclusion into the model. Using test statistics instead of P values is preferred to facilitate the assumption that the viral activity is biologically implausible to protect against the outcomes of interest [16, 38]. Once a pathogen is included in the model, this step is repeated with the variables corresponding to the rest of the pathogens until one (lagged) variable for each pathogen is included in the final model.

Model fit is assessed visually by investigating the plots of observed versus estimated number of events over time, as there is no readily available numeric goodness-of-fit measurement. The number of events attributable to RSV is calculated as the difference between the expected number of events from the full model and those from the model without the RSV term (by setting the coefficient associated with RSV to zero). The yearly incidence rates of events attributable to RSV are calculated as the annual number of events attributable to RSV divided by the corresponding denominators (multiplied by 100,000). Depending on the database, the denominators are either the age- and risk-specific (if applicable) census population or the number of individuals captured within the registered nationally representative databases. The confidence intervals around the estimates are obtained using residual bootstrapping with 1000 bootstrapped samples [39]. Given the considerably large number of bootstrapped samples, the incidence rates are assumed to be normally distributed with a mean equal to the estimated IR. Results are presented with IRs and their 95% confidence intervals (CIs). When risk- and age-specific analyses are performed, the corresponding IR in combined age- and/or risk-specific populations is calculated as the sum of the number of events attributable to RSV across risk groups and age groups divided by the sum of the corresponding population sizes (multiplied by 100,000).

To have a broad overview of the disease burden, the yearly percentage of the number of events attributable to RSV is derived as the proportion of the yearly number of events attributable to RSV out of the yearly number of observed events (presented as percentages).

While the primary analysis for this study is based on the frequentist framework in all countries, the analysis could also be conducted in a Bayesian framework. This framework has the advantage of easily incorporating prior knowledge in parameter estimation (e.g., forcing the RSV parameter to be minimally zero to reflect the assumption that RSV is not likely to protect against the outcomes of interest) and obtaining the posterior mean with its 95% credible interval without the need for additional analyses such as bootstrapping. However, a Bayesian model comes with the risk of experiencing difficulties in obtaining convergence and, when convergence is obtained, a considerable runtime. Given the potential benefit of such analyses, weighted against their limitations, the Bayesian model similar to that proposed by Zheng et al. [31] is used as a sensitivity analysis to assess the impact of the selected framework on the primary outcomes in the first two countries.

Strengths and Limitations

This is the first protocol to present a unified approach for conducting a time series model-based study to estimate RSV burden in adults, applicable across countries using diverse data sources (GP/outpatient, ED, hospital and death registries). It is implemented in multiple countries, enabling uniform estimation of symptomatic RSV infection incidence rates in adults across different age and risk groups. This model-based study is cost-efficient as it can use existing data and can easily be implemented in multiple countries or specific regions. This facilitates comparison and/or combination of results across countries to better understand the global RSV burden in adults. To support the credibility of the results, the findings generated by applying this protocol should be compared to any existing country-specific estimates as well as to the existing pooled estimates from the meta-analysis which pools prospective studies adjusted for diagnostic testing under ascertainment and model-based studies [1, 30]. Findings from country-specific studies could help policymakers better evaluate the impact of RSV infection on public health.

Although the proposed methodology is practical and straightforward to apply, it has some limitations. The main limitation is the availability of high-quality data. For example, when analysing hospital data, the fixed number of beds could affect hospitalization rates, and the pathogen diagnosis data indicating RSV and influenza circulation could be affected by the limited testing capacity [25]. In addition, working with administrative data comes with the risk of both under- and overestimation of the observed or recorded number of outcome events. For example, underestimation could result from omission of cardiorespiratory ICD codes when in fact such a diagnosis was involved in the hospitalization. Overestimation could result from inaccurate ICD codes being assigned such as for rule-out diagnosis.

The models used to estimate the RSV-attributable number of events include viral proxies for RSV and influenza as has been done in most published time series studies, which implicitly assumes that these are the only two pathogens that show a relevant association with the outcome of interest. If relevant associations between other pathogens (currently not included) and the outcome of interest exist and pathogen-specific time series data are available, they could be integrated into the model by including additional viral proxies, as was done for RSV and influenza. However, even without explicitly modelling these potentially relevant pathogens, they would to a great extent be indirectly accounted for in the proposed model through the periodic component and overdispersion parameter.

While the proposed quasi-Poisson model is expected to be stable and bear minimal computational burden for seasonal data, it is expected to encounter difficulties converging or accurately estimating RSV-attributable burden when data do not show a clear seasonal pattern. This may be the case for an outcome of interest in a particular country for a particular age group.