Background & Summary

Since the onset of the pandemic, the volume of COVID-19 data made available to the public has been unprecedented1. Yet, disaggregated data are still scarce, incomplete, sometimes contradictory, and with little cross-country comparability. Various reports have substantiated the need for better data. It has already been pinpointed the sheer importance of disaggregated information by gender, age, geographic location, ethnicity, and other variables relevant in a national context, especially for the developing counties2. Some scholars have posited that failing to acknowledge the importance of disaggregated data on gender and sex may result in significant inequalities in access to health services3. Others have propounded that the current lack of COVID-19 disentangled data will increase the existing sex and gender data gaps, which in turn will increase gender disparities in the health and socioeconomic effects of the pandemic, with a negative impact on females4. Various experts worldwide have advocated the need to produce and standardise age-disaggregated health data to improve usability and cross-country comparison5. They have showcased that failing to do so results in misinterpreting the patterns of SARS-CoV-2 transmission among and beyond the cohort of children, the process of prioritizing vaccination, and the inspection of the secondary effects.

As of January 21st, 2022, Romania has had the second-lowest vaccination rate in Europe (41% cumulative uptake of full COVID-19 vaccination scheme)6. On top of that, Romania is among the top ten European countries with the highest death toll (3,106 per one million people)7. Located in the southeast of Romania, Bucharest is the capital city and the primary economic agglomeration of the country. Services - the most impacted sector during the COVID-19 pandemic-represent the leading supplier for the local economy8. With a total resident population of 1,827,3909, Bucharest ranks third, behind Paris and Athens, by population density, among the European Union country capital cities (7,917 persons per km2)10.

The first COVID-19 case in Bucharest (case number 12 in Romania) was officially confirmed on March 7th, 2020. The patient, a male, aged 49 years old, entered Bucharest by plane on February 27th, travelling from Rome, Italy11. Restrictive measures were imposed, at that time, only for travellers arriving in Romania from regions in northern Italy. Evidence suggests that returning travellers from Italy had a pivotal role in the early spread of COVID-19 in Romania12. As of January 21st, 2022, the number of positive cases in Bucharest has reached a tally of 335,498, accounting for 17% of all 1,983,670 cases reported in the country13.

We present the first 250 days of the coronavirus pandemic in Bucharest (between March 7th and November 11th, 2020), comprising 46,440 COVID-19 individual cases. The dataset gives for each patient: disaggregated biological sex and age, the COVID-19 infection confirmation date, the place of residence or quarantine, and the phase (stage) corresponding to the Non-Pharmaceutical Interventions (NPIs) active in Bucharest. The official measures adopted during the first 250 days critically vary in severity: from imposing a nationwide state of emergency to almost no measures at all. The NPIs have been diverse: mask-wearing, social distancing, movement restrictions (including for people aged 65+), school closures, etc. The effect of the NPIs is likely to be conditioned on the structure of the target population, on the share of specific groups within the total population14, and on mobility and geographic factors15. In effect, we divide the first 250 days into five phases given the measures’ profile (the severity). All the information (excepting the NPIs phase variable) was procured, in an anonymized format, from the Bucharest Public Health Department (the Romanian Ministry of Health). Our study received ethical approval (Decision No. 1, September 1st, 2020) from the Ethics Committee of the Department of Sociology (University of Bucharest). We complied with all relevant ethical regulations.

The potential re-use of our dataset is multiple: epidemiological prevalence understanding, mathematical modelling or simulations, and public health policy design or assessment. Also, it allows for measuring the effectiveness of NPIs at the city (urban community) level, which may provide a fine-grained understanding of local health policy success. A sub-sample of our dataset has already been employed to estimate the role of age and sex in spreading COVID-19 in Bucharest16,17.

Methods

The dataset comprises 46,440 COVID-19 patients officially confirmed and reported between March 7th and November 11th, 2020, in Bucharest (before the start of the vaccination campaign). These real-world data were procured from the Bucharest Public Health Department (the BPHD), Romanian Ministry of Health. To receive the data, our team sent an official request address to the BPHD (i.e., Address No. 14870, August 18th, 2020) on behalf of the University of Bucharest. The BPHD provided the dataset based on approval No. NT3054E (August 28th, 2020) in response to our request. Our study received ethical approval (Decision No. 1, September 1st, 2020) from the Ethics Committee of the Department of Sociology (University of Bucharest). Before their being procured, the data were anonymized by the BPHD. The 46,440 observations (patients) correspond to an investigated time window of 250 days (March 7th - November 11th, 2020). Subsequently, we qualitatively analysed the NPIs advanced by the public health authorities in Romania18 during the 250-day timeframe. Accounting for their level of severity, we derived five stringency phases. Each of the 250 days was nested in one of the five phases.

Figure 1 describes the steps of producing the “COVID-19 in Bucharest” dataset (or, briefly, the dataset). For each of the 46,440 COVID-19 patients, we have records of their biological sex, age, area of residence, or quarantine, as well as their official confirmation date. Each patient also has a unique code of identification (patient_ID) assigned by the BPHD. These records are marked in red in Fig. 1. Information about the NPIs corresponding to the first 250-day time interval was retrieved from official press releases uploaded on the webpage of the Romanian Ministry of Internal Affairs18 and from newsletters published in the “Press releases” section on the webpage of the Romanian Ministry of Public Health19. This information was expressed in the form of the “phase” variable (marked in blue in Fig. 1). Additionally, our team derived new variables from the original data records provided by the BPHD, i.e., observation number, age_groups, month, week, and 14-day interval. These variables are marked in green in Fig. 1. Our team performed various technical validation checks on the “COVID-19 in Bucharest” dataset (plausibility, completeness, and conformance checks). These are presented in greater detail in the “Technical validation” section. For brevity, we only mention here that during the technical validation stage, we compared the data records in our dataset to the information enclosed in other public available datasets20,21,22.

Fig. 1
figure 1

The sequence of steps taken to produce the dataset.

We used the severity degree of the NPIs to delineate the stringency phases. In Fig. 2, for informative and illustrative purposes, we represent the evolution of the COVID-19 cases in a longitudinal fashion (between March 7th and November 11th, 2020) by major NPIs, stringency phases, sex, and age groups. The stringency phases are displayed in temporal order. The information displayed in the main content and Fig. 2 represents the authors’ contribution.

Fig. 2
figure 2

The evolution of COVID-19 confirmed cases between March 7th and November 11th, 2020, in Bucharest, Romania. We illustrate the new daily cases by sex (a) and age-groups (b) while accounting for the stringency phases. The rendered information represents the authors’ contribution.

Stringency “Phase 1” ranges between March 7th and March 15th, 2020, and corresponds to “low to moderate measures.” Forty-seven COVID-19 cases are confirmed during this phase. The major NPIs until March 15th are: restriction of outdoor or indoor public and private events to 1,000 participants (March 8th), cessation of flights (March 9th), bus rides (March 10th), and railway travel (March 12th) to and from Italy, suspension of face-to-face classes (March 11th) in all pre-university level schools and some universities, limitation of the number of participants in indoor cultural, scientific, religious and sports activities to 50 (March 11th). “Phase 2”, matching the most severe measures, corresponds to “the state of emergency in Romania” and ranges between March 16th and May 14th, 2020. One thousand five hundred fifty-one cases are confirmed amid these dates. The preliminary NPIs adopted during the state of emergency are: an extension of the suspension of in-person classes; permission of only takeaway and delivery services in restaurants and shopping malls; closure of hotels and clubs; prohibition of indoor cultural, scientific, religious, and sports events; the restriction of outdoor personal events to 100 participants; the cessation of flights to and from Spain. On March 24th, the military is deployed to help enforce a national lockdown. Movement outside the household is generally prohibited for non-essential purchases, with persons over 65 having their outdoor activity restricted, at first, to a two-hour interval (March 25th) and, then, to a three-hour interval (March 29th). Starting April 27th, non-essential movement outside the household is permitted for persons above the age of 65 both in the morning (7–11 am) and during the evening (7–10 pm).

“Phase 3”, with moderate measures, ranges between May 15th and June 16th, 2020. Nine hundred thirty-two cases are confirmed during this interval. May 15th marks the onset of the first “state of alert”, allowing hairdressing salons, barbershops, and dental clinics to reopen. Facial masks generally become mandatory in indoor public spaces, public transportation included. From June 1st, gradually, some of the movement restrictions are lifted, outdoor concerts and cultural events are permitted, and restaurant terraces reopen. “Phase 4”, with 18,499 confirmed cases, is a phase of the least stringent measures, ranging between June 17th and October 6th, 2020. Since June 17th, shopping malls, public nurseries, and pre-schools are allowed to reopen. Pre-university level schools reopen on September 14th. Local elections are held on September 27th, with a participation rate of 37% of the 18+ Bucharest population23. For the entire phase, facial masks remain mandatory. “Phase 5”, with low to moderate measures, ranges between October 7th and November 11th, 2020. Authorities confirm 25,274 COVID-19 cases during this phase. As of October 7th, theatres, cinemas, and show venues are closed in Bucharest, while restaurants are restricted to outdoor service only. Opening hours of stores are limited, and a night curfew is being imposed. Starting November 9th, all in-person classes are suspended. Masks are mandatory in both outdoor and indoor places, working spaces included. Public and private gatherings are cancelled. Table 1 displays the major NPIs by stringency phase and time interval.

Table 1 GANTT table displaying the major Non-Pharmaceutical Interventions during the first 250 days of the COVID-19 pandemic in Bucharest.

Data Records

We deposited a copy of the dataset (i.e., the Bucharest COVID-19 dataset) to the generalist repository figshare24. Data are available in an Excel file format (.xlsx), facilitating importation into various statistical software programs such as R, Python, SPSS, Stata, SAS, or conversion to comma-separated value format (.csv). In the database, the rows designate unique individual observations (COVID-19 confirmed cases). Each observation is assigned a number from 1 to 46,440 (the total number of cases) and ascendingly ordered. Overall, the dataset renders information about all the officially COVID-19 confirmed individual cases in Bucharest, since the onset of the pandemic in the city, on March 7th, till November 11th, 2020. Specifically, we give information on the COVID-19 confirmation date and each patient’s sex, age, and geographical (administrative) location (the administrative district in Bucharest). We also give the patients’ ID codes as provided by the BPHD. These ID codes are helpful for joining (linking) this dataset to other available datasets16,17. Additionally, we derive four new variables from the original information. Namely, based on its COVID-19 confirmation date, each observation is nested in a (calendar) month, week, 14-day interval, and stringency phase. We make available the dataset in the English language to increase its international usage by health professionals, policy-makers, scientists, and other interested parties. We use self-explanatory and straightforward variable labels and values for users’ convenience. Also, we mark missing data by “NA” codes. We continue this section with a detailed description of all variables (data fields) available in the dataset (Table 2).

Table 2 The data fields (variables) included in the Bucharest COVID-19 dataset.

Observation_number

There are 46,440 unique data entries (observations or patients) in the dataset. Therefore, the variable takes values from 1 to 46,440. The observation numbers are ascendingly sorted. These numbers do not reflect epidemiological progression and should not be used for that purpose. Instead, this variable uniquely identifies each observation in the dataset.

Patient_ID

The BPHD had assigned each COVID-19 patient a numerical code due to the anonymization process. We render these patient ID codes as it is useful for joining (linking) the dataset to other available datasets16,17.

Confirmation_date

For each patient, this represents the official date of test response (i.e., COVID-19 confirmation date). The dataset includes only positive tests. A number of 137 cases are reported with missing confirmation dates (marked by NA). These missing data account for 0.3% of all 46,440 patients. We record the time related to the confirmation dates in the “MM/DD/YYYY” format (“month, day, year”, in left-to-right writing direction), e.g., “03/07/2020”. The maximum number of cases confirmed in a day is 1,203 (on November 3rd, 2020). Between March 7th and November 11th, 2020, there are 250 days (confirmation dates) in which the authorities report positive cases.

Sex

This represents the biological sex of the tested individuals (patients). The dataset comprises 24,696 female patients (53%) and 21,744 male patients (47%). We report no missing cases on this variable. The variable takes two values: “male” and “female”.

Age

This variable captures the equivalent of age in completed years, with values ranging from 0 (less than one year of age) to 101. The average value is 43.4, the median is 43.0, and the most frequent value within the dataset is 52.0. The standard deviation is 17.7. Thirty-six cases have missing data (these represent less than 0.1% of all cases). In the dataset, we mark missing values by NA.

Age_groups

We recode the “age” variable into five-year age groups (brackets), with categories ranging from 0–4 y.o. to 85+ y.o. We derive a total of 18 groups. The category with the highest absolute frequency is the 40–44 y.o. group (namely, 5,650 positive cases correspond to this age group, accounting for roughly 12% of all data entries). Thirty-six patients have missing information (these represent less than 0.1% of all cases). In the dataset, we mark missing values by NA.

District

This variable refers to the six administrative units (or sectors) forming the Municipality of Bucharest, each governed by a mayor. The six districts have a clockwise geographic arrangement. For instance, District 1 is located in the north, District 4 in the south, and District 6 in the west of the city (Fig. 3). The exact district is not available for 8,043 cases (that is 17.3% of all the data entries). For these cases, we add a generic location – “Bucharest”. The “district” marks the place of residence or quarantine for a specific individual case. Due to disclosure reasons, the BPHD did not disentangle the residence from the quarantine place. The variable takes the following values: “District_1”, “District_2”, “District_3”, “District_4”, “District_5”, “District_6”, and “Bucharest”.

Fig. 3
figure 3

The administrative organization of the Municipality of Bucharest into six districts.

Month

We derive this variable from the “confirmation_date” variable. We assign each “confirmation_date” to a “month” (each date is nested in a month). Consequently, we obtain nine months of investigation (March – November 2020). Two of these months are incomplete: March and November 2020. Further, October has the highest absolute frequency of observations (19,429 cases accounting for 41.8% of all cases). Also, in the dataset, we have 137 data entries with missing data that are marked by NA. The variable takes the following values: “March”, “April”, “May”, “June”, “July”, “August”, “September”, “October”, and “November”.

Week

We derive this variable from the “confirmation_date” variable. We assign each “confirmation_date” to a “week” (each date is nested in a week). The variable takes as values the week number (e.g., w10, w11, w12, …, w46). Week counting starts from the beginning of the year (2020) – the first week of 2020 is January 1st – January 4th. We consider each week begins with Sunday. In our dataset, the pandemic onset in Bucharest is in week 10 (i.e., w10). The largest number of reported cases is in week 45 (6,105 observations accounting for 13% of all cases). Also, in the dataset, we have 137 data entries with missing data that are marked by NA.

14_day_interval

We derive this variable from the “confirmation_date” variable. We assign each “confirmation_date” to a 14-day-time-interval or two-week time window (each date is nested in a two-week time interval). We build this variable for potential analysis purposes. The variable takes as values week numbers (e.g., “w10_w11”, “w12_w13”, “w14_w15”, “w16_w17”, …, “w44_w45”, “w46_w47”). We showcase that “w10_w11” and “w46_w47” (the beginning and the end of the time window) are incomplete. We notice that the minimum number of unique COVID-19 confirmed cases (i.e., 40 representing less than 0.1% of all observations) is in weeks 10–11. Also, the maximum number of cases is in weeks 44–45 (12,045 accounting for 26% of all observations). We have 137 data entries with missing data marked by NA in the dataset.

Stringency_phase

We derive this variable from the “confirmation_date” variable. We assign each “confirmation_date” to a “stringency phase” (each date is nested in a phase). This variable takes the following values: “phase_1”, “phase_2”, “phase_3”, “phase_4”, and “phase_5”. We define these stringency phases based on the various NPIs adopted by the authorities. These five phases are consistently different in terms of the stringency of the adopted measures. “Phase_1” covers nine days: between March 7th (the first confirmed case in Bucharest) and March 15th (the last day before the state of emergency). “Phase_2” corresponds to a 60-day time window between March 16th and May 14th (the entire period of the state of emergency). “Phase_3” covers a time frame of 33 days between May 15th and June 16th (the first state of alert). “Phase_4” has a 112-day duration equivalent to the period of relaxation (between June 17th and October 6th). “Phase_5” covers 36 days and corresponds to new restrictive measures (October 7th and November 11th). Also, in the dataset, we have 137 data entries with missing data that are marked by NA.

Technical Validation

Before being transferred to our team, the data were collected, curated, and anonymized by the Bucharest Public Health Department (BPHD), The Romanian Ministry of Health. The BPHD performed data anonymization to solely safeguard personal data and not to infringe on the data reliability or correctness and comprehensiveness. The BPHD is a Romanian public authority mandated to develop public health policies and programs, devise preventive measures and collect public health statistics. In this regard, the BPHD is a reliable official source of data. The authors do not have details of the collection of the epidemiological data and, therefore, cannot assess the reliability of the dataset acquired from the BPHD. After receiving the dataset from the BPHD, the authors performed additional checks to ensure the technical quality. Firstly, we implemented plausibility checks, looking for duplicate cases. We identified 118 duplicate cases - with identical values on all variables. These cases were eventually removed from the dataset. Secondly, we performed completeness checks. Namely, we closely examined the variables in searching for unavailable information. We detected less than 0.3% missing values in relation to two variables, i.e., confirmation_date and age (specifically, 137 and 36 cases, respectively). We did not employ any imputation technique for these missing observations. Subsequently, we marked the missing data with “NA”.

Thirdly, we executed conformance checks and compared the information embedded in our dataset to available alternative data. To that end, we deployed three comparisons. We compared the Bucharest COVID-19 dataset against the first 147 disaggregated records officially announced by the health authorities at the onset of the pandemic in Romania12,22. For each of the 147 records, the authorities provided various pieces of information: the confirmation date, patients’ age, sex, residence, probable contacts, and travel history. A human-to-human network analysis of these first 147 records is available in the literature12. Further, we compared our dataset to the total population of COVID-19 cases officially reported in Romania for the same time window (March 7th - November 11th, 2020). We illustrate the two-time series in Table 3. Also, we compared our dataset to the most recently updated data structure of the resident population in Bucharest (as of July 1st, 2020)20. Tables 4 and 5 illustrate the comparison between the Bucharest COVID-19 dataset and the structure of the Bucharest resident population. Eventually, we compared the stringency phase variable from our dataset to the stringency index developed by the University of Oxford25.

Table 3 The distribution of the COVID-19 confirmed cases in Romania vs the distribution of the COVID-19 confirmed cases in Bucharest, by weeks and phases, between March 7th and November 11th, 2020.
Table 4 The resident population of Bucharest (as of July 1st, 2020) vs the COVID-19 confirmed cases in Bucharest (March 7th - November 11th, 2020) by sex and age groups.
Table 5 The resident population of Bucharest (as of July 1st, 2020) vs the COVID-19 confirmed cases in Bucharest (March 7th - November 11th, 2020) by sex and age combined.

Supplementary, we deployed data type checks to ensure that the data entered had the correct data type. For instance, we examined whether the age variable contains only numerical values. Further, we ran a range check to verify whether the values taken by our variables fall within a predefined range. For example, whether the age variable has taken a reasonable range of values. Finally, we performed a format check to ensure that our variables had the predefined format. For instance, we assessed whether the values taken by the confirmation_date variable are all stored in the same fixed format, i.e., MM/DD/YYYY.

Usage Notes

Our data records illustrate the COVID-19 prevalence in an urban community (Bucharest) for the first 250 days by providing a high-level granularity dataset. Precisely, the dataset comprises individual-level covariates, such as the age and sex of the officially confirmed patients, in a longitudinal (daily) fashion. We hope to make a contribution to the current international efforts of coalescing disaggregated empirical evidence on the spread of COVID-19. Our dataset may prove a valuable instrument for public health experts, policy-makers, scientists, and even journalists interested in assessing and better understanding COVID-19 spread in urban communities, especially before introducing the vaccines. For example, our data may demonstrate its utility in informing the efforts of scientists to statistical model or simulate the spread of diseases, in general, and of respiratory viruses, in particular. Our disaggregated observations may assist public health experts in building a comprehensive picture of the epidemiological situation. Moreover, it may help scientists establish (confirm) causal inferences in virus circulation patterns and solve potential problems, such as Simpson’s paradox26. Furthermore, this dataset is suitable for European or global comparisons as it comprises individual-level cases that allow for standardization.

If our dataset is supplemented with compatible information available from other sources16,22,27, it may prove fruitful for gearing pharmaceutical interventions (e.g., vaccination efforts and strategies). For example, a subsample of this dataset has been partly used to estimate the role of age in spreading COVID-19 across a social network in Bucharest16,17. The age and sex of the patients confirmed positive between August 1st and October 31st, 2020, were input into relational hyperevent models28,29. Precisely, these two covariates were combined with network data (human-to-human transmission chains) to test for age and sex homophily effects17. Additionally, the variables embedded in our dataset may also be used, as covariates, in conjunction with network data, for estimating Exponential Random Graph Models (ERGMs)15,30.

Furthermore, the current dataset may prove its utility in assessing the impact of the NPIs advanced by the local authorities in Bucharest between March 7th and November 11th, 2020. Various insights concerning the effects of the NPIs may be inferred using the sex and age covariates. For instance, the information in Table 6 implies significantly higher shares of COVID-19 cases among females when the stringency levels of the NPIs are higher. Further, the evidence exhibited in Table 6 may support the very few previous studies31 that claim the average age of COVID-19 patients decreases and stabilizes over time.

Table 6 The distribution of COVID-19 confirmed cases in Bucharest (March 7th - November 11th, 2020), by sex and age, on weeks and phases.

Our dataset may also provide a better understanding of the susceptibility to infection by biological sex. Already available scientific evidence has pointed to lower treatment efficiency32, greater rates of hospitalization33, higher probability of intensive treatment34, and a higher risk of death for males35,36,37. Still, the evidence is mixed when looking at confirmed cases. Earlier reports find a sex imbalance, with COVID-19 male patients having a greater risk of infection38,39. However, more recent studies uncover no difference between males and females regarding susceptibility to infection40,41.

We find in our dataset that, overall, significantly more females were confirmed with COVID-19 than males (χ2 = 187.64, df = 1, p = 0.000). Approximately 53.2% of all confirmed cases were females. The high level of detail in our data shows how males and females were affected during the first 250 days of the pandemic in Bucharest. For illustrative purposes, we report the differences between the Bucharest COVID-19 dataset and the resident population of Bucharest, by sex and age groups, at a 14-day time interval (Fig. 4). For females, differences range from −6.8 to +10.6, while for males, from −6.3 to +7.3. In Fig. 4, bright yellow colours designate high positive differences (more COVID-19 cases than we would expect if compared with the total population), and dark blue colours designate high negative differences (fewer COVID-19 cases than what we would expect if compared with the total population). Negative differences were found in the 0–19 age group and the 70+ age group, irrespective of the sex and confirmation date. Furthermore, the dataset provides evidence for a disproportionate impact of COVID-19 on sex, during the first few weeks of the pandemic. Throughout weeks 12 to 21, significantly high positive differences can be noticed for females aged between 40 and 54. More adult age females were tested positive during the state of emergency than we would expect compared to their share in the total resident population. The effect is not visible in the case of men. In sum, these results may reveal occupational segregation and, consequently, give support to existing reports about the unbalanced composition of the global health workforce (with females representing about 70%42).

Fig. 4
figure 4

Statistically significant differences between the structure of COVID-19 confirmed cases in Bucharest and the structure of the resident population of Bucharest. We illustrate the significant differences by age groups and sex: females (a) and males (b). Brighter yellow colours designate high positive differences (more COVID-19 cases than expected when compared with the total population) and dark blue colours designate high negative differences (fewer COVID-19 cases than expected when compared with the total population).

Comparisons of the disaggregated COVID-19 data to population parameters are expected to have a critical role in designing health policies. The scientific community has already documented this need for demographically informed decisions, stressing the importance of interlinking the stringency and content of COVID-19 NPIs with key figures of the population43. For instance, school closures and curfew for individuals aged 65+ are expected to produce different outcomes depending on the structure of the population of interest. Moreover, sex and age disaggregated data are expected to guide crafting strategies for vaccination44,45,46. Last but not least, our dataset provides location details (for 83% of the cases) – see: the “district” variable, which coupled with the “confirmation date” records, could provide a spatiotemporal image for the first 250 days of the COVID-19 pandemic. In conclusion, we argue that the present disaggregated dataset can significantly improve the accuracy and effectiveness of NPIs, especially in countries with low vaccinations rates. Moreover, we deem that the qualitative scale of stringency (Fig. 1, Table 1 and the corresponding main content) is sufficiently justified and detailed that future researchers could use this, to extract, or modify it for their own purpose of exploration. Also, the modelling of COVID-19 spreading in this micro-context may be performed by corroborating our dataset with the detailed evidence available with the COVID-19 Border Accountability Project (COBAP)47.