Self-reported infections in the German National Cohort (GNC) in the context of the current research landscape

Max J Hassenstein1,2, Ghazal Aarabi3, Peter Ahnert4,5,6, Heiko Becher7, Claus-Werner Franzke8, Julia Fricke9, Gérard Krause1,10,11, Stephan Glöckner1,10, Cornelia Gottschick12, André Karch13, Yvonne Kemmling1, Tobias Kerrinnes1,10, Berit Lange1,10, Rafael Mikolajczyk12, Alexandra Nieters14,15, Jördis J Ott1,10, Wolfgang Ahrens16,17, Klaus Berger13, Claudia MeinkeFranze18, Sylvia Gastell19, Kathrin Günther16, Karin Halina Greiser20, Bernd Holleczek21, Johannes Horn12, Lina Jaeschke22, Annika Jagodzinski23,24,25, Lina Jansen26, Carmen Jochem27, Karl-Heinz Jöckel28, Rudolf Kaaks29, Lilian Krist9, Oliver Kuss30, Susan Langer12, Nicole Legath13, Michael Leitzmann27, Wolfgang Lieb31, Markus Loeffler4,5, Nina Mangold32,33, Karin B. Michels8, Christa Meisinger34,35, Nadia Obi7, Tobias Pischon22,36,37,38, Tamara Schikowski39, Sabine Schipf18, Matthias B. Schulze19, Andreas Stang28, Sabina Waniek31, Kerstin Wirkner4,5, Stefan N. Willich9, Stefanie Castell1,10


Introduction
Infectious diseases continue to play an important role for the subjective disease perception, health economic considerations and public health in Germany, even if they are no longer the leading causes for morbidity and mortality. They contribute considerably to the loss of healthy years of life [1]. Due to their contagiousness and preventability through hygiene measures and vaccination, communicable diseases are important for public health and subject to public debate. In recent years, infectious diseases have been associated with the development of non-communicable diseases in a complex interplay with the immune system, psyche, microbiome, genetics and lifestyle factors [2,3]. In addition, the impact of certain infections on carcinogenesis like hepatitis B virus infection (liver cell carcinoma) or human papillomavirus infection (cervical carcinoma) is also known [4]. Data and biosamples of the German National Cohort (GNC) may provide further insights in this regard, and pave the way for more targeted approaches to disease prevention. Especially, the use of several omics technologies in multi-omics analyses of different biosamples will play a role. This approach aims, for example, at the joint analysis of genetic data together with data from comprehensive transcriptome, protein or microbiome analyses. In combination with self-reported infection events, a paradigm shift towards personalised prevention and medicine can be supported.
Central assessment tools in the GNC are the computer-assisted interview ('face-to-face') and the selfadministered questionnaire that is filled in independently at desktop computers (touchscreen module) by the participants in the study centre [5]. For both tools, the questions on infectious diseases were compiled from 2009 onwards by experts in the area of infections and the immune system. Included in the interview are questions referring to the entire life course and to infectious diseases that may have a chronic progression. These particular questions are asked in a specific way to rule out self-diagnosis.
Transient infections are queried in the touchscreen module. Such infections that occur mostly only temporary like acute respiratory infections are also suitable for self-diagnosis. The self-administered questionnaires of the touchscreen module were evaluated in both GNC pretests between 2011 and 2013. The results indicate the suitability of the infection-specific questionnaire to assess the targeted diseases and showed acceptable reliability [6,7].
The aim of this paper is to describe the GNC tools to assess self-reported infectious diseases.
Additionally, first data on disease frequencies are reported, and the assessment tools are put into context with tools used in other studies or in the German notification system, which is based on the Infection Protection Act (German abbreviation: IfSG). In this way, GNC assessment tools are positioned within the existing research landscape.

Methods
The presented data originate from the first 101,791 participants, examined in 18 GNC study centres between 14 March 2014 and 17 March 2017. The GNC is a population-based German cohort study with the aim of including 200,000 participants between the ages of 20 to 69 years. The baseline recruitment was conducted in 2014-2019, while the follow-up period is scheduled for 20-30 years [8]. The questions on infectious diseases were included in the so-called level-1-examination and were intended for all participants. The GNC study protocol has been reviewed by the respective ethics committees and data protection officers. All participants provided signed informed consent before taking part in the study [4]. The data presented in this publication (age, gender) are not quality-assured. Also, the data on diagnoses are not validated based on medical data. The infectious disease questions considered in both the interview and the self-administered questionnaire, including the respective response options, are presented in Table 1. The interview queries medical diagnoses of infections, or, in the case of sepsis, hospitalisation. In contrast, all responses to the self-administered questionnaire rely on selfdiagnosis. In this publication, the interview questions were analysed only concerning the age of the participant at the time he or she visited the study centre. Further analyses, such as the consideration of age at diagnosis, are reserved for future research. Therefore, the age-stratified figures reflect the frequencies at the age at examination. Within the touchscreen module, the response options for the occurrence of the infectious diseases have been aggregated, i.e. the presence of at least one disease/ symptom is reported. For both instruments, the response options 'no answer', 'do not know' and missing values have been excluded for the calculation of proportions. These answers are reported separately. Some participants exceeded the maximum recruitment age of 69 years at their first site visit. Therefore, the age group of >= 70 years has been specified in age-stratified analyses.
Furthermore, four participants have been excluded due to implausible or missing reports on age. For herpes zoster and postherpetic neuralgia, cases have been omitted from the analyses if the age at diagnosis was below 21 because such cases are considered medically implausible.
All analyses have been conducted with R, Version 3.6.0, including the package 'tidyverse'.

Results
In total, 101,787 participants were included in the analysis. The median age site visit was 53 years (interquartile range 45-62 years); 53.6% (54,526/101,787) were female. There were 2,971 (2.9%) participants in the age group of 70 years and older. In the interview, the occurrence of herpes zoster, postherpetic neuralgia, hepatitis B and C, HIV/ AIDS, tuberculosis was queried in terms of a medical diagnosis. For sepsis, hospitalisation was queried. Between 0.2% (HIV/ AIDS) and 8.6% (herpes zoster) of the participants reported the diagnosis of the respective infectious disease during their life span by a medical doctor ( Table 2). In the touchscreen module, infection episodes were inquired concerning the past 12 months for lower respiratory tract infections (LRTI), upper respiratory tract infections (URTI), bladder infections, gastrointestinal infections or fever (>38°C). For these infections, at least one episode was reported from 12.2% (bladder infection) up to 81.2% (URTI) of responders. (Table 2). Here, URTI were reported most frequently, followed by gastrointestinal infections, fever and lower respiratory tract infections. Figures 1 to 3 present the proportions of participants, stratified by age and sex, which reported the respective disease or symptom within the life course (interview) or within the past 12 months (touchscreen module). Hepatitis C and HIV/ AIDS were the least frequently reported diseases and are additionally presented in Figure 2 with a modified scale for better presentability.
HIV/AIDS were mainly reported by men, and bladder infections mainly by women. Considering data quality, the proportion of data that can be analysed from the interview is very high (97.1% and more, Table 2), while this proportion is lower for the questions of the touchscreen module (90.0-91.2%, Table   2).

Discussion
This publication is the first to present results of the GNC regarding infectious diseases or their symptoms, respectively. Due to the long follow-up-period, the GNC has immense potential for research on risk factors or predictors of infectious diseases and their effects. The storage of biosamples allows future research on biomarkers associated with increased infection risk or on biological mechanisms in the context of infections and non-communicable diseases [8].
Standardised indicators of disease frequencies in terms of prevalences or incidences are not directly to be derived from the presented numbers. The calculation of these numbers is planned for future disease-specific research in order to characterise the GNC accordingly. Here, it should be noted that only survivors report the diseases, therefore, the numbers might be underestimated. Hence, GNC disease frequencies in the framework of existing studies or the German notification system will be discussed in future disease-specific publications. In contrast, this publication serves the purpose to put the questionnaires used into the context of the research landscape and, therefore, to show the potential of the GNC.
The age-specific frequencies of the infectious diseases ascertained in the GNC and presented here can only be compared to a limited degree to notification data, health insurance data or studies with a focus on prevalence or incidence assessment. The reason being that the frequencies presented in Figure 1 and 2 do not consider the age at diagnosis, but the question if the respective disease has ever been diagnosed during the life course. Thus, the figures only show the age at disease assessment on site.
Furthermore, it is known that it is a challenge to recall medical events correctly, especially if the events lie further back in time [9, 10]. The diseases and symptoms queried in the touchscreen module ( Figure   3) probably provide a more precise representation of the age at diagnosis as the questionnaire queries occurrences within the past 12 months of the time of site visit. In general, validation of the self-reports is reasonable, and can be at least be partially realised by health insurance data in the future. Here, it should be kept in mind that not all symptomatic patients necessarily visit a doctor's practice.
The interview data have excellent quality in terms of valid answers, but the touchscreen module data score considerably less well (Table 2). It cannot be determined if a missing answer equals the absence of the disease, and, thus, results may be biased. The frequencies presented in Table 2 refer to all participants with a valid answer (without missing values, 'no answer' and 'do not know'). Consequently, the calculated results may be biased. This issue could be addressed in future analyses with appropriate methods, e.g. multiple imputation [11].
In the following, all reported infectious diseases or symptoms are put into the context of the research landscape. Additionally, an outlook on research questions will be given that can be answered with GNC data in the future.

Tuberculosis
In Germany, every case of tuberculosis (TB) is required by law (IfSG) to be reported to health authorities. The data are published yearly by the Robert Koch-Institute (RKI) in their disease-specific report [12]. The German notification system meets international quality standards [13]. Nevertheless, gaps in notifications are still possible. These might occur in vulnerable populations or only clinically diagnosed forms of TB without microbiologic verification. Further research data on TB are included in health insurance data. Data collections on specific populations who underwent active surveillance either for active or latent TB, e.g. for occupational reasons [14], or because of serial X-ray examinations [15], allow a different type of (historical) comparison [16]. Apart from the mentioned limitations for the GNC self-reports, there is the challenge that latent TB may be falsely reported as tuberculosis.
Positive tests (e.g. tuberculin skin test) hint at latent TB, but unlike active tuberculosis, the condition is asymptomatic, and the microbiological detection of M. tuberculosis complex remains negative. This may result in an overestimation of the frequencies reported in the GNC. In our view, potential uses for the GNC data especially arise if the diagnosis for active or latent tuberculosis were confirmed, e.g. by information from biomaterials. In this case, crucial issues concerning disease progression after TB and possible consequences (e.g. restriction of lung function) can be investigated. This would exceed the follow-up-period of most of the existing other international cohort studies [17,18].

Herpes zoster and postherpetic neuralgia
Despite the high disease burden of herpes zoster [19], there is no obligation to report the disease to authorities in Germany because of its low contagiousness. Therefore, studies concerned with the epidemiology of the disease are mostly based on secondary data [20,21]. These studies were able to detect and quantify previously unknown cardiac and cerebrovascular consequences of a varicellazoster virus (VZV) reactivation.
For herpes zoster and post-zoster neuralgia, an assessment of the validity of GNC pretest data has been performed. Presumably, earlier herpes zoster episodes were less likely to be reported than more recent episodes [24]. As already mentioned earlier, this embodies an important limitation of the selfreported infectious diseases in the GNC.
From an epidemiological point of view, GNC baseline and follow-up examinations are conducted within an interesting time period, since in Germany, the varicella (chickenpox) vaccination has been recommended by authorities since 2004 [25] and the herpes zoster vaccination since 2019 [26]. In contrast to studies based solely on secondary data, the longitudinal design of the GNC with repeated collection of biosamples and integration of vaccination card data allows research on risk factors and biomarkers for VZV-reactivation and possibly provides insight into molecular causes for vaccination failure. The results could be used for the development of personalised vaccination strategies.

HIV/ AIDS
In Germany, various tools are available to estimate the disease burden of HIV/ AIDS. A case register has been available since 1982, based on voluntary and anonymous reports of patients by physicians [27]. Since 1987, laboratories have been required to anonymously notify positive HIV-tests (currently § 7 par. 3 IfSG). Furthermore, reports on HIV frequencies from routine tests of all blood donations have been available [27]. From these data sources and the cause-of-death statistics of the statistical offices of the federal states as well as the sales data on antiretroviral therapy from pharmacy accounting centres, the RKI estimates incident HIV infections [28]. Additionally, multiple patient registers or cohorts exist, e.g. the HIV-1 seroconverter-study managed by the RKI since 1997 [29] or registries with clinical patient data ( [30], www.tp-hiv.de). The 'Translational Platform HIV' (TP-HIV) has an integrated biobank and, therefore, allows molecular-genetic transmission investigations [31,32]. Due to the low number of cases, it is difficult to answer specific research questions regarding HIV within the GNC; rather, information on HIV-infection may be used as exclusion criteria or confounder for immunological and infectiological GNC analyses.

Hepatitis B and C virus infection
In terms of hepatitis B and C virus infections (HBV/ HBC), the long latency between the infection and clinical disease as well as the often asymptomatic nature of the condition challenge epidemiological research. This becomes apparent in the cause-of-death statistics and the statistics of notifiable infectious diseases. According to updated surveillance case definitions, acute HBV cases are registered independently of stage and clinical presentation [33]. Thereby, asymptomatic cases can be reported, which addresses the underestimation of HBV in notifiable disease records. Additionally, blood donors are tested for both viruses [34]. The screening of pregnant women generates additional information regarding chronic HBV-infection [35] as do studies among high-risk populations [36] and individuals infected with HIV [37]. As for other infections, secondary data from health insurances can be analysed scientifically. However, these are restricted to individuals with insurance and focus on treatment parameters (e.g. [38]). Also, patient registries and studies comprise specific data, e.g. on disease progression (www.deutsche-leberstiftung.de). Similar to HIV/AIDS, data on hepatitis in the GNC will mainly play a role as exclusion criteria because participants are rarely affected. On the other hand, long-term follow-up data from the GNC can investigate consequences of chronic viral hepatitis infection. Given the existence of serological vaccine markers for HBV, the collection of vaccination data within the GNC provides options for a comparative analysis of self-reported and serologically determined HBV immunity. Also, explorative investigations on proportion and predictors of possible HBV vaccine-non-responders can be conducted.

Infections of the upper respiratory tract
Infections of the upper respiratory tract are among the most frequent infectious diseases [39]. This is reflected in the frequencies obtained from the touchscreen module (Table 2). Moreover, e.g. influenza Furthermore, the data analysis regarding URTI may contribute to confirm or extend the results of specialised local population studies (AWIS: [3,41], LIFE-Adult: [42]) or studies focusing only on acute respiratory infections like GrippeWeb [43].

Infections of the lower respiratory tract
In Germany, pneumonia causes over 250,000 hospitalisations each year with a case fatality rate of approximately 13% [44]. Globally, infections of the lower respiratory tract (LRTI) are responsible for approximately 103 million DALYs [45]. With GNC data, existing studies on pneumonia from the clinical perspective ((CAPNETZ [46], PROGRESS [47]) can be expanded by the analysis of molecular factors and lifestyle data, and put on a wider data basis. Furthermore, the longitudinal design allows the assessment of long-term consequences of respiratory infections such as cardiovascular events [48].
Within future publications, the GNC data will be evaluated: It is conceivable, for instance, that respiratory diseases, such as asthma or chronic obstructive pulmonary disease (COPD), are mistaken for LRTI, which would lead to an overestimation of cases. The comprehensive database of the GNC, including information on medication intake, allows for a more differentiated consideration in the future.

Gastrointestinal infections
Acute gastrointestinal infections lead to 63.2 million missed working days due to illness [49]. In Germany, gastrointestinal illnesses are among the infectious diseases most frequently reported to the Robert Koch-Institute, which is particularly the case for infections caused by norovirus, rotavirus, campylobacter and salmonella [50]. Mild or moderate gastrointestinal tract infections are often not reported, especially among adults [51]. The results of international studies confirm this [52][53][54][55].
Depending on the study, only 20 to 40% of the diseased consulted a doctor; stool samples with pathogen diagnostics were only available for a fraction of these patients.

Urinary tract infections
In Germany, there are no comparable studies with similar data collection on urinary tract infections (UTI). Following IfSG, there is no obligation to notify. The former Barmer GEK health insurance determined the UTI prevalence of the insured aged 12 and over for 2013 [57]. However, this study used secondary data which only capture cases in which a doctor has been consulted. The European ARESC-study with study centres in Germany analysed the pathogen spectrum for the assessment of the resistance situation. The goal was to evaluate the recommendations regarding uncomplicated cystitis [58]. Hence, GNC data can fill in a gap. Apart from that, the data can be considered for the construct 'infection susceptibility' (see above section on URTI). Additionally, self-reports on UTI from international studies mostly originate from female study populations [59,60]. The self-reports from male GNC participants may add to this data to gain a broader insight into the epidemiology of transient infections. Finally, long-term effects of frequent UTI in women can be investigated, e.g. unspecific consequences such as shortened life expectancy.

Fever
Fever is a symptom of infection. It is commonly ascertained in studies as a symptom, e.g. of respiratory tract infection to differentiate between influenza-like illness (ILI) and common cold [61]. This is especially true for studies that complement the notification system by supplementary surveillance such as GrippeWeb [43]. Clinical studies on LRTI or sepsis also collect data on fever and, in some instances, also hypothermia [47]. However, temperature thresholds often differ between studies, which in turn limits cross-cohort evaluations (e.g. LIFE-Adult: >38.5°C in case of infection [42]; PROGRESS: >=38.3°C rectally or >=37.8°C orally [47]; GNC separate question: >38°C). The GNC data can be incorporated into the construct 'infection susceptibility' in future analyses and taken into account in molecular investigations of the immune system.

Sepsis
Sepsis is a common complication of various diseases and especially occurs in hospitals and intensive care. The incidence is approximately 110 cases per 100,000 inhabitants per year and is associated with a high case fatality rate [62]. In sepsis, the initially targeted immune response of the body systemically expands, which can lead to considerable organ damage and death. Therefore, sepsis is an important medical research area. Large cohort studies like the GNC, with information on risk factors, molecular data and health insurance data can assist research on sepsis together with data from clinical cohorts (AlertsNet [63], SepNet study group (www.sepsis-stiftung.eu/sepnet/), PROGRESS (www.capnetz.de/html/progress/project), [47,64]), population-based cohorts (LIFE-Adult [42]) and additional health insurance data. Here, the GNC also provides data on participants who have overcome sepsis and are healthy for study participation.
In summary, it can be stated that on all infectious diseases queried in the GNC numerous and very diverse research questions can be investigated in the future. Thereby, the GNC complements the existing research landscape in a variety of ways.

Outlook
Based on a cross-sectional design, publications on results from the GNC in the area of infection research can be expected from 2021 onwards, e.g. on determinants of susceptibility to certain selfreported infections. In particular, it will be possible to examine the distribution of infectious diseases by social class. Further research in the field of infectious disease epidemiology will be feasible after GNC biosamples have been analysed.
Separately financed serological determinations of antibodies against the causal pathogen of Lyme disease and analyses of the stool or nasal microbiome up to high-dimensional multi-omics analyses are also conceivable. Infection-specific subcohorts with timely data and symptomatic biosample collection concerning acute infections as well as the collection of viable blood cells for characterisations of the immune system (www.info-pia.de) require additional funding. These types of investigations have not been conducted within the GNC so far. Future research in the area of infectious disease epidemiology will, therefore, focus on these additional projects and analyses.

Ethics
All described examinations on humans have been conducted with the approval of the responsible ethics committees, following national laws and the Declaration of Helsinki of 1975 (in the current, revised version). All participants provided signed informed consent.

Conflicts of interest
The authors state that there are no conflicts of interest.

Funding
This project was carried out with data from the German National Cohort (www.nako.de). The GNC is