Background

Highly pathogenic avian influenza H5N1 has a case fatality rate approximately ten times higher than the H1N1 “Spanish Flu” pandemic of 1918, but has much lower infectivity [1, 2]. Given the high pathogenicity of the virus, an H5N1 virus capable of airborne human-to-human transmission could lead to a devastating pandemic [3,4,5]. Indeed, this scenario has been predicted for many years [1, 2, 6, 7]. A number of important genetic differences between avian and human strains of influenza exist, which have so far prevented H5N1 from being transmitted directly from human to human [8,9,10,11]. However, years of viral evolution in domestic poultry populations involving close contact with humans and other mammals has the potential to lead to an accumulation of mutations improving mammalian adaptation [1, 12].

Egypt has reported more confirmed human cases of avian H5N1 than any other country, and is regarded by many observers as the epicenter of human H5N1 infections [1, 13,14,15,16,17]. Human infections have been increasing in Egypt, and nearly all cases have been linked to direct contact with poultry [18]. The potential for direct human-poultry contact is high in Egypt. In addition to the commercial poultry industry, which supplies most of the meat consumed in the country, some 4–5 million Egyptian families raise poultry at home in backyard flocks [19, 20]. A 3-year seroprevalence study found approximately 2% of Egyptians exposed to poultry had been infected with H5N1 [21]. Contact between human and avian hosts is a precondition for virus transfer, so an environment conducive to frequent contact, such as exists in Egypt, increases the probability of a human-adapted strain emerging [22, 23].

Genetic changes in viral proteins can have phenotypic consequences that increase pandemic risk [24]. A number of specific amino acid substitutions have been implicated in host specificity, particularly changes in the hemagglutinin (HA) protein that allow the virus to bind with and infect mammalian cells [2, 25,26,27,28]. In addition, there are genetic mutations known to alter the pathogenicity of the virus, increase the replicative ability of the polymerase complexes, and grant aerosol transmissibility without reassortment [28,29,30,31]. Viral genetic changes have been identified that alter susceptibility to current anti-viral drugs such as zanamivir and oseltamivir [24, 32]. All of the above are examples of phenotypic changes resulting from amino acid substitutions, often SNPs. Some potentially dangerous mutations may simultaneously compromise viral fitness, requiring the presence of other counteracting or “permissive” mutations to restore viral fitness in order to present a threat [33]. We refer to these genetic changes (or sets of concurrent changes) that alter host specificity, virulence, replication efficiency, transmissibility, or drug susceptibility in ways that increase human risk as key substitutions (KS).

Several KS have already been detected in Egypt, from samples at least as early as 2010 [18, 34, 35]. Gain-of-function studies have suggested that as few as four or five KS may be sufficient to provide H5N1 with airborne transmission, and at least two of these have been previously identified in Egyptian isolates [5, 31, 36]. Using viral genetic sequences for H5N1 uploaded to GenBank (https://www.ncbi.nlm.nih.gov/genbank/), we performed a descriptive spatial analysis to examine the distribution of KS in Egypt [37]. A spatial analysis of KS in H5N1 influenza samples has not been performed previously, despite the importance of such information to a properly designed surveillance system and subsequent intervention efforts [13, 38]. Importantly, viruses can undergo local adaptation in response to environmental selective pressures and produce clusters of KS [39]. Areas where known KS occur with high frequency and exhibit non-random spatial clustering may indicate environments particularly conducive to the natural emergence of de novo KS.

Methods

We obtained Egyptian H5N1 sequences collected between 2005 and 2015 from the Influenza Virus Sequence Database, hosted by the National Center for Biotechnology Information (NCBI) [40]. We extracted metadata for Egyptian samples, including geographic location information, from the full GenBank records, obtained using the Entrez Direct utilities via UNIX terminal [37, 41]. Inclusion criteria included all partial and complete protein sequences from avian hosts (both poultry and wild birds) with a geographic location more detailed than “Egypt.” A large number of H5N1 KS have been identified by the WHO Collaborating Center for Influenza Reference and Research at the CDC and are catalogued in a Genetic Changes Inventory, publicly available online (www.cdc.gov/flu/avianflu/h5n1/inventory.htm). We used this inventory in the current study as the definitive list of known KS. KS have been identified in all 10 viral proteins, including the surface proteins HA and neuraminidase (NA) responsible primarily for virus binding, cell fusion, and viral release, and the internal proteins (M1, M2, NP, NS1, NS2, PB1, PB1-F2, and PB2) involved in viral structure, replication, suppressing host antiviral response, and more [2]. We performed an initial screening for over 100 KS, ultimately testing for the presence of 39 separate substitutions.

Most of the selected KS (n = 29) are SNPs, such as the substitution from glutamic acid to lysine at position 627 (E627K) in the PB2 gene, shown to enhance mammalian host adaptation; two KS are amino acid deletions; and the remaining eight KS are sets of mutations (i.e., KS formed from the interaction of multiple independent substitutions), such as the dual substitutions of N200S in the NS1 gene and T47A in the NS2 gene, which together decrease host antiviral response but do not constitute KS individually. Table 1, adapted from the CDC’s Genetic Changes Inventory, lists all detected KS in Egypt, including the number of isolates in which each KS was detected, whether or not the KS had been previously reported in Egypt, and the phenotypic consequences identified in previous literature. All 39 KS are associated with increased human risk, with two partial exceptions: 1) a triple substitution of S155 N, T156A, and S223 N in the HA gene associated with both increased mammalian adaptation and reduced lethality in mice; and 2) the N66S substitution in the PB1-F2 gene associated with increased virulence and increased host antivirus response [42,43,44,45]. We selected five phenotypic consequences groups (PCGs) using the CDC’s Inventory to group KS by the resultant changes in viral characteristics (i.e. phenotypic consequences): PCG1 – host specificity, PCG2 – virulence, PCG3 – replication efficiency, PCG4 – transmissibility, and PCG5 – antiviral susceptibility. These phenotypic consequences were identified by the CDC as “signal[ing] adaptation to mammalian species or alter[ing] susceptibility to existing antivirals” thereby rendering the virus more dangerous to humans [46].

Table 1 Detected KS in Egypt with phenotypic consequences and references, adapted from CDC’s Inventory

We aligned the sequences with the A/Vietnam/1203/2004 virus, except for the NA and NS1 proteins, which we aligned with A/goose/Guangdong/1/1996, for consistent amino acid numbering with the CDC’s Inventory, and then tested for the presence of each selected KS. H5N1 samples in GenBank varied in the number of proteins for which full sequence data were available. This variation occurs when researchers choose not to sequence certain proteins or fail to upload sequences to GenBank due to quality concerns, such as unconfirmed frame shifts. As not all proteins were present in GenBank for each virus sample, the specific number of KS tested for varied between samples. Detection rates by viral sample were calculated as the number of KS detected divided by the number of KS tested for, which varied according to the number of proteins sequenced for each isolate. Results were summarized by viral protein, sample year, and by PCG.

Geospatial hot spot analyses have been used to help target areas with a disproportionate burden of disease [47, 48]. We employed the Getis-Ord Gi* statistic to identify statistically significant hot spots, or spatial clusters of elevated rates of KS detection [49]. The Getis-Ord Gi* has been shown to be effective even with low incidence rates [49, 50]. For a hot spot to be detected, a location with a high value (e.g., many KS detected) must have neighbors with high values. Neighbors are determined in one of two ways. First, because the scale at which clustering occurs is not known a priori, we used Incremental Spatial Autocorrelation to identify a statistically significant peak distance at which clustering across the whole study area was most pronounced [51,52,53]. This distance was then used to select neighbors during hot spot analyses. If no peak was detected, an average distance was calculated to yield thirty neighbors. The local sum for a feature and its neighbors is compared proportionally to the sum of all features, then a z-score is calculated to determine when this difference is too large to be the result of random chance. Statistical significance is adjusted to account for multiple testing and spatial dependence using Benjamini’s and Hochberg’s false detection rate (FDR) correction, which reduces the critical p-value thresholds as a function of the number of input features and the neighborhood structure used [54, 55].

The majority of H5N1 samples from Egypt did not include geographic information below the governorate level to allow hot spot detection, consequently dasymetric adjustment was employed to approximate sample locations within governorates [56]. We randomly distributed point locations only within populated places, as identified using the 2010 population density estimates from worldpop.org with a minimum density value of 10 persons per square kilometer [57]. This removed large uninhabited portions of many governorates from the analysis. Acknowledging the bias introduced by estimating point locations, the random distribution of points was repeated 10,000 times, with hot spot analysis performed after each iteration. Only statistically significant hot spots (p ≤ 0.05) were kept. We summarized results for each governorate by counting the total number of iterations in which at least one hot spot was detected. We considered governorates with 9500 or more detections to be probable hot spots. For comparison, we also performed a single hot spot analysis using each governorate’s geometric centroid, consistent with prior phylogeographic studies of influenza in Egypt [16, 58]. Figure 1 shows the sets of locations for which spatial clusters of high KS detection rates were evaluated.

Fig. 1
figure 1

Locations for which spatial clusters of high KS detection rates were evaluated. Governorate centroids and boundaries shown, with dasymetrically-adjusted populated places shown with hatching

Results

From 2006 to 2015, 925 H5N1 samples with at least one sequenced gene segment and a geographic location more detailed than “Egypt” were listed in GenBank. Of these, 874 samples contained at least one KS (94.5%), with an average of 3 PCGs represented. The majority of KS-positive samples (584/925) were located in Lower Egypt, particularly in the governorates of Alexandria (n = 50), Beheira (n = 96), Monufia (n = 90), Qalyubia (n = 109), Dakahlia (n = 82), and Sharqia (n = 88) (see Fig. 2).

Fig. 2
figure 2

Counts of viral samples that met inclusion criteria by governorate

The most commonly sequenced protein was HA (n = 825), followed by NA (n = 381); other viral proteins were much less frequent in the GenBank data. Detection rates varied by protein with the NA and NS1 proteins having one or more KS present in 100% of samples tested, while the PB1-F2 protein had only a single detected KS out of 88 samples tested. The average number of KS tested for was 22.6 per sample, with an average of 6.4 KS detected, giving an overall mean detection rate of 0.28. A total of 39 KS were detected at least once in H5N1 samples from Egypt. Table 2 summarizes the detected KS by protein and includes the proportion of samples in which KS were present.

Table 2 Summary of KS detections by viral protein

Although H5N1 was first reported in Egypt in 2006, 2 positive viral samples were collected in December 2005. The number of isolates uploaded to GenBank with governorate location information rose above 100 each year in 2007–2010, but dropped after 2011. Similarly, the average number of proteins sequenced per virus ranged from a high of 2.7 in 2008 to a low of 1.3 in 2014. Between 2005 and 2015, mean detection rates by year (total KS detected per year divided by total KS tested for in that year) ranged from a high of 0.38 in 2007 to a low of 0.18 in 2014. Figure 3 summarizes the temporal distribution of detected KS by year, with the 95% margin of error for KS detection rates shown.

Fig. 3
figure 3

Temporal distribution of viral samples and KS detection rates by year

The single most commonly detected KS was the change in the HA protein at amino acid site 94 from aspartic acid to asparagine, associated with increased virus binding to human sialic acid receptors [59]. KS were grouped according to phenotypic consequences as identified in the CDC’s Inventory. Among the detected KS were several not previously reported in Egypt (n = 17), representing all 5 PCGs and including 5 of the 8 detected KS associated with antiviral susceptibility. One KS not previously reported in Egypt was a set of 7 amino acid substitutions in the PB2 gene, associated with both PCG 2 and PCG 3 [60]. For a full list of detected KS with phenotypic consequences, see Table 1. The number of viral samples with one or more detected KS in each group as well as mean detection rates by PCG are summarized in Table 3.

Table 3 Summary of detected KS by phenotypic consequences groups (PCG)

Results of hot spot analyses

Two governorates contained probable hot spots of overall KS detection rates using both governorate centroids (single run) and dasymetrically-adjusted governorate boundaries (> 9500 iterations): Alexandria and Beheira. These two governorates are immediate neighbors in the northwestern portion of the Nile River Delta (see Fig. 4). No other governorates were identified as hot spots using centroids, and no other governorates exceeded the threshold of 9500 iterations using dasymetric adjustment.

Fig. 4
figure 4

Summary of hot spot analyses for overall KS detection rates. The governorates of Alexandria and Beheira were the only locations identified as containing probable hot spots using both dasymetric adjustment (> 9500 iterations) and governorate centroids

Geospatial hot spots were also evaluated for individual PCGs, using a p-value cutoff of 0.05 to evaluate statistical significance. We detected probable hot spots based upon dasymetric adjustment for PCG1 in Alexandria and Beheira and for PCG2 only in Beheira, however the remaining PCGs falling below the 9500 iterations threshold. Results were not consistent between centroid and dasymetrically-adjusted analyses. Using a single iteration with all samples placed at governorate centroids, hot spots were found for PCG1 in Beni Suef, Dakahlia, and Faiyum; for PCG2 in Alexandria and Beheira; and for PCG4 in Asyut, Faiyum, and Monufia; while PCG3 and PCG5 again did not contain any hot spots (see Fig. 5). The number of iterations in which probable hot spots were detected using dasymetric adjustment, as well as detections using centroids, are shown in Table 4.

Fig. 5
figure 5

Summary of hot spot analyses for KS detection rates by PCG, using (a) dasymetrically-adjusted sample locations (> 9500 iterations), and (b) governorate centroids

Table 4 Results of hot spots analyses for overall KS detection rates and by PCG

Discussion

This work represents the first spatial analysis of KS in H5N1 avian influenza virus in Egypt, and the first use of geospatial hot spot analysis for influenza KS. The majority of viral samples (874/925) contained one or more KS, with 39 distinct KS detected including from each of the five PCGs. The circulation of so many KS increases the probability of recombination and reassortment producing viral genomes with combinations of KS that prove to be well-adapted to infect, replicate in, and transmit between humans.

Samples with governorate location information uploaded to GenBank varied from year to year, with a low of 39 in 2005–6 and a high in 2009 with 155 isolates, however the sample sizes dropped after 2011. It is not entirely clear why fewer isolates from Egypt are being uploaded to GenBank than previously, although political turmoil and social unrest in the wake of the 2011 revolution are likely responsible. Detection rates of KS also varied, trending down during the study period from a high of 0.38 in 2007 to a low of 0.18 in 2014 (see Fig. 3). This decline is likely driven a smaller number of protein sequences per sample being uploaded in later years of the study period, with the HA gene being the most commonly uploaded. HA sequences tend to have lower detection rates due to the larger number of known HA KS being tested for. The problem of outdated or incomplete reporting of avian influenza is not new in the country, and several studies suggest the true burden of disease in Egypt is severely underestimated [3, 61]. Given the inefficiency in past years of vaccination efforts in the country, there is a pressing need for increased biosecurity and public health education to help limit direct contact between humans and poultry [62,63,64]. Due to limited resources, the ability to focus such campaigns in regions of greatest need is one of the primary contributions of this work.

Spatial clusters of overall KS detection rates were identified using hot spot analyses in the northwestern portion of the Nile River delta region in Egypt, specifically within the governorates of Alexandria and Beheira. Hot spots were also identified for individual PCGs, however the reduced sample sizes resulting from splitting the dataset into PCGs increased the variability. This can be observed in the differences between the results using dasymetrically-adjusted locations and those using centroids. For example, hot spots were detected using centroids in Beni Suef, Dakahlia, and Monufia governorates for various PCGs, despite these governorates containing hot spots in less than half of the iterations using the dasymetrically-adjusted locations. Since both centroids and dasymetrically-adjusted locations are estimates, the latter method is preferable as its iterative nature (i.e., 10,000 replicate analyses) provides an indication of the reliability of our estimates and does not assume all samples were collected at the geometric centroid of the governorate. Using dasymetrically-adjusted locations, spatial clusters of KS detection rates were identified only in Alexandria and Beheira for KS related to host specificity and virulence (PCG1 and 2).

The city of Alexandria is the second largest city in Egypt and an important seaport, as well as a popular tourist destination. The neighboring governorate of Beheira is located between Alexandria and Cairo, experiencing substantial transport between those two urban centers. A study in Beheira in 2015 claimed the majority of the population was involved in raising poultry, and found most poultry growers demonstrated poor practices with regards to avian influenza prevention and control [65]. Spatial clustering of high rates of KS in this area suggests increased surveillance is warranted. It is not clear if the detected KS represent de novo mutations in Egypt, or are being introduced from outside the country. The identification of Alexandria and Beheira as a cluster of KS detections may suggest the latter due to their location on the Black Sea/Mediterranean flyway, but more work is required to determine this with confidence [66,67,68,69]. If the majority of detected KS are the result of viral introductions, it might partially explain the observed spatial patterns, including the relatively low rates in and around Cairo.

A limitation of this analysis was that in most governorates sample sizes were too small to allow for time series analyses of detected hot spots, therefore spatial clusters were assumed to be consistent over the 10-year study period. We were also limited by the lack of fine-grain geographic location information for influenza sequences on GenBank, which were the primary source of data for this study. Recognized by previous researchers as an important shortcoming to the otherwise incredibly useful datasets available through GenBank, the lack of consistency and infrequent inclusion of sub-governorate location information required the use of approximation methods which introduced uncertainty into the analyses [70,71,72]. Previous spatial investigations of Egyptian isolates relied solely on geographic centroids of governorates, or even the country [16, 58, 73,74,75]. This and many other studies focused on the spatial dynamics of genetic data would benefit immensely from improved geographic resolution in GenBank records. We call on researchers uploading genetic data to GenBank and similar repositories to include as much geographic detail in the metadata as possible, while respecting privacy considerations when appropriate. Sampling bias, both in terms of geographic locations chosen for sample collection and in terms of genes selected for sequencing, was a challenge for this study as with all secondary analyses. It is difficult to know to what extent such bias introduced uncertainty into the current analysis. While sampling would appear to be concentrated in the most densely populated areas of the country (see Figs. 1 and 2), it is unclear from the information available in GenBank what sampling strategies were employed or why full genome sequences were not collected or cataloged. Another challenge was that at the time of this writing, the CDC’s H5N1 Genetic Changes Inventory had not been updated since June 26, 2012. This corresponds with the publication of two papers detailing gain-of-function experiments in which KS allowing avian H5N1 transmission between mammals were identified and described, leading to concerns about public dissemination of KS information [31, 36, 76]. Despite the recent lifting of the NIH’s funding pause on gain-of-function studies in December 2017, no updates to the June 2012 edition of the inventory were published prior to this study.

Conclusions

This work provides a descriptive baseline for the distribution and spatial clustering of KS in Egypt, from which future studies can be compared. High overall detection rates of KS were found to cluster spatially in the northwestern portion of the Nile River delta. Large numbers of KS were detected with a range of associated phenotypic consequences, which raise concerns of a pandemic strain developing in the region. Better geographic data and improved genetic surveillance are necessary to properly evaluate and monitor this potential threat to global public health.