Helicobacter pylori is an important human pathogen that is likely to be present in gastric mucosa of over half of the world’s population. The prevalence of H. pylori infection appears to be higher in the low- and middle- income countries than developed countries, with infection prevalence between ethnic groups within countries often varied [1, 2]. Such localised differences might be attributable to socioeconomic factors [4,5,6], although H. pylori related issues may contribute. The prevalence of infection in Asia and Africa is 54.7% to 79.1%, respectively. In North- and South- America the prevalence is 37.1% and 63.4%, respectively and in Europe, the prevalence is on averages 47.0% [3]. Prevalence differences between racial and ethnic groups have been described in various parts of the world, but the extent to which such differences can be attributed to socioeconomic and other possible risk factors is unclear [4,5,6]. Vietnam is the easternmost mainland country in Southeast Asia with an estimated population of 96 million (2019, UNFPA-VN) among which there are more than 50 ethnic groups of different cultures; ~ 65% of these groups are located exclusively in remote or rural areas (2019, UNFPA-VN) [7, 8]. Earlier studies in both hospital and community settings showed a high prevalence of H. pylori infection in Vietnam [9,10,11]. There is considerable variation in socioeconomic status and lifestyle across a rapidly changing Vietnam, this study investigates the risk associated with H. pylori infection in a major urban community in southern Vietnam building on previous studies [9,10,11,12,13]. Importantly, this study examines international context of the H. pylori present in Vietnam in relation to the major H. pylori populations.

H. pylori has undergone localized co-evolution with humans for more than 60,000 years [14]. The pattern of distribution of H. pylori populations have a strong association with human migration and are named after the geographic regions historically associated with particular human populations [15] [16]. The pattern of distribution H. pylori populations is indicative of the epidemiology of this organism, being exclusively associated with humans and very localized transmission, almost vertical. Importantly, the incidence and severity of gastric disease associated with H. pylori infection is observed with particular H. pylori genetic types in particular regions of the world. For instance, in East Asian countries such as Japan and Korea the incidence of gastric cancer is higher relative to European and North American countries [17].

The cytotoxin associated gene pathogenicity island (CagPAI) is one of the major virulence determinants of H. pylori. Several virulence genes in the CagPAI trigger abnormal cellular signals in the host. This abnormal cell signalling is likely to contribute to H. pylori-infection associated disease, including gastric cancer (GC). The cagA gene, present in the CagPAI, is known to be an important virulence factor and plays a key role in pathogenesis. The cagA gene is not present in all H. pylori strains, more than 90% of H. pylori isolates from East Asian countries carry cagA, compared to 50–70% of isolates from the Western countries [18, 19]. Although, studies of H. pylori isolates from East Asia showed individuals carrying cagA positive strains have an increased risk of peptic ulcer disease (PUD) and/or GC, compared to those from Western countries carrying cagA positive strains [20,21,22]. Functionally, the protein encoded by cagA activates several signal transduction pathways that bind and disrupt the function of epithelial junctions, leading to aberrations in the functioning of the tight junction, cell polarity and cell differentiation in the host [23].

The H. pylori vacuolating cytotoxin A, encoded by the vacA gene, is endocytosed by the host cells and causing changes including membrane channel formation resulting in cytochrome c release which initiates apoptosis and a pro-inflammatory response [24]. Particular allelic variants of vacA and cagA are associated with H. pylori-associated disease sequelae. Allelic types are associated with H. pylori populations and are probably host-specific adaptive changes [25]. The typing scheme used for vacA is based on the middle (m) and signal (s) region of the gene with two types defined for each region; alleles: m1 m2 and S1 and S2 respectively. In vitro experiments showed s1m1 strains induce cell vacuolation more frequently than s1m2 or s2m2, from which it was inferred that the s1m1 was more cytotoxic [26].

Vietnam has emerged as a country with the highest age-standardized incidence rate (ASR) of GC (16.3 cases/100,000 for both sexes) in Southeast Asia (GLOBOCAN 2012; Previous studies have also reported the high prevalence of H. pylori infection in Vietnam and its association with peptic ulcer diseases, active gastritis, atrophy, and intestinal metaplasia [27]. As part of this prospective cross-sectional study, we have used isolate genome sequencing to enable the investigation of the H. pylori population types circulating in symptomatic Vietnamese patients. The genomic relationship between isolates and gene typing for the cagA and vacA genes (derived from the genome sequence for each isolate) provide key baseline information for identifying bacterial associated risk factors for H. pylori-associated disease in Vietnam and how these risk factors compare with H. pylori-associated disease in other parts of the world.

Materials and methods

Patient and specimen collection

We conducted a prospective cross-sectional study among patients attending at Gastroenterology Department of Gia Dinh Hospital, Ho Chi Minh City, Vietnam from August 2016 to February 2017. Instead of random selection, only patients with symptoms of upper gastrointestinal discomfort, heartburn, gastric or duodenal ulcer were eligible for enrolment. Candidate patients were informed about the study procedure and written informed consent was obtained for participation. Sociodemographic and clinical information was collected for each patient using a structured questionnaire at the time of clinical presentation. An endoscopic examination was performed by a trained clinician and two biopsy specimens (one from the gastric antrum and one from the corpus) were collected from each patient using well-washed and disinfected fibre optic endoscopes (model GIF XQ 30; Olympus, Japan). The biopsy specimens were transported to the laboratory in Stuart transport medium at 4 °C.

Isolation of H. pylori

Biopsy samples were vortexed vigorously for 5 min and plated on Brain Heart Infusion (BHI) agar (Oxoid Ltd, Hampshire, United Kingdom) supplemented with 7.5% sheep blood, 0.4% Isovitalex, and H. pylori Dent supplement (Oxoid, United Kingdom). Plates were incubated at 37 °C in an atmosphere of 5% O2, 15% CO2, and 80% N2 for 3 to 7 days. H. pylori colonies were identified based on their typical morphology, characteristic appearance on Gram staining, a positive urease test, and subsequently confirmed by MALDI_TOF (Bruker, Germany). Isolates were stored at minus 80 °C in 0.5 ml of brain heart infusion (BHI) broth with 20% glycerol.

Genomic DNA extraction and genome sequencing

Revived isolates were subcultured on selective BHI solid medium containing 7.5% sheep blood and 0.4% isovitalex under microaerophilic conditions (5% O2, 15% CO2, 80% N2) at 37 °C for 3–5 days [28]. Genomic DNA was prepared from confluent growth using a commercial DNA extraction kit (Qiagen DNA Mini kit, Germany). Genomic libraries were prepared using the Nextera DNA sample preparation kit (Illumina, San Diego, USA). Library sequencing was performed on the Illumina MiSeq instrument using the V3-600 cycle, paired-end kit (Illumina, CA. USA). Readsets for isolates sequenced as part of this study are available at National Center for Biotechnology (NCBI) under BioProject PRJNA689207

Bacterial genome assembly and annotation

Sequences were analysed using the Nullarbor pipeline ( In brief, low-quality bases and adaptor contamination were trimmed off with Trimmomatic [29], readsets with at least 35 × read depth of coverage were retained for analysis. Isolate purity was evaluated with Kraken (v0.10.5) 5 [30]. SPAdes (v.3.9.0) [31] and Prokka (v.1.12) were used for de novo assembly and genome annotation, respectively. [32]. We used tRNAScan and RNAmmer to identify tRNA and rRNA in the draft genomes, respectively [33, 34]. The identification of phage related regions was carried out using the PHASTER tool [35].

Phylogenetic analysis

Forty-two [42] reference H. pylori genome sequences representing selected H. pylori populations were downloaded from the NCBI, details are shown in Additional file 1: Table S1. Reads from the reference strains and the isolates in this study were aligned to the H. pylori strain 26695 (Accession: NC_000915) reference genome sequence using the Burrows-Wheeler Aligner MEM (v 0.7.15-r1140) algorithm [36] as implemented in Snippy; the core genome alignment was used to construct an SNP-based phylogenetic tree using FastTree [37]. SNPs were identified using Freebayes (v1.0.2) under a haploid model, with a minimum depth of coverage of 10× and allelic frequency of 0.9 required to confidently call an SNP [38]. The phylogenetic tree was visualized using MEGA-X [39].

Core genome and pan-genome analysis

OrthoMCL was used to identify orthologous clusters using predicted protein sequences from each of the studied isolates (minimum threshold of 50 amino acids in length with identity and e-value parameters were at 70% and 0.00001 respectively) [40]. The identified clusters were aligned against the EggNOG database to predict a functional category. Clusters that contained proteins with more than one domain with distinct categories were assigned multiple categories. The functional categories were graphically represented using R ( Proteins that could not be classified were assigned to category S (hypothetical). Graphical overviews of categorized strain-specific genes were produced using R.

Identification of virulence-associated genes and cag pathogenicity island

H. pylori virulence genes were obtained from VFDB [41]. Genes were detected using Abricate ( with a minimum 80% sequence identity and 90% gene coverage [42]. Virulence gene distribution across isolates was visualised using Phandango ( A visual overview of differences in gene content was obtained using Blast Ring Image Generator (BRIG) [43] with isolate genome sequences aligned against cagPAI of H. pylori strain 26695 (typical HpEurope) or strain F57 (typical hspEAsia).

Statistical analysis

Data analysis was performed using Statistical Package for Social Science (SPSS) software (IBM SPSS Statistics 23, NY USA). Baseline descriptive statistics were summarized for the variables of interest. Comparisons between groups were performed using either the chi-squared or Fisher’s exact tests for categorical variables; t-tests and the Mann–Whitney U-test were used for continuous variables. A two-sided P value of > 0.05 was considered statistically significant.

Ethics statement

The ethical review committee of the National University Ho Chi Minh City, Vietnam approved the study (Approval No: 702/DHQG-KHCN). Written informed consent was mandatory for patient enrolment in the study. For patients < 18 years, written informed consent was obtained from a parent or guardian.


Patient population

One hundred sixty-one patients were enrolled in the study from August 2016 to February 2017. Among the patients, 44.7% (72/161) were male. The age (median; interquartile range (IQR)) was 39.4; 32–48 years. Among the patients, 51.6% (83/161) presented with epigastralgia, 31.7% (51/161) with abdominal fullness and 23.0% (37/161) with indigestion. In endoscopic examination, 95.7% of patients had stomach inflammation including 74.5% (120/161) congestion, 37.9% (16/161) erosion, 26.1% (42/161) oedema (Additional file 2: Table S2). Among the patients, 57.1% (92/161) had a primary infection (diagnosed with H. pylori infection for the first time) and 42.8% (69/161) had secondary infections (i.e. had a previous history of H. pylori infection). There was no difference in age, sex, gender, smoking, alcohol consumption, clinical symptoms and endoscopic findings between primary and secondary infection, although the number of symptoms was higher in secondary infection patients. Among the 161 positive biopsy samples diagnosed for H. pylori, 156 were tested positive by rapid urease test and five samples by H. pylori antigen test. Initially, H. pylori was cultured from 59% (95/161) patients, although only 87.4% (83/95) of these isolates could be revived and analysed.

Genome characteristics

Summaries of the read data set and draft genome for each of the 83 H. pylori isolates are presented in Table 1. The read depth coverage in each of the isolate read sets ranged from 38–456×. The draft genome sequences comprised of between 16 and 83 coting’s. Overall, the average genome size was 1.6 Mb with 38.94% G + C content. For each isolate, the annotated genome sequence comprised between1451 and 1589 protein coding regions (CDS) with ~ 92% of the genome used for protein coding.

Table 1 Genome statistics of the whole-genome sequences of the 83 H. pylori isolates in this study

Single and incomplete phage associated region (8.1–13.5 kb) was detected in 17% (14/83) of the draft genome sequences. The phage sequences consist of between nine and 14 CDSs that encode either putative restriction-modification protein, TMP kinase, PcrA helicase, putative transposase, or other hypothetical proteins in addition to phage related genes (Additional file 3: Table S3).

Core and pan-genome analysis

The core- and pan- genome analysis by OrthoMCL identified 1,194 orthologous clusters (core genome) from the 119,366 annotated proteins in the 83 isolates. Among these 1070 orthologous clusters (core genome) were assigned functional categories using EggNOG database (Fig. 1a). A high proportion (12.7%, 136/1,070 and 7.7%, 83/1070) of the classified clusters belonged to the J (translation, ribosomal structure, and biogenesis), and M (cell membrane/envelope biogenesis) functional category, respectively. Proteins with no orthologues were detected in a small number of isolates, 26% (31/83) isolates contained either one or two proteins of this type. Most of these unique proteins were V (defence mechanism) or S (hypothetical) functional categories (Fig. 1b).

Fig. 1
figure 1

A Functional classification of 1,194 core orthologous clusters produced from the set of predicted proteins encoded on the genome sequence of each the 83 H. pylori study isolates using OrthoMCL. B Functional classification of the 28 isolate specific genes identified as part of the comparison of the protein coding capacity of the 83 study isolates. On the X-axis is the number of genes in each functional class on the Y-axis

Phylogenetic analysis

The genomic relationship between the 83 study isolates and 42 reference genome sequences for which the H. pylori population type was known was inferred from the core genome using the H. pylori strain 26,695 (Accession: NC_000915) as the reference genome sequence for read mapping. The tree shown in Fig. 2 provides a visual summary of the relationship between isolates. The core genome comparison showed that 80% (66/83) of the isolates were part of the H. pylori hspEastAsia population and the remaining 20%, 17/83 of isolates were part of the H. pylori hpEurope population based on the core genome relationship with the 42 classified isolates (Fig. 2).

Fig. 2
figure 2

A tree showing the core genome relationship between the 83 Vietnamese H. pylori isolates and 42 H. pylori reference genomes. The 83 Vietnamese isolates are indicated by black terminal branches, while classified isolates are shown with coloured terminal branches as follows: hspEastAsia (blue), hpEurope (brown), hspWAfrica (pink), hpNEAfrica (purple), hspAmerind (orange) and hpAsia2 (green). The tree was inferred using the core genome comparison method as implemented in Nullarbor with H. pylori strain 26695 (Accession: NC_000915) used as the reference genome sequence for read mapping. The tree was modified using tools available in FigTree and MEGA-X

Virulence factors

Virulence factors detection using the VFDB showed that 80% (66/83) Vietnamese isolates harboured between 110 and 113 virulence genes including all CagPAI genes and the vacA virulence genes whereas, 20% (17/83) of isolates contained 83 to 92 virulence genes. The second group of isolates usually lacked the cag1 to cag3 and cagA to cagZ genes of the CagPAI. Genes encoding urease enzymes, most of the flagella associated proteins, some endotoxins, and most of the Lewis antigens such as FutB, FutC and NeuA/FlmD were detected in all isolates (Fig. 3).

Fig. 3
figure 3

At the left there is a tree showing the core genome relationship between the 83 Vietnamese isolates. The virulence gene content for each of the isolates is colour coded at the right. Virulence genes detected are those present in the VFDB, virulence genes were detected using Abricate. (Green shows genes detected with less than 90% gene coverage, while Orange shows genes detected with greater than 90% gene coverage. Purple shows the gene was not detected

The virulence properties of the isolates are presented in Table 2. A complete CagPAI was present in 80% (66/83) of the genomes; of these, 97% (64/66) CagPAI positive isolates belonged to hspEastAsia population and the remaining 3% (2/66) belonged to hpEurope population (Table 2).Among 17 hpEurope isolates, 15 were CagPAI negative. Most of the CagPAI positive hspEastAsia and hpEurope isolates lacked an orthologue to the DNA helicase (HP0548) present in the Western-type CagPAI sequence found in H. pylori strain 26695 (Fig. 4).

Table 2 H. pylori virulence factors (cagA and vacA) in study isolates
Fig. 4
figure 4

Comparison of the genetic organization of cagPAI of Vietnamese H. pylori isolates with a Western-type cagPAI (H. pylori strain 26695). The innermost blue ring shows the strain 26695 sequence with the HpEurope classified Vietnamese isolates shown as yellow rings and hspEastAsia Vietnamese isolates shown as pink rings. The Figure was constructed using BRIG

Sequence analyses of the second repeat region of the cagA gene revealed that 95% (63/66), including two hpEurope isolates were of the ABD type, while the remaining three isolates (all hspEastAsia) were EPIYA-ABC or EPIYA-ABCC types (Table 2). Two hpEurope isolates had ABD type second repeat region of the cagA gene, which is an atypical characteristic of hpEurope strains. We also found 5% (3/63) of isolates containing an East Asian type cagA contained EPIYA-like sequences, ESIYA at EPIYA-B segments. Three vacA types were detected among the Vietnamese isolates, 34 isolates were s1m1 type, 48 isolates were s1m2 type and one isolate was s2m2 type. The most frequent genotypes among the cagA positive isolates were vacA s1m1/cagA + and vacA s1m2/ cagA + , accounted 51.5% (34/66) and 48.5% (32/66) of isolates, respectively.


H. pylori infection is associated with the development of gastric disease in the host; the frequency of infection and frequency of disease in the host varies across the world but there is an association between particular H. pylori genetic types in particular geographic regions with the disease. Developing effective strategies to manage H. pylori-associated disease relies on understanding the local H. pylori populations. This in conjunction with the significant H. pylori-associated disease burden in Vietnam highlights the important knowledge gap addressed by this study. Herein, we present genomic and epidemiological data for 83 Vietnamese H. pylori isolates. The frequency of H. pylori isolation was 59% (95/161) from the biopsies of symptomatic patients. This is similar to the result of earlier studies, where 270 randomly selected patients who underwent esophagogastroduodenoscopy at the endoscopy centres at either of two major hospitals in Hanoi and Ho Chi Minh (the biggest city in Northern and Southern Vietnam, respectively) [27]. Our phylogenetic data show that most H. pylori isolates from symptomatic Vietnamese patients are from the hspEastAsia population (80% of isolates). The dominance of the hspEastAsia population is consistent with the H. pylori population being strongly associated with human migration [16] where historical and emigrational evidence suggests the Vietnamese are more related to people from North Asia than to people from South Asia [44]. Moreover, migratory patterns with North Asia would have been influenced by the fact that Vietnam was under Chinese occupation for over a thousand years. Notably, a group of the Vietnamese isolates form an exclusive clade within the hspEAsia population, perhaps indicating that the Vietnamese were isolated from other South East Asian populations for an extended period; this may be supported by a study by Breurec et al. showing Khmer and Vietnamese isolates as deep branching members of the hspEastAsia H. pylori population [45]. More extensive sampling of H. pylori in the region would be required to confirm a H. pylori subpopulation for Vietnam. The Vietnamese H. pylori isolates that are part of the hpEurope population are likely to have arisen through the French colonial occupation of Vietnam and other parts of South East Asia during the 19th and early 20th centuries. We observe a small number of isolates that appear to be related to the representative isolates from the hpNEAfrica or hspWAfrica population used in our comparative analysis (Fig. 2). Another possibility is that these isolates are recombinant hybrids arising from the endemic hspEastAsia and hpEurope population strains now present in Vietnam [45].

The prevalence of H. pylori infection has been reported in between 50 to 80% in several studies conducted in adults in Vietnam, this is similar to Japan, Korea or China, and other South Asian nations [9,10,11, 46, 47]. The genetic characteristics and diversity of Vietnamese H. pylori strains could be a factor contributing to the high incidence of gastric cancer in Vietnam. Evidence indicates that the isoforms of vacA and the type and number of the EPIYA motifs in the cagA gene strongly influence the type and magnitude of the histological damage of the gastric mucosa. For example, the vacA s1m1 genotype has been associated with intestinal metaplasia, severe inflammation and a high risk of gastric cancer [20, 48, 49]. In this study, the s1m1vacA allelic combination was detected in 41% of isolates. In addition, East Asian cagA, which is more prevalent in Vietnamese isolates is more frequently associated with disease than Western cagA [20, 50, 51]. This study revealed a lower frequency of cagA than previous reports on Vietnamese H. pylori [52,53,54,55] which may contribute to the lower rates of gastric ulcer and gastric cancer observed in Vietnam. In dyspeptic patients from central Vietnam, the frequency of cagA + strains was 84% [54]. In H. pylori strains from Southern Vietnam with gastric cancer and peptic ulcer, all strains were cagA positive [52]. In this study, the cagA was found frequently with the vacA s1m1 allelic type (51.5%, 34/66), which is consistent with previous reports from South or North Vietnam isolates [27, 55]. The most frequent EPIYA motif found in our isolates was ABD (96.6%; 63/66), which is similar to previous reports from Vietnamese patients with the gastric disease [52, 55]. However, these frequencies were different in central Vietnam isolates, where vacA s1m1/ cagA + genotype was detected in 64.86% (48/74) of isolates and the cagA–ABD motif was found in a lower proportion (91%) [54].

We observed that 88.2% (15/17) of hpEurope isolates were either negative or possibly lost their cagA during the course of evolution or, if present, they had ABD type EPIYA-motif. The presence of ABD type EPIYA-motif pattern is an atypical characteristic of hpEurope strains where ABC type EPIYA-motif is more prevalent. The gene content and organization of genes of cagPAI are highly conserved. The phylogeny of most cagPAI genes including cagA was found to be similar to that of housekeeping genes, indicating that the cagPAI was probably acquired only once by H. pylori [56]. Recombination events during mixed infection have been identified as a major driving force behind allelic diversity in H. pylori cagPAI largely reflects that of H. pylori’s housekeeping genes being under diversifying selection or positive selection due to host polymorphisms which could even result in modified host protein interactions [56]. Accordingly, hpEurope and hspEastAsia strains are expected to carry a Western and an East Asian cagA respectively. A prominent example of amino acid diversity noted previously are the EPIYA motifs in the C- terminal half of cagA, which differ between Asian (hpAsia2; hspEastAsia) (type D) and all other populations [57]. The D type EPIYA repeat binds SHP-2 phosphatase more avidly than other types [22]. Furthermore, Furuta Y. et al. also clarified the recombination-mediated routes of cagA evolution and provided a solid basis for a deeper understanding of its function in pathogenesis [58]. Based on this observation, the predominant host may be applying a selective pressure on Vietnamese hpEurope strains for the ABD type cagA that is normally observed in the cagA of hspEastAsia lineage strains.


Our study confirmed the high prevalence of H. pylori infection and the most virulent genotypes combination vacA s1m1/cagA + in H. pylori isolates recovered from Vietnamese symptomatic patients, which may explain the higher incidence rate of gastric cancer in Vietnam. Our data on the genetic architecture of H. pylori strains isolated from symptomatic Vietnamese patients showed two predominant lineages, with the majority of isolates belonging to the hspEastAsia population. However, there is another group of Vietnamese isolates that is part of hpEurope population. Interestingly, the hpEurope population isolates are divided into two subclusters. Although phylogeny has been improved by increasing the number of genes analyzed, analyses of a limited number of genes cannot uncover more complex evolutionary events. Our study also has a limitation that almost all our enrolled patients were in the early stage of gastric diseases, so we could not explore the interaction between H. pylori genotypes and their outcomes.