Background & Summary

Listeria monocytogenes (Lm) is a facultative intracellular pathogen responsible for listeriosis, a serious disease affecting both humans and animals. Lm is a ubiquitous bacterium that is found in various ecological niches, including the natural and farm environments1,2. In particular, soil is a primary ecological niche of Lm and may thus be important in its transmission from natural/farm environment to food and food-processing environment (FPE)1,2. Farm animals, in particular ruminants, are also an additional important reservoir for Lm and contribute to contamination of the farm environment through fecal shedding3,4. In addition, Lm can persist for a long time in soil and the farm environment. Increasing amounts of information are also available on the prevalence of Lm in wildlife, showing that various animal species (e.g., deer, wild boars, bears, foxes, monkeys, rodents, hedgehogs, snails, slugs and birds) can act as a vehicles for this pathogen5,6,7,8,9,10,11. These findings point to an ecological role of wildlife as a reservoir of Lm and its potential importance in Lm infection cycle.

Lm is genetically heterogeneous species divided into four phylogenetic lineages, of which lineages I and II are the most frequently encountered. Multilocus sequence typing (MLST) classifies Lm into clonal complexes (CCs) and sequence types (STs), which are systematically used to describe its population structure12,13,14. Certain epidemiological clones account for the majority of outbreaks and sporadic cases in humans15 and animals16, worldwide13,17. The CCs that are commonly found in food and FPE, such as the most common CC9 and CC121, but also CC1, CC2, CC4, CC5, CC6, CC8 and CC3718, pose a serious challenge in food industry15,18,19. Moreover, they can persist in FPE for several years20,21,22,23,24. Remarkably, CC9 and CC121 are rarely reported in animals or natural/farm environments18,25.

In order to improve surveillance and the management of health risks associated with Lm, a deeper understanding of the genetic make-up of strains adapted to food and FPE is required. As part of the Horizon 2020 “One Health” European Joint Programme, the 3-year research project “LISTADAPT” (Adaptive traits of Listeria monocytogenes to its diverse ecological niches - https://onehealthejp.eu/jrp-listadapt/) aimed to identify the genetic mutations and mobile genetic elements underlying the adaptation of Lm to different ecological niches. With this objective in mind, strains were collected from i) farm environment and animals and ii) natural environment and wild animals to study their genetic make-up and to compare this background with that of strains isolated from food products and FPE. This work was made possible due to the LISTADAPT consortium which included (i) seven national reference laboratories (NRLs) for surveillance of Lm in food, animals and the environment (AT, CZ, DK, FR, IT, NO and SE) and (ii) three research laboratories at INRAE (the French National Research Institute for Agriculture, Food and Environment). Out of the seven NRLs, two are also national public health laboratories (AT and CZ) that are in charge of the surveillance of clinical strains isolated in outbreaks and sporadic cases. In addition, 14 institutes from 12 countries participated as external partners providing isolates.

In this data descriptor, we present a dataset of 1484 high-quality draft genomes originating from Lm strains isolated in 19 European countries within the framework of the LISTADAPT project. The constructed dataset cover a wide genetic diversity of Lm since it includes about 79 different CCs and singleton STs including the most prevalent CCs in Europe15 and worldwide13,17. The strains were collected from natural environment (wild animals and natural environment), primary production (farm environment and farm animals with or without listeriosis symptoms) until FPE and food products.

The constructed dataset provides a better understanding of the Lm transmission routes from the farm/natural environment to food and FPE and improves our understanding of its ecology. The dataset may also help to assess the importance of animal and food strains for human infection. Moreover, it can be used by the scientific community (i) to improve our understanding of the Lm population structure and the Lm evolutionary history, (ii) to facilitate the detection of the emerging Lm clones and (iii) to identify genetic traits related to the adaptation of Lm to particular ecological niches (ecophysiology). Such genetic traits could be used in the development of molecular assays for screening of food/FPE, animal and soil reservoirs.

Methods

Construction of the LISTADAPT dataset (n = 1484)

In order to build a dataset of Lm draft genomes suitable for investigating the adaptive traits of Lm to diverse ecological niches, we gathered a curated collection of Lm draft genomes. Strains isolated over the period 2010–2020 were preferred, regardless of their origin of isolation. We considered two geographic levels, (i) the 27 EU countries including Norway and Switzerland, heterogenous in size, population, climate, ecology and economical activities and (ii) based on country borders four European regions roughly equal in terms of surface area without consideration for other criteria (South-West, Central-South, Eastern and Northern). We included strains that were distributed evenly among these four European regions. The strain were gathered from already available strain collections and extensive sampling campaigns (Fig. 1). The LISTADAPT dataset was divided into two main ecological compartments: (i) C1 compartment, which included strains from animals and the natural/farm environment (n = 756), and (ii) C2 compartment, which included strains from food (n = 728) (Table 1).

Fig. 1
figure 1

Distribution of the LISTADAPT collection of Listeria monocytogenes strains (n = 1484) by time, geographic region and origin of isolation. (a) and (b) show the distribution of food strains by geographic region and food type, respectively. (c) and (d) show the distribution of environmental strains by geographic region and subcompartment, respectively.

Table 1 Repartition by compartments and sub-compartments of strains from the whole LISTADAPT collection (1,484).

Strains selected from the initial collection of the LISTADAPT consortium

At the beginning of the LISTADAPT project, the consortium had access to a collection of about 8000 food and animal Lm strains obtained from collaborative projects or national surveillance. Most of these strains were isolated from food, whereas the remainder were isolated from animals (C1 compartment: animal and environmental strains) with a substantial under-representation of certain animal species. This compartment mainly included strains isolated from animals showing listeriosis-related symptoms. Few strains were available from asymptomatic animals, soil and the agricultural environment, originating from three European countries (France, Italy and the Czech Republic).

Animal and environmental strains included in the collection during the LISTADAPT project (n = 756)

We collected isolates from animals showing listeriosis associated symptoms, asymptomatic animals, soil and the environment, in a large number of countries across Europe. These strains were isolated between 1978 and 2019. Regarding environmental niches, the consortium selected strains from continental environments remote to cities, large rivers and estuaries or marine environment to avoid the selection of human or food strains released in the environment, detailed strain information were provided in Figshare File 126. However, the six strains described by Szymczak et al.27 (Table 2) were isolated from city outskirts parks in Poland, distant from the city center. Similarly, the 47 strains from birds (mainly seagulls) (Hellström et al.)10 were isolated from localities from on the outskirts of Helsinki, Finland (Table 2).

Table 2 List of 301 animal and environmental Listeria monocytogenes strains from published microbial collections.

Strains obtained from existing microbial collections (n = 648)

To increase the size and representativeness of the Lm genome dataset the LISTADAPT consortium performed an extensive review of all recent collections of published and unpublished Lm strains and then contacted researchers in charge of these collections. Finally, 14 external partners, food and veterinary laboratories and research institutes, all dealing with Lm hazards in Europe, collaborated with the LISTADAPT consortium (Tables 2 and 3).

Table 3 List of 347 animal and environment Listeria monocytogenes strains from non-published collections.

The initial collection included more strains from animals with listeriosis-associated clinical symptoms than without symptoms. In order to reduce the number of strains originating from animals with listeriosis while maintaining maximum diversity of the dataset, we adopted an original method to select the strains based on metadata (e.g., type of sample, geographic location, time of isolation, molecular typing data such as PFGE profiles, animal species and geographic sampling location). This method relies on Gower’s coefficient (GC), which is a dissimilarity measure: the “distance” between two units is the sum of all the variable-specific distances (associated with metadata categories). The GC metric enables the combination of numeric and categorical data and enables applying weights to each variable, effectively altering the importance of each metadata category (e.g., geographical region as a more important category than year of isolation). The three steps are: (i) calculating the dissimilarity matrix based on Gower’s distance (ii) clustering the dissimilarity matrix with hierarchical clustering (agglomerative bottom-up approach of clustering) and (iii) assessing clusters with the “Silhouette” method. The silhouette plot displays a measure of how close each point in one cluster is to the points in the neighboring clusters. An R script available at https://github.com/lguillier/LISTADAPT/tree/master/metadata2assocation was used to perform the selection of strains based on this method. This script takes as input a Comma Separated Values (CSV) file that includes strain ID and metadata information, then outputs a CSV file of selected strains.

In the present study, we constructed a large dataset comprising 301 animal and environmental Lm strains from six European countries and published collections (Table 2), as well as 347 animal and environmental Lm strains from 12 European countries that were obtained from non-published collections (Table 3).

Strains collected from sampling campaigns (n = 108)

Soil, farm, and wild animal samples were collected in nine European countries (Table 4). For the collection of soil samples, the LISTADAPT project members raised awareness and organised crowd-sampling campaigns. All the soil samples were collected from agricultural or wild areas according to a common procedure provided to the samplers based on the existing recommendations reported in the literature2,28,29,30. The integration of feedback from samplers enabled a continuous improvement of the sampling protocol. The sampling campaigns were conducted in 17 areas in seven EU member states, Norway and Switzerland (Figs. 1 and 2, Table 4), namely AT, CH, CZ, FR, IT, NO, SE, SI and SK, resulting in the isolation of 58 Lm strains. Out of the 1752 available sampling records, the overall prevalence was 3%. We confirm in the present study the low prevalence of Lm in soil reported in the literature (below 1% and up to 6% depending on soil type)2,29. Soil strains from AT, FR, SI and SE were isolated by employing a two-step specific enrichment: the first enrichment was performed with modified Listeria Enrichment Broth for 24 h at 30 °C, followed by enrichment in University of Vermont Medium (UVM) enrichment broth for 48 h at 30 °C. Detection of Listeria spp. and Lm was then achieved by specific SYBR Green real-time PCR targeting prs2 and inlA genes, respectively. The samples positive for the presence of Listeria spp. and/or Lm were spread on RAPID’L.Mono agar plates (BioRAD, France). After 24 h incubation at 37 °C, colonies characteristic of Lm and other Listeria species were picked, purified and stored at –80 °C in Tryptone Soya Broth supplemented with 25% (v/v) glycerol. Strains from CH, CZ, IT and SK were isolated with the EN ISO 11290-1:2017 protocol (Horizontal method for the detection and enumeration of Lm and of Listeria spp.).

Table 4 List of 108 animal and environment strains from sampling campaigns.
Fig. 2
figure 2

Microreact screenshot representing the distribution of the whole LISTADAPT dataset (n = 1484) by geographic region (a) and time (b). The k-mer-based phylogenomic clustering of the complete dataset is shown in (c). Interactive access to strain metadata and MLST types is available through Microreact44, a recently developed online tool for visualizing and sharing spacio-temporal and genetic distributions of strains (Fig. 2, accession link: https://microreact.org/project/8YtGBqEqhosJtysXTVY79M-figure-2-distribution-of-the-whole-listadapt-dataset-n1484-by-geographic-region-time-and-genetic-diversity). The dataset interactive map was generated using either the exact GPS coordinate, regional GPS coordinate or national GPS coordinate according to the level of details available for each strain. An annual timescale was used. The core genome MLST (Moura et al.) tree was generated from the draft genome assemblies using pairwise categorical difference and single linkage method in BioNumerics. The tree revealed three main clades corresponding to Lm phylogenetic lineages. Each clade included several clusters corresponding to MLST types (CC and singleton ST). Circles in shade of blue show food product isolates (clear blue: fish product, greeblue: dairy products, blue: composite dishes, deep blue: meat products). Circles in shade of orange show animal and environment isolates (beige: soil & farm environment, golden: wild animal, deep orange: farm animals). Circles size is proportional to the number of strains included.

Regarding the subcompartments of farm and wild animal, 50 Lm strains were isolated from sampling campaigns. Three campaigns targeting shelled gastropods sampled in IT, SK and CH resulted in the isolation of six strains (Figs. 1 and 2, Table 4). Sampling campaigns were also carried out for wild deer and reindeer feces in Southern Norway, and from cattle, roe deer, wild boar, wolf, bear and fox feces in the Abruzzo and Molise regions of Italy (Fig. 1, Table 4). Of the 2577 samples collected from vertebrates during the campaign conducted in IT and NO 41 isolates were detected, with an overall prevalence of 1.6%.

Food strains included in the collection during the LISTADAPT project (n = 728)

The food strains (C2 compartment) were classified according to the five main categories of risk food matrices for Lm defined by the European Food Safety Authority (EFSA)31: dairy products (n = 119), fish and fishery products (n = 165), meat products (n = 246), vegetables and fruits (n = 95), and composite dishes (food products combining several food categories) (n = 103). Six NRL project partners (AGES, ANSES, DTU, IZSAM, SLV and VRI) were instructed to target a maximum of 30 strains per food category from their strain collections, preferring strains isolated in the last 10 years. This time period was extended to the under-represented categories (vegetables and fruits); the final dataset included strains originating from the 2002–2020 period. We excluded raw materials from the selection based on the assumption that they could be contaminated by strains originating from farms or animals. The 728 strains from C2 compartment were isolated along the food chain, from food processing plants to food retail in several EU countries (Table 1), detailed strain information were provided in Figshare File 126

Complete LISTADAPT dataset (n = 1484)

The final LISTADAPT strain dataset that we constructed in collaboration with external partners was balanced with regard to the two main compartments: C1 (animals/environment, n = 756) and C2 (food/FPE, n = 728) (Table 1). The geographic distribution covered 19 of the 27 EU countries plus Norway and Switzerland (Figs. 1 and 2).

Although the C1 compartment (n = 756) covered a 41-year period (1978–2019), most of the strains (75%) were isolated since 2010. This panel covered all successive years between 2009 and 2019 in at least three European regions (Fig. 1c). Between 2008 and 2019, except for the year 2013, the C1 compartment covered all successive years in the following three categories of subcompartments: farm animals, wild animals and natural/farm environment (Fig. 1d).

Although the C2 compartment (n = 728) covered an 18-year period (2002–2020), most of the strains (78%) were recent, i.e. having been isolated between 2013 and 2019 (Fig. 1b). This panel covered all successive years between 2013 and 2019, as well as the five major categories in at least three European regions (Fig. 2b).

Strains from C1 compartment were isolated from more countries (n = 18) than strains from C2 compartment (n = 6). Finally, the majority (1135 of 1484 strains, 76%) of strains from both compartments originated from the period 2011–2019 (Fig. 1a,c).

Overall, the 1484 strains clustered into 137 MLST STs, which belonged to 54 CCs and 25 singleton STs (Fig. 3). For 22 strains, the allele profile was unknown (novel ST) or incomplete (When six out of seven MLST alleles were present, a CC was assigned when possible).

Fig. 3
figure 3

Distribution of the LISTADAPT dataset of Listeria monocytogenes genomes by multilocus sequence typing clonal complex (CC) or singleton sequence type (ST).

Standard strain nomenclature

In order to facilitate data sharing between partners, we adopted a standard nomenclature for strain identification (ID). This nomenclature was used as metadata codification to allow for fast identification of the geographic region and detailed isolation source of the strains (e.g., wild animal, food product or farm environment). In more detail, the LISTADAPT code has between 10 and 15 characters; the first two letters (level 1) correspond to the country code (ISO 3166-1-alpha-2 code), which is followed by a code detailing the origin of the strain and the sample type (level 2 to 4, depending on the nature of the source). Briefly, level 2 details the type of sample (e.g., animal species, environment and food categories) and level 3 details the nature of the sample (e.g., type of animal sample, type of food and nature of environmental sample). The level 4 gives additional information about the sample (e.g. type of preparation for aliments and health status of the animals). The code ends with a sequential number for each country, generated when the strain was added to the collection. For example: the strain DE-RDE-CP-13 was isolated in Germany (DE) from a roe deer (RDE) as a clinical strain (CP) and it was the 13th strain isolated from Germany included in the dataset. The Supplementary Table S2 provides a detailed overview of the employed LISTADAPT codification.

Whole Genome Sequencing and genomes data analysis

The next generation sequencing (NGS) paired-reads (2 × 150 bp) were generated during the project with Illumina platforms. Four LISTADAPT partners (AGES, IZSAM, ANSES and DTU) mainly performed the sequencing. Figshare File 126 lists the sequencing technology and the center which performed the library preparation and produced the sequences.

The genomes were all de novo assembled and annotated with a harmonized in-house workflow named ARTwork (Assembly of reads and typing workflow)32 used in the ANSES Laboratory for Food Safety. In addition to de novo assembly, the ARTwork pipeline also performs genome annotation using Prokka33. This whole genome sequencing (WGS) workflow has been described in detail in previous publications32,34,35,36, including the integrated bioinformatics tools and their corresponding versions, enabling repeatability and comparability of the results2 (Table 5). Assembled genome files were made publicly available in FASTA format through Figshare37.

Table 5 Bioinformatics tools used and their versions.

Quality control of WGS data

Poor-quality reads or assemblies as well as contaminations can significantly affect gene prediction and cluster analyses38,39. Different WGS metrics and quality criteria were thus employed in the ARTwork pipeline to ensure high-quality WGS data. Reads with an estimated depth of coverage < 30 × (as estimated by BBmap40) as well as contigs and scaffolds with a length of < 200 bp were excluded (n = 22). Draft genomes with a total length outside the range of 2.7–3.3 Mb and with a total number of scaffolds > 200 (n = 46) were also excluded. In addition, inter- and intra-species contamination of reads was determined using the recently developed ConFindr software (v0.5.1)41. Since recently demonstrated, inter-and-intra species contamination of 10 single nucleotide variants (SNVs) assessed by ConFindr in the conserved core genes does not significantly impact cluster analysis39. We decided to exclude all genomes presenting SNVs lower than this cut-off (n = 12) as well as various read- or assembly-related errors (n = 34).

The employed WGS metrics and quality criteria of the complete LISTADAPT genome dataset are reported in Figshare File 126. In total, 114 sequenced genomes were of unsatisfactory quality after quality control and were thus excluded from the final dataset. After quality control of NGS and WGS data, the final LISTADAPT dataset included 1484 genomes.

Metadata and WGS data sharing

All metadata and WGS data collected herein were centralized and processed with standardized criteria for common nomenclature and NGS/WGS quality control before sharing between project partners. Reads normalized to 100 × coverage, draft assemblies (contigs and scaffolds) and annotated genomes (Genome Feature Format, GFF, and Genbank format, GBK) were also centralized at the MongoDB database located at ANSES (Maisons-Alfort Laboratory for Food Safety) providing quickly available, ready-to-use data.

Raw (non-normalized) reads for all the Lm strains sequenced in the LISTADAPT collection (n = 1484) were submitted to the NCBI Sequence Read Archive (SRA) for sharing with the LISTADAPT project’s partners. Raw (non-normalized) reads for 67 Lm food strains obtained from previous publications19,42 were submitted to the NCBI Sequence Read Archive (SRA) database and were linked to their existing accession numbers in Figshare File 126.

Data Records

All high-quality WGS data from this data descriptor are available for download at SRA/ENA public repository, including the sequences already available at the beginning of this study43. Assembly and annotation files are available through Figshare44. Complete metadata and quality check parameters are here reported in Figshare File 126.

Technical Validation

Redundant strains

The LISTADAPT dataset was analyzed by core-genome MLST (cgMLST) analysis, using BioNumerics (Table 5) according to a fixed cgMLST scheme consisting of 1748 Moura et al. loci45. All strains with genomes presenting less than < 7 allele differences (AD), isolated in the same year, as well as sharing the same source of isolation and sharing identical geographic location (same region or country) were considered as redundant. When the latter information was not available, the provider was used instead. Although year of isolation was unknown for four strains, they were marked as redundant because of similar cgMLST (<7 AD). Among the 1484 strains, 157 were identified as redundant. These strains were maintained in the dataset and marked accordingly (Figshare File 126)

Consistency analysis

The present study includes 648 strains from existing collections and 108 strains isolated in the framework of this study. The strains from historical collections were provided from 19 different laboratories. The management of large strain collections may lead to storage issue such as the isolation of two strains in the same tube. Furthermore, the sequencing of the strains involved several handling that may lead to human error.

For 380 of the 648 strains provided by partners, historical typing data were available. We established links between these typing data provided and the sequence obtained. These typing data were either, conventional serotyping data, molecular serotyping or MLST obtained by individual allele sequencing or mapping from PFGE. For conventional serotype the correspondence with the MLST type obtained from WGS was established following correspondence based on Ragon et al.12. The correspondence with molecular serotyping was established based on Hyden et al.46 mapping system using the Software SeqSphere (Table 5). For the strains isolated in Belgium (Table 3) the correspondence with PFGE was applied by our partner, based on the methodology described in Félix et al.18. For the strains isolated in Finland (Tables 2 and 3), the correspondence with PFGE was applied by our partners according to their in-house mapping methodology. The observed discordances were investigated with the partners. The concerned strains were re-sequenced if needed and discarded when unresolved. All results were reported in the Figshare File 126.