Utilizing the Public GenomeTrakr Database for Foodborne Pathogen Traceback.

This protocol outlines the all the steps necessary to become a GenomeTrakr data contributor. GenomeTrakr is an international genomic reference database of mostly food and environmental isolates from foodborne pathogens. The data and analyses are housed at the National Center for Biotechnology Information (NCBI), which is a database freely available to anyone in the world. The Pathogen Detection browser at NCBI computes daily cluster results adding the newly submitted data to the existing phylogenetic clusters of closely related genomes. Contributors to this database can see how their new isolates are related to the real-time foodborne pathogen surveillance program established in the USA and a few other countries, and at the same time adding valuable new data to the reference database.


Inception of GenomeTrakr Within the FDA Mission
In 2012 FDA began a pilot project called GenomeTrakr to build a public genomic reference database of historical food and environmental isolates of Salmonella. The goal of this project was to improve the accuracy and response time for identifying the causes of foodborne outbreaks, to identify harborage in facilities, and to aid in establishing preventative controls [1]. In this pilot WGS data were collected by a distributed set of public health laboratories, transferred to the FDA for quality screening, then uploaded under an umbrella BioProject at NCBI's SRA database (Fig. 1). The result has been a continuously growing database of genomic sequence information and accompanying metadata (e.g., geographic location, source, and date) from food, environmental, and clinical isolates. Over 1000 Salmonella genomes were collected after the first year, around 10,000 by the second year, and now, after 5 years and multiple contributors, including other US agencies and Public Health England, the maturing Salmonella database is approaching 160,000 genomes [2]. After the initial success of sequencing Salmonella, the effort expanded to Listeria monocytogenes [3] in 2013, and soon thereafter pathogenic Escherichia coli/Shigella, Campylobacter jejuni, Vibrio parahaemolyticus, and Cronobacter. The Pathogen Detection portal at NCBI is now the central repository for foodborne pathogen genomes used for real-time surveillance in the US.

Increased Role of Food/Environmental Isolate Contributions
Foodborne pathogen isolates collected by FDA field laboratories were a major contributor of the food and environmental isolates in the PulseNet PFGE database. These isolates largely come from FDA's regular sampling of imports, routine facility swabbing, and targeted high-risk food sampling assignments. Food and environmental isolates contributed from state public labs varies from state to state depending on sampling efforts and levels of collaboration with respective state agriculture lab(s). Some states contribute quite a few food and environmental isolates and others next to none. Because membership and contribution to the US PulseNet database is restricted to US public health labs that maintain certification, increasing the sources of food and environmental isolates from laboratories outside these members was not feasible.  Technology also played an important role in the importance of food and environmental isolates. Because the surveillance effort for PFGE is largely lead by epidemiological data, the food/environmental isolates played a secondary role to outbreak delineation (i.e., first define the scope of the outbreak, then use patient interviews to discover potential food sources, then target sampling for those suspect foods). This model works well for low resolution PFGE technology, but with a high-sensitivity technology such as WGS, these outbreak investigations can benefit when the underlying data plays a more forward role in the investigation. For example, a likely scenario under WGS is as follows: first a genomic signal is picked up with a clinical cluster matching a food/environmental isolate, then the full epidemiological investigation is launched in response, at the same time FDA launches additional inspections to understand the root cause of contamination along the farm to fork continuum. Because of the increased resolution of WGS, PulseNet is recognizing a greater number of smaller clusters. However, due to limited resources, those clusters that include a food or environmental isolate often get prioritized for follow-up over those that do not. This results in the food and environmental isolates potentially playing a more important role under a genomic surveillance network. The shift to storing the WGS data in an open, public database creates an opportunity to greatly increase the diversity of these isolates by targeting potential submitters outside the PulseNet community. FDA scientists recognized this advantage early on and worked to leverage the GenomeTrakr network to include non-PulseNet laboratories, such as state agriculture labs, academic labs, and international collaborators with the overall goal to more accurately capture the global population diversity within key foodborne pathogen species. This effort has resulted in a higher percentage of food and environmental isolates in the Listeria WGS database: as of April 2018 44% of PulseNet's PFGE database comprised food and environmental isolates compared to 69% at NCBI's Pathogen Detection database [2]. Ultimately, this will help to increase the probability of a food/environmental "match" for any new isolate being added to the database, supporting the FDA's mission to pinpoint the causes of foodborne outbreaks, to identify harborage in facilities, and to use WGS data to establish preventative controls.

GenomeTrakr Data Flow and Open Source Analysis Pipelines
Sequence data are generated at one of more than 40 GenomeTrakr laboratories, then transferred immediately to our data center at FDA-CFSAN. Newly generated sequence data are processed through our internal quality control pipeline where metrics are accessed for data quality (sequence quality, and sequence coverage) and integrity (correct species and serovar assignment). Data that passes predetermined thresholds are submitted to the short-read archive (SRA) at NCBI where they are processed through NCBI's Pathogen Detection analysis pipeline. Within a couple days the sequences will then appear in the Pathogen Detection browser where results of nightly cluster analyses are available for searching and browsing. On average GenomeTrakr submits over 1000 isolates per month to the Pathogen Detection pipeline.
FDA monitors the public Pathogen Detection site daily looking for mission-relevant clustering results, such as a close match between a food isolate and an isolate collected from a clinical patient or an environmental swab isolate match to an isolate collected from the same location in a previous year. Upon seeing results like these one of the FDA data scientists will download the sequence data associated with a particular cluster, then rerun the SNP analysis using CFSAN's open source SNP pipeline [4]. Depending on the nature of the cluster, appropriate stakeholders will be contacted for follow-up. For example, a cluster showing clonal isolates collected from the same facility over multiple years might be sent to the FDA's Office of Compliance where the data will be added to ongoing facility investigations. Similarly, new food + clinical matches might be forwarded to the FDA's Coordinated Outbreak and Response (CORE) team or perhaps a state lab might be contacted if the cluster appears to be contained within state boundaries. A regulatory response by the FDA will include all the evidence gathered across a full investigation, which might include site visits, epidemiological evidence, as well as supporting data from WGS cluster analysis. This three-legged stool of evidence from epidemiology, site investigations, and WGS provide support for FDA regulatory decisions.

WGS Data Collection and Analysis-Validation and Harmonization
Genomics for Food Safety (Gen-FS) is a working group in the USA [5], with representatives from the CDC, FDA, USDA and NCBI. The Gen-FS working groups carefully harmonize quality management systems across GenomeTrakr and PulseNet, including Quality Assurance (QA) measures and accompanying quality control (QC) checks, to ensure all WGS data in the Pathogen Detection database meet the Gen-FS minimum quality standards. In addition, all downstream analyses, including cluster analysis presented through the Pathogen Detection website, outbreak analyses from PulseNet, and identifying the source of contamination events from GenomeTrakr, are harmonized such that the results are accurate and comparable across the different analysis pipelines. Benchmark datasets derived from empirical data [6] as well as simulated data [7] are used in this harmonization effort. Gen-FS also runs an annual multilab proficiency test (PT) across the PulseNet/Geno-meTrakr lab network [8] which measures proficiency for each laboratory and also serves as a multilab validation exercise by accessing the accuracy and reproducibility in the WGS data collection across the whole network.
The Global Microbial Identifier (GMI) [9] is an international organization dedicated harmonizing the multiple in-country efforts of genomic pathogen surveillance. GMI is working toward a global system to aggregate, share, mine and translate genomic data for microorganisms in real time. Multiple GMI working groups have agreed on minimum metadata standards, proposed quality control standards on the sequence data, and produced benchmark datasets for validating analysis pipelines. Additional efforts for global harmonization include developing guidance documents on the value of WGS technologies, and the value of sharing these data, both with the Food and Agricultural Organization (FAO) and World Health Organization (WHO) (see http:// www.fao.org/food/food-safety-quality/a-z-index/wgs/en/).
The GenomeTrakr database is a free, open-access, database for consumption and contribution. Contributors do not have to be affiliated with the FDA, Gen-FS, or GMI to submit data and use NCBI's Pathogen Detection portal to view clustering results. This chapter outlines all the steps necessary to independently submit data to the GenomeTrakr database at NCBI, including WGS quality standards, NCBI data submission, and finally how to view and curate your data and cluster results at NCBI.

Materials
Materials described here cover the formats of sequence data, accompanying metadata, and quality control thresholds being utilized by GenomeTrakr for the submission of raw sequence data.
1. Project creation: Establish an umbrella BioProject(s) that will hold one or multiple data BioProjects (e.g., one for each pathogen species). Email the umbrella accession to pd-help@ncbi. nlm.nih.gov and ask to have it linked to the Pathogen Detection pipeline.

Internal QA/QC Pipeline
Data should be thoroughly checked for quality control before submission to NCBI. GenomeTrakr screens for quality (sequence quality, coverage, etc.) and integrity (correct ID, no contamination, etc.) (Fig. 2). At the minimum submitters should establish the following quality control checks (Table 1).
3.1 Sequence Quality 1. The average Q score for R1 and R2 should be at Q30 or above, unless there is higher coverage ( Table 1). The Q score averages are calculated as average Q score per filename (R1 and R2) as the mean Q of the file's Q histogram where the Q histogram is the count of each Q score appearing in that file.
2. The estimated average coverage should be above the taxonspecific determined thresholds listed in Table 1. Estimated coverage is calculated as the total number of bases in the reads divided by the assumed genome size. The following genome sizes are used for the most common foodborne pathogens.   3. Project type tab: (a) Project data type ¼ Genome sequencing and assembly.
4. Target tab: Populate ONLY the Organism name here, usually Genus species, or just Genus if your laboratory does not determine species, e.g., Campylobacter.
5. General info tab: click "Release immediately following processing".
(a) Public Description: e.g., "Whole genome sequencing of cultured pathogens as part of a surveillance project for the rapid detection of outbreaks of foodborne illnesses".
(c) Is your project part of a larger initiative which is already registered at NCBI? Click yes if you have an umbrella project established and include the accession here. This will properly link your data project with your overall umbrella project. 11. BioProject accession will be available on the "my submissions" page of the Submission portal and usually starts with "PRJNAxxxxxx." 12. For questions about establishing umbrella projects, linking data projects, or any other bioproject issue contact geno-meprj@ncbinlm.nih.gov.

BioSample
Submit metadata to BioSample database: 1. Populate the combined pathogen package template with metadata for each isolate you intend to submit. Ensure that text is included for ALL mandatory fields. Include the word "missing" if data are not available for a given field (Table 2).
3. Click on the "New submission" button at the top of the screen.
7. "Attributes" tab: click upload a file use Excel, then navigate to file for upload.
8. "Overview" tab: Check over your submission for errors, then submit. 9. BioSample accessions will be available on the "my submissions" page of the Submission portal. Accessions will start with SAMNxxxxxxxx." Table 2 The minimum set of metadata fields for food or environmental isolates

Required fields Description
*sample_name Sample Name is a name that you choose for the sample (or isolate in our case). It can have any format, but we suggest that you make it concise, unique and consistent within your laboratory, and as informative as possible. Every Sample Name from a single Submitter must be unique. *strain This is the authoritive ID used in GenomeTrakr and usually the same as the sample_name listed above. It can have any format, but we suggest that you make it concise, unique and consistent within your laboratory, and as informative as possible. Strain names must be unique within the NCBI database.
*isolation_source Describes the physical, environmental and/or local geographical source of the biological sample from which the sample was derived. For food isolates please provide generic descriptions of the food.
*lat_lon This is a required field for NCBI, but not for GenomeTrakr. Include "missing" here if lat/long details are not provided.