Introduction

Genetically modified organisms (GMOs) intended for marketing as food and feed and derived products or for cultivation and release in the environment are subject to an authorization process in Europe in accordance with Directive 2001/18/EC and Regulation (EC) No 1829/2003. Commission Implementing Regulation (EU) No 503/2013 specifies the content of applications for authorization of genetically modified food and feed including the information required to be submitted. In order to advice this process, the European Food Safety Authority (EFSA) has issued several guidance documents with regard to the preparation and presentation of applications and, most importantly, the scientific information within risk assessment.

EFSA guidance documents are based on four pillars of GMO risk assessment: (i) a molecular characterization, which is an assessment of the molecular structure of the intended modification as well as any other unintended changes in the GMO; (ii) the comparative analysis, which is focused on compositional, nutritional and agronomic characteristics; (iii) an evaluation of potential toxicity and allergenicity; and (iv) an evaluation of the potential environmental impact of the GMO [1].

 According to the EFSA guidances, possible alterations in the phenotype are identified through a comparative analysis of growth performance, yield, chemical composition, and more. A targeted approach (i.e., measurements of a limited number of individual compounds such as macronutrients, micronutrients, and certain crop-specific secondary metabolites) is used for the detection of compositional and nutritional differences between the GMO and its near-isogenic non-GM counterpart. This comparative approach applies the concept of substantial equivalence as food and feed derived from GMOs are compared to an appropriate comparator, defined by Regulation (EC) No 1829/2003 as “a similar food or feed produced without the help of genetic modification and for which there is a well-established history of safe use” [1, 2]. In general terms, the concept of substantial equivalence is based on the notion that existing organisms, such as those used as food sources, can be used as comparators when assessing the safety of the genetically modified organism. According to EFSA, the compositional analysis is not considered an endpoint analysis itself, but as a starting point of the case-specific risk assessment as it serves the purpose of identifying intended and unintended differences and/or lack of equivalences between GM plants and derived food and feed of their comparator(s) [1]. Since the application of substantial equivalence principle by EFSA, it became clear that the underlying criteria left scope for different interpretation by various risk assessors and academics who described the principle as unfit for purpose [3,4,5,6,7,8,9]. In addition, EFSA’s comparative approach has been long and frequently criticized for its limitations with respect to a restricted and ‘biased’ selection of compounds that can be analyzed, as the detection of unknown toxins or anti-nutrients is not possible using this method [10,11,12,13].

There is a long-lasting and ongoing debate concerning the potential value of much broader scale, such as the use of unbiased molecular profiling approaches in risk assessment [14]. Such untargeted approaches, through the quantity of the data they generate, may help to: (a) identify effects which could trigger additional risk assessment hypotheses to be tested and (b) reduce the level of uncertainty that unintended changes have been overlooked [15]. Strategies based on advanced massive analysis of molecular data have been developed and successfully applied to screen genetically modified plant varieties for altered transcriptomic, proteomic or metabolomic profiles when compared to their non-GM counterparts [16,17,18,19,20,21,22,23], revised in [24]). The application of such molecular profiling analyses has been also suggested by the expert group on risk assessment and management serving the Cartagena Protocol on Biosafety serving the United Nations Convention on Biological Diversity [25]; these should be employed in those comparison studies where the scientifically most justifiable near-isogenic and conventional comparator would not grow under the relevant stress condition, or not grow as well, e.g., after herbicide application. In addition to these previous debates, the current ongoing discussion of the GMO regulations in Europe will certainly trigger revisions of its technical guidance, including EFSA guidance documents, to accommodate the risk assessment of organisms derived from New Genomic Techniques (NGTs)[26].

Several ways to apply or implement omics analysis into the risk assessment framework have been proposed [11, 13, 15, 24, 27,28,29]. More recently, EFSA has explored opportunities for integration of datasets produced via specific omics tools within risk assessment approaches in several fields, including GMO risk assessment. In their report, the authority suggested the use of case examples that could be tested to enhance confidence in the use of omics datasets in risk assessment [30]. Similar to EFSA, the U.S. National Academies of Sciences, Engineering, and Medicine also acknowledges the usefulness of omics technologies to enable an examination of a plant’s DNA sequence, gene expression, and molecular composition, as these techniques are expected to improve the efficiency of development of both non-GM and GM crops and could likewise be used to analyze new GMOs and test for unintended changes caused by the genetic engineering process [31]. However, none of these studies provide a clear implementation pathway for the GMO risk assessment in Europe. In this paper, we produced empirical data to test implementation, and we provide a clear pathway for omics analysis integration in the context of the European GMO regulation and EFSA’s guidance documents.

Material and methods

Plant material

A total of seven soybean cultivars were selected for the field experiment and subsequent omics analysis: the stacked GM event BRS1001 Intacta™ Roundup Ready™ 2 Pro soybean (IPRO; unique identifier MON-877Ø1-2 × MON-89788-1) from Embrapa Brazil, containing transgenic elements from event MON89788, conferring glyphosate-tolerance (i.a. CP4-epsps behind the chimeric promoter P-FMV/Tsf1 and a chloroplast transit peptide sequence), and from event MON87701, conferring Lepidoptera-resistance (i.a. Cry1Ac from Bacillus thuringiensis behind the A. thaliana rbcS promoter and transit peptide); the near-isogenic non-GM variety BRS284 from Embrapa; as well as five non‐GM commercial reference varieties (BRS 232; BRS 283; BRS 257; BRS 511 all from Embrapa and CD 216 from Codetec Brazil). Seeds were supplied by the local representatives and not tested in house for other GM events as they are certified seeds and follow Brazilian seed quality regulations.

Field conditions

A field experiment was conducted in the state of Santa Catarina, southern part of Brazil. The area is situated in 27°25' S and 51°31' W, and it is dedicated to agriculture land use only. The exact location of the field can be provided upon request to authors to avoid publication of a private area location. Soil is classified as Red Latosol dystrophic with clay texture [32]. To comply with the practices used in the region, soybean seeds were planted in no-tillage system. The area was previously treated with systemic glyphosate-based herbicide (GBH). Prior to sowing, potassium (150 kg/ha) and phosphorus (400 kg/ha)-based fertilizers were applied and incorporated in each planting line. Soybean seeds were subjected to treatments with insecticides and fungicides (active ingredients used: pyraclostrobin, thiophanate-methyl, and fipronil), as well as with Bradyrhizobium japonicum-based inoculant (1.2 ml per kg of seed, 6 × 10^9 cfu/ml). Seeds were manually planted on November 18th (2017) in a density of 200,000 plants/ha, with a distance of 0.10 m between plants and 0.50 m between lines. The following pesticides were used during the growing season following agricultural praxis in the region: Thiamethoxam, Lambda Cyhalothrin, Lufenuron, Trifloxystrobin, Prothioconazole, Diflubenzuron Mancozeb, Azoxystrobin, Chlorantraniliprole, Difenoconazole, Cyproconazole, Bentazon, Fomesafen, Clethodim. Other adjuvants and chemosynthetics were also used (Nimbus, Áureo, Triunfo, and methyl ester-based adjuvants). All soybean varieties were treated equally. No glyphosate-based herbicides were applied during the growing season. Leaf samples were taken at phenological stage V5 and V6 before flowering. Samples were composed from material from the third upper leaf taken from four plants from inner lines. Samples were immediately placed in 3.8-ml cryogenic tubes, frozen in liquid nitrogen, and kept in a -80° C freezer until protein and metabolites were extracted.

This field experiment followed the EFSA guidelines for statistical analysis for the safety of genetically modified organisms [33]. Briefly, the plot area was replicated at seven plots, each one was defined as an area of 40 m × 7.5 m (L x W), divided into four randomized blocks of 20 m2 (named blocks) (Additional File 1).

Proteomics analysis

Total protein was extracted from soybean leaf samples (86 samples corresponding to the different varieties, 4 blocks, 8 plots) according to Carpentier et al. [34] with modifications. Briefly, 100 mg of leaf tissue was weighed and mixed with the extraction buffer (EDTA 5 mM, KCl 100 mM, sucrose 30%, TRIS HCl pH 8.5 50 Mm and protease inhibitor following the concentration provided by the manufacturer (cOmplete™, EDTA-free Protease Inhibitor Cocktail from Roche). Samples were ground in a Precellys 24 automatic homogenizer. Proteins were extracted using a phenol-based solution and precipitated in ammonium acetate (100 mM) in methanol. After precipitation, proteins were washed twice with 20% DTT in acetone. The pellet was resuspended in 1.5% urea and 100 mM TEAB. Finally, protein extracts were quantified in a spectrophotometer (reading at 562 nm) with the BCA protein kit (Novagen working reagent) and adjusted to a concentration of 2 µg/µl total. Samples were then analyzed using the mass spectrometer with labeling TMT 11-plex, LC–MS/MS (Arctic University of Tromsø, Proteomic Platform). First, protein samples were placed on a nanoLC, before sequential injection into an Orbitrap (Q-Exactive) instrument with high mass accuracy. Then, the peptides were fragmented in an order of ten times in MS by high-energy collisional dissociation (HCD). The mass spectra of peptides and fragmented peptides were used for the identification of proteins and post-translational modification (PTM), as well as their quantification. For protein identification, Proteome Discoverer (Thermo Fisher Scientific) was used, and quantification was carried out using MaxQuant software with Perseus.

Metabolomic analysis

Metabolites were extracted from the collected leaf samples and sent to the Swedish Metabolic Center of Sweden (University of Umeå, Sweden) for subsequent GC and LC–MS analysis. Sample preparation and metabolite extraction were performed as described by Jiye et al. [35]. In a 20 µl sample, 1000 µl of extraction buffer (60/20/20 v/v methanol:chloroform:water) were added together with a tungsten granule. Additionally, quality control (QC) (metabolite extract grouped) as well as the extraction blanks was analyzed and processed. Samples were shaken in a mixing mill and then centrifuged at 4° C and 14,000 rpm for 10 min. The supernatants were transferred to LC and GC microvials, respectively, and the solvents were evaporated. The resulting samples were divided into three aliquots for analysis on three platforms of MS instruments. These included two UPLC/MS platforms, one optimized for positive ionization and a second optimized for negative ionization UHPLC Agilent 1290 Infinity (Agilent, Waldbronn, Germany). The third was derivatized and analyzed by GC / MS. The UPLC–MS/MS platform included a Waters ACQUITY ultra-performance liquid chromatography (UPLC). The compounds were detected with an Agilent 6550 Q-TOF mass spectrometer equipped with a jet electrospray ion source operating without negative ion mode. A reference interface has been connected for specific mass accuracy. Compounds were quantified via peak area of the total ion mass. The identification of chemical compounds was based on comparisons with entries from the metabolic library of purified standards.

Statistical analyses

Two statistical approaches were performed in this study, hereafter referred to as ‘statistical analysis #i’ and ‘statistical analysis #ii’. The first analysis followed the statistical guidelines proposed by EFSA for the comparative assessment of compositional data from the statistical considerations for the safety evaluation of GMOs (2010) as discussed in the EFSA Omics Colloquium [30]. A second statistical analysis is presented here as an alternative approach as part of the molecular characterization in risk assessment. This new approach is based on a comprehensive untargeted metabolic and physiological assessment for the identification of unintended changes in the plant as whole.

Prior to both statistical analyses, metabolomics and proteomics data were normalized to the median distribution, auto-scaled, and log transformed aiming to facilitate statistical comparisons. For interpretation of the numerical values, means and differences of means on the logarithmic scale have been back-transformed to geometric means and ratios of geometric means on the original scale. These data treatments follow consensus standards for these analyses [36,37,38,39].

Statistical analysis #i

Identified proteins and metabolites were statistically analyzed based on ‘endpoint-by-endpoint’ comparative analysis as suggested by the EFSA statistical guidance for compositional analysis. This assessment is composed of two sets of comparative tests: first, a difference test to demonstrate whether the GMO is different from its near-isogenic control comparator (i.e., GMO vs. non-GM); followed by an equivalence test of the GMO compared to a range of conventional varieties to demonstrate whether it is equivalent to commercial reference varieties with a ‘history-of-safe-use’ (i.e., GMO vs. reference varieties (RVs) range) [33]. Statistical significance of the difference test was defined as p < 0.05 as determined using the two-tailed Student’s t-test. Values for significance were adjusted for the false discovery rate (FDR) with the Bonferroni–Holm method (p-adj FDR < 0.05) [40]. Plots were analyzed both separately (within plots) and combined (across all plots). For the equivalence test, the range of observed values from the reference varieties was determined for each analytical component and used to calculate tolerance intervals. A tolerance interval is a range of values, with a specified degree of confidence, which contains at least a specified proportion, of an entire sampled population for the parameter measured. For each significant protein or metabolite in the difference test (GMO vs. non-GM), a 99% tolerance interval representing the equivalence limit was calculated that is expected to contain, with 95% confidence, 99% of the quantities expressed in the population of commercial conventional varieties. The tolerance intervals estimate was based on a total of 20 observations from 6 reference varieties from all plots. Finally, each mean value and standard deviation (mean ± SD) from statistically different compound in the GMO (p-adj < 0.05) was compared to the 99% tolerance interval for the equivalence test. Non-equivalence is determined when the statistically different mean ± SD from the GMO sample falls outside the 99% RVs tolerance interval.

Statistical analysis #ii

The proposed alternative statistical model for omics data analysis was performed for both single plots and analysis across all plots. Exploratory multiple co-inertia analysis (MCIA) and principal component analysis (PCA) were conducted to investigate and geometrically projects the main sources of variation present in the proteomic and metabolomic data sets (Lever, Krzywinski & Altman 2017). Then, a comparative statistical analysis of whole-proteome and metabolome data for GM vs. non-GM plants was performed searching for potential metabolic alterations. Therefore, the focus was not on the endpoint-by-endpoint analysis but rather on the relationships and metabolic functions of proteins and metabolites as a whole. Scaling of the data, PCA, MCIA, volcano plots, and tolerance intervals were produced using MetaboAnalyst 5.0 (https://www.metaboanalyst.ca), as well as ggplot2, msmsTests, omicade4, and tolerance packages in R environment. Functional enrichment analysis and interaction network of differential proteins and metabolites were performed using Stitch 5.0 (http://stitch.embl.de) and String 11.0 (https://string-db.org).

The statistical significance of PCA was defined as p < 0.05, as determined using the two-tailed Student’s t-test and false discovery ratio correction with the Bonferroni–Holm method [40]. Fold changes are also presented in logarithm base 2 (Log2FC), a widely used transformation for a continuous spectrum of values to represent up- (positive) and down-regulated (negative) compound values in a reader friendly fashion. Functional annotation and identification of enriched metabolic pathways were performed with UniProt database (https://www.uniprot.org) and KEGG pathway enrichment analysis (Kyoto Encyclopedia of Genes and Genomes) using the differentially expressed proteins and metabolites as input. Pathways with p-adj FDR < 0.05 were considered as significantly enriched. Additionally, Stitch and String databases were used to generate biological networks of protein-metabolite interactions aiming to facilitate data interpretation. A cut-off score for the confidence of interaction ≥ 0.4 was used for a more reliable biological network.

A second statistical analysis was performed in order to search for allergenic proteins in statistically different proteins using Allergen Online v.20 database (http://www.allergenonline.org). Allergens were searched using Full Fasta 36 algorithm method with E-value cutoff = 1. Our search parameters followed the guidelines of Codex ([41]/2005) for the evaluation of the potential allergenicity of novel proteins, which suggests that matches of at least 35% identity may indicate the possibility of cross-reactivity. The presented E-value statistical score is calculated based on the overall length sequence alignments and the quality (% identity and similarity) of the overlap amino acids. The size of the E-value is inversely related to similarity of two proteins, meaning a low E-value indicates a high degree of similarity between the query sequence and the matching sequence from the database, while a value close to 1 indicates the proteins are not likely to be related in evolution, or structure.

Results

Statistical analysis #i—omics data integrated into current comparative analysis of GMO

The comparative analysis of GM vs. non-GM proteomic datasets showed 15 differentially expressed proteins (eight down-regulated and seven up-regulated in the GMO) in plot 1; four proteins (one up-regulated and three down-regulated) in plot 3; 70 proteins (17 up-regulated and 53 down-regulated) in plot 4; 14 proteins (seven up-regulated and seven down-regulated) in plot 5; four proteins (one up-regulated; and three down-regulated) in plot 6; two proteins (one up-regulated and one down-regulated) in plot 7; and six proteins (five up-regulated and one down-regulated) in plot 9. Combined plot analysis showed only two differentially expressed proteins, both down-regulated in the GMO (Additional file 2).

Equivalence tests showed that the majority of the differentially expressed proteins fell within the 99% tolerance interval representing the equivalence limits established from reference varieties. However, two proteins (I1KG57; I1LI58) from plot 4 and four proteins (I1KXW8; A0A0R0HT35; A0A0R0EPX0; I1MFX5) from plot 5 from the GMO fell outside the equivalence limits (Table 1). In fact, these proteins were not detected in the reference varieties. According to the EFSA statistical guidance, these particular results are considered statistically different from the near-isogenic non-GM variety, as well as non-equivalent to the commercial reference varieties available in the market.

Table 1 List of statistically different proteins among the GMO and non-GM soybean varieties single-plot analyses which fall outside the reference varieties equivalence limits

Most surprisingly, ten proteins from the non-GM variety also fell outside the equivalence limits calculated based on RVs tolerance interval (Table 1). Also, a RVs tolerance interval could not be calculated in two cases (plot 4 analysis: I1MPE8; A0A0R0KAT4) in which the proteins were not detected in more than two biological replicates of the reference varieties samples.

Metabolomics data analysis showed three metabolites (glycine; tyrosine; melibiose) with statistically significant differences between the GMO and non-GM from plot 3; and one metabolite (xylobiose) in plot 4. Glycine (log2FC = 0.24; p-adj = 0.008), melibiose (log2FC = 0.79; p-adj = 0.023), and xylobiose (log2FC = 0.58; p-adj = 0.023) showed higher concentrations in GM samples, while tyrosine (log2FC = − 1.65; p-adj = 0.023) showed significantly lower amounts compared to non-GM samples. The analysis of combined data from all plots did not show any statistical differences between both varieties. According to equivalence testing, all metabolites fell within the 99% tolerance interval calculated based on the values observed in the reference varieties. This leads us to assume that, despite the significant differences between samples derived from GM and non-GM plants, such differential metabolites are equivalent to the range of commercial reference varieties observed and, therefore, these differences are not considered biologically relevant based on the current EFSA guidelines for comparative risk assessment of food and feed from GM plants.

Statistical analysis #ii—omics data integrated into the molecular characterization and allergenicity assessment

Multiple co-inertia analysis (MCIA) was performed in order to explore the experimental quality and the main sources of variation, including environmental variation, in the proteomic and metabolomic datasets from all seven plots (1, 3, 4, 5, 6, 7, 9) and three genotypes (GMO, non-GM, and RV) simultaneously. MCIA has been recognized as an excellent tool for integrating the results of different omics techniques. It is an exploratory data analysis method that is able to provide a simple graphical representation that identifies the concordance between these multiple datasets [42].

First, we performed a MCIA with datasets from all experimental plots aiming to evaluate how the variation in all data obtained behave. The coordinates of each plot for each treatment are connected by lines, the lengths of which indicate the divergence (the shorter the line, the higher the level of concordance) between the metabolites and protein abundance levels for a particular plot. In the principal component 1 (PC1), there is a clear grouping of plots 1, 3, and 5, and another grouping of plots 4, 6, 7, and 9, accounting for 34.7% of the total variation in the datasets. On the other hand, PC2 shows separation of plots 1, 3, 7, and 9, from plots 4, 5, and 6, accounting for 24.6% of the total variation (Fig. 1). Such results are generally in accordance with visual observations of agronomic characteristics made during the field experiment, such as differences in the development of plants from the lowland plots (1, 3, 4) compared to the other plots, probably due to the substantial variation in environmental conditions, which includes the sun light incidence and soil moisture accumulation.

Fig. 1
figure 1

MCIA projection plot. A MCIA projection plot representing the proteomics and metabolomics datasets from seven experimental plots: PC1 = 34.7% and PC2 = 24.6%. PC1 is represented by the first axis (horizontal), and PC2 is represented by the second axis (vertical). Different symbols represent the respective treatments and omics analyses and are connected by lines where the length is proportional to the divergence between the data from the same replicate. Lines are joined by a common point, representing the reference structure, which maximizes covariance derived from the MCIA synthetic analysis. Colors represent the different field plots. B Eigenvalue and percentage graphics show the amount of variation in the dataset corresponding to each PC

We have conducted MCIA for both omics datasets within each experimental plot aiming to evaluate the convergence and divergence of proteomic and metabolomic data from the GMO, non-GM and RV varieties inside the plots. We found that four (plots 3, 4, 7, and 9) out of seven plots presented similar trends in the proteome and metabolic profiles, in which PC1 showed clear distinct separation between GM and non-GM plants accounting for 49–67% of the total variation in the dataset. However, there was no pattern in the distribution of the RV group in the MCIA analysis across all plots. Therefore, we have run MCIA with only GM and non-GM groups for the same plots. Running MCIA without the RV samples results in more distinct clustering of the GM and non-GM groups, with two PCs accounting for more than 80% of the total variance (Additional file 3). Experimental plots 1, 5 and 6 did not show any clear pattern in the MCIA distribution of datasets. There was a high divergence between proteomic and metabolomic profiles depicted by the length of coordinates for each biological replicate when compared to other plots. Such result might be attributed to the variation in environmental conditions found in the respective plots located in a specific area of the field experiments (i.e., differences in the micro-climate between plots, differences in forest shading, differences in fertility and water drainage due to slope difference).

In order to test our alternative statistical approach, we selected the experimental plot with lower environmental variation. MCIA of plot 4 showed that PC1 clearly separated proteomics and metabolomics data from GM and non-GM groups, which accounted for 70.58% of the total variation in the data (Fig. 2A). We found similar results from both omics data sets separately by PCA. Clear differences between GM and non-GM groups, as well as within-group clustering of biological replicates were demonstrated by PCA for both omics data, with a total of 67.9% (metabolomics) and 69.4% (proteomics) of the variance accounted for in 2 PCs (Fig. 2B, C).

Fig. 2
figure 2

Exploratory analysis of omics data from plot 4. A MCIA projection plot representing the proteomics and metabolomics datasets from experimental plot 4: PC1 = 70.58% and PC2 = 15.68%. Different symbols represent the different omics analyses, and colors represent the biological replicates. B PCA projection plot for metabolomics data of plot 4: PC1 = 41.2% and PC2 = 26.7%. C PCA projection plot for proteomics data of plot 4: PC1 = 51.8% and PC2 = 17.6%

Analysis of unintended changes in proteomic and metabolomic profiles

We conducted an overall comparative analysis of the proteome and metabolome profiling aiming at searching for unintended metabolic changes in plants. We first applied a comparative assessment conducting pairwise t-tests (p < 0.05) between the proteomic and metabolomic profiles of the GMO vs. non-GM varieties within plot 4 as an example. Among the total of 74 analyzed metabolites, only xylobiose was abundant in significantly higher concentration (fold change = 1.5; Log2FC = 0.58; p-adj = 0.023) in the GMO compared to the non-GM variety samples (Fig. 3). Fifteen other chemicals showed variation in concentrations between both varieties, but did not present statistically significant differences according to the t-test with FDR correction (Additional file 4). The volcano plot distribution of between-group differences based on fold change and statistical significance results shows all compounds analyzed (Fig. 3A, B).

Fig. 3
figure 3

Metabolomics comparative analysis. A Volcano plot displaying the distribution of 74 analyzed features of GM vs. non-GM groups separated by magnitude (x-axis, log2-fold change) and statistical significance (y-axis, p-adj FDR threshold = 0.05) in signal intensity. The only significant metabolite xylobiose (up-regulated for GM) is highlighted. B Average concentration for xylobiose in the GMO and non-GM samples (Student’s t-test; *p < 0.05). The bar plots on the left show the original values (mean ± SD). The box and whisker plots on the right summarize the normalized values

By an advanced analytical approach based on the TMT-11 × plex technique, a total of 5718 proteins were detected in samples from plot 4. In the comparative analysis, a total of 78 statistically different proteins were found between the GMO and the non-GM variety (p-adj FDR < 0.05). Volcano plot distribution of differentially expressed proteins displayed 33 (42.1%) proteins significantly up-regulated and 45 (57.7%) down-regulated in GM plants (Fig. 4). Table 2 shows the protein ID and name according to the Uniprot database, as well as the functional annotation and the fold change variation of the altered proteins significantly different in the GMO.

Fig. 4
figure 4

Volcano plot of proteomics comparative analysis. The plot shows the distribution of 5718 analyzed proteins of GM vs. non-GM groups separated by magnitude (x-axis, log2-fold change) and statistical significance (y-axis, p-adj FDR threshold = 0.05) in signal intensity. 78 significant differentially expressed proteins being up (33) or down-regulated (45) in the GMO are highlighted

Table 2 List of differentially expressed proteins among the GMO and the non-GM soybean variety in statistical analysis #ii

Metabolic pathway and interaction network analysis

We performed a functional enrichment analysis in order to rank associations between differentially regulated metabolites and proteins representing metabolic networks and the respective statistical probability. The association of chemicals and proteins in the biological network provides hints to their metabolic functions. Also, this analysis allows us to identify the relevant results of potentially altered pathways in the genetically modified plant. By conducting a KEGG pathway enrichment analysis in the generated biological network, we found ribosome, spliceosome, and protein export pathways as the most enriched, followed by biosynthesis of secondary metabolites, carbon fixation in photosynthetic organisms, carbon metabolism, and biosynthesis of amino acids (Table 3).

Table 3 List of altered metabolic pathways in the GM soybean plants

Network analysis using String database revealed key modules likely playing a role in the metabolism of GM plants (Fig. 5). This interaction network was divided into three main functional modules which correspond to the significantly altered pathways. Module 1 includes the KEGG altered pathways of ribosome and protein export. This module interacts with Module 2, which includes six altered proteins in the spliceosome pathway, via protein–protein interaction between a hydrolase uncharacterized protein (GLYMA09G05810.1) and 40S ribosomal proteins S12 (GLYMA13G44690.1; GLYMA15G00610.1). Module 3 is related to protein processing in endoplasmic reticulum with the protein disulfide isomerase S-2 (GLYMA19G41690.1) and an uncharacterized protein with a putative function assigned to retrograde protein transport from endoplasmic reticulum to cytosol (GLYMA03G33990.1), among other proteins.

Fig. 5
figure 5

Interaction network of proteins and metabolites statistically different in GM soybean plants. The visual network was built using Stitch and String databases. Three distinct functional modules are highlighted in black dotted circles. Protein names for accessions are present in Table 2. Stronger associations are represented by thicker lines. Protein–protein interactions are shown in grey, chemical–protein interactions in green and interactions between chemicals in red

Module 1 connects to Modules 2 and 3 by sharing strong protein interactions with stromal 70 kDa heat shock-related protein (GLYMA16G00410.1) further connected to acetyl-Coa carboxylase enzymes which are involved in ATP and RNA binding of spliceosome pathway. ACCase-A then connects to a serine/threonine-protein kinase srk2a isoform × 1 (GLYMA08G20090.3), thus linking back Modules 1 and 3.

Untargeted allergenicity analysis

We have searched for potential allergens among the statistically different proteins using the peer-reviewed allergen database Allergen Online v.20 (http://www.allergenonline.org/) intended for the identification of proteins that may present a potential risk of allergenic cross-reactivity. Among the 78 proteins with statistically significantly different expression levels in the GMO vs. non-GM variety comparison (p-adj FDR < 0.05), 43 were identified to have allergenic potential (Table 4). These proteins show at least 35% identity with overlapping amino acid sequences with known allergens according to the database search algorithm. Three proteins are related to pollen allergens and have been identified in different plant species. Two identified proteins showed significant statistical score and high similarity with the gliadin allergen protein, which is a component of gluten, present in wheat. Protein matches with the highest E-value score were heat shock cognate 70 (I1MJU7) which showed 51% identity to the allergen identified in Aedes aegypti; the pre-pro-cucumisin allergen (A0A368UIA9) with 53% identity to the full sequence identified in Cucumis melo; and actinidain allergen (A0A0R0GB30) with 52% sequence identity to the protein found in kiwi (Actinidia deliciosa). The results suggest further experimental studies with some of the identified potential allergens or allergenic epitopes sharing identities lower than 50% and having E-scores larger than 1,00E-4 regarding immunoglobulin E (IgE) binding and clinical reactivity. In addition, the assessment of literature would then contribute to the design of appropriately allergic study subjects.

Table 4 List of differentially expressed proteins with allergenic potential in the GMO variety

Discussion

The biological relevance of substantial equivalence

According to EFSA, biological relevance is based on the following three aspects: (i) the outcomes of the difference test; (ii) the outcomes of the equivalence test, as well as (iii) expert judgement regarding the implications of the changes for food and feed safety of a particular GMO [43]. The difference and equivalence tests are the basis for the analysis of substantial equivalence which in reality is an endpoint-by-endpoint comparison of a limited set of components between the GMO, the non-GM near-isogenic counterpart and several reference varieties [33]. The list of such components is determined by species and derives from external sources, like the list of analytes outlined in the OECD consensus documents for soybean composition [44].

In general, the comparative analysis is a Student’s t-test to verify the null hypothesis which is “no difference between the GMO and its conventional counterpart” against the alternative hypothesis: “difference between the GMO and its conventional counterpart” [33]. In the case of a GMO safety assessment, measured changes are considered to be of no biological relevance if compositional data fall within the range observed (99% tolerance interval) in traditionally cultivated crops that are considered to have a history of safe use for consumption by humans and/or domesticated animals [43]. In practice, if all tested components are found within the interval of equivalence limits the organism is determined as substantially equivalent with no threshold level for equivalence. In other words, GM soy could be determined equivalent to common bean or maize if they were included in the analysis (Fig. 6).

Fig. 6
figure 6

Graphical representation of the steps of the comparative approach according to EFSA and decision-making framework for GMO risk assessment in the EU

On the other hand, when statistically significant differences are found, usually additional data in support of the substantial equivalence are provided by the applicant. In the case of the data submitted by the applicant for Intacta soybean (MON 87701 × MON 89788), the comparative compositional analysis revealed 11 components (out of 53) with significant differences (p ≤ 0.05) between the GMO and the conventional control (Monsanto [45]. However, when data of soybean component levels from published scientific literature and ILSI’sFootnote 1 Crop Composition Database were included in the comparative evaluation (Table 18 in Monsanto [45]), the statistically significant different data points now fall inside the equivalence limits. EFSA statistical guidelines require inclusion of reference varieties conducted with a fully randomized plot layout. External datasets should only be used when a strong justification can be given why the first option was impossible. However, as also observed for other applications, the applicant did not provide reasoning on why these databases were used in addition to the 20 soybean reference varieties which were grown side-by-side in the field trials. Including those data enlarged equivalence limits leading to the product meeting substantial equivalence. In this case, field trial data for reference varieties were available, which would meet EFSA’s guidelines. The EFSA scientific opinion on Intacta soybean considered that “the information available for soybean MON 87701 × MON 89788 addressed the scientific issues indicated by the Guidance document of the EFSA GMO Panel” and that “the soybean MON 87701 × MON 89788 is as safe as its comparator with respect to potential effects on human and animal health or the environment in the context of its intended uses” [46].

In this case, EFSA assumes that the list of components tested in the trial is sufficient to establish equivalence when no differences are found, but does not consider it sufficient to attest non-equivalence when differences are observed. This unbalanced interpretation of the same set of components is a weakness of this comparative framework and lacks scientific justification (Bohn et al., 2014; Millstone et al., 2020).

When we assessed our proteomics data as indicated by EFSA’s Guidance document (statistical analysis #i), i.e., in the same way compositional data are presented in dossiers, we were challenged with two different statistical results in which we could not assess biological relevance and which are not addressed in EFSA´s statistical guideline (Fig. 1) [33]. In the first case, we found 10 proteins from the non-GM near-isogenic comparator which fell outside the equivalence limits. In our second case, a 99% tolerance interval of equivalent limits could not be calculated for two proteins from the reference varieties, because they could not be detected in three or more replicates of the samples. In case of the assessment of Intacta soybean EFSA took note of such statistical outcome and wrote in its opinion under chapter 4.1.3 that constituents at levels below the limit of quantification for more than 50% of samples were omitted from the analysis [46]. In summary, EFSA’s requirements for statistical analysis were met in our dataset analysis and we can only conclude that the GMO is not equivalent to the non-GM counterpart.

A multi-omics approach based on systems biology

Systems biology is generally understood as the study of biological systems “whose behavior cannot be reduced to the linear sum of their parts’ functions” [47]. In practice, it has been also described as a “computational modeling of molecular systems and the integrative interpretation of ever larger postgenomic datasets are accepted as useful, and perhaps even necessary, components of biological research” [48].

Our newly proposed approach described in ‘Statistical analysis 2’ follows a systems biology approach. The idea is to establish a holistic perspective of the genetically modified organism, in which the genetic modification is perceived as causing a perturbation of a system (i.e., the near-isogenic non-GM counterpart). The strategy is then to monitor the responses, integrate the data and perform a computational analysis, based on bioinformatics, to describe the modified system. This strategy is not new, and it has been routinely used to understand complex traits in all fields of biology, including the study of human diseases [49].

In this proposed implementation approach, omics datasets are used to investigate gene-by-gene interactions by network modeling; and even the flow of genetic information when multiplex omics are applied. In this way, many biological processes and metabolisms can be tested by the identification of the biochemical functions from a large network of molecular interactions, including interactions among molecules of the same type, for example, protein–protein interactions, or among molecules of different types, for example, protein–RNA, or protein–metabolite interactions [50].

The organism’s adaptation to changing conditions of the receiving environment depends on their capacity to change their molecular constitution, which can be achieved by modulation of the quantitative composition and the diversity of the cell’s molecular repertoire. Molecular diversification is particularly pronounced on the proteome level, at which multiple proteoforms derived from the same gene can in turn combinatorially form different protein complexes, thus expanding the repertoire of functional modules in the cell” [50]. The understanding of the plant protein–protein interaction network and interactome provides crucial insights into the regulation of plant developmental, physiological, and pathological processes [51]. Thus, data extracted from biochemical networks are more informative than the analysis of each single molecule alone, like the endpoint-by-endpoint analysis performed according to EFSA’s guidance.

There are several advantages in performing a systems biology approach as compared to the comparative assessment: (1) the untargeted and unsupervised analysis of molecules provides additional chance of detecting unintended and unexpected changes, such as new toxins and allergens; (2) the analysis of compounds or molecules that are relevant for each GMO event as opposing to a pre-determined list of compounds per species; (3) the list of altered compounds can be used for a network analysis based on their biological functions and their participation in certain metabolic pathways; (4) the range of molecules to be analyzed is only dependent on the state of the art of the analytical and technological development and will not be restricted to a pre-determined consensus list, hence, allowing to keep pace with increasingly complex metabolic changes and technological progress. Finally, the identification of potential metabolic disturbances due to the genetic modification will inform the testing of dedicated risk hypotheses, for example, if stress related metabolism is altered in the GM variety, then acute stress-response assays are recommended. The specific testing will be applicable to the GM event on a case-by-case basis, and such analyses will complement the animal feeding studies and the molecular characterization in the hazard identification step of a risk assessment [27]. In contrast, the current approach by EFSA requires submission of particular data sets to conclude on the “comparative safety” of a GMO, however the interpretation of these data has not been guided by specific test hypotheses.

A relevant aspect of any new analytical and statistical approach for implementation is standardization. This article does not provide a full pathway towards standardization but rather outlines the first steps for future validation. It is important to highlight that several initiatives have already accomplished a great deal of work over the past two decades towards standardization. The HUPO Proteomics Standards Initiative (PSI; www.psidev.info) defines community standards for data representation in proteomics to facilitate data comparison, exchange and verification and a list of scientific publications with standards can be found in their webpage.

Intacta soybean with altered metabolism

In this study, we identified metabolic disturbance at major pathways: the ribosome, spliceosome, protein processing and protein export metabolism. Alteration in protein-related metabolism can be related to the heterologous expression of the stacked cassette. Whereas several carbon metabolism-related proteins were present in the enriched networks, these pathways were not statistically significant. However, any metabolic imbalance in the plant can be expected to also have an impact on the carbon metabolism. The current strategy for transgenic expression is based on strong constitutive promoters (e.g., the viral P35S) which can be problematic as transgenes are overexpressed at all developmental stages and tissues, leading to competition for energy and building blocks for synthesis of proteins, RNA and metabolites that are required for plant growth (Singhal et al. 2015). In addition, genomic insertions and disruptions caused by transgenesis may lead to pleiotropic effects. These effects have to be investigated as to whether they are associated with risks (risk pathway). Whereas a growth penalty might only have an agronomic impact, increased sensitivity to stress as a pleiotropic effect can lead to the production of certain secondary metabolites, such as toxins or allergens, in the plant. In our allergenicity analysis, we have identified 43 potential allergenic proteins which should be further assessed. Our results contradict the data which Monsanto presented when requesting placing on the EU market for Intacta soybean in 2009 and which was assessed by EFSA in 2012. In their dossier, Monsanto researchers listed 11 compounds which were statistically different from the near-isogenic counterpart in their limited comparative analysis (53 compounds). However, the differences were considered to be within the equivalence limits of the reference varieties and further analysis of their functions was not performed (Monsanto 2009).

It is not yet clear how the metabolic disturbances identified in our study would affect the performance and the safety of Intacta soybean in the field, as additional analyses are necessary. However, there are few studies showing unintended effects which seemed to be caused in response to changes in other plant traits and compounds rather than the heterologous Bt protein per se. In 2014, Monsanto scientists have published results that “should be viewed as an alert that S. eridania [Spodoptera eridania] populations may increase in Bt soybeans [Intacta soybean]” [52]. Their results showed that Intacta soybean reduced larval development by 2 days and increased adult male longevity by 3 days, which indicates that the effect of Intacta soybean MON 87701 × MON 89788 on S. eridania development and reproduction can be favorable to pest development. In addition, the effect of GM Bt maize and BT proteins on non-target organisms (e.g., Neuroptera insects) has been extensively observed over the past two decades ([53], [54]). There have also been reports on phytotoxicity in Intacta soybean in response to glyphosate applications which could not yet be explained [55, 56]. Thus, understanding the underlying effects of transgene expression and mechanisms on plant molecular biology, biochemistry and physiology is crucial for predicting the effects on plant fitness and altered substances which may lead to potential risks [57, 58].

Conclusion

Taken together, our results show that a science-based, risk-related approach based on omics techniques can be implemented for risk assessment of GMOs according to the EU legislation. We demonstrated that a systems biology approach based on a holistic perspective can be more informative in risk assessment than the currently employed endpoint-by-endpoint analysis for the assessment of potential unintended effects in a GM plant. We show that current tolerance and equivalence interval analyses based on data from reference varieties creates a quantitative noise with a high threshold level due to genotypic variability. In contrast, the approach proposed in this paper offers several advantages for the risk assessment procedure. Untargeted omics techniques allow for monitoring case-by-case responses. It also opens the possibility for the integration of large datasets by generating metabolic networks. The proposed analysis pipeline addresses the existing gap between animal feeding studies and molecular characterization in the hazard identification step of a risk assessment.

In this study, we provide a concrete case explaining how this analysis can be included in risk assessment, the outcome of the analysis and how to further investigate risk-related hypotheses. The proposed systems biology-based approach identified alterations in protein and energy related metabolism of the Intacta soybean variety, which is different from the conclusions based on the current EFSA risk assessment approach. Based on the results generated by the approach proposed in our study, we conclude that the comparative assessment according to the current EFSA guidance is not fit for purpose and needs to be improved.