1 Introduction

Organic crop production shows a yield and temporal stability gap (Knapp and van der Heijden 2018) compared to conventional agriculture. The yield advantage of conventional crop production relies on external inputs that fulfil crop nutrient requirements (mineral fertilisers) and alter the relationship between crops and the natural flora (herbicides), fauna (pesticides) and microbiota (e.g. fungicides). However, such buffering of environmental variation is increasingly challenging in conventional farming as well, especially as environments become more unpredictable under climate change. Furthermore, growing environmental concerns, the need to limit the use of mineral fertilisers due to reducing margins, the challenges posed by pesticide and herbicide resistance (Lucas et al. 2015; Baucom 2019) and the withdrawal of many widely used pesticides, are all examples of how advancements in organic crop science might be of general relevance.

In this work, we addressed the particularly challenging case of organic wheat in the UK. Wheat is the most important British arable crop in terms of acreage, covering 1,797 million ha in 2018. However, wheat organic acreage has been steeply declining during the past decade and only covered 0.5% of the total wheat acreage in 2018 (DEFRA 2019). Low and unreliable yield and quality are often claimed to be causes of this decline, which calls back the long-known difficulty of organic and low-input wheat producers in the British Isles to achieve breadmaking quality (Gooding et al. 1993). Suboptimal performance of organic wheat in the UK can be interpreted as the result of a complex lock-in, similar to what Meynard et al. (2018) described about crop diversification in France. The constraints identified by these latter authors are recognisable in the case of British organic wheat, namely a vicious circle between suboptimal access to appropriate cultivars and logistical and coordination constraints along the supply chain. Supply chain inefficiencies inherent to small volumes of produce are exacerbated by the lack of quantitative evidence on crop performance that could help predict aggregate output, e.g. production or quality levels of home-grown grains, in each growing season and beyond. Such uncertainty hinders the negotiation of supply contracts with flour and feed processors, thus exacerbating the competition with imported grains in the commodity market, as well as constraining the development of small-scale, short supply chains.

In the absence of external inputs, or where external inputs are limited or less efficient, cultivar choice is the major crop-specific management decision organic farmers can make on a seasonal basis, considering that crop sequence and management strategies are based on longer-term decisions about rotation and cropping system. Wheat genotypes that perform well in organic systems are generally overlooked when cultivar selection is mostly performed in conventional conditions (Murphy et al. 2007). It is increasingly recognised that organic crops ideally require cultivars bred in organic systems, e.g. the ‘organic cultivars’ and ‘organic heterogeneous material’ recently recognised within the new EU Organic Regulation (2018/848/EU). However, the transition towards an ideal state where every organic farm has access to the most appropriate seeds for their system needs to be addressed. As a matter of fact, the general theory of agroecological transition and its three stages of ‘efficiency’, ‘substitution’ and ‘system redesign’ (Hill and MacRae 1996) overlaps with the three concepts identified by Wolfe et al. (2008) as ‘conventional breeding’ (i.e. reliance on cultivars bred for conventional agriculture), ‘breeding for organic’ (i.e. breeding cultivars in line with organically relevant trait architectures) and ‘organic breeding’ (i.e. direct selection in organic conditions).

The lack of access to appropriate cultivars is strongly interlinked with supply chain constraints both downstream (farm to end-product) and upstream, in the seed and plant breeding domain. As a matter of fact, if organic wheat were to rely on dedicated cultivars, these would likely not be financially viable in the short term, given the low acreage. Therefore, establishing a mechanism able to identify an optimal subset of cultivars among those available in the short term could be pivotal to ensure better and more reliable crop performance. In the UK, there is no formal organic cultivar evaluation programme at present, nor it would be possible to replicate the complex (and expensive) trial architecture of the conventional recommended lists in an organic setting. To support cultivar choice, organic wheat producers rely on two sources of information: official recommended lists that are however issued from trials managed conventionally and a few independent organic plot-scale trials managed by breeders and/or seed merchants that are, however, often carried out in single locations.

An inclusive cultivar evaluation mechanism can generate the conditions to make a wide array of innovations relevant to organic systems financially viable, which in turn is likely to be of benefit in a world of reduced input options and buffering capacity. Cultivar evaluation can be a testing ground for the ‘system-based plant breeding’ approach formulated by Lammerts van Bueren et al. (2018) and, as such, be useful to harness the potential of conventionally bred modern cultivars as well as of historic cultivars, alongside expanding the scope of organic breeding (Lammerts van Bueren and Myers 2012) and of evolutionary breeding (Phillips and Wolfe 2005).

However, organic and low-input cropping systems present peculiar challenges to experimentation, including breeding and cultivar testing. First, the reduced access to, and use of, external inputs can intensify environmental variation across sites, making the results from centralised trials harder to generalise than those in conventional cropping systems (Wolfe et al. 2008). Second, the results generated from plot-scale experimentation are less predictive of field-scale performance in organic and low-input than those in conventional cropping systems, mainly because of the formers’ reliance on mechanical weed management (Kravchenko et al. 2017). Hence, cultivar evaluation could benefit from a greater emphasis on on-farm, field-scale experimentation in organic and low-input cropping systems.

As a matter of fact, many organic farmers compensate the lack of information by directly testing cultivars in their own fields, as an apparently simple case of farmer experimentation. However, according to the stages of farmer experimentation identified by Catalogna et al. (2018), whilst the ‘real-time management’ of the experiment is indeed simple, the stages of ‘design’ and of ‘result evaluation’ bear some inherent difficulties. The design is contained within individual farms, often with a limited number of cultivars and limited replication over space or time. The evaluation of results is either limited to applying findings from ‘the past growing season’ or, when evidence from multiple seasons is available regarding, e.g. a cultivar’s stability, the identified cultivar might well be at the end of its commercial life cycle and become unavailable. Such limitations can be addressed by overcoming the isolation of individual ‘experiments’, joining them into a ‘collective experiment’.

Several examples of decentralised participatory experiments on wheat are documented in literature, especially focusing on participatory plant breeding (Ceccarelli and Grando 2007), its appropriate decentralised experimental designs (Rivière et al. 2015), its relevance in harnessing genotype-by-environment interactions in organic and low-input agriculture (Kucek et al. 2019), and its potential to help rethink breeding goals and breeding efficiency (Ceccarelli 2015). Likewise, detailed studies run at field scale across networks of farms have shed light on the agronomic and environmental drivers of organic wheat performance, particularly yield (David et al. 2005) and grain protein content (Casagrande et al. 2009).

Here, we present the results of the first two years of a collective experiment on organic winter wheat, in which we attempted to integrate elements of cultivar evaluation and of an agronomic survey. The present work started in 2017, as part of the ‘LIVESEED: Boosting organic seed production in Europe’ H2020 EU Project, from a joined initiative between the Organic Research Centre and Organic Arable, a membership-based marketing group set up by organic farmers and active in grain marketing, seed supply and technical support. The main objective was to raise quantitative evidence on organic wheat performance in real-farm, field-scale conditions and to set the foundations and inform future, more detailed and focused, surveys and experiments. We assessed crop performance under multiple angles, integrating grain yield and grain quality indicators with assessments of crop morphology, weed abundance and community composition during the crop cycle and tested the three following hypotheses:

  1. (i)

    Crop performance can be interpreted based on the climatic patterns of the tillering, stem extension and reproductive phases of the crop cycle.

  2. (ii)

    Grain yield, quality and weed abundance are affected by cultivar choice.

  3. (iii)

    Grain yield and quality, as well as weed abundance and community composition, are affected by the emerging fertility and post-emergence weed management strategies adopted by the farmers (Fig. 1).

Fig. 1
figure 1

Cultivar choice and post-emergence weed management strategies are determinants of organic wheat performance. Taller cultivars can suppress weeds better than modern dwarf cultivars (a). Post-emergence weed management in English organic wheat mostly relies on either spring tine harrowing on crops sown in 10–15-cm distant rows (b) or on interrow hoeing on crops sown in 20–25-cm distant rows (c). (Pictures show crops presented in this paper. Photographs by Mark Lea (a, b)).

2 Materials and methods

2.1 Wheat cultivars

We tested a set of seven and nine cultivars in 2017/2018 and 2018/2019, respectively, with five in common, for a total of eleven cultivars (Table 1). The cultivars were identified using information from the official recommended lists (AHDB 2019); from experimental organic plot cultivar trials, as well as those conducted by seed companies operating within the UK; and from farmers’ experience. Overall criteria to identify the set of cultivars were to cover the range of end-use quality classes and to represent a diversity of phenotypes, inclusive of modern elite as well as historic cultivars. Eight of these cultivars are part of the UK National List. They represent all four end-use groups defined by the National Association of British and Irish Millers (NABIM, now ‘UK Flour Millers’), namely: (i) group 1, defined as ‘bread-making cultivars with consistent milling and baking performance’; (ii) group 2, defined as ‘cultivars with bread-making potential’ but with ‘variability in performance or some undesirable traits’; (iii) group 3, ‘soft cultivars used for biscuits, cakes etc […] lower in protein […] and extensible but not elastic gluten’; (iv) and group 4, ‘both hard and soft wheats used mainly for animal feed’. In addition, we included the German breadmaking cultivar Montana (group E according to the German classification), the historic British cultivar Maris Widgeon and the Yield-Quality Composite Cross Population (YQCCP), constituted by bulking and reproducing on organic farms 120 crosses among 20 European cultivars registered between 1934 and 2000 (Döring et al. 2015).

Table 1 Cultivars tested, breeder, parentage, year and country of registration, end-market class (including classification by the National Association of British and Irish Millers, NABIM), endosperm texture, indicative growth cycle length (SE, onset of stem extension when sown in early October, days +/– relative to average; R, days to ripening, days +/– relative to cv. JB Diego), resistance ratings for foliar diseases (1, minimum; 9, maximum; YR, yellow rust; BR, brown rust; S, septoria). Data from the Agriculture and Horticulture Development Board Recommended Lists 2018/2019 unless stated otherwise (, missing information). 1Data from the breeder, 2Personal observation, 3Detailed description of the parent cultivars in Jones et al. (2010), 4Details on the constitution in Döring et al. (2015)

2.2 Experimental locations, crop management and climatic patterns

Six farms were participating in 2017/2018, with five additional farms joining the experiment in 2018/2019, and each farm grew a subset of cultivars as strips in a commercial winter wheat field. The experimental sites were mostly located in the West Midlands, East Midlands and East Anglia regions of England, UK, and spanned between 52° 55’ N and 51° 37’ N in latitude, and between 2° 39’ W and 1° 31’ E in longitude. Experimental fields covered a range of soil textural classes from clay loams to sandy loams, including a peculiar clay loam over a shallow layer of chalk locally known as ‘Cotswolds brash’. Soil textural class of each experimental field was reported by the farmers, visually validated during field visits and contextualised in their local area by the generalised soil characteristics obtained by the ‘Soilscapes’ dataset (Farewell et al. 2011).

Crop management was decided by the farmers, who were interviewed during the growing season to ascertain field conditions, rotational position and the main crop and weed management strategies and operations (Table 2). All crops were sown in the second half of October in both 2017/2018 and 2018/2019, in nearly all cases following a legume-based ley. Upon discussion with the participating farmers, the decision was made that in each farm a uniform post-emergence mechanical weed management would be applied to all cultivars, instead of adapting the intensity of the operation to the perceived status (e.g. different weed pressure, crop’s resistance to disturbance) of each cultivar strip.

Table 2 List of participating farms, soil type, preceding crop in rotation, fertilisation strategy, tillage strategy (FS, false seeding), sowing dates and sowing scheme and post-emergence weed management strategy, where ‘narrow rows’ indicates spring tine harrowing on wheat in 10- to 15-cm-wide rows and ‘wide-rows’ indicates interrow hoeing on wheat sown in 25-cm-wide rows.

Monthly maximum, minimum, mean temperatures and rainfall amounts were obtained from the UK Governmental Meteorological Office historic station data (MetOffice 2020), providing monthly records of temperatures, rainfall and sunshine hours. Estimates for individual experimental sites were either obtained from data from the closest weather station, where it was in a range of 50 km from the farm, or from the average, weighted by the distance from the farm, of data from the two closest weather stations in opposite directions, as reported in Table 2. Overall, the 2017/2018 growing season was characterised by a markedly hot and dry late spring and summer, whereas the 2018/2019 growing season was characterised by high late spring and summer rainfall (Fig. 2). Monthly estimates were divided into timeframes that overlapped with major development phases of the crop according to the BBCH growth scale (Lancashire et al. 1991), namely the ‘tillering’ phase (BBCH GS 21 to 30, November to March), the ‘stem extension’ phase (BBCH GS 31 to 49, April and May) and the ‘reproductive phase’ (BBCH GS 51 to harvest, June and July).

Fig. 2
figure 2

Average monthly maximum (∆) and minimum (□) temperatures (a) and average monthly rainfall (b) in the Midlands and East Anglia climatic regions during the 2017/2018 and 2018/2019 growing seasons. Dots (a) and bars (b) indicate monthly values for the respective season and climatic regions, whereas dotted lines represent the 1981–2020 monthly average values for the respective climatic regions (data: Met Office 2020).

2.3 Experimental design

We adopted an incomplete block experimental design, in which every farmer was allocated a subset of cultivars (Table 3) and provided with an adequate quantity of seed to grow in one of their commercial winter wheat fields in adjacent strips wide enough to be easily drilled, managed and harvested with farm machinery according to their farm management practices. In cultivar allocation to farms, two competing constraints were considered: (i) as many cultivars as possible should be grown on the same farm, to provide on-site pairwise comparisons; (ii) as few cultivars as possible should be grown on each farm to make the drilling and harvesting operation as easy as possible and to reduce the environmental and management-related error (e.g. time constraints on harvest in a possibly short harvest window, possible errors during drilling and harvest).

Table 3 Experimental design adopted with allocation of tested cultivars to farms.

2.4 Sampling and assessments

Crop phenology was assessed using the BBCH growth scale. All farms were visited during the second half of June in both years, in correspondence with wheat anthesis (BBCH GS 61–69), to collect key performance indicators. For each cultivar in each farm, four to five random positions were selected. When cultivar strips were not homogeneous, a stratified random sampling approach was adopted to select and correct the averaging of the subsampling positions. In each position, a 2 m2 sampling area was assessed for wheat canopy height (cm), wheat canopy cover (visual estimate), ear density over 2 linear metres and foliar disease severity. The main diseases identified were septoria (Septoria tritici), brown rust (Puccinia triticina) and yellow rust (Puccinia striiformis). However, due to overlapping symptoms at the time of assessment, only the total leaf area affected by diseases was retained for analysis. In addition, a visual ground cover estimate was run for each weed species, thereby obtaining the total weed abundance as the sum of each species’ cover and the relative abundance of the most prevalent species as the ratio between its cover and the total cover in each sampling area.

Yield was measured by farmers using their own machinery for combining the crop and weighing grains from each strip separately. As part of Organic Arable members’ common practice, grain samples were collected at harvest in sealed sample bags and sent for analysis at the Trinity Grain Ltd. laboratory (Overton Rd, Overton, Winchester SO21 3AN, URL: trinitygrain.com) where, after determination of the mass of weed seeds and inert matter in each sample, grain moisture, protein content, specific weight and Hagberg falling number (HFN) were measured by near-infrared spectroscopy. The amount of nitrogen harvested (hereinafter ‘N harvest’) was obtained as the product between grain protein content and grain dry matter yield, with protein/nitrogen conversion factor of 6.25.

2.5 Statistical analysis

Statistical analysis was based on the use of linear mixed-effect models (Bates et al. 2015) and explored variation across (i) environments, (ii) cultivars and (iii) different management strategies.

2.5.1 Environmental differences

To explore crop performance in response to environments, we adopted a model assuming farm, year and farm-by-year interaction as fixed terms and cultivar as a random term, formulated as:

$$ {Y}_{ify}={\mu}_0+{\beta}_f+{\chi}_y+{\beta}_f:{\chi}_y+{a}_i+{e}_{ify} $$
(1)

where Yify is the value of the response variable for cultivar i in farm f in year y, μ0 is the grand mean, βf is the effect of the fth farm, χy is the effect of the yth year , βf y is the interaction between the yth year and the fth farm, ai is the random effect (random intercept) of the ith cultivar and eify is the error. The different models obtained by stepwise deletion from model 1 of each fixed term, starting with βf y, were compared by likelihood ratio test. A fixed term was considered significant when its deletion generated a significant increase in Akaike information coefficient (AIC). The main purpose of this model was to generate estimated marginal means, corrected by the bias of different subsets of cultivars (random term), for each farm in each year, that could be compared with the respective estimated values of climatic variables as described in 2.2. Estimated marginal means of crop variables were obtained by model 1 fit by restricted maximum likelihood (REML) with t tests using the Satterthwaite method and, alongside the corresponding climatic estimated values, were analysed by Pearson’s product-moment correlations and principal component analysis (PCA).

2.5.2 Cultivar differences

To investigate cultivar differences, we used a linear mixed model assuming cultivar as a fixed term and, as random terms, farm and year within farm. The model was formulated as follows:

$$ {Y}_i={\mu}_0+{\alpha}_i+{b}_f+{b}_f\ {x}_y+{e}_{ify} $$
(2)

where Yi is the value of the response variable for ith cultivar, μ0 is the grand mean, αi is the effect of cultivar i, bf is the random effect of the fth farm (random intercept), bf xy is the random effect of the yth year within the fth farm (random slope) and eify is the error. Significance of cultivar effect was assessed comparing by likelihood ratio test model 2 against a null model only containing the random terms. From the REML fit model, estimated marginal means of cultivars, related standard errors and p values of pairwise comparisons were calculated with Tukey adjustment and Kenward-Roger method for degrees of freedom. Pearson’s product-moment correlations and PCA were run with the estimated marginal means thereby obtained. The five cultivars tested in both 2017/2018 and 2018/2019, which were tested in at least five different environments, were also subjected to a dynamic stability analysis (Finlay and Wilkinson 1963) for grain yield, grain protein content, N harvest and weed ground cover at anthesis. For each variable, a linear model was run against the estimated marginal means of the corresponding environment obtained through model 1. Regression slopes of each cultivar were compared against the mean regression line (slope = 1) and against one another by t tests.

2.5.3 Effects of management

To analyse the effects of management, a further linear mixed-effect model was tested, assuming the random effect of year within farm and the random effect of cultivar, formulated as follows:

$$ {Y}_{ifym}={\mu}_0\kern0.5em +{\gamma}_{m1}+{\gamma}_{m2}+{b}_f+{b}_f\ {x}_y+{a}_i\kern0.5em +{e}_{ifym} $$
(3)

where Yify is the value of the response variable for the ith cultivar in the fth farm adopting the mth management strategy in the yth year, μ0 is the grand mean, γm(n) are the management strategies adopted, bf is the random effect of the fth farm (random intercept), bf xfy is the random effect of the yth year within the fth farm (random slope), ai is the random effect of the ith cultivar (random intercept) and eifym is the error. Management strategies considered were the post-emergence weed control strategy (‘wide-’ or ‘narrow-rows’ sowing schemes) and whether or not organic fertiliser was applied to the field (Table 1). Significance of each fixed term and generation of estimated marginal means were obtained like for models 1 and 2. Multiple correlations assuming the random structure of model 3 were fit through REML. We considered grain yield, grain protein content and N harvest as response variables. We considered crop cover, weed cover, ear density and all possible interactions between them as explanatory variables. Stepwise deletion of fixed terms started from non-significant higher-order interaction and proceeded, within same-order interactions, with the least significant terms, until, upon comparing models through likelihood ratio test, a deletion caused a significant increase in AIC.

Homoscedasticity and normality were checked by visual inspection of residual Q-Q plots. Values were log or square root transformed when necessary. We used R version 3.6.1 “Action of the Toes” (R Core Team 2017) on a platform: x86_64-w64-mingw32/x64 (64-bit). The package ‘lme4’ and ‘lmerTest’ were used for mixed-effect models. The package ‘emmeans’ was used to calculate estimated marginal means. Graphs were obtained by the package ‘ggplot2’. PCA charts were obtained by the ‘factoextra’ package.

3 Results and discussion

3.1 Exploring relationships between environments and crop performance indicators

The trial went through two seasons with particularly differentiated climatic patterns. Important variation across farms in each season was also observed in terms of both crop performance and climatic variables. Grain yield ranged between 1.42 and 6.45 t/ha in 2017/2018 and between 1.39 and 7.24 t/ha in 2018/2019. Grain protein content ranged between 7.6% and 12.0% in 2017/2018 and between 7.9% and 13.4% in 2018/2019. When considering cultivar as a random effect (model 1), grain yield was highly influenced by farm (p < 0.001) and year-by-farm interaction (p < 0.001), but not by year (p = 0.665). Unlike yield, grain protein content, besides being affected by farm and year-by-farm interaction (p < 0.001 in both cases), was significantly higher in 2018/2019 than that in 2017/2018 (p = 0.008), with values of 10.42 ± 0.3 % and 9.46 ± 0.36 %, respectively. A similar pattern was found for grain N harvest, whose estimated marginal mean averaged over farms was higher in 2018/2019 (65.2 ± 2.5 kg/ha) than that in 2017/2018 (50 ± 3.8 kg/ha) (p < 0.001). On the contrary, grain specific weight was not affected by year-by-farm interaction, and its estimated marginal mean averaged over farms was significantly higher in 2017/2018 (73.3 ± 1.3 kg/100 l) than that in 2018/2019 (70.8 ± 1.1 kg/100 l) (p = 0.021). In addition, crop canopy was 11.7% taller, on average, in 2018/2019 than that in 2017/2018 (p = 0.003), which was associated to occurrences of lodging in the tallest varieties.

The PCA shown in Fig. 3a clearly shows the differences between the two climatic years along the horizontal axis and variation between farms within each season developing along the vertical axis. The most remarkable difference between the two seasons lays in the climatic pattern during the reproductive stages. The 2017/2018 growing season showed an uncharacteristic drought, with total June and July rainfall between 25 mm in CV_01 and 54 mm in TF_01 (i.e. between 79 and 55% lower than the 1980–2010 average for the Midlands region), mean temperatures in line with the 1980–2010 average but sunshine hours (between 471 hours in PE_01 and 556 hours in HR_01) up to 65% higher than the 1980–2010 average for the Midlands. Rainfall during June and July was much higher in 2018/2019, ranging from 112 mm in CB_01 and 177 mm in HR_01 (38.6% more than the Midlands’ 1980–2010 average), and including flooding events affecting PE_01 in early July. Although with a wet late spring and summer, the 2018/19 growing season was characterised by lower rainfall than that of 2017/2018 during April and May (crop stem extension), with values ranging between 51 mm in IP_01 (49% lower than East Anglia’s 1980–2010 average) and 98 mm in HR_01 (17.3% lower than the Midlands’ 1980–2010 average). On the contrary, in 2017/2018, rainfall during stem extension was ranging between 92 and 153 mm, i.e. up to 30.8% higher than the 1980–2010 average in the Midlands.

Fig. 3
figure 3

Principal component analysis (a) of estimated marginal means by environment of grain yield (Yield), grain protein content (Protein), grain N harvest (N.harvest), grain specific weight (Sp.Weight), crop height at anthesis (Height) and climatic indicators: cumulative rainfall (‘rain’), mean temperature (‘temp’) and cumulative sunshine hours (‘sun’) during tillering (‘T’, November – March), stem extension (‘SE’, April–May) and reproductive stages (‘R’, June–July) (variables with a quality of representation lower than 50% were excluded from the PCA). Relationship between grain protein content and grain yield by environment (b): estimated marginal means and related standard errors by environment, obtained from a mixed model considering cultivar as a random term and, as fixed terms, farm, year and farm-by-year interaction. Trendline equation for 2017/2018 y = 5.98 + 2.338∙log(x) (R2 = 0.559, p = 0.087). Trendline equation for 2018/2019: y = 7.996 + 1.682∙log(x) (R2 = 0.388, p = 0.041).

A positive correlation between the environmental estimated marginal means of grain yield and grain protein content (r = 0.607, p = 0.01) was observed. Approximating this association through a linear regression, in fact, a positive linear relationship between grain protein and grain yield was found to be significant in 2018/2019 (p = 0.04) and nearly significant in 2017/2018 (p = 0.08). This is apparently in contrast with the common finding of protein dilution with increasing yield, like in Casagrande et al. (2009). A reason behind this contradiction might be that the correlations we found were based on values offset against cultivar differences. In fact, a similar finding was also reported by Jones et al. (2010) demonstrating a yield-protein trade-off within cultivars but a positive association between yield and protein across sites, owing to increases in N availability. That nitrogen availability at each site is the most likely cause for aligned patterns of grain yield and protein is also suggested by findings from Barraclough et al. (2010) and confirmed by strong positive correlations between grain yield and protein content with N harvest (r = 0.95, p < 0.001; and r = 0.79, p < 0.001, respectively). Increased grain specific weight in 2017/2018 is fully in line with findings from the 158-year-long data series of the Broadbalk wheat experiment held at Rothamsted (UK) as analysed by Atkinson et al. (2008): specific weight was positively associated with cumulative sunshine hours and negatively associated with the number of days with more than 2-mm rainfall during grain filling.

The spring and summer of 2018 have been the object of high scientific attention due to an evident weather anomaly across Europe. On this subject, Beillouin et al. (2020) support the relevance of rainfall and temperature data, which provided a more robust explanation than soil moisture, and their analysis showed an important yield decrease in the areas of Northern Europe affected by late spring and summer drought, which our data seem not to confirm, given that no significant yield differences were found between the two growing seasons. However, the same authors point out the compound effect different of climatic anomalies that can offset each other’s impact. In fact, the yield-depressive effect of drought during reproductive stages in 2017/2018 might have been counterbalanced by lower rainfall and lower temperatures during stem extension in 2018/2019. The hypothesis of rainfall as a driver of N mineralisation (Gooding et al. 2003) could explain that 2018/2019 might have had lower N availability earlier in the season, in stages when it is more linked with increases in grain yield than in grain protein (Fischer et al. 1993). As far as grain protein content is concerned, lower values recorded in 2017/2018 are in line with Casagrande et al. (2009) who had observed a depressive effect of water deficit after anthesis on grain protein content on organic winter wheat in France. Higher rainfall in correspondence of wheat anthesis in 2018/2019 could explain higher N mineralisation at a stage where it mostly affects grain N concentration rather than grain yield (Gooding et al. 2003). On the other hand, the N accumulated by the grass-clover leys which preceded nearly all the wheat fields studied is subject to leaching (Olesen et al. 2009), which could have been exacerbated by higher-than-average rainfall during stem extension in 2017/2018.

According to long-term probabilistic climatic projections, extreme reductions in summer rainfall (–47% +2% compared to current trends) would be expected by 2070 (MetOffice 2019). In fact, a growing season like 2017/2018 is a potential example of what future climatic patterns might look like, with an increased prevalence of heat and drought stress during key reproductive stages. In this light, collecting and capitalising on crop performance data can enrich forecasts of future climate scenarios with retrospective analysis of the impact of different climatic patterns. Climatic data and soil variables, particularly regarding soil mineral N availability and crop N status, would need to be captured at a better spatial resolution and finer temporal scale. As such, climatic indices can be both covariates to improve the accuracy of cultivar evaluation (Brown et al. 2020) and explanatory variables of crop performance in organic cropping systems.

3.2 Effect of cultivars on crop performance and weed abundance

Upon offsetting for the random effect of farm and year (model 2), a significant effect of cultivar was found for grain yield, grain protein content, N harvest, grain-specific weight and Hagberg’s falling number, as well as for most of the variables measured at anthesis, namely crop canopy height, foliar disease severity and weed cover (Table 4). No significant cultivar effect was found for either crop cover at anthesis or ear density.

Table 4 Estimated marginal means and standard errors by cultivar of weed cover after anthesis, canopy height after anthesis, severity of foliar diseases after anthesis, grain yield (t/ha at 15% moisture content), grain protein content (% dry matter), grain specific weight (kg/100 l) and Hagberg’s falling number. Values obtained by the linear mixed-effect model containing ‘cultivar’ as a fixed term, ‘farm’ as a random intercept and ‘year within farm’ as a random slope (model 2; see Methods). Means with the same letter are not significantly different at a 0.95 confidence level (Tukey adjustment and Kenward-Roger method for degrees of freedom). Cultivar effect: p value of likelihood ratio test comparison between the linear mixed-effect model containing cultivar as a fixed term and the null model.

The PCA shown in Fig. 4a highlights two main patterns of negative associations: (i) grain protein content was negatively associated to grain yield (r = –0.91, p < 0.001) and N harvest (r = –0.62, p = 0.042), and (ii) weed cover at anthesis was negatively associated to both crop height (r = –0.68, p = 0.021) and crop ground cover (r = –0.48, p = 0.139) at anthesis. Unlike what was observed when comparing different environments, the negative correlations between grain protein and both grain yield and N harvest are consistent with a yield-protein trade-off, as a result of dilution of protein by carbohydrate in the grain (Acreche and Slafer 2009) that is likely more pronounced in organic than in conventional fields (Rakszegi et al. 2016). Approximating this relationship through a linear regression, in fact, 76.8% of the variation in grain protein content was explained by variation in grain yield (Figure 4b), which is stricter than the positive relationships found across environments (Figure 3b).

Fig. 4
figure 4

(a) Principal component analysis of estimated marginal means of cultivars’ grain yield (Yield), grain N harvest (N_harvest), grain-specific weight (Sp.Weight), crop ground cover at anthesis (GrndCv), crop height at anthesis (Height), grain protein content (Protein), severity of foliar diseases (Diseases) and weed cover at anthesis (WeedCv). (b) Relationship between estimated marginal means of grain protein content and grain yield by cultivar. Estimated marginal means and related standard errors obtained from linear mixed-effect model considering cultivar as a fixed term, farm as a random intercept and year within farm as a random slope. Trendline equation (b): y = 14.59 – 1.068x (R2 = 0.765, p < 0.001).

Figure 4b suggests the existence of a high-yielding cluster, representing the NABIM group 4 cultivars Crispin, Evolution and Revelation and the group 2 Siskin, and a high-protein cluster, representing the group 1 Zyatt, the German breadmaking Montana, the British historic Maris Widgeon and YQCCP. In fact, a pairwise comparison shown in Table 4 highlights the significantly higher yield in cvs. Crispin, Siskin and Revelation compared to that in cvs. YQCCP, Montana and Skyfall, perfectly overlapping with an inverse significant difference in protein content, with the addition of cv. Evolution alongside the low-protein group (the yield advantage of cv. Evolution over cv. Montana was nearly significant, p = 0.054). It is worth noting that cv. Skyfall, one of the main breadmaking cultivars in the UK (and the control cultivar in official cultivar trials for group 1), showed a quasi-outlier positioning along the yield-protein regression line (Figure 4b), with the highest protein content but the lowest yields. Its foliar disease severity was significantly higher than all other cultivars, consistently with the low yellow rust rating in the UK recommended lists. These results suggest indeed a non-appropriateness of this cultivar to organic conditions, despite its prominent positioning in the non-organic market. Casagrande et al. (2009) identified the end-use class of the cultivar (‘baking quality class’) as one of the main drivers of grain protein content. This is in line with the distribution of the tested cultivars along the yield-protein trendline in our study, although some cultivars, like Siskin, showed a closer behaviour to feed-grade than to breadmaking cultivars. In addition, cv. YQCCP, which had never been classified in terms of end-use, seems to align with breadmaking cultivars. Its low Hagberg falling number might be a limitation for industrial but not for artisanal breadmaking.

The dual yield-protein trend, negative across cultivars but positive across environments, can help farmers better target their cultivar selection for different end-markets. This could be critical information in the UK organic wheat context, characterised by widespread difficulty in achieving breadmaking specifications. Farms with greater N availability and therefore higher yield potential, who mostly aim their wheat to the feed market, also appear to have a higher protein potential and could therefore consider, with a small yield penalty, to target milling quality. As an example, we recorded a remarkable 13.4% protein content on a YQCCP yielding 5.04 t/ha in farm GL_01. On the contrary, farms showing low yield potential, who generally target milling cultivars, might well consider increasing their yields with more productive feed cultivars, as one of the participating farmers (SN_01) did.

Although being an approximation due to the lack of an indisputable causal relationship, linear regressions between grain yield and grain protein content are pivotal to identify grain protein deviations (GPD). GPDs are cultivars whose grain protein content is significantly higher than their yield level would predict according to the yield-protein regression and may provide a basis for the future development of wheat cultivars adapted to organic production (Monaghan et al. 2001). Although statistical analysis of GPD is beyond the scope of this work, yield-protein estimated marginal means of Maris Widgeon and YQCCP fell at the limits or above the 95% confidence interval of the yield-protein trade-off trendline (Figure 4b), which can possibly be explained by their better competitiveness (lowest ranking weed cover) against weeds. In fact, the pairwise comparisons shown in Table 4 show that cv. YQCCP, a potential positive GPD, had significantly lower weed cover at anthesis than that of cv. Spyder, a potential negative GDP. Nearly significant lower weed cover was also found for cv. Maris Widgeon compared to that of cv. Spyder (p = 0.097), as well as for cv. YQCCP compared to that of cvs. Zyatt (p = 0.078) and Evolution (p = 0.075).

Monaghan et al. (2001) suggested that GPD is subject to genetic control and linked to the ability to accumulate N after anthesis. In addition, Gooding et al. (2012), upon comparing near-isogenic lines varying for dwarfing alleles, showed that N use efficiency relies more on N uptake than N remobilisation after anthesis. However, this latter study also found that N use efficiency patterns differ under organic, when compared to non-organic, management, with greater importance of N accumulation before, rather than after, anthesis and with a confounding effect of N uptake by weeds, in turn negatively associated with crop height. A negative correlation between cultivar estimated marginal means of crop height at anthesis and weed cover at anthesis was recorded in our study (r = –0.68, p = 0.021), with cv. Maris Widgeon and cv. YQCCP showing significantly taller canopies than that of all other cultivars (Table 4).

Overall, the differences identified between cv. Maris Widgeon and cv. YQCCP and the other modern commercial cultivars resurrect the debate about the value of historic vs. modern cultivar performance in organic and low-input environments. Jones et al. (2010) demonstrated, upon comparing a set of 19 cultivars released over 64 years in organic and non-organic conditions in the UK, that the correlation between year of release and N harvest was strong and positive in non-organic sites, indicating a breeding progress, but non-significant in organic sites. Furthermore, when this latter work considered breadmaking cultivars only, no significant correlation was found in organic sites between cultivar age and grain yield either, whereas a significant breeding progress was found in non-organic sites. Similar conclusions were reached more recently in Central European environments by Herrera et al. (2020) who, in addition, observed that the genetic effect on wheat yield trends over 20 years was significantly positive in ‘conventional high-input’ and ‘conventional low-input’ management system, but not significant in ‘organic management’. Longer straw in historic cultivars can be associated with increased root proliferation (Bai et al. 2013, Barraclough 2010), particularly in terms of seedling root growth (Wojciechowski et al. 2009). Cv. Maris Widgeon was also reported to yield more than modern cultivars when grown organically (Cosser et al. 1997). However, whilst the latter work attributed a better performance of Maris Widgeon to weed tolerance rather than weed suppressive ability mechanisms, the lower weed cover found for the tallest cultivars in our work suggests that weed suppressive ability does play a role. In a previous study, in fact, Maris Widgeon and Maris Hunters (another historic cultivar) showed the best competitiveness against weeds among four more cultivars (Korres and Froud-Williams 2002).

The incomplete block experimental design does not provide consistently co-occurring cultivars across all environments, which prevents a reliable static stability analysis. However, the environmental estimated marginal means obtained through model 1 do allow the definition of environmental gradients for a dynamic stability analysis. Analysing the five cultivars tested in both growing seasons, no significant deviations from the slope of the environmental means were found for either grain yield and grain protein content, only confirming the positioning of either cultivar in either a high-yielding (cvs. Crispin, Siskin, Evolution) or high-protein (cvs. Montana and YQCCP) groups (Figure 5a, b). A slightly significant (p = 0.068) deviation was found for N harvest, suggesting an increased performance of cv. Crispin with an increasing environmental mean (Fig. 5c), potentially calling back the work by Baresel et al. (2008) who showed that wheat cultivar performance in terms of N use efficiency in organic cropping systems depended on whether the cultivars were grown in N-limiting (‘extensive’) or in more favourable environments. A similar analysis for weed cover at anthesis showed more markedly significant results, with a slope flatter than the environmental mean for YQCCP (p = 0.009), suggesting a better ability to suppress weed abundance with higher weed burdens. Near-significant steeper slopes were found for cv. Evolution compared to the weed cover environmental mean (p = 0.089) and of cv. Evolution (p = 0.052) and cv. Siskin (p = 0.071) compared to cv. YQCCP (Fig. 5d). These latter findings, in line with Cosser et al. (1997), highlights that historic, longer-strawed cultivars, in this case represented by the historic parentage of YQCCP, can be especially relevant in environments with high weed pressure. Andrew et al. (2015) suggested that, in high weed pressure environments, the trade-offs between yield potential and competitive ability, mediated by straw length, are negligeable to the extent that competitive ability could possibly be preferred over yield potential for cultivar selection.

Fig. 5
figure 5

Dynamic stability of yield (a, R2 = 0.92), grain protein content (b, R2 = 0.64), nitrogen harvest (c, R2 = 0.97) and weed cover at anthesis (BBCH GS 65) (d, R2 = 0.95) of the five cultivars tested in both 2017/2018 and 2018/2019, showing the regression line of each cultivar’s response variable against the environmental estimated marginal means. The dotted line indicates a regression with slope = 1. Asterisks indicate a slope significantly different from 1 ((*) = p < 0.1; * = p < 0.05; ** = p < 0.01). The slopes of regression lines with different letters are significantly different from one another at a 0.9 confidence level.

More detailed field-scale studies of N use efficiency, accounting for soil mineral N availability and crop N status, as well as deeper analysis of cultivar traits that could be drivers of weed suppressive ability (e.g. seedling vigour, canopy cover during tillering and stem extension), will be pivotal to complement and validate evidence available in literature. On the other hand, developing methods to incorporate, and harness the value of, farmers’ own knowledge and observations can be critical in expanding the scope of on-farm cultivar testing and developing participatory plant breeding programmes (Annicchiarico et al. 2019). In addition, decentralised field-scale cultivar evaluation can involve processors and end-users, such as millers and bakers, thus complementing standard grain quality indicators with a direct evaluation of product quality (Kucec et al. 2017).

In summary, as pointed out by Brown et al. (2020), future developments in cultivar testing programmes should rely on the improved capacity of synthesising data from different sources, from environmental and genetic data to farmers’ and consumer’s evaluation. For example, citizen-science approaches have been adopted in participatory cultivar selection programmes across networks of farms wide enough to detect and generalise differential varietal response to climatic patterns (Van Etten et al. 2019).

3.3 Effects of management strategies

Crop management was analysed according to two main categories: (i) strategy of post-emergence weed management, with five farms relying on interrow hoeing on wheat sown in wide rows and six farms relying on spring tine harrowing on wheat sown in narrow rows, and (ii) organic fertilisation that has been applied on nine of the studied crops (farm-year combinations), whereas the eight others only relied on residual nitrogen from prior legume-based leys. All possible co-occurrences of organic fertilisation and post-emergence weed management were represented by at least three farm-year combinations (Table 1).

The additive effect of organic fertilisation and weed management significantly affected grain N harvest and weed cover at anthesis. Both management factors, analysed separately to avoid a non-homoscedastic additive model, also affected grain protein content (Table 5). Higher grain protein content and grain N harvest were recorded on crops where organic fertilisation was applied than on crops where it was not, in line with findings by Shiwakoti et al. (2020) and Olesen et al. (2009), although these latter authors also found a significant yield increase with the use of manure in organic wheat. Pedersen et al. (2012) also found that farmyard manure is a significant source of N in organic wheat but its effect interacts with soil properties and past management. Considering that all but one of the wheat fields studied in our work were preceded by either a legume or a fodder ley containing a legume and that all fields were part of rotations containing a legume-based ley, better insights on fertility management and its impact on wheat performance should be looked for along entire rotation cycles.

Table 5 Estimated marginal means and standard errors of grain protein content (% dry matter); N harvest (kg/ha); weed cover (%) at anthesis; relative abundance (%) of the most abundant weed species by sowing scheme (wide rows, crop sown in 23- to 25-cm distant rows; post-emergence weed management by interrow hoeing; narrow rows, crop sown in 12- to 15-cm distant rows; post-emergence weed management by spring tine harrowing) and by application of organic fertilisation prior to the wheat crop. Values obtained by the linear mixed-effect models containing, as indicated in the footer, either ‘sowing scheme’ or ‘organic fertilisation’ (single models) or their additive effect (additive model) as fixed term(s); ‘farm’ and ‘cultivar’ as random intercepts; and ‘year within farm’ as a random slope (model 3; see. Methods). ns.: p ≥ 0.1; (*) p < 0.1 ; * p < 0.05 ; ** p < 0.01 ; *** p < 0.001. Values with the same letter are not significantly different at a 0.05 confidence level (Tukey adjustment and Kenward-Roger method for degrees of freedom)

The effect of weed management system did not generate significantly different means for either grain protein content or N harvest, although suggesting a trend of improved performance in the ‘wide rows’ system. On the other hand, post-emergence weed management did affect weed abundance, with a 28.4% higher weed cover at anthesis recorded in the ‘narrow-’ than that in the ‘wide-rows’ farms. This was accompanied, in turn, by a consistent effect on the relative abundance of the most abundant weed species that was 16.0% higher in the farms adopting the ‘wide-rows’ system (Table 5). The effect of organic fertilisation did not generate significantly different means for weed cover, although suggesting a trend of increased weed cover in farms applying organic fertiliser. This trend is worth noting: Olesen et al. (2009) documented a significant manure-led weed abundance increase which could partially offset the yield benefit of manure application. No effect of either weed management strategy or organic fertilisation was found on crop cover at anthesis, ear density, grain yield and grain N harvest.

Multiple regressions considering ear density, weed cover at anthesis, crop cover at anthesis and all possible interactions as potential predictors of grain yield, assuming farm, year within farm and cultivar as random terms (model 3, see Methods), showed weed cover at anthesis as the only significant predictor of grain yield (p = 0.021), which decreases with increasing weed cover (Figure 6a). Similar findings were found across French organic winter wheat fields by David et al. (2005), who, more specifically, associated weed abundance at anthesis with a detrimental effect on grains/m2. The same multiple correlation did not highlight any significant explanatory variable for grain N harvest but showed a near-significant (p = 0.071) positive effect of weed cover on grain protein content (Fig. 6b). This latter finding is in line with what observed by Casagrande et al. (2009) across a range of organic wheat fields in France, although these authors highlighted the difficulty of providing a definite interpretation of this trend. Previously, Mason and Madin (1996) had found either positive or negative associations between weed abundance and wheat protein content in different sites in Australia, attributing these complex relationships to interactions between crop-weed competition for water and for N during reproductive stages. We might add that our dual finding on the association between N harvest and grain protein content, positive across environmental means but negative across cultivar means, might have mediated this result in our study.

Fig. 6
figure 6

Relationship between weed cover at anthesis and grain yield (a) and grain protein content (b), resulting from linear mixed-effect model multiple correlation considering, as explanatory variables, weed cover at anthesis, crop cover at anthesis, ear density and all possible interactions thereof and, as random terms, farm, year-within-farm and cultivar (model 3, see Methods). The line shows the marginal effect of weed cover on the response variable, the shaded area shows the 95% confidence interval of the marginal effect, and the dots show the partial residuals. Trendline (a): y = 4.8953 – 0.0195∙x (p = 0.021). Trendline (b): y = 9.4891 + 0.0185∙x (p = 0.071).

Weed abundance seems therefore to be the main driver of changes in grain yield, when considered independently of environmental or cultivar effects. A wide range of weed cover at wheat anthesis was observed, between a minimum of 4.0% in 2017/2018 and 3.3% in 2018/2019 and a maximum of 66.8% in 2017/2018 and 83.3% in 2018/2019. The experimental strips supported a richness ranging from 2 to 16 species. The most frequent dominant species were Avena fatua (17 obs.), Papaver rhoeas (9 obs.) and Sinapis arvensis (8 obs.). Interrow hoeing is a more intensive system overall, which entails higher operating costs but can be more effective than spring tine harrowing in reducing weed abundance (Kolb et al. 2012). Spring tine harrowing relies on precise climatic conditions and weed and crop growth stages, namely prior to the onset of wheat stem extension, to be effective. In contrast, interrow hoeing can also be carried out after the onset of stem extension, as the participating farmers did, hence minimising weed competition for N at a critical stage for determining grain N concentration (Fischer et al. 1993). This can explain the trend of increased grain protein content and grain N harvest in farms relying on interrow hoeing.

However, the more intensive interrow hoeing system also entails higher environmental disturbance. Higher disturbance can generate shifts in the weed community towards the prevalence of more aggressive weeds exhibiting ruderal and/or competitive traits. In many of the ‘wide-rows’ farms, we observed a high abundance of wild oats (Avena fatua) growing within the wheat rows that farmers control during late spring and summer with a surfing machine to minimise harvest contamination and prevent excessive weed seed return. On the contrary, Armengot et al. (2013) demonstrated that spring tine harrowing effectively reduced weed competition but maintained weed community diversity in the organic cereal field in Catalonia. Exposure to weed-related yield losses can be higher under less diverse than more equitable and diverse communities as suggested by Adeux et al. (2019).

The limited size of the dataset and the incomplete block design did not allow at present a cultivar-by-management system interaction analysis, which needs to be a priority in future research. A review by Le Campion et al. (2020) encourages a more explicit accounting for differences in environments and management, as opposed to relying on the ‘organic vs. conventional’ dichotomy, when addressing organic plant breeding. These authors suggest that effective breeding for organic farming has been reported as relying either on direct selection in organic systems (like in Murphy et al. 2007) or on selection in target pedoclimatic areas, as in Annicchiarico et al. (2010). Based on our evidence of (i) weed abundance as a key determinant of grain yield, of (ii) differences in weed abundance and community composition between the two main weed management systems observed, and of (iii) cultivar differences in weed suppressive ability, also in terms of dynamic stability of the trait, we suggest that post-emergence weed management strategy could be a key descriptor to differentiate crop management systems to explore genotype-by-management interactions. In this respect, field-scale evaluation can provide a unique outlook on crop-weed interactions and other processes difficult to capture at a plot scale (Kravchenko et al. 2017).

4 Conclusion

This work described the field-scale performance of organic winter wheat cultivars across a network of organic farms, combining aspects of an agronomic survey and of a decentralised cultivar evaluation. Whilst the results related to yield and protein content mostly overlapped with cultivars’ end-use classification, differences emerged for weed suppressive ability, particularly when comparing historic and modern genotypes. The limited size of the dataset does not allow an overall analysis of genotype-environment management interactions, as a multi-environment complete block design would. On the other hand, dynamic stability analysis, joined with the unique outlook on weed communities that field-scale surveys could achieve, highlighted differential cultivar performance along weed pressure gradients. These outcomes emphasise the need to account for crop-weed interactions and to further investigate their relations with nutrient use efficiency, in cultivar testing and breeding.

We suggest that future research jointly address four priorities. First, datasets of multi-environment, field-scale crop performance should be integrated with climatic and environmental data with a better resolution than those used in this work. This could enable capitalising past growing seasons to facilitate farm-focused decision-making on cultivars and management strategies based on genotype- and management-by-environment interactions. Second, characterisation and categorisation of crop and weed management strategies could be pivotal to better harness genotype-by-management interactions in cultivar choice and decentralised plant breeding. Third, direct farmers’ involvement in data collection during the growing season could be of extreme relevance to harness the value of farmers’ observations. Lastly, expanding the participation in such ‘collective experiments’ to processors and end-users would be of benefit to unravel the effects of climate, cultivars and management and their interactions on actual, as opposed to standardised indicators, end-use quality and, as such, trigger the creation of stronger links along the supply chain.