Introduction

Quantification of plant disease intensity (amount of disease in a population, Nutter Jr et al. 1991) is required for many different purposes including monitoring epidemics in experiments or surveys, understanding yield loss, comparing phenotypes for disease resistance, and evaluating effects of treatments (chemical, biological, agronomic, or environmental factors) on disease (James 1974; Kranz 1988; Cooke 2006; Madden et al. 2007; Bock et al. 2010a). Throughout all of these applications, visual estimates of disease are used to draw conclusions and/or take actions—and thus they should be as accurate as possible given available resource and purpose—where accuracy is operationally defined as the closeness of the visual estimate to the actual value Nutter Jr et al. (1991). The term agreement can be considered synonymous with accuracy where actual values are concerned (Madden et al. 2007). During the research process, incorrect quantification could result in a type II error (failure to reject the null hypothesis when it is false) when comparing treatments in any experiment situation, which will have ramifications for the conclusions drawn from those experiments. Decisions based on such conclusions could result in wasted resources, increased disease, yield loss, and ultimately reduced profit. Thus, accurate disease intensity estimates would appear to be vital.

Plant disease severity, the subject of this review, is currently defined as the “area of a sampling unit (plant surface) affected by disease expressed as a percentage or proportion of the total area” Nutter Jr et al. (1991). It is worth considering this definition of disease severity as it is limited to only the increase in the magnitude of disease intensity that can be measured or estimated based on a proportion or percentage of specimen area (for example, soybean rust [Godoy et al. 2006], pecan scab [Yadav et al. 2013], and rice brown spot [Schwanck and Del Ponte 2014]). The current definition is not applicable to diseases that manifest by a progression of symptoms that do not lend themselves to area estimations (for example, huanglongbing [Gottwald et al. 1989], zucchini yellow mosaic virus [Xu et al. 2004], and cassava mosaic disease Houngue et al. (2019)) and for which severity is quantified and represented by other means. Perhaps it is time to revise the definition of plant disease severity to include those diseases that have a symptomatology that does not lend itself to proportion of specimen area diseased. However, as noted, this review focuses on those diseases that can be assessed quantitatively as a proportion based on visible symptomatic area. Full definitions of all the terms used in this review and in phytopathometry can be found in the glossary of terms and concepts in this special issue. Other terms of disease intensity germane to this review include incidence (the proportion of diseased specimens) and prevalence (the proportion of diseased plots or fields in a defined area).

The use of the term visual estimation refers to the eye sensing a stimulus (a diseased specimen, say), followed by perception of the sensation by our brains, which is in turn followed by a cognitive process based on our training, knowledge, and expertise to classify parts of the specimen as diseased (Fig. 1). Such elementary cognition is sufficient to determine incidence of disease, but more complex cognition is required if an estimate of severity based on the proportion of area diseased is to be made. That mental process of estimation may be achieved using various scales, or can be performed using sensor-based systems and image analysis. There are three commonly used scale types for visual estimates of plant disease severity, as defined by Stevens (1946): nominal scales (where the rater uses simple descriptors to indicate degrees of severity), ordinal scales of two types, qualitative and quantitative (where the rater may use either a qualitative [descriptive] or quantitative [defined ranges of the percentage scale, respectively]), both based on rank-ordered classes [Chiang et al. 2020]), and ratio scales (where the rater bases severity estimates on the proportion or percentage of the specimen area diseased). Quantitative ordinal scales are discussed in detail in another article in this issue (Chiang and Bock 2021). The argument could be made that a fourth common type of scale, the interval scale, is not used in measurement of plant disease severity as interval scales have no defined zero—and all disease severity by definition has a defined zero when the host status is healthy. For this reason, we choose to recognize just the three aforementioned scale types, although the authors recognize some may have valid reasons for considering additional scale types.

Fig. 1
figure 1

The stages in plant disease severity estimation by visual raters and by image analysis via a sensor

Our understanding and knowledge of the processes, methods, and factors affecting the accuracy of severity assessment have evolved as new research results have become available, and consequently approaches to improve accuracy have been developed. This review has two purposes: firstly, to briefly chart the history of plant disease severity estimation and factors that affect those estimates (including sources of error), and secondly to outline the approaches and tools available to maximize accuracy of rater estimates. It is a synthesis of the history of visual disease severity estimation, and corrals the tools that we have available at this time to maximize the accuracy of estimates made visually by different raters; the endeavor will distil a list of best-operating practices that may be used to maximize accuracy of visual disease severity estimation, and point out some avenues for future research.

A brief history of visual disease severity assessment

Various original research studies have described error in visual estimates of plant disease severity as well as novel tools and approaches (Supplementary Table 1 and Fig. 2 list and provide a timeline of some of the most significant). Review articles (Table 1) have been written over the decades that have charted practices and developments in severity estimation.

Fig. 2
figure 2

The history of phytopathometry, 1892 to the present. Significant events and articles are indicated

Table 1 Some review articles and book chapters (or components of book chapters) that have described the status of visual plant disease severity estimation

Interestingly, since 1970 with the publication by Kranz of an article on rater error and scale design (Kranz 1970), phytopathometry has fallen primarily under the purview of botanical epidemiology, perhaps due to its quantitative nature and the treatment of the subject in many reviews and books primarily by epidemiologists since 1970 (most references in Supplementary Table S1). Phytopathometry is indeed critical to epidemiology, but it is equally vital to other branches of plant pathology and in other disciplines where plant disease measurement is required (for example, horticulture, agronomy, ecology, and plant breeding). Phytopathometry is needed in these disciplines, and the importance of accurate assessments is vital in many studies that cut across the needs of these scientific endeavors. Based on these needs, we contend that phytopathometry in its broadest sense including visual and sensor-based assessments should play a more prominent role in plant pathology than it has hitherto occupied, or at least for which it has been recognized. Indeed, we believe phytopathometry warrants a status as an independent branch of plant pathology, of importance and application in several other branches of the discipline, and in other disciplines.

The rise of phytopathometry

The history of phytopathometry as it relates to visual estimation of disease severity can be divided into two phases. A pre-1970 phase when there was no basis for assessing accuracy of severity estimates (Chester 1950; Large 1953 and 1966), and the phase since 1970 during which there have been quantitative approaches to understanding error and improving accuracy and reliability (“the extent to which the same measurements of individuals [e.g., diseased specimens] obtained under different conditions yield similar results” Everitt 1998) of estimates of severity (for example, Kranz 1970; Forbes and Jeger 1987; Nutter Jr et al. 1993; Nutter Jr and Schultz 1995; Nita et al. 2003; Nutter Jr and Esker 2006; Godoy et al. 2006; Bardsley and Ngugi 2013; Bock et al. 2016b; Pereira et al. 2020). The pivot around which these two phases occur is the recognition of a need for unifying methods of assessment to quantify crop loss in particular, including standardized methods allowing for reproducibility (“the extent to which two or more raters obtain the same estimates of disease severity of the same specimens”, Madden et al. 2007, also known as inter-rater reliability), recognized by the United Nations Food and Agriculture Organization in the mid-1960s (Chiarappa 1970), and culminating in publication of a crop loss assessment manual (Chiarappa 1971).

Over the last 140 years, since the first tool was developed as an aid to quantify severity (Cobb 1892), various approaches have been developed in an attempt to standardize and improve accuracy of visual estimates of disease severity (Chester 1950; Large 1966; James 1974; Nutter Jr 1999; Nutter Jr 2001; Nutter Jr and Esker 2006; Madden et al. 2007; Bock et al. 2010a, 2016a, and Bock et al. 2021; Chiang et al. 2014; Del Ponte et al. 2017). The term “Phytopathometry” was first suggested in the 1950s (Large 1953, 1966)—a term that was at that time defined as equivalent to “plant disease measurement” or “disease assessment.” Perhaps defining phytopathometry with the narrower, former meaning is most appropriate, as disease assessment is more typically the physical process of measuring disease.

Although various disease assessment terms had been used and defined previously (Chester 1950; James 1974; Berger 1980), it was only in 1991 that the first comprehensive and authoritative list of definitions and concepts used in plant disease assessment was presented in the journal Plant Disease Nutter Jr et al. (1991). Research on phytopathometry has since provided knowledge of sources of error, and various methods for augmenting visual estimates of disease severity, which are now a basis for recommending practices to improve the accuracy of visual estimates of disease severity. During this evolution, new terms have been coined, new technologies and methods used, and definitions have been revised. Thus, an updated glossary of terms and concepts in phytopathometry has been developed (this issue, Bock et al. 2021).

The early years of quantification: scales, diagrams, and field keys

The first tool developed to standardize severity assessments, and which utilized a SAD set, was the “Cobb scale” published by Nathan Cobb (Cobb 1892). This ordinal scale had 5 classes which corresponded to 1, 5, 10, 20, and 50% severity (and was used for “classifying” rather than interpolation to the nearest percent estimate). The Cobb scale was modified twice. Firstly, by Melchers and Parker (1922) who labelled a maximum severity of 37% as 100% “infection level”; and secondly, by Peterson et al. (1948) who included additional infection levels. Such ordinal scales, both those that are proposed as “qualitative ordinal scales” (descriptive of symptoms) or “quantitative ordinal scales” (each class representing defined ranges on the percentage scale) (Chiang et al. 2020), with or without diagrams, proliferated during the following decades, often with diagrams designed to rate severity at the organ or plant level (Gassner 1915; Tehon and Stout 1930; Trumblower 1934; Ullstrup et al. 1945; Horsfall and Barratt 1945; Croxhall et al. 1952a, b). In contrast, field keys were developed to estimate disease severity in whole fields, which may combine characteristics of qualitative and quantitative ordinal scales. An example is the widely used 9-class key developed to assess late blight of potato (Anon. 1947). Many of the SADs developed during this period were used to group estimates in an appropriate “class” or illustrated “degree of symptoms.” Nonetheless, the value of using the continuous percentage scale was well recognized even in the 1940s (Anon. 1948). In that article, the authors point out the nearest percentage estimates have direct biological meaning and may be compared among seasons and raters, and the percentage scale provides a single, uniform method for many different diseases (compared to a diversity of ordinal scales or diagram based systems).

An early quantitative ordinal scale was that of Gassner (1915). Other linear and logarithmic scales and methods were developed to assess severity on individual plant organs, and whole plants (see Chester 1950). The usefulness of these tools to quantity severity accurately was implicit, and although considerations of “reproducibility and reliability” (sic) were considered important (Marsh et al. 1937) they were not addressed statistically, nor were they defined. Indeed, it was during these nascent years of phytopathometry that pre-processing of ordinal data for analysis was considered important. McKinney (1923) proposed the “infection index” (a kind of disease severity index, or DSI), which basically summarized frequency of severity class ratings on an ordinal scale. The early history of the DSI is described by Chester (1950). Marsh et al. (1937) commented that the DSI reduced what may be non-linear data to a single expression that is continuous and amenable to statistical analysis “…although the estimates are not necessarily in direct linear relation to the amount of fungus present…they are reducible to a linear function of this amount.”

Phytopathometry encounters psychophysics

Historically, a widely-used scale for quantifying plant disease severity is the Horsfall and Barratt (H-B) scale (Horsfall and Barratt 1945). It is a quantitative ordinal scale with 12 classes that divide the percentage scale into logarithmically increasing and decreasing sized ranges below and above 50%, respectively. The rational for the scale design was based in psychophysics. According to the authors, the scale was structured to reflect the “Weber-Fechner law,” which actually combined two independent laws (Nutter Jr and Esker 2006): (1) there is a logarithmic relationship between the intensity of the stimulus (in this case severity of disease) and the estimated value (Fechner’s law, which is false), and (2) the change in a stimulus that will be just noticeable is a constant ratio of the original stimulus (Weber’s law, which holds true). Horsfall and Barratt also presumed that the eye perceives diseased tissue at severity <50%, and healthy tissue at severity >50%, which has never been established. Redman et al. (1968) developed a set of tables based on a formula to convert multiple H-B ratings to estimated mean percentages, effectively taking the percentage midpoint values of the ranges for each class to facilitate determination of percentage mean severity. The H-B scale and its basis in psychophysics was perhaps the most dominant paradigm in phytopathometry for many decades, and received praise as late as the 1980s (Hollis 1984), and remains a tool used in modern research in the field, although not without the psychophysical basis and structure of the scale being seriously questioned (Hebert 1982; Nutter Jr et al. 2006; Bock et al. 2010b). Contrary to the claims of a logarithmic relationship between estimates and actual severity, it has now been demonstrated on many occasions that there is a linear relationship between estimates of disease severity and actual severity (Nita et al. 2003; Nutter Jr and Esker 2006; Bock et al. 2009b).

A flourish of manually prepared black and white SADs

The major contributions of W. Clive James, a researcher from the Canada Department of Agriculture, to the field of phytopathometry, began when he published an influential article in the Canadian Plant Disease Survey (James 1971). In the article, he presented and described the preparation and usage of what he defined as “assessment keys,” which in fact were SADs for cereal, forage, and field crops, representing 19 diseases. Each key was accompanied by detailed instructions for sampling and usage. To ensure that percent affected area was accurate, a drum scanner coupled to a computer was used to measure drawings made on paper sheets. James recommended interpolation to the nearest percent estimate using the SADs. Another important unit of research was that of Dixon and Doodson (1971) who also published disease-specific SAD sets, some being recommended to be used for interpolation, and others to be used alongside ordinal scales for classification of severity. Given the extensive variation of type and intensity of symptoms across several diseases, the diagram sets in those two studies varied in purpose and number, from as few as three illustrations to 6 or more depicting different disease severities. James (1971) recognized the advantages of using the percentage scale, but also warned that, because only a few severities are shown in the SADs, the extent of interpolation was determined by the ability of the observer. Moreover, it is interesting that the rationale for defining the few diagrams and their values was convenience, rather than laws of psychophysics, as suggested decades earlier as well as, surprisingly, in more recent SADs research (Del Ponte et al. 2017). During that time, no formal quantification of the accuracy of the estimates was determined—rather, it was implicitly presumed that the SADs or scales with diagrams improved accuracy.

Exploring and understanding error

It was also during the early 1970s that a quantitative understanding of characteristics of error and accuracy of visual estimates of disease severity was established. Kranz (1970), Analytis (1973), and Amanat (1976, 1977) investigated rater error and disease scales, and determined standard deviations of multiple rater estimates of the same model leaves were non-constant with severity, a relationship demonstrated for several other diseases too (Fig. 3). Standard deviations of unaided rater estimates tended to be greatest in the range 18 to 62% severity. Kranz (1970) was first to report the minimum and maximum estimate, range, and relative errors of unaided estimates, which increased up to 50% severity, then decreased up to the maximum severity of 100% (Fig. 3). The same pattern has been confirmed more recently for other diseases, as indicated in the figure. Analytis (1973) confirmed non-homogeneity of variance with severity in the apple scab pathosystem. Various transformations of severity data were suggested to account for the non-homogeneous variance and lack of normality of these data (Kranz 1970; Analytis 1973). Amanat (1976) was first to show that training improved precision (which is the degree of variability; the greater the variability, the less precise the estimates, in these cases in relation to the actual values, which is an important point; Madden et al. 2007). Precision was measured as the scatter of the points in a regression analysis, and in early studies it was noted that precision of estimate of severity was low where symptoms were comprised of small lesions, and raters tended to overestimate with such symptoms (Amanat 1976). Koch and Hau (1980) showed that raters preferred certain severities (“knots”) when estimating severity—generally at 5 and 10% intervals at severities >10 to 20%, which has since been observed in other pathosystems (Bock et al. 2008a). Sherwood et al. (1983) and Hock et al. (1992) also showed overestimation was greatest at low disease severities and that, given the same severity, a disease with smaller lesion size will generally be overestimated, and that, overall, visual estimates by raters were not particularly precise, confirming previous reports. Error associated with estimation due to organ types and disease severity was explored further by Forbes and Jeger (1987).

Fig. 3
figure 3

The means and ranges of unaided estimates of disease severity A of stylized disease on 25 model leaves by 200 raters (Kranz 1970), B of symptoms of citrus canker on 200 leaves by 28 raters (Bock et al. 2009a), and C of symptoms of soybean rust on 50 leaves by 37 raters Franceschi et al. (2020). The standard deviations of the means are indicated in D, E, and F, respectively

Intra-rater reliability is the closeness of repeated estimates of severity of the same specimens by the same rater (also known as “repeatability”). Inter-rater reliability is the closeness of repeated estimates of the same specimen by different raters, also known as “reproducibility” (Madden et al. 2007). Reliability does not embrace the concept of accuracy, as no actual values are involved. Statistical analyses of inter-rater and intra-rater reliability were made by Shokes et al. (1987) using analysis of variance and correlation analysis, respectively. Indeed, in regard to plant disease severity estimation, the test/retest method to gauge intra-rater reliability was first promulgated by Shokes et al. (1987) but was based on correlation—although Amanat (1976) used the same test/retest concept it was in relation to learning capacity. Hau et al. (1989) summarized much of this early quantitative work to explore accuracy, and provided further insights into the nature of the relationships between actual disease severity and rater estimates. They also questioned the nature of the logarithmic relationships espoused by Horsfall and Barratt (1945). Nutter Jr et al. (1993) used regression analysis to further establish and understand the concepts of accuracy (using image-analyzed acetate images), inter-rater and, using the test/retest method, intra-rater reliabilities in visual plant disease severity estimation compared to sensor-based methods. Others also explored accuracy and variability in rater estimates using various approaches (Beresford and Royle 1991; Newton and Hackett 1994). Several studies compared rater estimation of symptom components (Beresford and Royle 1991) and rater variability (Beresford and Royle 1991; Newton and Hackett 1994; Parker et al. 1995a, 1995b).

Arrival of personal computers and programs to aid in assessment training

Research on visual assessments of severity was impacted in the mid-1980s when personal computers and programming languages become more accessible. Several computer programs were developed with the purpose of improving raters’ accuracy via training based on computer-generated images of specific and measured disease severity. The estimate could be compared to the actual value. AREAGRAM was the first, described in a university report by Shane et al. (1985) to develop a program where leaves of fixed severities (not randomly generated) were shown to raters. This was followed by other software with similar functionality, but allowing randomly generated series of diseased leaf images in a defined severity range, including DISTRAIN (Tomerlin and Howell 1988), DISEASE.PRO (Nutter Jr and Worawitlikit 1989), and ESTIMATE (Weber and Jorg 1991). Later in the decade, new software was developed for specific diseases, symptoms, and leaf types, for example, SEVERITY.PRO (Nutter Jr et al. 1998). COMBRO (Canteri and Giglioti 1998) was developed specifically for sugarcane rust and borer-rot complex. Research using these tools demonstrated statistically detectable improvement in the accuracy of estimates of disease severity after training (Newton and Hackett 1994; Nutter Jr and Schultz 1995; Parker et al. 1995b; Giglioti and Canteri 1998). A potential issue with computer training is that the benefits may be short-lived (Parker et al. 1995b), with raters requiring regular re-training.

The computer capability to quickly generate digital drawings of diseased leaves not only without the need to hand draw, scan, and measure, but also with the ability to process and analyze the data in real time, was a significant advance. Raters could also do in-house training at any time of the year. Interestingly, the development of these computer programs in the 1980s and 1990s was not immediately followed by computerized systems with greater sophistication to draw more realistic symptomatic digital leaves, despite the advances in software engineering. Indeed, there have been very few training programs developed since (Aubertot et al. 2004; Sachet et al. 2017).

The early psychophysical basis of severity perception challenged

Starting in the 1980s, the so-called Weber-Fechner law and the ideal of the H-B scale began to be challenged. Although Kranz (1970) presented results which showed that error is not symmetrical (and logits were a suitable transformation), estimates did not follow the so-called Weber-Fechner law, because the standard deviation of rater estimates was similar and greatest between ≈18 and 52% compared to other severities (Fig. 3). Hebert (1982) was first to question the presumed psychophysical basis of plant disease severity assessment. Forbes and Jeger (1987) provided the first valuable insights into a number of factors affecting estimation of severity on different plant structures, identifying rater, actual disease severity, and plant structure as factors affecting the accuracy of estimates, and unequivocally demonstrated and stated that the rater error was not compliant with the assumptions of the Weber-Fechner law. The results were reinforced by other observations that estimation error was not greatest at 50% (as had been argued by Horsfall and Barratt 1945) (Hau et al. 1989).

Forbes and Korva (1994) showed that direct use of a H-B type scale did not necessarily resolve uneven variances of estimates, and direct percent estimates were more accurate and precise. Nita et al. (2003) compared scale types using measured, actual values. In a study on Phomposis leaf blight of strawberry comparing direct visual estimates to the H-B scale, the authors pioneered use of Lin’s concordance correlation in determining accuracy in phytopathometry. Accuracy can be considered a product of bias and precision, where bias is the difference between the estimated mean severity and the actual mean severity, and precision is as previously defined. Bias has two forms. First, bias may be constant with estimates being higher, or lower on average by a constant amount when compared to the actual values, and second, bias may be systematic, where the estimates are higher (or lower) than the actual values by an amount that is proportional to the actual severity measured. Constant bias is also known as “fixed bias” or “location shift,” while systematic bias is also known as “proportional bias” or “scale shift.” Nita et al. (2003) demonstrated that the use of the H-B scale did not result in greater accuracy or reliability when compared to direct nearest percentage estimates, and the results of the study further questioned the basis of the Weber-Fechner law. These and other observations were confirmed experimentally by Nutter Jr and Esker (2006) who used the concept of the “just noticeable difference” to demonstrate that accuracy of raters was far greater in the mid-ranges (25 to 75%) of the H-B scale than the scale structure suggests, which is a significant argument against its use where more accurate methods can be applied (or use of scales similar to the H-B scale).

Various simulation studies have since confirmed that the H-B scale lacks the same power for hypothesis testing compared to the percentage scale (Bock et al. 2010b; Chiang et al. 2014; Chiang et al. 2016a, b). Indeed, since Hebert (1982) first articulated his concerns, it is now generally accepted that a linear relationship exists between estimated severity and actual severity (Nutter Jr and Esker 2006; Bock et al. 2009b), although the relationship between the error of those estimates and the actual values remains to be fully established.

The importance of instruction and experience

Associated with training is instruction. But this is not like computer-based training; rather it relates to written or oral descriptions of symptoms, how to delineate them, and a description of how to implement the rating scales used for assessments. Only recently has research shown that detailed instruction in a pathosystem and how to rate disease severity using the methods of choice is critical for accurate and reliable assessments (Bardsley and Ngugi 2013). Indeed, instruction on use of the rating scale is critical too, as error may result from misuse, as has been noted (Kranz 1988; Bock et al. 2013a, b; Forbes and Korva 1994). Studies demonstrating the importance of the basic procedure of instruction should be repeated with other pathosystems to confirm these results.

Over the last 10 years, several studies have demonstrated that raters’ lack of experience can result in inaccuracy and unreliability (Bock et al. 2009b; Pedroso et al. 2011; Yadav et al. 2013; Lage et al. 2015). Experienced raters tend to estimate disease severity on specimens more accurately (although some novice raters may also be intrinsically accurate too). The research has demonstrated that as a group experienced raters are more accurate, but experience does not guarantee more accurate estimates.

Establishment and evolution of SADs research

The pioneering work by W Clive James was highly influential to subsequent SAD research. A selected list of SADs is presented in the chapter on disease monitoring in the Plant Disease Epidemiology book by Campbell and Madden (1990). The list shows 17 published studies by other authors from 1971 to 1988, averaging one per year, but between 1991 and 2017, 105 articles were published (averaging 6 articles per year; Del Ponte et al. 2017). A study conducted by Amorim et al. (1993) was a turning point and the first to use regression analysis to report a measure of accuracy, although the benefits from using SADs were not checked because there was no data on unaided estimates. The Amorim et al. study was used as a model in several articles that followed (Godoy et al. 1997; Michereff et al. 1998, 2000; Diaz et al. 2001; Leite and Amorim 2002). Nutter Jr and Litwiller (1998) were first to show that SADs improved rater estimates of disease severity in an abstract from a conference. Michereff et al. (2000) formally published the first comparison of accuracy of estimates without and with SADs for assessing citrus leprosis via comparison of linear regression coefficients. A plethora of SADs followed (see review by Del Ponte et al. 2017) with analyses demonstrating statistically detectable improvements in accuracy and reliability due to using SADs. Many SADs from 1970 to 2010 were based on the “Weber-Fechner” law. As noted earlier, the Weber-Fechner law is non-existent, although Weber’s law holds true. Consequently, the Weber-Fechner law as a principle to guide SAD design has generally been abandoned as a stated basis for scale development in more recent years (Yadav et al. 2013; Lage et al. 2015; Araújo et al. 2019). Interestingly it was not a stated basis for defining incremental interval and number of diagrams in the pioneering work of James (1971). The basis for SADs design should probably be a linear scale, but with additional diagrams at low severity (Bock et al. 2010a; Schwanck and Del Ponte 2014; Franceschi et al. 2020).

Two advances in the methodology for SADs validation were important to more appropriately understand the benefits of SADs. First, the shift from using linear regression to Lin’s concordance coefficients as a measure of accuracy and its two main components (precision and bias), recommended as more appropriate for the purpose (Nita et al. 2003; Madden et al. 2007). Spolti et al. (2011) were the first to apply them to the study of SADs. Second, the use of statistical approaches other than regression analysis to explore accuracy including (ordered by first use) confidence intervals (Spolti et al. 2011) equivalence tests (Yadav et al. 2013), non-parametric tests (Schwanck and Del Ponte 2014), and generalized linear mixed models (Correia et al. 2017).

Research on the topic has demonstrated that several aspects of the SADs design and evaluation might affect accuracy (and reliability) including rater experience (Yadav et al. 2013), pathosystem (Godoy et al. 1997), number of diagrams and structure and/or color of SADs (Schwanck and Del Ponte 2014; Bock et al. 2015; Franceschi et al. 2020), and the procedures followed during SAD development and validation, and other factors (Melo et al. 2020; Pereira et al. 2020). Franceschi et al. (2020) demonstrated the substantial improvements that could be made with carefully designed SADs compared to older, basic, previously developed SADs (Fig. 4)—raters’ estimates were significantly more accurate with the new SADs. Thus, there may be useful room for improving accuracy based on SADs characteristics. With SADs, research showed that those raters who are least accurate tend to benefit the most from using SADs, while raters who are already accurate remain about the same (Yadav et al. 2013; Bock et al. 2015).

Fig. 4
figure 4

Standard area diagrams (SADs) to estimate severity of rust (Phakopsora pachyrhizi) on soybean (Glycine max) leaves. A The original SADs (Godoy et al. 2006) B the relationship between the illustated SAD severity and diagram number for the original SAD C the absolute errors of estimates when using the original SADs D the newly developed and validated SADs (Franceschi et al. 2020) that is a tool for more accurate estimates of rust severity E the relationship between the illustated SAD severity and diagram number for the newly developed SAD F the absolute errors of estimates when using the newly developed SADs. The numbers under each leaf represent actual percentage leaf area showing symptoms (necrosis and chlorosis)

The inexpensive availability of scanners and portable digital cameras in the early 2000s, and the development of plant disease-specific image analysis software facilitated development of SADs (Del Ponte et al. 2017). Image analysis software included APS Assess 2.0 (Lamari 2002) and QUANT (Vale et al. 2003). The development of empirical approaches to develop more realistic SADs, combined with accessibility of image analysis for measuring actual values of test images, made the use of SADs as a training tool a practical and easy to use option compared to computer training programs (and a less expensive one), which may have contributed to the decline of computer training systems. Only a few examples exist linking SADs and training software either based on an ordinal scale (Aubertot et al. 2004) using Didacte-PIC (Training program: Canker-didacte. Online https://www62.dijon.inrae.fr/didactepic/choix_nombre_et_mode.php) or a percent scale (Sachet et al. 2017).

The intersection of portable devices and SADs was explored by Pethybridge and Nelson (2018). The iPad app “Estimate” has SADs for assessing the severity of Cercospora leaf spot in red and yellow table beets and allows direct data entry, using either different ordinal (linear or logarithmic) or continuous scale data. For the ordinal scales, a higher resolution linear scale was most accurate (Del Ponte et al. 2019).

Comparing scale types and characteristics and evaluating impact on decisions

An early study was that of O’Brein and van Bruggen (1992), which compared three quantitative ordinal scales to relate to yield loss caused by corky root of lettuce. The scales had 7, 10, and 12 (the H-B scale) classes. Although the actual values on which accuracy was based were merely “expert” visual estimates, the authors concluded that no scale was most accurate and precise overall, and depended on the specific severity ranges or lettuce growth stages. Two years later, Forbes and Korva (1994) were the first to demonstrate that direct use of H-B scale types did not necessarily resolve uneven variances of estimates, and direct percent estimates were more accurate and precise (direct use of the scale resulted in a “linearization” of unequal scale class intervals). As noted, Nita et al. (2003) compared direct visual estimates to H-B scale converted values and demonstrated the H-B scale was not more accurate or reliable compared with direct nearest percentage estimates. Similar studies on citrus canker by Bock et al. (2009b) drew similar conclusions, and Bardsley and Ngugi (2013) demonstrated that direct estimation resulted in more accurate and reliable estimates than an ordinal scale when estimating severity of foliar bacterial spot symptoms on peach and nectarine. Hartung and Piepho (2007) also showed that accuracy was greatest using the percentage scale (although they considered a 5% ordinal scale to be sufficient).

Some studies have compared treatments using different methods of assessment on the outcome of an analysis. Todd and Kommedahl (1994) compared severity of symptoms of Fusarium stalk rot of corn caused by three different species of Fusarium assessed either as a percentage by image analysis (considered objective) or visually using a 1 to 4 severity scale—means separation was dependent on assessment method. Similarly, Parker et al. (1995a, b) found that data from objectively measured severity of barley powdery mildew (using image analysis) gave different outcomes compared to visual estimates after data analysis. Bock et al. (2015) also found that use of the H-B scale could result in different means separation among treatments compared to direct percentage estimates. In a study of QTLs for oat crown rust resistance genes, a quantitative analysis found that 64% of the phenotypic variation was accounted for using q-PCR to quantify the pathogen (which also most precisely mapped the gene), 52% was accounted for using digital image analysis, but only 41% by visual assessments, respectively (Jackson et al. 2007). Although Poland and Nelson (2011) observed little difference in the QTLs identified to northern leaf blight of corn using either a 1 to 9 scale or a direct percentage estimation, the results showed the direct percentage estimates to be more precise.

During the last decade, several simulation-based studies exploring the power of the hypothesis test have demonstrated the issues associated with using quantitative ordinal scales compared to a continuous ratio scale (Bock et al. 2010b; Chiang et al. 2014). In the former, type II errors are elevated, although increasing sample size can resolve most issues. Rater bias also has problematic effects that can be magnified by quantitative ordinal scales (Chiang et al. 2016a). Several of the studies described in this paragraph, and in other works (for example, Chiang et al. 2014), indicate that the H-B scale (and similar scales) has drawbacks and can result in elevated type II errors. The research has also provided a basis for developing ordinal scales that have similar accuracy or minimized risk of type II error compared to nearest percentage estimates (Hartung and Piepho 2007; Chiang et al. 2014) (Table 2). Furthermore, selection of scale type can affect resource use efficiency (Chiang et al. 2016b), with more replications required to achieve the same level of power in a hypothesis test (the type II error rate) when using some ordinal scales. Furthermore, percentage scale severity data estimated by very accurate raters almost always leads to the rejection of the null hypothesis (when it is false), but for accurate raters using H-B type scales is more detrimental to the probability to reject the null hypothesis compared to inaccurate raters (Bock et al. 2010b).

Table 2 The “Chiang” scale: an improved 16-class quantitative ordinal scale for general assessment of plant disease severity based on the scale developed by Chiang et al. (2014) (with modification by Bock et al. 2021)

The impact of using a DSI on accuracy and type II error when using a quantitative ordinal scale was investigated by Chiang et al. (2017a, b). Results showed that DSIs based on ranges of the percentage scale are prone to overestimation if the midpoint values of the rating class are not considered. Rater bias can further detract from accuracy of the DSI compared to the actual mean. However, Chiang et al. (2017b) using quantitative ordinal rating grades or the midpoint conversion for the ranges of disease severity resulted in similar powers of hypothesis testing. The authors concluded that the principal factor determining the power of the hypothesis test (the complement of the type II error rate) when using a DSI is the nature of the intervals in the quantitative ordinal scale—an amended 10% interval scale provided a type II error rate close to direct estimation of disease severity. Thus, steps can be taken that maximize the utility of the DSI when selecting the scale intervals on which it will be based. DSIs remain quite widely used (Hunter and Roberts 1978; Koitabashi 2005; Nsabiyera et al. 2012; Gafni et al. 2015).

The previous sections have outlined the history and many of the advances in phytopathometry since Nathan Cobb developed a cereal rust scale in the 1890s. But there remain many unanswered questions, and there are further avenues to explore that may provide a basis for added improvements in accuracy and reliability of visual estimates of plant disease severity.

The need for a baseline for accuracy

So, a couple of questions may remain regarding all this progress: what is an accurate visual estimate of disease severity? How do we know when we are close enough to the actual value? Accuracy may in part be dependent on the needs of a specific study, so these are not easy questions to answer. Nonetheless, based on empirical results from rater studies over the last 10 years we can determine that raters with Lin’s concordance correlation coefficient (ρc) of approximately 0.90, or more may be considered accurate (Capucho et al. 2011; Spolti et al. 2011; Rios et al. 2013; Domiciano et al. 2014; Duarte et al. 2013; Yadav et al. 2013; Bardsley and Ngugi 2013; Schwanck and Del Ponte 2014; Lage et al. 2015; Araújo et al. 2019; Franceschi et al. 2020). Inevitably this is somewhat arbitrary, and the references show that it varies with the study, and quite likely the pathosystem and several other factors. But based on the studies that have been done, and the accuracies achieved with and without instruction, training, and SADs, it is a reasonable magnitude for a ρc for the rater to be considered accurate on the spectrum of known rater capability. Rarely will a visual rater have a consistent ρc > 0.95. Most commonly raters with training instruction and/or SADs will have an ρc ≥ 0.85 to 0.95. The SADs, individual rater, and other factors will contribute to imprecision, constant bias, and systematic bias. In a real disease assessment situation, it is quite likely that accuracy will be a little lower. But this is a lot better compared to the capability of some raters with no experience or assessment aid, and possibly just rudimentary instruction, who may have a ρc = 0.60 or less. It should be noted that accuracy too may be in the eyes of the beholder: whereas Altman (1991) considers 0.90 to be accurate, and McBride (2005) does not believe “substantial” accuracy is achieved until ρc ≥ 0.95 (anything less being considered only moderately accurate or poor).

Using regression analysis, permissible accuracy has been based on a range of percentages around the actual severity (Amanat 1977; Newton and Hackett 1994). In these studies, ranges considered accurate at specific actual severities were 1% (0.5 to 1.5%); 5% (3.75 to 7.00%), 10% (7.50 to 12.50%), and 30% (25 to 35%). As observed by Newton and Hackett (1994), this gives an upper limiting regression line with an intercept of 0.28 and a slope of 1.2, and a lower limiting regression line with an intercept of −0.28 and a slope of 0.78.

As noted, many decisions are based on estimates or measurements of plant disease severity. Thus, for these decisions to be of greatest value, they must be based on data that is true to the actual severities—i.e., it must be accurate. The work done to date has explored many facets that affect accuracy and reliability, and plant pathologists have developed an understanding of sources of error to address some of the shortcomings by implementing improved tools and approaches to estimate plant disease severity. The main sources of error are briefly considered in the next section.

Factors affecting accuracy

Scale type: Several studies have demonstrated that assessment method can affect accuracy and the outcome of an analysis (Todd and Kommedahl 1994; Parker et al. 1995a, b; Nita et al. 2003; Bock et al. 2015; Bock et al. 2009a; Bock et al. 2010b; Jackson et al. 2007; Poland and Nelson 2011; Chiang et al. 2014; Chiang et al. 2016a, b).

Raters: Probably the single biggest source of error and variability in assessment. Raters have been demonstrated to be inherently variable (Hau et al. 1989; Nutter Jr et al. 1993; Bock et al. 2009b). The majority of raters tend to overestimate disease severity, especially at low severity, while a few may also underestimate, but it is less common.

Rater preferences for particular severities: Raters tend to have a preference for certain severities, generally at 5 and 10% interval, particularly at severities >20% (Koch and Hau 1980; Bock et al. 2008b, 2009b; Schwanck and Del Ponte 2014).

Lack of experience, training, and instruction: Over the last 10 years, several studies have demonstrated that lack of experience can result in inaccuracy and unreliability (Bock et al. 2009b; Pedroso et al. 2011; Yadav et al. 2013; Lage et al. 2015). Training improves accuracy of estimates (Parker et al. 1995a; Nutter Jr and Schultz 1995; Bardsley and Ngugi 2013). Similarly, instruction in the pathosystem and/or rating methods can result in more accurate and reliable estimates (Bardsley and Ngugi 2013).

Symptoms: The characteristics of the symptoms can influence rater accuracy. Severity characterized by numerous small lesions will tend to be more seriously overestimated compared to diseases with fewer, larger lesions (Sherwood et al. 1983; Forbes and Jeger 1987). Also, that tendency to overestimate is relatively greatest at severities <20% (Sherwood et al. 1983; Bock et al. 2008b). Whether lesions are regularly or irregularly distibuted may also impact error (Hock et al. 1992).

Plant structure: The organ (plant part) or whole plant being assessed can influence the accuracy of estimation (Amanat 1977; Forbes and Jeger 1987; Nita et al. 2003). Roots in particular are especially challenging to accuracy in severity estimation (Forbes and Jeger 1987).

Time: The speed with which rating is performed may affect accuracy, although not many studies have been performed. Faster raters tended to have less precise estimates of severity (Parker et al. 1995b), and by extension these estimates would be individually less accurate.

Other causes: Color blindness has been reported to be detrimental to estimation of disease severity (Nilsson 1995).

There may also be interactions among the various factors listed here. A more in-depth discussion of sources of error affecting disease severity estimation is provided by Bock et al. (2010a). Other factors not yet studied may also play a role in rater error. A chart presenting the sources of error in plant disease assessment and the tools, methods, and approaches to increase accuracy is presented in Fig. 5.

Fig. 5
figure 5

Sources of error that affect rater accuracy of individual specimen disease severity estimates during the assessment process, and approaches and tools to increase accuracy

Can visual estimates be more accurate? A primer on best practices

The advantages of the percentage ratio scale for estimating those diseases amenable to such estimations were articulated in an article in the Transactions of the British Mycological Society (Anon. 1948). In addition, the authors: (i) encouraged use of pictorial diagrams of known severities to more accurately guide estimates, (ii) commented that the percentage estimates have direct biological meaning, and (iii) stated it provides a single, unifying method for estimating severity for all those diseases where area estimates are appropriate measures for severity. As noted, the data also lend themselves to direct parametric analysis. It is also bounded by 0 and 100%, the scale can be subdivided, is universally known, and is applicable to measures of incidence as well as severity (James 1971). Large (1955) stated that wherever possible they strove to assess using percentages because it provided the “percentage of the total green area of the plant rendered inoperative by reason of the disease at the time of observation,” rather than arbitrary or subjective ordinal grading systems based on psychology of perception, and due to its objectivity allowing comparisons.

There may be reasons for selecting any one of the types of scales used in plant pathology for a specific disease assessment purpose, but the user should remember that the objectivity and statistically available information content is least with the nominal scale, and increases progressively with the ordinal and ratio type scales, respectively. Of course, there are many diseases that must be assessed using a qualitative ordinal scale, but those are not considered in this review. There are various criteria to consider, and a sequence to approaching severity assessment that can be followed that will help contribute to accuracy of rater estimates, and at the same time minimize risk of type II errors. Best-operating practices (summarized in Table 3) for consideration in a disease severity assessment activity to maximize accuracy should include:

Table 3 Best-operating procedures for maximizing the accuracy of plant disease severity estimates (adapted from Bock et al. 2021)

First, select the most appropriate scale for the pathosystem involved, the requirements of the experiment, and the resource availability. In some cases, a pathosystem may dictate the scale to be used: thus, many systemic diseases that have relatively amorphous symptoms are more readily scored using a qualitative ordinal scale. Other pathosystems where symptoms are easily defined and quantified on an organ or whole plant lend themselves to rating using a quantitative ordinal scale or a ratio scale (the percentage scale). The percentage scale may be preferable to provide greater accuracy of individual estimates, and the ability to use parametric statistics directly with no loss in accuracy or precision (taking midpoints of quantitative scale ordinal estimates is less accurate and precise compared to direct estimates). Furthermore, a rater must learn the characteristics of the quantitative ordinal scale.

Second, provide raters with detailed instruction of (i) the pathosystem, (ii) the rating scale being used, and (iii) common sources of error in rating. These instructions should include a description of the disease symptoms and the stages they may go through, and any fungal structures that are relevant to assessment, and where to consider a boundary between healthy and diseased tissue. Other diseases or conditions that could be a source of misidentification, confusion, and error should also be described. Explicit instruction should be provided, even for the percentage ratio scale (for example, if using SADs, raters should understand to use the SADs as a guide for interpolation of their best estimate, not as a tool to classify the specimen as represented by a SAD or a preferred value—which has happened in some studies [Parker et al. 1995a, b; Melo et al. 2020]). Raters should be instructed on common sources of error including avoiding the common tendency to overestimate (especially at low severity) and to avoid rating in “knots”—specific values at 5 and 10% intervals. The importance of instruction is demonstrated (Bardsley and Ngugi 2013).

Third, raters should be tested and trained for two reasons: (i) to ascertain their native ability and (ii) to ascertain whether they can improve with experience, training, and/or the use of SADs. A rater who is consistently very inaccurate should probably be replaced. Most raters respond favorably to training and it ensures that their estimation accuracy is sufficient. This can be done using computer training programs (not easy to obtain now) or through the use of SADs and sets of image-analyzed, diseased specimens of known actual value that raters can gain experience by using. The value of training for improving accuracy is demonstrated (Nutter Jr and Schultz 1995; Bardsley and Ngugi 2013).

Fourth, related to the previous two criteria is experience. Wherever possible raters should be experienced (perhaps through instruction and training) so that they are comfortable rating disease severity. Thus, raters should be selected based on demonstrated experience wherever possible. Again, experience has been shown in several studies to be an important gauge of accuracy (Pedroso et al. 2011; Yadav et al. 2013; Lage et al. 2015).

Fifth, wherever possible, raters should use SADs as an aid, especially if not highly experienced and demonstrated to be accurate. There are now well over 100 studies that show SADs improve accuracy, particularly for those less experienced or less accurate raters (Pedroso et al. 2011; Yadav et al. 2013; Lage et al. 2015; Del Ponte et al. 2017). The SADs also improve inter- and intra-rater reliability, which is most likely a result of the increase in accuracy of individual raters.

Sixth, where possible the minimum number of raters should be used in any particular experiment, and if different raters are used, ideally they should be allocated randomly across the experimental units. This provides a further method to isolate rater-related error in a way that can be accounted for in the analysis. If raters vary and have assessed across statistical units, the error will detract from the power of the analysis. Although peripheral to disease assessment, resource use efficiency and sample size may be critical considerations. Chiang et al. (2016b) demonstrated the need for a minimum sample size to minimize the risk of type II errors. Ideally this should be at least 30 samples. Subsequent analysis should be appropriate for the data type (Shah and Madden 2004; Chiang et al. 2020).

The future of visual plant disease severity estimation

Based on applying these methods, and implementing appropriate tools, the potential improvement in accuracy of direct visual severity estimates, especially for inherently “accurate” raters, is likely approaching its limit. Further gains, of variable magnitude, may be made from results of additional studies understanding aspects of rater error, and from the optimization of SADs design and their deployment, in particular. But there remain many areas for future research.

For example, how does rater accuracy really vary over the full range of disease severity? Nutter Jr and Esker (2006) provided valuable information over the mid-range of disease severity (25 to 75%) using the just noticeable difference. But what about severity estimates <25% or >75%? A comprehensive study will lay to rest the question of the relationship between ability to estimate and actual disease severity.

How do symptom types and characteristics, and likely range of disease severity for a pathosystem affect an optimum selection of SAD severities and the range and individual severities illustrated? Further research is needed to determine how many diagrams are really needed in a SAD set. And at what point are there too many? What are critical aspects of SAD development and validation that should be followed in all labs developing this tool to ensure that differences due to approach are not a source or error in design or validation?

What aspects of rater instruction are most important to accurate assessments? How do personality types and other psychological or gender factors affect rater accuracy?

Similarly, with ordinal scales: what further improvements might be made to scale structure that will improve accuracy of estimation? Does training or instruction impact quantitative ordinal scale use (this has never been explored)? Do SADs aid accuracy of classification using ordinal scales (both quantitative and qualitative)?

Do the same raters need to be used for all stages in disease assessment studies of accuracy and reliability? Or can a random “sample” of raters be used to represent the population? If so, how many raters should be used in any given study to encompass likely variability?

There are several other methods used for assessing disease severity which are sensor-based. These may incorporate artificial intelligence (AI) and have the potential for an eventual capability to provide accurate estimates of disease severity Bock et al. (2021). Nonetheless, in most pathosystems, visual disease estimation is and will be a standard for many years to come, underlining the importance of accuracy in visual estimation of disease severity.