Interrater reliability of photographic assessment of thyroid eye disease using the VISA classification

Purpose To determine the interrater reliability (IRR) of thyroid eye disease (TED) photographic assessment using the VISA classification. To assess whether a VISA grading atlas improves ophthalmology trainees’ performance in photographic assessment of TED. Methods A prospective, partially randomized, international study conducted from September 2021 to May 2022. Online study invitation was emailed to a volunteer sample group of 68 ophthalmology college accredited consultants and trainees, and 6 were excluded from the study. Participants were asked to score 10 patient photographs of TED using only the inflammation and motility restriction components of the VISA classification. IRR was compared between groups of practitioners by their level of experience. A clinical activity grading atlas was randomized to 50% of the ophthalmology trainees. Results Overall rater ICC was 0.96 for inflammation and 0.99 for motility restriction. No statistically significant difference in IRR between rater groups was identified. Trainees with a grading atlas had the highest IRR for inflammation (ICC = 0.95). Each subcomponent of the inflammation and motility restriction components of VISA classification had an ICC considered good to excellent. The mean overall rater score was 4.6/9 for inflammation and 3.5/12 for motility restriction. For motility restriction there was a reduced mean score variance among all raters when scoring photographs with more severe motility restriction. Conclusion IRR using the inflammation and motility restriction components of the VISA classification was excellent. A VISA grading atlas improved trainee performance in grading inflammation. Supplementary Information The online version contains supplementary material available at 10.1007/s10792-024-02934-z.


Introduction
Thyroid eye disease (TED) is an autoimmune disease often causing permanent facial disfigurement and substantially affecting patients' quality of life and daily function [1][2][3].While intravenous glucocorticoids remains a first line medical treatment for TED in the active phase in many parts of the world, about 35% of cases do not respond, and 11% showed inflammation reactivation [4][5][6][7].Increasingly immunotherapies are utilized in the treatment of TED to not only suppress inflammation, but to modify disease severity with the aim of reversing proptosis and improving ocular motility [8][9][10][11].One of the first monoclonal antibodies, teprotumumab, has been FDA-approved for the treatment of TED based on interventional trials [8,9,12].One key aspect of conducting treatment trials in TED and to monitor treatment response in clinical care, is to ensure the parameters used to define treatment response are reliable and reproducible [13].
One of the challenges of TED progression and treatment response assessment is how to classify and grade its various clinical manifestations [14,15].Currently periocular inflammation is standardly assessed by the Clinical Activity Score (CAS) or a derivative of it [16][17][18][19][20][21].Ocular motility assessment is less standardised where grading a range of eye duction varied from subjective grading from 0 to 4, where 0 is zero duction and 4 is full duction or 0 to − 4, where 0 is full duction and − 4 is no duction, by corneal reflex position to estimate the degree of eye duction, to more vigorous and laborious methods using kinetic perimetry [22][23][24][25][26]. Interobserver reliability and therefore validation of all of these methods remain to be explored.
The vision, inflammation, strabismus, appearance/exposure (VISA) classification categorizes four specific end points of TED and summarizes these in a clinical form useful for grading specific measurements and for guiding management [17,27,28].The advantage of the VISA classification is that it grades both disease severity and activity using subjective and objective inputs.The clinician decides on disease progression if there is interval change in individual sections for example, a change of 2 or more on the inflammatory score or 15° change in motility restriction (Table 1) [17,29].This allows the VISA classification to acknowledge different aspects of eye and periocular changes that can be disproportionately affected by TED.For example, a patient may present with predominantly extraocular muscle involvement or predominantly fat hypertrophy which may not be reflected in grading systems such as NOSPECS which assume a linear progression of disease [15].The VISA classification was initially validated in a pilot study using ten patients referred with TED who were assessed by two clinicians.This study found that the scores correlated 100% for vision and appearance, 80% for inflammation, and 90% for strabismus [27].Mawn et al. 2018 conducted a study to validate the reliability of three different grading scales used to measure soft tissue changes in TED (VISA, NOSPECS, CAS).They found that intrarater reliability was better than interrater reliability (IRR) for all scales and that the VISA classification met threshold for agreement for conjunctival and eyelid erythema but not for caruncular oedema [30].
Studies have shown significant interrater and intrarater variability in ability to assess TED patients with in-person assessment using the VISA classification [30].These studies recommended assessing expert agreement using photographic evaluation of patients with TED [30].Photographic assessment is important as it enables re-assessment of specific variables at different time intervals, greater participant recruitment, and development of machine-learning based image interpretation in future studies.
The purpose of this study is to determine the accuracy of photographic assessment of TED using the VISA classification [17].We compare the IRR of assessment between orbital and oculoplastic subspecialists, subspecialists in other areas, general ophthalmologists, and ophthalmology trainees.In addition, we assess whether a TED grading atlas improves trainee performance in assessing photographs of TED.

Study design and ethical approval
A prospective, partially randomized study between September 2021 and May 2022 participants were asked to score 10 patient photographs of TED using only the inflammation and motility restriction components of the VISA classification.The Royal Victorian Eye and Ear Hospital's Human Research Ethics Committee granted ethics approval (approval number 21/1515HL) before the initiation of the study.This study adhered to the tenets of the Declaration of Helsinki.Participants were asked to respond to 13 questions including grading 10 photographs from 7 patients with TED and 3 questions on participant demographics/clinical experience.Photographs were provided in isolation and there was no additional patient information included.The 10 TED photographs were selected by one of the senior authors who subspecializes in TED and included a range of severity for 5 inflammation and 5 motility restriction photographs and written patient consent was provided.Photographs were viewed in a randomized sequence for each rater to minimize observer fatigue and learning.Raters were asked to grade the severity of inflammation and motility restriction using the VISA classification.An abridged version of The Graves Orbitopathy Clinical Evaluation Atlas by EUGOGO [23] was randomly provided to 50% of the trainees via the survey software program to provide instructions on grading TED activity using the inflammation and motility restriction components; none of the other rater groups were provided with an atlas.Our abridged version contains the identical inflammation photographs with adapted instructions to score using VISA grading as well as additional photographs on scoring motility restriction.

Participants
The online study invitation was emailed to ophthalmology college accredited consultants and trainees.The study invitation was distributed internationally via the Royal Australian and New Zealand College of Ophthalmologists, Australian and New Zealand Society of Ophthalmic Plastic Surgeons, The Royal College of Ophthalmologists-England.Locally the study invitation was distributed to practitioners at The Royal Victorian Eye and Ear Hospital, Melbourne, Australia, and Sydney Eye Hospital, Sydney, Australia.Incomplete responses were excluded.Raters were grouped by subspecialty into 98 Page 4 of 10 the following rater groups: orbital and oculoplastic subspecialists, other subspecialty, general ophthalmologists, and ophthalmology trainees.'Other subspecialist' was defined as an ophthalmologist who had completed fellowship into a subspecialty other than orbital and oculoplastics.

Data collection and analysis
Data was exported from the SurveyMonkey online platform to a Microsoft Excel spreadsheet (Version 16.66.1).Two-way mixed effect absolute agreement model intraclass correlation coefficients were calculated using IBM SPSS Statistics (Version 28).IBM SPSS Statistics (Version 28) was used to calculate a one-way ANOVA test to calculate inflammation and motility restriction mean severity scores and variance for all raters and between group and within group sum of squares and F values were calculated for each TED photograph.A result was considered statistically significant where the source of variation indicated a between group sum of squares greater than the within group sum of squares and this corresponded to an F-value greater than F-critical value and a p-value less than 0.05.A two-tailed independent-samples T-test was used to assess the effectiveness of the TED grading atlas among the trainee cohort.Consensus for interpretation of ICC is as follows < 0.5 poor, 0.5-0.75moderate, 0.75-0.9good, and > 0.9 excellent [31][32][33][34][35].A statistician was consulted about the appropriate methods for analysis.

Rater demographics and clinical experience
From a total of 68 responses, 6 were incomplete and excluded from the study.The remaining 62 respondents included 18 orbital and oculoplastic subspecialists (29.0%), 9 other subspecialists (14.5%), 14 general ophthalmologists (22.5%) and 21 ophthalmology trainees, randomized to 10 with (16.1%) and 11 without atlas (17.7%).Across all rater groups, most participants managed less than 5 TED cases per week.Orbital and oculoplastic subspecialist raters on average treated the most TED cases of any group including 27.7% seeing 5-10 TED cases per week and 11.1% seeing 11-15 cases per week.Participants from the General Ophthalmologist group had the greatest number of total clinic years' experience, with 28.5% having 16-20 years and 64.2% having greater than 20 years of clinical experience.Table 2 shows the demographics and clinical experience of each of these groups.

Severity of inflammation and motility restriction
For all raters, mean severity score for inflammation for all patient photographs was 4.6/9, ranging from 3.0 to 5.7 for each patient photographs (Table S1).For all raters, the mean severity of motility restriction score for all patient photographs was 3.6/12, ranging from 1.2 to 6.4 for each patient photographs (Table S2).When comparing all raters, variance between mean severity scores for inflammation ranged from 1.5 to 3.4 and motility variance for mean motility restriction ranged from 0.7 to 1.5.The source of this variance reflects differences between individual rater scores.For motility restriction there was a reduced variance among all raters when scoring more severe motility restriction (p-value < 0.05).There was no reduction in variance between raters when scoring inflammation regardless of severity.

Interrater reliability for inflammation and motility restriction
The ICCs for inflammation and motility restriction are shown in Table 3. Intraclass correlation coefficients for all groups were considered good to excellent; the ICC for inflammation assessment was 0.96, and for motility restriction was 0.99 for all raters.There was no statistically significant difference in between rater groups for scoring of inflammation or motility restriction using the VISA classification.For all rater groups the ICCs for inflammation photographs 1, 4, 6 and 7 were excellent (ICC > 0.9) and uniform across rater groups for inflammation scoring, with minor variance noted in photograph 3 across rater groups.For motility restriction assessment, the ICCs for photographs 2, 8 and 9 were excellent (ICC > 0.9) and uniform across all rater groups, with moderate to good agreement for photographs 5 and 10. (Refer to supplementary Tables S3 and S4) Intraclass correlation coefficients for each subcomponent of inflammation (caruncular oedema, chemosis, conjunctival redness, lid redness, upper lid oedema, lower lid oedema) and motility restriction scoring (upgaze, downgaze, abduction, adduction) were good to excellent for all raters (Table 4).Gaze positions had the highest ICC (ICC 0.97-0.99),followed by conjunctival redness, chemosis and upper eyelid oedema; the lowest ICC was for eyelid erythema at 0.82.Orbital and oculoplastic surgeons had the highest ICC with near perfect agreement for inflammation in all photographs.Four of the 5 motility restriction photographs had excellent IRR.Photograph 10 had moderate to good IRR for motility restriction.

Effect of a grading atlas on interrater reliability
There was a statistically significant difference in the mean severity score of inflammation for the trainee with atlas group (mean = 4.2) and the trainee group without the atlas (mean = 5.4, p-value < 0.05), Fig. 1a.The trainee with atlas group scored inflammation closer to the overall rater mean of 4.6.This suggests trainees without the use of a grading atlas tended to overestimate inflammation.The trainee with atlas group had greater IRR when assessing mild motility restriction.For example, the ICC motility restriction scores for photograph 5 and 10 for the with atlas group were 0.97 and 0.79 respectively, in comparison to 0.83 and 0.68 for the without atlas group.There was no statistically significant difference in mean severity scores for motility restriction for the trainee with atlas group (mean = 3.7) and trainee without atlas group (mean = 3.8), Fig. 1b.One-way ANOVA testing showed no significant differences in variance between the trainee with or without atlas groups for inflammation or motility restriction scoring.

Discussion
This study found that photographic assessment's IRR for inflammation was excellent for raters of all skill levels.When comparing IRR between the different rater groups the trainee cohort with the atlas had the highest ICC for inflammation at 0.95 and general ophthalmologists performed the least well (ICC = 0.86).For the inflammation section of the VISA classification most scoring components had excellent IRR (ICC > 0.9) except perhaps eyelid redness.Nevertheless, lid redness still had good reliability with an ICC of 0.82.VISA scoring of lid redness is binary, either 0 (no erythema) or 1 (definite Fig. 1 a VISA inflammation scores for all patient photographs for trainees with a grading atlas (yellow) and trainees without the use of a grading atlas (blue).b VISA motility restriction scores for all patient photographs for trainees with a grading atlas (yellow) and trainees without the use of a grading atlas (blue) Page 7 of 10 98 Vol.: (0123456789) erythema).Redness must exceed generalized facial redness to score; however, this is highly subjective.The findings of this study are consistent with a previous study which found excellent IRR when using digital images to measure eyelid fissure height in patients with Graves' Ophthalmopathy [36].However, IRR with photographic assessment may not be directly comparable with in person assessment.Mawn et al. showed a binary (present/absent) scoring system for the inflammation components of VISA had moderate agreement (mean fraction of agreement = 0.74) [30,37,38].This difference in findings is likely due to different statistical measurements as well as the study design comparing two raters at different institutions whereas our study assessed the IRR of all 62 raters.Subject variance may have also contributed to the difference in findings, our study had a range of severity which would lead to a greater ICC (Tables S1 and  S2).Although, it may also reflect differences in reliability between photographic assessment and in person assessment.Studies on the validity of photographic assessment including machine learning assisted systems of ophthalmic conditions have suggested while there were some useful correlations, the 2 methods are not currently interchangeable [39][40][41][42][43].A study comparing digital image measurement with clinical measurement of eyelid fissure height in Graves' Ophthalmopathy showed fair to moderate agreement when assessing eyelid position in terms of palpebral fissure and marginal light reflex distance.A possible explanation for this discrepancy is the static eyelid position in the photographic measurement compared to the dynamic position with in person assessment [36].
There was no significant difference between the ICCs of the orbital and oculoplastics group where raters had 11-15 years of experience and had the highest clinical case load of TED patients, when compared to the trainee without atlas group where 80% of raters had < 5 years' experience.The trainee with atlas group had the highest ICC for inflammation (ICC = 0.95).This supports the use of a grading atlas to assist consistency of scoring between raters.Our study revealed the IRR for all raters was excellent for motility restriction (ICC = 0.99).There was no statistically significant difference between each rater group.In agreement, each component of upgaze, downgaze, abduction and adduction also had excellent IRR of above 0.95 for all raters.Our findings are consistent with a previous study that found absolute percentage agreement of 90% for motility restriction [27].However, comparison to these findings is limited due to the previous study utilizing in person assessment and only comparing two raters.Our current study showed a positive correlation between motility restriction severity and ICC, where ICC increases as motility restriction increases.This suggests that the level of interrater agreement increases with increasing motility restriction.Motility restriction photographs 5 and 10 had the lowest mean score (1.7/12 and 1.2/12, respectively); this corresponded to the lowest ICC scores for all raters for these photographs.These findings are consistent with a previous study on reliability of measuring ductions using the light reflex method which found an interrater overall coefficient of repeatability (CR) of 9.6 degrees with 95% confidence [22].Therefore, smaller changes in duction measurements < 9.6 degrees in less severe motility restriction will likely have lower IRR.There is no consensus on what an acceptable value of CR for a measurement tool and it is up to the discretion of the clinician.Intraclass correlation coefficients and CR are not directly comparable, and both may be utilized.
This study showed improved IRR for scoring inflammation and motility restriction with the use of a grading atlas in the trainee cohort.The trainee with atlas group had a greater ICC for both inflammation and motility restriction when compared to the trainee without atlas group.In comparison, the trainee without atlas group tended to overestimate inflammation; for example, periorbital swelling from orbital fat hypertrophy may be confused with eyelid oedema [44][45][46].Dickinson and Perros developed a grading atlas for TED due to difficulties with assessment and meaningful interpretation of research [23,47].The atlas illustrates targeted qualitative assessment for soft tissue grading of TED.Validation of this tool shows that soft tissue grading can be performed more reliably with the use of a comparative photographic atlas [23].Our modified grading atlas has clear definitions with diagrammatic representation on VISA scoring and was designed to be used as a reference at the time of VISA scoring.Having standardized explanations of scoring at the time of assessment likely reduces variability.The grading atlas appears to enhance accuracy in assessing inflammation, and especially useful in grading less severe motility restriction.To improve IRR, we recommend institutional consensus on a standardized TED assessment tool amongst practitioners.In institutions that treat large numbers of TED patients a VISA grading atlas may be beneficial to less experienced staff.There are some limitations to this study.The ICC is strongly influenced by the variance amongst the population in which it is assessed and is calculated as a ratio: (variance of interest)/(variance of interest + unwanted variance) [34,48,49].Therefore ICCs measured amongst different study populations might not be directly comparable.When ICCs for rater groups are calculated on an individual photograph the variance of interest may be much smaller than the variance from all photographs collectively.This may explain the decreased ICC for individual photographs for each of the rater groups [34,50,51].Our study contained photographs of variable size and resolution which may have led to increase variance between scores.Attempts were made to select good quality photographs of differing disease severity based on expert opinion.To better compare the accuracy of photographic assessment of TED a study should compare in person clinical examination with photographic assessment.When collecting data on clinical experience we did not request raters to specify their familiarity with the VISA classification.As this was an international study some raters may be more familiar with other assessment tools.This may have led to raters' level of clinical experience being incorrectly classified.The VISA grading atlas was only randomized to 50% of the trainee group.To better assess the utility of a grading atlas a future study may have the same group of raters assess the same patient photograph with and without the atlas to assess intraobserver reliability.This study only assessed the inflammation and motility restrictions of the VISA classification.

Conclusion
To our knowledge this is the first study assessing the accuracy of photographic assessment of TED.It is also the first study to directly compare raters of different levels of clinical experience and the effect of a grading atlas on IRR.In conclusion, we found the inflammation and ocular motility grading of the VISA classification to be reliable amongst raters of all levels of clinical experience.It showed that photographic assessment of inflammation and motility restriction in TED had excellent IRR.We would recommend a future study to compare the accuracy of photographic assessment with live clinical assessment.This will be an important comparison with future advances in machine learning-systems development for the detection of TED.

Table 1
VISA classification scoring for inflammation and motility restriction Vol.: (0123456789)

Table 3
Intraclass correlation coefficients (ICC) for each of the rater groups for inflammation and motility restriction Vol:. (1234567890)