figure a

Introduction

Retinal diseases are a significant cause of visual impairment and blindness, both in adults (secondary to age-related macular degeneration (AMD) and diabetic retinopathy (DR)) [1] and in children (due to inherited retinal disorders (IRD) and retinopathy of prematurity (ROP)) [2]. Diagnosing these conditions usually involves multimodal testing and multiple consultations with retina specialists, often not available in a timely manner, which can result in delays in sight-saving treatments. For rare diseases, it can take several years for a final diagnosis (‘diagnostic odyssey’), resulting in uncertainty about the prognosis and delay in appropriate care.

Healthcare data increases by approximately 50% every year, making it one of the fastest-growing digital areas [3]. Genomic data alone is as demanding in terms of data acquisition, storage, distribution, and analysis as astronomy or social media content [4]. Ophthalmology is one of the leading data generators, with 30 million optical coherence tomography (OCT) scans performed yearly in the USA [5]. This ever-increasing vast amount of data, alongside the development of cutting-edge digital technology, has made ophthalmology a pioneer in digital innovation and healthcare artificial intelligence (AI).

AI has been rapidly developing in multiple areas of medicine, including, dermatologist-level performance at detecting skin cancer [6], highly accurate classification of pulmonary tuberculosis [7], and genetic variant calling and classification [8]. AI-based ophthalmology telemedicine has been beneficial during the COVID-19 pandemic [9], and remote evaluation and analysis of retinal imaging may be useful in decreasing diagnostic time and facilitating triaging and classification [10, 11].

Development of highly sensitive and sensible AI-based tools requires transdisciplinary collaboration between clinicians and software engineers. Herein, we will provide an overview of current methodologies used in AI system development and validation and focus on clinical application in prioritising retinal diseases.

AI methodology overview

The most common techniques to develop AI-based healthcare tools will be summarised below and in Figs. 1 and 2:

  • AI is a phenomenon in which non-living entities mimic human intelligence [12]. It is an umbrella term encompassing a spectrum of computing programs. ‘Rule-based’, ‘hard-coded’ or ‘symbolic AI’ has existed for many decades and is the basis of any software system, from a traffic light management system to the autopilot flying every plane. In healthcare, symbolic AI has multiple applications, e.g. calculating cardiovascular risk index or eGFR.

  • Machine learning (ML) is an AI subfield in which a program achieves a task by being exposed to vast volumes of data and gradually learning to recognise patterns within the data, allocating data to distinct classes [13]. It involves ‘soft coding’, which means that the model learns from examples instead of being programmed with rules [12]. ML models can be supervised (based on data labelled by humans), unsupervised (i.e., grouping features within categories), or reinforcement learning (the system accumulates its own feedback to improve through a reward function) [14]. In medicine, supervision is the most common.

  • Nonneural network-supervised ML algorithms are useful in healthcare for prediction modelling and evaluating associations and best-fitted lines between two (linear regression, parametric) or multiple variables (random forest, non-parametric). The latter combines different inputs using a network of flowcharts (known as decision trees); each tree creates an outcome, and a collective one will be made by combining all the singular outputs [15]. Non-neural networks are often combined with deep neural network (DNN) architectures and achieve improved performance (Fig. 1) [16].

  • Deep learning (DL) is a subdivision of ML, defined by the presence of multiple layers of artificial neural networks (ANN) [17]. An ANN is composed of an input layer of multiple nodes—‘artificial neurons’—that represent characteristics to be analysed, e.g. pixels on an image, diagnoses (International Classification of Disease (ICD) coded), age, nucleotide changes, etc.; connected to one or more hidden layers that sum and analyse all inputs, and transmit a final value to an output layer (Fig. 2A).

  • DNN corresponds to multi-layered DL algorithms (with often over 100 hidden layers), which are currently the gold standard for image classification [15]. As more layers are added, an iterative training phenomenon starts occurring, by which deep layers combine stimuli sent from other layers and design new stimuli, improving the output layer and ultimately leading to better diagnoses [8].

  • Convolutional neural network (CNN) is a type of DNN particularly useful for image and video analysis [15]. These algorithms divide the files into pixels, convert them into numbers or symbols, analyse them by multiple convolutional layers that filter, merge, mask, and/or multiply features, and feed the results to a dense neural network that will create an output layer [18]. Fully convolutional networks (FCN) feed the output layers themselves, without the final step of dense layers (Fig. 2A) [17].

Fig. 1
figure 1

Diagram of artificial intelligence algorithms, subfields, and mechanisms

Fig. 2
figure 2

A Overview of a convolutional neural network (CNN). The process starts with an input layer, typically an image or video, that gets divided into subsamples and/or pixels and is analysed by multiple convolutional layers that filter, mask, or multiply features and feed the results to a dense neural network of multiple nodes ‘artificial neurons’. Each one represents a characteristic to be analysed (e.g., pixels, diagnoses, age, contrast, etc.) and is connected to hidden layers that sum and analyse all inputs, combining the received stimuli and designing a new one, leading to an improved output layer and final diagnosis. B The process of developing a supervised AI model. First, a training set needs to be created, and these images are used to train the model to interpret the different features; after this, a separate, non-annotated dataset (validation set) is presented to the model to try it, whilst still fine-tuning its configuration; and lastly, the algorithm is tested on new data, evaluating its overall performance

Useful concepts to better understand AI literature

There are different types of models, depending on the outcome prediction. (i) Classification models apply to categorical outputs, such as classifying retinal images into with or without DR; (ii) segmentation models are specialised for image processing and analysis, detecting presence or absence of features (e.g., intraretinal fluid), segmenting images into known anatomical correlates, or classifying them into diagnostic categories; (iii) regression models for when a quantitative output is needed, such as predict central macular thickness from an OCT file [19, 20].

Different performance metrics are used to present results of each model type. Dice similarity coefficient (dice score) and intraclass correlation coefficient (ICC) are metrics of segmentation accuracy suitable for evaluating performance of image segmentation DL algorithms, ranging from 0 to 1 [21]. There are multiple performance metrics for classification and regression model algorithms, such as (i) receiver operating characteristic (ROC) curves, that plot true positives (sensitivity) against false positives (1 = specificity) [22]; (ii) the area under the curve (AUC, also known as AUROC) ranging from 0 to 1, with 1 indicating a perfect algorithm [23]; (iii) precision-recall curves (PRC), which associate positive predictive value with sensitivity (also known as recall or true positive rate) [24]; (iv) the accuracy statistical score; (v) absolute difference; (vi) Pearson’s correlation between said parameters (the latter also goes from 0 and 1) [23].

The process of developing a supervised AI model generally involves three stages: (1) training, when the network is provided labelled images; (2) fine-tuning, where the model starts aiding the manual annotation and human graders to correct it and improve it; (3) validation or testing, where the algorithm is tested on a hold-out dataset annotated by human graders and kept separate from the training dataset (internal validation). External validation on datasets of completely independent origin than the training dataset is the gold standard for validation/performance evaluation, indicating generalizability (Fig. 2B). [23]

Selected retinal diseases for which AI-based tools have been developed

Diabetic retinopathy (DR)

Recent studies have shown that AI-based DR screening systems can achieve adequate levels of safety [25,26,27,28,29]. These algorithms include classical expert-designed image analysis, mathematical morphology, and transformations [30,31,32,33]. One of the approaches tested was to classify colour fundus images from training datasets into referable DR (moderate or advanced stage) or non-referable DR (no or mild DR, Table 1). These studies either built their own CNNs or used pretrained ones like AlexNet [34], Inception V3 [35], Inception-Resnet-V2 [36], and Resnet152 [37]. Other studies tried to detect DR based on fixed features such as red lesions [3839], microaneurysms [40], exudates, and blood-vessel segmentation [41, 42]. Lastly, other groups introduced a method to detect DR and diabetic macular oedema (DMO) using a CNN model, being able to detect the exact stage of DR; these studies are summarised in Table 1 [49,50,51,52,53,54,55,56,57,58,59,60, 95].

Table 1 Artificial intelligence in retinal disease—methods, cohorts, and overall results

Wong et al. [96] developed a model to classify DR stages based on microaneurysms and haemorrhages, while others used exudates, blood vessel mapping, and the optic disc. [97, 98] The sensitivity of automatic DR screening has been reported as ranging from 75 to 94.7%, with comparable specificity and accuracy [99]. Several publicly available retinal datasets have been used to train, validate, and test these AI systems, and also to compare performance against other systems; namely, DIARETDB1, Kaggle, E-ophtha, DDR, DRIVE, HRF, Messidor, Messidor-2, STARE, CHASE DB1, Indian Diabetic Retinopathy Image Dataset (IDRiD), ROC, and DR2 [57, 100,101,102,103,104,105,106,107,108]. Several studies have used these datasets to detect red lesions, microaneurysms, DR lesions, exudates, individual DR stages, and blood vessel segmentation [38, 40, 41, 43, 52, 109, 110].

Another area of focus is the detection of DMO, currently assessed by OCT as the gold standard. AI-based groups have tried detecting DMO from colour fundus photography based on exudates and accurate identification of the macula. Automated detection via OCT imaging is ongoing, focusing on retinal layer segmentation [111, 112] and specific lesion (e.g. cysts) identification [113,114,115,116,117,118]. Recently, DL has also been used to detect macular thickening based on colour photographs, and it has been found to be comparable to OCT-measured thickness [119].

Multiple programs have tried to use AI-based methods in population-based screening for DR. The United States Food and Drug Administration (US FDA) has recently approved IDx-DR, a CNN for screening DR stages in adults aged 22 years or older [49, 120]. Initial versions of IDx-DR have been evaluated as part of the Iowa Detection Programme and have shown good results in White, North African, and Sub-Saharan populations [25]. Similar software, like the RetmarkerDR in Portugal and EyeArt in Canada, have been tested in local screening programs [121, 122]. Multiple South-Asian eye institutes are also involved in development and validation of AI-based algorithms in DR [95, 123, 124]. Recently, a Singapore-based DL tool has shown comparable diagnostic accuracy to manual grading, and a semi-automated DL model involving a secondary human assessment may prove to be the most cost-effective model [125, 126]. Their real-world performance remains to be tested [127].

Age-related macular degeneration (AMD)

The use of AI with DL tools has great potential in AMD, both for diagnostic purposes—while allowing for a more efficient and accurate approach—to prognostication of affected individuals and perhaps to directly determine (predict) efficacy of treatments. The most common imaging modalities being explored in the field of AI for AMD are OCT, colour fundus image, and fundus autofluorescence (FAF). OCT-angiography (OCTA) has also been used in DL approaches to diagnose and classify AMD, achieving high accuracy and sensitivity [128, 129]. Due to the huge number of studies, selected key ones will be discussed, with a summary of a broad range of studies in Table 1.

One of the first attempts to evaluate ML algorithms in risk assessment of AMD was a European study by van Grinsven et al. that aimed to detect and quantify drusen on colour fundus photographs in eyes without and with early to moderate AMD [61]. This study demonstrated that the proposed system was in keeping with experienced human observers in detecting the presence of drusen as well as estimating the area, with an ICC greater than 0.85. For AMD risk assessment, it achieved a ROC of 0.948 and 0.954—similar performance to human graders. Subsequently, the same group explored another algorithm for automatic detection of reticular pseudodrusen (RPD) [62]. This followed a multimodal imaging approach using colour fundus, FAF, and near-infrared images, with automated quantification having similar performance to the observers.

In 2018, Schmidt-Erfuth et al. evaluated the predictive potential of ML in terms of best-corrected visual acuity (BCVA) by analysing OCT volume scan features—intraretinal fluid (IRF), subretinal fluid (SRF), and pigment epithelial detachment (PED) [130]. A modest correlation was found between BCVA and OCT at baseline (R2 = 0.21), while functional outcome prediction accuracy increased in linear fashion. The same group then explored automated quantification of fluid volumes using a DL method and a CNN, using OCT data from the HARBOR study (NCT00891735) for neovascular AMD (nAMD) [131]. Retinal fluid volumes (IRF, SRF, and PED) were then validated by the authors as important biomarkers in nAMD [132].

A more recent study attempted to introduce an AI system that combines 3D OCT images and automatic tissue maps in individuals with unilateral nAMD to predict progression in the contralateral eye [70]. It achieved a sensitivity of 80% at 55% specificity and 34% specificity at 90% sensitivity while being able to identify high-risk groups and changes in anatomy before conversion to nAMD, outperforming 5 out of 6 experts. Also, the age-related eye disease studies (AREDS and AREDS2) used DL algorithms and survival analysis to predict risk of late AMD, which achieved high prognostic accuracy [133].

Several segmentation models have been described in AMD. In 2018, De Fauw et al. created a landmark OCT image segmentation model that utilised a DL framework to perform segmentation and automated diagnosis of retinal diseases [134]. Subsequently, Liefers et al. validated a DL model for segmentation of retinal features specifically in individuals with atrophic AMD and nAMD, with results comparable to independent observers [135]. A further automated segmentation algorithm with a CNN has been explored to quantify IRF, SRF, PED, and subretinal hyperreflective material (SHRM) in nAMD [136]. There was good agreement for both the segmentation and detection of lesions between clinicians and the network (dice scores ≥ 0.75 for all features). Two applications with validated automated DL segmentation algorithms are currently commercially available: RetinAI (Medical AG, Switzerland) and RetInSight (Vienna, Austria) [137].

Dry AMD with geographic atrophy (GA) has also been actively investigated. Zhang et al. developed a DL model that segments and classifies GA on OCT images, achieving similar performance to manual specialist assessment [138]. Another group segmented GA in both OCT and FAF images and had reasonable agreement, with better performance (highest dice) in FAF [139]. GA algorithms have also been used to predict VA, with certain features such as photoreceptor degeneration having high predictive significance [140].

Inherited retinal disorders (IRD)

AI algorithms using multimodal imaging techniques have been developed to facilitate the diagnosis [78], classification [80], decipher the genetic aetiology [83], and measure the progression rate of IRD [89, 84].

Chen et al. have developed a CNN that detects if a patient has retinitis pigmentosa (RP) by analysing colour fundus images, with an overall accuracy of 96% (versus 81.5% from four ophthalmology experts) [78]. Another group proposed an FCN that detects pigment in colour images and diagnoses RP with an accuracy of 99.5% [79].

To predict aetiologies, Miere et al. have created a CNN model that can distinguish between FAF images from patients with Stargardt disease (STGD), RP, and best disease (BD), with an overall accuracy of 0.95 [80]. Furthermore, Fujinami-Yokokawa et al. used OCT images to predict causative genes (ABCA4, RP1L1, and EYS) through a DL platform [83]. They achieved an accuracy of 100% for ABCA4, 66.7 to 87.5% for RP1L1, 82.4 to 100% for EYS, and 73.7 to 100% for healthy control images. Miere et al. also created a CNN that is able to outperform specialists in distinguishing between FAF images of STGD and PRPH2-related macular dystrophy (AUROC 0.890 versus experts 0.816) [82]. Shah et al. also achieved an accuracy of 99.6% with a model distinguishing between OCT images from patients with STGD and controls [81]. Crincoli et al. combined image processing with a CNN to differentiate between BD and adult-onset vitelliform macular dystrophy using FAF and OCT images, with an AUROC of 0.880 [23]. Moreover, this endeavour has been recently markedly upscaled by Pontikos et al. to differentiate between 36 gene classes by exploiting multimodal imaging [141]. However, further development is needed, given more than 300 genes are known to cause IRD to date.

STGD is the most prevalent inherited macular dystrophy, and it can affect both children and adults, with multiple ongoing clinical trials [142]. Charng et al. developed a CNN algorithm that segments flecks and is able to monitor their progression over time [84]. They obtained an overall agreement between manual and automatic segmentation of 0.54 ± 0.14 dice score for diffuse speckled patterns and 0.71 ± 0.08 for discrete flecks. Wang et al. also used FAF images, detecting and quantifying areas of atrophy in STGD and AMD [85]. They obtained an accuracy of 0.98 for differentiating normal eyes from those with AMD-related atrophy and 0.95 for eyes with STGD. Atrophic areas were also segmented manually and automatically, with an overlap ratio of 0.89 ± 0.06 in AMD and 0.78 ± 0.17 in STGD [85]. Miere et al. also assessed atrophy and developed a CNN that differentiates between FAF images with GA secondary to AMD and IRD-associated, with an AUROC of 0.981 [86].

Automatic macular OCT segmentation by the device manufacturers is often inaccurate in IRD, requiring manual correction in over one-third of scans [143]. OCT images of STGD were used to create an improved DL-based algorithm that is able to segment the inner and outer retinal limits, providing faster and better macular thickness and volume quantification [87]. Lastly, adaptive optics scanning light ophthalmoscopy images of STGD have also been used to develop an FCN that is able to accurately count macular cones (dice score: 0.9431 ± 0.0482) [88].

Other tools are being designed to assess disease severity and potentially have applications in determining eligibility for interventional trials. A CNN has been developed by Camino et al. that segments preserved EZ area on OCT images from patients with RP and choroideremia (CHM) [89]. This tool reached 0.894 ± 0.102 similarity between automatic and manual grading for RP and 0.912 ± 0.055 for CHM. Loo et al. also targeted EZ segmentation and validated their algorithm for macular telangiectasia in patients with USH2A-related RP, with excellent applicability (dice score 0.79 ± 0.27) [91]. Similarly, Wang et al. also tested an EZ segmentation CNN in USH2A-RP and obtained a Dice score of 0.867 ± 0.105 [92]. CHM EZ segmentation was then attempted by Wang et al. through a nonneural random forest approach and reached a Jaccard similarity index between manual and automated segmentation of 0.876 ± 0.066 [90].

Predicting VA based on OCT and infrared images in RP has been assessed by Liu et al. They were able to determine if a patient with RP had VA below or above 20/40, with an AUC of 0.85 [93]. Sumaroka et al. also developed a nonneural network to predict foveal sensitivity (Humphrey visual field testing), VA, and possible outcome of therapy in patients with blue cone monochromacy based on OCT scans, with good results [94].

Retinopathy of prematurity (ROP)

ROP is an important cause of preventable childhood blindness worldwide [144]. ROP causes abnormal blood vessel growth and can be detected by trained ophthalmologists using indirect ophthalmoscopy, with access to adequate, timely screening potentially limited due to the requirement of highly trained personnel and equipment. DL-based detection and staging of ROP[145] by evaluation of posterior pole fundus images has been attempted with high sensitivity and specificity [146]. Authors have developed a ROP vascular severity score with good correlation with the labels set by the International Classification of Retinopathy of Prematurity committee [147]. The DeepROP score [148] and i-ROP DL system are DL algorithms developed to evaluate clinically significant severe ROP at the posterior pole [149]. ROP plus disease, a more aggressive form of ROP, is often difficult to diagnose given the lack of consensus among ophthalmologists; several authors have evaluated automated algorithms that may be able to objectively diagnose plus disease [150,151,152,153].

These study limitations are the review of the literature in a non-systematic approach, possibly leading to some papers being omitted or not adequately prioritised; and editorial restrictions, which prevented us from doing a comprehensive review of AI applications in all retinal disorders. Substantial research has been undertaken in other fields of medical retina (e.g., uveitis and oncology), which will be reviewed in a subsequent project. [154, 155]

Concluding remarks and future directions

Retinal disease has been at the forefront of AI in ophthalmology, with the first AI-related publication being on DR. Since then, research groups focusing their efforts on AI have multiplied around the world, targeting all aspects of the patient journey, including diagnosis, triage, and prognostication, by leveraging multiple imaging (and functional) modalities, as well as a range of AI tools. A shortage of medical professionals is anticipated in the short term, likely further increasing healthcare inequalities and challenging our ability to improve care for preventable diseases [156]. AI represents one important approach to help meet these challenges and moreover facilitate improvements in patient care—both at the individual level with more timely, accurate, and bespoke management, as well as population-level, large-scale healthcare. Ever-improving DNN and CNN algorithms can become a helping hand for healthcare to lean on towards meeting current capability endpoints.

Despite the huge promise, many challenges remain for AI in ophthalmology, including, (i) the need for larger, more diverse, and representative datasets that fully represent real life, (ii) the closer collaboration by experts (both national and international) to develop disease-specific consensus and subsequently provide a comprehensive large volume of image grading, and (iii) greater synergy between healthcare professionals, patients, and data scientists, communicating and improving the software interface as it is being iteratively created, and ensuring it complements the human interaction that underpins the practice of medicine, rather than seeking to replace it [157]. Further uses of AI are yet to be explored, such as multimodal inputs to determine the best candidates for interventional clinical trials, the selection of the ideal anti-VEGF and therapeutic scheme in nAMD, and the estimation of functional impairment based on structural parameters for IRD, among others.

The future of healthcare will increasingly incorporate the advantages that AI can provide to improve the lives of our patients and no doubt perform assessments quicker and more accurately than retina specialists can currently sustainably provide, allowing us to spend more time being better clinicians and scientists. Nevertheless, as always with new technology, there will be new learnings and surprises along the way.