FormalPara Key Summary Points

Why carry out this study?

 Myopia prevalence is increasing worldwide, with half of the global population expected to have myopia by 2050.

 Although the etiology of myopia remains unclear, it is important to control myopia early in children to avoid sight-threatening complications due to high myopia in the future.

 The study asked: What are the main risk factors for myopia in children during primary school, and how can the change in these risk factors be predicted well?

What was learned from the study?

 Myopia progression in primary school children could be predicted with good accuracy using machine learning models.

 Ocular factors, such as spherical equivalent, had greater weight than environmental and genetic factors, and should be monitored annually to achieve early prediction and intervention in children with myopia.

Introduction

Myopia has become one of the most prevalent ocular disorders worldwide [1, 2] and has achieved an epidemic level of 90% prevalence in university students in China [3, 4]. Myopia-related complications and vision loss may become severe social concerns by 2050 [2]. The etiology of myopia remains unclear, although in recent years, some studies suggest that environmental factors may play as important a role as genetic factors [1]. Time outdoors is one promising environmental factor for myopia prevention in children, and can be increased by adding an outdoor physical class per day [5] or encouraging playing outside during recess [6]. In addition, education [7] and certain reading habits such as continuous reading, a close reading distance, and a low-light environment are reported to be associated with childhood myopia [8, 9].

In 2015, Zadnik et al. [10] found that cycloplegic spherical equivalent refraction was the single best predictive factor for myopia onset in school-age children and performed as well as all eight factors together, with the area under the curve (AUC) ranging from 0.87 to 0.93 depending on the factors included. Zhang et al. [11], using ocular biometry, height, weight, and presenting visual acuity in a model, reported AUCs of 0.82–0.97 in Chinese children. In 2010, Lim et al. [12] reported that higher intake of saturated fat and cholesterol were associated with longer axial length in schoolchildren, suggesting a possible relation between dietary factors and refractive errors. In 1996, Edwards et al. [13] reported that children who developed myopia had significantly lower intakes of fat, protein, vitamins B1, B2, and C, cholesterol, phosphorus and iron than children who did not become myopic. However, more studies are still needed to confirm the relation between dietary factors and myopia. Recently, Tideman et al. [14] reported that risk score combining environmental risk factors and ocular parameters can help to identify children at high risk of myopia. Nevertheless, environmental factors are various and changing every year for students during primary and middle school. Although many researcher have built various myopia prediction models based on different datasets on risk factors for myopia [15], it remains unclear what the main risk factors for myopia are during primary school and how these risk factors change over time [16].

Machine learning is a method of data analysis that automates analytical model building and has been successfully used in imaging recognition and classification [17]. In the field of ophthalmology, machine learning has been used in diagnosis of diabetic retinopathy [18], predictions of myopia development [19, 20], orthokeratology lens prescription [21, 22], and visual acuity in patients treated for neovascular age-related macular degeneration [23]. There are many machine learning algorithms, each with its own strengths and weaknesses. Random forest is a supervised learning algorithm with the major advantage that it can be used for both classification and regression tasks, which form the majority of current machine learning systems. Although the algorithm is at risk of overfitting the data, this can be avoided through careful system design. In addition, it can handle missing values and can be modeled for categorical values.

In this study, we developed a system that assigns probabilities for myopia progression of children using machine learning with random forest. We applied this to investigate the risk factors for myopia progression using a large sample of Chinese children that were followed for a period of 5 years.

Methods

Study Population

The Anyang Childhood Eye Study (ACES) was a school-based cohort study, which was approved by the Ethics Committee of Beijing Tongren Hospital, Capital Medical University, and adhered to the tenets of the Declaration of Helsinki. Informed written consent was obtained from at least one parent, while verbal assent was obtained from each child. Details of the methodology have been reported previously [24]. A total of 2740 grade 1 students aged 7.1 ± 0.4 years (range 6–9 years) were measured annually with ocular biometry and cycloplegic autorefraction [25, 26]. We examined the students during a fixed period of 2 months every year. We defined grade 1 as the first study year, grade 2 as the second year, and so on.

Procedures

All students had distance visual acuity measured with and without spectacles, if worn, using a logarithmic visual acuity chart (Precision Vision, La Salle, IL, USA) at a distance of 4 m [27]. A Lenstar LS900 (Haag-Streit, Koeniz, Switzerland) was used to measure axial length before cycloplegia [24]. Five repeated measurements were taken and averaged. The cornea powers were measured in the principal meridians to give the lesser (the flat) and the greater (the steep) corneal powers. Mean corneal power was calculated as the average of these powers [26]. Cycloplegic autorefraction was performed 30 min after one drop of topical anesthetic agent (Alcaine; Alcon, Fort Worth, TX, USA), two drops of 1% cyclopentolate (Alcon), and one drop of 0.5% tropicamide (Mydrin P; Santen, Osaka, Japan) at 5-min intervals. Three measurements were averaged (HRK-7000A, Huvitz, Gunpo, Korea), and the spherical equivalent was calculated (sphere power + cylinder power/2). Myopia progression was considered to be any increase in the myopic spherical equivalent in myopic children, while in the full cohort such a shift towards a more negative or less positive refractive error was termed a myopic shift. Information including children’s near work load, time outdoors, living habits, reading habits, food habits, and parental myopia were gathered using questionnaires administered to the parents, with five choices from which to select offered for each question, as described in previous studies [9, 28].

Data Analysis

Data analysis was performed using the R programming language (http://www.r-project.org/) [29] on right eyes only. The children were divided into two groups: (1) a randomly chosen subset of 10% of children as an independent dataset or a “hold-out” group to test the performance of the prediction model, and (2) the remaining children (90%) as the training group to identify risk factors and establish prediction models. A regression model was first used to screen the factors that will be included to model using random forest.

Univariate analysis was performed with the dependent variable being myopia progression in each study year, and independent variables being ocular axial length, near work, time outdoors, living habits, nutritional habits, reading habits, habits of wearing spectacles, and parental myopia (see Supplement A for the meaning of each variable). In multivariate regression analysis, categorical variables were changed into dummy variables, and the best subset of detected variables in univariate analysis was determined based on the Akaike information criterion. The relative weights of the predictive variables in the multivariate regression model were calculated.

The prediction models of each study year were determined by the random forest method (randomForest package for R, http://www.r-project.org/) [30], a mature ensemble learning method in machine learning that can be applied for classification and regression. Among the 90% of children used for training, five-fold cross-validation (80% subjects for training and 20% subjects for validation) was used to tune parameters and train an optimal random forest model. There were two parameters, mTry and nTree, which represent the number of randomly chosen features at each split of decision trees and the number of trees in the random forest, respectively. The rates of samples with absolute error between actual and predicted myopia progression of less than a certain error threshold, set to various values, and coefficients of determination between actual and predicted myopia progression, were used as prediction indexes. Finally, the model’s performance was validated by applying it to the hold-out group.

Results

At baseline, 2740 grade 1 students aged 7.1 ± 0.4 years were included. Boys accounted for 57.4%. From the first to the fifth study years, 2559 (93.4%), 2611 (95.3%), 2531 (92.4%), 2342 (85.5%), and 2199 (80.3%) children were re-examined, respectively (Table 1). There were no significant differences in baseline characteristic between the students included and those excluded due to incomplete follow-up, nor between the training group and hold-out group (Table 1). Table 2 shows the descriptive statistics of spherical equivalent, axial length, uncorrected distance visual acuity, and flat keratometry reading for children in each grade.

Table 1 Baseline characteristics of children between training group and hold-out group in each study year (mean ± SD)
Table 2 Distribution of SE, AL, AR, UDVA, and K1 of children in each grade (mean ± SD)

In this study, 68 variables were screened, including 23 continuous variables, 16 nominally categorical variables, and 29 orderly categorical variables (Supplement variables). From the first to the fifth study year, 19, 23, 26, 20, and 25 variables (Fig. 1 and Supplement B) were screened out, respectively.

Fig. 1
figure 1

Weights of predictor variables in the first study year using a random forest model. Ocular parameters are shown in green, environmental factors in yellow, nutrition factors in red, and genetic factors and gender in gray. UDVA uncorrected distance visual acuity; AL axial length; K1 the flat keratometry reading; MYOPICPARENTS2 two myopic parents; PUPIL_SIZE pupil diameter; SE spherical equivalent after cycloplegia; K2 the steep keratometry reading; GENDER male or female; PULSE heart rate; ROW quantiles of rows children sit in the classroom from 1 least to 6 most; READWEEKLY quartiles of weekly reading from 1 (lowest) to 4 (highest); DESK_LAMP the type of lamp (bulb); WHITEMEATS quartiles of frequency of eating white meat, such as fish and chicken, in the last 4 weeks from 1 lower to 4 upper; NUCVA near uncorrected visual acuity; BREAK quartiles of time keeping reading or doing close work before a break from 1 least to 4 most; ORIENTATION1,2,3 bedroom window orientated to south, west, and north, respectively (east as reference);

Figure 1 shows the weights of variables for the first study year with ocular parameters in green, environmental factors in yellow, nutrition factors in red, and genetic factors and gender in gray. The weights of variables from the second to the fifth study years are shown in Supplement B. Table 3 shows the regression coefficients of predictive variables in each study year. Six variables were significant risk factors for myopia progression in all study years: more myopia (P < 0.01 to 0.045), poorer uncorrected distance visual acuity (UDVA, P < 0.0001), longer axial length (P < 0.0001), being female (P < 0.0001), higher flat keratometry reading (K1, P < 0.0001), and having two myopic parents (P < 0.0001–0.027).

Table 3 Regression coefficients of predictor variables in the multivariate regression models of each study year

Figure 2 shows the weights of six prominent variables and two additional variables. During the five study years, these combined variables had a mean weight of 76.7% (range 69.1–86.1%). UDVA had the greatest weight (28.3%, 21.6–38.9%), followed by spherical equivalent (20.4%, 7–28.1%), axial length (12.6%, 10.2–14.4%), the flat keratometry reading (K1) (6.7%, 3.7–10.8%), gender (5.7%, 1.9–8.5%), and myopic parents (3.1%, 1–9.5%).

Fig. 2
figure 2

Weights of important predictor variables in the random forest model during the five study years

Other variables were found to be significant in different study years (Figs. 1, 2 and Supplement figures). Wearing spectacles was significant at the fourth study year, with weight of 12.8%. Undergoing other myopia treatments (OMT) was significant at the second and third study years, with weights of 4% and 2.7%, respectively. Weekly time spent reading was significant in the first, third, fourth, and fifth study years (more reading, more myopia). Distance between the child's eye and book when reading was significant in the third, fourth, and fifth study years (farther distance, less myopia).

Figure 3 shows the curves of prediction accuracy with different absolute errors in each study year. When the absolute error between predicted and actual myopia progression was set at 0.50 D, the prediction accuracy was 80%. The accuracy increased to 90% for an absolute error at 0.75 D. The differences in mean myopia progression between predicted and actual values of each study year were less than 0.05 D.

Fig. 3
figure 3

Prediction accuracy curves of random forest models in five study years, that is, the accumulated percentage of samples as a function of the absolute difference between predicted and actual spherical equivalent refractions

Discussion

Machine learning often uses many more variables in its prediction models because the emphasis is not on significance of individual variables, but rather the ability of the machine learning model to predict the independent variable from a combination of factors. In this study on risk factors for myopia progression in primary school children, 68 variables, not including variables of binocular vision and accommodation such as phoric state or accommodative lag, were first screened using multivariate regression analysis. Among these, six variables comprising uncorrected distance visual acuity (UDVA), spherical equivalent, axial length, the flat keratometry reading (K1), gender, and myopic parents were included in the models for all study years, with a mean combined weight of 76.7%. The prediction accuracy based on these variables was greater than 80%.

During the five study years, UDVA always had the greatest weight (28.3%) with a peak at the second study year, indicating that UDVA was the best predictor for myopia progression and myopia screening [31]. This implies that UDVA of primary school children should be monitored frequently to identify children at risk of myopia. We also found that spherical equivalent was a significant predictive variable with a peak weight at the third study year. The successive weight peaks for UDVA and spherical equivalent might be explained by the peaks in myopia onset in grade 2 to grade 4 reported by previous studies [5, 32]. In other words, although more myopia was closely related to a lower UDVA, reduced UDVA occurred before myopia and thus acted as a more sensitive predictor of myopia progression.

Axial length was also found to be significant in our models, with moderate weight (12.6%), followed by the flat keratometry reading (K1) with lower weight (6.7%). Although the ratio of axial length to the corneal radius of curvature (AL/CR) is relatively good at classifying myopia grades [33], a recent study found that AL/CR was not useful in monitoring myopia progression in children due to a nonlinear relation between axial length and the corneal radius of curvature [34]. The successive lower weight of K1 and higher weight of axial length might reflect an active emmetropization process and final match between them [35].

A study on children in Singapore aged 6 months to 6 years reported that genetic factors (number of myopic parents) may play a more substantial role in early-onset myopia than environmental factors, of which neither near work nor outdoor activity was associated with early myopia [36]. Our findings confirmed this, as the 7-year-old children at grade 1 were affected by 7% by having myopic parents, but not by the total time spent in near work or outdoors. Furthermore, the influence of myopic parents decreased steadily over the five study years (9.5%, 1.9%, 1.8%, 1.0%, 1.3%, respectively), possibly reflecting a decreased importance of genetic factors on myopia with age. It should be noted that myopic parents not only constitute a genetic factor, but are likely associated with myopigenic environments, such as more time spent at near work and less time outdoors [37].

Interestingly, continuous reading (i.e., the break variable) was significant in all five study years during primary school, indicating its association with myopia, which was consistent with our previous reports in grade 7 children [9] and a study of Australian children [8]. In the Australian children study, myopia was also not associated with time spent doing near work [8]. The reasons for the lack of relation include inaccurate measurement of time on near work by questionnaire and dynamic changes in near work through different school years, as well as the effect of continuous reading (break) masked in the time on near work. In clinical trials, it has been demonstrated that making full use of recess time for outdoor activities can significantly control the development of myopia in children [6, 38]. Therefore, children’s myopia may be more affected by the breaks between periods of continuous reading than total time spent on near work. In addition, use of smart phones or digital devices, which might be associated with myopia [39], were not measured alone in this study but included in the total time on near work.

In this study, time spent outdoors was not significant in the model, even though it is often regarded in the literature as the most promising environmental factor for controlling myopia onset. In our previous study, time spent outdoors was associated with a change in axial length but not with a change in spherical equivalent, possibly due to insufficient statistical power [28]. Moreover, the longitudinal results in the present study suggest that environmental factors are dynamic, making it more difficult to accurately evaluate their effects on myopia progression in children. This is difficult to prove, however, because time spent outdoors is usually determined through generally rather inaccurate questionnaires. The use of wearable devices to monitor children’s studying and living environment, including time outdoors, might resolve the problem. This leaves the challenge of coping with the large amounts of data, which can be handled by machine learning.

K1 was probably a risk factor because it helps to form hyperopic defocus which has been shown to be the cause of prolonged axial myopia in animal studies. Although corneal power is basically stable after the age of 3 years, in some cases, it might produce a certain compensatory reaction in the case of excessive near work, especially when the myopic refractive error is not corrected by glasses for a long time.

The strengths of this study include a large sample size and a high follow-up rate. This study also has some limitations. First, the environmental variables were determined using a questionnaire, which might lead to recall bias. Second, our study built the prediction model in the same cohort, although we divided the children into two groups; testing the model in other cohorts and populations is necessary to evaluate its generalizability. Third, the present work was designed to predict risk factors for myopia progression only 12 months ahead. It is very challenging to accurately forecast the occurrence of any medical condition 10 years in the future, since during that period there may be many important environmental or behavioral changes in the patient’s life that such long-term predictions cannot take into account. Also, while it is possible to perform a retrospective analysis of the best predictors, it will be more difficult to prospectively validate these results in a 10-year follow-up study. For this reason. it is more realistic to use a series of short-term forecasts, allowing for treatment that is more relevant to the patient’s current situation.

Random forest, a machine learning model widely used in many data analysis studies, performed well in predicting myopia progression. Methods for training the random forest model included bootstrap samples and random feature selection, thus reducing model variance, improving generalizability, and avoiding overfitting. This stabilizes the proposed random forest model and makes it more suitable for clinical practice. However, machine learning methods such as random forest act like a black box and are not easy to interpret. These techniques often use many variables for their prediction models because the emphasis is not on the significance of individual variables, but rather the ability of the machine learning-produced model to predict the independent variable from a combination of factors. This may occasionally lead to assigned values that seem irrelevant to a clinician.

Multivariate linear regression can compensate for some of the shortcomings of random forest, as they can be easily interpreted, reflecting the effect of predictive variables on response variables, and they can include important predictive variables and can analyze the relative weights of these variables. It is useful to combine classic regression models to explore risk factors with random forest to build highly accurate prediction models. Moreover, the random forest model can evaluate the importance of each predictive variable based on node purity from a decision tree, which might be different from linear models.

Conclusions

We have built machine learning-based prediction models for myopia progression in primary school children for a range of study years. The models demonstrated good accuracy for predicting myopia progression and showed the interaction among different factors. Ocular factors had greater weight than environmental and genetic factors. The environmental factors are modifiable to control myopia in children, which deserves further study to evaluate their interaction effect and feasibility in different populations and individuals.