Introduction

During adolescence, teenagers undergo a plethora of biological and socially-driven developments that also influence their sleep–wake behaviour1,2,3,4. Their internal phase (chronotype) delays progressively with age until around 215, while sleep pressure (the homeostatic load) likely accumulates more slowly across the day compared to adults6,7. This predisposes teenagers, more than younger children or adults, to delay into late evening hours thereby also delaying their sleep timing. Early school start times cut teenagers’ sleep artificially short in the morning, forcing them to get up before they reach healthy amounts of 8–10 h of night-time sleep on schooldays. On weekends, teenagers sleep not only longer but also later which better suits their delayed circadian clock. This sleep timing difference between school and free days is called “social jetlag”, since it is a constant jetlag situation induced by social schedules8,9. These widespread sleep restrictions are not only connected to compromised health and well-being10,11,12,13,14, but also reduced cognition (e.g. constructive15 and creative thinking skills16,17 or verbal fluency18,19) and decreased academic performance20.

There is now substantial amount of evidence that delayed school start times help to increase sleep durations towards more healthy amounts, at least in the short-term21,22,23,24. However, it is less clear whether the increased sleep durations and better sleep quality also translate to better learning in the same students. This could reasonably be expected, given the prominent role of sleep for memory, sustained attention and concentration e.g.25,26,27.

Thus, several studies previously tried to evaluate the effects of delayed school start times on grades or scores. These studies yielded very mixed results, probably due to differences in study designs, interventions, exposure times and outcome measures28. However, all interventions assessed were static changes in school start times whereas the possibility to make school start times flexible has been largely overlooked although previously raised in a commentary29.

Here, we studied whether a flexible school start system, as implemented in a secondary school in Germany30,31, and concurrent changes in sleep were associated with changes in grades in multiple academic disciplines. The flexible system entailed that the school changed from a permanent fixed start at mainly 8 AM to a flexible school start that allowed senior students to choose daily whether to attend school at 8 AM or skip the first class and start at 8:50 AM. The flexible system was associated with a variety of short- and long-term changes in students’ sleep patterns on which we report in detail elsewhere30,31. In summary, on days with later starts, students slept on average about 1 h longer by delaying their offset times by 1 h but maintaining the same average onset times. Subjective sleep quality was improved, and the amount of alarm-driven waking slightly reduced. This was seen both immediately after changing into the flexible system30 as well as after 1 full year in the system31. Nonetheless, sleep patterns did not improve from the conventional to the flexible system overall since, on average, students did not accumulate enough late starts during the time of monitoring to result in a significant difference of the weekly average. However, there was wide variation between students on the number of late starts chosen and the sleep gain or loss achieved30,31.

In the present study, we analysed students’ quarterly grades from 12 academic school subjects across 4 years to examine effects of this new system on academic grades in detail. With 2.5 years of data prior and up to 1.5 years after the introduction of the flexible school start, we could control for some important confounders and address trends and complex interactions which started long before the system was changed. In addition, we also used longitudinal sleep data (duration, chronotype, social jetlag), which we had previously collected in these students30,31 to predict students’ quarterly grades by means of linear mixed regression models.

Methods and materials

German secondary school system

The German school system varies widely between federal states, but all students start formal schooling with primary school (children aged ~ 6–9) followed by several possible types of mostly public secondary schools. The minimum mandatory schooling years are 9–10 years. The current study took place at a public “Gymnasium”, which is the most academic secondary school type in Germany, granting access to higher education at the university level after successful completion of the final exam (“Abiturprüfung”). Gymnasiums offer a very broad education and thus specialisations are not allowed until senior years (grades 11–13) when students can then decide to focus on specific subjects. Core subjects (e.g. maths, German and one main foreign language) cannot be deselected and—depending on the state—are examined in the final exam. About 34% of all secondary school students in Germany attend Gymnasium32.

Study site and the flexible school start system

The study took place at the Gymnasium Alsdorf (50° 53′ N, 6° 10′ E), a public secondary school in the West of Germany. This particular school offers daily self-study periods during which students work through a personal 5-week curriculum with a teacher and on a subject of their own choice (so-called “Dalton system”33,34) to foster independent learning. This is different to most other Gymnasiums in Germany, which work with mostly teacher-centred methods and no personal curriculum.

On February 1st, 2016, the school changed permanently from a fixed start (“conventional system”) to a flexible start (“flexible system”) for 10–12th graders (senior students). In the conventional system, school started at 8 AM on most days. On a median of 1 day/week (depending on students’ individual timetables), however, school started with the second period at 8:50 AM. The conventional system is a common system seen across different Gymnasiums in Germany, although the exact start time might vary slightly between schools (~ 7:30–8 AM). The amount of afternoon classes and the closing times vary depending on students’ grade level and the school’s individual schedule. Most often, students have 2–4 long school days where they attend afternoon classes finishing roughly between 1 and 5 PM.

In the flexible system, the first period (lasting 08:00–08:45 AM) was made optional for senior students to attend. Senior students could thus choose daily whether to start at 8 AM with the first self-study period or skip it and start at 08:50 AM instead (called “9 AM” here for convenience) with a normal teacher-centred class. On a median of 1 day/fortnight, students also had a scheduled free second period (08:50–09:50 AM), i.e. the chance to start at 10:15 AM (“ > 9 AM”). Given the low occurrence rate of > 9 AM-starts, we did not distinguish between frequencies of 9 AM-starts and > 9 AM-starts in our analyses. Skipped first periods had to be made up for within the same week, either during gap periods or after classes ended. Skipped first periods had to be made up for within the same week, either during gap periods or after classes ended.

Free first or second periods or gap periods during the day are common in secondary schools in Germany. This reflects the different choices senior students make to focus on specialised subjects which results in individualised schedules for each student. For more information on this unique flexible system, please refer to30 which also includes an example timetable.

Study design

Official academic grades were obtained from the school registry for students that took part in our first wave in 201630 and the second wave in 201731. While grades were provided retrospectively at the end of the schoolyear of wave 2 for the past 4 years, sleep data were collected longitudinally in both waves (Fig. 1): Wave 1 consisted of baseline sleep diary data collection (= t0) covering 3 weeks in January, 2016 (Jan 8th to 31st) in the conventional system, followed by sleep diary data collection for 6 weeks (Feb 1st to Mar 14th) in the flexible system right after its introduction on Feb 1st, 2016 (= t1). Wave 2 covered the matching photoperiod and time of t1, lasting again 6 weeks (Feb 2nd to Mar 20th, 2017, = t2). No second baseline just before t2 was carried out since the school had remained in the flexible system. We excluded any sleep diary entries during the carnival holiday periods between February 4th–9th, 2016 and February 23rd–28th, 2017 from the analyses.

Figure 1
figure 1

Study design and outcome measures. Schematic of longitudinal study design including quarterly academic grades from up to 2.5 years prior to and up to 1.5 years after the introduction of the flexible system. The same students had also provided daily sleep diary data in two waves (one baseline assessment in the conventional system and 2 time points in the flexible system as described previously31.

Participants

Written informed consent was obtained from all participants (or their parents/guardians if < 18y). The study was conducted according to the Declaration of Helsinki and approved by the school board, the parent-teacher association, the school’s student association and the ethics committee of the Medical Faculty of the LMU Munich (#774–16). We used opportunity sampling without specific exclusion criteria. For response and attrition rates and filter criteria of sleep diary data and cohorts please refer to31. Academic grades were provided by the school registry from students that took part at any time during our study (t0 through t2). All included students were granted promotion to the next grade level during the study period.

Outcome measures

Sleep diary

We used a daily sleep diary (provided online via LimeSurvey.org) based on the μMCTQ35. Students provided sleep onset (note: not bedtime) and offset (wake time) of their past night’s sleep, the type of day they woke up (schoolday or free day), and when they started school (8 AM, 9 AM or > 9 AM). The questionnaire did not cover any naps during the day. For detailed sleep diary descriptions, please see30,31.

Academic grades

From the school registry, we obtained official quarterly grades awarded to participating students between the school year 2013/2014 through to 2016/2017. Of the 170 students from both waves qualifying for analysis (i.e. the cohorts previously described and used for sleep analyses; see31), 13 students had grades missing, thus resulting in a maximum sample of 157 students for the grade analyses. For the majority of these students (62%), grade data span 2.5 years in the conventional and 1.5 years in the flexible system; for those in grade level 10 at wave 2 (18%), it was 3 and 1 years, and for those at grade level 12 at wave 1 (15%) it was 2.5 and 0.5 years. The grades were provided for all academic subjects taken by a student, of which we included only 12 subjects in our analyses that most students took and assigned them to one of three disciplines: Sciences (Biology, Chemistry, Maths, Physics, Natural Sciences), Social Sciences (Geography, History), and Languages (English, German, Spanish, French, Latin). Provided grades were averages per academic quarter per academic subject over a mixture of written and oral examinations, course work and participation in class.

The school year lasted from the end of August to mid-July divided into the following quarters: quarter 1 until end of October, quarter 2 until third week of January, quarter 3 until third week of April and quarter 4 until first week of July.

In grade levels 7–10, the grading scale ranged from 1 (best) to 6 (worst) with grades ≤ 4 considered passing grades. This scale was additionally broken down into plus ( +) and minus (−) for all but grade 6 (1+, 1, 1−, 2+, 2, 2− and so on). In grade levels 11 and 12, the scale ranged from 0 (worst) to 15 (best) with ≥ 4 considered passing. Both scales were combined by transforming the 1–6 scale to a 0–15 scale based on its finer plus/minus system (i.e. 15 = 1+, 14 = 1, 13 = 1−, 12 = 2+, 11 = 2, 10 = 2− and so on) and then transformed to a more universal 0–100% scale (0 = 0%, 15 = 100%, then steps of 6.66% for grades in between).

Data analysis

Analyses were performed in SPSS Statistics (IBM, versions 24 and 25) and R (versions 3.6.1 and 3.6.3) using R studio (versions 1.1.463, 1.2.1335 and 1.2.5042). Graphs were produced using the r-package ggplot236.

Sleep data

Daily sleep data from dairies were aggregated and taken from t2 if available (else from t1). From these aggregates, we derived the following variables as per equations below: average daily sleep duration during the week (SDweek), chronotype as midsleep on free days (MSF) corrected for oversleep (MSFsc), social jetlag (SJL) as midsleep on free days (MSF) minus midsleep on work/school days (MSW), and frequency of ≥ 9 AM-starts. For the linear mixed Models 3a–d, we additionally calculated the absolute differences between t0 and t1 (i.e., from baseline to the flexible system during wave 1) for SDweek, MSFsc and SJL (X change).

$${\text{SD}}_{{{\text{week}}}} = ({\text{SD}}_{{{\text{schooldays}}}}*5 + {\text{ SD}}_{{{\text{free days}}}}*2)/7$$
$${\text{MSF}}_{sc} = {\text{ SleepOnset}}_{{{\text{free days}}}} + \frac{1}{2}{\text{SD}}_{{{\text{week}}}}$$
$${\text{SJL}} = {\text{MSF}} - {\text{MSW}}$$
$${\text{Frequency of}}\;{ } \ge 9{\text{AM-starts}} = { }({\text{n}}_{{9{\text{AM-starts}}_{{{\text{flex}}}} }} /{\text{n}}_{{{\text{schoolday-entries}}_{{{\text{flex}}}} }} )*100$$
$${\text{X}}\;{\text{ change}} = {\text{x}}_{{{\text{t}}1}} - {\text{x}}_{{{\text{t}}0}}$$

Statistical analyses

Unless indicated otherwise, descriptive statistics are reported as mean ± standard deviation and test statistics are abbreviated as follows: t, t-test; Pearson correlation; rho, Spearman rank correlation; b, unstandardized coefficient of linear regression or linear mixed models; bflex*change, unstandardized coefficient of the interaction of linear mixed models; p, significance level. The alpha-level was set to p < 0.05 for all statistical analyses. All data were tested on normality (histograms, QQ plots, Shapiro–Wilk’s test) and sphericity.

For simple grade analyses comparing accumulative grade point averages in the conventional versus the flexible system, a two-sided paired t-test was used. To this end, the grade point average was calculated as the mean grade across all subjects before or after the start-system change for each student. For more sophisticated grade analyses, we used linear mixed-effects regression models (lme4 and lmer test package37,38 in R). In total, 4 different models (plus model variations) were calculated to answer different questions based on different fixed effects, interaction terms and subcohorts (see overview Table 1). Student ID was added as random effect to all models to incorporate unsystematic differences between individuals. In all models, the outcome (dependent variable) was quarterly grades per discipline per student; the fixed effects (independent variables) were system (conventional/flexible), gender (female/male), grade level (7–12), academic quarter (1–4), and academic discipline (Sciences/Social Sciences/Languages), all entered as categorical variables. Model 1 additionally included interaction terms between discipline and gender to assess general grade influences, Model 2 included interaction terms between school start system and gender, and system and discipline to assess system effects per discipline and gender. In Models 3, we included one of the aggregated sleep-change variables (see equation above; mean-centred) as additional fixed effects, each in interaction with system (conventional/flexible): chronotype change (Model 3a), sleep duration on schooldays change (Model 3b), social jetlag change (Model 3c) or frequency of ≥ 9 AM-starts (Model 3d). In Model 4, we instead included the absolute value of chronotype, sleep duration on schooldays, social jetlag, and frequency of ≥ 9 AM-starts for the flexible system only (from t2 if available, else from t1 to maximize sample size). Since chronotype, sleep duration on schooldays, social jetlag, and frequency of ≥ 9 AM-starts were prone to collinearity, we first assessed their correlations before adding them into the models (Supplementary Fig. S1). Only chronotype and social jetlag were at least moderately correlated (rho = 0.65, p < 0.001; Supplementary Fig. S1), and results from models including just one of these variables each (4a–d) were essentially similar to Model 4e which included all sleep variables together (Supplementary Table S4). The variance inflation factor (car package in R39) also indicated no problematic collinearity for Model 4e. Marginal means of model estimates were calculated using emmeans in R40 for models where interactions were significant. All linear mixed models were visualised in tables using the sjPlot and sjmisc packages41,42 and in figures as marginal means via the ggeffects package43 in R. Simple contrast results from interactions in linear mixed models were averaged over the levels of system or gender (depending on the model), grade level, and quarter; degrees of freedom method used was Kenward-Rogers. Pairwise comparisons were adjusted with Tukey method.

Table 1 Overview of linear mixed model analyses on official, quarterly grades.

Results

In this study, we investigated whether a flexible school start system and concurrent changes in sleep (as previously described here30,31) were associated with changes in academic grades. During a first wave of sleep assessment30, the studied secondary school had changed from a conventional start system (mostly starting at 8:00 AM; baseline = t0) to a flexible start system with a daily choice between 8:00 and 8:50 AM (= t1; Fig. 1). The school has since maintained this system allowing for a second wave of sleep assessment after exactly one year (t2)31.

For the current study, we included quarterly grades of 63–157 students from these two waves irrespective of time of participation (i.e. t0/t1, t2 or all time points) (Fig. 1). The sample size varies depending on the analysis question and thus with the respective regression model calculated (see Table 1). The majority of included students were females (63–68%), were in grade levels 10 or 11 (but levels 9 and 12 were also included), and used the late-start option (“ ≥ 9 AM-use”) on about 24–28% of all recorded schooldays (see Table 2 for more cohort characteristics). In total, we analysed 5111–16,724 grades (on average 107 individual grades per person) that students received in 12 academic subjects over 2.5 years in the conventional system and 0.5 to 1.5 years in the flexible system (Fig. 1). Grades were provided by the school registry and transformed to a 0–100% scale and labelled (not aggregated) as Languages, Sciences, or Social Sciences. Median grades were 53–60% in Languages, 60% in Sciences, and 60–67% in Social Sciences (Table 2). To complement the analyses, we also used several of the existing sleep variables (chronotype expressed as MSFsc, sleep duration, social jetlag) in the flexible system and their respective change (delta) from baseline (t0) to the flexible system (t1) as well as the frequency of ≥ 9 AM-use in several model specifications (Tables 1 and 2).

Table 2 Composition of study cohorts. Displayed are cohort characteristics after standard filter criteria.

School start system showed no systematic effect on academic grades overall

First, we investigated whether the flexible system allowed students to increase their grades without considering sleep variables. At first sight, a simple comparison of overall grades yielded a small but statistically significant improvement in grade point average from 58.2% (± 2.1 SD) in the conventional to 59.6% (± 2.0 SD) in the flexible system (Fig. 2a; t[154] = − 2.15, p = 0.033, dz = 0.173). However, attributing this improvement to the flexible system is likely unwarranted. As outlined in the introduction, grades are influenced by a multitude of factors, thus comparisons that do not account for these can be misleading. We therefore applied linear mixed-effects regression models to adjust for potential confounders and including random intercept for ID to account for inter-individual differences (Table 1). When incorporating gender, grade level (i.e. indirectly age), academic quarter, and discipline in addition to school start system in the analysis, the flexible system showed no systematic relationship with students’ grades (Fig. 2b; b = − 0.10, p = 0.815, Model 1, Supplementary Table S1), hence the flexible system was not associated with students receiving better or worse grades overall in our sample (nstudents = 157).

Figure 2
figure 2

Longitudinal analysis of official quarterly grades—effects of school start system and general predictors. Quarterly grades (0–100%) from 12 academic subjects of 3 disciplines for 4 years i.e., for most students this was 2.5 years before and 1.5 years after the flexible school start was introduced (n = 157 students; 16,724 grades; 107 grades per student on average). (a) Simple, unadjusted comparison of average grades across all disciplines in the conventional and the flexible school start system via paired t-test (nID = 157). Shown are mean and 95% CI within the raw data distribution (violin plots). The apparent grade improvement in the flexible system was not confirmed in linear mixed models. (b–g) Visualization of mixed-model-determined influences on grades. Plots show marginal means from Models 1 and 2 (Supplementary Table S1), i.e. the estimated grade and 95% CI for the reference category (female student, class level 10, quarter 1, languages, conventional system). Statistical significance is indicated in (b), results of more complex cases can be found in the text and Supplementary Tables S1 and S2. (b) Effect of school start system (Model 1). (c) Effect of grade level (Model 1). (d) Effect of academic quarter (Model 1). (e) Effect of academic discipline by gender (Model 1). (f) Effect of school start system by gender (Model 2). (g) Effect of school start system by academic discipline (Model 2).

Grades varied systematically with grade level, academic quarter, discipline and gender

But what drives better grades in the unadjusted comparison if not the flexible system itself? The same factors that we adjusted for in the regressions also stood out as major predictors (Model 1, Supplementary Table S1, nstudents = 157): Students in 12th grade (the last school year) did consistently better compared to their peers across all other grade levels—a sort of “leavers effect” that has already been observed before (Fig. 2c, b = 3.44, p < 0.001)44. Moreover, we found that students enjoyed a bump in grades in the last quarter of the school year with an estimated improvement of 2.3 percentage points compared to the first quarter (Fig. 2d; b = 2.34, p < 0.001). The combination of these two effects might explain the statistically significant improvement observed in the unadjusted comparison: the flexible system replaced the conventional system mid-year between quarter 2 and 3, so quarter 4 and higher grade levels were overrepresented in the flexible system, which the t-test could not account for.

The mixed models also revealed other strong systematic influences on grades in our sample. Firstly, we observed a clear difference between the disciplines: students performed generally best in Social Sciences, followed by Sciences and then Languages (Model 1, Supplementary Table S1). Post-hoc tests (Fig. 2e) showed that these differences were highly significant for both genders (all p < 0.001; post-hoc to Model 1, Supplementary Table S2), except for girls’ grades in Sciences and Social Sciences, which were indistinguishable (b = − 0.47, p = 0.3895; post-hoc to Model 1, Supplementary Table S2).

Female gender has been reported as another driving force for higher grades45. However, girls in our sample did not outperform boys overall (Model 2, Supplementary Tables S1 and S2). Girls were significantly better in Languages (Fig. 2e; b = 4.72, p = 0.0284; post-hoc to Model 1, Supplementary Table S2), while boys surpassed them in the Social Sciences (b = − 3.31, p = 0.1269; post-hoc to Model 1, Supplementary Table S2), and both genders did equally well in Sciences (b = − 0.00, p = 0.9915; post-hoc to Model 1, Supplementary Table S2).

The flexible system was linked with subtle improvements in languages and subtle drops in social sciences grades

Although we did not find evidence that the flexible system was linked with better grades overall (Model 1, see above), the flexible system might be linked with grade improvements in certain disciplines and genders. To assess this, we looked at the interaction between (1) school start system and discipline, as well as (2) school start system and gender in a second model (Model 2; Supplementary Table S1, nstudents = 157). Neither females nor males significantly improved their overall grades from the conventional to the flexible system (Fig. 2f; post-hoc to Model 2, Supplementary Table S2). In terms of discipline effects, we found that grades in Social Sciences slightly dropped (b = 1.26, p = 0.0384; post-hoc to Model 2, Supplementary Table S2), Science grades remained unchanged (b = − 0.07, p = 0.8849; post-hoc to Model 2, Supplementary Table S2), and Language grades slightly improved (b = − 1.30, p = 0.0168, post-hoc to Model 2, Supplementary Table S2) in the flexible system. Notably, these changes were subtle but reduced the grade differences between the academic disciplines (Fig. 2g, Supplementary Table S2). These small changes in opposite directions likely explain the absence of a net effect of the flexible system on overall grades.

Improvements in chronotype, sleep duration, and social jetlag did not systematically improve grades

What was the role of sleep parameters on grade developments? We speculated that students who showed greater improvements in the flexible system (i.e., advanced chronotype, lengthened sleep duration, and lowered social jetlag) also received better grades in the flexible system. Thus, we computed changes in sleep from t0 (baseline) to t1 for each student in the subpopulation of students with sleep parameters during these time points (n = 63; Table 2). On average, these students showed a small delay in chronotype of 0.2 h (MSFsc, range = − 1.6–2.2 h), a small reduction in social jetlag of 0.3 h (− 2.0–1.9 h) and a small sleep gain of 0.1 h (− 1.3–1.1 h) but large inter-individual differences as reflected by the ranges. Adding these parameters separately into a third model (Models 3a–c, Supplementary Table S3, Fig. 3a), we found that neither changes in chronotype (flex*chronotype change: b = 0.10, p = 0.845, Fig. 3a,b) nor changes in sleep duration (flex*sleep duration change: b = − 0.77, p = 0.352, Fig. 3a,c) were systematically associated with changes in grades. Surprisingly, however, students who increased their social jetlag in the flexible system obtained slightly better grades in the flexible system (flex*social jetlag change: b = 1.28, p = 0.027; Fig. 3a,d), which was contrary to our hypothesis. Therefore, our analyses in this subsample suggest that sleep improvements experienced immediately after transitioning to the flexible system did not result in detectable higher academic achievement.

Figure 3
figure 3

Longitudinal analysis of official quarterly grades—effects of sleep and 9 AM-use. Results from linear mixed model analyses of quarterly grades (0–100%) considering sleep variables and the frequency of ≥ 9 AM-starts (see Table 2 for sample descriptions). (a,f) Schematic of the structure and results from Models 3 and 4 (Supplementary Tables S3 and S4) showing the outcome, official quarterly grades (center), all predictors (black-framed boxes), the statistical significance of their effect (arrows; black: p < 0.05, grey: p ≥ 0.05), the unstandardized regression coefficients (b-values) and ID as random intercept (dashed box). General predictors (white) are categorical variables, so the levels with the highest impact are shown compared to their reference (female, grade level 10, quarter 1, languages). b-values are approximate in (a), indicated by ≈, as representing results from Models 3a–d. (a) Effect of changes in sleep and of 9 AM-use on grade improvements from the conventional to the flexible system. Summarized results from Models 3a–d (nID = 63; Supplementary Table S3) where each model variation included a different yellow predictor in interaction with school start system (conventional/flexible; bflex*change) to model effects of sleep changes on grade changes. (b–e) Visualization of the yellow interaction effects from (a) via marginal means, i.e. grade estimates and 95% CI for the reference (female student, class level 10, quarter 1, languages) and categorical splits in the continuous sleep change variables to facilitate display. The effect of school start system on grades by (b) chronotype change (advance/delay), (c) sleep duration change (sleep loss/sleep gain), (d) social jetlag change (reduction/increase) from the conventional to the flexible system, and by (e) the frequency of 9 AM-use (< 2 days/ ≥ 2 days) in the flexible system. (f) Effect of absolute sleep characteristics on grades in the flexible system. Summarized results from Model 4e (nID = 129; Supplementary Table S4) predicting grades only for the flexible system, i.e., 1.5 years post-change, including the red sleep predictors in one common model after running separate Models (4a–d) to check for collinearity. (g–j) Visualization of the red effects from (f) via marginal means, i.e. grade estimates and 95% CI for the reference (female student, class level 10, quarter 1, languages). *p < 0.05; **p < 0.01; ***p < 0.001.

If not linked to sleep improvements, were grades nonetheless linked with the choice of more later school starts? The results of Model 3d calculated to answer this question suggests that higher 9 AM-use was associated with worse grades in the conventional system (b = − 3.04, p = 0.015), a link reversed partly—albeit not significantly—in the flexible system (flex*9 AM-use: b = 0.59, p = 0.101; Fig. 3a,e and Supplementary Table S3). Hence, students who made high use of the late-start option in the flexible system (group median was 24% of schooldays or 1 day/week; Table 2) were predominantly students obtaining lower grades, but they tended to benefit at least slightly from more later starts.

No systematic effects of chronotype, social jetlag and sleep duration on grades

Lastly, we investigated if we could find absolute effects of sleep variables (chronotype, social jetlag, sleep duration) and ≥ 9 AM-use on grades in the flexible system (Model 4, n = 129 students). Table 2 shows descriptive statistics of these three variables in this cohort (notice the wide range): chronotype or midsleep time on weekends corrected for oversleep was on average at 4:40 AM (range = 2:10–8:35), social jetlag was on average 2.0 h (0.5–6.0 h) and students slept on average 7.2 h (5.2–9.0 h). In contrast to what we had expected, none of the sleep parameters showed any significant link with grades, independent of whether they were added separately into the model (Model 4a–d; Supplementary Table S4) or together (Model 4e; Fig. 3f, Supplementary Table S4). Our results thus indicate that late chronotypes in our sample were not worse off compared to their early peers (Fig. 3f,g) and that longer sleep duration in the flexible system did not improve grades received in the flexible system (Fig. 3f,h). Similarly, social jetlag did not hamper grades to such a degree that we could detect an effect (Fig. 3f,i). Furthermore, frequency of attending school later was at a median of 1.5 days per schoolweek in this sample (28% of schooldays, IQR = 10–52%; Table 2), and we found that per every additional day a student chose to go to school later, grade estimates non-significantly decreased by 1.32 in the full Model 4e (p = 0.272, Fig. 3f,j; but see the single Model 4d in the Supplementary Table S4: b = − 2.12, p = 0.022). Although the interpretation is slightly different, this result tallies with the above finding from Model 3: At first sight, it looks as if attending school later more often would prevent students from getting better grades, but we argue that most likely it is the other way around; students who receive worse grades also liked to attend school later more often when they had the chance to do so. Overall, we could not show that chronotype and social jetlag negatively influenced grades, and it seemed as if mainly students who previously achieved lower grades in our sample liked to use the ≥ 9 AM option.

Discussion

Adolescence is a decisive time in life for teenagers around the world. Teenagers undergo many cognitive, emotional and brain structural changes that also shape their risk-taking behaviour, learning capacities and motivation to attend school46,47. A prominent change also occurs in their daily sleep–wake behaviour: teenagers tend to phase-delay their sleep–wake behaviour, which essentially means that they become night-owls5,48,49,50,51,52. This delayed phase, however, clashes with early school starts seen across many countries, thus cutting sleep short in the morning hours during the school week. Apart from many other negative (health) consequences10,11,53,54,55,56, short and low-quality sleep as well as sleepiness likely influences academic success20,57, which in turn is an important determinant of future career trajectories58. Since sleep restrictions and poor sleep habits are more severe in minority groups and disadvantaged students59,60, addressing this problem is key to closing the achievement gap between social groups. However, the evidence is not conclusive whether delayed school start times can ameliorate this pressing health and performance problem. Many previous studies suffer from study design limitations, outcome variables are not comparable, and long-term studies that track individuals over time are rare28. Here, we studied whether a novel timetabling system—a daily chosen flexible school start—has the potential to improve academic grades via improved sleep.

In our study, we found that the flexible system was only associated with higher grades at first sight. When not adjusting for confounding factors, we observed a small improvement of grades in the flexible system, which would be in line with some previous studies e.g.,55,56. However, we argue that such simple pre-post analysis of aggregated grades is not suited to answer this complex question—although this has frequently been done using cross-sectional data. Studies on grades that performed proficient analyses, such as mixed regression models44, quantile regression models61 or difference-in-difference approaches62,65,64 accounting for available confounders provided mixed results and mostly small effect sizes (for a systematic review see28). Nonetheless, positive effects of delayed school start times on academic achievement have been widely proclaimed—bound to raise falsely high expectations in parents and teachers. When we considered grade level, discipline and quarter in mixed model analyses, we found that the flexible system was clearly not associated with overall grade improvements except for subtle increases in Languages and subtle decreases in Social Sciences. In fact, the “confounders” weighed much stronger in our sample than any school start system effect on individual disciplines: graduating students did constantly better, highest grades were given in the final quarter of the year, and students were most successful in Social Sciences. Furthermore, the interplay between gender, discipline and school start system on grades is complex.

Importantly, we also did not find any expected relationships between chronotype, social jetlag, or sleep duration with grades in our sample. Neither changes in these sleep parameters from the conventional to the flexible system nor their absolute values in the flexible system showed any link with grades—except for changes in social jetlag. Surprisingly, an increase in social jetlag, not a decrease, in the flexible system was predictive of higher grades in the flexible system. We have not been able to identify obvious explanations for this finding in exploratory analyses, except for the fact that weekend sleep was much more variant and backed by fewer data points than schoolday sleep, pointing towards a potential chance finding. A likely explanation for our null-finding for the other sleep parameters is a possible lack of power in our sample of 157 students (even though we have > 16,000 longitudinal grades) given the small effect sizes previously identified (ranging around < 0.1 SD; see28). A second possibility is that the time frame students were exposed to the new system was too short (exposure length) or that the delay was too little or infrequent (dose) in our study. Furthermore, sleep variables obtained at discrete study points might not be reflective of sleep during the other academic quarters or years. Thus, we cannot preclude that we missed a subtle effect in our sample but any such effect is likely extremely small. This is also in line with several other studies that were unable to find any effect or meaningful improvements28.

Importantly, the fact that we did not detect systematic improvements in students’ grades does not mean that there were no improvements in learning. There is a substantial body of evidence supporting that both acute and chronic sleep loss compromises alertness, cognitive performance and memory, and reduces engagement to perform well (performance effort)15,65,66. Thus, improving sleep in sleep-deprived teenagers is very likely to improve their learning67,70,69. In addition, one could speculate that the flexibility and the thus putatively increased self-responsibility and self-determination of students in the flexible system, paired with the reported increase in motivation on later days, may also further improve learning. The question is whether better learning mediated by improved sleep also translates into better grades—and how much sleep improvement is needed and within what timeframe.

Additionally, students’ learning is strongly affected by many factors beyond those captured in our study or those of others on this topic. Models of teaching and learning include several nested factors, such as the individual student (e.g. motivation and prior knowledge), the individual teacher (professional competence70,71), the learning environment (e.g. socio-economic status or native language) or factors of instruction (generic and subject-specific instructional quality) but also class-level factors (learning atmosphere or class mates)72. Of these, especially instruction and teacher-level factors greatly influence students’ learning73,74. Furthermore, grades are inherently suboptimal measures of students’ academic performance, as teachers also include other factors such as compliance, effort, attitude, or behaviour in their assessment75.

Therefore, it may be a big ask and possibly naive to expect grades to improve noticeably and within a few months after delays or a flexible system have been affected. Rather, we should acknowledge students’ maintained achievements under potentially less effort and improved learning capacities (this needs to be assessed in future studies) in addition to the gift of more sleep and better well-being. Indeed, teachers at the studied school reported perceiving students as more alert and motivated and tardiness rates as decreased (personal communication). On the other hand, grades still determine future career trajectories and open doors to higher education in many countries76. In this sense, they do have a greater importance for careers—at least early on—than other measures of performance, such as standardised or non-standardised tests in class, which might be more valid for measuring academic performance under certain conditions. Additionally, despite all described influences, grades—as indicators of prior knowledge—seem to be the best predictor for achievement in university courses77.

Our study has several limitations that have not yet been mentioned. The analyses are based on quarterly aggregates of grades, so we could not consider the type of grades (oral, written, etc.) and their weights thus prohibiting detailed analyses on grade types and their possible competence aspects (e.g. knowledge, attitude, ability).

Additionally, we did not obtain information about teachers’ competence, their instructional quality or classroom atmosphere but accounted for gender, quarter, grade level, and discipline—factors that are often overlooked in the field. We also lacked socio-demographic information, which likely influence grades, such as the socioeconomic status (SES), parents’ education or ethnicity of students and their parents78. The vast majority of students included in this study were Caucasian by observation, so we had low variation with regards to ethnicity. Furthermore, the school studied was a Gymnasium, which is the most academic secondary school type in Germany leading directly to higher education. Therefore the generalisability of our results to non-Gymnasium secondary students in Germany(~ 65%32) is limited. Lastly, we did not collect objective measures of cognitive performance through cognitive test batteries but asked students to self-evaluate their quality of study and concentration levels31. We also did not collect information on napping behaviour, which might have led to an underestimation of sleep duration overall thus underestimating the positive effects of flexible start times on sleep.

In conclusion, we highlight that current early school start times around the globe are detrimental for sleep and health and likely do not allow students to excel as much as they could. Many studies have shown positive effects on sleep or well-being, when school starts were delayed21,22,23,24 or in systems were school already starts much later, such as in Uruguay or Argentina79,80. Thus, it seems fair to argue that later starts are beneficial for students in terms of health and well-being. These factors form a profound basis for good academic achievement but there are also numerous other factors that play into this and possibly mask positive effects: for example, teachers might not perform at their best later during the day or adjust their grading under the new bell times to achieve normal distributions of performance; the dose of delay might need to be higher (i.e. more delay or more uptake of later starts in a flexible system) and exposure time might need to be longer until an effect emerges, or grades are insensitive to this kind of intervention, to name only a few. But despite these complications, it should be emphasized that students can maintain their grades in addition to better sleep and well-being—a central and very important achievement in its own right. Indeed, students might need to spend less time on their homework, learn more easily or show improved sustained attention while still achieving similar but not improved grades. This question should be addressed in future research, e.g. by distributing questionnaires on time use data in general but also time spent on homework, learning and other academic activities, or running cognitive and attention tests in the field (such as the psychomotor vigilance task assessed for example with the NASA PVT + app81). Furthermore, more studies are required on how to harness the unique advantages of flexible start systems, such as promoting students’ responsibility, choice and investment, for optimal sleep and learning gains.