1 Introduction

Higher education lived a severe disruption in 2020 with the COVID-19 pandemic, when universities all over the world did their best to continue educational activities using online teaching and assessment (Hodges et al., 2020). The consequences of this prolonged crisis are still unknown; there is a need to obtain rigorous assessment data that can provide a glimpse of its impact on learning (IESALC-UNESCO, 2020; UNESCO, 2020). Analysis of assessment data obtained during the pandemic will provide a clearer picture of the size and direction of the “educational catastrophe” that some scholars and the media have predicted, so appropriate measures can be implemented (Jankowski, 2020; Lake & Olson, 2020; UNESCO, 2021). The term “pandemic learning loss” has appeared frequently in the media and academic literature, and it is crucial to document its size. It is important to explore if the pandemic effects on learning are selective in educational levels, areas of knowledge, or in different countries and socioeconomic realities (Lake & Olson, 2020; Skye et al., 2020). Data are beginning to emerge regarding learning loss in basic education, but so far, these are scant in higher education (Azevedo et al., 2021; Engzell et al., 2021; Maldonado & De Witte, 2020; Donnelly & Patrinos, 2022).

For universities, the change was unexpected and required emergency measures of heterogeneous quality, including online assessment testing (IESALC-UNESCO, 2020; Lake & Olson, 2020; UNESCO, 2020). The validity and use of test results in this context required complex analyses and decisions, with mixed arguments about their usefulness and risk of biases (Jankowski, 2020; Lake & Olson, 2020).

Objective tests are the preferred instrument to assess students’ knowledge in a valid and reliable manner (AERA, 2014; Lane et al., 2015; Sánchez-Mendiola, 2020). There is widespread agreement that multiple-choice question (MCQ) large-scale testing can measure knowledge in a precise, timely, and cost-effective manner with large populations of students (Lane et al., 2015). Online testing has added advantages in terms of ease of application, speed of analysis and scoring, and less use of paper and is becoming a preferred option if the technology and infrastructure are available (Butler-Henderson & Crawford, 2020; Jankowski, 2020). Computer-aided testing has several disadvantages: cost of software licenses, equipment, and connectivity infrastructure, need for digital skills in faculty and students, among others (Butler-Henderson & Crawford, 2020; Dennick et al., 2009). An important challenge during the pandemic has been the difficulty of direct proctoring and supervision during testing, which is more complex when testing occurs in the students’ home (technological issues, Internet access, electricity, possibility of cheating) (Daffin, 2018; Reisenwitz, 2020; Şenel and Senel, 2021; UNESCO, 2020).

The National Autonomous University of Mexico (UNAM) has a long tradition of applying large-scale standardized testing for diagnostic and high-stakes exams (Valle, 2012). The university has applied yearly, since 1995, a diagnostic exam to students that are admitted to the institution, to assess knowledge in several areas, plus English and Spanish (Martínez-González et al., 2018, 2020; Valle, 2012). The Educational Evaluation area of the university is a centralized department that develops these exams and has the task of developing, validating, implementing, applying, and scoring the test results, as well as publishing institutional reports. The exams are MCQ tests developed with an evidence-based process to create academically grounded instruments. The diagnostic exams have been applied in face-to-face modality during the first weeks of the academic year, at the university schools’ classroom facilities, with faculty proctoring and a massive logistical display of resources, since the student cohorts are in the tens of thousands (Martínez-González et al., 2018, 2020; Valle, 2012). A goal of the diagnostic exams is to obtain information about students’ knowledge level at admission, to identify areas of high and low performance, and implement remediation strategies to decrease dropout and academic delay during the first year of their academic trajectory.

The pandemic forced the university to apply the admission diagnostic exams online, opening a window of opportunity to collect data about the process and contrast the information obtained with previous exam administrations. The goal of this study was to compare and contrast knowledge levels in newly admitted university students, in pre- and transpandemic cohorts, and to analyze the results by knowledge area and gender.

2 Methods

2.1 Research design

The research design was quasi-experimental with static group comparisons, taking advantage of the pandemic “natural experiment” (Campbell and Stanley, 1963; Fraenkel et al., 2018).

2.2 Setting

UNAM is the largest public university in Mexico (> 369,000 students, > 42,000 faculty, 133 careers, http://www.estadistica.unam.mx) and one of the largest in the world. Each year, about 35,000 students are admitted to the university (Sánchez-Mendiola et al., 2020).

2.3 Population and sample

The studied populations were four cohorts admitted to the University in 2017, 2018 (before the pandemic, control groups) and 2020, 2021 (during the pandemic, intervention groups). Each of the two pairs of cohorts was evaluated with the same instrument (2017 and 2021 used the same test, 2018 and 2020 had the same exam). The diagnostic exam is not mandatory and has no summative implications, although the majority of students decide to take the test. The vast majority of applicants are based in Mexico City and the surrounding metropolitan area. In Mexico, all schools and universities closed due to the pandemic lockdown in the middle of March 2020 and stayed closed for more than 200 days. Students that responded the diagnostic exam in August 2020 finished high school at the end of Spring 2020, so they spent about three months with emergency remote teaching from home before finishing high school. The August 2021 student cohort did the last year or so of their high school education via distance learning.

2.4 Instrumentation

The diagnostic exam is an MCQ test composed of two portions, with 120 items each: one general knowledge (GK) test (mathematics, physics, chemistry, biology, world history, Mexican history, literature, and geography) and one English and Spanish (ES) exam. The proportion of items in the general knowledge test varies depending on the career area where the students are admitted. At UNAM, careers are classified in four areas: area I, physics, mathematics, and engineering sciences (PMES); area II, biological, chemical, and health sciences (BCHS); area III, social sciences (SS); and area IV, humanities and arts (HA). There are four versions of the general knowledge exam, one for each area. Only area IV has philosophy items (Table 1).

Table 1 Structure of the general knowledge diagnostic exam, UNAM, Mexico City

The English and Spanish tests have 60 items for each language. The Spanish portion assesses four areas: reading comprehension (16 items), grammar and composition (23), vocabulary (9), and orthography (12). The English test evaluates the first three levels of language domain, in agreement with the Common European Frame of Reference for Languages (CEFRL, 2001). Levels are beginner (Level A1), high beginner (Level A2), and low intermediate level (Level B1). Each level has 20 items. Applicants need to obtain at least 75% of items correct in the block of items that reach the corresponding level; if they do not reach the first level, they are included in a “not classified” category. Traditionally, UNAM first-year students have low scores in English, as is frequent in public universities in Mexico.

Since general knowledge exams have a different structure for each area of knowledge, they were analyzed separately. In the case of the English and Spanish test, they were analyzed for the total population, since they are the same for all areas.

The contents of the diagnostic exam after admission to the university are focused on the fundamental learning outcomes at the end of high school and cover the official high school curriculum in Mexico. The test blueprint and items are validated by teachers with content expertise in each of the test explored areas. The main goal of the diagnostic exam is to assess the level of knowledge in students that are admitted to the university.

These exams are developed following good practices for objective large-scale standardized tests, as has been previously reported (Sánchez-Mendiola et al., 2020). In summary, the test blueprint is developed by content expert groups, led by test development staff from UNAM Department of Educational Evaluation, using the official Mexican high school curriculum learning goals as the construct to be measured. Our university has created a large item-bank over the course of many years, piloting and testing with psychometric analysis. Every year, items are added to the pool to renew and grow the number of available items, and different items are applied on each occasion. In this study, for purposes of pre- and transpandemic comparison, the test applied in 2020 was the same as the exam in 2018, to decrease instrument bias. The 2018 test was done face-to-face paper-and-pencil, and the 2020 exam was done online with remote proctoring. We also included the results of two more years, 2017 (prepandemic, paper-and-pencil) and 2021 (transpandemic, online), to provide a broader perspective of the results; these years were chosen because the same instrument was utilized in these cohorts. Test development processes, application, and scoring are performed with several quality-control steps along the process, following standard practices for large-scale testing.

Tests are piloted prior to application, and test item selection from the item bank has strict criteria (moderate difficulty, high reliability, and high discrimination indices). The psychometric data obtained with classical measurement theory (CMT) and item response theory (IRT) have been previously reported (Martínez-González et al., 2018, 2020; Sánchez-Mendiola et al., 2020).

2.5 Test application

The 2017 and 2018 tests were applied in a paper-and-pencil printed format with Op-scan MCQ answer sheets, in their respective schools’ classrooms. The central Evaluation Department coordinated the logistics, and each school applied the test with supervision. The test was applied in groups of varying size, depending on the school facilities and student population, with direct proctoring by faculty. The total time for answering both exams was three hours. The answer sheets and exams were collected and analyzed in the Evaluation Department, and the results were collated in a report sent to the university authorities and each school.

Due to the pandemic lockdown, in 2020 and 2021, the test was applied online at home, with specific instructions to minimize cheating (Butler-Henderson, 2020; Dennick et al., 2009). The university developed an in-house digital platform for the design, application, and scoring of online exams in large populations (EXAL), which allows real-time monitoring of the responses from each test-taker, as well as the time used in each item and test-taking total time. It registers monitor and mouse/keyboard activity and flags anomalous events as incidents, issuing alerts to the remote test proctors. It does not use video monitoring in real time, which was not feasible in our setting due to the technological and Internet connectivity limitations in many test-takers’ homes.

Instructions were sent to the students for pretest verification of technical compatibility and ethical, security, and technical issues. The testing process was piloted previously, and three hours of testing time were allowed. Test applications were achieved without major problems.

2.6 Psychometric and statistical analysis

Student performance was measured with percentage correct response scores (Morduchowicz, 2006). Average test scores were calculated, as well as differences between 2017–2021 and 2018–2020. Since the study uses observational data, we used propensity score matching (PSM) to build balanced and comparable control (prepandemic, face-to-face testing) and experimental (transpandemic, online testing) groups, to account for the covariates that may predict receiving the intervention (Rosenbaum & Rubin, 1983). For each pair of comparisons (2017–2021, 2018–2020), the conditional probability for each individual to be exposed to the intervention was calculated with a combination of the observed variables of interest, using the software R (r-project.org). The variables of interest utilized to calculate the propensity score were gender and high school of origin, in agreement with our previous studies (Martínez-González et al., 2018). PSM assumes that two students with similar propensity scores have the same distribution of explanatory variables, which implies that samples built with this criterion helps to guarantee comparable groups. The technique used was near neighbor with caliper matching.

For the knowledge exams in test cohorts 2018–2020 and 2017–2021, pairing was performed by area of knowledge; in the Spanish and English exams, it was not necessary to divide by area since the total population completed the same instrument in each application. Results were also analyzed by gender.

Inferential statistics for group differences was done with Student’s t-test for independent samples. Cohen’s d with pooled standard deviations was calculated as a measure of effect size between cohorts (Cohen, 1988).

Psychometric analyses were performed with CMT and IRT. Descriptive statistics, alpha’s Cronbach coefficient for reliability, standard error of measurement, mean difficulty index, and point biserial correlation coefficient for discrimination were calculated. Analysis was done with ITEMAN 3.5, BILOG MG 3.0 and Winsteps 3.0 software.

2.7 Ethical aspects

The study was in compliance with the Declaration of Helsinki for research involving human subjects’ data. Data was managed anonymously in a confidential manner.

3 Results

For the first prepandemic-transpandemic (2018–2020) comparison set, 35,584 matched students from each cohort were considered, using the PSM methodology. For the second comparison set (2017–2021), 31,574 matched students from each cohort were considered. For the Spanish and English exams, the first set of comparisons (2018–2020) used 33,585 matched students and for the second set (2017–2021) 33,481 students.

Psychometric results of all exams were appropriate for a test with diagnostic intentionality. Reliability coefficients in the 2017 and 2018 exams were in the range of 0.72 to 0.94, and for the 2020 and 2021 tests from 0.86 to 0.95, overall reliability was 5 to 10% higher in the transpandemic online exams (Table 2). Mean difficulty indices were moderate in all exams, 2 to 7% higher in the transpandemic cohorts; discrimination indices were appropriate with point biserial correlations of 0.15 to 0.48, with higher indices in the transpandemic cohorts (Table 2).

Table 2 Classical measurement theory parameters for the 2017, 2018, 2020, and 2021 diagnostic assessment examinations in UNAM students (general knowledge exams: 2018 n = 35,821; 2020 n = 41,909; 2017 n = 34,003; 2021 n = 37,688) (English and Spanish exams: 2018 n = 35,837; 2020 n = 35,534; 2017 n = 34,078; 2021 n = 35,097)

Figure 1 shows Wright’s maps comparing area II (BCHS) exams, mapping the test items’ difficulty with the students’ ability in the construct evaluated, using the item response theory Rasch model (Andrich & Marais, 2019). These data show that the test difficulty is adequately calibrated for the student population, and the patterns are similar in each pair of cohorts. The results in the other exams showed similar patterns, which adds validity evidence about the appropriateness or fit of the exams’ difficulty for the students’ range of ability levels.

Fig. 1
figure 1

Wright’s maps using Rasch model of test items’ difficulty levels and students’ ability in area II (biology, chemistry, and health sciences) general knowledge diagnostic exams at UNAM, Mexico (2017, 2018, 2020, and 2021 cohorts)

The mean percent correct scores for the four cohorts by area of knowledge, Spanish and English, are shown in Fig. 2. The planned paired comparisons using the same instrument (2017 vs. 2021 and 2018 vs. 2020) showed increased scores in the four areas of knowledge during the pandemic, ranging from 2.3 to 4.4% in the 2018 vs. 2020 comparison and from 4 to 7.1% in the 2017 vs. 2021 contrast.

Fig. 2
figure 2

Mean percent correct scores with standard deviations (SD) for the 2017, 2018, 2020, and 2021 diagnostic assessment exams at UNAM, Mexico. Results by cohort, area of knowledge (area I = physics, mathematics, and engineering sciences; area II = biological, chemical, and health sciences; area III = social sciences; area IV = humanities and arts), and testing modality (face to face, online). Paired comparisons’ data are shown in black bars for 2017 vs. 2021 and white bars for 2018 vs. 2020. All differences were statistically significant

3.1 Area of knowledge

Scores in the four areas of knowledge increased in the 2020 and 2021 cohorts; all the differences were statistically significant. The largest increases were observed in area I (PMES) with 4.4% in the 2018–2020 comparison and 7.1% in the 2017–2021 pair (Table 3). The lowest were in area III (SS) with 2.3% in the 2018–2020 comparison and 4% in 2017–2021. Results of the general knowledge exam in the four areas were higher in area IV (HA) in all cohorts, with mean percent correct score values ranging from 50.2 and 56.8.

Table 3 Number of students, mean percent correct scores, standard deviation, difference of means, effect size (Cohen’s d), and 95% confidence interval for the diagnostic assessment exams at UNAM in 2018–2020 and 2017–2021, total results and by gender

The scores in the Spanish exam had a small decrease of 1.3% in the 2020 pandemic exam compared to the 2018 prepandemic test, although in contrast there was a 1.6% increase in the 2017 vs. 2021 comparison (Fig. 2 and Table 3). The English test had a 1.7% increase when comparing pandemic vs. prepandemic scores (2020 vs. 2018), but when comparing the 2021 vs. 2017 exam, the English scores were significantly lower during the pandemic, about 7.7% less (Fig. 2 and Table 3).

Figure 3 shows the effect sizes using Cohen’s d for all paired comparisons (2017 vs. 2021, 2018 vs. 2020) for total scores in the four areas of knowledge, Spanish and English, and by gender. Effect sizes were generally larger in the 2017–2021 comparisons; they were also larger in area I (PMES) and in women (Fig. 3). Cohen suggested that d = 0.2 could be considered a “small” effect size, 0.5 “medium,” and 0.8 a “large” effect size (Cohen, 1988).

Fig. 3
figure 3

Effect sizes measured with Cohen’s d for the 2017, 2018, 2020, and 2021 diagnostic assessment exams at UNAM, Mexico. Grey bars represent effect sizes for the 2018 vs. 2020 comparison and black bars the 2017 vs. 2021 comparison. (area I = physics, mathematics, and engineering sciences; area II = biological, chemical, and health sciences; area III = social sciences; area IV = humanities and arts)

The analysis by subject topic showed that in most categories there were higher scores in the pandemic cohorts (2020 and 2021) compared to prepandemic values (2018 and 2017), in some cases as high as 15.7% higher (Mexico history) (Table 4). Geography was the only subject that had negative differences in some areas of knowledge during the pandemic, and literature scores in men also had some negative differences.

Table 4 Mean scores by topic in UNAM diagnostic exam and differences between 2018–2020 and 2017–2020, by area of knowledge and gender (general knowledge exams: 2018–2020 n = 35,584 and 2017–2021 n = 31,574)

3.2 Gender

More than 50% of UNAM student population are women (UNAM, 2022). The percentage of women in the four cohorts were as follows: 2017 = 53.8%, 2018 = 54%, 2020 = 53%, and 2021 = 54%. The proportion of female students varies considerably by area of knowledge: in area I (PMES), the majority was male (2018 = 65%, 2020 = 68%), in areas II (BCHS) and IV (HA), the majority were women (area II, 2018 = 66% and 2020 = 68%; area IV, 2018 = 63% and 2020 = 65%). In area III (SS), there is a slightly higher percentage of women (2018 = 53% and 2020 = 51%). In the four exam cohorts, the pattern of gender distribution was similar.

The mean percentage correct scores in general knowledge exams in men were higher in the four areas in all cohorts. However, before the pandemic, there was a 3.1% higher global test performance in men compared to women, and this gap decreased to 0.34% during the pandemic. Women had higher gains than men in both pandemic vs. prepandemic paired comparisons, and these increases were larger in the 2017 vs. 2021 comparison (Tables 3 and 4).

Men had a lower score in Spanish in 2020 compared to 2018 (− 2.4%), although the 2017–2021 comparison showed no difference (0.1%) (Table 3). In the English exam, there was a slight increase of 0.4% in 2020 compared to 2018, but there was a large decrease of − 8.4% from 2017 to 2021.

Test performance in women showed a better scenario. The general knowledge exam scores were higher in 2020 and 2021 compared to 2018 and 2017. The largest difference was found in area I (PMES) exams with a 6.1% increase in 2020 and an 8.6% increase in 2021; the other areas had increases in scores from 3.2% to 7.0%. All these differences were statistically significant (Table 3).

The performance of women in the Spanish exam had a small decrease of 0.5% in 2020 compared to 2018 and a 2.8% increase when comparing 2017 to 2021 (Table 3). The English exam in women showed an increase of 2.8% from 2018 to 2020 but a large decrease from 2017 to 2021 of − 7.1%.

4 Discussion

Diagnostic assessment of knowledge in students admitted to the university is important to obtain their baseline academic status at the beginning of the first year, so institutions can identify students that are in a disadvantaged situation and design interventions to improve their likelihood of success. A well-implemented diagnostic strategy at admission can help decrease dropout and academic delay in the first year of higher education (Bombelli & Barberis, 2012; Martínez-González et al., 2018, 2020; Porta, 2018).

UNAM’s admission diagnostic exam is designed following good practices for objective, standardized, large-scale tests (AERA, 2014; Lane et al., 2015) and has validity evidence (Martínez-González et al., 2018, 2020). This study showed an overall increase in scores during the pandemic period in the majority of knowledge domains, which was larger in the second year of the pandemic (2021). There was a small decrease in Spanish scores in the first year of the pandemic, which changed to a small increase in the second year. There was a small increase in English scores in 2020 and a substantial decrease in the 2017–2021 comparison, where students had more than one year of their education in confinement. These data do not support the hypothesis that during the pandemic there would be a large learning loss at all educational levels (Pier et al., 2021; UNESCO, 2021).

There are editorials and opinion articles that predict a large learning loss in the trans- and post-pandemic eras, but so far, concrete evidence backed by data is scarce (Azevedo et al., 2021; Pokhrel and Chhetri, 2021). A recent systematic review has identified only eight published papers related to the subject of pandemic learning loss (Donnelly & Patrinos, 2022). Seven studies found learning loss evidence of 0.03 to 0.29 standard deviations (SD) in at least some of the participants, and one found learning gains. Six studies in this review were in primary level students and only two from higher education (Orlov et al., 2021; Gonzalez et al., 2020). The educational level is critically important during the pandemic, since K-12 level students are in a very different level of maturity and cognitive development compared to university students. The majority of studies that analyzed primary level students showed a decrease in learning that was statistically and educationally significant (Azevedo et al., 2021; Engzell et al., 2021; Maldonado & De Witte, 2020; Pier et al., 2021; Schult et al., 2021). Our findings show an increase in almost all areas of knowledge, except Spanish and English, in a large sample of higher education students. These data are compatible with the findings of Gonzalez et al. (2020) at the Universidad Autónoma de Madrid, who analyzed the effects of pandemic confinement on autonomous learning performance of 458 higher education students in courses related to “Applied Computing,” “Metabolism,” and “Design of Water Treatment Facilities.” The experimental group (during the pandemic, 2020) had a better performance than the control group (prepandemic, 2017–2019); furthermore, the experimental group had more continuous learning activities and better assessment outcomes. The authors suggest that confinement had a positive effect in students’ learning strategies (González et al., 2020). These findings agree with our study, where we found an increased academic level in knowledge levels, suggesting that higher education students can overcome the difficult pandemic situation and compensate through several strategies the potential negative academic effects of the pandemic (ten Cate, 2001).

The only other study that analyzed higher education students in Donnelly and Patrinos' systematic review was from the United States, examining seven economics courses, where they found a worse performance in Spring 2020 compared to Spring or Fall 2019 students (Orlov et al., 2021). This paper found a statistically significant drop of 0.185 SD (p = 0.015) during the pandemic semester. The authors do not report the sample size of the study, and unlike our study, it was only in the field of economics.

Due to the large sample sizes in our study, almost all comparisons are statistically significant, although the question remains about its educational significance, as has been previously discussed in the educational research literature (McLean & Ernest, 1998). We used Cohen’s d as a measure of effect size to provide a clearer picture of the differences among the prepandemic and pandemic cohorts and found that the differences were in the small and moderate range. The use of propensity score matching helped to have statistically matched and balanced control and intervention groups, to decrease potential biases introduced by confounding variables. To argue causality in this type of studies is difficult without an experimental design, but as far as we know, this is one of the few studies that addresses learning loss in higher education with objective tests.

4.1 Performance by modality of test application

An effect that was observed when comparing our test experience was an increase in the number of students that took the test online vs. face to face. It increased 3.4% in the 2018–2020 comparison (78.8 to 82.2% of the total student population) and 2.6% in the 2017–2021 cohorts (75.8 to 78.4%). The diagnostic exam is not mandatory, so we do not have a clear explanation for these differences, although probably students during the pandemic perceived this diagnostic, non-high-stakes exam as important for themselves and the institution and relevant to their academic history and university success (Bombelli and Barberis, 2012; Martínez-González, 2018; Porta, 2018). Online test application in a home environment was found to be a feasible although logistically complex modality. Before the pandemic, the university used a large amount of financial and human resources for test piloting, printing, reviewing printed exams, quality control, test distribution, and collection. This economic and logistical cost decreased substantially with the online modality. The advantages and disadvantages of online testing for high-stakes tests are controversial, although it could be argued that for formative and diagnostic assessments, where the stakes are not high, the pros of applying exams at home outweigh the cons (Baleni, 2015; Butler-Henderson and Crawford, 2020; Dlab et al., 2015).

Psychometric analysis of the tests in our study, obtained via the same instruments with proven validity and reliability, helped confirm that reliability was appropriate and even increased in the online testing modality. The psychometric data about standard error of measurement, difficulty and discrimination indices, showed a pattern consistent with good practices and international standards for objective large-scale testing, and it is interesting that these psychometric data were better in the online modality than the face to face. The patterns of performance in both tests were not affected by the testing modality. A potential explanation for the improvement in psychometric data when the instrument is applied online is that despite the possible technological asymmetries in equipment sophistication and connectivity at home, answering the test online may help standardize the conditions of testing and decrease students’ anxiety. Furthermore, students that lived with the pandemic for a long period of time had likely developed online testing skills and planned conditions at home to interact better with digital devices and online testing platforms.

Another aspect to consider are the likely sources of error in controlled face-to-face exams generated by complex logistics, as well as the different physical facilities where exams were applied in the diverse schools of the university, factors which can influence test scores (AERA, 2014). The online application of the exams was performed by the central Evaluation Department, with the same platform, staff, and instructions to discourage cheating, plus solving technical issues in real time. These factors may have provided a more homogeneous setting for the online test application, and the results appear to be valid as measures of the constructs of knowledge, Spanish, and English in the respective student cohorts.

The mean difficulty indices of the 2020 and 2021 exams were higher than the prepandemic values—with the exception of Spanish—which means that the online versions of the tests were easier for the pandemic student cohorts. Discrimination index (point biserial correlation, an indicator of how well the test discriminates the more knowledgeable individuals from the less knowledgeable) was higher in all the domains of the 2020 and 2021 tests, which means that the students that correctly answered each item had a better overall performance in the test in comparison with students that had a lower performance. These data, together with the higher reliability, are signals of better instrument performance when applied online and provide results similar to those of Dlab and colleagues (2015), in the sense that mean performance can be higher when tests are applied online. A relevant issue is that non-proctored online exams can have better results, due to the fact that students use other sources of information to improve their scores and thus achieve better outcomes (Backes and Cowan, 2019; Brallier, 2015; Daffin and Jones, 2018). A recent systematic review that studied the effects of the pandemic on more than a million students from the health professions found that learners performed better in online assessments compared to prior in-person evaluations, with differences that ranged from 1 to 3% approximately (Dedeilia et al., 2023; Jaap et al., 2021). We cannot rule out completely that our students did not cheat or use other resources to obtain better scores, although we did not find patterns of cheating in the analysis and, ultimately, if a student searches for the answer in a book or in the digital library that is something that helps him/her learn. Online non-proctored exams at home can be less stressful to students if they are not permanently seen and videorecorded by long-distance proctors. If the test is diagnostic or formative, the incentives for cheating are low and they can improve learning through “assessment for learning and as learning” (Earl et al., 2006). On the other hand, the pattern of responses in all topics of the exam was very similar in all cohorts, suggesting strongly that students did not cheat or artificially increased their scores. This is also supported by the finding that scores in Spanish and English in some comparisons not only did not increase but actually decreased; if students had cheated in the home online test, the scores likely would have been increased in all topics.

Differently from the general knowledge exam, the Spanish test had a slightly higher difficulty for the population in the online modality in 2020. A possible explanation could be that skills that have not been acquired throughout time in reading and grammar will not be reflected in a test of this type irrespective of the application modality. A study performed by Backhoff et al. (2011) highlights the importance that knowledge and skills in Spanish provide students from basic to higher education. On the other hand, the large decrease in English scores in 2021 suggests that the long confinement period during the pandemic lockdown affected learning of English as a second language, as has been reported by several authors (Muftah, 2022; Ying et al., 2021).

4.2 Performance by gender

The analysis by gender showed that men had higher scores in knowledge exams, independently of the area of knowledge, and women had better performance in the Spanish exam and the literature component of the general knowledge test. The English test had more homogeneous results in both sexes. These patterns have been consistent for more than 10 years at our university, although it should be pointed out that in the pandemic cohort the gap between men and women was substantially decreased in the online test modality during the pandemic years. This was apparent in all domains of knowledge.

Performance in exams by gender has been widely discussed in the literature, since gender roles influence several aspects of cognitive abilities and academic performance (Halpern, 2012). There are unanswered questions about why women and girls have better academic performance and terminal efficiency in education, but consistently have lower scores in MCQ standardized tests. In our institution, several studies of academic trajectories have shown that women have better grades than men throughout the academic career, lower percentage of dropout, and higher percentage of graduation, but lower scores in the high-stakes admission exam to the university and the diagnostic exam applied after admission (Martínez González, et al., 2018, 2020; Campillo et al., 2017). Regarding the Spanish test, the results are consistent with previous findings that women have better performance than men in this area. Other studies have shown that women usually have better performance in exams that assess verbal skills (Brizzio et al., 2008).

4.3 Limitations

The study is a result of the pandemic “natural experiment,” so there are other variables that could influence the results, not the least of which is the confounding of the testing modality with the onset of the pandemic, so it is difficult to state with certainty how much of the increase in scores is due to the online modality per se, with its attendant implications (mainly the relatively unexplored aspects of online testing at home), or to a real increase in learning in the 2020 and 2021 cohorts due to intrinsic differences, like more hours dedicated to study because of home confinement. The results of the pandemic cohorts are limited to two generations of students, so it is too early to say that there will not be a learning loss phenomenon in the following years, after the complete effects of the ongoing pandemic settle in. It is necessary to continue monitoring the knowledge levels of students to have a clearer and longitudinal picture of the effects of the pandemic in learning.

5 Conclusions

The performance of two transpandemic cohorts of higher-education students in a large-scale objective standardized diagnostic exam was higher than in two matched prepandemic cohorts. These differences could be explained by a number of factors, including testing modality, although it did not show evidence of a large learning loss in the pandemic groups.

Men had a higher score in the general knowledge exam and women in the Spanish exam, consistent with previous data, although the performance gap between men and women decreased substantially during the pandemic.

Online test application under pandemic conditions at home showed better psychometric data and reliability than the face-to-face modality. Large-scale online testing at home seems to be a valid and cost-effective way of applying diagnostic and formative tests in higher education.