Keywords

1 Introduction

Since Germany signed the UN-Convention on the Rights of Persons with Disabilities in 2009, the German School system has been changing to meet new standards for inclusive classrooms. Before 2009, children with special educational needs (SEN) primarily visited special schools and were not included in the regular school system. However, an important goal of today’s German school system is to reach academic inclusion, meaning that children with and without SEN learn together in the same classroom.

In inclusive classrooms, academic performance of children with and without SEN is very heterogeneous [1]. Children with SEN and other risk factors (e.g., minority or migration background) face an especially high risk of failing in schools and manifesting academic and social problems over the time (e.g., [2]). Issues relating to reading instructions are important to teachers because most students with SEN have difficulties learning to read. Indeed over the last several decades, numerous studies have reported that the vast majority of students with SEN have difficulties in learning to read [3, 4], that they achieve significant lower reading skills than their peers [5], and that this has consequences both within and outside the classroom [6]. These results were also applicable to students learning to read German [1, 7]. Reading problems might negatively affect skills related to early reading acquisition (e.g., reading fluency [8]) or in later skills (e.g., reading comprehension [9]). Reading fluency influences general reading development [10]. Because reading fluency already correlates to reading comprehension in primary school [11], it is a fundamental reading skill and an important goal in the early reading development for every student [12, 13]. Influencing factors, such as general cognitive ability, phonological awareness, speech perception and production, letter-name knowledge, rapid automatized naming, or consistency of the orthography, predict the early reading ability and individualize the learning processes [14, 15]. Additionally, specific reading fluency instruction can be differentially effective for students with and without SEN [16].

Furthermore, the heterogeneous conditions and needs in inclusive classrooms challenges teachers in new ways, such as designing effective instruction for the special needs students. In this regard, it is important that the teachers focus on individual learning growth instead of the social comparison between students [17]. Only in this way, can teachers provide targeted effective reading instruction [18]. This requires a new approach: first, teachers need to determine the individual needs of their students. Second, they choose a lesson and give it to the target student. Third, they monitor the individual learning growth and reflect upon the choice of the instruction. If learning growth is sufficient, the teacher can continue with the same instruction or focus on a new goal. If not, the teacher can adapt the method or can try another approach for that student.

Curriculum-based measurement (CBM) is a method for monitoring the learning growth of children and for supporting the teachers in effective decision-making [19]. CBM was designed in the United States to solve academic difficulties in special education [20]. CBM tests can be used very frequently during the lessons, take only few minutes, and show the slope of individual learning growth for a longer period graphically. The tasks are representative of end-year performance goals and integrate various subskills for competence in a domain (e.g., in reading or mathematics) [21]. Like any other test, CBM instruments need to possess quality criteria such as objectivity, reliability, validity, and be sensible of learning growth [22]. For classroom use, it should be simple and easy for educators to use the test and interpret the results. Additionally, it is important that multiple measurement points (MP) are comparable over time and the test is demonstrated to be invariant across these MPs [23].

The reliability and validity of CBM instruments has been shown in the classical test theory framework [24], and they are particularly useful for children with SEN [25]. More recently, computer based instruments have boosted the potential of CBM instruments [26]. Computer versions can reduce the time requirements, ease the creation of parallel test versions, and provide automatic feedback to students and teachers.

Curriculum-based measurement has also been studied in the context of German schools [27]. The first German CBM tests were pen-and-paper tests tracking reading or mathematic skills [e.g., [28]). However, in large classes pen-and-paper tests can result in a lot of additional work for teachers. Moreover, with these instruments, teachers must choose CBM tests from lower classes for children with SEN and lower ability levels (e.g., [29]), which complicates the use of the CBM instruments in inclusive classrooms [30]. Newer instruments have focused on online assessments to remedy some of these problems (e.g., [31, 32]), but such online tools often cost money for teachers or administrators [33]. Nonetheless, more research is necessary regarding the use of CBM techniques in real inclusive German classrooms [34].

The web-based platform Levumi (www.levumi.de) was founded by a multidisciplinary research team with the goal of creating a free online CBM tool to assess reading and mathematics competencies in primary education, with a focus on children with SEN, learning problems, behavior problems, or other risk factors. The three main goals of this research project are (1) to offer teachers a practical CBM tool for inclusive classrooms, (2) to improve research on CBM and the acceptance of CBM tools by teachers, and (3) to use the collected data for evaluating supporting materials for research and development in teaching and learning [33].

Levumi is currently available for teachers and researchers. Users can register on the website for free. Supporting material is also provided free of charge (e.g., [35]). Levumi tests can be used in all 16 German federal states and focus on learning goals throughout the country. Levumi runs from within any major browser. This makes it easy for teachers to use the system without having to install additional software, which usually requires administrative privileges that teachers do not have in typical school IT infrastructures. The only requirement is a permanent internet connection. In some tests, the teachers control the tests, and in other tests, the teachers activate a test for a class and the students can then take the test on their own by logging in with a personal ID code. Each such test begins with an instruction page, often with an interactive sample item, which requires multiple inputs to prevent a student from accidently starting the test.

The design of the platform is visually simple, and the tests look similar to one another in order for learners with SEN to easily work within the platform. For each item, the actual answer and whether or not it is correct is recorded. The sum of correct responses is used as the final score of the test. Immediately after each test, our mascot, a dragon named Levumi, shows each child if he or she has improved. Teachers can see graphically how each student performs in comparison both to past performance and to other students. Furthermore, the teachers get more detailed performance information, including items a student had problems with. The multiple difficulty levels allow learners with SEN to use the same tools and tests as their peers, which makes supporting all learners in an inclusive classroom much easier for teachers. Additionally, the teachers can track student information such as age, sex, migration background, and SEN.

Levumi contains tests of important indicators of overall competence in different learning domains. These learning domains are currently reading, writing, and mathematics. For each domain, several competencies can be measured with separate test types and multiple difficulty levels. The reading test structures are outlined in Fig. 1.

Fig. 1.
figure 1

Test structure of the learning domain reading in the Levumi platform

The reading domain contains two competencies: reading fluency and reading comprehension. For each competency, teachers can pick several test types and difficulty levels, which is important or practical use in class, it is very important that the teachers can easily choose a particular test and difficulty level. For each learning domain, the same difficulty structure exists across all competencies (see Fig. 1, lower part).

Each test has a unique item pool that includes approximately 40 to 200 items. All items are constructed upon theoretical models of reading acquisition. The web platform randomly orders the items for each measurement. This creates a huge number of parallel test forms for each single measurement. In some tests, additional rules are applied to produce the random item ordering. Typically, these tests involve items from multiple categories, such as item types from different dimensions. Additionally, in some tests, such as the reading fluency test, words or syllables with the same initial letter are prevented from following one another, preventing common mistakes. In these cases, items of the different categories or types were randomly in selected via round robin.

The Levumi reading fluency test measures fluency in reading aloud, a robust indicator of overall reading competence [36]. In these tests, the children read aloud items from a computer screen for one minute. Teachers rate the correct and non-correct answers by keyboard and can consider any factors specific to each learner such as a speech impairment. Reading of syllables, words, and pseudo words are assessed through separate tests. All reading fluency tests are based on the ‘Kieler Leseaufbau’ [37], which applies specifically to the German language. For each test, we incorporate multiple difficulties levels (L0 to L4) [38], based on a range of letters (see Fig. 1). Vowels are used in every difficulty level. The lower levels use stretchable consonants (e.g., /m/, /r/, /s/, /l/) in simple word structures. In the higher levels, the plosive, less common consonants, and consonant combinations are used with more complex word structures (e.g., /b/, /sch/, /qu/). In L4 all consonants and vowels are included. A student can be tested in a difficulty level after instruction on all contained letters. Because the teachers can choose a suitable Levumi test based upon the competency level of a student and not upon their age or grade, ease-of-use for teachers in inclusive classrooms is maintained.

As a part of the completion of test development of the reading fluency assessment for Levumi, we present four research questions. First, to verify the psychometric properties of the test in respect to the item response theory, we ask: (1) how does a Rasch model fit the Levumi syllable reading test? Next, we wish to examine the applicability of the sum scores to important theoretical questions and the general reliability of the syllable reading test and its ability to measure changes over time. Thus, our second and third research questions are: (2) do the sum scores of Levumi reading fluency test possess good test-retest reliability over 2 MPs, and: (3) can the Levumi test measure learning progress of learners with SEN over multiple MPs. Lastly, we wish to examine the applicability of the assessment to learners with special education needs, so we ask: (4) how do Levumi reading fluency test takers with SEN compare to other test takers in terms of sum scores on the syllable reading test?

2 Methods

Participants include test takers of the syllable reading fluency test on the Levumi platform in five samples. The first three samples were measured twice within a period of 7 to 10 weeks, with each representing data from a single difficulty level, L2b (n = 105), L3 (n = 97), and L4 (n = 132). The fourth sample was a small group of learners with SEN from a single class (n = 8). The fourth sample included 4 MPs of difficulty level L2b, followed by 5 MPs of L3, and lastly 5 MPs of L4 over one school year. Lastly, the fifth sample was taken from inclusive and special schools (N = 300) and included learners with SEN (n = 46; 38 with learning SEN, 7 with German SEN, and 1 other).

We calculated a full Rasch analysis for the second sample. We examined item fit scores (infit and outfit) for all test items. We further report Warm’s weighted likelihood estimates (WLEs) for the first three samples.

Next, we analyzed the sum scores of each sample. First, we conducted a reliability assessment over 2 MPs in the first three samples. Second, we conducted a repeated measures ANOVA on the sum scores of the 8 participants in the fourth sample to assess the test’s ability to measure learning progress. Lastly, we compared the results of SEN learners to other learners in the fifth sample.

3 Results

A graphical model check confirmed that all items performed equally across both MPs for difficulty level L3, indicating good test-retest reliability. Next, we calculated a Rasch model across both measurement points for difficulty level L3. The model possessed good data fits, with the mean square (MSQ) of the outfit ranging from .726 to 1.682 and the MSQ of the infit ranging from .911 to 1.079. Good values for the MSQ of the outfit and infit measures are between 0.5 and 1.5, while only values above 2.0 are considered harmful to measurement [39]. With only 2 of 112 items with an outfit over 1.5, we concluded the Rasch model fits our data very well. Furthermore, all three difficulty levels had good reliability within the Rasch model, WLEL2b = .919, WLEL3 = .883, and WLEL4 = .895. Therefore, we conclude that the Rasch models fit the data well, and the test is suitably unidimensional to use sum scores.

Sum scores at difficulty level L4 indicated a very high level of test-retest reliability for difficulties L2b and L4, rL2b = .84 and rL4 = .85. The reliability of difficulty L3 was lower, but still high, rL3 = .76. This is consistent with the other reliability analyses.

Figure 2 includes all sum scores of the tracking sample of 8 learners with SEN across one school year. Each individual line represents the performance of a single learner. The left section contains MPs of difficulty L2b, the middle has L3, and the right has L4. Separate ANOVAs for each test type confirmed significant changes over time. For the L2b test, performance significantly changed from MP1 (M = 21.8, SD = 5.4) to MP4 (M = 30.8, SD = 7.9), F(3,21) = 13.49, p < .001. For the N3 test no differences over time were found, F(4,28) = 1.644, p > .10. Lastly for the L4 test, a change from MP1 (M = 24.0, SD = 7.8) to MP5 (M = 29.8, SD = 8.6) was detected, F(4,28) = 4.32, p < .01. We concluded that individual changes over different MPs are detectable.

Fig. 2.
figure 2

Individual Tracking across Difficulty Levels L2b, L3, and L4

Lastly, the average items correctly solved for those students with SEN (M = 28.5, SD = 9.7) was no different than for those without SEN (M = 30.4, SD = 14.0), Welch’s t(83.4) = 1.08, p > .25.

4 Discussion and Future Work

We assessed the quality of the syllable reading fluency test across different difficulty levels. Rasch analyses verified the test’s psychometric qualities. Test-retest reliability was confirmed in a graphical model check and an analyses of sum scores. Learners with SEN performed no differently than those without SEN. Also, scores of the difficulty L2 and L4 improved significantly over the course of learning. Scores for L3 did not change, but this may represent a local plateau in the learning process.

These assessments allow for three important implications. First, good test-retest reliability indicates that any changes in student responses over the course of the school indicate changes in learner ability level not test artifacts. Second, the test is effective at measuring the learning process across multiple MPs. Lastly, equivalent test performance for learners with SEN demonstrates that the test provides a fair assessment for those learners. However, some limitations remain. We did not assess all learners, all test types, and difficulty levels. Further work should assess other test quality criteria.

Levumi provides multiple tests to monitor learning growth in different learning domains. Development is nearly complete on other tests. An item-response theory based evaluation of test quality criteria is needed for other test types and difficulty levels. Simultaneously, new test types for use on the Levumi platform are being designed. These measure competencies in reading, writing, and mathematics, as well as behavior ratings. They include reading comprehension tests on the level of individual words and complete sentences comprehension, and mathematical assessments on early number sense and number sequencing tasks. Work is ongoing to create more difficult tests for use in secondary schools. Lastly, CBM behavior ratings are planned for primary and secondary schools. Similar test evaluations will be performed on the new tests.

We are also improving platform use on tablets. We have had the chance to collect pilot data for some tests with tablets. Some of these work well on tablets, but others need specific adaptions. Many participants were familiar with touch screens but require some time to familiarize themselves with Levumi. Further work should investigate if there are any mode effects between these two systems. Providing an app that displays the tests for students is also a possible improvement.

Our research established essential reliability and usefulness of a new web-based CBM technique. This platform will allow for rapid assessments and easy tracking of children with and without special education needs in all classroom types. Reliability of the tests for reading fluency was confirmed, and development continues on new tests of different competencies. Lastly, we found no difference in performance between students with and without SEN, and no differences were required in the test preparation and handling for learners with SEN.