8.2.1 Item Construction
Based on the theoretical considerations, as well as existing test items, we began to construct atomistic test items that were intended to assess each sub-competency of mathematical modelling separately. As the basis for operationalisation, we used the familiar definitions of the sub-competencies as explained above (Kaiser et al. 2015, an English translation can, for example, be found in Maaß 2006).
As we had in mind using the new test instrument in further studies, we aligned our work with the requirements of these studies. For example, we focused on geometric modelling problems
and chose grade 9 students
(15–16 years old) to be our target group. There were no content-related reasons for these choices concerning the research questions formulated above, and we expect the results of our study to be transferable, to a certain extent, to other mathematical domains. However, as Blum (2011) states, learning is always dependent on the specific context, and hence, a simple transfer from one situation to another cannot be assumed. He emphasises that this applies to the learning of mathematical modelling in particular, so that modelling has to be learnt specifically. Thus, if a student is a good modeller in the field of geometry, he or she is not necessarily a good modeller in the field of functions. Of course, the restriction to geometric modelling problems
limits the generalizability of our results, but allows us at the same time to gain more reliable and meaningful findings regarding the chosen topic.
Next, we present an example of a test-item for each of the four sub-competencies we measured, and explain, to what extent this item actually measures the sub-competency. To provide some evidence for the quality of the items, the solution frequency and the item-total-correlation as an indicator for its selectivity are given, as found in an implementation of the test in a large sample (3300 completed tests).
8.2.1.1 Simplifying: Lighthouse Item
An example of a test item that was used to measure the sub-competency of
simplifying
is the Lighthouse Task (see Fig. 8.1, translation). It is a modification of the well-known lighthouse question (Kaiser et al. 2015), which requires the use of a geometrical model and is suitable for grade nine students. The given situation is depicted by a picture of a lighthouse. The students’ task is to select all the information that is relevant to calculate the distance to the horizon. Thus, the item measures competencies for identifying relevant quantities and key variables, which are part of the definition of the sub-competency simplifying.
The fact that more than one answer has to be selected, namely the radius of the earth and the height of the lighthouse, reduces the probability of selecting the correct answer by guessing. The alternative answers represent misconceptions, for example the answer “There are no clouds in the sky” reflects confusing the distance to the horizon with the visibility. The first two alternatives show different misconceptions of the dependence on location of the lighthouse, and the last alternative represents a misunderstanding of the question, or rather the misconception that the distance to the horizon depends on the range of the light.
The distractors were developed with the help of experts in the field of modelling. We collected various items of information we thought students might select as relevant, even though they are not. In our pilot studies, as well as in the implementation of the test with a large sample, we checked these distractors and found that all of them were chosen by at least some students. The two most common mistakes were to select the distractor: Between the lighthouse and the ocean, there are 25 m of sandy beach (25.2% of wrong answers) and not to select the second correct option: The radius of the earth measures 6370 km (13.5% of wrong answers).
The item was used in a study with a large sample which led to 1473 responses to this item. A total of 45.35% of the students was able to answer this question correctly and received 2 points. Students who selected one additional distractor or forgot to select the second correct answer without selecting one of the distractors still received 1 point. This was the case with 27.70% of the students. Even though it is thus a relatively easy item, its Item-Total-Correlation of r = 0.43 yields a satisfactory selectivity of this item in this sample.
8.2.1.2 Mathematising: Straw Bale Item
The item in Fig. 8.2 was used to assess the competencies for setting up a mathematical model from a simplified real situation (i.e.
mathematising
). This item is inspired by the Straw Bale Task in Borromeo Ferri (2011, pp. 84–85), which confronts students with a real-life situation of a stack of straw bales in a field. The idealizing assumptions that all straw bales are the same size and that they are evenly and exactly round are given in the text. So are the diameter of 1.50 m and the depth that the straw bales sink into the layer below them. The student’s task is to convert this situation into a mathematical representation, both graphically in a labelled drawing and symbolically as a formula with the aim of calculating the stack’s height. The item thus measures the competencies required for choosing appropriate mathematical notations or by representing situations graphically.
A correct answer must include the stack’s diameter, the depth of sinking in and, as the unknown quantity, the height of the stack. Answers using the specific sizes and those using abstract variables to denote these quantities were acceptable. Students could achieve a maximum of two points for this item, one for the correct drawing and one for a correct formula.
Use in a large sample produced 1143 responses to this item, which was correctly solved by 24.58% of the students, 36.05% scored one point and 39.37% gave a completely incorrect answer. With an item-total-correlation of r = 0.50, its selectivity is also within a satisfactory range. Since this item is in a short-answer format, approximately 40% of students’ answers were rated by two independent raters according to a coding manual. The interrater-reliability Cohen’s Kappa was κ = 0.86 and thus very good.
Figure 8.3 gives an example of how this item was coded. The first solution shows a correct solution given by a student. He or she was able to use the given relevant information to build a graphical and a symbolic mathematical model. The answers below show incorrect responses. The answer on the left shows that the student tried to apply Pythagoras’ theorem and was not able to transform the given data into a mathematical model, with which it would have been possible to solve the problem. The response on the right shows a graphical representation of the situation where the straw bales are still shown (which is written next to the drawing). The formula used is an attempt to incorporate the given data, but on one hand does not pay attention to the units, and on the other hand, employs the formula for the area of a triangle. This mathematical model thus cannot be used for solving the task and was therefore coded zero.
8.2.1.3 Interpreting: Dresden Item
The sub-competency of
interpreting
a mathematical result and relating it back to the extra-mathematical context was measured with items such as Fig. 8.4. In this item, students are confronted with an extra-mathematical situation, which has been simplified and converted into a mathematical model. In the Dresden item in Fig. 8.4, a boy takes a look at a photograph, where he identifies his father standing in front of a giant arch at a Christmas fair. He mathematises the situation by measuring the height of his father and of the arch in the photo, and by setting up a mathematical term that combines all given numbers and yields the numerical result 3.8. In other words, the modelling cycle
has already been carried out up to the point where the mathematical result has to be related back to the context. The student’s task is to explain what the result 3.8 means in relation to the specified situation. Since the mathematical term represents the father’s height in reality, divided by his size on the photo, multiplied by the size of the arch, the correct answer, which was rewarded one point, is that the arch is in reality 3.8 m high.
In our study, 56.05% of the students gave a correct answer. The most common incorrect response was that 3.8 represents the difference between the size of the father and the arch. This is probably due to the fact that the numbers given in the picture have a difference of 2.8. Students who do not pay attention to the ‘borrowing’ in the subtraction thus confuse the given result with the difference. These students clearly display a deficit in their competencies for interpreting
a mathematical result, and subsequently did not receive a point for their answer. The selectivity for this item was satisfactory with a value of r = 0.48. The interrater-reliability (Cohen’s Kappa) was very good with a value of κ = 0.95.
8.2.1.4 Validating: Rock Item
The sub-competency
validating
was perhaps the most difficult to assess. As the definition of this sub-competency shows, it consists of different facets, namely to critically check solutions, reflect on the choice of assumptions or of the mathematical model and also to search for alternative ways to solve the problem. To measure this sub-competency, we therefore employed a broader variety of items, which means that the items measuring the sub-competency validating were not as similar to each other as the items in the other sub-competencies. Figure 8.5 gives an example of an item that assessed the competencies for critically reflecting a result. In this item, students are confronted with a photo of a girl standing beside a rock. Without presenting a mathematical model, students are given the result of a calculation, namely the assertion that the rock is 8 m tall. They are asked to explain whether or not this result is plausible. To solve this task, students must use the photo and compare the size of the girl with that of the rock. As the rock is approximately three times as high as the girl, she would have to be more than two metres tall if the result was correct. A student’s response, which clearly stated that the assertion is wrong and justified this answer by comparing the size of the girl and the rock, and additionally identified a maximum size for the girl was coded with two points. Answers like “No, since the rock is just approximately three times as big as the girl” which did not give a maximum size for the girl were still awarded one point. Answers that were coded as wrong mostly either were not justified at all or the result was found to be plausible.
Approximately half of the students (51.05%) acquired one point in this item, 27.56% were given two points. The selectivity was r = 0.40 and the interrater-reliability was again very good with κ = 0.88. Other items that assessed this sub-competency did not focus so strongly on checking a result, but confronted students with the choice of a mathematical model and asked them to decide whether the mathematical model would fit the given extra-mathematical situation. Additionally, there were items that assessed student abilities to find objects that help in determining the plausibility of a result. For example, students were given a photo of a dog and the claim that this dog is 28 cm high. They were asked to name one object that is approximately 28 cm high with which they could mentally compare the dog. In contrast to the Rock item in Fig. 8.5, students in this item were not asked to actually check the given result. This item assessed whether students were able to fall back on supporting knowledge (in German “Stützpunktwissen”) as a basis for checking their results. We therefore had a broad variety of difficulty levels of items that assessed the different facets of the sub-competency of validating
.
8.2.2 Testing of Items
Before constructing test booklets, we had a phase of intensive item testing. We first presented the items to experts in the field of modelling and asked them to comment on the tasks and to indicate what they thought the items would assess. All experts classified the items as we expected, but there were some that tended to assess more than just the one sub-competency. We reworked those items and related them more closely to the definitions of the respective sub-competency. Special attention was paid to the multiple-choice items and the choice of distractors. We asked the experts to comment on all answers that were part of the items and to add an answer to the item if they thought an answer or a typical mistake would be missing.
Subsequently, we gave the items to 36 students in a class, observed their working processes and asked them afterwards in groups what problems they had solving the tasks. Most of their answers referred to the poor quality of a photo which was then changed. In this phase, we identified formulations that were too complicated and made items too difficult to understand. With the help of students’ comments, we simplified the language and made clear references for students who would subsequently be expected to use a picture as in the Rock item in Fig. 8.5. Students found some of the items “easy and interesting to solve, since they are different from conventional maths exercises”, but “had to think intensively” about some of the items. These comments, as well as the analysis of their answers to the exercises, revealed a wide range of item difficulties, with a large number of items having a medium solution frequency, but also with a substantial number of items with a high as well as with a low solution frequency. No item remained unsolved, but no item was solved by all participants either. The qualitative analysis of students’ answers made it possible to identify possible difficulties in coding the answers, which led to small changes in formulation. It was also the basis of a first draft of a coding manual for the test instrument.
Afterwards, we conducted a second pilot study with the aim of acquiring quantitative data to check the test’s quality and to generate solution frequencies of the various items. In this study, no item was solved by all, or by none, of the 189 students. The answers the students gave additionally helped us to improve the coding manual.
8.2.3 Combining Items into a Test
One of the most difficult challenges in constructing a test that can be used in an experimental design is to ensure the comparability of pre- and post-tests. This challenge of creating parallel tests becomes redundant if one uses psychometric models and interprets responses to items as manifest indicators of one or several latent variables. The central idea is that the more distinct a person’s latent variable is, the greater his or her probability of solving an item. Thus, in the simplest model, only the difficulty of the items and the person’s ability are taken into account. The great advantage of this model is that the person’s ability can even be determined if not all items are presented, which makes it possible to use a multi-matrix-design.
Figure 8.6 illustrates the test structure. Firstly, we constructed eight item blocks consisting of one item per sub-competency, a total of four items per block. No items were in more than one block. Secondly, we combined the item blocks into four test booklets, two for each point of measurement, so that each test booklet consisted of 16 items. We thereby paid attention to a similar average difficulty of the test booklets so as to avoid motivational problems for some groups of students. The fourteen multiple choice items were also equally distributed over the different booklets, so that all test booklets contained both item formats.
The two booklets we used at the first point of measurement were linked to each other via two blocks (blocks 3 and 4 in Fig. 8.6). Additionally, booklet A contained items that were not part of booklet B and vice versa. The same linking method was used for the post-test, where new items (blocks 7 and 8 in Fig. 8.6) were used to link the booklets. A person who answered test-booklet A in the pre-test also received post-test A, and the same for booklet B. By so doing, no student answered the same items twice. Nevertheless, since the item blocks 1, 2, 5 and 6 were used at both points of measurement, it was possible to link the two points of measurement. We determined the item difficulties using the data of all points of measurement, and then calculated the person’s abilities for each point of measurement separately.
8.2.4 Methods of Data Collection
We implemented the test in 44 classes of grade 9 students
who completed the test instrument three times each. This led to a total of 3300 completed tests which was the basis for the evaluation of the test instrument presented in this chapter.
Each testing lesson had a duration of 45 min, and since each student had to answer a set of just 16 items, no time pressure was observed. The testing was performed by the teachers strictly following a written test-manual, in which all details for conducting the testing process, as well as instructions to be read out, were recorded. In this way, it was possible to have a standardized execution in each of the participating classes. The correct implementation was controlled at random.
The completed test sheets were coded according to the coding manual. Some items were coded dichotomously and some had a Partial Credit scoring, receiving two points for a completely correct solution and one point for a partially correct solution. A sample (40%) of the completed test sheets were rated by two independent coders. The interrater-reliability for the open tasks was within a range of 0.81 ≤ κ ≤ 0.96 (Cohen’s Kappa) which reflects very good agreement.
The data were scaled using a one-parameter Rasch model with the help of the software ConQuest (Wu et al. 2007). For the estimation of item- and person-parameters, weighted likelihood estimations were used. To determine item parameters and to evaluate the test instrument, all three points of measurement in the main study were treated as if they were independent observations of different people, even though the same person could appear in up to three different rows in the data matrix. This approach is called ‘using virtual persons’ (Rost 2004) and is used in PISA (OECD 2012) and TIMSS (Martin et al. 2016), since it is unproblematic for the estimation of item parameters. These item parameters are the basis for evaluating the test instrument reported in this chapter.
8.2.5 Statistical Analyses to Answer the Research Questions
To be able to use the outcome of a probabilistic model for empirical data, it is necessary to check whether the a priori chosen model fits the data. Since the model that fits the data best is regarded as the best reproduction of the structure of the latent variable, this check of model fit can be used to gain more information about the competence structure itself. We therefore calculated various different models and compared the respective model fits. As we were interested to know whether it is possible to measure the different sub-competencies separately, we compared three models shown in Fig. 8.7. The first Model is a four-dimensional one in which each sub-competency is measured as a separate dimension. The second Model reflects the aggregation of sub-competencies as Brand (2014) and Zöttl (2010) chose for their research. The third Model is one-dimensional. If this was found to be the best fitting model for the empirical data this would mean the abilities students need to solve the different types of items, as presented in Sect. 8.2.1, were so similar that it would not be appropriate to model them as different dimensions.
When scaling empirical data with the help of item response theory (IRT), there are different ways to check how well a model fits the data. In the case of estimating item- and person-parameters, the algorithm used, iterates until the likelihood of observed responses reaches its maximum under the constraints of the given model. Therefore, the fit of two models can be compared by analysing their likelihood (L). After estimating the parameters, the programme ConQuest displays the final deviance (D) of the estimation, which derives from the likelihood by D = −2ln(L). The smaller the final deviance, the greater the likelihood and the better the model fits the data. This measure does not take into account the sample size and the number of estimated parameters. Therefore, the AIC and BIC are also reported.Footnote 1 AIC tends to prefer models that are too large whereas BIC prefers smaller models. If both criteria prefer the same model, this is likely to be the best of the candidate models (Kuha 2004, p. 223).