1 Introduction

Word problems (WPs) serve many purposes in mathematics education. They bring variation to the practising of basic mathematical operations and prepare students to use mathematical skills in everyday situations outside the classroom. WPs differ from other mathematical tasks that are presented in mathematical notation because the problem is laid out through text that describes a situation and a question(s) that must be answered by performing a mathematical operation(s) derived from the descriptions in the text (Verschaffel et al. 2000). The text comprehension theory of Kintsch and his collaborators (Kintsch 1998; van Dijk and Kintsch 1983) has been widely applied to describe processes required to solve mathematical WPs (Kintsch and Greeno 1985; Reusser et al. 1990). This theory distinguishes between textbases (networks of propositions within the text) and situation models (model of the situation described in the text) as two aspects necessary for an adequate mental representation of text (Kintsch 1998, p. 107).

Solving WPs is complex, as the (complete) process involves a number of phases (Montague et al. 2014; Niss 2015; Verschaffel et al. 2000). Depaepe and colleagues (2015) reviewed descriptions of different WP-solving processes (e.g., Blum and Niss 1991; Burkhardt 1994; Mason 2001; Verschaffel et al. 2000) and concluded that, essentially, the process comprises six phases that are not necessarily performed sequentially: (1) understanding and defining the problem situation and developing a situation model; (2) developing a mathematical model based on a proper situation model; (3) working through the mathematical model to acquire mathematical results; (4) interpreting the results with respect to the original problem situation; (5) examining whether the interpreted mathematical result is appropriate and reasonable for its goal; and (6) communicating the acquired solution of the original WP. According to this description, WP-solving requires students not only to apply mathematical concepts and procedures (e.g., arithmetic relations) but also to construct a mental representation (Verschaffel et al. 2015) that demands different levels of text comprehension (van Dijk and Kintsch 1983; Reusser et al. 1990).

In this study, we focused on the more demanding WPs that cannot be solved without going through important phases of the complex problem-solving process summarized by Depaepe and colleagues (2015). Instead of dealing with the difficulty from the experts’ predefined point of view, we focused on the difficulty as it appears in students’ performance (see actor-oriented theory by Lobato 2012). This study aims to (a) explore WP-item characteristics regarding linguistic and numerical factors and their difficulty levels, using item response theory (IRT) modelling; (b) examine an association among WP-solving, text comprehension and arithmetic skills; and (c) investigate whether students with different levels of text comprehension and arithmetic skills differ in their performance on WPs at different levels of difficulty.

1.1 Linguistic and mathematical factors contributing to WP difficulty

Previous studies have investigated different factors contributing to WP difficulty. For example, in the 1980s, researchers began investigating the difficulties that students encounter when solving various WPs, starting from simple arithmetic WPs (e.g., change, combine, compare: Carpenter 1985; Cummins et al. 1988; De Corte and Verschaffel 1987; Greer 1987; Riley and Greeno 1988) and progressing to more complex WPs requiring non-routine thinking (Lee et al. 2014; Verschaffel and De Corte 1997). Based on a recent literature review by Daroczy and colleagues (2015), the factors influencing WP difficulty could be distilled into three components: linguistic factors, numerical factors, and interaction between linguistic and numerical factors (e.g., reading direction and numerical process, order of number word system).

Prior studies reported that linguistic factors, such as the number of words in the WP statement, influence the difficulty of WPs (Jerman and Rees 1972). However, later research has shown that superficial textual features such as the number of words hardly explains the difficulty of WPs. For example, Lepik (1990) reported different findings after investigating students’ performance on WPs. Linguistic factors, including the length of a WP statement, were not significant predictors of the proportion of correctly solved WPs but were strong predictors of the WP-solving time. In addition to this basic quantitative property of WPs, empirical evidence has convincingly shown that the semantic structure of WPs strongly impacts WP difficulty and the strategies that young children apply when solving arithmetic WPs (Cummins et al. 1988; De Corte and Verschaffel 1987; LeBlanc and Weber-Russell 1996; Riley and Greeno 1988). Semantic structure refers to meaningful relations between the known and unknown sets involved in the WP (e.g., whether the text of a simple additive WP involves a combination of two sets, a dynamic change in a start set, or a comparison between the magnitudes of two sets: see Riley and Greeno 1988 for WPs with different semantic structures). There are evidently significant differences in the probability of solution for problems both within a specific semantic type and between these semantic structure types (Cummins et al. 1988; De Corte and Verschaffel 1987). According to the literature, many young children have difficulty at the stage of comprehending sentences (Cummins 1991), while difficulties for older students may be more closely connected to the overall demands of arriving at an integrated representation of the situation described in text (LeBlanc and Weber-Russell 1996). In WPs, an additional linguistic factor is related to the role of a situation model in comprehending the meaning of the problem text. In routine WPs, it is enough to understand the propositions presented in the text, whereas in WPs requiring non-routine thinking, the construction of an adequate situation model is necessary in order to understand the problem (Kintsch 1998; Reusser 1992). Linguistic factors, such as irrelevant information and implicit information, have been found to influence students’ comprehension of WP-situations, which is essential to the construction of situation models. According to Englert and colleagues (1987), irrelevant numerical information negatively impacts students’ WP-solving, while irrelevant linguistic information did not affect their performance. Concerning implicit information, researchers suggested that many unsuccessful problem solvers often rely on the direct translation strategy (looking for numbers and keywords) and fail to provide correct answers, especially when problems include important implicit information, which they should infer on the basis of the situation described in the text (Hegarty et al. 1995). Another factor influencing WP difficulty, which can be seen as an extended aspect of a situation model, is the necessity of using realistic considerations requiring a non-direct translation of the situational model into a mathematical one. WPs that demand the use of realistic considerations were reported to be very difficult for many students (e.g., Verschaffel and De Corte 1997; Verschaffel et al. 2000).

Several studies have shown that numerical factors, such as number properties, required operations and number of solving steps, influence difficulty too. For example, Koedinger and Nathan (2004) investigated the effect of decimal numbers on students’ WP-solving performance. Their results indicated that whole-number problems are significantly easier than decimal-number problems. Apart from the effect of number type, the type of operation required (e.g., addition and subtraction; multiplication and division) appears to have an impact on children’s solution strategies and varies widely in difficulty (De Corte and Verschaffel 1987; De Corte et al. 1988). Various kinds of arithmetic calculation errors can result from the type of operation required (Kingsdorf and Krawec 2014). In addition to the required operations, the number of solving steps was reported to have an impact on WP difficulty. For instance, Quintero (1983) examined students’ problem-solving performance on WPs with a ratio and revealed that two-step ratio WPs are more difficult than single-step ones. Problems can also require mathematical thinking that goes beyond the basic arithmetic, such as combinatorial reasoning, which has proved to be difficult for children (English 2005).

1.2 Associations between WP-solving, text comprehension, and arithmetic skills

A substantial number of studies have examined an association between WP-solving performance, arithmetic and text comprehension skills. For example, Fuchs and colleagues (2006, 2018) reported that arithmetic skills are related to WP-solving performance and can be seen as an essential foundation for WP-solving. However, their studies indicated that, although arithmetic skills are a necessary foundation, they are not sufficient for WP-solving given that WPs also require text processing when constructing a mental representation. Furthermore, this is evident in several studies that have found associations between WP-solving performance and text comprehension even after controlling for general cognitive abilities (e.g., working memory) or other factors that may be involved (e.g., technical reading, calculation skill, gender) (Boonen, de Koning et al. 2016; Boonen, van der Schoot et al. 2013; Swanson et al. 1993; Vilenius-Tuohimaa et al. 2008). Although the association between WP-solving performance, text comprehension, and arithmetic skills has received much attention in previous research (e.g., Boonen et al. 2013, 2016; Fuchs et al. 2006, 2018; Swanson et al. 1993; Vilenius-Tuohimaa et al. 2008), these studies typically investigated WP-solving performance on WPs that have simple semantic structures and did not pay any attention to the differences in the difficulty levels of WPs. There are well-established findings in the literature on how different semantic problem types have different difficulty levels (LeBlanc and Weber-Russell 1996) and result in different errors (Carpenter 1985; Cummins et al. 1988; Greer 1987; Riley and Greeno 1988). However, there is still a lack of studies on more demanding WPs that require non-routine thinking (e.g., including more complex structures, involving different factors contributing to their difficulty).

Various students may experience WP difficulty differently. Moreover, the effects of students’ skills and WP characteristics may be interrelated. For example, linguistically rather weak students (poor in text comprehension) may face challenges with linguistically complex WPs (e.g., including semantically complex features, long WP statements) and arithmetically rather weak students with arithmetically complex WPs (Daroczy et al. 2015). This assumption seems reasonable since empirical evidence has clearly shown associations between WP-solving performance, text comprehension, and arithmetic skills (Boonen et al. 2013, 2016; Fuchs et al. 2006, 2018; Swanson et al. 1993; Vilenius-Tuohimaa et al. 2008). However, it raises important questions as to whether it is possible to identify which WPs are linguistically or arithmetically complex and whether both features can explain the difficulty of more demanding WPs.

1.3 Aims

Our aim in this study is to deepen our understanding of the associations of linguistic and mathematical WP characteristics and WP-solving skills. Previous investigations about these associations were mostly conducted with simple arithmetic WPs (e.g., Boonen et al. 2013; Fuchs et al. 2006, 2018; Swanson et al. 1993; Vilenius-Tuohimaa et al. 2008). In this study, the investigation was carried out with more demanding WPs and the item-level difficulty of WPs was scrutinized. This study had a focus on a joint investigation of linguistic and numerical WP characteristics and took into account students’ skills in text comprehension and arithmetic, while previous studies often focused on one or the other aspect or skill (Daroczy et al. 2015). In identifying the difficulty level of WPs, previous research typically relied on classical test theory (CTT) in which the proportion of individuals answering the item correctly is used as the index for the item difficulty (Finch and French 2015; Parkash and Kumar 2016; Stage 2003). However, item difficulty index derived from CTT is often criticized because it is dependent on the sample (Chalmers 2012; Stage 2003). A widely recommended alternative to CTT is the item response theory (IRT) modelling (De Ayala 2009; Finch and French 2015; Reckase 2009), in which the difficulty level estimated with IRT refers to a probability of a correct response at a given level of participant ability (Finch and French 2015). With IRT, it is possible to obtain item characteristics (e.g., item difficulty level) that are not dependent on the examinee group (Parkash and Kumar 2016; Stage 2003).

The present study has the aim of answering the following research questions:

  1. 1.

    Are there linguistic or mathematical features that explain the level of difficulty of various WPs?

  2. 2.

    Are students’ text comprehension, arithmetic, and WP-solving skills correlated with each other?

  3. 3.

    How do different patterns of students’ text comprehension and arithmetic skills predict their performance in WPs of different levels of difficulty?

2 Method

2.1 Participants and overall design

Participants comprised 891 fourth-grade students, including 446 boys and 445 girls, from different elementary schools situated in cities, small towns, and rural communities in southern Finland. All of them had Finnish as their mother tongue. All participants were asked to complete text comprehension, arithmetic, and WP-solving tests in a classroom situation as a part of the Quest for Meaning project. The data were partly used in a previous study (Kajamies et al. 2010). The University of Turku’s ethical guidelines were followed. Permissions were obtained from both the schools and the students’ guardians.

2.2 Measures

2.2.1 Mathematical word problems

Students’ WP-solving performance was measured with a WP test containing 15 items (Kajamies et al. 2003; see Appendix 1). These WPs were created in such a way that they could not be solved using straightforward strategies (e.g., by requiring students to develop a proper situation model, avoiding keywords in the WPs, and giving numerical information in a written format). Two WPs (WP6 and WP13) were constructed based on original items used in earlier studies (Verschaffel et al. 2000). The students had no time limit to complete the WP test. All WPs were assessed by giving one point for each correct answer and zero points for an incorrect answer or no response. Cronbach’s alpha for the whole test was 0.76. The number of words, irrelevant information, implicit information, the use of realistic considerations, the required problem-solving steps and arithmetic operations, and the use of decimal numbers were all noted in order to investigate WP characteristics that may influence WP difficulty level (see Table 2).

2.2.2 Text comprehension

Text comprehension skills were assessed with the Finnish Standardized Reading Test (Lindeman 1998). The students were given 48 multiple choice questions about the four different texts they had to read. One point was awarded for each correct answer, making a maximum score of 48 for text comprehension. The Kuder–Richardson coefficient of internal consistency (CR20) was 0.87 (Lindeman 1998). Text comprehension was seen as an important measure of the linguistic skills of 4th graders.

2.2.3 Arithmetic skills

Arithmetic skills were measured with a time-limited (10 min) RMAT test (Räsänen 2004). The test begins with simple computations and ends with algebraic tasks. According to Räsänen (1993), the RMAT is comparable to the WRAT-R (Jastak and Wilkinson 1984). Both of them contain similar instructions, but the RMAT closely follows the Finnish mathematics curriculum (e.g., the role of fractions is not emphasized) and includes more computational tasks. Therefore, it can assess more basic arithmetic than the WRAT-R (correlations were 0.547–0.659, n = 2673, Räsänen 2004). The total number of correct solutions in the RMAT is here used as an indication of the students’ arithmetic skills. The maximum score was 56. The alpha coefficient was 0.92–0.95 (Räsänen 2004).

2.3 Analysis

Analyses used in the present study are separated into two phases. The first phase investigated item characteristics and employed item response theory (IRT) modelling to classify WPs based on their difficulty level. The second phase used k-means clustering to categorize students into groups based on their text comprehension and arithmetic skills. In addition, the one-way analysis of variance (ANOVA) was used to determine whether students with different text comprehension and arithmetic skills differ in their performance in mathematical WP-solving.

2.3.1 Item response theory (IRT)

IRT is widely employed in assessment and evaluation research in the fields of education and psychology. IRT is an approach of testing, which is based on the relationship between participants’ performance on a particular test item and the level of his or her performance in general in all items measuring the skill in question. In technical terms, IRT attempts to model individual response patterns by determining how underlying latent trait levels (i.e., ability) interact with the item’s characteristics (e.g., item difficulty, discrimination ability) to form an expected probability of the response pattern (Chalmers 2012; Embretson and Reise 2000). In this study, IRT analyses were conducted using the R 3.2.3 with ltm (latent trait models) package, which was developed to analyze multivariate dichotomous data using latent variable models (Rizopoulos 2006).

One type of IRT models called 2PL model (two-parameter logistic model) was applied to investigate the difficulty level of WPs. It expresses the relationship between individuals’ level of the latent trait (his or her WP-solving ability) and the probability of endorsing a given item (answering the WP correctly) in the form of a logistic model (Finch and French 2015). Relative fit indices (AIC, BIC, Item fit) were examined to see whether the model fits the individual items well.

2PL model:

$$P\left( {x_{j} = 1\left| {\theta ,b_{j} } \right.} \right) = \frac{{e^{{a_{j} \left( {\theta , - b_{j} } \right)}} }}{{1 + e^{{a_{j} \left( {\theta , - b_{j} } \right)}} }}$$

\(\theta = \text{students' ability};\,a_{j} = {\text{discrimination value of item}}\,j;\, b_{j} = {\text{difficulty level of item}}\,j.\)

2.3.2 Unidimensionality test

For selecting a suitable IRT model, the dimensionality of a set of test items has to be tested. A primary assumption underlying the 2PL model is that there is only one latent trait being measured by the set of items (unidimensionality). There are many ways to test the unidimensionality assumption (see Finch and French 2015; Verhelst 2002). One approach is to use the bootstrap modified parallel analysis test (BMPAT, Finch and Monahan 2008), which was developed based on Horn’s (1965) parallel analysis method for indicating the number of factors. The BMPAT works by checking the second eigenvalue of the observed data to see whether it is larger than the second eigenvalue of the data under the assumed IRT model. If the BMPAT test results are statistically significant for the second eigenvalue (p < 0.05), it means that the data are not unidimensional (Finch and French 2015).

2.3.3 2PL model and unidimensionality test

Table 1 shows the fit indices for the 2PL model. The results indicate that the 2PL fits all items well, and based on the BMPAT result, the observed data are unidimensional (p > 0.05), and the results support the primary assumption underlying the 2PL model.

Table 1 Fit indices and unidimensionality tests for 2PL model

3 Results

3.1 Difficulty level and WP characteristics concerning linguistic and numerical factors

Item difficulty values estimated by the IRT-analysis and linguistic and numerical factors of WPs were examined, and the results are presented in Table 2 and Fig. 1. According to Finch and French (2015), the item difficulty estimates are centred at 0. Therefore, the negative values indicate relatively easy items, while the positive values represent somewhat difficult items. The order of items was arranged based on their difficulty level, from the easiest item (WP1) to the most difficult one (WP13).

Table 2 Item difficulty (SE) and other characteristics
Fig. 1
figure 1

Item characteristic curves (ICC) of 2PL model

Overall, the results showed that the association between WP characteristics concerning linguistic and numerical factors and their difficulty level is not simple and straightforward. There was no significant correlation between the number of words and the difficulty value (r(13) = 0.21, p = 0.490). Within these WPs, the need for realistic considerations did not explain the difficulty, because the two WPs requiring realistic consideration (WP6 and WP13) were located at the different ends of the difficulty dimension. Also the existence of irrelevant linguistic or mathematical information was distributed equally with the difficulty dimension of the WPs and did not distinguish between easy and difficult WPs. However, implicit information seemed to explain WP difficulty: the eight WPs with the lowest difficulty value had no implicit information, whereas four of the five WPs with highest difficulty values (WP12, WP5, WP4, and WP10) had implicit information.

Neither the solving steps (r(13) = 0.22, p = 0.468) nor the number of possible operations (r(13) = 0.42, p = 0.157) correlated significantly with the difficulty value, but both of the WPs including decimal numbers appeared to be relatively difficult.

The success rates of WP9 (6.6%) and WP13 (7%) were extremely low, and the curves (see Fig. 1) suggest that the two items were very difficult. Therefore, within the model, these items’ difficulty could not be properly estimated (extremely high standard errors). None of the aspects used in the analysis (Table 2) explained the extreme difficulty of WP9. One explanation might be that it was the only WP that required combinatorial reasoning. WP13 required deep understanding of the real-life situation described in the problem. Because only a few students could solve these two extremely difficult WPs, they were excluded from further analyses.

For further analyses, WPs were classified as easy (WP1, WP6, WP3, WP14, and WP15) and difficult items (WP2, WP8, WP7, WP12, WP5, WP4, WP11, and WP10) based on their difficulty values in the IRT analyses. Cronbach’s alphas for each subgroup were 0.61 and 0.64, respectively.

3.2 Associations between text comprehension, arithmetic, and WP-solving skills

To investigate the interrelation among text comprehension, arithmetic, and WP-solving skills on easy and difficult items, Pearson correlations were calculated. The students who did not complete all the three tests were excluded from the analyses (N = 55). The correlation matrix is shown in Table 3, and it revealed a significant correlation between text comprehension skills with both easy (r(836) = 0.41, p < 0.01) and difficult items (r(836) = 0.43, p < 0.01). Arithmetic skills also showed a significant correlation with both easy (r(836) = 0.52, p < 0.01) and difficult items (r(836) = 0.53, p < 0.01).

Table 3 Pearson’s correlation matrix between the main variables

3.3 Individual differences and how those differences relate to WP-solving performance

To investigate individual differences in text comprehension and arithmetic skills and how those differences relate to WP-solving performance, first, students were categorised based on their text comprehension and arithmetic skills. K-means clustering was conducted on the z-scores of the variables. As a result, students were classified into four different groups: poor in text comprehension and arithmetic skills (N = 154); poor in text comprehension but skilful in arithmetic skills (N = 197); skilful in text comprehension but poor in arithmetic skills (N = 288); skilful in both skills (N = 197) (see Fig. 2). Then, ANOVAs were conducted to compare students’ text comprehension and arithmetic skills between groups. The results of the ANOVAs indicate that there was a significant difference between groups in their text comprehension (F(3, 832) = 787.666, p < 0.001) and arithmetic skills (F(3, 832) = 519.959, p < 0.001). Post-hoc tests (Bonferroni) revealed that the differences in text comprehension and arithmetic skills were significant in all group comparisons (all ps < 0.001). Descriptive information concerning text comprehension and the arithmetic skills of students in each group is presented in Table 4.

Fig. 2
figure 2

Four clusters of students based on their text comprehension and arithmetic skills. M−L− very poor in both skills, M + L− skilful in arithmetic but poor in text comprehension, M−L + poor in arithmetic but skilful in text comprehension, M ++L ++ very skilful in both skills

Table 4 Descriptive information concerning text comprehension, arithmetic, and WP-solving skills by each student group

Later, ANOVAs were conducted to investigate whether students in each group differ in their performance in mathematical WP-solving on easy and difficult items. The results of these ANOVAs show that there was a significant difference between groups in mathematical WP-solving performance on both easy items (F(3, 832) = 102.636, p < 0.001) and difficult items (F(3, 832) = 116.554, p < 0.001).

As shown in Table 4, the results of post hoc comparisons using the Bonferroni test reveal that students who were very poor in both skills had the lowest mathematical WP-solving performance on both easy (M = 0.23, SD = 0.25) and difficult items (M = 0.12, SD = 0.14) (all ps < 0.001), while students who were very skilful in both skills had the highest mathematical WP-solving performance on both easy (M = 0.71, SD = 0.23) and difficult items (M = 0.53, SD = 0.22) (all ps < 0.001). In addition, students who were poor in text comprehension but skilful in arithmetic skills (M = 0.53, SD = 0.26) performed significantly better than those who were skilful in text comprehension but poor in arithmetic skills (M = 0.46, SD = 0.27) on easy WPs (p< 0.05). However, there were no differences in these groups’ performance on difficult WPs (p = 0.14), showing that more challenging WPs require students to also be competent in text comprehension.

4 General discussion

Previous empirical evidence has convincingly shown that WP-solving performance is related to both text comprehension (Boonen et al. 2013, 2016; Swanson et al. 1993; Vilenius-Tuohimaa et al. 2008) and arithmetic skills (Fuchs et al. 2006, 2018). However, these studies mainly examined WP-solving performance on arithmetic WPs with simple semantic structures without paying any attention to the differences in the difficulty level of WPs. This study focused on WP-solving performance when dealing with demanding WPs in which a solution to these WPs requires that students develop a proper situation model and cannot rely solely on superficial coping strategies, such as the keyword approach. The first aim of the study was to investigate WP-item characteristics regarding linguistic and numerical factors and their difficulty level, using item response theory (IRT) modelling. We wanted to find out whether the selected linguistic factors (the length of WP statement), those influencing difficulty in developing situation model (irrelevant information, implicit information, and the use of realistic considerations), and numerical factors (number properties, required operations, and number of solving steps) could explain the difficulty level of these demanding WPs. Overall, the results revealed that superficial linguistic factors did not clearly explain WP difficulty. However, WPs which required inference of implicit information belonged all to the difficult WPs. Both WPs including decimal numbers (WP7 and WP11) were difficult but other numerical factors did not predict the difficulty of WPs. These results are not surprising given that the structure and context of these demanding WPs that require non-routine thinking are fairly different, and there is no common strategy to solve these problems. Individual items seem to have unique factors in their deeper structure that contribute to the item’s difficulty level. For example, the main factor that contributes to the difficulty level of the extreme item WP13 seems to be the demand to use real-world knowledge in the modelling process (see Verschaffel and De Corte 1997), while the factor influencing WP difficulty for WP10 could be the difficulty in developing a situation model from the WP statement. One possible explanation for the finding that some of the superficially linguistically similar items appeared to be more difficult is that, in these difficult items, the deep structure is different. For example, the numbers needed in the calculations were not directly given and the students had to infer them from textual expressions.

The second aim of this study was to examine an association between text comprehension, arithmetic, and WP-solving skills on easy and difficult items. In line with prior studies (Boonen et al. 2013, 2016; Fuchs et al. 2006, 2018; Swanson et al. 1993; Vilenius-Tuohimaa et al. 2008), the results showed that text comprehension correlated with WP performance. In this study, we showed that the connection was equally strong with both easy and difficult items. Similar connections also occurred with arithmetic skills; the results indicated that there is an association between arithmetic skills and performance in both easy and difficult items.

The last aim of this study was to investigate individual differences in text comprehension and arithmetic skills and their relationships with WP-solving performance. Students were categorised, based on their text comprehension and arithmetic skills, into four different groups: very poor in both skills; poor in text comprehension but high in arithmetic skills; skilful in text comprehension but poor in arithmetic skills; very skilful in both skills. As expected, the students who were weak in both skills had the lowest mathematical WP-solving performance in both easy and difficult items, while students who were strong in both skills had the highest WP-solving performance in both easy and difficult items. In addition, students who were poor in text comprehension but strong in arithmetic skills performed better than those who were skilful in text comprehension but poor in arithmetic skills in easy WPs. However, there were no differences in the performance of these two groups on difficult WPs. This shows that arithmetic skills are needed in all WPs, but in more challenging WPs, the role of text comprehension skills becomes important as well. These results are in concordance with previous studies reporting that text comprehension skills have a prominent role in overcoming textual complexities (De Corte et al. 1985; De Corte et al. 1990), for example, when students deal with WPs containing semantically complex features (Boonen et al. 2016).

The main conclusion of the study was that the difficulty of WPs is not based on the surface linguistic features, but there seem to be features in the structure of the WP texts requiring deeper comprehension (Kintsch 1998), which can explain the differences in the levels of difficulty. However, the evidence for this conclusion is not convincing because the WPs used in this study were not systematically planned for the comparison of these features. Future studies are needed with WPs in which surface and deep structure features of the WPs are systematically varied. Further, in investigating individual differences in WP-solving performance, the study focused mainly on text comprehension and arithmetic skills, but other general cognitive abilities (e.g., working memory, motivation) were not included.

Lastly, to examine WP-solving performance and WP difficulty, the study relied merely on WP-test achievement. In future studies, more detailed observations of students’ WP-solving processes, for example, through stimulated recall interviews, are needed to understand students’ challenges in solving different demanding WPs.

5 Limitations of the study

A major limitation of the study is that in the formulation of WPs, different mathematical and linguistic features were not systematically varied. One consequence of the missing systematic design of the WPs was that it was impossible to make clear theory based sub-categories of the WPs. In addition some widely studied linguistic aspects such as different sematic structures (e.g. LeBlanc and Weber-Russell 1996) were not clearly represented in the WPs. Another limitation is that WP-solving takes much time and it was not possible to collect the data with a larger number of WPs. This aspect also limited the opportunity to find reliable sub-categories representing different mathematical and linguistic aspects of problems.

6 Educational implications

Mathematical WPs can be valuable content for mathematics education, and the use of WPs requiring non-routine thinking and real-world knowledge of the modelling process, instead of mere routine problems, has been recommended by many researchers (CTGV 1992; Mason and Scrivani 2004; Pongsakdi et al. 2016; Verschaffel and De Corte 1997). It is, however, important to be aware of the remarkable differences in the levels of difficulty of the WPs, which are not always self-evident for the teacher or textbook authors. Moving from routine to realistic non-routine tasks also requires novel teaching strategies and teacher scaffolding (Pongsakdi et al. 2016; Pongsakdi et al. 2019). The results of this study show that demanding non-routine WPs require high levels of text comprehension skills. Thus, practising with more demanding WPs is not only beneficial for mathematics learning but can also be an effective way to improve advanced text comprehension skills.