Introduction

Relevance

Among the 17 Sustainable Development Goals (SDGs) launched by the United Nations (UN) in 2015 (UN 2015; 2016), the fourth goal (SDG 4) is dedicated to education. Extending the scope beyond the previous agenda’s focus on primary education,Footnote 1 it aims to “promote lifelong learning opportunities for all”. This has led to “hopes for a stronger role” of adult learning and education “in global education agendas and policies” (Elfert 2019, p. 537). While UN Agendas fall into the category of soft law,Footnote 2 they reflect a need for action, and by endorsing them, UN Member States have made commitments towards trying to achieve the targets.

One of the core instruments of soft law is monitoring (Grek 2019), and it often relies on assessment (Hamilton et al. 2015). Monitoring countries’ progress towards achieving the targets of the SDGs on an international scale makes it necessary to discuss methods of assessment, especially for adult literacy and numeracy. One of the ten targets within SDG 4 directly addresses adult literacy and numeracy skills:

By 2030, ensure that all youth and a substantial portion of adults, both men and women, achieve literacy and numeracy (SDG target 4.6; UN 2016).

To boost effective action in addressing SDG 4, the UNESCO Institute for Statistics (UIS) recently launched the Global Alliance to Monitor Learning (GAML), which

is designed to improve learning outcomes by supporting national strategies for learning assessments and developing internationally-comparable indicators and methodological tools to measure progress towards key targets of … SDG 4 (UIS 2017).

This initiative covers all ten targets of SDG 4, with thematic task forces established to address each of them. Since 2017, the task force for SDG target 4.6 has held several expert meetings in order to collect and evaluate existing tests and findings and discuss adequate testing instruments.

The dilemma is how to build on earlier – mostly Western – research on the one hand, and how, on the other hand, to avoid a monopolistic spread of definitions and test instruments throughout the world (Addey 2018). Another challenge is that the most powerful instrument, the Programme for the International Assessment of Adult Competencies (PIAAC) conducted by the Organisation for Economic Co-operation and Development (OECD),Footnote 3 is too expensive for most UN Member States. The OECD asks participating countries to organise the data collection and test analysis themselves. This requires sample sizes of around 5,000 test takers per country. Respondents’ completion of the test and questionnaire takes approximately two hours and also includes a computer-aided personal interview which is usually carried out by a survey company that charges several million Euro for the data collection.

Moreover, the five proficiency levels for literacy do not cover the most basic levels of literacy, i.e. from total illiteracy onwards (there is simply a sixth category labelled “below Level 1”).Footnote 4 Since GAML is monitoring improvement by 2030, at least two reports will be needed from each country before 2030: The first assessment would serve as a starting point which the second assessment can then be compared against, hopefully demonstrating improvement in adult literacy and numeracy. So the timeframe for coming up with suitable assessment methods and tools to begin the first round of assessments as soon as possible is tight. What is especially urgently needed are tests that cover the most basic levels of literacy in a more differentiated way than “below PIAAC Level 1”. Moreover, the question arises whether existing instruments that cover lower levels of literacy can be integrated into a common scale with instruments that cover higher levels of literacy, e.g. the PIAAC scale.

State of the art

In terms of existing instruments, there are two competing approaches, which we discuss in detail in the course of this article. One is the lower-rungs approach (Brooks, Davies et al. 2001a, b), and the other is the reading components approach (Sabatini and Bruce 2009; Strucker et al. 2007). In a nutshell, the lower-rungs approach takes a differentiated look at the lowest level of literacy, and the reading components approach indicates adults’ proficiency in decoding, word recognition and word meaning (vocabulary). Both approaches have strengths and weaknesses.

Test items of the lower-rungs type have the advantage of correlating with, and complementing, higher levels on international literacy proficiency scales such as those used by PIAAC. But they have not, in fact, been translated into languages other than English and German.

By contrast, the reading components test items are not hierarchically organised and therefore are not aligned with the PIAAC scale, but they do exist in several languages. Moreover, they have been administered internationally as an add-on to the OECD’s PIAAC programme, under UNESCO’s Literacy Assessment and Monitoring Programme (LAMP)Footnote 5 as well as the World Bank’s Skills Towards Employability and Productivity (STEP) skills measurement programme.Footnote 6 While both of these programmes were run in middle-income countries or regions, their suitability for low-income countries is unlikely. Another complicating factor is that the Reading Components test items originate from many sources and there are different versions of test sets – with different ownership.

Purpose and structural organisation of this article

Our aim in this article is to explore and clarify whether the Reading Components, as they are used in their international version (e.g. as a PIAAC add-on), can be understood as hierarchical and therefore be organised on a proficiency scale which can be aligned with and connected to international literacy scales like the one applied by PIAAC. If this is possible, the reading component items would perform like lower-rungs items and then enhance the bottom end of the scale where the most basic skills are situated. This would solve the problem of where to find test items for a range of countries (including low-income ones), as the international Reading Components are already widely used, well-accepted and available in many languages, and have also already been pretested in the countries that participated in LAMP and STEP as well as those who bought the add-on module under PIAAC.

We begin with a review, looking back into the development of each of the two competing approaches (lower rungs versus components). This is necessary to avoid confusion between earlier and more recent versions. We also present the theoretical background, the development of test items as well as pretest and main test results for both approaches, and sum up the differences in a table. We then discuss both approaches with regard to their strengths and weaknesses for monitoring SDG target 4.6 globally. This discussion leads to our three research questions, the overarching purpose of which is to find out whether one of the item sets (the Reading Components test set) could be disconnected from its theoretical background (the components approach) and re-organised in a hierarchical way (as rungs on a ladder). This would meet the requirements specified by the GAML initiative for effective assessment methods to monitor a wide range of countries’ progress in achieving SDG target 4.6. In our methodology, we describe and report on the relevant statistical tests which we carried out using item response theory (IRT)Footnote 7 and the German PIAAC Reading Components subset of data. After presenting the results, addressing each of our three research questions, we evaluate the outcomes and conclude our article with recommendations for further re-analysis and refinement.

Review: assessing the most basic levels of literacy

International large-scale assessments currently measure literacy with unidimensional and continuous competence models. What this means is that individual proficiencies are hierarchically described as being situated on a scale rising from low to high levels of competence. In terms of the main results, PIAAC and earlier international assessmentsFootnote 8 have defined four or five proficiency levels and documented the percentage of adults scoring at each of these levels for each of the participating countries (OECD 2013; OECD and Statistics Canada 2000, 2005) and an average for all of them together. For example, in 2012, on OECD average, 15.5 per cent of the participating international population (ages 16–65) scored at literacy Level 1 or below (OECD 2013, p. 257).Footnote 9

In the underlying theoretical model, literacy is defined as

the ability to understand, evaluate, use and engage with written texts to participate in society, achieve one’s goals, and develop one’s knowledge and potential (OECD 2013, p. 61).

In addition to the literacy scale, a “Reading Components” assessment was included in PIAAC’s international “Survey of Adult Skills” (OECD 2013, pp. 59, 67). According to John Sabatini, the intention was to use the information collected through this additional “battery of reading component tasks” to “draw implications for policy, as well as for learning and instruction, for adults who score at or below Level 1 in literacy proficiency” (Sabatini 2015, p. 2; emphases added).

There are also approaches to assessing basic reading and writing skills with continuous models, so-called “lower-rungs approaches” (Brooks, Giles et al. 2001), which think of the continuum as a ladder and take into account even barely measurable low proficiency levels. However, when complementing (rather than extending) PIAAC with the above-mentioned “battery of reading component tasks” (Sabatini 2015, p. 2), the OECD preferred a non-continuous model of three reading components and did not integrate these into the six-level literacy scale (Levels 1–5 and the “below Level 1” category). The three reading components the PIAAC add-on module tests participants on are (1) word recognition, (2) sentence processing and (3) passage fluency (Sabatini and Bruce 2009). It remains unclear why there have been no attempts up to now to find out whether it would be possible to link either these three components or the total set of component items to the PIAAC scale. Perhaps one reason is the theoretical quality of the three components. Since these components were developed independently of PIAAC, they are different from what is being tested on the overall literacy scale now (Strucker et al. 2007). However, the preparations for PIAAC did polish the reading component subtests in a way that made them suitable for international comparison (Sabatini and Bruce 2009). We assume that the theoretical differences may have decreased during this process while the similarity to the overall PIAAC literacy scale increased.

Unlike the PIAAC literacy scale, which builds on item response theory (briefly explained in footnote 7), the Reading Components in the add-on module are tested using classical test theoryFootnote 10 methods (Yamamoto et al. 2013, p. 16; Zabal et al. 2014, p. 106). Again, it is not clear why this is so. It remains open to investigation whether it would be possible to run the reading component tests under an item response model as well, and also whether they would meet the necessary quality controls. If the answer to both of these questions turned out to be yes, the reading component tests would lose their full status as component tests, but they would gain the highly relevant quality of being statistically linkable to established international literacy scales.

Fig. 1
figure 1

Lower-rungs vs. components approach

Figure 1 illustrates the theoretical assumption about the main difference between rungs and components. While both are located inside the lowest level of literacy, labelled Level I in the graph, only the lower rungs claim to be hierarchical and part of the overall literacy scale. The components claim to be different elements of the reading process and thus non-hierarchical and non-comparable to the literacy scale. Both approaches are explained further below.

Both approaches compete with each other in assessing proficiencies of adults with low literacy skills. While the reading components approach was very fruitful in the early 2000s in Canada and the United States (US), the development of testing materials in the United Kingdom (UK) and in Germany focused on the lower-rungs approach. Among early versions of component approaches, the components were clearly differentiated and were linked to reading. When PIAAC chose to take the reading components approach on board in an add-on module, it became necessary to translate the test instruments already existing in individual countries (such as Germany, for example), and to reduce them to make them suitable for application in and comparison among a wide range of countries. It can be assumed that the components approach consequently became more similar to a (lower-) rungs approach than expected. The question is whether the reduction made to meet the needs of international comparability subsequently led the components to become hierarchical parts of one latent variableFootnote 11 (i.e. reading). We return to this question in a later section of this article.

The lower-rungs approach and its implementation in the Level-One Survey (LEO) in Germany

A lower-rungs approach can be applied to describe and examine low skills in literacy. This means it enables differentiating the lowest level of the literacy scale more finely – in other words, “creating the lower rungs of the ladder” (Brooks, Davies et al. 2001a, b, p. 55). By including proficiencies “below Level 1”, the lower-rungs approach extends the lower end of the established ranking of proficiency levels, which is based on a hierarchical and unidimensional model of literacy.

For example, the New Standards Level, developed in the UK in 2000 by the Basic Skills Agency (BSA) and the Qualifications and Curriculum Authority (QCA), comprised one “Entry Level”, subdivided into Entry Levels 1–3 (E1, E2, E3), describing reading skills that are comparable to the range below IALS Literacy Level 1Footnote 12 (Brooks, Davies et al. 2001a, b; QCA 2005). These levels were applied in the Skills for Life survey conducted by the UK Department for Business, Innovation and Skills in 2011 (BIS 2012).

Another example is the Level-One Survey (LEO), which implemented four so-called Alpha Levels (α1 [letters], α2 [words], α3 [sentences], α4 [whole texts]) in Germany. They are based on theories about the acquisition of written language,Footnote 13 international large-scale assessments, national and international educational standards, and concepts of the practice of adult basic education (Dessinger 2011; Kretschmann 2011). Furthermore, the Alpha Levels were theoretically anchored within the IALS literacy scale (i.e. below IALS Level 1) by the level definitions and the “can-do” descriptions and characteristics for determining the level of difficulty (Grotlüschen 2011). Examples are provided in Fig. 2, which shows the “can do” descriptors of Alpha Level 3 in reading, and Fig. 3, which shows the “can do” descriptors of Alpha Level 4 in writing.

Fig. 2
figure 2

“Can do” descriptors and task characteristics of Alpha Level 3 (reading) (translated from Kretschmann 2011, p. 53)

Fig. 3
figure 3

“Can do” descriptors and task characteristics of Alpha Level 4 (writing) (translated from Grotlüschen et al. 2010, p. 38). Note: *Final-obstruent devoicing means that spoken words like “Hun-d” [dog] sound as if they were spelled with a hard consonant at the end (Hun-t [do-k]), making it difficult to draw conclusions from the sound towards the spelling. ** Interfixes are the spoken gaps between syllables in compound words, e.g. “Bus_halte_stelle” [bus stop]. *** CEFR is the Common European Framework of Reference which has standardised proficiency levels from A1 (lowest), A2 and B1, B2 to C1, C2 (highest)

Furthermore, the Alpha Levels have had an influence on the development of instruments and tools for assessing adult literacy proficiency in Germany. The curriculum framework for literacy and adult basic education (DVV 2014), which contains guidelines for teaching and testing reading, writing and calculating in adult basic education, was developed following Alpha Levels 1–4 (ibid.).

The reading components approach and its implementation in PIAAC

Representing basic “building blocks” of reading, component reading tasks also examine very foundational reading abilities, albeit not in a hierarchical order. Before the OECD added a reading components assessment module to the international assessment of PIAAC in 2012, the Statistics Canada research institute decided to implement a components approach in the Canadian part of the OECD’s ALL Survey in 2003.

Early Canadian and US-American national testing components

The reading components identified in Canada offered some additional information that differentiated among types of struggling readers. The advantage of a components approach was seen in its potential to offer insights into the different ways in which weak readers lag behind. Possible difficulties are insufficient vocabulary, difficulties with basic word decoding, inadequate strategies for dealing with new or complex texts, or general comprehension problems. Statistics Canada’s expectation was that these differentiations would provide useful information to programme providers and policymakers (Murray 2001). Table 1 shows the components and tests which were discussed and subsequently recommended as being suitable for a household survey investigating adults’ reading proficiency – in this case the Canadian ALL Survey, conducted in 2003.

Table 1 Table of recommended components and tests*

The Adult Reading Components Study (ARCS; Strucker and Davidson 2003) carried out in the United States by the National Center for the Study of Adult Learning and Literacy (NCSALL) served Statistics Canada as a model for clustering adult learners into groups of reading skills levels. John Strucker and Rosalind Davidson tested 955 randomly selected learners from adult basic education (ABE) and English for speakers of other languages (ESOL) classes to assess their phonological awareness, rapid naming, word recognition, oral reading, spelling, vocabulary and background knowledge. Using a cluster analysisFootnote 14 methodology, they discerned ten clusters of reading skills levels in their sample which they deem relevant for effective teaching and learning (Strucker and Davidson 2003, p. 126).

Further components research was conducted jointly by John Strucker (NCSALL) as well as Kentaro Yamamoto and Irwin Kirsch from the Educational Testing Service (ETS), also in the United States. They took a sample of 1,034 adults and ran, among other things, a latent class analysis (LCA)Footnote 15 based upon participants’ scores on:

  • oral vocabulary (PPVT);

  • real word reading (TOWRE A);

  • pseudo-word reading (TOWRE B);

  • spelling; and

  • short-term memory (digit span).

The result was a distinction of five classes of readers:

  1. (1)

    proficient ABE, adult secondary education (ASE), and household sample readers with very strong decoding and vocabulary skills;

  2. (2)

    ABE and ASE students with strong decoding skills that tend to undermine their vocabulary skills;

  3. (3)

    advanced ESOL students with strong decoding but noticeably weaker English vocabulary skills;

  4. (4)

    intermediate ESOL students with moderate weaknesses in decoding and vocabulary skills in English; and

  5. (5)

    low intermediate ESOL students and reading disabled ABE native speakers with marked needs in decoding and vocabulary (Strucker et al. 2007).

Further results of latent class analysis with component assessment data from the Canadian International Survey of Reading Skills (ISRS)Footnote 16 were published by the Canadian Council on Learning (Murray et al. 2008). The report distinguishes six groups (A1, A2, B1, B2, C and D) based on mother tongue, immigrant status and other key characteristics including age, gender, education and employment status (ibid.).

International components suitable for comparative analyses

The developers of the reading components assessment in PIAAC 2012 applied none of the above-named tests, because they needed instruments that would enable international comparison. Whereas the developers’ conceptual framework suggested five components, only three of these reading components made it into the final assessment set. Since languages vary in terms of their writing systems (alphabetic [e.g. English], syllabic [e.g. Japanese] or logographic [e.g. Chinese]), the PIAAC Reading Components test excluded tasks for alphanumeric perception and efficiency as well as tasks for word recognition and decoding (Sabatini 2015; Sabatini and Bruce 2009). Below, we explain the three remaining components and their task-sets.Footnote 17

Print vocabulary (word meaning). To ensure cross-country comparability, the language chosen for this component in the PIAAC’s add-on module was the local language being used in the respondents’ neighbourhood, in the market and in popular media. The print vocabulary tasks are based on the assumption that adults know the meaning of everyday words from pictures and from listening. The 34 print vocabulary tasks assess whether a person also knows their meaning from print. For this purpose, the respondent is given a four-item multiple choice list and asked to circle the correct word that represents the meaning of an image. Thus the print vocabulary task-set seeks to determine whether individuals can identify everyday words of their local language in print.

Sentence processing. To ensure this component’s cross-country comparability, the tasks in this set were created without varying the grammatical/syntactic complexity of the sentences. Variation was, however, taken into account in the length of sentences within a basic grammatical structure, and also in the logical relationships that comprise meaning. These variations were designed with increasing difficulty and therefore indicate the individual’s proficiency at constructing basic meaning from print (Sabatini 2015, pp. 7, 11). The 22 sentence-processing tasks ask an individual to judge “whether the sentence makes sense in relation to common knowledge about the world […] or based on the internal logic of the sentence” (ibid., p. 12). Therefore, a “yes” or “no” answer represents a 50 per cent guess probability. Thus the sentence processing tasks assess the individual’s proficiency in applying his or her language skills in the context of printed text.

Passage comprehension. The passage comprehension task-set measures fluent, efficient reading performance. The 44 passage comprehension tasks are embedded in four short basic text passages designed for adult readers. In each task, respondents are asked to choose between a word that correctly fits a sentence in a passage and a second option that a skilled reader would recognise as being obviously wrong. Although reading fluency and efficiency are usually assessed by giving participants only a fixed amount of time to do the task, PIAAC 2012 allowed them as much time as they needed. The individual total time required to complete it was recorded, and average reading rates were compared afterwards. The purpose of this was to prevent biases, caused by cross-country comparison, because differences between languages, writing systems and cultural variables were expected to affect average reading rates (Sabatini and Bruce 2009, p. 13). In their conceptual framework, John Sabatini and Kelly Bruce explain that “the time to complete will add very little additional information” about the skills of “the very low-skilled beginning reader”, but low-ability adults with high accuracy scores within the passage comprehension tasks can be identified by this measurement, because they need more time to complete than the subsample of skilled readers in each country (Sabatini and Bruce 2009, p. 13).

Table 2 sums up the differences between the lower-rungs approach and the reading components approach and their development for PIAAC.

Table 2 Summary of differences between the lower-rungs approach and the reading components approach before and during PIAAC

Research questions

Having elaborated the differences between the lower-rungs approach and the reading components approach in the previous sections of this article, we now discuss both approaches with regard to their potential suitability for monitoring SDG target 4.6 globally, which then leads to our presentation of our own research.

To assess lower reading skills, PIAAC 2012 opted for a components approach rather than a lower-rungs approach. There are two possible reasons for this. First, the design of the survey suggests there was no plan to link the Reading Components to the continuous literacy scale. The Reading Components assessment was implemented as a new domain and as an optional element of the assessment in Round 1 (2011–2012) of PIAAC’s first cycle. Furthermore, it was provided in pencil-and-paper format, whereas the main assessment was designed in a computer-based format (Kirsch and Thorn 2013). This certainly limits the comparability of both measures.

Second, Sabatini and Bruce state that even in theory, the components “do not strictly develop hierarchically” during the acquisition of reading skills (Sabatini and Bruce 2009, p. 7). Therefore, it might be inadequate to treat them as having a clearly hierarchical order.

However, the published results of the PIAAC Reading Components assessment (OECD 2013) as well as the progression of the components (from words to sentences to text passages) could point to a hierarchy among the three different types of the assessed reading component tasks.

A hierarchy of difficulty?

The published average proportions of the correctly answered reading component items show differences among the three dimensions (print vocabulary, sentence processing and passage comprehension). The highest average proportions of correct answers were reached for the print vocabulary dimension, whereas the lowest were reached for the sentence processing items. This result is stated independently of the individual literacy level. Furthermore, this is not only true for the German data, but also for the OECD average (OECD 2013, pp. 416–418).

Table 3 shows the average proportion of correctly answered reading component items by literacy proficiency level for the German sample. From this table it is reasonable to assume that the print vocabulary items are the easiest, and the sentence processing items are more difficult than the passage comprehension items.

Table 3 Average proportion of reading component items answered correctly, in per cent, by PIAAC literacy proficiency level (German sample)

Also, Sabatini states for the US reading components:

One may have noticed that sentence and passage reading means were closely aligned across the higher levels of literacy proficiency, with passage means sometimes higher than sentence means toward the higher proficiency levels. This is because the most difficult sentence items are typically more difficult than any of the passage items. Thus, even adults who are relatively more proficient may still make errors on these challenging sentence items while likely finding all passage items relatively easy to answer (Sabatini 2015, p. 16).

Table 4 shows the average time spent completing a reading component item, in seconds, by PIAAC literacy proficiency level for the German sample. Here, too, print vocabulary emerges as the easiest dimension, because the average time spent on completing these tasks is comparably the shortest for all literacy levels. But responding to the passage comprehension items takes a little longer than answering the sentence processing items. Therefore, in terms of time spent on completing the tasks, it is reasonable to assume that the passage comprehension items are more difficult than the sentence processing items.

Table 4 Average time spent completing a reading component item, in seconds, by PIAAC literacy proficiency level (German sample)

This pattern is also the same for the OECD average across all participating countries (in Round 1 of PIAAC’s first cycle) for time spent completing the reading component items (OECD 2013, pp. 417–418).

Considering these results, the research questions (RQ) we decided to investigate in our own research, presented in this article, were:

RQ1 Is it possible to describe the PIAAC reading component items (in the German PIAAC questionnaire) hierarchically by their difficulty?

RQ2 Provided that it is possible, what kind of hierarchical relationship exists among the three components and across all items?

RQ3 If the Rasch model proves unsatisfactory, does a 2PL Birnbaum model fit the reading component data better?Footnote 18

Methodology

In addressing our research questions, we applied methods of item response theory (IRT) to the German sample of the PIAAC Reading Components data. IRT provides probabilistically combined results regarding respondents’ trait level (competences) and item properties (difficulties) based on the probability of a correct response to a test item (Embretson and Reise 2000).

The simplest item response model, the so-called Rasch model,Footnote 19 assumes that the probability of a specified response depends on two variables: the respondent’s trait level and the difficulty of the test item (Embretson and Reise 2000, pp. 48–51). If a respondent’s trait level exceeds the difficulty of the item, then there is a strong possibility that this person will respond correctly to the item. If the difficulty of the item exceeds the respondent’s trait level, there is a strong possibility that this person will not respond correctly to the item. In other words, the more difficult an item is, the less likely it is that a person with a particular trait level will respond correctly to this item (Embretson and Reise 2000, p. 49).

In our research, we focused in particular on the item difficulties of all three reading component items, re-analysing them in terms of their hierarchical relationship. For this purpose, we chose the one-parameter logistic Rasch model, because it is particularly suitable for estimating and scaling test items on a common scale, ordered by their difficulties. In case of model conformity, the Rasch model has the property of specific objectivity. This means that differences in terms of item difficulties can be stated independently of the sample’s skills distribution (Embretson and Reise 2000; Moosbrugger 2012, p. 49).

A necessary precondition of IRT analyses is the assumption of item homogeneity and local independence, meaning that all item responses depend on the same latent variable and that, given the model parameters, no further relationships exist in the data (Embretson and Reise 2000, p. 60). One important advantage of the Rasch model is that it provides appropriate and strict model fit criteria to evaluate item homogeneity and item quality.

We carried out the estimation of a one-dimensional Rasch model using ConQuest software. Sabatini states that the translation of reading component items across languages may result in different item level difficulty estimates (Sabatini 2015, p. 11). Therefore, the analysis we present here refers to the reading component data from only one country (Germany). Our input file contained the full response data of the German sample in the PIAAC Reading Components assessment based on the reduced version of the German PIAAC Scientific Use File (SUF; Rammstedt et al. 2015).

The Reading Components sample for Germany comprises 822 cases, whereas the whole German PIAAC sample comprises 5,465 cases. Therefore, the sample is not representative for the German adult population. Furthermore, Claudia Tamassia et al. note in the OECD’s Technical Report of the Survey of Adult Skills (PIAAC) that the criterion for routing respondents into the paper-based reading components assessment was not only lower literacy and numeracy skills, but also a lack of experience in handling a computer.

[The] paper-based assessment was administered to respondents who either reported they had no computer experience; failed the test of basic computer skills required to take the assessment; or refused to take the assessment on the computer (Tamassia et al. 2013, p. 2).

As a consequence of this routing process, the German sample contains relevant proportions of respondents with higher literacy skills, who solved the reading components tasks with ease, while a relevant proportion of adults with lower literacy skills also remained in the sample. Across the entire 23-country PIAAC sample, an above-average proportion of 31 per cent of the adults who took the Reading Components assessment (compared to 15.5 per cent total) scored at or below Level 1 (Sabatini 2015, p. 9).

Comparison of the sociodemographic bias of the German sample against that of the German adult population as a whole can be described by, for example, a higher mean age and a higher proportion of adults who speak German as a second language.Footnote 20

The dataset we analysed comprised responses for a total of 100 PIAAC reading component items. These were 34 print vocabulary items (numbered 1–34), 22 sentence processing items (numbered 35–56) and 44 passage comprehension items (numbered 57–100). For our IRT analysis, we recoded the response data into dichotomous (0/1: incorrect/correct) data. Missing values were treated as follows: in cases where questions had been skipped (refused or not done), we recoded missing values into incorrect responses; in cases where the whole reading components assessment was broken off, we recoded the first missing value into an incorrect response and left all further missing values as missing. Afterwards, we estimated, mapped and analysed the item parameters and evaluated the quality of the items. We checked the Rasch model fit by weighted mean squares (MNSQ). A perfect item fit in terms of mean squares would be 1.0 (Wu et al. 2007, p. 54). For this research study, we chose MNSQ ≥ 1.33 as criterion for a bad item fit (Wilson 2005, p. 129; Grotlüschen et al. 2012, p. 63). Furthermore, we illustrated and described the distribution of item difficulties by a Wright map (see results section).Footnote 21

Subsequently, we compared the results to the outcomes of a two-parameter logistic (2PL) Birnbaum model,Footnote 22 which considers varying item discriminations. Since items differ in their discriminating power, trait level estimates depend on the specific patterns of success and failure in the item set. In contrast to the Rasch model, items do not have equal weight in estimating trait levels (Embretson and Reise 2000, p. 53). We estimated the 2PL model using Mplus 7 softwareJavaScript:TypeChar(73).

Results

RQ1: Is it possible to describe the PIAAC reading component items (in the German PIAAC questionnaire) hierarchically by their difficulty?

As a main result of our analysis, we found that the applied Rasch model confirmed the possibility of representing the 100 reading component items on a hierarchical scale (i.e. the overall answer to our first research question seemed to be yes). The mean squares of most items (n = 92) met the model fit criterion (MNSQ ≤ 1.33). Only eight items, in this analysis, did not meet this criterion. These were one item (item 17) from the print vocabulary item set, and seven items (items 39, 40, 44, 45, 50, 51 and 56) from the sentence processing item set.Footnote 23 Their mean squares range from 1.33 (item 45) to 1.47 (item 40). On the one hand, these items are characterised by very low discriminations. This could mean that respondents with higher abilities are not more likely to solve them than respondents with lower abilities. On the other hand, the unsatisfactory item fits could also indicate that these items do not fit a one-dimensional construct of the kind we applied here (Rost 2004, p. 98; Kelava and Moosbrugger 2012, p. 86).

Seven out of the eight unsatisfactory items belong to the sentence processing item set, indicating that roughly one-third of the sentence processing scale either does not discriminate well, or might be testing something other than sentence processing. Respondents are asked to check the sentences in terms of whether they make sense; it is possible that people only check whether they are grammatically correct without deciding whether or not they are reasonable.

Nevertheless, our Rasch analysis of the Reading Components data did result in a coefficient alphaFootnote 24 of 0.95. This indicates an overall internal consistency, although this is also not a measure for item homogeneity (Schermelleh-Engel and Werner 2012, p. 132).

RQ2: Provided that a hierarchical description of the PIAAC reading component items (in the German questionnaire) is possible, what kind of hierarchical relationship exists among the three components and across all items?

With the overall answer to our first research question being yes, we then addressed our second research question. The Wright map in Fig. 4 shows the results for our analysis of the 100 Reading Components items in the German PIAAC sample when applying a one-parameter logistic Rasch model. This map of latent distributions and response model parameter estimates displays a joint hierarchical scale. The horizontal axis designates the number of cases/respondents; the vertical axis designates the level of difficulty. The scale is adjusted in a way that uses “zero” as the average competence of the sample. This gives the difficult items a positive value and the easier items a negative value.

Fig. 4
figure 4

Map of latent distribution and response model parameter estimates for the German PIAAC 2012 sample. Note: The vertical axis designates level of difficulty

The left-hand panel shows a representation of the latent reading competencies distribution, and the right-hand panel indicates the difficulty of the test items. Each number represents one item and the items are plotted according to their difficulties. Here, the difficulties range from –6,86 to –2,14. Item 56 and item 89 have the highest item difficulties, so they are plotted at the top of the figure, while item 17 has the lowest item difficulty, so it is plotted at the bottom of the figure. According to Rasch’s model, a person with a latent ability estimate that corresponds to the level at which the item was plotted would have a 50 per cent chance of success on that item (Wu et al. 2007).

As expected, the item difficulties are located clearly below the average of the competence distribution. This means that the majority of the sample responding to the Reading Components add-on was able to solve most of the items correctly.

When comparing the item difficulties of the three components, it is evident in Fig. 4 that most of the print vocabulary items (numbers 1–34, shown in red) are relatively easy, as expected. Furthermore it is noticeable that the sentence processing items (numbers 35–56, shown in yellow) and the passage comprehension items (numbers 57–100, shown in green) have higher difficulties, but mix with each other relating to their difficulties. Therefore, the implicit assumption of a components-related hierarchy of the three scales cannot be confirmed. However, while print vocabulary, sentence processing and passage comprehension are not clearly ordered like lower rungs, the general trend is that words (print vocabulary) are easier than sentences (sentence processing); which are easier than short texts (passage comprehension). Thus the test items do form a hierarchy. Even under the rather strict assumption of Rasch homogeneity, all but 8 items meet the model fit requirements.

The print vocabulary items numbered 17, 8 and 10 have the lowest item difficulties, ranging from − 6,86 to − 6,51. It is worth noting that the correct answers for these three easier items are monosyllabic, which might explain their position on the Wright map.Footnote 25 Further up in the map, the most difficult print vocabulary items do already mix with items from both the sentence processing and the passage comprehension components.

Within the sentence processing component, item 35 is the one with the lowest difficulty. This seems to be reasonable, because the sentence consists of one definite article, one subject and one verb in simple past form. Items 49 and 53 also have low difficulties, but their sentence structures are far more complex. For example item 49 involves an interjectional relative clause, and the length of the text comprises 14 words. By contrast, items 36, 37 and 41 have higher item difficulties, although they are main clauses and their lengths range from four to eight words. This order of difficulties disagrees with Sabatini’s theoretical description, which states that the sentence processing items in the test booklet would rise in terms of their difficulties (Sabatini and Bruce 2009, p. 11).

The difficulties of the passage comprehension items concentrate on the range between –4,95 and –2,18. Each of the passage comprehension tasks requires the respondents to choose between two words within a short text. According to the results of the Rasch analysis, items 85 and 75 have the lowest item difficulties, whereas item 89 is the most difficult one. The varying difficulties of passage comprehension items could depend on the length and familiarity of the words, the abstractness of the word meaning, and how obviously they seem to be correct answers in the context of the text passage.

RQ3: If the Rasch model proves unsatisfactory, does a 2PL Birnbaum model fit the reading component data better?

According to Kentaro Yamamoto et al., for PIAAC, “a common set of item parameter estimates of the two-parameter logistic (2PL) model and the general partial credit model (GPCM)Footnote 26 was estimated and found to fit quite well to all countries” (Yamamoto et al. 2013, p. 16), i.e. not with a simple Rasch model. Indeed, the Rasch model assumption of homogeneous item discrimination is often non-realistic and artificial. More sophisticated models can cope with inhomogeneity of discrimination.

As already mentioned, we found that eight reading component items showed poor Rasch model fit, that is, they did not discriminate the same way as the others or they did not test the same latent variable. For these reasons we estimated a two-parameter logistic (2PL) Birnbaum-model in order to check the item difficulties taking different discrimination characteristics into account. The item discriminations ranged from 0.73 to 11.91.

All in all, we found that the two-parameter logistic Birnbaum model fit the Reading Components data better than the one-parameter logistic Rasch model. In comparison (see Table 5), the 2PL model fits show lower Akaike and Bayesian information criteria (AIC and BIC)Footnote 27 and sample-size adjusted BIC and should therefore be preferred (de Ayala 2009, pp. 141–142).

Table 5 Model fit of 1PL and 2PL (both calculated using Mplus software)

Discussion

Component items do also function as hierarchical test items and therefore meet GAML requirements

To sum up, we found that the first research question (Is it possible to describe the PIAAC reading component items [in the German PIAAC questionnaire] hierarchically by their difficulty) can be answered positively. Two different approaches (applying the Rasch model and the Birnbaum model) show that the component approach at least partly contains hierarchical item difficulties.

Our second research question investigated the kind of hierarchical relationships existing among the three components and across all items. We found that while the print vocabulary scale is easier than the two others, the latter have internal hierarchies but mix with each other in terms of difficulty. Our first method, which applied the Rasch model, showed unsatisfactory item fits for 8 out of 100 items, with 7 of them belonging to the 21-item sentence processing subscale.

Our third research question investigated whether a two-parameter logistic model would lead to better fit values. The results indicate that the model fit was indeed better and that the reading components approach as used in PIAAC can also be interpreted as a hierarchical scale modelling a latent variable that could be called “reading”.

Our findings indicate an overall hierarchy of the Reading Component items, although two of the dimensions, namely sentence processing and passage comprehension, cannot be clearly separated in terms of the rise in difficulty. Reading comprehension in a single sentence as distinct from the comprehension of a multi-sentence text section is not tested selectively in the PIAAC assessment tasks. This is certainly a consequence of choosing especially those reading tasks for the assessment that are less language-specific in order to improve the international comparability.

Moreover, our findings demonstrate that the Reading Component test set under PIAAC 2012 also works as a hierarchy which would indeed be linkable to an international literacy scale. The test items are available in many languages. This already enables usage of component items in a wide range of countries. Many of the subsets have been applied under PIAAC, STEP, LAMP or even IALS, ranging across several supra-national organisations and thus indicating that the items are widely accepted (which would probably be more difficult if the items were purely owned by the OECD or ETS).

One conclusion of our research therefore is that it is technically possible to use the full set of PIAAC reading component test items to meet the requirements of the Global Alliance to Monitor Learning (GAML) initiative’s efforts to address all ten targets of SDG 4. Participating UN Member States can add the tests to national micro-censuses or similar surveys. Findings can be displayed in a hierarchy that is comparable across countries because of its linkability to an anchor literacy scale (e.g. PIAAC).

Tests that were developed under the Reading Components scheme become disconnected from their origins when they are made internationally comparable

To break down the reading proficiency within the lowest literacy level (“below PIAAC Level 1”) into more differentiated categories, a lower-rungs approach was developed in Europe (in the UK and Germany) and a reading components approach was developed in the United States and Canada. Both have advantages and disadvantages. The most recent and widespread version of the Reading Components is the one used in PIAAC and STEP. It differs from earlier versions, because it was adjusted for the purpose of being applicable in different countries, settings, languages and scripts. While these adjustments and test development efforts polished the test (Sabatini and Bruce 2009), one unavoidable side effect was the blurring of some of the clear differences which had been discernible among earlier components (Strucker et al. 2007).

Earlier component versions differed much more from each other and were more closely linked to different aspects of reading. One aspect, for example, was the strategy of letter-by-letter-decoding of unknown words, mostly tested by using nonsense words (i.e. in the TOWRE test). Another aspect was the existence of lexical memoryFootnote 28 entries according to a lexical strategy of reading where fast word recognition is required. This can be tested with word recognition tests (TOWRE, PPVT). These two aspects can be interpreted by using Coltheart’s dual-route theory of reading (Coltheart et al. 2001)Footnote 29 and they show up in readers with different kinds of dyslexia, requiring different treatments. Both are different from tests on language and vocabulary or tests on grammar, which indicate low language proficiencies – and thus require provision of language lessons rather than making efforts to improve learners’ decoding or memorising skills. Another aspect has been the test of short-term memory, attention or concentration. Many foreign-born readers may have excellent short-term memories, while locally born struggling readers may not because of generally low cognitive skills. The latter may indicate learning disabilities but may also need psychological treatment. Earlier reading component approaches also tested listening and differentiation skills, phonemics or phonemic awareness.Footnote 30 In cases of low test results, training would focus on syllables and rhymes, precise pronunciation and listening skills. Less important for reading but a good indicator for literacy proficiency are spelling skills which require a good command of writing skills as well. Overall, the earlier versions of reading components provided in-depth knowledge about adequate pedagogical treatment. The problem is that these tests do not work for comparative studies of surveys conducted using different language and letter systems. Most of the nationally developed reading components correlate very closely with the phonemic characteristics of particular languages and their written equivalents (Sabatini and Bruce 2009). For these reasons, it is rather difficult to develop test items that still keep a close relation with the theoretical explanations and are internationally comparable.

In sum, useful information from earlier component versions (covering the dual-route theory of reading, short-term memory or learning disabilities; language or grammar and vocabulary; phonemic awareness, grapheme-morpheme-correspondence or spelling) has been lost in the efforts of trying to make items internationally comparable. Thus, as we already assumed before embarking on our research, an internationally comparative approach at this level indeed proves to be extremely difficult and, in cases where it does work, loses the components character, shifting slightly towards a lower-rungs approach.

While on the one hand components lose their strong connection to the original theoretical background, lower rungs in recent years have tended to be described in more detail, providing rich didactical insights and knowledge. The can-do descriptions provide a good example of better theoretical knowledge (see also Durda et al., in this issue).

Limitations: custody of an international literacy scale – who owns it?

There is no such thing as the one and only common literacy scale, even though the items used in the PIAAC add-on module have proven to test literacy in a hierarchical order. Further research with open and large datasets would be necessary to link them to the overall PIAAC scale or any other international literacy scale. For the moment, the OECD holds custody of its PIAAC scale, and UNESCO’s LAMP component datasets are not large enough to run the necessary analyses. The dilemma remains the same. GAML has to avoid implementing a single scale and definition with a single test in possibly all UN Member States, because researchers claim that this would lead to a monopoly (Addey 2018) and re-colonisation of the so-called Global South (Grotlüschen 2018). The current solution (UIL 2019) is to propose two reporting levels according to Member States’ income category.

Moreover, the tests in the PIAAC add-on module were developed for industrialised countries. They still have a blind spot at a certain point that lies between virtually no reading skills and the easiest test item. This section may be highly relevant for low-income countries.

Recommendation: re-analyse LAMP and STEP items and refine the theoretical approach to assessing reading proficiency in an internationally comparable manner

At this point it seems necessary to re-run analyses of reading components data from several other surveys like LAMP and STEP in order to find out whether they deliver similar hierarchies and, if not, whether eliminating some items might improve the scales. Another necessity is to discuss a common anchoring scale. This would enable countries to develop further and perhaps even easier test items, co-run them in their national surveys and link them to the existing set.

More theoretical work is needed for the development and interpretation of tests at the very lowest levels of literacy (see Durda et al., in this issue). Lower rungs can be described according to what proficiency the items require or according to can-do-descriptions, i.e. Alpha Levels with 7–10 can-do descriptions on each level for both reading and writing. This would provide detailed knowledge via the descriptions of lower rungs. For surveys to be run in Germany, an adult education curriculum with formative assessment tools has already been developed based on the lower rungs level descriptions.

Hence, to improve learning outcomes within the GAML initiative, instead of trying to find a language-independent set of test items, it would be appropriate to reconsider the advantages of a lower-rungs approach for the international assessment of reading skills, either to supplement the components approach or to leave the language-related area of “below Level 1” research to UN Member States.