With appropriate treatment, the PIAAC log file data (OECD 2017a, b, c, d, e, f, g, h, j, k, l, m, n, o, p, q, r) allow for the creation of a large number of informative indicators. Three generic process indicators derived from log file data are already included in the PIAAC Public Use File at the level of items—namely, total time on task, time to first action, and the number of interactions. This section provides a brief overview of the various research directions in which PIAAC log file data have been used so far. These studies include research on all three PIAAC domains, selected domains, and even specific items. They refer both to the data collected in the PIAAC main study and to the data collected in the field test, which was used to assemble the final instruments and to refine the operating procedures of the PIAAC main study (Kirsch et al. 2016). So far, PIAAC log files have been used to provide insights into the valid interpretation of test scores (e.g. Engelhardt and Goldhammer 2019; Goldhammer et al. 2014), test-taking engagement (e.g. Goldhammer et al. 2017a), dealing with missing responses (Weeks et al. 2016), and suspected data fabrication (Yamamoto and Lennon 2018). Other studies have concentrated on the highly interactive tasks in the domain of PS-TRE (He et al. 2019; He and von Davier 2015, 2016; Liao et al. 2019; Naumann et al. 2014; Stelter et al. 2015; Tóth et al. 2017; Vörös and Rouet 2016) and contributed to a better understanding of the adult competencies in operation.
The studies reviewed (Table 10.1) demonstrate how PIAAC log file data can contribute to describing the competencies of adults and the quality of test-taking, but as Maddox et al. (2018; see also Goldhammer and Zehner 2017) objected to the capturing of log events, inferences about cognitive processes are limited, and process indicators must be interpreted carefully.
Table 10.1 Overview of studies analysing log files from the PIAAC data base
Studies of Time Components Across Competence Domains
Processing times reflect the duration of cognitive processing when performing a task. Provided that information about the time allocation of individuals is available, several time-based indicators can be defined, such as the time until respondents first interact with a task or the time between the respondents’ last action and their final response submission (OECD 2019). Previous research has often focused on ‘time on task’—that is, the overall time that a respondent spent on the item. For example, analysis of the PIAAC log file data showed considerable variation of time on tasks in literacy and numeracy across countries, age groups, and levels of education, but comparatively less variability between the competence domains and gender (OECD 2019).
The (average) effect of time on task on a respondent’s probability of task success is often referred to as ‘time on task effect’. Using a mixed effect modelling approach, Goldhammer et al. (2014; see also Goldhammer et al. 2017a) found an overall positive relationship for the domain of problem solving, but a negative overall relationship for the domain of reading literacy. Based on theories of dual processing, this inverse pattern was explained in terms of different cognitive processes required; while problem solving requires a rather controlled processing of information, reading literacy relies on component skills that are highly automatised in skilled readers. The strength and direction of the time on task effect still varied according to individual skill level and task characteristics, such as the task difficulty and the type of tasks considered. Following this line of reasoning, Engelhardt and Goldhammer (2019) used a latent variable modelling approach to provide validity evidence for the construct interpretation of PIAAC literacy scores. They identified a latent speed factor based on the log-transformed time on task and demonstrated that the effect of reading speed on reading literacy becomes more positive for readers with highly automated word meaning activation skills, while—as hypothesised—no such positive interaction was revealed for perceptual speed.
Timing data are commonly used to derive indicators of disengagement (e.g. rapid guessing, rapid omissions) reflecting whether or not respondents have devoted sufficient effort to completing assigned tasks (Wise and Gao 2017). Several methods have been proposed that rely on response time thresholds, such as fixed thresholds (e.g. 3000 or 5000 ms) and visual inspection of the item-level response time distribution (for a brief description, see Goldhammer et al. 2016). The methods of P+ > 0% (Goldhammer et al. 2016, 2017a) and T-disengagement (OECD 2019) determine item-specific thresholds below which it is not assumed that respondents have made serious attempts to solve an item. P+ > 0% combines the response times with a probability level higher than that of a randomly correct response (in case of the PIAAC items, the chance level was assumed to be zero since most of the response formats allowed for a variety of different responses). The T-disengagement indicator further restricts this definition by implementing an additional 5-second boundary that treats all responses below this boundary as disengaged. Main results of these studies (Goldhammer et al. 2016, 2017a; OECD 2019) revealed that, although PIAAC is a low-stakes assessment, the proportions of disengagement across countries were comparatively low and consistent across domains. Nevertheless, disengagement rates differed significantly across countries, and the absolute level of disengagement was highest for the domain of problem solving. Other factors that promote disengagement included the respondents’ level of education, the language in which the test was taken, respondents’ level of proficiency, and their familiarity with ICT, as well as task characteristics, such as the difficulty and position of a task, which indicated a reduction in test-taking effort on more difficult tasks and tasks administered later in the assessment.
Similar to the issue of respondents’ test engagement, time on task can be used to determine how to treat missing responses that may occur for various reasons, such as low ability, low motivation, or lack of time. In particular, omitted responses are an issue of the appropriate scaling of a test, because improperly treating omits as accidentally missing or incorrect could result in imprecise or biased estimates (Weeks et al. 2016). In the PIAAC main study (OECD 2013b), a missing response with no interaction and a response time under 5 s is treated as if the respondent did not see the item (‘not reached/not attempted’). Timing information can help to determine if this cut-off criterion is suitable and reflective of respondents having had enough time to respond to the item. Weeks et al. (2016) investigated the time on task associated with PIAAC respondents’ assessment in literacy and numeracy to determine whether or not omitted responses should be treated as not administered or as incorrect. Based on descriptive results and model-based analyses comparing response times of incorrect and omitted responses, they concluded that the commonly used 5-second rule is suitable for the identification of rapidly given responses, whereas it would be too strict for assigning incorrect responses.
The consideration of time information was also used to detect data falsifications that can massively affect the comparability of results. Taking into account various aspects of time information, ranging from time on task to timing related to keystrokes, Yamamoto and Lennon (2018) argued that obtaining an identical series of responses is highly unlikely, especially considering PIAAC’s adaptive multistage design. They described the cases of two countries that had attracted attention because a large number of respondents were interviewed by only a few interviewers. In these countries, the authors identified cases in which the processing of single cognitive modules was identical down to the time information; even entire cases were duplicated. Other results showed systematic omissions of cognitive modules with short response times. Consequently, suspicious cases (or parts of them) were dropped or treated as not administered in the corresponding countries.
Studies of the Domain of PS-TRE
The PIAAC domain of PS-TRE measures adult proficiency in dealing with problems related to the use of information and communication technologies (OECD 2012). Such problems can range from searching the web for suitable information to organising folder structures in digital environments. Accordingly, the PS-TRE tasks portray nonroutine settings requiring effective use of digital resources and the identification of necessary steps to access and process information. Within the PS-TRE tasks, cognitive processes of individuals and related sequences of states can be mapped onto explicit behavioural actions recorded during the problem-solving process. Clicks showing, for example, that a particular link or email has been accessed provide an indication of how and what information a person has collected. By contrast, other cognitive processes, such as evaluating the content of information, are more difficult to clearly associate with recorded events in log files.
Previous research in the domain of PS-TRE has analysed the relationship between problem-solving success and the way in which individuals interacted with the digital environment. They have drawn on a large number of methods and indicators for process analysis, which include the investigation of single indicators (e.g. Tóth et al. 2017) and entire action sequences (e.g. He and von Davier 2015, 2016).
A comparatively simple indicator that has a high predictive value for PS-TRE is the number of interactions with a digital environment during the problem-solving process. Supporting the assumption that skilled problem solvers will engage in trial-and-error and exploration strategies, this action count positively predicted success in the PS-TRE tasks for the German and Canadian PIAAC field test data (Naumann et al. 2014; see also Goldhammer et al. 2017b) and for the 16 countries in the PIAAC main study (Vörös and Rouet 2016). Naumann et al. (2014) even found that the association was reversely U-shaped and moderated by the number of required steps in a task. Taking into account the time spent on PS-TRE tasks, Vörös and Rouet (2016) further showed that the overall positive relationship between the number of interactions and success on the PS-TRE tasks was constant across tasks, while the effect of time on task increased as a function of task difficulty. They also revealed different time–action patterns depending on task difficulty. Respondents who successfully completed an easy task were more likely to show either a low action count with a high time on task or a high action count with a low time on task. In contrast, the more time respondents spent on the task, and the more they interacted with it, the more likely they were to solve a medium and a hard task. Although both Naumann et al. (2014) and Vörös and Rouet (2016) investigated the respondents’ interactions within the technology-rich environments, they used different operationalisations—namely, a log-transformed interaction count and a percentile grouping variable of low, medium, and high interaction counts, respectively. However, they obtained similar and even complementary results, indicating that the interpretation of interactions during the process of problem solving might be more complex than a more-is-better explanation, providing valuable information on solution behaviours and strategies.
Process indicators can also combine different process information. Stelter et al. (2015; see also Goldhammer et al. 2017b) investigated a log file indicator that combined the execution of particular steps in the PS-TRE tasks with time information. Assuming that a release of cognitive resources benefits the problem-solving process, they identified routine steps in six PS-TRE tasks (using a bookmark tool, moving an email, and closing a dialog box) and measured the time respondents needed to perform these steps by determining the time interval between events that started and ended sequences of interest (e.g. opening and closing the bookmark tool; see Sect. 10.4.2). By means of logistic regressions at the task level, they showed that the probability of success on the PS-TRE tasks tended to increase inversely with the time spent on routine steps, indicating that highly automated, routine processing supports the problem-solving process.
While the number of interactions and the time spent on routine steps are generic indicators applicable to several different tasks, indicators can also be highly task-specific. Tóth et al. (2017) classified the problem-solving behaviour of respondents of the German PIAAC field test using the data mining technique of decision trees. In the ‘Job Search’ itemFootnote 1, which was included in the PIAAC field test and now serves as a released sample task of the PS-TRE domain, respondents were asked to bookmark websites of job search portals in a search engine environment that did not have a registration or fee requirement. The best predictors included as decision nodes were the number of different website visits (top node of the tree) and the number of bookmarked websites. Respondents who visited eight or more different websites and bookmarked exactly two websites had the highest chance of giving a correct response. Using this simple model, 96.7% of the respondents were correctly classified.
Other important contributions in analysing response behaviour in the domain of PS-TRE were made by adopting exploratory approaches from the field of text mining (He and von Davier 2015, 2016; Liao et al. 2019). He and von Davier (2015, 2016) detected and analysed robust n-grams—that is, sequences of n adjacent actions that were performed during the problem-solving process and have a high information value (e.g. the sequence [viewed_email_1, viewed_email_2, viewed_email_1] may represent a trigram of states suggesting that a respondent revisited the first email displayed after having seen the second email displayed). He and von Davier (2015, 2016) compared the frequencies of certain n-grams between persons who could solve a particular PS-TRE task and those who could not, as well as across three countries to determine which sequences were most common in these subgroups. The results were quite consistent across countries and showed that the high-performing group more often utilised search and sort tools and showed a clearer understanding of sub-goals compared to the low-performing group.
Similarly, Liao et al. (2019) detected typical action sequences for subgroups that were determined based on background variables, such as the monthly earnings (first vs. fourth quartile), level of educational attainment, age, test language, and skill use at work. They examined the action sequences generated within a task in which respondents were required to organise several meeting room requests using different digital environments including the web, a word processor, and an email interface. Findings by Liao et al. show not only which particular action sequences were most prominent in groups with different levels of background variables but also that the same log event might suggest different psychological interpretations depending on the subgroup. Transitions between the different digital environments, for example, may be an indication of undirected behaviour if they are the predominant feature; but they may also reflect steps necessary to accomplish the task if accompanied by a variety of other features. However, although such in-detail item analyses can provide deep insights into the respondents’ processing, their results can hardly be generalised to other problem-solving tasks, as they are highly dependent on the analysed context.
Extending this research direction to a general perspective across multiple items, He et al. (2019) applied another method rooted in natural language processing and biostatistics by comparing entire action sequences of respondents with the optimal (partly multiple) solution paths of items. By doing so, they determined the longest common subsequence that the respondents’ action sequences had in common with the optimal paths. He et al. were thus able to derive measurements on how similar the paths of the respondents were to the optimal sequence and how consistent they were between the items. They found that most respondents in the countries investigated showed overall consistent behaviour patterns. More consistent patterns were observed in particularly good- and particularly poor-performing groups. A comparison of similarity across countries by items also showed the potential of the method for explaining why items might function differently between countries (differential item functioning, DIF; Holland and Wainer 1993), for instance, when an item is more difficult in one country than in the others.