Background

The assessment of vocational competencies is a major challenge in the area of vocational education and training (VET), as appropriate instruments are still relatively scarce (Baethge and Arends 2009; Liedtke and Seeber 2015). In particular, there is a need for instruments to assess the content-specific competencies and skills needed for working life, in order to clarify whether vocational education prepares trainees adequately. This contribution aims: (1) to give an outline of such an instrument, one that has been designed to evaluate vocational competencies in the field of economics, and (2) to present the results of an empirical comparison of two possible test modes: computer-based assessment (CBA) versus paper-based assessment (PBA).

From an instructive perspective new technologies play an important role in VET, and in different ways. Firstly, they shift the educational content more closely toward the actual competencies needed in the workplace (Ludwig-Mayerhofer et al. 2011; Mayer and Solga 2008). For example, labelling the following themes “Ausbildung 4.0” (by which, in imitation of “Industry 4.0,” we mean “VET 4.0”; Frey and Osborne 2013), Zinn (2015) picked out digitization, industrial automation, and interconnectedness as central themes for a future reorganization of learning processes in VET. The rationale behind this reorganization is that changes to the requirements of professional life must have a direct impact on the nature of the essential workplace competencies that have to be assessed. Given the background assumption that VET should primarily enable young adults or trainees to participate in the workplace and should promote their vocational capacities, these changes in the field of learning require us to review the test modes and formats that we have deemed adequate for assessment.

As observed by Pellegrino et al. (2003); see also Wilson (2005), the triadic elements of assessment, curriculum, and instruction all have to be integrally connected. New or changed learning content and learning forms must also be reflected in the assessment, and vice versa. Additionally, the results of recent studies on final examinations point to the need to redesign assessments: Herzog (2013), for example, speaks about the “gauging of education,” and criticizes trends in schools whereby teachers align their lessons too strongly with external tests, notably with respect to content, task dimensions, and the methodological structure of the lesson. A test that is closely geared to real professional situations, in respect of content and type of work, could diminish the theory–practice gap: “teaching to the test” would, at the same time, be regarded as “teaching to the job.”

Therefore, and secondly, new technologies offer new forms of, and mechanisms for, learning and competence measurement, given that it is necessary to “remove some of the constraints that have limited assessment practice in the past” (Pellegrino et al. 2003, p. 9). The computer presents—particularly in the commercial sector—an important and authentic work tool in professional life; thus, it seems helpful to administer computer-based tasks, in order to strengthen their authenticity by working with a relevant tool. CBA offers the possibility of new (item) designs, as well as the integration of multimedia elements, such as video or audio sequences (Goldhammer et al. 2014) or interactive tools. These advantages have led to the assumption that tasks can be represented more authentically by CBA (Katz 2007; Huff and Sireci 2001) and that the capturing of strategic knowledge, as defined by more complex tasks, is made possible by this format (Quellmalz and Kozma 2003).

Thus, while CBA offers many opportunities and advantages for competence measurement, current assessment practice in Germany is dominated by the use of PBA (Gruttmann and Usener 2011), and there is a lack of evidence about the possible effects of putatively authentic assessment (Gulikers et al. 2005). Since the 1970s, several studies have made comparisons between CBA and PBA (Russell et al. 2003), but these have had very different designs, and the results are as multifarious as the studies themselves, notably with respect to the comparison criteria and their apparent impact on performance measurements (Bennett 2001; Pellegrino and Quellmalz 2010; Kingston 2009; Frahm 2012). Further, technological trends and computer-use behaviors are changing rapidly, so that the results of earlier studies can quickly become obsolete (Russell et al. 2003). An overview of comparability studies can be found in Wang and Shin (2009), and in a review of literature from the Texas Education Agency (2008).

A special characteristic of this study is that the CBA component of the assessment was designed, first and foremost, not to verify equivalent transferability (concurrent validity) from CBA to PBA, but to explicitly consider and utilize the opportunities and possibilities of a CBA by creating an authentic (workplace) environment. The PBA was adapted as closely as possible to the CBA by reprinting screenshots. The assessment was designed to measure competence in the world of work and training in an action-oriented and authentic way, which means: “requiring students to use the same competencies, or combinations of knowledge, skills, and attitudes, that they need to apply in the criterion situation in professional life” (Gulikers et al. 2004, p. 69). Authenticity does not inhere in relevant content, but must be derived from a generational process (Achtenhagen and Weber 2003). For this reason, several criteria have been developed for the design and evaluation of authentic assessments, including aspects such as complex and unstructured tasks, fidelity of the task to the conditions under which the performance would normally occur, as well as the production rather than reproduction of knowledge, and achieving validity and reliability with appropriate criteria for scoring varied products (e.g., Wiggins 1990; Herrington and Herrington 2006; Janesick 2006).

Thus, the aim of this study is to carry out an empirical comparison of CBA and PBA and to address the question as to whether there are significant differences in the ways in which commercial competence is presented in both test formats, with respect to specific content, item format, and, finally, motivational aspects.

Measuring VET competencies

Vocational competencies are defined in various ways, and are strongly linked to the professional domain to which they belong. In the following section, definitions of the concept of vocational competencies and the subordinate construct of commercial competence are given, before the assessment itself is introduced. Finally, the respective differences in test design for the two formats are discussed.

Baethge and Arends (2009, p. 9) describe vocational competencies as “young adults’ abilities to successfully apply their knowledge and experience to authentic occupational situations in selected vocational areas in the world of work.” In this case, vocational competence denotes an overarching concept that includes economic or commercial competence as one possible thematic orientation in the professional world. While there are established comparative studies and assessments in the field of general education (e.g., PISA, TIMSS), an equivalent large-scale assessment in the VET system, which focuses on apprentices´ transition to the labor market, is a desideratum. VET, especially the company-based VET program (the so-called “dual system”), a long-established tradition in Germany, is of great importance, not least because many young people are trained in this system. As measured by the number of apprenticeship contracts, the commercial domain “trade and industry” is one of the major areas of VET (BMBF 2016, p. 25). Finally, there are good reasons to focus on this area, given the high level of participation and the high relevance of new technologies in the commercial domain.

Following Winther and Achtenhagen (2008, p. 100), based on a systemic understanding of operational sub-processes and their reconstruction from real company data in real professional situations, commercial competence is defined as the ability to make business decisions and to validate them. From a theoretical and instructive perspective, one can distinguish between domain-linked (dl) and domain-specific (ds) facets of commercial competence, as proposed by Gelman and Greeno (1989). The domain-linked category refers to the notion of key vocational abilities, such as knowledge that is general but also relevant for solving vocational or professional problems. Domain-linked content is related to the general skills (mostly linguistic and mathematical) that are relevant in (commercial) occupational practice (Klotz et al. 2015); it differs from the content of the general education sector in its emphasis on professional relevance to economic areas. By way of an example, we might cite the understanding or reading of simple mathematical algorithms (e.g., the rule of three to calculate currency conversions or discounts). Domain-specific content includes specific sets of rules and a practical knowledge of a professional community, exclusively in commercial professions (Klotz and Winther 2016; Oates 2004; Winther et al. 2016a).

Test environment

To represent and measure commercial competencies in the field of VET a test environment named ALUSIM was developed. It is a simulation of an industrial company that produces extruded aluminum products, such as beverage cans (see also, Winther and Achtenhagen 2009, first edition; Winther 2010). The assessment is designed in such a way that test-takers assume the role of an apprentice in the ALUSIM company and are introduced to different situations via video clip, and then have to act in the simulated work area with a real computer. For this purpose, various documents are made available, such as customer lists and product catalogs, as well as working tools like writing and tabulation programs, email programs, calendars, calculators, and notepads. The assessment is intended to be authentic in terms of occupational situations, and appropriate with respect to abilities acquired through vocational training. It can be described as a “simulated” performance test that assesses working products and processes (see Kubiszyn and Borich 2010, p. 186) and promotes and captures decision-making. The items are divided into four main curricular areas of focus: work preparation, purchasing, corporate communications, and sales. The response format varies between short-answer items and extended-response items (see Kubiszyn and Borich 2010; Nitko and Brookhart 2014; Hanna and Dettmer 2004). In respect of the latter, for example, emails have to be formulated (e.g., to make formal requests); for short-answer items, brief answers, consisting of numbers or key words, have to be supplied within a given template. These item formats, in contrast to multiple-choice questions, are suitable for assessing competencies that are relevant in vocational situations, because they measure how well test-takers are able to generate answers, act, and express themselves (see Hanna and Dettmer 2004).

The term “computer-based” encompasses the representation of, and the working on, tasks, and (at least partially) their evaluation (for an overview of concepts of electronic assessment see Jurecka and Hartig 2007). In particular, the simulation, as a special form of CBA, is an “artificial representation of the real thing” (Hanna and Dettmer 2004) and can open up new possibilities, because it “ensures a measurement of authentic abilities without bringing the testees in such real-world situations” (Winther and Achtenhagen 2009, p. 98).

In a second step, the PBA was constructed on the basis of the CBA: all items were transferred to a paper-based version. In order to generate the highest possible (visual) similarity, screenshots from the simulated test environment were added into the PBA, video sequences were transcribed, and, as the name implies, the handling was based on paper and pencil. The two tests (CBA and PBA) both attempted to capture the same latent commercial trait.

To ensure that the selected test content could be interpreted as relevant, preliminary studies were realized by using document analysis (including curricula, teaching materials, and regulations), interviews with experts, and workplace observations (see Winther et al. 2016b). The selection of adequate test and item content is elementary but difficult to objectify, so it generally entails the involvement of experts (Hartig et al. 2012). In addition, technical reliability and usability considerations represent key conditions that affect the validity and general acceptability of tests (Ehlers et al. 2013). Therefore, in order to ensure the overall functionality of the CBA, a usability test was conducted using the think-aloud method (Boren and Ramey 2000; Ericsson and Simon 1984; McDonald et al. 2012; Yom et al. 2007). The results were used to adapt the user interface over several revision cycles, and to prevent any interference in the operation of the CBA in advance (Sangmeister et al. 2018, in preparation).

Differences in test design: mode effects

There are good reasons to assume that CBA is bringing changes to assessment practice that affect organizational and technical aspects for schools and teachers, as well as the test-design itself (Jerrim 2016). The test design constitutes the framework, setting up test-taker interactions and working steps within the assessment, and has the potential to influence performance and results. In the literature, effects that are due to differences in test design are called “mode effects,” and studies describe a “difference between the latent competencies of a test-taker for two tests administered in different modes” (Kröhne and Martens 2011, p. 174). The medium of administration—in our case, computer versus paper-and-pencil—is therefore an obvious and central design feature from which mode effects can arise; “however, in practice, the simple distinction between the two test delivery strategies is only a convenient way to communicate a conglomerate of differences between the actual test administrations that might trigger mode effects and result in nonequivalent test administrations” (Kröhne and Martens 2011, p. 171). So one challenge is to describe design criteria in a transparent way for each test format, in order to form a basis for interpreting or judging the applicability of the results. Some relevant criteria at the test level that are usually used in empirical analyses are: (a) information searches, (b) multimedia elements, (c) response modes, (d) input devices, and (e) navigation.

The information search criterion (a) is also closely related to the type of test and differs between PBA and CBA. While in CBA the necessary information to solve tasks must be partially filtered by the test-takers themselves, as part of the task, the relevant documents in the PBA are bundled beforehand into a document wallet. Disadvantages can arise by interacting with CBA, especially when the simulation is not intuitive to use and/or when the effort of searching for information triggers temporal limitations.

In respect of the multimedia element criterion (b), CBA offers a series of design options in addition to the written language format that is usually used in PBA. A key difference between the tests is that in CBA it is a video, rather than a continuous text, that introduces the situation; this automatically reduces the effort required for reading. Mangen et al. (2013) assume that students who read texts in print scored significantly better than students who read the texts digitally. The information in the PBA text is easier to read or mark up, and brings advantages in respect of time required to answer. So there are some indications that completing a PBA is faster than finishing a CBA (Johnson and Green 2006). On the other hand, the simulation could be perceived as a less formal and more practical testing format, compared to the traditional educational testing format of written instructions, and thus may have also a positive effect on processing motivation (Garris et al. 2002).

Aspects specifically related to handling include response mode (c) and input devices (d). Giving responses (c) in CBA entails touching or clicking a keyboard or mousepad, whereas in PBA, paper and pencil are used in handwriting. Input devices (d) can be responsible for differences in performance when test time is affected, because information and communication technology (ICT) expertise might be required in handling the computer. Navigation possibilities (e) must also be considered: this involves within-item navigation, which for CBA means scrolling and clicking. Several studies confirm that too much scrolling to see the entire item can result in mode effects (Kingston 2009). Thus, a computer simulation entails a certain degree of risk, because the information needed to solve the items is not always visible at a glance. In our test the within-test navigation, meaning the sequences in which items are answered, and the possibilities for revising and omitting items, were even more important. In the CBA, a predetermined task sequence has to be followed. Once completed, a task cannot be revised. It is possible, however, to finish an item without supplying an answer in the relevant location. The question of revision is often described in studies as the cause of mode effects (Lunz and Bergstrom 1994), but results and interpretations differ, and are influenced by test characteristics (speed or power). The possibility of revising may cause test-takers to deliberately seek out and work on items which they have a high probability of solving, and thus achieve better test results. On the other hand, the revision of tasks and the possibility of “jumping back and forth” does adversely affect test time, and hence may lead to a lower number of correct answers.

At the level of the sample, contextual factors are often analyzed for comparison (e.g., age, socioeconomic background), as well as motivational aspects and computer experience. Studies that examine the age of the test-takers have found different results, but there seems to be a tendency for primary-school age to generate more effects (Pomplun et al. 2006). This, in turn, relates to the computer experience of children, but due to the progress in digitization this effect continues to be less of an issue (Wang and Shin 2009).

Research questions

In order to compare both test formats, we had to ensure that the construct of commercial competence had been tested in an appropriate form in both PBA and CBA. It is considered valid if the established parameters of item analysis and model fit comply with current conventions. If these psychometric conditions are achieved for both tests, then a positive interpretation of construct validity is allowable, and a comparison between PBA and CBA is reasonable. In IRT, internal validity is deemed to be approved through model fit (Rost 2006). Further, the focus on vocational competencies in the workplace is strongly linked to the increasing importance of the use of information and communication technologies. In this case, we assume that a computer-based and authentic assessment increases construct validity (Gielen et al. 2003) because of an improved representation of domains (Pellegrino and Quellmalz 2010). For the purposes of verification, reliabilities and item- and model-fit measures of CBA and PBA were compared. We assume that CBA maps the tasks in a way that is more authentic, so we hypothesize that:

  1. (1)

    CBA displays a significantly better model fit than PBA

A detailed analysis was performed at item level to filter out items that are systematically easier (or more difficult) in either CBA or PBA, and thus have a DIF-Effect. Although PBA may well require declarative and procedural knowledge, its scope for mapping realistic operations in the workplace is limited (Goldhammer et al. 2014). Since authentic design (using job-specific documents and procedures) is important, it follows that job-related (specific) tasks can be represented and displayed more realistically in CBA (Huff and Sireci 2001; Katz 2007) and, thus, that items that are classified as domain-specific and more complex can be better reflected in CBA (Quellmalz and Kozma 2003). So we hypothesize that:

  1. (2)

    DIF-Items for CBA are mainly domain-specific.

In addition to aspects based on the test level, issues at the level of sample should also be addressed. We examined to what extent motivational aspects make a difference between test formats, and whether there is a difference in respect of time (before and after testing). Motivation is operationalized using scales for emotional state and attractiveness. It was expected that emotional state would turn out to be higher before and after working with CBA because it seems to be more engaging than classical PBA and students prefer acting on CBA (see Johnson and Green 2006).

  1. (3)

    The motivation of the test-taker, considering (a) emotional state and (b) test attractiveness, is significantly higher in CBA than in PBA before and after testing (ex ante/ex post).

Methods

The study amassed data from CBA and PBA performance tests, as well as from a motivation questionnaire. The survey was conducted in 2013 in twenty-six classes in three German federal states. The test-takers were in their second year of apprenticeship for industrial clerks and, on average, 20 years old, with a range of 17–37 years (SD = 2.37) (see Table 1).

Table 1 Sample of industrial clerks

Data collection was organized by a within-subject design, meaning that while each test-taker had to complete the whole test, one half was executed in PBA (first part) and the other half in CBA (second part). Therefore, a balanced rotation design with a total of eight booklet rotations was applied (each booklet with a sample of n = 62–72). Although both tests were designed to measure the same underlying construct there was no overlap of items. We received data sets from PBA (n = 523) and CBA (n = 414), with a total of thirty-four items being administered to measure commercial competence. The final sample consisted of 387 apprentices, because only those examinees who had worked on both tests (PBA and CBA) were included in the analysis. To scale the competence data, invalid responses or omitted items were rated as incorrect answers (0 = no credit). Items that were not administered due to the rotational design were excluded. The performance tests were framed by questionnaires asking for aspects of motivation. The test-takers were invited to perform a self-assessment (n = 294) before and after working on CBA and PBA, including the scales “emotional state” and “test attractiveness” (adapted from PISA studies; Kunter et al. 2002). Emotion and motivation actually represent two separate psychological dimensions but cannot be considered in isolation (Seifried and Sembill 2005; Sembill 1992; Schumacher 2002). The response scale ranges from 0 = no agreement to 3 = high agreement. Here also, only a reduced sample was realized, with missingness of n = 93. To analyze these data, a within-subject ANOVA was used.

The Item Response Theory (IRT) is an established method used to ascribe the competence values of trainees to contextual and situational requirements (Hartig and Frey 2013). Relevant models of probabilistic test theory (De Ayala 2009), therefore, were calculated with generalized Item Response Modelling Software (Acer´s ConQuest 2.0; Wu et al. 2007). The present work was based on a partial credit model (Masters 1982) as a form of the ordinal Rasch (1960) model that permits partial credit items, with 0 = no credit, 1 = partial credit, 2 = full credit, while the estimation of people and task parameters was determined by the maximum likelihood principle (multidimensional random coefficient multinomial logit model [MRCML]) (see Adams et al. 1997). The question of test validity is decisive for the acceptance of an instrument and for successful testing. Construct validity, in particular, is a key test criterion that can be seen as an overarching concept (Anastasi 1986; Hartig et al. 2012). Thus, validity is not a characteristic of the test per se, but refers to the interpretation of test scores (Borsboom et al. 2004).

In addition to these model tests, the detecting of DIF (Holland and Wainer 1993) was of importance, allowing for comparison of the two test formats at item level. DIF is most commonly used to assess fairness aspects in the process of instrument development and adaptation (Ackerman 1992; Roussos and Stout 1996) by checking whether individual items operate differently for different test subgroups, thus detecting violations of validity. Zumbo (2007, p. 228) divides changes in DIF-research into three generations, wherein the third generation is characterized by “the matter of wanting to know why DIF occurs.” We refer to items as DIF-Items if the probability of a solution is not fully explained by the person’s ability and/or item difficulty (Adams and Carstensen 2002). DIF-based studies permit statements about the performance of groups on an assessment, controlling for overall ability of the groups. According to Embretson and Reise (2000, p. 319), “a scale item displays DIF if examinees with the same latent-trait level have different probabilities of endorsing an item […].” In our case, we examined the extent to which group differences were caused by the test mode (CBA and PBA) and, therefore, not by personal characteristics such as gender or age. Group affiliation is characterized by the delivery strategy so Group 1 = PBA and Group 2 = CBA.

Results and discussion

Comparing facets of construct validity at test level (H1)

To check the overall fit in both formats, analyses of item, reliability, and model fit were used (see Table 2). The fit of several items to the model was evaluated using the weighted mean square (wMNSQ) approach. The wMNSQ describes the deviation of the observed probability for a correct response from the model-implied probability for a given ability level (Pohl and Carstensen 2012) and, therefore, a wMNSQ near 1 indicates a good fit with 0.75 ≤ wMNSQ ≤ 1.33 (see Bond and Fox 2001). The respective t-values are inference statistical measures for the null hypothesis that the wMNSQ equals one with weighted fit t-values from −2.0 > t < 2.0 (see Wright and Masters 1982).

Table 2 Comparison of PBA and CBA by item statistics, reliability, and model fit

As shown in Table 2, the data yielded good item statistics: for PBA, 0.83 ≤ wMNSQ ≤ 1.21; −2.24 ≤ Item Thresholds (difficulty) ≤ +3.13.; and for CBA, 0.78 ≤ wMNSQ ≤ 1.22; −1.77 ≤ Item Thresholds (difficulty) ≤ +2.62. There were found one item for PBA and nine items for CBA that provide values t > |2|. Nevertheless, these items were not excluded because of their relevance with regard to content. With regard to the differences in item thresholds, the PBA offered a higher range of item thresholds for the same item pool that can be identified for CBA. The reliabilities are described through WLE and EAP/PV coefficients that indicate higher values for PBA; however, the difference between the test formats is likely to be low (WLE difference of 0.019, EAP/PV difference of 0.057). The psychometric requirements are thus complied with for both test formats, in respect of item fit and reliabilities.

There are no absolute measurements of model fit that can be used for comparison; therefore, different model quality indices based on deviance, such as the Akaike Information Criteria (AIC), the corrected Akaike Information Criteria (cAIC), and the Bayesian Information Criterion (BIC) were reported. Deviance is a measure of deviation between the model and the data: the greater the deviance, the greater the deviation from the model. Models with smaller AIC, cAIC, and BIC values are empirically preferred. In our case, the PBA data provided a higher model fit (BIC = 7237.03; AIC = 7074.74; cAIC = 7084.72).

Additionally, one can see the person’s ability (WLE) and reached raw scores. The WLE describes the most likely competence score for each single person, given the item responses of that person (Pohl and Carstensen 2012). We found a mean value for WLE in PBA with −0.3 (SD = 1.51) and in CBA with −0.6 (SD = 1.31), so the person’s ability in PBA generates a wider range (between −5.09 and 3.92) than CBA (between −4.56 and 2.48). On average, 8.93 (SD = 5.81) tasks were solved in PBA and 7.94 (SD = 5.13) tasks in CBA, one item less in CBA. Generally, it can be concluded that the items were rather too difficult for the sample, as indicated by a lower proportion of correct answers (see Fig. 1). Only 15 items for PBA and 12 items for CBA had a proportion of correct answers (full credit) higher than 50%.

Fig. 1
figure 1

Proportion of correct answers (%) for items in CBA and PBA

Both tests provided basically permissible values, but PBA performed slightly better, considering item parameters, reliabilities, model fit, and personal ability.

Hypothesis (1), that CBA displays a significantly better model fit than PBA, is rejected.

Contrary to expectations formulated in Hypothesis 1, the data show a slight advantage for PBA in terms of the proportion of correct answers and model fit. However, the difference is very low (only one additional task was achieved in PBA, and a difference of deviance of Δ449.93). It was determined that test rotations had no significant effect on test performance for PBA (F < 1, p = 0.767), but there is an effect for CBA [F(7379) = 5.21, p = < 0.001] that affects two combinations of rotation.

Mode effects at test level, as discussed above, may provide an explanation for these differences, as well as sample effects, like familiarity with the test format, because trainees at the vocational school usually use PBAs that are neither presented in the form of a simulation nor as CBAs. The novelty of CBA could conceivably lead to a situation in which test-takers initially try out functions and options simultaneously, so that the processing is less focused and accordingly more time consuming, whereas in PBA all the relevant information is present in a concentrated way (see “Measuring VET competencies” section: “information search”), and it is easier to get an overview of all the documents (see “Measuring VET competencies” section: “navigation”); this is less obvious in the CBA. Furthermore, the reprocessing of tasks (see “Measuring VET competencies” section: “revision”) is not possible.

Detecting DIF to compare test delivery strategies (PBA vs. CBA) at item level (H2)

By using DIF analyses in a between-design, we identify items that are significantly easier or alternatively more difficult to solve, in each of the test formats (see Table 3). These items can be examined for similarities and differences related to configuration/design and type.

Table 3 DIF items for PBA and CBA with domain-linked (dl) and domain-specific (ds) item classification

First, the DIF analysis confirmed the results above, and demonstrated for a group difference at test level in favor of the PBA. This means that test-takers underperformed in CBA, at 0.34 logits. At the item level, it becomes apparent that 21 items have a moderate (7) to large (14) DIF effect, of which 10 can be interpreted in favor of the PBA and 11 in favor of the CBA. This classification was done according to Paek (2002). Based on the underlying theoretical model of competence, all items were classified (ex ante) with regard to their domain-specific or domain-linked components (Winther et al. 2016b). Further investigation showed that of these CBA DIF items, a total of 8 items were classified as domain-specific (73%). This suggests that specific items can be displayed in the CBA in a helpful way, so we have evidence, using the DIF analysis, that the specificity of the item had a positive effect on working with CBA.

Hypothesis (2), that DIF-Items for CBA are mainly domain-specific can be confirmed.

Conversely, we found that items that are easier to solve in PBA were predominantly domain-linked. Thus, the results are independent of the response format (essay items and short-answer items).

Evaluation of motivational aspects at sample level (H3)

The descriptive results of motivational aspects from the self-assessment are represented in Table 4 for PBA and Table 5 for CBA.

Table 4 Mean values of motivation scales for PBA (ex ante/ex post)
Table 5 Mean values of motivation scales for CBA (ex ante/ex post)

For both tests there is a higher agreement before testing (ex ante) compared to after testing (ex post) for emotional state and attractiveness of test. Overall, the motivational rating for both formats is rather sobering. The lowest rating was found for “Attractiveness of test”.

For each scale four things are measured: approval ratings for the two test forms (PBA and CBA) at two times (ex ante and ex post).

Figure 2 shows mean emotional state as a function of the two test forms (PBA vs. CBA) and time (ex ante vs. ex post): there was no significant effect on test forms, with F(1293) = 1.4, MSE = 0.215, p = 0.238 (n.s.), η 2p  = 0.005, but there was an effect for time, with F(1293) = 72.86, MSE = 0.188, p < 0.001, η 2p  = 0.199, and an interaction effect for format and time: F(1293) = 15.01, MSE = 0.156, p < 0.001, η 2p  = 0.049. The post hoc t test illustrates that there were significant differences in emotional state before testing (ex ante) between PBA and CBA, t(293) = 3.74 with p < 0.001, but not after testing (ex post), t(293) = −1.5, p = 0.135 (n.s.). These results also show that over time emotional state was more stable in CBA than in PBA.

Fig. 2
figure 2

Emotional state as a function of test format (PBA vs. CBA) and time (ex ante vs. ex post). Error bars show the within-subjects standard errors of the means, using the algorithm suggested by Cousineau (2005)

Figure 3 depicts the attractiveness of the test and offers a significant main effect for time with F(1293) = 10.715, MSE = 0.147, p = 0.001, η 2p  = 0.035, but not for format F(1293) = 0.439, MSE = 0.187, p = 0.508 (n.s.), η 2p  = 0.001. The interaction term was significant with F(1293) = 4.552, MSE = 0.141, p = 0.034, η 2p  = 0.015. The post hoc t-Test provided an effect on time (ex ante/ex post) only for PBA, with F(293) = 3.883, p < 0.001.

Fig. 3
figure 3

Attractiveness of test as a function of test format (PBA vs. CBA) and time (ex ante vs. ex post). Error bars show the within-subjects standard errors of the means, using the algorithm suggested by Cousineau (2005)

In summary, the mean values before testing (ex ante) differ significantly for emotional state between the test formats, but not after testing (ex post). In particular, we expected higher ratings for CBA due to the aforementioned restructuring of the tests. Following the observation that the use of computers is no longer “anything special” for the sample of so-called “digital natives,” and that no disadvantages were expected in using a computer, we can simultaneously assume that there were no motivational effects, simply because a computer is now part of everyday life (see also Frey et al. 2009). However, an interesting finding here is that emotional state (ex ante) in the CBA was lower than in the PBA, but CBA-values were more stable over time (ex post) and decreased less strictly. The latter finding was also observed for attractiveness of test.

Hypothesis (3), that the motivation of test-takers, considering (a) emotional state and (b) test attractiveness, is significantly higher in CBA than in PBA before and after testing (ex ante/ex post), cannot be confirmed. However, motivation of test-takers was more stable over time for CBA.

Conclusions

In the context of VET competence measurement, the crucial question is not just what test-apprentices know, but, in particular, what they are able to do and how they act. Therefore, instruments are necessary in order to determine and externalize these competence aspects effectively for evaluation of performance (status quo), the quality and improvement of training, and the prospects of success in the labor market. This article presents an advanced instrument from the field of electronic assessment for measuring commercial competence, which claims to optimize competence measurement through the use of authentic content and presentation. To achieve a better understanding of effects and mechanisms, a comparison was carried out between CBA and PBA, taking mode effects into account. Because mode effects vary depending on the test content and, thus, are not necessarily transferable and generalizable (see Kröhne and Martens 2011; Ito and Sykes 2004), it is necessary to have a look at the specific instrument. The aim was to illuminate the characteristics of CBA and to use this information for further development and optimization of those new assessments.

The study showed that both test formats (CBA and PBA) can represent commercial competence in a valid and reliable way. Contrary to expectations, PBA manifested a slight advantage in respect of item and model fit. The dominance of CBA against PBA, based on the quality of the construct representation, or the test motivation of the test-takers, could not be confirmed. Overall, there was only little endorsement for both formats in absolute terms, but the motivation of test-takers was more stable over time for CBA and decreases less strictly. Furthermore, the specificity of tasks is of importance: CBA represents (domain) specific items in an easier way than PBA, while PBA fits better for domain-linked items (more general tasks). Depending on the purpose of a test, a combination of both forms may be appropriate. From a scientific perspective, we have gained information about assessment design criteria that can improve test performance, and that should be enhanced in vocational learning settings, in order to promote competence development affecting the socialization of young adults into the world of work. This also supports educators, at a practical level, by designing assessments.

Several limitations of the study suggest the need for future study improvements. For example, with respect to design, on an individual level, there were no identical tasks in both CBA and PBA for a pairwise comparison. The advantages of this type of design are that a smaller sample is sufficient, and it obviates the need to test twice, which could possibly have negative effects on motivation, such as fatigue or boredom (Texas Education Agency 2008), and, moreover, apprentices are generally better able to carry out the task on a second attempt. This approach is also justified by the fact that, from a pragmatic, test-economic perspective, the rotating design reduces test time, even though all items could be included, and, on the other hand, carry-over effects could be avoided.

Moreover, additional surveys would be able to provide a deeper analysis of the mode effects. We described differences between CBA and PBA that may affect test scores, and the empirical analysis showed evidence for this, but it has not been possible to identify the specific driving force (see also Jerrim 2016). For example, while we generated log data for CBA that can supply information on processing procedures and times, such information is not available for the PBA, and so a comparison is not possible. The assumption that differences occurring are based on processing times (for example, tapping vs. writing, or introductory text vs. video introduction) cannot be adequately tested.

To assess the “suitability” of e-assessment, the whole testing process needs to be considered, from preparation to implementation to post-processing (Ehlers et al. 2013). Questions such as avoiding interference, providing infrastructure, and data protection are essential for the use of CBA, but were not singled out here as central themes. To make a final judgment, these criteria would have to be discussed in more detail.

With regard to content, we may discuss to what extent the use of computers (in terms of ICT Literacy) must be regarded as another dimension of competence, and to what extent it is part of commercial expertise. Assessments used to measure computer skills—for example, the ISkills (Information and Communication Technology Literacy Test) of the ETS (2002), or the BCS (Basic Computer Skills) of Goldhammer et al. (2014), could help to identify the level of ICT competence and monitor the impact on the CBA test results for commercial apprentices.