9.1 Introduction

In Kenya, Tanzania, and Uganda, education is undergoing reform in order to include values and competencies as integral components of learning, going beyond knowledge acquisition and emphasis on literacy and numeracy as the only foundational skills (Kenya Institute of Curriculum Development, 2017). The inclusion of competencies and values as part of the learning process broadens and challenges the form and use of assessment as these newly adopted competencies may require different assessment mechanisms. Educational assessment of traditional academic subjects has tended to focus on whether a competency or knowledge area has been achieved or not, with an assumption that there is a finite range which can be labelled as failure at one end, and success at another. A different approach is increasingly being applied to less content-focused learning. In this approach, there is no absolute range of performance, since what is being assessed are naturally occurring competencies. In addition, display of the competencies is typically viewed as context specific, being sensitive to critical features of the environment and the socio-cultural context.

Several factors influence how assessments are designed. These include the purpose, the context, and practical constraints. Assessment tasks should provide individuals with opportunities to demonstrate their competencies in meaningful contexts. Therefore, development of assessment tasks needs to be undertaken in such a way that the tasks target the intended population in a known context, and for a known purpose. Assessment tasks can take many forms, and selection of form is a key decision in this targeting. One form is the scenario, the subject of this chapter. Use of scenario-based assessment (SBA) has been viewed as effective in assessment of values, which are dependent on one’s perception of situations and learning experiences (Haynes et al., 2009). In the case of values, these are both complex and formed by an array of social influences. Conventional test items that assume closed responses or closed self-report options may under-represent the complexity of some constructs. One of the features of SBA is that it can allow those assessed the opportunity to express themselves. In literacy assessment, for example, Sabatini et al. (2019) argue that there is need to move beyond the constraints of the traditional passage and question format of reading comprehension tests, and instead use scenarios to allow individuals the opportunities for self-expression. Similarly, for values, a method of assessment that allows adolescents to express themselves in their own words relative to specific scenarios and to consider their local context would be appropriate.

9.1.1 Characteristics of Scenario-Based Assessment

Scenario-based assessment is a method of evaluation in which situations that are hypothetical, but based in real life, are created, and to which individuals are asked to respond. An SBA task and its items typically takes the form of a brief description of a situation, followed by a series of questions or prompts designed to elicit behaviours or perceptions associated with the target construct. SBA therefore provides a method for assessing individuals in a meaningful and purposeful context with tasks that relate to their experiences. The approach is useful for bridging across informal but complex social activities in order to provide information about human behaviour in real world situations (Haynes et al., 2009), in such a way that the formality and rigour required to generate robust and credible data is ensured.

Three characteristics of SBA are of particular interest in the case of assessment of life skills and values at household level, as envisaged by the Assessment of Lifeskills and Values in East Africa (ALiVE) initiative. These characteristics are familiarity, explanatory facility, and capture of complex phenomena. These characteristics are discussed in the context of the SBA design allowing for open-ended responses which are then coded according to criteria set to differentiate between performance levels.

9.1.1.1 Familiarity

The familiarity facility that SBA can offer is two-dimensional. First, the content of the scenarios can draw from daily life of the target respondents. Second, the administration mode of SBA can be less formal such that it allays respondents’ assessment anxieties. SBA is apt for assessing competencies and values since these are acquired and developed through daily experiences - whether formally through structured learning processes, or informally as when adolescents interact in the community more broadly. Such experiences present the content used to stimulate responses (Haynes et al., 2009). Sabatini et al. (2019) argue that texts or stories used to assess literacy should reflect matters familiar to the target group. While Sabatini et al. focused on assessment of literacy, the lessons learnt can be extended to the assessment of life skills and values.

The second dimension concerns administration mode. SBA’s use of an open-ended response format can reduce the perception that there is only one correct answer to an assessment item. This can encourage respondents to be more discursive in ways that reveal more about their perceptions and capacities. Scenarios can allow individuals to take varied approaches to demonstrating their capacities, making this approach beneficial for those who have difficulty with other assessment formats. The open-ended response mode can simulate a daily conversation in the hands of an experienced test administrator or interviewer.

Life skills and values are attributes that relate to or direct one’s behaviour. They are presumed influenced by socialisation processes within the family and community. As such, the assessment of such attributes requires assessment approaches that reflect that socialisation. People generally find scenarios easy to identify and discuss, as they constitute the patterns of their everyday life. The use of scenarios within the mode of less formal discussion is effective because the technique links what might otherwise just be stated, to actual activity or behaviour. This characteristic leads to the potential role of SBA as an explanatory facility.

9.1.1.2 Explanatory Facility

Using real-world problems and social context provides a natural base for assessment of cognitive abilities and motivation to participate in creative activities (Quitadamo & Brown, 2001). Twenty-first century skills, although diverse in nature, typically draw on the analytical mind, on perspective taking, creative thinking, and flexibility. A well-constructed scenario triggers questions like, “What’s the goal of all this?” “What do I need to consider in this situation?” The scenario is ideally presented in such a way that respondents will think and react as naturally as they would in a real-life situation. Such assessments help learners to visualize, conceptualise, and comprehend the situations presented. Accordingly, SBA can support learning.

The use of real-life situations can validate individuals’ understanding of those situations and build their confidence in how to handle similar situations in future. One of the major benefits of using a scenario-based approach to learning is the development of the learner’s ability to comprehend situations and identify relevant factors (Bell et al., 2004). Learners can immerse themselves in a situation and practice skills, while continuously learning from their mistakes. Driscoll and Carliner (2005) supported SBA due to the perspective that it enriches the process of learning by engaging learners in brainstorming, critical thinking, application and synthesis of real-life social situations. Use of relatable and engaging story-like scenarios resonates with learners. The experiences stick with learners longer, hence enhancing learning. The principles of situated learning theory (Lave & Wenger, 1991) argue that authentic activity, context, and culture are important components of knowledge acquisition. This takes us to the capacity of SBA to make visible the nature of the target construct, by virtue of constructing tasks and designing items that generate open-ended responses that are directly interpretable within the conceptual structure of the target construct. Haynes et al. (2009) in designing assessments for learning experiences, argue that scenario-based evaluation enables structured analysis of the causal forces behind phenomena of interest.

9.1.1.3 Capture of Complex Phenomena

Development of SBA tools assumes several features. First, the scenario will be able to address multiple aspects of the construct being targeted. These multiple aspects are defined by the hypothesised structure of the construct. ALiVE took a ‘construct-centred’ approach to its assessment design (Pellegrino et al., 2016). Second, the scenario must have qualities that enable different levels of performance to be elicited from the respondent. Third, and most critical, the questions that the scenario poses must be able to elicit participant responses that are clearly associated with the competence or value being assessed. In terms of the administrability of SBA, scenarios must be written in such a way that respondents can reasonably be assumed to possess the requisite listening or reading comprehension in the language necessary to understand them.

Construction of SBA tasks that adopt open-ended responses bypasses the need to develop series of branching pathways of prompts in order to capture the progress of individuals through multiple processes or steps, as is typically the mode of thinking through activities such as problem solving (e.g., Graesser et al., 2018). Instead, the design takes account of the key indicators of the steps, monitoring these, and using responses as indicative behaviours of interim processes.

Education systems have traditionally acknowledged acquisition of knowledge through tests and examinations practices which reward correct answers. The shift to twenty-first century skills which prioritises development of skills as described along learning continua requires assessment approaches that can capture increasing development. Such development is typically indicated by how an individual behaves or can perform, as distinct from what an individual might know. SBA is viewed as a reliable and valid assessment method for such performance-based constructs (Care & Kim, 2018).

9.1.2 Use of Scenario-Based Assessment in ALiVE

Reflecting the contextual philosophy of ALiVE, the project selected SBA as the predominant method for assessment of self-awareness and problem solving as well as the value of respect. The medium provided the opportunity to use local stories, narratives or situations to which adolescents could respond from within the household and community environment.

Each assessment task took the form of a hypothetical situation which was described orally to the respondent. The scenarios were based on real life contexts since the strategy was to use daily situations that would be familiar to respondents and would not assume knowledge based on formal education experience. After description of the scenario, a series of questions were asked about how the respondent might react to the situation described. The scenarios and their associated items (questions to the respondents) were designed to stimulate responses that would be interpretable within the conceptualisation of each of the constructs, and that could be construed to indicate proficiency level.

The construction of the scenarios was therefore based on the hypothesised structures of each life skill and value, identified by dimensions and/or subskills (Scoular & Otieno, 2024; Chap. 6, this volume; Ngina et al., 2024; Chap. 5, this volume; Care & Giacomazzi, 2024; Chap. 4, this volume). Associated assessment frameworks were developed which included performance indicators. These frameworks guided the development of scenarios for each construct. In creating scenario ideas, the adolescents’ immediate environments were considered, emanating in use of home, community, and school situations.

Table 9.1 presents an example of a scenario-based task from the ALiVE tool. Given widespread familiarity with the nature of problem solving, this skill is used to illustrate the approach. The adopted structure for problem solving identified eight subskills which fell within three dimensions. The assessment framework operationalised four of these subskills organised within two of the dimensions (Care & Giacomazzi, 2024; Chap. 4, this volume). Decisions about the scope of assessment frameworks were made in recognition of the nature and limitations of the planned mode of assessment and its administration. The pragmatism of this approach is demonstrated in this example, where ‘applying the solution’ which is the third dimension of the adopted structure, is not included since it would not have been viable within the timeframe and realities of the assessment event. To the degree that some dimensions and subskills were not measurable in ALiVE, it is essential that reporting of results clarifies the detail of what has and has not been measured. Examination of Table 9.1 shows how the phrasing of each question (item) targets the specific subskill which it is designed to stimulate. The SBA approach is totally dependent on the careful deconstruction of complex skills into their component parts.

Table 9.1 Illustrative task identifying subskills targeted and associated scoring rubrics

9.2 Development of Scenario-Based Tasks

A large technical team was formed to develop the ALiVE assessment tool. The team was composed of members from Kenya, Tanzania, and Uganda, and sub-divided into country-based groups for much of the development work. The full technical team participated in five workshops which focused on different aspects of test development. The aspects included but were not limited to: defining and describing the targeted skills and value; setting the scope and drafting the assessment frameworks; idea generation; and scenario development accompanied by scoring decisions. These tasks were undertaken through iterative processes of review, ‘think aloud’ and paneling activities, and piloting. Finally, data from the pilot were analysed for final revision of materials.

Those processes particular to task development are highlighted in this section.

9.2.1 Idea Generation

The process of idea generation entailed the consideration of the context of adolescents based on their daily life experiences with reference to family, school and community activities and the probable roles of the youth in different tasks. It was challenging to generate ideas that would ‘carry’ content and prompts that would stimulate adolescents’ demonstration of competencies. The team needed to consider how familiar situations presented the opportunity for targeting the constructs and their subskills, having but recently gone through the process of deconstructing the skills. In addition, ideas were subject to debating cultural differences across the three countries, supported by the results of an earlier contextualization study (Giacomazzi, 2024).

9.2.2 Checking Utility and Developing Rubrics

The ‘think aloud’ activity was used to check first drafts of tasks. The goal was to determine whether an assessment task actually captured the intended competencies by analyzing the behaviours and metacognitive reflection of the responding adolescents as they worked through the tasks. The activity was also used as a strategy to collect likely responses which would inform the development of scoring criteria, or rubrics. ‘Think aloud’ involves individuals articulating their thinking as they complete a task (Eccles & Arsal, 2017). As adolescents went through each of the tasks, they were requested to report on their mental processes orally, explaining their thinking and reasoning. The transcripts of these think aloud activities informed the review process. Checklists were designed to identify whether: scenario tasks were capturing the targeted competences, dimensions, and subskills; the tasks and items were clear; the tasks were perceived as familiar or reasonable by the adolescent.

Where necessary the tools were translated to local languages used in Kenya, Tanzania, and Uganda. This was to eliminate any language comprehension barriers to actual understanding of the tasks. To harmonize the think aloud data collection across the three countries, each member of the National Technical Team participating in the think aloud activity as an assessor was given standard field instructions which included a script to introduce oneself to the adolescent, clearly state one’s name, the organization represented, the purpose of the meeting with the adolescent, and what the adolescent was expected to do. All meetings with adolescents were undertaken by a pair of assessors. After the introductions, the two assessors modelled thinking aloud in order to help the adolescent understand what to do or how to respond to the items. Once the adolescent had fully understood what to do, the assessors read the tasks to the adolescent providing enough time for the adolescent to think through the task. In the event that the adolescent needed help with the item, the assessors repeated the item and/or asked the adolescent to think more. If the adolescent was not able to respond to the item, assessors would prompt to ascertain what kind of help was needed; for example, whether it was a comprehension, familiarity, or difficulty issue. This was recorded on the think aloud record form before moving to the next task. The record form included details on identity of assessors, the location where the activity was taking place, the adolescent’s name, sex, education (in school/out of school) and age. The record form had adequate space to allow the assessor to write for each item the adolescent’s responses, both initial responses and additional responses after probing. From each of the participating countries, three or four adolescents took part.

A workshop was organized for country teams to present reports of the think aloud activity focusing on issues and solutions related to administration of the think aloud, usability of the tools, and behaviours of the adolescents as they responded to the tasks. In terms of administration issues, each team reported on:

  • Instructions to young person: what was the range of needs in terms of modelling what was required, or expanding on instructions?

  • Recording information: how adequate was the record form for capturing observations?

  • Physical environment for the process: were there any issues associated with the location for the activity that might impact on the young person’s performance or behaviour?

The usability issues of interest were whether there was evidence that the adolescent found understanding instructions difficult or anomalous; whether the content of the tasks was perceived as sensitive in terms of culture, gender, religion or custom. The main question for the think aloud activity was whether the task and its items appeared to capture the targeted subskill. The question was answered through two processes: first was aggregating the response data from across the three countries’ think aloud activity, using the same protocols to ensure comparability across records and across country, and adding commentary based on observations of the adolescents; second was analysing the data in order to make decisions about the utility of the tasks and items for targeting the skills and subskills and for capturing different levels of quality, and to initiate the development of scoring criteria.

9.2.3 Paneling

Armed with the information derived from the think aloud process, the technical team proceeded to review tasks to decide which would proceed to the pilot. Paneling is a quality assurance process to check and improve draft assessment items in terms of content and construct validity, and capacity of the items to capture differences in performance. A panel was formed to focus on each construct. The model for allocation of members to panels was to allocate a minimum of two individuals who had worked on a construct to that construct’s panel in order to ensure expertise to respond to other panelists’ queries; and roughly equal allocations of individuals from other construct areas. The role of the two ‘construct representatives’ was to respond to queries, to explain and to clarify, not to defend or argue the case. Membership of the panels was: 14 for problem solving, 13 for self-awareness, and 9 for respect. Each panel was guided by a paneling checklist in their review of draft tasks. The panels examined whether the task and item combinations assessed (part of) the construct, what respondents would need to know to respond to the scenarios, the authenticity of the scenarios, the precision and clarity of the phrasing of tasks and items, the amount of time needed to produce an answer, adequacy of the scoring rubrics, and equity for respondents of different backgrounds. A sample of items in the paneling checklist is provided in Table 9.2.

Table 9.2 Sample items on the Panelling checklist

The panels summarised the results of their deliberations and recommendations using a paneling summary form whereby members reported the quality of the task in terms of how effectively it captured the subskills targeted, and whether its items had potential for differentiation across different levels of proficiency. Members were then required to make a preliminary decision on whether a task or its item/s should be edited, discarded or retained.

An important activity which took place at this stage was the determination of the levels of difficulty of the assessment tasks. The utility of tasks and items depends not only on whether they appear authentic and targeting the skills, but on whether they can differentiate between levels of proficiency. The issue of whether the tasks and items could identify levels of proficiency was explored using two sources of information:

  1. 1.

    A priori descriptions of hypothesised indicative responses

  2. 2.

    Analysis of responses from the think alouds to identify different levels.

The activity informed reciprocal goals: identifying if the scenarios could stimulate responses across a range of lower to higher proficiencies; and developing rubrics that could describe these levels of proficiency based on adolescents’ responses. Each ‘skills team’ compared the think aloud data with a priori descriptions of the hypothesised responses or behaviours. The teams checked frequency data of responses collected through the think aloud activity against the a priori responses ranging from low to high proficiency levels. Where there were no frequencies against a level, this stimulated discussion on whether to improve or drop the task. Another method used involved comparison of think aloud-derived responses with each other. Teams transcribed the individual responses onto sticky-notes and arranged these in ascending or descending order according to members’ judgement of increasing or decreasing proficiency. The final step was the development of rubrics that distinguished between proficiency levels. If rubrics that could comprehend the responses for coding could not be established, the task was discarded. This establishment of coding criteria, or rubrics, was a pre-requisite for the next activity, the pilot study.

9.2.4 Pilot and Dry Run Studies

The pilot study was undertaken across Kenya, Tanzania, and Uganda in November 2021. The number of adolescents participating varied from N = 366 for respect, to N = 392 for self-awareness and N = 395 for problem solving. Guidelines for the exercise covered what the assessors were expected to do before, during, and after the assessment with regard to safety precautions, protocols to be followed, and ethical considerations. The pilot generated identification of terminology and language issues; and response data from the adolescents, both qualitative and quantitative. Data analysis involved calculation of descriptive statistics, including frequency distributions, measures of central tendency and dispersion. Item response data were used to explore scale composition, followed by establishing model fit. The output of these analyses, supplemented by analyses of dry run data, was used to make final decisions on each SBA task to be used in the large-scale assessment. Notably, the pilot collected not only data coded according to the [then] rubrics, but also the verbatim responses of adolescents. This rich data source provided the facility for further review of coding categories where anomalous patterns of data were apparent. In turn, this informed slight changes in scenario wording and rubrics criteria prior to the dry run.

A dry run was undertaken with N = 337 adolescents in Kenya in February 2022 as a test of the whole process, as distinct from the earlier pilot for which the focus was on finalization of tools based on how the test items functioned. The dry run provided field experience to guide the fine-tuning of administration guidelines and some slight re-phrasing of scenarios. The dry run used cell-phone based data capture through KoboCollect, and only coded responses to items were recorded. An analysis team comprised of 13 members from the original technical team as well as academics and statisticians drawn from universities and the government sector engaged in exploration of the dry run data. Again, item response distributions were analysed, scales were reviewed, and associations between the constructs explored. These activities led the team to reach:

  • Consensus on the underlying definitions and descriptions of the constructs (problem solving, self-awareness, and respect)

  • Agreement on the skills structures of the target competencies.

  • Agreement on the coding of the target competences.

That ALiVE decided to adopt SBA as the main assessment approach is not a claim that the method is perfect, either for the target constructs, or for the contexts in which the assessments were to take place. Challenges remain.

9.3 Challenges of SBA Illustrated

As with development and use of any assessment methods, what they offer in terms of benefits does not come easily. Some of the challenges associated with use of SBA are illustrated through ALiVE’s experience.

Assessment of twenty-first century skills demands multiple modes or multiple data points in order to capture their complexity effectively. Since many of these skills are demonstrated through behaviour rather than through written or oral responses, observation of behaviour as individuals perform tasks would be recommended, all things being equal. The SBA approach taken by ALiVE opts for immediate judgment by assessors concerning the relevance and skill level of an individual as they perform or provide a response. There is therefore an inherent risk posed by the capacity of assessors to capture and interpret behaviour accurately. According to Pellegrino et al. (2016), an assessment is designed to observe learners’ behaviour and draw inferences about what the learner knows or can do. The detailed development processes undertaken in this study were designed to identify fit-for-purpose coding criteria that would be relatively easy to understand and apply during an assessment event in the field. From the robust scales derived from the large-scale study results (Ariapa et al., 2024), it is clear that such understanding and application was possible in this instance. However, if the SBA offered stimulus for a wider range of proficiencies, coding would arguably present greater challenges. Hence the approach has a clear limitation for application in contexts such as household-based assessment.

Whitlock and Nanavati (2013) argue that acquisition and demonstration of skills in assessment contexts may not strongly relate to the skills that are brought to bear in real-world experience. This is precisely the situation that SBA avoids since, in this current application, it is to familiar situations that adolescents are asked to respond. Of course, a response to a hypothetical, even if authentic, situation is not the same as a true response triggered by events. To this extent, SBA cannot emulate real life. In addition, the need for simplicity of situation to act as a stimulus might under-represent the real demand of such situations and so be incapable of stimulating less visible skills that might be part of a natural response. What is important is that the SBA has the capacity to stimulate different responses across different individuals. This was witnessed in the ALiVE assessments as adolescents interpreted and responded to similar tasks differently. For instance, the first scenario on problem solving focused on fire breaking out and the adolescents were required to respond to the first task asking whether they considered this situation to be a problem. A sample of responses showing different levels of proficiencies is given in Table 9.3.

Table 9.3 Verbatim responses representing different proficiencies

Putting aside where such differences in response are accounted for by variations in proficiency, the assumption that responses are fully determined by the construct being targeted, is questionable. One of the significant challenges in assessment of complex social skills is that these are influenced by multiple phenomena, some of which are cognitive, some interpersonal, some intrapersonal, and some contextual. Ascribing a response or reaction to just one source, whether it be a trait or a situational state, is to ignore the complexity of human being and behaviour. This remains a challenge in SBA and for ALiVE. As with other methods of assessment, the claims for construct validity will remain for adjudication by replication.

The development of SBA entails investment of time and effort to develop relevant and engaging scenarios to assess the intended construct, and according to Ryan (2006) may not necessarily provide the intended value. One challenge is designing an administrable assessment that can encompass all the intended elements of the target construct. The SBA development team needed to keep in mind that the goal was to sample adolescent proficiencies for the purpose of establishing regional profiles, rather than to comprehensively describe those proficiencies at individual level. In addition, the household-based assessment context would be such that maintaining adolescent attention beyond 40–50 min would be unsustainable. These realities led to the diminution of the original construct structures to the assessment frameworks – the more easily assessable elements would be targeted in the interest of evaluating these reliably.

It is worth noting that SBA, as a method and as an interaction mode was in general not familiar to the adolescents. Hence, they may have experienced some confusion about what was expected of them, notwithstanding the overwhelmingly positive motivation to be involved in the activity. A general challenge of SBA may have exacerbated that confusion: the ability of individuals to perform tasks within constrained time limits due to task complexity. Building mental models is known to be complex and individuals may take time to navigate the complexities of extracting meaning and integrating background knowledge (Kintsch, 2012). Whether this in fact did impact on adolescent capacity to demonstrate optimal levels of skill is not known.

9.4 Conclusion

The use of SBA in the ALiVE project served a major requirement that the assessment of life skills and a value be contextualised in terms of the constructs themselves, how they play a role in daily life, and where they are demonstrated. These three conditions established a context in which adolescents were presumed to enter the assessment event confidently. The scenarios themselves were key in confirming familiarity of content in the actual tasks. The contextualisation of the skills and value themselves in the early part of the study (Giacomazzi, 2024; Chap. 3, this volume) ensured that interpretation was such that tasks would stimulate the cognitive and social-emotional processes of interest. The place provided the actual physical environment in which the scenarios would typically play out. In addition, the simplicity of the scenarios used for the tasks was such that task descriptions and prompts could be translated and administered in the local languages of the adolescents.

Establishing these conditions does not however guarantee accuracy of results, particularly the proficiency levels estimated on the basis of adolescent responses. A question that remains to be answered is whether the results from the assessment under-estimate adolescents’ actual proficiencies. This possibility is raised due to the observation made by some assessors that adolescents were unaccustomed to the assessment mode and unsure of how to respond. The mode was less formal than routinely associated with assessment activity, and less formal than an adolescent might expect with an adult with whom they are unfamiliar. The training of assessors needs to be comprehensive in order to approach and interact with the adolescent appropriate to the goal of the activity, and to master the coding rubrics. If assessors are not properly trained, they are likely introducing measurement error by either under or over reporting the proficiency levels. The practice of pairing of assessors to administer the tasks and code the adolescents’ responses minimised the coding errors hence enhancing the reliability and validity of the findings.