This dissertation was part of the project Measuring English Writing at Secondary Level (MEWS). The MEWS project was co-founded by the Swiss National Science Foundation (SNF) and the German Research Foundation (DFG). The study aimed to be the first large scale study in Germany and the German speaking part of Switzerland that focused especially on English writing skills in upper secondary education. In addition to writing competences, students were also tested on listening and reading comprehension, as well as given extensive background questionnaires, including a catalogue of questions concerning their media-related extramural English contacts. The following chapter will describe the sample and operationalization of key constructs and variables for the present analysis.

5.1 Sample and Test Procedures

The initial sample of the MEWS project consisted of 2,847 German-speaking upper secondary students in their penultimate year before their baccalaureate exams (CH: n = 1882; G: n = 965). Students were tested in a repeated measurement design. The first measurement point took place two years before the students took their baccalaureate exam (August/September 2016), the second measurement took place one year before finals (May/June 2017). Before the data collection, participating schools provided student lists, including the type of school track, date of birth, gender, and grades for English, German, and Mathematics for the previous school year. Students were also assigned an anonymous ID number in the process.

For Switzerland, the full population was students in the academic track in upper secondary education (Gymnasium) in the German-speaking cantons of Aargau, Basel City, Basel Country, Solothurn, St. Gallen, Lucerne, Schwyz, and Zurich. Participation was voluntary but recommended by the educational departments. Twenty schools agreed to participate, which resulted in an initial sample of n = 1,882 students. Due to scheduling demands, schools decided to volunteer entire classes instead of samples of students from each class. The Swiss sample is thus a convenience sample. Data collection was conducted by the research team at the University of Applied Science Northwest Switzerland (for more details, see Keller et al., 2020).

For Germany, the full population consisted of students in the academic upper secondary track (Gymnasium) in Schleswig-Holstein with eight years of secondary classes. Data collection was conducted by the IEA Data processing and Research Center Hamburg and with the consent of the Ministry of Schools and Professional Education in Schleswig-Holstein. Thirty-seven schools were recruited, and student participation was voluntary. In the end, n = 965 students from different profiles participated (for more details, see Köller et al., 2019).

For the present thesis, n = 124 students with less than one valid response to at least one test item were excluded from the analysis (CH: n = 53; G: n = 71). In addition, one Swiss school did volunteer students yet declined to provide any background information for them. Since this information played an important role in later analyses and the scaling of the three language skills, the students were also excluded from the analysis (n = 101; n = 99 valid). Data for one visually impaired student was also excluded, as providing the student with all tests and questionnaires in an adequate format proved difficult. Last, n = 136 students with English as a native language were excluded from the dataset for the present analysis (CH: n = 103; G: n = 33).

The final dataset contained 2,487 students, with n = 1,626 students from Switzerland (58.2% female; age: \(\overline{x}\) T1 = 17.57, SDT1 = 0.91; \(\overline{x}\) T2 = 18.27, SDT2 = 0.91) and n = 861 students from Germany (58.8% female; age: \(\overline{x}\) T1 = 16.9, SDT1 = 0.56; \(\overline{x}\) T2 = 17.6, SDT2 = 0.56).

Data collection took place during regular school hours and was supervised by trained university students and Ph.D. students. Students wrote two essays and completed a reading and listening comprehension test at each time point. In addition, they were asked to complete a test for cognitive abilities at T1 and fill out an extensive background questionnaire (at both T1 and T2). All tests were conducted on laptops provided by the test teams, and the data was stored electronically. Testing took about three hours for each measurement point, including two breaks for the students.

5.2 Language Assessment

Students were tested on their productive writing skill and their perceptive reading and listening skills. For reading and listening, test items from the German National Assessment were used (for details, see Köller et al., 2010). Students were presented with two sets of reading and two sets of listening comprehension tests (testlets) that each took, on average, 15 minutes to complete. The testlets were rotated among students between the two measurement points using a multi-matrix design to avoid sequence effects (Köller et al., 2019). The items tested the entire range (A1 to C2) of competences of the Common European Framework of Reference for Languages (CEFR) (Keller et al., 2020) and are in alignment with the educational standards for English as a first foreign language (Standing Conference of the Ministers of Education and Cultural Affairs in the Federal Republic of Germany [KMK], 2014). Completing the test thus also required the understanding of longer and more complex reading and listening input, including idiomatic expressions and different linguistic registers (Keller et al., 2020).

For writing, there was no established large-scale assessment procedure for measuring writing in English as a foreign language. The research team, therefore, partnered with the Educational Testing Service (ETS) in Princeton, NJ, USA, which conducts the internationally renowned TOEFL iBT assessment (Burstein et al., 2013; Educational Testing Service [ETS], 2009). A central goal in the German and the Swiss writing curricula for English as a foreign language is to develop students’ understanding of a wide range of input from audio or written sources. Students should be able to use that information to produce their own writing and be able to address a specific audience in an appropriate and persuasive way, as well as to state their personal opinion about a given topic (Educational Department of Basel-Stadt [EDBS], 2017; Fleckenstein et al., 2018; Institute for Quality Development in Schools of Schleswig Holstein [IQSH], 2014). In order to operationalize the main learning goals from the national curricula, integrated and independent prompts were selected from the ETS TOEFL iBT pool for the MEWS study. For the independent essay, students were presented with a controversial topic and were asked to write an argumentative essay in which they were supposed to agree or disagree with the statement and support their opinion with arguments. Students had 30 minutes to write the argumentative essay on the computer. They were advised that a good essay should be at least 300 words long. They were not required to count the words but were told that 300 words would result in approximately ten lines on the screen (ETS, 2009).

For the integrated essay, students were presented with a written text on a specific topic (250–300 words long) and a spoken audio input expressing opposing views to the written input (2–3 minutes long). They had 20 minutes after the input to summarize the information (150–225 words). In contrast to the independent essay, they were not required to formulate their own opinion or conclusion (ETS, 2009). The integrated essay, therefore, represented a synthesis text, which required the writer to combine different language skills and writing strategies, including demonstrating a broad understanding of the opposing input sources, selecting important information, evaluating them, and rearranging them in a logical and coherent text (Keller et al., 2020; van Ockenburg et al., 2019).

Students were asked to write one independent and one integrated essay at both measurement points, resulting in four essays per student. Due to copyright reasons, the research team selected prompts that had already been used for TOEFL and were publicly available online. Overall, four prompts were selected that were thought to meet students’ interests and, in terms of the independent prompts, were likely to meet students’ world knowledge (for details on the writing prompts, see Keller et al., 2020). As with the testlets for reading and listening, writing prompts were permutated between the two time points.

The written texts were scored by the ETS through a combination of human scoring and automated essay evaluation (AEE). The following is a brief overview of the scoring technique used for the study. For an in-depth description of the procedure, refer to Rupp et al. (2019).

For the human scores, each essay was scored on a holistic scale from 0 to 5 by two experienced and trained human raters. A score of 0 indicated an essay written in a language other than English or that the students did not write an essay, despite being present for the test. For both prompt types, an essay scored high if students used English accurately and made only minor grammatical and spelling errors. In addition, an independent essay was scored high if ideas were well organized and developed and if students supported their ideas and opinions with examples. For an integrated essay to score high, students had to clearly summarize the main points from both the audio and the written input and contrast the two positions. Inter-rater-agreement for the human rating, measured in quadratic weighted kappa, was high for both text types and at both time points (Independent: QWKT1 = .639, QWKT2 = .670; Integrated QWK T1 = .865, QWKT2 = .775; Rupp et al. (2019), p. 9).

For the automated essay evaluation, the students’ texts were scored using e-rater®, the AEE engine of the TOEFL iBT test (Burstein et al., 2013). E-rater® rates texts based on text features (macrofeatures), such as grammar, usage, mechanics, organization, development, discourse, collocations and prepositions, average word length, median word frequency, and sentence variety (Rupp et al., 2019). In addition to these generic macrofeatures, the model also included prompt-specific vocabulary usage measures (Attali, 2007).

Research has shown that AEE models can help boost scoring reliability, as it counterbalances human rater errors such as fatigue or leniency effects, topic effects, sequence effects, and halo effects (Deane, 2013). However, AEE cannot understand the content of an essay as a human rater does. As a result, it might assign a high score to an essay that is linguistically and grammatically correct but does not express any comprehensible and coherent thought. As a result, AEE scoring only measures the mechanical side of an essay, i.e., the textual quality. Nevertheless, the macrofeatures capture important text qualities and writing abilities that learners need in order to compose and organize a well-written text (for a detailed discussion on students’ writing skills for both writing, see Keller et al., 2020). To obtain one writing score for each prompt, an average score was calculated using the two human scores and the one machine score. This resulted in an HHM-score (human-human-machine) for the integrated and the independent essay at each time point. In order to create one writing score for each student per time point, the two HHM-scores at each time point were again averaged, resulting in one overall writing score for T1 and one overall writing score for T2.

In order to obtain comparable measurements for all three language skills, the scores for each skill were scaled using a longitudinal multi-level two-parameter item response model in MPlus Version 8 (Muthén & Muthén, 1998–2017). After computing expected a posteriori (EAP) measures, 15 plausible values were calculated for each domain at each time point. Principal component scores from the background questionnaire (provided by the schools) and the students’ questionnaires at T1 were used as the background model for the calculation. Reliability for the plausible values reached .92T1/ .76 T2 for reading, .85 T1/ T2 = .72 T2 for hearing and .94 T1/ .85 T2 for writing. Following the example of other international large-scale studies, the plausible values were standardized and transformed to \(\overline{x}\)= 500 and SD = 100 at T1. Plausible values for T2 were standardized and transformed along the values for T1. As a result, differences between T1 and T2 can be interpreted as gains in language competences between the measurement points (Keller et al., 2020; Köller et al., 2019).

Table 5.1 summarizes the means and standard deviations for students’ reading, writing, and listening skills at both measurement points, as well as the correlation between them (detailed descriptions can be found in Keller et al., 2020; Köller et al., 2019; Rupp et al., 2019).

Table 5.1 Correlation between listening, reading, and writing scores

As Table 5.1 shows, the four skills are moderately correlated. This is in line with newer research, showing that a one-dimensional model of language competences might not be the best representation of the underlying data, especially for beginners and intermediate students, as not all dimensions develop simultaneously. Instead, a multi-dimensional model, in which the language skills are independent but correlated, might better represent the underlying data structure (Jude et al., 2008; Schoonen, 2019). The present study will therefore assume such a multi-dimensional structure for the analysis. This will also allow a separate analysis of the effect of extramural contact on all three skills.

5.3 Questionnaire

This chapter summarizes the relevant constructs and variables for the present analysis. The complete questionnaire for Germany and Switzerland and the descriptive statistics for all variables can be found in (Meyer et al., in preparation).

5.3.1 Socio-economic Background and Gender

Measurements for gender and socio-economic background factors were collected at T1. For gender, information from the official student lists provided by schools was used (0 = male, 1 = female). Students’ technical equipment at home was operationalized by five dichotomous items from the PISA study (0 = no; 1 = yes) and measured internet access at home, computer for studying and possession of a gaming console, a personal laptop, and a personal smartphone (Hertel et al., 2014; OECD Programme for International Student Assessment, 2009a, 2009b, 2012).

Students’ socio-economic background was measured by two structural factors (highest level of parental education and number of books at home) and three process factors, which measured a conducive home environment for English and parental role modeling. As Rolff et al. (2008) could show, the process factors are themselves already influenced by the structural factors.

The first structural factor, the highest level of parental education (HISCED), was operationalized by using either the highest level of mother or father on the International Standard Classification of Education (ISCED) (Schroedter et al., 2006). The index ranges from 0 (no formal school leaving certificate) to 9 (2nd Stage of Tertiary Education (Research Qualification). If a student had only provided information for one parent, that parent’s level of education was used for the HISCED. As such, the HISCED operationalizes the educational resources and the institutionalized cultural capital at home. The HISCED is also an approximate measure for a family’s economic resources because educational resources are closely linked to the economic situation since a high level of qualifications is usually associated with better career prospects (Hußmann et al., 2017; Stecher, 2005).

The second structural factor, the number of books at home, was used to measure the objectified cultural capital of a family. The scale ranged from (1) No books to (7) More than 500 books. This indicator has proven to be a reliable operationalization of objectified cultural capital, as well as serving as another indicator for the proximity to the education system (Hußmann et al., 2017; Stecher, 2005).

In addition, the questionnaire included 26 items from the study Assessment of Student Achievements in German and English as a Foreign Language (DESI), measuring process factors for English competence and English use within the family (Wagner et al., 2009). The scale for all items ranged from 1 – Not at all true to 4 – Exactly true. A principal component analysis with all 26 items revealed five process factors: the use of English at home, parent’s English competence, parent’s use of English at work, the perceived importance of English by the parents, and parent’s interest in classroom instruction (Wagner et al., 2009). Three of these factors can be considered especially important for the present study, as they represent crucial indicators for a conducive English home environment and positive English media socialization: (1) English use within the family (6 items, \(\overline{x} = 1.9, SD = 0.74, \alpha = { }.804\)), (2) parent’s English competences (as perceived by students) (11 items, \(\overline{x} = 2.1, SD = 0.78, \alpha = .936\)), and (3) Parents’ perceived value of English as an important investment into their children’s future (as perceived by students) (3 items, \(\overline{x} = 3.33, SD = 0.63, \alpha = .799\)) (for detailed results see the electronic supplementary material). From the 11 items loading high on the second factor, six were excluded for the present study, as they related more to school-related support than to parents’ English competence. The remaining five items still showed satisfactory reliability (5 items, \(\overline{x} = 2.43, SD = 0.85, \alpha = .888\)). Each group of items was combined to form a mean value index, with each index ranging from 1 to 4.

It is important to note that neither the structural nor the process factors are entirely congruent with the complex reality of socialization conditions within a family. Nevertheless, scientific studies have shown that they allow for a reliable approximation to the conditions of socio-economic socialization and thus the social origin of the (media) habitus for quantitative research purposes (Stecher, 2005).

Missing rates for these socio-economic indices ranged from 25.5% (HISCED) to 8.3% (Computer to study/ Internet access). Missing rates can be found in Table 1 and Table 2 in the electronic supplementary material.

5.3.2 Media-related Extramural English Contact

For the MEWS study, an in-depth questionnaire to measure students’ extramural English contacts through media content was developed by the author of the present book. The questionnaire included traditional media forms like books and television, as well as newer media channels, such as surfing on websites or using social media platforms. The questionnaire was administered to all students participating at the second measurement point (T2) after they had written the essays and completed the listening and reading comprehension tests. In a first step, learners were asked how often they engaged in English-language media content through ten media channels. These questions then served as filter questions for the subsequent follow-up questions. They will therefore be referred to as entry questions. A translated version of the entry question can be seen in Table 5.2.

Table 5.2 Excerpt from the questionnaire: Frequency of extramural English contact

The entry questions can be used to investigate the frequency of extramural English contact through each media channel separately. Missing rates ranged from 18.7% (Music) to 19.6% (Books). In addition, the answers for all ten questions were combined to create a mean additive index for the analysis. This index represents students’ overall frequency of media-related extramural English contact across all media channels. Cronbach’s Alpha showed satisfactory reliability among the ten items (α = .77).

Follow-up questions were used to gather additional information for some of the most relevant media channels. Filters were used, so low frequency students who did not engage in a media channel at least 1–3 times per month were not shown the follow-up questions. Since reading books might require longer than watching a movie, the filter for the follow-up question for reading books was set for reading at least 1–3 times per year.

As a result of the filters, the follow-up questions are based on differing subsamples. The subsamples for the follow-up questions will henceforth be referred to as readers, watchers, surfers, and gamers, respectively. Missing rates ranged from 23.5% to as high as 62.8% for some follow-up questions (see again Table 1 and Table 2 in the electronic supplementary material for more details). The size of each subsample as it results from the answers in the entry questions is reported in Table 5.3. The numbers might differ from the actual valid responses in the data set due to the number of missings by intention.

Table 5.3 Subsample for follow-up questions

Table 5.4 shows the follow-up questions for the individual media channels.

Table 5.4 Follow-up questions per media category

In order to measure gaming genres, Graham’s item battery of 12 gaming genres was adapted to German (Graham, n.d., Appendix A). Two categories − card games and gambling games and quizzes − were added to include two important categories for smartphone-based gaming. In addition, the questionnaire also listed multiplayer online role-playing games as a separate category to explicitly measure the use of highly interactive and communicative games. Examples of well-known games were included for each category.

The open-ended questions for hours spent surfing, watching, or gaming were recoded into categorical numeric variables using SPSS 25.0 (IBM Corp, Released 2017) and Microsoft Excel. Unrealistic answers were coded to missing (e.g., 24 hours surfing per day).

Students who regularly surf on English-language websites (surfers) were also asked which influencers/content creators, channels, and celebrities they were following online on social media. Due to the variety and range of influencers and channels available, students were provided with an open-ended question. By doing so, students’ responses were not limited to a predefined set of answers. Students were free to list as many names as they wanted to and could remember. Therefore, the answers can be seen as representing spontaneous recall or top-of-mind awareness, i.e., names that are most salient to students. This technique has been well established in marketing for measuring consumer brand awareness (Common Language Marketing Dictionary). Answers were categorized using the free version of the software QDA Miner 5.0 (Provalis Research, Released 2016), which allows the coding and analysis of text-based data from interviews and open-ended questions. The dataset contained 1,557 valid answers. Three students gave indefinable answers, and 217 students answered that they regularly watched YouTube Videos but did not follow anybody in particular and therefore provided no names. These answers were set to missing, as they provided no further detail (n = 155). Each mentioned name was initially coded as a separate category. To summarize the data for the analysis, names were grouped into six main categories according to the type of information and entertainment they produced and the level of input they allowed into their private lives (see Table 5.5). Names were researched across multiple social media platforms to determine their category.

While some influencers allow deep insight into their private lives and often make their everyday activities the focus of their content, others prefer not to share as much private information online. For example, gaming channels on YouTube might focus on instructional videos and walk-throughs for specific games. The creators might not appear on screen or share any private information. Other gaming creators might allow viewers insight into their lives (sometimes via a second channel or on a different platform, such as Instagram) and engage in other topics apart from gaming. To differentiate between these two forms of influencers, two categories were coded: influencer and channels. If a content creator had a large following, used multiple platforms, appeared in person, talked about their lives, or posted pictures about themselves, they were categorized as influencers. If not, they were categorized as channels.

Content creators who focused on fitness were either coded as influencers or as channels, depending on their overall online presence. Professional athletes were coded as celebrities (e.g., Roger Federer).

If the name could not be identified by online research, the name was set to missing (n = 101); the same was done for German-speaking influencers (n = 41).

Table 5.5 Coded categories: influencer followed online

Last, students who reported regular media-related extramural English contact were asked why they chose to engage in said extramural contact. Overall, 1,857 valid answers were recorded. Answers to the open-ended question were again analyzed using the free version of the software QDA Miner 5.0 (Provalis Research, Released 2016). Through content analysis, 28 categories were derived from the textual data and summarized into nine main categories: quality, exclusivity, internationality, convenience, appreciation for English, appreciation for the original, language learning, external influence, and other reasons (see the electronic supplementary material for the coding manual). To increase the reliability, two additional coders re-coded the first n = 150 answers in the dataset. Inter-coder-agreement between the author and the two coders, measured as Brennan and Prediger’s kappa (κn), was satisfactory and substantial (Coder 1: κn = .63; Coder 2: κn = .71; Rädiker & Kuckartz, 2019). Quotes chosen for this publication will be corrected for spelling mistakes and translated into English by the author.

It should be mentioned that nine students stated that they only engage in regular out-of-school English contact for homework. In addition, five students said they did not engage in extramural English contact at all, even though their previous answers had activated the filter for the follow-up question. These two groups are worrisome since only students who reported at least occasional extramural contact in the entry questions were shown this follow-up question. This points towards possible misunderstandings within the questionnaire or false declarations. In-depth analysis of these students revealed that four of them seem to have stated that they read books in English at least 1–3 times per year, even though they later stated to only do so for school, which the entry question specifically instructed not to count. For the rest of the cases, the data does not provide any conclusive explanation as to why they said not to engage in voluntary extramural English contact. Most of them had selected regular extramural contact via the internet (surfing and watching videos). They might not have counted these as relevant contacts for the open-ended question. However, this is speculation. No such misunderstanding had occurred during the pilot and was therefore not anticipated for the field phase.

The media questionnaire underwent piloting in both Germany and Switzerland before the final field phase. Qualitative think-aloud interviews with 20 students (10 Germany, 10 Switzerland) were conducted. The sessions were audio-recorded with the permission of the students. After the first round of qualitative piloting with 10 Swiss students, changes were made to the scaling of the entry question. In the first draft, the entry questions were scaled on a 4-point scale ((almost) never—rarely- often—very often). This was followed by in-depth questions for each media category for which students would have reported regular extramural contact on a 5-point scale (never—1 to 3 times per year—1 to 3 times per month—1 to 3 times per week—(almost) daily). Students pointed out the ambiguity of the first scale (e.g., what does ‘often’ mean?) and the duplication. Therefore, the questionnaire was shortened to contain only the one 5-point scale entry question presented in Table 5.2.

In addition, examples were added for the category TV shows and the category surfing the internet. For the follow-up question for reading, the scale for the number of books was changed to cover a greater number of books. For hours spent surfing, watching movies, TV series, TV shows, and gaming, descriptions were added to clarify the question. The changed questionnaire was again tested with 10 German students. These interviews revealed no further misunderstandings.