1 Introduction

Bosnian/Croatian/Montenegrin/Serbian (BCMS) is one of the most widespread heritage languagesFootnote 1 in Switzerland. At least 2.4% of the Swiss population speaks BCMS on a daily basis, which corresponds to 173,546 individuals.Footnote 2 However, the BCMS spoken in Switzerland is still under-investigated, despite the increasing body of research on heritage languages (Polinsky & Scontras, 2020) and on heritage BCMS spoken in other German-speaking countries (Hansen, 2018; Hansen et al., 2013; Raecke, 2007; Romić, 2016; Schlund, 2006; Simonović & Arsenijević, 2020). One of the main challenges in the research on heritage varieties is the difficulty in obtaining and redistributing authentic data, which often limits the research settings to case studies based on online communication between peers (Kajgo, 2020; Zagoricnik, 2014). What is particularly lacking for facilitating the research on (BCMS) heritage varieties are corpora of spoken language, since heritage speakers use their heritage language predominantly in oral communication. In order to provide a resource which would enhance the study of heritage BCMS, we created a map task corpus of this variety spoken by second-generation speakers in Switzerland.

The corpus consists of data that have been collected by students during the courses “Corpus linguistics” and “BCMS as heritage language in Switzerland” at the Department of Slavonic Languages and Literatures at the University of Zurich in 2019 and 2020. The aim of the exercises was to train the students to conduct an empirical study on spoken language. The task included fieldwork, transcription of speech, linguistic annotation and analysis of selected phenomena found in the collected data. At the end of the courses, we obtained 30 short transcripts and recordings of BCMS heritage speakers having parents born in former Yugoslavia. In order to present this data to a broader audience, and to present a prototype for a corpus access of this type of spoken data, we created an interactive platform on which users can preselect the transcripts according to particular metadata and frequency distribution of pre-annotated features, sort and filter corpus counts, search the annotated turn-aligned transcripts, and add and export their own custom annotations. Hence, in addition to providing a first publicly available resource for heritage BCMS in German-speaking diaspora, with this corpus, we also present an example for visualising, structuring, and accessing spoken language data. With the corpus architecture used for this prototype, we address the need of user-group specific differentiation when accessing spoken language corpora (Fandrych et al., 2016; Goldman et al., 2005). The map task corpus of heritage BCMS is primarily tailored for linguists investigating heritage BCMS in interaction, and teachers and learners of BCMS living abroad.

1.1 About the label BCMS

The migration from the countries of former Yugoslavia to Switzerland started in the 1960s, when the name of the official language was Serbo-Croatian. However, Serbo-Croatian as standard language has never been entirely uniform: there was an Eastern and a Western variety, as well as two other (sub-) varieties: Bosnian–Herzegovinian standard-language expression and Montenegrin standard-language expression (Bugarski & Hawkesworth, 2004). Following the break-up of Yugoslavia, new standard languages have been codified from what was former known as Serbo-Croatian: Bosnian, Croatian, Serbian and Montenegrin. All these standard languages are almost completely mutually understandable, since they are all based on the Štokavian dialect, which is spoken in all of these countries.Footnote 3 While the denomination BCMS is widespread in the linguistic community, Bosnian, Serbian, Croatian and Montenegrin are also seen as separate languages, depending on the definition and approach one adopts when dealing with them. There is an extended body of literature about the controversy of BCMS and whether and under which aspects it can be considered one language (see Langston & Peti-Stantić, 2003; Gröschel, 2009; Bugarski, 2000; Kapović, 2010; Kordić, 2010). For reasons of consistency, in this paper we use the term BCMS to denominate these language(s), although the variety from Montenegro is not included in our data.

1.2 About map task corpora

Map tasks (Anderson et al., 1991; Thompson & Bader, 1993) are elicited conversations used as material for different research questions in linguistic research. In a typical map task setting, two participants take part in a cooperative problem-solving task. Both participants obtain a sheet of paper populated with various images, but only one of them also has a path drawn around the images. The participant who sees a path has the assignment of explaining to their interlocutor how to draw it on their sheet. Depending on the research design, the roles can also be switched, provided that there is a second set of pictures. The first publicly available map task corpus (HCRC) was designed to “furnish a common set of materials for the simultaneous study of several different linguistic phenomena” (Anderson et al., 1991, p. 353). Since then, several map task corpora were created for different research purposes and for different languages. Some of these are Chiba University Japanese Map Task Dialogue Corpus (2007),Footnote 4 Hamburg Map Task Corpus (HAMATACFootnote 5; Schmidt et al., 2010) and Berlin Map Task Corpus (BeMaTaCFootnote 6; Sauer & Lüdeling, 2016) for German, Montclair Map Task CorpusFootnote 7 (Pardo et al., 2018) for English, and Aix Map Task corpus (Gorisch et al., 2014) for French.

We decided to use map tasks for our pilot corpus of heritage BCMS because map task conversations are thematically similar and hence comparable, and at the same time they represent a source of semi-spontaneous speech that is differentiated enough to allow for the investigation of various speech phenomena. They are well suited for making a preliminary assessment of lexical, phonetic, morphosyntactic and disfluency patterns of heritage speakers, which was our aim in the course “BCMS as heritage language”.

2 Map Task Corpus of Heritage BCMS

In this section, we present the idea behind our map task design, the steps we undertook in order to collect the corpus data and a brief description of the participants and the language they used in map tasks conversations.

2.1 Map task design

We used an adapted version of the original HCRC map tasks (Anderson et al., 1991) with our own set of images. For each map task, we selected 11 images that represent objects/places from everyday life (e.g.: ‘bread’, ‘market’, ‘bowl’), but also rarely used words that we evaluated as challenging for heritage speakers (e.g.: ‘sloth’, ‘paper-clip’) see Figs. 1 and 2.Footnote 8 In selecting the images, we aimed to find terms that differ across BCMS standard varieties (e.g. kruh/hleb/hljeb for ‘bread’ in Croatian/Serbian/Bosnian). This was performed in order to be able to assess active and passive knowledge of “own” and “foreign” BCMS varieties in the course “BCMS as heritage language”. Our maps differ from the map tasks presented in the original HCRC Map Task Corpus (Anderson et al., 1991) in that (1) the features on the maps are not labeled, (2) the maps contain mostly features not to be encountered in a real navigation task, and (3) that the giver and the follower had the maps with the same exact features, while they had some differences in HCRC map tasks’ design.

Fig. 1
figure 1

Map tasks used in the experiments (map task A). Market: Gmihail at Serbian Wikipedia, CC BY-SA 3.0 RS https://creativecommons.org/licenses/by-sa/3.0/rs/deed.en, via Wikimedia Commons; bag clip: Rotho; beetroot: kindPNG.com; hair clipper: A&A Hair Beauty; stapler: Ofrex; node: dreamstime.com Pancaketom); key hole: Thegreenj, CC BY-SA 3.0 http://creativecommons.org/licenses/by-sa/3.0/, via Wikimedia Commons; bat: PD-USGov, exact author unknown, Public domain, via Wikimedia Commons: ladder, garbage, soup spoon: unknown author

Fig. 2
figure 2

Map tasks used in the experiments (map task B). Scissors: makeup.sk; sloth: Roy Toft; bowl: IKEA; cauliflower: shutterstock (Egor Rodynchenko); construction site: factumArchive; bread: Hubertus Schüler; rose hip: RGBStock.com (micromoth); turkey: dreamstime.com (Mike Neale); screwdriver: Narex; paperclip, windshield: unknown author

Before conducting the map tasks, we checked that (non-heritage) BCMS speakers (4 test individuals) were familiar with the represented images at least in their BCMS variety. We also checked whether the images were known to Swiss–German native speakers without migration background (5 test individuals). Interestingly, in contrast to BCMS test individuals, Swiss–German speakers were not familiar with the word for rose hip and only one of them knew Swiss–German word for beetroot. Despite that, we decided to keep these two images in the map tasks because BCMS test persons were familiar with them.

2.2 Data collection

The students’ task was to find at least two BCMS second-generation speakers originating from different regions of former Yugoslavia and to let them engage in map tasks. We defined second-generation heritage speakers as speakers of BCMS who were either born in Switzerland or who have migrated to Switzerland before starting primary school, and whose parents grew up in former Yugoslavia. Before the experiment, participants were asked to sign a declaration of consent to record and use their data for research purposes.Footnote 9 The initial plan was to record the map task conversations in a quiet room with high-resolution audio quality. However, due to COVID-19 restrictions, some map tasks were conducted online and recorded using video conferencing platforms. After each completed task, participants were requested to fill a questionnaire containing questions on the map task, as well as on their language use, social environment, self-assessment and language attitudes toward BCMS. We collected 30 map tasks conversations (overall 12,988 tokens).Footnote 10

2.3 Participants

A total of 29 participants (18 female and 11 male) took part in the task.Footnote 11 The median value for participants’ age was 23 years. Their highest educational attainment was university or higher education diploma (16) and high school (13). On average, the participants assessed their proficiency in BCMS as 4.7 on a scale from 1 to 6 (n = 19Footnote 12). 12 Participants attended classes in BCMS, and 16 did not (n = 28). Most of the participants originated from Bosnia and Herzegovina: there were 11 participants with both parents, and 6 participants with one parent from Bosnia and Herzegovina, and another from other successor states. A total of five participants had both parents originating from Serbia, four had both parents from Croatia, and two had both parents from Kosovo.Footnote 13 The majority of participants were ethnic Serbs (14) and Croats (9).Footnote 14

All participants lived in German-speaking Switzerland at the time of recording. The participants were given a pseudonym for further processing, as well as a “speaker-id”, which refers to the unique identifier for a speaker in a particular map task event.

2.4 Language in the map tasks

When asked about their first language, most participants (including those from Bosnia and Herzegovina) responded Croatian (10) or Serbian (9), and only one specified Bosnian. Other speakers either did not reply (5) or they indicated Serbo-Croatian (1), German (2) and Swiss–German (1) as their first language. While most participants named the language according to their ethnic backgrounds, naming the language proved to be a delicate subject for some participants.Footnote 15 All participants agreed that they would call BCMS naš jezik (‘our language’) when they would speak about it with their map task interlocutor. However, when asked if they spoke the same language as their interlocutor, the opinions were divided: out of 19 participants who named language differently than their interlocutor, only 8 (at least partially) agreed with the statement that it is the same language. Nevertheless, all participants spoke the Štokavian dialect in the map task, which is the dialectal basis of all standard BCMS languages. Most of them used Ijekavian (17) and Ekavian (9) variety, while three participants reported to speak Ikavian.Footnote 16

3 Corpus compilation

In this section we describe the processes of transcription, normalisation, and creation of the annotated TEI transcripts. Since the recordings contain many non-standard terms, we adopted the solution of transcribing them as they are pronounced, and normalising them in a second step.

3.1 Transcription

The corpus has been transcribed by 11 students with the help of FOLKERFootnote 17 transcribing tool using cGAT conventions (Schmidt et al., 2015). Each student transcribed the map tasks they have previously recorded (which ranges from 2 to 8). Transcribers were given the instructions to transcribe the conversations as they are pronounced, and to keep dialectal features and other non-standard features instead of correcting them to the standard language. Student’s transcripts were reviewed by two tutors. No measurements of inter-transcriber reliability were performed. As proposed by cGAT, we used pronunciation-based (semi-orthographic) transcription, and we also included notations for pauses, verbal and non-verbal units and incidents in the transcripts. The FOLKER transcripts were segmented into speaker turns.

3.2 Normalisation

We used the tool OrthoNormalFootnote 18 to add a normalisation layer for transcribed tokens. In the normalisation process, we compared the transcriptions to the standard variety that participants indicated as the language they spoke in the questionnaire, since BCMS has four different standard varieties. For instance, if a participant indicated “Croatian” as his/her language and used the eastern variant tačno instead of točno (‘correct’, ‘exactly’), as the word is said in the Croatian standard, we normalised it as točno, although tačno is correct in other BCMS standard varieties. This allows assessing how close their language use is to the standard varieties they name or regard as their own. We also normalised hesitation (ähm, äh, ääh, etc.) and acknowledgement tokens (hmhm, mhm, mh, etc.) to äh and hm respectively.

We also used OrthoNormal to annotate truncations, non-BCMS words, invented words, stutter, unclear words, and elongations (see Table 1). In doing so, we used conventions presented in Winterscheid et al. (2019). In total, 2538 tokens were affected by the normalisation (12.7%).

Table 1 Normalising conventions in XML transcripts (FLN-transcripts)

3.3 TEI encoding

We converted XML transcripts produced by OrthoNormal into TEI-XML encoded transcripts, following the TEI guidelines for transcriptions of speech implemented in the Corpus of Serbian Forms of Address (Lemmenmeier-Batinić, 2021, pp. 131–132).Footnote 19 In addition, we added the element <seg>, in which we stored non-BCMS words (@type=“non-bcms”), invented words (@type=“non-word”), elongations (@type=“elongation”), hesitation tokens (@type=“hesitation”) and acknowledgment tokens (@type=“acknowledgment”).

We tagged the corpus with the Serbian model of the CLASSLA-StanfordNLP tagger (Ljubešić & Dobrovoljc, 2019)Footnote 20 and stored the annotations for lemmas (@lemma), universal part-of-speech tags (@pos),Footnote 21 MULTEXT-East Serbo-Croatian morphosyntactic specificationsFootnote 22 (@ana), and normalisations (@norm) in the <w> element (see Example 1).Footnote 23

figure a

4 Web interface

In this section, we present the web interface of the corpus, which allows for a preliminary assessment of lexical, phonetic, and morphosyntactic features of heritage BCMS. Upon user registration, it allows users to view different transcript versions and already implemented annotations, make a pre-selection of transcripts through metadata filters, and create, store and share custom annotations directly on the corpus platform. To enable these functionalities, we elaborated the metadata so that it can be searched and filtered, converted the TEI transcripts into html files, and enriched them with annotation-related functionalities.

4.1 Metadata view

Speaker metadata collected in the questionnaire as well as the corpus counts for each speaker are presented in an interactive table created with the jQuery plug-in DataTablesFootnote 24 (see Fig. 3). The table is populated with the metadata stored in JSON format. Each column can be sorted and filtered using regular expressions. The table filter is synchronised with the map view representing the place of origin for each speaker, which gets updated according to the current table selection (see Fig. 4).Footnote 25

Fig. 3
figure 3

Interactive table and map

Fig. 4
figure 4

Filtering the table and map on participants born in Chur

As shown in Fig. 5, when the speaker’s location (marker) is clicked, the metadata table gets filtered on that speaker, and a pop-up window appears containing basic speaker information, as well as links to the transcript view and view of all the relevant places for that speaker (place of birth and residence of the speaker, and place of birth of both parents, see Fig. 6).

Fig. 5
figure 5

Pop-up window for the participant with the pseudonym Ivan

Fig. 6
figure 6

Birth place and residence of the participant (red), and birth place of both parents (green). (Color figure online)

We divided the metadata table content into three parts to allow for easier search: Speaker profile, Language/Language in map task, and Counts. Metadata in Speaker profile and in Language/Language in map task was taken from the questionnaire, while corpus counts in Counts were added after the normalisation step. Prior to the implementation, information about the speaker profile (birth place, birth year, parents’ origin, education, media usage, etc.) and about their language use (dialect, accommodation, language attitudes, reflection on language use in the map task, etc.) has been made consistentFootnote 26 and translated into English in order to fit with the main language of the web interface. In answers that refer to a 5-point Likert scale, “1” stands for “disagree/false”, and 5 stands for “agree/correct”.

The Counts-section comprises token and type count, type/token ratio,Footnote 27 speech rate, number of map task images that participants knew in BCMS, and absolute and relative counts of non-BCMS tokens in the transcripts, normalised tokens, pauses, hesitations and elongations.Footnote 28 For each corpus count we also implement the calculation of sum, average, and standard deviation of currently selected data (see Fig. 7), as well as all data in the set (in parentheses). The metadata table can be exported as a text file in CSV format.

Fig. 7
figure 7

An excerpt of the Counts section in the metadata table

4.2 Transcript view

The transcript view consists of the pictures of the original map task, the solved map task and the transcript of the conversation (see Fig. 8). Each turn is aligned with the audio segment and can be displayed on mouse click on the respective turn number. On click on the participant’s pseudonym the user gets redirected to the metadata view as shown in Fig. 5. To facilitate the reading of the transcript, we added a background colour to the segments that have been encoded in the normalisation step. Pauses are marked in parentheses: pauses shorter than 0.2 s are marked as (.), while longer pauses are reported in seconds, just like in the FOLKER output. Annotations created in the normalisation process can be viewed either separately or all together on click on Annotations button (see Fig. 9). They can be exported as CSV files.

Fig. 8
figure 8

Transcript view

Fig. 9
figure 9

Transcript annotations

The transcript view has been created with an XSL-stylesheet that can be retrieved at the corpus homepage (section Transcripts), and further edited for transforming own TEI-encoded transcripts of speech. We used the same minimally adapted XSL-stylesheet to create a HTML version of the Corpus of Serbian Forms of Address.Footnote 29 All transcripts are additionally made available for download in the start-page in following formats: TEI, FLK (raw FOLKER transcripts), FLN (normalised transcripts), TEI (tagged transcripts) and TextGrid (for the use with PraatFootnote 30).

By clicking on the button Annotate users can add their own custom annotations with the custom tag and colour by selecting one or more tokens (see Fig. 10).Footnote 31 Multi-word annotations are also supported. The annotations can be either exported as CSV and JSON files or saved directly in the user profile. Exported annotations in JSON format can be uploaded and edited in the Uploads page (see Sect. 4.3).

Fig. 10
figure 10

Custom annotations for the transcript GK_03_B

4.3 Custom annotations view (uploads)

Custom annotations are stored in the mongoDBFootnote 32 database and accessed via Python web framework Flask.Footnote 33 In the Uploads page, users can view their custom annotations and share them with other users. They can also upload and further edit their own annotations as well as shared annotations of other users by clicking on Show annotations, which redirects the users to the transcript view with custom annotations (see Fig. 11).

Fig. 11
figure 11

Uploads page

4.4 Search view

The search view allows for querying the word form (tokens), lemmas, universal parts of speech, morphosyntactic annotations, and by map task annotations added in the normalisation step (non-BCMS tokens, truncations, hesitations and acknowledgment tokens). We implemented simple as well as regular expression search for word forms. After user submits the query, the results are retrieved from the database and presented in the context of the whole turn (see Fig. 12). For each result, a link to the respective transcript is provided. The search results are available without registration, but users have to be registered to access the transcripts. Results can be exported in CSV format.

Fig. 12
figure 12

Search view: an excerpt of the hits for the search of all non-BCMS tokens

4.5 User flow diagram

The functionalities presented in prior sections are summarised in the user flow diagram in Fig. 13. The diagram shows the structure of the web-interface, the links between different web pages, as well as main functionalities provided on the corpus interface.

Fig. 13
figure 13

User flow diagram

4.6 Corpus access

Upon user registration, the corpus is accessible at https://maptask.slav.uzh.ch. For long-term deposit, we plan to store the corpus on CLARIN.SI. Current work, as well as demo code for transcript annotations is documented at the GitLab Repository of ZuCoSlaV corpora (Zurich Corpora of Slavic Varieties).Footnote 34 The corpus is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike (CC BY-NC-SA).Footnote 35

5 Case study: heritage BCMS used by a pair of siblings

Although they are brought up in the same household, siblings can have great differences in heritage language proficiency (Aalberse & Muysken, 2013, p. 6). While Baker (2007, p. 56) argues that a younger child is simply integrated in the decisions about language that are already established in the family, Kheirkhah & Cekaite (2018) challenge that by claiming that siblings may have different experiences of the heritage language related to factors such as the age of migration, social environment and their aspirations. It is often argued that the eldest child speaks the heritage language most native like (Aalberse & Muysken, 2013; Jarovinskij, 1995; Shin, 2002; Wong Fillmore, 1991) since they receive more speech input from the parents. The school entry of the eldest child usually implies the introduction of the dominant language in the family. Commonly, younger siblings receive less input in their heritage language, which also impacts their active use of the heritage language. Siblings tend to interact among each other using the dominant language, i.e., the language spoken at school (Döpke, 1992). This has also been observed by Romić (2016, p. 200) regarding the second-generation heritage BCMS speakers in Germany, where only 6% of the survey participants (n = 103) report to always speak their heritage language with their siblings. Romić (2016) also observed that second-generation speakers speak BCMS commonly with other heritage speakers: 13% of them reported to speak BCMS with their friends from former Yugoslavia (which is more than they reported to speak BCMS with their siblings). Mayer & Lemmenmeier-Batinić (in preparation) observed that 28% of second-generation speakers living in Germany, Austria, and Switzerland report to always speak BCMS with their siblings, and 30% to always speak BCMS with people from former Yugoslavia (n = 175).Footnote 36

In the following sections, we use the BCMS Map Task corpus platform to conduct an exemplary case study of language use by a pair of siblings. After presenting the participants and corpus measurements, we analyse their use of lexical, morphosyntactic, and phonetic features.

Within the lexical domain, we study the use of lexical transfers from (Swiss-)German into BCMS (see Ščukanec, 2021, p. 263). An example for a lexical transfer from German to BCMS is anrufati (from anrufen ‘to ring sb.’, see Ščukanec et al., 2021, p. 118). Since map tasks are picture naming tasks, we expect to find frequent use of lexical transfers from Swiss–German. The adaptation of lexical transfers to BCMS morphosyntactic and phonetic systems is expected to be variable in second-generation speakers (see Ščukanec et al., 2021, pp. 125–126) and to be common among “more proficient heritage speakers” (see Brehmer, 2021, p. 33). In addition, we measure the lexical proficiency by counting the number of correctly named images in the map tasks, and comparing it to other participants’ results.

Regarding morphologic and syntactic features, we focus on verbal aspect and word order patterns, which are reported to be vulnerable domains for heritage speakers (see Brehmer, 2021, pp. 30–32; Polinsky & Kagan, 2007). Several studies report the loss or confusion of verbal aspects in Slavic heritage languages (see Hill, 2014, p. 2128), as well as on influence of dominant word order patterns of the majority languages on the heritage language (Hansen et al., 2013; Laskowski, 2014).

Regarding the domain of phonetics, we focus on the realisation of the alveolo-palatal affricate /dʑ/ (đ).Footnote 37 Given that this consonant is missing in Swiss–German, we expect deviations in its realisation (see Brehmer, 2021, p. 25). Since the devoicing of affricates has already been observed in Italian spoken by second-generation heritage speakers living in German-speaking Switzerland (see De Rosa & Schmid, 2000), we hypothesise that the speakers might have difficulties in distinguishing the alveolo-palatal /dʑ/ from its voiceless counterpart /tɕ/, which is also present in BCMS.

5.1 Participants

The speakers Marija* and Darko* were born in Chur and lived in Landquart at the time of recording (both towns are situated in the southeast of Switzerland). At the time of recording, Marija was 24, and Darko 22 years old. Maria is the eldest child in the family, Darko the second child, and they also have a younger brother (16 years old at the time of recording). Their parents are Croats from Bosnia and Herzegovina. Their father is born in Bukova Gora, where he spoke Ikavian, and their mother in Žepče, where she spoke Ijekavian (see Fig. 14). Marija attended university, and Darko has a high school diploma.

Fig. 14
figure 14

Places of parents' origin of the two siblings

5.2 Map task setting

Marija explained map task B to a male participant, Dragan*, whose parents originate from Svilajnac (Serbia), and whom she did know before the task. Darko explained map task A to a female participant Marina*, whose parents are Croats from Grabovčići (BiH) and Pokrajčići (BiH), and whom he already knew before. Hence, the setting is not identical, and the participants’ performance might vary depending on the interlocutor effect (see Son, 2016).Footnote 38 Nevertheless, since the two map tasks are comparable in content and difficulty level, we assess that the speech material provided in the conversations is adequate for a further analysis of their language use.

5.3 Language profiles

As shown in Table 2,Footnote 39 both siblings name the variant they spoke in the map task as Croatian and report to have spoken the Štokavian dialect with the Ikavian jat vowel reflex, the same variant they report to speak at home. As expected, they both report to frequently speak BCMS with their parents (4 on a scale from 1 to 5). Surprisingly Darko, the younger sibling, reports to speak BCMS with his siblings more often than Marija reports doing so (Darko: 4; Marija: 3). He also reports to speak more often BCMS with “people from former Yugoslavia” than Marija does (Darko: 5; Marija: 3). Darko specifies never to speak BCMS with his partner, while Marija leaves this question unanswered. As expected, both siblings report to mix their heritage language often with Swiss–German (Marija: 4; Darko: 5). It goes without saying that the scale is subject to personal interpretation (the participants might mean the same, but assess their answer as 4 or 5).

Table 2 Language profiles of Marija and Darko
Table 3 Quantitative information on the language used in map tasks

Regarding the experience in BCMS, both siblings report that they did not have classes in BCMS, but Marija self-evaluates her media consumption in BCMS as much higher than her brother does.Footnote 40 While Darko reports to go to their country of heritage more frequently than Marija does, according to her estimation, Marija has more regular contact with people in Bosnia and Herzegovina.

Both siblings absolutely agree with the statement that it's important for them to pass the heritage language on to the next generation. While Darko strongly disagrees with the statement that it’s important not to mix languages (e.g. Swiss–German–BCMS), Marija has less strong opinion on this matter and neither agrees nor disagrees.

In summary, the two participants show to have a meta-knowledge regarding the dialectal variant they speak, which is not always the case for heritage speakers (31% of the heritage speakers in the survey by Mayer & Lemmenmeier-Batinić [in preparation] report not to know which dialect they speak). According to their reports, they used the same variants they speak at home also in map task communication. Their use of BCMS in communication with their parents, siblings and other people is relatively in line with theoretical and empirical considerations on heritage speakers mentioned in Sect. 5. The analysis of the metadata showed that both siblings have close ties with their heritage country, that they use BCMS in their private and social life, and that they also aim to pass it to the next generation. In such, they are typical representatives of second-generation BCMS heritage speakers living in a German-speaking country (see Mayer & Lemmenmeier-Batinić, 2021).

5.4 Corpus measurements

In Table 3 the quantitative information regarding the language use in the map tasks is given. It shows that the two siblings have relatively similar language profiles. They even uttered nearly the same number of tokens in the map task conversations (605; 601), which is more than average (426.57). However, the quantitative information calculated from the normalised transcripts suggests that in comparison to an average participant the siblings used more non-standard words and more non-BCMS words. They both have a slightly faster speech rate than the average, and a lower number of pauses, hesitations and elongations. Regarding Darko, the latter is possibly due to the fact that he used more than twice as much Swiss–German words than an average participant, and hence spent less time hesitating and trying to find the correct BCMS expression. His self-evaluation regarding his frequent mixing of heritage and dominant languages is reflected in his frequent use of Swiss–German words that are directly inserted into BCMS turns, without being morphologically or phonetically adapted to it (see Sect. 5.5.1).

5.5 Analysis of language use

According to the observations shown in Sects. 5.15.4, we expect Marija to exhibit a higher level of proficiency in BCMS than Darko because she is the eldest child in the family, and because her self-evaluation on consuming BCMS media is much higher than her brother’s (see Sect. 5.3). However, the difference in heritage language proficiency between the two siblings is not expected to be large: on one hand, Darko reports to always speak BCMS with people from former Yugoslavia, and exhibits an overall similar language profile as Marija: in most questions they present none or only 1-point-difference on a Likert scale from 1 to 5 (see Table 2). On the other hand, their age difference is relatively small (2 years), so the difference in length of exposure to the heritage language is also small. Corpus measurements regarding the conversations by the two siblings draw a similar picture, with the most prominent difference being the higher relative number of non-BCMS tokens in Darko’s turns, which supports the assumption that Darko has a lower lexical proficiency in BCMS than Marija.

5.5.1 Lexical observations

The siblings showed difficulties in naming map task images in BCMS. They both named 4 out of 10 images correctly, which is lower than an average participant (the average number of correctly named images is 5.53).Footnote 41 However, while both siblings often referred to the images without naming the objects themselves (by just saying slika ‘image’), they used different strategies when confronted with difficulties in lexical retrieval. Darko mostly inserted Swiss–German lexical transfers directly in BCMS sentences without adapting them to BCMS grammatical system. For instance, the insertion schloss (‘lock’), which is used in instrumental, would have been inflected as schlossom (‘lock’-INS) in BCMS, see Example 2.Footnote 42

Example 2: Excerpt from the transcript GK_04_A (Swiss–German words are represented in italics)Footnote 43

[...] na höche malo ispod on-e slik-e sa schloss

    on height a little below that-GEN image-GEN with lock

‘[...] on the height a little bit below that image with the lock’

The use of morphologically non-integrated lexical transfers is consistent in Darko’s turns: out of 14 lexical transfers that require case marking in BCMS, the case is marked only in one of these, as Darko inflects the word schloss in genitive to schloss-a (‘lock’-GEN), as a masculine singular noun would be inflected in BCMS. Overall, Darko uttered 31 lexical transfers, which is more than any other participant. In contrast to her brother, Marija never used lexical transfers from Swiss–German, but she tried to explain the map task path with her own paraphrases in BCMS, as shown in Example 3, in which she paraphrases the word ‘paperclip’. In other cases, she used hypernyms for the words represented in the images (like bird for turkey or animal for sloth).

Example 3: Excerpt from the transcript GK_01_B (paraphrase in italics)

[...] i onda ideš .hh (0.45) prema doli (.) u sredinu između onog što drži listove i između šipka

‘[...] and then you go .hh (0.45) down (.) between the thing that holds the sheets together and the rose hip’

Darko also used hypernyms, but even in those cases he sometimes used Swiss–German words like frucht (‘fruit’) for beetroot. Marija used Swiss–German only in meta-comments about the map task, in which she expressed her difficulties in retrieving the right word, as shown in Example 4.

Example 4: Excerpt from the transcript GK_01_B (‘nothing comes to my mind’)

[...] i onda o tu ((smije se, 0.83s)) ideš (0.6) ehm (.) pored ehm ((nejasno)) (0.31) ehm (.) .hh (1.91) ono .hh ((smije se, 2.2s)) (0.62) ((smije se, 1.45s)) .hhh sorry aber miar fallt gar nüt i .h (0.29) ehm .h

‘[...] and then from there ((laughs, 0.83s)) you go (0.6) ehm (.) near ehm ((unclear)) (0.31) ehm (.) .hh (1.91) that .hh ((laughs, 2.2s)) (0.62) ((laughs, 1,45s)) .hhh sorry but nothing comes to my mind .h (0.29) ehm .h’

Non integrating other-language-items to a heritage language as Darko is the opposite of what is expected from a proficient heritage speaker (see Brehmer, 2021, p. 33). However, Marija’s use of avoidance strategies reveals difficulties in lexical retrieval as well.

5.5.2 Morphosyntactic observations

One recurring morphosyntactic deviation in Darko’s utterances is the consistent use of the conjugation dok (‘until’) with the imperfective present form of the verb biti ‘to be’ (dok si-PRS.2SG) instead of the negated perfective form budeš (dok ne-NEG budeš-PRS.2SG), which was encountered five times in the transcript (see Example 5).Footnote 44

Example 5: Excerpt from the transcript GK_04_A (dok si)

[…] dalje vuči dole dok si ispod te (1.23) ispod te äh (0.29) kutlače (.)

‘[…] draw further down until you are below that (1.23) below that äh (0.29) soup spoon (.)’

This use of verbal aspect and the missing negation is possibly due to the process of reduction, i.e., simplification of syntactic structures in heritage languages, comparable for instance to the loss of double negation in Swedish Polish (Laskowski, 2014, p. 119). Not to neglect is also a possible (Swiss-)German influence, since in (Swiss-)German there is no negation after the conjugation bis (‘until’) used in this sense.

Other participants also showed deviations in verbal aspect after subordinate sentences introduced with the conjunction dok: the search of all turns with dok (Search page) shows that out of a total of 20 occurrences, dok never occurred followed by a negated form, but always followed just by a perfective or imperfective form of the verb biti (budeš/si). The non-standard use of imperfective form in utterances such as (5) is not surprising given the loss or confusion of verbal aspects in Slavic heritage languages that was reported in several studies (see Hill, 2014, p. 2128).

In contrast, the placement of the verb in clause-final position (see Example 6) is not ungrammatical, but rather “perceived as somewhat odd” in standard BCMS (Hansen, 2018, p. 5). This phenomenon occurred only twice in Darko’s transcript, so it is hard to talk about a (deviant) pattern.

Example 6: Excerpt from the transcript GK_04_A (verb position)

[...] ali na desn-u stran-u trea-š crt-u vuč

    but on right-ACC side-ACC you need-PRS.2SG line-ACC to draw-INF

‘[...] but you have to draw the line to the right’

We detected no recurring patterns regarding deviations in verbal aspect and word order in Marija’s map task conversation. However, it has to be noted that Marija never used the construction with dok (‘until’), so we cannot know how she would use the verbal aspect in utterances like (5).

Regarding the detection of morphosyntactic patterns in the corpus, pre-annotated features such as non-standard words can be helpful in case when a particular word is non-standard because of a non-standard morphology (for instance, do donj-og-GEN.SG.M., instead of do donj-eg-GEN.SG.M., ‘until the lower’). Syntactic deviations such as word order are however not pre-annotated and have to be searched either by examining the transcript in the transcript view or by querying the transcripts on relevant phenomena in the search page.

5.5.3 Phonetic observations

As expected, both speakers showed deviations in the realisation of /dʑ/. However, the siblings showed no difficulties in distinguishing /dʑ/ from its voiceless counterpart /tɕ/, as it was hypothesised. Instead, they often uttered a less palatal sound, which can be either described as post-alveolar /dʒ/ (dž), or falling into the spectrum between /dʑ/ and /dʒ/.Footnote 45 Different realisations repeatedly occurred within the same word između ‘between’, as in Examples 7 and 8, where između is first realised as [izmedʒu] and than as [izmedʑu].

Example 7: Excerpt from the transcript GK_01_B ()

[…] i onda .h (.) ideš prema desno gori (.) do sredinu između [izmedʒu] (.) kruha i one ptice

‘[…] and then you go up to the right to the middle between the bread and that bird’

Example 8: Excerpt from the transcript GK_01_B ()

[…] i onda (0.21) do s u sredinu tamo između [izmedʑu] kruha i te ptice

‘and then to the middle between the bread and that bird’

Interestingly, the tendency to realise post-alveolar /dʒ/ instead of alveolo-palatal /dʑ/ cannot be explained as a direct influence from Swiss–German, since both sounds are absent in Swiss–German phoneme inventory. While in homeland BCMS the difference between the affricates /dʒ/ and /dʑ/, is disappearing in some regions as well, the tendency there is rather to replace /dʒ/ by /dʑ/ (see Peco, 2000; Halilović et al., 2009, p. 117).Footnote 46 This development is hence worth further investigation, also because different deviations in realisation of /dʑ/ have been evidenced by other speakers as well.Footnote 47

For finding and evaluating the use of the alveolo-palatal /dʑ/, we used the Search page (with which we could find the segments by using Regex search) and the transcript view for evaluating their realisation and creating custom annotations. We observed a relatively high number of unclear cases that would require an elaborate phonetic analysis in a further work. The difficulty in discriminating between different phonetic units is due not only to the type of segments (which are often difficult to distinguish), but also to the fact that individual words cannot be heard in a loop, since the corpus is not word but turn-aligned.

5.5.4 About the dialect use: Ikavian and (Ij)ekavian variants

Since the siblings were one of the few participants who reported to have spoken a dialect (Ikavian), we additionally analysed their use of dialect features in conversations in which they took the role of giving indications. We focused on detecting whether their use of Ikavian jat reflexes was consistent throughout the transcript, and whether there are differences between the two in using them. For finding Ikavian words we first selected all normalised words (in transcript view: Annotations/Non-standard/spoken) and then annotated the Ikavian words amongst them by adding custom annotations (in transcript view: Annotate). We also annotated their non-Ikavian counterparts as well as other non-Ikavian variants whenever they were present in the transcript with a different tag. Then, we exported the annotations and created a comparison table (Table 4).Footnote 48

Table 4 Frequency of use of words with Ikavian (i) and (Ij)ekavian (ije/je/e) jat reflex by Marija (map task: GK_01_B) and Darko (map task: GK_04_A)

As it can be observed in Table 4, both siblings used not only Ikavian, but also (Ij)ekavian variants in map task conversations. In comparison, an Ikavian participant originating from Croatia (Ante*) was consistent in using only Ikavian features throughout the transcript. Possibly, the siblings used Ijekavian features next to Ikavian because their mother originates from Žepče, where Ijekavian is spoken,Footnote 49 and they might already use the “mixed” variety at home.Footnote 50 However, it could also be the case that they tried to avoid exclusively Ikavian features because this variety is not encoded in any BCMS standard language and it has low prestige in Bosnia and Herzegovina, in contrast to Ijekavian.Footnote 51 For that reason, speakers might have adapted to their non-Ikavian interlocutors by avoiding the forms that are marked dialectally or geographically (see levelling in Aalberse & Muysken, 2013, pp. 8–9). The use of different jat reflexes could also be due to the difference of treatment of long and short jat reflexes: the long jat reflex in lijevo is consistently realised as ije, while reflexes in other words that contain short jat are less consistent.Footnote 52 Other examples of mixing variants of jat vowel reflex by a BCMS heritage speaker have been observed by Hansen et al., (2013, p. 24), and would be an interesting subject for further research of dialect levelling.

5.5.5 Summary

In this section we illustrated how the corpus can be used on the example of an investigation of similarities and differences that the siblings Marija and Darko show regarding lexical, morphosyntactic and phonetic peculiarities, as well as dialectal variants. We found no major differences between the two siblings regarding the features we investigated. The siblings exhibited great difficulties in lexical retrieval, showed the same type of deviation in realisation of the phonetic segment /dʑ/, and they both mixed dialectal variants with standard language jat reflex variants. Darko, the younger sibling, additionally showed morphosyntactic deviations from the standard norm that were absent in Marija’s utterances. However, since Marija did not use the same constructions in her indications (with dok ‘until’), this difference might just be due to chance. While this comparison is only of exemplary nature, and an overall assessment of heritage language proficiency would require analysis of more features and more data, we assess that in our case study no sufficient evidence was found that Marija, the older sibling, is more proficient in BCMS than her younger brother. Possibly, this is due to the small age difference of only 2 years.

While analysing the data, the corpus platform showed to be most useful for lexical searches, searches of pre-annotated data (non-BCMS segments, normalised segments, etc.), as well as for creating and sharing custom annotations. The latter was particularly helpful for categorising phonetic peculiarities among multiple raters. The possibility to access detailed metadata about speakers also proved relevant for the interpretation of the findings regarding phonetic and dialectal variants. The corpus platform showed to be useful for enabling access to pre-annotated (non-BCMS) and structured data that can be further elaborated. However, like in any, especially small corpus, pre-annotated as well as custom observations regarding the quantitative distribution of linguistic features should be treated with caution. For instance, while some speakers, like Marija, do not show any recurrent deviations in verbal aspect, it does not necessarily mean that they have an excellent command of it, but that they either didn’t had to use particular structures in their conversation, or that they used other structures (or avoiding strategies) instead. Similarly, speakers that used (Swiss-)German words in their map tasks are not necessarily less proficient in BCMS lexicon, but they possibly choose to deliberately use (Swiss-)German words as a strategy for more fluent speech.

6 Discussion

The BCMS map task corpus offers interactive and collaborative work on the corpus, and a possibility to download different versions of the transcripts, metadata and (pre-)annotated data on local machines, where users can profit from other software and processing options. As such, this corpus represents a compromise between web-(only)-based environments, advocated by Kemps-Snijders et al. (2008), and the “download first, then process” paradigm. Regarding the possibility of accessing corpus data after querying quantitative information, the metadata filter in this corpus is comparable to those of spoken language tools such as Lexical ExplorerFootnote 53 and ZuMal.Footnote 54 Being able to explore quantitative data for each transcript at a glance helps users to choose transcripts to examine, and it can point out speakers’ individual tendencies regarding particular linguistic and paralinguistic features. Further work on the transcripts is facilitated by the possibility to add, export and share custom annotations.

While the map task corpus platform is created for one specific corpus, its modular structure is adaptable to other resources that have the same transcript or metadata structure. Another advantage of this corpus platform is its simplicity, which is the “key issue in spreading corpus use in and beyond the research community” (von Waldenfels & Woźniak, 2017, p. 156). As observed by Fandrych et al. (2016), working with corpora of spoken language in an online setting is often challenging for students, researchers, and teachers. Since typical users of spoken language corpora have high expectations regarding platform usability, but they do not want to invest much time into learning to use new software or techniques (ibid.), we implemented simple querying and annotation techniques, which demand no additional training.

The case study showed how the corpus interface can be used for a linguistic study of heritage BCMS. The pre-selection of metadata, as well as the possibility to work upon pre-annotated features proved to be very useful for the aims of our investigation, especially regarding analysis of lexical peculiarities. The most challenging task was the annotation of phonetical deviations. This is mostly due to the fact that the corpus is turn-aligned, while it should ideally be word-aligned, in order to permit repetitive hearing of the same word (or multi-word segment).

Needless to say, the semi-orthographic transcription used in this corpus reflects transcribers’ subjective interpretations that might differ from the interpretations of other users. Since the transcripts are short, it might also lead to a slight over- or underrepresentation of features such as number of pauses and elongations in corpus counts section. However, we consider pronunciation-based transcription and the annotation of non-vocal segments and pauses a very good compromise between readability of the transcript and faithfulness to the original source.

Another limitation of the corpus is given by the type of data: while map tasks are well adapted for pilot studies, users must bear in mind that they represent an experimental situation which is (deliberately) thematically narrow. Besides, map task conversations contain for the most part encodings of directions that are embedded in a situation which is likely to lead to unusual collocations such as “ideš do büroklammera” (‘you go towards the paperclip’). Despite these challenges, the map task corpus of heritage BCMS represents a valuable insight not only into the language contact between the heritage and the majority language, but also between different BCMS varieties among themselves. The corpus can be used, for instance, as a resource for studying phonetic peculiarities found in other Slavic heritage languages, such as distinguishing contrasting pairs like palatalised and non-palatalised consonants (Sussex, 1993, p. 1014), aspiration of initial voiceless stops (see Brehmer & Kurbangulova, 2017), and the replacement of Slavic [ł] by a “less velar” or “clearer” [l] (see Sussex, 1993, p. 1013; Recasens & Espinosa, 2005). It can also be used for investigating disfluency patterns (Yılmaz & Özsoy, 2020), and discourse markers used by heritage speakers (see Hlavac, 2006), as well as for studying language accommodation to other speakers in general, and to speakers of other BCMS varieties in particular (Giles & Ogay, 2007; Ljubešić et al., 2019). Teachers and students of BCMS in the diaspora can profit from the text-audio alignment, and the possibility to annotate the transcripts and share their results with their peers. The pre-selection of transcripts according to metadata such as speaker’s origin or the proportion of normalised tokens can help teachers choose appropriate transcripts for didactic use in the classroom.

In future work, we aim to further develop the custom annotation functionality, perform word-alignment by using forced alignment techniques, and include a quality score based on a measure of overlap between original and drawn route. We also aim to use the technologies and workflows developed for this corpus platform to enhance the work on corpora of heritage Albanian we are currently working on.