Design and construction of Guayaquil radio speech corpus (CHARG)

The present paper aims to describe the process of creating CHARG—Corpus de Habla Radiofónica de Guayaquil (the Guayaquil Radiophonic Speech Corpus). It is the first systematized spoken corpus for this rather under-researched variety of Spanish. Guayaquil is the most populated city of Ecuador, while its capital city is Quito. Therefore, Ecuador is a rare case of a Spanish-speaking country with two major urban centers that belong to two separate dialectal zones, offering a very peculiar sociolinguistic context. CHARG is a corpus composed of Guayaquil radio programs. Its structure is organized by non-linguistic criteria (program type) in order to ensure a representative and balanced sample. The paper describes the design of the corpus (defining the study population, sample and stratification) and its construction (recording procedure, speakers and speech style coding, transcription and annotation). As a result, CHARG consists of 24 h of transcriptions and annotations of recordings from 142 speakers. The paper’s potential use is twofold: since it presents a step-by-step procedure of corpus construction that can be replicated, the readers might be interested in both the procedure and the corpus itself as a research material.


Introduction
Designing a spoken corpus is a challenging task for a number of reasons. A researcher faces difficulties that can broadly be grouped into two basic categories. The first one includes a variety of dilemmas regarding the corpus's construction, and the second one-those related to the storage of files. While the latter is rather a technical issue that can be solved relatively easily in a number of ways, the former is in fact a set of possible problems, concerning issues like sound quality and database organization. This paper aims to describe the design and construction of CHARG (the Guayaquil Radiophonic Speech Corpus), and to address questions regarding the structure of a corpus based on radio programs.
Undoubtedly, the use of corpora is a well-established approach in contemporary linguistics. The most widely spoken languages like Spanish are represented in a great variety of such resources, 1 including both written 2 and spoken 3 corpora. The latter consist of transcriptions of the spoken language and their annotation. Such databases have a great value for many subfields of linguistics, such as phonetics, discourse analyses, and dialectology.
Despite the increasing number of Spanish corpora, there are still many varieties, including the Spanish spoken in Guayaquil, Ecuador, lacking one. The main goal of this paper is to describe the design of CHARG-the Guayaquil Radiophonic Speech Corpus (Corpus del Habla Radiofónica de Guayaquil). While constructing an analyzable and representative corpus, numerous methodological obstacles were encountered. The following sections of the paper report on the design of CHARG, the tasks tackled and the description of the final product obtained.
CHARG is the first spoken corpus of Guayaquil Spanish, one of the two main varieties of Spanish spoken in Ecuador. The Guayaquil variety of Spanish lacks systematic linguistic studies. Being one of Ecuador's two most important urban centers, Guayaquil deserves a deeper insight into its speech, especially because of the unique sociolinguistic background of the variety: Ecuador is a rare case of a Spanish speaking country with two major, almost equally populated, urban centers belonging to two separate dialectal zones. Traditionally, Ecuador is divided in four main regions: the highlands (Sierra), the coast (Costa), the Amazon (Oriente) and the Galapagos Islands. Most of the country's population lives in the highlands and on the coast (INEC, 2019), whose capital cities are respectively Quito and Guayaquil, Ecuador's two main urban centers. These two regions differ significantly in various aspects: climate, geography, demography, history, etc. The physical distance between Quito and Guayaquil, but also the topography were factors that prevented the merger. The regions are divided by a natural border-the Andes range-and no navigable river connects them, which in the past made it extremely difficult or even impossible to travel between the cities (Toscano, 1953, p. 15). Quito was proclaimed the capital city, becoming Ecuador's administrative and cultural center, while Guayaquil, until now one of the most important Pacific harbors, plays a critical role in the economic development (Estrella, 2009, p. 48, 49). Nowadays, the two cities have a similar population (INEC, 2019) and the antagonism between their inhabitants is still perceived. Due to the long-lasting separation and scarce migration, the varieties of Spanish spoken on the coast and in the highland are considered two separate dialects, since the features of the former correspond with the innovative or intermediate Spanish varieties (i.e. because of consonant reduction) and the latter, with the conservative ones (i.e. because of vocal reduction). The linguistic differences are clearly perceived by the speakers. While the variety spoken in Quito is the norm for the highlanders, the coast linguistic model is the one used in Guayaquil.
The research and resources on the Ecuadorian Spanish are scarce. The most complete work, El español en el Ecuador by Humberto Toscano Mateus, dates back to 1953. Although some studies have been carried out ever since (Boyd-Bowman, 1953;Estrella, 2001Estrella, , 2009Haboud, 2004Haboud, , 2005Palacios, 2005;Flores, 2014;Sessarego, 2013;Strycharczuk et al., 2014), they are not abundant as compared to many other varieties of Spanish (like Colombian, Mexican, Cuban, Dominican, Argentinean or Chilean) and most of them focus on the highland variety. To our knowledge, there is only one published study about the Guayas 4 Spanish specifically (Estrella, 2009). Hispanic America is a strongly urban region; in case of Ecuador, 62,7% of the population lives in the cities. It is the cities that induce linguistic change (Alvar, 1990, p. 63) and indicate the prestigious linguistic norm.
The radio speech was selected for the corpus for a number of reasons. The first one, most pragmatic, is that it can be fairly easily accessed from any place in the world. Secondly, previous studies proved the radio to be a useful source of data for linguistic studies of Spanish. Some works of interest include the study by Pérez (2007) on/s/-weakening on the Chilean radio and the papers by Flores (2016 and2017), where the material gathered from radio programs was used as a base for acoustic measurements of velar palatalization. Less recent works worth mentioning are those concerning Dominican Spanish by Alba (2000) and those focusing on the linguistic norm by Ávila (2003). Moreover, mass media help to overcome the barrier of illiteracy and have an important ideological impact on their audience, also at the linguistic level. The language used on the radio tends to become a model and can simultaneously reflect a prestigious local variety. In this way, radio speakers provide valuable data about what is considered correct by a speech community. It also provides style and register shifts, as well as a variety of accommodative strategies applied by the speakers.
Finally, there are some features that distinguish the radio among other mass media. Due to its competition with television, the radio has to compensate for its technological inferiority. In many cases, it is done by means of a meticulous selection of content. Furthermore, since a radio set is more accessible than a TV, the radio has more coverage and more presence in its listeners' everyday life. It is also easier to access as it only engages one sense (McQuail, 2005, p. 36). According to a survey carried out by the Instituto Nacional de Estadística y Censos en el año 2012 (INEC, 2012), the radio is one of the most consumed mass media in Ecuador among the population aged 12 or more (Gehrke et al., 2016, p. 20), and 87.6% of Ecuadorians own a radio set. The surveyed population spends 5 h 39 min per week 1 3 listening to the radio. The time spent on this activity is equal for the urban and the rural population (INEC, 2012). Although 98% claim to watch television regularly, the radio has bigger geographic coverage and is listened to by as many as 83% of Ecuadorians (Gehrke et al., 2016, p. 20). Estudio de Hábitos del Consumidor de Radio en Quito y Guayaquil del año 2012 (cf. Piedra, 2015, p. 32, 33) reveals that there are more than two radio sets per household and, on average, radio is listened to for 2 h 47 min daily. Moreover, 55% of Ecuadorians listen to the radio every day, especially at home (53%), but also at work (28%) or in the means of transports, both private and public (17%).
Each of the principal cities in Ecuador (Guayaquil and Quito) holds its own media, meaning that although most of the stations have a national coverage, their profile is local. According to the data from 2015 (Gehrke et al., 2016, p. 15), the number of radio stations was slightly higher in the Guayas region, with the capital city of Guayaquil, than in the region of Pichincha, with the capital city of Quito (95 and 89 respectively). The most popular radio stations in Guayaquil have an audience of almost 50,000 listeners and this index has remained stable in the past few years (Piedra, 2015, p. 25).
The numbers presented above suggest that the language spoken by radio anchors is present in the lives of the Guayaquil population and thus might have an important impact on their speech. Consequently, a study on the language of the radio can give us some insight into the linguistic variety of the region.
The main contribution of the paper is a detailed description of the corpus construction procedure. It can serve as a guideline for a researcher interested in designing a linguistic corpus based on mass media material. However, the corpus itself is also a working tool that enables different kinds of research, including e.g. phonetic, syntactic or discourse studies. It is the first corpus of spoken Guayaquil Spanish variety; therefore, its analysis on any linguistic level would be a step in filling the gap in the current state of knowledge about the dialect at hand. The corpus is also open to further annotation by researchers interested in particular issues of Guayaquil Spanish.

Designing the corpus
In order to establish a balanced study sample, researchers take different approaches, which highly depend on the research question. One of the solutions is to establish a representative list of speakers and obtain an equal recording time for each of them; another way is to collect a number of utterances that contain a particular linguistic phenomenon. None of these approaches seemed appropriate for CHARG-although its annotation was focused on a specific phonetic aim, 5 the goal was to create a universal corpus for future purposes. As such, it had to reflect the whole linguistic content to which radio listeners are exposed. The challenge was then to ensure its representativeness, meaning that all kinds of Guayaquil radio programs and all types of its speech are evidenced in the corpus, and its balance, meaning that all these kinds of programs and types of speech are represented proportionally. CHARG structure is inspired by the method developed in the DIES-RTP project (Difusión Internacional del Español a través de Radio, Televisión y Prensa), led by Raúl Ávila from El Colegio de México since 1989. The project has been incorporating researchers from various centers for Hispanic studies all around the world (López González, 2001, p. 3). Its aim is to study the vocabulary, phonology and syntax of the varieties of Spanish broadcast by different types of mass media in order to describe local norms and a general Hispanic norm (López González, 2001, p. 4). The project gathers random homogeneous samples of different types of radio and television programs, and of press texts (López González, 2001, p. 73). One of the studies carried out within the DIES-RTP and following its methods was the study of radiophonic language of Almería (López González, 2001). Since its approach is similar to the one described in the present paper, the structure design of CHARG draws heavily on its instructions.

Study population
The representativeness of any textual study stems from the definition of the "population" that the sample is expected to represent (Biber, 1993, p. 243). In the case of text corpora defined in terms of registers and genres rather than demographic criteria, it is often challenging to determine the population limits. CHARG applies the term "radiophonic universe", which means the total number of potential objects of study (López González, 2001, p. 81;López Morales, 1994, p. 41, 42). 6 Consequently, the absolute universe of our study is all of the content broadcast by the radio stations in Guayaquil in a predetermined period of time. Since this amount of data would be excessively large, a sample, called "relative universe", was established as the object of the study. The relative universe is a representative selection of the most popular local radio stations, in an amount that makes further elaboration and research feasible.
Eight popular radio stations were selected, both AM (Amplitude Modulation) and FM (Frequency Modulation), on the basis of the data provided for the year 2018 by Mercapro, an Ecuadorian company targeted at market research and public opinion polling. The selection was supported by social media statistics (Facebook, Twitter, and Instagram). Moreover, the radio stations had to meet some additional criteria: to offer a stable schedule, preferably on their websites, and to broadcast from the city of Guayaquil. Also, the stations with musical content exclusively were rejected, since they would not provide any data useful for linguistic research.
Following the criteria described above, the radio stations selected for the project were: In 2018, the radio stations had a mean audience between 7648 (i99) and 34,604 (Critstal) listeners, according to the report provided by Mercapro. However, the FM stations presented a higher number of followers in the social media (the AM stations, such as Radio Morena, have a long trajectory and presume of a loyal audience, but are not as popular on the Internet, probably due to the listeners' average age). The numbers of followers are listed in the Table 1. 7 All of the radio stations selected, except for Radio Diblú, have a general thematic profile. Radio Diblú offers sports programming exclusively, but since it is one of the most popular radio stations in Guayaquil, it was included in the corpus.

Sampling and stratification
In order to define the study sample, proportionate stratified random sampling was chosen (Herrera Soler et al., 2011, p. 30), since stratified samples are considered The tables, figures and graphs were created by the author more representative than non-stratified ones (Biber, 1993, p. 244). The sampling consisted in three steps: calculating the total broadcast time, the proportional recording time for each radio station, and the recording time for each program and each of six program categories (henceforth referred to as "strata") within each station's programming. This section reports on the details of this process.
In the calculation of the total weekly broadcast time for each radio station, prerecorded musical nighttime programming was excluded, as well as weekends, since the programming for Saturdays and Sundays is irregular or consists of content that would not be valuable for this research (repeat broadcasts of Monday to Friday programs, soccer match live broadcasts, holy mass live broadcasts, presidential speeches, etc.). On the whole, the calculated broadcast time from Monday to Friday for the eight radio stations was 513 h 20 min. This number represents the relative universe of the study, which was defined as the base to calculate the sample size. The time of broadcast to include into the corpus was defined as the total of 24 h of recordings, which gave 3 h per station on average and was equal to 4.86% of the relative universe defined for the study. The total time of recording for each station is presented in the Table 2.
Once the population limits and the sample were established, the hierarchical structure of the corpus was designed. As mentioned before, the sample was stratified proportionally, and the strata were equal to the program types determined in the Organic Law of Communication classification (Ley Orgánica de Comunicación, Asamblea Nacional 2013, p. 12). Article 60 of LOC obliges all the radio stations, TV channels and the press in Ecuador, both public and private, to tag their content with at least one of the six following categories: I-Informative (informativos); O-Opinion (de opinión); E-Entertainment (entretenimiento); F-Educational/cultural (formativos/educativos/culturales); D-Sports (deportivos); P-Advertisement (publicitarios). As for the radio, the information about the content type has to be explicitly stated at the beginning of the show. The categories listed above served as strata in the corpus.
Since the programming of each radio station is modified cyclically, the period of time for the calculations and for the recordings was limited to August and September 2018 (except for one program of Radio Centro, After Office, which had to be recorded in December 2018 and January 2019 due to technical issues).
The recording times were distributed proportionally, meaning that a proportional duration of time was assigned to each program and, as a consequence, all programs are included in the same proportion they had in the original total broadcasting time. Table 3 presents the calculations for the informative programs from Radio Morena. The total weekly broadcasting time (Monday to Friday) of Radio Morena was 66 h and 44 min, which makes 13% of the total broadcasting time (513 h 20 min). 13% of the corpus total duration (24 h) is 3 h 3 min 44 s, the Radio Morena proportional recording time. After summing up each program's weekly broadcasting time and calculating 13% of each, we obtained the proportional recording times for the corpus.
The broadcast time of the programs tagged with two or more strata was divided by the number of tags. Every effort was made to ensure that no file exceeds 10 min in Table 3 Example of broadcast times and proportional recording times calculation, Radio Morena duration in order to guarantee maximum diversity of the content. The excerpts from each program were selected randomly, avoiding the first and the last five minutes of the show.

Recording procedure
All of the recordings were collected from publicly available online streams and supplemental podcasts. 8 The online broadcasts were recorded in the.wav format using Foobar2000, ver. 1.4 (Pawlowski et al., 2018) at the 44.1 Hz sampling rate. The podcasts were downloaded and converted into.wav with the use of Audacity, ver. 2.2.2. (Audacity Team, 2019), also with 44.1 Hz sampling frequency. The.wav format was chosen since it is one of the standard sound formats readable by Praat (Boersma & Weenink, 2017). The programs were recorded in Windows 10 OS on a laptop with Intel-Core i5 8250U CPU.

Speakers
Once the recordings were gathered, we proceeded to collect the speakers' sociodemographic metadata and the speech style metadata. The speakers had to follow three conditions: • Be at least 18 years old, • Be native residents of Guayaquil, • Have higher education.
The speakers were also tagged as male or female.
Taking into account the age structure of the Ecuadorian society, the speakers were divided into three age groups: 18-30 years old, (Z), 30-50 years old (Y) and > 50 years old (X). The youngest group was composed of speakers at the stage of individual formation and at the beginning of their professional career. Generation Y were the speakers independent of their parents, in their full professional performance. Group X was the speakers at the stage of professional maturity or close to retirement (Moreno Fernández, 2009, p. 51).
The second condition served to guarantee that the speakers used the language proper to the geographic variety in question. Speakers who reported a long-term stay abroad in their curriculum (n = 10) were excluded and replaced with those who met the criteria.
The last condition allowed the exclusion of the diastratic variable from the study. It would be impossible to obtain a socially stratified sample in a homogeneous professional group of speakers. Therefore, the study was limited to the language spoken by one socioeconomic stratum in order to obtain information about the preferred linguistic usage in the educated higher class of the population in question. The Organic Law of Communication (LOC, 2013) assumes the professionalization of the journalists in Ecuador. Consequently, according to Article 42 (Asamblea Nacional, 2013, p. 9), until the year 2019, every person who executed journalistic activity had to obtain a university degree in Journalism ("Ciencias de la Comunicación"). Since the present study was carried out in 2018, it can be assumed that most of the speakers had obtained the degree by then or were about to obtain it. 9 Since most of the speakers are public figures, the biographic information used in the codification process was obtained from the radio stations' websites, from the speakers' professional social media profiles, from online articles and from the programs' content itself.

Speech style
The last metadata codified in the audio files was the speech style. Although the styles or registers are defined upon non-linguistic criteria, there are fundamental linguistic differences between them and the linguistic features within a style or register are relatively stable (Biber, 1993, p. 245). Four categories were defined, depending on the level of formality: a: texts read aloud (news, documentaries, etc.), b: guided monologues (e.g. commentaries on news), c: interviews, d: talks and debates.
Some journalism genres feature more dialectal variation than others (Bell, 1991, p. 77), as they can be more susceptible either to response or to initiative in terms of speakers' accommodative strategies in relation to their audience: the more formal the format is, the more "depersonalized" it gets, while less formal formats, such as advertisement, leave more linguistic flexibility to the speaker. The distinction is rather gradual; nonetheless, it can be assumed that news programs are extremely responsive and this is where the least dialectal variation occurs, since they tend to maintain a neutral pronunciation, accessible to a wide audience. At the other extreme of the axis there are publicity, sports and entertainment programs. In their case, the host attempts to earn the audience's trust, which can be done through language accommodation. This feature is stated by Lipski (1985) for the radio in the Spanish-speaking regions of America.
To sum up, taking into account all elements described in the Methods section, each recording file was named according to the following format: Speaker's ID_gender_age group_radio station's name_program type_speech style_program title_DDMMYYYY_hour.wav. 1 3 A speaker's shortest possible ID consists of five characters-the first two taken from the first name and the next three taken from the first surname-so that they would not repeat. An example of a complete file name would be: CLREY_f_Y_Centro_F_b_LaLupa_22082018_0430.wav.

Corpus transcription and annotation
All the utterances produced by the speakers were transcribed orthographically in Annotation Pro software (Klessa et al., 2013). The software was selected due to its user-friendly and intuitive interface. Figure 1 illustrates a fragment of orthographic transcription carried out in Annotation Pro (Klessa et al., 2013). As the next step, the transcriptions were converted into TextGrid for the sake of the posterior analysis in Praat (Boersma y Weenink, 2017). The automatic alignment was carried out with the EasyAlign plug-in (Goldman, 2011). The alignment consisted of two steps: grapheme-to-phoneme conversion, and then word, syllable and phone segmentation. The phonetic transcription was held in SAMPA alphabet, whose major advantage is the lack of special characters. Lastly, the syllabic stress was marked manually with an acute accent. Figure 2 illustrates a piece of annotation obtained in the process described above.
The syllable alignment assumed resyllabification, meaning that the final /s/ before a word starting with a vowel was attached to the following syllable, like in the example presented in the Fig. 3 ("minutos en" [mi.'nu.to.sem]).
All the transcription and segmentation material obtained with EasyAlign was inspected by an expert phonetician. The segments where the plug-in failed to produce any results were corrected manually. Other steps that were not done automatically involved manual correction of the segment boundaries where necessary. The correction  was carried out by an expert phonetician, which guaranteed coherence in all the corpus. The automatic segmentation was adjusted following a set of guidelines established in accordance with Machač and Skarnitzl (2009). The EasyAlign plug-in is an HMM aligner (Goldman, 2011, p. 2) that relies on a pronunciation dictionary and does not cover all the possible geographic phonemic varieties. It was then required to adjust the grapheme-to-phoneme conversion following the guidelines for Spanish SAMPA by Wells (1997) in some cases (e.g. the dephonologization of /L/ in favor of /jj/, standard for most Spanish dialects and not detected by EasyAlign).

Corpus structure
Following the design described in the Methods section, 24 h of recordings were obtained. Figure 4 illustrates the stratification of the recordings by the type of program.
Among the radio stations included in the corpus, only Radio Cristal offers advertisement programming explicitly tagged by the speakers.
Due to the unsatisfying audio quality (background music or noise), 13% of the corpus was replaced by excerpts coming from other programs that met similar requirements (the type of program and the speech style).
Regarding the social structure of the corpus, utterances from 142 speakers were obtained. The speakers were stratified in three age groups and two gender groups (Fig. 5), as pointed out in the Methods section.
The distribution of duration time on the social variables is presented in the Fig. 6: It can be observed that there is an advantage of the medium age group both in the number of speakers and the total utterance production time. There is also a bigger representation of male speakers in every subgroup, with a remarkably low number Design and construction of Guayaquil radio speech corpus… of female speakers in the group of the oldest speakers. The proportions obtained are not equal, but they reflect the real structure of the radio staff. Figure 7 presents the duration time distribution of the corpus regarding the speech styles: Again, commentary is the least represented speech style in the corpus. The most frequent styles are entertainment and news, which represent the opposite extremes of the speech style axis. Therefore, the potential differences in the variable "speech style" should be clearly observable in any study results.

Conclusions
The aim of this paper was to report on the design of the Guayaquil Radiophonic Speech Corpus, with the main focus on its structure. Building a linguistic corpus is often challenging, since the researcher has to decide carefully on the criteria that allow to organize the collected material into a representative dataset. While the object of study might be linguistic, it is the non-linguistic features that guarantee that the files gathered in the corpus reflect the full profile of the speech variety.
The selection of radiophonic speech was motivated by a number of reasons. The language spoken on the radio is a valuable object of linguistic research. Not only does it provide a good sound quality, but it also covers a variety of sociolinguistic topics, due to the great presence of the radio in everyday life of this particular speech community and its possible impact on their linguistic habits. Radio speech often has the tendency to become a linguistic model. At the same time, the speakers accommodate their speech depending on the level of formality of the program. The more formal the program, the less dialectal variation occurs in the anchor's speech. In this sense, the speaker's behavior reveals what the audience considers prestigious and universal, and what is regarded as local or marked.
When it comes to the corpus structure, the sample aims to reflect the relative radiophonic universe of Guayaquil, following different program types, calculated proportionally to the broadcast duration time. In order to select the radio stations, official reports and social media statistics were used, since audience ratings are the instrument that measures their success and impact (López González, 2001, p. 21). The more listeners a station has, the more impact it possibly makes on them, also in terms of linguistic influence.
Choosing the random stratified proportional sample allowed a systematic and organized data gathering process. A variety of program types and speech styles are represented proportionally. Corpus structure and annotation enables studies on a wide variety of research topics in the field of phonetics, syntax, discourse studies, and others, while offering quite a complex sociolinguistic insight. With further specific annotations, CHARG might be applied in other fields of linguistics.
The steps considered in the future work scope are layers containing prosodic information, using automatic tools such as Prosogram (Mertens, 2004). However, much linguistic information can be extracted just now from the existing layers (time group analysis, segment durations on the phonetic level, speech parts on a syntax level, etc.). Corpus analysis tools can be also applied and any additional tagging is possible. The important manual intervention in the segment correction converted the corpus into a sufficient input for automatic tools basing on time stamps, such as speech synthesis and recognition training or sound and syllable duration. Other researchers are welcome to develop the existing annotations adding any kind of linguistic tagging to the existing corpus.
Apart from the corpus itself, we firmly believe that the detailed description of the design of CHARG can be of much benefit to the beginning researchers taking up the challenge of constructing a spoken corpus.