EsPal: One-stop shopping for Spanish word properties
This article introduces EsPal: a Web-accessible repository containing a comprehensive set of properties of Spanish words. EsPal is based on an extensible set of data sources, beginning with a 300 million token written database and a 460 million token subtitle database. Properties available include word frequency, orthographic structure and neighborhoods, phonological structure and neighborhoods, and subjective ratings such as imageability. Subword structure properties are also available in terms of bigrams and trigrams, biphones, and bisyllables. Lemma and part-of-speech information and their corresponding frequencies are also indexed. The website enables users either to upload a set of words to receive their properties or to receive a set of words matching constraints on the properties. The properties themselves are easily extensible and will be added over time as they become available. It is freely available from the following website: http://www.bcbl.eu/databases/espal/.
KeywordsWord frequency Subtitles Word recognition Corpus linguistics Psycholinguistics
Researchers from a wide range of disciplines (e.g., neuroscience, artificial intelligence, psychology, linguistics, and education, among others) who work in the interdisciplinary area of language research (e.g., language acquisition, language processing, language learning, bilingualism, and computational linguistics) need quick and efficient access to information about specific properties of words. For example, word frequency is a dominant factor in accounting for visual word recognition speed as measured by lexical decision times (Forster & Chambers, 1973; Monsell, 1991) and eye fixation durations during reading (Rayner, 2009). Unsurprisingly, reading behavior as measured by, for example, lexical decision, naming, fixation times, and so on is affected by a wide range of other properties of words, including orthographic neighborhood (Carreiras, Perea, & Grainger, 1997; Grainger, 1990), syllable frequency (Carreiras, Alvarez, & de Vega, 1993; Carreiras & Perea, 2004; Perea & Carreiras, 1998), and imageability (James, 1975), to cite just a few examples. Similarly, with regard to other fields that employ linguistic stimuli, such as memory research, it has been shown that word frequency plays a role in short-term memory (Hulme et al., 1997) and syllable length in working memory (Gathercole & Baddeley, 1990).
Given the wide range of word properties that can affect language and cognitive processing, it is desirable to have a single, integrated, and updateable source of data. For Spanish, there are now a variety of databases available, but some are based on a relatively small number of tokens (Davis & Perea, 2005; Sebastián-Gallés, Martí, Carreiras, & Cuetos, 2000; Taulé, Martı, & Recasens, 2008), while others provide information about a limited number of variables (Alonso, Fernandez, & Díez, 2011; Cuetos-Vega, González-Nosti, Barbón-Gutiérrez, & Brysbaert, 2011; Davies, 2005; Marian, Bartolotti, Chabal, & Shook, 2012). EsPal (Español Palabras, meaning simply “Spanish words”) is a Web-based repository available at http://www.bcbl.eu/databases/espal/ that has been designed to fill this gap, providing information on a comprehensive set of word properties from corpora with hundreds of millions of words.
The most similar effort is the Syllabarium (Duñabeitia, Cholin, Corral, Perea, & Carreiras, 2010), which is a Web-based tool accessing a database containing information on word frequencies and syllable frequencies by token and syllable position. Standalone software packages are also available for Spanish and other languages that provide subsets of the properties in EsPal (Davis, 2005; Davis & Perea, 2005; New, Pallier, Brysbaert, & Ferrand, 2004; Perea et al., 2006). However, given the size of the corpora (discussed below), some of the calculations for some of the properties take up to a week on a standard PC, so a precomputed set of properties is preferred. With EsPal, the back-end processing for the word and subword properties is conducted using a multistep program written in Java, which precomputes not only basic properties of word frequency and form, but also orthographic structure and neighborhoods, phonological structure and neighborhoods, lemma and part-of-speech properties, and subword structure properties related to letter bigrams and trigrams, bisyllables, and biphones. In addition, other data such as a word’s subjective ratings (e.g., familiarity, imageability, etc.) can be easily attached to the data and made searchable.
The second important factor of EsPal is the capacity to apply the exact same processing to different corpora. A number of studies have shown that, across many languages, word frequencies derived from movie subtitle corpora provide a better account for various psycholinguistic effects (Brysbaert, New, & Keuleers, 2012; Cai & Brysbaert, 2010; Cuetos-Vega et al., 2011; Dimitropoulou, Duñabeitia, Avilés, Corral, & Carreiras, 2010; Keuleers, Brysbaert, & New, 2010; New, Brysbaert, Veronis, & Pallier, 2007). However, properties from written corpora have in the past been more common and may better predict some phenomena, so it is useful to have different sources of data available for researchers, depending on their goals. EsPal currently fulfills this goal by applying the same processing to both a corpus based on movie subtitles and one based on written text (fiction, nonfiction, and Web pages).
Finally, the Spanish-speaking community is diverse, and EsPal is constructed to be able to accommodate this diversity, at least in terms of phonological representation. Standard Castilian Spanish spoken on mainland Spain differs in a number of dimensions from the Spanish spoken in the Canary Islands and in Latin America (which itself is quite diverse). EsPal therefore also allows the user to choose which phonological representation is used, for example, to derive properties related to phonological neighborhoods.
In the remainder of this article, we describe the collection and preprocessing of the written and subtitle databases currently available in EsPal; how we calculate orthographic and phonological properties, subword properties, lemma and part-of-speech properties; and the source of the subjective ratings data.
Written corpus collection and preprocessing
Written corpus collection
Percentage of terms by source type in the EsPal written corpus
Percent of terms
The academic texts are mainly Ph.D. theses selected from a wide range of scientific fields: anthropology, architecture, art, biology, law, economics, electronics, philology, philosophy, physics, history, humanities, engineering, mathematics, medicine, psychology, chemistry, telecommunications, and veterinary science. The set of culture texts is composed of news about cultural events from several newspapers and blogs of opinion about films. Legal texts include mainly rulings by the High Court of Justice of several autonomous regions in Spain, as well as news from the judiciary field as it appeared in popular newspapers (El Mundo, El País, and El Periódico). The literary texts come from several websites containing works with expired copyrights (bdigital, biblioteca_ignoria, libroteca, logos, and scribd). These works are both texts written in Spanish and translations into Spanish. The news is from the EFE Agency from January, February, and March 2000. The politics set contains news texts referring to Spain’s 2007 autonomic elections, speeches by the Spanish President during 2008, and documents taken from political party websites. The society set is composed of Web texts about religion, abortion, and psychology. Finally, the Web data are from the whole Spanish Wikipedia, circa February 2009.
The whole corpus underwent a process of cleaning to eliminate the metadata usually present in these types of texts. This process was both automatic and manual and was extremely time consuming.
Written corpus preprocessing
Before the data were incorporated into EsPal, all the text was first parsed using the FreeLing part-of-speech tagger (Padró, Collado, Reese, Lloberes, & Castellón, 2010) to output into a file one term per line with its lemma and its part of speech. The parsing resulted in a total of 309,530,600 terms (no punctuation was included). A “term” could be one or more words and included dates (17 de julio de 1990 [“July 17, 1990”]), proper nouns (Congreso de los Estados Unidos [“United States Congress”]), or phrases (por ejemplo [“for example”]). These terms were then imported into a raw_sequence table in EsPal, with one word per row (i.e., multiword terms were separated) and columns for the lemma and the part-of-speech tag (e.g., http://nlp.lsi.upc.edu/freeling/doc/tagsets/tagset-es.html). If the word came from a multiword term, then the word itself was used as the lemma. In this manner, the part-of-speech tag is maintained for the word within its larger context—for example, de [“of”] will have lemma statistics as a date and a proper noun (among others) in addition to being a preposition. In the lemma processing section below, we describe further lemma information available for words. The word and lemma were changed to all lowercase using the Java string function toLowerCase with the “es” locale. This table had a total of 325,773,444 rows. Subsequent processing of the contents of the raw_sequence table is described later.
Subtitle corpus collection and pre-processing
Subtitle corpus collection
A total of 100,659 Spanish subtitle files were originally provided by the www.opensubtitles.org website including metadata about the file (such as author and total downloads). The Internet Movie Database (IMDb) ID was also supplied, by which genre, director, and cast information can be obtained. Subtitle file formats contain an index number, the start and stop time for which the subtitle is to be shown on screen in milliseconds, and the text of the subtitle, all of which were stored in the subtitles table of the database. Movies account for 65.6 % of the files, with the remainder from television episodes. A given show can be labeled with more than one genre, so the words in a subtitle file can be double counted, but across all such counts, by genre, 22.0 % of the words are from dramas, 10.9 % from comedies, 10.3 % from thrillers, 7.7 % from crime shows, 7.4 % from action shows, 7.3 % from romances, 5.8 % from mysteries, and 5.5 % from adventure shows, and the remaining are in 13 other genres, accounting for less than 5 % each. Similarly, the source show can contain more than one language, and across such counts, 52.6 % of the words are from English language shows, followed by French (8.5 %) and Spanish (5.5 %). No limits were put on the date of the source, since the subtitles themselves, uploaded by users of the website, are of recent origin. However, given the metadata maintained about the source of the words, a variety of subcorpora are possible whose properties might be more appropriate depending on the psycholinguistic question being asked.
Subtitle corpus preprocessing
For a proper parsing of text, complete sentences are needed. However, a single subtitle instance could have two speakers (usually denoted by a dash [“–“] at the beginning of each of their statements), or a single speaker’s statement could continue into the next subtitle instance (usually denoted by ellipses [“…”] at then end). Therefore, a second stage of processing was run to fill a statements table with strings that were, at a first approximation, single statements (which could contain multiple sentences). At this stage, subtitles were removed that contained metadata (such as the author of the subtitles or translations of the credits); all HTML markings were removed; and contents within brackets (often indicating sounds) were also removed.
Each statement was submitted individually to FreeLing (Padró et al., 2010) for part-of-speech tagging and lemmatization. In this case, the lowercased word, lowercased lemma, and part-of-speech tag were stored directly in the raw_sequence table, along with the file ID, IMDb movie ID, statement index, and within-statement index. Thus, the provenance, or origin, of every word can be traced back, enabling further analyses, which we will be reporting in the future. In the end, words from 98,339 distinct files and 40,444 unique movies are present in this table.
Word selection and frequency processing
Counts of word types and word tokens in each corpus
The word_data table contains all the information about each word and, thus, what can be searched for simultaneously via the Web interface. We will be presenting the various properties available for each word with its column name in bold italics. For each word, we store the count (cnt), the frequency per million (frq), log10(cnt+1) (log_cnt), and log10(frq + 1/N), where N = millions of words in the database (log_frqN), which has been shown to be a fruitful way to compare frequencies across corpora (Brysbaert et al., 2011).
Subtitle corpus contextual diversity processing
Recent work has found that the number of different contexts in which a word occurs can be more informative than the token frequency (Adelman, Brown, & Quesada, 2006; Brysbaert & New, 2009; Dimitropoulou et al., 2010; Keuleers et al., 2010; Perea, Soares, & Comesaña, 2013). The original EsPal subtitles database described above uses all the files available, so some shows are multiply represented. Therefore, EsPal provides a third database of properties (subtitles_cdm) that are based on the number of different movies (IMDb IDs) that the word appears in. In this database, cnt refers to the count of different movies, and frq is equal to the percent of movies (i.e., 100 * cnt/40,444). We also explored using the count of different subtitle files, with the expectation that this would have some relationship to popularity (e.g., there are almost 300 versions of Lord of the Rings: Return of the King) and, therefore, provide word frequencies that were better predictors of certain psycholinguistic variables. However, in all the cases we have explored to date, the contextual diversity based on the number of movies has given slightly better results.
Orthographic properties processing
The basics of the orthographic structure are, of course, present in the word column itself. In addition, the number of letters (num_letters) and whether or not there are repeated letters (rep_letters) within the word (0 = false, and 1 = true) are stored. A straightforward consonant–vowel structure (orth_cv_structure) was also created by replacing each vowel character (a,e,i,o,u, with or without accents, but not y) with “V” and all other characters with “C.” Note, however, that there are certain limitations to this simple heuristic, especially with regard to the letters y and h.
Orthographic neighborhood variable names and meanings
Number of substitution neighbors
Number of higher frequency substitution neighbors
Frequency of the highest frequency substitution neighbor
Highest frequency substitution neighbor
List of substitution neighbors in descending frequency, with the place of the word itself marked by "OOOOOO"
Number of positions with substitution neighbors
Number of positions with higher frequency substitution neighbors
Average frequency of substitution neighbors
Number of transposed-letter neighbors
Frequency of the highest frequency transposed-letter neighbor
Highest frequency transposed-letter neighbor
List of transposed-letter neighbors in descending frequency, with the place of the word itself marked by "OOOOOO"
Number of addition-letter neighbors
Frequency of the highest frequency addition-letter neighbor
Highest frequency addition-letter neighbor
List of addition-letter neighbors in descending frequency, with the place of the word itself marked by "OOOOOO"
Number of deletion-letter neighbors
Frequency of the highest frequency deletion-letter neighbor
Highest frequency deletion-letter neighbor
List of substitution neighbors in descending frequency, with the place of the word itself marked by "OOOOOO"
The character in the word at which it is no longer like any other word. Set to 0 if it is subsumed by someother word and not unique
If the word is unique, then the uniqueness point with the last letter removed
Average Levenshtein distance of the 20 closest words (OLD20)
Phonological properties processing
Spanish is a relatively transparent language, so syllable and phonological structure can be derived from the orthography in a rule-based fashion. To derive the syllable structure, we implemented, with some minor changes, the rules in Silabeador TIP (Hernández-Figueroa, Rodríguez-Rodríguez, & Carreras-Riudavets, 2009) to obtain orthographic syllable boundaries (orth_syll_structure). The most notable change was the addition of the onset, nucleus, and coda information being stored for each character. From this information, the number of syllables (num_syll) and the position of the syllable with the accent was also derived (syll_accent).
Phonetic transcription codes used in EsPal
voiceless bilabial plosive
voiced bilabial plosive
voiceless dental plosive
voiced dental plosive
voiceless velar plosive
voiced velar plosive
voiced bilabial nasal
voiced alveolar nasal
voiced velar nasal (preceding a velar consonant)
voiced palatal nasal
voiceless palatal affricate
voiceless labiodental fricative
voiceless interdental fricative
voiceless alveolar fricative
voiced alveolar fricative (preceding a voiced consonant)
voiced palatal fricative
voiceless velar fricative
voiced alveolar lateral
voiced lateral palatal
voiced alveolar trill
voiced bilabial approximant
voiced dental approximant
voiced velar approximant
simple vibrating voiced alveolar
open central vowel
front half vowel
front closed vowel
half rounded back vowel
closed rounded back vowel
Two phonetic representations were derived, one for Castilian Spanish and one for Latin American Spanish. Although this is a complex topic and pronunciation varies dramatically within and between countries (Moreno & Mariño, 1998), for this introduction of EsPal, the only difference between these two representations are that z and c (followed by e or i) are transcribed as T in Castilian and s in Latin American Spanish. However, the software and website are capable of accommodating any number of phonetic representations, and more accurate representations can be added over time. In the database and website output, these columns and the neighborhood columns described below are prepended by either es or sa for Castilian and Latin American Spanish, respectively, depending on which representation is chosen.
Phonological neighborhood variable names and meanings
Number of phonological neighbors (all kinds)
Number of higher frequency phonological neighbors
Frequency of the highest frequency phonological neighbor
Phonological neighbor with the highest frequency
List of phonological neighbors in descending frequency, with the place of the word itself marked by "OOOOOO"
Number of phonemes/positions with phonological neighbors
Number of phonemes/positions with higher frequency phonological neighbors
Average frequency of phonological neighbors
Phoneme position at which it is no longer like any other word. Set to 0 if it is subsumed by some otherword and not unique
Number of other word entries with the same phon_structure
List of homophones in descending frequency
Infralexical, or subword, features are known to influence lexical decision and naming times (Carreiras et al., 1993; Carreiras & Perea, 2004). The processing was very similar for bigrams, trigrams, biphones, and bisyllables, but for exposition we will describe only bigram processing. A new table bigram_raw is created to hold for each bigram–word-position combination the sum of word token frequencies (frq) and word type counts from the word_data table. For instance, when the word casa [“house”] is encountered, it is found to contain three bigrams (ca, as, sa) with positions 1, 2, and 3, respectively.7 An entry is made in the bigram_raw table for each of these bigrams at their positions, and the frequency per million (frq) of caso is added to the token frequency column and 1 is added to the type count column. When the word caso [“case”] is encountered, ca at position 1 and as at position 2 have their token frequency and type count columns incremented by the frequency per million of caso and 1, respectively; and a new entry for so is made at position 3.
After information from all the words was added to the bigram_raw table, each word was reanalyzed to obtain properties of its bigrams. For example, across the entire word casa, we can sum or average, in terms of token frequency or type count, its three bigram frequencies. These sums and averages can also either respect the position of the bigram or not (e.g., ca at position 1 vs. at any position). Thus, there are eight bigram values that are available for each word as a whole.
For a given word, EsPal also provides each bigram’s token frequency and type count, either for the bigram in that position only or for the bigram in that position found anywhere in a word. So caso has three nonzero bigram data sets, and the first data set has the token frequency and type count of ca at position 1 and of ca at any position. Bigram and trigram data are calculated for words with up to 20 characters. Similar processing is done for biphones on the basis of the phonetic structure (phon_structure) up to 20 phonemes and for bisyllables on the basis of the individual syllables in the orthographic syllable structure (orth_syll_structure) up to eight syllables.
To provide this large amount of infralexical information, we created a systematic method for deriving property names. Property name affixes are added for each n-gram length (bigram [B] or trigram [T]), and for each n-gram modality (orthographic [O], phonemic [P], syllabic [S]). So, bigram = BO; trigram = TO; biphone = BP; and bisyllable = BS. The system is designed to be extensible, so any other combination of interest could be added. Currently, the frequency per million (frq) is used and denoted by F in the variable name, but the count (cnt) could also be used, as well as the log of either. We can add such versions of the calculations as they are requested. Eight variables are made for each length–type combination. These have combinations that are position sensitive (pos_) or independent (abs_) sums (S) or means (M) of the token frequency (tok_) or type count (type_). The previous code is then appended to the property name. For example, the position-independent mean of biphone token frequencies is abs_tok_MBPF.
Lemma and part-of-speech processing
While word-form frequencies have tended to dominate analyses, the lemma and part-of-speech frequencies may also influence behavior (Baayen, Dijkstra, & Schreuder, 1997; Taft, 1979). To set the values for the lemma and part-of-speech properties, we return to the raw_sequence table. Counts were made of every unique combination of word, lemma, and part-of-speech tag, rejecting combinations where the lemma contains non-Spanish characters or is too long (> 255 characters). For the written database, there were 388,270 word–lemma–code types, and for the subtitles database, there were 404,394 word–lemma–code types. Since there was more than one row per word, these data were stored in a separate lemma_data table for searching (cf. Brysbaert et al., 2012).
For each word, EsPal gives the percentage of occurrences with each lemma–code combination. For example, the word caso most often appears as a common masculine singular noun [“case”] but can also appear as a conjunction (caso de que [“if”]), an adverb (en todo caso [“in any case”]), a preposition (en caso de [“in case of”]), a verb (yo me caso [“I marry”]), as well as a proper noun and URL. Similarly, for each lemma, EsPal gives the percentage of occurrences with each word–code combination. For example, the lemma caso, besides occurring with the previous parts of speech, also occurs with the masculine plural noun casos. The variable percent_word gives the percentage of each word (by _type or _tok) that has that word–lemma–code, and percent_lemma gives the percentage of each lemma (by _type or _tok) that has that word–lemma–code. For example, for the word–lemma–code combinations with caso as the word, percent_word_type = 16.76 % in the written database, since caso appears with six different lemma–code combinations, and the percent_word_tok for the masculine singular noun lemma–code = 81.5 %, and for the simple preposition = 5.6 %.
The part-of-speech tags are also expanded to allow searching and organization of results. The part-of-speech information includes Category, Type, Degree, Appreciative, Diminutive, Person, Mode, Tense, Form, Gender, Number, Function, Possessor, and Politeness. A full list for Spanish can be found on the FreeLing website,8 which shows, for example, how the different attributes of an adjective are specified.
Some of the lemma information is also added to the word_data table—namely, information about the most common part of speech associated with the word (the “maximum lemma”) and the “lemma frequency” of the word, which is based on the sum of the counts of all the words that have the same lemma as any of the lemmas of the word (Keuleers et al., 2010). For the maximum lemma of a word, EsPal provides the lemma itself (max_lem_lemma), the detailed part-of-speech code (max_lem_code), and the percentage of all the word’s tokens with that code (max_lem_perc), the category (max_lem_cat), and the percentage as that category (max_lem_cat_sum_perc). So for example, in the subtitles database, the word caso mentioned above appears 90.15 % as a common masculine singular noun and 90.55 % as a noun overall (the additional appearances probably labeled as a proper noun). For the lemma frequencies, EsPal makes available the log(cnt +1), as well as the log2(cnt + 1), which Keuleers et al. (2010) found helped account for more variance in lexical decision times in Dutch.
Subjective ratings, such as the imageability of the thing that a word refers to, also modulate the process of lexical access (Balota, Cortese, Sergent-Marshall, Spieler, & Yap, 2004). For EsPal, 6,500 words were selected (mostly nouns and verbs, although some nouns could be considered also adjectives). The words corresponded to those with the highest frequencies in the Alameda and Cuetos (1995) and the Juilland and Chang-Rodríguez (1964) word frequency lists. Nouns with gender (e.g., niña, niño) and number (e.g., corte and cortes) inflections were generally both included for evaluation. We decided to include both since, in many cases, the different usages with the two gender forms or the two number forms hold different semantic features. For instance, the word corte suggests more clearly the action of cutting than does the plural form cortes. In addition, each form involves different semantic meanings: Cortes is a term that can be used to refer to the parliament of Spain (Cortes Generales), while corte is linked more to the royalty. On the other hand, some nouns could also be considered adjectives; for example, the word rojo [“red”] can refer to the color itself (as well as a Communist) or be used as an adjective. Finally, we have included nonreflexive and reflexive verbal forms when the two are common, such as aplicar [“to apply/attach”] and aplicarse [“to apply oneself/work hard”], because there are important semantic differences between them.
From the 6,500 words, we created 130 questionnaires of 100 words each. This way, each word appeared in a different position in two questionnaires and was embedded in a different context of other words. Then we created three forms for each of the 130 questionnaires, so that each word was evaluated on a scale of 1–7 for three different values: concreteness, familiarity, and imageability. Subjective ratings were obtained in two different time windows. The first wave was obtained in 1998–1999 and corresponds to the data appearing in LEXESP (Sebastian-Gallés et al., 2000). The questionnaires were answered by undergraduates from 12 different Spanish universities, including Universitat Autònoma de Barcelona, Universidad Autónoma de Madrid, Universitat de Barcelona, Universidad Complutense de Madrid, Universidad de Granada, Universidad de Oviedo, Universidad de La Laguna, Universitat Rovira i Virgili, Universitat de València, Universidad de Santiago de Compostela, Universidad de Málaga, and Universidad de Salamanca. Due to the random sampling, not all words were equally evaluated, and around 2,000 words in each dimension did not reach the minimum of 30 responses. In a second wave (taking place between 2007 and 2009), an additional set of undergraduate students from the Universitat de Barcelona and Universidad de La Laguna answered new questionnaires so that a minimum of 30 responses for each word were finally reached. The data present in EsPal are the average ratings for over 6,400 words from at least 30 participants and from at least 2 universities.
Index comparisons and validity
Frequency correlations: Correlations of frequency—that is, log(count + 1)—between different corpora (number of common words)
Word naming: Regression analysis results using word length and the frequency, log(count + 1), from different corpora on word naming times (Cuetos & Barbón, 2006)
(N = 240)
(N = 235)
(N = 239)
(N = 240)
(N = 240)
(N = 240)
Picture naming: Regression analysis results using word length and the frequency, log(count + 1), from different corpora on picture naming times (Cuetos, Ellis, & Alvarez, 1999)
(N = 139)
(N = 137)
(N = 138)
(N = 139)
(N = 139)
(N = 139)
EsPal currently provides the properties of two data sources, one written and one based on subtitles, with additional information based on the contextual diversity (by movie) of the subtitles data. We provide initial evidence that these data sources, the latter especially, are comparable to other corpora in Spanish in terms of their frequency data helping to predict some psycholinguistic phenomena. We should note, however, that there are some limitations that researchers should keep in mind when using the data contained in EsPal, especially the subtitle data. These data are based on a large number of amateur translations of media that are most often English, not Spanish, in source, and since proper nouns are typically not translated (e.g., “John” is not renamed “Juan”), such terms will appear with some frequency. We have used publicly available lists of “Spanish words” in order to restrict what is inserted into our databases, as well as allow comparison with other experimental data. Even so, when using EsPal to generate Spanish words for an experiment, one should have a native speaker, from the same culture as the subjects, cull out these perhaps undesirable elements. Nevertheless, our initial validation results suggest that despite what pollution may occur because of these foreign words, the frequencies given for the “true” Spanish words are useful.
EsPal is a free online application that makes available a wide range of frequency, orthographic, phonological, and subjective information about Spanish words. EsPal provides an extensible, ever-improving, and accurate set of data sources and analyses. Initial testing of the current data indicates that they are at least comparable to extant sources. This system may, therefore, assist the research communities of many disciplines to accelerate selection of stimuli for their experiments and thereby increase the rate of scientific progress.
A cutoff was made for processing and memory considerations. Out of over 460 million tokens in the raw subtitle data set, only 735 tokens have a length greater than 30.
The system is designed such that at this stage, it would also have been possible to further reduce the words by removing accents or tildes and collapsing the counts across the subsequent word forms. Some psycholinguistic research questions, such as studies focused on stress assignment (e.g., Shelton, Gerfen, & Gutiérrez-Palma, 2011), might benefit from this type of frequency data. However, the first version of these sources has the actual form of the word.
Note that exceptions to the rules have not been implemented, and we are investigating other methods by which to derive phonetic transcriptions.
Note that it is common to have markers for the beginning and end of words as well; for example, casa would also produce the bigrams _c and a_ and the trigrams _ca and sa_. This information will be available in a subsequent version of the database.
We would like to thank Daniel Diaz for his technical help during the initial phases of the project. Our reviewers have been extremely helpful as well. This work was partially funded by a grant, HUM2007–30271–E/FILO, from the Spanish Ministry of Science and Innovation.
- Alameda, J., & Cuetos, F. (1995). Diccionario de frecuencias de las unidades lingüísticas del español. Oviedo: Servicio de Publicaciones de la Universidad de Oviedo.Google Scholar
- Brysbaert, M., & Diependaele, K. (2012). Dealing with zero word frequencies: a review of the existing rules of thumb and a suggestion for an evidence-based choice. Behavior Research Methods. doi:10.3758/s13428-012-0270-5
- Brysbaert, M., New, B., & Keuleers, E. (2012). Adding part-of-speech information to the SUBTLEX-US word frequencies. Behavior Research Methods, 44(4), 991–997.Google Scholar
- Carreiras, M., Alvarez, C. J., & de Vega, M. (1993). Syllable frequency and visual word recognition in Spanish. Journal of Memory and Language, 32(6), 766–780.Google Scholar
- Cuetos, F., Ellis, A. W., & Alvarez, B. (1999). Naming times for the Snodgrass and Vanderwart pictures in Spanish. Behavior Research Methods, 31(4), 650–658.Google Scholar
- Cuetos-Vega, F., González-Nosti, M., Barbón-Gutiérrez, A., & Brysbaert, M. (2011). SUBTLEX-ESP: Spanish word frequencies based on film subtitles. Psicológica: Revista de Metodología y Psicología Experimental, 32(2), 133–143.Google Scholar
- Dimitropoulou, M., Duñabeitia, J. A., Avilés, A., Corral, J., & Carreiras, M. (2010). Subtitle-based word frequencies as the best estimate of reading behavior: The case of Greek. Frontiers in Psychology, 1(218).Google Scholar
- Hernández-Figueroa, Z., Rodríguez-Rodríguez, G., & Carreras-Riudavets, F. (2009). Separador de sílabas del español - Silabeador TIP. Retrieved from http://tip.dis.ulpgc.es
- Hulme, C., Roodenrys, S., Schweickert, R., Brown, G. D. A., Martin, S., & Stuart, G. (1997). Word-frequency effects on short-term memory tasks: evidence for a redintegration process in immediate serial recall. Journal of Experimental Psychology: Learning, Memory, and Cognition, 23(5), 1217.PubMedCrossRefGoogle Scholar
- Juilland, A., & Chang-Rodríguez, E. (1964). Frequency dictionary of Spanish words. The Hague: Mouton.Google Scholar
- Marian, V., Bartolotti, J., Chabal, S., & Shook, A. (2012). CLEARPOND: Cross-Linguistic Easy-Access Resource for Phonological and Orthographic Neighborhood Densities. PLoS ONE, 7(8), e43230.Google Scholar
- Monsell, S. (1991). The nature and locus of word frequency effects in reading. In D. Besner & G. W. Humphreys (Eds.), Basic processes in reading: Visual word recognition (pp. 148–197). Hillsdale, NJ: Lawrence Erlbaum Associates.Google Scholar
- Moreno, A., & Mariño, J. B. (1998). Spanish dialects: Phonetic transcription. Fifth International Conference on Spoken Language Processing (ICSLP ’98) (pp. 189–192). Sydney, Australia.Google Scholar
- New, B., Pallier, C., Brysbaert, M., & Ferrand, L. (2004). Lexique 2: A new French lexical database. Behavior Research Methods, 36(3), 516–524.Google Scholar
- Nogueiras, A., & Mariño, J. (2009). SAGA: Transcriptor fonético de las variedades dialectales del español. Retrieved from http://www.talp.upc.edu/index.php/technology/tools/signal-processing-tools/81-saga
- Padró, L., Collado, M., Reese, S., Lloberes, M., & Castellón, I. (2010). Freeling 2.1: Five years of open-source language processing tools. Proceedings of 7th Language Resources and Evaluation Conference. La Valletta, Malta.Google Scholar
- Perea, M., Soares, A. P., & Comesaña, M. (2013). Contextual diversity is a main determinant of word-identification times in young readers. Journal of Experimental Child Psychology. doi:10.1016/j.jecp.2012.10.014
- Sebastián-Gallés, N., Martí, M., Carreiras, M., & Cuetos, F. (2000). LEXESP: Léxico Informatizado del Español. Barcelona: Universitat de Barcelona.Google Scholar
- Taulé, M., Martı, M. A., & Recasens, M. (2008). Ancora: Multilevel annotated corpora for catalan and spanish. Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC-2008).Google Scholar