Investigating diachronic trends in phonological inventories using BDPROTO

Here we present an expanded version of bdproto, a database comprising phonological inventory data from 257 ancient and reconstructed languages. These data were extracted from historical linguistic reconstructions and brought together into a single unified, normalized, accessible, and Unicode-compliant language resource. This dataset is publicly available and we aim to engage language scientists doing research on language change and language evolution. Furthermore, we identify a hitherto undiscussed temporal bias that complicates the simple comparison of ancient and reconstructed languages with present-day languages. Due to the sparsity of the data and the absence of statistical and computational methods that can adequately handle this bias, we instead directly target rates of change within and across families, thereby providing a case study to highlight bdproto’s research viability; using phylogenetic comparative methods and high-resolution language family trees, we investigate whether consonantal and vocalic systems differ in their rates of change over the last 10,000 years. In light of the compilation of bdproto and the findings of our case study, we discuss the challenges involved in comparing the sound systems of reconstructed languages with modern day languages.

today. Additionally, it is not uncommon in the world's languages to drop sounds at the end of words, particularly vowels, as was probably done in French. Therefore the proto-form for the word 'tooth' shared by these Romance languages had the shape 'dente', written *dente to denote a reconstructed form. Sometimes such forms are preserved in historical records (in which they have been used to verify the accuracy of the comparative method), but more often they are hypothetical reconstructions. With a large enough amount of reconstructed vocabulary, language scientists can posit that the parent language of modern languages, in this case socalled Proto-Romance, contained a 'd' sound in its phonological inventory, i.e., its repertoire of contrastive speech sounds (Moran 2019).
Another illustration of the comparative method with a small dataset from four Polynesian languages is given in Table 1. These four related languages belong to the large Austronesian language family. The data are taken from Crowley and Bowern (2010). Again, the words in each line are cognates and together they form cognate sets. It is immediately visible that where Tongan, Samoan and Rarotongan have /t/, Hawaiian has /k/. Similarly, where Tongan and Rarotongan have /k/, Samoan and Hawaiian have a glottal stop /ʔ/ (Table 2). Such facts, where they recur across the vocabulary, are called REGULAR SOUND CORRESPONDENCES, and they are the bread and butter of comparative reconstruction.
Since it is highly implausible that such facts are unmotivated, historical linguists assume that these cognates developed from a common ancestral word. As a result, it is assumed that the sounds attested in each cognate reflect either (i) the original sound or (ii) a sound change. The considerations for identifying a particular sound change involve a number of heuristics, including (i) majority rules, i.e., the sound that is most frequent in the cognate set is the original one, (ii) knowledge of relative frequency of sound changes, e.g., a change from /h/ to zero is more frequent than the reverse, (iii) plausible phonetic motivations, and so on (for a standard list of such heuristics, see Campbell 2004, chap. 3). In the present case, historical linguists argue that in the /t:k/ correspondence set, /t/ is original, having persisted in Tongan, Samoan, and Rarotongan, and having changed to /k/ in Hawaiian. Similarly, /k/ in Tongan and Rarotongan is considered to have been maintained from an earlier stage, and Samoan and Hawaiian /ʔ/ to be the result of a change. The sounds that are reconstructed as original are called PROTO-PHONEMES, and are marked with an asterisk. These proto-phonemes allow us to reconstruct the vocabulary of protolanguages; the proto-forms of this cognate set are *tapu 'forbidden', *tahi 'sea', *kalo 'dodge', and *aka 'root'. The changes that can be inferred from the dataset are: *t [ k (Hawaiian); *h [ ; (all but Tongan); *l [ r (Samoan, Rototongan), and *k[ʔ (Samoan, Hawaiian). Incidentally, one can identify some relative chronology within these changes, as Proto-Polynesian /*k/ must have shifted to /ʔ/ in an ancestor of Hawaiian before Proto-Polynesian *t shifted to /k/; otherwise, all instances of /k/ would have shifted to /ʔ/.
Recently the comparative reconstruction approach outlined above has been implemented programmatically (Steiner et al. 2011), so that many of the timeconsuming and redundant tasks of the historical linguist, e.g., inferring regular sound change, are automated (Bouchard-Côté et al. 2013;Hruschka et al. 2015). The resulting score of similarity from pairwise sets of words across all languages in a sample can help to identify cognates. Expert judgment is still needed, but tools (List and Moran 2013;List et al. 2018) and interfaces (List 2017) allow even the non-tech-savvy linguist to quickly identify cognates from masses of raw data, such as word lists from thousands of languages (Wichmann et al. 2017). Word lists that are coded for cognacy and phonetic similarity scores can be used as input for one of many algorithms that generate language family phylogenies. These language family trees can then be used as input to phylogenetic comparative methods that were developed by biologists for investigating the tree of life, which have now been adopted and adapted by linguists and evolutionary anthropologists to address research questions about ancient language structures, cultures, and population movements (e.g., Dediu 2010;Dunn et al. 2011Dunn et al. , 2017Verkerk 2015;Zhou and Bowern 2015).
A particularly ripe but under-researched area of historical linguistics is the crosslinguistic analysis of the evolution of sound systems (cf. Marsico 1999). Whereas data exist in both diachronic and synchronic forms, i.e., there are many detailed reconstructions of proto-languages and the sound systems of currently spoken languages are available online (e.g., Moran et al. 2014), what has been lacking so far is a digitally-accessible database of the sound systems of proto-languages. Therefore, we extracted phonological inventories from the historical linguistics literature for a large number of ancient and reconstructed languages. In the next section, we describe our data extraction and aggregation pipeline and then describe the current language sample and the challenges that arise when comparing it with present-day spoken languages. Afterwards, we provide a case study that illustrates what BDPROTO has to offer in terms of research questions in historical and evolutionary linguistics.  3 The BDPROTO database

Primary data extraction
The phonological inventories in BDPROTO were extracted manually from source texts, 1 interpreted by experts, and then codified according to standardized Unicode conventions (Moran and Cysouw 2018) for the International Phonetic Alphabet (International Phonetic Association 1999). In practice, this means that we identified scientifically rigorous publications by experts in historical linguistics, who applied the historical-comparative method to synchronic datasets (e.g., word lists, phonological descriptions, grammars) and reconstructed the contrastive sound systems of proto-languages. Because linguists often use different phonetic transcription practices, we went through the source materials in detail and unified the phonetic practices into a well-defined notational convention. 2 This allows us to identify crosslinguistic comparative concepts for the domain of phonology (cf. Haspelmath 2010). We collected phonological inventory data into spreadsheets with separate tabs for the phonological inventories and for the metadata associated with each language.
Although each dataset (see Sect. 3.2) may have project-specific information (e.g., for the ancient languages we have descriptions of syllable structure), we specify a number of fields that are common to all input sources and are required for any additional resources that are added to BDPROTO. These fields include (their data types are given in parentheses): -BdprotoID (int): primary key -LanguageName (text): language name given in the source document (original resource) -Glottocode (text): a valid Glottolog code denoting a languoid 3 -LanguageFamily (text): name of top-level language family (e.g., Afro-Asiatic) -TimeDepthYBP (int): inferred date of the proto-language in years before present -TimeDepth (text): time depth as given in the source or elsewhere (if elsewhere, the source must be cited) -Homeland (text): where the proposed homeland of the proto-language was (either in the original resource or in another source, but then it must be specified in the HomelandSource field) -HomelandSource (text): source where the homeland data was taken from -MetadataNotes (text): notes regarding the metadata file -BibtexKey (text): BibTeX key -Phoneme (text): the phonological segment -PhonemeNotes (text): optional notes about the phonological segment The resulting datasets were put into a GitHub repository, 4 with supplemental metadata for each inventory, including estimates for its age and the homeland where it was spoken. Both the time depth and the homeland of language families are hotly debated issues (see for example the discussion of the age and heartland of Indo-European, Bouckaert et al. 2012;Chang et al. 2015). Each data point in BDPROTO is also associated with a Glottolog language identifier, so that it is positioned within a language family phylogeny (Hammarström et al. 2017). 5 Each inventory has one or more bibliographic citations, which are stored in a text-based BibTeX file, where the BibTeX key is given a transparent ID that is mapped to the source document, so that we can easily keep track of the digital documents (PDFs) from which the data were extracted. Lastly, an aggregation script was then written to integrate these independently collected datasets into a unified data dump for quantitative analysis (details are given in Sect. 3.2).
For data collection, aggregation, and dissemination, we use GitHub because it is a web-based hosting service with version control and issue tracking, both of which are useful for data curation and code maintenance. For example, when we spot a data or coding error, we can log it in the repository's issue tracker and assign the appropriate issue to either the data collectors or programmers, who subsequently fix and close the issue. Thus, we keep a record of how the data has been collected and what changes we have made to it. This method is in line with best practices in terms of reproducibility and the requirements of open data science. GitHub also allows us to tag stages of the data and code with version numbers that can then be converted into releases in the software engineering sense. Here we follow recommended practice by using semantic versioning, i.e., by following a major.minor.patch version numbering scheme. 6 Thus, GitHub allows us to easily curate the BDPROTO data and code base and to integrate versions into releases, which bundle up the data and the code in into zipped file releases. These files are then archived on Zenodo, 7 e.g., (https://doi.org/10.5281/ zenodo.3521639, https://doi.org/10.5281/zenodo.3522973), an open-access repository by OpenAIRE 8 and CERN. 9 Zenodo provides a Digital Object Identifier (DOI) 10 for each release, so that it can be cited, and importantly, so that each specific release is available for reproducability of scientific results.

Data aggregation
The BDPROTO database is comprised of several individual datasets that have been collected by different researchers in different places at different times. The oldest is the original BDPROTO data from Marsico (1999). This dataset comprises 101 phonological inventories of reconstructed proto-languages. Being collected in the 1990s, the data were stored in SQL tables in ISO 8859-1 encoding. We transformed the data into CSV files encoded in UTF-8 NFC with LF and no BOM, following the recommended linguistic data encoding practices put forth in Moran and Cysouw (2018). Given the legacy character encoding, we standardized character representations to current Unicode IPA standards (Moran and Cysouw 2018) and we follow the PHOIBLE IPA conventions (Moran 2012;Moran et al. 2014). 11 Two smaller sets of phonological inventories were collected in the Department of Comparative Linguistics at the University of Zurich. The first of these two included 15 inventories from reconstructed proto-languages. The second includes 21 inventories from ancient languages, by which we mean languages for which there are written records from which scholars have analyzed their phonologies, i.e., not languages which were reconstructed based on modern day sources. Both of these datasets were extracted from scientific publications and entered by hand into Excel spreadsheets and then exported as compliant Unicode Standard UTF-8. Each resource is kept separated in the BDPROTO GitHub repository and is updated periodically.
The most recently-collected dataset is part of an ongoing collaboration with the Department of Linguistics at the Hebrew University of Jerusalem. This dataset, which we refer to as HUJI, currently comprises 120 phonological inventories from reconstructed proto-languages extracted from expert sources.
To bring these resources together into a unified format, we aggregate their phonological inventory data and metadata with a script written in R Team (2013). 12 The aggregation script includes data sanity checks and it performs various cleaning procedures. For example, input data is notoriously messy when collected by hand, so we have several data checks that identify duplicated data or data in fields that do not follow our BDPROTO data specification. We use the R testing library TESTTHAT (Wickham 2011) to catch such errors.
The aggregation script then combines the inventory data from CSV files, joins in the additional linguistic and non-linguistic metadata (described above), and outputs the combined data sources as a denormalized CSV table and as an R data object (both are available in the GitHub repository). An example of the aggregated output is given in Table 3. After these sources have been combined, we extend the phoneme inventory data with vectors of binary distinctive features from the 37 phonetic features described in PHOIBLE (Moran et al. 2014). These include common articulatory phonetic descriptors, such as 'labial', 'dental', and 'consonantal', which allow us to categorize and compare the reconstructed sound systems of these proto-languages.

The language sample
In total, there are 257 phonological inventories in the combined BDPROTO sample; they represent 206 distinct reconstructed and ancient languages from over 100 different language families. The original BDPROTO sample was devised without duplicate entries and by considering the coherence of the proposed reconstructions and their relations to their modern daughter languages. The aggregation of the original BDPROTO sample with our more recent work results in duplicate data points, i.e., different reconstructions of phonological descriptions of the same protolanguage or different DOCULECTS (Cysouw and Good 2013). We consider multiple entries a feature of our database, which allows users to explore and compare different reconstructions by different experts. Nevertheless, both the old and new BDPROTO language samples present novel challenges for comparative linguists. For example, how do we fairly assess the frequency of a particular phenomenon, e.g., reconstructed sounds, in our set of proto-languages?

Towards a typology of ancient phonological segment inventories
Studies in typology face several statistical biases when attempting to describe distributions of linguistic phenomena in the world's languages (Sherman 1975). For example, to say something meaningful about the distribution of sounds in today's spoken languages, one must consider both the BIBLIOGRAPHIC BIAS (Bakker 2011) (i. e., not all languages have been documented, and as such, there is a serious lack of available data, which is skewed towards larger well-described languages) and the GENEALOGICAL BIAS (languages fit into language families in a power-law-like distribution, i.e., a few languages families contain the majority of the world's languages, while nearly a third of all language families consist of a single language or its dialects, like Basque). 13 Both biases mean that classical statistical methods for sampling are untenable for linguistic typology (Janssen et al. 2006). 14  While the BDPROTO sample is of course restricted by bibliographic bias (because there are relatively few language reconstructions currently available and the number of reconstructions is essentially non-exhaustive), it is also constrained by a form of genealogical bias that has not been identified in the literature, as far as we know. That is, the data points in the BDPROTO sample represent different stages of language families through time. This poses a problem for the direct comparison of protolanguages that have different ages, and for the direct comparison of proto-languages with present-day languages. For example, some data points in BDPROTO represent the root-level of a language family tree, also known as a language family STOCK (Nichols 1992). A stock is a genealogical group with considerable time depth, such as Indo-European. The stock is the highest reconstructible genealogical group. 15 As such, stocks may exhibit regular sound correspondences that are not transparent to non-specialists and hence must be reconstructed by experts.
Other data points in BDPROTO are intermediate nodes in existing proposed phylogenies. For example, there are expert reconstructions of ancient Celtic, Germanic, Romance, and Balto-Slavic, each of which represents an intermediate node within the branches of the Indo-European tree, i.e., daughter languages of Indo-European but also parent languages of currently spoken languages. These genealogical classifications are referred to as GENERA (Dryer 2005). A language family genus is a genealogical classification whose time depth is not more than roughly 4000 years, and may be much younger. The relatedness of languages within these groups is fairly obvious without systematic comparative analysis. Figure 1 lists 25 of the oldest languages in the BDPROTO sample and approximately when they were spoken. It contains both language family stocks and genera. Furthermore, some data points are nested within others, e.g., Finno-Permic is an intermediate node within the Uralic stock.
The fact that languages and families of various ages are nested recursively within larger families leads to a set of problems that we have called the TEMPORAL BIAS, which makes the comparison of ancient and proto-languages across time particularly problematic. First, the set of ancient and reconstructed languages is not temporally homogeneous. In other words, languages cannot be binned into "presentday" versus "ancient and reconstructed" categories. Each ancient or reconstructed language was spoken during a particular time period. In effect, BDPROTO (or any similar database) will include languages from 500 years ago, 2000 years ago, and, potentially, 8000 years ago. This means that from a purely chronological point of view, the blind comparison of such languages is akin to comparing an 8-year old, a 25-year old, and a 70-year old. A possible solution, in principle, would be to compare languages on the basis of their estimated ages, e.g., comparing only 15 It is generally agreed-upon that 10,000 years is the maximum time depth of reconstruction for the comparative method (Nichols 1992). Past this time depth, languages have simply had too much time to mutate in vocabulary through regular processes of sound change and it has not yet been discovered how to peer further back in time (although this is an active area of research, e.g., Pagel et al. 2013). Another tool of historical linguistics, INTERNAL RECONSTRUCTION, can be applied to proto-languages, thereby looking further back into linguistic prehistory; however, internal reconstruction is generally considered to be less reliable than reconstruction based on the comparative method (Ringe 2003). Finally, it has been suggested that certain structural properties of language may be deeply-conserved population markers, allowing linguists to reach even further back into human prehistory (Nichols 1994), but this is still controversial. languages that are estimated to have been spoken 1500-2000 years before present, and so on.
Secondly, however, language families are heterogeneous in terms of their size and diversification. For example, one stock may have many nodes (e.g., Indo-European), while another may have very few (e.g., Basque). This means that the blind comparison of ancient and reconstructed languages may be like comparing the branches of an angel oak tree with those of a birch tree (Fig. 2): at the same height from the ground, the number of nodes is very different, and, conversely, at the same taxonomic level (i.e., n nodes from the present), the height from the ground can be very different.
In order to make languages spoken at different time depths and at different nodes in a family tree structure comparable, one would either need to eliminate nodes from the more diversified families or to inflate the genealogical complexity of the less diversified families. Both of these would exacerbate the already-acute problem of data sparsity when existing genealogical diversity is taken into account. While most accounts estimate the number of distinct present-day languages as somewhere between 6000 and 7000, most also estimate the number of genealogically distinct stocks as relatively small. For example, Hammarström (2016) counts 424 such families, some of which are currently isolates or very small families. Such isolates are counted as stocks, even though they might be very recent or low-level nodes in families whose other members have been lost, which means that the genealogical diversity available to us for any particular time depth might be even less. On the other hand, some experts consider it plausible that some families are related to other families at an even deeper time depth than what can be confidently reconstructed using the comparative method. Nichols (1994) mentions Afroasiatic as a likely stock that cannot be confidently reconstructed, and points to plausible deeper relations between, e.g., Indo-European and Uralic, relations which cannot be proven by currently available methods. In other words, even 'independent' families may represent earlier dependencies. To sum up, the total genealogical diversity represented by a present-day sample of languages, if we go back in time as far as current methods allow us, is simply not that great, which means that for statistical analysis, our data is very sparse.
All in all, the deep-time language tree structure, with recursively nested language family splits, means that classical and current approaches to dealing with genealogical bias in typological studies (e.g., language family as random effect structures in mixed effects models, Moran et al. 2012) are unlikely to be adequate, and the problem itself is potentially computationally intractable. 16 Now consider, for example, if we want to compare the frequency distribution of reconstructed phonemes in proto-languages with sounds found in languages spoken today. Perhaps we want to ask whether the distribution of sounds in languages today are different from those from thousands of years ago. Figure 3 shows the 30 most frequent phonemes in the broadest sample of today's languages (Moran et al. 2014) (n = 1672) in rank comparison with the frequency of reconstructed phonemes in BDPROTO. We observe a general trend in the frequency of sounds cross-linguistically in modern spoken languages, i.e., they follow a power-law distribution in which articulatorily basic sounds, like 'm', 'i', 'k', 'a', are found across the majority languages sampled, but around half of all documented speech sounds are languagespecific (as already noted by Maddieson 1984). 17 The trend is similar in the BDPROTO sample; however, there are a number of differences. For example, whereas 'm' is the most frequent sound reported in today's languages, in reconstructed languages it There are two other aspects of the temporal bias that merit mentioning. One is methodological, i.e., the sounds of reconstructed languages tend to be normalized to 'representative' basic sounds, eliding phonetic detail available for present-day spoken languages. Another aspect of the temporal bias is the assumption that distributions of phonological segments is uniform across time, which may or may not be the case. As a result, phonological distributions should be visualized not only spatially, i.e.. via maps, but also temporally, i.e., via a time-line. 17 We provide a more detailed report of general summary statistics and plots of the current data in an R markdown report at: https://github.com/bdproto/bdproto/blob/master/src/descriptive_stats.md. is 'n'. There are also sharp dips and rises in the reported frequency of sounds like 'f' and 'r'. We can speculate on individual sounds. For example, the low frequency of 'f' illustrates the late emergence of labiodental sounds since the Neolithic due to changes in human bite configuration from food processing technology and agriculture (Blasi et al. 2019). The high frequency of 'r' likely reflects (along with other sounds including vowels) a tendency to normalize detailed synchronic sound descriptions into the basic phonological variants when using the historicalcomparative method to project back in time. Even so, we need to keep in mind that the reconstruction of the actual phonetic values of ancestral proto-sounds is an unresolved problem (Coleman et al. 2015), although Bayesian computational phylogenetic approaches are being developed to try to tackle it (e.g., Blasi et al. 2019;Aston et al. 2012). We return to this and other issues in the discussion in Sect.

5.
We can at this point, however, avoid genealogical and temporal biases by investigating specific language families, regardless of their age, by comparing certain properties of proto-languages directly with the currently available data of its daughter languages. Moreover, although we cannot give precise phonetic values for reconstructed sounds from thousands of years ago, we can be fairly certain that proto-languages derived through the historical-comparative method provide us with a fairly accurate number of contrastive sounds in the phonological inventories that the parent languages would have had. Hence, we can compare whether ancient and

Case study: evolutionary rates of consonant and vowel inventories
In a study of whether phonological inventories have become more or less complex over time, Marsico (1999) shows that languages dating back as far as 10,000 years are equally complex in terms of their number of segments, consonant/vowel ratio, average number of consonants and vowels, and frequency hierarchy of the segments. However, Marsico (1999) also notes that modern languages tend to have slightly more consonants today than their ancestors did in the past. The same does not apply to vowels. On average the number of consonants and vowels across proto-languages in BDPROTO are 18 and 8, respectively. In comparison, modern spoken languages have on average 22 consonants and 8 vowels (Maddieson 1984). 18 This is in itself an interesting finding, because it is typically assumed that reconstructed phonological systems are likely to show more complexity than their daughter languages, due to inherent biases of the comparative method. For example, Fox (1995) argues that some reconstructions of the Proto-Indo-European consonant system contain more consonants than any of the daughter languages. If reconstructed languages indeed tend to have an artefactual inflation of inventory size, this would mean that the difference between reconstructed and present-day phonological inventories is underestimated in the present study.
Why is it that we observe more consonants in phonological inventories today than we see in reconstructed ancient languages of the past? We decided to test whether eight language families show greater rates of change in consonant inventory size as compared to vowel inventory size using phylogenetic comparative methods (see also Marsico et al. 2018, Moran and. Specifically, we use BayesTraits V3 (Meade and Pagel 2014), which implements a generalized least squares approach to modeling the evolution of continuously varying traits (Pagel 1997 and multistate traits (Pagel et al. 2004). We chose these eight language families because they have high-resolution expert-created phylogenies: Arawakan (language sample used for current study n=38; Walker and Ribeiro 2011), Austronesian (83; Gray et al. 2009 Figure 4 shows box plots of the ranges of vowel and consonant inventory size in the language samples used for phylogenetic ancestral state estimations. P gives proto-language reconstruction from BDPROTO. R gives the ancestral state estimation using BayesTrait V3's analysis Continuous. An asterisk * indicates whether the rate of change of vowel or consonant inventory size is faster. 19 Table 4 gives the mean rates of change of consonant and vowels on the branches of the listed phylogenetic tree sets by 1000s of years. Our results suggest a mixed picture for the acquisition of new consonants vs vowels over the last 10,000 years. In five out of eight language families sampled, the rate of change in consonants systems is greater. But in Turkic, Arawakan and Indo-European, vowel inventory   19 We also observe that ancestral state estimates of vowel and consonant inventory sizes are generally closer to the mean of the range than expert reconstructions of proto-languages. This means there is a difference between the well-worked historical-comparative method used by linguists to reconstruct protolanguages and the ancestral states generated through phylogenetic analysis. The latter optimize ancestral states by looking both at the ancestor as well as the daughter(s) of each node, which may have an effect of averaging out differences rather than picking up on (directional) trends. This observation warrants a closer evaluation using directional models of feature change.
size changes faster than consonant inventory size. These three language families have in common a wider range of vowel inventory sizes as compared to the other families. However, if we take into account the mean and standard deviations of the rates given in Table 4, a high variance does not always entail a high rate of change and vice versa. Dravidian, for instance, has the second highest rate of change for consonants, but the Dravidian languages are less variable in their inventory size than Bantu and Indo-European. While this is an interesting result, it is not clear a priori whether modeling change in phoneme inventory size should be done this way. BayesTraits V3's (Meade and Pagel 2014) Continuous analysis considers all numbers floats and can hence arrive at an ancestral inventory size of 8.5, for example, which is not in line with our understanding of how phoneme inventories evolve. An alternative approach would be using BayesTraits V3's Multistate analysis, which regards all inventory sizes as separate states, separate entities with no pre-defined relationship between them. However, this approach has its own problems. Bantu has 36 different consonant inventory sizes. As a result, to model these requires 36 × 36 − 36 = 1260 parameters (i.e., change from inventory size 6 to inventory size 7, from size 6 to inventory size 8, etc.). These have to be restricted in a meaningful way. In the following analyses, we restrict these changes to a single parameter so that we can compare the value of this parameter between consonant and vowel inventory sizes. Ideally, however, we would allow increases or decreases of, say, 2 or 3 phonemes at a time too, because if a particular feature is added to a phonological system (such as lengthening or nasalization) this may result in the instantaneous increase or decrease of the inventory across some number of affected phonemes. Such a more refined model is not possible within Multistate, which is agnostic with regard to such specificity. Table 5 shows the median rate parameter values for modeling consonant and vowel inventory size in eight language families. The Multistate model seems to arrive at two solutions that are fundamentally different in nature: the first four families, Arawakan, Dravidian, Tupi-Guarani and Turkic, have high rates of change, with large standard deviations, both for consonants and for vowels. The last three families, Austronesian, Bantu, and Pama-Nyungan, have very low rates, and a difference between consonants and vowels, i.e., consonants evolve faster. Indo-European is somewhere in-between: for vowels we observe the first pattern, while the analysis of the consonant inventories shows a bipartite distribution as it contains both very low and high rates (see the low median but the high standard deviation). The posterior distributions of the rate parameter values for Austronesian, Bantu, and Pama-Nyungan are displayed in Fig. 5. The main difference between these two groups is the size of the family: Arawakan, Dravidian, Tupi-Guarani and Turkic are small in terms of number of daughter languages, Austronesian, Bantu, and Pama-Nyungan are big, Indo-European is in-between. The high rate typical for the first group signifies a poorly updated prior (uniform, 0-100) and thus indicates a poor fit of the data, presumably because of the small size of the datasets. The diverging results for the three bigger families show that the Multistate model is relevant for analyzing phoneme inventory size, given a large enough family. Figure 5 shows that in Austronesian, Bantu, and Pama-Nyungan, consonants evolve faster than vowels. This matches the results of the Continuous model for these three families (see Fig. 4), where it was Arawakan, Indo-European, and Turkic for which vowel inventories change faster than consonant inventories.
The results of the Continuous analyses suggest differential rates of change in consonants and vowels by language family, while the Multistate analyses are uninformative in this regard. This finding is surprising to us because the synchronic data suggest that there is a diachronic pressure on languages to expand their consonant inventories at a greater rate than vowels, in line with the finding by Marsico (1999). For example, on average languages have more consonants than vowels, so we might expect phonological inventories to universally acquire consonants at a faster rate. Consonants are also more likely to be borrowed than vowels (Moran et al. 2014). Lastly, the synchronic data show more phonetic diversity in consonant inventories, suggesting a greater number of lexical contrasts made available by consonants. For example, there are three times as many contrastive consonants than vowels in the world's languages. Consonant inventories also range more in size, from 6 to 90 (Rotokas in Papua New Guinea vs the click language !Xu, spoken in Botswana and Namibia) than vowel qualities, which range from 2 to 14 (Maddieson 2013a, b). All in all, it is surprising that any language family shows a greater rate of change in vowel inventories than in consonant inventories. We end this section by highlighting the family-specific nature of patterns of change. 20 This finding is in contrast to the null hypothesis, i.e., that there is no relationship between family and rate of change in the domain of phonological segment inventories or parts thereof.

Discussion
Our findings warrant further research, but we might already speculate on where to look next. First of all, phonological segment inventories can be subject to a highly diverse range of potentially universal pressures (or biases or constraints). Such pressures can be related to the synchronic structure of phonological systems (e.g., Clements 2009), articulatory or perceptual properties of sounds (e.g., Ohala 1981Ohala , 2010, usage-based effects on languages (e.g., Bybee 2001Bybee , 2006, or the function of sounds as the basis of lexical contrasts. But they can also be related to socio-cultural, biological, or environmental factors, such as population size, genes, or humidity. Many such factors have been proposed in the literature (e.g., Pericliev 2004;Hay and Bauer 2007;Everett 2013;Greenhill 2014;Everett et al. 2015;Maddieson and Coupé 2015;Everett 2017;Benítez-Burraco and Moran 2018) and continue to be hotly debated (e.g., Moran et al. 2012;Donohue and Nichols 2011;Greenhill 2016;Roberts 2018;Mendívil-Giró 2018). In the following, we mention just a few candidates for causal factors that shape phonological segment inventories by shaping the directionality of language change. Inventories of both vowels and consonants can be extended through the use of secondary articulatory features. For example, a vowel space can be expanded straightforwardly by the contrastive features of length and nasalization. On the other hand, consonant inventories can be expanded by numerous features, such as length, labialization, palatalization, velarization, and more. 21 Mathematically, it may be the case that there is a greater number of dimensions for consonant inventories to expand, but that there are other constraints on how consonant or vowel inventories increase in size. Hence to create more and more vocabulary, increasing the number of contrastive sounds in the phonological inventory while keeping the number of distinctive phonetic features at a minimum is said to embody the principle of feature economy (Clements 2009). An example is given by Moran (2012), who shows that vowel systems tend to expand from the cardinal vowels through the highly economic features of length and nasalization before filling in the vowel space with peripheral vowels that require finer articulatory features and perceptual distinctions. Furthermore, there is an asymmetry of feature economy such that vowel inventories tend to be more economical than consonant inventories (Coupé et al. 2011). Thus, the articulatory and perceptual constraints that may govern the changes in phonological inventories over time must be incorporated into models of the evolution of spoken languages.
Another important consideration related to lexical contrasts and sound changes over time is that of FUNCTIONAL LOAD. Based on the premise that the primary function of a phonological system is to allow utterances to be distinguished, Hockett (1966) explored the idea that some phonological segments "do more of this job than others". Martinet (1955Martinet ( , 1977 hypothesized that segments with a higher functional load would be more resistant to change. Interestingly, it has been found, albeit on the basis of a small cross-linguistic sample, that consonants tend to have a higher functional load than vowels (Oh et al. 2013). While Martinet's hypothesis is not supported by all studies, the present study corroborates it-to the extent that the finding that consonants tend to have a higher functional load than vowels can be generalized-on the basis of a large cross-linguistic database.
Functional load, in turn, may be causally related to sound change. For example, it might be that vowels are more likely than consonants to be involved in mergers, thereby favoring the reduction of vowel inventories. Vowels may also be more likely to involved in chain shifts, which could be taken as indicating a generally lower functional load. However, the empirical evidence does not allow us to make these generalizations, since there are no large-scale databases of sound changes, and those databases that do exist do not provide information about phonological change per se (e.g., mergers and splits). As such, we merely point to the possibility that vowels and consonants participate differentially in classical phonological sound changes.
Whatever the pressures or biases shaping phonological segment inventories may be, the findings of our case study point to several facts that warrant attention. First is the fact that the evolution of phonological inventories does not follow a universal pathway. Rather, individual families show particular biases of change. Secondly, all families show a clear directionality of change, i.e., there are no unbiased families. Bickel (2013) argues that situations like these, i.e., when there is a cross-family preference for change while individual families show particular preferences, point to (i) the operation of universal pressures leading to preferred pathways or directions of change, and (ii) the relative strength of these pressures. In the present case, the findings from our sample provide preliminary (and hence tentative) evidence for both pressures that lead to an expansion of consonant inventories and pressures that lead to an expansion of vowel inventories. These pressures must be relatively strong, since no unbiased families were found in our sample, but it is hard to say which pressures are stronger, because the respective proportions of vowelincreasing and consonant-increasing families in the sample are very close (5/8 vs. 3/8). Obviously, in order to further explore this avenue of research, a larger sample is needed. Likewise, much work remains to be done in order to investigate particular causal theories linking a particular pressure (e.g., functional load) with a particular change preference (e.g., higher rate of change).
However, it should also be pointed out that the results reported here may also stem at least in part from biases inherent to the data and the methods used. As we noted in Sect. 3.3, historical-comparative reconstructions of phonological inventories suffer from a number of empirical issues and sampling biases.
First, the reconstruction of actual phonetic values of ancestral proto-sounds is an unresolved problem (Blasi et al. 2019;Coleman et al. 2015;Aston et al. 2012). For many present-day languages, we have access to detailed phonetic descriptions (including audio recordings and their computerized acoustic analysis), which allow us to identify with some degree of certainty the primary phonetic realizations of particular segment types. For example, it is increasingly clear whether a voiceless stop is best analyzed as dental or alveolar in a particular language. However, reconstructed proto-languages do not allow for such detailed analysis, for obvious reasons. As a result, there is a tendency to normalize detailed synchronic sound descriptions into the basic phonological variants when using the historicalcomparative method to project back in time. In practice, this means that it is likely that, e.g., taps, trills, and flaps are all normalized to /r/, or close-mid and open-mid vowels are normalized to /e/, and so on. Additionally, reconstructing sounds relies on the principle of maximum parsimony, which fails to take into account sound change reversals, i.e., a sound change that takes place, but then reverses back to its previous sound. Presently, these problems limit the overall accuracy of statements about the worldwide distribution of sounds and their precise phonetic values in ancient languages. This is a ripe area for future research (cf. Blasi et al. 2019;Coleman et al. 2015;Aston et al. 2012).
Second, there are several biases regarding statistical sampling that are problematic for cross-linguistic language data in general, and BDPROTO in particular. These include the bibliographic bias (not all languages are documented, so our data samples are incomplete) and the genealogical bias (language families contain an uneven and skewed distribution of languages due to the effects of history). In addition to these well-described biases, we propose a novel problem for the comparative study of ancient and modern languages in terms of language evolution, i.e., the temporal bias. The temporal bias is due to language family trees being recursively nested data structures, which in turn encode multiple levels of language family splits, through time. 22 The temporal bias presents a potentially computationally intractable problem for current statistical approaches that address genealogical bias in comparative linguistic studies. Namely, to account for autocorrelation, which arises from the inter-relatedness of languages within language families, the confound 'language family' is typically treated as a random structure in mixed effects models. However, approaches that can deal with nested temporal data points must be found if we are to describe in more detail the differences in phonological systems of ancient and modern languages crosslinguistically.
Hence, another avenue for future research is to investigate the phonetic and phonological features involved in creating lexical contrasts in languages. One approach is taken for example by Nikolaev and Grossman (2018), who count places of articulation, rather than phonemes. They argue that adding new secondary articulations is easier than adding new places of articulation, so their model presupposes that places are more stable than individual segment counts. The investigation of broad phonetic and phonological features and how they may have evolved in phonological systems over time is another area for further research using BDPROTO.

Conclusion
In this article, we present an expanded version of BDPROTO (Marsico et al. 2018), an open-access database of phonological inventories from a sample of 257 ancient and reconstructed languages. 23 BDPROTO provides a rich resource for investigating historically reconstructed languages and whether they show any significant changes with languages spoken today. After an initial brief overview of the historicalcomparative method, we describe the data extraction and aggregation techniques that we used to create the BDPROTO database. Finally, in a case study we use phylogenetic methods to show that the evolution of consonant and vowel systems have differential rates of change -an unexpected observation given what we know about the ancient and reconstructed languages in the BDPROTO sample and their modern descendants. We hope that this resource will be used by both traditional philologists and computational linguist alike, and everybody in-between. Since there are often several reconstructions for a given proto-language, collecting these in one place will allow for critical evaluation and further study.
This study also identifies a type of bias in typological studies that has not yet been identified, namely, temporal bias. This bias stems from the fact that languages are nested within families that are nested within larger families, which are in turn nested within larger families. This problem is likely to be computationally intractable, at least with respect to traditional methods of treating family as a random effect. This is in line with the finding of Piantadosi and Gibson (2014), who estimate that one would need hundreds of independent languages in order to demonstrate a linguistic universal above and beyond genealogical and areal effects, and are generally pessimistic about the sample size of present-day languages and families providing an empirical basis for such inferences.
A way forward may be found in approaches that directly target language change. At present, such approaches focus on estimating the probabilities of particular changes (e.g., the Family Bias Method, Bickel 2013). However, there may be a limit to what one can do with increasingly sophisticated statistical analyses applied to currently available data. Ultimately, it is likely that in order to directly model language change, resources much richer than those currently available will be necessary. For making progress in research on the evolution of phonological systems, a suite of resources-of which BDPROTO would be part of-can be envisioned. Other such future resources include (i) extensive cross-linguistic databases of sound change, both of pathways (represented, minimally, as inputoutput processes in particular contexts) and of classical types of phonological change, such as merger and split; (ii) cross-linguistic databases of contact-induced phonological change, such as the borrowing of segments and other phonological properties (such as tone); and (iii) databases of expert-established and computationally-derived phylogenies that include best estimates of time depths of the various nodes of families.