1 Introduction

Due to the specific morpheme-syllabic character of the Chinese writing system (Chao 1968: 121), we have considerably fewer clues regarding the original pronunciation of the oldest attested stages of the Chinese language than we do for languages which are written in alphabetic writing systems. As a result, reconstructing the pronunciation of Old Chinese constitutes a challenge in its own right, and quite a few scholars have proposed a variety of reconstructions which differ considerably from one to another (Li 李方桂 1971; Karlgren 1957; Wang 王力 1980; Pan 潘悟云 2000; Starostin 1989; Baxter 1992; Zheng Zhang 郑张尚芳 2003). Apart from the internal structure of Chinese characters, rhyme evidence plays a crucial role in the reconstruction of Old Chinese phonology (Baxter 1992). Based on the fundamental assumption that words which regularly rhyme in older stages of Chinese reflect words with similar pronunciation in their finals, we can systematically investigate Chinese poetry from coherent epochs, assigning words to classes of similar pronunciations. In classical Chinese scholarship, rhyme analysis has a long tradition, going back to scholars like Wu Yu 吳棫 (1100–1154), who was one of the first to systematically assign Chinese characters to specific rhyme classes (He 何九盈 2006: 163).

Up to the end of the 19th century, traditional Chinese rhyme analysis, which was especially devoted to 詩經 shijing ‘the Book of Odes’ (ca. 1050–600 BC), led to the identification of more than 30 distinct rhyme categories (韻部 yunbu, see Baxter 1992: 141–150). The classical approach to rhyme analysis, sometimes called 丝贯绳牵法 siguan shengqian fa ‘link-and-bind method’ (Geng 耿振生 2004), or 韵脚系联法 yunjiao xilian fa ‘rhyme linking method’ (Lv 呂胜男 2009) starts from the collection of words which can be shown to rhyme with each other (usually represented by one Chinese character), and then clusters these words into rhyme groups by applying a greedy strategy (Geng 耿振生 2004). This strategy searches exhaustively for connected components in a rhyme network in which rhyme words are modeled as nodes and attested rhyme instances are represented as links between the nodes (List 2017).

The most obvious drawback of the classical rhyme analysis is its resolution power; following the idea of connected components blindly will yield very large groups of rhymes and a very small number of distinct categories. The classical analysis favors lumping over splitting, and is furthermore vulnerable to incorrectly identified rhyme patterns and other kinds of errors in the data. The problems of the classical rhyme analysis were explicitly addressed in the Old Chinese reconstruction system of Baxter (1992), which proposed six main vowels for Old Chinese and a total of 52 distinct rhyme groups, thus drastically expanding the number of rhyme categories proposed for Old Chinese by classical scholarship. The choice of a six vowel system was further substantiated by the fact that the reconstruction systems by Sergei A. Starostin and Zheng Zhang Shangfang 郑张尚芳, proposed independently around the same time, also employed six vowels (see Starostin 1989; Zheng Zhang 郑张尚芳 2003). The proposal by Baxter (1992) was further substantiated by a statistical test which tested the likelihood of specific rhyme category groupings to have been occurred by chance. In the recently proposed new reconstruction for Old Chinese by Baxter and Sagart (2014), the rhyme schema by Baxter (1992) was only slightly modified by adding a new coda *-r for rhyme words which rhyme both with words in coda *-n and *-j. This resulted in six additional rhyme categories, one for each of the six main vowels *a, *e, *i, *o, *u, and *ə.

2 Vowel purity and rhyme evidence

According to Ho (2016: 176–184), the Old Chinese reconstruction by Baxter and Sagart (2014) contradicts important rhyming principles, especially the principle of vowel purity, according to which rhymes in the Book of Odes were very strict regarding the identity of vowels, while consonant differences could easily be tolerated. According to the author, vowel purity is in conflict in many cases where pronunciations as suggested by the Old Chinese reconstruction by Baxter and Sagart point to different vowels, while the respective words frequently rhyme in the Book of Odes. The argument by Ho (2016) rests on two fundamental assumptions. First, Ho assumes that vowel purity was a key principle in Chinese rhyming. Second, Ho claims that the reconstruction system by Baxter and Sagart is in strong conflict with this principle. Unfortunately, he does not provide any concrete examples, apart from contrasting traditional rhyming categories with the more fine-grained rhyming categories as they were first proposed by Baxter (1992).

Due to the lack of external evidence for Old Chinese pronunciation, the first assumption is very difficult to check. The argument of the author itself rests uniquely on perceived rhyming tendencies in current folk traditions in China. While they may seem suggestive on first sight, they stand in strong contrast to classical rhyme traditions which evolved during the Tang dynasty (618–907) and took the prescriptions in official rhyme books for granted, as well as cross-linguistic tendencies of rhyme production, which may favor similarity in vowels, but not necessarily prescribe identity. This is, for example, reflected in German rhyme tradition, in which words with vowels [y] and [i] freely rhyme with each other, as in nieder [niːdər] ‘down’ and Brüder [bryːdər] ‘brothers’, see also Peust 2014: 62)Footnote 1. Another obvious problem of vowel purity is the fact that the Book of Odes from which the rhyme categories are drawn does not reflect a coherent speech variety that was spoken at a single place and time (Baxter 1992: 343–366). On the contrary, the Book of Odes was compiled over a period of at least 400 years (from about 1000 until 600 BC, cf. Kern 2004), and scholars have long suggested that certain passages reflect dialectal rhyme patterns (Baxter and Sagart 2014: 278f). So even when disregarding the problem of overarching rhyme traditions superimposed by society, it would be rather surprising if the system of rhyming showed no stages of transitions and conflicts resulting from language change and dialectal influence.

We can illustrate this further by having a look at concrete poems in the Book of Odes. Table 1 gives Ode 10 as an example, contrasting both what scholars believe reflects the perceived rhyme structure during the time the poem was composed (column rhyme), the traditional opinion regarding the rhyme group to which the rhyme words belong (column group), as well as reconstructions in four different systems (see the table for details). As we can see from this example, stanza 1 shows an impure rhyme in two systems, contrasting the vowels [ə] and [e], namely, those of Pan Wuyun 潘悟云 (Pan 潘悟云 2000) and Wang Li 王力 (Wang 王力 1980). This impure rhyme was also recognized in traditional Chinese phonology, as the traditional rhyme groups 微 wei and 脂 zhi. The OCBS system (Baxter and Sagart 2014) and the system by Starostin (Starostin 1989) do not show this conflict, as they propose only the vowel [ə] in this group. If we compare across the following stanzas, we can see that all reconstruction systems show specific conflicts regarding the principle of vowel purity, including the traditional classification upon which Ho (2016) bases his criticism. A crucial question for Old Chinese reconstruction is to what degree one should try to avoid impure rhymes, and to what degree one should accept them as reflecting vivid poetry which does not necessarily follow strict rules. How much vowel purity do we need to assume for the Book of Odes?

Table 1 Comparing impure and pure rhymes in Ode 10 and how they are reflected in different reconstruction systems

We cannot directly test the importance of vowel purity for Old Chinese rhyming, as our information regarding Old Chinese vowels relies on reconstructions, and these reconstructions may well have been proposed with the principle in mind, be it explicitly or intuitively. Whether a given reconstruction system is in strong conflict with the vowel purity principle, on the other hand, can be directly tested by inspecting the actual data. Given the restricted corpus of the Book of Odes, an exhaustive investigation of the conflicting cases is possible, and one could compare all Odes in the corpus in different reconstruction systems, just as we have illustrated for Ode 10 in Table 1. Such a qualitative evaluation has the obvious disadvantage that it would be very time-consuming, both for the experts who carry it out and for the scholars who read the reports. In order to avoid the problems resulting from manual comparisons, we propose a quantitative test that automatically measures the degree by which reconstruction systems deviate from the principle of vowel purity. By modeling Chinese rhyme data from the Book of Odes as a weighted network in which rhyme words serve as the nodes and attested rhyme occurrences in the Book of Odes are modeled as links between the rhyme words, we can not only test how well a given reconstruction system conforms to Ho’s (2016) vowel purity criterion, but we can even compare alternative reconstruction systems directly with each other.

3 Evaluating vowel purity in reconstruction

3.1 Materials

3.1.1 Rhyme data

The rhyme data used for the experiment follow the rhyme assignments for the Book of Odes provided in Baxter (1992) which were digitized and converted into a machine-readable format in List (2017). The data are available online as interactive application, the Shījīng Rhyme Browser (http://digling.org/shijing/), where all rhyme decisions can be interactively searched and inspected in the reconstruction systems by Baxter and Sagart (2014) and Pan 潘悟云 (2000). The former is available for download; the latter was taken from the Thesaurus Linguae Sericae (Harbsmeier and Jiang 2009). The dataset lists all potential rhyme words in the Book of Odes, which were determined by taking the final character in each line of each stanza across the 305 poems of the Book of Odes. This list of potential rhyme words is contrasted with the actual rhyme words as assigned in Baxter (1992). The interactive application visualizes rhyme annotations by coloring words which are marked as rhyming in the same color, as shown in Table 2 for the poem number 60.

Table 2 Example of the structure and display of rhymes of the Book of Odes in the Shījīng rhyme browser

In List (2017), the rhyme data are used to construct a rhyme network of all rhyming words in the Book of Odes. In this network, rhyme words (represented by Chinese characters) are represented as nodes, and links between the nodes are drawn whenever two rhyme words actually rhyme in the Book of Odes. The whole network comprises 1845 nodes and 5266 links between the nodes. The number of recurring links between two nodes is counted and weighted, using specific weighting principles, like (a) counting formulaic (recurring lines in the collection) only once, and (b) by taking the size of the group in which two words rhyme into account when establishing the weights (in order to avoid that large groups of rhyming words are scored more often than smaller ones). As network weighting itself is not of primary importance for the approach presented in this paper, we refer the readers to List (2017) where the rhyme network construction process is described in detail. All data underlying the study are accessible online at https://zenodo.org/badge/latestdoi/43676744, and we used this data to create the rhyme network for our study. Figure 1 illustrates the structure of the rhyme network by showing a small part of the full graph, corresponding to the codas reconstructed as *-ar, *-an, and *-aj in the reconstruction of Baxter and Sagart (2014).

Fig. 1
figure 1

Example for a small part of the rhyme network based on the data in List (2017), for rhyme words reconstructed with coda *-ar (black nodes), *-aj (gray nodes), and *-an (white nodes) in the reconstruction system of Baxter and Sagart (2014). Nodes correspond to rhyme words, and edges indicate whether the nodes they connect rhyme together in the Book of Odes. Edge weights represent the frequency of rhyme-word co-occurrences, and node weights represent the general frequency by which the words occur in rhyme position in the Book of Odes

3.1.2 Reconstruction systems

For all 1845 rhyme words in the network, Old Chinese readings in eight different reconstruction systems were collected from different sources. The system of Baxter and Sagart (2014) is available online for download. Unfortunately, it covers only 1431 characters of the full set of 1845 rhyme words, and 414 readings are missing. The Eastling project (Shanghai Normal University 上海师范大学 2016, http://www.eastling.org/oc/oldage.aspx) offers Old Chinese reconstructions for various authors, including the systems proposed in Karlgren (1957), Li (1971), Wang 王力 (1980), Zheng Zhang 郑张尚芳 (2003), and the most recent proposals according to the system of Pan 潘悟云 (2000). The Eastling data has a broad coverage, and only 15 out of 1845 readings in the original rhyme data from list (in press) were missing in this collection, thus comprising a total of 1830 readings for each of the five different reconstruction systems. In order to make sure that these different systems are reflected correctly, we compared the Eastling data with original and alternative sources. For Li 李方桂 (1971), we compared the Eastling data with the charts provided in Shen 沈鍾偉 (2005)Footnote 2, and for Wang 王力 (1980) and Karlgren (1957), we compared it with the original sources. Given that Pan Wuyun 潘悟云 and Zheng Zhang Shangfang 郑张尚芳 were involved in the creation of Eastling, and that especially the reconstructions of the system outlined in Pan 潘悟云 (2000) are only available online, we assume that the data for these two reconstruction systems are truthfully displayed. Apart from a few incorrect characters in the source by Wang 王力 (1980), which we manually corrected, our comparison did not reveal any errors. In addition to the five reconstruction systems, Eastling also offers readings attributed to William Baxter, but since we could not identify these readings with any known published sources of Baxter corresponding to these readings, we did not use them in our analysis.

The Tower of Babel project (Starostin 2008, http://starling.rinet.ru/) further offers an exhaustive database of character readings following the Old Chinese reconstruction system by Starostin (1989), which was compiled by Sergei Starostin himself from 1991 on and was expanded in the years thereafter. While the original publication by Starostin (1989) lists readings for all rhyme words in the Book of Odes, the online version only offers 1358 character readings for the 1845 characters in our base list, with 487 readings missing. The Old Chinese reconstruction by Schuessler (2007) was collected from a recently published digital version of the book. Unfortunately, only 1224 readings for the 1845 rhyming characters in the Book of Odes could be found, leaving us with 621 missing character readings.

In order to compare the different rhyme systems for vowel purity, the main vowels for all available character readings for the 1845 rhyme words in the rhyme networks were extracted and added as meta-data to each rhyme in the network. The different vowel systems proposed in the different reconstruction systems are shown in Table 3. Although each of our 8 systems has much more than 1200 readings (see column 3 in Table 3), the intersection between all systems is surprisingly low, and if we only retain those readings reflected in all samples, a sample of 875 nodes remains. The data by Schuessler (2007) is missing the largest amount of characters (621 readings), followed by the data of Starostin (1989, 487 readings), and Baxter and Sagart (2014, 414 readings).

Table 3 Vowel systems across different Old Chinese reconstructions

It is important to note in this context that missing readings cannot be easily added without the assistance of those who originally created a given reconstruction system. While certain aspects in Old Chinese reconstruction are systematic, allowing us to project attested Middle Chinese readings back to Old Chinese, the projection rules which differ in the reconstruction systems proposed by different scholars do not necessarily allow us to replicate their judgments, as scholars use a range of different types of evidence, including Chinese character structure, evidence from excavated texts, and early borrowings into neighboring languages (see especially Baxter and Sagart 2014 for a discussion of the different types of evidence used in reconstruction). As a result, we cannot simply add the missing character readings in our comparative dataset without running the danger of incorrectly representing a given reconstruction system. For our comparison, we are left with what we have, and we need to address the problems resulting from gaps in the data. But since we provide all data as an Additional file with this study, we hope that collaborative efforts of the scholarly community may eventually close the gaps in the future.

When comparing across datasets, it is important that we compare samples of the data containing exactly the same nodes, as in smaller or larger samples the basic characteristics, as, for example, the number of edges, may differ, thus giving the reconstruction systems we want to compare different starting chances. The difference is further confirmed by the data on network density that is the fraction of the number of edges divided by the number of potential edges in a network. The number of potential edges in a network is the number of edges in a network in which all nodes are connected with each other and can be calculated with the help of the formula \( \left({n}^2- n\right)/2 \), where n is the number of nodes in the networkFootnote 3. Network density for the different subgraphs is reported in Table 3. As can be seen from the scores, the subgraphs of the different reconstruction systems slightly differ in density depending on the coverage of the data sample, with the smaller datasets showing a higher density.

3.2 Methods

We need a measure for the purity of clusters in a graph. If the theory of vowel purity holds, we should expect a high degree of isolation for those rhyme words which can be grouped by the same vowel. We thus want to compare how well a given external grouping of the nodes in our network (the vowels reconstructed for the rhyme words in a given reconstruction system) conforms to the internal ordering in our network (as reflected by the rhyme relations among the rhyme words). If we accept that we will have a certain degree of vowel impurity in all rhyme networks, be it due to the fact that the poets deliberately decided to tolerate this, or that the underlying data reflects different stages in language history, we would still assume that words rhyme more often with each other if they have the same main vowel.

We can illustrate this notion of purity by creating a fictive dataset of six rhyme words which we label 1, 2,…, 6, and of which 1, 2, and 3 share the same vowel, and 4, 5, and 6 share a vowel, which is different from the vowel of 1, 2, and 3. In Table 4, we display two matrices which contrast different fictive types of rhyme co-occurrence for our six words. If two words rhyme, this is indicated by a cross in the cell of the matrix. Impure rhymes in which two vowels of different quality rhyme with each other are further marked by shading the cell in gray. From the two different matrices, we can easily see that the first one (matrix A) would intuitively reflect a higher degree of vowel purity than the second one (matrix B), simply because the number of impure rhymes is much lower in matrix A.

Table 4 Rather high and rather low degree of vowel purity in a fictive set of six rhyming words

The same information can be also displayed in a network, in which our words 1, 2,…, 6 are modeled as nodes, and the information, whether they rhyme with each other in the sources (matrices A and B) are displayed by drawing an edge between the nodes. This is illustrated in Fig. 2, and we can see that the network visualization makes it even easier to see the difference between the intuitively rather pure rhyme network in A and the rather impure rhyme network in B. But our intuitive assessment may easily betray us if the data becomes more complex. For this reason, we need a way to measure to which degree a given network structure (the rhyme co-occurrences in the Book of Odes) is in conflict with a given external division of the nodes (the vowels, as annotated in the reconstruction systems of different scholars).

Fig. 2
figure 2

Comparing networks with (a) high and (b) low “purity” regarding the relation of colors and edges

A measure that measures exactly what we want to test is assortativity (Newman 2003). Assortativity tests whether nodes sharing connections in a graph are also similar regarding other characteristics. In social network analyses it can, for example, be used to test whether observed patterns in a network, like friendship, come along with properties of the individuals, such as language or gender (ibid.). Assortativity can be measured by calculating the assortativity coefficient of a network in which all nodes have a given attribute. The basic idea of this coefficient is to compare the proportion of edges connecting nodes with the same attribute with the proportion of edges connecting nodes with different attributes. Calculating the assortativity coefficient in a network is straightforward. Given a network with nodes and node attributes, one first calculates an attribute mixing matrix which indicates the proportion of edges between all attributes. Based on this matrix, the assortativity coefficient can then be calculated with help of the formula:

$$ r=\frac{\mathrm{Trace}(m)-\parallel {m}^2\parallel }{1-\parallel {m}^2\parallel }, $$

where m is the attribute mixing matrix, Trace is the sum of the diagonal from top left to bottom right, and ||m|| is the sum of all cells in the matrix (see Newman 2003 for details). An assortativity coefficient equal to 1 indicates full assortativity, with all edges only connecting nodes with the same attributes. 0 indicates no assortativity, and scores between 0 and −1 indicate inverse assortativity in which edges have the tendency to connect nodes with different attributes (ibid.).

As an example on how to calculate the assortativity for a given network, consider again our two networks in Fig. 2. In both networks, colors indicate node attributes, and even from eyeballing, we have already seen above that network A has a high assortativity (as there is only one edge connecting red and blue nodes), while network B has a lower assortativity. In order to calculate the assortativity coefficient for the two networks, we first need to determine the proportion of the edges connecting different types of nodes with each other. Assuming a directed network Footnote 4, in which we can draw two different edges between two nodes, both indicating the direction (from 1 to 2, or from 2 to 1, as in a one-way street), we have 14 edges (2 × 7) in the first and 18 edges (2 × 9) in the second network (see also Table 4, where the original matrices are given). The proportion of edges linking from red to red, red to blue, blue to red, and blue to blue can then be arranged in a contingency matrix, as illustrated in Table 5, and this matrix is then used as input for formula (1) to calculate the assortativity coefficient r. For the networks in Fig. 2, this yields:

$$ {r}_A=\frac{0.86-0.5}{1-0.5}=0.72 $$
$$ {r}_B=\frac{0.56-0.51}{1-0.51}=0.1 $$
Table 5 Calculating the attribute mixing matrices for the networks from Fig. 1

We can see from this example that the assortativity coefficient confirms the intuition we might have already had by eyeballing the networks in Fig. 2, namely, that the network structure in network A reflects the coloring of the nodes much better than in network B.

When comparing two or more reconstruction systems with each other, we need to be careful in correctly interpreting the results. If one system has a high assortativity coefficient, this confirms a tendency to produce clusters of high purity. If the assortativity coefficient of another system is lower, however, this could be triggered by the topological structure of the network alone, and not by the reconstruction system. As scholars have chosen their reconstructions independently, assuming different numbers of vowels for their reconstructions, it may well be that the initial number of vowels might favor or disfavor a given analysis. A hypothetical system of one single vowel, for example, would receive the highest assortativity coefficient simply due to the fact that it covers the full network, and in the light of the theory of vowel purity in rhyming, this would also reflect a pure rhyming behavior, as all rhyming instances would show the same vowel.

We need to make sure that the distribution we obtain for a given reconstruction system is not due to chance. More concretely, what is interesting for us, is not only whether the distribution of vowels across a rhyme network is due to chance alone, but also to compare across different reconstruction systems, which system is most unlikely to have arisen by chance. Comparability can be achieved by comparing the results obtained for a given reconstruction system with the results of a random distribution obtained for the same dataset. The random distribution can be created by shuffling the node labels (the vowels for each Chinese character in our case). In order to normalize the data, one then compares to which degree the original result differs from the results obtained for the randomized distribution, that is, one compares to how unlikely it is that a given system could have been produced by chance. If we only wanted to test whether a given distribution is likely to be due to chance, we can calculate the p - value, using the formula:

$$ p=\left( S+1\right)/\left( R+1\right), $$

where S is the number of random distributions with an assortativity coefficient higher than the one we observed, and R is the number of all random distributions we created. The p - value will range between 1 and 0, and the lower the value we obtain, the lesser we would expect that the observed distribution was created by chance. It is customary in the social sciences to set an arbitrary threshold for the p - value, indicating when an experiment is accepted to confirm a hypothesis and when it is rejected. This value is usually 0.05 in psychology and sociology, but much lower in physics.

In addition, since we do not only want to test whether a given reconstruction system is significant with respect to the principle of vowel purity, we also need to find a way to compare different reconstruction systems with each other. A good score for this difference is to count the number of standard deviations between the mean of the randomized distribution and the non-randomized test (Lopez et al. 2013), which can be done with the help of the formula:

$$ \sigma =\frac{r_A-{r}_E}{s_E}, $$

where r A is the attested assortativity coefficient, r E is the mean of the assortativity coefficients in the random sample (the expected assortativity), and s E is the standard deviation. This score, which we will call the sigma score in the following, tells us how unexpected a given analysis is with respect to an analysis which was carried out randomly: the higher the score, the lesser we expect an analysis to be due to chance. In the context of vowel purity in Chinese rhyme networks, this means that the higher a score, the more closely it groups the rhymes by vowel quality. By reporting both the sigma scores and the p - values, we further make sure that our results are generally significant.

A further problem mentioned above is the problem of sample size. Since we have a considerable amount of missing readings in our data, we need to make sure that the differences do not influence our results. In order to control this, we apply a straightforward re-sampling procedure by randomly selecting a certain number of nodes from the networks which occur in all reconstruction systems and re-running the complete analysis on these subsets of the data. For this purpose, we created 10 random samples for varying numbers of nodes, ranging from 100 characters up to 800 characters (all random samples as well as the source code to create new random samples are given in the Additional file 1: supplementary material). We ran our basic analysis on all these subsets and averaged the results for a given number of nodes. In this way, we tested the robustness of our approach when dealing with datasets of different sizes and random collections of subsets of the data.

4 Results

We computed the assortativity coefficients for the original and the randomized data based on the Book of Odes network for all eight reconstruction systems. The randomized distribution was obtained by shuffling the nodes in each network 1000 times and storing the assortativity coefficient for each run. Thanks to the NetworkX software package (Hagberg 2009), all computations could be carried out in Python, and all source codes to replicate the analyses reported here are given in the Additional file 1: supplementary material. In all cases, our primary question was to which degree the division of the rhyme words in the network according to their reconstructed vowels would reflect the “natural” division of the networks into rhyme classes as represented in the annotated network of rhymes in the Book of Odes. Table 5 shows the results for this experiment for the 875 character readings.

As one can see from the results in Table 6, the reconstruction system by Baxter and Sagart (2014) outperforms all other systems. With an assortativity coefficient of 0.88 and a sigma score of 79, it shows a higher degree of assortativity than the other systems, and a generally high assortativity with respect to vowel purity. The next in order is the system of Starostin (1989), with an assortativity coefficient of 0.84 and a sigma score of 74. The system of Li 李方桂 (1971) performs worse than the other systems with a sigma score of 56, followed by the system of Wang 王力 (1980) with a sigma score of 61. As the p-values in the last column in Table 6 indicate, all of our experiments are highly significant, and there was no random distribution of vowels in all 1000 which achieved a higher assortativity coefficient than the one we achieved for the observed data. Regardless of the reconstruction system, all reconstructions show a high tendency to reflect vowel purity.

Table 6 Results of the analysis for the complete dataset (including all characters reflected in all reconstruction systems), a total of 875 nodes

As we mentioned before, due to the large number of missing readings in our data, we need to control for the sample size. As a strategy, we carried out the re-sampling procedure outlined in the end of Section 3.2, in which we split the data into randomly selected samples of varying sizes of 100, 200,…, up to 800 characters, and then applying our basic method to those subsets of the data. The averaged results for the ten different samples we used in each analysis are given in Table 7. For reasons of space, we only report ranks and sigma scores, but all detailed analyses are provided in the Additional file 1: supplementary material. All p values for these analyses were highly significant with p < 0.01. As can be seen from the table, all studies on the subsets confirm the tendency we also saw in the full sample from Table 6, and especially the ranks are remarkably stable (the only exception being the analyses by Schuessler and Karlgren in the lower ranks). What one can also see is that the size of the networks has a direct impact on the sigma scores, which is easy to understand keeping in mind that if we select only a small number of nodes the evidence for rhyme co-occurrences will drastically shrink.

Table 7 Results of the re-sampling test on randomized subsets of the data with varying numbers of characters, and the resulting rankings for all datasets for the respective analysis. The eight re-sampling trials consist of ten randomly selected sets of characters

Apart from the remarkable robustness of the results across different random samples of the data, the difference between the reconstruction systems regarding their individual degrees of vowel purity is also quite striking. This is interesting since scholars have often emphasized the similarities between the more recently proposed reconstruction systems (Behr 1999). Given that we only investigate the main vowels, thus ignoring all other potential disagreements, shows that we are still far away from a communis opinio on Old Chinese phonology. The differences between the reconstructions are further illustrated in Fig. 3, where we contrast the reconstructed vowels for 300 characters out of the 1830 character readings in the data. While we can see a rather high agreement in the majority of patterns, especially between the six vowel systems of Old Chinese, it is also easy to identify certain individual differences in the reconstructions. These cases show that it is not one major disagreement triggering the variation, but a notable number of individual reconstructions in which scholars differ.

Fig. 3
figure 3

Comparing the rhyme patterns across different reconstruction systems. The figure shows three subsets of 100 characters each as they occur in the rhyme data of the Book of Odes; both include missing characters and the respective vowel readings. While a definite structural similarity can be detected, we also find remarkable differences. In the figure, each cell corresponds to one reading for a given character in the row. Since the characters are too small to be readable, we offer a high-resolution version of this figure in the Additional file 1: supplementary material

The assortativity coefficients of all systems and the high significance of our randomized tests indicate that vowel purity plays an important role in Old Chinese rhyming. If vowel quality was independent of rhyme decisions, we would expect to find assortativity coefficients to be close to zero, as we found in the random distributions. What this means more concretely is shown in Fig. 4, where we show the full rhyme network in which nodes have been colored according to the system of Baxter and Sagart (2014). From this perspective, we can see that the network is highly structured. Most rhymes which are topographically close from organic groups in the network, as shown by their colors. That one and the same vowel further form multiple distinct clusters is also to be expected, as vowel quality is not the only factor conditioning rhyming. Furthermore, given the overall structure of the network with its one larger component that connects almost all of the characters, we can also see that the rhyme purity assumption is essentially an assumption of degree: we find definite clusters which obviously correspond to words with a very similar if not identical pronunciation in Old Chinese, but we also find obvious transitions between all rhyme groups.

Fig. 4
figure 4

The Book of Odes network with vowels colored according to the reconstruction system of Baxter and Sagart (2014)

5 Discussion

What can we learn from this experiment? Surprisingly, the reconstruction system of Baxter and Sagart (2014), which was heavily criticized by Ho (2016) for its lack in vowel purity, seems to evince a much higher purity of vowels then all other popular reconstruction systems for Old Chinese, regardless of the number of vowels which these systems actually reconstruct. If vowel identity was indeed a valid criterion for the choice of rhyming words in Old Chinese times, this could be seen as strong evidence for the superiority of the reconstruction system by Baxter and Sagart (2014) closely followed by the system of Starostin (1989). Yet, we should be careful with our conclusions, since vowel purity is surely only one factor that may have contributed to Old Chinese rhyming practice, and we cannot be sure how important this factor was. In order to use the vowel purity criterion to favor or disfavor certain reconstruction systems of Old Chinese, more evidence on the universality or the areal prevalence of this principle in rhyming would be required. Since rhyming practice results from the interaction between language, culture, and cognition, more studies on cross-linguistic and cross-cultural rhyming practices would be needed to clearly use external criteria as evidence for or against a given Old Chinese reconstruction.

Even if we refuse to use the results of this research to rank or evaluate the different reconstruction systems of Old Chinese, we consider it as a valuable contribution to the field of Chinese historical linguistics, as we have shown that we can easily design quantitative tests that check to which degree different reconstruction systems conform to a given criterion. By expanding this principle to the finals of different reconstruction systems, we could, for example, test the general degree of purity with respect to the rhymes in the Book of Odes. As shown in List (2017), we can also use the rhyme networks to resolve uncertainties inside a given reconstruction system. Due to the diversity of poetry collections like the Book of Odes itself, we could further compare rhyming behavior across different partitions of the data, thus testing current hypotheses regarding its development history. Given the crucial role that Chinese plays for the history of the Sino-Tibetan language family, research along these lines may not only have an impact on Chinese historical linguistics, but may also help us to gain new insights into the prehistory of one of the largest language families in the world.

Given that Chinese is not the only language whose older stages are reflected in rhyming, one may even think of applying the method to other languages, such as Tangut (Arakawa 2001) or Egyptian (Peust 2014). When taken with care, network studies on rhyming practice may provide additional evidence for original pronunciation, especially in those situations where the writing system lacks precision in truthfully representing speech in phonetic detail. These methods may also be used to investigate cross-linguistic rhyming tendencies. So far, the vowel purity principle is still a hypothesis rather than a confirmed effect. By adding more data from different languages to the sample, one could investigate whether it reflects a universal tendency rather than a specific tendency in Old Chinese rhyming.

This paper shows that a thorough quantitative comparison can give us new insights into the problems in the reconstruction of Old Chinese, but also into the more general problems of reconstruction in historical linguistics. Instead of dismissing theories or reconstructions by cherry-picking particular examples, a thorough and if possible exhaustive evaluation may often allow us to look at problems from a fresh perspective. Unfortunately, increasing the amount of data amenable for quantitative investigations is time-consuming. For this reason, the results presented in this paper can only be regarded as preliminary until the existing data are more consistently checked and new data have been added. In order to tackle these problems in the future, collaborative efforts are required, and all scholars should try to contribute by sharing their data as transparently as possible.