Introduction

Although still in its infancy, digital humanities research supported by big data and deep learning has become a hot topic in recent years. Researchers began to use digital methods to study cultural issues quantitatively, such as examining cultural evolution (Lewens 2015) through the diachronic changes of n-gram frequency (Michel et al. 2011; Lansdall-Welfare et al. 2017; Alshaabi et al. 2021; Newberry and Plotkin 2022) and word-level semantics (Newberry et al. 2017; Garg et al. 2018; Kozlowski et al. 2019; Giulianelli et al. 2020). This trend also spread to the study of ancient civilizations. Scholars from different cultural backgrounds have investigated the culture of ancient Rome (Dexter et al. 2017), ancient Greece (Assael et al. 2022), and Natufian (Resler et al. 2021) with the assistance of computer technology. It is acknowledged that ancient China was one of the longest-standing civilizations in human history, with a culture that evolved over the past thousands of years. Various ancient literature has been handed down over time, providing extensive textual records of Chinese culture. With the digitized versions of these classics, we can gain a glimpse into the cultural evolution in ancient China.

Ancient Chinese classics are highly intertextual texts. Since the doctrine “A transmitter and not a maker, believing in and loving the ancients” proposed in Analects (Legge 1861. VII.I), quoting previous texts became a convention of literary creation in ancient China. Chinese scholars have long studied this cultural phenomenon from different perspectives. For example, Pan-ma i-t’ung (published around AD 1200) demonstrated the character differences between two history books, Records of the Grand Historian (published around 91 BC) and Book of Han (published around AD 82). Since Qing Dynasty, scholars began to enumerate parallel intertextual associations between ancient classics (Chen 1989; He et al. 2004). However, intertextuality (Kristeva 1980) is not only the connections of words and phrases but also manifests at higher levels hierarchically (Riffaterre 1994; Alfaro 1996), such as document, author, and community. The traditional form of high-level intertextuality studies was the overall literary criticism by scholars. For example, Ming dynasty scholar Ling Zhilong compiled previous scholars’ literary criticism of the above two history books. Literary criticism was themed on the style, skill, and viewpoints of literature, which was seen as a formidable endeavour due to the complexity of Chinese culture. Both parallel enumeration and literary criticism are limited by the reading and memory of scholars, which restricts the discussion on the large-scale corpus. Assisted by computer technology and digital literature, scholars recently began to study intertextuality within large-scale data.

Various natural language processing (NLP) methods have been applied to the intertextuality modelling of ancient literature. The previous automatic detection methods of text-level intertextuality aimed to discover similar phrases or sequences by lexical matching approach (Lee 2007; Coffee et al. 2012a; Coffee et al. 2012b; Ganascia et al. 2014; Forstall et al. 2015), which are insufficient and rigid in semantic modelling. The non-literal feature like synonym (Büchler et al. 2014; Moritz et al. 2016) and rhythm (Neidorf et al. 2019) also implies intertextuality, yet it requires language-specific design. Topic modelling lends a hand to passage-level modelling (Scheirer et al. 2016), while its dependence on expert annotation limits its generalization on diverse corpora. Simple statistics on text-level results contribute to document-level modelling (Hartberg and Wilson 2017). However, it ignores their overall connections. Besides, graph structure seems to be an appropriate way for the community-level modelling of intertextuality (Romanello 2016; Rockmore et al. 2018). Intertextuality modelling on classical literature widely supports cultural studies, such as quantitative literary criticism and stylometry (Forstall et al. 2011; Burns et al. 2021). Existing related studies on Chinese literature were limited to the detection methods (Liang et al. 2021; Li et al. 2022; Yu et al. 2022) and shallow studies of intertextual texts on small corpora (Sturgeon 2018a; Sturgeon 2018b; Huang et al. 2021; Deng et al. 2022), short of macroanalysis (Jockers 2013) on Chinese culture.

In this paper, we conducted a macroanalysis of ancient Chinese culture on an unprecedented large-scale corpus spanning nearly 3000 years. Figure 1a presents a schematic of this corpus. This corpus consists of 30,880 articles from 201 ancient Chinese books (or anthologies). It covers various topics, such as philosophy, religion, and politics, including the famous works of major cultural groups (e.g., Analects of Confucianism; Tao Te Ching of Taoism). The history books (e.g., Book of Han) and comprehensive anthologies (e.g., Collected Works of Han) of each era are also involved.

Fig. 1: Dataset and modelling framework.
figure 1

a The dataset of ancient Chinese literature with an instance in each era. The names of the dynasties and the approximate AD years are marked on the timeline. For each period, it gives one instance book and indicates its subject. b Hierarchical framework with three modules for multilevel intertextuality modelling.

In this work, we modelled ancient Chinese literature with a hierarchical framework. The cultural thought of civilization is composed of multiple levels, such as doctrines, individuals, and communities. Moreover, cultural evolution manifests hierarchically with microevolution and macroevolution (Mesoudi 2017; Gray and Watts 2017). A comprehensive discussion of cultural evolution requires multilevel perspectives. Therefore, this framework models intertextual associations from the text level to the community level with three modules. A schematic of the framework is shown in Fig. 1b. The text-level detection module tracks intertextual sentences with deep-learning models. The book-level aggregation module gathers text-level clues and abstracts various books into an association graph. The community-level inference module applies topological propagation to explore intertextual associations in the cultural community. After the modelling, millions of intertextual sentence pairs and a book-level intertextual association graph are ready for cultural analysis.

In the experiment, we detected 2.6 million pairs of intertextual sentences and then built them into an association graph. For a specific text collection, its intertextual distribution refers to its quantitative intertextual associations with other texts. Based on the modelling results, we can study ancient Chinese culture through the intertextual distribution among ancient literature.

In the cultural analysis, we considered cultural evolution from the perspective of cultural groups and religions. Schools of thought and religions were part and parcel of ancient Chinese culture (Schwartz 1985). The Hundred Schools of Thought that originated in the axial age were the prototype of ancient Chinese philosophy (Graham 1989). They rose and fell over the millennia that followed. The introduction of foreign cultures, like Buddhism (Chen 1964), also influenced the evolution of native culture. In this paper, we disentangled the cultural evolution of ancient China on three levels: (1) The interaction between individual scholars and philosophical schools; (2) The rise and fall of schools in Chinese history and culture; (3) The cross-culture communication with Buddhism.

Specifically, we validated several acknowledged cultural phenomena: the evolutionary paths of Confucianism and Taoism, and the booms and declines of the Hundred Schools of Thought. We also provided quantitative suggestions for cultural problems that are yet to be definitely resolved, such as the school attribution of Lüshi Chunqiu, the authorship attribution of Collected Works of Tao Yuanming, and the influence of Confucianism and Taoism across different cultural domains. Furthermore, we quantitatively discussed the interaction between Buddhism and native culture, revealing how cultural integration has evolved over time.

In addition, we have developed an online platform to display this corpus, along with millions of intertextual associations detected in this work. The platform supports custom data analysis, which encourages researchers and enthusiasts to gain insight into this work.

Methods

Data

Two datasets were built respectively, the classic dataset and the era-text dataset. We considered several factors when building the dataset: era balance, representativeness, and official-folk balance. Two datasets consist of 30,880 articles from 201 books (or anthologies).

Classic dataset

The classic dataset is composed of the most prominent and influential books that represent the core culture of ancient China. Before the Tang Dynasty (618–907), literature was copied manually. Due to the long history and the limitations of publishing technology, only time-tested classics have been handed down to this day. Therefore, we added all the collected pre-Tang literature to the classic dataset. In the Tang Dynasty, the invention of block printing led to the rapid development of the publishing industry, resulting in explosive growth in the amount of literature. Until the mid-18th century, China printed more books than the rest of the world combined (Gernet 1996). Considering that this study focuses on the evolution of early thought in ancient China, we selected several most famous classics after Tang Dynasty. The well-known digital library of ancient Chinese classics, CTEXT (https://ctext.org/), also adopted similar rules to build a collection of core classics. We considered the literature samples of CTEXT and built the classic dataset.

Our research focuses on ideological evolution, so books in the classic dataset should reflect cultural thought with good data quality. Therefore, we further screened the classic dataset to filter out inappropriate books, including commentary books, mathematics books, dictionaries, excavated literature (e.g., Mawangdui Silk Texts), and lengthy novels.

Finally, the dataset of ancient Chinese classics contains 133 books, including 8984 articles. Table 1 shows the time-period statistics of this corpus. It covers various aspects of culture, such as philosophy, mythology, politics, and religion. In this dataset, the earliest book was created around 1000 BC (e.g., Book of Documents), while the latest book was published around AD 1750 (e.g., The Scholars).

Table 1 Time-period statistics for the classic dataset.

Era-text dataset

We aim for the era-text dataset to reflect the contemporary culture of each period, encompassing both official and folk traditions. To achieve this, we set our sights on history books and anthologies. As ancient China had a tradition of producing history books for each dynasty, history books typically reflected official attitudes. We added the official history (Twenty-Four Histories), large-scale chronicle history books (Zizhi Tongjian and Continued Zizhi Tongjian Changbian), and 15 other influential history books to the era dataset. In addition, we included Quan shang gu san dai Qin Han San guo Liu chao wen, a series of large-scale anthologies organized by era. It collected a wide variety of works from numerous authors, including prose, essays, religious scriptures, inscriptions, etc. These anthologies comprehensively record the contemporary culture of ancient China. To further enrich the era-text dataset, we added 13 well-proofread anthologies.

We categorized these history books and anthologies by era. For history books (e.g., Zizhi Tongjian) that cover multiple eras, we divided them into corresponding eras. Finally, we got 55 history books and 13 anthologies, containing 21,896 articles. Table 2 shows the time-period statistics of this corpus. These works chronicle Chinese history and culture from the legendary period (e.g., Bamboo Annals, from 2600 BC) to the Ming Dynasty (e.g., History of Ming, ending in AD 1644).

Table 2 Time-period statistics for the era-text.

Data processing

Ancient Chinese characters may have multiple written forms, we use the open-source toolkit OpenCC (https://github.com/BYVoid/OpenCC) to map them to a unique root character before encoding them using deep learning models. The maximum sentence length was set to 50 characters. Sentences exceeding this length were divided into two sentences. This setting can cover more than 99% of sentences.

Intertextuality detection usually aims to discover meaningful textual connections. It is important to note that texts without actual meaning cannot indicate the ideological connection between texts. Therefore, we use additional computational rules to avoid inappropriate text pairs. First, we filtered out sentences (clauses) with less than three remaining characters after removing the stopwords (such as prepositions and pronouns). Then, with predefined rules, we filtered out meaningless sentences, such as tone, dates, lengths, quantities, and formats. After filtering, there are about 436,000 sentences with 840,000 clauses in the classic’s dataset, and 2,113,000 sentences with 4,526,000 clauses in the era-text dataset.

Challenge and limitation

The collection and processing of ancient Chinese literature present challenges and limitations. Although we used punctuated text in this work, the original ancient Chinese literature has no punctuation. When it comes to no-punctuation data, an automatic punctuation model should be applied beforehand. Moreover, ancient literature could have multiple versions. In our dataset, we opted to include only one widely circulated version of each book. It may restrict the applicability of the dataset for researchers interested in different versions.

Additionally, the selection of appropriate literature collections for cultural analysis from a vast pool of ancient literature requires expert knowledge. In our study, humanities scholars specializing in Chinese history and philosophy were consulted.

Modelling framework

Considering that intertextuality and cultural evolution can manifest at multiple levels, we developed a hierarchical framework to analyze ancient literature. This framework captures intertextuality at three levels, ranging from micro to macro. At the text level, similar sentence pairs shared between books are detected by deep neural networks. At the book level, books are abstracted into an intertextual association graph based on the text-level results. At the community level, information propagates through the topological structure of the book-level graph, thus exploring intertextuality in the cultural community. This hierarchical approach provides both micro-evidence and macro-quantification for intertextual associations and cultural evolution.

Text-level detection

The study of cultural evolution is concerned with the connections of thoughts. Each sentence often expresses a distinct thought, making it a suitable quantitative unit. Therefore, we traced the intertextuality at the sentence level. We considered that the more similar sentences the two books share, the more closely connected they are.

The dissemination of text is not static but mutates. The micro-evolution of texts has multiple patterns (Tamariz 2019), such as replication, expansion, and succession. Therefore, this module traced similar sentence pairs shared between books with three patterns: overall similarity, partial similarity, and paraphrased similarity. A sketch is given in Fig. 2a.

Fig. 2: Modelling methods of quantitative intertextuality.
figure 2

a Three patterns of similarity between sentences. Darker colour indicates more similar text. b The explicit intertextuality and implicit intertextuality between the three books.

Overall similarity

Two sentences explain the similar meaning with close language expressions.

Partial similarity

Two sentences share similar parts.

Paraphrased similarity

The similar meaning is explained by different language expressions. The text may be disrupted and reorganized.

Deep neural networks (Vaswani et al. 2017) and pre-train methods (Devlin et al. 2019) have shown excellent performance in text feature extraction. Contrastive learning (Chen et al. 2020) can help to obtain personal-defined text similarity models without supervision, which is suitable for text-level intertextuality detection. To get sentence representation for these three patterns, we introduced the RoBERTabase (Liu et al. 2019), a pre-trained language model that can be further fine-tuned for our task using different training strategies.

For the overall similarity pattern, it can be treated as the overall semantic similarity between sentences. To train the model1, we used SimCSE (Gao et al. 2021), a contrastive learning method for extracting sentence embeddings.

For the paraphrased similarity pattern, the sentence structure could be reconstructed. We trained another model2 for this pattern, with its loss being a weighted sum of loss1 and loss2. The loss1 was calculated in the same way as for model1.

For loss2, we randomly dropped and shuffled the clauses and n-grams in the original sentence to obtain a new sentence. It serves as another positive sample of contrastive learning. Negative samples are other sentences in the batch. The final loss for the model2 is:

$$\begin{array}{*{20}{c}} {loss = loss_1 + r \ast loss_2} \end{array}$$
(1)

r is a hyperparameter that modulates the emphasis between sentence structure and semantics.

For the partial similarity pattern, sentences are considered similar if they share similar clauses. We detected similarities at the clause level using both model1 and model2.

In large-scale information retrieval, brute-force search is often impractical due to the time and resources required. Therefore, it usually follows a multi-step process for the balance of precision and efficiency.

The first step is to recall potential candidates. In our work, we identified K members that were most similar to each sentence embedding. Then, we selected appropriate candidates and calculated a threshold to further filter out similar candidates.

For each pattern, we used the following steps to detect:

1. Extract embeddings of all sentences using the RoBERTa model.

2. De-duplicate embeddings. For each embedding, find its TopK similar embeddings. Denote all embedding pairs obtained as P.

3. Calculate the Euclidean distance of embedding in P and find the tth percentile as the similarity threshold dthr.

4. Filter out sentence pairs whose embedding distance is closer than dthr.

We detected similar pairs with these three strategies separately and gathered their results. The detected similar sentence pairs give concrete evidence of text-level intertextuality.

Book-level aggregation

Text-level results can support textual research on microevolution. However, to analyze at the macro level, text-level results must be gathered and aggregated. In this module, we aggregated text-level intertextuality results and synthesized them into a book-level intertextual association graph g. In this graph, each node Bi represents a book, and there are N books in total. The edges indicate the intertextual associations between books. Suppose there are two books Bi and Bj, they contain ni and nj distinct sentences, respectively, and sij distinct similar sentence pairs were detected between them. The edge weight αij between Bi and Bj is calculated as follows:

$$\begin{array}{*{20}{c}} {\alpha _{ij} = \frac{{s_{ij}}}{{n_i \ast n_j}}} \end{array}$$
(2)

For node Bi, it has a one-hot node feature \(x_i = [x_{i1},x_{i2} \ldots x_{iN}]\), where \({x_{ii}} = {1}\).

Community-level inference

Text-level intertextuality can be observed explicitly. However, some intertextual connections can be implicit, with no direct textual association. In this study, we treated these classics as a cultural community and explored the implicit intertextuality at the community level. A schematic is shown in Fig. 2b.

Explicit intertextuality

If two books share similar sentences, they are explicitly intertextual.

Implicit intertextuality

If Book1 and Book2 are explicitly intertextual, and Book2 and Book3 are explicitly intertextual, then it can be inferred that Book1 and Book3are implicitly intertextual.

This module performs inference by propagating and aggregating information through the topology of the intertextual association graph:

$$\begin{array}{*{20}{c}} {I_{ex} = {\sum} {\alpha _{ij} \ast x_j} } \end{array}$$
(3)
$$\begin{array}{*{20}{c}} {x_i^\prime = x_i + I_{ex}} \end{array}$$
(4)
$$\begin{array}{*{20}{c}} {I_{im} = {\sum} {\alpha _{ij} \ast x_j^\prime } } \end{array}$$
(5)
$$\begin{array}{*{20}{c}} {y_i = x_i^\prime + r^\prime \ast I_{im}} \end{array}$$
(6)

The first operation gathers explicit intertextuality Iex to the node feature. The second operation infers and integrates the implicit intertextuality Iim. r′ is a custom weight that adjusts the emphasis of implicit intertextuality. After twice graph computations, the node feature of Bi is \(y_i = [y_{i1},y_{i2} \ldots y_{iN}]\), where yij indicates the united intertextual score Iij between Bi and Bj.

The node feature reflects the distribution of intertextuality for each book within the community. Excessive aggregation of information on the graph can lead to over-smoothing, which is detrimental to node features. Therefore, we set the number of graph computations to twice. Sparsity is an issue that often plagues text-based cultural analysis. With this method, the sparsity of intertextuality detection results can be alleviated.

Settings and modelling results

In text-level detection, we trained the model on an Nvidia 1080ti GPU. The optimizer is Adam (Kingma and Ba 2015). We took the pre-trained ancient Chinese RoBERTabase model as a basis. For both model1 and model2, we fine-tuned the base model 10 epochs at a learning rate of 1e-6. The batch size was 32. The r for the loss of model2 was set to 0.2. For similarity detection, we set K to 100 and t to 1 based on our data scale and observations. The large-scale vector searching tool Faiss (Johnson et al. 2019) was applied to speed up vector retrieval.

In book-level aggregation, we found that diverse genres have variant punctuation styles, disturbing the total number of sentences. After observation, we found that in this dataset, the number of sentences with at least two clauses is relatively stable. Therefore, we set the number of sentences ni of the book Bi to the number of sentences with at least two clauses.

In community-level inference, r′ was set to a value that makes implicit intertextuality one-fifth of explicit intertextuality. \(x_i^\prime\) and Iim were clipped with a ten-fold mean. In the modelling after adding era-text, the information propagation between era-text nodes was blocked to evaluate each era independently.

As a result, the detection module identified over 411,000 pairs of similar sentences between classics and 2,209,000 pairs between classics and era-text. An intertextual association graph was built from these pairs.

Manual evaluation of text-level detection

Note that in this corpus, each sentence has millions of intertextual candidates from books on diverse topics. As a result, the likelihood of any two sentences being intertextual is extremely low. Building a hand-labelled test set to evaluate the recall rate is nearly impossible. Therefore, we manually evaluated the accuracy rate with the same number of recalled sentences.

We invited three people with graduate degrees and research experience in Chinese classical literature to conduct the manual evaluation. The evaluators were asked to assess the intertextuality of each detected sentence pair. If the two sentences share a similar meaning, topic, or structural style, give 1 point. Otherwise, give 0 points. We took the single-pattern methods as baselines. We used the SIMCSE (Gao et al. 2021) model to detect the same number of pairs at the sentence and clause levels, respectively. One pair is randomly sampled from each book in the dataset of classics. There are three groups with 133 pairs each.

The results are shown in Table 3. The average accuracy of our proposed multi-pattern detection model is 82.22% (Pearsons r = 0.74), while the single-pattern baseline is 73.70% and 45.92%. It suggests that the multi-pattern design can improve intertextuality detection performance.

Table 3 The results of the manual evaluation.

Ablation of community-level inference

We performed an ablation study on a specific book to validate the designed inference module. Figure 3 shows the intertextual connection between Analects and other classics. To compare the modules fairly, we adjusted the weights r′ so that explicit and implicit intertextuality have equal status in united intertextuality.

Fig. 3: The intertextual associations of Analects towards other books in different modelling stages.
figure 3

They are mean-normalized, and their standard deviations are given respectively. a Number of similar sentence pairs s. b Explicit intertextual scores Iex. c Implicit intertextual scores Iim. d United intertextual scores I.

The number of similar pairs varies widely due to the varying length of books. After aggregation, normalized explicit intertextual scores are obtained. However, some books do not share similar sentences, resulting in vacancies. Implicit intertextual scores are positively correlated with explicit intertextuality. It fills the gap of explicit intertextuality and alleviates sparsity. In addition, the introduction of implicit intertextuality brings smoothness, leading to more robust united intertextual scores (std = 0.81) than explicit intertextual scores (std = 1.07).

Indicator robustness

A metric that is susceptible to data variance is not ideal. Therefore, we examined these two concerns regarding the intertextual score I:

Q1: Is the intertextual score affected evidently by data size?

Q2: Does the intertextual score decrease noticeably due to language discrepancy in different eras?

For Q1, we calculated the correlation between data size and intertextuality with the classic dataset. The two variables used in the correlation calculation are as follows:

Data Size: the number of sentences involved in intertextuality detection for each book.

Intertextual Score: The average intertextual score of each book with all other books.

Our results show that there was no significant correlation between data size and intertextual score (\(r = - 0.1427,\,P = 0.1025,\,n = 133\)). Therefore, we considered that the decrease in the H index is not due to data size.

For Q2, let us examine some cases. Jin Si Lu of the Song Dynasty (published around 1175) and Chuan Xi Lu of the Ming Dynasty (published around 1472–1529) are two famous works of Neo-Confucianism, which emerged as a continuation of Confucianism thousands of years after its birth. Compared with previous books, is the intertextuality between these two books and Confucianism prominent?

To answer this question, we ranked the intertextual scores between all books and keystone works of Confucianism and observed where these two books are placed. We found that these two books rank highly (1/131, 2/131), even surpassing Confucian books that are more recent to the Axial period. Therefore, we consider that language differences across different eras do not have an obvious impact on the intertextual score.

Through these two examinations, we can conclude that the indicator, intertextual score I, is robust to data variance.

Results

Study 1. Interaction between scholars and schools

At the first level, we discussed the interaction between scholars and schools. Schools can be remoulded by later generations of scholars during their thousands of years of evolution. Confucianism and Taoism were the most influential philosophical schools in ancient China. We examined their evolutionary paths by assessing the preference of their followers through intertextual distributions of literary works. Besides, some literature is controversial or ambiguous in the mists of antiquity. To clarify the true path of cultural evolution, we provided quantitative suggestions for the school attribution of Lüshi Chunqiu and the authorship attribution of the Collected Works of Tao Yuanming.

In the axial age, religion and philosophy transformed drastically in various civilizations. The Hundred Schools of Thought, which arose in the Eastern Zhou Dynasty (500 BC), were the prototype of Chinese philosophy. The enduring and pervasive influence of schools such as Confucianism, Taoism, Mohism, Legalism, and Military make them essential to any discussion of ancient Chinese culture (Sima 1959; Ban 1962).

Scholars and schools are symbiotic. Scholars were inevitably exposed to mainstream schools of their periods, while the doctrines of schools needed to be passed down to subsequent scholars. In this section, we investigated the interaction between scholars and schools through the intertextual associations of their literature. We calculated the Tendency Index T between 125 ancient Chinese classics and the five schools mentioned above. This index shows the ideological tendency of a particular collection of texts toward each school. The schematic diagram of this index is shown in Fig. 4a, and the details of its design are as follows.

Fig. 4: Calculation of analysis index.
figure 4

a Calculation of Tendency Index T. b Calculation of Historical Status Index H.

Based on the consensus of Chinese philosophy (Feng and Bodde 1948), we selected the keystone works as the benchmarks for each school. We first calculated the average intertextual score between a book and the keystone works of each school. The Tendency Index is defined as the ratio of the average intertextual score with a specific school to its means across all schools. Suppose there are books \(B = \{ B_1,B_2 \ldots B_m\}\) and schools \(S = \{ S_1,S_2 \ldots S_v\}\). The intertextual score between any two books Bi and Bj is Iij, which can be obtained from the node features of the association graph. For the book Bi and the school Sk, the school Sk has l keystone works, Tik is calculated as follows:

$$\begin{array}{*{20}{c}} {IS_{ik} = \frac{1}{l}\mathop {\sum }\limits_{B_j \in S_k} I_{ij}} \end{array}$$
(7)
$$\begin{array}{*{20}{c}} {\overline {IS_i} = \frac{1}{v}\mathop {\sum }\limits_{S_k \in S} IS_{ik}} \end{array}$$
(8)
$$\begin{array}{*{20}{c}} {T_{ik} = \frac{IS_{ik}}{\overline {IS_i} }} \end{array}$$
(9)

Tik reflects the tendency of book Bi for school Sk compared to other schools. When Tik > 1, Book Bi has an above-average preference for school Sk.

We also examined the significance of text-level intertextuality. Specifically, we investigated whether sentences from a specific book have a significantly greater probability of being detected in the keystone works of a school than the average of other schools. Considering that these books typically contain a large number of sentences \(({\text{Median}} = 2739)\), we employed a one-tailed Z-test statistic. This statistic was constructed from the similar sentence pairs detected. Suppose there are books Bi and Bj containing ni and nj sentences after data processing. There are sij distinct similar sentence pairs detected between them. For book Bi and school Sk, the calculation of test statistic Z is as follows:

$$\begin{array}{*{20}{c}} {P_{ik} = \frac{1}{l}\mathop {\sum }\limits_{B_j \in S_k} \frac{{s_{ij}}}{{n_j}}} \end{array}$$
(10)
$$\begin{array}{*{20}{c}} {\overline {P_{ik^\prime }} = \frac{1}{{v - 1}}\mathop {\sum }\limits_{S_{k^\prime } \ne S_k} P_{ik^\prime }} \end{array}$$
(11)
$$\begin{array}{*{20}{c}} {\sigma _{ik}^2 = P_{ik}\left( {1 - P_{ik}} \right)\frac{1}{l}\mathop {\sum }\limits_{B_j \in S_k} \frac{1}{{n_j}}} \end{array}$$
(12)
$$\begin{array}{*{20}{c}} {\overline {\sigma _{ik^\prime }} ^2 = \frac{1}{{v - 1}}\mathop {\sum }\limits_{S_{k^\prime } \ne S_k} \sigma _{ik^\prime }^2} \end{array}$$
(13)
$$\begin{array}{*{20}{c}} {Z = \left( {P_{ik} - \overline {P_{ik^\prime }} } \right)/\sqrt {\sigma _{ik}^2 + \overline {\sigma _{ik^\prime }} ^2} } \end{array}$$
(14)

We set the significance level α to 0.05. With the Tendency Index and P-value, we developed quantitative discussions on the scholar-school linkages.

Evolutionary path of philosophical schools

The schools in ancient China were constantly evolving as scholars reshaped previous theories. As acknowledged in the history of Chinese philosophy (Feng and Bodde 1948), the original Taoist philosophy inspired the Taoist religion and Wei Jin metaphysics, while Neo-Confucianism inherited the theories of Confucianism. This section validates these evolutionary paths of Taoism and Confucianism quantitatively.

Taoism was a philosophical school that mainly advocated conformity to nature. Taoist religion evolved from Taoist philosophy, developing into the most prominent native religion until now (Raz 2012). The representatives of Taoist philosophy, Laozi and Zhuangzi, were revered as the founder and patriarch of the Taoist religion respectively. Figure 5a shows the Tendency Index of two Taoist religious classics, Cantongqi and Wen Shi Zhen Jing. They were significantly inclined towards Taoist philosophy (Cantongqi, \(T = 2.48\), \(P = 0.0142\), \(n = 529\); Wen Shi Zhen Jing, \(T = 2.62\), \(P = 0.0019\), \(n = 879\), for Taoism). It demonstrates the consistency between Taoist religion and Taoist philosophy in their evolution.

Fig. 5: Tendency index of several classics towards five schools of thought.
figure 5

The dynasty of publication and the corresponding AD years of each book are shown below. The keystone works of each school are listed on the right, including the time of the publication. a Tendency Index of two Taoist classics, Cantongqi and Wen Shi Zhen Jing. And the Tendency Index of the collected works of two scholars, Ruan Ji and Ji Kang. b Tendency Index of two Neo-Confucianism classics, Jin Si Lu and Chuan Xi Lu. c Tendency Index of Lüshi Chunqiu. d Tendency Index of the Collected Works of Tao Yuanming. And the Tendency Index of its widely accepted and controversial parts.

Apart from the religious re-creation, Taoism inspired a new school of philosophy. Wei Jin metaphysics, a variant of Taoist philosophy, arose during the Three Kingdoms period (220–280) and flourished in the Jin Dynasty (266–420). Ruan Ji and Ji Kang were two representative scholars. Figure 5a shows the Tendency Index of their collected works. Compared with the other four schools, scholars of Wei Jin metaphysics were closer to the theories of Taoism (Collected Works of Ruan Ji, \(n = 1590\); Collected Works of Ji Kang, \(P = 3.57e - 06\), \(n = 2209\), for Taoism).

This kind of transformation also occurred in Confucianism. Confucianism, which originated in 500 BC (Yao 2000), had an extensive impact on ancient Chinese culture and spread throughout East Asia. Over millennia, the philosophy evolved, and Neo-Confucianism became the new representative of Confucianism in the Song Dynasty (960–1279) and Ming Dynasty (1368–1644) (Bol 2008). Jin Si Lu and Chuan Xi Lu, written by Zhu Xi, Lv Zuqian, and Wang Yanming, were two of the most famous classics. Their Tendency Index is shown in Fig. 5b. The significant intertextual connection between the two works and Confucianism confirms their inheritance (Jin Si Lu, \(T = 2.84\), \(P = 1.01e - 13\), \(n = 2914\); Chuan Xi Lu, \(T = 3.36\), \(P = 2.33e - 15\), \(n = 2495\), for Confucianism).

Controversial literature attribution

Because of its antiquity, the information of some classics has become vague over thousands of years of circulation. Attributing ancient literature to appropriate schools and original authors has been a long-discussed issue in Chinese cultural studies, and in recent times, scholars have embarked upon quantitative investigations in this regard (Zhu et al. 2021; Zhou et al. 2023). In this section, we provide quantitative suggestions for controversial literature based on its intertextual distributions among schools.

Appropriate school attribution could contribute to the study of the influence and evolution of cultural thought. For example, Lüshi Chunqiu, an encyclopedic classic from the Warring States Period, was compiled in 239 BC with the support of the politician Lü Buwei. It brought together doctrines from various schools. However, there is no conclusion about its predilection among them (e.g., Syncretism theory, Taoism theory, and Confucianism theory (Chen 2001)).

In Fig. 5c, our quantitative modelling result shows that Lüshi Chunqiu is a syncretic work (\(T = 0.78\sim 1.43\)) led by Taoism (\(T = 1.43\), \(P = 0.0004\), \(n = 6118\), for Taoism). It indicates that the editors have done a syncretic compilation of the theories in that period, with a slight inclination toward Taoism.

The variation of intertextual distributions can also be applied to controversial authorship attribution. Some ancient books were published in the name of famous scholars, but the real authors maybe someone else. However, the creations by different people have their own styles. The thought divergence between the real celebrity and their impostor could be implied in the intertextual variation of their works.

For example, Tao Yuanming is widely recognized as a representative of Chinese individual liberalism (Swartz 2008). He refused to serve the government and pursued a pastoral life. His yearning for a free life was depicted in his poems, which is highly consistent with the claim of Taoism. He is considered to have a strong predilection for Taoism and was slightly affected by Confucianism. Therefore, it is puzzling to find that the Tendency Index shown in Fig. 5d indicates a significant predilection for Confucianism in the Collected Works of Tao Yuanming (\(T = 2.41\), \(P = 0.0007\), \(n = 2119\), for Confucianism).

Further investigation revealed that the authorship of some parts of the Collected Works of Tao Yuanming is controversial. The version compiled by Xiao Tong (501–531) did not contain Five Sets of Filial Piety Biographies and Book of Ministers, while the version of Yang Xiuzhi (509–582) added them. Yang Xiuzhi mentioned in the preface that Xiao Tong’s version was missing these two parts, so he added them to prevent them from being lost in future generations.

However, later scholars gradually became suspicious of these two parts. The most famous one is the assertion in Siku Quanshu (Ji 1997). For its “self-contradictory” and “meaningless”, Siku Quanshu declared that Five Sets of Filial Piety Biographies and Book of Ministers were counterfeit. This view remains popular today, owing to the authority of Siku Quanshu.

To find clues to this dispute, we compared the intertextual distributions of the widely accepted and controversial parts. We divided the Collected Works of Tao Yuanming into two parts: collection 1 included Five Sets of Filial Piety Biographies and Book of Ministers, while collection 2 contains the remaining works. The Tendency Index for the two collections is shown in Fig. 5d. The “Tao Yuanming” of collection 1 exhibited a significant preference for Confucianism (\(T = 3.02\), \(P = 0.0002\), \(n = 436\), for Confucianism), while the “Tao Yuanming” of collection 2 inclined towards Taoism (\(T = 2.45\), \(P = 0.0658\), \(n=1683\), for Taoism) and has an above-average preference for Confucianism (\(T = 1.51\), \(P = 0.1985\), \(n = 1683\), for Confucianism). The modelling result of collection 2 is consistent with the actual behaviours and mainstream cognition of Tao Yuanming.

The Tendency Index shows an antithesis between the controversial sections and other parts in terms of their intertextual connections to Confucianism and Taoism. Considering the life experience of Tao Yuanming, our finding lends further support to the speculation: Five Sets of Filial Piety Biographies and Book of Ministers were forged by others in the name of Tao Yuanming.

Nevertheless, it is also worth considering that these two books were intended as textbooks for family education. If we treat them as the original works of Tao Yuanming, the intertextual discrepancy in the results reveals the divergence between the personal pursuits of Tao Yuanming (Taoism) and his aim to educate future generations (Confucianism).

Study 2. Vicissitudes of schools

At the second level, we studied the rise and fall of schools in different eras and domains. Scholars have employed character co-occurrence (Yang and Song 2022), syntactic patterns (Lee et al. 2018), and topic analysis (Nichols et al. 2018) to quantitatively measure the grammatical and ideological connections in ancient Chinese literature, thus supporting research into cultural differences and thought evolution. In this section, we studied cultural phenomena through diachronic and field-specific intertextual distributions. We investigated quantitative evidence for the connections between historical events and school status. Besides, schools’ claims have their own focus, making them favoured by different aspects of culture. We quantitatively discussed the status of Confucianism and Taoism in various cultural domains.

To achieve this, we divided ancient China into 12 eras and built an era-text corpus from history books and anthologies. The era-text corpus is a comprehensive collection of literature from official and folk sources, allowing them to be taken as indicators of the prevailing thought of that time. The era-text was then classified into 12 eras and added to the intertextuality modelling. For a specific collection of text, its intertextual association with the era-text implies its popularity in that era. The Historical Status Index H was designed to measure the school status in each era. The schematic diagram of this index is shown in Fig. 4b, and the details of the calculation are as follows.

We first calculated the average intertextual score between the keystone works of each school and the era-text in each era. For each school, its index H is defined as the ratio of the average intertextual score in a specific era to its mean across all eras and schools. Let \(S = \{ S_1,S_2 \ldots S_v\}\) denote the set of schools, and \(E = \{ E_1,E_2 \ldots E_f\}\) denote the set of eras. For a given school Sk and era Ee, where school Sk has l keystone works and era Ee has c books in era-text, the Historical Status Index Hke is calculated as follows:

$$\begin{array}{*{20}{c}} {ISE_{ke} = \frac{1}{{l \ast c}}\mathop {\sum }\limits_{B_i \in E_e,B_j \in S_k} I_{ij}} \end{array}$$
(15)
$$\begin{array}{*{20}{c}} {\overline {ISE} = \frac{1}{{v \ast f}}\mathop {\sum }\limits_{S_k \in S,E_e \in E} ISE_{ke}} \end{array}$$
(16)
$$\begin{array}{*{20}{c}} {H_{ke} = \frac{ISE_{ke}}{\overline {ISE} }} \end{array}$$
(17)

Hke reflects the status of school Sk in the era Ee. If its mean value \(\bar H\) in era Ee exceeds 1.00, it suggests that the school had an above-average influence in era Ee. The Historical Status Index of five schools in 12 eras is shown in Fig. 6.

Fig. 6: Historical status index H of five schools in history.
figure 6

The timeline gives the name of each era, with the approximate AD years of its beginning and end. The histogram shows the H of each school, while the line chart indicates its mean value in each era.

School transformation in history

As society changed, schools of thought experienced booms and declines in Chinese history. Historical events like wars, policies and regime changes have impacted the school’s evolution. In this section, we investigated the quantitative textual evidence of these connections through the diachronic changes in their intertextual distributions.

The results show that the keystone classics of these five schools were highly intertextual with era-text within about a thousand years (\(\bar H > 1\)) and then gradually decreased (\(\bar H < 1\)). Although the original texts created during the axial period were still classic, they gradually became unsuitable for the new era (Feng and Bodde 1948). This could be the reason for the decrease in the \(\bar H\) index. Throughout the millennium of prosperity, we can observe the connections between school transformation and historical events.

The popularity of the school of Military was affected by the division of the country in the Three Kingdoms period (220–280), when China was divided into three comparable kingdoms. The country was in turmoil, and wars often broke out between these three kingdoms. Against this background, the school of Military, which was themed on the philosophy of war, reached its heyday (\(H = 3.15\), for Military in the Three Kingdoms period).

The linkage between political events and the prosperity and decline of the school manifested in the quantitative results. Confucianism was a school of humanism (Juergensmeyer 2005), while Legalism was a school that advocated legal institutions. Some scholars believe that ancient China was influenced by both Confucianism and Legalism (Zhou 2011; Zhao 2015). In Qin (221 BC - 207 BC) and Han (202 BC - 220) Dynasties, favour from the government made two schools stand out rapidly. The Shang Yang Reform (356 BC & 350 BC) and the advocation from the emperor Qin Shi Huang brought Legalism to a peak in the Qin Dynasty (\(H = 3.67\), for Legalism in the Qin Dynasty). However, this brief prosperity ended with the demise of the Qin Dynasty. The policy implemented in the Han Dynasty, which banned other philosophical schools and venerated Confucianism, caused the drop of \(\bar H\) and made Confucianism (\(H = 1.73\), for Confucianism in the Han Dynasty) exceed others (\(H = 0.94\sim 1.23\), other schools in the Han Dynasty). This advantage continued since then, and Confucianism had long been the dominant philosophical school in ancient China.

School influence in various domains

Confucianism and Taoism were representative schools of collectivism and individualism in ancient China (Munro 1985). As the two most prominent native philosophical schools, Confucianism and Taoism have often been compared. In this section, we studied the influence of Confucianism and Taoism through their intertextual distributions among various cultural domains.

Confucianism placed greater emphasis on family and social relations, whereas Taoism focused more on nature and the spirit. For most of the time since the Han Dynasty (202 BC - 220), Confucianism was far superior to other schools of thought. Nevertheless, there was an anomaly in history. As shown in Fig. 6, Taoism experienced a revival from the Three Kingdoms period (220–280) to the Jin Dynasty (266–420). In the Jin Dynasty, the status of Confucianism (\(H = 2.14\), for Confucianism in the Jin Dynasty) and Taoism (\(H = 2.10\), for Taoism in the Jin Dynasty) was very close. It stemmed from the collapse of the Han Dynasty, which advocated Confucianism. During this period, people sought to find a successor from the theories of other schools (Feng and Bodde, 1948). In the background, Wei Jin metaphysics developed from Taoism theory. However, this prosperity did not last long. After the brief revival, Taoism decayed while Confucianism remained the mainstream.

In addition to the diachronic investigation, we discussed the status of Confucianism and Taoism in different cultural domains according to their intertextuality with texts on related topics. History books in ancient China tended to record political events. Therefore, we took the intertextual associations with history books to indicate the political status of a school. The average Tendency Index of history books is shown in Fig. 7a. We also test whether the Tendency Index of Confucianism exceeds Taoism significantly with a one-tailed paired samples t-test. The distribution of their difference value is shown in Fig. 7b, which corresponds to normal distribution. The significance level α is set to 0.05. Confucianism exceeded Taoism significantly in the political domain (\(P = 2.22e - 15\), \(n = 55\)).

Fig. 7: A comparison of intertextual distributions between Confucianism and Taoism.
figure 7

a The average Tendency Index of history books. b Difference value distribution of Tendency Index between Confucianism and Taoism among 55 history books. c The average Tendency Index of 125 classics from various cultural domains. d Difference value distribution of Tendency Index between Confucianism and Taoism among 125 classics. e Tendency Index of 125 classics towards Confucianism and Taoism, sorted by their difference value.

Although Taoism did not replace Confucianism in the political domain, it is comparable to Confucianism in broader cultural communities. We calculated the average Tendency Index of 125 classics from various cultural domains, and the result is shown in Fig. 7c. We test whether their Tendency Index is variant with a two-tailed paired samples t-test. The distribution of their difference value is shown in Fig. 7d, which corresponds to normal distribution. The significance level α is set to 0.05. There is no significant difference between Confucianism and Taoism among these classics (\(P = 0.8014\), \(n = 125\), not rejecting the null hypothesis). Specifically, Fig. 7e shows the Tendency Index of 125 classics towards two schools. Among these classics, Confucianism and Taoism had respective advocacy groups. Books on politics and regulations are highly intertextual with Confucianism, while books on mythology, occultism, and medicine are close to Taoism.

These indicators show that Confucianism has advantages in the political field, while Taoism attempted to surpass Confucianism yet failed. However, Taoism was on par with Confucianism in other fields of ancient Chinese culture. Thus, it is suggested that in ancient China, the political domain was the territory of collectivism, while individualism flourished in the diverse cultural fields.

Study 3. Communication with foreign culture

At the third level, we investigated the communication between ancient China and foreign cultures, with a focus on Buddhism, one of the most influential foreign religions in ancient China. The preaching of Buddhism experienced imitation and integration (Zürcher 2007). We started by identifying the native schools that are most intertextual with Buddhism and then discussed the different stages of the infiltration between Buddhism and native Chinese culture.

Although the dissemination of information in ancient times was much slower than it is now, ancient China had extensive communication with foreign cultures. As a result, the cultural evolution of ancient China was not isolated. Buddhism, a religion that originated in ancient India, spread to ancient China during Han Dynasty. Buddhist scriptures were translated into Chinese versions, which were widely disseminated over the next millennia.

In this section, we investigated the preaching of Buddhism in ancient China through the intertextual association between Buddhist scriptures and native classics. We selected the four most influential Buddhist scriptures in ancient China (Diamond Sutra, Lotus Sutra, Shurangama Sutra, and Avatamsaka Sutra) as the keystone work of Buddhism and added them to the modelling. The diachronic changes of intertextual distributions reveal the evolution of cultural integration in different stages. The topics of intertextual associations show the commonalities between Buddhism and native culture.

Analogue in native cultural groups

As a newly introduced religion, Buddhism inevitably interacted with the existing native cultural groups in its preaching. Taoist religion, which developed from Taoist philosophy, was the dominant indigenous religion in China. Scholars generally believe that Buddhism and Taoism imitated each other in many ways (Mollier 2008), including textual scriptures, image symbols, and organization. In this study, we concentrate on textual scriptures. Figure 8a shows the Tendency Index of Buddhist scriptures towards the five native schools. Taoism is the closest native school as expected (\(T = 1.83\), \(P = 0.0131\), \(n = 62693\), for Taoism).

Fig. 8: Quantitative intertextuality results of Buddhist scriptures and indigenous literature.
figure 8

a Tendency Index of Buddhism toward five native schools. b Similar textual cases between Buddhist scriptures and native classics. c Top 10 native classics that are most intertextual to Buddhism before its introduction. d Similar textual cases of the top three classics in (c). e Top 10 native classics that are most intertextual to Buddhism after its introduction. f Similar textual cases of the top three classics in (e). In (c and e), different colours represent the subject of classics. The area of the block reflects the intertextual score with Buddhist scriptures. The Tendency Index T between each book and Buddhism is shown at the top left of each block. In (b, d, and f), the words in represent the same characters. The words in represent synonyms. The approximate years of publication are indicated in the second column.

Besides, we found that Buddhist scriptures borrowed language expressions from existing Chinese terms in the translations of Buddhist concepts. Figure 8b shows two cases from the detected intertextual sentences. The term “Amrita” (meaning “immortality drink”) was borrowed from the word “甘露“ (gan lu, meaning “sweet dew”) when translated into Chinese. This word refers to “rain” in the native Taoist classic Tao Te Ching. Similarly, the Chinese translation of “Sattva” (meaning “sentient beings”) employed the term “众生“ (zhong sheng, meaning “all living beings”), as found in the Taoist classic Zhuangzi.

Evolution of cultural integration

Apart from the philosophical schools, intercultural communication manifested in various aspects of society. Therefore, we expanded the horizons to broader cultural domains. In this section, we compared the intertextual associations between Buddhism and native literature before and after its introduction.

During the Jin Dynasty (266–420), these four Buddhist scriptures were translated into Chinese, paving the way for Buddhism to flourish in ancient China. After separating the texts before and after AD 420, we ranked native classics based on their intertextual scores with Buddhist scriptures. The top 10 classics are shown in Fig. 8c and e. We also juxtaposed Buddhism with five native schools and calculated the Tendency Index of these classics.

Before the introduction of Buddhism, its similar native classics often focused on myth and religion, implying that the Chinese version of Buddhist scriptures retained the original theme. Besides, it may attribute to their assimilation of the corresponding native literature in the Chinese translation of Buddhist scriptures. Specifically, three similar cases from the top three classics are shown in Fig. 8d. The Chinese version of Buddhist scriptures shares similar phrases with native myths in their discussions of mysteries, including the control of ghosts and gods, and the description of the mysterious phenomenon of “burning day and night”. It also mimicked the language expression of native religious discourses. For example, the description of the choice between justice and evil is highly consistent between Cantongqi and Avatamsaka Sutra.

After the introduction of Buddhism, Buddhist doctrines diffused into various domains of native culture. Compared to the previous period, there was an overall increase in the Tendency Index of Buddhism among the top 10 classics. It indicates the promotion of Buddhism’s influence on Chinese culture. One notable change is the emergence of three native Buddhist works. It symbolizes that Buddhism built its advocacy group in ancient China. These works remoulded Buddhism in a new cultural environment with localized doctrines. In addition to expanding its own religious territory, Buddhism integrated into other native religions (Zürcher 1980). For example, the top 1 work shown in Fig. 8e is the native religious classic named Xuan Zhu Lu, which deeply absorbed Buddhist theories. In terms of missionary targets, the preaching of Buddhism was not limited to ordinary people and even reached the supreme ruler, such as the Emperor Wu of Liang (464–549), which ranks third in Fig. 8e. With the advocacy of the emperor, the Liang Dynasty was the heyday of Buddhism in the Southern Dynasty (Strange 2011). For details, Fig. 8f shows three similar cases from the top three native classics after the introduction of Buddhism. Religious concepts from Buddhism were mixed into Chinese as new words (e.g., ten directions, immeasurable and Buddha). India’s “Ganges River” flowed into ancient China along with Buddhist scriptures.

Online platform

In this paper, we focused on the theme of cultural evolution. However, there are many other meaningful findings in our modelling results, which await further explanation by relevant scholars. Therefore, we have developed an online platform (http://evolution.pkudh.xyz/) featuring an interactive visualization system that displays the corpus and intertextual sentences. This platform shows millions of intertextual cases detected in this work and provides support for further data analysis. Even researchers without programming backgrounds can gain valuable insights into our work and develop further studies using this convenient tool. We gave several screenshots of the platform in Fig. 9.

Fig. 9: Screenshots of the online data analysis platform for ancient Chinese literature.
figure 9

a Intertextual sentence browsing from corpus. b Intertextual sentence statistics and visualization within a custom collection. c Visualization of intertextual sentences distribution among different chapters of a book within a custom collection.

Discussion

With the leap forward of big data and AI technology, computer-assisted cultural studies have expanded in both scale and depth. Intricate cultural problems can be discussed quantitatively with the support of large-scale data. In this paper, we used digital methods to quantify the cultural evolution of China over the past thousands of years within a large-scale corpus of ancient literature.

We gave validated results for several acknowledged cultural phenomena. The two evolutionary paths of Taoism and Confucianism, inspiring new branches of school and migrating to religious fields, were confirmed by intertextual associations. Besides, we provided quantitative evidence for the connections between the schools’ status and several historical events. It shows the intertwining of philosophical schools and politics in ancient China.

Through our analysis, we gained quantitative insights into some long-debated cultural problems. For literature with controversial school attribution, our findings suggest that Lüshi Chunqiu is a syncretic work headed by the theory of Taoism. As for literature with controversial authorship attribution, we revealed that Five Sets of Filial Piety Biographies and Book of Ministers are divergent from other works of Tao Yuanming in ideological preference. In the comparison between Confucianism and Taoism, we propose that collectivism represented by Confucianism was mainstream in the political domain, while individualism represented by Taoism was active in extensive fields of ancient Chinese culture.

Furthermore, we investigated intercultural communication between Buddhism and Chinese native culture. The results suggest that the influence of this foreign culture evolved at different stages, from imitation to integration. In the early days, Buddhism imitated similar aspects of native culture to ease resistance (Kohn 1995). After the initial prosperity of Buddhism in China, it was remoulded through localized Buddhist works. As time went by, Buddhism became a part of the local culture. It was evident in various cultural domains of ancient China.

Our study demonstrates that hierarchical intertextuality modelling is a promising tool for cultural analysis within the large-scale corpus. However, there are still limitations in quantitative intertextuality research on Chinese literature. The evolution of language over time presents challenges in detecting intertextuality between ancient Chinese and modern Chinese is challenging. Besides, intercultural communication from different languages requires cross-lingual detection, which is still an area that remains underexplored.

This research represents an innovative attempt to study the evolution of Chinese culture from a digital perspective. It provides new insights into the interpretation of ancient Chinese culture and raises important questions for further exploration: How did ancient Chinese culture evolve into its modern form? What was the impact of global culture on this process of evolution? To conduct more comprehensive research, interdisciplinary and intercultural collaboration is necessary.