1 Introduction

In [1], linguist Peter Trudgill starts to confirm his observation in Norwich English. The variation in the suffix-ing pronunciation applied in present participles and placenames can influence social stratification, which means that people with different occupation backgrounds in Norwich will pronounce phonemes in different ways.

Trudgill divided the various classes in England into six categories according to factors such as occupation, education, income, locality and housing: the upper middle class, the middle middle class, the lower middle class, the upper working class, the middle working class and the lower working class [1]. Additionally, the Norwich experiment covers word-list, reading-passage, formal speech and casual speech styles. In a specific social situation, the speaker's behavioral analogues in language have different positions onthe spectrumfrom formal to informal. There are two variations in the experiment pronunciation: the standard form (ng)-1 = [iŋ] scores 1, and the dialect form(ng)-2 = [in] scores 2. We divide the sum of the scores of each sample to obtain the average score. Then, 1 is subtracted from the average score and multiplied by 100 to obtain the final scores. Using this method, always using (ng)-2 will achieve ascore of 100; however, continuously adopting (ng)-1 may provide a score of 000.

The final result is shown in Fig. 1. First, this variable can obviously influence social class. For example, working-class samples preferred to use the (ng)-2 variant. Moreover, it is clear that this method produces quantitative stylistic differences in pronunciation. Formal speech may be far from casual or daily speech. Additionally, this array has perfect consistency. Scores between rows and columns increase regularly. Although different social classes have great differences in (ng) variations, they all shift in the same ways with the change in stylistic background. Recently, building off Trudgill, the increased interest in computational social science [2] and other computational sociolinguistics [3] hasled to the discovery that from the analysis of some linguistic variables, writing style is affected by occupation.

Fig. 1
figure 1

(ng) Scores of different occupations in different styles

Occupation profiling is a subtask of authorship profiling. This unique task is related to both personal writing styles and text classification, which means that individual writing activities unconsciously reflect their stylistic “fingerprint” and can credibly deduce the language structure of unconscious documents [4]. Authorship profiling analyzes a document to determine certain demographics, such as age, gender or even nationality. In Walter’s opinion, with increasing age, the usage of pronouns decreases, but prepositions are used more frequently, and teenagers prefer to use no dictionary words more than adults in blogs [5]. Although, this task has been extensively studied among a wide range of languages, there still several challenges. Firstly, research on Chinese authorship profiling is still in the early stages. To date, there is no standard public corpus for Chinese authorship profiling studies. Secondly, most of the abovementioned tasks can be regarded as free text profiling, which means there are no restrictions on the format and structure, such as the number of words and rhyme rules. Finally, researchers pay more attention to gender and age profiling, and only a few focus on occupation profiling.

Chinese classical poetry in the Tang Dynasty causes grammar structure confusion owing to some restrictions, for example, the tonal styles, the number of characters in a sentence, and the number of lines in a poem. Moreover, as shown in Fig. 2, a certain number of named entities inthe special domain exist in poems, especially ancient place names and literary quotations. Another feature of classical poetry is different subjects. Generally, Cen Shen and Gao Shi preferred to write frontier poets, while the delegates of pastoral landscape poems are Meng Haoran and Wang Wei. Consequently, the subject features are useful for poet occupation profiling.

Fig. 2
figure 2

Illustrative examples of named entities in poems

To solve the above features in poem text, in this research, we summarized our major contributions as follows:

  1. (1)

    Our work can be seen as an extension of Trudgill’s approach, that is, in the present age, poetry writing style can predict the poet’s occupation, and a public corpus of poet occupation profiling is established.

  2. (2)

    We first collected different named entities inthe special domain from an existing corpus to establish a precise named entities dictionary. Then, we propose a novel Domain-Knowledge Transformer model for poet occupation profiling, which employs domain knowledge from traditional poetry studies with Transformer. The results illustrate that the proposed model is effective.

  3. (3)

    A description of other social attribution features of poetry style that can predict occupation in this area is given. During the process of designing the Domain-Knowledge Transformer model we test different combination levels.

2 Related Works

Currently, many researchers pay more attention to long text author profiling in natural language processing areas. Various authors’ characteristics, e.g., gender [6], age [7], educational background [8], educational level [9], language background [10], and personality type [11], have been predicted in textual datasets in the last two decades. Since 2003, Pennebaker [12] found that young people prefer to use the past tense of verbs and first-person pronouns; however, senior citizens tend to adopt the future tense of nouns, prepositions, verbs and articles. Wall and Stuart [13] found significant gender differences in the comments on websites. Most of the comments were male dominated. In the subtitles of the news, women-oriented topics include teachers, health and rape, whilesports are dominated by men. In terms of fitness, they found some terms related to women, such as diet, women and methods (Pilates and yoga), while method is the least dominant sport for men. In terms of interpersonal relationships, women made slightly more than half of the comments. Common words for women include boyfriend, husband, and mother, while common words for men include brother, playboy and man. Albert Gatt [14] encouraged Labov's [15] pioneering achievements in using stylistic variation to predict social stratification, developing and comparing deep learning approaches that calculate a person's social classes.

Machine learning models have recently been successfully supported in authorship profiling. For example, for high-dimensional data, random forests are more effective and widely applied for authorship profiling. We have also foundsome evidence that naive Bayes can achieve promising results in authorship profiling. Goswami [16] added various special features, such as nonfictional words, the average length of sentences, and slang words that appear more than 50 times. Using the naive Bayes classifier can obtain 80% accuracy in age group detection, and the accuracy of gender detection is 89%. Soler Company [17] and Wanner [18] applied SVM to various literature and blog article datasets. The research shows that syntactic features and discourse features have achieved good results in author gender classification. Ameer [19] applied five classical machine learning methods, decision tree model, LR, j48, sequence minimum optimization (SMO), NB and RF, to various feature set combinations on several English datasets for the shared task of PAN authorship attribution, for example, syntactic part of speech, word n-grams, part of speech n-grams and character n-grams for age and gender recognition. SMO ultimately had the highest accuracy.

At present, deep learning approaches are proposed for authorship profiling and have achieved amazing performance. Suman [20] used a three-layer LSTM model and, in terms of accuracy, achieved 66.25% in gender and 22.22% in age. In L'opezSantill'an’s article [21], a genetic programming technology is applied to generate a new document embedding method. This method uses various word embedding approaches, such as word2vec, BERT and fastText, to obtain the weighted average and achieves the best results on nine datasets by predicting the characteristics of different individual authors, such as age, gender and personalities. A bidirectional Transformer method (BiCTr) proposed by Das and Paik [22] can predict the named entities’ gender in documents. Their approach consists of two pairs of transformers. One transformer received training on NER tasks, and the other transformer received training on gender marking. For all four datasets, their transformer model acquires the highest F1 score.

Among the above studies, researchers pay more attention to gender and age profiling, and only a few focus on occupation profiling. Moreover, most of the abovementioned studies can be regarded as free text profiling, which means that no studies show solicitude for poet profiling.

3 Methodology

Chinese classical poems not only have normal Chinese language-related problems, such as different forms of writing but also some traditional domain knowledge, for example, named entities and theme ages. To address these problems in Tang poetry, this paper proposes a novel hybrid Domain-Knowledge Transformer model. Moreover, to improve the performance of poet occupation profiling, as shown in Fig. 3, we adopt the following three steps: settling language-related problems, exploiting the benefits of domain knowledge and the advantages of Transformer.

Fig. 3
figure 3

Overall view of the proposed model

3.1 Language Related Component

In this subsection, we explore several language-related problems of Chinese occupation profiling. Although several tools are obtainable for Chinese language processing, in this study, they are not available for solving these problems. Therefore, the following solutions are carried out in Python.

  • RarelyUsed Chinese: There isa certain number of rarely used characters in Chinese, especially in ancient Chinese, which will produce some unrecognizable or illegal characters during the word embedding process. The first module of this component is used to address this problem, which will decrease the final embedding dimension ofthe deep learning component of our model and therefore decrease the computational complexity.

  • Confusion of traditional and simple Chinese: Different from most Indo-European languages, Chinese has two forms of writing, traditional Chinese and simplified Chinese. Multiform characters increase the difficulties of identification. This module affects the quality of the named entity module of the domain knowledge component of the proposed model and hence increases its accuracy.

  • Alphabetization: Compared with traditional Indo-European languages, Chinese has its own distinctive phonetic features. For example, there is no compound consonant; each syllable has its own tone, and vowels dominate. However, poems from the Tang Dynasty have their own meterand rhyming. Therefore, alphabetization is necessary.

3.2 Domain Knowledge Component

The subject feature is also useful for poet occupation profiling in the Tang Dynasty. Generally, there are many subjects in Chinese classical poetry, for example, farewell, boudoir grievance, frontier history, pastoral landscape, and so on. Considering that there is no corpus oflabeled poem themes, in the first module of the domain knowledge component, we select LDA to extract topic features instead of poetry theme features. However, LDA can be treated as an unsupervised model, and we need to decide the quality of the proposed model. For the classical LDA model, Blei [23] suggests that perplexity is only a crude measure, and it is helpful (when using LDA) to get 'close' to the appropriate number of topics in a corpus. We generalize this as follows:

$$perplexity\left({D}_{test}\right)=exp\left\{-\frac{{\sum }_{d=1}^{M}\mathit{log}p\left({w}_{d}\right)}{{\sum }_{d}^{M}{N}_{d}}\right\},$$
(1)

where \(M\) represents the number of documents in the test set, \({N}_{d}\) represents the size of the text in document d (i.e., the number of words), and \(p({w}_{d})\) represents the probability of the document. Because we usethe bag-of-words model, the likelihood of documents is the productof those of all words. Therefore, we summarize \(p({w}_{d})\) as follows:

$$p\left({w}_{d}\right)=\sum_{z}p\left(z\right)p\left(w|z\right),$$
(2)

where \(z\) and \(w\) refer to each document of the trained topic and test set, respectively. In natural language processing, perplexity is a method to measure the quality of the language probability model. A linguistic probability model can be treated as a probability distribution for the sentence or the whole paragraph. A better unknown distribution model will provide higher probabilities. Thus, the lower perplexity brings less surprise.

Although perplexity and likelihood have rigorous mathematical calculation topics that are not guaranteed to be easily interpretable, we proposed topic coherence, a subjective evaluation to discriminate good topics and bad topics. It is mainly used to measure whether the words in a topic are coherent. Topic coherence was made available for the first n words of the proposed topic. It is calculated from the average or median score of the similarity of paired words for a given topic. A better model will produce higher topic coherent scores.

There is several evaluation measures designed for coherence scores. Pointwise mutual information (PMI) was first adapted by Newman [24] to measure topic coherence, which is usually used to evaluate the correlation between two words. Based on the sliding window, Mimno [25] proposed the UCI measure by calculating the PMI of word pairs for the given topic. As shown in formula (3), \(\upvarepsilon\) indicates a smoothing factor, which guarantees that the score returns a real number. (Usually, we would like to select \(\varepsilon =1\) as mentioned in [25], a smoothing count of 1is included to avoid taking the logarithm of zero.). Roder [26] suggests calculating the smoothed conditional probability between top word pairs. The UMass coherence is calculated by logarithmic conditional probability. The UMass measure does not rely on a pretrained corpus; for our evaluations, we consider it to estimate the LDA models.

$$PMI\left({w}_{i},{w}_{j}\right)=log\frac{p\left({w}_{i},{w}_{j}\right)+\varepsilon }{p\left({w}_{i}\right)\cdot p\left({w}_{j}\right)},$$
(3)
$${C}_{UCI}=\frac{2}{N\cdot \left(N-1\right)}\sum_{i=1}^{N-1}\sum_{j=i+1}^{N}PMI\left({w}_{i},{w}_{j}\right),$$
(4)
$${C}_{UMass}=\frac{2}{N\cdot \left(N-1\right)}\sum_{i=2}^{N}\sum_{j=1}^{i-1}log\frac{p\left({w}_{i},{w}_{j}\right)+\varepsilon }{p\left({w}_{i}\right)\cdot p\left({w}_{j}\right)}.$$
(5)

The second module in the domain knowledge component is the named entities module. As a special literary form, poems possess a certain number of named entities, especially ancient places, names and literary quotations. It is difficult to recognize these named entities due to the lack of appropriate annotation datasets. To improve the accuracy of poetry named entity recognition, as shown in Fig. 3, four steps are used as follows: first, we selected 20 poets in the Tang Dynasty [27] and extracted annotated named entities in the corpus. Figure 3 showsthe marked named entities,including ancient names, ancient places, plants, clothes, and literary quotations. Second, we count the word frequency of these selected named entities, and the POS tag is added at the end of the words. The TF-ID Falgorithm is used to train named entity sets to obtain the weighted words. Finally, we built named entity dictionaries and added them to the Chinese word segmentation software jieba to recognize the named entitiesinthe datasets in this study.

The final module in the domain knowledge component of Fig. 3 mainly marked the feature of poet age and official career path. Generally, we separate the Tang Dynasty into four periods: Early Tang, Prosperous Tang, Middle Tang and Late Tang. Simultaneously, people in the Tang Dynasty could not only become government administrators through imperial examination but also through recommendation (such as Li Bai), joining the army (such as Gao Shi), or becoming a secretary in a garrison command (such as Li Shangyin). The ways of selecting officials are also reflected in the poet's writing. In this module, we search the CBDB [28] corpus to find the age and the official career path of each poet in our datasets (Fig. 4).

Fig. 4
figure 4

Overall view of the establishment of the domain-related named entity dictionary

3.3 Deep Learning Component

As a special literal form, poems are not only incoherent but also integral, which means that the incoherence elements represent the same artistic conception of poetry. Therefore, it is necessary to both capture the detailed information andgrasp the global information in the poems. CNN instead of a fully connected network (FNN) was used for classification, which can acquire not only some indivisible features but also some long-range contextual information of a poem. With the help of multi-head attention, Transformer can grasp deeper semantic information for poems in the Tang Dynasty.

As demonstrated in Fig. 3, the deep learning component contains an embedding module, a Transformer module and a CNN module. In the embedding module, BERT [29] is used for vectorization. Then, in the Transformer module, we carry out the multihead self-attention layer proposed in the classic Transformer [30]. Finally, the CNN model is adopted to obtain both indivisible features and long-range contextual information of the classical poems for classification.

There is an encoder-decoder structurein the classical Transformer. Using occupation profilingas a classification task, we adopt only the encoder part. The meaning of agivenword in a sentence may change dramatically in different positions. Transformer is completely based on the self-attention mechanism, while self-attention cannot obtain the word position information. Therefore, we need to add positional encodings to each word embedding, as illustrated in Fig. 3. In this paper, we employ the fixed positional embedding proposed by Vaswani [30]:

$$PE(pos,2i)=\mathit{sin}(pos/100{0}^{\frac{2i}{{d}_{\mathit{mod}el}}}),$$
(6)
$$PE\left(pos,2i+1\right)=\frac{\mathit{sin}(pos}{100{0}^{\frac{2i}{{d}_{\mathit{mod}el}}}},$$
(7)

where \(pos\) is the position of the characters in the poems and \(i\) is the characters’ dimension, which is the same as the dimension of the character embedding.

Transformer completely discards the horizontal propagation of the traditional RNN by adopting the self-attention mechanism and only propagates in the vertical direction. Therefore, it only needs to continuously stack the self-attention layer. In this way, the computation of each layer can be performed in parallel and can be accelerated using GPU. For each input word embedding, we need to construct the input of self-attention. Here, Transformer first multiplies the word embedding by three matrices Q, K, and V to obtain three new embeddings. The reason why it multiplies the three matrix parameters instead of directly using the original word vector is that more parameters are added to improve the model effect. Different parameters have different concerns. Some may only focus on local information, while others may want to pay attention to global information. With the multi head attention mechanism, each parameter has its own duties.

Scaled dot-product attention is the most important part of self-attention. With the help of a highly optimized matrix multiplication operation, this mechanism can be carried out and is more space-efficient and much faster than the traditional additive attention [31] and multiplicative attention. For the self-attention mechanism, matrices of Q (Query), K (Key) and V (Value) all come from the same input and are calculated according to the following steps:

First, we calculate the dot product between Q and K. To guarantee the stability of the gradient, we need to divide by \(\sqrt{{d}_{k}}\),where \({d}_{k}\) is the dimension. Then, we use the Softmax operation to normalize, which means we can obtain the probability distribution result and multiply the matrix V to acquire the weight representation. We generatethe attention matrix as follows:

$$Attention\left(Q,K,V\right)=softmax\left(\frac{Q{K}^{T}}{\sqrt{{d}_{k}}}\right)V.$$
(8)

Then, themulti-scale CNN model isused for each poem to determine the poet occupations. Under the support of multiple filters with different sizes, CNNs can increase the ability to capture local information while reducing the complexity of the network model. Finally, we use softmax to provide the highest likelihood occupation label.

4 Results and Analysis

In this section, a set of new occupation datasets is described. We also provide several baselinesand compare them to our proposed model. In addition to accuracy and precision, we also choose the F1-score and recall for evaluation. Finally, we present the results analysis and related ablation study.

4.1 Datasets

In this experiment, we selected five occupations: Emperor, Minister of Personnel, Imperial Censor, Assistant to the Chief Local Official and Monk from CBDB [28] and established these poems as a dataset. However, there are some poets who first hadthe occupation of Imperial Censor and then later became a Minister of Personnel. Because these poems cannot be used for occupation profiling tasks, we remove all these records. Table 1 illustrates the descriptive statistics of the datasets. It can be seen that the corpus is imbalanced. However, Zhou [32] had mentioned the corpus of Quan Tang Shi is also imbalanced. Nearly half of the poets in Tang Dynasty only produce 1 or 2 poems. The average poems of Quan Tang Shi are almost 19. For each poem in the dataset, we marked the poets’ age and the official career path. Poet occupation profiling can be treated as the classification task, and we annotated each poem with the poet’s occupation and treated them as the basis of the classification. The final files are written in the JSON documents and can beshown as follows: Model = [Author’s Occupation, Author, Title, Poem]. An example for the document annotation is as follows: [Author’s Occupation: ‘皇帝’ “Emperor”, Author: ‘武则天’ “WuZetian”, Title: ‘如意娘’, Poem: ‘看朱成碧思纷纷, 憔悴支离为忆君。不信比来长下泪, 开箱验取石榴裙。’].

Table 1 Datasets statistics

4.2 Parameter Setting

As LDA is an unsupervised model that aims to find the optimal number of topics; we built different LDA models with different values for the number of topics (k) and picked the one that gives the highest coherence value. Choosing a ‘k’ that marks the end of a rapid growth of topic coherence usually offers meaningful and explicable topics. Figure 5 shows the changing trend in the coherence scores with the increasing number of topics (k) on different datasets.

Fig. 5
figure 5

Coherence scores for different occupations

4.3 Experimental Design

We consider the following common authorship profiling deep learning and machine learning approaches as baseline:

Naive Bayes: The first machine learning model that attempted authorship profiling in poems written by Du Fu and Li Bai achieved amazing results [33].

SVM: Markov. [34] proposed that it is the most available approach for long text authorship profiling.

CNN: CNN [35] is the most popular deep learning approach for short text authorship profiling.

BERTAA: Fabien [36] introduce BertAA, a fine-tuning of a pre-trained BERT language model with an additional dense layer and a softmax activation to perform authorship classification.

CNN + LSTM: Sboev [37] have performed the comparative study of machine learning techniques for gender and age attribution on the Russian-language texts. CNN + LSTM model demonstrate accuracy close to the state-of-the-art.

BiGRU + Attention: Kodiyan [38] presented their BiGRU + Attention model for the AP Shared Task at PAN 2017 (classifying language variety and gender of a Twitter user based only on their Twitter messages). The dataset contains tweets in four different languages.

BiCTransformer: Das&Paik [39] propose a novel supervised learning approach based on the transformer network to identify the gender of named entities.

For all deep learning methods, the number of input embedding dimension is 300. For the model of CNN, we choose one dimensional convolution and the number of filters is 64 with multiple kernel size chosen from {1, 2, 3, 4}, and the number of kernel is chosen from {50, 100, 200, 250, 300}. For basic Transformer (BERT) and DKT, the head number of multihead attention was chosen from {2, 4, 6, 8}, the dimension of multi-head attention’s output was chosen from {16, 32, 64, 128, 256, 512} respectively. For the model of BERT, we first extract a 300-dimension word embedding from the bert-base-chinese model. An early stopping mechanism is adopted on this set, and we use Adam optimization [36] with shuffled mini batches (batch size 16) to improve the effectiveness of our model. We use \(1{{\text{e}}}^{-8}\) as learning rate. For best generalization performance, we use 1000 epochs with proper early stopping for all deep learning methods. Besides, L2 regularization and 25% dropout are employed to avoid over fitting. We use standard cross-entropy errors for optimization. The average performances of these results were reported as the final results in this research. All the experiments are implemented with Python 3.8, and were carried out on a personal server with Nvidia GeForce RTX 3080 (16 GB).

4.4 Experimental Results and Analysis

4.4.1 Experimental Results

Table 2 demonstrates the results on the aforementioned dataset. A tenfold cross-validation is used to avoid the impact of stochastic feature of deep learning models. First, the result for the deep learning approach is betterthan for the machine learning ones. Generally speaking, for small datasets, machine learning model usually achieve better performance than the deep learning ones. However, machine learning typically relies on feature engineering. Researchers need to select the best features for the proposed model. For our task, the classical poems in Tang Dynasty, the features selected by Domain Knowledge Component only represent a few features that are easy to extract. Some other features like rhetoric, rhythm, level and oblique tones and genres, are also have significant influence on poet occupation predicting. However, these features have more difficult to extraction with low accuracy. Therefore, for our task, deep learning models are more suitable. The experimental results show that the three based deep learning models achieve almost the same performance with respect to all four measures, far behind the proposed model. This is most likely because Chinese classical poetry is very different from modern Chinese. Ancient Chinese, such as classical poetry, is not applicable to the BERT-Base Chinese model. This is because this model is pretrained by modern Chinese. For the SOTA methods previously applied to identified the natural attributes (gender, age) achieve almost the same performance with respect to all four measures. Some models even have lower performance than CNN models. The main reason is that there is currently a lack of deep learning models for author social attribute recognition, and the existing SOTA methods are mostly used for identified gender, which belongs to the binary classification problem. As the categories increase, the accuracy of the model decreases rapidly. Our integrated model achieves the best performance, which gains over 10% advance among all measures. The results obviously show the effectiveness of our integrated model.

Table 2 Experimental results (tenfold cross-validation)

Figure 6 presents the performanceon different occupations. As shown in Fig. 6, Assistant and Monk outperform all occupations. As shown in Table 1, there are more samples in these two categories. Therefore, similar to other authorship attribution datasets, the number of author samples affects the performance of this task.

Fig. 6
figure 6

Comparison of the performance of five different occupations

4.4.2 Visualization

As shown in Fig. 6, one phenomenon deserves particular attention. Among all five datasets, Emperor, with the fewest samples, obtains satisfactory accuracy scores, almost 95%. Word clouds are adopted to visualize the common word frequency statistics results in all five datasets. As shown in Fig. 7, aside from the most common words, namely, ‘Moon’(‘月’), ‘Sun’(‘日’), ‘Return’(‘还’), and ‘Come’(‘来’), which are also the most common words in Tang poetry, there are dramatic differences in word cloud imagery among the poems created by different occupations. Emperors prefer to use ‘Chariot’(‘辇’), ‘Palace’(‘阙’), and ‘Mountain’(‘岫’). Personal Ministers prefer to use ‘Open’(‘开’), ‘Fly’(‘飞’),and ‘Remain’(‘馀’). Imperial Censors prefer to use ‘Ending’(‘尽’), ‘Will’ (‘将’),and ‘Time’(‘岁’). Official assistants prefer to use ‘Toward’(‘向’), ‘Deep’ (‘深’),and ‘Distance’(‘万里’). Monks prefer to use ‘Taoism’(‘道’), ‘Good’ (‘好’),and ‘Want’(‘欲’). This finding illustrates that the writing style of each occupation is considerably different.

Fig. 7
figure 7

Visualization results for five different occupations

4.5 Ablation Study

To demonstrate the effectiveness of each component of the proposed model, the relevant evaluation is made in this subsection.

4.5.1 Language Related Component

In an effort to show the influence of the Alphabetization module presented in ‘Language Related Component’, we compare the performance of the Language Related Component with and withoutthismodule. Figure 8 demonstrates that the performance the Alphabetization module is not very high, but it is helpful withregard to all methods. Alphabetization makes it easier to extract syllable features in poetry, and each syllable has its own tone and vowels dominate, which increase the performance of this system.

Fig. 8
figure 8

The effect of using the Alphabetization module on the overall performance of the Language Related Component for our task

4.5.2 Domain Knowledge Component

In this part, we implemented ablation experiments to demonstrate the validity of each module in the domain knowledge component. In these experiments, we tested the four simplified datasets by dropping all the modules in the domain knowledge component (No-Domain Knowledge), adding the named entities (Named Entities), the themes (Theme), the official career path (Official Career Path) and the ages (Age) with our model. We also show the performance of their combination (Pool).

Figure 9 suggests that the domain knowledge component is an important component of our model. The performance of Pool gains over 8% improvement compared with the performance of No-Domain Knowledge among all four measures. Similarly, it can be observed that the named entities in poems contributing to over 5.6% improvement among all four measures are relatively important in the domain knowledge component. A certain number of named entities existing in poems, especially some famous names, stated the age and society of the poets, increasing recognition accuracy. We can also find that the themes play a relatively important role in our model, which increases accuracy by 1.78%, illustrating the effectiveness of text style characters on our model. Both the Age and Official Career Path enhance the results of the proposed model, and they contribute to 4.54% and 1.24% of the accuracy, respectively. Thus, the other tags also play a crucial role in our task.

Fig. 9
figure 9

The effect of the Domain Related Knowledge Component for our task

4.5.3 Deep Learning Component

To show the validity of the deep learning components, in this subsection, we made some equivalent evaluations. In this experiment, we examine the three predigested models by removing BERT, Transformer and the CNN. As shown in Table 3, all of the parts of the model can improve the results of Chinese authorship profiling.

Table 3 Effectiveness of Deep Learning Components

It can be observed that BERT enhances the performance of our model, which contributes 0.69%, 0.79%, 0.72%, 0.72% and 0.88%of the accuracy onthe Monk, Assistant to the Chief Local Official, Imperial Censor, Minister of Personnel and Emperor datasets, respectively. Therefore, the CNN classifier is an important part of our model.

Transformer can increase to a very deep depth, directly calculate the relevance of each word in the poems and fully exploit the writing styles of the poems, and the results illustrate that without the Transformer component, the average performance of the model decreases by 2.84% in terms of accuracy. Therefore, the Transformer component provides a vital link to increase the accuracy of our hybrid model.

Moreover, as shown in Table 3, the CNN model strengthens the results of the proposed model by increasing the accuracy by 2.69%, 1.41%, 1.71%, 1.36%, and 1.64% on the Monk, Assistant to the Chief Local Official, Imperial Censor, Minister of Personnel and Emperor datasets, respectively. Therefore, the CNN classifier is also a significant part of our model.

5 Conclusions

Inspired by Trudgill and recent interest in computational sociolinguistics, we propose a novel combined method utilizing the advantages of both traditional domain knowledge and deep learning methods for occupation profiling. To the best of our knowledge, we are the first to attempt to use classical poetry from the Tang Dynasty for authorship profiling. In the proposed approach, three layers exist for addressing language-related problems and domain knowledge and exploiting the benefits of Transformer. The experimental results illustrate that our hybrid method is effective in poet occupation profiling. For future work, some other deep poetry-related features, such as genres, tones and rhyme, will be considered to adopt and to design more effective presentations to increase the accuracy of identification.