1 Introduction

1.1 Motivation

Compiling corpora for low-resource languages is a valuable undertaking for preservation, education, knowledge acquisition, and monitoring demographic and political processes. The focus of this work is on presenting a large machine-readable corpus of Persian literary text suitable for studying a variety of NLP problems in Persian, including lexical semantics, authorship prediction, style classification, and computational approaches for studying rhetorical figures and metaphor. Further, since the corpus covers the majority of available Persian poems across over fourteenth centuries, the corpus facilitates computational studies to track meaning shifts over time.

Poetry explores the space of imagination beyond linguistic interpretation and pragmatics (Kadkani, 1943; Atashi, 2004), yet brings distinctive insights (Hobbs, 1990). Persian poetry, in particular, has not only played a profound role in shaping Persian culture, politics, and literature, but has a pervasive influence on world literature (Mohaqeqi et al., 2014; Tusi, 2013). Even in the United States, Rumi, Khayyam, and Hafez are among the best-selling poets.

1.2 Previous work

While there has been substantial computational research on poetry in English (Lau et al., 2018; Greene et al., 2010; Genzel et al., 2010; Hayward, 1996), Chinese (Zhang et al., 2017; Liu et al., 2019), and German (Baumann et al., 2018), among others, Persian poetry has not been widely studied in computational linguistics. The works that do exist point to the need for more comprehensive and systematic resources: For instance, Asgari and Chappelier (2013), Asgari et al. (2013) apply topic modeling to Persian poetry but work with a collection of works by 30 poets but only in one style, Ghazal. The corpus presented here covers those datasets as well as other styles. Other studies (Malmasi & Dras, 2015; Seraji et al., 2012; Khashabi et al., 2021) focus solely on contemporary Farsi language.

1.3 Contributions

We introduce a corpus of Persian literary text that encompasses poems from the ninth to twenty-first centuries as well as the two main collections of myths and stories Gulistan and Panchatantra. Gulistan Saadi includes a mix of poetry and prose. These collections are essential for designing basic models for processing literary text, such as spell checkers and temporal and structural analyses. Table 1 provides essential statistics on the corpus.

In addition, in order to study the symbolism and rhetorical figures in Persian, we annotate 4192 lines of poetry with six rhetorical figures. We explain these figures and their similarities to their English counterparts. We present detailed statistics about the corpus, and, in a series of computational experiments, we introduce a baseline for classifying authors and styles. Finally, we present a study on semantic shifts in different eras of Persian poetry.

While our experiments focus on poetry, some collections consist of a mix of prose and poetry, and keeping only the poetry would lead to incomplete collections. Thus, including prose is useful for completeness and may be beneficial for certain kinds of temporal and structural analyses. Also, we made deliberate choices in selecting the most prominent styles and poets for the classification experiments. Our criteria are based on factors such as popularity, as well as differences in content and intent among the selected poets. By taking this approach, we aim to present a balanced representation that can be applicable to cross-linguistic studies while avoiding excessive information specific to Persian poetry.

1.4 Outline

The rest of the paper is organized as follows. In Sect. 2, we explain the data collection and normalization process used to compile a clean, well-organized corpus of Persian literary text from the ninth to twenty-first century. In Sect. 3, we describe the human annotation process used to endow our corpus with annotations for century, style, author, and a rich set of rhetorical figures. To demonstrate the merits of the corpus, in Sect. 4, we present a series of computational analyses that offer new techniques to investigate literary developments over time and present an open-source library for style classification. Additionally, we describe a series of classification experiments to show the distinction between the authors and different periods.

Table 1 Our corpus coverage

2 Corpus compilation

We had to overcome several challenges to make this corpus possible. The first of these is the limited availability of relevant source texts on the web. To complete some of the collections and annotations, we had to obtain access to a diverse set of resources. Second, cleaning and normalizing online text in Farsi requires correcting for various keyboard layouts, including Arabic. We release Python code with this submission for cleaning and normalizing diverse forms of Persian text,Footnote 1 as most existing libraries are designed only for modern text. Third, annotating literary text is expensive and requires expert annotators. We have annotated part of the corpus with key rhetorical figures such as metaphor with high reliability following carefully designed protocols.

2.1 Crawling

To compile a comprehensive corpus with poems and stories spanning from the ninth to twenty-first centuries, we had to crawl multiple sources and request access to online teaching material to put together pieces and make collections complete. Our corpus was collected by crawling several Persian literary websites.Footnote 2 The released version of our corpus does not include poems from the twenty-first century due to copyright concerns. However, we include modern poetry in our experimental analyses in Sect. 4. We have obtained all required permissions to release the rest of the corpus publicly.

2.2 Linguistic challenges

Persian is an Indo-European language with a comparably rich morphology, conventionally written in Arabic script. The language poses a number of special challenges. Orthographic variability results from the frequent omission of vowel diacritics, alternative encoding of characters, and diverse shapes for affixes. Additional challenges include morphological complexities in the inflectional paradigms for nouns, verbs and adjectives, which involve multiple stems for verbs, and irregularities in nouns and verbs borrowed from Arabic. The language allows free word order, with a default of SOV.

2.3 Normalization and cleaning process

To ensure the quality of the data, we relied on both automated means and manual corrections. An important aspect of this is orthographic normalization. For this task, first we tokenized each line using space and punctuation characters as separators. We developed a tool to normalize Persian text in accordance with the list of undesired forms proposed by the Academy of Persian Language and Literature.Footnote 3 According to these, the use of certain letters and letter combinations, imported from general Arabic and Western usage, is deemed incorrect. We have replaced such instances with the standard form of Persian letters.

Arabic characters that are represented with alternative Unicode codepoints have been replaced with their correct Persian form. Such cases typically are the result of using an Arabic keyboard layout. For instance, Arabic Letter kaf () is replaced with Arabic letter keheh () and Arabic letter high Hamza Waw () is replaced with Arabic letter Waw with Hamze above ().

Left-to-right and Right-to-left control characters are removed, since the script is always right to left. We did not remove Zero width non-joiner (), since it is commonly used in Persian for inflected adjectives and nouns, as well as for morphological changes in verbs to show tense, aspect, and mood of the verb.

We are distributing the normalized version of the data (Raji et al., 2023). While normalization tools for Farsi are readily accessible, our approach prioritizes ensuring a high level of quality of the data instead of relying on such tools. Considering the limitations of most spell-checkers that are tailored to contemporary Farsi text, these tools may not adequately handle older prose or poetry.

Moreover, while the large size of the corpus makes manual proofreading prohibitively costly, we have corrected instances of misspellings as we came across them while working with the data. After cleaning, the vocabulary size is reduced by around 3000.

3 Corpus annotation

The corpus includes the poet and century as metadata. We annotated each poem based on an inventory of styles and temporal periods. Additionally, we partially annotated the corpus with six rhetorical figures commonly used in Persian poetry.

3.1 Style, author, and temporal annotations

3.1.1 Background on Persian poetry styles

3.1.1.1 Classical style

Classical Persian poetry is classified into conventional styles in part based on different metrical and structural features. The prosody of classical Persian verse is based on the line, called a bayt, which consists of half lines that are metrically identical (Tousi, 1974; Tabatabai, 2001; Shamisa, 1999; Perry, 2011). Figures 2 and 3 from Sect. 3 show examples of two lines of a poem. The half-lines are usually written side by side.

Different styles of classic poetry are usually distinguishable by examining the rhymes of the words in the half-lines. Some of the styles have a similar pattern of rhymes, in which case the content and topic of the poem may be used to classify them.

Figure 1 illustrates the differences and similarities in the four most prevalent styles in Persian poetry. Note that Persian has a right-to-left script. The letters at the end of each half-line indicate the position of rhyming words or phrases, which signal different styles. As shown, in Masnavi each pair of half-lines have their own rhyme, while in Ghazal the rhymes occur at the end of each line. Hence, these two styles can be distinguished just by the positions of their rhymes. In contrast, the position of the rhyming words in the two styles of ghazal and qaside are the same. However, a qaside could be used by religious poets as an educational poem, whereas ghazals (literally: love-songs) are much shorter poems, adopted by mystic poets and Sufis as a medium for the expression of love for the divine. From the fourteenth century CE, Persian poets became more interested in ghazals, and the qasida form declined.

Fig. 1
figure 1

Structure of rhyming words in different classical style. The letters indicate the position of the rhyming words

3.1.1.2 Modern poetry

Poetry remained a prominent form of literature in Iran through the twenty-first century. The modern style does not possess the features of the traditional styles and has different topics, content, and goals. Poems are not confined to the two half-lines format and the rhetorical figures, themes, metrics, and prosody are different, making this style easily distinguishable from classical poetry.

3.1.2 Annotation process

We manually labeled the collections with their styles. For this annotation, each collection was labelled by two annotators. The annotators and adjudicators are all adult native Farsi speakers, expert linguists, and completed at least two undergraduate classes in Persian literature and poetry.

Most of the collections have different chapters with different styles and the most prominent collections were annotated with corresponding styles. While our paper does not delve into the intricacies of Persian poetic meters that define fine-grained styles and genres, we have focused on "Masnavi", "Ghazal", "Qaside", and "Modern" styles for our annotations. The inter-rater agreement for style is almost perfect (\(\kappa =0.96\)). We study this data in further detail in Sect. 4.

Table 2 Distribution of number of poets, collections, and poems over time

We also included the name of the poet and centuries following classifications proposed in the humanities (Safa, 1993; Browne, 1999; Rypka, 2013). This information was taken from the metadata available in the original sources. Table 2 shows the distribution of poets across different centuries.

3.2 Rhetorical figure annotations

Persian poetry uses rich symbolism (Seyed-Gohrab, 2011), and from the very beginning has extensively relied on a diverse inventory of rhetorical figures (Seyed-Gohrab, 2011; Arberry, 2008; Lewis, 2014; Meisami, 2014). Traditional figures of Persian rhetoric, when being analyzed in their stylistic function and expressive potential, are useful tools for describing the poet’s style and imagery. While these devices and techniques are studied in several domains (Tom & Eves, 2012; Bush, 2012; Fengjie et al., 2016; García et al., 2018) in other languages, this is the first large resource available to study them in Persian.

We chose the collection of ghazals by Hafez, which consists of 4192 lines of poems. We choose to work with this collection since the poems are among the most complex ghazals yet also the most popular in Iran, Afghanistan, and Tajikistan. This collection was chosen because of its rich symbolism and the important role of Hafez in Persian literature and culture. The analysis in Asgari et al. (2013) is on the same poetry style.

3.2.1 Background on rhetorical figures

There are many rhetorical figures in Persian. Here, we briefly explain the six rhetorical figures used in this study. The decision to use these figures is based on their frequency in Persian poetry and their similarity to figures used in English. Note, however, that the English counterparts may have some differences with the forms as used in Persian.

3.2.1.1 Iham

The term Iham literally means creating doubt. or making one suppose. It refers to a deliberate use of lexical ambiguity, whereby the poet employs a word with two different meanings and arranges the context surrounding it in a way that one meaning is more immediate and the other remote, yet both can make sense.

As an example, the word (curtain) refers to two concepts; 1. the penetralia, the most intimate part of the house, and 2. two semitones in traditional Persian music. This rhetorical figure is designed in a way that its direct meaning is the first thing that comes to the mind but really the remote meaning is intended.

3.2.1.2 Majaz (metonymy)

Majaz refers to when only the related meaning of the word is intended. For example, when a part of a whole is used instead of the whole, such as when (head) is used instead of person.

3.2.1.3 Esteara (metaphor)

Esteara refers to a form of metaphor based on similarities of with the intended subject, where the subject is removed, for example, (lion) in certain contexts means a brave warrior. Or (fire) can, metaphorically, describe the sorrow of losing the lover. In example (1), musk deer is a metaphor for the poet’s lover.

figure j
3.2.1.4 Kenaya

This refers to a particular kind of esteara, which we annotate separately due to its particular prominence, including in everyday language. A kenaya is a form of allusion employed when being indirect is deemed more polite, appropriate, or preferable for other reasons. In example (1) earlier, Khotan, the name of an ancient city near Kashgar, China, is used as a kenaya for the hometown of the poet. In many cases, kenayas are phrases. Example (2) below means “having intentions to do something”. Literal meanings of Examples (3) and (4) are to “draw on the water” and “to give to the wind”. As a kenaya, they mean “to do something in vain” and “to waste”, respectively.

figure k
3.2.1.5 Tashbih (simile)

In most of the cases, such similes in Persian poetry come with an explicit marker, such as “x like y” or “x as y”. As in English, when the marker is removed, tashbih is very similar to esteara (metaphor). The top most frequent unigrams that are used with this rhetorical figure in Hafez poetry are: (hair), (love), and (heart).

3.2.1.6 Jenas

This device involves the use of words that are (or appear to be) derived from a common root, as in example (6), (pronounced nāz) and (pronounced nyāz). There are two kinds of jenas: (a) when the two words are homonyms, and (b) when they are very similar in writing or pronunciation.

The examples below include pairs of words in a line that are similar in spelling or pronunciations. These words can appear in any order or anywhere within a line. Incidentally, in example (6), when the two words are used together (goblet of the king) is a metaphor for the universe.

figure r

A complete list of identified jenas is released along with the corpus. Figure 2 shows an example of jenas where two homonyms are used with different meanings.

Fig. 2
figure 2

Example of a line with jenas rhetorical figure (1) in the Hafez ghazal collection. The highlighted words are homonyms, the first being a noun meaning “remedy” and the second a verb meaning “distress”

3.2.2 Annotation process

For these annotations, the words or phrases corresponding to the rhetorical figures in each line of poem are labelled by two annotators.Footnote 4 The annotators and adjudicators underwent a substantial period of training on the relevant linguistic devices before conducting the annotation. In our annotation protocol, we ask the annotators to label the six rhetorical figures described above. We prepared an annotation platform for the annotators where they could choose among the defined rhetorical devices and explain their observations. Each line of a poem was annotated with up to six rhetorical figures. Figure 3 shows three rhetorical figures used in one line of poetry. In general, spans of text are annotated, and some figures require marking multiple spans.

Some of these figures are purely morphological (e.g., jenas), or the words and phrase do not have a second meaning (e.g., tashbih). For others, the annotation also provides the literal meaning of each word or phrase when needed.

Fig. 3
figure 3

An example of a line (two half lines) with kenaya (1), esteara (2) and tashbih (3) rhetorical figures in Hafez ghazal collection

For tashbih, the annotators marked the word that is described by another word in the poem. For example, in Fig. 3, “belonging” is described as being similar to “a paint” that covers whatever is underneath it. This relationship is labelled as (belonging:color) in the annotations. For jenas, the annotators provide the pair of words in the poem that create the rhetorical figure. The poem in Fig. 2 is annotated as (remedy–distress) for jenas. The words creating the rhetorical figure are highlighted in both poems.

Iham, kenaya, esteara, and majaz are more involved. For each word or phrase, the annotators provide the concept or concepts that is described by that word or phrase. Since the concept is not in the poem, it can be described in different ways. Hence, an adjudication process was used to select a consistent description after the initial annotation process. For iham, which are inherently ambiguous, the annotation does not indicate which meaning may be the intended one.

3.2.3 Annotation results

We chose the collection of ghazals by Hafez and present annotations for 4192 lines of poetry. The inter-annotator agreement was measured at the line level across a test set of 500 lines and resulted in a Cohen’s \(\kappa \) score of 0.78 (average across the rhetorical figures), which indicates strong agreement between annotators. Table 3 shows what fraction of lines contain each of the rhetorical figures. We observe that kenaya is particularly common in the annotated data.

Table 3 Distribution of different rhetorical figures in Hafez poetry

4 Experiments and analysis

In this section, we study how NLP techniques can be used automatically for informed explorations of Persian text at a variety of different analytical levels. We also provide detailed information about the distribution of texts over time, word counts, the average length of lines in poems in classic and modern texts, and more. However, we will focus our analysis on Persian poetry in this paper.

4.1 Style classification

In this section, we present several baselines for style prediction in Persian poetry.

4.1.1 Rule-based approach

We have implemented an open source tool that recognizes classical styles of Persian poetry, except for ghazal and qaside, using a rule-based algorithm based on formal features. As shown in Fig. 1, only these two styles of classical Persian poetry remain that cannot be distinguished using rules. Hence, the algorithm first looks at the positions of rhyming words and attempts to predict the style of the poem using simple rules.

4.1.2 Supervised learning

To distinguish between ghazal and qaside, we train a supervised model. We compiled a dataset of 1100 poems, 595 of which are ghazals, based on the most notable poets in each style,Footnote 5 We established a train–test split at a 80–20% ratio. As models, we consider a convolutional neural network (CNN) (Kim, 2014) as well as a linear SVM model. The CNN model consists of a convolutional layer and a fully-connected layer to predict the label. A dropout rate of 0.5 is applied to the convolutional layer. The CNN model obtains an accuracy of 74%, while the SVM sentence classifier obtains an accuracy of 95%. The lower accuracy of the former stems from the small size of the training data.

4.1.3 Modern style classification

The modern style merits special consideration. The rhetorical figures, themes, metrics and prosody are different, making this style easily distinguishable from classical poetry. We also observed a considerable difference in the length of poems compared to classical styles. However, the difference between half-lines and lines is not as obvious as in classical Persian poems.

We found that poems of the twenty-first century can easily be distinguished from others by the two models. The same SVM model using Bag-of-words and CNN model using a word2vec model as input both attain an accuracy of 89% at distinguishing modern poetry from classic poetry. The word2vec embedding model was trained on our corpus.

4.2 Poet and century classification

In what follows, we describe baseline experiments for predicting the century and the poet of poems from different historic periods.

4.2.1 Models

To further study the differences and commonalities between poems in different centuries and the style of authors, we ran a linear SVM model with Bag-of-Word features using a train–test split of 85–15%. Another CNN model uses word2vec (trained on our corpus) as input, and consists of two parallel CNN layers with 50 filters each and kernel sizes of 4 and 10, a max-pooling layer, and two dense layers for predicting the labels. As in our previous model, a dropout rate of 0.5 is applied to the hidden layer.

4.2.2 Results

The results in Table 4 show that using different inputs and the unbalanced size of the test sets for each class significantly affect the CNN model. A t-test indicates statistically significant results with \(p<0.05\) and \(t< -70.8\).

The confusion matrix for our temporal classification in Fig. 4 reveals the similarity of poems in the tenth to thirteenth centuries, as well as in the eighteenth to twentieth century.

Table 4 \(F_1\)-scores for authorship classification for well-known poets
Fig. 4
figure 4

The confusion matrix for century classification. The results are largely better for the sixteenth, seventeenth, and twenty-first century because cleaner data is available. Whereas, the results for some time periods are not good due to the small size of the dataset or similarity between the author styles

Table 5 Top 10 words with highest drifts in meaning over time

4.3 Tracking changes in context

In addition to the usage of the words as rhetorical figures, another meaningful study is to assess context shifts of words.

4.3.1 Algorithm

To observe how the contexts of words have changed over time, we applied dynamic Bernoulli embedding for language evolution (Rudolph & Blei, 2018) on our data. The method was originally devised to study language evolution and meaning shifts. We adapt the method to instead study changes in the contexts of words over time. Other methods that study context changes (Mihalcea & Nastase, 2012; Hamilton et al., 2016) propose algorithms for aligning embeddings that are trained separately on data in each time slice. However, such algorithms are highly sensitive to the size of the data that is used in each slice. Since the time slices in our corpus are very heterogeneous in size, such algorithms are not good candidates for our analysis.

Fig. 5
figure 5

Example of a line (two half lines) of a poem by Hafez with wine referring to the alcohol prohibition in the Islamic era. Wine here is used an a symbol of all of the prohibited activities

4.3.2 Results

As we can observe in Table 5, wine, beware, message, king, and mirror exhibit the highest drifts. The given numbers represent the absolute total drift of the word vector, assessed in terms of the Euclidean distance between the words’ embeddings between the first and the final time slices (Rudolph & Blei, 2018).

The change of context for some of these words such as wine and king in the history of Persian poetry have been the subject of previous studies (Kadkani, 1943; Sharifnasab, 2004; Shabestary, 2008). The results of our analysis accord well with the explanations and observations made in these works. For instance, neighboring words of wine have changed from lover, beloved, dance, and happy in Sufi poetry to martyr, blood, country, and war. One sub-cluster of this includes words such as forbidden and affectation that refer to alcohol prohibition in the Islamic era. Figure 5 presents an example of this case.

This has also been the subject of poetry critics (Pourjavadi, 2008). The context of king changed from the tenth and eleventh centuries (mostly influenced by Ferdowsi, Hafez, and Rumi), from land, war, horse to gift, heart, love, poverty. The complete list of words together with the result of the analysis and code is attached with this submission.Footnote 6

5 Conclusion

Preservation, revitalization, and documentation purposes call for the availability of computational resources and methodologies for low-resource languages. We take a step forward with this by furthering research not just for modern Farsi, but also middle and old Persian, by introducing a large standardized and machine-readable corpus of Persian literary text that is annotated for century and style. We have additionally annotated Hafez’s ghazals with critical rhetorical figures such as metaphor. Our computational experiments provide insights into how Persian poetic language has evolved. Additionally, our investigations suggest the effectiveness of supervised and unsupervised techniques in studying poems and poetic styles.

By expanding the range of languages traditionally studied by computational linguistics, low-resource languages often represent a test-bed for validating current methods and techniques. Our resource can contribute to research on metaphor, lexical semantics, text generation, and entailment in addition to cross-linguistic studies. Although these have been studied in a number of poems over the years in the linguistics and Persian literature departments, such studies have never had the tools and resources to consider such a wide coverage corpus while taking advantage of NLP techniques.