Song authorship attribution: a lyrics and rhyme based approach

In this work, we apply authorship attribution to a large-scale corpus of song lyrics. As a sub-category of poetry, song lyrics embody cultural elements as well as stylistic attributes that are not present in prose. We draw attention to special characteristics such as repetitive sound patterns and rhyme based structures in lyrics that can be key to ownership, and present opportunities that cannot be employed for authorship attribution of other types of text such as tweets, emails, and blog posts. We first create a new balanced, large-scale data set of 12,000 song lyrics from 120 different artists. We propose CNN models for authorship attribution on this song lyric data set, in order to use structural information included in the lyrics, similarly to image classification. We conduct experiments at the character and sub-word levels that mostly reflect positional information. In addition, we use phoneme level features, which intrinsically involve attributes such as repetitions, rhyme, and meter, and represent elements unique to verse-based textual compositions. We attempt to discover idiosyncratic features and consequently author and genre associations by working with variants of CNN architectures that have been successfully used in other text classification domains. Our architecture choice results in a particular focus on lyric attributes residing in neighboring regions, since CNNs fail to apprehend long term textual dependencies. Finally, we empirically evaluate our results in comparison with the findings of previous test classification research from different domains.


Introduction
Music classification tasks such as genre, mood and emotion recognition, as well as artist attribution, are historically addressed using audio features in either traditional machine learning or deep learning models (Eghbal-Zadeh et al., 2015;Zhang et al., 2016a). Lyrics, as an integral part of many musical compositions, can play a promising role in music classification since they are widely accessible and easier to process and store in comparison to audio signals. Lyrics provide stylistic and structural attributes such as rhymes and repetitions, that mimic or accompany audio related properties, along with semantic information unique to textual data. It has also been shown that models including textual features can have a boosting effect over models based purely on audio (Mayer et al., 2008).
Song lyrics also pose special challenges for authorship attribution approaches, because they are relatively short and repetitive, often consisting of just a few distinct words. Many stylistic properties of lyrics, such as rhyme schemas, verse structure, etc., are not usually considered in authorship analyses, which focus on linear text structure. Thus song lyrics have rarely been addressed within authorship attribution.
This work tackles both authorship attribution and genre classification on a new dataset of 12,000 song lyrics belonging to 120 different artists and 10 distinct genre labels. We employ a variant of a simple CNN architecture that has been used in text classification and authorship attribution tasks for short texts. Apart from character and sub-word level embeddings that indicate positional and lexical information, we focus on phoneme-level embeddings, in order to capture repetitive sound patterns and rhyme based structures that stand out as some of the most idiosyncratic elements in lyrical texts. These special characteristics of lyrics can be key to ownership, and present opportunities that cannot be employed for authorship attribution of other types of text such as tweets, emails, and blog posts.
One limiting factor of this work is that in music, the actual authorship of lyrics is very hard to trace, since recording artists often compose lyrics collaboratively or simply acquire the lyrics from song writers. By adopting our authorship attribution approach, we make several critical assumptions: (1) that singers/performers possess certain core lyrical characteristics either when producing their own lyrics or when selecting lyrics that suit their style, and (2) that they preserve these characteristics over time and that style changes to a limited degree. Regarding (1), we consider the singer of a song as the author of its lyrics for the purposes of authorship classification. As for (2), we do not specifically adress stylistic changes over time, clustering all songs by one singer together as that author's work.
To our knowledge, this paper represents the first attempt at large-scale song lyrics authorship attribution, achieving an overall accuracy score of 23.6% with an individually trained model and 29.8% with a model combination over different embedding types. We further use the data for music genre classification, yielding an accuracy of over 47% for 10 common genres. Despite certain challenges presented by lyrics such as use of uncommon words and high variation in text length, we thus achieve promising results on author attribution over a large number of classes. Finally, we implement occlusion graphs in order to show the contribution of the analyzed lyrics features. Occlusion graphs allow us to visually detect the text components that our model variants take advantage of while generating predictions, by evaluating predictions made when certain parts of the input are occluded. We use these graphs for detailed error analysis on song authorship attribution in our corpus. They show that the authorship attribution models make crucial use of positional information of parts of the input, which we take as evidence in favor of the CNN approach.

Related work
Authorship attribution (AA) is an important field of research in computational text analysis. Authorship attribution has been a part of the PAN-CLEF shared tasks consecutively in 2018 and 2019, within the scope of cross-fandom authorship attribution. As described by Kestemont et al. (2018), the shared task of cross-fandom AA presents a collection of fan fiction writings from different domains and aims to detect authors for unseen documents. Although the cross-domain property of the shared task pertains to our AA implementation on cross-genre lyrics, the nature of lyrics is quite different from other types of text, calling for unique approaches that should take stylistic properties of lyrics into account. This necessity is grounded in the work by Custódio and Paraboni (2019), who apply a multinomial logistic regression model to a compilation of Portuguese and English lyrics separately, and show that their model performs much better than the 2018 PAN-CLEF shared task baseline model. Their results, particularly on English lyrics, show that an ensemble approach combining words, characters and rare symbols performs much better on literary domains as opposed to lyrics, for which another level of text representation is required to improve over a character-based method. In regard to the effects of linguistic aspects in any AA task, Sundararajan and Woodard (2018) emphasize that on top of the contribution of purely syntactic elements, non-lexical components such as characters and non-topic words play an important role especially in crossdomain attribution tasks, as such components concern author-specific style rather than domain-specific topics.

Text classification in the music domain
Traditional text-based music classification approaches center around statistical methods such as bag-of-words indexing and text frequency analysis, in particular before the introduction of neural network based methods in NLP in 2013 and 2014. Linguistic features from part of speech tags to rhyme patterns have been used in diverse music classification tasks, such as genre classification (Mayer et al., 2008;Fell & Sporleder, 2014), best/worst song identification and publication date identification (Fell & Sporleder, 2014), as well as discovery of artist similarity (Logan et al., 2004). These studies do not directly attempt authorship attribution and rely on standard machine learning techniques.

Text classification with CNNs
The idea of making use of convolutional neural networks (CNNs), either exclusively or in synchrony with other neural architectures, stems from the success of earlier neural text classification approaches. Kim (2014) successfully incorporated CNNs to achieve state-of-the art results in four different sentence-level text classification tasks such as sentiment analysis and question classification. Soon, Zhang et al. (2016b) showed that character-level embeddings can outperform approaches with other neural networks and traditional classification methods in text classification accuracy. Ruder et al. (2016) combines the architectural simplicity of Kim (2014) with the benefits of character embeddings shown in Zhang et al. (2016b) for competitive results on authorship attribution for five different data sets such as tweets, movie reviews and Reddit gaming posts. Among their data collections with at least 50 different author labels, their best results are on the Twitter data set without preprocessing, as the authors suspect that clues such as mentions and hashtags along with a relatively low variability in terms of text size contribute to better F 1 scores. In contrast, their least performing data set comes from the Reddit posts, negatively influenced by differing text lengths and domain-specific styles. In regards to the model architecture, variety of data and the number of authors per each data set, this research offers a reasonable benchmark for our experiments, in particular in the absence of similar scale lyrics-based AA works in the literature.
As another effective CNN-based authorship attribution approach, character n-grams have been employed by Shrestha et al. (2017) for classifying tweets taken from the data set formed in a different micro-message AA study by Schwartz et al. (2013). This Twitter data set is reported to exclude mentions as superficial clues, but does not provide information about the handling of hashtags. The inclusion of saliency filters in order to capture contributing character patterns in the work of Shrestha et al. (2017) is also of great importance to our study for non-accuracybased evaluation purposes.

Deep learning approaches in music classification
The success of neural networks on text classification has been recognized in the music classification literature addressing tasks other than authorship attribution. Fell et al. (2018) utilize CNNs to detect and segment repetitive blocks (e.g., chorus, verse) within song lyrics. CNNs have also been combined with LSTMs to attribute authorship by lexical, syntactic and structural analysis (Jafariakinabad and Hua, 2019), and in a genre classification experiment over a 117-genre and 20-genre label data set using lyrics alone (Tsaptsinos, 2017). To the best of our knowledge, the research proposed here is the first to make use of convolutions for the task of song authorship attribution based on lyrics only.

3
Song authorship attribution: a lyrics and rhyme based approach

Data
We build our new corpus of song lyrics based on the Wasabi Project data set (Fell et al., 2019). The complete data set 1 , consisting of 2.1M song metadata entries at the time of retrieval, has been preprocessed in two consecutive tiers to obtain the data used by our models. The first level of preprocessing involved selecting attributes such as artist and album info, and removing audio features. Lyrics were retrieved from lyricwikia 2 , an online wiki-based lyrics database. Alongside the actual lyrics, important modal attributes such as number of lines or number of tokens in each line were also calculated and appended. We deleted all non-English lyrics according to the langdetect tool 3 , duplicate lyrics (as a result of cover songs), incomplete sections resulting from missing data, and songs without lyrics. Finally, we assigned consistent genre labels based on the available ones: The preprocessed data set did not contain any known genre labels in almost half of the existing entries, and the set of genre labels in the remaining entries had more than 450 elements. In case of missing labels, we identified the most common genre label for each artist, and assigned that label to all their unknown genre fields. To reduce the overall number of genres, after a detailed survey on the literature of genre classification, we decided to map all genre labels to 14 main (parent) categories: 'Hip Hop', 'Reggae', 'Gospel&Religious', 'Pop', 'Jazz', 'Heavy Metal', 'Punk', 'Blues', 'Rock', 'Country', 'Folk', '(Electronic) Dance', 'R&B' and 'Rest'. The detailed account of this process can be found in the genre mapping appendix in the public project repository 4 .
In the second phase of preprocessing, we aimed to narrow down the collection to obtain: (1) a balanced corpus in terms of both the artist and the genre label distribution, and (2) as many texts as possible for each artist label. These reduction criteria led to the dismissal of samples from two genre labels, 'Rest' and 'Reggae', as the former contained lyrics with inconsistent style from a lot of hybrid or rare genre labels, and the latter could not provide a sufficient number of texts. For each of the remaining 12 parent genre labels, we selected a final collection of 10 artists, where each artist contributed 100 distinct song lyrics. The process of picking songs out of 100+ lyric candidates for each artist was carried out by implementing a basic cosine similarity algorithm that prioritizes lyrics with sufficient level of dissimilarity from the already chosen instances. This clustering algorithm in addition excludes all cover songs from the data set as cases of debated authorship. Table 1 lists each step and the resulting number of song lyrics in both phases of preprocessing. The final corpus of song lyrics with author and genre metadata is freely available at the project repository 5 .
The design of our data set preparation is based on several inter-related elements. As opposed to the input types for the neural text classification and authorship attribution models presented in previous work (Kim, 2014;Ruder et al., 2016;Shrestha et al., 2017), the nature of song lyrics as a kind of text involves greater character, word and line level variability per author and genre, as depicted in Fig. 1. As a pertinent property of lyrics, it has been reported in the domain of genre classification that both for human and for machine classifiers, genre labels such as 'Folk' and 'Reggae' might be slightly more difficult to classify as opposed to 'Rap', which is generally the easiest label to target (Fell & Sporleder, 2014). Therefore an immediate ramification of which classes are favored in terms of numbers due to an imbalance in the data set might be the facilitation of the text classification task. We chose to retain as many different genres and authors as possible in order to represent the variability that is inherent in music and lyrics, even though this makes the attribution task more challenging. On the other hand, only the most prolific musicians offer more than 100 lyrics in their life times, making it harder to provide sufficient examples for model training, as opposed to other text domains such as tweets and emails. The level of diversity in regards to the number of authors and the scarcity of samples for each author therefore makes our data set particularly challenging not only in lyrics classification but also in the text classification domain in general. ∼ 850k Assignment of parent genres, removal of lyrics from genres with insufficient number of songs ∼ 570k Selection of artists that have at least 100 song lyrics ∼ 125k Random selection of 10 artists from each genre & 100 songs from each artist 12k

Experiments
Our approach follows the simple CNN architecture introduced by Kim (2014). In order to address linguistic style that distinguishes verse from prose and accounts for the rhythmical essence of song lyrics, we propose using embeddings of sub-word and phoneme components in addition to the character-level embeddings which produce the best results in general neural text classification applications. The phoneme component in particular is hypothesized to capture regularities in the song lyrics such as rhyme schemes. Our model variants, apart from hyper-parameter configuration, follow the same layout proposed by Kim (2014): The padded one dimensional input text is turned into a collection of embedding vectors with respect to any selected feature like characters, sub-words, or phonemes; these vectors are then processed with successive convolutional and pooling layers that include varying kernel sizes and numbers of filters; the pooled outputs of parallel convolutional layers are concatenated to form a single consolidated strand that contains the summary of different kernel operations; and eventually this strand is connected to a successive dense layer with a softmax function, after dropout is applied for regularization purposes. The model architecture framework is visualized in Fig. 2.
We build three model variants using a different type of features each: (i) self-generated dense vectors of character embeddings obtained directly from GloVe word embeddings (Pennington et al., 2014); (ii) pre-trained subword level embeddings that are based on Byte-Pair Encoding (Heinzerling and Strube, 2018); (iii) phoneme encodings. Phonemes are the basic sound units a language system provides; a phonemic representation of a word therefore captures its pronunciation. Pronunciation features have been known to aid NLP tasks for poetry (Colton et al., 2012;Tobing & Manurung, 2015), and may help in classifying rhyme schemas and repetitions in song lyrics. Unlike word, sub-word and character embeddings, pre-trained pronunciation embeddings are not available in the literature. We mapped tokens in the data to their phoneme representations by the help of the CMU dictionary 6 , which contains more than 130k unique English word pronunciations. For words with unknown pronunciations, we deployed another CMU tool 7 , while certain trivial preprocessing steps such as hyphen removal and discovery of omitted letters (e.g. "livin'" -> "living") were taken to further decrease the number of untranslatable words. With the unknown translations reduced to less than 0.1% of all texts, the complete lyric collection is represented by 88 phoneme pieces including one for the unknowns.
For augmenting the contributions of different types of embeddings, we have implemented a combined model that uses three different channels whose outputs are averaged before the softmax layer in order to reflect summary statistics. This model has been compared with others that we refer to as single channel models, which work exclusively with one type of embeddings at a time. In all model variations, the same weight initialization rules have been applied for consistency.
For sparse embeddings, each component in the vocabulary was mapped to a onehot-vector in line with its index, and for dense embeddings, components were either matched with existing pre-trained dense vectors or paired with an unknown vector. As a remedy for overfitting that was caused mainly by the shortage of training examples, we have enforced early stopping with different patience parameters. The impact of such an imposition on the overfitting problem and on overall training of the author classification models is demonstrated in Fig. 3. With regard to the training optimization, all our models used Adam Optimizer for its reported memory efficiency benefits (Kingma & Ba, 2014). We trained with rather small learning rate alternatives, finally deciding at the default rate of 0.001 for the Adam Optimizer in the Keras API. This rate helped with searching for more optimal sets of weights in backpropagation without significantly expanding the training time. The configuration details along with other particulars such as vocabulary sizes can be seen in the project repository mentioned earlier.

Authorship attribution
We begin with the authorship attribution task, and apply all three embedding models in 10-fold cross validation. We obtain an average accuracy of 23.6% in our character embedding model (CHAR-CV10), with slightly lower scores for the sub-word (SW-CV10; 20.6%) and phoneme encodings (PH-CV10; 16.2%). In addition, we ran a combination of the 3 individual models (CO1AR), run in a randomly selected fixed data set split, which yields an overall accuracy score of 29.8%, improving over the pure character embeddings. For all model types our classifiers have been built as single multi-class classifiers that aim to distinguish between 120 different author labels. We compare our authorship attribution results to two previous approaches and report them in Table 2 along with the scores of our four model variants. Note that these benchmarks work on non-lyrical texts and with different amounts of data. The 'CNN-1' model proposed by Shrestha et al. (2017) is particularly comparable due to the number of distinct class labels, while the 'CNN-char' model from Ruder et al. (2016) is akin to ours in regard to the reported length variations in its data instances.  It is noticeable that the reported accuracy is directly related to the number of target classes and the nature of the data sets (such as minimal text length variation in tweets, very large training sets ⇒ better results). Text length variation is particularly relevant in our experiments, since documents with a lot of padding yield lower accuracy scores in all model variants. Not having a consistent text length across training examples stands out to be the major factor behind overfitting (due to the dominance of padding values rather than informative components).

Occlusion analysis
For the task of authorship attribution, the use of lyrics has been restricted extensively to genre labels in the literature, most probably due to the issues regarding the actual authorship of lyrics. Consequently, our authorship classification accuracy results are compared only to out-of-domain works that match our experiments in terms of certain criteria such as number of classes or corpus size. As a more general limitation of reporting accuracy results, mere numbers do not give any hint on where the models are concentrated on the input to bring about their predictions. To address both of these constraints, we provide occlusion analysis results as a more qualitative assessment of how our models perform. Occlusion maps were introduced by Zeiler and Fergus (2014) as a way of discovering the effect of input on the output, by directly "occluding" sections of the input and recording the changes in the output Table 2 Comparative analysis of similar CNN-based authorship attribution models in the literature For models with instances equally distributed over labels (i.e. with balanced datasets), micro averaged F 1 -scores are equal to test accuracies. *Models and the best score introduced by Shrestha et al. (2017) are shown in italics. **Models and the best score reported by Ruder et al. (2016) are marked with asterisks. Our models along with our best score are given in bold 1 3 Song authorship attribution: a lyrics and rhyme based approach probability distribution. We adapted the same approach to take a sample from the training set of our phoneme based model and to document how the elimination of certain parts of each sample reorganize the prediction probabilities over target class labels. In Fig. 4, the first two occlusion excerpts are taken from a Chris Brown song that is incorrectly labeled as Trey Songz, and the last one is from a Trey Songz song labeled as Chris Brown. One of the most immediate observations in this example that is also valid throughout the training of other samples as well is that blocking the exact same phoneme particles in the same lyric may have different effects on the output probability distribution (see "it's alright, it's okay" in the last line of the last excerpt). This provides clear evidence that the model is not oblivious to the location of each input part and considers their surroundings while making predictions. 8 On the other hand, analogous to taking advantage of mentions and hashtags in tweets, our models are greedy to infer from very artist specific words, often causing confusions. In the specific comparison presented by Fig. 4, the word "nobody" has a remarkable weight in the prediction process, leading to a misclassification error. It is in fact an indication of how a single word or phoneme part is associated with only a few of the labels and such links are taken advantage of as superficial and erroneous clues.
In order to emphasize the issue of superficial clues, Fig. 5 illustrates a very direct example where the model looks at a singular portion for the entire prediction process. In certain cases some artists persistently use certain words that are not found in the lyrics of other authors in the data set. The limited segment taken from a song of the band Acappella in Fig. 5 contains the word "scriptural", and we have discovered that being a contemporary Christian band, this very specific word is commonly encountered in Acappella songs. Such words can reflect a theme that is exclusively preferred by certain genres or artists. In more straightforward situations, artists continuously use their own names in lyrics. Taking advantage of such shallow and direct hints is arguably an important contributor in any text classification task, as they are also valid indicators of authorship. However, since these indicators can be discovered by much simpler models and can be mimicked easily, we prefer our models to predict based on more meaningful stylistic motives rather than a matter of existence or non-existence of a word or phoneme.

Genre classification
Our genre classification results (multi-class classification on all 12 genres) follow a similar trend to the authorship results with the best model being the character embedding variant (41.6%± 1.8), followed by the sub-word (39.5%± 1.1) and phoneme (39.1%± 1.2) models. The combined model on a fixed dataset split yielded an overall accuracy score over 12 classes of 47.5%. Unlike authorship attribution, genre classification on lyrics provides in-domain comparisons from the literature, which are listed in Table 3.
Apart from plain accuracy metrics, our results show that general genres such as 'Rock' and 'Pop' exhibit less specificity and thus predictability as has been reported by others (Fell & Sporleder, 2014). We also found that certain genre label pairs, such as 'Blues' and 'R&B', that are historically linked, are also very likely be confused with each other. A more detailed account of such cross evaluations of genre labels can be seen in the confusion matrix derived from the combined genre classification model, in Fig. 6.

Conclusion
We have implemented variations of single channel CNN models inspired by a number of effective convolutional network based text classification frameworks in the literature. In order to account for the challenges introduced by lyrics such as unknown and slang words, varying lengths, and repetitive structures, we proposed a multimodal approach that integrated character, phoneme and sub-word level embeddings of the inputs. We have managed to obtain overall accuracy scores of approximately 48% in genre and 30% in author classification tasks. We are able to analyse the impact of lyrics specific phoneme features using occlusion maps.   Aside from providing a benchmark for authorship attribution of lyrical texts, we have also made available a large pre-processed, public and ready-to-use lyrics data set for similar research purposes. In future work, we aim to integrate additional structural and linguistic analyses of lyrics, and use 2D representations of lyrics that would increase the chances of capturing neighboring patterns between lines in the vertical alignment, such as rhyme schemas.