1 Introduction

Although music is primarily an auditory experience, users’ first impressions of unfamiliar artists’ work are largely influenced by associated marketing materials, including images and visual branding. For the most part, cover images, such as album art, are designed to embody the artist’s esthetics [1], and they are frequently used in online music discovery and streaming platforms, such as Spotify, Pandora, Last.fm, and Deezer, to attract new listeners and provide users with a multisensory, crossmodal experience congruous with the meanings, moods, and ambience of the songs [2]. Historically, album artworks have mirrored the musical trends and identities of their musicians. For example, in the 1960s, the Beatles used an album cover reminiscent of contemporary visual art to match the experimental direction of the music in their iconic album Sgt. Pepper’s Lonely Hearts Club Band [1].

In recent years, the meteoric rise of online streaming platforms has made it easier for independent musicians to broadcast their music to listeners around the world. At the same time, this places the burden of marketing and other non-music tasks on the musician [3]. This may be especially for independent musicians, who might lack the financial resources, time, or design experience necessary to commission or create the suitable album artwork. Given the important role that such artwork plays in communicating the themes of the album’s songs, procuring suitable images is a matter of concern for independent musicians.Footnote 1 Indeed, online communities of musicians have dedicated multiple discussion threads on online forums (subreddits) to discussing potential solutions to this problem.Footnote 2 Some common suggestions on these forums include using image retrieval to source artworks (from websites like Google Images or DeviantArt) or using design tips and image editing software to produce DIY album art. However, these strategies risk copyright infringement and are limited by musicians’ design skills and access to relevant software.

To this end, we propose the use of state-of-the-art image generation and styling techniques in computer vision and deep learning, and we add an interface aimed to be fast, convenient, and intuitive for users. We describe a system architecture that relies on user-submitted music and lyrics as input for two models: a generative adversarial network (GAN) [4] text-to-image model, which generates seed images from each line of the lyrics, and a style transfer model, which uses information about the dominant moods in the audio file to pull related “emotional” images from a previously assembled database and layer them onto the seed images. Style transfer [5] is a deep learning–based method that combines the neural representations of two images, resulting in a third image that blends the two constituent images. Through this two-step approach, the tool generates a selection of copyright-free, custom-cover artworks that capture some of the semantic meanings of the lyrics and the mood of the music. From this selection, musicians can download the cover artwork that best accompanies their music. We opted for a modern responsive layout and a minimalist design to make the process of album art generation as seamless as possible.

Finally, to assess whether the proposed architecture is suitable for its intended purpose, we designed a simple user study targeting musicians in diverse genres to measure the usability of the application and the suitability of the output images. We recruited 35 professional and amateur musicians from around the world, who tested the application with their own songs and subsequently assessed (1) whether the GAN-generated images were usable as cover artwork and (2) whether the system itself was easy to use.

2 Related work

By combining semantic information from the lyrics and acoustic information from the audio, our proposed system can synthesize album art while also providing a new way for users to visualize music. Therefore, we consider previous work on both music visualization and album art generation.

One of the approaches to music visualization is through musical moods. Laurier et al. [6] designed an application that visualizes mood probabilities in real time, displaying the probability of each mood during a specific time frame using bar charts. Husain et al. [7] followed the same approach, designing visual textures based on the relationship between moods and different visual elements, such as color and shapes. Another approach is the visualization of lyrics. Funasawa et al. [8] created a system to identify keywords within each line and used image retrieval to form a slideshow that displays images corresponding to words in each line of lyrics. Visualyre adopts both approaches in a multimodal fashion. The lyrics are visualized line-by-line, but rather than using keywords to retrieve existing images, it uses image generation to synthesize new images using each sentence as an input. The moods extracted from the audio are visualized via style transfer, using pictures related to each mood as texture images, generating copyright-free images suitable for use as album art.

Current methods to create album art rely on websites that aid users in generating different kinds of cover art for various media from a base stock image.Footnote 3 The selection of the images is quite limited and may not be the one desired by the user. A limitation of these applications is that other users may use the same stock image, which makes the resulting album art no longer unique.

Hepburn et al. addressed the limitation above by using a DC-GAN [9] model. This model was trained on the One Million Audio Cover Images for Research (OMACIR)Footnote 4 dataset to automatically synthesize album covers [10]. They also experimented with the AC-GAN [11] model to generate album arts for a specific genre. Visualyre adopts a similar approach to synthesize album covers. However, instead of using musical genres as input, it uses a set of lyrics and an audio file from a user-provided song. A GAN model synthesizes an image from the lyrics, which is further enhanced with the moods of an audio file. Moreover, the authors only showcased the results of their model, while our work incorporates a user interface that allows users to generate album covers in an interactive and personalized way.

3 System description

Given that album artwork could serve to preliminary communicate the themes and sentiments of a song (or collection of songs) to a would-be listener, the most helpful image-generation tool for independent musicians would do more than produce attractive images—it would account for both the semantic meanings of the lyrics and the emotional moods of the instrumentation. As such, Visualyre uses the lyrics (text) and music (audio) of the input song as features for the generation of representative images.

To do this, it deploys three modules that pair the information from the audio files and the lyrics: (1) a text-to-image GAN model to synthesize images from lyrics, (2) an audio analyzer to evaluate the moods of a song file based on its audio features, and (3) a style transfer model to adapt the image from the first module to a particular mood. In this way, users are able to fine-tune and adjust the generated images, achieving a similar effect to image filters on smartphone camera applications. The system architecture is shown in Fig. 1.

Fig. 1
figure 1

System architecture: lyrics are used to synthesize images through a text-based image generator. The audio file is analyzed to evaluate the presence of four different moods. This evaluation is used to sample which style images are used in style transfer

3.1 Image generation

In the first module, a text-encoder extracts features from each line in the input lyrics and uses these features to conditionate the synthesis of images. This is done by employing a text-to-image generation model. Multiple approaches have been proposed over the years to improve these models, such as using a stacked approach to refine the images [12], employing attention to synthesize fine-grained details [13], and inferring semantic information to guide the image generation [14].

Visualyre uses a DM-GAN to visualize the lyrics, generating an image for each sentence in the input lyrics. DM-GAN [15] makes use of dynamic memory to refine the content of the synthesized images, even if the quality of the previous images is not good. This generative model is trained using the MS COCO dataset [16], which contains a total of 123,287 images (82,783 for training and 40,504 for validation) with five captions per image. The captions of the images are short descriptions in English of a particular scene; as such, the model can only generate images from English sentences. Once our system generates a selection of images, users must select one before proceeding to the next step. Figure 2 shows a selection of one of the synthesized images.

Fig. 2
figure 2

Graphical user interface of Visualyre and its components. This figure shows the selection of a synthesized image from the image generator prior to applying style transfer

3.2 Audio analysis

Music encompasses more than just lyrics; in many cases, and while outside the target base for our application, music may be completely instrumental and devoid of lyrics completely. Thus, to make more efficient use of the source material (music), one must also attend to the emotionality of the instrumentation [17]. To generate cover art capable of representing these musical emotions, we utilize binary classifiers to detect the relative presence of four different emotions—anger, happiness, sadness, and relaxation—based on the musical features extracted from the song’s audio signal. These classifiers were introduced by Laurier et al. using support vector machine models and were recently ported to pre-trained deep convolutional neural network models by the Music Technology Group [18].

For our model, we normalized the positive label (e.g., “angry” instead of “not angry”) of these classifiers and used this information to derive an audio-based mood probability score, which assigns a probability to each mood based on the score of the classifiers. The higher the score, the more likely the mood will be sampled for style transfer (through the process described in Sect. 3.3, below). Likewise, if a particular score has a very small value, its sampling probability will be very low and other moods will be sampled instead. This would allow us to enhance the images generated in the first phase by giving them an artistic feel.

3.3 Style transfer

To visually represent the moods obtained from the audio analysis described above, we use a technique called style transfer. Style transfer is the process of changing the style of an image to match the style of another image while still preserving the original image’s content [19]. In previous work, style transfer models were only capable to change the style to a specific style, as these models used a different trained model for each style image [5, 20]. However, Ghaidi et al. [21] introduced a style prediction network that can predict the style embeddings of an arbitrary style image, effectively enabling stylization using any pair of content and style images. Their network is trained using content images from the ImageNet datasetFootnote 5 and the Kaggle Painter By Numbers (PBN) dataset,Footnote 6 which consists of 79,433 paintings.

For Visualyre, we utilized Ghaidi et al. style transfer model by preparing a small dataset of “emotional” images called “Style Bank” by querying UnsplashFootnote 7 for images that reflect the four different moods assessed by our binary classifiers (Sect. 3.2 anger, happiness, sadness, and relaxation. We used the moods and words with similar meanings (e.g., “fury,” “joy,” “calm,” and “sorrow”) as search queries to obtain images relevant to a particular mood. Our intention was to use style transfer to apply the “mood” from the images with the images generated by the GAN model (Sect. 3.1). To determine which “mood” image should be combined with the GAN-generated image, we relied on the mood probability scores derived from the audio analysis. After sampling a mood, the system selects an image representative of the chosen mood from the Style Bank, which serves as the style image. This style image is then used to generate eight different versions of the previously selected seed image. Together with the base synthesized image, we show the user a total of nine different images to choose from, which is shown in Fig. 1.

3.4 Graphical user interface

As per the above description, Visualyre synthesizes cover artworks in a sequential manner. To make this process as seamless and dynamic as possible for the end user, we opted for a responsive layout and a minimalist design, showing each component only when it is needed. This means that the entire process occurs on the same page, without reloading or redirecting to another URL. Instead, the UI components and their respective contents change reactively according to the user’s interaction with Visualyre. Figure 2 shows the graphical user interface and its different components (which are also listed below).

  • Song input: A form where users add lyrics and upload an audio file. The users input the lyrics by copying and pasting them into a text box and upload the song by choosing a file from their device.

  • Navigation menu: Buttons that enable travel to the next step or to the previous one. This component also indicates whether the system has finished analyzing the moods from the submitted audio file or if there occurred any errors during the analysis.

  • Progress tracker: allows users to check their current step at any time. The name of each step provides a subtle description of the desired action during a specific step.

  • Image selection: A grid where users can select or hover over generated or stylized images. Multiple images are displayed simultaneously, users can scroll to display further images when necessary. To advance to the style selection, a user must select one of the synthesized images. Similarly, users must select a stylized image to download an image.

  • Lyrics viewer: An area where users can view the input lyrics while selecting synthesized images. When the user hovers over or selects an image, the system highlights the fragment of lyrics used to generate that image.

4 System interaction

To use Visualyre, users proceed through a series of steps, which are detailed below and illustrated in the user experience flow in Fig. 3. The letters indicate the current node from the flowchart.

  1. 1.

    An empty form (A) is displayed to the user.

  2. 2.

    The user inputs the lyrics in the text field and uploads an audio file (B). The user clicks SUBMIT to start the audio analyzer.

  3. 3.

    The top left button in the navigation changes to READY (C), indicating that the mood detection of the audio file has finished.

  4. 4.

    The user clicks CONTINUE and is redirected to the image selection. In the background, the generator starts generating images from the lyrics.

  5. 5.

    The images are generated and shown in a grid-like fashion.

  6. 6.

    The user selects a synthesized image (D).

  7. 7.

    The user clicks CONTINUE and is redirected to the style selection.

  8. 8.

    Style transfer is applied to the selected image by sampling style images based on the results of Step 5.

  9. 9.

    The stylized images are created and shown in a grid-like fashion.

  10. 10.

    The user selects a stylized image (E). The image is shown on the left with a DOWNLOAD button.

  11. 11.

    The user clicks DOWNLOAD, acquiring their desired stylized image.

Fig. 3
figure 3

User experience flow. This figure shows the route that the users must follow to synthesize a stylized image from a song file and its respective lyrics. The letter indicates the current status of the application

5 Evaluation methodology

5.1 Prescreening

We used the ProlificFootnote 8 platform to recruit an expert group of musicians to evaluate the usability of Visualyre and its suitability in generating cover art for their songs. Prolific allows us to restrict participants according to certain categories, one of them being musical instruments. Thus, prior to recruitment, we utilized this item (“Do you play a musical instrument; if so for how many years?”) to pre-select an eligible sample population from Prolific who had the highest likelihood of having experience writing songs.

We then conducted a prescreening survey, where we asked participants to share their previous musical experience. Additionally, because Visualyre requires the use of an original song with English lyrics, we asked participants whether they had previously written a song with English lyrics and whether they were willing to share such a song to evaluate our application. Based on these criteria, initially recruited N = 621 participants for the prescreening survey, of which N = 95 suitable candidates were invited to participate in the user study. All prescreening survey participants were compensated with 0.2 dollars, regardless of their answers. The prescreening includes an attention check [22], which ensures that participants were reading the questions prior to answering them.

5.2 Telemetry

We used telemetry in our application to (1) confirm that participants interacted with the application prior to answering the evaluation survey (described in Sect. 5.3, below) and (2) determine how long they spent performing each step. To do this, we added a “Create ID” button at the start of the application that generates a unique 7-digit ID, which is used to log the participants’ actions with a corresponding timestamp. Table 1 shows all the action types and the moment when they are logged. By checking whether a specific ID has completed the download action, we can assess whether a participant has used the application from start to finish, as instructed.

Table 1 Types of actions logged through telemetry

5.3 Evaluation metrics

We provided each eligible participant with an evaluation survey, which listed instructions for how to use the application, specified that the user must upload an original English song and download one of the stylized images, and indicated that participants were free to drop out from the evaluation at any time. There were no time constraints when using the application. Out of the 95 candidates, 35 participants (27 male, 7 female, 1 non-binary) answered all the survey questions and used the application from start to finish with an original English song. This was corroborated by the telemetry detailed above. Participants were compensated $5 for their involvement with the study.

The age of the participants ranged from 19 to 61 (mean age = 28.43, SD = 9.73); most participants reported using English (17) as a main language, followed by Spanish (5) and Polish (4). The rest of the languages were Hebrew (2), Portuguese (2), Czech (1), Dutch (1), French (1), Greek (1), and Swedish (1). Participants also labelled the genres of the songs they submitted, and the most common genres were rock (6) and pop (6). The rest of the genres were R&B (4), electronic (3), metal (3), acoustic (2), alternative (2), house (2), indie (2), punk (2), rap (2), and piano-ballad (1). Some participants wrote more than one genre for their song. The usage time ranged from 2:41 to 44:14 (mean = 11:59, SD = 9:32). Out of the 35 participants, only 5 of them spent more than 20 min using the application, while 8 of them used it for less than 5 min.

After using the application, participants were invited to complete the survey portion of the evaluation survey, where they rated the application across different metrics. Users had to rate eight different questions with a 6-point Likert scale ranging from strongly agree (6) to strongly disagree (1). These questions were designed to enable a quantitative evaluation of the app’s usability and utility for generating cover art. To link these responses with the telemetry data, we asked participants to provide both their Prolific ID and the 7-digit ID generated by our application. Table 2 shows the questions from the survey portion of the evaluation survey with its associated metric.

Table 2 Survey questions with their respective metrics

For the qualitative evaluation, users had to share their impressions of the application and their previous experiences seeking cover art for their music by answering two open-ended questions. These questions were intended to solicit further information about the app’s effectiveness, usability, and necessity. The first, “What are your impressions of Visualyre?” (Q14) yielded responses from all 35 participants. The second was a follow-up question to “Prior to using this application, have you had any difficulties obtaining a cover image for your music?” (Q25); those respondents who answered “Yes” to this initial question were asked to elaborate on their previous experiences (Q26). An additional question, “Have you used a similar tool before?” (Q15) was added to assess the novelty of the application.

6 Findings

In this section, we showcase the results of our evaluation by grouping and analyzing the results of the quantitative and qualitative evaluations across different metrics. The results of the quantitative evaluation are summarized in Table 3 below.

Table 3 Results of the quantitative analysis

6.1 How intuitive is Visualyre?

As Visualyre requires multiple steps to accomplish music visualization (as per Sect. 4), we wanted to check whether users were able to complete this step-by-step procedure smoothly. Accordingly, we asked participants to rate whether “The application was easy to use from beginning to end” (Q4) using the 6-point Likert scale described above. The mean score for this question was 5.09 (SD = 0.92), suggesting that most users agreed that the application was intuitive and easy to use throughout the entire procedure.

Of the 28 participants who commented on the app’s usability on Q14, 22 expressed positive sentiments. Eleven indicated that the app was easy to use, and one elaborated, “It’s pretty easy to use. smooth transitions between pages and it didn’t take long to upload and analyze the song” (p19). In terms of user experience, 14 classified the app as “cool,” “interesting,” or “creative,” and four lauded the app’s concept/features as “nice,” “good,” or even “great.” Two noted that it was “fun” or otherwise enjoyable to use, with one writing, “I really enjoyed that a lyric from the song could be interpreted in different ways by choosing a different style for the artwork” (p17). Finally, three remarked that they liked the app or were happy to have used it, independent of their intention to use it again.

Although Visualyre was praised for its intuitiveness, some users expressed concerns. Three indicated that the app was incomplete in its current form, either for functional (n = 1) or esthetic (n = 2) reasons. For example, one such respondent noted, “It doesn’t work at the moment, but I can see it might end up being something very interesting” (p6). Another, who had experienced technical difficulties, expressed the need for functional improvements, while three others recommended specific improvements or features, including the ability to re-access previous information, access a broader range of images, or upload original work as a starter image (p3, p25, p27, respectively).

6.2 How effective are the methods of text-to-image generation and style transfer for visualizing lyrics?

Visualyre combines the methods of text-to-image generation and style transfer to synthesize the resulting images. To evaluate the first method, we asked participants to rate whether “The application displayed at least one image that matches the theme of the lyrics” (Q5). The mean score for this question was 4.26 (SD = 1.46). To evaluate the second method, we asked participants to rate whether “The application displayed at least one style that matches the tonality of the song” (Q6). The mean score for this question was 4.66 (SD = 1.08). Participants considered style transfer to be a more effective visualization method than text-to-image generation which may be due to the limitation of the MS COCO dataset used in training the current text-to-image model and the difficulty involved in representing abstract concepts through this dataset. Nevertheless, both mean scores were well within the positive (agree) end of the spectrum, which suggests that Visualyre was at least somewhat effective as visualizing lyrics.

6.3 Are the images generated by Visualyre suitable as cover art?

All participants used the application until they downloaded at least one of the stylized images. To determine whether these images could reasonably be used as cover art, we asked users to rate whether “The downloaded image is suitable as cover art” (Q7). The mean score for this question was 4.60 (SD = 1.03).

We also added two follow-up questions to determine why users considered the downloaded image suitable as cover art: “The downloaded image matches the intent or intention of the song” (Q8) and “The downloaded image matches the intent or intention of the corresponding lyrics” (Q9). The first question yielded a mean score of 4.20 (SD = 1.16); the second question yielded a mean score of 4.14 (SD = 1.19). Combined with the preference for style transfer over text-to-image generation indicated in response to Q5 and Q6 (see Sect. 6), these results show that users are somewhat willing to use images generated by Visualyre as cover art despite the limitations of text-to-image generation.

In response to Q14, respondents elaborated on their positive and negative perceptions of the cover art. Six of the 16 participants who commented offered positive appraisals. Four said that the images were “interesting” or “unique” while two said they were “impressive.” Two also described the images as “beautiful” or “nice,” one based on the text-to-image generation (“makes beautiful covers for every line” (p13)) and the other based on the style transfer (“when combined with the different styles, they can give really nice examples” (p17)). Five respondents described them as abstract, conceptual, or random/non-representative, and in four of these cases, the respondent regarded this as a negative attribute. For instance, one remarked, “Bit confused as to what some of the images were” (p1), while another clarified, “[the image] didn’t add to the meaning of the song – the images were quite conceptual images rather than illustrative images” (p22). Five respondents also used more affective negative-leaning language, describing the images as “strange,” “creepy,” “disturbing,” “trippy,” or “nonsensical.” One specified, “Some of the ones regarding people were scary” (p33). Only three respondents found the images to be unattractive, sub-par, or uninteresting, one of whom wrote, “The images provided were not that interesting … The images feel esasily identifiable as being AI generated” (p31). Finally, one participant commented not on the content of the image, but on the low file quality upon download. Figure 4 showcases some of the stylized images that our participants downloaded with their respective base images and target mood. To preserve anonymity, we only show two words from the sentence that was used to synthesize the base image.

Fig. 4
figure 4

Synthesized images downloaded by our users. The base images are located on the top, and the stylized images are located on the bottom

6.4 How effective is Visualyre for musicians?

Of the 18 participants who commented on the effectiveness of the app, 14 attested to its actual or potential efficacy. Six confirmed that they found the app generally useful, and three indicated that they would use the app for their own projects. Speaking from his own experience, one respondent summarized, “As an artist I personally know how important it to get the right art work for a song or project. Most artist struggle when it comes to creating there desired art work so Visualyre can change that by show casing defferent art for artists to just pick” (p24). Six participants regarded the app as promising, and three suggested that it might work better for artists in specific musical genres, for example, “for indie and experimental music [rather] than for traditional pop” (p14). However, three respondents expressed confusion about the purpose of the app, and one stated that they did not find the app useful in its current form, writing “I don’t know what to think about it. […] but for now I don’t believe that it can be really useful” (p28).

These findings complement the results of “How likely are you to recommend this application to other artists?” (Q12) from the quantitative part of the survey, wherein 71.4% of respondents indicated that they were somewhat likely to very likely to recommend this application to others. The average score for this question was 4.29 with an SD of 1.30. The score ranged from very likely (6) to very unlikely (1).

6.5 Is Visualyre novel?

Though none of the respondents commented directly about the novelty of the app, we may deduce from the prevalence of “cool/interesting” comments (n = 14; i.e., “converting audio files and interpreting lyrics in a pictorial way is something really cool!” (p26)), the renderings of the app as a “concept” or “idea” (n = 3; i.e., “I think it is a great concept” (p29)), the tendency to describe the app as “promising,” or as having “potential” (n = 6), or even the comments of confusion (n = 3), that users had not encountered an app like this before (there were 20 unique responses of this kind). This is further supported by the responses to another question, which asked, “Have you used a similar tool before?” All but one of the respondents answered “No,” not including another who compared the app to Google’s DeepMind. Thus, it seems safe to say that the app was novel for a majority of the respondents.

6.6 Is there a need for an application like Visualyre?

Of the eight participants who responded to the second qualitative question, seven confirmed that their previous experiences obtaining/creating cover art for their music were difficult; one misinterpreted the prompt, commenting instead on their technical difficulty when utilizing Visualyre. The main reasons given for previous difficulties included sourcing issues (n = 4), including the inability to find an appropriate artist or to obtain image rights, and time (n = 3). One participant explained, “Well, it takes me a lot of time to find the right cover art because I’m such a perfectionist. Sometimes I do find the right image but when I contact the owner, they won’t let me use it, which is understandable” (p15). Contrary to our expectations, only two respondents cited financial obstacles, and only one cited skill-related difficulties. Three resorted to making their own artwork, though one conceded that this was “fun” despite the time commitment (p16), and another acknowledged that they have the requisite skills: “I end up doing my own artwork since I’m an illustrator” (p15). Moreover, one respondent noted that Visualyre would have been especially useful in the early phases of his career (p24), suggesting a more circumscribed area of “necessity.” Indeed, it is worth noting that 27 respondents (77.1%) did not report previous issues with obtaining cover artwork. As such, it may be more appropriate to refer to the app’s convenience, as it saves artists time more so than money. This may explain why, when asked “Would you pay to use a similar service?” (Q17), 42.9% of respondents answered “Yes,” despite concerns with the quality of the generated images. Taken together with the predominantly positive feedback about the app’s effectiveness for musicians (see Sect. 6), these comments seem to confirm that there is an audience for this kind of tool.

7 Discussion

Thus far, our data indicate that Visualyre appears to be an effective tool for musicians to generate cover images and album art for their proprietary songs. Responses to “What are your impressions of Visualyre?” (Q14) revealed that users found Visualyre to be highly usable, largely novel, and reasonably effective, though it currently does not yield wholly suitable results. Responses to “Prior to using this application, have you had any difficulties in obtaining a cover image for your music?” (Q25) confirmed the need for this tool amongst a small segment of the target population (22.9%), though for different reasons than the ones we anticipated. Most of the participants were willing to recommend this app to another musician, and 42.9% would pay for cover art generation as a service.

Overall, more than 60% of the qualitative responses leaned positive, while less than 30% leaned negative, indicating general support for the app. Of the four participants who explicitly reported on the app’s performance against their expectations, two were pleased to find that it was better than expected. One stated that the app performed to plan “for someone who want a quick, abstract and unique album cover” (p8). Finally, one bemoaned that the app was different than expected because the images were nonrepresentational: “I was expecting a more finished product with actual images instead of random lines and colors” (p2). We expect that these issues could be solved by using more robust text-to-image generation models, given that participants acknowledged less concordance between lyrics and images than between tone and images in the quantitative survey.

Despite only a minority of users with prior difficulties with procuring album art, the largely positive response towards Visualyre suggests that this may be a useful tool for independent musicians to consider for online platforms and promotional material, as both the images generated and interface received satisfactory feedback. For the musician, this presents a low-cost alternative to designing album art, so the effort, time, and resources that are saved can then be diverted back to core music-making activities.

8 Conclusion

In this paper, we proposed a novel application of text-to-image computational techniques in the form of Visualyre, a lyric visualization tool that combines two techniques (text-based image generation via DM-GAN and arbitrary style transfer) to sequentially synthesize images based on lyric segments and the overall mood of uploaded songs. Combined with the in-built possibility for user selection and input, we think that this method allows for the generation of images that can capture some aspect of a song’s semantic meaning or mood.

Moving forward, we consider some areas where Visualyre can be improved. Firstly, our GAN model is currently generating images for every line in the lyrics. Some of these lines may be too short or contain only abstract concepts, which are very difficult for the image generation model to synthesize. As such, one improvement could be to expand the training dataset used to include images trained on artistic or abstract concepts. Secondly, the range of supported moods in the audio analysis was limited, and the reference images in the Style Bank were determined subjectively. Future iterations could include a more comprehensive mood detection model for audio analysis and crowdsourced annotations for a more consistent estimation of reference images’ emotionality. Thirdly, Visualyre only supports lyrics written in English. Thus, another improvement would involve adding multilingual support for the application, perhaps by training GAN models with multilingual datasets or by adding machine translation capabilities within Visualyre’s architecture. Finally, we note that the downloadable images have low resolution. We plan to use super-resolution models [23] to enable a high-resolution download in future iterations. Additionally, we also plan to use cloud computing for increased scalability, enabling simultaneous access to musicians around the world. This would also provide an opportunity for real-world feedback from actual users of the application, rather than paid participants used in this study. With the release of VQ-GAN [24] and CLIP [25], we think that initial image generation using these models may offer an alternative image-generation algorithm to the DM-GAN used in Visualyre, though we note that the speed of CLIP-guided text-to-image synthesis is considerably slower, so retaining our current DM-GAN model may yet be advantageous.

With Visualyre, one of our goals is to bring a modest contribution to the independent music community by generating artwork to match artists’ musical and artistic intentions. In the future, we will explore other possible applications, such as allowing users to browse music through synthesized images.