1 Introduction

The task of speech-to-image generation aims to automatically yield photo-realistic and semantically consistent photographs directly from given speech signals. In recent years, this topic has drawn rapidly growing interest from multidisciplinary communities. It can be potentially used in a wealth of real-world applications, such as creating novel, visually interesting photos providing ideas for visual artists, photograph editing according to the spoken description, generating new data in machine learning [1], when augmentation is needed in training, e.g.,, classifier, and helping disabled persons produce pictures.

The availability of a speech-to-image transform would allow for an AI system to check the visual implications of a spoken narrative.

We think that research on synthetic-image generation directly from speech audio may provide a way of looking into the problems that the human brain is confronted with when forming a visual association on the basis of speech input. Note that in this task, an intermediate textual representation can be avoided (learning to read text happens at a much later stage, in children). In traditional AI, the abstract symbolic nature of text is considered to be an essential property and the pinnacle of intelligent information processing. However, the assumption that decoded speech (‘recognized text’) is a necessity for image generation has not been proven. From the point of view of deep machine learning, there is no strict necessity to rely on an intermediate symbolic text code. Vectorial representations may also be suitable for abstract representations, as has been shown extensively in recent years [2].

The task of speech-to image generation has a variety of advantages over text-to-image synthesis [3, 4], whose goal is to produce high-resolution image samples semantically aligning with input natural-language descriptions.

Firstly, spoken language may play a more advanced role in controlling and directing an image-generation process than the text modality. Compared to encoded text, speech signals contain, in addition to the pertinent phonemes, rich characteristics that can be roughly divided into timbre, prosody, rhythm, intonation, stressed components, etcetera [5, 6]. Prosody allows speakers to express the attitude and emotional states, providing key non-verbal cues that help the listener to understand the spoken description [7]. For instance, when a person says: ‘There is a fire’ in a neutral or falling intonation, it is probable that the fire is not big. However, if the speaker says the same sentence in a sharp intonation, then it may be more likely that this is a case of a large, risky fire. The example shows that text alone may not be sufficient to convey subtleties of the speaker’s intention. It is possible to introduce non-verbal cues, e.g.,, emotional features, for conditional-image synthesis. Incorporating prosody information into the design of picture-generation systems can also be beneficial in applications, where users can influence desired photographs by adjusting the intonation of the speech probe. Since current text-to-speech synthesis models with deep-learning techniques allow for prosody modulation [8], synthesizing pictures on the basis of emotional features and linguistic information of speech signals should be feasible.

Secondly, the auditory modality is the most commonly used and natural way for humans to communicate information with each other in daily life, while written-form language is more slow and elaborate. For these reasons, compact speech-to-image generation systems may become interesting in comparison to a pipeline of a speech-recognizer followed by a standard text-to-image algorithm.

Thirdly, there are about 3500 languages lacking orthography or written form [9], which makes it impossible to train a text-to-image synthesis model for these languages and thus text-to-image systems cannot benefit these populations. Since acquiring speech signals is relatively easy and not limited by languages, an adequate speech-to-image generation architecture can be designed for these populations. This may be interesting in language research in the humanities. The presence of such tools would allow for an immediate inspection of the correspondence between spoken description and the generated image, by subjects and scholars.

In summary, spoken descriptions are suitable to explicitly and accurately describe an image using linguistic and emotional information, and researching speech-to-image synthesis has practical and scientific implications. Nevertheless, speech-to-image generation remains a considerably challenging cross-modal task. Different from textual descriptions, a speech signal is not discrete but continuous, lacking regular breaks between words, i.e., the spaces in the written-form text. This characteristic will make it difficult for the model to grasp the linguistic information of spoken captions of photographs and pictures while learning the corresponding embeddings.

We are not the first to attempt to translate speech signals into high-resolution pictures directly. For example, Li et al. [10] introduced a multi-stage speech-to-image GAN architecture to produce photo-realistic pictures semantically correlated with the input spoken description. In order to better capture the linguistic information, the researchers adopted a well-trained image encoder as a ‘teacher’ to train the speech encoder from scratch. Wang et al. [9] designed a new speech-embedding network (SEN) to obtain the speech vector. Furthermore, a relation-supervised densely-stacked generative model is developed to yield high-quality photographs.

Fig. 1
figure 1

The comparison between the stacked framework and the proposed architecture. The multi-stage pipeline (a) entails training separate generators to obtain high-quality samples. The presented Fusion-S2iGan (b) is capable of producing visually plausible pictures only employing a single generator/discriminator pair. In (a), \(G_{0}\)-\(G_{2}\) are generators and \(D_{0}\)-\(D_{2}\) are discriminators. In (b), \(B_{0}\)-\(B_{6}\) are the dual-residual speech-visual fusion blocks discussed in Sect. 4, and G and D are the generator network and discriminator network, respectively

These speech-to-image GAN models adopt a multi-stage framework (see Fig. 1a), where several generators and discriminators are employed to produce visually plausible samples. Although this architecture has now acquired promising results in the speech-to-image generation task, there still exist three vital problems. First, this framework entails training separate networks and is therefore inefficient as well as time-consuming [4]. Even worse, it is difficult for the final generator to output perceptually realistic photographs when the earlier generator networks do not converge to a global optimum [3, 11,12,13]. Second, the quality of the outputs of the previous generator networks [14] is ignored by this architecture. Contextual vectors are not used to enhance and modulate the visual feature maps in the generator for precursor images, which comprises up-sampling operations and convolutional layers [3]. Third, several discriminator networks are required to be trained.

In recent years, we have proposed two novel single-stage text-to-image GAN models, i.e., DTGAN [3] and DiverGAN [4], to address the above-mentioned issues of a multi-stage architecture. Both DTGAN and DiverGAN are capable of adopting a single generator/discriminator pair to produce photo-realistic and semantically correlated image samples on the basis of given natural-language descriptions. In DTGAN, dual-attention models, conditional adaptive instance-layer normalization and a new type of visual loss were presented. In DiverGAN, the sentence-level attention models introduced in DTGAN were extended to word-level attention modules, to better control an image-generation process using word features. Moreover, we proposed to insert a dense layer into the pipeline to address the lack-of-diversity problem present in current single-stage text-to-image GAN models. Inspired by these previous works and aiming to address the limitations of a stacked framework, our research introduces a visual+speech fusion module and incorporates several effective loss functions. These advancements collectively contribute to a novel single-stage speech-to-image architecture.

Looking at recent publications on text-to-image generation [15,16,17], it is clear that this field has become a thriving area of research. Despite the significant progress made by models like the OpenAI ’Dall-E 2’ [16] and the stable diffusion model [15] in producing high-quality images, these achievements often come at the cost of requiring billions of images and hundreds of GPU hours for training [15]. Additionally, generating images based on speech descriptions with these models necessitates the extra step of translating speech into text. In contrast, our model offers a direct approach to generating images from speech, while being trained on smaller data sets. This highlights the potential for efficient and effective image generation directly from speech inputs, even with limited training data and using a common single-GPU platform.

The contributions of this paper can be summarized as follows:

  • We present a novel effective and efficient single-stage architecture called Fusion-S2iGan (see Fig. 1b) for speech-to-image transforms, which is capable of producing high-quality and semantically consistent pictures only using a generator/discriminator pair.

  • A visual+speech fusion module (VSFM) is designed to effectively feed the speech information from a speech encoder to the neural network while improving the quality of generated photographs. More importantly, we spread the bimodal information over almost all layers of the generator. This allows for an influence of the speech over features at various hierarchical levels in the architecture, from crude early features to abstract late features.

  • To the best of our knowledge, we are the first to apply (1) the hinge loss, (2) deep attentional multimodal similarity model (DAMSM) loss and (3) matching-aware zero-centered gradient penalty (MA-GP) loss in speech-to-image generation, which are beneficial for the convergence and stability of the generative model.

  • Extensive experiments are carried out on four benchmark data sets, i.e., CUB bird [18], Oxford-102 [19], Flickr8k [20] and Places-subset [21]. The experimental results suggest that the proposed Fusion-S2iGan has the capacity to yield better pictures than current multi-stage speech-to-image GAN models such as StackGAN++ [22], Li et al. [10] and S2IGAN [9].

  • We explore how far can current single-stage text-to-image methods be used for the speech-to-image transform task, which is depicted in Sect. 5.4.3.

The remainder of this paper is organized as follows. Section 2 reviews related works. In Sect. 4, the architecture of the proposed Fusion-S2iGan is introduced in detail. In Sect. 4.2, we elaborate on the presented visual+speech fusion module. Experimental settings are discussed in Sect. 5 and results are reported in Sects. 5.2, 5.3 and 5.4. Section 6 draws the conclusions.

2 Related works

In this section, research fields related to our work are described, including CGAN in text-to-image generation and speech-audio-to-image synthesis.

2.1 CGAN in text-to-image generation

Text-to-image generation aims to automatically produce visually plausible and semantically consistent image samples, given natural-language descriptions. Many text-to-image synthesis approaches are built upon the original conditional generative adversarial network (cGAN) [23] due to its appealing performance. A cGAN-based text-to-image synthesis architecture is composed of a generator and a discriminator, that are trained to competing goals. The generator model takes a random latent vector and textual vectors as the inputs while outputting the corresponding photograph, attempting to approximate the text-conditioned data distribution. In the meantime, the discriminator network learns to separate the real text-image pair from the fake text-image pair. We roughly group them into two categories in terms of the number of the generators and the discriminators they use.

2.1.1 Multi-stage models

Zhang et al. [22] suggested employing several generators and discriminators to boost image quality and semantic relevance while presenting the first multi-stage text-to-image generation framework named StackGAN. StackGAN serves as a strong basis for the future research. Xu et al. [24] proposed to incorporate a spatial-attention module into the design of a stacked architecture, in order to better bridge the semantic gap between vision and language. Qiao et al. [25] developed MirrorGAN that introduced an image-to-text model to ensure that synthesized pictures are semantically related to given textual descriptions. Zhu et al. [14] built DMGAN where a dynamic-memory module is applied to enhance the image quality in the initial stage. CPGAN [26] designed a memory structure to parse the produced image in an object-wise manner and introduced a conditional discriminator to promote the semantic alignment of text-image pairs.

2.1.2 Single-stage methods

Reed et al. [27] were the first to use a single generator/discriminator pair to yield samples on the basis of natural-language descriptions. However, the resolution of generated pictures is limited owing to the unstable training process as well as the lack of an effective structure. Tao et al. [28] developed a matching-aware zero-centered gradient penalty loss to help stabilize the training and improve the image quality of a single-stage text-to-image GAN model. Zhang et al. [3] presented DTGAN, in which dual-attention models, conditional adaptive instance-layer normalization and a new type of visual loss are designed to generate perceptually realistic images only using a single generator/discriminator pair. Zhang et al. [4] proposed DiverGAN inserting a dense layer into the pipeline to address the lack-of-diversity problem present in current single-stage text-to-image GAN models. Zhang et al. [29] introduced linear-interpolation and triangular-interpolation techniques to explain the single-stage text-to-image GAN model. Moreover, a Good/Bad data set was created to select successfully generated images and corresponding good latent codes.

2.2 Speech-audio-to-image synthesis

With the recent rapid advances in generative-adversarial networks (GANs) [30] and conditional GANs (cGANs) [23], speech-to-image generation has made promising advances in image quality and semantic consistency when given speech-audio signals as inputs. Various studies focused on synthesizing images conditioned on the sound of music. Chen et al. [31] made the first attempt to use the cGAN to produce samples on the basis of music audio. In addition, an image-to-sound network is introduced to synthesize music according to instrument pictures. Hao et al. [32] presented a unified architecture (CMCGAN) for audio-visual mutual synthesis. Specifically, CMCGAN incorporated audio-to-visual, audio-to-audio, visual-to-audio and visual-to-visual networks into the pipeline for cyclic consistency and better convenience.

Some publications tried to reconstruct a facial photograph from a short audio segment of speech. Duarte et al. [33] proposed an end-to-end speech-to-face GAN model called Wav2Pix, which has the ability to synthesize diverse and promising face pictures according to a raw speech signal. Oh et al. [34] developed a reconstructive speech-to-face architecture named Speech2Face that contains a voice encoder and a pre-trained face decoder network. The encoder is used to extract face information from the given speech and the decoder aims to reconstruct a realistic face sample.

Different from the above approaches, several papers aimed at translating a spoken description of an image into a high-quality picture directly. Li et al. [10] attempted to apply a multi-stage speech-to-image GAN model to yield perceptually plausible pictures semantically correlated with input speech-audio descriptions. To better acquire the speech embedding, the researchers used a well-trained image encoder as a ‘teacher’ to train the speech encoder from scratch. Wang et al. [9] proposed S2IGAN where a speech-embedding network (SEN) was designed to obtain the spoken vector and a matching loss and a distinctive loss were presented to train SEN. In addition, a relation-supervised densely-stacked generative model is introduced to produce high-resolution images.

This paper focuses on solving the task of speech-to-image synthesis. In contrast to previous works, our approach aims to generate high-quality images from spoken descriptions, leveraging only a single generator/discriminator pair.

3 Preliminaries

3.1 Conditional generative adversarial network (CGAN)

A CGAN is an advanced version of the GAN, using conditional contexts, such as class labels, textual descriptions, and low-resolution images, in combination with random code as the inputs for G. G yields samples that are correlated with the given conditional contexts, and the aim of D is to distinguish the real pair (xc) from the fake pair (G(zc), c). Specifically, the objective V(GD) of a CGAN can be formulated as follows:

$$\begin{aligned} \begin{aligned} \min _{G}\max _{D}V(G,D&)={\mathbb {E}}_{x\sim p_{data}(x)}\left[ \log D(x, c)\right] +\\&{\mathbb {E}}_{z\sim p_{z}(z),c\sim p_{c}(c)}\left[ \log (1 - D(G(z, c), c)) \right] \end{aligned} \end{aligned}$$
(1)

where \(p_{c}(c)\) is the distribution of c.

The goal of speech-to-image generation is to yield perceptually plausible samples that are semantically correlated with the linguistic content of input spoken descriptions. Mathematically, let \(\{ (I_{i}, S_{i})\}_{i=1}^{n}\) represent a suite of n image-spoken caption pairs for training, where \(I_{i}\) indicates an image and \(S_{i}=(s_{i}^{1}, s_{i}^{2},..., s_{i}^{k})\) refers to a set of k speech-audio descriptions of \(I_{i}\). The generator of a speech-to-image GAN model aims to synthesize a high-resolution and semantically related picture \({\hat{I}}_{i}\) on the basis of a speech signal \(s_{i}\) randomly picked from \(S_{i}\). In the meantime, the discriminator is trained to separate the real image-speech pair \((I_{i}, s_{i})\) from the fake image-speech pair \(({\hat{I}}_{i}, s_{i})\).

3.2 Scaled dot-product attention

In general, an attention function takes a query \(q_{i}\in {\mathbb {R}}^{d}\) and a set of key-value pairs as the inputs, and then generates a weighted sum of the values. The weight assigned to each value is determined by a softmax function applied to the dot product between the query and all keys. The method for computing attention weights can be defined as follows:

$$\begin{aligned} Attention(q_{i},K,V)=Softmax\left(\frac{D(q_{i}, K^{T})}{\sqrt{d}}\right)V \end{aligned}$$
(2)

where \(D(\cdot )\) represents the dot-product operation and Softmax denotes the softmax function. \(K=(k_{1}, k_{2},...,k_{n}), k_{i}\in {\mathbb {R}}^{d}\), and \(V=(v_{1}, v_{2},...,v_{n}), v_{i}\in {\mathbb {R}}^{d}\) are the keys and the values, respectively. \(\frac{1}{\sqrt{d}}\) refers to the scaling factor.

4 The proposed approach

In this section, we discuss the overall architecture of DiverGAN, which can be seen in Fig. 2. After that, a module called visual+speech fusion module (VSFM) is presented as a means of efficiently and effectively combining visual feature maps with the global speech embedding, resulting in high-resolution images. Subsequently, we describe the loss functions used to train our network.

Fig. 2
figure 2

The overall framework of the proposed Fusion-S2iGan. FC is a dense layer, Conv is a convolutional layer, ReLU is a ReLU activation function and BN is a Batch-Normalization operation. Additionally, Residual Block and Fusion Module refer to the presented dual-residual speech-visual fusion block (see (c)) discussed in Sect. 4.1.1 and visual+speech fusion module (see (d)) discussed in Sect. 4.2, respectively. Furthermore, PAM, SMM and WFM in (d) represent the pixel-attention module, speech-modulation module and weighted-fusion module, respectively, discussed in Sect. 4.2. Note that we do not plot the up-sample layers between Residual Blocks in (a) due to the limited space

4.1 Overall architecture

The overall framework of Fusion-S2iGan for a speech-to-image transform is presented in Fig. 2. The architecture only consists of a generator and a discriminator. Next, we introduce these two components one by one.

4.1.1 Generator

The generator is able to project a speech signal into a photo-realistic and semantically consistent picture, shown in Fig. 2a. More specifically, the generator network is composed of a dense layer transforming a latent code to the initial feature map and seven dual-residual speech-visual fusion blocks modulating the visual feature map with the spoken vector derived from a speech encoder. The speech encoder is employed to learn the semantic representation and conceptual meaning of a given spoken description that capture the discriminative visual details [35].

The designed dual-residual speech-visual fusion block (see Fig. 2c) contains two effective and efficient visual+speech fusion modules (VSFM) (see Fig. 2d) as well as a suite of ReLU activation functions and convolutional layers. Each VSFM comprises Batch Normalization (BN) [36], a pixel-attention module (PAM), a speech-modulation module (SMM) and a weighted-fusion module (WFM). The dual-residual speech-visual fusion block allows us to easily enhance model capacity by effectively increasing the number of layers, while also stabilizing the training process by maintaining more original features than cascade structures. Benefiting from such dual-residual speech-visual fusion blocks, the generator network has the ability to yield high-quality pictures. The process of synthesizing photographs on the basis of a speech signal is formulated as follows:

$$\begin{aligned}&h_{0}=F_{0}(z) \end{aligned}$$
(3)
$$\begin{aligned}&h_{1}=F_{1}^{Dual}(h_{0},s) \end{aligned}$$
(4)
$$\begin{aligned}&h_{i}=F_{i}^{Dual}(h_{i-1}\uparrow ,s) \quad for \quad i=2,3,...,7 \end{aligned}$$
(5)
$$\begin{aligned}&o=G_{c}(h_{7}) \end{aligned}$$
(6)

where z is a latent vector randomly sampled from the normal distribution, \(F_{0}\) is a dense layer, s is the speech embedding from a speech encoder, \(F_{i}^{Dual}\) is the presented dual-residual speech-visual fusion block and \(G_{c}\) is the last convolutional layer. The details of the proposed visual+speech fusion module will be discussed in Sect. 4.2.

4.1.2 Discriminator

The architecture of the discriminator network comprises an image encoder, a pre-trained speech encoder and a conditional discriminator network, depicted in Fig. 2b. The choice of the hyper-parameters in the discriminator is based on our earlier work [3], but textual features are replaced with speech embeddings. To be specific, the image encoder is constructed with one convolutional layer with strides 1 kernel size 3 padding 1 followed by six adaptively successive residual blocks. Each block consists of two convolutional layers, where the first layer with strides 2 kernel size 4 padding 1 is used to reduce the dimension to half of the input feature map and the second one with strides 1 kernel size 3 padding 1 aims to further distill image features. After each convolutional layer, Leaky-ReLU activation [37] with a slope of 0.2 is utilized to help the training. The number of filters for residual blocks are 64 128 256 512 1024 1024, respectively. In order to stabilize the learning, we incorporate a residual connection into each block. In the process of training, a picture is fed into the image encoder to extract the image features (1024, 4, 4) which are combined with the spoken vector from the speech encoder as the joint embeddings, with a dimension of (2048, 4, 4). After that, the joint features are passed through the conditional discriminator to gain the final conditional score, which is utilized to determine whether the input image-spoken caption pair is real or fake.

4.2 Visual+speech fusion module

It is widely known that the semantic relationships between the visual content of photographs and images and the corresponding conditional contexts, e.g.,, class labels and textual descriptions, play a significant role in an image-generation process. However, this correlation will be more complicated for speech-to-image synthesis, since the speech signal is long and continuous, lacking word boundaries, i.e., the spaces in natural-language descriptions. It is thus very difficult to model the affinities between the linguistic content of spoken descriptions and the visible objects. Moreover, word-level modulation modules fail to fuse spoke information and visual feature maps due to the continuity of speech signals. In this case, a crucial problem is most clearly present for researchers: how to effectively inject the information from a spoken probe into a neural network which has a large 2D color image as its output?

The input speech signal needs to be converted to the global speech vector similar to the sentence embedding [3] and then can be made to modulate feature maps in the network. In this section, we develop an effective and efficient visual+speech fusion module (VSFM) to facilitate the visual feature maps with the global speech embedding and yield high-resolution pictures.

4.2.1 Overall architecture

The framework of the proposed visual+speech fusion module is shown in Fig. 2d. Given an intermediate feature map \(F\in {\mathbb {R}}^{C\times H\times W}\) and the speech vector \(s \in {\mathbb {R}}^{D}\) from a speech encoder as inputs, the process of the VSFM is composed of three steps:

  1. 1.

    The VSFM first applies Batch Normalization (BN) [36] on F, acquiring a new feature map \(F{}'\in {\mathbb {R}}^{C\times H\times W}\). This may help stabilize the learning of the conditional generative adversarial network (cGAN) and accelerate the training process.

  2. 2.

    \(F{}'\in {\mathbb {R}}^{C\times H\times W}\) is refined using the pixel-attention module (PAM) \(M_{p}\) and speech-modulation module (SMM) \(M_{s}\), respectively.

  3. 3.

    The VSFM effectively fuses their outputs \(M_{p}(F{}')\in {\mathbb {R}}^{C\times H\times W}\) and \(M_{s}(F{}', s)\in {\mathbb {R}}^{C\times H\times W}\) using the weighted-fusion module (WFM) \(M_{f}\). Meanwhile, a residual connection is employed to get the final enhanced result \(F{}'' \in {\mathbb {R}}^{C\times H\times W}\).

Note that we spread the bimodal information of the VSFM over all layers of the generator network. This allows for an influence of the speech over features at various hierarchical levels in the architecture, from crude early features to abstract late features. The overall process of the VSFM can be formulated as follows:

$$\begin{aligned}&F{}'=\textrm{BN}(F) \end{aligned}$$
(7)
$$\begin{aligned}&F{}''=F{}'+M_{f}(M_{p}(F{}'), M_{s}(F{}', s)) \end{aligned}$$
(8)

where BN indicates a Batch-Normalization operation. The details of PAM, SMM and WFM will be described in the following subsections.

Fig. 3
figure 3

Overview of the introduced pixel-attention module, which aims to assign larger weights to discriminative and informative pixels. BN and ReLU refer to Batch Normalization and a ReLU activation function, respectively. \(1\times 1\) conv and \(3\times 3\) conv indicate the \(1\times 1\) and \(3\times 3\) convolutional operation, respectively

4.2.2 Pixel-attention module (PAM)

Pictures and photographs consist of visual pixels which are considerably significant for image quality. Learning the long-range contextual dependency of each position of a picture is essential for producing perceptually plausible samples. However, convolutional operations can only grasp local relationships between spatial contexts and thus fail to ‘see’ the entire image field. Hence, the pixel-attention module (PAM) is introduced to effectively model the spatial affinities between visual pixels and enable crucial and informative positions to receive more attention from the generator.

Figure 3 illustrates the process of the pixel-attention module (PAM). For a feature map \(F{}'\in {\mathbb {R}}^{C\times H\times W}\), we first feed it into a \(3\times 3\) convolutional layer followed with BN and a ReLU function to reduce the channel dimension to \({\mathbb {R}}^{C/r\times H\times W}\). This may integrate and strength the visual feature map across the channel and spatial directions. Subsequently, we use a \(1\times 1\) convolutional layer followed with the sigmoid function to process the features to obtain the pixel-attention map \(PA \in {\mathbb {R}}^{1\times H\times W}\). Mathematically,

$$\begin{aligned} PA=\sigma (f_{1}^{1\times 1}(ReLU(\textrm{BN}(f_{0}^{3\times 3}(F{}'))))) \end{aligned}$$
(9)

where f is a convolutional layer, ReLU is the ReLU function and \(\sigma \) is the sigmoid function. We conduct a matrix multiplication between the original feature map \(F{}'\in {\mathbb {R}}^{C\times H\times W}\) and the spatial-attention map \(PA \in {\mathbb {R}}^{1\times H\times W}\) to acquire the refined result. Specifically,

$$\begin{aligned} M_{p}(F{}')=F{}'\odot PA \end{aligned}$$
(10)

where \(\odot \) is the element-wise multiplication.

4.2.3 Speech-modulation module (SMM)

An adequate modulation method needs to be developed to ensure the semantic consistency and quality of synthesized photographs [4]. To this end, the speech-modulation module (SMM) is introduced to inject the features derived from the speech signal into the network at the proper points in the network architecture.

Inspired by [28], SMM facilitates the visual feature map using detailed linguistic cues captured from the speech embedding s. To be specific, we adopt two dense layers to project s into the linguistic cues \(WA \in {\mathbb {R}}^{C\times 1\times 1}\) and \(BA \in {\mathbb {R}}^{C\times 1\times 1}\). After that, WA and BA are employed to scale and shift \(F{}'\). The process of SMM can be defined as follows:

$$\begin{aligned}&WA=MLP(ReLU(MLP(s))) \end{aligned}$$
(11)
$$\begin{aligned}&BA=MLP(ReLU(MLP(s))) \end{aligned}$$
(12)
$$\begin{aligned}&M_{s}(F{}')=F{}'\odot WA+ BA \end{aligned}$$
(13)

where MLP is a fully-connected perceptron layer and \(M_{s}(F{}')\) is the output from SMM.

4.2.4 Weighted-fusion module (WFM)

We do not simply fuse the attended and enhanced visual feature maps from PAM and SMM using an addition or element-wise multiplication, since pixels and speech signals refer to two very different modalities and need to be combined in a more advanced manner. Here, we propose an efficient weighted-fusion module (WFM) to highlight the discriminative and significant regions in an adaptive manner and produce high-quality image samples.

Fig. 4
figure 4

Overview of the proposed weighted-fusion module, which effectively combines the outputs from the pixel-attention module (PAM) and speech-modulation module (SMM) in an adaptive manner. \(M_{p}(F{}')\) and \(M_{s}(F{}', s)\) denote the result of PAM and SMM, respectively. GAP, MLP and GELU represent the global-average pooling, fully-connected perceptron layer and Gaussian Error Linear Unit (GELU) [38], respectively. \(WF_{1}\) and \(WF_{2}\) are the channel-aware weight matrix. \(M_{f}(F{}', s)\) is the final refined result

The detailed structure of the weighted-fusion module (WFM) is depicted in Fig. 4. For the outputs \(M_{p}(F{}')\) and \(M_{s}(F{}', s)\) from PAM and SMM, respectively, we perform an element-wise addition between them as a first step, acquiring an intermediate feature map \(IF\in {\mathbb {R}}^{C\times H\times W}\). In the next step, IF is used to compute the final weights for both \(M_{p}(F{}')\) and \(M_{s}(F{}', s)\). More specifically, we employ global-average pooling (GAP) to process IF to aggregate holistic and discriminative information, thereby obtaining a channel-feature vector \(IF{}'\in {\mathbb {R}}^{C\times 1\times 1}\). Subsequently, we feed \(IF{}'\) into two fully-connected perceptron layers, in which the first one is to compress and integrate the channel features and the other aims to recover the channel dimension and capture the semantic importance of the image-attention mask and the speech-modulation module at the level of the channels. After that, we apply a softmax function across the channel dimension to get the contextually channel-aware weight matrix \(WF \in {\mathbb {R}}^{C\times 2}\). Afterward, WF is split into two separate channel-wise weight matrices \(WF_{1} \in {\mathbb {R}}^{C\times 1}\) and \(WF_{2} \in {\mathbb {R}}^{C\times 1}\). Mathematically,

$$\begin{aligned}&IF=M_{p}(F{}')+M_{s}(F{}', s) \end{aligned}$$
(14)
$$\begin{aligned}&IF{}'=GAP(IF)\end{aligned}$$
(15)
$$\begin{aligned}&WF=softmax(MLP(GELU(MLP(IF{}')))) \end{aligned}$$
(16)
$$\begin{aligned}&WF=[WF_{1}; WF_{2}] \end{aligned}$$
(17)

where GAP is the global-average pooling and GELU is the Gaussian Error Linear Unit (GELU) [38]. \(WF_{1}\) and \(WF_{2}\) are resize into \({\mathbb {R}}^{C\times 1\times 1}\) and applied to \(M_{p}(F{}')\) and \(M_{s}(F{}', s)\) to get the final fusion result. It is denoted as follows:

$$\begin{aligned} M_{f}(F{}', s)=WF_{1}\odot M_{p}(F{}') + WF_{2}\odot M_{s}(F{}', s) \end{aligned}$$
(18)

4.3 Objective function

An adversarial loss is employed to match generated samples to input speech signals. Inspired by [4], we utilize the hinge objective [39] for stable training instead of the vanilla GAN objective. The adversarial loss for the discriminator is defined as follows:

$$\begin{aligned} \begin{aligned} {\mathcal {L}}_{\text {adv}}^{D}=&{\mathbb {E}}_{x\sim p_{\text {data}}}\left[ \text {max}(0,1-D(x,s)) \right] \\&+\frac{1}{2}{\mathbb {E}}_{x\sim p_{G}}\left[ \text {max}(0,1+D({\hat{x}},s)) \right] \\&+\frac{1}{2}{\mathbb {E}}_{x\sim p_{\text {data}}}\left[ \text {max}(0,1+D(x,{\hat{s}})) \right] \end{aligned} \end{aligned}$$
(19)

where s is a given spoken caption, \({\hat{s}}\) is a mismatched speech-audio description, x is the real image from the distribution \(p_{data}\) and \({\hat{x}}\) is the synthesized sample from the distribution \(p_{G}\),

To enhance the image quality and semantic consistency of produced pictures, we adopt the matching-aware zero-centered gradient penalty (MA-GP) loss [28] for the discriminator, which applies gradient penalty to real images and input spoken descriptions. The MA-GP Loss is formulated as follows:

$$\begin{aligned} {\mathcal {L}}_{\text {M}}={\mathbb {E}}_{x\sim p_{\text {data}}}\left[ (\left\| \nabla _{x}D(x,s) \right\| _{2}+\left\| \nabla _{s}D(x,s) \right\| _{2})^{p}\right] \end{aligned}$$
(20)

For the generator, we apply an adversarial loss and a deep attentional multimodal similarity model (DAMSM) loss [24] to train the network.

5 Experiments

In this section, a variety of quantitative and qualitative evaluations are conducted to validate the effectiveness of the presented Fusion-S2iGan in yielding visually realistic and semantically consistent images. The evaluations are carried out on two synthesized spoken caption-image data sets, namely Oxford-102 [19] and CUB bird [18], as well as two real spoken caption-image data sets, namely Flickr8k [20] and Places-subset [21]. Specifically, we describe the details of experimental settings in Sect. 5.1. Afterward, our proposed Fusion-S2iGan is compared with previous CGAN-based approaches for text-to-image generation using the synthesized data sets in Sect. 5.2. In Sect. 5.3, a similar comparison is performed using the real speech data sets. Subsequently, an analysis of the individual contributions from different components of our Fusion-S2iGan is conducted in Sect. 5.4.

5.1 Experimental settings

Datasets To evaluate the proposed Fusion-S2iGan, we perform extensive experiments on two synthesized spoken caption-image data sets and two real spoken caption-image data sets, which are employed by Li et al. [10] and S2IGAN [9].

  • CUB bird [18]. The CUB data set includes a total of 11,788 images, which is divided into 8,855 training pictures and 2,933 testing pictures. Each picture is accompanied by 10 natural-language descriptions. To evaluate the task of speech-to-image generation, Li et al. [10] and S2IGAN [9] transform textual descriptions to spoken captions utilizing a text-to-speech method such as Tacotron 2 [40].

  • CUB bird [18]. The CUB data set includes a total of 11,788 images, which is divided into 8,855 training pictures and 2,933 testing pictures. Each picture is accompanied by 10 natural-language descriptions. To evaluate the task of speech-to-image generation, Li et al. [10] and S2IGAN [9] transform textual descriptions to spoken captions utilizing a text-to-speech method such as Tacotron 2 [40]. Oxford-102 [19]. The Oxford-102 data set is composed of 8,189 pictures, in which 5,878 pictures belong to the training set and the other 2,311 pictures are used for testing. Each picture contains 10 textual descriptions, which are transformed to speech signals in the same way as the CUB data set.

  • Flickr8k [20]. The Flickr8k data set is a more challenging data set comprising 8,000 scene pictures and each image is paired with 5 real spoken descriptions collected by [41]. We split the Flickr8k data set according to S2IGAN [9].

  • Places-subset [21] The Places-subset data set is a subset of the Places Audio Caption data set [42, 43], which encompasses real spoken captions of pictures from the Places 205 data set [21]. It contains a total of 13,803 image-spoken caption pairs belonging to 7 categories, which are divided into 10,933 training paired data and 2,870 testing paired data.

In summary, two synthesized spoken caption-image data sets and two real spoken caption-image data set are adopted to evaluate the proposed Fusion-S2iGan.

Implementation details For the speech encoder, following the structure of S2IGAN [9], we use two 1-D convolution blocks, two bidirectional gated recurrent units (GRUs) and a self-attention module to obtain the speech embedding. To better capture the semantic representation of spoken descriptions on the Places-subset data set, we replace the Inception-V3 [44] pre-trained on ImageNet [45] in S2IGAN with the ResNet [46] trained on Places 205 [21]. Following the settings of S2IGAN [9], the dimension of the speech vector is set to 1024. The training of Fusion-S2iGan is based on our earlier work [3]. Specifically, we utilize the Adam optimizer [47] with \(\beta _{1}=0.0\) and \(\beta _{2}=0.9\) to train the networks. Furthermore, we follow the two timescale update rule (TTUR) [48] and set the learning rates for the generator and the discriminator to 0.0001 and 0.0004, respectively. We set the batch size to 32. We implement Fusion-S2iGan adopting PyTorch [49]. All the experiments are conducted on a single NVIDIA Tesla V100 GPU (32 GB memory).

Evaluation metrics We verify the effectiveness of Fusion-S2iGan by computing the following four extensively employed evaluation metrics:, i.e., the inception score (IS) [50], Fréchet inception distance (FID) [44] score, mean Average Precision (mAP) and R(ecall)@50 [9], which are used by S2IGAN [9]. The first two are to assess the diversity and perceptual quality, and the last two are to measure the semantic relevancy between synthesized samples and their spoken descriptions.

  • Inception score (IS) [50]. The IS is obtained by computing the KL divergence between the conditional class distribution and the marginal class distribution. The synthesized pictures are divided into multiple groups and the IS is calculated on each group of photographs, then the average and standard deviation of the score are reported. Higher IS demonstrates better quality and diversity among the generated images [4].

  • Fréchet inception distance (FID) [44]. The FID calculates the Fréchet distance between the distribution of generated images and the distribution of true data. A lower FID score means that the generated pictures are closer to the corresponding real pictures.

  • Mean average precision (mAP) and R(ecall)@50 [9]. The mAP and R(ecall)@50 are speech-image retrieval metrics, introduced by S2IGAN to measure the semantic relevancy for the synthesized and real speech data sets, respectively. Higher mAP or R@50 suggest better semantic consistency.

It should be noted that standard deviation on performances cannot be given for all metrics reported (FID and mAP). This is due to the fact that we wanted the measurements to be comparable with other studies. However, given the size of the test sets: \(N=30k\) for the CUB data set and \(N=10k\) for the Oxford-102 data set, we expect the measurements to be fairly reliable. As a comparative illustration: In case of a measured accuracy of 0.80 and a data set of \(N=30k\), at \(\alpha =0.01\), the confidence band would span 0.794 to 0.806, i.e., a deviation of just 0.75%.

Table 1 The IS, FID and mAP of previous text-to-image and speech-to-image approaches and Fusion-S2iGan on the CUB and Oxford-102 data sets

5.2 Results on the synthesized speech datasets

5.2.1 Quantitative results

We compare Fusion-S2iGan with previous single-stage [4, 28, 53] and multi-stage [9, 10, 14, 22, 24, 25, 51, 52] cGAN-based methods in text-to-image generation and speech-to-image transforms on the CUB and Oxford-102 data sets. The IS, FID and mAP of Fusion-S2iGan and other compared approaches on the CUB and Oxford-102 data sets are shown in Table 1. Note that the reported scores in Table 1 are based on the results presented in these publications. We can see that Fusion-S2iGan achieves the best scores on speech-to-image synthesis, significantly improving the IS from 4.29 to 4.82 on the CUB data set and from 3.69 to 3.81 on the Oxford-102 data set, reducing the FID from 14.50 to 13.74 on the CUB data set and from 48.64 to 40.08 on the Oxford-102 data set and increasing the mAP from 9.04 to 11.49 on the CUB data set and from 13.40 to 17.72 on the Oxford-102 data set. Notably, Fusion-S2iGan performs better in speech-to-image generation than both DM-GAN [14] in text-to-image generation on the CUB data set, as seen from an improvement of the IS from 4.75 to 4.82 and DTGAN [3] doing text-to-image synthesis on the Oxford-102 data set, here improving the IS from 3.77 to 3.81. The experimental results suggest that Fusion-S2iGan is capable of yielding perceptually plausible pictures with higher quality and better diversity than state-of-the-art speech-to-image transform models and many text-to-image approaches.

Fig. 5
figure 5

Qualitative comparison of AttnGAN [24], DTGAN [3] conditioned on the textual descriptions, S2IGAN [9] and Fusion-S2iGan on the basis of the speech signals on the CUB and Oxford-102 data sets. The spoken descriptions are shown above the images

5.2.2 Qualitative results

In addition to quantitative experiments, we carry out qualitative comparison on the CUB and Oxford-102 data sets, which is presented in Fig. 5. Note that in order to visualize the synthesized images of S2IGAN [9], we utilize the public codeFootnote 1 to train the network from scratch due to the lack of available pre-trained models. It was noted that the contrast of the generated images was dull for the S2IGAN code, as given. By enhancing brightness and contrast, this problem could be solved. However, we found this is only needed for human evaluation: On 20 selected images,Footnote 2 the IS score for S2IGAN was the same for the raw (grayish) output and the enhanced output samples. As can be seen in Fig. 5a, the shape of birds produced by AttnGAN [24] and S2IGAN [9] (2nd, \(3\textrm{rd}\), \(6\textrm{th}\) and \(8\textrm{th}\) column) is strange, the image details are lost (\(2\textrm{nd}\), \(3\textrm{rd}\), \(5\textrm{th}\), \(6\textrm{th}\) and \(7\textrm{th}\) column) and the backgrounds are blurry (\(1\textrm{st}\), \(3\textrm{rd}\), \(6\textrm{th}\) and \(8\textrm{th}\) column). However, Fusion-S2iGan yields more clear and photo-realistic samples than AttnGAN and S2IGAN, which demonstrates the effectiveness of our architecture. For example, the birds synthesized by Fusion-S2iGan have a more clear shape and richer color distributions compared to AttnGAN and S2IGAN in the \(2\textrm{nd}\), \(3\textrm{rd}\), \(6\textrm{th}\), \(7\textrm{th}\) and \(8\textrm{th}\) column. Furthermore, as shown in the \(1\textrm{st}\), \(2\textrm{nd}\), \(4\textrm{th}\), \(6\textrm{th}\) and \(8\textrm{th}\) column, Fusion-S2iGan produces visually plausible birds with more vivid details than AttnGAN and S2IGAN.

The qualitative results of DTGAN [3], S2IGAN [9] and Fusion-S2iGan on the Oxford-102 data set are illustrated in Fig. 5b, suggesting that Fusion-S2iGan is able to adopt a single generator/discriminator pair to produce high-resolution pictures that correspond well to the given speech-audio descriptions. For instance, Fusion-S2iGan yields perceptually realistic flowers with a more vivid shape than DTGAN and S2IGAN in the \(1\textrm{st}\), \(2\textrm{nd}\), \(6\textrm{th}\), \(7\textrm{th}\) and \(8\textrm{th}\) column. In addition, as shown in the \(3\textrm{th}\), \(4\textrm{th}\), \(5\textrm{th}\) and \(6\textrm{th}\) column, the flowers generated by Fusion-S2iGan have more clear details and richer color distributions than DTGAN and S2IGAN.

The above experimental results indicate that Fusion-S2iGan equipped with the dual-residual speech-visual fusion blocks has the ability to effectively inject the speech information into the generator network and generate high-quality photographs.

5.2.3 Human evaluation

To evaluate the image quality and semantic consistency of StackGAN++ [22], S2IGAN [9] and our Fusion-S2iGan, we conduct a human test on the CUB and Oxford-102 data sets. We selected 100 pictures randomly from each data set. Then, we provided the same spoken captions to users and asked them to choose the best sample synthesized by the three approaches based on image details and the corresponding speech description. In addition, we computed the final scores based on the judgment of two judges to ensure fairness. The results are shown in Fig. 6. We can see that our presented Fusion-S2iGan performs better than StackGAN++ and S2IGAN on both data sets, particularly on the CUB data set, This indicates the effectiveness of our approach.

Fig. 6
figure 6

Human test results (ratio of 1st) of StackGAN++ [22], S2IGAN [9] and our Fusion-S2iGan on the CUB and Oxford-102 data sets

Table 2 The IS, FID and R@50 of AttnGAN [24] based on the textual descriptions, Li et al., StackGAN++ [22], S2IGAN [9] and Fusion-S2iGan conditioned on the spoken captions on the Flickr8k and Places-subset data sets

5.3 Results on the real speech datasets

In addition to the synthesized speech data sets, we also evaluate Fusion-S2iGan on the challenging real speech data sets, i.e., Flickr8k and Places-subset data sets, from both quantitative and qualitative perspectives.

5.3.1 Quantitative results

Table  2 reports the IS, FID and R@50 of Fusion-S2iGan and other compared approaches on the Flickr8k and Places-subset data sets. It can be observed that Fusion-S2iGan performs better than S2IGAN by notably enhancing the IS from 8.72 to 11.70 on the Flickr8k data set and from 4.04 to 5.05 on the Places-subset data set, reducing the FID from 93.29 to 70.80 on the Flickr8k data set and from 42.09 to 25.68 on the Places-subset data set and improving the R@50 from 16.40 to 34.95 on the Flickr8k data set and from 12.95 to 28.37 on the Places-subset data set. Notably, Fusion-S2iGan obtains a remarkably lower FID in speech-to-image synthesis than AttnGAN doing text-to-image generation on both data sets, which demonstrates that the distribution of the samples synthesized by our model is closer to the real data distribution. Specifically, Fusion-S2iGan reduces the FID from 84.08 to 70.80 on the Flickr8k data set and from 35.59 to 25.68 on the Places-subset data set while increasing the IS from 4.59 to 5.05 on the Places-subset data set. We can also see that Fusion-S2iGan gets a lower R@50 score than AttnGAN on both data set. The reason for this may be that speech signals lack regular breaks between words, i.e., the spaces in the written-form text.

Fig. 7
figure 7

Qualitative comparison of S2IGAN [9] and Fusion-S2iGan conditioned on the speech signals on the Flickr8k and Places-subset data sets. Ground-truth images are shown above the samples produced by S2IGAN and the spoken descriptions are presented above the ground-truth images

5.3.2 Qualitative results

For qualitative comparison, we visualize the pictures produced by Fusion-S2iGan and S2IGAN and corresponding ground-truth images on the Flickr8k and Places-subset data sets in Fig. 7, which suggests that Fusion-S2iGan is capable of yielding photo-realistic and semantic-consistency photographs when given speech signals. For example, in terms of complex-scene generation, Fusion-S2iGan generates visually plausible dogs with more vivid details and more clear backgrounds than S2IGAN in the \(1\textrm{st}\), \(5\textrm{th}\), \(6\textrm{th}\) and \(8\textrm{th}\) column. We can also observe that Fusion-S2iGan synthesizes a man surfing on the realistic sea waves (\(2\textrm{nd}\) column), two clear persons standing on the beach (\(3\textrm{rd}\) column), a plausible football player in red and a reasonable background (\(4\textrm{th}\) column) and a running girl (\(4\textrm{th}\) column), whereas S2IGAN yields unclear objects (\(1\textrm{st}\), \(2\textrm{nd}\), \(4\textrm{th}\), \(5\textrm{th}\), \(6\textrm{th}\) and \(8\textrm{th}\) column) and blurry backgrounds (\(4\textrm{th}\) and \(7\textrm{th}\) column). Furthermore, it can be seen that the number of the objects produced by Fusion-S2iGan is correct in the \(1\textrm{st}\), \(2\textrm{nd}\), \(3\textrm{rd}\), \(5\textrm{th}\), \(6\textrm{th}\), \(7\textrm{th}\) and \(8\textrm{th}\) column.

As can be seen in Fig. 7b, Fusion-S2iGan produces promising and high-quality complex-scene pictures, i.e., a realistic bedroom (\(1\textrm{st}\) column), plausible kitchens (\(2\textrm{nd}\) and \(8\textrm{th}\) column), high-quality living rooms (\(3\textrm{rd}\) and \(5\textrm{th}\) column), visually promising dining rooms (\(4\textrm{th}\) and \(7\textrm{th}\) column) and a photo-realistic hotel room (\(6\textrm{th}\) column), although the Places-subset data set is very challenging. However, some examples generated by S2IGAN (\(1\textrm{st}\), \(4\textrm{th}\), \(6\textrm{th}\), \(7\textrm{th}\) and \(8\textrm{th}\) column) are not plausible. For instance, the shape of the beds is not clear (\(1\textrm{st}\) and \(6\textrm{th}\) column) and the color distribution is rough (\(4\textrm{th}\) and \(8\textrm{th}\) column).

The above analysis indicates that Fusion-S2iGan has the capacity to achieve very good results on the real speech data sets due to the effective speech-visual fusion module and loss functions.

5.3.3 Human evaluation

To further assess the image quality and semantic consistency of StackGAN++, S2IGAN, and our proposed Fusion-S2iGan, we perform a human test on the Flickr8k and Places-subset data sets using the identical process as in Sect. 5.2.3. Figure 8 shows that our approach significantly performs better than StackGAN++ and S2IGAN on both data sets, demonstrating the superiority of our proposed Fusion-S2iGan.

Fig. 8
figure 8

Human test results (ratio of 1st) of StackGAN++ [22], S2IGAN [9] and our Fusion-S2iGan on the Flickr8k and Places-subset data sets

5.4 Ablation/substitution tests of the proposed approach

In order to evaluate the effectiveness of different components in Fusion-S2iGan, we perform a series of substitution tests on the CUB, Oxford-102 and Flickr8k and Places-subset data sets. Ablation usually refers to the removal of processing steps and evaluating the effects on performance. Here, however, the word ‘substitution’ is more appropriate, referring to the in-place replacement of a network module by another variant. In the evaluation we used 30k sample images for the CUB data set and 10k samples for the Oxford-102 data set in order to obtain reliable performance estimates.

Table 3 Substitution study on pixel-attention modules
Table 4 Substitution test on fusion methods
Table 5 Substitution study on the visual+speech fusion module (VSFM)
Table 6 Effect of the number of fusion modules in the dual-residual speech-visual fusion block

5.4.1 Effectiveness of the pixel-attention module (PAM)

To validate the effectiveness of the introduced pixel-attention module (PAM), we investigate the performance of Fusion-S2iGan with other attention modules on the CUB and Oxford-102 data sets. Specifically, we replace the PAM in the visual+speech fusion module (VSFM) using the bottleneck attention module (BAM) [54], polarized self-attention (PolarizedAttn) [55] and contextual transformer attention (CoTAttn) [56], respectively. The comparison results are depicted in Table 3. We can see that PAM outperforms BAM, PolarizedAttn and CoTAttn by increasing the IS by 0.12 on the CUB data set and 0.06 on the Oxford-102 data set, reducing the FID by 0.33 on the CUB data set and 1.16 on the Oxford-102 data set and enhancing the mAP by 0.75 on the CUB data set and 0.44 on the Oxford-102 data set. The results demonstrate that PAM can improve the quality and semantic relevancy of produced pictures, relative to the other methods.

5.4.2 Effectiveness of the weighted-fusion module (WFM)

To further prove the benefits of the proposed weighted-fusion module (WFM), we conduct a substitution test on fusion methods. We replace WFM with the element-wise addition and multiplication operations, respectively. Table 4 reports the quantitative results on the CUB and Flickr8k data sets. It can be observed that WFM performs better than both addition and product operations, significantly improving the IS from 4.68 to 4.82 on the CUB data set and from 10.23 to 11.70 on the Flickr8k data set, reducing the FID from 15.25 to 13.74 on the CUB data set and from 78.03 to 70.80 on the Flickr8k data set and increasing the mAP from 11.11 to 11.49 on the CUB data set and the R@50 from 30.82 to 30.95 on the Flickr8k data set. The above analysis suggests the effectiveness of the presented WFM in comparison to plain additive or product-based weighing.

5.4.3 Effectiveness of the visual+speech fusion module (VSFM)

To verify the effectiveness of the developed visual+speech fusion module (VSFM) and investigate how far can current single-stage text-to-image algorithms be used for speech-to-image generation, we explore the results of Fusion-S2iGan with the sentence-visual fusion modules in DTGAN [3] and DFGAN [28] on the CUB and Oxford-102 data sets. The results of the substitution study are displayed in Table 5. We can see that the VSFM achieves the best score, notably increasing the IS by 0.42 on the CUB data set and 0.23 on the Oxford-102 data set, reducing the FID by 3.00 on the CUB data set and 8.29 on the Oxford-102 data set and increasing the mAP by 3.29 on the CUB data set and 1.40 on the Oxford-102 data set.

It can also be observed that DTGAN does not obtain promising performance on the speech data sets. The reason behind this result may be that the dual-attention module in DTGAN is not suitable for processing the continuous speech signals and modeling the semantic relationships between vision and speech. The above experimental results indicate that with the VSFM, Fusion-S2iGan is able to yield high-resolution pictures semantically aligning with the input speech-audio descriptions.

5.4.4 Effectiveness of the number of fusion modules in the residual-structure

To evaluate the impact of different number of the fusion modules in the dual-residual speech-visual fusion block, we compare the results of a fusion module (single-module) and two fusion modules (dual-module) on the Flickr8k and Places-subset data sets, as shown in Table 6. We can observe that the proposed dual-residual speech-visual fusion block outperforms the block with a single module on both data set, which validates the effectiveness of the dual-residual structure.

6 Conclusion

In this paper, we propose a unified, novel end-to-end speech-to-image transform architecture named Fusion-S2iGan. Fusion-S2iGan is able to only use a generator/discriminator pair to project a speech signal into a high-quality picture directly. Fusion-S2iGan adopts a new and effective visual+speech fusion module (VSFM) to modulate the visual feature maps with the speech information and boost the resolution of produced samples. More significantly, Fusion-S2iGan spreads the bimodal information over all layers of the generative model. This allows for an influence of the speech over features at various hierarchical levels in the architecture, from crude early features to abstract late features. The hinge objective, deep attentional multimodal similarity model (DAMSM) loss and matching-aware zero-centered gradient penalty (MA-GP) loss are introduced to stabilize the training of the conditional generative adversarial network (cGAN). Fusion-S2iGan is evaluated on four public data sets, i.e., CUB bird, Oxford-102, Flickr8k and Places-subset, outperforming current speech-to-image methods. Moreover, we explore whether current text-to-image approaches are effective for a speech-to-image transform. Furthermore, our presented VSFM is a general fusion module, and can be easily integrated into current speech-to-image frameworks to improve image quality and semantic consistency. More significantly, our developed architecture overcomes the issues presenting in the existing multi-stage speech-to-image algorithms, and can serve as a strong basis for developing better speech-to-image generation models. In the future, we will investigate how to inject the emotional features derived from the speech signal into the speech-to-image synthesis pipeline and how to combine speech signals with other conditional contexts, e.g.,, sketches, to control the produced pictures. We focus on producing natural-scene pictures from spoken descriptions. However, it is intriguing and attractive to study image generation from other domains such as artistic fonts, drawings, cartoons, sketch and paintings, since they play a significant role in humans’ daily life.