1 Introduction

Since widespread utilization has been shown for deep learning (DL), excessive development has been achieved in face recognition and face detection tasks. Further, various studies related to face manipulation such as face synthesis, face swapping, facial attributes, and their analysis has been extensively studied [1,2,3]. Face recognition is already in place but still needs more attention. Moreover, the COVID-19 crisis has forced many more daily tasks to be digitalized in the world, even after the post-COVID-19 some tasks may remain digitalized. Thus, it needs precise face recognition and verification for security at many places. For example, at airports for baggage drops, security screening, gate check-ins, this may consume large processing time. Thus, face ID can provide comfort at these overcrowded points. Similarly, in banks face ID feature is offered to customers for their respective log-in, also in ATM’s facial features are used for the withdrawal of the money. Face recognition with age progression can be helpful in the above-mentioned applications. Thus, the application of the automatic face aging method with machine learning to handle the large database can come to play the role of auto-readjustment of e-records. And in many more cases like avoiding contact at attendance system for offices and degree colleges, passports renewal, electronic customer-retailer business, abduction of children where after many years person’s biological face appearance changes. So, it becomes necessary to go deeper into the existing face aging methods and their current scenario, so that in the future various face aging method challenges can be solved. That is why face aging is the topic of utmost interest. Based on these considerations, two face aging methods CycleGAN and AttentionGAN are compared for face aging application and evaluated. As they have achieved tremendous attention in translation for the image-to-image GANs, these GANs have been generating remarkable results. To the best of our knowledge, no comparison is done between these two GANs for face aging tasks using a different dataset. In this paper, extensive experiments are conducted which give the comparison analysis between CycleGAN and AttentionGAN to measure the ability to produce plausible photorealistic face images. Some of the applications of face aging or face age progression and its related research fields are shown in figure 1.

Figure 1
figure 1

Various applications of face aging and its related research fields.

The main objectives of this paper are:

  1. 1.

    The comparison between CycleGAN and AttentionGAN models on the face aging application and list their merits and demerits.

  2. 2.

    To evaluate the performance quantitatively and qualitatively, two dataset CelebA-HQ and FFHQ are used.

  3. 3.

    The potential of CycleGAN and AttentionGAN is also shown with their robustness and various perspectives for

future directions are enumerated.

1.1 Face aging

It is a procedure of altering a face image across its different ages which are termed as target age, with natural aging effects or reviving effects on the given face image. The block diagram of the face age progression or face age synthesis is shown in figure 2.

Figure 2
figure 2

Face age synthesis [4].

1.1a Input dataset/target images: The input dataset and target images are pre-processed so that fed input images are of good quality. At the input, it is also ensured that images are of a face only. Also, the face is cropped which enhances the focus on the face. The number of images used for the input dataset and target dataset is almost equal. These are further trained to learn the pattern for a transition from given input images to the desired target images.

1.1b Face age synthesis process: This process includes deep learning from the given database. It extracts the features from the face and memories for the new transition. The larger the database better it will be because the more data gives a lot of variation, repetition to learn and recognize the features which further helps to do the prediction with accuracy. During the training, conversion takes place for every image and testing images give the results from that learned training. Thus, testing images are used to predict how effectively the system is trained on the given dataset. It also depends on the quality and amount of dataset provided to the system. Then the testing results give the transition required for the input image to the target image.

1.1c Synthesized images: All the training images may not be able to learn the transition so in the testing phase not all the output results are as desired. The amount of training data, quality of the dataset, pre-processing steps, and algorithm all decide the accuracy of the final output.

Every human aging process is unique. For illustration, the age-progressed faces of Albert Einstein are shown in figure 3. Further, there are four conceptual terms with age. Actual age is also termed the real age of the person. Secondly, the visual appearance of the person is termed as appearance age. Thirdly, perceived age is measured by human subjects from the visual appearance and fourth, age predicted by a machine based on its visual appearance is termed as estimated age.

Figure 3
figure 3

Face Aging of Albert Einstein [4].

However, there are many reasons which affect the remarkable changes in human face aging, roughly it is divided into two portions. First is the primary growth i.e., from birth to growth till childhood. This stage shows the changes mainly in a facial curve, facial features (eye, chin, mouth, etc.), and facial features distributions. This is called shape change or craniofacial growth but a slight consequence is on change in color of skin at this stage. Further, the second stage is adult aging i.e., growth from adulthood till old age. In this stage, prominent change is shown in skin texture and color of the skin such as wrinkles appear, facial lines are visible, and minor craniofacial changes, also there is a reduction in muscle strength and elasticity [5,6,7]. Thus, in face age progression most of the variations are associated with the face texture.

Some signs of facial skin aging are shown in figure 4 with texture changes such as forehead wrinkles, crow’s feet wrinkles, or lateral Canthal lines, Glabellar frown lines means a vertical line in between the brows, under-eye bags or full bags, wrinkles under the eye, Nasolabial folds, Paler or yellow skin, Marionette lines that initiate from corner of both side of the mouth to the corner of the chin, wrinkles in the upper lip side. Also, some geometrical changes like brow drop, vertical lower-eyelid length increases, nose elongation, and tip movement, lower face Ptosis, and chin sagging, the small focal accumulation is termed as jowl of fact in the lower cheek overlying the jaw bone [7].

Figure 4
figure 4

Facial aging marks on the complete face [7].

Geometrical transformations include two parts in context to facial feature points. First is the size of components and their distribution on the face such as distance between different facial parts which contributes to face aging. For example, the distance between brows and eyes and distance between nose and lips, etc. The second is face contour or face shape. The face shape change in age progression is mainly jowl variations that occur because of the face skin losses and a noticeable reduction in muscle.

In texture variations, changes in the skin directly or through changes in muscles, fat with age progression. Also, a facial skeleton that can alter the face geometry shows its effect on the texture of the skin. Thus, texture changes on the human face play a vital role in face age progression.

Further, the face pattern is unique to each person and linked with several internal factors like hormonal changes, stress, ethnicity, etc. External factors such as environmental conditions, geographical conditions, lifestyle, etc. affect the face aging results. Some internal and external factors are discussed below.

1.1d Internal factors: As ethnicity is described as one particular population in terms of genetic similarities. Face alterations in different ethnic can describe genetic differences which show with face age progression also. Some face age-related factors such as the color of skin, skin thickness and natural moisturizer [8, 9], etc. can be different from one ethic to another. Also, human skin is observed with many gender differences [10]. On average female skin is thinner than male skin. Generally, males have deeper wrinkles than females. The male skin creases are produced at a later stage of age but they are more noticeable when they appear. Moreover, human skin is a maker of hormones [10, 11]. These hormones help in the growth and biological functionality of skin muscles.

1.1e External factors: The changes in the human’s face are not only because of internal factors. But, external factors play a significant role in face aging methods. Such as geographical area, gravity, pollution, temperature and working environment, etc. However, lifestyle plays a great role which includes habits such as nutrition, sleep, drugs, exercise, exposure to UV rays, and many more.

Further, face aging categories can be divided in various ways. The input image can be 2D or 3D face images. 3D image representation can provide both the shape and texture information and potentially can produce better results in comparison to 2D images. The changes in the face are observed in terms of texture changes and geometrical or shape changes as a person grows. The methods used for face age progression are prototype-based methods and physical model-based methods that are conventional face aging methods. But, with the advancement in research, deep learning-based approaches can deal with a larger and complex dataset. The deep learning approaches have shown notable results in the face aging field. Figure 5 shows the face age progression with different categories.

Figure 5
figure 5

Face age progression different categories.

2 Related work

The relevant previous work done on GAN for the translations from image to image is provided in this section.

2.1 GAN

The substantial improvement in deep generative models is the GAN framework presented by Goodfellow et al [12]. GAN comprises of two neural nets, the generator and a discriminator that works as a two-player, non-cooperative game competing against each other. The performance of one network comes at the cost of another network. The generator generates fake images and the discriminator distinguishes between actual and fake images. The training is done at the same time for both models. The response from the discriminator aids the generator to progress its performance, the generator cannot access the real data while the discriminator can access the real as well as fake data [13]. The generator and discriminator networks consist of convolutional layers such as Deep Convolutional GAN (DCGAN) or fully connected layers [14]. In a fully-connected GAN, the generator, as well as the discriminator uses a fully-connected network and in a convolutional network GAN, Convolutional Neural Network (CNN) is used in the generator and the discriminator. Although, training both models with CNN is difficult in comparison to fully connected network GAN. The two neural network models play the min-max game which is mathematically expressed in equation (1) as:

$$ \begin{aligned} \min_{G} \max_{D} Y \, \left( {D, \, G} \right) \, =&\, E_{{p\sim s_{data} }} \left( p \right) \left[ {\log \, D\left( p \right)} \right] \\ & +E_{{r\sim s_{r} \left( r \right)}} [\log \left( {1 - D\left( {G\left( r \right)} \right)} \right], \\ \end{aligned} $$
(1)

where G, F are generators, D is discriminator, \(E_{{p\sim s_{data} }} \left( p \right)\)[log D(p)] is log probability of D to predict the real-world data is genuine and \(E_{{r\sim s_{r} \left( r \right)}}\)[log(1-D(G(r))] is the log probability that G’s generated data is not genuine. r is random noise as input, p is real data instance, s is a probability distribution.

Alqahtani et al 2019 [15] had shown the various GANs with their specific application and provided much border explanation on GAN in various aspects. The paper also addressed the future challenges for GAN in terms of their training and GAN standard evaluation metrics. Goodfellow et al [16] in 2020 had shown a wider explanation on GAN. GAN is a type of artificial intelligence algorithm and it is GAN whose generative models can produce highly realistic images. But, still has a lot of challenges to overcome as it is based on game theory, such as it is still difficult to train GANs. With the problem of training GAN and its evaluation, wide space is present in the research community to explore GAN. Guo et al 2020 [17] had proposed a GAN called combined GAN (Com-GAN) and done the study using improved GAN for generating fundus images. It had shown high-quality results in comparison to other generative models. Thus, further research can be done with com-GAN for image generation and image translation tasks in various fields.

The previous supervised learning process with deep neural networks for face aging [18,19,20] needed paired dataset for training, which was quite difficult. To overcome this problem, GANs have been used for face aging to train with the unpaired dataset and its various variants were also used. Conditional GAN (cGANs) [21] had been used for face aging applications with unpaired dataset [22,23,24,25,26,27,28,29,30], to attain better realistic results than conventional methods like prototype methods [31] and physical model-based methods [32].

2.2 CycleGAN

The CycleGAN framework was introduced by Zhu et al in 2017 [33] for the image-to-image transformation task without the need for a paired training database. The learning of mapping from the domain P (input) to domain Q (output) and vice versa is achieved with the help of cycle consistency losses. The hypothesis in the paper was that some underlying associations that exist among the two domains, combined the two losses “cyclic losses” with “adversarial losses” on the input and output domain. In figure 6 the framework has two plotting roles G: P to Q and F: Q to P also \(D_{Q}\) and \(D_{P}\) two discriminators. The discriminator, \(D_{Q}\) insists the generator G which decodes P into new synthesized output images. Similarly, the function of \(D_{P}\) and F is in the opposite direction. Although the cycle-consistency loss in figure 6 is also termed as cyclic loss, the cyclic loss helps to get back the real input image in the same domain from the generated output image in another domain. So, two cyclic losses, one in the forward direction is stated as

Figure 6
figure 6

CycleGAN model overview [33].

\(\left( {p \to G\left( p \right) \to F\left( {G\left( p \right)} \right) \approx p} \right)\) and another is in the backward direction and expressed as \(\left( {q \to F\left( q \right) \to G\left( {F\left( q \right)} \right) \approx q} \right)\). The main objective consists of two terms, one is adversarial losses in that the distribution of synthesized images is matched with the data distribution of images in the target domain. The second is the cycle consistency loss that avoids the learned mappings of G and F to contradict each other. Mathematically, the adversarial loss is shown in equation (2) and cycle consistency loss in equation (3).

For mapping function in G: P → Q and \(D_{Q }\) is discriminator. The adversarial loss is expressed as:

$$ \begin{aligned} L_{GAN} ({\text{G}},D_{Q} ,{\text{P}},{\text{ Q}}) \, =&\, E_{{q\sim s_{data} \left( q \right)}} [{\text{log}}\;D_{Q} \left( {\text{q}} \right)] \\ & + E_{{p\sim s_{data} \left( p \right)}} [{\text{log }}({1} - D_{Q} \left( {G\left( p \right)} \right)], \\ \end{aligned} $$
(2)

where G tries to generate images G(p) and \( D_{Q}\). tries to differentiate between synthesized images G(p) and images q (in Q domain). G goals to diminish this objective in comparison to an adversary D which attempts to maximize it, i.e., \(min_{G} max_{{D_{Q} }} L_{GAN}\)(G, \(D_{Q}\), P, Q)

Cycle-consistency loss:

$$ \begin{aligned} L_{cyc} \left( {{\text{G}},{\text{ F}}} \right) \, =&\, E_{{p\sim s_{data} \left( p \right)}} \left[ {||{\text{F}}\left( {{\text{G}}\left( {\text{p}} \right)} \right) - {\text{p|}}\;\;{|}_{{1}} } \right] \\ & + E_{{q\sim s_{data} \left( q \right)}} \left[ {\left\| {{\text{G}}\left( {{\text{F}}\left( {\text{q}} \right)} \right) - {\text{q}}} \right\|_{1} } \right]. \\ \end{aligned} $$
(3)

With this, the reconstructed image F(G(p)) closely resemble the input image p.

T objective function for optimization by combining the adversarial loss and cycle-consistency loss is presented in Eqs. (4), (5), and (6) as:

$$ \begin{aligned} L(G, F, D_{P} ,D_{Q} ) \, &= L_{GAN } ({\text{G}},D_{Q} ,{\text{ P}},{\text{ Q}}) + L_{GAN} ({\text{ F}},D_{P}, \\ & \; {\text{Q}},{\text{ P}}) \, + {\mathcal{M}} L_{cyc} \left( {{\text{G}},{\text{ F}}} \right), \\ \end{aligned} $$
(4)
$$ G^{*} arg min_{G, F} max_{{D_{p } D_{Q} }} {\text{L }}({\text{G}},{\text{ F}},D_{P} , D_{Q} ), $$
(5)
$$ Loss_{complete} = Loss_{adv} + {\mathcal{M}} Loss_{cyc} , $$
(6)

where \( Loss_{adv}\) is adversarial loss, \( Loss_{cyc}\) is cyclic loss and ℳ manage the relative significance of the two objectives.

The CycleGAN model stretches its edges to show the maximum possible setting in unsupervised learning. The generalization of the model is shown with different broader range applications without paired data and had shown the outperforming results. The CycleGAN is best to do changes in texture and color but geometric changes show little progress [33].

However, Welander et al 2018 [34] presented a comparison among CycleGAN and UNIT using multi-contrast MR images which had shown that CycleGAN produced better results for visually realistic images. Nanavati et al in 2020 [35] presented a GANs comparative analysis which included SinGAN, cGAN, CycleGAN, StarGAN and had shown in the mathematical results that CycleGAN had performed well in comparison to various other GANs using several metrics like root mean square error (RMSE), Universal quality measure (UQI), multi-scale structural similarity (MS-SSIM), visual information fidelity (VIF). Burad et al 2020 [36] showed a comparative study of CycleGAN and progressive growing GAN and presented that the progressive growing GAN can be good in the case of medical images because they preserve more details and produce a high-resolution image than CycleGAN.

2.3 AttentionGAN

AttentionGAN was presented by Tang et al [37]. Although various existed methods for the image-to-image translation had produced remarkable results, still it had some visual artifacts in the output images because of the week translation of high-level semantic input images. AttentionGAN identifies the foreground objects and minimizes the changes in the background. As shown in figure 7 (scheme B), the AttentionGAN produces the attention masks also content masks which combine along with the generated output to produce the high-quality target images. The higher intensity level of the attention mask for any particular image indicates the high contribution for better change. The AttentionGAN generalization is shown with the help of various kinds of different application images. The results obtained are sharper and more realistic [37].

Figure 7
figure 7

AttentionGAN framework [37] applied on face aging application.

AttentionGAN had proposed two schemes: scheme A and scheme B. In scheme A, the generator focuses on the specific sections of the image which are responsible for producing good expression at output such as eyes, mouth, and other parts remain untouched like hairs, glasses, clothes. Thus, scheme A can change the output image better where there is a greater overlapping similarity between the two domains such as facial expression to another facial expression translation job. To overcome the disadvantages of scheme A, scheme B was proposed. It has two generators G, F composes of two sub-nets for producing several intermediate attention masks and content masks that help to eliminate the drawback of scheme A. Thus, the accomplishment of scheme B with the generation of both foreground and background attention masks, lets the model alter the foreground and instantaneously preserve the background of a given face image. Mathematically, content and attention masks get multiplied with the generated image from the generator to produce a realistic image as shown in equation (7) as:

$$ G\left( p \right) = \mathop \sum \limits_{f = 1}^{n - 1} (C_{q}^{f} *A_{q}^{f} ) + p*A_{q}^{b} , $$
(7)

where \((\{ A_{q}^{f} \}_{f = 1}^{n - 1}\),\( A_{q}^{b}\)), k attention masks are produced, \(G\left( p \right)\) generated target face aged image, P and Q are two domains, p is the input image, \(C_{q}^{f}\) signifies a content mask and q are images in another domain. Also, to produce a reconstructed image from G(p), another generator F, with a similar has the structure to G is used. With the help of content masks, attention masks, and G(p), recreate the original image p and expressed mathematically in equation (8) as:

$$ F\left( {G\left( p \right)} \right) = \mathop \sum \limits_{f - 1}^{n - 1} (C_{p}^{f} *A_{p}^{f} ) \, + {\text{ G}}\left( {\text{p}} \right) *A_{p}^{b} , $$
(8)

where \(F\left( {G\left( p \right)} \right)\) is the recreated image that is close to the real image p. Further, \(C_{p}^{f}\) is a content mask, G(p) is a synthesized image, \(A_{p}^{b}\), \( A_{p}^{f} \) are background attention mask and foreground attention mask respectively. Thus, the aim of the image-to-image conversion is achieved.

This paper provides the comparison using AttentionGAN (scheme B) on face aging application for image-to-image translation GAN. Mathematically objective of optimization in AttentionGAN scheme B is expressed in equation (9).

$$ L = L_{GAN} + \epsilon_{cycle} * L_{cycle} + \epsilon_{id} *L_{id} , $$
(9)

where, \(L_{GAN}\) represents GAN loss,\( L_{cycle}\) is termed as cyclic loss and \(L_{id}\) represents identity preserving loss. Also,\({\epsilon}_{cycle}\) and \(\epsilon_{id}\) are the parameters to manage each term relation.

Also, the min-max game of AttentionGAN works as expressed in equation (10) as:

$$ \begin{aligned} L_{AGAN} (G,D_{Q A} ) = & \,E_{{q\sim s_{data} }} \left( {\text{q}} \right)[{\text{log}} D_{Q A} ([A_{q} ,{\text{q}}]) \\ & + E_{{p\sim s_{data} }} \left( {\text{p}} \right)[{\text{log}}({1} - D_{Q A} ([A_{q} ,{\text{G}}\left( {\text{p}} \right)\left] {))} \right], \\ \end{aligned} $$
(10)

where \(D_{Q A}\) is attention-guided discriminator, G is a generator, [\(A_{q}\), G(p)] is fake image pairs and [\(A_{q}\),q] is real image pairs. In this equation, \(D_{Q A}\) tries to discriminate between the generated image pair [\(A_{q}\), G(p)] and the real image pairs [\(A_{q}\),q]. Similarly, from the other domain \(L_{AGAN}\)(F,\( D_{P A}\)) where F is a generator and \(D_{P A}\) is a discriminator which attempts to differentiate among fake image pairs [\(A_{p}\), F(q)] and actual image pairs [\(A_{p}\),p]. Thus, the discriminator focuses on the main content and overlooks the unrelated content. This is used for scheme A only, in the case of scheme B, the generator is itself very effective to learn main content from the source and target domain images.

2.4 Face aging GAN variants

Image synthesis turns out to be an interesting area in the computer vision field. Human faces have been thoroughly studied in multimedia fields, computer graphics, and computer vision [4]. In the last few years, a growing amount of research on face age progression and its associated applications have been described such as face age estimation, cross-age face study, and entertaining field, etc. Some current and existing impressive face age progression approaches using GANs have been discussed below in table 1.

Table 1 Literature of GANs based on face aging methods.

3 The proposed evaluation method

The comparison assessment among the two image-to-image transformation methods provides a result that which model can produce more realistic output images, the robustness of models i.e., is how the model behaves with different kinds of input images and which model can preserve the identity better. Therefore, qualitative and quantitative measures are performed for evaluating CycleGAN and AttentionGAN frameworks.

3.1 Dataset

  • CelebA-HQ: The dataset was introduced in the paper progressive growing GANs for improved quality by Karras et al [51]. It contains 30,000 high-quality face images out of CelebA dataset. The dataset is largely diverse with images that have larger variations in pose, annotations, and background clutter. A total of 1000 images are trained from the CelebA-HQ dataset. For the face aging task, two groups of images are taken, one younger age (age up to 18 years) and another older age (age above 55 years).

  • FFHQ: Flicker-Faces-HQ dataset contains high-quality human faces with 70,000 PNG format images. The dataset contains large variations in terms of the image background, age, ethnicity, and images with accessories like eyeglasses, sunglasses, caps etcetera. A total of 1000 images are trained for the two frameworks with two groups of younger age and older age.

Some images are presented in figure 8 from CelebA-HQ and FFHQ dataset.

Figure 8
figure 8

Represents some images from the CelebA-HQ dataset in the upper row and the lower row presents some images from FFHQ dataset.

3.2 Training and implementation details

Both dataset are split into a training, testing ratio of 70-30%. The training images are 706 and the testing images are 300 for each experiment are used. System architecture with Nvidia Geforce GTX with one GPU, 1660 Ti is used. The training time was approximately 10 hours for each age group in AttentionGAN. For CycleGAN, it took approximately 12 hours for each age group. Both the models completed 210 epochs. For a fair comparison, the same epochs are used in both models. The optimum training epochs in GAN are dependent on the size of the dataset, dataset type, and application for which GAN has been used [52]. As shown in figures 9(a) and (b) for the FFHQ dataset and figures 9(c), (d) for the CelebA-HQ dataset training loss graph is presented. The training loss plots (figure 9) representing that the loss of G and D should remain consistent throughout the training. That is, the G loss should be greater than the D loss and should not diminish. Diminishing the generator (G) loses infer that the model has become good in generating images while diminishing the discriminator (D) loses indicate that the generator has either become good or the discriminator has not improved in differentiating real or fake face images. Either of these conditions hinders the learning process. Since the requirement is to neither increase nor decrease the G and D losses too much, therefore, obtaining a training loss graph that looks the same throughout the model’s training is required. So, created images in training can be used to select a model and additional training epochs may not essentially mean better quality synthesized images. Finally, the models are trained for 210 epochs. Based on constant monitoring of these training results it is observed that in the beginning, outputs are blurry images that become more and more realistic as training progresses. After the 40th epoch, the synthesized images are visibly improving and age-progressed face features are taking meaningful aging signs. That also should signify that both models have learned well enough.

Figure 9
figure 9

Training loss graph (a) CycleGAN, (b) AttentionGAN with the FFHQ dataset, (c) CycleGAN and (d) AttentionGAN with the CelebA-HQ dataset.

4 Simulation results

Extensive experiments are performed for the evaluation, to compare the face aging output results of CycleGAN and AttentionGAN.

4.1 Qualitative evaluation

4.1.1 Face aging:

Figures 10 and figure 11 present the aged face images from the CelebA-HQ (figure 10(a)) and FFHQ (figure 10(b)) dataset which are generated by CycleGAN and AttentionGAN frameworks. The images are generated for the age group of 55+ age group. Though input face images consist of an extensive choice of the people with gender, expression, makeup, pose, and race. The synthesized aged face images are photo-realistic that retain original details of the face such as wrinkles, skin texture, muscles, etc. For illustration, the hair goes grey, and the skin wrinkles appear. Also, all face images can preserve their original identities. While hair color usually changes into grey as the human face ages, it differs from person to person and is based upon the various internal and external factors and the training images. This clarifies why some of the synthesized face images in figures 10 and 11 show few face aging effects. From comparison in figure 10(a), CycleGAN shows a better performance than AttentionGAN. The opposite is however illustrated in figure 10(b) where AttentionGAN presents the aged faces better than CycleGAN. Thus, the dataset also plays a major role along with the framework to generate quality images. Besides, the individual results generated from the CycleGAN and AttentionGAN are shown in figure 11 where AttentionGAN outperforms CycleGAN by generating sharper, realistic aged faces and visually appealing. Since the key contribution of AttentionGAN is that it learns the foreground and preserves the background of an image simultaneously. It depicts that each method independently generates significant results. Besides, increasing the number of images for training and performing the training twice or more can improve the results further.

Figure 10
figure 10

Comparison of images generated from AttentionGAN and CycleGAN.

Figure 11
figure 11

Synthesized results from the CycleGAN and the AttentionGAN with CelebA-HQ and FFHQ dataset.

4.2 Robustness

Figure 12 shows the robustness of CycleGAN and AttentionGAN in terms of profile face images, expression, and occlusion. The age-progressed face images are still photorealistic and true to given inputs. The images are obtained for the 55+ age group. Thus, both the models are robust to pose, expression and occlusion. The CycleGAN and AttentionGAN model takes the input for the whole face and generates the realistic age-progressed images including hair although the previous methods still work with cropped faces without including hair aging [23, 39, 53]. The performance of CycleGAN and AttentionGAN in terms of robustness are similar with little exception, AttentionGAN synthesized face images have minor artifacts.

Figure 12
figure 12

Robustness of CycleGAN and AttentionGAN with side pose, expression, and occlusion.

4.3 Aging signs

Figure 13 shows aging signs that occur with age progression in CycleGAN and AttentionGAN. It is illustrated in figures 13(a) and (b) for CycleGAN that it synthesizes smooth aging variations with high fidelity for different parts like showing the lower half part of the face in figure 13(a). In this figure, it is shown that changes in skin texture with deeper wrinkles as age progress also lips become thinner. However, with the appearance of significant aging signs, identity is well preserved. In figure 13(b) the half-face is illustrated to display the performance globally as nasolabial fold appears and are prominent with aging.

Figure 13
figure 13

Face aging signs: In CycleGAN (a) Cheeks wrinkles and skin texture change, (b) Nasolabial fold appears with aging. In AttentionGAN (c) Wide deep orbit, eyebrows thin with age progression and (d) Hair whitening with aging.

Besides, in figures 13(c) and (d) for AttentionGAN, the effect on eye orbit includes wrinkles around the eyes, wide eye orbits are present, and eyebrows become thin in figure 13(c). It also clearly determines the change of age pattern. In figure 13 (d) hair whitening occurs as hair is varied in shape, color, and texture therefore it is tough to model. With age progression expectations, the hair grows wispy and thin. This is also shown in the age progression simulation. It validates the ability to preserve the necessary facial details while aging.

Thus, images are visually realistic in CycleGAN and AttentionGAN. Both CycleGAN and AttentionGAN accomplish aged face significantly.

4.4 Quantitative evaluation

4.4.1 Identity preservation:

Following the convention [25, 27, 29, 41,42,43] to estimate identity preservation objectively, the online face analysis tool Face++ is used. In this paper, images from FFHQ and CelebA-HQ dataset are used to evaluate the confidence score with the Face++ tool [54] to measure the similarity among the synthesized aged face image and real face image. The high score above the threshold value (76.5), signifies the higher similarity between the two face images [54]. Table 1 shows the confidence score for CycleGAN and AttentionGAN for the proposed work and graphical representation is presented in figure 14. It is shown in table 2 that the confidence score for CycleGAN and AttentionGAN have values above a threshold which means each framework preserves the identity well in the aged face images. Besides as shown in table 2, values for CycleGAN are better in comparison to AttentionGAN.

Figure 14
figure 14

Graphical representation of confidence score for CycleGAN and AttentionGAN using FFHQ and CelebA-HQ dataset.

Table 2 The confidence score for CycleGAN and AttentionGAN.

4.5 Image quality assessment

Fidelity is a significant part to evaluate image generation tasks. Following [35, 37, 55, 56] for the extensive quantitative evaluation, several image quality assessment metrics used are Frechet Inception Distance (FID) [41], Peak Signal to Noise Ratio (PSNR), Structural Similarity Index (SSIM), Universal Quality Image Index (UQI), Visual Information Fidelity (VIF). These are the most frequently used metrics to assess the quality of samples of GANs [35].

FID score tells how real the generated face images are in comparison to real face images. Assuming, (\(N_{t } , P_{t} ) and \left( {N_{g} , P_{g} } \right) \) are the mean and covariance of the true images and generated face images features respectively. Thus, mathematically it is expressed as:

$$ {\text{FID }} = \left| {N_{t} - N_{g} } \right|{ }^{2} + {\text{ tr }}(P_{t} + P_{g} - 2\left( {P_{t} P_{g} } \right)^{1/2} ). $$
(11)

The lower value represents the better quality of the image. The size of the dataset should be larger to compute the FID score [45]. So, following the [45], taking references value for FID score by computing the FID score between the 2 splits of actual image dataset, thus providing the baseline FID score of 87.1 for CelebA-HQ and 132.6 for the FFHQ dataset. The outcomes are illustrated in table 3 by computing the FID score among real and synthesized face images. It is observed that in CelebA-HQ, AttentionGAN has a better FID score and for the FFHQ dataset, CycleGAN has shown a better FID score. The graphical representation for FID values is shown in figure 15(b). As shown, model performance varies with the type of dataset used.

Table 3 Evaluation with Frechet Inception Distance (FID), lower is better.
Figure 15
figure 15

(a) Quantitative results with image quality assessment score, (b) FID, (c) PSNR, (d) SSIM, (e) VIF, (f) UQI are its graphical representation.

Besides this, quality assessment is evaluated with PSNR. It has been extensively used in numerous digital image measurements. The PSNR is used before SSIM and it is easy. Also, it has been considered tested and valid. Higher PSNR value is better Mathematically, PSNR is expressed in equation (12):

$$ PSNR = 10log10\left( {\frac{{max^{2} }}{MSE}} \right), $$
(12)

where max is the highest scale value of an image, MSE is the logarithm function of the mean square error of a given image. Figure 15(a) shows the PSNR values (in dB) for CycleGAN and AttentionGAN and its graphical presentation is shown in figure 15(c). From numerical simulations, it is clear that AttentionGAN performance is better than CycleGAN. This shows the quality of synthesized images is better (as presented in figure 11). Since, the AttentionGAN generator is the same as CycleGAN with some modifications i.e., built-in generators (figure 7) which help to learn the foreground and preserve the background details of an image while the conversion process.

Further, SSIM measurement is considered depending upon three factors i.e., contrast (C), luminance (L), and structure (S) (equation 13) to be acceptable with the working of the human visual system [57]. The SSIM values range between 0 and 1, in which 1 means a perfect match of the generated image with the original image. The mathematical expression of SSIM is expressed in equations (13) and (14) as:

$$ {\text{SSIM}}\left( {{\text{q}},{\text{ q}} }^\prime \right) = {\text{L}}\left( {{\text{q}},{\text{ q}}}^\prime \right){\text{C}}\left( {{\text{q}},{\text{q}} }^\prime \right){\text{S}}\left( {{\text{q}},{\text{q}} }^\prime \right), $$
(13)

where L, C, S are functions that compare the image q and image q’ [ (q, q’)] for luminance, contrast, and structure.

$$ {\text{SSIM}}\left( {{\text{q}},{\text{q}}}^\prime \right) = \frac{{\left( {2u_{q} u_{q^{\prime}} + C_{1} } \right)\left( {2\sigma_{qq^{\prime}} + C_{2} } \right)}}{{\left( {u_{q}^{2} + u_{q^{\prime}}^{2} + C_{1} } \right)\left( {\sigma_{q}^{2} + \sigma_{q^{\prime}}^{2} + C_{2} } \right)}}, $$
(14)

where q is real image intensity value and q’ is generated image intensity value with a respected mean (\(u_{q} u_{q^{\prime}}\)) and variances (\(\sigma_{qq^{\prime}}\)). C1, C2, and C3 are constant values used to avoid the zero denominators [57]. Figures 15 (a)-(d) represent the CycleGAN and AttentionGAN with SSIM values and their graphical representation respectively. Here, CycleGAN SSIM values are better than the AttentionGAN for both the dataset. As the advantage of CycleGAN, it works well for texture and color changes [33].

VIF (visual information fidelity), points out that the results were more similar to the targeted output and had a recognizable structure and visually pleasing images. Mathematically, VIF is expressed as:

$$ {\text{VIF}} = \frac{{\mathop \sum \nolimits_{j \in sub\,bands} I(\vec{C}^{R ,k} ;\vec{F}^{R ,k} |S^{R ,k} )}}{{\mathop \sum \nolimits_{j \in sub \,bands} I(\vec{C}^{R ,k} ;\vec{E}^{R ,k} |S^{R ,k} )}}, $$
(15)

where the numerator and denominator are the information extracted from the given image and generated face images respectively. \(\vec{C}^{R ,k}\) represents R elements of the RF (random field) and \(C_{k}\) describes the coefficients from subband k. Therefore, VIF will deliver the image quality measure.

For VIF, a higher value is better, figures 15 (a)-(e) show the CycleGAN and AttentionGAN VIF values. It depicts that CycleGAN and AttentionGAN have shown better results with the FFHQ dataset. Thus, model performance depends upon the type of dataset and the quality of the dataset at the input.

UQI (universal quality image index) was considered to model any image distortion with the grouping of three factors: loss of correlation, contrast distortion, and luminance distortion. Mathematically, UQI (Q) is expressed as:

$$ {\text{Q}} = \frac{{\sigma_{qq^{\prime}} }}{{\sigma_{q} \sigma_{q^{\prime}} }}\frac{{2\overline{qq^{\prime}} }}{{\left( {\overline{q}} \right)^{2} \left( {\overline{q^{\prime}} } \right)^{2} }}\frac{{2q\sigma_{q^{\prime}} }}{{\sigma_{q}^{2} + \sigma_{q^{\prime}}^{2} }}. $$
(16)

This provides the map of Qs, further average value of this map gives the quality measure of Q which is expressed as:

$$ {\text{Q}} = \frac{1}{M}\mathop \sum \limits_{j = 1}^{M} Q_{j} , $$
(17)

where M is steps depending on the size of an image. As SSIM was built from UQI, it is noticed that the result presented in figure 15(a) by UQI is somewhat closer to 1 than SSIM.

Higher value signifies better quality. Figures 15(a)- (f) show the CycleGAN and AttentionGAN UQI values. It shows that both the models have small differences in output values. But for the overall performance of UQI, AttentionGAN is better than CycleGAN.

4.5.1 Subjective evaluation:

For evaluation of the results, following the work [23,24,25, 41, 42] a user study was performed. Randomly selected 10 original face images with their corresponding generated images were used for the age group of 55+. Then, 20 evaluators had assessed the aged face images from CycleGAN and AttentionGAN regarding reference to the input face image. Finally, volunteers were asked to select realistic aged face images from CycleGAN and AttentionGAN having natural aging effects and lesser artifacts. The faces with lesser artifacts and more natural aging effects voted with 57% for CycleGAN, 36% AttentionGAN, and 7% for voted that both the model has shown overall equal results. In few output images, AttentionGAN synthesized images are better than CycleGAN. Further, the model performance depends greatly upon the number of training images also large dataset can better yield the results, also the type of dataset matters most.

5 Pros and Cons of CycleGAN and AttentionGAN

GANs

Pros/Cons

CycleGAN

Pros:

CycleGAN’s biggest advantage is that it can produce remarkable results without the need for a paired dataset. Moreover, it works well for texture and color changes [33].

For identity preservation (Table 2), FID score (Table 3, figure 15(b)) and SSIM (figures 15(a)-(d)) CycleGAN is a better performer than AttentionGAN in this paper.

Cons:

In this paper, the training time required by CycleGAN is more for the same number of epochs and training images. Because this paper focuses on face age progression only, CycleGAN performs bidirectional translation simultaneously (converting the images for progression as well as for regression process.)

AttentionGAN

Pros:

AttentionGAN works for the unpaired dataset.

As illustrated in figure 11, individual results synthesized by AttentionGAN are better than CycleGAN. Individually, it has produced more realistic and sharper aged face images.

Cons:

AttentionGAN produces artifacts in some images. CycleGAN images are better than AttentionGAN (figure 10).

6 Conclusion and future scope

The image-to-image conversion process with GANs had done exponential development. The unsupervised algorithms can produce remarkable results in comparison to supervised learning without paired data. In this paper, it is shown that the results generated for the comparison between the CycleGAN and AttentionGAN for face aging task vary for each dataset. Visually realistic and significant aging signs are shown by both the CycleGAN and AttentionGAN models. However, some results in AttentionGAN show artifacts in comparison to CycleGAN. But the individual results obtained from the AttentionGAN are better than CycleGAN. So, it is hard to claim strongly why one or the other performs slightly better than the other one, as model performance depends on several factors. Overall, CycleGAN performance the better than AttentionGAN in this paper.

Some face aging problems that lead to further improvements. As research is a continuous process, the proposed work can be further continued with the different dataset for various applications, like ethnicity-based face aging. Mostly the dataset used in face aging methods is a mixture of ethnic groups so there is a difference between the face and its various feature in shape across ethnicities. Thus, collecting the more specific dataset like ethnic-based can be helpful for more detailed study in the face aging application. Moreover, the comparison experiment on the different variants of GANs can be helpful for the research community further. The different dataset shows the varying output results on the same model. The proposed work is done on 2D images. So, it can be further extended to 3D images. But it may require a large processing time and increase the memory usage because 3D face images require the 3D face scanner to obtain the 3D images of the face [11]. As the 3D faces can describe the face structure and associated texture clearly that is why it can generate more photo-realistic and precise results. The objective function for evaluating the performance of GANs is still an open problem [58]. Besides, human face aging depends on internal and external factors. Thus, due to emotions of human some face muscles contract and expand which cause the development of lines and wrinkles. These lines and wrinkles with age progression become permanent even when the face relaxes. Thus, a study of wrinkles on the individual faces with facial emotion expression changes can be supportive for the face aging process. A healthy lifestyle with a healthy diet and exercise slows the aging process in comparison to the smoker and alcoholic persons. Thus, to precisely predict the face in the future of the person there is also a necessity to link with external factors and tune them in automatic face age progression method to produce more realistic results. A fusion can be also helpful for extracting rich information from the face of the person for recognizing which part of the face aging faster.