1 Introduction

The aim of new technologies is normally to make a specific process easier, more accurate, faster or cheaper. In some cases they also enable us to perform tasks or create things that were previously impossible. Over recent years, one of the most rapidly advancing scientific techniques for practical purposes has been Artificial Intelligence (AI). AI techniques enable machines to perform tasks that typically require some degree of human-like intelligence. With recent developments in high-performance computing and increased data storage capacities, AI technologies have been empowered and are increasingly being adopted across numerous applications, ranging from simple daily tasks, intelligent assistants and finance to highly specific command, control operations and national security. AI can, for example, help smart devices or computers to understand text and read it out loud, hear voices and respond, view images and recognize objects in them, and even predict what may happen next after a series of events. At higher levels, AI has been used to analyze human and social activity by observing their convocation and actions. It has also been used to understand socially relevant problems such as homelessness and to predict natural events. AI has been recognized by governments across the world to have potential as a major driver of economic growth and social progress (Hall and Pesenti 2018; NSTC 2016). This potential, however, does not come without concerns over the wider social impact of AI technologies which must be taken into account when designing and deploying these tools.

Processes associated with the creative sector demand significantly different levels of innovation and skill sets compared to routine behaviours. While AI accomplishments rely heavily on conformity of data, creativity often exploits the human imagination to drive original ideas which may not follow general rules. Basically, creatives have a lifetime of experiences to build on, enabling them to think ‘outside of the box’ and ask ‘What if’ questions that cannot readily be addressed by constrained learning systems.

There have been many studies over several decades into the possibility of applying AI in the creative sector. One of the limitations in the past was the readiness of the technology itself, and another was the belief that AI could attempt to replicate human creative behaviour (Rowe and Partridge 1993). A recent survey by AdobeFootnote 1 revealed that three quarters of artists in the US, UK, Germany and Japan would consider using AI tools as assistants, in areas such as image search, editing, and other ‘non-creative’ tasks. This indicates a general acceptance of AI as a tool across the community and reflects a general awareness of the state of the art, since most AI technologies have been developed to operate in closed domains where they can assist and support humans rather than replace them. Better collaboration between humans and AI technologies can thus maximize the benefits of the synergy. All that said, the first painting created solely by AI was auctioned for $432,500 in 2018.Footnote 2

Applications of AI in the creative industries have dramatically increased in the last five years. Based on analysis of data from arXivFootnote 3 and Gateway to Research,Footnote 4 Davies et al. (2020) revealed that the growth rate of research publications on AI (relevant to the creative industries) exceeds 500% in many countries (in Taiwan the growth rate is 1490%), and the most of these publications relate to image-based data. Analysis on company usage from the Crunchbase databaseFootnote 5 indicates that AI is used more in games and for immersive applications, advertising and marketing, than in other creative applications. Caramiaux et al. (2019) recently reviewed AI in the current media and creative industries across three areas: creation, production and consumption. They provide details of AI/ML-based research and development, as well as emerging challenges and trends.

In this paper, we review how AI and its technologies are, or could be, used in applications relevant to creative industries. We first provide an overview of AI and current technologies (Sect. 1), followed by a selection of creative domain applications (Sect. 3). We group these into subsectionsFootnote 6 covering: (i) content creation: where AI is employed to generate original work, (ii) information analysis: where statistics of data are used to improve productivity, (iii) content enhancement and post production workflows: used to improve quality of creative work, (iv) information extraction and enhancement: where AI assists in interpretation, clarifies semantic meaning, and creates new ways to exhibit hidden information, and (v) data compression: where AI helps reduce the size of the data while preserving its quality. Finally we discuss challenges and the future potential of AI associated with the creative industries in Sect. 4.

2 An introduction to artificial intelligence

Artificial intelligence (AI) embodies a set of codes, techniques, algorithms and data that enables a computer system to develop and emulate human-like behaviour and hence make decisions similar to (or in some cases, better than) humans (Russell and Norvig 2020). When a machine exhibits full human intelligence, it is often referred to as ‘general AI’ or ‘strong AI’ (Bostrom 2014). However, currently reported technologies are normally restricted to operation in a limited domain to work on specific tasks. This is called ‘narrow AI’ or ‘weak AI’. In the past, most AI technologies were model-driven; where the nature of the application is studied and a model is mathematically formed to describe it. Statistical learning is also data-dependent, but relies on rule-based programming (James et al. 2013). Previous generations of AI (mid-1950s until the late 1980s (Haugeland 1985)) were based on symbolic AI, following the assumption that humans use symbols to represent things and problems. Symbolic AI is intended to produce general, human-like intelligence in a machine (Honavar 1995), whereas most modern research is directed at specific sub-problems.

2.1 Machine learning, neurons and artificial neural networks

The main class of algorithms in use today are based on machine learning (ML), which is data-driven. ML employs computational methods to ‘learn’ information directly from large amounts of example data without relying on a predetermined equation or model (Mitchell 1997). These algorithms adaptively converge to an optimum solution and generally improve their performance as the number of samples available for learning increases. Several types of learning algorithms exist, including supervised learning, unsupervised learning and reinforcement learning. Supervised learning algorithms build a mathematical model from a set of data that contains both the inputs and the desired outputs (each output usually representing a classification of the associated input vector), while unsupervised learning algorithms model the problems on unlabeled data. Self-supervised learning is a form of unsupervised learning where the data provide the measurable structure to build a loss function. Semi-supervised learning employs a limited set of labeled data to label, usually a larger amount of, unlabeled data. Then both datasets are combined to create a new model. Reinforcement learning methods learn from trial and error and are effectively self-supervised (Russell and Norvig 2020).

Modern ML methods have their roots in the early computational model of a neuron proposed by Warren MuCulloch (neuroscientist) and Walter Pitts (logician) in (1943). This is shown in Fig. 1a. In their model, the artificial neuron receives one or more inputs, where each input is independently weighted. The neuron sums these weighted inputs and the result is passed through a non-linear function known as an activation function, representing the neuron’s action potential which is then transmitted along its axon to other neurons. The multi-layer perceptron (MLP) is a basic form of artificial neural network (ANN) that gained popularity in the 1980s. This connects its neural units in a multi-layered (typically one input layer, one hidden layer and one output layer) architecture (Fig. 1b). These neural layers are generally fully connected to adjacent layers, (i.e., each neuron in one layer is connected to all neurons in the next layer). The disadvantage of this approach is that the total number of parameters can be very large and this can make them prone to overfitting data.

For training, the MLP (and most supervised ANNs) utilizes error backpropagation to compute the gradient of a loss function. This loss function maps the event values from multiple inputs into one real number to represent the cost of that event. The goal of the training process is therefore to minimize the loss function over multiple presentations of the input dataset. The backpropagation algorithm was originally introduced in the 1970s, but peaked in popularity after (1986), when Rumelhart et al. described several neural networks where backpropagation worked far faster than earlier approaches, making ANNs applicable to practical problems.

Fig. 1
figure 1

a Basic neural network unit by MuCulloch and Pitts. b Basic multi-layer perceptron (MLP)

2.2 An introduction to deep neural networks

Deep learning is a subset of ML that employs deep artificial neural networks (DNNs). The word ‘deep’ means that there are multiple hidden layers of neuron collections that have learnable weights and biases. When the data being processed occupies multiple dimensions (images for example), convolutional neural networks (CNNs) are often employed. CNNs are (loosely) a biologically-inspired architecture and their results are tiled so that they overlap to obtain a better representation of the original inputs.

The first CNN was designed by Fukushima (1980) as a tool for visual pattern recognition (Fig. 2a). This so called Neocognitron was a hierarchical architecture with multiple convolutional and pooling layers. LeCun et al. (1989) applied the standard backpropagation algorithm to a deep neural network with the purpose of recognizing handwritten ZIP codes. At that time, it took 3 days to train the network. Lecun et al. (1998) proposed LeNet5 (Fig. 2b), one of the earliest CNNs which could outperform other models for handwritten character recognition. The deep learning breakthrough occurred in the 2000s driven by the availability of graphics processing units (GPUs) that could dramatically accelerate training. Since around 2012, CNNs have represented the state of the art for complex problems such as image classification and recognition, having won several major international competitions.

Fig. 2
figure 2

a Neocognitron (Fukushima 1980), where U\(_s\) and U\(_c\) learn simple and complex features, respectively. b LeNet5 (Lecun et al. 1998), consisting of two sets of convolutional and average pooling layers, followed by a flattening convolutional layer, then two fully-connected layers and finally a softmax classifier

A CNN creates its filters’ values based on the task at hand. Generally, the CNN learns to detect edges from the raw pixels in the first layer, then uses those edges to detect simple shapes in the next layer, and so on building complexity through subsequent layers. The higher layers produce high-level features with more semantically relevant meaning. This means that the algorithms can exploit both low-level features and a higher-level understanding of what the data represent. Deep learning has therefore emerged as a powerful tool to find patterns, analyze information, and to predict future events. The number of layers in a deep network is unlimited but most current networks contain between 10 and 100 layers.

Goodfellow et al. (2014) proposed an alternative form of architecture referred to as a Generative Adversarial Network (GAN). GANs consist of 2 AI competing modules where the first creates images (the generator) and the second (the discriminator) checks whether the received image is real or created from the first module. This competition results in the final picture being very similar to the real image. Because of their performance in reducing deceptive results, GAN technologies have become very popular and have been applied to numerous applications, including those related to creative practice.

While many types of machine learning algorithms exist, because of their prominence and performance, in this paper we place emphasis on deep learning methods. We will describe various applications relevant to the creative industries and critically review the methodologies that achieve, or have the potential to achieve, good performance.

2.3 Current AI technologies

This section presents state-of-the-art AI methods relevant to the creative industries. For those readers who prefer to focus on the applications, please refer to Sect. 3.

2.3.1 AI and the need for data

An AI system effectively combines a computational architecture and a learning strategy with a data environment in which it learns. Training databases are thus a critical component in optimizing the performance of ML processes and hence a significant proportion of the value of an AI system resides in them. A well-designed training database with appropriate size and coverage can help significantly with model generalization and avoiding problems of overfitting.

In order to learn without being explicitly programmed, ML systems must be trained using data having statistics and characteristics typical of the particular application domain under consideration. This is true regardless of training methods (see Sect. 2.1). Good datasets typically contain large numbers of examples with a statistical distribution matched to this domain. This is crucial because it enables the network to estimate gradients in the data (error) domain that enables it to converge to an optimum solution, forming robust decision boundaries between its classes. The network will then, after training, be able to reliably match new unseen information to the right answer when deployed.

The reliability of training dataset labels is key in achieving high performance supervised deep learning. These datasets must comprise: i) data that are statistically similar to the inputs when the models are used in the real situations and ii) ground truth annotations that tell the machine what the desired outputs are. For example, in segmentation applications, the dataset would comprise the images and the corresponding segmentation maps indicating homogeneous, or semantically meaningful regions in each image. Similarly for object recognition, the dataset would also include the original images while the ground truth would be the object categories, e.g., car, house, human, type of animals, etc.

Some labeled datasets are freely available for public use,Footnote 7 but these are limited, especially in certain applications where data are difficult to collect and label. One of the largest, ImageNet, contains over 14 million images labeled into 22,000 classes. Care must be taken when collecting or using data to avoid imbalance and bias—skewed class distributions where the majority of data instances belong to a small number of classes with other classes being sparsely populated. For instance, in colorization, blue may appear more often as it is a color of sky, while pink flowers are much rarer. This imbalance causes ML algorithms to develop a bias towards classes with a greater number of instances; hence they preferentially predict majority class data. Features of minority classes are treated as noise and are often ignored.

Numerous approaches have been introduced to create balanced distributions and these can be divided into two major groups: modification of the learning algorithm, and data manipulation techniques (He and Garcia 2009). Zhang et al. (2016) solve the class-imbalance problem by re-weighting the loss of each pixel at train time based on the pixel color rarity. Recently, Lehtinen et al. (2018) have introduced an innovative approach to learning via their Noise2Noise network which demonstrates that it is possible to train a network without clean data if the corrupted data complies with certain statistical assumptions. However, this technique needs further testing and refinement to cope with real-world noisy data. Typical data manipulation techniques include downsampling majority classes, oversampling minority classes, or both. Two primary techniques are used to expand, adjust and rebalance the number of samples in the dataset and, in turn, to improve ML performance and model generalization: data augmentation and data synthesis. These are discussed further below.

2.3.1.1 Data augmentation

Data augmentation techniques are frequently used to increase the volume and diversity of a training dataset without the need to collect new data. Instead, existing data are used to generate more samples, through transformations such as cropping, flipping, translating, rotating and scaling (Anantrasirichai et al. 2018; Krizhevsky et al. 2012). This can assist by increasing the representation of minority classes and also help to avoid overfitting, which occurs when a model memorizes the full dataset instead of only learning the main concepts which underlie the problem. GANs (see Section 2.3.3) have recently been employed with success to enlarge training sets, with the most popular network currently being CycleGAN (Zhu et al. 2017). The original CycleGAN mapped one input to only one output, causing inefficiencies when dataset diversity is required. Huang et al. (2018) improved CycleGAN with a structure-aware network to augment training data for vehicle detection. This slightly modified architecture is trained to transform contrast CT images (computed tomography scans) into non-contrast images (Sandfort et al. 2019). A CycleGAN-based technique has also been used for emotion classification, to amplify cases of extremely rare emotions such as disgust (Zhu et al. 2018). IBM Research introduced a Balancing GAN (Mariani et al. 2018), where the model learns useful features from majority classes and uses these to generate images for minority classes that avoid features close to those of majority cases. An extensive survey of data augmentation techniques can be found in Shorten and Khoshgoftaar (2019).

2.3.1.2 Data synthesis

Scientific or parametric models can be exploited to generate synthetic data in those applications where it is difficult to collect real data, and where data augmentation techniques cannot increase variety in the dataset. Examples include signs of disease (Alsaih et al. 2017) and geological events that rarely happen (Anantrasirichai et al. 2019). In the case of creative processes, problems are often ill-posed as ground truth data or ideal outputs are not available. Examples include post-production operations such as deblurring, denoising and contrast enhancement. Synthetic data are often created by degrading the clean data. Su et al. (2017) applied synthetic motion blur on sharp video frames to train the deblurring model. LLNet (Lore et al. 2017), enhances low-light images, and is trained using a dataset generated with synthetic noise and intensity adjustment, while LLCNN (Tao et al. 2017) employs a gamma adjustment technique.

Fig. 3
figure 3

CNN architectures for a object recognition adapted from \(^{8}\), b semantic segmentation \(^{9}\)

2.3.2 Convolutional neural networks (CNNs)

2.3.2.1 Basic CNNs

Convolutional neural networks (CNNs) are a class of deep feed-forward ANN. They comprise a series of convolutional layers that are designed to take advantage of 2D structures, such as found in images. These employ locally connected layers that apply convolution operations between a predefined-size kernel and an internal signal; the output of each convolutional layer is the input signal modified by a convolution filter. The weights of the filter are adjusted according to a loss function that assesses the mismatch (during training) between the network output and the ground truth values or labels. Commonly used loss functions include \(\ell _1\), \(\ell _2\), SSIM (Tao et al. 2017) and perceptual loss (Johnson et al. 2016)). These errors are then backpropagated through multiple forward and backward iterations and the filter weights adjusted based on estimated gradients of the local error surface. This in turn drives what features are detected, associating them to the characteristics of the training data. The early layers in a CNN extract low-level features conceptually similar to visual basis functions found in the primary visual cortex (Matsugu et al. 2003).

The most common CNN architecture (Fig. 3aFootnote 8) has the outputs from its convolution layers connected to a pooling layer, which combines the outputs of neuron clusters into a single neuron. Subsequently, activation functions such as tanh (the hyperbolic tangent) or ReLU (Rectified Linear Unit) are applied to introduce non-linearity into the network (Agostinelli et al. 2015). This structure is repeated with similar or different kernel sizes. As a result, the CNN learns to detect edges from the raw pixels in the first layer, then combines these to detect simple shapes in the next layer. The higher layers produce higher-level features, which have more semantic meaning. The last few layers represent the classification part of the network. These consist of fully connected layers (i.e. being connected to all the activation outputs in the previous layer) and a softmax layer, where the output class is modelled as a probability distribution - exponentially scaling the output between 0 and 1 (this is also referred to as a normalised exponential function).

VGG (Simonyan and Zisserman 2015) is one of the most common backbone networks, offering two depths: VGG-16 and VGG-19 with 16 and 19 layers respectively. The networks incorporate a series of convolution blocks (comprising convolutional layers, ReLU activations and a max-pooling layer), and the last three layers are fully connection with ReLU activations. VGG employs very small receptive fields (3 \(\times\) 3 with a stride of 1) allowing deeper architectures than the older networks. DeepArt (Gatys et al. 2016) employs a VGG-Network without fully connected layers. It demonstrates that the higher layers in the VGG network can represent the content of an artwork. The pre-trained VGG network is widely used to provide a measure of perceptual loss (and style loss) during the training process of other networks (Johnson et al. 2016).

2.3.2.2 CNNs with reconstruction

The basic structure of CNNs described in the previous section is sometimes called an ‘encoder’. This is because the network learns a representation of a set of data, which often has fewer parameters than the input. In other words, it compresses the input to produce a code or a latent-space representation. In contrast, some architectures omit pooling layers in order to create dense features in an output with the same size as the input.

Alternatively, the size of the feature map can be enlarged to that of the input via deconvolutional layers or transposed convolution layers (Fig. 3bFootnote 9). This structure is often referred to as a ‘decoder’ as it generates the output using the code produced by the encoder. Encoder-decoder architectures combine an encoder and a decoder. Autoencoders are a special case of encoder-decoder models, where the input and output are the same size. Encoder-decoder models are suitable for creative applications, such as style transfer (Zhang et al. 2016), image restoration (Nah et al. 2017; Yang and Sun 2018; Zhang et al. 2017), contrast enhancement (Lore et al. 2017; Tao et al. 2017), colorization (Zhang et al. 2016) and super-resolution (Shi et al. 2016).

Some architectures also add skip connections or a bridge section (Long et al. 2015) so that the local and global features, as well as semantics are connected and captured, providing improved pixel-wise accuracy. These techniques are widely used in object detection (Anantrasirichai and Bull 2019) and object tracking (Redmon and Farhadi 2018). U-Net (Ronneberger et al. 2015) is perhaps the most popular network of this kind, even though it was originally developed for biomedical image segmentation. Its network consists of a contracting path (encoder) and an expansive path (decoder), giving it the u-shaped architecture. The contracting path consists of the repeated application of two 3 \(\times\) 3 convolutions, followed by ReLU and a max-pooling layer. Each step in the expansive path consists of a transposed convolution layer for upsampling, followed by two sets of convolutional and ReLU layers, and concatenations with correspondingly-resolution features from the contracting path.

2.3.2.3 Advanced CNNs

Some architectures introduce modified convolution operations for specific applications. For example, dilated convolution (Yu and Koltun 2016), also called atrous convolution, enlarges the receptive field, to support feature extraction locally and globally. The dilated convolution is applied to the input with a defined spacing between the values in a kernel. For example, a 3 \(\times\) 3 kernel with a dilation rate of 2 has the same receptive field as a 5 \(\times\) 5 kernel, but using 9 parameters. This has been used for colorization by Zhang et al. (2016) in the creative sector. ResNet is an architecture developed for residual learning, comprising several residual blocks (He et al. 2016). A single residual block has two convolution layers and a skip connection between the input and the output of the last convolution layer. This avoids the problem of vanishing gradients, enabling very deep CNN architectures. Residual learning has become an important part of the state of the art in many application, such as contrast enhancement (Tao et al. 2017), colorization (Huang et al. 2017), SR (Dai et al. 2019; Zhang et al. 2018a), object recognition (He et al. 2016), and denoising (Zhang et al. 2017).

Traditional convolution operations are performed in a regular grid fashion, leading to limitations for some applications, where the object and its location are not in the regular grid. Deformable convolution (Dai et al. 2017) has therefore been proposed to facilitate the region of support for the convolution operations to take on any shape, instead of just the traditional square shape. This has been used in object detection and SR (Wang et al. 2019a). 3D deformable kernels have also been proposed for denoising video content, as they can better cope with large motions, producing cleaner and sharper sequences (Xiangyu Xu 2019).

Capsule networks were developed to address some of the deficiencies with traditional CNNs (Sabour et al. 2017). They are able to better model hierarchical relationships, where each neuron (referred to as a capsule) expresses the likelihood and properties of its features, e.g., orientation or size. This improves object recognition performance. Capsule networks have been extended to other applications that deal with complex data, including multi-label text classification (Zhao et al. 2019), slot filling and intent detection (Zhang et al. 2019a), polyphonic sound event detection (Vesperini et al. 2019) and sign language recognition (Jalal et al. 2018).

2.3.3 Generative adversarial networks (GANs)

The generative adversarial network (GAN) is a recent algorithmic innovation that employs two neural networks: generative and discriminative. The GAN pits one against the other in order to generate new, synthetic instances of data that can pass for real data. The general GAN architecture is shown in Fig. 4a. It can be observed that the generative network generates new candidates to increase the error rate of the discriminative network until the discriminative network cannot tell whether these candidates are real or synthesized. The generator is typically a deconvolutional neural network, and the discriminator is a CNN. Recent successful applications of GANs include SR (Ledig et al. 2017), inpainting (Yu et al. 2019), contrast enhancement (Kuang et al. 2019) and compression (Ma et al. 2019a).

GANs have a reputation of being difficult to train since the two models are trained simultaneously to find a Nash equilibrium but with each model updating its cost (or error) independently. Failures often occur when the discriminator cannot feedback information that is good enough for the generator to make progress, leading to vanishing gradients. Wasserstein loss is designed to prevent this (Arjovsky et al. 2017; Frogner et al. 2015). A specific condition or characteristic, such as a label associated with an image, rather than a generic sample from an unknown noise distribution can be included in the generative model, creating what is referred to as a conditional GAN (cGAN) (Mirza and Osindero 2014). This improved GAN has been used in several applications, including pix2pix(Isola et al. 2017) and for deblurring (Kupyn et al. 2018).

Theoretically, the generator in a GAN will not learn to create new content, but it will just try to make its output look like the real data. Therefore, to produce creative works of art, the Creative Adversarial Network (CAN) has been proposed by Elgammal et al. (2017). This works by including an additional signal in the generator to prevent it from generating content that is too similar to existing examples. Similar to traditional CNNs, a perceptual loss based on VGG16 (Johnson et al. 2016) has become common in applications where new images are generated that have the same semantics as the input (Antic 2020; Ledig et al. 2017).

Most GAN-based methods are currently limited to the generation of relatively small square images, e.g., 256 \(\times\) 256 pixels (Zhang et al. 2017). The best resolution created up to the time of this review is 1024 \(\times\) 1024-pixels, achieved by NVIDIA research. The team introduced the progressive growing of GANs (Karras et al. 2018) and showed that their method can generate near-realistic 1024 \(\times\) 1024-pixel portrait images (trained for 14 days). However the problem of obvious artefacts at transition areas between foreground and background persists.

Another form of deep generative model is the Variational Autoencoder (VAE). A VAE is an autoencoder, where the encoding distribution is regularised to ensure the latent space has good properties to support the generative process. Then the decoder samples from this distribution to generate new data. Comparing VAEs to GANs, VAEs are more stable during training, while GANs are better at producing realistic images. Recently Deepmind (Google) has included vector quantization (VQ) within a VAE to learn a discrete latent representation (Razavi et al. 2019). Its performance for image generation are competitive with their BigGAN (Brock et al. 2019) but with greater capacity for generating a diverse range of images. There have also been many attempts to merge GANs and VAEs so that the end-to-end network benefits from both good samples and good representation, for example using a VAE as the generator for a GAN (Bhattacharyya et al. 2019; Wan et al. 2017). However, the results of this have not yet demonstrated significant improvement in terms of overall performance (Rosca et al. 2019), remaining an ongoing research topic.

A review of recent state-of-the-art GAN models and applications can be found in Foster (2019).

Fig. 4
figure 4

Architectures of a GAN, b RNN for drawing sketches (Ha and Eck 2018)

2.3.4 Recurrent neural networks (RNNs)

Recurrent neural networks (RNNs) have been widely employed to perform sequential recognition; they offer benefits in this respect by incorporating at least one feedback connection. The most commonly used type of RNN is the Long Short-Term Memory (LSTM) network (Hochreiter and Schmidhuber 1997), as this solves problems associated with vanishing gradients, observed in traditional RNNs. It does this by memorizing sufficient context information in time series data via its memory cell. Deep RNNs use their internal state to process variable length sequences of inputs, combining across multiple levels of representation. This makes them amenable to tasks such as speech recognition (Graves et al. 2013), handwriting recognition (Doetsch et al. 2014), and music generation (Briot et al. 2020). RNNs are also employed in image and video processing applications, where recurrency is applied to convolutional encoders for tasks such as drawing sketches (Ha and Eck 2018) and deblurring videos (Zhang et al. 2018). VINet (Kim et al. 2019) employs an encoder-decoder model using an RNN to estimate optical flow, processing multiple input frames concatenated with the previous inpainting results. An example network using an RNN is illustrated in Fig. 4b.

CNNs extract spatial features from its input images using convolutional filters and RNNs extract sequential features in time-series data using memory cells. In extension, 3D CNN, CNN-LSTM and ConvLSTM have been designed to extract spatial-temporal features from video sequences. The 3D activation maps produced in 3D CNNs are able to analyze temporal or volumetric context which are important in applications such as medical imaging (Lundervold and Lundervold 2019) and action recognition (Ji et al. 2013). The CNN-LSTM simply concatenates a CNN and an LSTM (the 1D output of the CNN is the input to the LSTM) to process time-series data. In contrast, ConvLSTM is another LSTM variant, where the internal matrix multiplications are replaced with convolution operations at each gate of the LSTM cell so that the LSTM input can be in the form of multi-dimensional data (Shi et al. 2015).

2.3.5 Deep reinforcement learning (DRL)

Reinforcement learning (RL) is an ML algorithm trained to make a sequence of decisions. Deep reinforcement learning (DRL) combines ANNs with an RL architecture that enables RL agents to learn the best actions in a virtual environment to achieve their goals. The RL agents are comprised of a policy that performs a mapping from an input state to an output action and an algorithm responsible for updating this policy. This is done through leveraging a system of rewards and punishments to acquire useful behaviour—effectively a trial-and-error process. The framework trains using a simulation model, so it does not require a predefined training dataset, either labeled or unlabeled.

However, pure RL requires an excessive number of trials to learn fully, something that may be impractical in many (especially real-time) applications if training from scratch (Hessel et al. 2018). AlphaGo, a computer program developed by DeepMind Technologies that can beat a human professional Go player, employs RL on top of a pre-trained model to improve its play strategy to beat a particular player.Footnote 10 RL could be useful in creative applications, where there may not be a predefined way to perform a given task, but where there are rules that the model has to follow to perform its duties correctly. Current applications involve end-to-end RL combined with CNNs, including gaming (Mnih et al. 2013), and RLs with GANs in optimal painting stroke in stroke-based rendering (Huang et al. 2019). Recently RL methods have been developed using a graph neural network (GNN) to play Diplomacy, a highly complex 7-player (large scale) board game (Anthony et al. 2020).

Temporal difference (TD) learning (Gregor et al. 2019; Chen et al. 2018; Nguyen et al. 2020) has recently been introduced as a model-free reinforcement learning method that learns how to predict a quantity that depends on future values of a given signal. That is, the model learns from an environment through episodes with no prior knowledge of the environment. This may well have application in the creative sector for storytelling, caption-from-image generation and gaming.

3 AI for the creative industries

AI has increasingly (and often mistakenly) been associated with human creativity and artistic practice. As it has demonstrated abilities to ‘see’, ‘hear’, ‘speak’, ‘move’, and ‘write’, it has been applied in domains and applications including: audio, image and video analysis, gaming, journalism, script writing, filmmaking, social media analysis and marketing. One of the earliest AI technologies, available for more than two decades, is Autotune, which automatically fixes vocal intonation errors (Hildebrand 1999). An early attempt to exploit AI for creating art occurred in 2016, when a three-dimensional (3D) printed painting, the Next Rembrandt,Footnote 11 was produced solely based on training data from Rembrandt’s portfolio. It was created using deep learning algorithms and facial recognition techniques.

Creativity is defined in the Cambridge Dictionary as ‘the ability to produce original and unusual ideas, or to make something new or imaginative’. Creative tasks generally require some degree of original thinking, extensive experience and an understanding of the audience, while production tasks are, in general, more repetitive or predictable, making them more amenable to being performed by machines. To date, AI technologies have produced mixed results when used for generating original creative works. For example, GumGumFootnote 12 creates a new piece of art following the input of a brief idea from the user. The model is trained by recording the preferred tools and processes that the artist uses to create a painting. A Turing test revealed that it is difficult to distinguish these AI generated products from those painted by humans. AI methods often produce unusual results when employed to create new narratives for books or movie scripts. BotnikFootnote 13 employs an AI algorithm to automatically remix texts of existing books to create a new chapter. In one experiment, the team fed the seven Harry Potter novels through their predictive text algorithm, and the ‘bot’ created rather strange but amusing sentences, such as “Ron was standing there and doing a kind of frenzied tap dance. He saw Harry and immediately began to eat Hermione’s family” (Sautoy 2019). However, when AI is used to create less structured content (e.g., some forms of ‘musical’ experience), it can demonstrate pleasurable difference (Briot et al. 2020).

In the production domain, Twitter has applied automatic cropping to create image thumbnails that show the most salient part of an image (Theis et al. 2018). The BBC has created a proof-of-concept system for automated coverage of live events. In this work, the AI-based system performs shot framing (wide, mid and close-up shots), sequencing, and shot selection automatically (Wright et al. 2020). However, the initial results show that the algorithm needs some improvement if it is to replace human operators. Nippon Hoso Kyokai (NHK, Japan’s Broadcasting Corporation), has developed a new AI-driven broadcasting technology called “Smart Production”. This approach extracts events and incidents from diverse sources such as social media feeds (e.g., Twitter), local government data and interviews, and integrates these into a human-friendly accessible format (Kaneko et al. 2020).

In this review, we divide creative applications into five major categories: content creation, information analysis, content enhancement and post production workflows, information extraction and enhancement, and data compression. However, it should be noted that many applications exploit several categories in combination. For instance, post-production tools (discussed in Sects. 3.3 and 3.4) frequently combine information extraction and content enhancement techniques. These combinations can together be used to create new experiences, enhance existing material or to re-purpose archives (e.g., ‘Venice Through a VR Lens, 1898’ directed by BDH Immersive and Academy 7 ProductionFootnote 14). These workflows may employ AI-enabled super-resolution, colorization, 3D reconstruction and frame rate interpolation methods. Gaming is another important example that has been key for the development of AI. It could be considered as an ‘all-in-one’ AI platform, since it combines rendering, prediction and learning.

We categorize the applications and the corresponding AI-based solutions as shown in Table 1. For those interested, a more detailed overview of contemporary Deep Learning systems is provided in Sect. 2.3.

Table 1 Creative applications and corresponding AI-based methods

3.1 Content creation

Content creation is a fundamental activity of artists and designers. This section discusses how AI technologies have been employed both to support the creative process and as a creator in their own right.

3.1.1 Script and movie generation

The narrative or story underpins all forms of creativity across art, fiction, journalism, gaming, and other forms of entertainment. AI has been used both to create stories and to optimize the use of supporting data, for example organizing and searching through huge archives for documentaries. The script of a fictional short film, Sunspring (2016),Footnote 15 was entirely written by an AI machine, known as Benjamin, created by New York University. The model, based on a recurrent neural network (RNN) architecture, was trained using science fiction screenplays as input, and the script was generated with random seeds from a sci-fi filmmaking contest. Sunspring has some unnatural story lines. In the sequel, It’s No Game (2017), Benjamin was then used only in selected areas and in collaboration with humans, producing a more fluid and natural plot. This reinforces the notion that the current AI technology can work more efficiently in conjunction with humans rather than being left to its own devices. In 2016, IBM Watson, an AI-based computer system, composed the 6-min movie trailer of a horror film, called Morgan.Footnote 16 The model was trained with more than 100 trailers of horror films enabling it to learn the normative structure and pattern. Later in 2018, Benjamin was used to generate a new film ‘Zone Out’ (produced within 48 h). The project also experimented further by using face-swapping, based on a GAN and voice-generating technologies. This film was entirely directed by AI, but includes many artefacts and unnatural scenes as shown in Fig. 5a.Footnote 17 Recently, ScriptBookFootnote 18 introduced a story-awareness concept for AI-based storytelling. The generative models focus on three aspects: awareness of characters and their traits, awareness of a script’s style and theme, and awareness of a script’s structure, so the resulting script is more natural.

In gaming, AI has been used to support design, decision-making and interactivity (Justesen et al. 2020). Interactive narrative, where users create a storyline through actions, has been developed using AI methods over the past decade (Riedl and Bulitko 2012). For example, MADE (Massive Artificial Drama Engine for non-player characters) generates procedural content in games (Héctor 2014), and deep reinforcement learning has been employed for personalization (Wang et al. 2017). AI DungeonFootnote 19 is a web-based game that is capable of generating a storyline in real time, interacting with player input. The underlying algorithm requires more than 10,000 label contributions for training to ensure that the model produces smooth interaction with the players. Procedural generation has been used to automatically randomize content so that a game does not present content in the same order every time (Short and Adams 2017). Modern games often integrate 3D visualization, augmented reality (AR) and virtual reality (VR) techniques, with the aim of making play more realistic and immersive. Examples include Vid2Vid (Wang et al. 2018) which uses a deep neural network, trained on real videos of cityscapes, to generate a synthetic 3D gaming environment. Recently, NVIDIA Research has used a generative model [GameGAN by Kim et al. (2020b)], trained on 50,000 PAC-MAN episodes, to create new content, which can be used by game developers to automatically generate layouts for new game levels in the future.

Fig. 5
figure 5

a A screenshot from ‘Zone Out’, where the face of the woman was replaced with a man’s mouth\(^{17}\). b Music transcription generated by AI algorithm\(^{31}\)

3.1.2 Journalism and text generation

Natural language processing (NLP) refers to the broad class of computational techniques for incorporating speech and text. It analyzes natural language data and trains machines to perceive and to generate human language directly. NLP algorithms frequently involve speech recognition (Sect. 3.4), natural language understanding [e.g., BERT by Google AI (Devlin et al. 2019)], and natural language generation (Leppänen et al. 2017). Automated journalism, also known as robot journalism, describes automated tools that can generate news articles from structured data. The process scans large amounts of assorted data, orders key points, and inserts details such as names, places, statistics, and some figures (Cohen 2015). This can be achieved through NLP and text mining techniques (Dörr 2016).

AI can help to break down barriers between different languages with machine translation (Dzmitry Bahdanau 2015). A conditioned GAN with an RNN architecture has been proposed for language translation by Subramanian et al. (2018). It was used for the difficult task of generating English sentences from Chinese poems; it creates understandable text but sometimes with grammatical errors. CNN and RNN architectures are employed to translate video into natural language sentences (Venugopalan et al. 2015). AI can also be used to rewrite one article to suit several different channels or audience tastes.Footnote 20 A survey of recent deep learning methods for text generation by Iqbal and Qureshi (2020) concludes that text generated from images could be most amenable to GAN processing while topic-to-text translation is likely to be dominated by variational autoencoders (VAE).

Automated journalism is now quite widely used. For example, BBC reported on the UK general election in 2019 using such tools.Footnote 21 Forbes uses an AI-based content management system, called Bertie, to assist in providing reporters with the first drafts and templates for news stories.Footnote 22 The Washington Post also has a robot reporting program called Heliograf.Footnote 23 Microsoft has announced in 2020 that they use automated systems to select news stories to appear on MSN website.Footnote 24 This application of AI demonstrates that current AI technology can be effective in supporting human journalists in constrained cases, increasing production efficiency.

3.1.3 Music generation

There are many different areas where sound design is used in professional practice, including television, film, music production, sound art, video games and theatre. Applications of AI in this domain include searching through large databases to find the most appropriate match for such applications (see Sect. 3.2.3), and assisting sound design. Currently, several AI assisted music composition systems support music creation. The process generally involves using ML algorithms to analyze data to find musical patterns, e.g., chords, tempo, and length from various instruments, synthesizers and drums. The system then suggests new composed melodies that may inspire the artist. Example software includes Flow Machines by Sony,Footnote 25 Jukebox by OpenAIFootnote 26 and NSynth by Google AI.Footnote 27 In 2016, Flow Machines launched a song in the style of The Beatles, and in 2018 the team released the first AI album, ‘Hello World’, composed by an artist, SKYGGE (Benoit Carré), using an AI-based tool.Footnote 28 Coconet uses a CNN to infill missing pieces of music.Footnote 29 Modelling music creativity is often achieved using Long Short-Term Memory (LSTM), a special type of RNN architecture (Sturm et al. 2016) (an example of the output of this model is shown in Fig. 5bFootnote 30 and the reader can experience AI-based music at Ars Electronica Voyages ChannelFootnote 31). The model takes a transcribed musical idea and transforms it in meaningful ways. For example, DeepJ composes music conditioned on a specific mixture of composer styles using a Biaxial LSTM architecture (Mao et al. 2018). More recently, generative models have been configured based on an LSTM neural network to generate music (Li et al. 2019b).

Alongside these methods of musical notation based audio synthesis, there also exists a range of direct waveform synthesis techniques that learn and/or act directly on the waveform of the audio itself [for example (Donahue et al. 2019; Engel et al. 2019]. A more detailed overview of Deep Learning techniques for music generation can be found in Briot et al. (2020).

Fig. 6
figure 6

Example applications of pix2pix framework (Isola et al. 2017)

3.1.4 Image generation

AI can be used to create new digital imagery or art-forms automatically, based on selected training datasets, e.g., new examples of bedrooms (Radford et al. 2016), cartoon characters (Jin et al. 2017), celebrity headshots (Karras et al. 2018). Some applications produce a new image conditioned to the input image, referred to as image-to-image translation, or ‘style transfer’. It is called translation or transfer, because the image output has a different appearance to the input but with similar semantic content. That is, the algorithms learn the mapping between an input image and an output image. For example, grayscale tones can be converted into natural colors (Zhang et al. 2016), using eight simple convolution layers to capture localized semantic meaning and to generate a and b color channels of the CIELAB color space. This involves mapping class probabilities to point estimates in ab space. DeepArt (Gatys et al. 2016) transforms the input image into the style of the selected artist by combining feature maps from different convolutional layers. A stroke-based drawing method trains machines to draw and generalise abstract concepts in a manner similar to humans using RNNs (Ha and Eck 2018).

A Berkeley AI Research team has successfully used GANs to convert between two image types (Isola et al. 2017), e.g., from a Google map to an aerial photo, a segmentation map to a real scene, or a sketch to a colored object (Fig. 6). They have published their pix2pix codebaseFootnote 32 and invited the online community to experiment with it in different application domains, including depth map to street view, background removal and pose transfer. For example pix2pix has been usedFootnote 33 to create a Renaissance portrait from a real portrait photo. Following pix2pix, a large number of research works have improved the performance of style transfer. Cycle-consistent adversarial networks (CycleGAN) (Zhu et al. 2017) and DualGAN (Yi et al. 2017) have been proposed for unsupervised learning. Both algorithms are based on similar concepts—the images of both groups are translated twice (e.g., from group A to group B, then translated back to the original group A) and the loss function compares the input image and its reconstruction, computing what is referred to as cycle-consistency loss. Samsung AI has shown, using GANs, that it is possible to turn a portrait image, such as the Mona Lisa, into a video where the portrait’s face speaks in the style of a guide (Zakharov et al. 2019). Conditional GANs can be trained to transform a human face into one of a different age (Song et al. 2018b), and to change facial attributes, such as the presence of a beard, skin condition, hair style and color (He et al. 2019).

Several creative tools have employed ML-AI methods to create new unique artworks. For example, PicbreederFootnote 34 and EndlessFormsFootnote 35 employ Hypercube-based NeuroEvolution of Augmenting Topologies (Stanley et al. 2009) as a generative encoder that exploits geometric regularities. ArtbreederFootnote 36 and GANVAS StudioFootnote 37 employ BigGAN (Brock et al. 2019) to generate high-resolution class-conditional images and also to mix two images together to create new interesting work.

Fig. 7
figure 7

a Real-time pose animator\(^{38}\). b Deepfake applied to replaces Alden Ehrenreich with young Harrison Ford in Solo: a star wars story by derpfakes\(^{50}\)

3.1.5 Animation

Animation is the process of using drawings and models to create moving images. Traditionally this was done by hand-drawing each frame in the sequence and rendering these at an appropriate rate to give the appearance of continuous motion. In recent years, AI methods have been employed to automate the animation process making it easier, faster and more realistic than in the past. A single animation project can involve several shot types, ranging from simple camera pans on a static scene, to more challenging dynamic movements of multiple interacting characters [e.g basketball players (Starke et al. 2020)]. ML-based AI is particularly well suited to learning models of motion from captured real motion sequences. These motion characteristics can be learnt using deep learning-based approaches, such as autoencoders (Holden et al. 2015), LSTMs (Lee et al. 2018), and motion prediction networks (Starke et al. 2019). Then, the inference applies these characteristics from the trained model to animate characters and dynamic movements. In simple animation, the motion can be estimated using a single low-cost camera. For example, Google research has created software for pose animation that turns a human pose into a cartoon animation in real timeFootnote 38. This is based on PoseNet (estimating pose positionFootnote 39) and FaceMesh (capturing face movement (Kartynnik et al. 2019)) as shown in Fig. 7a. Adobe has also created Character Animator softwareFootnote 40 offering lip synchronisation, eye tracking and gesture control through webcam and microphone inputs in real-time. This has been adopted by Hollywood studios and other online content creators.

AI has also been employed for rendering objects and scenes. This includes the synthesis of 3D views from motion capture or from monocular cameras (see Sect. 3.4.6), shading (Nalbach et al. 2017) and dynamic texture synthesis (Tesfaldet et al. 2018). Creating realistic lighting in animation and visual effects has also benefited by combining traditional geometrical computer vision with enhanced ML approaches and multiple depth sensors (Guo et al. 2019). Animation is not only important within the film industry; it also plays an important role in the games industry, responsible for the portrayal of movement and behaviour. Animating characters, including their faces and postures, is a key component in a game engine. AI-based technologies have enabled digital characters and audiences to co-exist and interact.Footnote 41 Avatar creation has also been employed to enhance virtual assistants,Footnote 42 e.g., using proprietary photoreal AI face synthesis technology (Nagano et al. 2018). Facebook Reality Labs have employed ML-AI techniques to animate realistic digital humans, called Codec Avatars, in real time using GAN-based style transfer and using a VAE to extract avatar parameters (Wei et al. 2019). AI is also employed to up-sample frame rate in animation (Siyao et al. 2021).

3.1.6 Augmented, virtual and mixed reality (VR, AR, MR)

AR and VR use computer technologies to create a fully simulated environment or one that is real but augmented with virtual entities. AR expands the physical world with digital layers via mobile phones, tablets or head mounted displays, while VR takes the user into immersive experiences via a headset with a 3D display that isolates the viewer (at least in an audio-visual sense) from the physical world (Milgram et al. 1995).

Significant predictions have been made about the growth of AR and VR markets in recent years but these have not realised yet.Footnote 43 This is due to many factors including equipment cost, available content and the physiological effects of ‘immersion’ (particularly over extended time periods) due to conflicting sensory interactions (Ng et al. 2020). VR can be used to simulate a real workspace for training workers for the sake of safety and to prevent the real-world consequences of failure (Laver et al. 2017). In the healthcare industry, VR is being increasingly used in various sectors, ranging from surgical simulation to physical therapy (Keswani et al. 2020).

Gaming is often cited as a major market for VR, along with related areas such as pre-visualisation of designs or creative productions (e.g., in building, architecture and filmmaking). A good list of VR games can be found in many article.Footnote 44 Deep learning technologies have been exploited in many aspects of gaming, for example in VR/AR game design (Zhang 2020) and emotion detection while using VR to improve the user’s immersive experience (Quesnel et al. 2018). More recently AI gaming methods have been extended into the area of virtual production, where the tools are scaled to produce dynamic virtual environments for filmmaking

AR perhaps has more early potential for growth than VR and uses have been developed in education and to create shared information, work or design spaces, where it can provide added 3D realism for the users interacting in the space (Palmarini et al. 2018). AR has also gained interest in augmenting experiences in movie and theatre settings.Footnote 45 A review of current and future trends of AR and VR systems can be found in Bastug et al. (2017).

MR combines the real world with digital elements (or the virtual world) (Milgram and Kishino 1994). It allows us to interact with objects and environments in both the real and virtual world by using touch technology and other sensory interfaces, to merge reality and imagination and to provide more engaging experiences. Examples of MR applications include the ‘MR Sales Gallery’ used by large real estate developers.Footnote 46 It is a virtual sample room that simulates the environment for customers to experience the atmosphere of an interactive residential project. The growth of VR, AR and MR technologies is described by Immerse UK in their recent report on the immersive economy in the UK 2019.Footnote 47 Extended reality (XR) is a newer technology that combines VR, AR and MR with internet connectivity, which opens further opportunities across industry, education, defence, health, tourism and entertainment (Chuah 2018).

An immersive experience with VR or MR requires good quality, high-resolution, animated worlds or 360-degree video content (Ozcinar and Smolic 2018). This poses new problems for data compression and visual quality assessment, which are the subject of increased research activity currently (Xu et al. 2020). AI technologies have been employed to make AR/VR/MR/XR content more exciting and realistic, to robustly track and localize objects and users in the environment. For example, automatic map reading using image-based localization (Panphattarasap and Calway 2018), and gaze estimation (Anantrasirichai et al. 2016; Soccini 2017). Oculus Insight, by Facebook, uses visual-inertial SLAM (simultaneous localization and mapping) to generate real-time maps and position tracking.Footnote 48 More sophisticated approaches, such as Neural Topological SLAM, leverage semantics and geometric information to improve long-horizon navigation (Chaplot et al. 2020). Combining audio and visual sensors can further improve navigation of egocentric observations in complex 3D environments, which can be done through deep reinforcement learning approach (Chen et al. 2020).

3.1.7 Deepfakes

Manipulations of visual and auditory media, either for amusement or malicious intent, are not new. However, advances in AI and ML methods have taken this to another level, improving their realistism and providing automated processes that make them easier to render. Text generator tools, such as those by OpenAI, can generate coherent paragraphs of text with basic comprehension, translation and summarization but have also been used to create fake news or abusive spam on social media.Footnote 49 Deepfake technologies can also create realistic fake videos by replacing some parts of the media with synthetic content. For example, substituting someone’s face while hair, body and action remain the same (Fig. 7bFootnote 50). Early research created mouth movement synthesis tools capable of making the subject appear to say something different from the actual narrative, e.g., President Barack Obama is lip-synchronized to a new audio track in Suwajanakorn et al. (2017). More recently, DeepFaceLab (Perov et al. 2020) provided a state-of-the-art tool for face replacement; however manual editing is still required in order to create the most natural appearance. Whole body movements have been generated via learning from a source video to synthesize the positions of arms, legs and body of the target (Chan et al. 2019).

Deep learning approaches to Deepfake generation primarily employ generative neural network architectures, e.g., VAEs (Kietzmann et al. 2020) and GANs (Zakharov et al. 2019). Despite rapid progress in this area, the creation of perfectly natural figures remains challenging; for example deepfake faces often do not blink naturally. Deepfake techniques have been widely used to create pornographic images of celebrities, to cause political distress or social unrest, for purposes of blackmail and to announce fake terrorism events or other disasters. This has resulted in several countries banning non-consensual deepfake content. To counter these often malicious attacks, a number of approaches have been reported and introduced to detect fake digital content (Güera and Delp 2018; Hasan and Salah 2019; Li and Lyu 2019).

3.1.8 Content and captions

There are many approaches that attempt to interpret an image or video and then automatically generate captions based on its content (Pu et al. 2016; Xia and Wang 2005; Xu et al. 2017b). This can successfully be achieved through object recognition (see Sect. 3.4); YouTube has provided this function for both video-on-demand and livestream videos.Footnote 51

The other way around, AI can also help to generate a new image from text. However, this problem is far more complicated; attempts so far have been based on GANs. Early work by Mansimov et al. (2016) was capable of generating background image content with relevant colors but with blurred foreground details. A conditioning augmentation technique was proposed to stabilize the training process of the conditional GAN, and also to improve the diversity of the generated samples (Zhang et al. 2017). Recent methods with significantly increased complexity are capable of learning to generate an image in an object-wise fashion, leading to more natural-looking results (Li et al. 2019c). However, limitations remain, for example artefacts often appear around object boundaries or inappropriate backgrounds can be produced if the words of the caption are not given in the correct order.

3.2 Information analysis

AI has proven capability to process and adapt to large amounts of training data. It can learn and analyze the characteristics of these data, making it possible to classify content and predict outcomes with high levels of confidence. Example applications include advertising and film analysis, as well as image or video retrieval, for example enabling producers to acquire information, analysts to better market products or journalists to retrieve content relevant to an investigation.

3.2.1 Text categorization

Text categorization is a core application of NLP. This generic text processing task is useful in indexing documents for subsequent retrieval and content analysis (e.g., spam detection, sentiment classification, and topic classification). It can be thought of as the generation of summarised texts from full texts. Traditional techniques for both multi-class and multi-label classifications include decision trees, support vector machines (Kowsari et al. 2019), term frequency–inverse document frequency (Azam and Yao 2012), and extreme learning machine (Rezaei-Ravari et al. 2021). Unsupervised learning with self-organizing maps has also been investigated (Pawar and Gawande 2012). Modern NLP techniques are based on deep learning, where generally the first layer is an embedding layer that converts words to vector representations. Additional CNN layers are then added to extract text features and learn word positions (Johnson and Zhang 2015). RNNs (mostly based on LSTM architectures) have also been concatenated to learn sentences and give prediction outputs (Chen et al. 2017; Gunasekara and Nejadgholi 2018). A category sentence generative adversarial network has also been proposed that combines GAN, RNN and reinforcement learning to enlarge training datasets, which improves performance for sentiment classification (Li et al. 2018b). Recently, an attention layer has been integrated into the network to provide semantic representations in aspect-based sentiment analysis (Truşcă et al. 2020). The artist, Vibeke Sorensen, has applied AI techniques to categorize texts from global social networks such as Twitter into six live emotions and display the ‘Mood of the Planet’ artistically using six different colors.Footnote 52

3.2.2 Advertisements and film analysis

AI can assist creators in matching content more effectively to their audiences, for example recommending music and movies in a streaming service, like Spotify or Netflix. Learning systems have also been used to characterize and target individual viewers, optimizing the time they spend on advertising (Lacerda et al. 2006). This approach assesses what users look at and how long they spend browsing adverts, participating on social media platforms. In addition, AI can be used to inform how adverts should be presented to help boost their effectiveness, for example by identifying suitable customers and showing the ad at the right time. This normally involves gathering and analysing personal data in order to predict preferences (Golbeck et al. 2011).

Contextualizing social-media conversations can also help advertisers understand how consumers feel about products and to detect fraudulent ad impressions (Ghani et al. 2019). This can be achieved using NLP methods (Young et al. 2018). Recently, an AI-based data analysis tool has been introduced to assist filmmaking companies to develop strategies for how, when and where prospective films should be released (Dodds 2020). The tool employs ML approaches to model the patterns of historical data about film performances associating with the film’s content and themes. This is also used in gaming industries, where the behaviour of each player is analyzed so that the company can better understand their style of play and decide when best to approach them to make money.Footnote 53

3.2.3 Content retrieval

Data retrieval is an important component in many creative processes, since producing a new piece of work generally requires undertaking a significant amount of research at the start. Traditional retrieval technologies employ metadata or annotation text (e.g., titles, captions, tags, keywords and descriptions) to the source content (Jeon et al. 2003). The manual annotation process needed to create this metadata is however very time-consuming. AI methods have enabled automatic annotation by supporting the analysis of media based on audio and object recognition and scene understanding (Amato et al. 2017; Wu et al. 2015).

In contrast to traditional concept-based approaches, content-based image retrieval (or query by image content (QBIC)) analyzes the content of an image rather than its metadata. A reverse image search technique (one of the techniques Google Images usesFootnote 54) extracts low-level features from an input image, such as points, lines, shapes, colors and textures. The query system then searches for related images by matching these features within the search space. Modern image retrieval methods often employ deep learning techniques, enabling image to image searching by extracting low-level features and then combining these to form semantic representations of the reference image that can be used as the basis of a search (Wan et al. 2014). For example, when a user uploads an image of a dog to Google Images, the search engine will return the dog breed, show similar websites by searching with this keyword, and also show selected images that are visually similar to that dog, e.g., with similar colors and background. These techniques have been further improved by exploiting features at local, regional and global image levels (Gordo et al. 2016). GAN approaches are also popular, associated with learning-based hashing which was proposed for scalable image retrieval (Song et al. 2018a). Video retrieval can be more challenging due to the requirement for understanding activities, interactions between objects and unknown context; RNNs have provided a natural extension that supports the extraction of sequential behaviour in this case (Jabeen et al. 2018).

Music information retrieval extracts features of sound, and then converts these to a meaningful representation suitable for a query engine. Several methods for this have been reported, including automatic tagging, query by humming, search by sound and acoustic fingerprinting (Kaminskas and Ricci 2012).

3.2.4 Recommendation services

A recommendation engine is a system that suggests products, services, information to users based on analysis of data. For example, a music curator creates a soundtrack or a playlist that has songs with similar mood and tone, bringing related content to the user. Curation tools, capable of searching large databases and creating recommendation shortlists, have become popular because they can save time, elevate brand visibility and increase connection to the audience. The techniques used in recommendation systems generally fall into three categories: (i) content-based filtering, which uses a single user’s data, (ii) collaborative filtering, the most prominent approach, that derives suggestions from many other users, and (iii) knowledge-based system, based on specific queries made by the user, which is generally employed in complex domains, where the first two cannot be applied. The approach can be hybrid; for instance where content-based filtering exploits individual metadata and collaborative filtering finds overlaps between user playlists. Such systems build a profile of what the users listen to or watch, and then look at what other people who have similar profiles listen to or watch. ESPN and Netflix have partnered with Spotify to curate playlists from the documentary ‘The Last Dance’. Spotify has created music and podcast playlists that viewers can check out after watching the show.Footnote 55

Content summarization is a fundamental tool that can support recommendation services. Text categorization approaches extract important content from a document into key indices (see Sect. 3.2.1). RNN-based models incorporating attention models have been employed to successfully generate a summary in the form of an abstract (Rush et al. 2015), short paragraph (See et al. 2017) or a personalized sentence (Li et al. 2019a). The gaze behavior of an individual viewer has also been included for personalised text summarization (Yi et al. 2020). The personalized identification of key frames and start points in a video has also been framed as an optimization problem in Chen et al. (2014). ML approaches have been developed to perform content-based recommendations. Multimodal features of text, audio, image, and video content are extracted and used to seek similar content in Deldjoo et al. (2018). This task is relevant to content retrieval, as discussed in Sect. 3.2.3. A detailed review of deep learning for recommendation systems can be found in Batmaz et al. (2019).

3.2.5 Intelligent assistants

Intelligent Assistants employ a combination of AI tools, including many of those mentioned above, in the form of a software agent that can perform tasks or services for an individual. These virtual agents can access information via digital channels to answer questions relating to, for example, weather forecasts, news items or encyclopaedic enquiries. They can recommend songs, movies and places, as well as suggest routes. They can also manage personal schedules, emails, and reminders. The communication can be in the form of text or voice. The AI technologies behind the intelligent assistants are based on sophisticated ML and NLP methods. Examples of current intelligent assistants include Google Assistant,Footnote 56 Siri,Footnote 57 Amazon Alexa and Nina by Nuance.Footnote 58 Similarly, chatbots and other types of virtual assistants are used for marketing, customer service, finding specific content and information gathering (Xu et al. 2017a).

3.3 Content enhancement and post production workflows

It is often the case that original content (whether images, videos, audio or documents) is not fit for the purpose of its target audience. This could be due to noise caused by sensor limitations, the conditions prevailing during acquisition, or degradation over time. AI offers the potential to create assistive intelligent tools that improve both quality and management, particularly for mass-produced content.

3.3.1 Contrast enhancement

The human visual system employs many opponent processes, both in the retina and visual cortex, that rely heavily on differences in color, luminance or motion to trigger salient reactions (Bull and Zhang 2021). Contrast is the difference in luminance and/or color that makes an object distinguishable, and this is an important factor in any subjective evaluation of image quality. Low contrast images exhibit a narrow range of tones and can therefore appear flat or dull. Non-parametric methods for contrast enhancement involve histogram equalisation which spans the intensity of an image between its bit depth limits from 0 to a maximum value (e.g., 255 for 8 bits/pixel). Contrast-limited adaptive histogram equalisation (CLAHE) is one example that is commonly used to adjust an histogram and reduce noise amplification (Pizer et al. 1987). Modern methods have further extended performance by exploiting CNNs and autoencoders (Lore et al. 2017), inception modules and residual learning (Tao et al. 2017). Image Enhancement Conditional Generative Adversarial Networks (IE-CGANs) designed to process both visible and infrared images have been proposed by Kuang et al. (2019). Contrast enhancement, along with other methods to be discussed later, suffer from a fundamental lack of data for supervised training because real image pairs with low and high contrast are unavailable (Jiang et al. 2021). Most of these methods therefore train their networks with synthetic data (see Sect. 2.3.1).

3.3.2 Colorization

Colorization is the process that adds or restores color in visual media. This can be useful in coloring archive black and white content, enhancing infrared imagery (e.g., in low-light natural history filming) and also in restoring the color of aged film. A good example is the recent film “They Shall Not Grow Old” (2018) by Peter Jackson, that colorized (and corrected for speed and jerkiness, added sound and converted to 3D) 90 minutes of footage from World War One. The workflow was based on extensive studies of WW1 equipment and uniforms as a reference point and involved a time-consuming use of post production tools.

The first AI-based techniques for colorization used a CNN with only three convolutional layers to convert a grayscale image into chrominance values and refined them with bilateral filters to generate a natural color image (Cheng et al. 2015). A deeper network, but still only with eight dilated convolutional layers, was proposed a year later (Zhang et al. 2016). This network captured better semantics, resulting in an improvement on images with distinct foreground objects. Encoder-decoder networks are employed in Xu et al. (2020).

Colorization remains a challenging problem for AI as recognized in the recent Challenge in Computer Vision and Pattern Recognition Workshops (CVPRW) in 2019 (Nah et al. 2019). Six teams competed and all of them employed deep learning methods. Most of the methods adopted an encoder-decoder or a structure based on U-Net (Ronneberger et al. 2015). The deep residual net (NesNet) architecture (He et al. 2016) and the dense net (DenseNet) architecture (Huang et al. 2017) have both demonstrated effective conversion of gray scale to natural-looking color images. More complex architectures have been developed based on GAN structures (Zhang et al. 2019), for example DeOldify and NoGAN (Antic 2020). The latter model was shown to reduce temporal color flickering on the video sequence, which is a common problem when enhancing colors on an individual frame by frame basis. Infrared images have also been converted to natural color images using CNNs (e.g., Limmer and Lensch 2016) (Fig. 8a) and GANs (e.g., Kuang et al. 2020; Suarez et al. 2017).

Fig. 8
figure 8

Image enhancement. a Colorization for infrared image (Limmer and Lensch 2016). b Super-resolution (Ledig et al. 2017)

3.3.3 Upscaling imagery: super-resolution methods

Super-resolution (SR) approaches have gained popularity in recent years, enabling the upsampling of images and video spatially or temporally. This is useful for up-converting legacy content for compatibility with modern formats and displays. SR methods increase the resolution (or sample rate) of a low-resolution (LR) image (Fig. 8b) or video. In the case of video sequences, successive frames can, for example, be employed to construct a single high-resolution (HR) frame. Although the basic concept of the SR algorithm is quite simple, there are many problems related to perceptual quality and restriction of available data. For example, the LR video may be aliased and exhibit sub-pixel shifts between frames and hence some points in the HR frame do not correspond to any information from the LR frames.

With deep learning-based technologies, the LR and HR images are matched and used for training architectures such as CNNs, to provide high quality upscaling potentially using only a single LR image (Dong et al. 2014). Sub-pixel convolution layers can be introduced to improve fine details in the image, as reported by Shi et al.. Residual learning and generative models are also employed, (e.g., Kim et al. 2016; Tai et al. 2017). A generative model with a VGG-basedFootnote 59 perceptual loss function has been shown to significantly improve quality and sharpness when used with the SRGAN by Ledig et al. (2017). Wang et al. (2018) proposed a progressive multi-scale GAN for perceptual enhancement, where pyramidal decomposition is combined with a DenseNet architecture (Huang et al. 2017). The above techniques seek to learn implicit redundancy that is present in natural data to recover missing HR information from a single LR instance. For single image SR, the review by Yang et al. (2019) suggests that methods such as EnhanceNet (Sajjadi et al. 2017) and SRGAN (Ledig et al. 2017), that achieve high subjective quality with good sharpness and textural detail, cannot simultaneously achieve low distortion loss (e.g., mean absolute error (MAE) or peak signal-to-noise-ratio (PSNR)). A comprehensive survey of image SR is provided by Wang et al. (2020b). This observes that more complex networks generally produce better PSNR results and that most state-of-the-art methods are based on residual learning and use \(\ell _1\) as one of training losses (e.g., Dai et al. 2019; Zhang et al. 2018a).

When applied to video sequences, super-resolution methods can exploit temporal correlations across frames as well as local spatial correlations within them. Early contributions applying deep learning to achieve video SR gathered multiple frames into a 3D volume which formed the input to a CNN (Kappeler et al. 2016). Later work exploited temporal correlation via a motion compensation process before concatenating multiple warped frames into a 3D volume (Caballero et al. 2017) using a recurrent architecture (Huang et al. 2015). The framework proposed by Liu et al. (2018a) upscales each frame before applying another network for motion compensation. The original target frame is fed, along with its neighbouring frames, into intermediate layers of the CNN to perform inter-frame motion compensation during feature extraction (Haris et al. 2019). EDVR (Wang et al. 2019), the winner of the NTIRE19 video restoration and enhancement challenges in 2019,Footnote 60 employs a deformable convolutional network (Dai et al. 2017) to align two successive frames. Deformable convolution is also employed in DNLN (Deformable Non-Local Network) (Wang et al. 2019a). At the time of writing, EDVR (Wang et al. 2019) and DNLN (Wang et al. 2019a) are reported to outperform other methods for video SR, followed by the method of Haris et al. (2019). This suggests that deformable convolution plays an important role in overcoming inter-frame misalignment, producing sharp textural details.

3.3.4 Restoration

The quality of a signal can often be reduced due to distortion or damage. This could be due to environmental conditions during acquisition (low light, atmospheric distortions or high motion), sensor characteristics (quantization due to limited resolution or bit-depth or electronic noise in the sensor itself) or ageing of the original medium such as tape of film. The general degradation model can be written as \(I_{obs} = h * I_{ideal} + n\), where \(I_{obs}\) is an observed (distorted) version of the ideal signal \(I_{ideal}\), h is the degradation operator, \(*\) represents convolution, and n is noise. The restoration process tries to reconstruct \(I_{ideal}\) from \(I_{obs}\). h and n are values or functions that are dependent on the application. Signal restoration can be addressed as an inverse problem and deep learning techniques have been employed to solve it. Below we divide restoration into four classes that relate to work in the creative industries with examples illustrated in Fig. 9. Further details of deep learning for inverse problem solving can be found in Lucas et al. (2018).

3.3.4.1 Deblurring

Images can be distorted by blur, due to poor camera focus or camera or subject motion. Blur-removal is an ill-posed problem represented by a point spread function (PSF) h, which is generally unknown. Deblurring methods sharpen an image to increase subjective quality, and also to assist subsequent operations such as optical character recognition (OCR) (Hradis et al. 2015) and object detection (Kupyn et al. 2018). Early work in this area analyzed the statistics of the image and attempted to model physical image and camera properties (Biemond et al. 1990). More sophisticated algorithms such as blind deconvolution (BD), attempt to restore the image and the PSF simultaneously (Jia 2007; Krishnan et al. 2011). These methods however assume a space-invariant PSF and the process generally involves several iterations.

As described by the image degradation model, the PSF (h) is related to the target image via a convolution operation. CNNs are therefore inherently applicable for solving blur problems (Schuler et al. 2016). Deblurring techniques based on CNNs (Nah et al. 2017) and GANs (Kupyn et al. 2018) usually employ residual blocks, where skip connections are inserted every two convolution layers (He et al. 2016). Deblurring an image from coarse-to-fine scales is proposed in Tao et al. (2018), where the outputs are upscaled and are fed back to the encoder-decoder structure. The high-level features of each iteration are linked in a recurrent manner, leading to a recursive process of learning sharp images from blurred ones. Nested skip connections were introduced by Gao et al. (2019), where feature maps from multiple convolution layers are merged before applying them to the next convolution layer (in contrast to the residual block approach where one feature map is merged at the next input). This more complicated architecture improves information flow and results in sharper images with fewer ghosting artefacts compared to previous methods.

In the case of video sequences, deblurring can benefit from the abundant information present across neighbouring frames. The DeBlurNet model (Su et al. 2017) takes a stack of nearby frames as input and uses synthetic motion blur to generate a training dataset. A Spatio-temporal recurrent network exploiting a dynamic temporal blending network is proposed by Hyun Kim et al. (2017). Zhang et al. (2018) have concatenated an encoder, recurrent network and decoder to mitigate motion blur. Recently a recurrent network with iterative updating of the hidden state was trained using a regularization process to create sharp images with fewer ringing artefacts (Nah et al. 2019), denoted as IFI-RNN. A Spatio-Temporal Filter Adaptive Network (STFAN) has been proposed (Zhou et al. 2019), where the convolutional kernel is acquired from the feature values in a spatially varying manner. IFI-RNN and STFAN produce comparable results and hitherto achieve the best performances in terms of both subjective and objective quality measurements [the average PSNRs of both methods are higher than that of Hyun Kim et al. (2017) by up to 3 dB].

Fig. 9
figure 9

Restoration for a deblurring (Zhang et al. 2018), b denoising with DnCNN (Zhang et al. 2017), and c turbulence mitigation (Anantrasirichai et al. 2013). Left and right are the original degraded images and the restored images respectively

3.3.4.2 Denoising

Noise can be introduced from various sources during signal acquisition, recording and processing, and is normally attributed to sensor limitations when operating under extreme conditions. It is generally characterized in terms of whether it is additive, multiplicative, impulsive or signal dependent, and in terms of its statistical properties. Not only visually distracting, but noise can also affect the performance of detection, classification and tracking tools. Denoising nodes are therefore commonplace in post production workflows, especially for challenging low light natural history content (Anantrasirichai et al. 2020a). In addition, noise can reduce the efficiency of video compression algorithms, since the encoder allocates wasted bits to represent noise rather than signal, especially at low compression levels. This is the reason that film-grain noise suppression tools are employed in certain modern video codecs (Such as AV1) prior to encoding by streaming and broadcasting organisations.

The simplest noise reduction technique is weighted averaging, performed spatially and/or temporally as a sliding window, also known as a moving average filter (Yahya et al. 2016). More sophisticated methods however perform significantly better and are able to adapt to change noise statistics. These include adaptive spatio-temporal smoothing through anisotropic filtering (Malm et al. 2007), nonlocal transform-domain group filtering (Maggioni et al. 2012), Kalman-bilateral mixture model (Zuo et al. 2013), and spatio-temporal patch-based filtering (Buades and Duran 2019). Prior to the introduction of deep neural network denoising, methods such as BM3D (block matching 3-D) (Dabov et al. 2007) represented the state of the art in denoising performance.

Recent advances in denoising have almost entirely been based on deep learning approaches and these now represent the state of the art. RNNs have been employed successfully to remove noise in audio (Maas et al. 2012; Zhang et al. 2018b). A residual noise map is estimated in the Denoising Convolutional Neural Network (DnCNN) method (Zhang et al. 2017) for image based denoising, and for video based denoising, a spatial and temporal network are concatenated (Claus and van Gemert 2019) where the latter handles brightness changes and temporal inconsistencies. FFDNet is a modified form of DnCNN that works on reversibly downsampled subimages (Zhang et al. 2018). Liu et al. (2018b) developed MWCNN; a similar system that integrates multiscale wavelet transforms within the network to replace max pooling layers in order to better retain visual information. This integrated a wavelet/CNN denoising system and currently provides the state-of-the-art performance for Additive Gaussian White Noise (AGWN). VNLnet combines a non-local patch search module with DnCNN. The first part extracts features, while the latter mitigates the remaining noise (Davy et al. 2019). Zhao et al. (2019a) proposed a simple and shallow network, SDNet, uses six convolution layers with some skip connection to create a hierarchy of residual blocks. TOFlow (Xue et al. 2019) offers an end-to-end trainable convolutional network that performs motion analysis and video processing simultaneously. GANs have been employed to estimate a noise distribution which is subsequently used to augment clean data for training CNN-based denoising networks (such as DnCNN) (Chen et al. 2018b). GANs for denoising data have been proposed for medical imaging (Yang et al. 2018), but they are not popular in the natural image domain due to the limited data resolution of current GANs. However, CycleGAN has recently been modified to attempt denoising and enhancing low-light ultra-high-definition (UHD) videos using a patch-based strategy (Anantrasirichai and Bull 2021).

Recently, the Noise2Noise algorithm has shown that it is possible to train a denoising network without clean data, under the assumption that the data is corrupted by zero-mean noise (Lehtinen et al. 2018). The training pair of input and output images are both noisy and the network learns to minimize the loss function by solving the point estimation problem separately for each input sample. However, this algorithm is sensitive to the loss function used, which can significantly influence the performance of the model. Another algorithm, Noise2Void (Krull et al. 2019), employs a novel blind-spot network that does not include the current pixel in the convolution. The network is trained using the noisy patches as input and output within the same noisy patch. It achieves comparable performance to Noise2Noise but allows the network to learn noise characteristics in a single image.

NTIRE 2020 held a denoising grand challenge within the IEEE CVPR conference that compared many contemporary high performing ML denoising methods on real images (Abdelhamed et al. 2020). The best competing teams employed a variety of techniques using variants on CNN architectures such as U-Net (Ronneberger et al. 2015), ResNet (He et al. 2016) and DenseNet (Huang et al. 2017), together with \(\ell _1\) loss functions and ensemble processing including flips and rotations. The survey by Tian et al. (2020) states that SDNet (Zhao et al. 2019a) achieves the best results on ISO noise, and FFDNet (Zhang et al. 2018) offers the best denoising performance overall, including Gaussian noise and spatially variant noise (non-uniform noise levels).

Neural networks have also been used for other aspects of image denoising: Chen et al. (2018a) have developed specific low light denoising methods using CNN-based methods; Lempitsky et al. (2018) have developed a deep learning prior that can be used to denoise images without access to training data; and Brooks et al. (2019) have developed specific neural networks to denoise real images through ‘unprocessing’, i.e. they re-generate raw captured images by inverting the processing stages in a camera to form a supervised training system for raw images.

3.3.4.3 Dehazing

In certain situations, fog, haze, smoke and mist can create mood in an image or video. In other cases, they are considered as distortions that reduce contrast, increase brightness and lower color fidelity. Further problems can be caused by condensation forming on the camera lens. The degradation model can be represented as: \(I_{obs} = I_{ideal} t + A (1-t)\) where A is atmospheric light and t is medium transmission. The transmission t is estimated using a dark channel prior based on the observation that the lowest value of each color channel of haze-free images is close to zero (He et al. 2011). Berman et al. (2016), the true colors are recovered based on the assumption that an image can be faithfully represented with just a few hundred distinct colors. The authors showed that tight color clusters change because of haze and form lines in RGB space enabling them to be readjusted. The scene radiance (\(I_{ideal}\)) is attenuated exponentially with depth so some work has included an estimate of the depth map corresponding to each pixel in the image (Kopf et al. 2008). CNNs are employed to estimate transmission t and dark channel by Yang and Sun (2018). Cycle-Dehazing (Engin et al. 2018) is used to enhance GAN architecture in CycleGAN (Zhu et al. 2017). This formulation combines cycle-consistency loss (see Sect. 3.1.4) and perceptual loss (see Sect. 2.3.2) in order to improve the quality of textural information recovery and generate visually better haze-free images (Engin et al. 2018). A comprehensive study and an evaluation of existing single-image dehazing CNN-based algorithms are reported by Li et al. (2019). It concludes that DehazeNet (Cai et al. 2016) performs best in terms of perceptual loss, MSCNN (Tang et al. 2019) offers the best subjective quality and superior detection performance on real hazy images, and AOD-Net (Li et al. 2017) is the most efficient.

A related application is underwater photography (Li et al. 2016) as commonly used in natural history filmmaking. CNNs are employed to estimate the corresponding transmission map or ambient light of an underwater hazy image in Shin et al. (2016). More complicated structures merging U-Net, multi-scale estimation, and incorporating cross layer connections to produce even better results are reported by Hu et al. (2018).

3.3.4.4 Mitigating atmospheric turbulence

When the temperature difference between the ground and the air increases, the air layers move upwards rapidly, leading to a change in the interference pattern of the light refraction. This is generally observed as a combination of blur, ripple and intensity fluctuations in the scene. Restoring a scene distorted by atmospheric turbulence is a challenging problem. The effect, which is caused by random, spatially varying, perturbations, makes a model-based solution difficult and, in most cases, impractical. Traditional methods have involved frame selection, image registration, image fusion, phase alignment and image deblurring (Anantrasirichai et al. 2013; Xie et al. 2016; Zhu and Milanfar 2013). Removing the turbulence distortion from a video containing moving objects is very challenging, as generally multiple frames are used and they are needed to be aligned. Temporal filtering with local weights determined from optical flow is employed to address this by Anantrasirichai et al. (2018). However, artefacts in the transition areas between foreground and background regions can remain. Removing atmospheric turbulence based on single image processing is proposed using ML by Gao et al. (2019). Deep learning techniques to solve this problem are still in their early stages. However, one method reported employs a CNN to support deblurring (Nieuwenhuizen and Schutte 2019) and another employs multiple frames using a GAN architecture (Chak et al. 2018). This however appears only to work well for static scenes.

3.3.5 Inpainting

Inpainting is the process of estimating lost or damaged parts of an image or a video. Example applications for this approach include the repair of damage caused by cracks, scratches, dust or spots on film or chemical damage resulting in image degradation. Similar problems arise due to data loss during transmission across packet networks. Related applications include the removal of unwanted foreground objects or regions of an image and video; in this case the occluded background that is revealed must be estimated. An example of inpainting is shown in Fig. 10. In digital photography and video editing, perhaps the most widely used tool is Adobe Photoshop,Footnote 61 where inpainting is achieved using content-aware interpolation by analysing the entire image to find the best detail to intelligently replace the damaged area.

Recently AI technologies have been reported that model the missing parts of an image using content in proximity to the damage, as well as global information to assist extracting semantic meaning. Xie et al. (2012) combine sparse coding with deep neural networks pre-trained with denoising auto-encoders. Dilated convolutions are employed in two concatenated networks for spatial reconstruction in the coarse and fine details (Yu et al. 2018). Some methods allow users to interact with the process, for example inputting information such as strong edges to guide the solution and produce better results. An example of this image inpainting with user-guided free-form is given by Yu et al. (2019). Gated convolution is used to learn the soft mask automatically from the data and the content is then generated using both low-level features and extracted semantic meaning. Chang et al. (2019) extend the work by Yu et al. (2019) to video sequences using a GAN architecture. Video Inpainting, VINet, as reported by Kim et al. (2019) offers the ability to remove moving objects and replace them with content aggregated from both spatial and temporal information using CNNs and recurrent feedback. Black et al. (2020) evaluated state-of-the-art methods by comparing performance based on the classification and retrieval of fixed images. They reported that DFNet (Hong et al. 2019), based on U-Net (Ronneberger et al. 2015) adding fusion blocks in the decoding layers, outperformed other methods over a wide range of missing pixels.

Fig. 10
figure 10

Example of inpainting, (left-right) original image, masking and inpainted image

3.3.6 Visual special effects (VFX)

Closely related to animation, the use of ML-based AI in VFX has increased rapidly in recent years. Examples include BBC’s His Dark Materials and Avengers Endgame (Marvel).Footnote 62 These both use a combination of physics models with data driven results from AI algorithms to create high fidelity and photorealistic 3D animations, simulations and renderings. ML-based tools transform the actor’s face into the film’s character using head-mounted cameras and facial tracking markers. With ML-based AI, a single image can be turned into a photorealistic and fully-clothed production-level 3D avatar in real-time (Hu et al. 2017). Other techniques related to VFX can be found in Sect. 3.1 (e.g., style transfer and deepfakes), Sect. 3.3 (e.g., colorization and super-resolution) and Sect. 3.4 (e.g tracking and 3D rendering). AI techniquesFootnote 63 are increasingly being employed to reduce the human resources needed for certain labour-intensive or repetitive tasks such as match-move, tracking, rotoscoping, compositing and animation (Barber et al. 2016; Torrejon et al. 2020).

3.4 Information extraction and enhancement

AI methods based on deep learning have demonstrated significant success in recognizing and extracting information from data. They are well suited to this task since successive convolutional layers efficiently perform statistical analysis from low to high level, progressively abstracting meaningful and representative features. Once information is extracted from a signal, it is frequently desirable to enhance it or transform it in some way. This may, for example, make an image more readily interpretable through modality fusion, or translate actions from a real animal to an animation. This section investigates how AI methods can utilize explicit information extracted from images and videos to construct such information and reuse it in new directions or new forms.

3.4.1 Segmentation

Segmentation methods are widely employed to partition a signal (typically an image or video) into a form that is semantically more meaningful and easier to analyze or track. The resulting segmentation map indicates the locations and boundaries of semantic objects or regions with parametric homogeneity in an image. Pixels within a region could therefore represent an identifiable object and/or have shared characteristics, such as color, intensity, and texture. Segmentation boundaries indicate the shape of objects and this, together with other parameters, can be used to identify what the object is. Segmentation can be used as a tool in the creative process, for example assisting with rotoscoping, masking, cropping and for merging objects from different sources into a new picture. Segmentation, in the case of video content, also enables the user to change the object or region’s characteristics over time, for example through blurring, color grading or replacement.Footnote 64

Classification systems can be built on top of segmentation in order to detect or identify objects in a scene (Fig. 11a). This can be compared with the way that humans view a photograph or video, to spot people or other objects, to interpret visual details or to interpret the scene. Since different objects or regions will differ to some degree in terms of the parameters that characterize them, we can train a machine to perform a similar process, providing an understanding of what the image or video contains and activities in the scene. This can in turn support classification, cataloguing and data retrieval. Semantic segmentation classifies all pixels in an image into predefined categories, implying that it processes segmentation and classification simultaneously. The first deep learning approach to semantic segmentation employed a fully convolutional network (Long et al. 2015). In the same year, the encoder-decoder model in Noh et al. (2015) and the U-Net architecture (Ronneberger et al. 2015) were introduced. Following these, a number of modified networks based on them architectures have been reported Asgari Taghanaki et al. (2021). GANs have also been employed for the purpose of image translation, in this case to translate a natural image into a segmentation map (Isola et al. 2017). The semantic segmentation approach has also been applied to point cloud data to classify and segment 3D scenes, e.g., Fig. 11b (Qi et al. 2017).

3.4.2 Recognition

Object recognition has been one of the most common targets for AI in recent years, driven by the complexity of the task but also by the huge amount of labeled imagery available for training deep networks. The performance in terms of mean Average Precision (mAP) for detecting 200 classes has increased more than 300% over the last 5 years (Liu et al. 2020). The Mask R-CNN approach (He et al. 2017) has gained popularity due to its ability to separate different objects in an image or a video giving their bounding boxes, classes and pixel-level masks, as demonstrated by Ren et al. (2017). Feature Pyramid Network (FPN) is also a popular backbone for object detection (Lin et al. 2017). An in-depth review of object recognition using deep learning can be found in Zhao et al. (2019b) and Liu et al. (2020).

YOLO and its variants represent the current state of the art in real-time object detection and tracking (Redmon et al. 2016). A state-of-the-art, real-time object detection system, You Only Look Once (YOLO), works on a frame-by-frame basis and is fast enough to process at typical video rates (currently reported up to 55 fps). YOLO divides an image into regions and predicts bounding boxes using a multi-scale approach and gives probabilities for each region. The latest model, YOLOv4, (Bochkovskiy et al. 2020), concatenates YOLOv3 (Redmon and Farhadi 2018) with a CNN that is 53 layers deep, with SPP-blocks (He et al. 2015) or SAM-blocks (Woo et al. 2018) and a multi-scale CNN backbone. YOLOv4 offers real-time computation and high precision [up to 66 mAP on Microsoft’s COCO object dataset (Lin et al. 2014)].

On the PASCAL visual object classes (VOC) Challenge datasets (Everingham et al. 2012), YOLOv3 is the leader of object detection on the VOC2010 dataset a with mAP of 80.8% (YOLOv4 performance on this dataset had not been reported at the time of writing) and NAS-Yolo is the best for VOC2012 dataset with a mAP of 86.5%Footnote 65 (the VOC2012 dataset has a larger number of segmentations than VOC2010). NAS-Yolo (Yang et al. 2020b) employs Neural Architecture Search (NAS) and reinforcement learning to find the best augmentation policies for the target. In the PASCAL VOC Challenge for semantic segmentation, FlatteNet (Cai and Pu 2019) and FDNet (Zhen et al. 2019) lead the field achieving the mAP of 84.3 and 84.0% on VOC2012 data, respectively. FlatteNet integrates fully convolutional network with pixel-wise visual descriptors converting from feature maps. FDNet links all feature maps from the encoder to each input of the decoder leading to really dense network and precise segmentation. On the Microsoft COCO object dataset, MegDetV2 (Li et al. 2019d) ranks first on both the detection leaderboard and the semantic segmentation leaderboard. MegDetV2 combines ResNet with FPN and uses deformable convolution to train the end-to-end network with large mini-batches.

Fig. 11
figure 11

Segmentation and recognition. a Object recognition (Kim et al. 2020a). b 3D semantic segmentation (Qi et al. 2017)

Recognition of speech and music has also been successfully achieved using deep learning methods. Mobile phone apps that capture a few seconds of sound or music, such as Shazam,Footnote 66 characterize songs based on an audio fingerprint using a spectrogram (a time-frequency graph) that is used to search for a matching fingerprint in a database. Houndify by SoundHoundFootnote 67 exploits speech recognition and searches content across the internet. This technology also provides voice interaction for in-car systems. Google proposed a full visual-speech recognition system that maps videos of lips to sequences of words using spatiotemporal CNNs and LSTMs (Shillingford et al. 2019).

Emotion recognition has also been studied for over a decade. AI methods have been used to learn, interpret and respond to human emotion, via speech (e.g., tone, loudness, and tempo) (Kwon et al. 2003), face detection (e.g., eyebrows, the tip of nose, the corners of mouth) (Ko 2018), and both audio and video (Hossain and Muhammad 2019). Such systems have also been used in security systems and for fraud detection.

A further task, relevant to video content, is action recognition. This involves capturing spatio-temporal context across frames, for example: jumping into a pool, swimming, getting out of the pool. Deep learning has again been extensively exploited in this area, with the first report based on a 3D CNN (Ji et al. 2013). An excellent state-of-the-art review on action recognition can be found in Yao et al. (2019). More recent advances include temporal segment networks (Wang et al. 2016) and temporal binding networks, where the fusion of audio and visual information is employed (Kazakos et al. 2019). EPIC-KITCHENS, is a large dataset focused on egocentric vision that provides audio-visual, non-scripted recordings in native environments (Damen et al. 2018); it has been extensively used to train action recognition systems. Research on sign language recognition is also related to creative applications, since it studies body posture, hand gesture, and face expression, and hence involves segmentation, detection, classification and 3D reconstruction (Jalal et al. 2018; Kratimenos et al. 2020; Adithya and Rajesh 2020). Moreover, visual and linguistic modelling has been combined to enable translation between spoken/written language and continuous sign language videos (Bragg et al. 2019).

3.4.3 Salient object detection

Salient object detection (SOD) is a task based on visual attention mechanisms, in which algorithms aim to identify objects or regions that are likely to be the focus of attention. SOD methods can benefit the creative industries in applications such as image editing (Cheng et al. 2010; Mejjati et al. 2020), content interpretation (Rutishauser et al. 2004), egocentric vision (Anantrasirichai et al. 2018), VR (Ozcinar and Smolic 2018), and compression (Gupta et al. 2013). The purpose of SOD differs from fixation detection, which predicts where humans look, but there is a strong correlation between the two (Borji et al. 2019). In general, the SOD process involves two tasks: saliency prediction and segmentation. Recent supervised learning technologies have significantly improved the performance of SOD. Hou et al. (2019) merge multi-level features of a VGG network with fusion and cross-entropy losses. A survey by Wang et al. (2021) reveals that most SOD models employ VGG and ResNet as backbone architectures and train the model with the standard binary cross-entropy loss. More recent work has developed the end-to-end framework with GANs (Wang et al. 2020a) and some works include depth information from RGB-D cameras (Jiang et al. 2020). More details on the recent SOD on RGB-D data can be found in (Zhou et al. 2021). When detecting salient objects in the video, an LSTM module is used to learn saliency shifts (Fan et al. 2019). The SOD approach has also been extended to co-salient object detection (CoSOD), aiming to detect the co-occurring salient objects in multiple images (Fan et al. 2020).

3.4.4 Tracking

Object tracking is the temporal process of locating objects in consecutive video frames. It takes an initial set of object detections (see Sect. 3.4), creates a unique ID for each of these initial detections, and then tracks each of the objects, via their properties, over time. Similar to segmentation, object tracking can support the creative process, particularly in editing. For example, a user can identify and edit a particular area or object in one frame and, by tracking the region, these adjusted parameters can be applied to the rest of the sequence regardless of object motion. Semi-supervised learning is also employed in SiamMask (Wang et al. 2019b) offering the user an interface to define the object of interest and to track it over time.

Similar to object recognition, deep learning has become an effective tool for object tracking, particularly when tracking multiple objects in the video (Liu et al. 2020). Recurrent networks have been integrated with object recognition methods to track the detected objects over time (e.g., Fang 2016; Gordon et al. 2018; Milan et al. 2017). VOT benchmarks (Kristan et al. 2016) have been reported for real-time visual object tracking challenges run in both ICCV and ECCV conferences, and the performance of tracking has been observed to improve year on year. The best performing methods include Re\(^3\) (Gordon et al. 2018) and Siamese-RPN (Li et al. 2018) achieving 150 and 160 fps at the expected overlap of 0.2, respectively. MOTChallengeFootnote 68 and KITTIFootnote 69 are the most commonly used datasets for training and testing multiple object tracking (MOT). At the time of publishing, ReMOTS (Yang et al. 2020a) is currently the best performer with a mask-based MOT accuracy of 83.9%. ReMOTS fuses the segmentation results of the Mask R-CNN (He et al. 2017) and a ResNet-101 (He et al. 2016) backbone extended with FPN.

3.4.5 Image fusion

Image fusion provides a mechanism to combine multiple images (or regions therein, or their associated information) into a single representation that has the potential to aid human visual perception and/or subsequent image processing tasks. A fused image (e.g., a combination of IR and visible images) aims to express the salient information from each source image without introducing artefacts or inconsistencies. A number of applications have exploited image fusion to combine complementary information into a single image, where the capability of a single sensor is limited by design or observational constraints. Existing pixel-level fusion schemes range from simple averaging of the pixel values of registered (aligned) images to more complex multiresolution pyramids, sparse methods (Anantrasirichai et al. 2020b) and methods based on complex wavelets (Lewis et al. 2007). Deep learning techniques have been successfully employed in many image fusion applications. An all-in-focus image is created using multiple images of the same scene taken with different focal settings (Liu et al. 2017) (Fig. 12a). Multi-exposure deep fusion is used to create high-dynamic range images by Prabhakar et al. (2017). A review of deep learning for pixel-level image fusion can be found in Liu et al. (2018). Recently, GANs have also been developed for this application (e.g., Ma et al. 2019c), with an example of image blending using a guided mask (e.g., Wu et al. 2019).

The performance of a fusion algorithm is difficult to quantitatively assess as no ground truth exists in the fused domain. Ma et al. (2019b) shows that a guided filtering-based fusion (Li et al. 2013) achieves the best results based on the visual information fidelity (VIF) metric, but proposed that fused images with very low correlation coefficients, measuring the degree of linear correlation between the fused image its source images, also works well compared to subjective assessment.

Fig. 12
figure 12

Information Enhancement. a Multifocal image fusion. b 2D to 3D face conversion generated using the algorithm proposed by Jackson et al. (2017)

3.4.6 3D reconstruction and rendering

In the human visual system, a stereopsis process (together with many other visual cues and priors (Bull and Zhang 2021) creates a perception of three-dimensional (3D) depth from the combination of two spatially separated signals received by the visual cortex from our retinas. The fusion of these two slightly different pictures gives the sensation of strong three-dimensionality by matching similarities. To provide stereopsis in machine vision applications, images are captured simultaneously from two cameras with parallel camera geometry, and an implicit geometric process is used to extract 3D information from these images. This process can be extended using multiple cameras in an array to create a full volumetric representation of an object. This approach is becoming increasingly popular in the creative industries, especially for special effects that create digital humansFootnote 70 in high end movies or live performance.

To convert 2D to 3D representations (including 2D+t to 3D), the first step is normally depth estimation, which is performed using stereo or multi-view RGB camera arrays. Consumer RGB-D sensors can also be used for this purpose (Maier et al. 2017). Depth estimation, based on disparity can also be assisted by motion parallax (using a single moving camera), focus, and perspective. For example, motion parallax is learned using a chain of encoder-decoder networks by Ummenhofer et al. (2017). Google Earth has computed topographical information from images captured using aircraft and added texture to create a 3D mesh. As the demands for higher depth accuracy have increased and real-time computation has become feasible, deep learning methods (particularly CNNs) have gained more attention. A number of network architectures have been proposed for stereo configurations, including a pyramid stereo matching network (PSMNet) (Chang and Chen 2018), a stacked hourglass architecture (Newell et al. 2016), a sparse cost volume network (SCV-Net) (Lu et al. 2018), a fast densenet (Anantrasirichai et al. 2021) and a guided aggregation net (GA-Net) (Zhang et al. 2019). On the KITTI Stereo dataset benchmark (Geiger et al. 2012), the team, called LEAStereo from Monash University, ranks 1st at the time of writing (the number of erroneous pixels reported as 1.65%). They exploit neural architecture search (NAS) techniqueFootnote 71 to build the best network designed by another neural network.

3D reconstruction is generally divided into: volumetric, surface-based, and multi-plane representations. Volumetric representations can be achieved by extending the 2D convolutions used in image analysis. Surface-based representations, e.g., meshes, can be more memory-efficient, but are not regular structures and thus do not easily map onto deep learning architectures. The state-of-the-art methods for volumetric and surface-based representations are Pix2Vox (Xie et al. 2019) and AllVPNet (Soltani et al. 2017) reporting an Intersection-over-Union (IoU) measure of 0.71 and 0.83 constructed from 20 views on the ShapeNet dataset benchmark (Chang et al. 2015)). GAN architectures have been used to generate non-rigid surfaces from a monocular image (Shimada et al. 2019). The third type of representation is formed from multiple planes of the scene. It is a trade-off between the first two representations—efficient storage and amenable to training with deep learning. The method in Flynn et al. (2019), developed by Google, achieves view synthesis with learned gradient descent. A review of state-of-the-art 3D reconstruction from images using deep learning can be found in Han et al. (2019).

Recently, low-cost video plus depth (RGB-D) sensors have become widely available. Key challenges related to RGB-D video processing have included synchronisation, alignment and data fusion between multimodal sensors (Malleson et al. 2019). Deep learning approaches have also been used to achieve semantic segmentation, multi-model feature matching and noise reduction for RGB-D information (Zollhöfer et al. 2018). Light field cameras, that capture the intensity and direction of light rays, produce denser data than the RGB-D cameras. Depth information of a scene can be extracted from the displacement of the image array, and 3D rendering has been reported using deep learning approaches in Shi et al. (2020). Recent state-of-the-art light field methods can be found in the review by Jiang et al. (2020).

3D reconstruction from a single image is an ill-posed problem. However, it is possible with deep learning due to the network’s ability to learn semantic meaning (similar to object recognition, described in Sect. 3.4). Using a 2D RGB training image with 3D ground truth, the model can predict what kind of scene and objects are contained in the test image. Deep learning-based methods also provide state-of-the-art performance for generating the corresponding right view from a left view in a stereo pair (Xie et al. 2016), and for converting 2D face images to 3D face reconstructions using CNN-based encoder-decoder architectures (Bulat and Tzimiropoulos 2017; Jackson et al. 2017), autoencoders (Tewari et al. 2020) and GANs (Tian et al. 2018) (Fig. 12b). Creating 3D models of bodies from photographs is the focus of (Kanazawa et al. 2018). Here, a CNN is used to translate a single 2D image of a person into parameters of shape and pose, as well as to estimate camera parameters. This is useful for applications such as virtual modelling of clothes in the fashion industry. A recent method reported by Mescheder et al. (2019) is able to generate a realistic 3D surface from a single image intruding the idea of a continuous decision boundary within the deep neural network classifier. For 2D image to 3D object generation, generative models offer the best performance to date, with the state-of-the-art method, GAL (Jiang et al. 2018), achieving an average IoU of 0.71 on the ShapeNet dataset. The creation of a 3D photograph from 2D images is also possible via tools such as SketchUpFootnote 72 and Smoothie-3D.Footnote 73 Very recently (Feb 2020), Facebook allowed users to add a 3D effect to all 2D images.Footnote 74 They trained a CNN on millions of pairs of public 3D images with their associating depth maps. Their Mesh R-CNN (Gkioxari et al. 2019) leverages the Mask R-CNN approach (He et al. 2017) for object recognition and segmentation to help estimate depth cues. A common limitation when converting a single 2D image to a 3D representation is associated with occluded areas that require spatial interpolation.

AI has also been used to increase the dimensionality of audio signals. Humans have an ability to spatially locate a sound as our brain can sense the differences between arrival times of sounds at the left and the right ears, and between the volumes (interaural level) that the left and the right ears hear. Moreover, our ear flaps distort the sound telling us whether the sound emanates in front of or behind the head. With this knowledge, Gao and Grauman (2019) created binaural audio from a mono signal driven by the subject’s visual environment to enrich the perceptual experience of the scene. This framework exploits U-Net to extract audio features, merged with visual features extracted from ResNet to predict the sound for the left and the right channels. Subjective tests indicate that this method can improve realism and the sensation being in a 3D space. Morgado et al. (2018) expand mono audio, recorded using a 360\(^\circ\) video camera, to the sound over the full viewing surface of sphere. The process extracts semantic environments from the video with CNNs and then high-level features of vision and audio are combined to generate the sound corresponding to different viewpoints. Vasudevan et al. (2020) also include depth estimation to improve realistic quality of super-resolution sound.

3.5 Data compression

Visual information is the primary consumer of communications bandwidth across broadcasting and internet communications. The demand for increased qualities and quantities of visual content is particularly driven by the creative media sector, with increased numbers of users expecting increased quality and new experiences. Cisco predict, in their Video Network Index report, (Barnett et al. 2018) that there will be 4.8 zettabytes (4.8 \(\times 10^{21}\) bytes) of global annual internet traffic by 2022—equivalent to all movies ever made crossing global IP networks in 53 seconds. Video will account for 82 percent of all internet traffic by 2022. This will be driven by increased demands for new formats and more immersive experiences with multiple viewpoints, greater interactivity, higher spatial resolutions, frame rates and dynamic range and wider color gamut. This is creating a major tension between available network capacity and required video bit rate. Network operators, content creators and service providers all need to transmit the highest quality video at the lowest bit rate and this can only be achieved through the exploitation of content awareness and perceptual redundancy to enable better video compression.

Traditional image encoding systems (e.g., JPEG) encode a picture without reference to any other frames. This is normally achieved by exploiting spatial redundancy through transform-based decorrelation followed by variable length, quantization and symbol encoding. While video can also be encoded as a series of still images, significantly higher coding gains can be achieved if temporal redundancies are also exploited. This is achieved using inter-frame motion prediction and compensation. In this case the encoder processes the low energy residual signal remaining after prediction, rather than the original frame. A thorough coverage of image and video compression methods is provided by Bull and Zhang (2021).

Deep neural networks have gained popularity for image and video compression in recent years and can achieve consistently greater coding gain than conventional approaches. Deep compression methods are also now starting to be considered as components in mainstream video coding standards such as VVC and AV2. They have been applied to optimize a range of coding tools including intra prediction (Li et al. 2018; Schiopu et al. 2019), motion estimation (Zhao et al. 2019b), transforms (Liu et al. 2018), quantization (Liu et al. 2019), entropy coding (Zhao et al. 2019a) and loop filtering (Lu et al. 2019). Post processing is also commonly applied at the video decoder to reduce various coding artefacts and enhance the visual quality of the reconstructed frames [e.g., (Xue and Su 2019; Zhang et al. 2020)]. Other work has implemented a complete coding framework based on neural networks using end-to-end training and optimisation (Lu et al. 2020). This approach presents a radical departure from conventional coding strategies and, while it is not yet competitive with state-of-the-art conventional video codecs, it holds significant promise for the future.

Perceptually based resampling methods based on SR methods using CNNs and GANs have been introduced recently. Disney Research proposed a deep generative video compression system (Han et al. 2019) that involves downscaling using a VAE and entropy coding via a deep sequential model. ViSTRA2 (Zhang et al. 2019b), exploits adaptation of spatial resolution and effective bit depth, downsampling these parameters at the encoder based on perceptual criteria, and up-sampling at the decoder using a deep convolutional neural network. ViSTRA2 has been integrated with the reference software of both the HEVC (HM 16.20) and VVC (VTM 4.01), and evaluated under the Joint Video Exploration Team Common Test Conditions using the Random Access configuration. Results show consistent and significant compression gains against HM and VVC based on Bjønegaard Delta measurements, with average BD-rate savings of 12.6% (PSNR) and 19.5% (VMAF) over HM and 5.5% and 8.6% over VTM. This work has been extended to a GAN architecture by Ma et al. (2020a). Recently, Mentzer et al. (2020) optimize a neural compression scheme with a GAN, yielding reconstructions with high perceptual fidelity. Ma et al. (2021) combine several quantitative losses to achieve maximal perceptual video quality when training a relativistic sphere GAN.

Like all deep learning applications, training data is a key factor in compression performance. Research by Ma et al. (2020) has demonstrated the importance of large and diverse datasets when developing CNN-based coding tools. Their BVI-DVC database is publicly available and produces significant improvements in coding gain across a wide range of deep learning networks for coding tools such as loop filtering and post-decoder enhancement. An extensive review of AI for compression can be found in Bull and Zhang (2021) and Ma et al. (2020b).

4 Future challenges for AI in the creative sector

There will always be philosophical and ethical questions relating to the creative capacity, ideas and thought processes, particularly where computers or AI are involved. The debate often focuses on the fundamental difference between humans and machines. In this section we will briefly explore some of these issues and comment on their relevance to and impact on the use of AI in the creative sector.

4.1 Ethical issues, fakes and bias

An AI-based machine can work ‘intelligently’, providing an impression of understanding but nonetheless performing without ‘awareness’ of wider context. It can however offer probabilities or predictions of what could happen in the future from several candidates, based on the trained model from an available database. With current technology, AI cannot truly offer broad context, emotion or social relationship. However, it can affect modern human life culturally and societally. UNESCO has specifically commented on the potential impact of AI on culture, education, scientific knowledge, communication and information provision particularly relating to the problems of the digital divide.Footnote 75 AI seems to amplify the gap between those who can and those who cannot use new digital technologies, leading to increasing inequality of information access. In the context of the creative industries, UNESCO mentions that collaboration between intelligent algorithms and human creativity may eventually bring important challenges for the rights of artists.

One would expect that the authorship of AI creations resides with those who develop the algorithms that drive the art work. Issues of piracy and originality thus need special attention and careful definition, and deliberate and perhaps unintentional exploitation needs to be addressed. We must be cognizant of how easy AI technologies can be accessed and used in the wrong hands. AI systems are now becoming very competent at creating fake images, videos, conversations, and all manner of content. Against this, as reported in Sect. 3.1.7, there are also other AI-based methods under development that can, with some success, detect these fakes.

The primary learning algorithms for AI are data-driven. This means that, if the data used for training are unevenly distributed or unrepresentative due to human selection criteria or labeling, the results after learning can equally be biased and ultimately judgemental. For example, streaming media services suggest movies that the users may enjoy and these suggestions must not privilege specific works over others. Similarly face recognition or autofocus methods must be trained on a broad range of skin types and facial features to avoid failure for certain ethnic groups or genders. Bias in algorithmic decision-making is also a concern of governments across the world.Footnote 76 Well-designed AI systems can not only increase the speed and accuracy with which decisions are made, but they can also reduce human bias in decision-making processes. However, throughout the lifetime of a trained AI system, the complexity of data it processes is likely to grow, so even a network originally trained with balanced data may consequently establish some bias. Periodic retraining may therefore be needed. A review of various sources of bias in ML is provided in Ntoutsi et al. (2020).

Dignum (2018) provide a useful classification of the relationships between ethics and AI, defining three categories: (i) Ethics by Design, methods that ensure ethical behaviour in autonomous systems, (ii) Ethics in Design, methods that support the analysis of the ethical implications of AI systems, and (iii) Ethics for Design, codes and protocols to ensure the integrity of developers and users. A discussion of ethics associated with AI in general can be found in Bostrom and Yudkowsky (2014).

AI can, of course, also be used to help identify and resolve ethical issues. For example, Instagram uses an anti-bullying AIFootnote 77 to identify negative comments before they are published and asks users to confirm if they really want to post such messages.

4.2 The human in the loop: AI and creativity

Throughout this review we have recognized and reported on the successes of AI in supporting and enhancing processes within constrained domains where there is good availability of data as a basis for ML. We have seen that AI-based techniques work very well when they are used as tools for information extraction, analysis and enhancement. Deep learning methods that characterize data from low-level features and connect these to extract semantic meaning are well suited to these applications. AI can thus be used with success, to perform tasks that are too difficult for humans or are too time-consuming, such as searching through a large database and examining its data to draw conclusions. Post production workflows will therefore see increased use of AI, including enhanced tools for denoising, colorization, segmentation, rendering and tracking. Motion and volumetric capture methods will benefit from enhanced parameter selection and rendering tools. Virtual production methods and games technologies will see greater convergence and increased reliance on AI methodologies.

In all the above examples, AI tools will not be used in isolation as a simple black box solution. Instead, they must be designed as part of the associated workflow and incorporate a feedback framework with the human in the loop. For the foreseeable future, humans will need to check the outputs from AI systems, make critical decisions, and feedback ‘faults’ that will be used to adjust the model. In addition, the interactions between audiences or users and machines are likely to become increasingly common. For example, AI could help to create characters that learn context in location-based storytelling and begin to understand the audience and adapt according to interactions.

Currently, the most effective AI algorithms still rely on supervised learning, where ground truth data readily exist or where humans have labeled the dataset prior to using it for training the model (as described in Sect. 2.3.1). In contrast, truly creative processes do not have pre-defined outcomes that can simply be classed as good or bad. Although many may follow contemporary trends or be in some way derivative, based on known audience preferences, there is no obvious way of measuring the quality of the result in advance. Creativity almost always involves combining ideas, often in an abstract yet coherent way, from different domains or multiple experiences, driven by curiosity and experimentation. Hence, labeling of data for these applications is not straightforward or even possible in many cases. This leads to difficulties in using current ML technologies.

In the context of creating a new artwork, generating low-level features from semantics is a one-to-many relationship, leading to inconsistencies between outputs. For example, when asking a group of artists to draw a cat, the results will all differ in color, shape, size, context and pose. Results of the creative process are thus unlikely to be structured, and hence may not be suitable for use with ML methods. We have previously referred to the potential of generative models, such as GANs, in this respect, but these are not yet sufficiently robust to consistently create results that are realistic or valuable. Also, most GAN-based methods are currently limited to the generation of relatively small images and are prone to artefacts at transitions between foreground and background content. It is clear that significant additional work is needed to extract significant value from AI in this area.

4.3 The future of AI technologies

Research into, and development of, AI-based solutions continue apace. AI is attracting major investments from governments and large international organisations alongside venture capital investments in start-up enterprises. ML algorithms will be the primary driver for most AI systems in the future and AI solutions will, in turn, impact an even wider range of sectors. The pace of AI research has been predicated, not just on innovative algorithms (the basics are not too dissimilar to those published in the 1980s), but also on our ability to generate, access and store massive amounts of data, and on advances in graphics processing architectures and parallel hardware to process these massive amounts of data. New computational solutions such as quantum computing, will likely play an increasing role in this respect (Welser et al. 2018).

In order to produce an original work, such as music or abstract art, it would be beneficial to support increased diversity and context when training AI systems. The quality of the solution in such cases is difficult to define and will inevitably depend on audience preferences and popular contemporary trends. High-dimensional datasets that can represent some of these characteristics will therefore be needed. Furthermore, the loss functions that drive the convergence of the network’s internal weights must reflect perceptions rather than simple mathematical differences. Research into such loss functions that better reflect human perception of performance or quality is therefore an area for further research.

ML-based AI algorithms are data-driven; hence how to select and prepare data for creative applications will be key to future developments. Defining, cleaning and organizing bias-free data for creative applications are not straightforward tasks. Because the task of data collection and labeling can be highly resource intensive,labeling services are expected to become more popular in the future. Amazon currently offers a cloud management tool, SageMaker,Footnote 78 that uses ML to determine which data in a dataset needs to be labeled by humans, and consequently sends this data to human annotators through its Mechanical Turk system or via third party vendors. This can reduce the resources needed by developers during the key data preparation process. In this or other contexts, AI may converge with blockchain technologies. Blockchains create decentralized, distributed, secure and transparent networks that can be accessed by anyone in public (or private) blockchain networks. Such systems may be a means of trading trusted AI assets, or alternatively AI agents may be trusted to trade other assets (e.g., financial (or creative) across blockchain networks. Recently, Microsoft has tried to improve small ML models hosted on public blockchains and plan to expand to more complex models in the future.Footnote 79 Blockchains make it possible to reward participants who help to improve models, while providing a level of trust and security.

As the amount of unlabeled data grows dramatically, unsupervised or self-supervised ML algorithms are prime candidates for underpinning future advancements in the next generation of ML. There exist techniques that employ neural networks to learn statistical distributions of input data and then transfer this to the distribution of the output data (Damodaran et al. 2018; Xu et al. 2019; Zhu et al. 2017). These techniques do not require a precise matching pair between the input and the ground truth, reducing the limitations for a range of applications.

It is clear that current AI methods do not mimic the human brain, or even parts of it, particularly closely. The data driven learning approach with error backpropagation is not apparent in human learning. Humans learn in complex ways that combine genetics, experience and prediction-failure reinforcement. A nice example is provided by Yan LeCun of NYU and FacebookFootnote 80 who describes a 4–6 month old baby being shown a picture of a toy floating in space; the baby shows little surprise that this object defies gravity. Showing the same image to the same child at around 9 months produces a very different result, despite the fact that it is very unlikely that the child has been explicitly trained about gravity. It has instead learnt by experience and is capable of transferring its knowledge across a wide range of scenarios never previously experienced. This form of reinforcement and transfer learning holds significant potential for the next generation of ML algorithms, providing much greater generalization and scope for innovation.

Reinforcement Learning generally refers to a goal-oriented approach, which learns how to achieve a complex objective through reinforcement via penalties and rewards based on its decisions over time. Deep Reinforcement Learning (DRL) integrates this approach into a deep network which, with little initialisation and through self-supervision, can achieve extraordinary performance in certain domains. Rather than depend on manual labeling, DRL automatically extracts weak annotation information from the input data, reinforced over several steps. It thus learns the semantic features of the data, which can be transferred to other tasks. DRL algorithms can beat human experts playing video games and the world champions of Go. The state of the art in this area is progressing rapidly and the potential for strong AI, even with ambiguous data in the creative sector is significant. However, this will require major research effort as the human processes that underpin this are not well understood.

5 Concluding remarks

This paper has presented a comprehensive review of current AI technologies and their applications, specifically in the context of the creative industries. We have seen that ML-based AI has advanced the state of the art across a range of creative applications including content creation, information analysis, content enhancement, information extraction, information enhancement and data compression. ML–AI methods are data driven and benefit from recent advances in computational hardware and the availability of huge amounts of data for training—particularly image and video data.

We have differentiated throughout between the use of ML–AI as a creative tool and its potential as a creator in its own right. We foresee, in the near future, that AI will be adopted much more widely as a tool or collaborative assistant for creativity, supporting acquisition, production, post-production, delivery and interactivity. The concurrent advances in computing power, storage capacities and communication technologies (such as 5G) will support the embedding of AI processing within and at the edge of the network. In contrast, we observe that, despite recent advances, significant challenges remain for AI as the sole generator of original work. ML–AI works well when there are clearly defined problems that do not depend on external context or require long chains of inference or reasoning in decision making. It also benefits significantly from large amounts of diverse and unbiased data for training. Hence, the likelihood of AI (or its developers) winning awards for creative works in competition with human creatives may be some way off. We therefore conclude that, for creative applications, technological developments will, for some time yet, remain human-centric—designed to augment, rather than replace, human creativity. As AI methods begin to pervade the creative sector, developers and deployers must however continue to build trust; technological advances must go hand-in-hand with a greater understanding of ethical issues, data bias and wider social impact.