1 Introduction

High dynamic range (HDR) video provides a significant difference in visual quality compared to traditional low dynamic range (LDR) video. With up to 96 bits per pixel (BPP), compared to a standard image of 24 BPP, a single uncompressed HDR frame of \( 1920\,{\times }\,1080 \) resolution requires 24 MB, and a minute of data at 30 fps is 42 GB [5]. To cope effectively with this large amount of data, efficient compression is required. Moreover, if HDR is to gain wide acceptance, and find use in broadcast, internet streaming, remote gaming, etc., it is crucial that computationally efficient encoding and decoding is possible.

HDR video compression may be classified as either a one-stream or two-stream approach [11]. A two-stream method separates the single HDR video input stream into base and detail streams which are then compressed separately according to their individual characteristics. One-stream methods, on the other hand, take advantage of the higher bit-depth available in modern video codecs. A transfer function (TF) is used to map the HDR video input stream to a single, high bit-depth stream and optionally some metadata to aid the post-processing before display. A number of the proposed one-stream methods [9, 25] use complex TFs, requiring many floating-point operations for both compression and decompression.

In this paper we evaluate whether straightforward power functions, with their associated computational benefits, can be used to efficiently compress HDR video. We propose a HDR video compression method, the Power Transfer Function (PTF), which aims to provide real-time HDR video encoding without a loss in quality or compression performance.

The key contributions of this work are:

  • The presentation of PTF, a straightforward HDR transfer function, with an objective evaluation demonstrating that the method is a highly performant HDR video compression technique.

  • An evaluation of the performance of PTF, showing that an analytic implementation of the method exceeds 380 fps for decoding video on commodity hardware, outperforming an analytic implementation of another leading transfer function by over an order of magnitude and a pre-calculated look-up table implementation by a factor of approximately 1.5.

2 Related work

HDR video compression methods can be split into two broad categories: one-stream and two-stream. Two-stream methods have the advantage that they can work well on existing 8-bit infrastructure. One-stream methods, on the other hand, require at least 10-bit infrastructure. The advantage of one-stream methods is that they follow a similar pipeline to those used for LDR video, without the need for secondary streams to be transmitted or combined before display.

Two-stream methods can be considered either backwards compatible or non-backwards compatible based on whether one of the streams can be presented using a non-HDR aware video player. Mantiuk et al. [21, 24] presented a method which, following the overall method proposed by [30], tone maps the HDR data to create a backward compatible image [21]. This image is then restored to a colour space compatible with the original and the difference in luminance between the reconstructed frame and the original, is taken and stored a residual data stream. The decoding is performed by reconstructing the tone mapped image and then applying the residuals previously created. The method proposed by Lee and Kim [20] also follows the structure proposed by [30]. In this method the backwards compatible frames are generated using a temporally coherent tone mapper and the residual is created by taking the logarithm of the division of the reconstructed image and the original HDR image. To reduce noise, the residual stream is cross-bilaterally filtered [13]. Other proposed two-stream methods include goHDR [6] and optimal exposure [12].

Several one-stream HDR video compression methods have been proposed in the last 10 years. One of the earliest was by Mantiuk et al. [23] that extended the existing MPEG-4 encoder and attempted to preserve colour and luminance levels visible to the human eye [23]. This mapped the real-world luminances from linear RGB to an 11-bit perceptually uniform luma space and chrominance into an 8-bit uniform chromaticity scale similar to that used in LogLUV encoding [19]. We will refer to this method as HDRV for the remainder of this paper. Garbas and Thoma [17] presented a temporally coherent extension of the Adaptive LogLUV function [26] suitable for HDR video compression [17]. The proposed method maps real-world luminance into a 12-bit luma space and preserves chrominance in 8-bit \(u'v'\) chroma channels similar to LogLUV [19]. We will refer to this method as Fraunhofer for the remainder of this paper. Zhang et al. [33] developed a method that converts HDR data to a 32-bit LogLUV colour space [19], after which the 16-bit luminance channel is converted to 14-bit by non-linear quantisation, similar to Lloyd–Max optimisation [28].

The Perceptual Quantizer (PQ) method is based on the fitting of a polynomial function to the peaks in the Barten model of visual perception [7]. Compression is provided by means of a closer fit to a human visual response curve [25]. This method has recently been included in a SMPTE standard, ST2084 [1].

More recently, Borer [9] proposed a compression method based on the log and gamma segments of Mantiuk’s analytic model [24] that increases the dynamic range that can be distributed by a factor of 50. This Hybrid–Log–Gamma (HLG) method has been developed to provide support for a display independent television system [10], and has also been included in the Arib STD-B67 standard [3].

Fig. 1
figure 1

Example pipelines used for encoding and decoding HDR using the PTF. a Takes in HDR video frames in either scene or display referred scale and outputs YCbCr for encoding with a standard encoder. b Takes as input the encoded bitstream and outputs HDR frames at the initial scale. The dashed lines denote optional processing a PTF encode. b PTF decode

3 Power Transfer Function

The human visual system (HVS) has greater sensitivity to relative differences in darker areas of a scene than brighter areas [14, 31]. This non-linear response can be generalised by a straightforward power function. The Power Transfer Function (PTF) weights the use of the values available to preserve detail in the areas of the HDR content in which the HVS is more sensitive. PTF therefore allocates more values to the dark regions than to the light regions. The theoretical properties of the power functions used in PTF will be presented in Sect. 3.3.

3.1 Motivation

The recent addition of higher bit-depth support to commonly used video encoding standards such as advanced video coding (AVC) [32], high efficiency video coding (HEVC) [29] and methods such as VP9 has diminished the need for two-stream methods. Instead, this support has motivated an investigation into the efficient mapping of HDR data into 10 and 12 bits. For this purpose, PQ [25] uses a perceptual encoding to map the contrast sensitivity of the HVS to the values available in the video stream. This perceptual encoding, however, relies on a complex transfer function.

In this paper, we investigate whether a transfer function implemented using straightforward power functions can provide an efficient mapping. Power functions also provide computational benefits, particularly for lower integer powers. To perform the PQ mapping [25] requires many calculations, however, a power function can be computed with a single calculation.

figure f
figure g

3.2 Method

PTF is a single stream method, converting HDR input into a single set of compressed output frames. To achieve this compression, PTF utilises the power function: \( f(x) = Ax^\gamma \) where: A is a constant, x is normalised image data contained by the set \([0, 1] \subset \mathbb {R}\) and \(\gamma \in \mathbb {R}^+\).

The straightforward nature of the PTF method is shown in Fig. 1a, b which present the general pipeline into which PTF is used, and from Algorithms 1 and 2 which detail the compression and decompression procedures, \(\text {PTF}_{\gamma }\) and \(\text {PTF}_{\gamma }'\), respectively.

Before a HDR video is compressed using PTF, it is normalised to the range [ 0, 1 ] with a normalisation factor \( \mathfrak {N}\) using the relation where: S is full range HDR data. If the footage is of an unknown range then it can be analysed to determine the correct \(\mathfrak {N}\) for encoding, or for live broadcast, \( \mathfrak {N}\) can be set to the peak brightness the camera is capable of capturing or the display is capable of presenting.

If the normalisation factor is variable, then it can be stored as metadata along with the video data to correctly rescale the footage for display. Each input frame may be normalised independently, however, this may introduce artefacts as the scaling and nonlinearity can interact and lead to the accumulation of errors when using predicted frames. More often a global or temporal normalisation factor is used. The metadata can either be passed at the bitstream level, i.e. with supplemental enhancement information (SEI) messages, or at the container level, i.e. MPEG-4 Part 14 (MP4) data streams.

Following compression with PTF, the data must be converted into the output colour space to be passed to the video encoder, and if chroma subsampling is to be used, reduced to the correct format.

Fig. 2
figure 2

Contrast sensitivity plots showing GDF as implemented by the DICOM standard, Ferwerda TVI [15] as used by HDRV, Adaptive LogLUV used by the Fraunhofer method, Dolby PQ, BBC HLG and \(\text {PTF}_{4}\) and \(\text {PTF}_{8}\). Luminance and Contrast are shown on logarithmic scales

3.3 Theoretical analysis

Figure 2 presents a comparison of just noticeable difference (JND) characteristics from various methods and standards. The Greyscale Display Function (GDF) is an implementation of the Barten contrast sensitivity function (CSF) [7] developed for the digital imaging and communications in medicine (DICOM) standard [2]. This CSF plots a relationship between luminance and luma such that the contrast steps between each consecutive luma value are not perceptible. Methods with contrast steps larger than that of the GDF are likely perceptible at that luminance. The DICOM standard GDF is defined with a lower bound of \(0.5 \times 10^{-1}\). As the Fraunhofer method is also based on log luminance it exhibits a purely linear plot on Fig. 2.

To understand how power functions could be adapted for HDR video compression, we investigated the JND characteristics of PTF with the \(\gamma \) values 4 and 8. We chose integer values as we expected them to exhibit reduced computational cost over non-integer values. In Sect. 4.2 we investigate the role of \(\gamma \) in PTF compression. Figure 2 shows how \(\text {PTF}_{4}\) compares against other methods. \(\text {PTF}_{8}\) expresses too few values for the brighter regions of the image along with reserving a large proportion of the available luma values for a region very close to the lower bound. However, this does provide \(\text {PTF}_{8}\) the ability to store a very high dynamic range.

The power function used in PTF is similar to the well-known Gamma function used in LDR video adapted instead to provide HDR video compression. Figure 3 presents a comparison of the shape of the proposed TFs in a normalised space. As a linear plot would express no compression, we can see that \(\text {PTF}_{2.2}\) provides a small amount of compression.

Fig. 3
figure 3

Graph showing encoding and decoding transfer functions. Presented are \(\text {PTF}_{4}\) and \(\text {PTF}_{8}\) alongside the PQ and HLG curves. \(\text {PTF}_{2.2}\) is presented for comparison with an example LDR Gamma function. HLG has been rescaled to the [0, 1] range for comparison with other TFs

Fig. 4
figure 4

The relationship between \(\gamma \) and coding error for PTF created at different bit-depths across a range of metrics. The results are the average distortion introduced by PTF for the selection of HDR images in Online Resource a HDR-VDP-2.2.1 Q correlate b puPSNR c PSNR-RGB

4 Results

To evaluate how the efficiency of PTF compares with other proposed methods it has been compared with the following four state-of-the-art one-stream methods (described in more detail in Sect. 2): HDRV [23], Fraunhofer [17], PQ [25], and HLG [9]. For fairness, HDRV and Fraunhofer were adapted from their original presentation for use with a 10-bit video encoder. HDRV was implemented with the luminance range which is reduced \(1\times 10^{-5}\) to \(1\times 10^{4}\) such that the TVI curve [15] could provide a mapping from luminance to 10-bit luma. The Fraunhofer implementation uses Adaptive LogLUV [26] which provides mappings for a flexible number of bits.

These methods will be compared on an objective basis using the metrics presented in this section. Subsequently, an analysis of the effect of \(\gamma \) on the coding error introduced by compression is provided. The results of the objective evaluation performed on the compression methods are then presented. Finally, the computational performance of PTF in contrast with PQ and look-up tables is addressed.

Fig. 5
figure 5

The evaluation pipeline used for comparing compression methods. The dashed line denotes comparison of coding errors only

4.1 Metrics

The following three metrics are used to provide results for the evaluation.

Peak signal to noise ratio (PSNR) is one of the most widely used metrics for comparing processed image quality. To adapt the method for HDR imaging, \(L_\mathrm{peak}\) was fixed at \(10{,}000\hbox { cd/m}^{2}\) and the result was taken as the mean of the channel results.

$$\begin{aligned} \mathrm{PSNR}_\lambda&= 20\log _{10} \left( \frac{L_\mathrm{peak}}{\sqrt{\mathrm{MSE}_\lambda }} \right) \end{aligned}$$
(1)

Perceptually Uniform PSNR (puPSNR) was proposed as an extension to PSNR such that it is capable of handling real-world luminance levels without affecting the results for existing displays [4]. The proposed metric maps the range \(1\times 10^{-5}\) to \(1 \times 10^{-8}\,\hbox {cd/m}^{2}\) in real-world luminance to values that approximate perceptually uniform values derived from a CSF. It is from the remapped luminance that the PSNR is calculated.

HDR-VDP-2.2.1 (HDR Visual Difference Predictor) is an objective metric based on a detailed model of human vision [22]. The metric estimates the probability at which an average human observer will detect differences between a pair of images in a psychophysical evaluation. The visual model used by this metric takes several aspects of the human visual system into account such intra-ocular light scatter, photo-receptor spectral sensitivities and contrast sensitivity. HDR-VDP-2.2.1 has been shown to be the objective metric that correlates most highly with subjective studies [18, 27].

The metrics were calculated for every frame, except HDR-VDP-2.2.1 which was every 10th frame due to its computational expense, and averaged to produce a final figure for the sequence.

4.2 Analysis of power functions

Figure 4a–c show the motivation for the selection of \(\gamma \) by comparing the average distortion introduced by PTF over a range of \(\gamma \) values. These figures suggest that the different metrics favour certain \(\gamma \) values over others. A dataset of 20 HDR images were used for computing the results (these are shown in Online Resource 1).

The pipeline used for this analysis is shown in Fig. 5. After compression and colour conversion the images were not passed through the video encoder and were instead immediately decompressed to ascertain just the coding errors introduced by each \(\gamma \) value. The \(\gamma \) values used in the evaluation ranged from 0.25 to 10 and increased in steps of 0.25. The evaluation was performed at four bit-depths: 8, 10, 12 and 16. PSNR-RGB suggests that a \(\gamma \) of 2.2 will give the best results, and as it is also used for LDR video. HDR-VDP-2.2.1 Q correlate indicates that a \(\gamma \) of around 4 will perform best and puPSNR a \(\gamma \) of around 6. Figure 3 shows that the PQ TF proposed by Miller et al. [25] can be closely approximated by a \(\gamma \) value of 8 and hence the value was also tested. Integer values are favoured as the operations required to decode are significantly faster than non-integers, as discussed in Sect. 4.4. Based on the peaks of the graph, and similarities to the GDF and PQ (see Sect. 3.3), the four implementations of PTF chosen for testing were: \(\text {PTF}_{2.2}\), \(\text {PTF}_{4}\), \(\text {PTF}_{6}\) and \(\text {PTF}_{8}\).

Table 1 The ten HDR video sequences used to evaluate the methods, showing resolution and dynamic range

Also of note in Fig. 4 is the how the peak in quality does not shift greatly as the bit-depth is increased. This suggests that \(\gamma \) will not need to be changed in an environment of 12 and above bits. This will be explored further in future work.

Fig. 6
figure 6

Rate-distortion characteristics showing the results of each method averaged over the ten sequences and with three metrics, HDR-VDP-2.2.1, PSNR and puPSNR. The Output BPP is shown on a logarithmic scale to improve clarity. a HDR-VDP-2.2.1 Average b puPSNR Average c PSNR-RGB Average

4.3 Quality

The approach used for quality comparison is outlined in Fig. 5. For each of the compression methods the pipeline is executed in its entirety. The content is provided as individual HDR frames in OpenEXR format. The compression method’s encoding process is run on each of the ten sequences of frames, presented in Table 1, to produce 10-bit files in YCbCr format. These sequences were chosen as they cover a wide range of content types, such as computer graphics renderings, video captured by a SphereonVR HDR Video Camera or an ARRI Alexa. Each scene consisted of 150 frames and was encoded at 24 frames per second. The encoding was conducted with the HEVC encoder x265, due its computational efficiency, and 4:2:0 chroma subsampling with the quantisation parameters \(\text {QP} \in [5, 10, 15, 20, 25, 30, 35]\). The Group Of Pictures (GOP) structure contained both bi-directional (B) and predicted (P) frames and the pattern used was (I)BBBP where the intra (I) frame period was 30 frames. The encoded bitstreams were then decoded using the HEVC Test Model (HM) [29] reference decoder, and subsequently using the individual compression method’s decoding process.

4.3.1 Analysis

Figure 6a–c show the results for each of the tested methods for the three quality metrics presented in Sect. 4.1. On each of the figures an increase on the Y axis indicates improved objective quality, and a decrease on the X axis indicates reduced bit-rate. Therefore, results closest to the top-left corner are preferred. For each method at each QP, the average BPP of the encoded bitstreams across all sequences is calculated and plotted against the average quality measured. The ten HDR video sequences used to test the compression methods are shown in Table 1. Results for individual sequences are presented in Online Resource 3.

4.3.2 Discussion

The rate-distortion plots shown in Fig. 6 present the trade-off between bit-rate and quality for each method. If a plotted line maintains a position above another, this indicates that improved quality can be consistently obtained from a method even with a reduction in bit-rate.

These figures show that \(\text {PTF}_{2.2}\) achieves the highest average PSNR followed by HLG then \(\text {PTF}_{4}\). As PSNR does not perceptually weight the error encountered, \(\text {PTF}_{2.2}\) is rated highly. This is because the close to linear mapping provided by \(\text {PTF}_{2.2}\) reduces error in the bright regions while failing to preserving detail in the dark regions. The reduced error on the relatively large values found in the bright regions, therefore, favour \(\text {PTF}_{2.2}\) when tested with PSNR.

HDR-VDP-2.2.1 and puPSNR [4, 22] use perceptual weightings that recognise that error in the dark regions is more noticeable to the HVS than the error in the bright regions. These metrics show that on average \(\text {PTF}_{4}\) exhibits the least error for a given bit-rate than the other methods, although for certain sequences, such as Beer Festival 4, \(\text {PTF}_{6}\) achieves the highest quality. \(\text {PTF}_{4}\) weights error in the dark regions more highly than \(\text {PTF}_{2.2}\) but less highly than \(\text {PTF}_{6}\) or \(\text {PTF}_{8}\).

The Bjøntegaard delta metric [8] calculates the average difference in quality between pairs of methods encoding sequences at the same bit-rate. Using this metric we can determine the average HDR-VDP-2.2.1 Q correlate gain over the range of bit-rates achieved by PTF when compared with the other methods evaluated. From Table 2 it can be seen that \(\text {PTF}_{4}\) gained 0.32 over PQ, 2.90 over HLG, 7.28 over Fraunhofer and 13.35 over HDRV. We can also see that \(\text {PTF}_{4}\) gained 0.96 over \(\text {PTF}_{6}\), 2.24 over \(\text {PTF}_{8}\) and 2.39 over \(\text {PTF}_{2.2}\). A table showing Bjøntegaard delta bit-rate metric results is available in Online Resource 4. A useful feature of PTF is its adaptability which enables the use of different \(\gamma \) values to provide the best performance for particular sequences.

Table 2 Bjøntegaard delta VDP results showing the average improvement in HDR-VDP-2.2.1 Q correlate results between pairs of methods over ten sequences
Table 3 Difference in decoding time per frame between \(\text {PTF}_{4}'\), \(\text {PQ}_{\text {forward}}\) and their LUT equivalents across a range of sequences and averaged over five tests per sequence performed on a workstation PC

4.4 Computational performance

High performance is essential for real-world encoding and decoding. With that in mind we compared PTF against an analytical implementation of PQ [25] and against look-up tables (LUTs).

Table 3 shows the decoding performance of \(\text {PTF}_{4}'\) and PQ and their LUT equivalents, \(\text {PTF}_{4}'\) LUT and PQ LUT, for the scenes presented in Table 1. The 1D LUTs were generated by storing the result of each transfer function for every 10-bit input value in a floating-point array. The scaling required to reconstruct the full HDR frame was also included in the table to improve performance resulting in a mapping from 10-bit compressed RGB to full HDR floating-point. The results were produced by a single-threaded C++ implementation compiled with the Intel C++ Compiler v16.0. Only the inner loop was timed so disk read and write speeds are not taken into account. Each result was taken as the average of five tests per method on each sequence to reduce the variance associated with CPU timing. The software was compiled with the AVX2 instruction set with automatic loop-unrolling, O3 optimisations and fast floating-point calculations. The machine used to run the performance tests was an Intel Xeon E3-1245v3 running at 3.4 GHz with 16 GB of RAM running the Microsoft Windows 8.1 x86-64 operating system.

The encoding performance was also evaluated for the methods. In this case the mapping was from full HDR floating-point to 10-bit output and hence the LUT implementations could not include scaling in the table. The sequences, resolution and sequence lengths were the same as above. \(\text {PTF}_{4}\) encoding was achieved on average per frame in 4.37 ms, PQ encoding in 72.59 ms, \(\text {PTF}_{4}\) LUT in 4.02 ms and PQ LUT in 4.21 ms.

The results demonstrate that the straightforward floating-point calculations required to decode \(\text {PTF}_{4}\) can outperform the floating-point calculations required to decode PQ by a factor of 29.63 times and even the indexing needed to use a look-up table by 1.48 times. The high performance of \(\text {PTF}_{4}'\) is due to its compilation into only a few instructions, in this case three multiplies, that can have high performance SIMD implementations. PTF also avoids any branching, improving performance on pipelined architectures. Encoding \(\text {PTF}_{4}\) can be achieved at a speed comparable to the use of LUT and greatly in excess of an analytic implementation of PQ.

5 Conclusion and future work

In this paper, we have introduced and evaluated a straightforward method of compressing HDR video streams. We have shown that a transfer function based on power functions is capable of producing high quality compressed HDR video and that the compression can be achieved using straightforward techniques which lend themselves to implementation in real-time and low-power environments. On a commodity desktop machine, PTF is able to be decoded at over 380 fps and outperforms an analytic implementation of PQ by a factor of over 29.5 and a look-up implementation by a factor of nearly 1.5. Encoding performance outperforms PQ by a factor of 16.6 and is only slightly slower than a LUT. Thanks to its straightforward nature, PTF is amiable to acceleration through the use of hardware such as FPGAs and GPUs. We intend to develop an implementation on such platforms in the future. As a continuation of this work we would like to confirm the objective results with a subjective evaluation. This could also serve as further confirmation of the correlation between HDR-VDP-2.1.1 results and experiments involving human participants.