Discriminative feature encoding for intrinsic image decomposition

Intrinsic image decomposition is an important and long-standing computer vision problem. Given an input image, recovering the physical scene properties is ill-posed. Several physically motivated priors have been used to restrict the solution space of the optimization problem for intrinsic image decomposition. This work takes advantage of deep learning, and shows that it can solve this challenging computer vision problem with high efficiency. The focus lies in the feature encoding phase to extract discriminative features for different intrinsic layers from an input image. To achieve this goal, we explore the distinctive characteristics of different intrinsic components in the high-dimensional feature embedding space. We define feature distribution divergence to efficiently separate the feature vectors of different intrinsic components. The feature distributions are also constrained to fit the real ones through a feature distribution consistency. In addition, a data refinement approach is provided to remove data inconsistency from the Sintel dataset, making it more suitable for intrinsic image decomposition. Our method is also extended to intrinsic video decomposition based on pixel-wise correspondences between adjacent frames. Experimental results indicate that our proposed network structure can outperform the existing state-of-the-art.


I. INTRODUCTION
In terms of intrinsic image decomposition, the albedo image A indicates the surface material's reflectivity which is unchanging under different illumination conditions, while the shading image S accounts for illumination effects due to object geometry and camera viewpoint [1].It is an ill-posed problem to reconstruct these two intrinsic images from a single color image I, which has the formation model: To solve this challenging inverse image formation problem, many researchers have tried applying physically motivated priors as constraints to disambiguate the decomposition [2], [3], [4], [5], [6], [7], [8], [9].These methods usually represent the priors in the form of energy terms and solve the decomposition problem through graph-based inference algorithms.With the surge of ground-truth intrinsic decomposition data [10], [11], [12], data-driven deep learning methods [1], [13], [14], [15], [16], [17] have achieved promising decomposition results and have drawn more and more research interest.However, fullysupervised methods require high-quality and densely-labelled decompositions, which are expensive to acquire.To overcome this problem, methods training across different datasets [17], training on synthetic datasets [15], [17], adding additional constraints [16] and reusing physically motivated priors [1] have been proposed.
When developing their specific deep learning techniques, previous methods usually extract features via a shared encoder, and then use different decoders to disentangle information for specific intrinsic layers.Observing the different distributions between albedo and shading in the gradient domain [2], it is natural to assume that features representing different intrinsic layers can be separated in the embedding space.With the features separated during the encoding phase, decoders can be relieved from distilling clues for specific targets and focus on the reconstruction procedure.This idea motivates the research in this paper.
We propose a novel two-stream encoder-decoder network for intrinsic image decomposition.In particular, our feature distribution divergence (FDD) constraint is designed to encourage the two encoders to extract distinctive features for different intrinsic layers.Our feature distribution consistency (FDC) constraint is used to encourage the features of a reconstructed intrinsic layer to have a similar distribution pattern to ground-truth decompositions.Moreover, we provide an approach to deal with the illumination inconsistency between the ground truth shading and input images in the MPI Sintel dataset, making it more suitable for intrinsic image decomposition.We also provide an intrinsic decomposition method for video data based on pixel-wise correspondences between adjacent frames.This work is an extension of our previous published workshop paper [18], giving more detailed method descriptions, novel technical contributions, and more comprehensive experiments.
The major contributions of this work are: • A novel two-stream encoder-decoder network for intrinsic image decomposition, in which discriminative feature encoding is achieved via feature distribution divergence and feature distribution consistency constraints.
• A data refinement algorithm for the MPI Sintel dataset, producing a more physically consistent dataset that better suits the intrinsic decomposition task.
• Experimental results on various datasets to demonstrate the effectiveness of our proposed method, including experimental extension to decomposition of video data.

II. RELATED WORK A. Approaches
Intrinsic image decomposition is a long standing computer vision problem.However, it is a seriously ill-posed problem to recover an albedo layer and a shading layer from a single color image [15].In recent decades, considerable effort has been devoted to this challenging problem.These approaches can be coarsely classified into optimization-based methods using physically motivated priors, and deep learning based, datadriven methods [15], [19].There are also approaches using multiple images as input [20], [21], [22], [23], [24], [25], treating the reflectance as a constant factor with changing illumination.These methods require the images to be captured by a static camera with varying illumination.A generative adversarial network (GAN) based domain transfer framework has also applied to image layer separation tasks [26].Additional cues including depth maps [27], [28], [29], [30] and near-infrared images [31] are also taken into account in some work.Here, we focus on key works recovering intrinsic images from a single input.

B. Physically motivated priors based methods
To solve this ill-posed intrinsic decomposition problem, researchers have derived several physically-inspired priors to constrain the solution space [15].Land et al. [2] proposed the retinex algorithm, exploring the different properties of intrinsic components in the gradient domain: large derivatives are perceived as changes in reflectance properties, while smoother variations are seen as changes in illumination.Based on this assumption, many priors for intrinsic image decomposition have been explored.Derived from a piece-wise constant property, reflectance sparsity [3], [4] and low-rank reflectances [8] have been used as constraints.Other constrains include the distribution difference in the gradient domain [32], [33], [34], non-local textures [5], [6], shape and illumination [7], and user strokes [8], [9].These hand-crafted priors are not likely to be valid across complex datasets [19].Bi et al. [32] presented an approach using the L 1 norm for piece-wise image flattening, and proposed an algorithm for complex scene-level intrinsic image decomposition.Li et al. [33] presented a method to automatically extract two layers from an image based on differences in their distributions in the gradient domain.Sheng et al. [34] proposed an approach based on illumination decomposition, in which a shading image is decomposed into drift and step shading channels based on different distribution properties in gradient domain.Based on an analysis of the logarithmic transformation, Fu et al. [35] introduced a weighted variational model to refine the regularization terms for intrinsic image decomposition.Later, Fu et al. [36] presented an algorithm incorporating a reflectance sparseness regularizer based on the L 0 norm and a shading smoothness regularizer based on total variation.Non-local texture constraints [5], [6] are used to find pixels with the most similar reflectance within an image.Krebs et al. [37] developed a method for intrinsic image decomposition from a single RGB or multispectral image, taking the mathematical properties of the mean and standard deviation along the spectral axis into consideration.Although these methods restrict the solution space to a feasible region, such specifically designed priors cannot hold under complex conditions.The results are largely influenced by the parameter settings, which need expert knowledge.Our method differs in that it is data driven, and parameters are automatically learned from the dataset.

C. Deep learning methods
Thanks to the public availability of intrinsic image datasets such as the MIT intrinsic dataset [10], the MPI Sintel dataset [11] and Intrinsic Images in the Wild (IIW) [12], application of deep learning to intrinsic decomposition has surged [13], [38], [39], [40], [41].Direct intrinsics [14] provided the first entirely deep learning model that directly outputs albedo and shading layers given a color image.Results from this method are blurred due to down-sampling during encoding and deconvolution during decoding.
Fan et al. [16] provided a network structure using a domain filter between edges in the guidance map to encourage piece-wise constant reflectance.[42] used region masks to guide the separation of different intrinsic components.By utilizing additional constraints, the solution space for the problem is further restricted.Seo et al. [43] proposed an image-decomposition network, which makes use of all the three premises regarding consistency from retinex theory.In this work, pseudo images (generated color-transferred multiexposure images) are used for training.Baslamisli et al. [1] presented a two-stage framework to firstly split the image gradients into albedo and shading components, which are then fed into decoders to predict pixel-wise intrinsic values.In their later work [44], the gradient descriptors for albedo and shading are derived from a physics-based reflection model and used to compute the shading map directly from RGB image gradients.[45] and [46] derived fine-grained shading components from a physics-based image formation model, in which the shading component is further decomposed into direct and indirect components, and shape-dependent/independent ones.These works proposed novel methods by revisiting physically motivated priors.Shi et al. [15] trained a model to learn albedo, shading and specular images on a large-scale object-level synthetic dataset by rendering ShapeNet [48].Sial et al. [47] trained an intrinsic decomposition model on a synthetic ShapeNet-based scene dataset which has more realistic lighting effects.Li et al. [17] presented an end-to-end learning approach that learns better intrinsic image decomposition by leveraging datasets with different types of labels.
The majority of these methods extract features via a shared encoder, and then use different decoders to disentangle information for specific intrinsic layers.In contrast to these works, we try to exploit the difference between intrinsic components in feature space through a novel two-stream framework.With Residual Dilated Blocks Fig. 1.Framework of our two-stream intrinsic image decomposition network.The input image is passed through two sub-network streams for albedo and shading image reconstruction respectively.We use the extractor in VGG-19 as the encoder structure, which extracts multi-scale feature maps.These are then aggregated by sequences of upsampling, concatenation, and convolution.Finally, three residual dilated blocks are used as a decoder to reconstruct intrinsic images from the fused feature maps.⊕ denotes feature aggregation, denotes element-wise multiplication, and rounded boxes represent loss computations.Cycle, FDC and FDD mean cycle loss, feature distribution consistency and feature distribution divergence respectively.the features separated in the encoding phase, decoders can be relieved from distilling clues for specific targets and focus on the reconstruction procedure.

D. Intrinsic video
Kong et al. [49] defined intrinsic video estimation as the problem of extracting temporally coherent albedo and shading from video alone.Ye et al. [50] proposed a probabilistic approach to propagate the reflectance from the initial intrinsic decomposition of the first frame.In order to achieve temporal consistency, these methods rely on optical flow to provide correspondences across time.Meka et al. [51] presented the first approach to tackle the hard intrinsic video decomposition problem at real-time frame rates.This method applies global consistency constraints in space and time based on random sampling.Lei et al. [52] presented a novel and general approach for blind video temporal consistency.This method is only trained on a pair of original and processed videos directly instead of a large dataset.In this paper, we simply extend our intrinsic image decomposition method to video based on optical flow, preserving temporal consistency during the decomposition process.

A. Network structure
Our network architecture is visualized in Figure 1.The framework consists of two streams of encoder-decoder subnetworks.One performs albedo image reconstruction, and the other, shading image reconstruction.Taking the albedo subnetwork for example, the input image is passed through a convolutional encoder to extract multi-level features, which are then aggregated by sequences of upsampling, concatenation, and convolution.In the decoding phase, the fused multi-scale features are fed into a sequence of three residual dilated blocks to reconstruct the albedo intrinsic image.The shading subnetwork has the same structure as the albedo sub-network.In practice, we adopt VGG-19 [53] pretrained on ImageNet [54] as the initial encoder.
Previous works usually use a shared encoder to extract features containing both albedo and shading information.Different decoders are then applied to distill clues from the comprehensive features for specific intrinsic image reconstruction.The 'Y'-shaped framework can be formulated as: where f (•; Θ) and g(•; Ω) denote the feature encoder and decoder respectively.Θ and Ω represent their corresponding trainable parameters.In this equation, we intend to represent g(•, Θ a ) as g a (•), in which the subscript of the function symbol f * means that the trainable parameters Θ * are different with respect to specific intrinsic components, while the network frameworks are the same.Unlike such methods, our network design has two encoders for albedo and shading images respectively.In this paper, we denote this structure as an ' '-shaped framework: Using this framework, the encoders (f a (•), f s (•)) are able to extract features more pertinent to their reconstruction targets ⇣ < l a t e x i t s h a _ b a s e = " n f U + M w p A x w p Z e K + s h I b F y q E G x I M s j c T / / O q Y u g z L J L V M s V H U S q w V X h O h w z a g V E c I d x l x X R E N K H W F V R y J f j L J + S V q q X R r e V + k e R x F O B T O w Y c r q M M d N K A J F B h G V h D S n g t R x K g P K d Y / g D P k D w u P Q A = = < / l a t e x i t > (albedo, shading).In Figure 2, we visualize the feature distributions of different network structures, which explains our idea pictorially.
In each row of Figure 2, a feature embedding visualization using t-SNE is provided on the right.Each data point represents a feature extracted by the encoder, which is then fed into the corresponding intrinsic image reconstruction decoder.We use the same color coding for the embedding's data points and the extracted feature vectors from the simplified network structure.The proposed ' '-shaped framework (given in detail in Figure 1) results in a better feature embedding.For instance, in the embedding space, the features for different intrinsic components are better separated.
The rest of this section introduces the core idea and detailed design of the discriminative feature encoding.Then, important constraints for our intrinsic decomposition network are explained.
B. Discriminative feature encoding 1) Basis: Our work is inspired by Land et al. [2]: the retinex approach assumes that albedo and shading layers possess different properties in the gradient domain.By utilizing such discriminative properties, the intrinsic decomposition results can be improved.In this work, we study and exploit the discriminative properties in a more general convolutional feature space.We next describe the proposed discriminative feature encoding in detail.
2) Feature distribution divergence: As Figure 1 shows, the encoding phase consists of multiple (convolution, relu, maxpooling) blocks, through which the input signal is encoded into several different abstraction levels.The multi-scale features are denoted {f E1 , . . ., f En }, in which f Ei represents the output feature of the i th block.We define the feature distance function as d : R m×n×c × R m×n×c → R, where c denotes the feature channel number and the input signal has spatial size m × n: Feature distance measurement is based on the cosine distance between two vectors and L 1 norm.In Eq. ( 4), f a and f s represent features from the albedo encoder and the shading encoder respectively.< •, • > is the inner product in Euclidean space, N i = m i × n i ;, and (x, y) represents a spatial location in a feature map.h(•) is a distance rescaling function in the form of a modified sigmoid function to ensure d L1 ∈ (0, 1).We use .
3) Feature distribution consistency: The feature distribution divergence aims to increase the distance between the feature vectors embedded by different encoders.However, this is not sufficient for discriminative feature encoding.The core idea of Fisher's linear discriminant is to maximize the distance between classes and minimize the distance within classes simultaneously.As an analogue of that, along with the feature distribution divergence described above, we use the feature perceptual loss [55] between the predicted and ground truth intrinsic images to constrain the encoding process, encouraging the embedded features to fit the real distribution.
We use the same distance measurement as for the feature distribution divergence.d(f Ei pred , f Ei real ) denotes the feature distance in the i th abstraction level.
The feature distribution consistency L fdc is formulated as: where represents the feature similarity between f pred and f real .Minimizing Eq. ( 6) encourages the predicted and ground truth intrinsic images to have similar perceptual features.In practice, the encoders are reused to extract features from the predicted and target results in our framework, so the embedded feature distribution can be optimized directly during training.Empirically, we set [γ 1 , . . ., γ 5 ] = [1.0,1.0, 1.0, 1.0, 1.0] and α = 0.1, β = 0.9.
C. Basic supervision constraints 1) Use of losses: Besides the above constraints for discriminative feature encoding, several basic supervision losses are adopted to train the intrinsic image decomposition network.
As described in Eq. ( 3), given an image I, the albedo image A and the shading image S are predicted through trained g a • f a and g s •f s .With densely-labelled intrinsic images A and S as ground truth data, we constrain the pixel-wise predictions using the reconstruction loss L rec and the gradient loss L grad .
The cycle loss is used to encourage the product of the predicted A and S to be similar to that of the input image I.
3) Gradient loss: We also use image gradients as supervision to help preserve details in the intrinsic images: where ∇ x , ∇ y are the image gradients in the x and y directions.

D. Adjustment for sparsely-labelled data
1) Sparse labelling: As well as the densely-labelled datasets mentioned above, recently, the sparsely-labelled dataset IIW has become available, with a larger number of real-world images.To apply our core idea to this situation, we must slightly adjust the training framework, as we first explain, and then give the loss functions measuring the intrinsic image reconstruction quality.
2) Framework adjustment: In order to train on the sparselylabelled IIW dataset, there are a few barriers to overcome.Figure 3 shows the adjustments made to our framework.
One problem is the lack of dense supervision during the albedo reconstruction procedure.The reconstruction loss in Eq. ( 7) and the gradient loss in Eq. ( 8) both need pixel-wise dense supervision.However, the IIW dataset only has sparse annotations of reflectance comparisons at selected points of the images.Therefore, we have to apply alternative constraints utilizing sparse labelling.Specifically, we use the ordinal loss to measure the difference between the output albedo image and the annotated input.Furthermore, smoothness constraints are applied to model the smoothness prior of the albedo component.
A further problem is the lack of shading ground truth, which is necessary in our two-stream training framework.The core idea of the feature distribution divergence is to extract distinctive features corresponding to different target intrinsic layers from the same input image.However, there is no annotation for the shading layers in this dataset.To circumvent the lack of ground truth data, we directly synthesize the target shading image from the input I and the reconstructed albedo A, using Eq. ( 1).Therefore, the synthesized shading image can be used as dense supervision.
Last but not least, the lack of reference intrinsic images causes problems for computing the feature distribution consistency.The FDC is designed to constrain the intrinsic image to have a similar distribution to the corresponding real reference.In detail, this constraint is achieved by minimizing the feature perceptual loss between the reconstructed and reference images.However, dense ground truth images are not provided in the IIW dataset, making ground truth features unavailable.To solve this problem, we maintain an image pool for each training stream, in which batches of reconstructed intrinsic images are gathered as the guidance for feature perceptual loss.
3) Training constraints: As noted, we thus modify the constraints to suit the sparse annotations in the IIW dataset.We now describe the ordinal loss and the smoothness constraints used.
We first consider the ordinal loss.Since dense ground truth labels are not available in [12], it introduced the weighted human disagreement rate (WHDR) as an error metric.Following [17], we use an ordinal loss based on WHDR as the sparse supervision term.
The ordinal loss L ord is obtained by accumulating all the annotated pairs in the albedo image: where e i,j (A) represents the error for a pair of annotated pixels (i, j) in the predicted albedo image A. A detailed definition is provided in Appendix A. We next consider smoothness priors: we adopt the same ones as [17].The smoothness constraints comprise the albedo smoothness constraint L asmooth and the shading smoothness constraint L ssmooth .The albedo component is constrained using a multi-scale L 1 smoothness term, through which the      albedo layer reconstruction is encouraged to be piecewise constant.Shading smoothness is constrained using a denselyconnected L 2 term.Detailed definitions are provided in Appendix A.
The total loss for sparsely-labelled data is: where L \ Â total means the total loss function L total for denselylabelled data in Eq. ( 9), removing the terms containing dense albedo ground truth Â. λ As and λ Ss weight the smoothness priors.A study considering different values of λ As and λ Ss is provided in V-B3.

E. Adjustment for video data
The MPI Sintel dataset is composed of short films, so it naturally has temporal consistency.We also investigated a suitable framework for intrinsic decomposition of such video data.
For video intrinsic decomposition, adjacent pairs of frames are input into our two-stream networks to get the corresponding intrinsic layers.In addition to using the single image decomposition framework, optical flow is computed from the sequential input images, and used to provide temporal consistency guidance for the output intrinsic layers.
Optical flow is typically used to construct temporal correspondences between two adjacent frames, assuming that the pixel intensities of an object do not change between consecutive frames, and that neighbouring pixels have similar motion.
In this work, the optical flow field of pairs of consecutive input images is obtained directly from the MPI Sintel dataset.Besides the single image intrinsic decomposition loss L total in Eq. ( 9), the optical flow u is used to enhance the temporal consistency of the output intrinsic layers: where A t (i) is the i th pixel in the albedo image of frame t.We denote the optical flow map between the consecutive frames (t, t + 1) as u.The mask ω u records which pixels have valid optical flow value.ω u (i) = 0 if pixel i is occluded, and is 1 otherwise.We use λ A and λ S to balance the importance of the temporal consistency terms between albedo and shading layers.

A. Basic approach
The MPI Sintel dataset [11] is a publicly-available denselylabelled dataset containing complex indoor and outdoor scenes.It was originally designed for optical flow evaluation.For research into intrinsic image decomposition, ground truth shading images have been rendered with a constant gray albedo considering illumination effects.However, due to the creation process, the original input frames can not be reconstructed from the ground truth albedo and shading layers through Eq. ( 1).
As the first row of Figure 5 shows later, the specular component of the shading image cannot be observed in the original image, which means it does not share the same illumination condition.Although the simplified image formation model Eq.(1) need not be strictly respected, it is physically incorrect to extract a shading layer depicting different illumination effects from the original image.To overcome this inconsistency, previous works [14] directly resynthesize original images I from the ground truth albedo A and shading S via Eq.(1).However, this approach does not deal with the specular component of the shading layer, which is considered not to be modeled well by Eq. ( 1) [15].
In this paper, we propose an approach to refine the dataset in order to shift it into a domain more representative of real images.The refined MPI Sintel dataset (MPI RD) is subject to the image formation model in Eq. ( 1), and the shading layers contain no color information (gray shading).In addition, the shading layers in the MPI RD maintain consistency with the original images.This can be shown in two ways.For one thing, the specular component is removed from the shading layer.For another thing, shape details observed in the original images are preserved in the shading layer.We describe our data refinement algorithm in 1.In summary, we shift the distribution of the albedo layer to a higher mean value, and then reconstruct the shading layer from the original image and the shifted albedo (steps 1 to 5).Next, invalid pixels in the reconstructed shading layer are computed using local linear embedding (LLE) [57] with the input I as the guiding image adopted to construct the embedding weights (steps 6 to 7).Finally, the input image is resynthesized from the processed albedo and shading images (step 8).

B. Improvements
Although the data refinement algorithm 1 successfully suppresses inconsistencies in the intrinsic components in the MPI dataset, problems still remain in terms of temporal consistency.For example, we can observe intensity jittering in consecutive albedo images, and areas lacking detail in shading images.We now discuss possible causes for these defects, as determined by investigating the dataset, and consequently provide methods to alleviate them.
Jittering effects can be observed in frames with notable statistical changes (mean color changes caused by large dark object occlusion).Areas lacking detail are usually observed in shading images whose source image has low contrast areas.Note that in the single-image refinement procedure, the validity mask is computed based on the source image and the original albedo and shading images.Invalid pixels (pixel values / ∈ (0, 1)) are directly truncated, and therefore not used when computing statistics of the albedo image.Such invalid areas in the shading image are then reconstructed using the LLE algorithm.In fact, the invalid areas lacking detail often overlap with the LLE reconstructed areas, resulting in low quality reconstruction.In conclusion, jittering effects are mainly due to large dark region changes between frames, while lack of detail is mainly caused by invalid region reconstruction artifacts.
This analysis results in a simple method to deal with those problems.To avoid intensity jittering between consecutive frames, we use optical flow to construct the correspondences between frames, and expand the pixel set for statistical computation.This increases robustness of the image statistics.To overcome lack of detail in shading images, the region reconstruction method is optimized to take temporal correspondences into consideration.

C. Temporal consistency measurement
To measure temporal consistency in the sequential data, we use the same video temporal consistency metric (TCM) as [58]: where O t and V t represent the t th frame in the output video (O) and the input video (V ) respectively.warp() is the warping function using the optical flow.The TCM of the t th frame is calculated using the warping error between frames.The 2-norm of a matrix || • || is the sum of squares of its elements.Through this equation, the processed video (O) is encouraged to be temporally consistent according with variations in the input video.
In order to visualize the video processing effects, we record pixelwise temporal consistency value in a TCM map, computed as: where the TCM score is computed per pixel.

A. Datasets 1) MPI Sintel dataset and our refined version:
Sintel is an open source 3D animated short film, which has been published in many formats for various research purposes.For intrinsic image decomposition, the clean pass images and the corresponding albedo and shading layers have been published as the MPI Sintel dataset, containing 18 sequences with a total of 890 frames.As discussed in IV, there is severe illumination inconsistency between the input frames and the shading layers in this dataset.Therefore, we provide the refined MPI Sintel dataset as a more suitable dataset for intrinsic image decomposition.
Figure 4(bottom) compares our refined MPI dataset (MPI RD) to the original MPI dataset (MPI).In the shading layer of the first column, we can see that in our refined shading image, the specularity on the shoulder of the girl is removed, making the shading illumination consistent with the original input image.In the second and third columns, the shading layers from the MPI RD dataset contain more geometric details than those from the MPI dataset.For instance, the wooden cart's coarse surface is depicted in the refined shading in the third column, while the original shading from the MPI dataset has a smooth surface.These examples demonstrate that our refined MPI RD improves consistency between the intrinsic decomposition and the input image.In Figure 4(bottom, right), the mean squared error (MSE) between the input image I and the resynthesized image A × S is computed.The MSE for the MPI RD dataset is significantly smaller than for the MPI dataset, showing that the intrinsic decomposition model Eq. ( 1) is well respected in the refined dataset.
For data augmentation, we randomly resize the input image by a scale factor in [0.8, 1.3], and randomly crop a 288 × 288 patch from the resized image per iteration.We also use horizontal flipping in the training phase.When comparing methods, following [16], we evaluate our results on both a scene split and an image split.For a scene split, half of the scenes are used for training and the other half for testing.For an image split, all 890 images are randomly separated into two sets.Evaluation on a scene split is considered more 2) IIW dataset: Intrinsic Images in the Wild (IIW) [12] is a large scale, public dataset of real-world scenes intended for intrinsic image decomposition.It contains 5,230 real images of mostly indoor scenes, combined with a total of 872,161 crowdsourced annotations of reflectance comparisons between pairs of points sparsely selected throughout the images (on average 100 judgements per image).Following many prior works [13], [41], [39], [16], we split the IIW dataset by placing the first of every five consecutive images sorted by image ID into the test set, and the others into the training set.WHDR from [12] is employed to measure the quality of the reconstructed albedo images.
For the IIW dataset, our proposed network structure cannot be directly used due to the lack of dense labelling of albedo and shading layers.Actually, only sparse and relative reflectance annotations are provided.In order to take advantage of the proposed feature distribution divergence and feature distribution consistency, we modify the network.In detail, the predicted dense albedo is collected into an image pool to describe the distribution of albedo.The reconstructed shading using the original image and predicted albedo is used as dense supervision for shading prediction, and is also collected in an image pool to describe the shading distribution.We set the weights in Eq. ( 6) to [γ 1 , . . ., γ 5 ] = [0, 0, 0, 1.0, 1.0].

B. Comparison to state-of-the-art methods
1) Using MPI Sintel and the refined dataset: As Table I shows, our method achieves the best results on the MPI Sintel dataset using the image split.On the more challenging scene split, our method is competitive with the state of the art, and achieves the best results in 5 out of the 9 cases in the table.We show a group of qualitative results evaluated on the scene split in Figure 5.While the MSCR [14] results are relatively blurred due to the large kernel convolutions and down-sampling, our method provides sharper results comparable to Revisiting [16].
Moreover, our shading layer depicts better shadow area than [16].
As explained in IV, the MPI Sintel dataset has issues of data consistency between the original input images and the corresponding shading images.Because of the proposed feature distribution divergence, feature distribution consistency and the use of the cycle loss, our method is sensitive to such data inconsistency.Therefore, we compare our method to the state-of-the-art methods on the more challenging scene split of the refined MPI Sintel dataset.As Table II shows, our method achieves the best result, demonstrating the effectiveness of our method and data refinement process.
To further validate the effectiveness of the proposed method, we also conducted an ablation study on the training loses as well as the network architecture.Results are given in the bottom part of Table II.'Plain' represents the baseline twostream network structure shown in Figure 2(b).In the ablation study, the gradient loss and SSIM loss are progressively added to train the plain network.The experimental results show that using the gradient loss and SSIM loss simultaneously achieves better intrinsic image decomposition results.The proposed network architecture is denoted by 'Ours', as defined in Figure 2(c).It can be observed that using only the feature distribution divergence (w/o FDC) or the feature distribution consistency (w/o FDD) does not improve results much, while using both of them results in considerable improvement.
Figure 6 displays a side-by-side comparison with two other methods using the refined dataset MPI RD.As can be seen, our method better separates shading from albedo information.For example, our method outputs consistent shadow around the girl's neck.
2) Using the MIT intrinsic dataset: We further experimented on the MIT intrinsic dataset [10], which consists of object-level real images.In this experiment, classical methods [7], [35], [33] as well as learning based methods [16], [15], [14], [39] were compared to our proposed method.We also conduct an ablation study on the training strategies of our proposed network, including training from scratch (Ours scratch), pre-training on the original MPI Sintel dataset (Ours MPI), and pre-training on the refined MPI Sintel dataset (Ours RD).The experimental results are reported in Table III, and representative instances are selected for visual comparison in    As in [16], we used the 220 images in the dataset.In comparisons to previous methods, the split from [7] was used.Our refined MPI Sintel dataset has grayscale shading images, so we first pre-train the model on MPI RD and then fine-tuned it on the MIT training set.
Numerical results are shown in Table III.Our method (Ours RD) achieves the best results in most cases in the table.Moreover, Ours RD performs better than Ours MPI in terms of LMSE, and both pre-training methods perform better than training from scratch, demonstrating that pre-training on the refined MPI Sintel dataset helps intrinsic image decomposition on the MIT dataset.
Qualitative results are illustrated in Figure 7.We can observe that our method (Ours RD) predicts sharp and accurate intrinsic layers.Classical methods may produce meaningful layer separation results, but good results depend on parameter tuning, which requires expert knowledge.Compared to (Ours scratch) and (Ours MPI), (Ours RD) achieves better region consistency with the ground truth.
Using both simultaneously provides the best result.In Figure 9(c, d, e), three representative parameter settings are used to show the resulting intrinsic decomposition.The albedo image in Figure 9(e) contains more precise contours and consistent region colors.4) Intrinsic decomposition of video data: In this experiment, we evaluated the proposed intrinsic decomposition method on video data with respect to reconstruction quality of the intrinsic images, and to temporal consistency, through comparisons to alternative methods.
Framewise methods used for comparison include MSCR [14] and Revisiting [16].The blind video temporal consistency method [52] is an unsupervised video smoothing method.In the experiment (Ours+DVP), it is applied as post-processing to increase the temporal consistency of the results produced by our framewise method (Ours).(Ours+Flow) is the proposed intrinsic decomposition method for video data.The temporal consistency constraint is applied by taking optical flow as input to construct correspondences between adjacent frames.We also extend the temporal consistency constraint to MSCR in the same way as for (Ours+Flow) in (MSCR+Flow).All methods were trained and tested on the MPI VRD dataset.
The intrinsic decomposition accuracy scores of framewise methods and flow methods are shown in Table V, while the temporal consistency scores for 8 test short videos are shown in Table VI.Using adjacent frame temporal consistency as a constraint, (Ours+Flow) and (MSCR+Flow) achieve much better temporal consistency metric scores than their framewise counterparts, while the intrinsic decomposition accuracy scores remain almost unchanged.This demonstrates the effectiveness of our proposed extension method for video data.As a result, (Ours+Flow) achieves the best average accuracy and temporal consistency for intrinsic decomposition of video data.
We also visualize temporal consistency of a specific frame clipped from a video, using a temporal inconsistency metric (TICM) map to highlight temporally inconsistent areas.First, the TCM map is computed using Eq. ( 14).Then, the TCM map is smoothed by a Gaussian kernel of size 65.Finally, the Jet colormap is inverted to highlight inconsistent areas with warm colors (the colder the color, the better the consistency).A qualitative comparison of different intrinsic decomposition methods on video data is shown in Figure 10.It can be observed that, using the temporal consistency constraint, both (MSCR+Flow) and (Ours+Flow) achieve better temporal consistency compared to their framewise counterparts (MSCR and Ours).
In Figure 11, temporal consistency loss is added to framewise methods (MSCR) and (Ours).In the (Ours+Flow) result, temporally inconsistent areas are suppressed compared to (Ours).However, in the (MSCR+Flow) result, temporal inconsistency is unexpectedly amplified in the highlighted area in the red box.In addition, the (MSCR+Flow) result is more blurred than the one without temporal consistency loss.In MSCR, multiple scales of feature maps are merged in the encoding phase.While features from deeper layers contain high-level knowledge, they may lack texture details.The temporal consistency constraint may encourage MSCR to pay more attention to high-level features, resulting in more blurred outputs.The case of (MSCR+Flow) indicates that while temporal consistency loss could be easily extended to other deep neural networks, the outcome will largely depend on the characteristics of specific methods.

VI. CONCLUSIONS
In this paper, we have presented a novel two-stream encoder-decoder network for intrinsic image decomposition.Our method is able to exploit discriminative properties of the features for different intrinsic images.Specifically, our feature distribution divergence is designed to increase the distance between features corresponding to different intrinsic images, and our feature perceptual loss is applied to constrain the feature distribution.These two modules work together to encode discriminative features for intrinsic image decomposition.We have also provided an algorithm to refine  the MPI Sintel dataset to make it more suitable for intrinsic image decomposition.Visual results for MPI RD and the more challenging IIW dataset demonstrate that our proposed method can achieve results with better albedo and shading separation than existing methods.Its extension to video data is able to decompose video into intrinsic image sequences with temporal consistency.
The limitations of our method are of two kinds.Firstly, the mathematical model of intrinsic decomposition applied in this work is relatively preliminary.We cannot recover scene properties such as 3D geometry, light source positions, global illumination, etc.It is worthwhile challenge to exploit feature discrimination properties in more complex and powerful mathematical models.Secondly, the temporal consistency constraint  used to extend our method to intrinsic video decomposition does not generalize well.As Figure 11 shows, directly using the temporal consistency loss in MSCR can result in unwanted blurring.In future, it is of interest to derive more general temporal consistency constraints for intrinsic video decomposition.Acknowledgements.Portions of this work were presented at the International Conference on Computer Vision Workshops in 2019 [18].This work was supported by the National Natural Science Foundation of China (NSFC) (Grants 61972012, 61732016).

Declarations
Conflict of interest The authors declare that they have no conflict of interest.

A. Ordinal loss
For each pair of annotated pixels (i, j) in the predicted albedo image A, we have the error function: where r i,j denotes the relative reflectance (albedo) judgement from IIW.Values of r i,j are set to 1, 0, or −1 depending on whether the relative brightness of pixel i is greater, the same, or lower than that of pixel j.

B. Smoothness priors
The albedo component is constrained using a multi-scale L 1 smoothness term: v l,i,j || log A l,i − log A l,j || 1 , (16) in which N (l, i) indicates the 8-connected neighborhood of the pixel at position i at scale l. v l,i,j is the weight corresponding to the similarity between the pair of albedo pixels (i, j), which is formulated as exp(− 1 2 (f l,i − f l,j ) T Σ −1 (f l,i − f l,j )).The greater the similarity of the two pixels, the smaller the weight, making the pairwise term loss less significant.f l,i is the feature vector defined as [p l,i , I l,i , c 1 l,i , c 2 l,i ], where p l,i is spatial position, I l,i is image intensity, and c 1 l,i and c 2 l,i are the first two elements of chromaticity.Σ is the covariance matrix defining the distance between two feature vectors.This albedo smoothness term encourages the reconstructed albedo layer to be piecewise constant.
The shading smoothness is formulated using a denselyconnected L 2 term: Ŵi,j (log where Ŵ is a bi-stochastic weight matrix derived from W where W i,j = exp(− 1 2 2 ).
A detailed derivation can be found in [23], [60].This weight is used to measure the positional difference of a pair of pixels (i, j) in an image, with greater weight for nearby pixels.

APPENDIX B FURTHER RESULTS
Further qualitative comparisons are shown in Figure 12, using the refined MPI Sintel dataset and Figure 13, using the IIW dataset.

Fig. 2 .
Fig. 2. Feature distributions of different network structures.In each row, left: simplified network structure, right: features visualized by t-SNE.The dotted curve represents the separating boundary of different domains in feature space.(a): 'Y'-shaped framework using a shared encoder.(b): plain two stream framework using two independent encoder-decoder sub-networks.(c): our ' '-shaped framework, in which features from different streams interact during training.
t e x i t s h a 1 _ b a s e 6 4 = " 7 j F / a x i K Q 5 0 q c k 4 5 3 c B + a S 7 C D O 0 p x 3 5 2 P R W n D y m W P 4 A + f z B 5 z S j W k = < / l a t e x i t > } < l a t e x i t s h a 1 _ b a s e 6 4 = " z 9 I 0 3 u D a A 3 b S 8 q z s e i t e D k M 8 f w B 8 7 n D 5 / a j W s = < / l a t e x i t > Shading Image Pool … { < l a t e x i t s h a 1 _ b a s e 6 4 = " 7 j F / a x i K Q 5 0 q c k 4 5 3 c B + a S 7 C D O 0 7 6 J a u 7 + s 1 G / y O I p w A q d w D h 5 c Q R 3 u o A F N Y B D C M 7 z C m z N 2 X p x 3 5 2 P R W n D y m W P 4 A + f z B 5 z S j W k = < / l a t e x i t > } < l a t e x i t s h a 1 _ b a s e 6 4 = " z 9 I 0 3 u D a A 3 b S 8 q z s e i t e D k M 8 f w B 8 7 n D 5 / a j W s = < / l a t e x i t > Ordinal Smooth I/A < l a t e x i t s h a 1 _ b a s e 6 4 = " 5 l S M O g S s m F g Q Q C b 3 G L k 5 + 0 b x h F E = " > A A A B 6 n i c b V D L S g N B E O z 1 G e M r 6 t H L Y B A 8 x d 0 o 6 D H q R W 8 R z Q O S J c x O O s m Q 2 d l l Z l Y I S z 7 B i w d F v P p F 3 v w b J 8 k e N L G g o a j q p r s r i A X X x n W / n a X l l d W 1 9 d x G f n N r e 2 e 3 s L d f 1 1 G i G N Z Y J C L V D K h G w S X W D D c C m 7 F C G g Y C G 8 H w Z u I 3 n l B p H s l H M 4 r R D 2 l f 8 h 5 n 1 F j p 4 e 7 0 q l M o u i V 3 C r J I v I w U I U O 1 U / h q d y O W h C g N E 1 T r l u f G x k + p M p w J H O f b i c a Y s i H t Y 8 t S S U P U f j o 9 d U y O r d I l v U j Z k o Z M 1 d 8 T K Q 2 1 H o W B 7 Q y p G e h 5 b y L + 5 7 U S 0 7 v 0 U y 7 j x K B k s 0 W 9 R B A T k c n f p M s V M i N G l l C m u L 2 V s A F V l B m b T t 6 G 4 M 2 / v E j q 5 Z J 3 V i r f n x c r 1 1 k c O T i E I z g B D y 6 g A r d Q h R o w 6 M M z v M K b I 5 w X 5 9 3 5 m L U u O d n M A f y B 8 / k D l K W N V Q = = < / l a t e x i t >

Fig. 3 .
Fig. 3. Framework adjustment for sparsely-labelled data.Changed parts are highlighted in green.

Fig. 4 .
Fig. 4. Refined (MPI RD) and original (MPI) MPI Sintel datasets.The top row shows an example of illumination inconsistency in MPI.At the bottom is an example for MPI RD.At the bottom left, each image is split into two parts: the left shows the refined data, and the right is the original data.The close ups show detail regions in the shading images, which in MPI RD exclude specular components and preserve more geometric details.At bottom right is an MSE comparison between the original and resynthesized images.The MSE value for MPI RD is significantly lower than for MPI.

Fig. 5 .
Fig. 5. Qualitative comparison on the MPI Sintel dataset.The visual results are evaluated on the more challenging scene split.The regions with obvious differences are highlighted in dotted boxes.

Fig. 6 .
Fig.6.Qualitative comparison on our refined MPI Sintel dataset MPI RD.Visual results are evaluated on the scene split.Our method is better at separating albedo and shading components.The regions with obvious difference are highlighted in dotted boxes.

Figure 7 .
Figure 7.As in[16], we used the 220 images in the dataset.In comparisons to previous methods, the split from[7] was used.Our refined MPI Sintel dataset has grayscale shading images, so we first pre-train the model on MPI RD and then fine-tuned it on the MIT training set.Numerical results are shown in TableIII.Our method (Ours RD) achieves the best results in most cases in the table.Moreover, Ours RD performs better than Ours MPI in terms of LMSE, and both pre-training methods perform better than training from scratch, demonstrating that pre-training on the refined MPI Sintel dataset helps intrinsic image decomposition on the MIT dataset.Qualitative results are illustrated in Figure7.We can observe that our method (Ours RD) predicts sharp and accurate intrinsic layers.Classical methods may produce meaningful layer separation results, but good results depend on parameter tuning, which requires expert knowledge.Compared to (Ours scratch) and (Ours MPI), (Ours RD) achieves better region consistency with the ground truth.3)Using the IIW dataset: In TableIV, we report results obtained using the test set of the IIW dataset.Our proposed

Fig. 7 .Fig. 8 .
Fig. 7. Qualitative comparison on the MIT intrinsic dataset.The intrinsic image decomposition results from classical methods (columns 2 and 3) and learning based methods (columns 4 and 5) are compared with our result (column 8).Results of different training strategies for our network (columns 6-8) are also provided.

Fig. 10 .
Fig. 10.Qualitative comparison of different intrinsic decomposition methods on video data.Left to right, top row: albedo ground truth, optical flow map, and occlusion mask (pixels in white have invalid optical flow).Rows 2-6: intrinsic decomposition results for various methods, showing albedo prediction, TICM map, and both overlayed.Colder colors indicate better temporal consistency.By using temporal consistency loss, both MSCR+Flow and Ours+Flow achieve better temporal consistency, compared to their framewise counterparts MSCR and Ours.Results show frame 32 in cave 4.

Fig. 11 .
Fig.11.Effects of using temporal consistency loss, showing frame 23 in bamboo 2. Temporal consistency loss is added to the MSCR method (bottom left) and our method (bottom right).In the Ours+Flow result, the temporally inconsistent areas are suppressed compared to the Ours result.However, in the MSCR+Flow result, temporal inconsistency is unexpectedly amplified in the highlighted area in the red box.In addition, the MSCR+Flow result is more blurred.

Fig. 12 .Fig. 13 .
Fig.12.A further qualitative comparison on our refined MPI Sintel dataset MPI RD, using the scene split.Our method is better at separating albedo and shading components.The regions with obvious differences are highlighted in dotted boxes.
Algorithm 1 Data refinement for MPI Sintel Input: Original MPI Sintel dataset comprising input images I, albedo images A and shading images S Output: Refined MPI Sintel dataset comprising I * , A * , S * , so that I * = A * S * , constrained by the intrinsic decomposition model in Eq. (1) for each i ∈ [1, N ] do 1: convert the RGB image into L * a * b * space, and extract the L channel as {I i , A i , S i } ; 2: reconstruct the albedo and shading: Âi = I i /S i , Ŝi =

TABLE I COMPARISON
OF METHODS USING THE MPI SINTEL DATASET.

TABLE II COMPARISON
OF METHODS USING THE MPI RD DATASET.

TABLE III METHOD
COMPARISON AND ALTERNATIVE TRAINING APPROACH COMPARISON USING THE MIT INTRINSIC DATASET.THE TOP 3 METHODS ARE CLASSICAL METHODS, WHILE THE OTHERS ARE LEARNING BASED ONES.NOTE THAT BARRON ET AL.'S METHOD RELIES ON SPECIALIZED PRIORS AND MASKED OBJECTS PARTICULAR TO THIS DATASET.ALTERNATIVE TRAINING STRATEGIES ARE GIVEN IN THE BOTTOM 3 ROWS.OURS (SCRATCH) MEANS TRAINING FROM SCRATCH.OURS (MPI) MEANS PRE-TRAINED ON THE ORIGINAL MPI SINTEL DATASET.OURS (RD) MEANS PRE-TRAINED ON THE REFINED MPI SINTEL DATASET.

TABLE IV METHOD
COMPARISON ON THE IIW TEST SET.