Correction: Spatio-Temporal Outdoor Lighting Aggregation on Image Sequences Using Transformer Networks

This erratum aims to correct errors in the sections 1, 3, and 5 of Lee et al. (2022). Some of the texts in these sections were reproduced in non-ﬁnal form. It resulted in omissions of several major extensions that are made during the revision process. Figures and Tables are not affected


Introduction
• The following sentence in the 3rd paragraph ( Fig. 1) should be modified as: In our work, we go in a similar direction as we robustly estimate the global sun direction and other lighting parameters (Lalonde & Matthews, 2014) by fusing estimates both from the spatial and temporal domains.
• The 5th paragraph (Fig. 2) should be modified as: … which accounts for individual orientations and fieldof-views of the input frames. With this novel pipeline, we The original article can be found online at https://doi.org/10.1007/ s11263-022-01725-2. To the best of our knowledge, we are the first to use an attention-based model for the task of lighting estimation. Finally, we extend our lighting model. Unlike previous work which predicted only the sun direction, the proposed work estimates parameters of the Lalonde-Matthews outdoor illumination model (Lalonde & Matthews, 2014).
• The list of contributions in the 6th paragraph (Fig. 3) should be modified as:

Proposed Method
• An additional sentence should be inserted after the last sentence of the 1st paragraph: In this way, the samples obtained from each sequence provide different observations for the same global lighting condition. This design is motivated by our empirical results, which showed that lighting can be estimated well from many small parts. • The 2nd paragraph (Fig. 4) is completely rewritten as: All image crops are passed through the backbone network and projected to a sequence of patch embeddings. We then add an orientation-invariant positional encoding and pass the sequence to our transformer network. Through the attention layers, the noisy spatio-temporal observations can be effectively aggregated to a final estimate. Weighted features are delivered to a dense layer that produces the estimated Lalonde-Matthews illumination model parameters. The sun direction estimates are formulated in their own camera coordinate systems. We compensate the camera yaw angle of each subimage in order to obtain aligned estimates in a unified global coordinate system. Our final prediction is given as the average of all estimates. Note that the sky parameters of the Lalonde-Matthews model do not require the alignment step, as they do not vary with respect to the camera yaw angle. The assumption behind our spatio-temporal aggregation is that distant sun-environment lighting can be considered invariant for small-scale translations (e.g., driving) and that the variation in lighting direction is negligible for short videos. Through the following sections, we introduce the details of our method.

Lighting Estimation
• The 1st paragraph ( where w sun ∈ R 3 and w sky ∈ R 3 are the mean sun and sky colors, (β, κ) are the sun shape descriptors, t is the sky turbidity, l sun = [θ sun , φ sun ] is the sun position, γ l is the angle between the light direction l and the sun position l sun , and f P is the Preetham sky model (Preetham et al., 1999). For more details, please refer to (Lalonde & Matthews, 2014).
• The following sentence should be inserted at the beginning of the 2 nd paragraph (Fig. 6): Among the parameters, the sun direction may be the most critical component. Unlike our predecessors … Since the two loss functions L sun and L param have similar magnitudes, we define the final loss function as the sum of them:

Orientation-Invariant Positional Encoding
• The occurrences of an abbreviation fov (field of view) in the 1st paragraph (Fig. 7) should be substituted with a spherical angle symbol : For example, the top left pixel gets a coordinate of − h 2 , v 2 for a pinhole camera model with a field of view of h and v horizontally and vertically, respectively.
• The first occurrence of x i in the equation 5 (Fig. 8) should be substituted with x enc i : We use an absolute positional encoding, i.e.
where the positional encoding p i and the subimage feature vector x i ∈ R d x are superimposed.
• The following sentence should be inserted after the last sentence: The resulting positional encoding of a subimage is the stacked vector of the three cyclic positional encodings. Note that the depth parameter d is carefully determined so that the depth of the stacked vector matches the channel size of the transformer network.

Calibration
• Occurrences of 'calibration' and 'calibrated' should be substituted with 'alignment' and 'aligned'. This change includes the subsection title. • The first two sentences are completely rewritten to reflect the changes introduced by an extended sun and sky model. • A new sentence is inserted at the end of the 1st paragraph.
The correct text for these three changes is:

Alignment
Our neural network outputs the lighting parameters as a 11dimensional vector for a given sequence of image patches. Although this prediction was made by considering patches from different temporal and spatial locations, the sun direction estimates are in their own local camera coordinate systems. Therefore, we perform an alignment step using the camera ego-motion data to transform the estimated sun direction vectors into the world coordinate system. We assume the noise and drift in the ego-motion estimation is small relative to the lighting estimation. Therefore, we employ a widely used structure-from-motion (SfM) technique such as Schonberger & Frahm (2016) to estimate the egomotion of an image sequence.
Each frame f has a camera rotation matrix R f and the resulting aligned vector − → v pred is computed as R −1 f · − → v pred . Finally, we take the mean of the aligned lighting estimates as our final prediction.
• The second paragraph should be removed.

Conclusion
• The 2nd paragraph (Fig. 11) is completely rewritten as: Although we demonstrated visually appealing results in augmented reality applications, intriguing future research topics are remaining open. Intuitively, the performance of the model should scale with the sequence length, as more information is present. We plan to scale both our model and data to examine the limit of attentionbased spatio-temporal aggregation for lighting estimation. Another interesting direction would be the integration of our method into reconstruction pipelines, such as SLAM. Knowing the lighting direction and shadow-casting can help initializing camera estimation. Lastly, we want to investigate further into the sampling methods. Instead of picking 8 random frames from an image sequence, we could think of selecting consecutive frames and experiment with the number of frames and the distance from the starting point.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/.