1 Introduction

Fig. 1
figure 1

Diagrams of the initial acquisition methods and the proposed SFF-RTI acquisition method

Shape from focus (SFF) [1] is an imaging method that seeks to capture the 3D shape of an object. This functions by moving the focal plane and therefore the center of the camera’s depth of field. The focal plane is the two-dimensional plane in the scene, perpendicular to the optical axis, at which the sharpest focus is obtained. By moving the focal plane, its movement can be correlated with a per-pixel estimation of focus to attain an estimated depth of the analyzed pixel in the scene, relative to the camera (Fig. 1a). This can be used to make a 3D model of the object with correct proportions.

On the other hand, reflectance transformation imaging (RTI) [2] is a method of multi-light imaging that is used to better highlight the surface topography of an object, assisting in the computation of surface normals. This is done by imaging a subject at a single camera viewpoint with multiple known directions of illumination (Fig. 1b). The resulting mathematical interpretation of the surface can be used to compute the surface normals of an object and understand the reflectance properties of the surface for tasks such as digital relighting. This mathematical model derived from RTI has taken different forms over the years to improve its accuracy [2,3,4]. This provides a greater opportunity for observation to those who can’t have direct physical access to the subject for either geographical or conservation reasons, allowing for further scientific analysis as well as wider access for the general public.

The latter method aims for capturing the overall shape of the object first and foremost while the latter works on a smaller scale, capturing smaller details such as nicks and dents or general texture. These two levels of detail are important for not only analyzing an object in the appropriate context in a quantitative manner but also for creating a convincing virtual representation of an object when looked at from either aesthetic or scientific viewpoints.

In this paper, we will first explain the function of shape from focus and a brief review of the relevant scientific literature followed by the same for reflectance transformation imaging. These sections will also discuss the strengths and weaknesses of the two methods to highlight where exactly a fusion of the two methods could prove interesting. Following this, the SFF-RTI method will be proposed in full, with an explanation of any techniques or models used. Finally, qualitative and quantitative discussions of the SFF-RTI results will be held to provide perspectives that are relevant to the development of the imaging methodology as well as the actual viewing and quality of the results.

2 Limits of current methodologies

2.1 Advantages and disadvantages of shape from focus

First developed by Nayar and Nakagawa [1], shape from focus (SFF) is a passive imaging technique for recovering the shape of objects from a series of varyingly focused images working under the understanding that high-frequency spatial variations can be used as an indicator of focus and that optical blur acts as a low-pass filter. This is defined as a passive imaging technique because it does not introduce any illumination during the acquisition process. Nayar and Nakagawa implemented a localized measure of focus, the sum-modified Laplacian, to track the focus of different points on the object across the image stack. By maximizing this focus measure, a general idea of the object shape can be recovered. This initial rough shape is then refined using a three-point Gaussian interpolation around the point of focus for each pixel.

These results are typically stored in a depth map: a grayscale image that contains at each pixel a value which is representative of the depth of that point on the surface with respect to the image sensor. Examples can be seen in Fig. 2. This method became the basis for many similarly-posed shape recovery problems: estimate the focus of a set of varyingly focused images to use as a rough depth estimate for further correction.

Fig. 2
figure 2

SFF depth maps generated using an increasing number of Z positions from left to right

Efforts in the field of focus estimation and shape from focus have also included the processing and analysis of images in the frequency domain [5,6,7] and machine learning methodologies [8,9,10]. The former set of approaches is considered outside of the scope of this approach as we are combining images at the gradient level, which is already a representation of the focus of the scene. The latter set isn’t investigated as the proposed changes to the shape from focus methodology isn’t concerned with the depth estimation and therefore doesn’t overlap with these papers.

The limits of shape from focus found in the development of the proposed methodology are based on both the characteristics of the imaging setup used and the properties of the focus measure that is employed.

One such limit is with regard to the focus measure operator being used. A typical method of estimating focus is to look for high spatial frequency as such frequencies are attenuated by optical blur, previously mentioned to be functioning as a low-pass filter. To estimate the focus per pixel, a neighborhood is taken around a pixel to sample information. The size of this neighborhood can determine the method’s susceptibility to noise.

Secondly, the only dimension of the object in a shape from focus collection is the Z axis. This can result in effects such as the point spread function causing an unclear understanding of edges if the blurring effect is too strong. This is especially apparent on the outside edges of the subject, as even a blurred edge can be measured as having a higher spatial frequency than a black background (registered as nothing). This lack of dimensional variation can also end up causing “holes” to be present in the resulting depth map; pixels where it was unable to estimate the depth of the surface. This can be attributed to either shadowing in the scene causing the same poor estimation discussed with respect to the background or to the diffuse lighting not providing a sharp relief of small-scale surface features and therefore not presenting a high spatial frequency response.

This is all to say that shape from focus, as the name implies, aims to recover the shape of an object. With respect to an object, the shape can be considered the large-scale features while the texture can be seen as the small-scale features.

2.2 Advantages and disadvantages of multi-light imaging

First proposed by Woodham [11], photometric stereo is a method of imaging where an object is lit from different angles and captured using a single camera viewpoint. Images of all these light positions, combined with the spatial consistency of the object and camera, allow one to estimate the reflectance at each pixel on the surface for a given light angle and then compute for each point the normal vector of the surface given the total “light position” space. Woodham’s original technique proposed the usage of only three light positions under the assumption of Lambertian surface reflectance. This has since been extended to account for n light positions. This computation of surface normals can be displayed as a pseudo color image, with the X, Y, and Z components of the normal vector being interpreted as the R, G, and B channels, respectively.

Another form of multi-light imaging introduced by Malzbender et al. [2] is RTI which can be used for continuous models such as polynomial texture mapping (PTM), hemispherical harmonics (HSH) [3], and discrete modal decomposition (DMD) [4]. The aim of this method is to take a large multi-light dataset and compute one of the aforementioned models which has at each pixel a description of the surface reflectance properties. This can be used to digitally relight an object, making this a tool that can inform the end-user of shape and texture cues not readily apparent in normal 2D imagery.

Building off of these, Fattal et al. [12] devised a method of combining multi-light images using multiscale bilateral filtering. The aim of this work was to create RGB images that better convey a sense of shape and detail to the viewer. The results are a success yet are still rooted in two-dimensional images. These images perform well when conveying that sense of texture to a viewer, but it still lacks an understanding of the 3D shape of an object. On the opposite side of the field, Raskar et al. [13] attempted to better understand and convey the overall 3D structure in non-photorealistic images. By measuring the width of shadows projected by surface features and correlating that with various known light angles, a non-photorealistic image was generated that emphasizes the shape of the object.

These methods can provide informative and high-quality normal maps, giving an idea of the surface and its details but without an absolute geometric dimension. That is to say that the normal maps have a sense of the surface relative to itself, but it lacks an absolute frame of reference and can’t as easily be converted into a 3D model for visualization.

2.3 Motivations for SFF-RTI

As discussed in the previous section, shape from focus can miss certain details due to a diffuse light not providing sharp relief of small-scale surface details or due to lighting placement. Also, the resolution of the resulting shape from focus depth map as well as the scale upon which it can better detect surface details are both dependent on the change in focus distance between images. On the other hand, multi-light imaging uses multiple lighting positions and specular reflections to capture the small-scale and large-scale surface details without understanding the shape of the imaged object.

It is in the context of the aforementioned strengths and weaknesses of the two methods that we propose SFF-RTI. The SFF-RTI method aims to leverage their strengths while minimizing their weaknesses. This is not to imply that all the gaps in information will be filled, but improvement can still be seen by augmenting one method with the other. For this paper, the opportunity was seen to enhance the data used as the input for shape from focus to grant it greater vision of the smaller details and cues of texture on the surface of the object.

3 Proposed SFF-RTI

First off, the input data used for the proposed SFF-RTI algorithm are a combination of the shape from focus and RTI data collections: for each Z position used for shape from focus, there is an image at each light position (LP), and vice versa (Fig. 1c). In this current iteration, the same light positions are used for each Z position. Although it could prove interesting to see if different information can be gathered by varying the light positions used with each Z position, this is not explored in this work.

The first step in the proposed method is to estimate the focus of the scene on a per-pixel basis using a focus measure. In this case, the focus measure operator used is the sum-modified Laplacian, first proposed alongside the original shape from focus paper [1]. Following the original definition, Eq. 1 defines the modified Laplacian and Eq. 2 dictates its summation.

$$\begin{aligned} {\text{ M }L}(x,y) =&|2I(x,y) - I(x-\text {step},y) - I(x+\text {step},y)|\nonumber \\&+|2I(x,y) \nonumber \\&- I(x,y-\text {step}) - I(x,y+\text {step}) |, \end{aligned}$$
(1)

where “step” is a preset value defining the size of the local neighborhood.

$$\begin{aligned} \textrm{SML}(i,j) {=} \sum _{x=i-N}^{i+N}\sum _{x=j-N}^{j+N}\textrm{ML}(x,y) \;\; \text {for} \;\; \textrm{ML}(x,y) {\ge } T_{1}, \nonumber \\ \end{aligned}$$
(2)

where \(T_1\) is a preset threshold value. The values of the ”step“ size and \(T_{1}\) are taken from the original shape from focus paper [1]; \(T_{1}\) = 7 and step = 1.

In a shape from focus algorithm, these focus maps can be used to represent the per-pixel focus at each imaged Z position. By correlating the greatest estimated per-pixel focus with the corresponding Z position, a depth map can be created which represents the in-scene depth for each pixel respective to the Z position measurements. Once the focus maps are computed, they are stacked and the Z position which correlates to the maximum focus response at each pixel is determined. This results in a rough estimation of depth that is then refined using a three-point Gaussian interpolation at each maximum.

In the proposed method, multiple light positions are gathered for each Z position. To use this extra acquisition dimension alongside shape from focus, the information across all the light positions is integrated to create one representation of the object per Z position. One method of achieving this is by calculating the full vector gradient (FVG) [14], a method originally devised to account for the correlation in multidimensional data when computing the gradient.

Fig. 3
figure 3

Example multi-light imaging layout with five light positions

This idea can be extended to the dimension spanning the light positions (the LP dimension). The cosine similarity (CS), Eq. 3, is computed for each pair of light positions so that each diagonal element of the Gram matrix is equal to 1. This idea is seen in Eqs. 4 and 5. The latter specifically shows the Gram matrix computed for an SFF-RTI acquisition with 5 light positions, visualized in Fig. 3. It can be seen that the computed Gram matrix is symmetric along the diagonal axis as the cosine similarity computed for those positions are for the same angle. Angles that are closer to each other, such as \(L_{4}\) and \(L_{5}\), have a cosine similarity value closer to 1.0. Conversely, angles that are farther apart, such as \(L_{1}\) and \(L_{3}\), produce a value closer to \(-\)1.0. Therefore, it follows that angles that are near orthogonal, such as \(L_{1}\) and \(L_{2}\), result in a cosine similarity near 0. This results in a de-emphasis of information that might not correlate across multiple light positions since it has a lower probability of being reliable information.

$$\begin{aligned} \text {Cosine similarity} := \cos (\theta ) = \frac{\overrightarrow{\textrm{OL}_{i}}\cdot {}\overrightarrow{\textrm{OL}_{j}}}{\left\| \overrightarrow{\textrm{OL}_{i}}\right\| \left\| \overrightarrow{\textrm{OL}_{j}}\right\| } \end{aligned}$$
(3)
$$\begin{aligned} G = \left[ {\begin{array}{cccc} 1 &{}\quad \text {CS}_{L_{1},L_{2}} &{}\quad \cdots &{}\quad \text {CS}_{L_{1},L_{n}}\\ \text {CS}_{L_{2},L_{1}}&{}\quad 1 &{}\quad \cdots &{}\quad \vdots \\ \vdots &{}\quad \vdots &{}\quad \ddots &{}\quad \text {CS}_{L_{n-1},L_{n}}\\ \text {CS}_{L_{n},L_{1}} &{}\quad \cdots &{}\quad \text {CS}_{L_{n},L_{n-1}} &{}\quad 1\\ \end{array} } \right] \end{aligned}$$
(4)
$$\begin{aligned} G = \left[ {\begin{array}{ccccc} 1.0 &{}\quad 0.092 &{}\quad -0.463 &{}\quad -0.173 &{}\quad 0.504\\ 0.092 &{}\quad 1.0 &{}\quad -0.615 &{}\quad 0.707 &{}\quad 0.629\\ -0.463 &{}\quad -0.615 &{}\quad 1.0 &{}\quad 0.112 &{}\quad -0.140\\ -0.173 &{}\quad 0.707 &{}\quad 0.112 &{}\quad 1.0 &{}\quad 0.760\\ 0.504 &{}\quad 0.629 &{}\quad -0.140 &{}\quad 0.760 &{}\quad 1.0\\ \end{array} } \right] \nonumber \\ \end{aligned}$$
(5)

This matrix is then used to define the autocorrelation matrix of the gradient of the image, Eq. 6. For Eq. 6, the SML is broken into its X and Y components by splitting Eq. 1 in the middle so that only one dimension is used at a time. Also of note is that the threshold value \(T_{1}\) in Eq. 2 is halved to account for having half the number of summed values to compare.

$$\begin{aligned} \text {SML}= & {} \text {SML}(I)_x \cdot G \cdot \text {SML}(I)_x \nonumber \\{} & {} \quad + \text {SML}(I)_y \cdot G \cdot \text {SML}(I)_y, \nonumber \\ \end{aligned}$$
(6)

4 Experiment methodology

4.1 Experiments

Experiments were conducted to demonstrate not only the methods of multi-light integration, but also how the results of the chosen method can vary based on the window size of the focus measure operator. This is a parameter often tweaked in SFF setups and so this experiment provides a brief discussion on the value chosen for SFF-RTI. Two statistical measures were chosen to describe the performance of the methods: the peak signal-to-noise ratio (PSNR) and the root mean square error (RMSE).

The motivation for using PSNR as a metric for this comparison is to see how much noise has been introduced into the focus maps by adding another dimension of information (the integrated light positions). To achieve this, the PSNR was computed between the focus maps generated by a traditional shape from focus method and the focus maps generated for each Z position and all light positions using the proposed SFF-RTI framework. When plotted, the average PSNR across all Z positions was computed for each dataset.

$$\begin{aligned} \text {PSNR} = 10\;\log {}_{10} \left( \frac{N R^{2}}{\textrm{MSE}(Z_\textrm{SFF}, Z_{\textrm{SFF}-\textrm{RTI}})} \right) , \end{aligned}$$
(7)

where R is the maximum value possible in the image determined by the bit depth and \(Z_{\text {SFF}}\) and \(Z_{\text {SFF-RTI}}\) are the estimated values of depth in the SFF and SFF-RTI depth maps, respectively.

The RMSE is computed between the ground truth depth map and those generated using the SFF and SFF-RTI methods.

Fig. 4
figure 4

RGB image and ground truth simulated for the “statue” object using Blender

4.2 Generation of data

For development and validation of the proposed methods, synthetic data was generated using the GNU GPL licensed 3D modeling and animation program Blender [15]. Specifically, Blender version 3.1.0 was used. 3D models of cultural heritage objects under Creative Commons licenses were used for simulation. For these simulations, the desired number of lamps and cameras are placed to simulate an RTI dome setup with a camera that can modify its focal point and move along the z-axis, or towards and away from the object being imaged. Using ray tracing, Blender is able to realistically simulate the basic objects used that help to validate the methodology.

This approach was taken for a few reasons. The first reason was that performing acquisitions with a 3D model allows one to have an absolute understanding of the surface geometry, allowing for ground truth for both depth maps and surface normals. The second reason is that this allows for a variety of different objects to be imaged any number of times while also allowing the capture methodology to be changed as many times as desired. This is in contrast to a real object of cultural heritage that might require highly controlled conditions such as room temperature or ambient light.

The third reason is that the simulation of the data collection removes many factors present in real-world image acquisitions such as camera jitter due to movement and inconsistent alignment of an object in the camera field of view across multiple data collections.

The first 3D model currently used for the generation of synthetic data has been acquired through the website Sketchfab. Uploaded by “Archéomatique”, this model of a “Statue du parc d’Austerlitz, Ajaccio (2A)” [16], or a statue in Austerlitz Park in Ajaccio, it is unclear whether it was created using a photogrammetric imaging method or whether it was created by hand. Examples of the images generated for this object using Blender can be seen in Fig. 4.

The second 3D model used was provided by the Museum of King Jan III’s Palace at Wilanow in Poland [17]. While the 3D model was provided by the museum, details of the model are available on Sketchfab at the URL in [17]. It was chosen to contrast the statue object since it lacks extreme differences in surface depths. Being a smaller object, it is also easier to image the texture of the surface. Examples of the images generated for this object using Blender can be seen in Fig. 5.

Table 1 lists all permutations of the synthetic imagery datasets generated for this paper. The only camera parameter that changes between acquisitions is the aperture size. As the number of depth positions increases, the aperture is opened up more. This results in a shallower depth of field which aims to maximize the gains in shape recovery.

Fig. 5
figure 5

RGB image and ground truth simulated for the “gemma” object using Blender

Table 1 Table detailing the generated datasets

A note before the analysis is that a specific range of depths was used for the following analyses. This was conducted because there is a large number of depth positions between the brick wall behind the statue and the alcove of the statue where the figure resides, resulting in no real usable data for many positions. Furthermore, the figure and the overall statue itself are the more obviously interesting and so they are chosen as the subjects to focus on during the acquisition. This is not to say the wall behind does not contain importance, but that it does not appear so with the given perspective and for the current application.

5 Results and discussion

5.1 Methods of multi-light integration

The methods tested for the integration of multi-light images are the FVG, the mean gradient response, and the maximum gradient response. The methodology behind the FVG is discussed in Sect. 3. For the other two methods, they take the pixel-wise mean gradient value or the maximum gradient value, respectively, across all multi-light images. Examples of the three methods compared with SFF can be seen in Fig. 6.

Figures 7 and 8 show plots of the computed RMSE between the ground truth depth maps and those generated with SFF-RTI for the “statue” and “gemma” objects respectively. Along the x-axis of each plot is a set of number pairs. These pairs, as indicated by the axis label, describe first the number of light positions used for a data collection and then the number of depth positions. The number of light positions holds the sorting priority in that all the datasets with a certain number of light positions are grouped together.

5.1.1 Quantitative analysis

Fig. 6
figure 6

Comparison of computed gradients for “gemma” object

Fig. 7
figure 7

RMSE computed between resulting depth maps and ground truth for the “statue” object

When looking at the RMSE in Fig. 6, it can be seen that the RMSE for any FVG dataset is lower than its SFF counterpart (the SFF dataset with the same number of Z positions). Interestingly, the SFF-RTI dataset using 5 light positions with 20 Z positions (for a total of 100 images) shows a significant increase in quality over the SFF dataset with 100 images for all integration methods, with FVG being the greater of the three.

As for the “gemma” object, the RMSE can be seen in Fig. 8. The RMSE values for this object are much more stable, with smaller decreases in error due to additional light angles. The overall range of the RMSE for the “gemma” object varies between approximately (0.036 and 0.041). This is a drastic decrease in range compared to the RMSE of the “statue” which varied between approximately (0.05 and 0.1).

Furthermore, there is not always a distinct increase in detail when using SFF-RTI over SFF. The reason for this is likely due to the structural differences between the two objects. The “statue” object has a lot of height differences along its surface, allowing for a moving light to illuminate larger differences, whereas the “gemma” object has a flatter surface and so more detail can be seen with one or few light positions.

Thus, it can be seen that the proposed SFF-RTI acquisition method can provide a clear numerical improvement over SFF when applied to certain objects. While a topographically complex object such as the “statue” can see an improvement with different light angles, the “gemma” object doesn’t. When SFF-RTI is seen to provide more interesting detail however, the full vector gradient performs admirably, even showing an increase in shape recovery with fewer images than would be used for SFF.

Fig. 8
figure 8

RMSE computed between resulting depth maps and ground truth for the “gemma” object

5.1.2 Qualitative analysis

Fig. 9
figure 9

Normal maps computed from SFF-RTI (RTI:20, SFF:20) using the full vector gradient (a), the mean gradient response (b), and the maximum gradient response (c)

Figure 9 shows a comparison of the three integration methods for the ROI previously used focused on the head of the figure of the “statue” object. All three images were computed from a dataset using 20 light positions and 20 Z positions. Some points of improvement can be seen when comparing the results computed using the FVG with those computed using the mean and maximum gradient responses. The main observation is that there appears to be more noise in the mean gradient response image than the FVG. The maximum gradient response ends up being noisiest of the three methods. While the same areas that exhibit error in the FVG can also be seen in the others, the affected area is diminished in comparison.

Fig. 10
figure 10

Region of interest comparing depth maps of SFF and SFF-RTI using the full vector gradient near the head of the statue

Another point of interest is highlighted in Fig. 10. Red arrows indicate some areas where the shape wasn’t recovered well when using SFF or the mean gradient response method for SFF-RTI. This is true even when increasing the number of images collected for SFF, proving it isn’t necessarily a lack of resolution that causes this issue. However, the FVG performs well in these areas with only a few light positions. Increasing the number of light angles can increase the quality of the recovered shape for both multi-light integration methods, although it doesn’t fix both of the holes discussed in the case of the mean gradient response.

A similar comparison of the “gemma” object can be seen in Fig. 11, focused on the eye of the face in the center of the object. All three images were computed from a dataset using 5 light positions and 20 Z positions. Visually, the only clearly visible difference is seen in the rendering of the surface texture, with the maximum gradient response returning the roughest. This is expected as the other two methods are averaging multiple images where the texture is likely to be slightly different due to the change in light direction.

Looking at both the quantitative and qualitative results, it cannot be said which integration method is definitive across all objects. However, the “statue” object continues to show evidence that the shape recovery of a complex object can be improved using the full vector gradient.

Fig. 11
figure 11

Normal maps computed from SFF-RTI (RTI:10, SFF:10) using the full vector gradient (a), the mean gradient response (b), and the maximum gradient response (c)

5.2 Focus estimation window size

When estimating focus with the sum-modified Laplacian, the local neighborhood surrounding each pixel is analyzed. The size of these neighborhoods can be altered to change the amount of points sampled for analyzing the high spatial frequency detail. An experiment was conducted to determine the effect of window size on the SFF-RTI results. This change in neighborhood size is conducted in the spatial dimension (X and Y), not the dimensions that contain the multi-light imagery (LP) or that describe the change in focus distance (Z).

Figure 13 consists of four subplots, each of which show the average PSNR computed between the SFF-RTI focus estimates and those of a corresponding SFF dataset. The motivation and usage of this statistic is discussed in Sect. 4.1. The PSNR is plotted against the window size of the focus measure operator on the x-axis. The four subplots are different SFF-RTI datasets. These datasets were chosen to provide context as to how varying the number of light or depth positions affects the results.

5.2.1 Qualitative analysis

Fig. 12
figure 12

Comparison of normal maps for SFF-RTI using the full vector gradient for various window sizes

When looking at Fig. 12, it can be seen that the window size has a clear effect on the quality of the depth map. The first observation demonstrating this effect is that the level of noise, expressed as holes in the depth map, is decreased as the window size increases. An expected effect of increasing the sampling area of a kernel is a decrease in noise as more points are sampled from a larger window, reducing the effects of statistical outliers.

Another effect of increasing the window size is that features begin to blend into each other and lose some of the depth detail. This can be seen best in the roundels located to either side of the figure’s head as well as the area around the head of the figure itself. The shapes of the roundels spread over a larger area as the window size increases, the one on the right almost losing the point in its center completely. The head of the figure becomes a bit more blurred, with the eye sockets and chin especially blending more into their surrounding features. The eye sockets become less of a defined feature and more of a slight dip in the surface while the chin also loses its consistent definition and outline, becoming a bit more misshapen and smooth.

5.2.2 Quantitative analysis

Fig. 13
figure 13

PSNR results comparing effect of window sizes for various datasets

When looking at how the window size affects PSNR in Fig. 13, two main observations can be made. The first observation is that using the FVG consistently results in a higher PSNR than taking the mean of the multi-light images. The second is that as the size of the kernel window increases, the PSNR decreases. The former can be attributed to the fact that the FVG is an attempt at intelligently weighting the information across the multi-light dimension so that high-frequency detail detected between drastic illumination differences has less of an effect on the map than the details seen under similar illumination that are deemed more reliable. On the other hand, simply averaging all the multi-light images results in a map where details seen under all illuminations, no matter how different, all have equal weight and can therefore de-emphasize desired information.

The second observation can be explained in a similar way. As the window size increases, the number of samples for the computation of the edge at the given point increases. This can de-emphasize high spatial frequency information as more low-frequency information is considered and lowers the magnitude of the former. In this case, the high-frequency information has become a minority in the sampled distribution of points. On top of these observations, the SML also seems to perform the best overall when it comes to PSNR; an observation made clearer with more light positions.

Looking at both the qualitative and quantitative effects of window size, it would appear that a balancing of the quality and noise with the desired level of detail is required. This could also vary based on the geometric complexity of an object as simpler surfaces might return fewer holes in the depth map despite using a smaller window size. It is with this in mind that a kernel size of (5 \(\times \) 5) was chosen for these tests with the current object of study.

6 Conclusion

In this paper, we have proposed a fusion of two different imaging methods, shape from focus and reflectance transformation imaging, with the goal of enhancing the understanding of both the shape of an object as well as its smaller surface details. Three methods were demonstrated for the integration of differently lit images: the mean gradient response, the maximum gradient response, and the full vector gradient. Results show that using multi-light image collections with shape from focus can provide higher-quality shape recovery of an object with fewer images. However, this only holds true for images with a significantly complex surface.

Going forward, this methodology can be used to task acquisitions of objects based on different scales of texture. This can be achieved by understanding what depth levels might be more useful when focus stacking images to create a synthetically extended depth of field. Furthermore, this work will be used to increase the quality of the estimation of surface micro-geometry so that a psychometric study can be conducted regarding the effect that changes in micro-geometry have on the overall surface appearance.