1 Introduction

Reconstructing 3D ranges from some snapshots of different perspective is a classical computer vision problem with ever existed concerning. Its applications range from 3D mapping, navigation to 3D printing, computational photography, video games, or heritage archival. Only recently have these techniques matured enough to provide industrial scale robustness, accuracy and scalability. The target of an image-based 3D reconstruction algorithm can be described as estimating the most likely 3D models by the given set of images under a proper assumption of material, viewpoints and light conditions. A general MVS pipeline includes:

  • Image collection.

  • Calibration for the difference of the camera setting of each image.

  • Correspondences detection among images.

  • Reconstruction the 3D ranges according to the geometrical correspondences.

  • Optionally reconstruct the materials of the scene.

Most MVS algorithms focuse on the 2D scenes fusion, merging and refinement, to achieve a dense and accurate 3D ranges estimation. Meanwhile the DoG based feature matching algorithms, which are less capable of detecting and matching the features under an affine invariant environment, are making the 3D ranges reconstruction less efficient until granted with a great amount of snapshots from all the possible perspectives.

An accurate and dense correspondence matching plays an important role in the MVS: an accurate correspondence matching makes the protagonist in the camera matrix calibration, coherent the images patch of different coordinates; a dense correspondence matching can establish a density depth clouds, rendering it more accurate and easier to smooth and merge the 3D ranges surface. The matches to construct the 3D ranges are determined by the images registration. 3D ranges to 2D images registration is largely depends on an accurate 2D camera calibration with respect to the acquired geometric 3D model. An accurate camera position estimation can largely determine the quality of 3D range’s construction. Thus the registration issue can be simplified as how to conduct the camera matrix in the framework of the projection model [13]. A large part of the recent success of MVS is due to the success of the underlying Structure from Motion algorithms that compute the camera parameters.

Camera calibration is the fundamental of MVS registration. It refers to a set of values describing a camera configuration, that is, camera pose information consisting of location and orientation, and camera intrinsic properties such as focal length and pixel sensor size. There are many different ways or models to parametrize this camera configuration. There exist many cues that can be used to calculate the camera parameters from images including: stereo correspondence, pre-settled devices, snapshot calibration etc. The method the algorithms applied can largely restrict the range of its application: a controlled MVS capture use diffuse lights and a turn table to collect the images, outdoor capture can capture series of images around a small-scale scenes, and crowd-sourcing from online photo-sharing websites. General speaking, the algorithm which is capable of tacking arbitrarily snapshots are more desirable. In this situation, a robust, dense and accurate correspondence detecting and matching schema becomes quite necessary for the MVS.

Different to the traditional DoG based feature marching algorithm, we propose a novel MVS method utilizing our previous published AIFD to detect and match the correspondence from images to images and introducing the Homograph model to define the smooth planes from the 3D objects. AIFD is a feature detector and descriptor method, provided a more improved resilience to affine and scale invariance. It borrowed some ideas of SIFT [14], like scale space and pyramid structure, etc, but it is more capable to dealing with the image content of different view points, which is suits the special requirement of MVS.

Scale invariant feature detector, like SIFT, SURF [6], ALP [4], etc, has achieved its success on a lot of applications, including content-based visual retrieval, robotic navigation, image registration, etc. However, its sensitivity to the view point changes greatly restrict its applications to a larger range, such as 3D registration for instance for a long time. Borrowed the basic principles of SIFT, we have successfully proposed Affine Invariant Feature Detector (AIFD), which has a better resilience to affine transformations. Equipped with this more advanced affine invariant feature detector, we can now seek the connections between images for 3D ranges reconstruction by detecting the matched features.

A correct detected correspondences between two images constitute a stereo system, which can provide the depth information. A set of points defined by the depth information out-shapes the structure of the 3D ranges. In practice, most scenes or partials of the object will be covered by more than 2 images, which can help to calibrate a more dense and accurate spatial information. The origins of multi-view stereo can be traced back to human stereopsis and the first attempts to solve the stereoscopic matching problem as a computation problem [4]. Until today, two-view stereo algorithms have been a very active and fruitful research area . The multi-view version of stereo originated as a natural improvement to the two-view case. Instead of capturing two images from different perspectives, multi-view stereo would capture more viewpoints in-between to increase robustness, e.g. to image noise or surface texture and viewpoint. What started as a way to improve two-view stereo has nowadays evolved into a different type of problem.

Only equipped with sufficient amount of correspondences from different images, are we able to be more approach to an accurate camera matrix estimation by DLT. The knowledge of the registration from 3D ranges to 2D images can improve to map the 3D textures and in advance can be treated as a homograph reference to be apply to some other 2D images [3]. With this iterative registration-mapping method, the 3D ranges can be more precisely registered to an un-calibrated and arbitrarily snapshot [1].

Based on our proposed pipeline, a progressive mapping and registration 3D to 2D images registration method can be formed. The experiments in the below section can prove the performance of our proposed registration method outperforms against the traditional edge/corner based method [2]. Though our registration proposal, the stereo mapped 3D model can be introduced to more applications for its efficiency and simplicity [5].

2 Stereo Visual Based 3D Range Methods

3D range model is referring to a collection of points that presenting the distance in a scene from a specified viewpoint, which is normally associated with some type of sensor, like the Laser deceives [7]. To a well formed range model, its pixel value reflects the corresponding distance to a certain view plain[8]. If the sensor that is used to produce the range is properly calibrated, the pixel values can directly give the distance in physical units, like meter [9].

The sensor device that is used for producing the range model is sometimes referred to as a range camera. Range cameras can operate according to a number of different techniques [10], including Stereo triangulation, Sheet of light triangulation, Time-of-flight, Structured light, Interferometry, and Coded aperture. Sheet of light triangulation is achieved by changing of the scene illuminated with a sheet of light this creates a reflected line as seen from the light source. From any point out of the plane of the sheet the line will typically appear as a curve, the exact shape of which depends both on the distance between the observer and the light source, and the distance between the light source and the reflected points. By observing the reflected sheet of light using a camera (often a high resolution camera) and knowing the positions and orientations of both camera and light source, it is possible to determine the distances between the reflected points and the light source or camera. By illuminating the scene with a specially designed light pattern, structured light [11], depth can be determined using only a single image of the reflected light. The structured light can be in the form of horizontal and vertical lines, points or checker board patterns. The depth can also be measured using the standard time-of-flight (ToF) technique, more or less like a radar, in that a range image similar to a radar image is produced, except that a light pulse is used instead of an RF pulse. By illuminating points with coherent light and measuring the phase shift of the reflected light relative to the light source it is possible to determine depth. Under the assumption that the true range image is a more or less continuous function of the image coordinates, the correct depth can be obtained using a technique called phase-unwrapping. Depth information may be partially or wholly inferred alongside intensity through reverse convolution of an image captured with a specially designed coded aperture pattern with a specific complex arrangement of holes through which the incoming light is either allowed through or blocked.

Among all of these techniques, Stereo triangulation is the most popular and widely applied technique for 3D ranges detections, where the depth data are determined by the data acquired by stereo or multiple-camera system. This way it is can determine the depth of a certain points in the scene, for example, from the enter point of the line between their focal points. In order to solve the depth measurement by employing a stereo camera system, it is necessary to detect the corresponding points from the different images. A well solution correctly specifying the correspondences from different images is one of the main task by applying this type of technique. For instance, it is difficult to detect the correspondence for the image points that lie inside the regions of homogeneous intensity or color. As a consequence, 3D range based stereo triangulation can produce reliable depth estimation only for a subset of all points visible from a multiple-view cameras.

The advantage of this technique is that it can guarantee that the measurement is passive and does not require a special requirement for detecting scene illumination. Other mentioned technique do not have to tangle the correspondence but are depending on some particular scene illuminations.

Fig. 1
figure 1

Relationship of image displacement to depth with stereoscopic images, assuming flat co-planar images

To a stereo image, a pixel value is not to store the grey scale level at a certain position, but to identify the depth information. Stereo cameras captures two images with tiny difference on the scene. In Fig. 1 the point A on the right light from is transmitted through the entry points of a pinhole cameras at B and D, onto image screens at E and H. In the attached diagram the distance between the centers of the two camera lens is \(BD = BC + CD\). The triangles are similar, ACB and BFE and ACD and DGH.

$$\begin{aligned} \begin{aligned} \text {Therefore displacement } d&=EF+GH\\&={\frac{BC+CD}{AC}}\\&={\frac{BD}{AC}}\\&={\frac{k}{z}}{\text {, where } k=BD, z=AC}\\ \end{aligned} \end{aligned}$$
(1)

Assuming the cameras are from the same level, and image plane is flat on the same plane, the displacement in the Y axis between the same pixel in the two images is \(d={\frac{k}{z}}\), Where k is the distance between the two cameras times the distance from the lens to the image. The depth component in the two images are \(z_{1}\), \( z_{1}\) given by,

$$\begin{aligned} \begin{aligned} z_{2}(x,y)&=\min \left\{ v:v=z_{1}\left( x,y-{\frac{k}{z_{1}(x,y)}}\right) \right\} \\ z_{1}(x,y)&=\min \left\{ v:v=z_{2}\left( x,y+{\frac{k}{z_{2}(x,y)}}\right) \right\} \end{aligned} \end{aligned}$$
(2)

Because the computation is squaring of the points in SSD, many implementations use Sum of Absolute Difference (SAD) as the basis for computing the measurement. Other methods use normalized cross correlation (NCC). The least squares measure can be deployed to measure the information content of the stereoscopic images, given the depth information at each point z(xy). The information needed to represent the image is called \(I_{m}\). Image rectification is required to adjust the images as if they were co-planar. This may be achieved by a linear transformation. The images also need to be rectified to make each image equivalent to the one taken from a pinhole camera as if they are projected to a flat plane.

The normal distribution is,

$$\begin{aligned} P(x,\mu ,\sigma )={\frac{1}{\sigma {\sqrt{2\pi }}}}e^{{-{\frac{(x-\mu )^{2}}{2\sigma ^{2}}}}} \end{aligned}$$
(3)

Probability is related to information content described by message length L,

$$\begin{aligned} \begin{aligned} P(x)&=2^{{-L(x)}} \\ L(x)&=-\log _{2}{P(x)} \end{aligned} \end{aligned}$$
(4)

So,

$$\begin{aligned} L(x,\mu ,\sigma )=\log _{2}(\sigma {\sqrt{2\pi }})+{\frac{(x-\mu )^{2}}{2\sigma ^{2}}}\log _{2}e \end{aligned}$$
(5)

The least squares measure may be used to measure the information content of the stereoscopic images [], given depths at each point z(xy). Firstly the information needed to express one image in terms of the other is derived. This is called \(I_m\).

A color difference function can be used to fairly measure the difference between colors. The color difference function is written cd in the following. The measure of the information needed to record the color matching between the two images is,

$$\begin{aligned} I_{m}(z_{1},z_{2})={\frac{1}{\sigma _{m}^{2}}}\sum _{{x,y}}{\text {cd}}\left( {\text {color}}_{1}\left( x,y+{\frac{k}{z_{1}(x,y)}}\right) ,{\text {color}}_{2}(x,y)\right) ^{2} \end{aligned}$$
(6)

An assumption is made about the smoothness of the image. Assuming that two pixels are more likely to be of the same color, the closer the voxels they represent are. This measure is intended to favor colors that are similar being grouped at the same depth. For example, if an object in front occludes an area of sky behind, the measure of smoothness favors the blue pixels all being grouped together at the same depth. T he total measure of smoothness uses the distance between voxels as an estimate of the expected standard deviation of the color difference,

$$\begin{aligned} \begin{aligned} I_{s}(z_{1},z_{2})&={\frac{1}{2\sigma _{h}^{2}}}\sum _{{i:\{1,2\}}}\sum _{{x_{1},y_{1}}}\sum _{{x_{2},y_{2}}} \\&{\frac{{\text {cd}}({\text {color}}_{i}(x_{1},y_{1}),{\text {color}}_{i}(x_{2},y_{2}))^{2}}{(x_{1}-x_{2})^{2}+(y_{1}-y_{2})^{2}+(z_{i}(x_{1},y_{1})-z_{i}(x_{2},y_{2}))^{2}}} \end{aligned} \end{aligned}$$
(7)

The total information content is then summed as,

$$\begin{aligned} I_{t}(z_{1},z_{2})=I_{m}(z_{1},z_{2})+I_{s}(z_{1},z_{2}) \end{aligned}$$
(8)

The z component of each pixel must be chosen to provide the minimum value for the information content. This will give the most likely depths at each pixel. The minimum total information measure is,

$$\begin{aligned} I_{{{\text {min}}}}=\min {\{i:i=I_{t}(z_{1},z_{2})\}}\} \end{aligned}$$
(9)

The depth functions for the left and right images are the pair,

$$\begin{aligned} (z_{1},z_{2})\in \{(z_{1},z_{2}):I_{t}(z_{1},z_{2})=I_{{{\text {min}}}}\} \end{aligned}$$
(10)

3 Orthogonal Projection

Orthographic projection (sometimes orthogonal projection), is a means of representing three-dimensional objects in two dimensions. It is a form of parallel projection, in which all the projection lines are orthogonal to the projection plane,[1] resulting in every plane of the scene appearing in affine transformation on the viewing surface. The obverse of an orthographic projection is an oblique projection, which is a parallel projection in which the projection lines are not orthogonal to the projection plane.

The term orthographic is sometimes reserved specifically for depictions of objects where the principal axes or planes of the object are also parallel with the projection plane, [1] but these are better known as multi-view projections. When the principal planes or axes of an object are not parallel with the projection plane, but are rather tilted to reveal multiple sides of the object, the projection is called an axonometric projection. Sub-types of multi-view projection include plans, elevations and sections. Sub-types of axonometric projection include isometric, dimetric and trimetric projections.

A simple orthographic projection onto the plane z = 0 can be defined by the following matrix:

$$\begin{aligned} P = \begin{bmatrix} 1&\quad 0&\quad 0 \\ 0&\quad 1&\quad 0 \\ 0&\quad 0&\quad 0 \\ \end{bmatrix} \end{aligned}$$
(11)

For each point \(v = (vx, vy, vz)\), the transformed point would be

$$\begin{aligned} P_{v} = \begin{bmatrix} 1&\quad 0&\quad 0 \\ 0&\quad 1&\quad 0 \\ 0&\quad 0&\quad 0 \\ \end{bmatrix}\,\, \begin{bmatrix} v_x \\ v_y \\ v_z \end{bmatrix} = \begin{bmatrix} v_x \\ v_y \\ 0 \end{bmatrix} \end{aligned}$$
(12)

Often, it is more useful to use homogeneous coordinates. The transformation above can be represented for homogeneous coordinates as

$$\begin{aligned} P = \begin{bmatrix} 1&\quad 0&\quad 0&\quad 0 \\ 0&\quad 1&\quad 0&\quad 0 \\ 0&\quad 0&\quad 0&\quad 0 \\ 0&\quad 0&\quad 0&\quad 1 \end{bmatrix} \end{aligned}$$
(13)

For each homogeneous vector \(v = (vx, vy, vz, 1)\), the transformed vector would be

$$\begin{aligned} Pv = \begin{bmatrix} 1&\quad 0&\quad 0&\quad 0 \\ 0&\quad 1&\quad 0&\quad 0 \\ 0&\quad 0&\quad 0&\quad 0 \\ 0&\quad 0&\quad 0&\quad 1 \end{bmatrix}\,\, \begin{bmatrix} v_x \\ v_y \\ v_z \\ 1 \end{bmatrix} = \begin{bmatrix} v_x \\ v_y \\ 0 \\ 1 \end{bmatrix} \end{aligned}$$
(14)

4 AIFD Based Feature Matching

Our previously published feature descriptor AIFD achieves its resilience to affine and scale changes by reshaping the multi-scale image representation and local extrema detection in order to maintain a linear relationship under the changes of affine transformation. Instead of relying on the image simulations, AIFD achieves its affine and scale invariance completely based on its internal mechanisms when dealing with the transformed visual content. Thus it is more brief, reliable and feasible to more applications and has more potentials for future research.

It is based on our previous proposed affine scale space. For a given image I(xy), its scale space is given by a family of pre-smoothed images \(L(x,y,\sigma )\), where the scale parameter is pre-defined according to its kernel size:

$$\begin{aligned} \begin{aligned} g(x,y;\sigma )&=\frac{1}{2\pi \sigma ^{2}}e^{-\frac{x^{2}+y^{2}}{2\sigma ^{2}}} \\ \text {such as,} \quad L(x,y;\sigma )&=g(x,y;\sigma )*I(x,y) \end{aligned} \end{aligned}$$
(15)

Based on this definition, a structure more adaptable to the affine transformation is defined below:

$$\begin{aligned} \begin{aligned} g(\eta ,\varSigma _{s})_{A}&=\tfrac{1}{2 \pi \sqrt{det \varSigma _{s}}} e^{- \tfrac{\eta ^{T}\varSigma _{s}^{-1}\eta }{2}}. \\ \end{aligned} \end{aligned}$$
(16)

where, \(\varSigma _{s}=A\sigma ^{2}A^{T}\).

Fig. 2
figure 2

The pipeline of AIFD. The parameters in the dashed box is pre-settled and only determined by the pyramid structure

In this formula, A represents the affine transformation, a \(2 \times 2\) matrix. \(\sigma \) is the scale. This deformed Gaussian kernel is specialized to generate affine scale space which can maintain linear relationship regardless the change of view point. Based on this structure, the images from any view points can be well represented from multi-scales. From the definition of affine scale space, conventional isotropic scale space can be deemed as a spacial case, whose affine transformation equals to the \(2 \times 2\) identity matrix (Fig. 2).

The similarity of two visual content is largely depends on the matched features detected from the scale space. To the conventional scale space, several approaches to detect the local maximum or minimum from derivatives have been proposed [12], and local LoG extrema detection outperform all others, concerning the accuracy and efficiency of a method in practice [17].

The Laplacian of Gaussian (LoG ) scale space can be mathematically expressed as:

$$\begin{aligned} \nabla ^{2}L=L_{xx}+L_{yy} \end{aligned}$$
(17)

In this formula, L represents the scale space. The local maximum or minimum over the Laplacian can then be selected as the feature candidates.

Borrowing the idea of LoG, we have also proposed an affine LoG, with the purpose to promote the feature candidates detection over affine scale space. Instead of a direct laplacian operation, we have proposed a feasible implementation based on our proposed pyramid structure to efficiently generate affine LoG. By this implementation, the affine Gaussian and LoG scale space can be simultaneously generated. More information about affine scale space and affine LoG can be found [18].

Once obtaining the affine scale space, we can then detect the local extrema by comparing its Harris and Hessian matrix. If we suppose,

$$\begin{aligned} A=\begin{bmatrix} \frac{\partial ^{2}f}{\partial x^{2}}&\frac{\partial ^{2}f}{\partial x \partial y} \\ \frac{\partial ^{2}f}{\partial x \partial y}&\frac{\partial ^{2}f}{\partial y^{2}} \end{bmatrix} \qquad B=\begin{bmatrix}\left( \frac{\partial f}{\partial x}\right) ^{2}&\frac{\partial f}{\partial x}\frac{\partial f}{\partial y}\\ \frac{\partial f}{\partial x}\frac{\partial f}{\partial y}&\left( \frac{\partial f}{\partial y}\right) ^{2}\end{bmatrix} \end{aligned}$$
(18)

Nominating the eigenvalues of A as \(\psi _{1}\) and \(\psi _{2}\) and the eigenvalues of A as \(\psi _{1}\) and \(\psi _{2}\), we can then detect the features by analysing,

$$\begin{aligned} \frac{1}{4}\min \{\psi _{1},\psi _{2}\}^{2}>\max \{\nu _{1},\nu _{2}\} \end{aligned}$$
(19)

With this way to find the feature candidates, we can apply the LoG first and second derivative filters to form the Harris and Hessian matrices.

$$\begin{aligned} \begin{aligned} l_{1}=&A^{-1}\eta \frac{1}{\pi \sigma ^{6}}e^{-\frac{\eta ^{T}(AA^{T})^{-1}\eta }{2\sigma ^{2}}}\left( 2-\frac{\eta ^{T}(AA^{T})^{-1}\eta }{2\sigma ^{2}}\right) \\ l_{axx}=&\frac{1}{\pi \sigma ^{6}}\left[ \left( \frac{\eta ^{T}(AA^{T})\eta }{2\sigma ^{2}}-3\right) (\frac{M_{a}(1,1)}{\sigma ^{2}}-1)-1\right] \\ l_{ayy}=&\frac{1}{\pi \sigma ^{6}}\left[ \left( \frac{\eta ^{T}(AA^{T})\eta }{2\sigma ^{2}}-3\right) (\frac{M_{a}(2,2)}{\sigma ^{2}}-1)-1\right] \\ l_{axy}=&\frac{1}{\pi \sigma ^{6}}\left( \frac{\eta ^{T}(AA^{T})\eta }{2\sigma ^{2}}-3\right) \frac{M_{a}(1,2)}{\sigma ^{2}} \end{aligned} \end{aligned}$$
(20)

Applying the same structure as depicted in Fig. 3 to generate the LoG derivative scale space to detect the local extrema.

Borrowing the idea of CDVS [16], we have proposed a polynomial expression to represent the affine scale space (also affine LoG and affine LoG derivatives) with a continual scale from. Supposing the the continual scale space can be approached by a cubic polynomial, which can be written as:

$$\begin{aligned} L(\sigma )=a\cdot \sigma ^3+b\cdot \sigma ^2 + c\cdot \sigma +d \end{aligned}$$
(21)

where, the parameters abcd are of the scale space’s image size. \(\sigma \) represents a scale value within the Octave range. \(L(\sigma )\) represents the image of a specified scale, it can be one image of Gaussian scale space, LoG scale space or LoG derivatives of any scale. This expression can be deployed as the continual form to represent the scale space of Gaussian, LoG or LoG derivatives. Apparently, \(\sigma \) can be any value within the Octave range, and \(L(\sigma )\) represents the corresponding scale space. According to the expression above, the image of any specified scale can be calculated given the parameters abcd, which are related with the input image. Thanks to the pyramid structure, 4 scale space within each Octave can easily be generated.

The equations at the 4 parameters can be written as:

$$\begin{aligned} \begin{bmatrix}a(x,y) \\ b(x,y) \\ c(x,y) \\ d(x,y) \end{bmatrix} = M\, \begin{bmatrix}L(\sigma _{1}) \\ L(\sigma _{2}) \\ L(\sigma _{3}) \\ L(\sigma _{4}) \end{bmatrix}, \end{aligned}$$
(22)

where M is a \(4 \times 4\) matrices and abcd are linear combination of \(L(\sigma _{1}), L(\sigma _{2}), L(\sigma _{3})\) \(L(\sigma _{4})\). Matrix M is a simple \(4\times 4\) matrix and it is determined under the certain affine transformation. Thanks to the linear property of the polynomial expression, the matrix M will not be affected under a different view point, which can be demonstrated, if supposing the affine transformation is A.

$$\begin{aligned} L(A,\sigma ) =\begin{bmatrix}\sigma ^{3}&\sigma ^{2}&\sigma&1 \end{bmatrix} M\, \begin{bmatrix}L(A,\sigma _{1}) \\ L(A, \sigma _{2}) \\ L(A,\sigma _{3}) \\ L(A,\sigma _{4}) \end{bmatrix} \end{aligned}$$
(23)
Fig. 3
figure 3

Proposed pyramid structure to speed up the generation of Gaussian, LoG and also LoG derivatives scale space

The candidates, with the Harris-Hessian matrices subtraction smaller than 0.001 will also be rejected to guarantee the extrema is larger upto a certain level compared with the surrounding points by \(R=Tr(H)^2/Det(H)\), which equals to \((\gamma +1)^2/\gamma \).

Since \(L(\varvec{x+\Delta x})\) is the local extreme, its derivative equals \(\varvec{0}\). By taking the derivative on both sides of the equation, we can have:

$$\begin{aligned} \Delta \varvec{x}=-\{D^{2}L(\varvec{x})\}^{-1}DL(\varvec{x}) \end{aligned}$$
(24)

If the offset is larger than 0.5 in any dimension, it implies the real extreme location is closer to a different pixel sample. Considering the local extreme scale for each pixel sample can be quite diverse, the feature located at the specified integral position becomes no more adequate.

The offset \(\hat{x}\) will be summed up with the detected integral position to approach the local extreme location to sub-pixel’s precision, according to the formula Eq. 31. The Hessian matrix and local gradient of the pixel sample can be obtained by the corresponding LoG derivative polynomial expressions.

To cope with an affine gradient-based feature descriptor, we will introduce the affine gradient filter and the related scale space in this section. At the very beginning, the definition of image gradient can be given by:

$$\begin{aligned} \nabla I=\left( \frac{\partial I}{\partial x} , \frac{\partial I}{\partial y} \right) \end{aligned}$$
(25)

It is equivalent to the first order of image derivatives. The traditional method to calculate the image gradient is by taking the subtraction from two image neighbouring pixels in the form:

$$\begin{aligned} \nabla I=\frac{1}{2}\left( I(x-1,y)-I(x+1,y), I(x,y-1)-I(x,y+1)\right) \end{aligned}$$
(26)

For every point of image gradient, its direction and magnitude can be given by:

$$\begin{aligned} \begin{aligned} m(x,y)=&\sqrt{\left( \frac{\partial I}{\partial x}\right) ^2+ \left( \frac{\partial I}{\partial y}\right) ^2}, \\ \theta =&\arctan \left( \frac{\frac{\partial I}{\partial x}}{\frac{\partial I}{\partial y}}\right) . \end{aligned} \end{aligned}$$
(27)

where I is the original image, \(*\) is convolution operation, g is Gaussian filter. Thus, the derivative of the Gaussian blurred image of the corresponding scale space can be equivalently obtained though filtering the image with the Gaussian derivative filters, which can be derived by:

$$\begin{aligned} \begin{bmatrix} g_{x}\\ g_{y} \end{bmatrix}=\begin{bmatrix}\frac{\partial g}{\partial x}\\ \frac{\partial g}{\partial y} \end{bmatrix}=\begin{bmatrix} \frac{x}{2\pi \sigma ^{4}}e^{-\frac{x^{2}+y^{2}}{2\sigma ^{2}}} \\ \frac{y}{2\pi \sigma ^{4}}e^{-\frac{x^{2}+y^{2}}{2\sigma ^{2}}}\end{bmatrix}=\frac{1}{2\pi \sigma ^{2}}e^{-\frac{x^{2}+y^{2}}{2\sigma ^{2}}}\begin{bmatrix}\frac{x}{\sigma ^{2}} \\ \frac{y}{\sigma ^{2}}\end{bmatrix} \end{aligned}$$
(28)

Thus, the Gaussian derivatives scale space can easily be obtained by multiplying the corresponding Gaussian scale space with \(x/\sigma ^2\) and \(y/\sigma ^2\).

Borrowing the idea of affine LoG, we can have the affine Gaussian derivative filters as:

$$\begin{aligned} \begin{aligned} \begin{bmatrix} g^A_{x} \\ g^A_{y}\end{bmatrix} =g^{A}_{\eta }=&\frac{A^{-1}\eta }{2\pi \sigma ^{4}}e^{-\frac{\eta ^{T}(AA^{T})^{-1}\eta }{2\sigma ^{2}}} \end{aligned} \end{aligned}$$
(29)

where A is a \(2\times 2\) matrix, indicating the affine transformation, \(\eta =A\begin{bmatrix}x \\ y \end{bmatrix}\). Like the isotropic Gaussian derivative scale space, the affine Gaussian derivative scale space can also be generated by multiplying the corresponding Gaussian scale space with \(A^{-1}\eta /\sigma ^2\).

Gradients from affine transformed images are restrained by the affine matrices between different viewpoints. Around each feature, the relocated gradients, according to the affine transformation, can then form a histogram as the feature descriptor (Figs. 45).

The gradient relocation can be done in the form of:

$$\begin{aligned} \begin{bmatrix} x^{\prime } \\ y^{\prime }\end{bmatrix}=A\begin{bmatrix} x \\ y \end{bmatrix} \end{aligned}$$
(30)

where x, y is the index of the gradient around the detected features from 1 to the descriptor, pre-defined by the scale of the features. \(x^{\prime }\) and \(y^{\prime }\) are the indexes accounting for the affine transformation. The relocation of the gradient can then be collected according to the new calculated index. The interpolation of the gradients may also be applied if the new calculated indexes are not integers.

Fig. 4
figure 4

Gradient relocated from the specified area around the detected features. The index of the gradient can be calculated according to the affine transformations

Fig. 5
figure 5

The gradient around the feature will be added to the gradient histogram weighted by its magnitude and a Gaussian-weighted circular window [14]

Assigning an orientation to each feature, the feature descriptor can be represented relative to this orientation and achieve its invariance to image rotation. To calibrate the orientation of a feature, an area of scale space gradient around the feature will first be formed, after proceeding our proposed gradient relocation, eliminating the effect of the affine distortion. The area of gradient to be collected is in a square shape with its size equal to 3 times that of the feature scale. Then, the orientation of each sample of scale space gradient can be added to the orientation histogram weighted by its gradient magnitude and by a Gaussian-weighted circular window with 1.5 times the scale [14].

Then, the orientation histogram will be subdivided into 36 bins covering the \(360^{\circ }\) range of orientations and filled the corresponding accumulated magnitude. The peak of the histogram points to its main direction. Any other local peak that is within \(80\%\) of the highest peak and higher than the average of its two neighbors will be assigned with different orientations. The features with multiple peaks will be respectively created at the same location with the same scale but different orientations [14].

In advance, the main directions of the feature can also promote its accuracy by utilizing the second order of Taylor expansion in the form of:

$$\begin{aligned} \Delta \varvec{x}=-\{D^{2}L(\varvec{x})\}^{-1}DL(\varvec{x}) \end{aligned}$$
(31)

Supposing the detected main direction bin is M and its two neighbor direction bins are \(M_{-}\) and \(M_{+}\), the above equation can be implemented in the form of:

$$\begin{aligned} \Delta M = - \frac{\frac{1}{2}(M_{+}-M_{-})}{\frac{1}{2}(M_{+}+M_{-})-M} \end{aligned}$$
(32)

Then, the main direction is:

$$\begin{aligned} \theta =(M+\Delta M) \cdot \frac{360}{36} \end{aligned}$$
(33)

The offset larger than 0.5 will be autocratically rejected.

Fig. 6
figure 6

Orientation histogram, created by subdividing the surrounding gradient into 36 bins according to the gradient orientations and accumulating weighted magnitude. The largest bin will be selected as the main direction of the descriptor

Assigning an orientation to each feature descriptor is to gain the invariance to image rotation by indicating the relations between the representation and the feature itself. Gradient is introduced with the purpose of avoiding the effects of the illumination changes. The patch to collect the gradient will then steer to the main direction to generate the traditional SIFT descriptor [14]. The image rotation can also be equalized as a special case of affine transformation in the form of:

$$\begin{aligned} R(\theta )={\begin{bmatrix}\cos \theta&-\sin \theta \\\sin \theta&\cos \theta \\\end{bmatrix}} \end{aligned}$$
(34)

Combining the affine transformation, the total transformation can be synthesized as:

$$\begin{aligned} A^{\prime }={\begin{bmatrix}\cos \theta&-\sin \theta \\\sin \theta&\cos \theta \\\end{bmatrix}}\cdot A \end{aligned}$$
(35)

By applying the gradient relocation depicted in Fig. 6, a square patch of gradient can be obtained (Fig. 7).

Fig. 7
figure 7

Gradient relocation according to the synthesized transformation matrix. After the relocation, we will obtain a set of gradients eliminating the affine transformation and pointing to the main direction

Fig. 8
figure 8

Borrowing the idea of SIFT descriptor, affine descriptor can be generated from the relocated gradient

As depicted in Fig. 8, affine descriptor can then be generated. It has an exact same form of SIFT descriptor and can be directly applied in a lot of pre-SIFT applications

Fig. 9
figure 9

A typical example of our proposed method. We will compare the performance our proposed method with some other most frequently applied algorithm on this special case [15]

Fig. 10
figure 10

Typical 3D and 2D images with fiducial markers: a CBCT axial slice (a), a 2D-MAX LAT (b) and 2D-MAX AP (c) images. The arrows indicate the locations of the fiducial markers [15]

5 Camera Matrix Calculation Via the Image-Image Feature Correspondences

With the matched correspondences, we can then apply the DLT method to calculate the camera matrix to define the 2D image to 3D models position mapping (Figs. 8910).

The mapping from the coordinates of a 3D point P to the 2D image coordinates of the point’s projection onto the image plane, according to the pinhole camera model is given by

$$\begin{aligned} \begin{pmatrix} y_1 \\ y_2 \end{pmatrix} = \frac{f}{x_3} \begin{pmatrix} x_1 \\ x_2 \end{pmatrix} \end{aligned}$$
(36)

where \((x_1, x_2, x_3)\) are the 3D coordinates of P relative to a camera centered coordinate system, \((y_1, y_2)\) are the resulting image coordinates, and f is the camera’s focal length for which we assume \(f > 0\). Furthermore, we also assume that \(x3 > 0\). To derive the camera matrix this expression is rewritten in terms of homogeneous coordinates. Instead of the 2D vector \((y_1,y_2)\) we consider the projective element (a 3D vector)\( {\mathbf {y}}=(y_{1},y_{2},1)\) and instead of equality we consider equality up to scaling by a non-zero number, denoted \(\sim \). First, we write the homogeneous image coordinates as expressions in the usual 3D coordinates.

$$\begin{aligned} {\begin{pmatrix}y_{1}\\ y_{2}\\ 1\end{pmatrix}}={\frac{f}{x_{3}}}{\begin{pmatrix}x_{1}\\ x_{2}\\ {\frac{x_{3}}{f}}\end{pmatrix}}\sim {\begin{pmatrix}x_{1}\\ x_{2}\\ {\frac{x_{3}}{f}}\end{pmatrix}} \end{aligned}$$
(37)

Finally, also the 3D coordinates are expressed in a homogeneous representation \(\mathbf {x}\) and this is how the camera matrix appears:

$$\begin{aligned} {\begin{pmatrix}y_{1}\\ y_{2}\\ 1\end{pmatrix}}\sim {\begin{pmatrix}1&{}0&{}0&{}0\\ 0&{}1&{}0&{}0\\ 0&{}0&{}{\frac{1}{f}}&{}0\end{pmatrix}}\,{\begin{pmatrix}x_{1}\\ x_{2}\\ x_{3}\\ 1\end{pmatrix}} \end{aligned}$$
(38)

Or,

$$\begin{aligned} \mathbf {y} \sim \mathbf {C} \, \mathbf {x} \end{aligned}$$
(39)

where \(\mathbf {C}\) is the camera matrix, which here is given by

$$\begin{aligned} {\mathbf {C}}={\begin{pmatrix}1&{}0&{}0&{}0\\ 0&{}1&{}0&{}0\\ 0&{}0&{}{\frac{1}{f}}&{}0\end{pmatrix}} \end{aligned}$$
(40)

and the corresponding camera matrix now becomes

$$\begin{aligned} {\mathbf {C}}={\begin{pmatrix}1&{}0&{}0&{}0\\ 0&{}1&{}0&{}0\\ 0&{}0&{}{\frac{1}{f}}&{}0\end{pmatrix}}\sim {\begin{pmatrix}f&{}0&{}0&{}0\\ 0&{}f&{}0&{}0\\ 0&{}0&{}1&{}0\end{pmatrix}} \end{aligned}$$
(41)

The last step is a consequence of \(\mathbf {C}\) itself being a projective element.

The camera matrix derived here may appear trivial in the sense that it contains very few non-zero elements. This depends to a large extent on the particular coordinate systems which have been chosen for the 3D and 2D points. In practice, however, other forms of camera matrices are common, as will be shown below.

The camera matrix \(\mathbf {C}\) derived in the previous section has a null space which is spanned by the vector

$$\begin{aligned} {\mathbf {n}}={\begin{pmatrix}0\\ 0\\ 0\\ 1\end{pmatrix}} \end{aligned}$$
(42)

This is also the homogeneous representation of the 3D point which has coordinates (0,0,0), that is, the “camera center” (aka the entrance pupil; the position of the pinhole of a pinhole camera) is at O. This means that the camera center (and only this point) cannot be mapped to a point in the image plane by the camera (or equivalently, it maps to all points on the image as every ray on the image goes through this point).

For any other 3D point with \(x_{3}=0\), the result \(\mathbf {y} \sim \mathbf {C}\) is well-defined and has the form \({\mathbf {y}}=(y_{1}\,y_{2}\,0)^{\top }\). This corresponds to a point at infinity in the projective image plane (even though, if the image plane is taken to be a Euclidean plane, no corresponding intersection point exists).

In this way, can we achieve the 2D images registered to the 3D stereo visual based models.

6 Experiment

Detecting the features across the images of different perspective is the fundamental for acquiring accurate depth information. An accurate depth information is also the fundamental for an accurate 3D to 2D image model registration. Based on this situation, we have listed our feature matching upon the images of different perspectives for 3D-2D registration in the Table 1.

In the Table, we present the performance of our proposed algorithm comparing with some most widely applied feature matching algorithms in the 3D-2D image registration, including SIFT, SURF and ALP. By comparison, our proposed method outperforms agains other feature matching algorithms specially on the 2D registration images. General speaking, our proposed 3D-2D registration method is largely depending on an accurate feature matching algorithm which can uttermost detect the potential correlations between different perspectives of images (Tables 23).

Table 1 The average matching performance upon the images from the different perspectives (dataset QUBR)
Table 2 Comparison of our proposed method and some state-of-the-art algorithms
Table 3 Comparison of our proposed method and some state-of-the-art algorithms

The clinical image database was used to quantitatively evaluate the performances of our proposed AIFD based method comparing with three state-of-the-art 3D-2D registration methods Selection of the state-of- the-art methods was limited to methods that are well established in the field of 3D-2D registrations, and that are capable of registering a 3D image either to one 2D view or to multiple 2D views simultaneously. There were about 14,000 centerline points per 2D image for which the distance transform was precomputed so as to speed up the nearest neighbor search to the projected 3D centerline points. In the BGB method the 3D intensity gradients were using the Canny edge detector, which resulted in about 17,000 edge points. The 2D intensity gradients were computed by the central difference kernel.

Parameters of the state-of-the-art 3D-2D registration methods were experimentally set to obtain the best registration performances on the clinical image dataset. For MIP-MI method, the sampling step along the projection rays was 0.375 mm and the intensities were discretized in 64 bins to compute the MI histograms. The ICP had no user-controlled parameters, while in the BGB method the sensitivity of the angle weighting function, was set to \(n=4\).

7 Conclusion

In this paper, we presented a novel method for 3D-2D rigid registration based on our previous proposed feature matching algorithm AIFD, which it more able to detect the correspondences across the images of different perspectives. The main advantage of the proposed method is it is more robust to viewpoint difference, resulting a less number of snapshot around. By experiment, it can be proven that our proposed method performs best among all the most applied registration method, and the overall execution time is also quite fast.

Translation of any 3D-2D registration method into clinical practice requires extensive and rigorous evaluations on real-patient image databases. Therefore, we acquired a clinical image database representative of cerebral-EIGI and established a highly accurate gold standard registration that enables objective quantitative evaluation of 3D-2D rigid registration methods. The quantitative and comparative evaluation of three state-of-the-art methods showed that the performance of the proposed method best met the demands of cerebral EIGI.