AIFD Based 2D Image Registration to MultiView Stereo Mapped 3D Models
 19 Downloads
Abstract
Multiview stereo (MVS) map based 3D range reconstruction is to generate 3D ranges by analyzing the surrounding snapshots from different perspectives. Different to the traditional method which employing the expensive and difficult maintaining laser range devices to calibrate the range of the real 3D objects, MVS has achieved its success by seeking the geometrical correlations between the correspondences from the snapshot of different perspectives. The concerning of MVS keeps rising thanks to the fast development of digital maps and 3D printing. Several algorithms with regard to MVS has been well developed and achieved their success with regard to reconstruction of 3D ranges. Meanwhile, most of the algorithms were mainly focusing on the fusion and merging of different scenes and surface refinement. Less capability of the feature matching algorithms on the affine invariant images renders the current MVS algorithms need huge amount of images with tiny perspective differences. In this paper, we will propose a new MVS algorithm, deploying our previous published Affine Invariant Feature Descriptor (AIFD) to detect and match the correspondences from different perspectives and applying Homograph matrix and segmentation to define the planes of the objects. Thanks to the AIFD and Homograph based projection model, our proposed MVS algorithm outperform other MVS algorithms in terms of speed and efficiency.
Keywords
AIFD Feature matching Homograph Registration Camera model Multiview Stereo1 Introduction

Image collection.

Calibration for the difference of the camera setting of each image.

Correspondences detection among images.

Reconstruction the 3D ranges according to the geometrical correspondences.

Optionally reconstruct the materials of the scene.
An accurate and dense correspondence matching plays an important role in the MVS: an accurate correspondence matching makes the protagonist in the camera matrix calibration, coherent the images patch of different coordinates; a dense correspondence matching can establish a density depth clouds, rendering it more accurate and easier to smooth and merge the 3D ranges surface. The matches to construct the 3D ranges are determined by the images registration. 3D ranges to 2D images registration is largely depends on an accurate 2D camera calibration with respect to the acquired geometric 3D model. An accurate camera position estimation can largely determine the quality of 3D range’s construction. Thus the registration issue can be simplified as how to conduct the camera matrix in the framework of the projection model [13]. A large part of the recent success of MVS is due to the success of the underlying Structure from Motion algorithms that compute the camera parameters.
Camera calibration is the fundamental of MVS registration. It refers to a set of values describing a camera configuration, that is, camera pose information consisting of location and orientation, and camera intrinsic properties such as focal length and pixel sensor size. There are many different ways or models to parametrize this camera configuration. There exist many cues that can be used to calculate the camera parameters from images including: stereo correspondence, presettled devices, snapshot calibration etc. The method the algorithms applied can largely restrict the range of its application: a controlled MVS capture use diffuse lights and a turn table to collect the images, outdoor capture can capture series of images around a smallscale scenes, and crowdsourcing from online photosharing websites. General speaking, the algorithm which is capable of tacking arbitrarily snapshots are more desirable. In this situation, a robust, dense and accurate correspondence detecting and matching schema becomes quite necessary for the MVS.
Different to the traditional DoG based feature marching algorithm, we propose a novel MVS method utilizing our previous published AIFD to detect and match the correspondence from images to images and introducing the Homograph model to define the smooth planes from the 3D objects. AIFD is a feature detector and descriptor method, provided a more improved resilience to affine and scale invariance. It borrowed some ideas of SIFT [14], like scale space and pyramid structure, etc, but it is more capable to dealing with the image content of different view points, which is suits the special requirement of MVS.
Scale invariant feature detector, like SIFT, SURF [6], ALP [4], etc, has achieved its success on a lot of applications, including contentbased visual retrieval, robotic navigation, image registration, etc. However, its sensitivity to the view point changes greatly restrict its applications to a larger range, such as 3D registration for instance for a long time. Borrowed the basic principles of SIFT, we have successfully proposed Affine Invariant Feature Detector (AIFD), which has a better resilience to affine transformations. Equipped with this more advanced affine invariant feature detector, we can now seek the connections between images for 3D ranges reconstruction by detecting the matched features.
A correct detected correspondences between two images constitute a stereo system, which can provide the depth information. A set of points defined by the depth information outshapes the structure of the 3D ranges. In practice, most scenes or partials of the object will be covered by more than 2 images, which can help to calibrate a more dense and accurate spatial information. The origins of multiview stereo can be traced back to human stereopsis and the first attempts to solve the stereoscopic matching problem as a computation problem [4]. Until today, twoview stereo algorithms have been a very active and fruitful research area . The multiview version of stereo originated as a natural improvement to the twoview case. Instead of capturing two images from different perspectives, multiview stereo would capture more viewpoints inbetween to increase robustness, e.g. to image noise or surface texture and viewpoint. What started as a way to improve twoview stereo has nowadays evolved into a different type of problem.
Only equipped with sufficient amount of correspondences from different images, are we able to be more approach to an accurate camera matrix estimation by DLT. The knowledge of the registration from 3D ranges to 2D images can improve to map the 3D textures and in advance can be treated as a homograph reference to be apply to some other 2D images [3]. With this iterative registrationmapping method, the 3D ranges can be more precisely registered to an uncalibrated and arbitrarily snapshot [1].
Based on our proposed pipeline, a progressive mapping and registration 3D to 2D images registration method can be formed. The experiments in the below section can prove the performance of our proposed registration method outperforms against the traditional edge/corner based method [2]. Though our registration proposal, the stereo mapped 3D model can be introduced to more applications for its efficiency and simplicity [5].
2 Stereo Visual Based 3D Range Methods
3D range model is referring to a collection of points that presenting the distance in a scene from a specified viewpoint, which is normally associated with some type of sensor, like the Laser deceives [7]. To a well formed range model, its pixel value reflects the corresponding distance to a certain view plain[8]. If the sensor that is used to produce the range is properly calibrated, the pixel values can directly give the distance in physical units, like meter [9].
The sensor device that is used for producing the range model is sometimes referred to as a range camera. Range cameras can operate according to a number of different techniques [10], including Stereo triangulation, Sheet of light triangulation, Timeofflight, Structured light, Interferometry, and Coded aperture. Sheet of light triangulation is achieved by changing of the scene illuminated with a sheet of light this creates a reflected line as seen from the light source. From any point out of the plane of the sheet the line will typically appear as a curve, the exact shape of which depends both on the distance between the observer and the light source, and the distance between the light source and the reflected points. By observing the reflected sheet of light using a camera (often a high resolution camera) and knowing the positions and orientations of both camera and light source, it is possible to determine the distances between the reflected points and the light source or camera. By illuminating the scene with a specially designed light pattern, structured light [11], depth can be determined using only a single image of the reflected light. The structured light can be in the form of horizontal and vertical lines, points or checker board patterns. The depth can also be measured using the standard timeofflight (ToF) technique, more or less like a radar, in that a range image similar to a radar image is produced, except that a light pulse is used instead of an RF pulse. By illuminating points with coherent light and measuring the phase shift of the reflected light relative to the light source it is possible to determine depth. Under the assumption that the true range image is a more or less continuous function of the image coordinates, the correct depth can be obtained using a technique called phaseunwrapping. Depth information may be partially or wholly inferred alongside intensity through reverse convolution of an image captured with a specially designed coded aperture pattern with a specific complex arrangement of holes through which the incoming light is either allowed through or blocked.
Among all of these techniques, Stereo triangulation is the most popular and widely applied technique for 3D ranges detections, where the depth data are determined by the data acquired by stereo or multiplecamera system. This way it is can determine the depth of a certain points in the scene, for example, from the enter point of the line between their focal points. In order to solve the depth measurement by employing a stereo camera system, it is necessary to detect the corresponding points from the different images. A well solution correctly specifying the correspondences from different images is one of the main task by applying this type of technique. For instance, it is difficult to detect the correspondence for the image points that lie inside the regions of homogeneous intensity or color. As a consequence, 3D range based stereo triangulation can produce reliable depth estimation only for a subset of all points visible from a multipleview cameras.
3 Orthogonal Projection
Orthographic projection (sometimes orthogonal projection), is a means of representing threedimensional objects in two dimensions. It is a form of parallel projection, in which all the projection lines are orthogonal to the projection plane,[1] resulting in every plane of the scene appearing in affine transformation on the viewing surface. The obverse of an orthographic projection is an oblique projection, which is a parallel projection in which the projection lines are not orthogonal to the projection plane.
The term orthographic is sometimes reserved specifically for depictions of objects where the principal axes or planes of the object are also parallel with the projection plane, [1] but these are better known as multiview projections. When the principal planes or axes of an object are not parallel with the projection plane, but are rather tilted to reveal multiple sides of the object, the projection is called an axonometric projection. Subtypes of multiview projection include plans, elevations and sections. Subtypes of axonometric projection include isometric, dimetric and trimetric projections.
4 AIFD Based Feature Matching
Our previously published feature descriptor AIFD achieves its resilience to affine and scale changes by reshaping the multiscale image representation and local extrema detection in order to maintain a linear relationship under the changes of affine transformation. Instead of relying on the image simulations, AIFD achieves its affine and scale invariance completely based on its internal mechanisms when dealing with the transformed visual content. Thus it is more brief, reliable and feasible to more applications and has more potentials for future research.
In this formula, A represents the affine transformation, a \(2 \times 2\) matrix. \(\sigma \) is the scale. This deformed Gaussian kernel is specialized to generate affine scale space which can maintain linear relationship regardless the change of view point. Based on this structure, the images from any view points can be well represented from multiscales. From the definition of affine scale space, conventional isotropic scale space can be deemed as a spacial case, whose affine transformation equals to the \(2 \times 2\) identity matrix (Fig. 2).
The similarity of two visual content is largely depends on the matched features detected from the scale space. To the conventional scale space, several approaches to detect the local maximum or minimum from derivatives have been proposed [12], and local LoG extrema detection outperform all others, concerning the accuracy and efficiency of a method in practice [17].
Borrowing the idea of LoG, we have also proposed an affine LoG, with the purpose to promote the feature candidates detection over affine scale space. Instead of a direct laplacian operation, we have proposed a feasible implementation based on our proposed pyramid structure to efficiently generate affine LoG. By this implementation, the affine Gaussian and LoG scale space can be simultaneously generated. More information about affine scale space and affine LoG can be found [18].
The candidates, with the HarrisHessian matrices subtraction smaller than 0.001 will also be rejected to guarantee the extrema is larger upto a certain level compared with the surrounding points by \(R=Tr(H)^2/Det(H)\), which equals to \((\gamma +1)^2/\gamma \).
The offset \(\hat{x}\) will be summed up with the detected integral position to approach the local extreme location to subpixel’s precision, according to the formula Eq. 31. The Hessian matrix and local gradient of the pixel sample can be obtained by the corresponding LoG derivative polynomial expressions.
Gradients from affine transformed images are restrained by the affine matrices between different viewpoints. Around each feature, the relocated gradients, according to the affine transformation, can then form a histogram as the feature descriptor (Figs. 4, 5).
Assigning an orientation to each feature, the feature descriptor can be represented relative to this orientation and achieve its invariance to image rotation. To calibrate the orientation of a feature, an area of scale space gradient around the feature will first be formed, after proceeding our proposed gradient relocation, eliminating the effect of the affine distortion. The area of gradient to be collected is in a square shape with its size equal to 3 times that of the feature scale. Then, the orientation of each sample of scale space gradient can be added to the orientation histogram weighted by its gradient magnitude and by a Gaussianweighted circular window with 1.5 times the scale [14].
Then, the orientation histogram will be subdivided into 36 bins covering the \(360^{\circ }\) range of orientations and filled the corresponding accumulated magnitude. The peak of the histogram points to its main direction. Any other local peak that is within \(80\%\) of the highest peak and higher than the average of its two neighbors will be assigned with different orientations. The features with multiple peaks will be respectively created at the same location with the same scale but different orientations [14].
5 Camera Matrix Calculation Via the ImageImage Feature Correspondences
With the matched correspondences, we can then apply the DLT method to calculate the camera matrix to define the 2D image to 3D models position mapping (Figs. 8, 9, 10).
The camera matrix derived here may appear trivial in the sense that it contains very few nonzero elements. This depends to a large extent on the particular coordinate systems which have been chosen for the 3D and 2D points. In practice, however, other forms of camera matrices are common, as will be shown below.
For any other 3D point with \(x_{3}=0\), the result \(\mathbf {y} \sim \mathbf {C}\) is welldefined and has the form \({\mathbf {y}}=(y_{1}\,y_{2}\,0)^{\top }\). This corresponds to a point at infinity in the projective image plane (even though, if the image plane is taken to be a Euclidean plane, no corresponding intersection point exists).
In this way, can we achieve the 2D images registered to the 3D stereo visual based models.
6 Experiment
Detecting the features across the images of different perspective is the fundamental for acquiring accurate depth information. An accurate depth information is also the fundamental for an accurate 3D to 2D image model registration. Based on this situation, we have listed our feature matching upon the images of different perspectives for 3D2D registration in the Table 1.
In the Table, we present the performance of our proposed algorithm comparing with some most widely applied feature matching algorithms in the 3D2D image registration, including SIFT, SURF and ALP. By comparison, our proposed method outperforms agains other feature matching algorithms specially on the 2D registration images. General speaking, our proposed 3D2D registration method is largely depending on an accurate feature matching algorithm which can uttermost detect the potential correlations between different perspectives of images (Tables 2, 3).
The average matching performance upon the images from the different perspectives (dataset QUBR)
Skewing  SIFT  ALP  SURF  Proposed AIFD  

Number  Ratio  Number  Ratio  Number  Ratio  Number  Ratio  
1  423.345  0.305  454.521  0.408  371.667  0.380  483.479  0.623 
2  93.243  0.058  102.356  0.095  89.032  0.047  436.138  0.577 
3  81.543  0.052  97.456  0.085  72.546  0.038  381.59  0.496 
4  24.456  0.017  37.056  0.031  22.453  0.015  328.897  0.431 
5  16.657  0.010  21.802  0.0155  15.234  0.00967  289.805  0.378 
6  13.234  0.0078  18.307  0.0135  12.342  0.00503  235.845  0.322 
7  5.213  0.0017  6.234  0.0029  3.434  0.00093  183.395  0.289 
8  1.367  0.0013  2.921  0.00192  1.412  0.00107  157.281  0.228 
9  0.572  0.0003  0.560  0.0005  0.442  0.0003  127.362  0.220 
Comparison of our proposed method and some stateoftheart algorithms
View  Method  MEAN ± STD  SR (\(\%\))  CR (mm)  Time (s)  

DSA  MAX  DSA  MAX  DSA  MAX  
LAT  MIPMI  0.32 ± 0.21  0.56 ± 0.53  72.32  34.23  4  3  76.3 
ICP  0.44 ± 0.23  42.02  1  1.1  
BGB  0.40 ± 0.37  0.41 ± 0.36  52.38  48.43  3  2  13.4  
MGP  0.61 ± 0.37  0.63 ± 0.39  73.23  69.98  5  3  0.9  
MGP+BGB  0.26 ± 0.23  0.29 ± 0.27  73.21  72.21  5  3  12.8  
AIFD  0.62 ± 0.25  0.58 ± 0.31  70.31  68.23  5  4  5.7  
AP  MIPMI  0.27 ± 0.32  0.68 ± 0.45  91.78  32.87  9  3  65.2 
ICP  0.32 ± 0.25  72.48  1  0.4  
BGB  0.32 ± 0.35  0.44 ± 0.33  58.32  52.13  3  5  13.8  
MGP  0.53 ± 0.27  0.63 ± 0.33  92.43  85.68  10  9  0.9  
MGP+BGB  0.28 ± 0.17  0.39 ± 0.27  95.45  85.3  11  8  10.5  
AIFD  0.45 ± 0.33  0.56 ± 0.21  72.34  63.21  4  4  6.8 
Comparison of our proposed method and some stateoftheart algorithms
View  Method  MEAN ± STD  SR (\(\%\))  CR(mm)  Time (s)  

DSA  MAX  DSA  MAX  DSA  MAX  
LAT  MIPMI  0.23 ± 0.22  0.57 ± 0.46  67.32  32.12  3  3  116.3 
ICP  0.48 ± 0.33  41.57  2  1.1  
BGB  0.38 ± 0.32  0.38 ± 0.31  51.32  42.41  2  2  11.4  
MGP  0.61 ± 0.37  0.63 ± 0.39  73.23  69.98  5  2  1.8  
MGP+BGB  0.24 ± 0.17  0.26 ± 0.21  72.29  71.87  4  2  18.7  
AIFD  0.45 ± 0.34  0.23 ± 0.46  73.21  61.32  4  3  8.2 
The clinical image database was used to quantitatively evaluate the performances of our proposed AIFD based method comparing with three stateoftheart 3D2D registration methods Selection of the stateof theart methods was limited to methods that are well established in the field of 3D2D registrations, and that are capable of registering a 3D image either to one 2D view or to multiple 2D views simultaneously. There were about 14,000 centerline points per 2D image for which the distance transform was precomputed so as to speed up the nearest neighbor search to the projected 3D centerline points. In the BGB method the 3D intensity gradients were using the Canny edge detector, which resulted in about 17,000 edge points. The 2D intensity gradients were computed by the central difference kernel.
Parameters of the stateoftheart 3D2D registration methods were experimentally set to obtain the best registration performances on the clinical image dataset. For MIPMI method, the sampling step along the projection rays was 0.375 mm and the intensities were discretized in 64 bins to compute the MI histograms. The ICP had no usercontrolled parameters, while in the BGB method the sensitivity of the angle weighting function, was set to \(n=4\).
7 Conclusion
In this paper, we presented a novel method for 3D2D rigid registration based on our previous proposed feature matching algorithm AIFD, which it more able to detect the correspondences across the images of different perspectives. The main advantage of the proposed method is it is more robust to viewpoint difference, resulting a less number of snapshot around. By experiment, it can be proven that our proposed method performs best among all the most applied registration method, and the overall execution time is also quite fast.
Translation of any 3D2D registration method into clinical practice requires extensive and rigorous evaluations on realpatient image databases. Therefore, we acquired a clinical image database representative of cerebralEIGI and established a highly accurate gold standard registration that enables objective quantitative evaluation of 3D2D rigid registration methods. The quantitative and comparative evaluation of three stateoftheart methods showed that the performance of the proposed method best met the demands of cerebral EIGI.
References
 1.Datta R, Joshi D, Li J, Wang JZ (2008) Image retrieval: ideas, influences, and trends of the new age. ACM Comput Surv 40(2):5CrossRefGoogle Scholar
 2.Du S, Guo Y, Sanroma G, Ni D, Wu G, Shen D (2015) Building dynamic population graph for accurate correspondence detection. Medical Image AnalysisGoogle Scholar
 3.Du S, Liu J, Zhang C, Zhu J, Li K (2015) Probability iterative closest point algorithm for md point set registration with noise. Neurocomputing 157:187–198CrossRefGoogle Scholar
 4.Fischler MA, Bolles RC (1981) Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun ACM 24(6):381–395MathSciNetCrossRefGoogle Scholar
 5.Florack L, Maas R, Niessen W (1999) Pseudolinear scalespace theory. Int J Comput Vision 31(2–3):247–259CrossRefGoogle Scholar
 6.Förstner W, Gülch E (1987) A fast operator for detection and precise location of distinct points, corners and centres of circular features. In: Proceeding ISPRS intercommission conference on fast processing of photogrammetric data, pp. 281–305Google Scholar
 7.Gao Y, Ji R, Cui P, Dai Q, Hua G (2014) Hyperspectral image classification through bilayer graphbased learning. IEEE Trans Image Process 23(7):2769–2778MathSciNetCrossRefMATHGoogle Scholar
 8.Gao Y, Ji R, Liu W, Dai Q, Hua G (2014) Weakly supervised visual dictionary learning by harnessing image attributes. IEEE Trans Image Process 23(12):5400–5411MathSciNetCrossRefMATHGoogle Scholar
 9.Gao Y, Wang M, Tao D, Ji R, Dai Q (2012) 3d object retrieval and recognition with hypergraph analysis. IEEE Trans Image Process 21(9):4290–4303MathSciNetCrossRefMATHGoogle Scholar
 10.Gao Y, Wang M, Zha ZJ, Shen J, Li X, Wu X (2013) Visualtextual joint relevance learning for tagbased social image search. IEEE Trans Image Process 22(1):363–376MathSciNetCrossRefMATHGoogle Scholar
 11.Gonzalez RC, Woods RE, Eddins SL (2004) Digital image processing using matlab. Pearson Prentice Hall, Upper Saddle RiverGoogle Scholar
 12.Lindeberg T (1992) Scalespace behaviour of local extrema and blobs. J Math Imaging Vis 1(1):65–99MathSciNetCrossRefGoogle Scholar
 13.Lindeberg T (2013) Generalized axiomatic scalespace theory. Adv Imaging Electron Phys 178:1CrossRefGoogle Scholar
 14.Lowe DG (2004) Distinctive image features from scaleinvariant keypoints. Int J Comput Vis 60(2):91–110CrossRefGoogle Scholar
 15.Mitrovic U, Špiclin Ž, Likar B, Pernuš F (2013) 3D–2D registration of cerebral angiograms: a method and evaluation on clinical images. IEEE Trans. Med. Imaging 32(8):1550–1563CrossRefGoogle Scholar
 16.Paschalakis S, Francini G (2014) Test model 12: compact descriptors for visual search. testmodel ISO/IEC JTC1/SC29/WG11/N14961, MPEG, Strasbourg, FranceGoogle Scholar
 17.Tuytelaars T, Mikolajczyk K (2008) Local invariant feature detectors: a survey. Found Trends Comput Gr Vis 3(3):177–280CrossRefGoogle Scholar
 18.Zhao B, Lepsoy S, Magli E (2015) Affine scale space for viewpoint invariant keypoint detection. In: Multimedia Signal Processing (MMSP), 2015 IEEE 17th International Workshop on. pp 1–6Google Scholar
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.