# Divide and Conquer: Efficient Large-Scale Structure from Motion Using Graph Partitioning

## Abstract

Despite significant advances in recent years, structure-from-motion (SfM) pipelines suffer from two important drawbacks. Apart from requiring significant computational power to solve the large-scale computations involved, such pipelines sometimes fail to correctly reconstruct when the accumulated error in incremental reconstruction is large or when the number of 3D to 2D correspondences are insufficient. In this paper we present a novel approach to mitigate the above-mentioned drawbacks. Using an image match graph based on matching features we partition the image data set into smaller sets or components which are reconstructed independently. Following such reconstructions we utilise the available epipolar relationships that connect images across components to correctly align the individual reconstructions in a global frame of reference. This results in both a significant speed up of at least one order of magnitude and also mitigates the problems of reconstruction failures with a marginal loss in accuracy. The effectiveness of our approach is demonstrated on some large-scale real world data sets.

## 1 Introduction

In structure from motion (SfM) we typically use many images of a scene to solve for both the 3D scene being viewed and the parameters of the cameras involved. Most contemporary large-scale SfM methods [1, 2, 3, 4, 5] use the bundle adjustment method [6] which simultaneously optimises for both structure and camera parameters using point correspondences in images by minimising a global cost function. However, being a joint optimisation over all cameras and 3D points, bundle adjustment often fails for large data sets. This is typically due to an accumulation of error in an incremental reconstruction or when cameras are weakly connected to 3D feature points. In addition, owing to the very large number of variables involved, bundle adjustment is also very computationally demanding and time consuming. In this paper we adopt a divide-and-conquer strategy that is designed to mitigate these problems. In essence, our approach partitions the full image data set into smaller sets that can each be independently reconstructed using a standard approach to bundle adjustment. Subsequently, by utilising available geometric relationships between cameras across the individual partitions, we solve a global registration problem that correctly and accurately places each individual 3D reconstructed component into a single global frame of reference.

- 1.
A principled method based on normalised cuts [7] to partition the match graph of a large collection of images into disjoint connected components which can be independently and reliably reconstructed. This process also automatically identifies a set of connecting images between the components which can be used to register the independent reconstructions. Specifically, these are the image pairs specified by the cut edges in the graph.

- 2.
A method for registering the point clouds corresponding to the independent connected components using pairwise epipolar geometry relationships. The epipolar based registration technique proposed in this paper is more robust than the standard techniques for registering point clouds using 3D-3D or 3D-2D correspondences. Registration methods based on 3D point correspondences do not use all available information (image correspondences) and may fail when the point clouds do not have sufficient number of 3D points in common. 3D-2D based methods, such as a sequential bundler [1, 2, 8], often result in broken reconstructions when the number of points available are inadequate for re-sectioning or when common 3D points are removed at the outlier rejection stage [1] (see Table 4). The proposed registration algorithm using pairwise epipolar geometry alleviates this problem as is shown in Fig. 1 and discussed in Sect. 4. Considered as an independent approach, the epipolar based algorithm can also be used to register independently reconstructed point clouds by introducing a few connecting images.

*et al.*[9, 10] try to find some representative “iconic images” from the image data set and then partition the iconic scene graph, reconstruct each cluster and register them using 3D similarity transformations. Snavely

*et al.*[11, 12] and Havlena

*et al.*[13] compute skeletal sets from the match graph to reduce image matching. All these methods reduce the set of images on which they run SfM. Moreover, incremental bundle adjustment is also known to suffer from drift due to accumulation of errors which increase as the number of images increase [1, 5, 14]. Crandall

*et al.*[5, 14] propose an MRF based discrete formulation coupled with continuous Levenberg-Marquadt refinement for large-scale SfM to mitigate this problem. To reduce the matching time, Wu [1] (henceforth VSFM) proposed preemptive matching to reduce the number of pairs to be matched. Moreover, all cameras and 3D points are optimised only after a certain number of new cameras are incorporated into the iterative bundler. Although VSFM demonstrates approximately linear running time, sometimes it fails for large data sets when the accumulated errors of iterative bundler become large [1]. Although there have been some recent global methods [15, 16], to be able to solve large-scale SfM problems, global methods need to be exceedingly robust. Farenzena

*et al.*[17] also propose to merge smaller reconstructions in a bottom up dendrogram. However, their largest datasets are of only 380 images and their use of reprojection errors of common 3D points for merging is unsuitable for very large datasets. In our approach, we propose to decompose the image set into smaller components so that the match graph of each component is densely connected. This is likely to yield correct 3D reconstructions, since fewer problems are encountered during the re-sectioning stage of a standard iterative bundler and the reconstruction is robust. Restricting pairwise image matching to within each component also yields a significant reduction in computation time. Moreover SfM based reconstruction of each component can be carried out in parallel. Our approach is conceptually depicted in Fig. 2.

The rest of the paper is organised as follows. Section 2 discusses our method of decomposing the image set into smaller groups and also determining the connecting images between individual groups. Section 3 provides the overview of our registration process. Section 4 reports the results of our experiments on different data sets, and Sect. 5 concludes the paper.

## 2 Data Set Decomposition Using Normalised Cuts

*connecting*image which sees parts of both the buildings. We call such data sets

*organised*.

In case such planned acquisition is not possible, the collection of images need to be automatically partitioned into smaller components. Unorganised data sets downloaded from the Internet are typical examples. In such cases a method for automatically grouping into visually similar sets and finding connecting images needs to be established. To this end, we train a vocabulary tree [18] using all image features (SIFTGPU [19]) and extract top \(p\) (typically p = 80) similar images for each image in the set. We form a match graph where each node is an image and the edge weights between two nodes are the similarity values obtained from the vocabulary tree. We aim to partition the set of images such that each partition is densely connected. The partitions only capture dense connectivity of matched image features and need not represent a single physical structure. Here the dense connectivity ensures that SFM reconstruction is less likely to fail due to the paucity of reliable matches or accumulated error or drift.

**Extracting connecting images:**The number of candidate connecting images are often very large. Reducing the number of connecting images will reduce the time for estimation of pairwise epipolar geometry. The connecting image extraction process is described below:

- 1.
For each of the connecting images reject the outlier

**out edges**(both within and across components) using a measure of the robustness of the epipolar computation (Eq. 6). - 2.
If the number of out edges retained is less than T (typically T = 60 \(\%\) of the original out degree) then remove the image from the set of connecting images.

- 3.
Compute the mean of the similarity scores of all the retained out edges for the current image.

- 4.
If the similarity score for a cut edge exceeds the mean similarity values of the images they connect, then mark the images as connecting images.

## 3 Registration of Independent Component Reconstructions

In this Section, we describe how each of the individually reconstructed groups of cameras are aligned or registered to a single frame of reference. To register a pair of 3D reconstructions, we need to estimate the relative transformation between them. In what follows, we describe how we estimate relative rotation, translation and scale between a pair of reconstructions using epipolar relationships between the reconstructed cameras and the connecting cameras. While estimating epipolar geometry, we use focal lengths extracted from the EXIF information of the images.

Let us consider two independently reconstructed groups of cameras \(A\) and \(B\) ^{1}. Let \(\mathbb {C}_{AB}\) be the set of connecting cameras between \(A\) and \(B\). We first fix the relative scale between \(A\) and \(B\) using the approach described in Sect. 3.1. Once this relative scale is fixed, the two reconstructions \(A\) and \(B\) are now related by a rigid or Euclidean transformation which can be estimated using the method detailed in Sect. 3.2.

### 3.1 Relative Scale Estimation Between a Pair of Reconstructions

### 3.2 Relative Rotation and Translation Estimation Between a Pair of Reconstructions

Once \(A\) is resized to have same scale as that of \(B\), the two reconstructions are related by a rigid or Euclidean transformation. Earlier, we estimated the motion of \(k\) in the frame of reference of \(A\) to be a rotation and translation of \(\widehat{R}_{Ak}\) and \(\widehat{s}_{AB}\widehat{T}_{Ak}\) respectively. Similarly, the motion of \(k\) in the frame of reference of \(B\) is \(\widehat{R}_{Bk}\) and \(\widehat{T}_{Bk}\) respectively.

## 4 Experimental Results

*organised*and

*unorganised*image data sets. For our experiments, we used an Intel i7 quad core machine with 16 GB RAM and GTX 580 graphics card. We first present our result on an

*organised*image set acquired from Hampi (see Fig. 3). The data set consists of 2337 images covering \(4\) temple buildings. The physical footprint of these 4 buildings covers an area of approximately \(160\times 94\) m.

^{2}For reconstructing the images in each individual set we use VSFM [1] as the iterative bundler. We merge each of these reconstructions using the method described in Sect. 3 into a common frame of reference. Figure 6a shows our reconstruction after registration superimposed on a view from Google Earth. As we do not have ground truth for such real-world data, to analyse the quality of our reconstruction we use the output of VSFM applied on the entire data set using all pairs matching as our baseline reconstruction. We note that all pairs matching is necessitated here as the scheme of preemptive matching suggested in [1] fails on this data set. Figure 6b shows the comparison where the red point cloud is obtained from VSFM and the green points are obtained using our method. VSFM took 5760 min to reconstruct the data set using all pairs matching. In contrast our method takes 2578 min (using all pairs matching) to reconstruct the same data set. The computation time of our method is calculated by considering the time required for reconstruction of the largest component and the total time for registration, since the reconstruction of each component is done in parallel. We also compare the 3D camera rotations and positions (i.e. translations) obtained by our method against the ‘ground truth’ provided by VSFM. As the two camera estimates are in different frames of reference and may also differ in scale, we align them in a common Euclidean reference frame by computing the best similarity (Euclidean transformation and a global scale) transformation between them. The results of our comparison are presented in Table 1. Here, while the rotation error is in absolute degrees, since the overall scale of the reconstruction is arbitrary, we present the errors in translation (position) estimates as a fraction of the graph diameter of the full reconstruction. As can be seen, apart from being much faster than VSFM, our result is qualitatively similar to that obtained by VSFM.

Comparison of total reconstruction by VSFM against individual reconstructions being registered by our method

Error entity | Error unit | Mean error | Median error | RMS error |
---|---|---|---|---|

Camera rotation | Degrees | 1.93 | 1.57 | 2.66 |

Camera translation | Ratio of graph diameter | 0.012 | 0.0091 | 0.041 |

Data sets used in our experiments

Data set | No. of images | No. of components | No. of images reconstructed |
---|---|---|---|

Rome | 13783 | 24 | 10534 |

Hampi | 3017 | 7 | 2584 |

St Peter’s Basilica | 1275 | 5 | 1236 |

Colosseum | 1164 | 3 | 1032 |

Time statistics of our method on different data sets compared with VSFM

Data set | Match graph creation using vocabulary tree (mins) | Pairwise matching (mins) | Reconstruction and registration (mins) | Total time by us (mins) | Pairwise matching by VSFM (mins) | Reconstruction by VSFM (mins) | Total time by VSFM (mins) |
---|---|---|---|---|---|---|---|

Rome | 768 | 502 | \(\mathbf {27}\) | \(\mathbf {1297}\) | N/A | N/A | N/A |

Hampi | 481 | 424 | \(\mathbf {8}\) | \(\mathbf {913}\) | 9522 | 59 | 9581 |

St Peter’s Basilica | 98 | 22 | \(\mathbf {4}\) | \(\mathbf {124}\) | 1385 | 10 | 1395 |

Colosseum | 83 | 24 | \(\mathbf {3}\) | \(\mathbf {110}\) | 1394 | 9 | 1403 |

*unorganised*data sets downloaded from the Internet. We downloaded approximately 13 K images of Central Rome from Flickr and tested our algorithm on this data set. Figure 9 shows the reconstruction using our method. This data set could not be reconstructed using VSFM with our hardware resources. Figure 9d shows the reconstruction overlaid on Google map. We also ran our algorithm on the St Peter’s Basilica and Colosseum data sets obtained from [1], the results of which are shown in Figs. 10 and 11 respectively. Table 2 shows the total number of connected components and the total number of images reconstructed for each of the data sets. The time statistics of our algorithm for different data sets are presented in Table 3. For most of the cases we had to use all pairs matching in VSFM as preemptive matching was causing the reconstruction to break in the middle, which is also reported in [1]. In our case we used the initial match graph obtained from vocabulary tree. It is evident that most of the time is consumed for matching. The reconstruction and the total registration time taken by our approach is significantly less than the reconstruction time of VSFM. The overall speed up achieved is at least one order of magnitude superior. We also note that iterative bundle adjustment schemes often results in broken reconstruction even within a component. In Table 4 we present statistics of such breaks. In all such cases we have been able to register the broken components using pairwise epipolar geometry on the connecting images in the broken components identified automatically from the match graph. Finally, we also remark in passing that we also experimented with the method presented in [17] using the author’s code. On the Hampi dataset, [17] failed to reconstruct in more than 24 h. While [17] is faster than original BA, its runtime complexity is far inferior to the \(O(n)\) complexity of VSFM. In an additional test, for a 300 image subset of the Hampi dataset, [17] was 10 times slower than VSFM and produced a significantly poorer result.

Statistics of breaks in reconstruction of the data sets

Data set | No. of components | No. of components broken by VSFM | Total no. of components including broken sub-components |
---|---|---|---|

Rome | 24 | 5 | 33 |

Hampi | 7 | 2 | 9 |

St Peter’s Basilca | 5 | 1 | 6 |

Colosseum | 3 | 2 | 6 |

## 5 Conclusion

We have presented a new pipeline for automatic 3D reconstruction from a large collection of images. We have demonstrated the utility of partitioning the images into clusters that can be independently and reliably reconstructed and then aligned in a global frame of reference. Results on a number of large data sets demonstrates that our method results in large speed improvements compared to the state-of-the-art without any significant loss of accuracy.

## Footnotes

- 1.
In this section, we use lower case letters to denote individual cameras and upper case letters to denote groups of reconstructed cameras.

- 2.
We point out here that these buildings are far more complex compared to urban buildings and even heritage sites such as the Notre Dame cathedral reconstructed in [4]. Specifically, these temples have fluted pillars, are repleted with ornate carvings and sculptures, repeated patterns as well as layered cupola. The complexity of these structures can also be judged from the building footprint as seen in the plan view presented in Fig. 6a.

## Notes

### Acknowledgement

The authors affiliated with the Indian Institute of Technology Delhi (BB, SP and SB) acknowledge the support of the Department of Science and Technology, Government of India under the Indian Digital Heritage programme.

## References

- 1.Wu, C.: Towards linear-time incremental structure from motion. In: Proceedings of the International Conference on 3D Vision, 3DV 2013, pp. 127–134 (2013)Google Scholar
- 2.Agarwal, S., Snavely, N., Seitz, S.M., Szeliski, R.: Bundle adjustment in the large. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part II. LNCS, vol. 6312, pp. 29–42. Springer, Heidelberg (2010) CrossRefGoogle Scholar
- 3.Snavely, N., Seitz, S., Szeliski, R.: Modeling the world from internet photo collections. Int. J. Comput. Vis.
**80**, 189–210 (2008)CrossRefGoogle Scholar - 4.Snavely, N., Seitz, S., Szeliski, R.: Photo tourism: exploring photo collections in 3D. In: Proceedings of ACM SIGGRAPH, pp. 835–846 (2006)Google Scholar
- 5.Crandall, D.J., Owens, A., Snavely, N., Huttenlocher, D.P.: Discrete-continuous optimization for large-scale structure from motion. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 3001–3008 (2011)Google Scholar
- 6.Triggs, B., McLauchlan, P.F., Hartley, R.I., Fitzgibbon, A.W.: Bundle adjustment – a modern synthesis. In: Triggs, B., Zisserman, A., Szeliski, R. (eds.) ICCV-WS 1999. LNCS, vol. 1883, pp. 298–372. Springer, Heidelberg (2000) CrossRefGoogle Scholar
- 7.Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell.
**22**, 888–905 (2000)CrossRefGoogle Scholar - 8.Wu, C., Agarwal, S., Curless, B., Seitz, S.: Multicore bundle adjustment. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 3057–3064 (2011)Google Scholar
- 9.Frahm, J.-M., Fite-Georgel, P., Gallup, D., Johnson, T., Raguram, R., Wu, C., Jen, Y.-H., Dunn, E., Clipp, B., Lazebnik, S., Pollefeys, M.: Building rome on a cloudless day. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part IV. LNCS, vol. 6314, pp. 368–381. Springer, Heidelberg (2010) CrossRefGoogle Scholar
- 10.Raghuram, R., Wu, C., Frahm, J., Lazebnik, S.: Modeling and recognition of landmark image collections using iconic scene graphs. Int. J. Comput. Vis.
**95**, 213–239 (2011)CrossRefGoogle Scholar - 11.Agarwal, S., Snavely, N., Simon, I., Seitz, S., Szeliski, R.: Building rome in a day. In: Proceedings of the International Conference on Computer Vision, pp. 72–79 (2009)Google Scholar
- 12.Snavely, N., Seitz, S., Szeliski, R.: Skeletal graphs for efficient structure from motion. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2008)Google Scholar
- 13.Havlena, M., Torii, A., Pajdla, T.: Efficient structure from motion by graph optimization. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part II. LNCS, vol. 6312, pp. 100–113. Springer, Heidelberg (2010) CrossRefGoogle Scholar
- 14.Crandall, D.J., Owens, A., Snavely, N., Huttenlocher, D.P.: SfM with MRFs: discrete-continuous optimization for large-scale reconstruction. IEEE Trans. Pattern Anal. Mach. Intell.
**35**, 2841–2853 (2013)CrossRefGoogle Scholar - 15.Moulon, P., Monasse, P., Marlet, R.: Global fusion of relative motions for robust, accurate and scalable structure from motion. In: Proceedings of IEEE International Conference on Computer Vision, pp. 3248–3255 (2013)Google Scholar
- 16.Jiang, N., Cui, Z., Tan, P.: A global linear method for camera pose registration. In: Proceedings of IEEE International Conference on Computer Vision, pp. 481–488 (2013)Google Scholar
- 17.Farenzena, M., Fusiello, A., Gherardi, R.: Structure-and-motion pipeline on a hierarchical cluster tree. In: Proceedings of IEEE International Conference on Computer Vision Workshop on 3-D Digital Imaging and Modeling, pp. 1489–1496 (2009)Google Scholar
- 18.Nister, D., Stewenius, H.: Scalable recognition with a vocabulary tree. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 2161–2168 (2006)Google Scholar
- 19.Wu, C.: SiftGPU: a GPU implementation of scale invariant feature transform (SIFT) (2007). http://cs.unc.edu/ ccwu/siftgpu
- 20.Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision, 2nd edn. Cambridge University Press, New York (2004) CrossRefzbMATHGoogle Scholar
- 21.Govindu, V.M.: Lie-algebraic averaging for globally consistent motion estimation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (2004)Google Scholar
- 22.Govindu, V.: Combining two-view constraints for motion estimation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 218–225 (2001)Google Scholar
- 23.Lowe, D.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis.
**60**, 91–110 (2004)CrossRefGoogle Scholar