Batch image alignment via subspace recovery based on alternative sparsity pursuit

The problem of robust alignment of batches of images can be formulated as a low-rank matrix optimization problem, relying on the similarity of well-aligned images. Going further, observing that the images to be aligned are sampled from a union of low-rank subspaces, we propose a new method based on subspace recovery techniques to provide more robust and accurate alignment. The proposed method seeks a set of domain transformations which are applied to the unaligned images so that the resulting images are made as similar as possible. The resulting optimization problem can be linearized as a series of convex optimization problems which can be solved by alternative sparsity pursuit techniques. Compared to existing methods like robust alignment by sparse and low-rank models, the proposed method can more effectively solve the batch image alignment problem, and extract more similar structures from the misaligned images.

be found online. These increasing data have the potential for information mining, but also raises some tough issues for preprocessing. In many image data sets, misalignment of images is a common problem in many computer vision and machine learning applications. To deal with this batch image alignment problem, one possible solution is to seek a group of transformations to adjust the unaligned images according to similarity or other measures [1,2]. A problem is that such methods are not robust enough to handle corruption or illumination variation which often occur in realworld applications.
Clearly, if one finds a group of optimal transformations and applies them to the unaligned images, the resulting aligned images will be very similar. If these images are vectorized and arranged as columns of a matrix, the constructed matrix will ideally be of low column rank. Since partial corruption or occlusion will affect the low-rank property, a method called robust alignment by sparse and low-rank decomposition (RASL) [3] was proposed to handle these issues, based on low-rank models. These have recently shown strength in many fields such as signal recovery and dimension reduction [4]. The core of a low-rank model is that high-dimensional data, such as images and video sequences, are drawn from low-dimensional structures which lie in low-rank subspaces [5]. This idea is applied to batch image alignment by treating the images as samples from the low-rank subspaces.
Since linear subspaces are embedded in a highdimensional space [6], it is possible to seek the underlying structures for a batch of images by subspace recovery. However, in practice, highdimensional data are seldom drawn from a single lowrank subspace-it is more reasonable to expect that high-dimensional data are drawn from several lowrank subspaces rather than just from one. Based on this idea, we propose a method that considers the unaligned images to be lying in a union of lowrank subspaces. Specifically, each aligned image is sampled from one of the union of subspaces, and can be represented as a linear combination of other images in the same subspace [7]. Further consideration of the sparse model in linear subspaces and high-dimensional data analysis [6,8], leads us to model the subspace recovery problem using a sparse representation.
In summary, in this paper, we propose a new method for batch image alignment based on seeking a set of optimal transformations via a subspace recovery technique. The proposed method is formulated as an optimization problem which can be approximately solved by linearization and alternative sparsity pursuit. After obtaining the optimal solution, we can recover the underlying structures of a batch of images to deal with misalignment, and remove partial corruption and occlusion.

Problem formulation
In this section, we formulate the problem of batch image alignment by modeling unaligned images and sparse errors. The aim is to search for a set of transformations and to recover the low-dimensional structures embedded in high-dimensional space.

Unaligned image model
Given a set of unaligned images I 1 , . . . , I n of the same object, we assume that they can be transformed to similar images which are well-aligned by a set of domain transformations τ 1 , . . . , τ n . Stacking the transformed images as vectors, we can construct the matrix: . . , n is a well-aligned version of image i and the operator • denotes the transformation applied to produce it. Pixel (x, y) of the transformed image I 0 i is given by Since the aligned images are similar, they can be treated as samples from a union of low-dimensional subspaces. Assuming a sufficient sampling density, each image can be represented as a linear combination of the other images from the same subspace [9]. As shown in Fig. 1, compared with the dimension of the entire union of subspaces (i.e., several subspaces of a high-dimensional space), the dimension of a single subspace is so small that the representation of each image is sparse [10]. Thus, we could model that: where W ∈ R n×n is a sparse coefficient matrix and A ∈ R m×n is a self-represented matrix. We may then formulate the batch image alignment problem as min where · 0 represents the 0 -norm which counts the number of nonzero entries of the matrix W .

Sparse error model
In general, partial corruption and occlusion may exist which will disrupt the low-dimensional subspaces. Since such errors usually occur in a small region of an image and have arbitrarily large magnitudes (especially for face images), these errors can be modelled as sparse errors [11]. In order to separate them from the well-aligned images, we modify Eq. (4) to min where E ∈ R m×n is the sparse error matrix. Our objective is to reconstruct A distributed over a union of low-rank subspaces and to handle the influence of sparse errors.

Solution via iterative linearization and alternative sparsity pursuit
In this section, we exploit an iterative scheme [3] to obtain a practical solution to the batch image alignment minimisation problem in Eq. (5).

Convex relaxation
The optimization problem in Eq. (5) is nonconvex and NP-hard because of the 0 -norm. Fortunately, sparse representation and compressed sensing theory shows that it can be approximately solved by replacing the 0 -norm by the 1 -norm [4,12,13]. Doing so, Eq. (5) becomes: min Since Gaussian noise always exists in real images, to tolerate it to some extent, we reformulate Eq. (6) as min where · F represents the Frobenius norm and ε is the tolerable noise level.

Problem linearization
The nonlinearity of the constraint D • τ = A + E makes the solution of Eq. (6) intractable. In practice, we assume that the change produced by τ is small enough that we can linearize the current estimate τ to approximate the constraint. Each transformation τ i (an affine transformation, etc.) can be represented by a vector of p parameters [14], yielding τ = [τ 1 | · · · |τ n ] ∈ R p×n . Specifically, if initial transformations τ are known, we can change In order to obtain the approximate solution to Eq. (6), we repeatedly linearize about the current estimate of τ and solve a series of optimization problems using Eq. (8). In other words, we seek a small change in τ in each iteration, to gradually approximate the correct transformations. In this way, we can obtain approximate transformations [15,16] to recover the underlying subspaces.
The detailed iterative linearization procedure to solve the batch image alignment problem is summarized in Algorithm 1. Iteration stops when the difference between the current objective function and the previous one meets a predefined stopping criterion.

Solution for inner loop by alternative sparsity pursuit
In the linearized image alignment problem, a key step is to find the solution to the convex optimization subproblem in Eq. (8) in Step 3, the inner loop of Algorithm 1. The recently developed alternating direction method (ADM) and linearized alternating direction method (LADM) can be applied to solve such problems quickly and effectively [17,18]. Before using the ADM and LADM, the augmented Lagrange multiplier (ALM) method [19] is applied to the original problem. Firstly, we define:
In the ADM method, the unknowns in the augmented Lagrangian function are iteratively minimized one by one: in other words, the sparsity of W and E are pursued alternatively until convergence [7]. In this case, the iterations are given by Hence, the solution to A k+1 after one iteration is given by Secondly, when updating W and E, considering the constraints in Eq. (8) Lagrangian function can be rewritten as Thus W and E can be updated alternatively using: whereW = I − W . By linearizing the quadratic terms in Eqs. (15) and (16), we can obtain the approximate solutions for W and E as A 2 2 and η 2 W 2 2 guarantee the solution generated by LADM converges to a KKT (Karush-Kuhn-Tucker) point of Eq. (8) [20]. Γ α (·) is a soft-thresholding operator defined as where sgn(·) represents the sign function. When Γ α (x) operates on a matrix, it acts element-wise. Finally, the solution to ∆τ in Eq. (11) is easily obtained as The Lagrange multiplier matrix Y and penalty parameter µ are updated following Eq. (12). The complete procedure for the inner loop of Algorithm 1 using alternative sparsity pursuit is summarized in Algorithm 2.

Experimental results and discussion
In this section, we verify the proposed method on several data sets, including face images, handwritten digits, and video sequences. In all experiments, we select the target regions from unaligned images manually, or by using object detectors (such as a face detector). These target regions are preprocessed to a uniform size, and used as the original unaligned images forming the input to our algorithm.

Robustness to sparse errors
In an experiment on images of a dummy head which contains sparse errors including corruption and occlusion [3], the correctness and robustness of the proposed method are illustrated in Fig. 2. In this experiment, the input images are the target regions to align. After alignment, we achieve well-aligned images, and reconstruct the underlying structures shown in Figs. 2(b) and 2(c). The average of the original, the aligned and the reconstructed images

Algorithm 2 Alternative sparsity pursuit (inner loop)
Input: A0 ∈ R m×n , W0 ∈ R n×n , E0 ∈ R m×n , ∆τ0 ∈ R p×n , J i ∈ R m×p , for i = 1, . . . , n, λ, ρ, η1, η2 while not converged do Step 1: update A: Step 2: update W : Step 4: update ∆τ : are shown in Fig. 2(e). These results demonstrate that the set of transformations found can successfully deal with misalignments. Moreover, the sparse errors can be separated by recovering the underlying structures from the union of subspaces. In this experiment, since we do not know which subspace each image belongs to, we cannot arrange the images from the same subspaces together in the data matrix. This leads to a different structure for the estimated coefficient matrixŴ in this experiment, shown in Fig. 3, and the ideal structure in Fig. 1. However, each column of this coefficient matrix still has many very small elements which reveals the sparsity of the self-representation of the reconstructed images, and supports the reasoning behind the proposed model.

Face image alignment
To further verify the efficacy of the proposed method, we carried out an experiment on more challenging natural face images from the Labeled Faces in the Wild (LFW) database [21]. These are real-world face images with uncontrolled misalignments, under varying illumination. About 35 face images of each person were used in the experiment. The unaligned face regions were used as input images. As shown in Fig. 4, clearer average faces were obtained after  Estimated coefficient matrixŴ in the dummy head experiment.
alignment with the proposed method.
Since we have no ground truth for these images, we evaluate the experimental results according to the similarity of the images: after alignment, the images should be more similar. We thus measure image similarity using peak signal to noise ratio (PSNR) and structural similarity index (SSIM) [22,23]. The mean PSNR and SSIM values for images of each subject are shown in Fig. 4, while the mean PSNR and SSIM values for all subjects in the LFW database are given in Tables 1 and 2. They show that the proposed method can reconstruct more similar and general structures from the high-dimensional data than the RASL method [3]. These results show the strength of the proposed method for batch image alignment.
We also validated the proposed method using real face images from a video sequence. The video sequence consists of 140 frames of Al Gore talking [3]. Selecting one from every 7 frames, 20 sampled images from the video and their results after alignment are shown in Fig. 5. In Figs. 5(b) and 5(c), the proposed method successfully aligns the speaker. The estimated coefficient matrix of this experiment is shown in Fig. 5(d); it is similar to the ideal one in Fig. 1. This result shows that the proposed method works well with video sequence data. Since adjacent frames from a video sequence are quite correlated, they are drawn from the same subspace with high probability. In contrast, if a frame is far from the current one, the structure of its subspace may differ. This estimated coefficient matrix further demonstrates the rationality of using a model based on a union of subspaces in this task. The results in Tables 3 and 4 show that the proposed method is better at processing video sequences than RASL. We can conclude that the proposed method outperforms RASL based on the results of the above experiments. The RASL method models data using robust principal component analysis (RPCA) [4], which assumes that data are drawn from a single subspace [5]. Unlike RASL, the proposed method reconstructs data from a union of subspaces, which enables it to describe the structure of the data more accurately, leading to better results.

Handwritten digit image alignment
A further kind of data set was used to verify the proposed method. It comprises handwritten digits, which are widely used in machine learning algorithms [24]. The images of handwritten digits were taken from the MNIST database [24]. We experimented on 100 images of the digit "3". Results achieved by the proposed method and RASL are shown in Fig. 6 and Tables 5 and 6. These results again allow us to conclude that the proposed method leads to better image alignment results.      (e) by our proposed method. The red circles mark some obvious differences between two method, which support that conclusion that our proposed method is more accurate.

Conclusions
In this paper, a new method for batch image alignment has been proposed which can handle sparse errors. Several experiments have verified the robustness of the proposed method, as well as its effectivity and superiority. Compared to existing methods, the proposed method is better at extracting the general underlying structures from high-dimensional data with misalignment and sparse errors. It could readily be extended to deal with 3D structures or much higher-dimensional data; this will be studied in our further work.