1 Introduction

Smart Cities are evolving along a number of paths, such as social services and governance, water and power networks, healthcare and transportation systems et al. And with the increasing number of vehicles, buildings and citizens, many studies have been started to develop extensive big data processing and analysis methods for modeling human behavior, urban systems, et al., while some new technique, such as memetic computing, have been adopted or applied on these challenging tasks. Among different real-world area for smart cities, video surveillance is an indispensable part to ganrantee the public security, especially for the traffic and safety management. As one of critical learning tasks in video surveillance, Person Re-Identification (Re-ID) is to track and identify a given pedestrian in a multi-camera scene in different challenging environments [1,2,3,4,5], such as camera views, person poses and changes in illumination, background clutter, and occlusion. In general, most currently state-of-the-art Person Re-ID is first to adopt a CNN based detector to obtain the cropped pedestrian image, it is then to learn a specific distance metric for ranking, where the distance is usually measured by the similarity between two low-dimensional image features extracted from CNN framework [4, 5]. For example, the work in [6] has developed FastReID to provide strong baselines that are capable of achieving the state-of-the-art performance, where the pedestrian images are firstly cropped from video sequence by human labeler or selected by CNN based detector, and then extensive CNN backbones with certain aggregation strategies and loss functions are trained to extract the feature of pedestrian images for distance metric. In addition, to handle a real-world video Re-ID problem in surveillance applications, [1] has developed a video-based Person Re-ID by exploring useful information captured from sequences. In detail, it firstly crops the image sequences belonging to a certain pedestrian via certain detector. It then trains a special aggregation network with these sequential data to grasp both spatial and temporal information for feature extraction, where these features can be utilized for distance metric for realizing Re-ID task.

However, the distance-based ranking methods only focuses on the pair-wise similarity among training dataset but the distribution of the whole dataset is neglected and not considered. In some real-world application case, e.g., the pose of the pedestrian changes greatly, the pedestrian image with the front view may be quite different from the one with back view. In such case, the system may judge them belonging to uncorrelated pedestrians, since distance-based ranking method only consider pair-wise similarity. However, if the similarity is measured by involving the distribution information or intrinsic manifold, such problem may be avoided, as the pedestrian images belonging to the same person but with different view are in the same manifold. This is the main drawback of distance-based ranking method. Obviously, a good Re-ID method should consider image features as well as the intrinsic structure of the image database. To handle the above problem, a key issue is to how to involve the data distribution or manifold information into the calculating of similarity between any two pair-wise images, so that the extracted feature can better provide more geometrical structure for guiding the distance-based ranking method. Motivated by this end, we in this paper consider the graph embedding method by taking the underlying structure into account for distance metrics.

The Graph Embedding (GE) methods [7,8,9,10] are developed to preserve the properties of a graph for feature extraction, where these methods are firstly to construct a graph to approximate the geometrical structure of data and then to learn a project matrix to cast the high-dimensional dataset to low-dimensional subspace while preserving the geometrical information. From this point of view, ME can naturally preserve the distribution of dataset. many GE methods have been proposed during the past decades in order to discover the intrinsic geometrical structure of data manifold [11,12,13,14,15,16,17,18,19]. Typical methods include ISOMAP [20], Locally Linear Embedding (LLE) [21] and Laplacian Eigenmap (LE) [22]. But these methods cannot handle the new-coming samples, hence suffering out-of-sample problems. To solve this problem, He et al. has developed Locality Preserving Projections (LPP) [23] to project the original high-dimensional data into low-dimensional subspace with a projection matrix. Therefore, the out-of-sample problem can be naturally handled as the projection matrix can be used for handling new-coming samples. Cai et al. [24, 25] further pointed that LPP is time-consuming especially for calculating the generalized eigenvalue problem given the data matrix is with high-dimensionality. They therefore proposed Spectral Regression (SR) to handle the above problem, in which they first to calculate the low-dimensional subspace, and then to calculate the projection via solving a regression problem. Recently, Nie et al. [26] have detailed analyzed the equivalent relationship between LPP and SR, through which they proposed a more efficient and effective method, namely Unsupervised Large Graph Embedding (ULGE), for dimensionality reduction and subspace clustering [27, 28].

On the other hand, graph neural network (GNN) is one of the most popular topic during the past few years, which has been proved to be effective for dealing with structured data [29,30,31,32]. For example, the work in [33, 34] have applied GCN for handling diagnosis problem of COVID-19. The goal of GNN [29, 30] is to integrate local geometrical structure and data features in the convolutional layers, so that both the connectivity patterns and feature attributes of graph-structured data can be preserved via the graph convolution and outperforms many state-of-the-art methods significantly. However, most of these methods could only be utilized to small-scale graph due to the computation limitation; another type of GNN is Spatial GNN [31, 32], which is to constructs a new feature vector for each vertex by aggregating the neighborhood information. But these methods need a predefined local graph so that they cannot learn the global information. Our work can be potentially viewed to make GNN to deal with large-scale graph without local sampling.

In general, the above graph construction methods are good to characterize the geometrical structure of data manifold. But these methods are all unsupervised, which means they do not utilize the discriminative information, such as class label or side information, which is good to enhance the clustering performance. Since the graph construction is unsupervised, it may confront the case that the data belonging to two different clusters or classes are connected. In such case, the graph construction tends to not be correct and the learned graph embedding results may cause some mistakes. Another problem is that these methods are usually two-stage approach, i.e. the graph construction and subspace learning are separated. The two stages do not have any information to improve graph embedding. In practice, we hope the geometrical information learned by the graph embedding procedure can well be used to enhance the local preserving ability of graph construction, and the modificed graph construction can in turn promote the learning of graph embedding [35].

In this paper, to handle this problem, we develop a new spectral regression method for dimensionality reduction by integrating both graph construction and manifold regularized term into the same framework. In order to handle the case that nearby data points with different clusters may be mis-connected, we in this paper has adopted the side information for graph construction, which is utilized to make the constraint that the pair-wise data points have no link when they are clearly belonged to different clusters. In addition, we further involve the normalized and symmetrical constraint to make the graph doubly-stochastic. As mentioned by the work [36], the doubly-stochastic property can also guarantee the graph is highly robust and less sensitive to the parameters. Meriting from such a graph, we then incorporate the graph construction, the subspace learning method in the unified loss term. Therefore, the subspace results can be utilized into the graph construction, and the updated graph can in turn incorporate discriminative information for graph embedding. Extensive simulations have been conducted based on several benchmark datasets. Simulation results indicate that the proposed work can achieve much better clustering and image retrieval performance than other methods.

The main contribution of the proposed work are as follows:

  1. 1)

    we have proposed a discriminative and doubly-stochastic graph for characterizing the geometrical of data manifold. By adopting the side information into consideration, the proposed graph can guarantee the data points with different clusters will not be mis-connected. In addition, the normalized and symmetrical constraint can make the graph doubly-stochastic so that the graph is highly robust and less sensitive to the parameters;

  2. 2)

    we have incorporated the graph construction, the subspace learning method in the unified loss term. Therefore, the subspace results can be utilized into the graph construction, and the updated graph can in turn incorporate discriminative information for graph embedding;

  3. 3)

    we also developed an iterative solution to handle the above optimization problem. Theoretical analysis has shown the convergence.

  4. 4)

    we have applied the proposed method on Person Re-Identification. Extensive simulation results also verify the effectiveness of the proposed method.

This paper is structured as follows: in Section 2, we have briefly review some related work about graph embedding; in Section 3, we give detail description of the proposed work for graph embedding; in Section 4, we will conduct extensive simulations to show the effectiveness of the proposed work on Person Re-Identification and final conclusion is drawn in Section 5.

2 Related work

In this paper, we first give some notations used in this work and review the work of LE, LPP, SR and ULSR.

2.1 Notations

Specifically, let \(X = \left \{ {{x_{1}},{x_{2}}, \cdot \cdot \cdot ,{x_{l}}} \right \} \in {R^{D \times l}}\) be the original-high dimensional dataset, each xi belongs to a class \({c_{i}} = \left \{ {1,2, \cdot \cdot \cdot ,c} \right \}\), where li be the number of data points in the i th class and l be the number of data points in all classes. We also denote that \(Y = \left \{ {{y_{1}},{y_{2}}, \cdot \cdot \cdot ,{y_{l}}} \right \} \in {R^{d \times l}}\) is the low-dimensional representation to X. In graph-based subspace learing framework, a similarity matrix is to defined for measuring the similarity of dataset. In detail, denote \( G = \left ({ V, E} \right )\) as the graph, where V is the vertex set of G representing the training set, E is the edge set of G related with W involving the geometrical information. In addition, let L = DW and \(\widetilde L = {D^{{{ - 1} \mathord {\left / {\vphantom {{ - 1} 2}} \right . \kern -\nulldelimiterspace } 2}}}L{D^{{{ - 1} \mathord {\left / {\vphantom {{ - 1} 2}} \right . \kern -\nulldelimiterspace } 2}}}\) be the graph and normalized graph Laplacian matrix, which is to approximate the geometrical structure of dataset, D is a diagonal matrix satisfying \({D_{jj}} = \sum \nolimits _{i = 1}^{l} {{W_{ij}}} \).

2.2 Review of LE, LPP and SR

The goal of LE is to calculate the low-dimensional representation as follows:

$$ {\min_{Y}}Tr\left( {{{\left( {YD{Y^{T}}} \right)}^{- 1}}YL{Y^{T}}} \right). $$
(1)

where the optimal result for Y can be formed by the top p eigenvectors of D- 1L or \({D^{{\text { - }}1/2}}L{D^{{\text { - }}1/2}}{\text { = }}\widetilde L\). However, LE cannot calculate the subspace graph embedding for new-coming data points hence cannot handle out-of-sample problem. To solve this problem, LPP further assume that the low-dimensional representation can be projected by high-dimensional data, i.e.Y = VTX, where V is the projection matrix. it is then to calculate the projected matrix following the objective function as:

$$ {\min_{V}}Tr\left( {{{\left( {{V^{T}}\left( {XD{X^{T}}{\text{ + }}\alpha I} \right)V} \right)}^{- 1}}{V^{T}}XL{X^{T}}V} \right). $$
(2)

where I is an identity matrix to avoid ill-posed problem, α is a small value. The optimal solution of V can be calculated by solving generalized eigenvalue decomposition (GEVD) of \({\left ({XD{X^{T}}{\text { + }}\alpha I} \right )^{- 1}}XL{X^{T}}\). Then, the out-of-sample problem can be naturally handled by projecting the new-coming data points to the low-dimensional graph embedding by the projection matrix. However, the computational cost for LPP is huge given that XDXT and XLXT are dense matrixes. To solve this problem, SR is developed, which is first to calculate the low-dimensional representation Y following (1), it then calculate the projection matrix via solving a regression problem as follows:

$$ {\min_{V}}\left\| {{V^{T}}X - Y} \right\|_{F}^{2}{\text{ + }}\alpha \left\| V \right\|_{F}^{2}. $$
(3)

The optimal solution of SR is \({\left ({X{X^{T}} + \alpha I} \right )^{- 1}}{X^{T}}Y\), which can be efficiently solved by some well-studies methods, such as LSQR. The SR is more efficient than LPP it only needs to calculate the GEVD of sparser matrix D- 1L. However, SR is not equivalent to LPP. In addition, both SR and LPP cannot handle large-scale datasets. As pointed in [26], given the degree of similar matrix W equivalent to 1, i.e. D = I, so that the graph Laplacian matrix is equivalent to normalized one \(L = I - W = \tilde L\), the optimal solution of SR can be equivalent to LPP.

2.3 Unsupervised large graph embedding (ULGE) with anchor graph construction

Though the conventional graph embedding methods have achieved satisfied results, they cannot be extended to large-scale dataset due to the computational complexites for both graph construction and graph embedding are not linear with the number of data points. To handle this problem, ULGE has developed an efficient SR method for graph embedding, which is based anchor graph construction [37].

In general, anchor graph first seeks m anchors, where mn, and then calculate the weight matrix based on the anchor data and each data point. In detail, let \(A = \left \{ {{a_{1}},{a_{2}}, {\ldots } {a_{m}}} \right \} \in {R^{d \times m}}\) represents the set of anchor points, ZRm×n represents the adjancy matrix. Each element Zij is to evaluate the similarity between xj and ai with constraint Zij ≥ 0, \(\sum \nolimits _{i = 1}^{m} {{Z_{ij}}} = 1\). Then, the anchor graph W is contructed based Z, which can be shown as follows:

$$ {W} = {Z^{T}}{{\Delta}^{- 1}}Z = S^{T}S \in {R^{n \times n}}. $$
(4)

where Δ ∈ Rm×m represents a diagonal matrix satisfying \({{\Delta }_{ii}} = \sum \nolimits _{j = 1}^{n} {{Z_{ij}}} \), S = Δ− 1/2ZRm×n is the bilinear decomposition of W. It can be easily proved that Wa is doubly-stochastic hence it has probability meaning. Compared with k NN graph, which needs O(n2k) to contruct the graph, the anchor graph construction is more efficient since it only needs O(m3 + nm2) computational complexity.

Finally, The ULGE is to formulate the graph embedding Y by the eigenvectors corresponding to the largest eigenvalues of ST and to calculate the projection matrix by solving a regression problems as (3). Since SRm×n, the computational complexity for calculating the GEVD of S is linear with n and much smaller than the one for directly calculating the GEVD of \(\widetilde {L}\).

Notes: :

Cai et al. [27, 38] recently developed a graph embedding method for subspace clustering, namely, L arge-scale S pectral C lustering (SLC), which is also based on anchor graph construction and share similar concept with ULGE. However, SLC is only to formulate the graph embedding Y by the eigenvectors corresponding to the largest eigenvalues of ST and then perform conventional k-means on Y for calculating the cluster labels for the dataset. Therefore, compared with ULGE, SLC cannot obtain the graph embedding for new-coming dataset so that it cannot handle out-of-sample problem. In other word, ULGE can be viewed as an extension work to SLC.

3 Semi-supervised adaptive graph embedding (SAGE)

3.1 Motivation and problem formulations

The above graph construction methods are good to grasp the geometrical information of dataset [26]. However, one problem is that they are all unsupervised and do not utilize the discriminative information provided by the class labels to enhance the clustering performance. A case in point is that if two data points xi and xj are in different cluster structure, they should not share any comment anchors so that \({s_{i}^{T}}s_{j}=0\). An illustration can be shown in Fig. 1.

Fig. 1
figure 1

Motivations: the conventional anchor graph construction is unsupervised and do not utilize the discriminative information provided by the class labels. This will cause the case that nearby data points belonging to different classes will share the common anchors hance causing mis-connections between them. An ideal case should guarantee there is no connection between any pairwise data points that belong to different classes

To solve the above problem, it is natural to integrate side information into the graph construction so that the samples with different clusters will has no link. Here, we induce a matrix TRn×n in this work, where Tij = 1 if xi and xj are not connected in the graph; Tij = 0, otherwise. Then, we can constrain graph weight matrix W with WijTij = 0 for optimization. As can be seen in the simulation, more and more link is disconnected so that more clear clustering structure can be observed.

Another problem is that the graph construction and subspace learning are separated. The two stages do not share any information to enhance the performance of graph embedding. By this end, we then propose the proposed method by simutaneously calculating

the optimal graph matrix W, the bilinear decomposition S and the graph embedding Y as follows:

$$ \begin{aligned} W,S,Y =& {\arg_{W,S,Y}}\min \left\| {W - {S^{T}}S} \right\|_{F}^{2}{\text{ + }} \gamma Tr\left( {YL{Y^{T}}} \right)\\ &s.t. S \ge 0, L=I-S^{T}S\\ &W \ge 0, W = {W^{T}}, W{1^{T}} = {1^{T}}, \\ &\sum\nolimits_{ij} {{{\left( {T \odot W} \right)}_{ij}}} = Tr\left( {TW} \right) = 0 \end{aligned} $$
(5)

where γ is a parameter balancing the tradeoff between two terms. We further add the constraint of degree normalization, i.e. Dii = 1, i.e. W1T = 1T, where 1 ∈ Rn is a one vector. Since W is both non-negative and symmetric, imposing W1T = 1T can make it doubly-stochastic [39] and less sensitive to the parameters. From (5), we can see that we have unified the graph construction and subspace learning into the same object function. Then, the subspace results can be used to enhance the geometrical preserving ability of the graph construction, and the modified graph can in turn incorporate discriminative information for graph embedding.

3.2 Solution

We next show how to solve the optimal solution of W, S and Y in (5), we use the alternative optimization approach to handle the problem. In detail, we first let St, Wt and Yt be the t-th iteration of S, W and Y. To calculate graph embedding Yt+ 1, we first fix Wt and St. Then (5) will be back to the conventional graph based subspace learning method, i.e.

$$ \begin{aligned} Y_{t+1} =& {\arg_{Y}}\min Tr\left( {YL{Y^{T}}} \right) = {\arg_{Y}}\max Tr\left( {Y{S_{t}^{T}}S_{t}{Y^{T}}} \right)\\ &s.t. Y{Y^{T}} = I, \end{aligned} $$
(6)

The optimal Yt+ 1 is to be obtained by the largest eigenvectors of \({S_{t}^{T}}S_{t}\).

To calculate the updated St+ 1, we need to replace the objective function of (5) with its first order approximation. Then (5) can be rewritten as:

$$ \begin{aligned} &\left\| {W_{t} - {S^{T}}S} \right\|_{F}^{2} + \gamma Tr\left( {S{W_{t}^{Y}}{S^{T}}} \right)\\ &= \frac{\eta }{2}\left\| {S - {S_{t}}} \right\|_{F}^{2} + \left\langle {S - {S_{t}},{\nabla_{S}}Q\left( {{S_{t}}} \right)} \right\rangle + Q\left( {{S_{t}}} \right)\\ &{\text{ = }}\frac{\eta }{2}\left\| {S - {S_{t}} - \frac{1}{\eta }{\nabla_{S}}Q\left( {{S_{t}}} \right)} \right\|_{F}^{2} + P\left( {{S_{t}}} \right) \end{aligned} $$
(7)

where \({W_{t}^{Y}} \in R^{n \times n}\) with each element satisfying \({{W_{t}^{Y}}}|_{ij}=||y_{i}-y_{j}||_{F}^{2}\), so that \(Tr(YLY^{T})=\sum \nolimits _{ij}{||y_{i}-y_{j}||_{F}^{2}S^{T}S}=Tr(S{W_{t}^{Y}}S^{T})\), \(Q\left (S \right )\text {=}\left \| W_{t}-{{S}^{T}}S \right \|_{F}^{2}+\gamma Tr\left (S{W_{t}^{Y}}{{S}^{T}} \right )\) and \({{\nabla }_{S}}Q\left ({{S}_{t}} \right )\text {=}{{S}_{t}}{S_{t}^{T}}{{S}_{t}}-{{S}_{t}}\left (W_{t}-\gamma {W_{t}^{Y}} \right )\) is the partially differential of \(Q\left (S \right )\) w.r.t. S at St, η is the Lipschitz parameter and we choose it as the trace of second differential of \(Q\left (S \right )\), i.e. \(\eta \text {=}\left \| {{S}^{T}}S \right \|_{F}^{2}\), \(P\left ({{S}_{t}} \right )\) is a fixed term with St that can be neglected. The updated St+ 1 is equivalent to solve:

$$ {S_{t + 1}} = \arg {\min_{S}}\frac{\eta }{2}\left\| {S - {S_{t}} - \frac{1}{\eta }{\nabla_{S}}Q\left( {{S_{t}}} \right)} \right\|_{F}^{2}\quad s.t. S \ge 0 $$
(8)

We next fix Yt and St and update Wt+ 1. Then (5) falls into an instance of quadratic programming (QP), which can be shown as follows

$$ \begin{aligned} W_{t+1}=&{{\arg }_{W}}\min \left\| W-{{W}_{t}^{0}} \right\|_{F}^{2} \\ &W \ge 0, W = {W^{T}}, W{1^{T}} = {1^{T}}, \\ &\sum\nolimits_{ij} {{{\left( {T \odot W} \right)}_{ij}}} = Tr\left( {{W^{T}}T} \right) = 0 \end{aligned} $$
(9)

where \({{W}_{t}^{0}}={S_{t}^{T}}S_{t}\). For efficient computation, we divide the QP problem into two convex sub-problems:

$$ W_{t+1}={{\arg }_{W}}\min \left\| W-{W_{t}^{0}} \right\|_{F}^{2}\quad s.t.\ W\ge 0. $$
(10)

and

$$ \begin{aligned} W_{t+1} =& {\arg_{W}}\min \left\| {W - {{W_{t}^{0}}}} \right\|_{F}^{2}\quad \\ &s.t. W = {W^{T}}, W{1^{T}} = {1^{T}}, Tr\left( {{W^{T}}T} \right) = 0 \end{aligned} $$
(11)

The Wt+ 1 in (10) can be simply formulated by as the non-negative elements of \({W_{t}^{0}}\). On the other hand, (11) is solved to take the Lagrangian:

$$ \begin{array}{l} J\left( W \right) = \left\| {W - {{W_{t}^{0}}}} \right\|_{F}^{2} - t Tr\left( WT \right)\\ \quad \quad \quad \quad - \mu \left( {W{1^{T}} - {1^{T}}} \right) - \mu \left( {{W^{T}}{1^{T}} - {1^{T}}} \right) \end{array} $$
(12)

where t and μRn are the Lagrangian parameters. We then set the derivatives of \(J\left (W \right )\) w.r.t. W to zero, i.e.

$$ W={W_{t}^{0}}+t T+{{\mu }^{T}}1+{{1}^{T}}\mu. $$
(13)

To fulfill \(Tr\left (WT \right )=0\), we have:

$$ \begin{aligned} &Tr\left( {\left( {{W_{t}^{0}} + t T + {\mu^{T}}1 + {1^{T}}\mu } \right)T} \right) = 0\\ &\Rightarrow Tr\left( {{W_{t}^{0}}T} \right) + \left| T \right|t + 2\mu T{1^{T}} = 0\\ &\Rightarrow t = {{\left( { - Tr\left( {{W_{t}^{0}}T} \right) - 2\mu T{1^{T}}} \right)} \mathord{\left/ {\vphantom {{\left( { - Tr\left( {{W_{t}^{0}}T} \right) - 2\mu T{1^{T}}} \right)} {\left| T \right|}}} \right. \kern-\nulldelimiterspace} {\left| T \right|}} \end{aligned} $$
(14)

Considering the normalized constraint W1T = 1T or 1W = 1, we have

$$ \left\{\begin{aligned} 1W{1^{T}} =& n = 1{W_{t}^{0}}{1^{T}} + \left| T \right|t + n\mu {1^{T}} + n{\text{1}}{\mu^{T}}\\ \!1WT{1^{T}} = & \left| T \right| = 1{W_{t}^{0}}T{1^{T}} \!+ \!1{T^{2}}{1^{T}}t + 1{\mu^{T}}\left| T \right| + n\mu T{1^{T}}\\ W{1^{T}} =& {1^{T}} = {W_{t}^{0}}{1^{T}} + t T{1^{T}} + n{\mu^{T}} + {1^{T}}\mu {1^{T}} \end{aligned} \right. $$
(15)

We then replace t in (15) with (14) and with some math derivations, we have

$$ \begin{array}{l} \left\{ \begin{aligned} \mu {1^{T}} =& 1{\mu^{T}} = \frac{{{1^{T}}1}}{{2n}}{R_{t}} + \frac{{{1^{T}}\mu T{1^{T}}}}{n}\\ \mu T{1^{T}} =& \frac{{2n1\left( {T - \frac{{\left| T \right|}}{{2n}}I} \right){R_{t}}}}{{2{n^{2}} + 2\left| T \right| - {{4n1{T^{2}}{1^{T}}} \mathord{\left/ {\vphantom {{4n1{T^{2}}{1^{T}}} {\left| T \right|}}} \right. \kern-\nulldelimiterspace} {\left| T \right|}}}}\\ {\mu^{T}} =& \frac{I}{n}{R_{t}} + \frac{{2T{1^{T}}\mu T{1^{T}}}}{{\left| T \right|n}} - \frac{{{1^{T}}\mu {1^{T}}}}{n} \end{aligned} \right.\\ \Rightarrow {\mu}^{T} = \left( {\frac{I}{n} - \frac{{{1^{T}}1}}{{2{n^{2}}}} - \frac{{\left( {T - \frac{{\left| T \right|}}{{2n}}I} \right){1^{T}}1\left( {T - \frac{{\left| T \right|}}{{2n}}I} \right)}}{{{{n1{T^{2}}{1^{T}} - {n^{2}}\left| T \right|} \mathord{\left/ {\vphantom {{n1{T^{2}}{1^{T}} - {n^{2}}\left| T \right|} 2}} \right. \kern-\nulldelimiterspace} 2} - {{{{\left| T \right|}^{2}}} \mathord{\left/ {\vphantom {{{{\left| T \right|}^{2}}} 2}} \right. \kern-\nulldelimiterspace} 2}}}} \right)R_{t} \end{array} $$
(16)

where R is a fixed matrix as:

$$ R_{t}={{1^{T}} - {W_{t}^{0}}{{\text{1}}^{T}}{\text{ + }}\frac{{Tr\left( {{W_{t}^{0}}} \right)}}{{\left| T \right|}}T{1^{T}}} $$
(17)

The detail derivations of (16) from (15) can be seen in A. Then by replacing μ in (13) with (16), we can obtain the updated Wt+ 1. This iterative procedure will be converged due to the Von-Neumann’s successive projection lemma [36, 40, 41]. Therefore, we iteratively updated W, S and Y following (6), (8) and (9), respectively, until the objective function in (5) is converged to a given small value. The detail algorithm for iteratively solving (5) can be seen in Algorithm 1.

figure a

3.3 Computational complexity analysis

We next analyze the computational complexity of the proposed SAGE. Based on the basic steps in Algorithm 1, the computational complexity can be divided into three parts:

  1. 1)

    Obtain the m anchors by k-means. If we choose the Balanced and Hierarchical K-means (BAHK) methods as in [42], this will needs O(ndlog(m)t), where t is the number of iterations;

  2. 2)

    Initialize the similarity matrix S0 and formulate the original graph weight matrix W0, where we need the computational complexity O(ndm);

  3. 3)

    Update the low-dimensional representation Y, the similarity matrix S, and graph weight matrix W according to Algorithm 1. In detail, to update Yt+ 1 in each iteration, we needs to perform the SVD of matrix St, where the computational complexity is O(nmk) and k is the number of nearest neighbors; to update St+ 1, we needs to calculate \({W_{t}^{Y}}\) and δSQ(St). The computational complexity is O(nd2) and O(nmd + nmq), respectively, where q is the average non-zero number of Wt or \({W_{t}^{Y}}\). Therefore the total complexity O(nd2 + nmd + nmq); to update Wt+ 1, we need to calculate the alternate projection procedure of (10) and (11), where the total complexity is O(nqtp) and tp is the iteration number of the alternate projection procedure;

  4. 4)

    Calculate the projection matrix V by solving the regression problem of (3), where the computational complexity is O(ndm).

Considering that mn, the computational complexity of the proposed SAGE is O(ndm). Compared with the conventional spectral based methods, such as LPP and SR, the proposed method is computationally efficient that can handle large-scale data.

4 Simulations

In this section, we will evaluate the proposed method based on one toy example and several real-world benchmark datasets, and compare the performance with other state-of-the-art methods. In toy example, we generalize a two-cycle dataset with two clusters, each follows a cycle distribution but with different radius; for real-world datasets, we choose three benchmark datasets, i.e. COIL100 dataset [43], CASIA-HWDB [44] and Yale-B dataset [45] and Fashion-MNIST dataset [46] for evaluating the performance of graph embedding. To further show the effectiveness of the proposed work, we also utilize the proposed work for handling the task of content-based image retrieval. Our goal is to verify the distance calculated by the low-dimensional representation of datasets can well preserve the geometrical structure of data manifold, which is good to enhance some real-world applications.

4.1 Parameter analysis

We first verify the convergence of the proposed method. Specifically, we randomly initialize the similarity matrix S0 and graph weight matrix W0, we also fix the regularized parameter γ = 10 in (5) and the number of anchors as 500. For the parameter T, we randomly select 10% side information to form matrix T. We then show the convergence for the object function value in (5) based on CASIA-HWDB [44] dataset. Figure 2 shows the iterative procedure of the proposed method for updating the graph weight. From the simulation results, we can see that with the iterations the missing link between different classes are graduate weakened until the structure of two classes are distinctively separated. This toy example well shows the effectiveness of the proposed method.

Fig. 2
figure 2

Graph weight updated via iterative procedure

We next evaluate some important parameters of the proposed work. Here, it should be noted that there are three main parameters in our work including matrix T, anchor number and tradeoff parameter γ. To get the optimal values, we first fix matrix T, then set the candidates of anchor number as [500,1000,1500,2000,2500] and those of γ as [10− 3,10− 2,10− 1,1,10,102,103], respectively. We then perform the proposed method and calculate the clustering accuracies based on certain value of T, anchor number and γ, through which the optimal values can be selected according to the best accuracies. Here, for the parameter T, 10% side information are randomly selected to form matrix T. The reason is that the proposed work mainly focuses on semi-supervised learning, which aims to utilize side information to enhance the clustering or ranking performance. If we choose 0% side information, it means the proposed work is totally an unsupervised learning method so that the parameter γ has no use as it is to balance the manifold term and side information regularization term; on the other hand, if we choose 20% side information, it means we use too much discriminative information. A good strategy is to use as fewer side information as possible and to achieve competitive results, so that we randomly select 10% side information to form matrix T. Figure 3 shows accuracies with varied number of anchors with range from 500 to 2500 and γ from 10− 3 to 103. From Fig. 3, we can see that the results are satisfied and related stable when γ fall in the range of [0.1,103] and the number of anchors is larger than 1000.

Fig. 3
figure 3

Parameter analysis

4.2 Subspace embedding evaluations

In this section, we will evaluate the effectiveness of the proposed work based on one synthetic dataset and three benchmark datasets. In the synthetic dataset, we evaluate the proposed work based on a dataset with two classes. each one follows a cycle distribution. In this toy example, we will show how the proposed method can update the graph weight so that both geometrical structure and discriminative information can be perserved. In the real-world datasets, we evaluate the performance of subspace embedding based on COIL100 dataset [43], CASIA-HWDB [44], Yale-B dataset [45] and Fashion-MNIST [46].

COIL100 dataset :

[43] is an object image dataset where each object is viewed from varying angles at an interval of five degrees so that each object has 72 images. The original size of each cropped image is 128 × 128, with 24bit color levels per pixel. In our study, we down-sample the images to 32 × 32 and transfer them to 256 gray levels.

CASIA-HWDB dataset :

[44] is a handwritten image dataset which include both isolated characters (52 categories including 26 upper case letters and 26 lower case letters) and handwritten texts (10 categories including 0-9 digits). In our study, we choose a subset from it which includes images from 52 isolated characters. Then, the subset has 12456 samples with an image size of 16 × 16 in 256 gray levels.

Yale-B dataset :

[45] is a famous face dataset that contains 16128 images for 38 humans under 64 illumination conditions and 9 poses. In our study, we resize them to 32 × 32 pixels and choose approximately 64 near frontal images under different illuminations for each human.

Fashion-MNIST dataset :

[46] is a Zalando’s clothes images dataset having 10 classes. It has 60000 and 10000 images for training set and test set, respectively. Each image is with 28 × 28 size in 256 gray level.

We first utilize one-cycle dataset to verify the effectiveness of the proposed method, where we randomly select 10% side information to form matrix T. Figure 2 shows the updated graph weight matrix during the iterative procedure in Algorithm 1, where the black line is the weight connection. From Fig. 2, we can observe that the graph weights between pairwise data points that belong to different clusters are gradually weakened, while those in the same cluster are strengthened. Therefore, the clustering structure of data manifold is much more distinctive. This illustrates the effectiveness of the proposed method.

We next evaluate the clustering results of the proposed method, where we use Clustering Accuracy to evaluate the performance for different methods [47]. The average results for ACC and NMI based on 20 random splits are given in Tables 1 and 2. Here, we evaluate the clustering results for a different number of clustering k. Tables 1 and 2 give the clustering results based on different datasets, where the left columns are the results of LE, LPP, SR, ULGE and the unsupervised version of the proposed method, while the right two columns are the results of semi-supervised versions of SAGE with 10% and 20% side information. From simulation results, we can obtain the following observations:

  1. 1)

    the performance of ULGE and unsupervised version of SAGE outperform the conventional graph embedding methods such as LE, LPP and SR by approximately 4-5% improvement. The reason is that ULGE and the proposed method have utilized anchor graph so that the doubly-stochastic property can make the graph highly robust and less sensitive to the parameters;

  2. 2)

    the unsupervised version of SAGE is superior to other methods, by approximately 3% and 4% enhancement compared to ULGE and other methods. This is believable due to the reason that the proposed method has incorporated the graph construction and graph embedding learning into the unified framework;

  3. 3)

    by utilizing side information, the semi-supervised version of SAGE can achieve much better results than other state-of-the-art unsupervised methods. The improvement can achieve approximately 5% in COIL100 datasets given 20% side information. This indicates that the discriminative information is indeed good to improve the subspace clustering results.

Table 1 Clustering performance based on different methods on four datasets (ACC)
Table 2 Clustering performance based on different methods on four datasets (NMI)

4.3 Applications for person re-identification

In this section, we utilize the proposed method on Person Re-Identification [48, 49]. In detail, given a query person provided, we first calculate its low-dimensional representation. We then calculate the Euclidean Distance between the query person image and database person images via the low-dimensional subspace. Finally, the top person images in database with smallest distance to the query person image are chosen as the relevant results. The original query data combined with relevant data will be formulated as a new set of query data for another round retrieval procedure. The procedure will be repeated iteratively for several times after the retrieval performance is satisfied by the user. In this work, we select three benchmark Re-ID datasets, i.e. PKU [50], WARD (Wide Area Camera Network) [51] and RAiD (Re-Identification Across Indoor-Outdoor) Dataset [52] datasets for evaluations.

PKU-ReID dataset [50] is a Re-ID dataset. This dataset has 114 individuals including 1824 images captured from two disjoint camera views. For each person, eight images are captured from eight different orientations under one camera view and are normalized to pixels. In our study, we split it into two parts randomly, i.e., 57 individuals for training, and the other 57 ones for testing.

WARD dataset [51] is also a Re-ID dataset collected with three non-overlapping cameras. Each person has several images in each camera. As a result, a total of 4786 images of 70 different individuals has been extracted. The original size of images is with 320 × 240 pixels. In our work, we down-sample them to 96 × 48 pixels.

RAID dataset [52] is collected at the Winstun Chung Hall of UC Riverside. It is a 4-camera Re-ID dataset with 2 indoor and 2 outdoor cameras. A total of 43 peoples have walked into these camera views causing 6920 images. The size of each cropped image is originally 128 × 64 with 24bit color levels. In our work, we down-sample the size of images to 32 × 16pixels.

4.3.1 Qualitative analysis

We first shown the qualitative analysis of the proposed method for Person Re-ID. Figure 4 shows the retrieval results of the proposed work under different query data for PKU [50], WARD [51] and RAiD [52] datasets. Simulation results can show that the proposed work can almost achieve 100% results given certain query data illustrating the effectiveness of the proposed work. Figures 567 show the image retrieval results by the proposed method with 10 scope. In this result, the images of the first column represent the queries and those in the following columns are the retrieval reslts with the top 15, 25 and 25 values under different iterations for PKU [50], WARD [51] and RAiD [52] datasets. The image with blue box in each group represents the query image randomly selected from WARD dataset, while those with green boxes represents the ones that are related to the query and those with yellow boxes represents the ones that are not related to the query. From Figs. 567, we can observe the retrieval performance given the queries are good, which can achieve almost up to 90% accuracies after the fourth iteration with user’s provided revelant feedback. In addition, we can see that with the iteration increased, the retrieval results can be greatly enhanced showing that the user’s revelant feedback information can provide useful supervised information that are good for the Person ReID.

Fig. 4
figure 4

The retrieval results of the proposed work under different query data: the images in the first column in each subfigure are the query person images, the following ones are the top relevant feedback images that are closest to the query images

Fig. 5
figure 5

The ranking results with different rounds of user revelance feedback: the first row to the fourth row are the ranking results associated with the zero, first, second and fourth round of user revelance feedback

Fig. 6
figure 6

The ranking results with different rounds of user revelance feedback: the first row to the fourth row are the ranking results associated with the zero, first, second and fourth round of user revelance feedback

Fig. 7
figure 7

The ranking results with different rounds of user revelance feedback: the first row to the fourth row are the ranking results associated with the zero, first, second and fourth round of user revelance feedback

4.3.2 Quantitative analysis

In this subsection, we compared the proposed method with other graph embedding methods for CBIR. The compared methods are the same as in data clustering evaluation. But we do not compare the performance with LE [22] and LSC [38] as they cannot calculate the low-dimensional representations of query data as it handle out-of-sample data. In our study, we use MAP-scope to evaluate the performance of the proposed work and to compare with those of other state-of-the-art methods. In detail, the scope is the number of top-ranked images given to a certain query, while M ean A verage P recision (MAP) represents the average ratios of the numbers of relevant data to certain scope. Therefore, it evaluates the accuracy under different scopes for certain method.

Figure 8 gives the mean of MAP-scope under the fixed iteration number 1, 2 and 4, respectively. In addition, Fig. 9 gives MAP with increased iterations but with certain scope of 4, 7, 10 for PKU dataset, and 10, 20, and 50 for WARD and RAiD datasets respectively. From results in Figs. 8 and 9, we can see that the MAPs for all methods will become better given the user feedback is increased. This indicates that user feedback is indeed helpful to increase the discriminative information for handling CBIR problem. For example, the proposed method in Fig. 9 can achieve 15% enhancement with the fourth iteration than those without iteration in 20 scope of RAiD dataset. In addition, the proposed method can obtain better retrieval results over almost all scope and iterations than other compared methods.

Fig. 8
figure 8

MAP-scope curves based on all test set given the different feedback iterations: the up to the bottom row represent the results of PKU [50], WARD [51] and RAiD [52] datasets

Fig. 9
figure 9

MAP-scope curves based on all test set at Scope 5, 10 and 20 for PKU [50] dataset; Scope 10, 20 and 30 for WARD [51] and Scope 10, 20 and 50 for RAID [52] datasets

5 Conclusion

In this paper, we in this paper develop a new scalable manifold ranking method for Person ReID by incorporating both graph weight construction and manifold regularized term in the same framework. The graph we developed is discriminative and doubly-stochastic. This can make the graph highly robust and insensitive to the dataset and parameters. In addition, by adopting the side information into consideration, the proposed graph can guarantee the data points with different clusters will not be mis-connected so that it can enhance the classification performances. Meriting from such a graph, we then incorporate the graph construction, the subspace learning method in the unified loss term. Therefore, the subspace results can be used to enhance the geometrical preserving ability of graph construction, and the modified graph can in turn incorporate discriminative information for graph embedding. Simulation indicates that the proposed method can achieve superor Person Re-ID performance.

While the proposed work can achieve satisfied results, our future work will focus on several issues: first, graph neural network (GNN) is one of the most popular topics during the past few years, which has been proved to be effective for dealing with structured data. We can consider to extend the proposed work to form a graph convolutional layer so that it can merit from the strong ability of CNN for grasp both the geometrical and data attribute features; secondly, we can also consider to connect with CNN network to form an end-to-end deep learning framework, which can adopt the powerful feature extraction ability of CNN as well as maintain the geometrical information via GCN. This is also of great significance to improve the performance of image retrieval.