1 Introduction

State-of-the-art image recognition algorithms usually adopt a local patch based, multiple-layer pipeline to obtain a good representation. These methods start from local image patches using either normalized raw pixel density or descriptors such as the scale-invariant feature transform (SIFT) [1] or the histogram of oriented gradients (HOG) [2], and encode them into an overcomplete representation using various algorithms such as the \(k\)-means or sparse coding. After coding, global image representations are formed by spatially pooling the coded local descriptors. The methods following such a pipeline have achieved competitive performance on image classification tasks [3]. During the whole procedure, the spatial pooling step brings a substantial performance improvement. One significant milestone in the construction of this arsenal of tools is the spatial pyramid matching (SPM) introduced in [4], which partitions the image into increasingly fine subregions and then computes histograms of local features found inside each subregion. The empirical success of this technique stems from the fact that the spatial cue is integrated, and an approximate geometric matching is actually performed when multiple resolutions are combined in a principled way.

The codebook model, as a simplified version of such a pipeline without spatial pooling, has been also considered for 3D shapes. The basic idea of using a codebook to represent a shape as histograms of occurrences of visual words is commonly referred to in the literature as Bag-of-Words (BoW) or Bag-of-Features (BoF) approach. Several authors have introduced such BoF approaches for 3D shape retrieval. Indeed, early research has mainly dealt with the global Euclidean transformations (rigid motion) [5] and multiple views [6]. By defining the visual words on the segmented shape regions, Toldo et al. [7] obtained encouraging shape categorization and retrieval results. Darom et al. [8] achieved state-of-the-art retrieval accuracy by designing the local vertex-wise features, which are robust to scale changes and partial mesh matching. The codebook model has been shown to be a promising method for partial shape retrieval [8, 9]. Recent efforts have also focused on finding the deformation invariance for nonrigid shapes by replacing the Euclidean metric with its geodesic counterpart [10]. The geodesic distance, however, suffers from strong sensitivity to topological noise, which limits its usefulness in real applications.

This problem is well handled by the tools from the emerging field of diffusion geometry, which provides a generic framework for many intrinsic methods in the analysis of geometric shapes. Diffusion geometry formulates the heat diffusion processes on manifolds. Coifman and Lafon [11] introduced invariant metrics known as diffusion distances, which correspond to the \(L_{2}\)-norm difference of energy distribution between two points initialized with unit impulse functions after a given time. The diffusion distance is more robust to topological noise than geodesic one. Reuter et al. [12] adopted the eigenvalues of the Laplace–Beltrami (LB) operator to construct a global shape descriptor, called ShapeDNA. Based on the theoretical works in [13], Lévy [14] showed that the eigenfunctions of LB operator can be well adapted to the geometry and the topology of an object. Later, several spectral descriptors were proposed to characterize the geometric features of a 3D surface [1517]. By aggregating these spectral descriptors, the Shape Google algorithm [18, 19] was proposed as a classical method for deformable shape retrieval. It uses the multiscale diffusion heat kernels as “geometric words”, and constructs a compact and informative shape representation by means of the codebook approach.

More recently, there have been several attempts to adapt 2D planar shape contexts [20], popular image feature detectors [21] and descriptors [22], to 3D surfaces. This line of works partially inspires our proposed approach. Another inspiration is due in part to the great success of SPM in the image domain. Spatially enhanced techniques for 3D shape recognition were explored earlier in [23, 24], but these works are not intrinsic, i.e., shape deformations affect the descriptors. “Geometric expressions” [18] was an earlier work that explored the exploitation of intrinsic geometry, but the authors only dealt with the local relative spatial position, by considering the diffusion distance between pairwise vertices. Our approach, on the other hand, models the global absolute spatial positions, which allow us to retain and exploit the information contained in the whole 3D shape.

Our contributions are threefold: (1) we propose to adopt the second eigenfunction of the LB operator in a bid to construct a global surface coordinate system, which is insensitive to shape deformation, (2) we develop a proper generalization of the SPM for surfaces and show a numerical way to construct it, and (3) we experimentally demonstrate that introducing the global spatial context significantly improves the discriminative power of the descriptor in 3D matching and retrieval.

The rest of this paper is organized as follows. Section 2 provides a brief background on the LB operator, its discretization and eigenanalysis, followed by the codebook model. In Sect. 3, we propose the intrinsic spatial pyramid matching (ISPM) approach. Experimental results on two 3D datasets are presented in Sect. 4. Finally, we conclude and point out future work directions in Sect. 5.

2 Background

2.1 Laplace–Beltrami operator

Let \(\mathbb M \) be a smooth orientable 2-manifold (surface) embedded in \(\mathbb R ^3\). A global parametric representation (embedding) of \(\mathbb M \) is a smooth vector-valued map (also called surface patch) \({\varvec{x}}\) defined from a connected open set (parametrization domain) \(U\subset \mathbb R ^2\) to \(\mathbb M \subset \mathbb R ^3\) such that

$$\begin{aligned} {\varvec{x}}({\varvec{u}})=\left( x^{1}({\varvec{u}}), x^{2}({\varvec{u}}), x^{3}({\varvec{u}}) \right) \end{aligned}$$
(1)

where \({\varvec{u}}=(u^1,u^2)\in U\). Note that the components of \({\varvec{x}}\) and \({\varvec{u}}\) are denoted by superscripts in place of subscripts. This superscript convention stems from the use of tensor notation which greatly simplifies the formalism of the theory of surfaces [25].

Given a twice-differentiable function \(f:\,\mathbb M \rightarrow \mathbb R \), the LB operator [26] is a second-order partial differential operator defined as

$$\begin{aligned} \Delta _\mathbb{M } f&= -\frac{1}{\sqrt{|g|}}\sum _{i,j=1}^{2}\frac{\partial }{\partial u^{j}} \left( \sqrt{|g|}\,g^{ij} \frac{\partial f}{\partial u^{i}}\right) \nonumber \\&= -\sum _{i,j=1}^{2}g^{ij}\frac{\partial }{\partial u^{j}} \frac{\partial f}{\partial u^{i}} +(\hbox {lower order terms}) \end{aligned}$$
(2)

where the matrix \(g=(g_{ij})\) is referred to as a Riemannian metric tensor on \(\mathbb M \), \(g^{ij}\) denote the elements of the inverse of the metric tensor \(g^{-1}\), and \(|g|\) is the determinant of \(g\). The functions \(g_{ij}\) are sometimes referred to as the metric coefficients. The Riemannian metric \(g\) is an intrinsic quantity in the sense that it relates to measurements inside the surface. It is the analogous of the speed in the case of space curves and determines all the intrinsic properties of the surface \(\mathbb M \). These properties depend on the surface and do not depend on its embedding in space. Furthermore, the tensor \(g\) is invariant to rotation of the surface in space because it is defined in terms of inner products that are rotation invariant.

2.2 Discretization

Assume that the surface \(\mathbb{M }\) is approximated by a triangular mesh. A triangle mesh \(\mathbb{M }\) may be defined as \(\mathbb{M }=(\mathcal{V },\mathcal{E })\) or \(\mathbb{M }=(\mathcal{V },\mathcal{T })\), where \(\mathcal{V }=\{{\varvec{v}}_{1},\ldots ,{\varvec{v}}_{m}\}\) is the set of vertices, \(\mathcal{E }=\{e_{ij}\}\) is the set of edges, and \(\mathcal{T }=\{{\varvec{t}}_{1},\ldots ,{\varvec{t}}_{n}\}\) is the set of triangles. Each edge \(e_{ij}\) (denoted by \([{\varvec{v}}_{i},{\varvec{v}}_{j}]\) or simply \([i,j]\)) connects a pair of vertices \(\{{\varvec{v}}_{i},{\varvec{v}}_{j}\}\). Two distinct vertices \({\varvec{v}}_{i},{\varvec{v}}_{j}\in \mathcal{V }\) are adjacent (denoted by \({\varvec{v}}_{i}\sim {\varvec{v}}_{j}\) or simply \(i\sim j\)) if they are connected by an edge, i.e., \(e_{ij}\in \mathcal{E }\). The neighborhood (1-ring) of a vertex \({\varvec{v}}_{i}\) is the set \({\varvec{v}}_{i}^{\star }=\{{\varvec{v}}_{j}\in \mathcal{V }: i\sim j\}\).

Several discretizations of the LB operator are available in the literature. In this paper, we use the approach developed in [27], which employs a mixed finite element/finite volume method on triangle meshes. Hence, the value of \(\Delta _\mathbb{M }f\) at a vertex \(\varvec{v}_{i}\) (or simply \(i\)) can be approximated using the cotangent weight scheme:

$$\begin{aligned} \Delta _\mathbb{M }f(i)\!\approx \!\frac{1}{a_{i}}\!\sum _{j\sim i} \frac{\cot \alpha _{ij} + \cot \beta _{ij}}{2}\bigl (f(j)-f(i)\bigr ) \end{aligned}$$
(3)

where \(\alpha _{ij}\) and \(\beta _{ij}\) are the angles \(\angle ({\varvec{v}}_{i}{\varvec{v}}_{k_1}{\varvec{v}}_{j})\) and \(\angle ({\varvec{v}}_{i}{\varvec{v}}_{k_2}{\varvec{v}}_{j})\) of two faces \({\varvec{t}}^{\alpha }=\{{\varvec{v}}_{i},{\varvec{v}}_{j},{\varvec{v}}_{k_1}\}\) and \({\varvec{t}}^{\beta }=\{{\varvec{v}}_{i},{\varvec{v}}_{j},{\varvec{v}}_{k_2}\}\) that are adjacent to the edge \([i,j]\), and \(a_i\) is the area of the Voronoi cell. It is worth pointing out that the cotangent weight scheme is numerically consistent and preserves several important properties of the continuous LB operator, including symmetry and positive-definiteness [28].

Define the weight function \(\omega : \mathcal V \times \mathcal V \rightarrow \mathbb R \) as

$$\begin{aligned} \omega _{ij}=\left\{ \begin{array}{ll} \frac{\cot \alpha _{ij} + \cot \beta _{ij}}{2a_i} &{}\quad \text{ if } i\sim j \\ 0 &{}\quad \text{ o.w. } \end{array}\right. \end{aligned}$$
(4)

Then, for a function \(f: \mathcal V \rightarrow \mathbb{R }\) that assigns to each vertex \(i\in \mathcal V \) a real value \(f(i)\) (we can view \(f\) as a column vector of length \(m\)), we may write the LB operator given by Eq. (3) as

$$\begin{aligned} Lf(i)=\sum _{j\sim i} \omega _{ij}\,\bigl (f(i) -f(j)\bigr ), \end{aligned}$$
(5)

where the matrix \(L=(\ell _{ij})\) is given by

$$\begin{aligned} \ell _{ij}=\left\{ \begin{array}{ll} d_{j}&{}\quad \text{ if } i=j \\ -\omega _{ij} &{}\quad \text{ if } i\sim j \\ 0 &{}\quad \text{ o.w. } \end{array}\right. \end{aligned}$$
(6)

and \(d_{j}=\sum _{i=1}^{m} {\omega _{ij}}\) is the weighted degree of the vertex \({\varvec{v}}_{i}\).

2.3 Eigenanalysis and spectral signatures

Note that \(\omega _{ij}\ne \omega _{ji}\) implies \(L\) is not a symmetric matrix. Thus, the spectrum (set of eigenvalues) of the eigenvalue problem \(L{\varvec{\varphi }}_{i}=\lambda _{i}{\varvec{\varphi }}_{i}\) may not be real [29]. Noting that \(\omega _{ij}=\kappa _{ij}/a_{i}\), where

$$\begin{aligned} \kappa _{ij}=\left\{ \begin{array}{ll} \frac{\cot \alpha _{ij} + \cot \beta _{ij}}{2} &{}\quad \text{ if } i\sim j \\ 0 &{}\quad \text{ o.w. } \end{array}\right. \end{aligned}$$
(7)

we may factorize the matrix \(L\) as \(L=A^{-1}C\), where \(A=\mathrm diag (a_{i})\) is a positive-definite diagonal matrix called stiffness matrix and \(C=(c_{ij})\) is a sparse symmetric matrix referred to as lumped mass matrix, given by

$$\begin{aligned} c_{ij}=\left\{ \begin{array}{ll} \sum _{i=1}^{m}\kappa _{ij}&{}\quad \text{ if } i=j \\ -\kappa _{ij} &{} \text{ if } i\sim j \\ 0 &{}\quad \text{ o.w. } \end{array}\right. \end{aligned}$$
(8)

Therefore, we may write the eigenvalue problem \(L{\varvec{\varphi }}_{i}=\lambda _{i}{\varvec{\varphi }}_{i}\) as a generalized problem

$$\begin{aligned} C{\varvec{\varphi }}_{i}=\lambda _{i} A{\varvec{\varphi }}_{i},\quad i=1,2,\dots ,m \end{aligned}$$
(9)

The eigenvalues \(\lambda _i\) and associated eigenfunctions \({\varvec{\varphi }}_i\) of the LB operator can be computed by solving the above generalized problem. That is, \({\varvec{\varphi }}_{i}\) is an \(m\)-dimensional vector. We may sort the eigenvalues in ascending order as \(0=\lambda _1<\lambda _{2}\le \dots \le \lambda _{m}\) with corresponding eigenfunctions as \({\varvec{\varphi }}_{1}, {\varvec{\varphi }}_{2},\dots ,{\varvec{\varphi }}_{m}\), where each eigenfunction \({\varvec{\varphi }}_{i}=(\varphi _{i}({\varvec{v}}_{1}),\dots ,\varphi _{i}({\varvec{v}}_{m}))'\) is an \(m\)-dimensional vector. Note that the eigensystem \(\{\lambda _i,{\varvec{\varphi }}_{i}\}_{i}\) is intrinsic to the manifold and enjoys a nice property of being isometric invariant.

Based on the obtained eigenfunctions and eigenvalues, several spectral signatures have been proposed in the literature to describe a single vertex on a surface. Sun et al. [15] introduced the heat kernel signature (HKS) based on the fundamental solution of the heat equation (heat kernel). Its scale-invariant version (SIHKS) was developed in [17]. Another physically inspired descriptor is the wave kernel signature (WKS), which was proposed in [16]. Unlike the HKS, the WKS separates influences of different frequencies, treating all frequencies equally. These descriptors have been shown to achieve an excellent performance in 3D shape analysis and recognition.

2.4 Bag-of-feature model

Given a set of local point-wise signatures densely computed on each vertex on the mesh surface, we quantize the signature space to obtain a compact histogram representation of the shape using the codebook model approach. The geometric word vocabulary in the codebook model may be constructed in various ways, e.g., by approximate k-means [30] or hierarchical k-means [31]. We use the simple \(k\)-means method, which is also used in the Shape Google algorithm [19]. Thus, the “geometric words” of a vocabulary \(P=\{{\varvec{p}}_k, k = 1, 2,\ldots ,K\}\) are obtained as the \(K\) centroid of \(k\)-means clustering in the signature space. From any shape, a specific type of local spectral descriptor \(S=\{{\varvec{s}}_t, t=1,2,\ldots , T \}\) is used for comparison. Obviously, each local descriptor \({\varvec{s}}_t\) (represented as a vector) is associated with its nearest geometric word \( NN ({\varvec{s}}_t)\) in the codebook. By a certain vector coding technique, such as hard counting or ambiguity modeling, each shape will be described by a histogram \(H\). Since the number of vertices is usually different among different meshed shapes, an appropriate normalization technique is essential for the codeword-cumulative histogram representation.

The traditional codebook is the histogram of the number of local descriptors assigned to each geometric word. For each codeword \({\varvec{p}}_k\), the differences of vector \({\varvec{s}}_t\) assigned to \({\varvec{p}}_k\) are accumulated by the \(L_0\) norm as follows:

$$\begin{aligned} q_i=\sum _{{\varvec{s}}_t:NN({\varvec{s}}_t)=i}\Vert {\varvec{s}}_t-{\varvec{p}}_i \Vert _0 \end{aligned}$$
(10)

However, this may be disadvantageous because one local descriptor can be modeled better by multiple geometric words. The hard counting strategy in traditional codebook will result in quantization loss. Another similar method is the codeword uncertainty [32], in which one image region distributes its probability mass to more than one codeword. In the Shape Google algorithm, the authors also adopted this technique, which is modeled to normalize the amount of probability mass to a total constant weight of 1 and is distributed over all relevant codewords. Relevancy is determined by the ratio of the Gaussian kernel values for all codewords \({\varvec{p}}_i\) in the vocabulary

$$\begin{aligned} q_i =\sum _{t=1}^{T}\frac{\displaystyle \frac{1}{\sigma \sqrt{2\pi }} e^{ -\frac{1}{2}\frac{\Vert {\varvec{s}}_t-{\varvec{p}}_i \Vert _2^{2}}{\sigma ^2}}}{\sum _{k=1}^{K}\frac{1}{\sigma \sqrt{2\pi }} e^{-\frac{1}{2}\frac{\Vert {\varvec{s}}_t-{\varvec{p}}_k \Vert _2^{2}}{\sigma ^2}}} \end{aligned}$$
(11)

Thus, in the traditional codebook, the local descriptor selects the best candidate geometric word, whereas the codeword uncertainty does not solely assign the descriptor to the best fitting word, but is also divided over multiple codewords. Both of them can be compared via the Chi-squared kernel. In the following section, we introduce an ISPM kernel.

3 Intrinsic spatial pyramid matching

3.1 Isocontours

The eigenfunctions of the LB operator enjoy nice properties including isometry invariance and robustness to pose variations such as translation and rotation. These eigenfunctions are orthogonal \(\langle {\varvec{\varphi }}_{i},{\varvec{\varphi }}_{j}\rangle _{A}=0,\quad \forall i\ne j\), where the orthogonality is defined in terms of the \(A\)-inner product. That is, \(\langle {\varvec{\varphi }}_{i},{\varvec{\varphi }}_{j}\rangle _{A}={\varvec{\varphi }}_{i}'A{\varvec{\varphi }}_{j}\). Moreover, any function \(f: \mathcal V \rightarrow \mathbb R \) (viewed as a column vector of length \(m\)) on the triangle mesh \(\mathbb M \) can be written in terms of the eigenfunctions as follows:

$$\begin{aligned} f=\sum _{i=1}^{m}\alpha _{i}{\varvec{\varphi }}_{i}, \quad \text{ where }\quad \alpha _{i}=\langle f,{\varvec{\varphi }}_{i}\rangle . \end{aligned}$$
(12)

Note that since the sum of each row in the matrix \(C\) equals zero, the first eigenvalue \(\lambda _1\) is zero and the corresponding eigenfunction \({\varvec{\varphi }_1}\) is a constant \(m\)-dimensional vector. The top row of Fig. 1 shows a 3D horse model colored by the second, third and fourth eigenfunctions, while the bottom row displays the isocontours of these eigenfunctions.

Fig. 1
figure 1

ac 3D horse model colored by \({\varvec{\varphi }}_{2}, {\varvec{\varphi }}_{3}, {\varvec{\varphi }}_{4}\). df Level sets of \({\varvec{\varphi }}_{2}, {\varvec{\varphi }}_{3}, {\varvec{\varphi }}_{4}\)

We can use the variational characterizations of the eigenvalues in terms of the Rayleigh–Ritz quotient. That is, the second eigenvalue is given by

$$\begin{aligned} \lambda _2 = \inf _{f\perp {\varvec{\varphi }}_{\mathbf{1}}} \frac{f'Cf}{f'Af} =\inf _{f\perp {\varvec{\varphi }}_{\mathbf{1}}} \frac{\sum _{i\sim j}c_{ij}(f({\varvec{v}}_{i})-f({\varvec{v}}_{j}))^2}{\sum _{i} f({\varvec{v}}_{i})^{2}a_{i}} \end{aligned}$$
(13)

and \({\varvec{\varphi }}_{\mathbf{2}}=(\varphi _{2}({\varvec{v}}_{1}),\dots ,\varphi _{2}({\varvec{v}}_{m}))'\) is its corresponding eigenvector.

The eigenvalues and eigenfunctions have a nice physical interpretation: the square roots of the eigenvalues \(\sqrt{\lambda _{i}}\) are the eigenfrequencies of the membrane and \(\varphi _{i}(x)\) are the corresponding amplitudes at \(x\). In particular, the second eigenvalue \(\lambda _2\) corresponds to the sound we hear the best [33]. On the other hand, Uhlenbeck [34] showed that the eigenfunctions of the LB operator are Morse functions on the interior of the domain of the operator. Consequently, this generic property of the eigenfunctions gives rise to construction of the associated intrinsic isocurves.

3.2 Intrinsic spatial partition

Motivated by the invariance properties of the second eigenfunction of the LB operator and also by its generic property as a Morse function as well as by the fact that intuitively the second eigenvalue corresponds to the sound we hear the best [33], we propose to use the level sets (isocontours) of the second eigenfunction as cuts to partition a surface. In Fig. 2a–c, we show some examples of the level curves of \({\varvec{\varphi }_{2}}\). In Fig. 2a, we can observe that the isocontours are consistent with global large deformation (first column), local small bend (second column), and among the shapes from different classes, but share similar topological structure (third column). The correspondence of isocontours on the shapes from the same class is displayed in Fig. 2b, which shows shapes that include various topological structures. Finally, the consistency of isocontours on the shapes from different classes is displayed in Fig. 2c. Although the shapes are explicitly different, their isocontours can capture their intrinsic correspondence well.

Fig. 2
figure 2

a Isocontours are invariant under both global and local deformations. b Proportionality correspondence of pairwise nonrigid shapes with varied topological structure. c Isocontours are consistent among different classes of shapes.

The level sets of the second eigenfunction have been previously used to extract curve skeletons of nonrigid shapes [35], which is a vivid clue that these isocontours capture the global topological structure of shapes.

3.3 Matching by intrinsic spatial partition

Instead of representing the whole shape by the codeword model without considering spatial layout of local descriptors, we propose to enhance the discrimination by integrating the distribution of local descriptors in different spatial patches determined by the intrinsic spatial partition. For any shape cut by isocontours at resolution \({ R}\), its description \(H\) is the concatenation of \({ R}\) sub-histograms:

$$\begin{aligned} H = [h^1,h^2,\ldots ,h^i,\ldots ,h^\mathtt R ] \end{aligned}$$
(14)

where \(h^i\) is the sub-histogram ordered in the \(i\)th position according to the intrinsic spatial partition from one end to the other. Note that the isocontour sequence might start from either end, and the situations are different from shape to shape. For example, in Fig. 2a, the heads of the first and third rabbit are colored in blue, but tail of the second is colored in red and head in blue, whose order is exactly the opposite. To guarantee that the semantic correspondent parts are matched in the comparison, we use an order-insensitive strategy comparison method. First, we get a new histogram \(T\) by making the order of the sub-histogram inverted in \(H\):

$$\begin{aligned} T = [h^\mathtt R ,h^{(\mathtt R -1)},\cdots ,h^{i},\cdots ,h^1]. \end{aligned}$$
(15)

Second, to compare two shapes \(P\) and \(Q\), we define their dissimilarity under this feature as follows:

$$\begin{aligned} \mathcal B ^\mathtt{R }(P, Q) = \min (\mathcal A ^\mathtt{R }(H_{P}, H_{Q}), \mathcal A ^\mathtt{R }(H_{P}, T_{Q})) \end{aligned}$$
(16)

where \(H_P\) and \(H_Q\) denote the histograms of \(P\) and \(Q\), respectively. In other words, there are two possible matching schemes between the isocontour sequences of two shapes, head-to-head and head-to-end. We consider the schemes with the minimum cost to be better matched. For each scheme, the dissimilarity measure \(\mathcal A ^\mathtt{R }(\cdot ,\cdot )\) is defined as

$$\begin{aligned} \mathcal A ^\mathtt{R }(H_P, H_Q) = \sum _{i=1}^\mathtt{R } \sum _{k=1}^{K} \Psi (h_P^i(k),h_Q^i(k)) \end{aligned}$$
(17)

where \(\Psi (\cdot ,\cdot )\) can be any histogram comparison metric. In this paper, we use the Chi-squared kernel so that \(h_P^i(k)\) and \(h_Q^i(k)\) are the accumulations of the code of the local descriptors from \(P\) and \(Q\) that fall into the \(k\)th codeword cell/channel of the \(i\)th patch.

The degree of resolution would affect the performance of the spatial partition-based method. To further improve the results, we extend the spatial pyramid, which has been shown to yield excellent performance in image analysis, to nonrigid 3D shapes. The spatial pyramid divides an image into a multi-level pyramid of increasingly fine subregions and computes a codebook descriptor for each subregion. We construct a sequence of histograms at resolutions \(\{\mathtt{R }=2^{\ell }, \ell =0,\ldots , L\}\) such that the surface at level \(\ell \) has \(2^\ell \) patches, for a total of \(2^L-1\) patches. Thus, the final dissimilarity between the histograms of \(P\) and \(Q\) is given by

$$\begin{aligned} \mathcal D ^L(P,Q)&= \mathcal B ^L(P,Q)+\sum _{\ell =0}^{L-1}\frac{1}{2^{L-\ell }}(\mathcal B ^{\ell }(P,Q)- \mathcal B ^{\ell +1}(P,Q))\nonumber \\&= \frac{1}{2^{L}}\mathcal B ^0(P,Q)+\sum _{\ell =1}^{L}\frac{1}{2^{L-\ell +1}}\mathcal B ^{\ell }(P,Q) \end{aligned}$$
(18)

The weight associated with each level is set to \(1/2^{L-\ell }\), which is inversely proportional to the cell width at that level. Intuitively, we want to penalize matches found in larger cells because they involve increasingly dissimilar features. Concerning the implementation, one issue that arises is that of normalization. To easily compare the methods of single-level partition and ISPM, we normalize the histogram of each resolution using the \(L_1\)-norm.

4 Experimental results

The performance of our proposed intrinsic spatial pyramid was evaluated on two datasets, namely SHREC 2011 Benchmark [36] and the TOSCA-based robust shape retrieval database used in [19]. The first dataset is used to validate the discriminative power of ISPM between different shape categories, and the second one is used to test the robustness of ISPM.

4.1 SHREC 2011 database

SHREC 2011 database contains 600 watertight triangle meshes that are equally classified into 30 categories. SHREC 2011 is the most diverse nonrigid 3D shape database available today in terms of object classes and deformations. In Fig. 3, we show two shapes of each class in the dataset. The retrieval performance is evaluated using the discounted cumulative gain (DCG) [37]. All normalized DCG calculations are relative values in the interval \([0,1]\). Higher numbers are better, and the results are cross-query comparable.

Fig. 3
figure 3

Sample shapes in SHREC 2011 dataset

We performed ISPM based on HKS and SIHKS dense descriptors. These descriptors showed excellent performance with the codebook model in the Shape Google algorithm. The first 150 eigenvalues and eigenvectors of the LB operator on each shape are used. We experimentally select the best parameters for HKS and SIHKS on SHREC 2011 dataset. For HKS, we formulate the diffusion time as \(t = t_0 \alpha ^{\tau }\), where \(\tau \) is sampled from 0 to a given scale \(T\) with a resolution \(1/4\). \(T = 5\), \(t_0 = 0.01\) and \(\alpha = 4\) are set in our case. In order to construct the SIHKS, we use \(t = \alpha ^{\tau }\), where \(\tau \) ranges from 1 to a given scale with finer increments of \(1/16\). \(T = 25\) and \(\alpha = 2\) are chosen. After applying the logarithm, derivative and Fourier transform, all the frequencies are used to obtain the best result.

The computation of the vocabulary is performed offline in advance. To confirm getting optimal results, the clustering is repeated three times, and each by a new set of initial cluster centroid positions. The solution with the lowest value for the sum of distances is returned. The running time depends on the number of the descriptors (number of vertices), the dimension of the descriptor, and the vocabulary size (the number of clusters). Since we simplify our mesh to 2,000 faces for each shape, we have a set of approximately \(6\times 10^5\) descriptors. The vocabulary size is fixed as \(32\), the dimension of HKS is \(21\) and the dimension of SIHKS is \(385\). It is important to point out that we performed the BoF experiments on SHREC2011 with various codeword sizes of 8, 12, 16, 24, 32, 48, 64, 80 and 200. It turns out that different descriptors attain the best results with size 32, and change slightly afterwards. As a result, we fixed the codeword size as 32 for SHREC2011. The running times to obtain the vocabulary for HKS and SIHKS are 1,043 and 9,033 s, respectively.

First, we examine the effect of integrating spatial cues on surfaces via the intrinsic partition. Figure 4 shows the performance improvement of the retrieval experiments by matching shapes directly using intrinsic partitions on SHREC 2011 dataset. With the increase of the number of intrinsic partitions, both HKS and SIHKS are improved substantially. The performance of the Shape Google is plotted as points whose partition number is one. Obviously, intrinsic spatial cues on the shape surface proposed in our framework significantly outperform the Shape Google. We only show the result with maximum 19 partitions for visualization purposes. As a matter of fact, the retrieval results will go slightly higher even when a 1,024-partition is adopted in our experiments.

Fig. 4
figure 4

Performance improvement by increasing the number of intrinsic partitions on SHREC 2011 dataset

Table 1 Performance (DCG) comparison of ISPM and single-level partition on SHREC 2011

Next, let us examine the behavior of the ISPM. For completeness, Table 1 lists the performance achieved using just the highest level of the pyramid (the “single” columns) as well as the performance of the complete matching scheme using multiple levels (the “pyramid” columns). For both HKS and SIHKS, the results improve considerably as we go from \(L = 1\) to a multi-level setup. We do not display the results for \(L=0\) because its highest single level is the same as with its pyramid. Although matching at the highest pyramid level seems to account for most of the improvement, using all the levels together helps provide stable results. For HKS with codeword uncertainty, single-level performance actually drops as we go from \(L=7\) to \(L=9\). This means that the highest level of the \(L=7\) pyramid is too finely subdivided, with individual bins yielding few matches. Despite the diminished discriminative power of the highest level, the performance of the entire \(L=9\) pyramid remains essentially identical to that of the \(L=7\) pyramid. Thus, the main advantage of the intrinsic spatial pyramid representation stems from combining multiple resolutions in a principled fashion, and it is robust to failures at individual levels.

In Fig. 5, we show two examples of top nine retrieval results for different methods. There are plenty of examples to demonstrate that our proposed ISPM method improves the Shape Google, we just take two to illustrate the idea. Between each two blue lines in that figure, the upper row is our approach, while the bottom row is Shape Google. For the first query alien, SIHKS confuses it with spider, while HKS confuses it with Santa. This is because these objects also have several long, thin pipe-like parts and flat globular parts, and the proportions are similar. The spatial partition separates pipe-like parts and globular parts into different sub-histograms according to the global spatial position, thus resulting in a more descriptive representation. In particular, the Santa and alien models share similar body shape, but the shapes of Santa’s hat and the alien’s horns are spatially inverted, even though these two parts are similar in terms of the proportion of primitive geometric elements. For the second query dinosaur, ISPM successfully removes the incorrect results of the gorilla and woman models. But the error with armadillo still remains, which turns out the case as ISPM fails. It is understandable since even humans may incorrectly recognize it at first glance. In terms of global shape structures, the armadillo and dinosaur models are almost isometric. So ISPM considers the semantically correspondent parts as good matchings and compares them by their correspondent regions. Because of their similar geometric details of the four legs and the tail, ISPM is still not able to distinguish between these two shapes.

In addition to comparing shapes at their coarsest level (shape-to-shape), ISPM is also able to quantitatively tell the difference at certain detailed levels (patch-to-patch).

Fig. 5
figure 5

Retrieval results using SIHKS and HKS and their ISPM version. Error result is marked in the red dashed box. a On the left is the query shape alien, and the ten rows on the right show its top 9 retrieval results. b On the left is the query shape dinosaur and the ten rows in the right show its top 9 retrieval results

4.2 TOSCA database

We also tested our algorithm on the TOSCA database, which has been used to validate the Shape Google [19]. The total positive set size was 531, equally categorized into 13 classes. In each class, the shapes underwent different types of transformations, including null, isometry topology, isometry + topology, triangulation and partiality. We used 456 shapes as negatives. All the shapes were normalized to have approximately the same scale. The retrieval quality was quantitatively measured using the receiver operating characteristic (ROC) curve, which essentially represents a trade-off between the percentage of similar shapes correctly identified as similar (true positive rate or TPR) and the percentage of dissimilar shapes wrongfully identified as similar (false positive rate or FPR). We used the same experimental settings as in [19]. Since the performance of both the traditional codebook and the uncertainty codebook is consistent, we only display the results of the traditional codebook for the sake of convenience. The highest level of ISPM is chosen as \(L=8\), and the dictionary size is set to 48. Figure 6 shows the ROC curves of both BoF and ISPM using HKS and SIHKS for each class of transformations. As can be seen in Fig. 6, the ISPM method outperforms the BoF of the Shape Google in all cases.

Fig. 6
figure 6

ROC curves (true positive vs. false positive rate) for different classes of shape transformations using SIHKS and HKS with both BoF and ISPM

4.3 Strengths and weaknesses of the proposed approach

Similar to BoF, the proposed method performs worse than ShapeDNA on SHREC 2011 benchmark. This is due in part to the fact that ShapeDNA is particularly good at retrieving the near-redundant isometrics. On TOSCA-based database, however, which has different classes of mesh transformations, ISPM performs the best on the grounds that the partitions provided by the second eigenfunction are stable. We may summarize the strengths and weaknesses of our approach as follows:

  • Strengths: (1) The main advantage of ISPM over BoF is its integration of spatial information in a principled way. (2) ISPM provides a coarse correspondence of shapes.

  • Weaknesses: (1) A major drawback of ISPM is how to determine an appropriate partition number parameter. Such a limitation also exists in the original paper of Lazebnik’s et al. [4] on SPM. Thus, using too many partitions on the surface tend to degrade the performance of the proposed algorithm, largely because of the mismatching. (2) Unlike the graph-based method, ISPM may still lose the topological information, which is critical in distinguishing between shapes from different classes.

5 Conclusion and future work

We developed an intrinsic version of the SPM, making it suitable for the analysis of deformable 3D shapes. Our construction is based on the isocontours of the second eigenfunction of LB operator on Riemannian manifolds. The proposed partitioning can capture the global shape topological information and provide a deformation invariant representation. Furthermore, the ISPM is able to establish a global correspondence among shapes. It can be used in combination with any dense shape descriptor, e.g., heat kernel signature or scale-invariant heat kernel signature, and consistently achieves a notable improvement over the BoF model, which only encodes orderless local information.

We plan to extend this work in two directions. First, the spatial pyramid framework offers insights into the success of the different dense shape descriptor in our experiments. Therefore, performing a spatial partition-based investigation on all the recent spectral descriptors, such as wave kernel signature, may provide very practical instructions for further applications. Second, using the proposed global system of coordinates, intrinsic versions of many other aggregation-based compact representations popular in image analysis, such as Fisher vector, can be designed. We intend to explore these constructions in our future work.