TransHist: Occlusion-robust shape detection in cluttered images

Shape matching plays an important role in various computer vision and graphics applications such as shape retrieval, object detection, image editing, image retrieval, etc. However, detecting shapes in cluttered images is still quite challenging due to the incomplete edges and changing perspective. In this paper, we propose a novel approach that can efficiently identify a queried shape in a cluttered image. The core idea is to acquire the transformation from the queried shape to the cluttered image by summarising all point-to-point transformations between the queried shape and the image. To do so, we adopt a point-based shape descriptor, the pyramid of arc-length descriptor (PAD), to identify point pairs between the queried shape and the image having similar local shapes. We further calculate the transformations between the identified point pairs based on PAD. Finally, we summarise all transformations in a 4D transformation histogram and search for the main cluster. Our method can handle both closed shapes and open curves, and is resistant to partial occlusions. Experiments show that our method can robustly detect shapes in images in the presence of partial occlusions, fragile edges, and cluttered backgrounds.


Introduction
Shape matching plays an important role in various computer vision and graphics applications such as shape retrieval, object detection, image editing, image retrieval, etc. Compared to gradient and texture features, shape features are much more reliable when objects are characterized by distinctive shapes, such as road signs in images and videos. In this paper, we focus on detecting shapes in cluttered images by analyzing point-to-point transformations.
In the early days, methods were proposed to measure shape similarity by transforming shapes into other domains, using, e.g., wavelet-based transforms [1] and Fourier transforms [2][3][4]. Methods have also been proposed to transform shapes into the curvature domain, using curvature inflection points for shape matching [5,6]. Later, various shape descriptors were proposed and used for measuring shape similarity. Shapes may also be described using triangle areas, by forming a set of triangles at a reference point [7]. Use has also been made of an integral kernel to extract shape characteristics within a region centered at a reference point [8,9]. The state-of-the-art shape context shape descriptor utilizes a log-polar diagram to statistically record the spatial distribution of shapes at each sample point [10].
However, most existing shape matching algorithms or shape descriptors are only designed for matching shapes with clean and clear edges. Unsatisfactory results may be obtained if they are used to detect shapes in cluttered images. First of all, edges in cluttered images are fragile edges: shapes may be cut into fragments (blue box in Fig. 1(b)). With no tolerance for fragile edges, whole-shape descriptors naturally fail to detect shapes with fragile edges. Furthermore, region-based shape descriptors are also unable to extract features from fragile edges since they assume that shapes are closed. Secondly, partial occlusions (yellow box in Fig. 1(b)) typically occur, again hindering whole-shape descriptors from finding the correct solution. Lastly, the background in cluttered images may be extremely noisy (red box in Fig. 1(b)). Noise greatly affects shape descriptors in the spatial domain. Moreover, cluttered backgrounds also increase computational costs.
The inner-distance shape context [11] extended the original shape context to tackle partial-shape matching, but it still requires the shapes to be closed. Recently, Kwan et al. [12] proposed a point-based shape descriptor called the pyramid of arc-length descriptor (PAD) to tackle partial occlusion, and their method is also applicable to open curves. However, their method is greatly affected by cluttered backgrounds since they rely on distance fields for final shape matching. Moreover, it is computationally expensive.
In this paper, we present an efficient shape detection method tailored for shape detection in cluttered images. We found that, while shape detection can be regarded as finding the optimal transformation from a queried shape to the shape in the image, we may first identify the optimal transformation for each point on the shape and then summarise all transformations to get the optimal transformation for the whole shape.
In order to find the optimal transformation for each point efficiently, we adopt PAD [12] as the shape descriptor for each point and find the optimal corresponding point in the image by comparing similarity of PAD. We use PAD because of its scale, rotational, and translational invariance. Furthermore, PAD is applicable to both closed and open curves, and is robust to partial occlusion. We then calculate the transformation between corresponding points based on PAD (Fig. 5), and form a 4D transformation histogram to summarise the transformations of all points on the shape. The main cluster of transformations is the detected result. Figure 1(c) shows the result detected by directly applying PAD, while Fig. 1(d) shows the result detected by our method. It can be seen that our method is more robust to noise and therefore more appropriate for detecting shapes in cluttered images. We later provide the results of two experiments to validate the robustness of our method to partial occlusion and cluttered backgrounds.
Our contributions can be summarized as follows: • an efficient shape detection method for detecting shapes in cluttered images; • which is well able to simultaneously handle fragile edges, partial occlusion, and cluttered backgrounds.

Related work
Detecting shapes in images is a fundamental problem in computer vision and graphics. Over the past few decades, various two-dimensional shape descriptors have been proposed to describe the characteristics of a shape. They can be broadly classified into two categories: global shape descriptors, which describe the characteristics of the whole shape, and local shape descriptors, which describe the local characteristics of a shape at specific points.

Global shape descriptors
A popular family of global shape descriptors performs domain transform on the shape. Fourier descriptors [2][3][4] can be used as shape signatures that capture shape characteristics (e.g., centroid distance and cumulative angles along the shape boundary) in the Fourier domain. Wavelet descriptors [13] rely on wavelet transforms to obtain a multiresolution representation of a shape from the shape boundary. Radon descriptors [14] rely on Rtransforms, a variant of Radon transforms, of the shape to obtain the shape properties. Curvature scale space (CSS) [5,6] describes a shape by recording its inflection points at different level of smoothing. It represents the changes in location of inflection points on the shape boundary with smoothing. Later, Lee et al. [15] proposed shape signature harmonic embedding (SSHE), which uses discrete harmonic functions to replace smoothing in the construction. Descriptors have also been proposed based on moment theory, including Hu moments [16], Zernike moments [17], and image moments [18,19]. Bernier and Landry [20] proposed a polar representation that plots the orientation of each boundary point referenced from the centroid of the shape as a description.
However, all of the above global shape descriptors describe shapes in an integral manner. They do not extract any local detail of the shape. Thus, we cannot directly use these global shape descriptors to measure shape similarity for open curves or complex shapes. Furthermore, global shape descriptors may also fail when partial occlusions exist.

Local shape descriptors
Local shape descriptors are generally point-based, describing local shape characteristics at reference points. One descriptor is built for each reference point, and all descriptors together form a rich description of the whole shape.
Shape context [10] is the state-of-the-art pointbased shape descriptor. It describes the shape distribution of boundary points by a log-polar diagram centered at a reference point. Mori et al. [21][22][23] introduced fast pruning to speed up the matching process for shape contexts. Inner distance shape contexts [11] are an extension of shape contexts which use inner distance instead of traditional Euclidean distance in length measurement. However, shapes need to be closed to measure inner distances. Furthermore, all methods based on shape contexts are error-prone when used to detect shapes in cluttered images since shape contexts describe shapes in raster space.
Tȃnase et al. [24,25] proposed use of a turning function that measures the turning angle along the boundary of the shape. Triangle area representation (TAR) [7] describes the shape of a reference point by the areas of triangles starting from this point. Integral invariants [8] describe the shape of a reference point using the integral kernel within a region centered at this point. Hong and Soatto [9] further proposed a multi-scale integral invariant approach.
Although these point-based shape descriptors describe local shape characteristics, global normalization is still needed to achieve scale invariance, as they are not inherently scale invariant. These point-based descriptors cannot be directly applied to detecting shapes in cluttered images.
Recently, Kwan et al. [12] proposed a locally scale-invariant shape descriptor, the pyramid of arclength descriptor (PAD). The locally scale-invariant property enables more robust detection of shapes in the presence of cluttered edges and noise. However, PAD only represents a very limited range of local shapes at a reference point, so further evaluation is needed for shape matching. While Kwan et al. used a distance field for this purpose, the resulting ability to detect shapes in cluttered images is poor, since distance fields are quite sensitive to noise in raster images. Instead, we adopt a transformation histogram for shape matching, making our method much more robust to noise.

Shape-based object detection
Various methods have been proposed to detect objects in 2D or 3D space [26,27]. Here, we only focus on object detection in 2D, i.e., images. Most existing methods are based on shape context [10] since it is the state-of-the-art shape descriptor. In particular, Lian et al. [28] proposed to detect shapes with a novel outlier-resistant shape context distance that ignores outliers in norm-2 distance in the original shape context. Thayanathan et al. [29] introduced a continuity constraint into shape context matching, restricting the correspondence to be formed by nearby points. However, these methods cannot achieve scale invariance and are only applicable to detecting shapes at the same scale.
Riemenschneider et al. [30] proposed a partialshape matching method to locate objects. However, their method is not truly scale invariant. Shape bands [31] can tolerate shape deformation to some extent, within a fixed bandwidth. But this method is also not scale invariant and is errorprone in the presence of cluttered edges. Cheng et al. [32] proposed to use a boundary band map to search for repeated elements with similar shapes, but user interaction is needed. The chordiogram method [33] works by first forming a set of chord by joining boundary points together, and then uses lengths, orientations, and normal directions of the boundary points forming the chords to form a chordiogram for shape matching. However, this method is not scale invariant and rotational invariant. Chi and Leung [34] proposed to decompose a shape into primitives and perform partialshape matching by searching an indexed structure of primitives. However, they do not take scale invariance into consideration. In contrast, our approach achieves scale invariance and rotational invariance; only a single description is needed to describe the local shape in a scale-invariant and rotational-invariant manner.
Methods based on neural networks have also been proposed to detect objects [35][36][37]. While neural networks may achieve better results than traditional low-level methods, they rely highly on training data. In this paper, we aim to detect shapes in cluttered images by only relying on the shapes' characteristics.

System overview
In this paper, we propose a novel shape detection method which calculates and analyzes a transformation histogram relating the queried shape to the cluttered image. Figure 2 shows the framework of our method. Given a cluttered input image and a queried shape, we first extract the edges from both the image and the queried shape. To extract edges from the cluttered image, we use the Canny edge detector [38]. For the queried shape, we simply identify all boundary pixels as edges of the shape. We then calculate the local shape feature of each each point using an existing point-based shape descriptor, the pyramid of arc-length descriptor (PAD) [12]. Note that PAD can be computed for all points no matter whether they are on closed or open curves. Moreover, PAD only describes local shape features along a single edge. No redundant or disturbed information from the cluttered background is embedded in PAD. In addition, it is scale invariant, rotational invariant, and translational invariant, allowing the detection of shapes with changes in size and orientation.
The next step is to find all pairs of edge points with similar PAD features (see Section 4.1). We observe that there is a high probability that the points will be correctly matched, and the transformations for all correctly matched point pairs should be quite similar (Fig. 6). Thus, we first calculate the transformation for each pair of edge points in correspondence (see Section 4.2), and then form a transformation histogram using all transformations, to identify the main cluster that represents the transformation of the whole shape (see Section 4.3).
Our method is fully parallelisable for use on a GPU.

Point-to-point matching via PAD
Given the edge maps from an image and a query shape, we first extract the shape features of all reference points on the edges by a local point-based shape descriptor, pyramid of arclength descriptor (PAD) [12]. The features extracted from PAD is locally scale invariant, rotational invariant, and translational invariant. It can provide the precise transformation of two points with similar local shapes, which perfectly matches our requirement. But we still want to point out that our framework actually can accept all local shape descriptors that provide point-to-point transformations.
For the completeness of the paper, we briefly introduce PAD and how it is used for local shape matching between two reference points in the following. The PAD shape feature is extracted from the integral of absolute curvature (IAC) [39]. Given a curve (Fig. 3), the integral of absolute curvature τ over a curve segment between points s and t is defined as where κ (x) is the curvature at point x on the curve. PAD encodes the shape by combining the IAC domain and the arc-length domain. It constructs a pyramid of arc-length intervals centered at a point p, such that each interval corresponds to a fixed integral value of absolute curvature (2 i Δτ ), cumulated from p. The pyramid of arc-length L i and R i can be extracted by integrating different levels of absolute curvature in the IAC domain. For level i, the two intervals (l i : p) and (p : r i ) have the same IAC value 2 i Δτ , so τ (l i : p) = τ (p : After cumulating arc-length on both left and right hand sides at different levels, the PAD vector is defined by using this set of arc-length values. The corresponding arc-lengths of these IAC intervals form the initial PAD vector M init . Figure 4 shows the set of intervals sampled for 5 levels and the IAC value accumulated for intervals from each level.
Normalization is performed to achieve scale invariance:

(4) The final PAD vector m(p) is defined as
, and s ∈ {+1, −1} .  Here, s ∈ {+1, −1} is the sign of curvature at p, indicating whether the curve is convex (+1) or concave (−1) near the point of interest. The PAD distance (similarity) between two local shapes around two points p and q is denoted D p,q , and is the l ∞ -norm distance of the difference of two PAD vectors: We can now estimate the local shape similarity of two points using the defined PAD similarity. We find all point pairs (one on the queried shape and the other on the image) with PAD distance larger than or equal to K = 0.2 and denote these point pairs as matching pairs. We may increase K to enforce the matching pairs to have more similar local shapes, but with reduced tolerance for shape deformations.

Transformation of matching pairs
We define the transformation between two points, and thus two local shapes, to be a 2D transformation comprising scaling, rotation, and translation along xand y-axes. For example, Fig. 5 shows a queried shape Q (left) and an edge map E (right) with an identified matching pair p i ∈ Q and q j ∈ E. We build a vector (red arrows in Fig. 5) of the endpoints of the last level coverage of PAD for points p i and q i . The relative magnitude of the two vectors indicates the change in scale s p i ,q j between the two shapes under the PAD coverage. The angular difference of the two vectors indicates the change in orientation θ p i ,q j of the two shapes locally. The translation between the two points (x p i ,q j , y p i ,q j ) can be computed as the spatial distance between the two vectors. Then we can write the transformation T i,j as where i and j are indices of points on the queried shape Q and the edge map of the cluttered image E respectively.
Note that the transformation model used here first translates the queried shape to the position defined by the vector, and then scales and rotates the queried shape correspondingly. Translation is referenced to the centroid of the shape in order to avoid deviations in position of the matching pair. This avoids translations being affected by the scaling and rotation components.

Transformation histogram
We now obtain a set of transformations T = {T i,j } for all matching pairs. All these transformations hint at possible locations of the queried shape in the cluttered image. It can be easily observed from Fig. 6 that matching pairs between two similar shapes should have similar transformations. Based on this observation, we use a transformation histogram to cluster the transformations.
Before putting these transformations into the histogram, we first normalize each component for better comparison. Given the original scale s i,j for a transformation T i,j , we assume the scale of the queried shape cannot be larger than the size of the cluttered image. Therefore, we normalize the scale by the diagonal of the cluttered image. We only take the normalized scale s i,j ∈ [0, 1] as an effective transformation and discard all transformations with normalized scale larger than 1. For orientation, we simply normalize it by the maximum possible rotation, i.e., π. The range of the normalized rotation θ i,j is thus [−1, +1). We also normalize the x-translation x i,j and the y-translation y i,j by the width and height of the cluttered image respectively. Any normalized translation components x i,j and y i,j outside the range [0, 1] are taken as unsuitable transformations and discarded.
Each matching pair contributes a score to its corresponding histogram bin. We define the score to be where D(i, j) is the PAD distance of i and j, and κ denotes curvature. The aim is to consider local similarity and local smoothness for each matching pair (i, j). We want locally more similar matching pairs to contribute higher scores. Since a smaller PAD distance means shapes are more similar, we thus put D(i, j) in the denominator. We further weight the score by local smoothness of the matching pair (i, j), as smoother edges deliver less shape information and are more likely to be matched with other smooth edges. On the contrary, more rapidly changing edges contain more information. Figure 7 shows examples of matching two smooth edge segments and two more rapidly changing edge segments. Dashed lines indicate pairs of points which are locally similar and form a matching pair. We can easily observe that each point in Fig. 7(a) matches several locally similar points in Fig. 7(b). In contrast, the indicated point in Fig. 7(c) only matches one point in Fig. 7(d).
Without consideration of local smoothness, the transformation histogram will be dominated by matching pairs of smooth points. We overcome this issue by weighting scores of matching pairs by their curvatures, κ(i) and κ(j).
For each bin n corresponding to a certain transformation range [T n , T n+1 ), the final score S n is the sum of scores of all matching pairs for that bin: In practice, we set the numbers of bins for scale, rotation, and xand y-translations to {10, 10, 50, 50}. By finding the bin with the largest score in the histogram, we can find the target shape in the image. Since it is possible that no similar shape exists in the cluttered image, we set a threshold on the fraction of the points for a match to exist. If the fraction is too low (empirically < 30%), we conclude that there is no similar shape in the image and return no match.

Experiments
In this section, we describe several experiments conducted to evaluate our shape detection method. Firstly we show some results of detecting shapes in cluttered images.
Note that all results are directly plotted onto edge maps to aid visualization. We conduct two further experiments to validate the robustness to occlusion and cluttered edges respectively.

Shape detection in cluttered image
First we show a few results to demonstrate the ability of our method to detect shapes in cluttered images. Figure 8(a) shows a real life photo containing a swan in the background. Since most existing shape descriptors cannot support open curves, we only compare our method with PAD-DF [12] and IDSC [11].
IDSC fails to detect the swan as inner distance is not defined for points across the disjoint components (see Fig. 8(b)). PAD-DF fails to locate the swan as it is confused by the clutter (see Fig. 8(c)). Our method successfully finds the location of the swan in the image (see Fig. 8(d)).
Since it is unfair to compare IDSC in such cases, in the next two results, we only compare our method with PAD-DF. Figure 9(a) shows an image from an object detection dataset, the ETHZ shape dataset [40]. The aim is to detect the Apple logo (shown at top left in Fig. 9(a)). Leaves on trees in the right part of the image lead to crowded edges (see Fig. 9(b)) which leads to incorrect detection results for PAD-DF (the Apple logo marked in red). In contrast, our method successfully avoids the effects of crowed edges and correctly matches the Apple logo in the image (see Fig. 9(c)).
We show another example in Fig. 10(a) where the Apple logo is partially occluded by a wire.    When detecting shapes in real images, PAD-DF is much more error-prone in the presence of crowded edges. Even with partial occlusion, our method can still survive and recognize the Apple logo correctly (see Fig. 10(c)).

Tolerance to occlusion
Partial occlusion is one of the major challenges in shape detection. However, this phenomenon is quite common in real world scenarios. This experiment aimed to explore the tolerance to partial occlusion, using the dataset proposed by Kwan et al. [12]. It contains a set of shapes and clipped instances of them. The shapes are from the MPEG7 dataset [41], and comprise 20 classes each with 70 shapes. Each shape is gradually occluded from left to right. The occlusion rate goes from 0% to 90% in 10% increments, giving 14,000 shapes and partial shapes in total.
We follow the rendering approach used by Kwan et al. [12] to quantify the goodness of matching: we render the clipped instance (after transformation) on top of the original shape. Let C be the set of pixels belonging to the transformed clipped instance in the space of the original shape and C o be the set of pixels where C should be, again in the space of the original shape. We then measure the matched fraction γ as follows: indicates a perfect match. Due to rasterization and numerical errors, we may not get a perfect match even if the match is visually perfect. Hence, we regard a transformation with γ > 0.95 as a successful match. Figure 11 plots the success rate against the degree of occlusion. We can see that PAD-DF and our method can still recognize shapes even in the presence of significant occlusion: even with 80% occlusion, we still achieve a success rate of around 25%. Most other descriptors, including TAR, SC, CSS, and II, are unable to deal with partial occlusion, and their success rates drop extremely quickly. Clearly, whole shape matching descriptors are inappropriate for measuring shape similarity in the presence of occlusion. Our transformation approach also outperforms PAD-DF, since the transformation histogram guarantees that matching pairs with similar transformations are grouped together, while the distance field is greatly affected by the cluttered background.

Tolerance to cluttered edges
We further evaluate robustness to cluttered edges since it is an important factor when detecting shapes in cluttered images. To mimic cluttered edges, we add random arcs to the edge map of a clean shape. We only control the total length of all added arcs with respect to the length of the edges of the original shape. Figure 12 shows a few examples of cluttered instances with different amounts of clutter.

Fig. 12
Cluttered instances, indicating the amount of clutter relative to the original total edge length. Figure 13 plots the results of our experiment. Since most existing descriptors do not support matching shapes with multiple boundaries and open curves, we only compared our approach with shape contexts and PAD-DF. Our method is much more successful at matching than the other two methods in the presence of cluttered edges. Shape contexts are highly influenced by cluttered edges since they affect the global quantity used for normalization. PAD-DF is also affected by cluttered edges because the distance field is extremely sensitive to noise. Our method works best since cluttered edges are filtered out in the transformation domain.

Discussion and conclusions
In this paper we have presented a new shape detection approach that robustly detects shapes in cluttered images. By utilizing PAD, our method is able to support shape matching for both open curves and closed shapes, in a scaleinvariant, rotational-invariant, and translationalinvariant manner. Moreover, our method can detect shape in the presence of partial occlusion.
Our method also has certain limitations. It is sensitive to shape deformation, including change of perspective, because the shape descriptor we use to calculate the transformation does not provide perspective invariance. Figure 14(b) shows a failure in such a case. Although we successfully find the location of the Apple logo in the image, we fail to match the whole shape with a correct transformation. Our method is also sensitive to noise of a kind that leads to changes in curvature of boundary points.