Keywords

1 Introduction

Man-made environments abound with varied manifestation of planar homogeneous texture, i.e., regularly repeating structure or motifs aligned along planes. Figure 1 depicts such “texture” from various indoor scenes in the MIT Indoor67 dataset [1] — repeating objects defining scene content (stacked laundry machines, bookshelves, wine barrels, theater seating, etc.), architectural and structural elements (brickwork, frameworks, repeating beams and columns), carpets printed or engraved with uniform patterns, tilings, ceiling fixtures, and even shadows (provided the light source is sufficiently far away, and the blocking objects uniformly spaced)! Such ubiquitous textures must have great potential for favorably influencing high-level scene understanding. Yet, the tools currently at our disposal are woefully inadequate to the purpose of detecting and analyzing textured regions “in the wild”, key for realizing the aforementioned potential. In this paper, we examine the technical challenges in detecting these textured regions, develop the machinery necessary to overcome these challenges, and then exploit these textured regions for scene understanding.

Fig. 1.
figure 1

Abundantly present and variedly manifested, homogeneous texture in indoor scenes can serve as useful mid-level features for recognition. Rectification of such texture can mitigate in-class variation arising out of perspective projection.

Even though invariant texture description and recognition have received regular attention for decades in computer vision, these low-level vision tasks have not been actively pursued as a means to solve high-level vision problems. The reasons are manifold. Firstly, it is difficult to secure a precise definition for texture [2], its optimal representation often necessitating a variety of different mechanisms (such as reaction-diffusion model, grey-level co-occurrence, transform methods, etc.) depending on the circumstances. The same texture can also look significantly different at different scales. When we want to detect and analyze textures in the wild (that is, the textured regions have not been segmented or cropped), the task is complicated by another order of magnitude. Figure 2 illustrates, using some MIT Indoor67 images, the challenges involved in localizing such patterns. The texture of interest is often interleaved with other scene content, and such outliers can often occupy large spatial extent — e.g. aisles separating seating sections, beams or arches interfering with repeating columns, visible backdrop through a colonnade, or music stands cluttering patterned woodwork on a concert stage. Photometric severities may be present, such as reflections blocking out under-water pool lanes, low image contrast or varying illumination over a given texture due to insufficient lighting in underground cellars. Finally, texture projected to the image plane inevitably exhibits perspective distortion. Existing region extractors [3] only afford affine invariance, detecting low-level features such as blobs and edges, and cannot localize potentially large patches depicting meaningful texture. In this regard, our first contribution is the projective-invariant detection of homogeneous textured planar patches in real-world images, as well as their affine rectification. In Fig. 2, our approach is seen to successfully overcome the aforementioned challenges, commonplace in real images. We also present quantitative evaluations of our method, outperforming existing work on the tasks of detection and rectification.

Our second contribution is to show how detected homogeneous texture, and their recovered projective parameters, can be used to obtain indoor geometric layouts in multi-planar textured scenes. In doing so, we sidestep the error-prone, ill-posed computation of vanishing points in order to establish room orientation, and eschew the simplistic Manhattan or box layout assumption [4]. This also contrasts with existing work [5] that employs machine learning to localize room faces in space and scale.

As seen in Fig. 1, semantically similar image patches can exhibit significant viewpoint differences. Since gradient based low-level image descriptors used in recognition such as SIFT or HOG are not invariant to projective transforms, this can adversely affect classification performance. Our third contribution is to demonstrate that plane projective rectification can potentially benefit a recognition pipeline by mitigating this geometric in-class variation. We report improved classification on the MIT Indoor67 [1] benchmark when densely extracted descriptors from affine-rectified texture are included in the image representation, suggesting the feasibility of employing texture cues to achieve rectification in realistic scenes, which, in turn, expectedly improves recognition performance.

Fig. 2.
figure 2

Detection in the wild: the proposed method can detect and rectify meaningful planar homogeneous texture in indoor scenes, despite outliers with large spatial support, photometric severities and significant perspective distortion. Clockwise: concert hall, train station, wine cellar, swimming pool.

2 Related Work

Textured patch detection and rectification are often performed together since by rectifying the perspective effect, the repeating patterns or symmetries are more easily detected. This can be done by exploiting recurring instances of low-level features [612], leveraging on different classes of symmetries detected in the images [13, 14], or by rank minimization of the intensity matrix [15]. However, most of these works require restrictive assumptions, e.g. specific symmetries, that the repeating elements form a lattice, that the symmetry type or the repeating element is given, etc. These are serious qualifications in the face of the real-life challenges discussed in the preceding section. Thus, despite the long line of works cited above, there is a paucity of evidence that these methods can work on real images collected in the wild, since they have not been demonstrated on images as rich and complex as say, those found in the benchmark MIT Indoor67, but mainly on limited textures such as building facades, text, or even just pre-segmented or cropped patterns. Different from these approaches, we have adopted a frequency based approach [16] in this paper, as it is capable of describing any generic homogeneous texture (from portholes in laundry machines to shadows — see Fig. 1), not necessarily composed of texels that can be sensed by a given feature detector (lines, blobs, edges, etc.). While the TILT algorithm of [15] also does not involve feature detection, it is, however, applicable to a very limited class of texture — that which upon rectification gives a low-rank matrix. Homogeneity, on the other hand, is a more general assumption.

Shape-From-Texture (SFT). Our work is also related to classical shape-from-texture (SFT) theory — in particular the class of methods that work with planar homogeneous texture [1618]. However, unlike SFT, our goal is not to recover surface normal, but to perform planar rectification. We therefore re-parameterize the local change in dominant texture frequency [16, 19, 20] as a function of the plane projective homography instead of the surface slant and tilt. The resulting formulation circumvents the need to define and relate coordinate systems and, more importantly, does not require knowledge of focal length, hence has wider applicability. [21] have previously performed SFT without a calibrated camera, jointly recovering surface normal and focal length, but assume the fronto-parallel appearance of the texture is known a priori. On the other hand, we only make the weak assumption of texture homogeneity. Criminsi and Zisserman [22] also recover vanishing lines from projected homogeneous texture, but their approach involves a computationally expensive search for the direction of maximum variance of a similarity measure, seems to be susceptible to such parameters as the size of image patch to compute the measure over, and has only been demonstrated on cropped texture exhibiting a grid structure.

Scene Layout. Automatic detection of dominant rectangular planar structure has been previously presented in limited, simplistic, and non-cluttered man-made indoor [23] or urban [24] environments. [5] have demonstrated the localization of primary indoor room faces (walls, ceiling, floor) by employing sophisticated machine learning, while [25] have detected depth-ordered planes. However, all these approaches assume the scene is aligned with a triplet of principal directions defining the coordinate frame (Manhattan layout), and that these directions can be reliably recovered in a scene. On the other hand, our method detects homogeneous texture to recover geometric layout in multi-planar indoor scenes that do not necessarily conform to the above assumptions.

Indoor Scene Recognition. Since indoor scenes can be well described by the objects and components they contain, their recognition has often been approached through the detection of class-discriminative, mid-level visual features or parts that preserve semantics and spatial information [1]. Automatic part learning from images labeled only with the scene category has received wide attention [2629]. However, already an ill-posed problem — since both the appearance models of parts and their instances in given images are unknown — it is exacerbated by the large viewpoint variation inherent in scenes. Instead, we employ a generic hand-crafted texture projection model to perform appearance and projective invariant detection of a wide range of meaningful textured scene regions.

Finally, our work is fundamentally different from that on invariant texture description or recognition based on hand-crafted descriptors [30] or by training classifiers for semantic or material properties of texture [31]. Where that line of work is focused on recognizing a wide range of generic texture from cropped images, we aim to detect a specified form of texture in indoor scenes, identify and address the challenges therein, and to explore how it helps high-level scene understanding tasks. We also differ from work that aims to learn to predict the presence or absence of generic material attributes in scenes [32].

3 Main Framework

3.1 Texture Frequency Projection Model

Shape-from-texture relates texture surface coordinates to corresponding camera coordinates in terms of the slant and tilt of the tangent plane at a point [16, 33], or in terms of the plane gradients or normal [19, 21, 34]. Surface coordinates (expressed in camera reference frame) are then projected to the image plane via scaled orthographic or perspective projection, assuming the camera focal length is known. Since we are interested in planar rectification, we can relate the surface and image points via a planar homography. This does not require the focal length, but the downside, as we shall see shortly, is that the rectification is only up to an affine ambiguity. Let us represent the projective transform from the image plane to the textured surface plane as a 3\(\times \)3 homography H, i.e., \(\mathbf {x_s'} = H \mathbf {x_i}\) (see Fig. 3). H can be decomposed to separate the contributions of the affine part and the projective part [35]:

$$\begin{aligned} H = H_A H_P = \begin{pmatrix} a_{11} &{} a_{12} &{} a_{13} \\ a_{21} &{} a_{22} &{} a_{23} \\ 0 &{} 0 &{} 1 \end{pmatrix} \begin{pmatrix} 1 &{} 0 &{} 0 \\ 0 &{} 1 &{} 0 \\ h_7 &{} h_8 &{} 1 \end{pmatrix} \end{aligned}$$
(1)
Fig. 3.
figure 3

Assorted hats along the bottom clutter this MIT Indoor67 clothingstore image (right), yet the texture is correctly affine-rectified by the proposed approach (center). For illustration, metric rectification (left) was manually performed, removing any rotation or anisotropic scaling.

That is, the image coordinates \(\mathbf {x_i}\) are first transformed by the “purely” projective homography (i.e. what is left in the projective group after removing the affine group) to some intermediate plane coordinates \(\mathbf {x_s} = (x_s~y_s)^T\), followed by the affine transform \(H_A\) to obtain the world (fronto-parallel) plane coordinates \(\mathbf {x_s'} = (x_s'~y_s')^T\). Note that the last row of \(H_P\) is the same as the last row of H. We consider the role of \(H_A\) first, which provides the transformation:

$$\begin{aligned} x_s' = a_{11} x_s + a_{12} y_s + a_{13}, \qquad y_s' = a_{21} x_s + a_{22} y_s + a_{23} \end{aligned}$$
(2)

The transpose of the Jacobian of \(H_A\), given as:

$$\begin{aligned} \mathbf {J^T_A} = \begin{pmatrix} \frac{\partial x_s'}{\partial x_s} &{} \frac{\partial y_s'}{\partial x_s} \\ \\ \frac{\partial x_s'}{\partial y_s} &{} \frac{\partial y_s'}{\partial y_s} \end{pmatrix} = \begin{pmatrix} a_{11} &{} a_{21} \\ \\ a_{12} &{} a_{22} \end{pmatrix} \end{aligned}$$
(3)

transforms a world plane spatial frequency \(\mathbf {u_s'} = (u_s' ~ v_s')^T\) — which is constant over the entire plane, since we have assumed homogeneity of texture on the surface — into the frequency \(\mathbf {u_s} = (u_s~ v_s)^T = \mathbf {J^T_A} \mathbf {u_s'}\) on our intermediate plane (c.f., [16]). Clearly, frequency \(\mathbf {u_s}\) on the intermediate plane, albeit different from world plane frequency \(\mathbf {u_s'}\), is also constant, i.e., does not vary spatially. In other words, homogeneous texture upon affine transform remains homogeneous, as also observed in [22]. A similar analysis for \(H_P\), which transforms image points \(\mathbf {x_i} = (x_i~ y_i)^T\) into points \(\mathbf {x_s} = (x_s~y_s)^T\) on our intermediate plane, gives:

$$\begin{aligned} \mathbf {J^T_P} = \begin{pmatrix} \frac{\partial x_s}{\partial x_i} &{} \frac{\partial y_s}{\partial x_i} \\ \\ \frac{\partial x_s}{\partial y_i} &{} \frac{\partial y_s}{\partial y_i} \end{pmatrix} = \frac{1}{(h_7 x_i + h_8 y_i + 1)^2} \begin{pmatrix} h_8 y_i + 1 &{} -h_7 y_i \\ - h_8 x_i &{} h_7 x_i + 1 \end{pmatrix} \end{aligned}$$
(4)

\(\mathbf {J^T_P}\) transforms the intermediate plane constant frequency \(\mathbf {u_s} = (u_s~v_s)^T\) to image plane variable frequency \(\mathbf {u(x_i)} = (u_i~ v_i)^T = [u(\mathbf {x}_i)~v(\mathbf {x}_i)]^T = \mathbf {J^T_P} \mathbf {u_s}\). While the above analysis is applicable to any spatial frequency component, in Sect. 3.2 we shall obtain a robust instantaneous estimate of the dominant spatial frequency component in a given image patch depicting real-world texture, which inevitably contains multiple frequency components. Denote this estimate as \(\mathbf {\tilde{u}(x_i)} = (\tilde{u}_i~ \tilde{v}_i)' = [\tilde{u}(\mathbf {x}_i)~\tilde{v}(\mathbf {x}_i)]'\). We then arrive at a method to recover \(H_P\) by minimizing the following re-projection error \(E_{RP}(h_7, h_8, u_s, v_s)\) over the projective parameters \(h_7\), \(h_8\) and the intermediate plane frequency \(u_s\), \(v_s\):

$$\begin{aligned} E_{RP} = \sum _{x_i} \sum _{y_i} (\frac{(h_8 y_i + 1)u_s -h_7 y_i v_s}{(h_7 x_i + h_8 y_i + 1)^2} - \tilde{u_i})^2 + (\frac{(h_7 x_i + 1)v_s -h_8 x_i u_s}{(h_7 x_i + h_8 y_i + 1)^2} - \tilde{v_i})^2 \end{aligned}$$
(5)

Optimizing Eq. 5 is a nonlinear least squares problem, and we solve it using the Levenberg-Marquardt algorithm. Observe that our method allows the recovery of \(H_P\) and not \(H_A\). This is because \(\mathbf {J^T_A}\) maps the fronto-parallel plane frequency \(\mathbf {u_s'} = (u_s' ~ v_s')^T\) to a different but still constant frequency \(\mathbf {u_s} = (u_s ~ v_s)^T\). As such, a planar rectification only to within an ambiguous affine transform \(H_A^{-1}\) of the fronto-parallel plane may be obtained.

3.2 Optimal Estimation of Dominant Frequency in Projected Homogeneous Texture

A Gabor filter \(h(\mathbf {u};\mathbf {x})\) with center frequency \(\mathbf {u} = (u, v)\) can be convolved with an image \(f(\mathbf {x})\) to give its frequency content near \(\mathbf {u}\) at point \(\mathbf {x} = (x,y)\):

$$\begin{aligned} A(\mathbf {u};\mathbf {x}) = |f(\mathbf {x}) * h(\mathbf {u};\mathbf {x}) |\end{aligned}$$
(6)

Since a given texture may exhibit multiple frequencies, which may also be oriented differently, one must discern the component that can be reliably tracked over the image, so as to be able to use the projection model developed in Sect. 3.1. In this regard, Super and Bovik [16] have previously demonstrated estimation of the dominant texture frequency — a distinct peak at any given point, around which most of the energy is concentrated in a narrow band — employing a frequency demodulation model (DEMOD) from  [20]. Briefly, denote the horizontal and vertical partial derivatives of Gabor filter \(h(\mathbf {u};\mathbf {x})\) by \(h_x(\mathbf {u};\mathbf {x})\) and \(h_y(\mathbf {u};\mathbf {x})\), respectively, and the corresponding amplitude response (Eq. 6) by \(B(\mathbf {u};\mathbf {x})\) and \(C(\mathbf {u};\mathbf {x})\), respectively. Then, an unsigned instantaneous estimate \(|\tilde{\mathbf {u}}(\mathbf {x}) |\) of the dominant frequency component may be computed for the filter h that maximizes the response \(A(\mathbf {u};\mathbf {x})\) at each point as:

$$\begin{aligned} |\tilde{u}(\mathbf {x}) |= \frac{B(\mathbf {u};\mathbf {x})}{2\pi A(\mathbf {u};\mathbf {x})}, \qquad |\tilde{v}(\mathbf {x}) |= \frac{C(\mathbf {u};\mathbf {x})}{2\pi A(\mathbf {u};\mathbf {x})} \end{aligned}$$
(7)

The sign at each pixel is defined by the frequency plane quadrant wherefrom the maximizing Gabor is sampled. Only quadrants I or IV are used, since the Fourier spectrum is symmetric.

Fig. 4.
figure 4

Affine rectification of texture (a) via the model developed in Sect. 3.1, applied to non-optimal frequency estimate by DEMOD [16] is prone to drift (b); optimal frequency estimation via GCO improves performance (c).

Frequency Drift. For the MIT Indoor67 airport_inside image patch shown in Fig. 4(a), DEMOD [16] provides a rather poor estimate of the dominant frequency, resulting in poor rectification using the model from Sect. 3.1. This is not surprising given the grim challenges we outlined in Sect. 1. Figure 5 examines the dominant frequency estimates in detail. Since the given texture does not extend to the lower left and lower right regions in the image patch, the maximizing Gabor drifts in both the center frequency (Fig. 5(a)) and orientation (Fig. 5(c)) in these regions (brighter pixels depict numerically larger values). A 1D plot along the dotted line (Fig. 5(b)) shows the center frequency deviates in these regions from an otherwise increasing pattern. The orientation plot (Fig. 5(d)) reveals that the Gabors pre-dominantly fire strongly at the horizontal bars in the image (\(18^{\circ }\), \(0^{\circ }\), \(-18^{\circ }\) as one moves from left to right). However, in the lower region, it is the vertical bars (\(-72^{\circ }\), \(90^{\circ }\)) that define the “dominant” Gabors. Figures 5(e) and (f), respectively, show the resulting horizontal and vertical estimates obtained via Eqs. 7. Corresponding surface plots are depicted in Figs. 5(f) and (h), showing large discontinuities. We propose to resolve drift by enforcing smoothness via the following regularized graph cut problem [36]:

Fig. 5.
figure 5

TOP: Closer look at drift in dominant instantaneous frequency estimate via DEMOD [16]. BOT: GCO resolves drift in both center radial frequency and orientation. Right: GCO also resolves quadrant ambiguity, if any.

$$\begin{aligned} E(f) = \sum _{p \in \mathcal {P}} D_p (f_p) + \sum _{\{p,q\} \in \mathcal {N}} V_{p,q} (f_p , f_q) \end{aligned}$$
(8)

where \(\mathcal {P}\) is the set of sites p to be labeled (pixels), and \(\mathcal {N}\) is the 8-neighbourhood system. Our set of labels \(\mathcal {L}\) is the Gabor filter bank. The unary term is defined as \(D_p (f_p) = \alpha /A(f_p;p)\), where \(A(\mathbf {u};\mathbf {x})\) is as dictated by Eq. 6, \(\alpha =1\) and \(f_p = (\Omega _p, \theta _p)\in \mathcal {L}\) gives the filter with center frequency \(\mathbf {u} = (\Omega _p sin \theta _p, ~\Omega _p cos \theta _p)\) at \(\mathbf {x} = p\). The pairwise term \(V_{p,q} (f_p , f_q) = V(f_p , f_q)\) forces the center radial frequency \(\Omega _p\) and orientation \(\theta _p\) to be smooth:

$$\begin{aligned} V(f_p , f_q) = \beta (\Omega _p - \Omega _q)^2 + \gamma \{(sin \theta _p - sin \theta _q)^2 + (cos \theta _p - cos \theta _q)^2\} \end{aligned}$$
(9)

Demodulation (Eqs. 7) is then performed after solving Eq. 8 for the optimal labeling f. We call this scheme Graph Cut Optimization (GCO), solved via \(\alpha \)-expansion [36]. As seen in Fig. 5(bottom) it yields a smooth, monotonically increasing frequency and orientation profile, consequently providing an improved rectification (Fig. 4(c)) compared to the non-optimal case (Fig. 4(b)).

A workaround to drift is to perform a robust parameter estimation via RANSAC [37]. While this can seemingly handle drift (see Fig. 4(d)), the %outliers is significantly higher compared to when GCO is also used in conjunction with RANSAC (Table 1). Later in Sect. 4.2, we employ the %outliers as a metric to “detect” homogeneous texture, and since GCO renders %outliers a more adequate measure, it is indispensable if we wish to reliably differentiate between non-textured surfaces from textured surfaces perturbed by other scene elements.

Table 1. Recovered projective parameters and % outliers for the example texture in Fig. 4(a) via DEMOD, GCO, DEMOD+RANSAC and GCO+RANSAC

Quadrant Ambiguity. DEMOD [16] also fails on, e.g., the subway patch in Fig. 6(o), because it can only measure the frequency orientations modulo-\(\pi \) (frequency estimates from opposite quadrants have the same magnitude). This wrapped orientation may result in sharp discontinuity between neighboring frequency estimates. As explained in Fig. 5(top right), the orientation of the rails increases as one moves from left to right (\(36^{\circ }\), \(54^{\circ }\), \(72^{\circ }\)), and wraps around back to \(-90^{\circ }\). We extend our set of labels \(\mathcal {L}\) to include filters sampled at orientations from all the four quadrants, and rely on the smoothness constraint between neighboring pixels to resolve the quadrant ambiguity. As seen in Fig. 5(lower right), the optimal orientations recovered by GCO are those sampled from quadrant III and not I, thereby ensuring a smoother transition into quadrant IV with respect to both the demodulated horizontal and vertical frequency.

Fig. 6.
figure 6

Affine rectification of homogeneous texture: Given, TILT [15], REM [6], DEMOD [16] with our model in Sect. 3.1, Proposed (GCO) and Ground Truth.

4 Experiments

4.1 Affine Rectification

The proposed affine rectification is evaluated on \(N = 30\) patches, cropped from various images in MIT Indoor67, depicting some homogeneous texture under perspective projection. We compare with TILT (Transform-Invariant Low-rank Texture) [15] using publicly available code, with REM (Repetition Maximization) [6] using their command-line tool, and our implementation of DEMOD [16] in conjunction with our model from Sect. 3.1, thereby encompassing techniques based on low-rankness, recurring elements and frequency. Following TILT and REM, a multi-scale approach is also implemented for the proposed GCO scheme (see supple. material). We define the mean estimation error as \(\sum _{i=1}^{N}\sqrt{(\tilde{h_{7i}}-h_{7i})^2 + (\tilde{h_{8i}}-h_{8i})^2}\), where \(\tilde{h_{7i}}\), \(\tilde{h_{8i}}\) are the parameters returned by an algorithm, and \(h_7\), \(h_8\) are the ground truth parameters obtained by manual annotation of vanishing points. The various algorithms fare as follows: TILT: 0.496, DEMOD: 0.386, and GCO: \(\mathbf {0.186}\). REM does not return the estimated parameters, hence its performance is not quantified. Our proposed GCO has substantially improved upon the pure DEMOD. TILT performs even worse than DEMOD, but of that, more later.

Figure 6 presents some qualitative results. REM — which has only been demonstrated for properly cropped, printed patterns in [6] — seems to only perform in the infrequent cases where it can detect some regular lattice structure (e.g., k, l), but usually either produces a partial rectification (c, n), or fails altogether. TILT, in general, also performs well only on a few cases, where the underlying texture is low-rank (a, b), but breaks down when this assumption is violated — e.g., port-holes (d), or barrels (e), where the gradients are isotropic in all directions. On the other hand, our robustified frequency based scheme (GCO) is seen to handle such texture very well, corroborating our intuition that homogeneity is a more general assumption than low-rankness. TILT and REM also seem to fail on cases exhibiting large perspective distortion — e.g., the textured ceilings in cases (p, q) — and when illumination changes over the texture (m, o, r). On the other hand, use of Gabor filters allows our frequency based scheme to perform remarkably well in these challenging cases. Provided the scale of texture is small (i.e., texture contains higher frequencies) relative to the scale of the surface it covers, a frequency based representation is resilient to slow-varying (low-frequency) photometric changes [16].

4.2 Detection in the Wild

Overlapping patches (80\(\,\times \,\)80 pixels) are sampled over a multi-scale image pyramid (details in supple. material) to decide if they are textured planar patches or not. GCO and robust parameter estimation via RANSAC (with outlier threshold fixed at 0.01) is performed on each patch individually. Our error measure (Eq. 5) is not affine invariant, so we employ the following heuristic normalization. First, the dynamic range of the optimal radial frequency (\(\mathbf {\tilde{u}_i} = \mathbf {\tilde{u}(x_i)} = \sqrt{\tilde{u_i}^2 + \tilde{v_i}^2}\)) of RANSAC inliers is computed as \(\mathcal {DR} = \max _{i \in inliers}{(\mathbf {\tilde{u}_i})} - \min _{i \in inliers}{(\mathbf {\tilde{u}_i})}\). A normalized residual error is then computed for all pixels (i.e., inliers and outliers) as \(\mathbf {\mathcal {E}(x_i)} = \{\mathbf {\tilde{u}(x_i)} - \mathbf {J_P^T(x_i)} \mathbf {u_s}\}/\mathcal {DR}\), followed by re-evaluating %outliers (which serves as the decision score).

A quantitative evaluation is performed on 300 images sampled from the MIT Indoor67 (with at least 3 from each scene category) that have been manually annotated with quadrilaterals indicating the homogeneous textured regions, their plane projective parameters, and their semantic class IDs (left/right wall, ceiling, floor). We define true positives (TP), false positives (FP) and false negatives (FN) as follows.Footnote 1 For precision [TP/(TP+FP)], TP is the number of candidate patches whose estimated semantic class (see Sect. 4.3) matches with an annotated region, with 50 % intersection-over-detection (i.e., at least 50 % of the candidate’s area should cover the annotation), while FP is the number of candidates that fail this criterion. For recall [TP/(TP+FN)], TP is the number of annotated regions that are “fired on” by one or more candidates (with the correct semantic class), such that its area beyond a certain threshold is covered (we evaluated at both coverage \(> = 50\,\%\) and \(> = 80\,\%\)), while FN is the number of our annotated regions that fail this criterion. Note that for recall, TP + FN = 1367, which is the total number of annotated regions, similar to object detection [38].

Fig. 7.
figure 7

Scene layout estimation by homogeneous texture detections, and associated vanishing lines. Scene with vanishing point clusters (left), box layouts [5] (center), proposed (right). Left wall = red, right wall = yellow, ceiling = blue, floor = green. Best viewed in color. (Color figure online)

Figure 8 presents precision-recall curves, and recall vs. # proposals curves for our method, as well as for TILT [15] (using ratio of final to initial rank as a decision score). One can observe a considerably superior performance by our method, with an average precision = 0.53, compared to 0.15 by TILT. Both methods improve in recall with increasing #proposals, but ours is seen to maintain a larger recall for the same #proposals from the outset. Some qualitative results are presented in Fig. 2 (and many more in supple. material).

4.3 Indoor Scene Geometric Layout Estimation

Hedau et al. [5] have previously demonstrated the estimation of indoor scene geometric layout by using orthogonal vanishing points [39] to establish room orientation, and then using machine learning with rich feature sets [40] to localize room faces (i.e., ceiling, walls, floor) in space and scale. Figure 7 identifies the shortcomings of such an approach, using MIT Indoor67 images. Presence of more than three dominant planar directions (b, f, g), absence of straight lines in a certain direction (c), forked layout (d), and non-Manhattan structure (commonplace in real-world scenes) (e) are scenarios where such a scheme is apt to provide incorrect room orientation, while face localization is also prone to error (a, h) owing to the limitations of a learning based system, such as non-exhaustive training data.

Our detections and the recovered projective parameters provide an alternative scheme to estimate indoor geometric layout in textured scenes (Fig. 7), that requires neither vanishing points nor machine learning. A given detection may be classified as a vertical/horizontal surface depending on the slope of the vanishing line, and as left/right wall or ceiling/floor depending on the position of this line with respect to the patch center (see top right of Fig. 7 for details). The top 150 detections (sorted by % of RANSAC outliers) are then subjected to non-max suppression (NMS) performed across semantic classes (i.e., an incoming detection is not admitted if at least 50 % of its area is already occupied by any previously admitted and thus higher-ranked patch that is not from the same class). Of course, the proposed scheme requires the scene faces to be textured. For e.g., Fig. 7(g) shows a scenario where the non-textured ceiling or walls cannot be correctly assigned a semantic face category.

Fig. 8.
figure 8

Precision-recall and recall vs. # proposals.

Fig. 9.
figure 9

Sample correct classifications.

4.4 Indoor Scene Classification

Table 2 quantitatively demonstrates that affine rectification of textured patches detected (with decision threshold fixed at 50 % RANSAC outliers) via the proposed approach can improve scene classification performance. Best practices for dense local feature based classification as suggested in [41] are followed (details in supple. material), using Fisher Encoding with sum pooling [42], Hellinger Kernel mapping, one-vs-all linear SVMs [43], and various gradient and thresholding based descriptors. Both regular, as well as rectified representations (wherein dense descriptors are extracted from affine-rectified patches) are computed, and then combined via the score fusion scheme of [26].

In general, our rectification based representations, on their own, perform slightly lower than regular ones since descriptors are extracted only from detected textured regions, which, more often than not, span the image only in some spatial regions and at certain scales, and not exhaustively, therein losing some discriminative power. Interestingly, however, LBP, perhaps because it is inherently a texture descriptor, performs significantly better with rectified, detected texture. For similar reasons, both representations perform almost the same with CENTRIST — again, a texture descriptor. Finally, our results suggest that features extracted upon planar rectification are also complementary to regular features, a finding that is consistent across all the descriptors experimented with. Figure 9 shows some sample images that were mis-classified using a regular representation (SIFT+HOG), but were correctly classified using (SIFT_Rect +HOG_Rect). A notable property among most of them is the presence of large perspective distortion, as well as high-frequency homogeneous texture.

Table 2. L: MIT Indoor67 classification improvement with fisher encoding of dense descriptors (CENTRIST [44], LBP [45], SIFT [46, 47], HOG2\(\,\times \,\)2 [48, 49]) extracted from affine-rectified texture patches. R: State of the art performance — all (except SIFT [28]) involve learning-based feature extraction, unlike ours

5 Conclusion

This paper has demonstrated a projective-invariant method to detect homogeneous texture, as well as to perform its affine rectification in challenging, real-world indoor scenes, outperforming existing representative work. Homogeneous texture is seen to provide cues for indoor geometric layout estimation in scenes where vanishing points cannot be reliably computed or the Manhattan assumption is violated. Rectified homogeneous texture also facilitates improved indoor scene recognition on the MIT Indoor67 benchmark, demonstrating that plane projective rectification can push performance in a recognition system.