1 Introduction

Autonomous vehicles, advanced driver assistance systems, robots and other intelligent devices need to understand their environment. For this purpose, both geometric (distance) and semantic (classification) sources of information are useful. We want to represent these inputs in a very compact model and compute them in real-time to serve as a building block of higher-level modules, such as localization and planning.

Fig. 1
figure 1

The proposed approach: pixel-wise color, semantic and depth information serve as input to our Slanted Stixels model, which is a compact semantic representation of a 3D scene that accurately handles arbitrary scenarios such as San Francisco city. The optional over-segmentation in the top-right yields significant speed gains while nearly retaining the depth and semantic accuracy (Color figure online)

This success has led to increased interest in the model from the intelligent vehicles community over the past years The Stixel world has been successfully used for representing traffic scenes, as introduced in Pfeiffer and Franke (2011). It has shown its potential particularly in the Bertha-Benz drive (Ziegler et al. 2014b), where it has been successfully applied for visual scene understanding in autonomous driving. This success has led to increased interest in the model from the intelligent vehicles community over the past years (Schneider et al. 2016; Hernandez-Juarez et al. 2017a; Benenson et al. 2011; Cordts et al. 2014, 2017; Ignat 2016; Levi et al. 2015; Carrillo and Sutherland 2016; Hernandez-Juarez et al. 2017b).

The Stixel world defines a compact medium-level representation of dense 3D disparity data obtained from stereo vision using rectangles, the so called Stixels, as elements. Stixels are classified either as ground-like planes, upright objects or sky, which are important geometric elements found in man-made environments. This representation transforms millions of disparity pixels to hundreds or thousands of Stixels. At the same time, most scene structures, such as free space and obstacles, which are relevant for autonomous driving tasks, are adequately represented.

The idea behind the Stixel model is that planar surfaces are dominant in man-made environments and they can be modeled using this assumption. Scene structure found in urban environments can be modeled with certain constraints, e.g. the sky is above the horizon line and objects usually lie on the ground. Generally, the geometric constraints of a scene are tied to the vertical direction. Hence, the environment can be modeled as a column-wise segmentation of the image with a 3D stick-like shape, i.e. a set of Stixels, c.f. Fig. 1. The segmentation of the image is estimated by solving a column-wise energy minimization problem, taking depth and semantic cues as inputs as well as a priori information that is used to regularize the solution c.f. Fig. 1.

The Stixel model has been successfully used for automotive vision applications either to decrease parsing time, increase accuracy or both. We can find examples of works using the Stixel representation in different topics such as object recognition (Benenson et al. 2012; Li et al. 2016), building a grid map over time (Muffert et al. 2014) and for autonomous driving (Ziegler et al. 2014b). Specifically, for motion planning in the context of autonomous driving, the Stixel model has been used c.f. (Ziegler et al. 2014b, a) to model the geometric constraints of a given scene.

We propose a new depth model that is able to accurately represent arbitrary kinds of slanted objects and non-flat roads. The improved Stixel representation outperforms the original Stixel model in scenarios with non-flat roads, while keeping the same accuracy on flat road scenes. The induced extra computational complexity is reduced by incorporating an over-segmentation strategy that can be applied to any Stixel model proposed so far. An earlier version of our work (Hernandez-Juarez et al. 2017b) proposed a simple over-segmentation strategy that provided faster execution at the expense of decreasing the accuracy of the model. This paper introduces a novel over-segmentation approach based on a Fully Convolutional Network (FCN) that outperforms the previous strategy, and achieves similar speedup results but retaining most of the accuracy of the original version. An overview of our method is shown in Fig. 1.

Our main contributions are: (1) a novel depth model to accurately represent arbitrary kinds of slanted surfaces into the Stixel representation; (2) a novel over-segmentation prior designed to reduce the run-time of the method; (3) an effective over-segmentation strategy based on a shallow Fully Convolutional Network; (4) a new synthetic dataset with non-flat roads that includes pixel-level semantic and depth ground-truth, which is publicly availableFootnote 1; and (5) an in-depth evaluation in terms of run-time as well as semantic and depth accuracy carried out on this novel dataset and several real-world benchmarks. Compared to the existing state-of-the-art approaches, our method substantially improves the depth accuracy in non-flat road scenarios.

Fig. 2
figure 2

Scene representation obtained by our method of a challenging street environment with a slanted road. Both geometric (top) and semantic (bottom) representations are shown (Color figure online)

The remainder of this paper is structured as follows. Section 2 reviews the state of the art. Section 3 presents the new Stixel formulation. We present two over-segmentation methods in Sect. 4. Section 5 explains the experiments we carried on and discusses their results. Finally, we state our conclusions in Sect. 6.

2 Related Work

Our proposed method introduces a novel Stixel-based scene representation that is able to account for non-flat roads, c.f. Fig. 2. We also devise an approximation to reduce the computational complexity of the underlying Dynamic Programming algorithm.

First, we will comment on works proposing different road scene models. Occupancy grid maps are models used to represent the surroundings of the vehicle (Dhiman et al. 2014; Muffert et al. 2014; Nuss et al. 2015; Thrun 2002). Typically, a grid in bird’s eye perspective is defined and used to detect occupied grid cells and then, from this information, to extract the obstacles, drivable area, and unobservable areas from range data. These grids and the Stixel world both represent the 2D image in terms of column-wise stripes allowing to capture the camera data in a polar fashion. Also, the Stixel data model is similar to the forward step usually found in occupancy grid maps (Cordts et al. 2017). However, the Stixel inference method in the image domain presents important differences compared to classical grid-based approaches.

Our work builds upon the proposal from Schneider et al. (2016): they use semantic cues in addition to depth to extract a Stixel representation, which is able to provide a rich yet compact representation of the traffic scene. However, their model assumes a constant road slant and is therefore limited to flat road scenarios. In contrast, our proposal overcomes this drawback by incorporating a novel plane model together with effective priors on the plane parameters.

Our proposal of using Stixels cuts is related to Cordts et al. (2014): they use fast object detectors for different object classes, e.g. Viola-Jones cascade detector (Viola and Jones 2001), to produce top and bottom Stixel cuts that are used as prior information, which is then integrated into the Stixel algorithm. They prove that using object-level knowledge provides significant accuracy improvements. Instead, we leverage semantic information as pixel-level knowledge in our model for the same purpose of improving accuracy. Semantic segmentation identifies the objects and other elements of the image, e.g. walls or sidewalks, providing pixel-level information, instead of boxes around the objects. Also, semantic segmentation requires a single predictor, while the method proposed by Cordts et al. (2014) needs a detector trained for each object class. In contrast, we define a Stixel cut prior to generate an over-segmentation of the optimal Stixel cuts in order to speed up the execution of the algorithm.

There are some methods (Benenson et al. 2011; Ignat 2016; Levi et al. 2015), that represent simplified scene models with a single Stixel per column. The advantage of these approaches is that the computational complexity of the underlying algorithms is linear, but they cannot represent some complex scenarios found in the real world, e.g. a pedestrian and a building in the same column.

Recent work by Carrillo and Sutherland (2016) uses edge-based disparity maps to compute Stixels. Their method is fast but they show that it gives inferior accuracy compared to the original Stixel model (Pfeiffer et al. 2013).

Levi et al. (2015) firstly introduced the use of an FCN in Stixel-based methods. A single RGB image feeds the FCN to estimate the bottom of the first non-road Stixel, i.e. closest obstacle. We use an FCN for a entirely different objective: to extract a Stixel cut over-segmentation that accelerates the execution of the algorithm. Moreover, the input of our FCN is a disparity map obtained from a stereo camera.

Finally, there are some works proposing fast implementations for Stixel computation. The FPGA implementation from Muffert et al. (2014) runs at 25 Hz with a Stixel width of 5 pixels, but the authors do not indicate the image resolution. Hernandez-Juarez et al. (2017a) present a GPU-accelerated implementation that runs at 26 Hz for a Stixel width of 5 pixels and image resolution of \(1024\times 440\) pixels, computed using a Semi-Global Matching (SGM) (Hirschmüller 2008) stereo algorithm. We propose a novel approximation that accelerates the computation by reducing the algorithmic complexity. Accordingly, our proposal could benefit from the aforementioned FPGA- or GPU-accelerated implementations.

3 Stixel Model

The Stixel world is a compressed representation of a 3D scene that preserves its relevant structure. Since the structure in street environments is dominant in the vertical domain, the Stixel world leverages this idea to model a scene without taking into account the horizontal neighborhood. This assumption leads to an efficient inference method and also allows the inference to be performed on all columns in parallel.

The Stixel world is defined as a segmentation of image columns into stick-like super-pixels with class labels and a 3D planar depth model c.f. Fig. 3. We consider three structural classes: object, ground and sky. These classes have properties that are derived from an underlying 3D model: for object Stixels the distance is roughly constant and usually lie on the ground, for sky Stixels the distance is infinite and for ground Stixels we favor planes with accordance to the expected ground.

The Stixel world has several properties that are useful for higher-level processing stages: (1) it is a medium-level scene representation that significantly reduces the quantity of elements, e.g. from millions of pixels to hundreds of Stixels, while keeping an abstract representation of physical extent, depth and semantics; (2) the representation is based upon a street model; (3) the representation is not high-level because an object is represented by more than one Stixel horizontally and it can be split in more than one Stixel vertically too, e.g. occlusions and slanted objects such as cars viewed from behind.

Fig. 3
figure 3

Example of input disparity measurements (black lines) and output Stixels encoded with semantic colors (colored lines) for a typical scene column (right). Adapted from Cordts et al. (2017) (Color figure online)

The joint Stixel segmentation and labeling problem is carried out via optimization of the column-wise posterior distribution defined over a Stixel segmentation given all measurements from that particular image column. In the following, we drop the column indexes for ease of notation. We obtain Stixel width \(>1\) as illustrated e.g. in Fig. 1 by down-sampling of the inputs, this width is fixed and is chosen to reduce the computational complexity during inference, however heavy down-sampling leads to degradation in accuracy (Cordts et al. 2017).

A Stixel column segmentation consists of an arbitrary number of Stixels , each representing four random variables: the Stixel extent via bottom and top row, as well as its class and geometric depth model . Thereby, the number of Stixels itself is a random variable that is optimized jointly during inference. To this end, the posterior probability is defined by means of the unnormalized prior and likelihood distributions

(1)

where Z is the normalizing partition function. Transformed to log-likelihoods via

(2)

where is the energy function, is the likelihood term and is the prior term.

(3)

3.1 Data Term

The likelihood term thereby rates how well the measurements at pixel fit to the overlapping Stixel

(4)

This pixel-wise energy is further split in a semantic and a depth term

(5)

The parameter \(w_l\) controls the influence of the semantic data term. The input is provided by an FCN that delivers normalized semantic scores \(l_v(c_i)\) with \(\sum _{c_i}l_v(c_i) = 1\) for all classes \(c_i\) at pixels v. The semantic energy favors semantic classes of the Stixel that fit to the observed pixel-level semantic input (Schneider et al. 2016). The semantic likelihood term is

(6)

The depth model is designed to represent the different characteristics of the different geometric classes, i.e. object, ground and sky Stixels. Furthermore, the model enforces multiple stacked Stixels in cases of objects with the same class but different depths.

Our depth input is a dense disparity map, each pixel is assigned a disparity value or is masked as invalid i.e. \(d_v \in \{0 ... d_{max}, d_{invalid}\}\). The depth term is defined by means of a probabilistic and generative sensor model that considers the accordance of the depth measurement at row to the Stixel

(7)

Invalid \(d_{inv}\) disparity measurements have to be handled, therefore, a prior probability of a valid disparity value is defined as \(p_{val}\)

(8)

where is the measurement model of valid disparities only. It is comprised of a constant outlier probability and a Gaussian sensor noise model for valid measurements with confidence

(9)

that is centered at the expected disparity given by the depth model of the Stixel, where and normalize the distributions. Similarly to Pfeiffer et al. (2013), we use the confidence of the depth estimates to influence the shape of the distribution. The Gaussian models the typical disparity noise and the uniform distribution makes the model more robust to outliers, which is weighted by . The standard deviation models the noise of the stereo matching algorithm and depends on the class \(c_i\).

3.1.1 New Depth Model

The depth model defines the 3D outline of a Stixel using very few parameters per Stixel and reflects our assumptions on the surrounding scene. Both, data term (c.f. eq. (9)) and priors (c.f. Sect. 3.2) have a significant impact on the inferred depth model. In previous formulations, the three different geometric classes were designed using restrictive constant height (ground Stixels) and constant depth (object and sky Stixels), assumptions per Stixel, e.g. for object Stixels: .

This paper introduces a new plane depth model that relaxes the previous assumptions in favor of a more accurate depth representation. The new model is formulated such that it nicely interacts with this well founded and experimentally validated depth sensor model. To this end, we formulate the depth model using two random variables defining a plane in the disparity space that evaluates to the disparity in row via

(10)

Note that we assume narrow Stixels and thus can neglect one plane parameter, i.e. the roll.

This model is a generalization of the previous class-specific depth models used in previous works, allowing for a more flexible representation of the scene because of the extra free parameter c.f. Fig. 4. The way of modeling the different Stixel classes i.e. object, ground and sky is through priors, as explained in Sect. 3.2.5. Also, to completely understand the details about the inference, we suggest to read Sect. 3.3.

Fig. 4
figure 4

Comparison of original (Schneider et al. 2016) (top) and our Slanted Stixels (bottom): due to the fixed slant in the original formulation, the road surface is not well represented as illustrated on the top-left figure. The novel model is capable of reconstructing the whole scene accurately

3.2 Prior Term

The prior captures knowledge about the segmentation independent from measurements, in this section we define the priors used for this model, they are based on Cordts et al. (2017). The Markov property is used so that the prior reduces to pair-wise relations between subsequent Stixels. Accordingly, the prior is computed as

(11)

In the next sections, where different priors are introduced, is the summation of all these priors. However, does not include pairwise terms, i.e.

(12)

3.2.1 Model Complexity Prior

A model complexity term favors solutions composed of fewer Stixels and thus invokes costs for each Stixel in the column segmentation :

(13)

There is a trade-off between compactness and accuracy. A high \(C_{mc}\) parameter would lead to a very compact segmentation i.e. few Stixels. However, a representation with few Stixels is more likely to have lower accuracy, e.g. a solution comprised of one Stixel the size of the whole column would result in a huge disparity and semantic error.

3.2.2 Segmentation Priors

The model has to enforce that all pixels are assigned to exactly one Stixel, i.e. non-overlapping Stixels, Stixels extend over all the column and Stixels are connected. Therefore, the first priors are defined to comply with the following rules: The first Stixel must begin in row 1 and the last Stixel must end in row h, i.e.

(14)
(15)

Furthermore, every Stixel must be connected to the next one and the Stixel top row must be greater than the bottom row, i.e.

(16)
(17)

3.2.3 Structural Priors

The gravity prior penalizes a flying object i.e. an object Stixel not lying on top of the previous ground Stixel,

(18)

where is the difference between the object Stixel disparity at it’s bottom pixel \(v_i^b\) and the disparity of the ground Stixel at the top row \(v_{i}^t\). It only applies for \(s_i\) being an object and \(s_{i-1}\) being a ground Stixel.

The depth ordering prior penalizes a combination of two staggered object Stixels when the upper of the two is closer (in distance to the car) than the lower one.

(19)

A novel prior is introduced in this paper: the ground gap prior penalizes two consecutive ground Stixels when the bottom disparity of the upper Stixel i.e. disparity at row \(v_{i}^b\) and the disparity of the lower Stixel at row \(v_{i}^b\) do not match.

(20)

where . These structural priors do not enforce their assumptions. Instead, they penalize unusual combinations, e.g. a flying object (gravity prior), traffic signs (ordering prior).

3.2.4 Transition Priors

These priors define the knowledge regarding the transition between a pair of Stixels.

(21)

where \(\gamma _{c_i,c_{i-1}}\) is the transition cost between previous Stixel class \(c_{i-1}\) to current Stixel class \(c_i\). This is defined via a two-dimensional transition matrix for all combinations of classes \(\gamma _{c_i,c_{i-1}}\). Only first order relations are modeled in order to infer efficiently.

3.2.5 Plane Prior

In this paper, we propose a new additional prior term that uses the specific properties of the three geometric classes. We expect the two random variables representing the plane parameters of a Stixel to be Gaussian distributed, i.e.

(22)

This prior favors planes in accordance to the expected 3D layout corresponding to the geometric class. For instance, object Stixels are expected to have an approximately constant disparity, i.e. . The expected road slant can be set using prior knowledge or a preceding road surface detection. For sky Stixels we expect infinite distance i.e. 0 disparity, therefore, we set .

The standard deviations and are used in order to enforce the assumptions for each Stixel class, i.e. the more confident we are that object Stixels have constant distance, the closer to 0 we would set . The same applies for ground Stixels: if we know the road is not slanted, we can rely on the expected previous road model and set . For sky Stixels, it does not make sense to have a disparity different to 0. Therefore, we set and .

Note that the novel formulation is a strict generalization of the original method, since they are equivalent, e.g. if the slant is fixed, i.e. .

3.3 Inference

The sophisticated energy function defined in Sect. 3 is optimized via Dynamic Programming as in Pfeiffer and Franke (2011). However, we must also optimize jointly for the novel depth model. When optimizing for the plane parameters of a certain Stixel , it becomes apparent that all other optimization parameters are independent of the actual choice of the plane parameters. We can thus simplify

(23)

Thus, we minimize the global energy function with respect to the plane parameters of all Stixels and all geometric classes independently. We can find an optimal solution of the resulting weighted least squares problem in closed form. However, we still need to compare the Stixel measurements to our new plane depth model. Therefore, the complexity added to the original formulation is another quadratic term in the image height.

3.4 Stixel Cut Prior

The Stixel inference process described so far requires the estimation of the cost for each possible Stixel in an image. However, many Stixels can be trivially discarded, e.g. in image regions with homogeneous depth and semantic input, making it possible to avoid the computation steps associated to the calculation of these.

We propose a novel prior that exploits hypothesis generation to significantly reduce the computational burden of the inference task. To this end, we formulate a new prior similar to Cordts et al. (2014); however, instead of Stixel bottom and top probabilities, we incorporate generic likelihoods for pixels being the cut between two Stixels.

We leverage this additional information adding a novel prior term for a Stixel

(24)

where is the confidence for a cut at , thus implies that there is no cut between two Stixels at row .

As described in Pfeiffer (2014), we can design a recursive definition of the optimization problem in order to solve the problem using a Dynamic Programming scheme. In order to simplify our description, we use a special notation to refer to Stixels: \(ob_{b}^{t} = \{v^{b}, v^{t}, object\}\). Similarly, \(OB^k\) is defined as the minimum aggregated cost of the best segmentation from position 0 to k. The Stixel at the end of the segmentation associated with each minimum cost is denoted as \(ob^k\). We next show a recursive definition of the problem:

(25)

We only show the case for object Stixels, but the other cases are solved similarly. Also, \(GR^k\) and \(SK^k\) stand for ground and sky respectively. The base case problem, i.e. segmenting a column of the single pixel at the bottom, is defined: \(OB^0 = E_{data}(ob_0^0)+E_{prior}(ob_0^0)\). Our method trusts that all the optimal cuts will be included in our over-segmentation [cuts in Eq. (25)], therefore, only those positions are checked as Stixel bottom and top. This reduces the complexity of the Stixel estimation problem for a single column to \(\mathcal {O}(h' \times h')\), where \(h'\) is the number of over-segmentation cuts computed for this column, h is image height and \(h' \ll h\).

The computational complexity reduction becomes apparent in Fig. 5. As stated in Cordts et al. (2017), the inference problem can be interpreted as finding the shortest path in a directed acyclic graph. Our approach prunes all the vertices associated with the Stixel’s top row not included according to the Stixel cut prior, c.f. Fig. 5b.

Fig. 5
figure 5

Stixel inference illustrated as shortest path problem on a directed acyclic graph: the Stixel segmentation is computed by finding the shortest path from the source (left gray node) to the sink (right gray node). The vertices represent Stixels with colors encoding their geometric class, i.e. ground, object and sky. Only the incoming edges of ground nodes are shown for simplicity. Adapted from Cordts et al. (2017) (Color figure online)

4 Generation of the Stixel Cut Prior

The previous section explained how to use a Stixel cut prior to reduce the computational complexity of the Stixel inference. The idea is that many Stixel cuts could be trivially discarded, e.g. in image regions with homogeneous depth and semantic input. We can save a lot of computation by not processing those unlikely Stixel cuts. The goal is to devise a fast method to generate an over-segmentation of the optimal Stixel cuts. And, if those optimal cuts are included in the generated hypothesis, then the Stixel algorithm will provide the same output as in the original case, but doing much fewer computation steps.

We propose two methods to generate Stixel cuts. The first method is a simple strategy that uses some mathematical concepts to identify points of interest c.f. Sect. 4.1. It is a very fast approach, but misses some of the optimal Stixel cuts and, therefore, the final accuracy of the Stixel inference is reduced. The second method uses a shallow Fully Convolutional Network (FCN) that is trained on the disparity map to infer likely Stixel cuts c.f. Sect. 4.2. This strategy is also very fast, since the FCN is small, and is able to provide almost all of the optimal Stixel cuts. For both methods, we leverage semantic segmentation information by including the edges of the semantic image into the set of the generated Stixel cuts.

4.1 Time Series Compression

The first method to generate Stixel cuts is based on the work of Ignat (2016), and has linear time complexity and linear memory requirements. In their work, each column of the disparity map is treated independently as a time series, i.e. a signal with measurements on equal intervals of time. They first perform an extreme points detection step that generates a list of possible Stixel cuts, and then apply subsequent filters to this list in order to generate the final Stixel segmentation. As we want to obtain an over-segmentation containing all the optimal Stixel cuts, we only use the first step of their proposal.

The detection of extreme points is based on techniques for time series compression (Fink and Gandhi 2011). A time series can be compressed by selecting local extreme points, i.e. maxima and minima of a function within a range. The assumption is that local extreme points are enough to find the important parts of the signal, and the rest would be unimportant points or noise.

In Ignat (2016) only left and right extrema are selected, while other kinds of extrema are discarded. Given a time series \(\{t_1,t_2,\dotsc ,t_i,\dotsc ,t_{n-1},t_n\}\) and point \(t_i\) with \(1< i < n\), the definition of left and right minimum is as follows (the definition of maxima is symmetric):

  • \(t_i\) is left minimum if \(t_i < t_{i-1}\) and there is \(t_j\) such that \(j > i\) and \(t_i = \dotsc = t_j < t_{j+1}\).

  • \(t_i\) is right minimum if \(t_i < t_{i+1}\) and there is \(t_j\) such that \(j < i\) and \(t_{j-1} > t_j = \dotsc = t_i\).

Similarly, we generate Stixel cuts by finding left and right extrema and the first and last points of the sequence of pixels in the column. The example in Fig. 6 illustrates the method. The predicted Stixel cuts are indicated in red color. In the example the vertical resolution is reduced around 3.3 times, which implies reduced computational work for the Stixel inference task.

Fig. 6
figure 6

Generated Stixel cuts (highlighted in red) using the left and right extrema as defined in Ignat (2016), and also cuts generated from semantic segmentation. Stixel cut density is \(30\%\), equivalent to a \(3.3\times \) reduction in vertical resolution (Color figure online)

4.2 FCN-Based Method

We propose a novel shallow deep neural network c.f. Fig. 8 that generates a set of promising Stixel cuts from depth images c.f. Fig. 7. We follow the proposal in Jasch et al. (2018): we use disparities instead of depth. We have experimentally found that adding the RGB image to the input of the neural network does not improve the accuracy of the method, compared to the simpler and faster strategy of directly adding the edges of the semantic image into the set of the generated Stixel cuts.

Fig. 7
figure 7

Generated Stixel cuts (highlighted in red) for the FCN-based method. Stixel cut density is \(31.5\%\), equivalent to a \(3.2\times \) reduction in vertical resolution (Color figure online)

We design the network to provide an over-segmentation of the optimal Stixel cuts that should be significantly smaller than the total number of potential Stixel cuts (which is the height of the image). Also, the computational work required for the network inference must be small, ideally similar to the Time Series method proposed in Sect. 4.1. In the remainder of this section, we will first discuss the proposed network architecture, and then describe the data and training strategy.

Fig. 8
figure 8

Definition of the proposed Fully Convolutional Network for generating Stixel cuts

4.2.1 Network Architecture

Our proposal is based on the architecture described by Schneider et al. (2017). They present a multi-modal FCN designed for semantic segmentation with a mid-level fusion architecture that exploits complementary input cues, i.e. RGB and disparity data. Their design includes the Network in Network (NiN) method proposed by Lin et al. (2013). Our proposal inherits the network branch that processes the disparity data and discards the branch on the RGB data, which is described in detail in Fig. 8. The proposed FCN is a very shallow network with three consecutive NiNs, and a final deconvolution that recovers the desired resolution of the Stixel cuts. The output of the FCN is a binary image indicating whether or not there is a Stixel cut for that pixel.

4.2.2 Training Data

We trained the proposed FCN using disparity maps generated from images in the Synthia synthetic dataset (Ros et al. 2016) and from images in a real-data sequence (6757 images) recorded in San Francisco, c.f. Fig. 9. In both cases, the disparity maps are generated from the left and right RGB images using a stereo matching algorithm (Hirschmüller 2008). This is the expected situation in a realistic scenario, where the SGM algorithm in the perception pipeline generates the disparity map and feeds the FCN that produces the Stixel over-segmentation.

Fig. 9
figure 9

Sample image from the real-data sequence used for Stixel cut generation. Stixel cut ground-truth is highlighted in red (Color figure online)

The ground-truth for the training data (the expected Stixel cuts) is generated as a combination of methods. In the case of the annotated synthetic dataset, which contains both pixel-level semantic and instance-level annotations, the ground-truth includes, as desired Stixel cuts, the boundaries of the instances and the semantic classes in the image (as in Cordts et al. 2017). Finally, the Stixel cuts associated to disparity changes are obtained by running the Stixel inference method. In the real-data sequence, we only perform this last step because we lack ground-truth.

As discussed previously c.f. Sect. 3.2.1, the definition of the parameters of the Stixel model represent a trade-off between compactness and accuracy. Since we need an over-segmentation of the optimal Stixel cuts, we adjust the parameters of the model to be conservative and to favor accuracy versus compactness.

The idea of using the Stixel model as a way to train a fast and simple neural network to approximate the optimal Stixel segmentation is inspired by model distillation techniques (Bucila et al. 2006). The comparatively slow Dynamic Programming method to solve the probabilistic model is used to transfer the knowledge inside the complex model to a fast and compact FCN that approximates the optimal Stixel cuts.

4.2.3 Training Strategies

Since our problem is to classify each pixel of our input disparity map as cut or not-cut, we use cross-entropy as the loss function that must be minimized. The distribution of cut/not-cut is strongly biased in our input and, accordingly, we introduce a class-balancing weight in the loss function, similarly to Xie and Tu (2017). These weights cause the FCN to generate wider edges c.f. Fig. 7. This is useful, since the FCN roughly detects the Stixel cut positions, and the precise detection is left to the Stixel inference.

Fig. 10
figure 10

The SYNTHIA-SF Dataset. A sample frame (left) with its depth (center) and semantic labels (right)

We set the learning rate to \(10^{-8}\) and the batch size to five: four of those inputs are Synthia images and one of them is a real-data image. The missing disparities are encoded as \(-1\). Input normalization is done by subtracting the mean value from the disparity map. We initialize the FCN with the weights used in Schneider et al. (2017), since semantic segmentation is a similar problem.

5 Experiments

This section assesses the accuracy and run-time of our proposal. A previous concern is to verify that our method not only improves the representation of scenes with non-flat roads, but also maintains the accuracy for scenes containing only flat roads. For that purpose, we present datasets of synthetic and real data to evaluate our proposal in Sect. 5.1. We introduce inputs, metrics, baselines, and other experimental details in Sect. 5.2. Finally, quantitative and qualitative results are reported in Sect. 5.3.

5.1 Datasets

As our Stixel model represents geometric and semantic information, we must evaluate the accuracy of our method for both. For that purpose, we select Ladicky (Ladicky et al. 2014), an annotated subset of KITTI (Geiger et al. 2012), which is, to the best of our knowledge, the only dataset containing both dense semantic labels and depth ground-truth. It consists of a set of 60 images with 0.5 MP resolution that we use for evaluating Stixel semantic and depth accuracy. We follow the suggestion given by the author (Ladicky et al. 2014) to ignore the three rarest object classes, which leaves us with 8 classes.

Additionally, for training our semantic segmentation FCN, we use publicly available semantic annotations on other parts of KITTY (Kundu et al. 2014; He and Upcroft 2013; Sengupta et al. 2013; Xu et al. 2013; Zhang et al. 2015). Our total training set is composed of 676 images, where we harmonized the object classes used by the different authors to the previously mentioned set suggested by Ladicky et al. (2014). This harmonization and data processing is the same applied in the previous work (Schneider et al. 2016) to allow for fair comparison.

In order to further evaluate disparity accuracy we use the training data of the well-known stereo challenge KITTI 2015 (Geiger et al. 2012). This dataset provides a set of 200 images with sparse disparity ground-truth obtained from a laser scanner. There is no suitable semantic ground-truth available for this dataset.

Furthermore, we also evaluate semantic accuracy using Cityscapes (Cordts et al. 2016), a highly complex dataset with dense annotations of 19 classes on \(\sim 3000\) images for training and 500 images for validation that we use for testing.

Unfortunately, all the above datasets were generated in flat road environments. Hence, they only help us validate that we are not decreasing our accuracy for this kind of environments. In order to compare the accuracy of competing algorithms on non-flat road scenarios, we need a new dataset.

Therefore, we introduce a new synthetic dataset inspired by Ros et al. (2016). This dataset has been generated with the purpose of evaluating our proposed model; however, it contains enough information to be useful in additional related tasks, such as object recognition, semantic and instance segmentation, among others.

SYNTHIA-San Francisco (SYNTHIA-SF) consists of photo-realistic frames rendered from a virtual city and comes with precise pixel-level depth and semantic annotations for 19 classes c.f. Fig. 10. This new dataset contains 2224 images that we use to evaluate both depth and semantic accuracy in non-flat roads.

Table 1 Accuracy of our methods compared to Semantic Stixels (Schneider et al. 2016), raw SGM and FCN
Table 2 Number of Stixels (\(10^3\)) generated by our methods compared to Semantic Stixels (Schneider et al. 2016) and raw input (total number of pixels)

5.2 Experiment Details

5.2.1 Metrics

We evaluate our proposed method in terms of semantic and depth accuracy using two metrics. The depth accuracy is obtained as the rate of outliers of the disparity estimates, the standard metric used to evaluate on KITTI benchmark (Geiger et al. 2012). An outlier is a disparity estimation with an absolute error larger than 3 px or a relative deviation larger than 5% compared to the ground-truth. The semantic accuracy is evaluated with the average Intersection-over-Union (IoU) over all classes, which is also a standard measure for semantic segmentation (Everingham et al. 2015). We measure the number of Stixels generated per image to quantify the complexity of the obtained representation. Finally, we evaluate the inference speed of the algorithm using the Frame-rate (Hz) metric, which helps us estimate if our system is capable of real-time performance. All the execution times of Stixels and SGM are obtained using a multi-threaded implementation running on standard consumer hardware: Intel i7-6800K. The semantic segmentation FCN frame-rate estimations are obtained using Maxwell NVidia Titan X. The Stixel frame-rate includes the over-segmentation approach. Note that Stixel frame-rate is variable if we use an over-segmentation method, i.e. it will depend on the number of Stixel cuts removed, therefore we provide a representative frame-rate. Similarly to Cordts et al. (2017), we can maximize the throughput of the system by computing SGM and Semantic Segmentation in parallel, then the system would run with one frame delay.

5.2.2 Baseline

Semantic Stixels (Schneider et al. 2016) serve as our comparison baseline, as they achieve state-of-the-art results in terms of Stixel accuracy. We provide the accuracy of our new disparity model, c.f. Sect. 3. Finally, we evaluate the complexity of the fast approach defined in Sect. 3.4, with the two over-segmentation techniques presented in Sects. 4.1 and 4.2.

Fig. 11
figure 11

Frame-rate of our method (only the Stixel computation step and the corresponding over-segmentation approach) compared to Semantic Stixels (Schneider et al. 2016) for SYNTHIA-SF (image resolution of 1920\(\times \)1080) on a multi-threaded CPU implementation (Intel i7-6800K) computed with a Stixel width of 8 pixels and equivalent down-sampling in the v-direction. Different methods of over-segmentation are also compared, these are: Time Series c.f. Sect. 4.1, FCN c.f. Sect. 4.2

Table 3 Per-stage report of frame-rate of our pipeline for a stereo pair of resolution 1242 \(\times \) 375. OS stands for Over-segmentation
Fig. 12
figure 12

Exemplary outputs on real data: in all cases with non-flat roads our model correctly represents the scene, while retaining accuracy on objects. The last example shows a failure case, where our approach classifies the road as sidewalk due to erroneous semantic input. However, the original approach reconstructs a wall in this case, highlighted by a red circle. This could lead to an emergency break (Color figure online)

5.2.3 Input

As input, we use disparity images obtained via SGM (Hirschmüller 2008) and pixel-level semantic labels computed by an FCN (Long et al. 2015). We use the same FCN model used in Schneider et al. (2016) without retraining, to allow for comparison. For the same reason, we set Stixel width to 8 px. The same down-sampling is applied in the vertical direction. The rest of the parameters used are taken from Schneider et al. (2016).

We use the camera parameters obtained after calibration to set the expected values of and . For object Stixels, we set because the disparity is too noisy for the slanted object model. Finally, since sky Stixels can not have slanted surfaces, we set: .

In order to improve the computational efficiency of our approach, we use the two Fast Stixel over-segmentation methods presented in Sect. 4.1, labeled as Time Series, and Sect. 4.2, labeled as FCN.

5.3 Results

The quantitative results of our proposals and baselines, as described in Sect. 3, are shown in Tables 1 and 2 and Fig. 11.

The first observation is that our method achieves comparable or slightly better results on all datasets with flat roads c.f. compare Semantic Stixels to Ours for Ladicky, KITTI 15 and Cityscapes datasets in Table 1. These results indicate that the novel and more flexible model does not harm the accuracy in such scenarios.

We also observe that our novel model is able to accurately represent non-flat scenarios in contrast to the original Stixel approach, yielding a substantially increased depth accuracy of more than \(16\%\) c.f. when comparing Semantic Stixels to Ours for the SYNTHIA-SF dataset in Table 1. Additionally, to verify that our method equally works also on real data, we provide a video of the Stixel 3D representation of a challenging non-flat road scene as supplementary material. Results also improve in terms of semantic accuracy, which we explain as a consequence of the joint semantic and depth inference that benefits from a better depth model.

A perfect over-segmentation method would find all optimal cuts, and consequently, it would have the same accuracy as not using any over-segmentation.

Our novel approach Fast: FCN has an accuracy almost equal to not using any over-segmentation method (in all cases but one). Note that, our proposed approach Fast: FCN is superior to Fast: Time Series method in all cases c.f. when comparing both methods for the SYNTHIA-SF dataset in Table 1.

Both over-segmentation methods increase the error for our challenging SYNTHIA-SF dataset; we think this is because of the difficult road Stixel cuts in these scenes, c.f. compare No over-segmentation to Fast methods in Table 1.

All variants are compact representations of the surrounding, since the complexity of the Stixel representation is small compared to the high resolution input images, c.f. Table 2.

Our last observation is that the proposed Fast variants improve the run-time of the original Stixel approach by up to \(2\times \), and also improve the novel Slanted Stixel approach by up to \(7\times \), with only a slight drop in depth accuracy c.f. Fig. 11. The benefit increases with higher resolution input images due to the quadratic and cubic computational complexity of the original and slanted Stixel inference methods, respectively. We also detail per-stage run-time c.f. Table 3 for completeness.

In addition to the quantitative evaluation presented before, we have visually inspected many of the obtained Stixel representations, to check the qualitative differences between our proposal and the previous work. Figure 12 illustrates some of these examples, in which the scenes with non-flat roads are correctly represented and all the objects in the scenario are identified by our proposal, while the previous model produces an incomplete road representation, or even generates non-existing objects at some road positions.

6 Conclusions

This paper presented a novel depth model for the Stixel world that is able to account for non-flat roads and slanted objects in a compact representation that overcomes the previous restrictive constant height and depth assumptions. This change in the way Stixels are represented is required for difficult environments that are found in many real-world scenarios. Moreover, in order to significantly reduce the computational complexity of the extended model, a novel approximation has been introduced that consists of checking only reasonable Stixel cuts inferred using fast methods. We showed in extensive experiments on several related datasets that our depth model is able to better represent slanted road scenes, and that our approximation is able to reduce the run-time drastically, with only a slight drop in accuracy.

As future work, we would like to focus on circumventing the limitations of our method. Namely, (1) the vertical/column independence assumed by the model is clearly not true. A more global representation, e.g. super-pixels that span vertically and horizontally, would be more compact and less prone to errors; (2) some surfaces are not well represented by a linear model, e.g. cars. A more complex depth model and specific models for each semantic class could represent more faithfully the scene. Nonetheless, a model with more free variables could also lead to a bad representation because of the noise; (3) the proposed over-segmentation algorithm has a non-predictable run-time. And this is a bad characteristic for a real-time system. The worst-case scenario, i.e. no Stixel cuts removed, is as slow as not using over-segmentation at all (although very unlikely); (4) in case of movement of the stereo rig during operation, there could be an offset in roll effectively breaking the vertical world assumption.