Stixmantics: A Medium-Level Model for Real-Time Semantic Scene Understanding

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8693)


In this paper we present Stixmantics, a novel medium-level scene representation for real-time visual semantic scene understanding. Relevant scene structure, motion and object class information is encoded using so-called Stixels as primitive elements. Sparse feature-point trajectories are used to estimate the 3D motion field and to enforce temporal consistency of semantic labels. Spatial label coherency is obtained by using a CRF framework.

The proposed model abstracts and aggregates low-level pixel information to gain robustness and efficiency. Yet, enough flexibility is retained to adequately model complex scenes, such as urban traffic. Our experimental evaluation focuses on semantic scene segmentation using a recently introduced dataset for urban traffic scenes. In comparison to our best baseline approach, we demonstrate state-of-the-art performance but reduce inference time by a factor of more than 2,000, requiring only 50 ms per image.


semantic scene understanding bag-of-features region classification real-time stereo vision stixels 


  1. 1.
    Abramov, A., Pauwels, K., Papon, J., Worgotter, F., Dellen, B.: Real-Time Segmentation of Stereo Videos on a Portable System With a Mobile GPU. IEEE Transactions on Circuits and Systems for Video Technology 22(9) (2012)Google Scholar
  2. 2.
    Achanta, R., Shaji, A., Smith, K., Lucchi, A., Fua, P., Süsstrunk, S.: SLIC Superpixels Compared to State-of-the-art Superpixel Methods. Trans. PAMI 34(11) (2012)Google Scholar
  3. 3.
    Arbeláez, P., Hariharan, B., Gu, C., Gupta, S., Bourdev, L., Malik, J.: Semantic Segmentation using Regions and Parts. In: CVPR (2012)Google Scholar
  4. 4.
    Benenson, R., Mathias, M., Timofte, R., Gool, L.V.: Fast Stixel Computation for Fast Pedestrian Detection. In: CVVT Workshop, ECCV (2012)Google Scholar
  5. 5.
    Brostow, G.J., Shotton, J., Fauqueur, J., Cipolla, R.: Segmentation and Recognition Using Structure from Motion Point Clouds. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part I. LNCS, vol. 5302, pp. 44–57. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  6. 6.
    Carreira, J., Caseiro, R., Batista, J., Sminchisescu, C.: Semantic Segmentation with Second-Order Pooling. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part VII. LNCS, vol. 7578, pp. 430–443. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  7. 7.
    Costea, A., Nedevschi, S.: Multi-Class Segmentation for Traffic Scenarios at Over 50 FPS. In: IV Symposium. pp. 1–6 (2014)Google Scholar
  8. 8.
    Couprie, C., Farabet, C., LeCun, Y.: Causal Graph-based Video Segmentation. In: ICIP (2013)Google Scholar
  9. 9.
    Dollár, P., Wojek, C., Schiele, B., Perona, P.: Pedestrian Detection: An Evaluation of the State of the Art. Trans. PAMI 34(4) (2012)Google Scholar
  10. 10.
    Ellis, L., Zografos, V.: Online Learning for Fast Segmentation of Moving Objects. In: Lee, K.M., Matsushita, Y., Rehg, J.M., Hu, Z. (eds.) ACCV 2012, Part II. LNCS, vol. 7725, pp. 52–65. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  11. 11.
    Enzweiler, M., Hummel, M., Pfeiffer, D., Franke, U.: Efficient Stixel-Based Object Recognition. In: IV Symposium (2012)Google Scholar
  12. 12.
    Erbs, F., Schwarz, B., Franke, U.: Stixmentation - Probabilistic Stixel based Traffic Scene Labeling. In: BMVC (2012)Google Scholar
  13. 13.
    Ess, A., Mueller, T., Grabner, H., Gool, L.V.: Segmentation-Based Urban Traffic Scene Understanding. In: BMVC (2009)Google Scholar
  14. 14.
    Everingham, M., Gool, L.V., Williams, C.K.I., Winn, J., Zisserman, A.: The Pascal Visual Object Classes (VOC) Challenge. IJCV 88(2) (2010)Google Scholar
  15. 15.
    Felzenszwalb, P.F., Girshick, R.B., McAllester, D., Ramanan, D.: Object Detection with Discriminatively Trained Part Based Models. Trans. PAMI 32(9) (2010)Google Scholar
  16. 16.
    Floros, G., Leibe, B.: Joint 2D-3D Temporally Consistent Semantic Segmentation of Street Scenes. In: CVPR (2012)Google Scholar
  17. 17.
    Franke, U., Pfeiffer, D., Rabe, C., Knoeppel, C., Enzweiler, M., Stein, F., Herrtwich, R.G.: Making Bertha See. In: CVAD Workshop, ICCV (2013)Google Scholar
  18. 18.
    Franke, U., Rabe, C., Badino, H., Gehrig, S.K.: 6D-Vision: Fusion of Stereo and Motion for Robust Environment Perception. In: Kropatsch, W.G., Sablatnig, R., Hanbury, A. (eds.) DAGM 2005. LNCS, vol. 3663, pp. 216–223. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  19. 19.
    Fröhlich, B., Rodner, E., Denzler, J.: Semantic Segmentation with Millions of Features: Integrating Multiple Cues in a Combined Random Forest Approach. In: Lee, K.M., Matsushita, Y., Rehg, J.M., Hu, Z. (eds.) ACCV 2012, Part I. LNCS, vol. 7724, pp. 218–231. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  20. 20.
    Fulkerson, B., Vedaldi, A., Soatto, S.: Class segmentation and object localization with superpixel neighborhoods. In: ICCV (2009)Google Scholar
  21. 21.
    Gehrig, S.K., Eberli, F., Meyer, T.: A Real-Time Low-Power Stereo Vision Engine Using Semi-Global Matching. In: Fritz, M., Schiele, B., Piater, J.H. (eds.) ICVS 2009. LNCS, vol. 5815, pp. 134–143. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  22. 22.
    Geiger, A., Lauer, M., Wojek, C., Stiller, C., Urtasun, R.: 3D Traffic Scene Understanding from Movable Platforms. Trans. PAMI (2013)Google Scholar
  23. 23.
    Geiger, A., Lenz, P., Urtasun, R.: Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite. In: CVPR (2012)Google Scholar
  24. 24.
    Grundmann, M., Kwatra, V., Han, M., Essa, I.: Efficient Hierarchical Graph-based Video Segmentation. In: CVPR (2010)Google Scholar
  25. 25.
    Hirschmüller, H.: Stereo Processing by Semiglobal Matching and Mutual Information. Trans. PAMI 30(2) (2008)Google Scholar
  26. 26.
    Hoiem, D., Efros, A.A., Hebert, M.: Closing the Loop in Scene Interpretation. In: CVPR (2008)Google Scholar
  27. 27.
    Koller, D., Friedman, N.: Probabilistic Graphical Models: Principles and Techniques. The MIT Press (2009)Google Scholar
  28. 28.
    Ladický, L., Sturgess, P., Russell, C., Sengupta, S., Bastanlar, Y., Clocksin, W., Torr, P.H.S.: Joint Optimisation for Object Class Segmentation and Dense Stereo Reconstruction. In: BMVC (2010)Google Scholar
  29. 29.
    Lowe, D.G.: Distinctive Image Features from Scale-Invariant Keypoints. IJCV 60(2) (2004)Google Scholar
  30. 30.
    Mester, R., Conrad, C., Guevara, A.: Multichannel Segmentation Using Contour Relaxation: Fast Super-Pixels and Temporal Propagation. In: Heyden, A., Kahl, F. (eds.) SCIA 2011. LNCS, vol. 6688, pp. 250–261. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  31. 31.
    Miksik, O., Munoz, D., Bagnell, J.A., Hebert, M.: Efficient Temporal Consistency for Streaming Video Scene Analysis. In: ICRA (2013)Google Scholar
  32. 32.
    Murphy, K.P.: Machine Learning: A Probabilistic Perspective. The MIT Press (2012)Google Scholar
  33. 33.
    de Nijs, R., Ramos, S., Roig, G., Boix, X., Gool, L.V., Kühnlenz, K.: On-line Semantic Perception using Uncertainty. In: IROS (2012)Google Scholar
  34. 34.
    Ochs, P., Brox, T.: Object Segmentation in Video: A Hierarchical Variational Approach for Turning Point Trajectories into Dense Regions. In: ICCV (2011)Google Scholar
  35. 35.
    Pfeiffer, D., Franke, U.: Towards a Global Optimal Multi-Layer Stixel Representation of Dense 3D Data. In: BMVC (2011)Google Scholar
  36. 36.
    Scharwächter, T., Enzweiler, M., Franke, U., Roth, S.: Efficient Multi-cue Scene Segmentation. In: Weickert, J., Hein, M., Schiele, B. (eds.) GCPR 2013. LNCS, vol. 8142, pp. 435–445. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  37. 37.
    Shotton, J., Winn, J., Rother, C., Criminisi, A.: TextonBoost for Image Understanding: Multi-Class Object Recognition and Segmentation by Jointly Modeling Texture, Layout, and Context. IJCV (2009)Google Scholar
  38. 38.
    Stollnitz, E.J., DeRose, T.D., Salesin, D.H.: Wavelets for Computer Graphics: A Primer. IEEE Computer Graphics and Applications 15 (1995)Google Scholar
  39. 39.
    Sturgess, P., Alahari, K., Ladický, L., Torr, P.H.S.: Combining Appearance and Structure from Motion Features for Road Scene Understanding. In: BMVC (2009)Google Scholar
  40. 40.
    Tang, K., Sukthankar, R., Yagnik, J., Fei-Fei, L.: Discriminative Segment Annotation in Weakly Labeled Video. In: CVPR. IEEE (2013)Google Scholar
  41. 41.
    Tighe, J., Lazebnik, S.: SuperParsing: Scalable Nonparametric Image Parsing with Superpixels. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part V. LNCS, vol. 6315, pp. 352–365. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  42. 42.
    Tomasi, C., Kanade, T.: Detection and Tracking of Point Features. Tech. Rep. CMU-CS-91-132, Carnegie Mellon University (1991)Google Scholar
  43. 43.
    Vazquez-Reina, A., Avidan, S., Pfister, H., Miller, E.: Multiple Hypothesis Video Segmentation from Superpixel Flows. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part V. LNCS, vol. 6315, pp. 268–281. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  44. 44.
    Wojek, C., Schiele, B.: A Dynamic Conditional Random Field Model for Joint Labeling of Object and Scene Classes. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part IV. LNCS, vol. 5305, pp. 733–747. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  45. 45.
    Wojek, C., Walk, S., Roth, S., Schindler, K., Schiele, B.: Monocular Visual Scene Understanding: Understanding Multi-Object Traffic Scenes. Trans. PAMI 35(4) (2013)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  1. 1.Environment PerceptionDaimler R&DSindelfingenGermany
  2. 2.Department of Computer ScienceTU DarmstadtGermany

Personalised recommendations