Semantic Road Segmentation via Multi-scale Ensembles of Learned Features

  • Jose M. Alvarez
  • Yann LeCun
  • Theo Gevers
  • Antonio M. Lopez
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7584)


Semantic segmentation refers to the process of assigning an object label (e.g., building, road, sidewalk, car, pedestrian) to every pixel in an image. Common approaches formulate the task as a random field labeling problem modeling the interactions between labels by combining local and contextual features such as color, depth, edges, SIFT or HoG. These models are trained to maximize the likelihood of the correct classification given a training set. However, these approaches rely on hand–designed features (e.g., texture, SIFT or HoG) and a higher computational time required in the inference process.

Therefore, in this paper, we focus on estimating the unary potentials of a conditional random field via ensembles of learned features. We propose an algorithm based on convolutional neural networks to learn local features from training data at different scales and resolutions. Then, diversification between these features is exploited using a weighted linear combination. Experiments on a publicly available database show the effectiveness of the proposed method to perform semantic road scene segmentation in still images. The algorithm outperforms appearance based methods and its performance is similar compared to state–of–the–art methods using other sources of information such as depth, motion or stereo.


Convolutional Neural Network Conditional Random Field Weighted Linear Combination Unary Potential Average Recognition Rate 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Zhang, C., Wang, L., Yang, R.: Semantic Segmentation of Urban Scenes Using Dense Depth Maps. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part IV. LNCS, vol. 6314, pp. 708–721. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  2. 2.
    Brostow, G.J., Shotton, J., Fauqueur, J., Cipolla, R.: Segmentation and Recognition Using Structure from Motion Point Clouds. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part I. LNCS, vol. 5302, pp. 44–57. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  3. 3.
    Brostow, G.J., Fauqueur, J., Cipolla, R.: Semantic object classes in video: A high-definition ground truth database. Pattern Recognition Letters (2008)Google Scholar
  4. 4.
    Ladický, Ľ., Sturgess, P., Alahari, K., Russell, C., Torr, P.H.S.: What, Where and How Many? Combining Object Detectors and CRFs. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part IV. LNCS, vol. 6314, pp. 424–437. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  5. 5.
    Floros, G., Rematas, K., Leibe, B.: Multi-class image labeling with top-down segmentation and generalized robust pn potentials. In: BMVC 2011 (2011)Google Scholar
  6. 6.
    Gupta, A., Efros, A.A., Hebert, M.: Blocks World Revisited: Image Understanding Using Qualitative Geometry and Mechanics. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part IV. LNCS, vol. 6314, pp. 482–496. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  7. 7.
    Farabet, C., Couprie, C., Najman, L., LeCun, Y.: Scene parsing with multiscale feature learning, purity trees, and optimal covers. In: ICML 2012 (2012)Google Scholar
  8. 8.
    Cecotti, H., Graser, A.: Convolutional neural networks for p300 detection with application to brain-computer interfaces. PAMI 33, 433–445 (2011)CrossRefGoogle Scholar
  9. 9.
    LeCun, Y., Bengio, Y.: Convolutional networks for images, speech, and time-series. In: Arbib, M.A. (ed.) The Handbook of Brain Theory and Neural Networks. MIT Press (1995)Google Scholar
  10. 10.
    Kuncheva, L.I.: Combining Pattern Classifiers: Methods and Algorithms. Wiley-Interscience (2004)Google Scholar
  11. 11.
    Sturgess, P., Alahari, K., Ladicky, L., Torr, P.H.S.: Combining appearance and structure from motion features for road scene understanding. In: BMVC 2009 (2009)Google Scholar
  12. 12.
    Levinshtein, A., Stere, A., Kutulakos, K., Fleet, D., Dickinson, S., Siddiqi, K.: Turbopixels: Fast superpixels using geometric flows. PAMI 31 (2009)Google Scholar
  13. 13.
    Domke, J.: Graphical models toolbox, (accessed July 31, 2012)
  14. 14.
    Felzenszwalb, P.F., Huttenlocher, D.P.: Efficient graph-based image segmentation. IJCV 59, 167–181 (2004)CrossRefGoogle Scholar
  15. 15.
    Farabet, C., LeCun, Y., Kavukcuoglu, K., Culurciello, E., Martini, B., Akselrod, P., Talay, S.: Large-scale FPGA-based convolutional networks. In: Scaling up Machine Learning: Parallel and Distributed Approaches. Cambridge University Press (2011)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Jose M. Alvarez
    • 1
    • 3
  • Yann LeCun
    • 1
  • Theo Gevers
    • 2
    • 3
  • Antonio M. Lopez
    • 3
  1. 1.Courant Institute of Mathematical SciencesNew York UniversityNew YorkUSA
  2. 2.Faculty of ScienceUniversity of AmsterdamAmsterdamThe Netherlands
  3. 3.Computer Vision CenterUniv. Autònoma de BarcelonaBarcelonaSpain

Personalised recommendations