Abstract
Bag-of-words based image representation is a successful approach for object recognition. Generally, the subsequent stages of the process: feature detection, feature description, vocabulary construction and image representation are performed independent of the intentioned object classes to be detected. In such a framework, it was found that the combination of different image cues, such as shape and color, often obtains below expected results.
This paper presents a novel method for recognizing object categories when using multiple cues by separately processing the shape and color cues and combining them by modulating the shape features by category-specific color attention. Color is used to compute bottom-up and top-down attention maps. Subsequently, these color attention maps are used to modulate the weights of the shape features. In regions with higher attention shape features are given more weight than in regions with low attention.
We compare our approach with existing methods that combine color and shape cues on five data sets containing varied importance of both cues, namely, Soccer (color predominance), Flower (color and shape parity), PASCAL VOC 2007 and 2009 (shape predominance) and Caltech-101 (color co-interference). The experiments clearly demonstrate that in all five data sets our proposed framework significantly outperforms existing methods for combining color and shape information.
Similar content being viewed by others
References
Bach, F. (2008). Exploring large feature spaces with hierarchical multiple kernel learning. In NIPS.
Bosch, A., Zisserman, A., & Munoz, X. (2006). Scene classification via plsa. In ECCV.
Bosch, A., Zisserman, A., & Munoz, X. (2007a). Image classification using random forests and ferns. In ICCV.
Bosch, A., Zisserman, A., & Munoz, X. (2007b). Representing shape with a spatial pyramid kernel. In CIVR.
Bosch, A., Zisserman, A., & Munoz, X. (2008). Scene classification using a hybrid generative/discriminative approach. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(4), 712–727.
Burghouts, G. J., & Geusebroek, J. M. (2009). Performance evaluation of local colour invariants. Computer Vision and Image Understanding, 113, 48–62.
Cai, H., Yan, F., & Mikolajczyk, K. (2010). Learning weights for codebook in image classification and retrieval. In CVPR.
Dorko, G., & Schmid, C. (2003). Selection of scale-invariant parts for object class recognition. In ICCV.
Everingham, M., Gool, L. V., Williams, C. K. I., Winn, J., & Zisserman, A. (2007). The Pascal visual object classes challenge 2007 results.
Everingham, M., Gool, L. V., Williams, C. K. I., Winn, J., & Zisserman, A. (2008). The Pascal visual object classes challenge 2008 (voc2008) results. [online]. available: http://www.pascal-network.org/challenges/voc/voc2008/.
Everingham, M., Gool, L. V., Williams, C. K. I., Winn, J., & Zisserman, A. (2009). The Pascal visual object classes challenge 2009 results.
Fei-Fei, L., & Perona, P. (2005). A Bayesian hierarchical model for learning natural scene categories. In CVPR.
Fulkerson, B., Vedaldi, A., & Soatto, S. (2008). Localizing objects with smart dictionaries. In ECCV.
Gao, D., Han, S., & Vasconcelos, N. (2009). Discriminant saliency, the detection of suspicious coincidences, and applications to visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(6), 989–1005.
Gehler, P. V., & Nowozin, S. (2009). On feature combination for multiclass object classification. In Proc. ICCV.
Harzallah, H., Jurie, F., & Schmid, C. (2009). Combining efficient object localization and image classification. In ICCV.
Ito, S., & Kubota, S. (2010). Object classification using heterogeneous co-occurrence features. In ECCV.
Itti, L., Koch, C., & Niebur, E. (1998). A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(11), 1254–1259.
Jost, T., Ouerhani, N., von Wartburg, R., Mri, R., & Hgli, H. (2005). Assessing the contribution of color in visual attention. Computer Vision and Image Understanding, 100(1–2), 107–123.
Jurie, F., & Triggs, B. (2005). Creating efficient codebooks for visual recognition. In ICCV.
Khan, F. S., van de Weijer, J., & Vanrell, M. (2009). Top-down color attention for object recognition. In ICCV.
Lazebnik, S., & Raginsky, M. (2009). Supervised learning of quantizer codebooks by information loss minimization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(7), 1294–1309.
Lazebnik, S., Schmid, C., & Ponce, J. (2005). A sparse texture representation using local affine regions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(8), 1265–1278.
Lazebnik, S., Schmid, C., & Ponce, J. (2006). Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In Proc. CVPR.
Li, L., Hu, W., Li, B., Yuan, C., Zhu, P., & Li, W. (2010a). Event recognition based on top-down motion attention. In Proc. ICPR.
Li, L., Yuan, C., Hu, W., & Li, B. (2010b). Top-down cues for event recognition. In ACCV.
Liu, T., Sun, J., Zheng, N., Tang, X., & Shum, H. (2007). Learning to detect a salient object. In CVPR.
Lowe, D. G. (2004). Distinctive image features from scale-invariant points. International Journal of Computer Vision, 60(2), 91–110.
Marszalek, M., Schmid, C., Harzallah, H., & van de Weijer, J. (2007). Learning object representation for visual object class recognition 2007. In Visual recognition challenge workshop in conjuncture with ICCV.
Meur, O. L., Callet, P. L., Barba, D., & Thoreau, D. (2006). A coherent computational approach to model bottom-up visual attention. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(5), 802–817.
Mikolajczyk, K., & Schmid, C. (2005). A performance evaluation of local descriptors. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(10), 1615–1630.
Mikolajczyk, K., Tuytelaars, T., Schmid, C., Zisserman, A., Matas, J., Schaffalitzky, F., Kadir, T., & Gool, L. V. (2005). A comparison of affine region detectors. International Journal of Computer Vision, 65(1–2), 43–72.
Nilsback, M. E., & Zisserman, A. (2006). A visual vocabulary for flower classification. In CVPR.
Nilsback, M. E., & Zisserman, A. (2007). Delving into the whorl of flower segmentation. In BMVC.
Nilsback, M. E., & Zisserman, A. (2008). Automated flower classification over a large number of classes. In ICVGIP.
Nowak, E., Jurie, F., & Triggs, B. (2006). Sampling strategies for bag-of-features image classification. In ECCV.
Oliva, A., & Torralba, A. B. (2001). Modeling the shape of the scene: a holistic representation of the spatial envelope. International Journal of Computer Vision, 42(3), 145–175.
Orabona, F., Luo, J., & Caputo, B. (2010). Online-batch strongly convex multi kernel learning. In CVPR.
Perronnin, F. (2008). Universal and adapted vocabularies for generic visual categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(7), 1243–1256.
Peters, R. J., & Itti, L. (2007). Beyond bottom-up: Incorporating task-dependent influences into a computational model of spatial attention. In CVPR.
Quelhas, P., & Odobez, J. M. (2006). Natural scene image modeling using color and texture visterms. In CIVR.
Quelhas, P., Monay, F., Odobez, J., Gatica-Perez, D., Tuytelaars, T., & Gool, L. V. (2005). Modelling scenes with local descriptors and latent aspects. In ICCV.
Rakotomamonjy, A., Bach, F., Canu, S., & Grandvalet, Y. (2007). More efficiency in multiple kernel learning. In ICML.
Sivic, J., & Zisserman, A. (2003). Video google: a text retrieval approach to object matching in videos. In ICCV.
Snoek, C. G. M., Worring, M., & Smeulders, A. W. M. (2005). Early versus late fusion in semantic video analysis. In ACM MM.
Stottinger, J., Hanbury, A., Gevers, T., & Sebe, N. (2009). Lonely but attractive: sparse color salient points for object retrieval and categorization. In CVPR Workshops.
Treisman, A. (1996). The binding problem. Current Opinion in Neurobiology, 6, 171–178.
Tsotsos, J., Culhan, S. M., Lai, W. W., Davis, N., & Nuflo, F. (1995). Modeling visual-attention via selective tuning. Artificial Intelligence, 78, 507–545.
Tuytelaars, T., & Schmid, C. (2007). Vector quantizing feature space with a regular lattice. In ICCV.
van de Sande, K., Gevers, T., & Snoek, C. (2008). Evaluation of color descriptors for object and scene recognition. In CVPR.
van de Sande, K. E. A., Gevers, T., & Snoek, C. G. M. (2010). Evaluating color descriptors for object and scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9), 1582–1596.
van de Weijer, J., & Schmid, C. (2006). Coloring local feature extraction. In ECCV.
van de Weijer, J., & Schmid, C. (2007). Applying color names to image description. In ICIP.
van de Weijer, J., Gevers, T., & Bagdanov, A. D. (2006). Boosting color saliency in image feature detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(1), 150–156.
van de Weijer, J., Schmid, C., Verbeek, J. J., & Larlus, D. (2009). Learning color names for real-world applications. IEEE Transactions on Image Processing, 18(7), 1512–1524.
Varma, M., & Babu, B. R. (2009). More generality in efficient multiple kernel learning. In ICML.
Varma, M., & Ray, D. (2007). Learning the discriminative power-invariance trade-off. In ICCV.
Vazquez, E., Gevers, T., Lucassen, M., van de Weijer, J., & Baldrich, R. (2010). Saliency of color image derivatives: a comparison between computational models and human perception. Journal of the Optical Society of America A, Online, 27(3), 1–20.
Vedaldi, A., Gulshan, V., Varma, M., & Zisserman, A. (2009). Multiple kernels for object detection. In ICCV.
Vogel, J., & Schiele, B. (2007). Semantic modeling of natural scenes for content-based image retrieval. International Journal of Computer Vision, 72(2), 133–157.
Walther, D., & Koch, C. (2006). Modeling attention to salient proto-objects. Neural Networks, 19, 1395–1407.
Wettschereck, D., Aha, D. W., & Mohri, T. (1997). A review and empirical evaluation of feature weighting methods for a class of lazy learning algorithms. Artificial Intelligence Review, 11, 273–314.
Winn, J. M., Criminisi, A., & Minka, T. P. (2005). Object categorization by learned universal visual dictionary. In ICCV.
Wolfe, J. M. (2000). The deployment of visual attention: two surprises. Search and target acquisition. NATO-RTO.
Wolfe, J. M., & Horowitz, T. (2004). What attributes guide the deployment of visual attention and how do they do it? Nature Reviews Neuroscience, 5, 1–7.
Xie, N., Ling, H., Hu, W., & Zhang, X. (2010). Use bin-ratio information for category and scene classification. In CVPR.
Yang, L., Jin, R., Sukthankar, R., & Jurie, F. (2008). Unifying discriminative visual codebook generation with classifier training for object category recognition. In CVPR.
Zhou, X., Yu, K., Zhang, T., & Huang, T. S. (2010). Image classification using super-vector coding of local image descriptors. In ECCV.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Khan, F.S., van de Weijer, J. & Vanrell, M. Modulating Shape Features by Color Attention for Object Recognition. Int J Comput Vis 98, 49–64 (2012). https://doi.org/10.1007/s11263-011-0495-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-011-0495-2