A Context-Aware Capsule Network for Multi-label Classification

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11131)


Recently proposed Capsule Network is a brain inspired architecture that brings a new paradigm to deep learning by modelling input domain variations through vector based representations. Despite being a seminal contribution, CapsNet does not explicitly model structured relationships between the detected entities and among the capsule features for related inputs. Motivated by the working of cortical network in HVS, we seek to resolve CapsNet limitations by proposing several intuitive modifications to the CapsNet architecture. We introduce, (1) a novel routing weight initialization technique, (2) an improved CapsNet design that exploits semantic relationships between the primary capsule activations using a densely connected Conditional Random Field and (3) a Cholesky transformation based correlation module to learn a general priority scheme. Our proposed design allows CapsNet to scale better to more complex problems, such as the multi-label classification task, where semantically related categories co-exist with various interdependencies. We present theoretical bases for our extensions and demonstrate significant improvements on ADE20K scene dataset.


  1. 1.
    LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86, 2278–2324 (1998)CrossRefGoogle Scholar
  2. 2.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)Google Scholar
  3. 3.
    Szegedy, C., et al.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015)Google Scholar
  4. 4.
    Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems 28, pp. 91–99 (2015)Google Scholar
  5. 5.
    Redmon, J., Divvala, S.K., Girshick, R.B., Farhadi, A.: You only look once: unified, real-time object detection. arXiv preprint arXiv:1506.02640 (2015)
  6. 6.
    Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440 (2015)Google Scholar
  7. 7.
    Badrinarayanan, V., Kendall, A., Cipolla, R.: Segnet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 39(12), 2481–2495 (2017)CrossRefGoogle Scholar
  8. 8.
    Sabour, S., Frosst, N., Hinton, G.E.: Dynamic routing between capsules. In: Advances in Neural Information Processing Systems, pp. 3859–3869 (2017)Google Scholar
  9. 9.
    LeCun, Y., Cortes, C., Burges, C.J.: The MNIST database of handwritten digits (1998)Google Scholar
  10. 10.
    Bar, M.: Visual objects in context. Nature Rev. Neurosci. 5(8), 617 (2004)CrossRefGoogle Scholar
  11. 11.
    Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017)Google Scholar
  12. 12.
    Hinton, G.E., Krizhevsky, A., Wang, S.D.: Transforming auto-encoders. In: Honkela, T., Duch, W., Girolami, M., Kaski, S. (eds.) ICANN 2011. LNCS, vol. 6791, pp. 44–51. Springer, Heidelberg (2011). Scholar
  13. 13.
    Hinton, G.E., Sabour, S., Frosst, N.: Matrix capsules with EM routing. In: International Conference on Learning Representations (2018)Google Scholar
  14. 14.
    Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. Royal Stat. Soc. Ser. B 39, 1–38 (1977)MathSciNetzbMATHGoogle Scholar
  15. 15.
    LeCun, Y., Huang, F.J., Bottou, L.: Learning methods for generic object recognition with invariance to pose and lighting. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 97–104 (2004)Google Scholar
  16. 16.
    Xi, E., Bing, S., Jin, Y.: Capsule network performance on complex data. arXiv preprint arXiv:1712.03480 (2017)
  17. 17.
    Zheng, S., et al.: Conditional random fields as recurrent neural networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1529–1537 (2015)Google Scholar
  18. 18.
    Ramasinghe, S., Rajasegaran, J., Jayasundara, V., Ranasinghe, K., Rodrigo, R., Pasqual, A.A.: Combined static and motion features for deep-networks based activity recognition in videos. IEEE Trans. Circuits Syst. Video Technol., 1 (2017)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.ConscientAI LabsColomboSri Lanka
  2. 2.Australian National UniversityCanberraAustralia

Personalised recommendations