Image-Text Dual Model for Small-Sample Image Classification

  • Fangyi Zhu
  • Xiaoxu Li
  • Zhanyu MaEmail author
  • Guang Chen
  • Pai Peng
  • Xiaowei Guo
  • Jen-Tzung Chien
  • Jun Guo
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 772)


Small-sample classification is a challenging problem in computer vision and has many applications. In this paper, we propose an image-text dual model to improve the classification performance on small-sample dataset. The proposed dual model consists of two sub-models, an image classification model and a text classification model. After training the sub-models respectively, we design a novel method to fuse the two sub-models rather than simply combining the two models’ results. Our image-text dual model aims to utilize the text information to overcome the problem of training deep models on small-sample datasets. To demonstrate the effectiveness of the proposed dual model, we conduct extensive experiments on LabelMe and UIUC-Sports. Experimental results show that our model is superior to other models. In conclusion, our proposed model can achieve the highest image classification accuracy among all the referred models on LabelMe and UIUC-Sports.


Small-sample image classification Ensemble learning Deep convolutional neural network 



This work was supported in part by the National Natural Science Foundation of China (NSFC) under Grant 61773071, Grant 61628301, Grant 61402047 and Grant 61563030, in part by the Beijing Nova Program Grant Z171100001117049, in part by the Beijing Natural Science Foundation (BNSF) under Grant 4162044, and in part by the CCF-Tencent Open Research Fund.


  1. 1.
    Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, New York (2006)zbMATHGoogle Scholar
  2. 2.
    Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. JMLR 3(1), 993–1022 (2003)zbMATHGoogle Scholar
  3. 3.
    Hare, J.S., Lewis, P.H.: Automatically annotating the MIR Flickr dataset: experimental protocols, openly available data and semantic spaces. In: ACM MIR, pp. 547–556. ACM (2010)Google Scholar
  4. 4.
    Hofmann, T.: Probabilistic latent semantic indexing. In: ACM SIGIR, pp. 50–57. ACM (1999)Google Scholar
  5. 5.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS, pp. 1097–1105 (2012)Google Scholar
  6. 6.
    Larochelle, H., Lauly, S.: A neural autoregressive topic model. In: NIPS, pp. 2708–2716 (2012)Google Scholar
  7. 7.
    Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: ICML, pp. 1188–1196 (2014)Google Scholar
  8. 8.
    Li, L., Li, F.: What, where and who? classifying events by scene and object recognition. In: IEEE ICCV, pp. 1–8. IEEE (2007)Google Scholar
  9. 9.
    Li, X., Li, R., Feng, F., Cao, J., Wang, X.: Multi-view supervised latent dirichlet allocation. Acta Electron. Sin. 42(10), 2040–2044 (2014)Google Scholar
  10. 10.
    Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). Google Scholar
  11. 11.
    Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
  12. 12.
    Oliva, A., Torralba, A.: Modeling the shape of the scene: a holistic representation of the spatial envelope. IJCV 42(3), 145–175 (2001)CrossRefzbMATHGoogle Scholar
  13. 13.
    Pan, S.J., Yang, Q.: A survey on transfer learning. IEEE TKDE 22(10), 1345–1359 (2010)Google Scholar
  14. 14.
    Putthividhya, D., Attias, H.T., Nagarajan, S.S.: Topic regression multi-modal latent dirichlet allocation for image annotation. In: IEEE CVPR, pp. 3408–3415. IEEE (2010)Google Scholar
  15. 15.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
  16. 16.
    Zheng, Y., Zhang, Y., Larochelle, H.: Topic modeling of multimodal data: an autoregressive approach. In: IEEE CVPR, pp. 1370–1377 (2014)Google Scholar
  17. 17.
    Zheng, Y., Zhang, Y., Larochelle, H.: A deep and autoregressive approach for topic modeling of multimodal data. IEEE TPAMI 38(6), 1056–1069 (2016)CrossRefGoogle Scholar
  18. 18.
    Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., Oliva, A.: Learning deep features for scene recognition using places database. In: NIPS, pp. 487–495 (2014)Google Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2017

Authors and Affiliations

  • Fangyi Zhu
    • 1
  • Xiaoxu Li
    • 1
    • 2
  • Zhanyu Ma
    • 1
    Email author
  • Guang Chen
    • 1
  • Pai Peng
    • 3
  • Xiaowei Guo
    • 3
  • Jen-Tzung Chien
    • 4
  • Jun Guo
    • 1
  1. 1.Pattern Recognition and Intelligent System LabBeijing University of Posts and TelecommunicationsBeijingChina
  2. 2.School of Computer and CommunicationLanzhou University of TechnologyLanzhouChina
  3. 3.Youtu LabTecent TechnologyShanghaiChina
  4. 4.Department of Electrical and Computer EngineeringNational Chiao Tung UniversityHsinchu CityTaiwan

Personalised recommendations