Skip to main content
Log in

Knowledge-driven description synthesis for floor plan interpretation

  • Original Paper
  • Published:
International Journal on Document Analysis and Recognition (IJDAR) Aims and scope Submit manuscript


Image captioning is a widely known problem in the area of AI. Caption generation from floor plan images has applications in indoor path planning, real estate, and providing architectural solutions. Several methods have been explored in the literature for generating captions or semi-structured descriptions from floor plan images. Since only the caption is insufficient to capture fine-grained details, researchers also proposed descriptive paragraphs from images. However, these descriptions have a rigid structure and lack flexibility, making it difficult to use them in real-time scenarios. This paper offers two models, description synthesis from image cue (DSIC) and transformer-based description generation (TBDG), for text generation from floor plan images. These two models take advantage of modern deep neural networks for visual feature extraction and text generation. The difference between both models is in the way they take input from the floor plan image. The DSIC model takes only visual features automatically extracted by a deep neural network, while the TBDG model learns textual captions extracted from input floor plan images with paragraphs. The specific keywords generated in TBDG and understanding them with paragraphs make it more robust in a general floor plan image. Experiments were carried out on a large-scale publicly available dataset and compared with state-of-the-art techniques to show the proposed model’s superiority.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18

Similar content being viewed by others


  1. Adam, S., Ogier, J.M., Cariou, C., Mullot, R., Labiche, J., Gardes, J.: Symbol and character recognition: application to engineering drawings. IJDAR 3(2), 89–101 (2000)

    Article  Google Scholar 

  2. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014)

  3. Barducci, A., Marinai, S.: Object recognition in floor plans by graphs of white connected components. In: ICPR (2012)

  4. Chatterjee, M., Schwing, A.G.: Diverse and coherent paragraph generation from images. In: ECCV (2018)

  5. Chen, X., Fang, H., Lin, T.Y., Vedantam, R., Gupta, S., Dollár, P., Zitnick, C.L.: Microsoft COCO captions: data collection and evaluation server. arXiv preprint arXiv:1504.00325 (2015)

  6. Cho, K., Van Merriënboer, B., Bahdanau, D., Bengio, Y.: On the properties of neural machine translation: encoder-decoder approaches. arXiv preprint arXiv:1409.1259 (2014)

  7. de las Heras, L.P., Terrades, O.R., Robles, S., Sánchez, G.: CVC-FP and SGT: a new database for structural floor plan analysis and its groundtruthing tool. IJDAR 18(1), 15–30 (2015)

  8. Delalandre, M., Valveny, E., Pridmore, T., Karatzas, D.: Generation of synthetic documents for performance evaluation of symbol recognition & spotting systems. IJDAR 13(3), 187–207 (2010)

    Article  Google Scholar 

  9. Dutta, A., Llados, J., Pal, U.: Symbol spotting in line drawings through graph paths hashing. In: ICDAR (2011)

  10. Dutta, A., Lladós, J., Pal, U.: A symbol spotting approach in graphical documents by hashing serialized graphs. Pattern Recognit. 46(3), 752–768 (2013)

    Article  Google Scholar 

  11. Farhadi, A., Hejrati, M., Sadeghi, M.A., Young, P., Rashtchian, C., Hockenmaier, J., Forsyth, D.: Every picture tells a story: generating sentences from images. In: ECCV (2010)

  12. Girshick, R.: Fast R-CNN. In: ICCV (2015)

  13. Goyal, S., Bhavsar, S., Patel, S., Chattopadhyay, C., Bhatnagar, G.: SUGAMAN: describing floor plans for visually impaired by annotation learning and proximity-based grammar. Image Process. 13(13), 2623–2635 (2019)

    Article  Google Scholar 

  14. Goyal, S., Chattopadhyay, C., Bhatnagar, G.: ASYSST: a framework for synopsis synthesis empowering visually impaired. In: MAHCI (2018)

  15. Goyal, S., Chattopadhyay, C., Bhatnagar, G.: Plan2Text: a framework for describing building floor plan images from first person perspective. In: CSPA (2018)

  16. Goyal, S., Mistry, V., Chattopadhyay, C., Bhatnagar, G.: BRIDGE: building plan repository for image description generation, and evaluation. In: ICDAR (2019)

  17. He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: ICCV (2017)

  18. He, K., Zhang, X., Ren, S., Sun, J.: Spatial pyramid pooling in deep convolutional networks for visual recognition. T-PAMI 37(9), 1904–1916 (2015)

    Article  Google Scholar 

  19. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    Article  Google Scholar 

  20. Johnson, J., Karpathy, A., Fei-Fei, L.: Densecap: fully convolutional localization networks for dense captioning. In: CVPR (2016)

  21. Khan, I., Islam, N., Rehman, H.U., Khan, M.: A comparative study of graphic symbol recognition methods. Multimedia Tools Appl. 79(13), 8695–8725 (2020)

    Article  Google Scholar 

  22. Krause, J., Johnson, J., Krishna, R., Fei-Fei, L.: A hierarchical approach for generating descriptive image paragraphs. In: CVPR (2017)

  23. Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L.J., Shamma, D.A., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. IJCV 123(1), 32–73 (2017)

    Article  MathSciNet  Google Scholar 

  24. Kulkarni, G., Premraj, V., Ordonez, V., Dhar, S., Li, S., Choi, Y., Berg, A.C., Berg, T.L.: Babytalk: understanding and generating simple image descriptions. T-PAMI 35(12), 2891–2903 (2013)

    Article  Google Scholar 

  25. Li, S., Kulkarni, G., Berg, T.L., Berg, A.C., Choi, Y.: Composing simple image descriptions using web-scale n-grams. In: CoNLL (2011)

  26. Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft COCO: common objects in context. In: ECCV (2014)

  27. Liu, Y., Fu, J., Mei, T., Chen, C.W.: Let your photos talk: Generating narrative paragraph for photo stream via bidirectional attention recurrent neural networks. In: AAAI (2017)

  28. Luong, M.T., Pham, H., Manning, C.D.: Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025 (2015)

  29. Madugalla, A., Marriott, K., Marinai, S., Capobianco, S., Goncu, C.: Creating accessible online floor plans for visually impaired readers. ACM T-ACCESS 13(4), 1–37 (2020)

    Article  Google Scholar 

  30. Mao, Y., Zhou, C., Wang, X., Li, R.: Show and tell more: topic-oriented multi-sentence image captioning. In: IJCAI (2018)

  31. Marcus, M., Santorini, B., Marcinkiewicz, M.A.: Building a large annotated corpus of English: The Penn Treebank (1993)

  32. Nallapati, R., Zhou, B., Gulcehre, C., Xiang, B., et al.: Abstractive text summarization using sequence-to-sequence RNNs and beyond. arXiv preprint arXiv:1602.06023 (2016)

  33. Ordonez, V., Kulkarni, G., Berg, T.L.: Im2Text: describing images using 1 million captioned photographs. In: NIPS (2011)

  34. Park, C.C., Kim, G.: Expressing an image stream with a sequence of natural sentences. In: NIPS (2015)

  35. Qureshi, R.J., Ramel, J.Y., Barret, D., Cardot, H.: Spotting symbols in line drawing images using graph representations. In: GREC (2007)

  36. Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: CVPR (2016)

  37. Redmon, J., Farhadi, A.: YOLO9000: better, faster, stronger. In: CVPR (2017)

  38. Redmon, J., Farhadi, A.: YOLOv3: an incremental improvement. arXiv preprint arXiv:1804.02767 (2018)

  39. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NIPS (2015)

  40. Rezvanifar, A., Cote, M., Branzan Albu, A.: Symbol spotting on digital architectural floor plans using a deep learning-based framework. In: Proceedings of the IEEE/CVF CVPR Workshops, pp. 568–569 (2020)

  41. Rush, A.M., Chopra, S., Weston, J.: A neural attention model for abstractive sentence summarization. arXiv preprint arXiv:1509.00685 (2015)

  42. Sabour, S., Frosst, N., Hinton, G.E.: Dynamic routing between capsules. In: Advances in neural information processing systems, pp. 3856–3866 (2017)

  43. Saha, R., Mondal, A., Jawahar, C.: Graphical Object Detection in Document Images. In: ICDAR (2019)

  44. Schreiber, S., Agne, S., Wolf, I., Dengel, A., Ahmed, S.: Deepdesrt: deep learning for detection and structure recognition of tables in document images. In: ICDAR (2017)

  45. Sharma, D., Gupta, N., Chattopadhyay, C., Mehta, S.: DANIEL: A deep architecture for automatic analysis and retrieval of building floor plans. In: ICDAR (2017)

  46. Sharma, N., Mandal, R., Sharma, R., Pal, U., Blumenstein, M.: Signature and Logo Detection using Deep CNN for Document Image Retrieval. In: ICFHR (2018)

  47. Su, H., Gong, S., Zhu, X.: Scalable logo detection by self co-learning. Pattern Recognition 97, 107003 (2020)

    Article  Google Scholar 

  48. Sutskever, I., Vinyals, O., Le, Q.: Sequence to sequence learning with neural networks. NIPS (2014)

  49. Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features. In: CVPR (2001)

  50. Wang, Q., Chan, A.B.: CNN+CNN: convolutional decoders for image captioning. arXiv preprint arXiv:1805.09019 (2018)

  51. Wang, Z., Luo, Y., Li, Y., Huang, Z., Yin, H.: Look Deeper See Richer: Depth-aware Image Paragraph Captioning. In: ACM MM (2018)

  52. Yao, T., Pan, Y., Li, Y., Qiu, Z., Mei, T.: Boosting image captioning with attributes. In: ICCV (2017)

  53. Yi, X., Gao, L., Liao, Y., Zhang, X., Liu, R., Jiang, Z.: CNN based page object detection in document images. In: ICDAR (2017)

  54. Ziran, Z., Marinai, S.: Object detection in floor plan images. In: IAPR Workshop on Artificial Neural Networks in Pattern Recognition (2018)

Download references


Funding was provided by Science and Engineering Research Board (ECR/2016/000953).

Author information

Authors and Affiliations


Corresponding author

Correspondence to Chiranjoy Chattopadhyay.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Goyal, S., Chattopadhyay, C. & Bhatnagar, G. Knowledge-driven description synthesis for floor plan interpretation. IJDAR 24, 19–32 (2021).

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: