Skip to main content

People@Places and ToDY: Two Datasets for Scene Classification in Media Production and Archiving

Part of the Lecture Notes in Computer Science book series (LNCS,volume 13833)

Abstract

In order to support common annotation tasks in visual media production and archiving, we propose two datasets which cover the annotation of the bustle of a scene (i.e., populated to unpopulated), the cinematographic type of a shot as well as the time of day and season of a shot. The dataset for bustle and shot type, called People@Places, adds annotations to the Places365 dataset, and the ToDY (time of day/year) dataset adds annotations to the SkyFinder dataset. For both datasets, we provide a toolchain to create automatic annotations, which have been manually verified and corrected for parts of the two datasets. We provide baseline results for these tasks using the EfficientNet-B3 model, pretrained on the Places365 dataset.

Keywords

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    https://www.ebu.ch/metadata/ontologies/ebucore/ebucore_LocationTimeType.html.

  2. 2.

    https://cv.iptc.org/newscodes/scene/.

  3. 3.

    https://en.wikipedia.org/wiki/Drawing.

  4. 4.

    https://github.com/openvinotoolkit/cvat.

  5. 5.

    https://rhodesmill.org/pyephem/.

  6. 6.

    https://github.com/openvinotoolkit/cvat.

References

  1. Abu-El-Haija, S., et al.: Youtube-8m: A large-scale video classification benchmark. arXiv preprint arXiv:1609.08675 (2016)

  2. Arijon, D.: Grammar of the film language. Silman-James Press (1991)

    Google Scholar 

  3. Awad, G., Snoek, C.G., Smeaton, A.F., Quénot, G.: Trecvid semantic indexing of video: a 6-year retrospective. ITE Trans. Media Technol. Appl. 4(3), 187–208 (2016)

    Article  Google Scholar 

  4. Bak, H.Y., Park, S.B.: Comparative study of movie shot classification based on semantic segmentation. Appl. Sci. 10(10), 3390 (2020)

    Article  Google Scholar 

  5. Bochkovskiy, A., Wang, C.Y., Liao, H.Y.M.: Yolov4: Optimal speed and accuracy of object detection. ArXiv abs/2004.10934 (2020)

    Google Scholar 

  6. Boggs, S.: Seasonal variations in daylight, twilight, and darkness. Geogr. Rev. 21(4), 656–659 (1931)

    Article  Google Scholar 

  7. Cheng, P., Zhou, J.: Automatic season classification of outdoor photos. In: 2011 Third International Conference on Intelligent Human-Machine Systems and Cybernetics. vol. 1, pp. 46–49. IEEE (2011)

    Google Scholar 

  8. Deng, J., Guo, J., Zhou, Y., Yu, J., Kotsia, I., Zafeiriou, S.: Retinaface: Single-stage dense face localisation in the wild. arXiv preprint arXiv:1905.00641 (2019)

  9. ETSI: Ts 102 822-3-1 v1.9.1 - broadcast and on-line services: Search, select, and rightful use of content on personal storage systems (tv-anytime); part 3: Metadata; sub-part 1: Phase 1 - metadata schemas. Tech. rep. (2015)

    Google Scholar 

  10. Fairbanks, A.T., Fairbanks, E.F.: Human proportions for artists. Fairbanks Art and Books (2005)

    Google Scholar 

  11. Galvane, Q.: Automatic Cinematography and Editing in Virtual Environments. Ph.D. thesis, Université Grenoble Alpes (ComUE) (2015)

    Google Scholar 

  12. He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. arXiv preprint arXiv:2111.06377 (2021)

  13. Huurnink, B., Hollink, L., Van Den Heuvel, W., De Rijke, M.: Search behavior of media professionals at an audiovisual archive: a transaction log analysis. J. Am. Soc. Inform. Sci. Technol. 61(6), 1180–1197 (2010)

    Google Scholar 

  14. Kissos, I., Fritz, L., Goldman, M., Meir, O., Oks, E., Kliger, M.: Beyond Weak Perspective for Monocular 3D Human Pose Estimation. In: Bartoli, A., Fusiello, A. (eds.) ECCV 2020. LNCS, vol. 12536, pp. 541–554. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-66096-3_37

    Chapter  Google Scholar 

  15. Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: a skinned multi-person linear model. ACM Trans. Graphics (Proc. SIGGRAPH Asia) 34(6), 248:1–248:16 (2015)

    Google Scholar 

  16. von Marcard, T., Henschel, R., Black, M.J., Rosenhahn, B., Pons-Moll, G.: Recovering accurate 3D human pose in the wild using IMUs and a moving camera. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11214, pp. 614–631. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01249-6_37

    Chapter  Google Scholar 

  17. Mihail, R.P., Workman, S., Bessinger, Z., Jacobs, N.: Sky segmentation in the wild: An empirical study. In: IEEE Winter Conference on Applications of Computer Vision (WACV),D pp. 1–6 (2016). https://doi.org/10.1109/WACV.2016.7477637,acceptance rate: 42.3%

  18. Qassim, H., Verma, A., Feinzimer, D.: Compressed residual-vgg16 cnn model for big data places image recognition. In: 2018 IEEE 8th Annual Computing and Communication Workshop and Conference (CCWC), pp. 169–175. IEEE (2018)

    Google Scholar 

  19. Rao, A., et al.: A unified framework for shot type classification based on subject centric lens. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12356, pp. 17–34. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58621-8_2

    Chapter  Google Scholar 

  20. Savardi, M., Signoroni, A., Migliorati, P., Benini, S.: Shot scale analysis in movies by convolutional neural networks. In: 2018 25th IEEE International Conference on Image Processing (ICIP), pp. 2620–2624. IEEE (2018)

    Google Scholar 

  21. Singh, A., et al.: Flava: A foundational language and vision alignment model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15638–15650 (2022)

    Google Scholar 

  22. Sun, Y., et al.: Monocular, one-stage, regression of multiple 3d people. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11179–11188 (2021)

    Google Scholar 

  23. Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114. PMLR (2019)

    Google Scholar 

  24. Trenberth, K.E.: What are the seasons? Bull. Am. Meteor. Soc. 64(11), 1276–1282 (1983)

    Article  Google Scholar 

  25. Wang, C.Y., Bochkovskiy, A., Liao, H.Y.M.: Scaled-yolov4: Scaling cross stage partial network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13029–13038 (2021)

    Google Scholar 

  26. Wang, Q., Xie, J., Zuo, W., Zhang, L., Li, P.: Deep cnns meet global covariance pooling: Better representation and generalization. IEEE Trans. Pattern Anal. Mach. Intell. 43(8), 2582–2597 (2020)

    Google Scholar 

  27. Wightman, R.: Pytorch image models. https://github.com/rwightman/pytorch-image-models (2019). https://doi.org/10.5281/zenodo.4414861

  28. Xiao, T., Dollar, P., Singh, M., Mintun, E., Darrell, T., Girshick, R.: Early convolutions help transformers see better. In: Advances in Neural Information Processing Systems, vol. 34 (2021)

    Google Scholar 

  29. Yang, S., Luo, P., Loy, C.C., Tang, X.: Wider face: A face detection benchmark. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5525–5533 (2016)

    Google Scholar 

  30. Yuan, L., et al.: Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432 (2021)

  31. Zhou, B., Lapedriza, A., Khosla, A., Oliva, A., Torralba, A.: Places: a 10 million image database for scene recognition. IEEE Trans. Pattern Anal. Mach. Intell. 40(6), 1452–1464 (2017)

    Article  Google Scholar 

Download references

Acknowledgments

The research leading to these results has been funded partially by the program “ICT of the Future” by the Austrian Federal Ministry of Climate Action, Environment, Energy, Mobility, Innovation and Technology (BMK) in the project “TailoredMedia” and from the European Union’s Horizon 2020 research and innovation programme, under grant agreement n\(^\circ \) 951911 AI4Media (https://ai4media.eu). The authors would like to thank Martin Winter, Hermann Fürntratt and Stefanie Onsori-Wechtitsch for support with the face detector and annotation tool setup, and Levi Herrich for checking and correcting the time of day annotations.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Werner Bailer .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Bailer, W., Fassold, H. (2023). People@Places and ToDY: Two Datasets for Scene Classification in Media Production and Archiving. In: Dang-Nguyen, DT., et al. MultiMedia Modeling. MMM 2023. Lecture Notes in Computer Science, vol 13833. Springer, Cham. https://doi.org/10.1007/978-3-031-27077-2_38

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-27077-2_38

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-27076-5

  • Online ISBN: 978-3-031-27077-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics