$$\text{C}^{3}\text{Net}$$ : end-to-end deep learning for efficient real-time visual active camera control

Kyrkou, Christos

doi:10.1007/s11554-021-01077-z

$\text{C}^{3}\text{Net}$: end-to-end deep learning for efficient real-time visual active camera control

Special Issue Paper
Published: 20 February 2021

Volume 18, pages 1421–1433, (2021)
Cite this article

Journal of Real-Time Image Processing Aims and scope Submit manuscript

Christos Kyrkou ORCID: orcid.org/0000-0002-7926-7642¹

318 Accesses
6 Citations
5 Altmetric
Explore all metrics

Abstract

The need for automated real-time visual systems in applications such as smart camera surveillance, smart environments, and drones necessitates the improvement of methods for visual active monitoring and control. Traditionally, the active monitoring task has been handled through a pipeline of modules such as detection, filtering, and control. However, such methods are difficult to jointly optimize and tune their various parameters for real-time processing in resource constraint systems. In this paper a deep Convolutional Camera Controller Neural Network is proposed to go directly from visual information to camera movement to provide an efficient solution to the active vision problem. It is trained end-to-end without bounding box annotations to control a camera and follow multiple targets from raw pixel values. Evaluation through both a simulation framework and real experimental setup, indicate that the proposed solution is robust to varying conditions and able to achieve better monitoring performance than traditional approaches both in terms of number of targets monitored as well as in effective monitoring time. The advantage of the proposed approach is that it is computationally less demanding and can run at over 10 FPS ($\sim 4\times $ speedup) on an embedded smart camera providing a practical and affordable solution to real-time active monitoring.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

3D Human Pose Estimation Based on Multi-Input Multi-Output Convolutional Neural Network and Event Cameras: A Proof of Concept on the DHP19 Dataset

Depth-Adaptive Computational Policies for Efficient Visual Tracking

APAC-Net: Unsupervised Learning of Depth and Ego-Motion from Monocular Video

References

Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Mané, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V., Viégas, F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., Zheng, X.: Tensorflow: a system for large-scale machine learning. In: Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation, OSDI’16, pp. 265–283. USENIX Association, Berkeley, CA, USA (2016). http://dl.acm.org/citation.cfm?id=3026877.3026899
Al Haj, M., Fernández, C., Xiong, Z., Huerta, I., Gonzàlez, J., Roca, X.: Beyond the Static Camera: Issues and Trends in Active Vision. Springer, London (2011)
Google Scholar
Al Machot, F., Ali, M., Haj Mosa, A.: Real-time raindrop detection based on cellular neural networks for ADAS. J Real Time Image Proc. 16, 1 (2019)
Article Google Scholar
Angella, F., Reithler, L., Gallesio, F.: Optimal deployment of cameras for video surveillance systems. In: 2007 IEEE Conference on Advanced Video and Signal Based Surveillance, pp. 388–392 (2007). https://doi.org/10.1109/AVSS.2007.4425342
Bateux, Q., Marchand, E., Leitner, J., Chaumette, F., Corke, P.: Training deep neural networks for visual servoing. In: 2018 IEEEInternational Conference on Robotics and Automation (ICRA), pp. 3307–3314 (2018). https://doi.org/10.1109/ICRA.2018.8461068
Bernardin, K., van de Camp, F., Stiefelhagen, R.: Automatic person detection and tracking using fuzzy controlled active cameras. In: 2007 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2007). https://doi.org/10.1109/CVPR.2007.383502
Bewley, A., Ge, Z., Ott, L., Ramos, F., Upcroft, B.: Simple on-line and realtime tracking. In: 2016 IEEE International Confer-ence on Image Processing (ICIP), pp. 3464–3468 (2016). https://doi.org/10.1109/ICIP.2016.7533003
Bhanu, B., Ravishankar, C.V., Roy-Chowdhury, A.K., Aghajan, H., Terzopoulos, D.: Distributed Video Sensor Networks, 1st edn. Springer Publishing Company, Incorporated, Berlin (2011)
Book Google Scholar
Biswas, A., Guha, P., Mukerjee, A., Venkatesh, K.S.: Intrusion detection and tracking with pan-tilt cameras. In: 2006 IET International Conference on Visual Information Engineering, pp. 565–571 (2006). https://doi.org/10.1049/cp:20060593
Bobda, C., Velipasalar, S.: Distributed Embedded Smart Cameras: Architectures, Design and Applications, 1st edn. Springer, New York (2014)
Book Google Scholar
Bo Bo, N., Deboeverie, F., Veelaert, P., Philips, W.: Real-time multi-people tracking by greedy likelihood maximization. In: Proceedings of the 9th International Conference on Distributed Smart Cameras, ICDSC ’15, pp. 32–37. ACM, New York (2015). https://doi.org/10.1145/2789116.2789125. http://doi.acm.org/10.1145/2789116.2789125
Bodla, N., Singh, B., Chellappa, R., Davis, L.S.: Soft-nms – improving object detection with one line of code. In: Proceedings ofthe IEEE international conference on computer vision (ICCV)(2017)
Bojarski, M., Testa, D.D., Dworakowski, D., Firner, B., Flepp, B., Goyal, P., Jackel, L.D., Monfort, M., Muller, U., Zhang, J., Zhang, X., Zhao, J., Zieba, K.: End to end learning for self-driving cars. CoRR (2016). arXiv:1604.07316
Bradski, G.: The OpenCV library. Dr. Dobb’s J. Softw. Tools (2000)
Campmany, V., Silva, S., Espinosa, A., Moure, J., Vázquez, D., López, A.: GPU-based pedestrian detection for autonomous driving. Procedia Comput. Sci. 80, 2377–2381 (2016). https://doi.org/10.1016/j.procs.2016.05.455. International Conference on Computational Science 2016, ICCS 2016, 6–8 June 2016, San Diego, California, USA
Chahyati,D., Fanany,M.I., Arymurthy, A.M.: Tracking people by detection using cnn features. Procedia Computer Science, 124, 167–172 (2017)
Chen, H., Zhao, X., Tan, M.: A novel pan-tilt camera control approach for visual tracking. In: Proceeding of the 11th World Congress on Intelligent Control and Automation, pp. 2860–2865 (2014). https://doi.org/10.1109/WCICA.2014.7053182
Chollet, F.: Keras (2015). https://github.com/fchollet/keras
Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), vol. 1, pp. 886–893 (2005). https://doi.org/10.1109/CVPR.2005.177
Dhillon, P.S.: Robust real-time face tracking using an active camera. In: Herrero, Á., Gastaldo, P., Zunino, R., Corchado, E. (eds.) Computational Intelligence in Security for Information Systems, pp. 179–186. Springer, Berlin (2009)
Chapter Google Scholar
Ding, C., Song, B., Morye, A., Farrell, J.A., Roy-Chowdhury, A.K.: Collaborative sensing in a distributed PTZ camera network. IEEE Trans. Image Process. 21(7), 3282–3295 (2012). https://doi.org/10.1109/IROS.2009.5353915
Article MathSciNet MATH Google Scholar
Dinh, T., Yu, Q., Medioni, G.: Real time tracking using an active pan-tilt-zoom network camera. In: 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 3786–3793 (2009). https://doi.org/10.1109/IROS.2009.5353915
Fan, H., Ling, H.: Siamese cascaded region proposal networksfor real-time visual tracking. In: 2019 IEEE/CVF conferenceon computer vision and pattern recognition (CVPR), pp. 7944–7953 (2019)
Feichtenhofer, C., Pinz, A., Zisserman, A.: Detect to track andtrack to detect. In: Proceedings of the IEEE international conference on computer Vision (ICCV), pp. 3038–3046 (2017)
Ferryman, J., Shahrokni, A.: Pets2009: Dataset and challenge. In: 2009 Twelfth IEEE International Workshop on Performance Evaluation of Tracking and Surveillance, pp. 1–6 (2009). https://doi.org/10.1109/PETS-WINTER.2009.5399556
Haj, M.A., Bagdanov, A.D., Gonzalez, J., Roca, F..: Reactive object tracking with a single PTZ camera. In: 2010 20th International Conference on Pattern Recognition, pp. 1690–1693 (2010). https://doi.org/10.1109/ICPR.2010.418
Hosang, J., Benenson, R., Schiele, B.: Learning non-maximumsuppression. In: The IEEE conference on computer vision and pattern recognition (CVPR), pp. 4507–4515 (2017)
Kiran, M., Tiwari, V., Nguyen-Meidine, L., Morin, L., Granger, E.: On the interaction between deep detectors and siamese trackers in video surveillance. In: 2019 16th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS). IEEE Computer Society, Los Alamitos, CA, USA pp. 1–8 (2019)
Kulathumani, V., Parupati, S., Ross, A., Jillela, R.: Collaborative Face Recognition Using a Network of Embedded Cameras, pp. 373–387. Springer, London (2011)
Google Scholar
Kyrkou, C., Christoforou, E.G., Timotheou, S., Theocharides, T., Panayiotou, C., Polycarpou, M.: Optimizing the detectionperformance of smart camera networks through a probabilistic image-based model. In: IEEE transactions on circuits and systems for video technology 28(5), 1197–1211 (2018)
Ser-Nam L., Elgammal, A., Davis, L.S.: Image-based pan-tilt camera control in a multi-camera surveillance environment. In: 2003 International conference on multimedia and Expo. ICME’03. Proceedings (Cat. No.03TH8698), vol. 1, pp. I–645 (2003)
Luo, W., Sun, P., Zhong, F., Liu, W., Zhang, T., Wang, Y.: End-to-end active object tracking via reinforcement learning. In: Proceed-ings of the 35th international conference on machine learning, proceedings of machine learning research, vol. 80, pp. 3286–3295 (2018)
Miao, X., Zhen, X., Liu, X., Deng, C., Athitsos, V., Huang, H.: Direct shape regression networks for end-to-end face alignment. In: 2018 IEEE/CVF conference on computer vi-sion and pattern recognition, pp. 5040–5049 (2018)
Micheloni, C., Rinner, B., Foresti, G.L.: Video analysis in pan-tilt-zoom camera networks. IEEE Signal Process. Mag. 27(5), 78–90 (2010). https://doi.org/10.1109/CVPR.2017.690
Article Google Scholar
Neff, C., Mendieta, M., Mohan, S., Baharani, M., Rogers, S., Tabkhi, H.: Revamp2t: Real-time edge video analytics for multi camera privacy-aware pedestrian tracking. IEEE Int of Things J 7(4), 2591–2602 (2020)
Patil, H.R., Bhagat, K.S.: Detection and tracking of moving objects; a survey. Int. J. Eng. Res. Appl. 5(11), 138–142 (2015)
Google Scholar
Pflugfelder, R.P.: Siamese learning visual tracking: a survey. CoRR (2017). arXiv:1707.00569
Redmon, J., Farhadi, A.: Yolo9000: Better, faster, stronger. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6517–6525 (2017). https://doi.org/10.1109/CVPR.2017.690
Salih, Y., Malik, A.S.: Depth and geometry from a single 2d image using triangulation. In: 2012 IEEE International Conference on Multimedia and Expo Workshops, pp. 511–515 (2012). https://doi.org/10.1109/ICMEW.2012.95
Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-cam: Visual explanations from deep networks via gradient-based localization. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 618–626 (2017). https://doi.org/10.1109/ICCV.2017.74
Wang, R., Dong, H., Han, T.X., Mei, L.: Robust tracking via monocular active vision for an intelligent teaching system. Vis. Comput. 32(11), 1379–1394 (2016). https://doi.org/10.1007/s00371-015-1206-8
Article Google Scholar
Zhao, Z., Zheng, P., Xu, S., Wu, X.: Object detection with deep learning: a review. IEEE Trans. Neural Netw. Learn. Syst. 30(11), 3212–3232 (2019). http://dl.acm.org/citation.cfm?id=3026877.30268990
Article Google Scholar

Download references

Funding

This work has been supported by the European Union's Horizon 2020 research and innovation programme under grant agreement No 739551 (KIOSCoE) and from the Government of the Republic of Cyprus through the Directorate General for European Programmes, Coordination and Development.

Author information

Authors and Affiliations

KIOS Research and Innovation Center of Excellence, University of Cyprus, Nicosia, Cyprus
Christos Kyrkou

Authors

Christos Kyrkou
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Christos Kyrkou.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Below is the link to the electronic supplementary material.

Supplementary file1 (TXT 2 KB)

Supplementary file2 (MP4 459 KB)

Supplementary file3 (AVI 571 KB)

Supplementary file4 (AVI 1420 KB)

Supplementary file5 (AVI 1147 KB)

Supplementary file6 (AVI 340 KB)

Supplementary file7 (AVI 2892 KB)

Supplementary file8 (AVI 11225 KB)

Supplementary file9 (AVI 1647 KB)

Supplementary file10 (AVI 3247 KB)

Supplementary file11 (AVI 3246 KB)

Supplementary file12 (AVI 1675 KB)

Supplementary file13 (AVI 1459 KB)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kyrkou, C. $\text{C}^{3}\text{Net}$: end-to-end deep learning for efficient real-time visual active camera control. J Real-Time Image Proc 18, 1421–1433 (2021). https://doi.org/10.1007/s11554-021-01077-z

Download citation

Received: 31 March 2020
Accepted: 19 January 2021
Published: 20 February 2021
Issue Date: August 2021
DOI: https://doi.org/10.1007/s11554-021-01077-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

\(\text{C}^{3}\text{Net}\): end-to-end deep learning for efficient real-time visual active camera control

Abstract

Access this article

Similar content being viewed by others

3D Human Pose Estimation Based on Multi-Input Multi-Output Convolutional Neural Network and Event Cameras: A Proof of Concept on the DHP19 Dataset

Depth-Adaptive Computational Policies for Efficient Visual Tracking

APAC-Net: Unsupervised Learning of Depth and Ego-Motion from Monocular Video

References

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Supplementary information

Supplementary file1 (TXT 2 KB)

Rights and permissions

About this article

Cite this article

Keywords

Navigation

\(\text{C}^{3}\text{Net}\): end-to-end deep learning for efficient real-time visual active camera control

Abstract

Access this article

Similar content being viewed by others

3D Human Pose Estimation Based on Multi-Input Multi-Output Convolutional Neural Network and Event Cameras: A Proof of Concept on the DHP19 Dataset

Depth-Adaptive Computational Policies for Efficient Visual Tracking

APAC-Net: Unsupervised Learning of Depth and Ego-Motion from Monocular Video

References

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Supplementary information

Supplementary file1 (TXT 2 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation