Skip to main content

Tracking subjects and detecting relationships in crowded city videos


Multi-subject tracking in crowded videos is an established yet challenging research direction in computer vision and information processing. High applicability of multi-subject tracking is demonstrated in smart cities (e.g., public safety, crowd management, urban planning), autonomous driving vehicles, robotic vision, or psychology (e.g., social interaction and crowd behavior understanding). In this work, we propose a real-time approach that reveals tracks of subjects in ordinary videos, captured in highly populated pedestrian areas, such as squares, malls, and stations. The tracks are discovered based on the proximity of detected bounding boxes of subjects in consecutive video frames. The reduction of track fragmentation and identity switching is achieved by the re-identification phase that uses caching of unassociated detections and mutual projection of interrupted tracks. As the proposed approach does not require time-consuming extraction of appearance-based features, the superior tracking speed is achieved. In addition, we demonstrate tracker usability and applicability by extracting valuable information about body-joint positions from discovered tracks, which opens promising possibilities for detecting human relationships and interactions. We demonstrate accurate detection of couples based on their holding hand activity and families based on children’s body proportions. The discovery of these entitative groups is especially challenging in crowded city scenes where many subjects appear in each frame.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13


  1. Different notation for \({d_{t}^{i}}\) and \({b_{t}^{I}}\) is used pragmatically, since input detections \({d_{t}^{i}}\) and output tracked bounding boxes \({b_{t}^{I}}\) do not represent one identical set in general, i.e., tracking often disposes of some detections (e.g., outliers), as well as derive new ones (e.g., detections of occluded subjects).

  2. We acknowledge that, in reality, not every pair holding hands need to be a couple or every child accompanied by an adult from the same family. For simplicity, however, we consider this assumption valid in the experimental evaluation.


  1. Babaee M, Athar A, Rigoll G (2018) Multiple people tracking using hierarchical deep tracklet re-identification. arXiv:1811.04091

  2. Bera A, Randhavane T, Kubin E, Shaik H, Gray K, Manocha D (2018) Data-driven modeling of group entitativity in virtual environments. In: Spencer SN, Morishima S, Itoh Y, Shiratori T, Yue Y, Lindeman R (eds) Proceedings of the 24th ACM Symposium on Virtual Reality Software and Technology, VRST 2018. ACM, Tokyo, pp 31:1–31:10

  3. Bergmann P, Meinhardt T, Leal-taixė L (2019) Tracking without bells and whistles. In: Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019. IEEE, Seoul, pp 941–951

  4. Bernardin K, Stiefelhagen R (2008) Evaluating multiple object tracking performance: The CLEAR MOT metrics. EURASIP J Image Video Process

  5. Bewley A, Ge Z, Ott L, Ramos FT, Upcroft B (2016) Simple online and realtime tracking. In: 2016 IEEE International conference on image processing, ICIP 2016. IEEE, Phoenix, pp 3464–3468

  6. Bochinski E, Eiselein V, Sikora T (2017) High-speed tracking-by-detection without using image information. In: 14Th IEEE international conference on advanced video and signal based surveillance, AVSS 2017. IEEE Computer Society, Lecce, pp 1–6

  7. Brasȯ G., Leal-taixė L (2020) Learning a neural solver for multiple object tracking. In: International conference on computer vision and pattern recognition (CVPR). IEEE, pp 6246–6256

  8. Cao Z, Simon T, Wei S, Sheikh Y (2017) Realtime multi-person 2d pose estimation using part affinity fields. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 1302–1310

  9. Carrara F, Elias P, Sedmidubsky J, Zezula P (2019) Lstm-based real-time action detection and prediction in human motion streams. Multimed Tools Appl 78(19):27,309–27,331

    Article  Google Scholar 

  10. Chen L, Ai H, Chen R, Zhuang Z (2019) Aggregate tracklet appearance features for multi-object tracking. IEEE Signal Process Lett 26(11):1613–1617

    Article  Google Scholar 

  11. Choi W, Savarese S (2012) A unified framework for multi-target tracking and collective activity recognition. In: Fitzgibbon AW, Lazebnik S, Perona P, Sato Y, Schmid C (eds) Proceedings, Part IV, Lecture Notes in Computer Science, vol 7575. Springer, Florence, pp 215–230

  12. Chu J, Tu X, Leng L, Miao J (2020) Double-channel object tracking with position deviation suppression. IEEE Access 8:856–866

    Article  Google Scholar 

  13. Chu P, Ling H (2019) Famnet: Joint learning of feature, affinity and multi-dimensional assignment for online multiple object tracking. In: 2019 IEEE/CVF International conference on computer vision, ICCV 2019. IEEE, Seoul, pp 6171–6180

  14. Cong R, Lei J, Fu H, Cheng M, Lin W, Huang Q (2019) Review of visual saliency detection with comprehensive information. IEEE Trans Circ Syst Video Technol 29(10):2941–2959

    Article  Google Scholar 

  15. Cristani M, Bazzani L, Paggetti G, Fossati A, Tosato D, Bue AD, Menegaz G, Murino V (2011) Social interaction discovery by statistical analysis of f-formations. In: British machine vision conference, BMVC 2011. Proceedings. BMVA Press, Dundee, pp 1–12

  16. Davenport CB (1917) Inheritance of stature. Genetics 2(4):313–389

    Article  Google Scholar 

  17. Dendorfer P, Rezatofighi H, Milan A, Shi J, Cremers D, Reid I, Roth S, Schindler K, Leal-Taixé L. (2003) Mot20: A benchmark for multi object tracking in crowded scenes. arXiv:2003.09003[cs]. arXiv:1906.04567

  18. Evangelidis GD, Psarakis EZ (2008) Parametric image alignment using enhanced correlation coefficient maximization. IEEE Trans Pattern Anal Mach Intell 30(10):1858–1865

    Article  Google Scholar 

  19. Fan D, Wang W, Cheng M, Shen J (2019) Shifting more attention to video salient object detection. In: IEEE Conference on computer vision and pattern recognition, CVPR 2019. Computer Vision Foundation / IEEE, Long Beach, pp 8554–8564

  20. Felzenszwalb PF, Girshick RB, McAllester DA, Ramanan D (2010) Object detection with discriminatively trained part-based models. IEEE Trans Pattern Anal Mach Intell 32(9):1627–1645

    Article  Google Scholar 

  21. Feng W, Hu Z, Wu W, Yan J, Ouyang W (2019) Multi-object tracking with multiple cues and switcher-aware classification. arXiv:1901.06129

  22. Fu K, Zhao Q, Gu IY, Yang J (2019) Deepside: A general deep framework for salient object detection. Neurocomputing 356:69–82

    Article  Google Scholar 

  23. Henschel R, Leal-taixė L, Cremers D, Rosenhahn B (2018) Fusion of head and full-body detectors for multi-object tracking. In: 2018 IEEE Conference on computer vision and pattern recognition workshops, CVPR workshops 2018. IEEE Computer Society, Salt Lake City, pp 1428–1437

  24. Henschel R, Zou Y, Rosenhahn B (2019) Multiple people tracking using body and joint detections. In: IEEE Conference on computer vision and pattern recognition workshops, CVPR workshops 2019. Computer Vision Foundation / IEEE, Long Beach, p 0

  25. Hu T, Zhu X, Wang S, Duan L (2019) Human interaction recognition using spatial-temporal salient feature. Multimedia Tools and Applications 78(20):28,715–28,735

    Article  Google Scholar 

  26. Insafutdinov E, Andriluka M, Pishchulin L, Tang S, Levinkov E, Andres B, Schiele B (2017) Arttrack: Articulated multi-person tracking in the wild. In: 2017 IEEE Conference on computer vision and pattern recognition, CVPR 2017. IEEE Computer Society, Honolulu, pp 1293–1301

  27. Iqbal U, Milan A, Gall J (2017) Posetrack: Joint multi-person pose estimation and tracking. In: 2017 IEEE Conference on computer vision and pattern recognition, CVPR 2017. IEEE Computer Society, Honolulu, pp 4654–4663

  28. Jonker R, Volgenant A (1987) A shortest augmenting path algorithm for dense and sparse linear assignment problems. Computing 38(4):325–340

    MathSciNet  Article  Google Scholar 

  29. Kalman RE (1960) A New Approach to Linear Filtering and Prediction Problems. J Basic Eng 82(1):35–45

    MathSciNet  Article  Google Scholar 

  30. Kim DY, Jeon M (2014) Data fusion of radar and image measurements for multi-object tracking via kalman filtering. Inf Sci 278:641–652.

    MathSciNet  Article  Google Scholar 

  31. Kok VJ, Lim MK, Chan CS (2016) Crowd behavior analysis: A review where physics meets biology. Neurocomputing 177:342–362

    Article  Google Scholar 

  32. Krausz B, Bauckhage C (2012) Loveparade 2010: Automatic video analysis of a crowd disaster. Comput Vis Image Underst 116(3):307–319

    Article  Google Scholar 

  33. Leal-Taixė L, Pons-Moll G, Rosenhahn B (2011) Everybody needs somebody: Modeling social and grouping behavior on a linear programming multiple people tracker. In: IEEE International conference on computer vision workshops, ICCV 2011 workshops. IEEE computer society, Barcelona, pp 120–127

  34. Li H, Li Y, Porikli F (2016) Deeptrack: Learning discriminative feature representations online for robust visual tracking. IEEE Trans Image Process 25(4):1834–1848.

    MathSciNet  Article  Google Scholar 

  35. Li K, Yuen C, Kanhere SS, Hu K, Zhang W, Jiang F, Liu X (2019) An experimental study for tracking crowd in smart cities. IEEE Syst J 13 (3):2966–2977

    Article  Google Scholar 

  36. Li S, Chu J, Zhong G, Leng L, Miao J (2020) Robust visual tracking with occlusion judgment and re-detection. IEEE Access 8:122, 772–122, 781

    Article  Google Scholar 

  37. Li Y, Huang C, Nevatia R (2009) Learning to associate: Hybridboosted multi-target tracker for crowded scene. In: 2009 IEEE Computer society conference on computer vision and pattern recognition (CVPR 2009). IEEE Computer Society, Miami, pp 2953–2960

  38. Liciotti D, Contigiani M, Frontoni E, Mancini A, Zingaretti P, Placidi V (2014) Shopper analytics: A customer activity recognition system using a distributed RGB-d camera network. In: Video analytics for audience measurement - first international workshop, VAAM 2014. Revised selected papers, lecture notes in computer science, vol 8811. Springer, Stockholm, pp 146–157

  39. Lu X, Ma C, Ni B, Yang X, Reid ID, Yang M (2018) Deep regression tracking with shrinkage loss. In: Ferrari V, Hebert M, Sminchisescu C, Weiss Y (eds) Computer Vision - ECCV 2018 - 15th European Conference, Proceedings, Part XIV, Lecture Notes in Computer Science, vol 11218. Springer, Munich, pp 369–386

  40. Mahmoudi N, Ahadi SM, Rahmati M (2019) Multi-target tracking using cnn-based features: CNNMTT. Multimed Tools Appl 78(6):7077–7096

    Article  Google Scholar 

  41. Mehmood R, Katib SSI, Chlamtac I (2020) Smart infrastructure and applications. Springer

  42. Milan A, Leal-Taixé L, Reid ID, Roth S, Schindler K (2016) MOT16: A benchmark for multi-object tracking. arXiv:1603.00831

  43. Munkres J (1957) Algorithms for the assignment and transportation problems. J Soc Indust Appl Math 5(1):32–38

    MathSciNet  Article  Google Scholar 

  44. Pan G, Qi G, Zhang W, Li S, Wu Z, Yang LT (2013) Trace analysis and mining for smart cities: issues, methods, and applications. IEEE Commun Mag 51(6)

  45. Pirsiavash H, Ramanan D, Fowlkes CC (2011) Globally-optimal greedy algorithms for tracking a variable number of objects. In: The 24th IEEE conference on computer vision and pattern recognition, CVPR 2011. IEEE Computer Society, Colorado Springs, pp 1201–1208

  46. Redmon J, Farhadi A (2017) YOLO9000: better, faster, stronger. In: Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 6517–6525

  47. Ren S, He K, Girshick RB, Sun J (2017) Faster r-CNN: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149

    Article  Google Scholar 

  48. Salmerȯn-García JJ, Van den Dries S, Del Río FD, Estėvez AM, Sevillano-Ramos JL, Van de Molengraft MJG (2019) Towards a cloud-based automated surveillance system using wireless technologies. Multimed Syst 25(5):535–549

    Article  Google Scholar 

  49. Sedmidubskẏ J, Elias P, Zezula P (2018) Effective and efficient similarity searching in motion capture data. Multimed Tools Appl 77 (10):12, 073–12,094

    Article  Google Scholar 

  50. Sedmidubsky J, Elias P, Zezula P (2019) Searching for variable-speed motions in long sequences of motion capture data. Inf Syst 80:148–158

    Article  Google Scholar 

  51. Sheng H, Zhang Y, Chen J, Xiong Z, Zhang J (2019) Heterogeneous association graph fusion for target association in multiple object tracking. IEEE Trans Circ Syst Video Techn 29(11):3269–3280

    Article  Google Scholar 

  52. Sime JD (1995) Crowd psychology and engineering. Safety Sci 21 (1):1–14

    MathSciNet  Article  Google Scholar 

  53. Sun K, Xiao B, Liu D, Wang J (2019) Deep high-resolution representation learning for human pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 5693–5703

  54. Tran KN, Gala A, Kakadiaris IA, Shah SK (2014) Activity analysis in crowded environments using social cues for group discovery and human interaction modeling. Pattern Recognit Lett 44:49–57

    Article  Google Scholar 

  55. Vascon S, Mequanint EZ, Cristani M, Hung H, Pelillo M, Murino V (2016) Detecting conversational groups in images and sequences: A robust game-theoretic approach. Comput Vis Image Underst 143:11–24

    Article  Google Scholar 

  56. Wang G, Wang Y, Zhang H, Gu R, Hwang J (2019) Exploit the connectivity: Multi-object tracking with trackletnet. In: Amsaleg L, Huet B, Larson MA, Gravier G, Hung H, Ngo C, Ooi WT (eds) Proceedings of the 27th ACM International Conference on Multimedia, MM 2019. ACM, Nice, pp 482–490

  57. Wojke N, Bewley A, Paulus D (2017) Simple online and realtime tracking with a deep association metric. In: 2017 IEEE International conference on image processing, ICIP 2017. IEEE, Beijing, pp 3645–3649

  58. Wu H, Shao J, Xu X, Ji Y, Shen F, Shen HT (2018) Recognition and detection of two-person interactive actions using automatically selected skeleton features. IEEE Trans Hum Mach Syst 48(3):304–310

    Article  Google Scholar 

  59. Xu R, Nikouei SY, Chen Y, Polunchenko A, Song S, Deng C, Faughnan TR (2018) Real-time human objects tracking for smart surveillance at the edge. In: 2018 IEEE International conference on communications, ICC 2018. IEEE, Kansas City, pp 1–6

  60. Yang B, Nevatia R (2012) An online learned CRF model for multi-target tracking. In: 2012 IEEE Conference on computer vision and pattern recognition. IEEE Computer Society, Providence, pp 2034–2041

  61. Yang F, Choi W, Lin Y (2016) Exploit all the layers: Fast and accurate CNN object detector with scale dependent pooling and cascaded rejection classifiers. In: Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016. IEEE Computer Society, Las Vegas, pp 2129–2137

  62. Yigitcanlar T, Butler L, Windle E, Desouza KC, Mehmood R, Corchado JM (2020) Can building “artificially intelligent cities” safeguard humanity from natural disasters, pandemics, and other catastrophes? an urban scholar’s perspective. Sensors 20(10):2988

    Article  Google Scholar 

  63. Yoon K, Gwak J, Song Y, Yoon Y, Jeon M (2020) Oneshotda: Online multi-object tracker with one-shot-learning-based data association, vol 8

  64. Yuan Y, Chu J, Leng L, Miao J, Kim B (2020) A scale-adaptive object-tracking algorithm with occlusion detection. EURASIP J Image Video Process 2020(1):7

    Article  Google Scholar 

  65. Zhang H, Hong X (2019) Recent progresses on object detection: a brief review. Multimed Tools Appl 78(19):27, 809–27, 847

    Article  Google Scholar 

  66. Zhang L, Li Y, Nevatia R (2008) Global data association for multi-object tracking using network flows. In: 2008 IEEE Computer society conference on computer vision and pattern recognition (CVPR 2008). IEEE Computer Society, Anchorage

Download references


This research is supported by the Czech Science Foundation project No. GA19-02033S.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Petr Elias.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Elias, P., Macko, M., Sedmidubsky, J. et al. Tracking subjects and detecting relationships in crowded city videos. Multimed Tools Appl (2022).

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI:


  • Multi-subject tracking
  • Relationship detection
  • 2D skeleton sequences
  • Video analysis
  • Smart cities