Automatic textual description of interactions between two objects in surveillance videos

Youssef, Wael F.; Haidar, Siba; Joly, Philippe

doi:10.1007/s42452-021-04534-3

Automatic textual description of interactions between two objects in surveillance videos

Research Article
Open access
Published: 10 June 2021

Volume 3, article number 695, (2021)
Cite this article

Download PDF

You have full access to this open access article

SN Applied Sciences Aims and scope Submit manuscript

Automatic textual description of interactions between two objects in surveillance videos

Download PDF

1142 Accesses
1 Altmetric
Explore all metrics

Abstract

The purpose of our work is to automatically generate textual video description schemas from surveillance video scenes compatible with police incidents reports. Our proposed approach is based on a generic and flexible context-free ontology. The general schema is of the form [actuator] [action] [over/with] [actuated object] [+ descriptors: distance, speed, etc.]. We focus on scenes containing exactly two objects. Through elaborated steps, we generate a formatted textual description. We try to identify the existence of an interaction between the two objects, including remote interaction which does not involve physical contact and we point out when aggressivity took place in these cases. We use supervised deep learning to classify scenes into interaction or no-interaction classes and then into subclasses. The chosen descriptors used to represent subclasses are keys in surveillance systems that help generate live alerts and facilitate offline investigation.

Deepfake video detection: challenges and opportunities

Article Open access 29 May 2024

Convolutional neural network: a review of models, methodologies and applications to object detection

Article 20 December 2019

Video summarization using deep learning techniques: a detailed analysis and investigation

Article 15 March 2023

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Among the most modern public safety and law enforcement tools are video surveillance systems. They provide a significant source of data, becoming the strong point of most investigations. A fundamental professional need is to extract useful information from the massive quantity of visual data generated by these surveillance systems.

Most existing video surveillance systems provide only the infrastructure to capture, transmit, store and distribute video images. Tedious tasks such as detecting an incident in a live stream or searching the archives for a specific scene depend on scarce and expensive human resources. The automatic scene description is essential in video surveillance. It facilitates the post-processing of incidents, like dispatching of patrols. Moreover, video search can be promoted to another level by introducing object detection and tracking, as well as some more specific features such as direction, shape, deformability, or interaction. In particular, motion is a cue feature; the key is to focus on a non-linear motion like the one we may observe during an interaction between objects.

This work proposes a new approach for a generic context-independent textual description of video surveillance scenes from the real-world. In this paper, we present new representations for the sentences based on well-structured templates, which can be applied to generate a scene description, similar to those used in police reports. Our approach is based on our ontology described in [1] and explained in detail in the third paragraph, "Proposed Approach".

This paper mainly consists of seven paragraphs, in addition to the introduction; In the next section, we present the works related to the automatic video scene description.

In paragraph 3, we present our proposed approach. We introduce how we produce activity matrices of useful characteristics that can be used for generating alerts and querying the scenes. As well, we present how to generate textual descriptions from these matrices. Next, in paragraph 4, we explain the experiments and results.

Then, in paragraph 5, we discuss the drawbacks and difficulties of our approach and its advantages.

Finally, in paragraph 6, a general conclusion, primary contributions and future works of this work are presented.

2 State of the art

Automatic video scene description includes understanding and differentiating between the diversity of backgrounds, objects, interactions and scenes types. Moreover, it requires a translation of the information into a comprehensible textual description or what is known as natural language. In the last decade, researchers have studied multiple strategies and ontologies to bridge the gap between visual content and textual description. For that, computer vision and natural language processing (NLP) fields are addressing such an issue separately or jointly [2].

In the state of art, two main types of approaches can be noticed; behaviour understanding and sentence generation approaches on one side and sequence learning approaches.

Sequence learning approaches directly learn how to match video content and sentences. This approach can be divided into video encoding and decoding stages. In the encoding stage, the visual features are directly extracted and learnt using different types of deep neural network algorithms, like CNN, RNN or LSTM. The produced result composes a fixed or dynamic real-valued vector. In the video decoding stage, the resulting vector from the first stage is used for text generation. The main techniques may involve speech recognition, language modelling, image captioning, translation and more. These approaches are considered domain-specific, suitable for short video clips with limited vocabularies of objects and activities [3]. Some interesting works following the sequence learning approach for video description can be seen in [4, 5].

Behaviour understanding mainly relies on extracting features used to train individual classifiers to identify background, objects and actions in the scenes. Sentence generation generally requires a template with some syntactical structures, like Subject-Verb-Object SVO tuples. It may use a probabilistic model to map the essential visual content results from the video with each template element. However, these templates are mostly dedicated to some non-generic variety of scenes [6].

Working specifically on video surveillance scenes, most of the research works targeting the description of these scenes follow the behaviour understanding and sentence generation approaches [7, 8]. Different templates have been proposed, like:

(Object) (Action) in (Place) [at (high/low/middle) speed].
A (colour) (size) (speed) (object type) coming from (entry zone) toward (exit zone).

The complicity here is due to the diversity of scenes in terms of location, object types and number, and various actions and interactions.

To simplify this problem, researchers proposed to add more assumptions. These propositions focused on particular types of objects, specific contexts and some specific actions. But these assumptions added more restrictions to the applicability in the real world. As an example, we mention [9] where their system assumes a surveillance scene with some prior region information and only four object classes: pedestrian, car, bike and cycle, while no interaction was considered. Another example is [10], where the concepts include vehicle, people and traffic signs, to allow users to annotate traffic events. Some other researches [11, 12] focus only on one type of interaction; the aggressivity.

Our goal will be to design a description system for video surveillance, taking advantage of behaviour understanding and sentence generation approaches, integrating some improvements by applying machine and deep learning methods.

3 Proposed approach

Our approach's key idea is to leverage meaningful content features from the scene for better understanding and appropriate description. We compute a dense set of spatiotemporal feature vectors to provide a localised description of the action. These features are well selected to satisfy the need of police operators and investigators. They are considered useful in generating alerts, enquiring the video surveillance footage and inferring textual sentences. We input those features into a supervised learning method for interaction classification. The proposed framework is conceptually simple and has low storage and computational cost, making it attractive for real-time implementation.

Our approach, named Video Surveillance Scene Description (VSSD), includes the following stages; First, we segment the video based on the number of objects in the scenes. We care to distinguish between deformable and non-deformable objects at this stage because there is a direct connection between the deformability of an object and its behaviour in an interaction. Second, we generate an activity map, highlighting the routes and the hot spots in the field of view (FOV) and indicating the background changes. Next, we extract several features and feed them to a DNN to serve for interaction classification. Then, we identify the critical moments in the scenes. And finally, we fill an activity matrix from which we generate the textual description which follows our templates.

Only two presumptions exist: 1- fixed camera, 2- only two moving objects in the scene. Although it is of high interest, there is a lack of scientific studies targeting this particular case.

In the diagram shown in Fig. 1, we present our approach's workflow.

3.1 Segmentation and tracking

Many tests were done on available segmentation and multi-object tracking algorithms. A search for functional segmentation and tracking methods led to algorithms and methods shown in Table 1 and others. After comparative tests, we selected the method in [19], called "Motion-Based Multiple Object Tracking", provided by Matlab. This algorithm is based on two main steps; 1- Detecting moving objects in each frame based on Gaussian Mixture Models and 2- Tracking the moving objects from frame to frame, based on Kalman filter.

Table 1 List of some tested algorithms for objects segmentation and tracking

Full size table

The next step is to cut the video into scenes according to the number of present objects; zero, one, two or more.

3.2 Object classification

From the surveillance point of view, non-deformable object actions are easy to analyse. While deformable parts of an object move freely in an unpredicted way, making the interaction more complicated and harder to interpret.

Our main goal is to have an abstract description. We could not find, for our knowledge, a generic algorithm that can segment any object type into its semantic sub-components under different external factors proof, i.e., invariant regarding the scene visibility and the lighting. Therefore, we limit the classification and restrain from the analysis of sub-object segments in the presence of interacting deformable objects. We check the blobs encasing each object to determine if it is deformable or not. The applied method is ours and has been published in [20].

3.3 Background model and activity localisation

Once the object segmentation task is accomplished, we now know whether a pixel belongs to an object or the background for each video frame. Using this information, we build the coefficient matrix ${C}_{t}$. ${C}_{t}$ is a cumulative matrix, where each value represents the number of times the corresponding pixel belonged to an object, from frame $0$ until frame $t$. The matrix is obtained by cumulating individual binary matrices ${M}_{t}$, each for a given frame at time $t$, in (1) and (2):

$$M_{t} \left( {x,y} \right) = \left\{ \begin{gathered} 0,\space p\left( {x,y} \right) \in an\, object\, at\, time\, t \hfill \\ 1,\, otherwise \hfill \\ \end{gathered} \right.$$

(1)

where $\mathrm{p}\left(\mathrm{x},\mathrm{y}\right)$ is the pixel of coordinates (x, y).

$${C}_{t}={\sum }_{k=0}^{t}{M}_{k}$$

(2)

Based on this coefficient matrix, next, we will calculate the background model.

3.3.1 Background change detection

Since the camera is fixed, we can locate the background parts that have changed due to the objects' movement, the routes followed by them and the activity's spots.

The temporary background model ${B}_{t}$ is a matrix representing the temporary cumulative background, where all pixels from the frame are taken into consideration except the one belonging to an object in the current frame. It is calculated as in (3):

$${B}_{0}= {I}_{0}{. C}_{0}$$

(3)

Then for each frame of the sequence, we update the value of the matrix as follows in (4):

$${B}_{t}={(B}_{t-1}.{C}_{t-1}+{I}_{t}.{M}_{t})/ {C}_{t}$$

(4)

The dot symbol (.) represents the dot product in matrix operation, i.e., point to point multiplication and the division symbol (/) also represents the point to point division, unless the denominator is null; in this case, the result is set to zero. ${\mathrm{I}}_{\mathrm{t}}$ is the frame at time $t$.

In a video sequence with $n$ frames $S= \left\{{I}_{t}|t=0..n\right\}$, then ${B}_{n}$ represents the background model. It is merely the average image without the moving objects.

In reality, more than one background model may be needed, especially in long scenes cases. This is due to possible permanent changes in the background, light changes or inert objects displacement. Therefore, the algorithm generates and uses a new background model every time a large percentage of pixels significantly change their values for a relatively long period.

3.3.2 Routes and activity spots plan

In the coefficient matrix ${C}_{t}$, low values represent the flows of the moving objects. To find out the routes and the map of activities, first, we normalise ${C}_{n}$: ${C}_{n}={C}_{n}/n$. Then, we apply on ${C}_{n}$ a simple image processing method in three steps:

1.
Pixel quantification: to reduce the pixel intensity interval into four values, respectively: no activity, marginal areas, regular routes and hot spots representing highly frequented locations.
2.
Morphological filtering: we perform opening and closing to smooth the map.
3.
Background addition: the resulting matrix is added to the background model ${B}_{n}$.

We obtain a scene model which can be used for detecting anomalies and in the description phase. Figure 2 shows a scene model example.

3.4 Feature extraction

The choice of the features to extract from the videos' scenes impacts the efficiency and accuracy of the method.

Five types of features were extracted:

1.
Object spatial features: mainly dimensions (width, height, surface, perimeter), position, shape (bounding box, intensity, RGB, Hu moments [21]) and type (deformable/non-deformable) of each of the objects.
2.
Object temporal features: variations of spatial features between frames, including displacements such as distance, speed and angle.
3.
Inter-objects features: the difference of spatial features between the two objects.
4.
Inter-frames features: for most of the above features $f$, we extract the derivative ${f}^{^{\prime}}$ and the second derivative $f^{\prime\prime}$. Then for each $f$,${f}^{^{\prime}}$ and $f^{\prime\prime}$, we find seven global inter-frames features in a fixed-size window: the minimum, the first, the last, the middle, the average, the median and the standard deviation, normalised by the maximum value.
5.
Trajectory features: we consider three trajectories, one for each object centroid and one for their middle point, to which we apply three smoothing filters: average (5), first-order (6) and second-order prediction (7) according to Taylor development.
$${x}_{p}(t+1)=\left(x(t)+x(t+2)\right)/2$$
(5)
$$ x_{p} \left( {t + 1} \right) = x\left( t \right) + x^{\prime}\left( t \right) $$
(6)
$$x_{p} \left( {t + 1} \right) = x\left( t \right) + x^{\prime}\left( t \right) + \left( {x^{\prime\prime}\left( t \right)} \right)/2$$
(7)

For each of the nine trajectories, we calculate two features; (a) the standard deviation of the distance between the filtered position and the one corresponding to the centroid and (b) the largest distance. Those values give information about the smoothness of the trajectories.

3.5 Scene classification

In our ontology [1], we proposed classifying the video scenes into 15 types according to the number of objects before and after any interaction combined with other factors like background changes and feature changes. We mainly monitor the changes of two characteristics, the Hu moments and the surface of the object. Table 2 summarises the expected evolutions of those features in different cases. We show in Fig. 3 an example of scene type 7, corresponding to an object left in the scene; the scene is from the dataset VISOR [22].

Table 2 Scene classification into 15 types, according to number of objects before and after the action, to the background changes and to objects’ features

Full size table

3.6 Interaction classification

An essential task for the safety observers of public places is to identify irregular actions. Police officers want to observe any unusual interaction between humans or vehicles. An interaction between two objects is called distant or remote if it does not involve physical contact. Most of the physical interactions are preceded by remote ones. Note, as an example, that people shout before they start fighting. Hence, it is crucial to detect the moment a distant interaction starts.

Furthermore and for the benefit of law enforcement, the interaction's aggressiveness is also to be identified. We add accordingly two more subclasses: aggressive and peaceful interaction.

Interaction classification is accomplished using three different multi-layered DNNs. First, we apply a binary interaction classification over fixed-size windows of the scenes to identify the existence of interaction. Next, scenes with interaction undergo two more in-depth analysis adding the sub-classes: distant or physical and aggressive or peaceful. Post-processing, each window classification result is smoothed regarding the temporal-consistency [23] to label the complete scenes.

3.7 Scene analysis and description

We inspire our description schema from the real incidents' cases. Incident police reports should contain the five Ws; $who, what, where, when\, \mathrm{and}\, why$. For objectivity purpose, we replace the $why$ with $how$. The latter key points draw a frame to our textual description.

At relevant moments in the scene, two kinds of descriptions are generated. The first description concerns each moving object's initial state separately: position, deformability, speed and movement direction. The second description focus on the interaction between the two objects. It reports the type of interaction, its aggressiveness and its influence on each of the objects after-state.

3.7.1 Scene key moments

To choose the key moments suitable for description, we rely on irregularities in the objects' characteristics or the interaction. By irregularities, we mean sudden changes, local maxima or minima and so forth. Several characteristics determine the behaviour of an object:

1.
Object characteristics: deformability, shape (invariant Hu moments, surface), relative position by reference to labelled areas or an area of interest, see Fig. 4, and displacement related (speed, direction).
2.
Inter-object characteristics: the distance between the objects, and relative position or direction.
3.
Interaction features: existence, distance, and aggressiveness.

These characteristics are represented using graph-based pattern discovery, from which we extract key moments to generate the description. An example of the keyframes for the scene "LeftBox" from the database [24] is shown in Fig. 4.

Key moments should be defined according to essential variations in the extracted features. The importance of the variations differs from a scene type to another and from one user needs to another. Considering that and as we want to keep our system generic, the description density control is given to the user as a hyper-parameter. Thresholds can be set as well, as hyper-parameters, to control the amount of the generated data. Triggered by the key moments, an activity matrix is generated.

3.7.2 Activity matrix

For each key moment${\mathrm{k}}_{\mathrm{i}}$, $i \in \left\{ {1,2,...m} \right\}$, corresponding characteristics are generated and filled in a vector ${\mathrm{V}}_{\mathrm{i}}$, see (8).

$${V}_{i}=\left\{F,O1,O2,IO,IN\right\}$$

(8)

where:

$F$ is the frame number of the key moment ${k}_{i}$,

$O1$ and $O2$ are the object's characteristics vectors,

$IO$ contains the inter-object characteristics and.

$IN$ has interaction features values.

Now, consider:

$$A=\left\{{V}_{i},i=1\cdots m\right\}$$

(9)

The vectors ${\mathrm{V}}_{\mathrm{i}}$ form the activity matrix $\mathrm{A}$, Eq. (9), which then saves, for chosen key moments of the scene, a full set of features′ values, describing it. This matrix can be used to build sophisticated scenario models based on context-dependent thresholds.

Finally,$A$ is mapped into textual descriptions by simply applying logical rules to fill the blanks in our structured templates.

3.7.3 Scene description

We introduce template models for textual description. We propose two types of templates:

1.
Object templates: for each of the objects, three templates were introduced. In these templates, words between quotes denote values, clauses between curly brackets surround options and square brackets indicate facultativity. At the key moment where an object enters or exits the scene, the description follows two slightly different templates, called "object entrance template" and "object exit template". In both cases, we use absolute description (ex: big) instead of comparative description (ex: bigger) of the current moment ${k}_{i}$ relatively to the previous one ${k}_{i-1}$. At each of the key moments of an object, other than the $entrance$ and the $exit$, an "object template" is structured as follows:

2.
Inter-objects templates: two templates were introduced.
a.
"inter-objects entrance template": when the second object enters the scene, at this key moment, the description is activated and follows:

b.
at each of the key moments, a proposed "inter-objects general template" is structured as follow:

In order to establish the mapping between the activity matrix $A$ and the templates, we use logical threshold-based rules [1]. An example of these rules, a sample rule considering the position and the direction comparison between the two objects is presented below. Hereby two conditions were applied.

For object 1, at a key moment ${k}_{i}$, in a given scene, let ${\alpha}$ be the angle between the vector direction of object 1 and the vector direction formed by the centroid of object 1 and the centroid of object2 (10):

$${\alpha}=\widehat{\left(\overrightarrow{\left({c}_{1,i-1},{c}_{1,i}\right)},\overrightarrow{\left({c}_{1,i},{c}_{2,i}\right)}\right)}$$

(10)

Then, if $\left|\alpha \right|\le 45$ add to description {toward the object 2}, if $\left|\alpha \right|\ge 135$, add to description {away from the object 2}.

Finally, the system generates two kinds of descriptions; a full one and a short one. The full description involves each key moment and includes all the features values. Only the features associated with the most important key moments appear in the short description, the ones showing irregularities. Tables7 and 8 show an example of describing the scene "LeftBox" taken from the database "CAVIAR" [24]. We can see the activity matrix in Table 7 and the full description in Table 8.

“At frame 392: Object 2: Deformable object 2 moves, in F spot, on the right middle of the inside area of the camera field of view, heading immediately down right, toward the object 1, no big change occurring respectively on its shape and big changes occurring respectively on its surface having now bigger one, and having respectively slight increasing of its speed.

Object 1 & Object 2: The two objects are respectively receding; a physical peaceful interaction occurs between them”.

Example of the results we were able to generate for scene description at a single moment.

4 Experiments and results

Our proposed method consists of many stages, conceptual, modelling, then learning for classification. We did experiments to evaluate the learning stage of the process.

4.1 Dataset selection, preparation and pre-processing

There is no dataset dedicated for two-objects interaction in surveillance video. We examined many available general video surveillance datasets; some are mentioned in Table 3. Among these datasets, we extracted 323 scenes suitable for our experiments and we manually annotated them. We got a total of 1903s of videos. A small number was later discarded due to tracking failure. We pre-processed this crafted dataset by cutting it into fixed-size scenes, each of 25 frames. We obtained 6029 windows, from which only 2208 windows of interaction. Among the 2208 interaction windows, were 5962 distant interactions and 67 physical ones. Then, we extracted the five types of features discussed earlier, which gave us 2498 features.

Table 3 Used datasets

Full size table

In the pre-processing data phase, we augmented the dataset volume by adding artificial scenes, to note: reverse footage. We duplicated, as well, the number of positive cases to balance the negative and positive inputs for the three classifications. The three classifications were trained and tested according to the dataset mentioned in Table 4.

Table 4 Balanced dataset input characteristics

Full size table

4.2 Classification training and results

To implement our classification models, after many tests on classical ML and NN algorithms, we chose a multi-layered DNN; Feedforward fully connected networks called Pattern recognition networks [30] in MATLAB and Simulink environment were trained by backpropagation of error. To achieve the desired outputs, several tests were made. For each of the three classifications, one classification DNN model, showing best results, was selected. Table 5 lists the used parameters for the three DNN classifications and Table 6 shows the results for the three classifications.

Table 5 Used parameters for the 3 classification DNNs

Full size table

Table 6 Results for the three classification DNN algorithms

Full size table

Table 7 Activity matrix A for the scene "LeftBox" [24]. The red colour marks irregular characteristic values. Object 1 shows key moments due to its direction near frames 101, 110, 185 and 230, and due to its speed near frame 160. Object 2 shows keys activities due to its direction near frame 185, its speed near frames 145, 160 and 230, and its shape near frame 230. The interaction between the two objects begins near frame 145 and ends near frame 190

Full size table

5 Discussion

In our approach, we faced many challenges; however, important outcomes were delivered.

While the size of the dataset we crafted remains insufficient for a very sharp evaluation benchmark, the absence of a specific annotated dataset for two-objects interaction confirms that the problem is actual and still not resolved.

The segmentation and tracking algorithm suffered from:

1.
One major issue: the traditional occlusion when two moving objects are physically close. Using the Kalman filter to estimate the location did not work well with moving occluded objects. The occluded object location and boundary box were estimated for some number of frames, while the foreground object is miss-detected. Consequently, the physical interaction in a scene was detected for only some frames per scene. And then, analysing and describing the interaction had no more effect until the two objects separate.
2.
One marginal issue: the false segmentation of the object when it is moving in a complex background. This issue can trigger a description declaring a significant change in the object Hu moments or surface. This can be overpassed at the level of thresholding.

A simple solution could be by replacing this algorithm. Having lately good results with detecting objects using deep learning, like YOLOv5 and Faster R-CNN, a good plan could be by testing these algorithms; then, if one delivers better results and satisfies all the conditions, our selected tracking and segmentation algorithm can be replaced in our overall approach.

On the other hand, we claim that our VSSD approach differs positively in the following points. The input features are dedicated to the interaction classification process, where many of the other methods do not export appropriate features from the videos. VSSD takes into consideration the diversity of scenes and does not apply any restrictions.

VSSD implements the original classification idea of distant versus physical interaction, while other works focus only on the physical one. Detecting distant aggressive interaction can alert the observers, in a surveillance control room, at early stages, giving them precious time to act.

The object features are expressed and stored, forming a higher semantical set of metadata. Such annotation is immensely helpful to the end-user when querying the archives for an incident with a specific description. We tried to encounter, in those features, most of the queries used in practice to search for an incident in the archives.

Finally, the description alert flags depend on a set of hyperparameters whose values are surely context-dependent. This leaves the main handle to the user. The grammar is expandable; the system can be considered as a toolset. The new structured templates contain the primary information reported by the police in real case incident description. Consequently, the textual description can be generated automatically as draft reports.

Table 8 The full description of the scene "LeftBox"[24], results of mapping matrix A into our proposed templates. At each key moment, corresponding to one of the irregularities mentioned in the description of Table 7 and marked in red, the full state of the objects and the interaction are described. Notice: near frame 145 the two objects start a distant peaceful interaction, which ends near frame 190.

Full size table

6 Conclusion and Perspectives

In this work, we look at the fundamental problem of generating textual descriptions of important contents in video surveillance scenes. We based on our new generic context-free ontology focused on objects interactions. We point out some basic undercover problematic needs and provide many solutions.

While analysing and understanding a wide variety of video scenes, our approach introduces new concepts and highlights important features to classify video objects' interactions. We propose a new classification of scenes based on background changes, the number of objects and the study of specific object characteristics. We classify interactions using deep learning. We introduce a very useful activity matrix, highlighting appropriate key moments and serves for description based on graph pattern discovery.

Finally, we propose very effective rule-based templates to structure the textual descriptions. Our templates support CCTV reports for real incidents description since one of the authors is a Major at the head of a central CCTV Control Room in the Lebanese Internal Security Forces. And therefore, we know that, when properly implemented and used, visual surveillance systems, supported by intelligent video analysis, can become a very effective weapon in law enforcement agencies' hands. We intend to learn the classifiers on a real dataset, but this project is highly critical and needs special permissions. From a practical point of view, we can imagine the generation of specific scene descriptions, integrating labelled contextual information.

We consider our work as a leading project that can be extended by analysing deeper levels of classifications. Also, we may add more semantic depth to the description model. To state as an example, deciding whether a specific interaction has "bad" or "good" overall influence.

Furthermore, to obtain better tracking and segmentation results, the current tracking and segmentation algorithm can be replaced by a more recent algorithm based on deep learning, potential YOLO v5 or Faster R-CNN.

In a more classic extension, it is interesting to apply our approach on more complex scenes, showing interactions between more than two objects.

Finally, as working on thousands of hours of videos surveillance footage, all the output data from our system, when applying, can form big data. This big data or metadata collected and accumulated can be then learned through clustering and regression to model and predict the objects' behaviours and interactions.

Availability of data and material

Part of the dataset used in the experiments was derived from the referenced mentioned datasets. Derived data supporting this study's findings are available from the corresponding author W. F. Y. on request.

Another part is subject to confidentiality and needs a particular permission procedure to be communicated.

Authors confirm that all data and materials as well as the software application or custom code, support their published claims and comply with field standards.

Code availability

All the approach custom code, including the video pre-processing, the features extractions, the tracking and segmentation, and description generation code, as well as the code for data cleaning and analysis associated with the current submission, are available upon request.

Reference

Youssef WF, Haidar S, Joly P (2018) Generic video surveillance description ontology. In: 1st international conference on big data and cyber-security intelligence (BDCSIntell 2018). BDCSIntell 2018. 6
Andrei Georgios E, Daniel H et al (2021): Language and vision workshop - 2018. http://languageandvision.com/2018.html. Accessed Feb 2021
Aafaq N, Mian A, Liu W, Gilani SZ, Shah M (2018) Video description: a survey of methods, datasets and evaluation metrics
Venugopalan S, Xu H, Donahue J, Rohrbach M, Mooney R, Saenko, K (2014) Translating videos to natural language using deep recurrent neural networks
Pan Y, Mei T, Yao T, Li H, Rui Y (2016) Jointly modeling embedding and translation to bridge video and language. In: proceedings of the ieee conference on computer vision and pattern recognition. pp 4594–4602
Xu J, Mei T, Yao T, Rui Y (2016) MSR-VTT: a large video description dataset for bridging video and language. In: proceedings of the ieee conference on computer vision and pattern recognition. pp 5288–5296
Lou J, Liu Q, Tan T, Hu W (2002) Semantic interpretation of object activities in a surveillance system. In: 'Object recognition supported by user interaction for service robots' 16th International Conference on Pattern Recognition (IEEE Comput. Soc 2002). pp 777–780.
Ahmed SkA, Dogra DP, Kar S, Roy PP (2019) Natural language description of surveillance events. In: Chandra P, Giri D, Li F, Kar S, Jana DK (eds) Information technology and applied mathematics. Springer, Singapore, pp 141–151
Kojima A, Tamura T, Fukunaga K (2002) Natural language description of human activities from video images based on concept hierarchy of actions. Int J Comput Vis 50(2):171–184
Article Google Scholar
Xu Z, Hu, C, Mei L (2016) Video structured description technology based intelligence analysis of surveillance videos for public security applications. Multimed Tools Appl 75(19):12155–12172
Article Google Scholar
Zizi TKT, Zizi T, Ramli S et al (2017) Aggressive movement detection using optical flow features base on digital & thermal camera. (193):6
Zulkifley MA, Samanu NS, Zulkepeli NAAN, Kadim Z, Woon HH (2016) Kalman filter-based aggressive behaviour detection for indoor environment. In: Kim KJ, Joukov N (eds) Information science and applications (ICISA) 2016, Springer, pp 829–837
Wang X, Türetken E, Fleuret F, Fua P (2016) Tracking interacting objects using intertwined flows. IEEE Trans Pattern Anal Mach Intell 38(11):2312–2326
Article Google Scholar
Andriyenko A, Schindler K, Roth S (2012) Discrete-continuous optimisation for multi-target tracking. In: ‘2012 IEEE Conference on Computer Vision and Pattern Recognition' 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (IEEE). pp 1926–1933
Milan A, Roth S, Schindler K (2014) Continuous energy minimization for multitarget tracking. IEEE Trans Pattern Anal Mach Intell 36:58–72
Article Google Scholar
Roshan Zamir A, Dehghan A, Shah M. GMCP-tracker: Global multi-object tracking using generalised minimum clique graphs. http://crcv.ucf.edu/projects/GMCP-Tracker/
Son KD (2019) Contribute to son-oh-yeah/Moving-Target-Tracking-with-OpenCV development by creating an account on GitHub.
Xiang Y, Alahi A., Savarese S (2015) Learning to track: online multi-object tracking by decision making. In: ‘2015 IEEE International Conference on computer vision (ICCV)' 2015 IEEE International Conference on computer vision (ICCV), (IEEE). pp 4705–4713
Motion-based multiple object tracking - MATLAB & Simulink. https://www.mathworks.com/help/vision/examples/motion-based-multiple-object-tracking.html. Accessed Feb 2021
Youssef Wael F, Haidar S, Joly P (2016) Classifying deformable and non-deformable video objects. In: 7th International Conference on Imaging for Crime Detection and Prevention (ICDP 2016) (Institution of Engineering and Technology). pp 1-6.
Ming-Kuei Hu (1962) Visual pattern recognition by moment invariants. IRE Trans Inf Theory 8(2):179–187
Article Google Scholar
Vezzani R, Cucchiara R (2010) ViSOR: video surveillance online repository. 2010(2):13
Jaffré G, Joly P (2005) Improvement of a temporal video index produced by an object detector. In: Gagalowicz A, Philips W (eds): Computer Analysis of Images and Patterns, Springer, Berlin Heidelberg, pp 472–479
CAVIAR: Context Aware Vision using Image-based Active Recognition. https://homepages.inf.ed.ac.uk/rbf/CAVIAR/. Accessed Feb 2021
Blunsden S, Fisher RB (2010) The BEHAVE video dataset: ground truthed video for multi-person behavior classification. 4:1–12
Multi-camera pedestrians video – CVLAB. https://cvlab.epfl.ch/data/data-pom-index-php/. Accessed Feb 2021
Ryoo MS, Chen CC, Aggarwal JK, Roy-Chowdhury A. (2010) An overview of contest on semantic description of human activities (SDHA) 2010. In: Ünay D, Çataltepe Z, Aksoy S (eds) Recognising patterns in signals, speech, images and videos. Springer, Berlin, Heidelberg, pp 270–285
I-Lids Dataset for AVSS 2007 (2007)
Oh S, Hoogs A, Perera A et al (2011) A large-scale benchmark dataset for event recognition in surveillance video. In: CVPR 2011. CVPR 2011, pp 3153–3160
Google Scholar
Pattern recognition network - MATLAB patternnet. https://www.mathworks.com/help/deeplearning/ref/patternnet.html. Accessed Feb 2021

Download references

Funding

This work was supported by the research project "Analysis and description of interaction between two moving objects in a video scene captured by a fixed camera" and was partially funded by the Lebanese University under the reference number "21948".

Author information

Authors and Affiliations

Beirut CCTV Control Room, Police of Beirut, Internal Security Forces, Beirut, Lebanon
Wael F. Youssef
Lebanese University, Beirut, Lebanon
Siba Haidar
IRIT, Université de Toulouse, CNRS, UT3, Toulouse, France
Philippe Joly

Authors

Wael F. Youssef
View author publications
You can also search for this author in PubMed Google Scholar
Siba Haidar
View author publications
You can also search for this author in PubMed Google Scholar
Philippe Joly
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

This work was based mainly on Wael F. Youssef thesis, the primary author of this manuscript. The thesis is entitled "Instantiation of a textual description schema of video surveillance scenes", available online on the server of the University of Toulouse III—Paul Sabatier "http://thesesups.ups-tlse.fr/4586/1/2019TOU30249.pdf". All authors contributed to the study conception and design. Material preparation, data collection and analysis were performed by W. F. Y. and S. H. The first draft of the manuscript was written by W. F. Y. and S. H. and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Siba Haidar.

Ethics declarations

Conflicts of interest

The authors have no conflicts of interest to declare that they are relevant to this article's content.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Youssef, W.F., Haidar, S. & Joly, P. Automatic textual description of interactions between two objects in surveillance videos. SN Appl. Sci. 3, 695 (2021). https://doi.org/10.1007/s42452-021-04534-3

Download citation

Received: 28 June 2020
Accepted: 24 March 2021
Published: 10 June 2021
DOI: https://doi.org/10.1007/s42452-021-04534-3

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Automatic textual description of interactions between two objects in surveillance videos

Abstract

Similar content being viewed by others

Deepfake video detection: challenges and opportunities

Convolutional neural network: a review of models, methodologies and applications to object detection

Video summarization using deep learning techniques: a detailed analysis and investigation

1 Introduction

2 State of the art

3 Proposed approach

3.1 Segmentation and tracking

3.2 Object classification

3.3 Background model and activity localisation

3.3.1 Background change detection

3.3.2 Routes and activity spots plan

3.4 Feature extraction

3.5 Scene classification

3.6 Interaction classification

3.7 Scene analysis and description

3.7.1 Scene key moments

3.7.2 Activity matrix

3.7.3 Scene description

4 Experiments and results

4.1 Dataset selection, preparation and pre-processing

4.2 Classification training and results

5 Discussion

6 Conclusion and Perspectives

Availability of data and material

Code availability

Reference

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflicts of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation