Introduction

A video analytics system comprises multiple cameras installed at locations of interest, connected to edge nodes, which in turn are connected to the decision making core, usually hosted in an on-premises data-center or on the cloud [1,2,3]. The edge node is considered to have relatively low computational power, sufficient enough for basic pre-processing and forwarding the video streams to the core. The core is a Graphical Processing Unit (GPU)-based system and software capable of parallelizing computation over the GPUs, enabling the use of Deep Learning for computer vision [4]. This generic video analytics framework is illustrated in Fig. 1.

Fig. 1
figure 1

Generic video stream analytics framework

Video streams consist of a time-based sequence of frames. Computer vision techniques are used to process individual frames. The inference from these individual frames is then derived and collated over time to determine actionable inputs based on the application processing logic. Computer vision tasks such as image classification, object detection, and segmentation are compute-intensive and need to be performed for each video frame.

According to Cisco, by 2022, 82% of all internet traffic is expected to be video and live video usage will increase 15-fold by 2021 [5, 6]. Furthermore, video surveillance demands a steady upstream of video frames from deployment locations to the cloud, typically via the edge. A single IP-camera can produce up to 300 GB of data per month, while 770 million such video surveillance cameras were present globally in 2020 alone [7, 8], requiring significant network bandwidth and storage. A December 2019 report states that US alone has around 70 million CCTV cameras [9]. There is thus an increasing need for resource optimized video surveillance systems with reduced:

  • communication overheads and delay

  • storage requirements

After cost, data privacy concerns are the major barrier in implementing AI-based video surveillance systems [10]. According to a poll released by the American Civil Liberties Union of Massachusetts, 80% of the voters support complete ban on facial recognition-based surveillance [11]. Three countries (Belgium, Luxembourg and Morocco) have even banned the use of facial recognition technology [12]. The Draft of the “Civil society initiative for a ban on biometric mass surveillance practices”, an initiative funded by the European Digital Rights (EDRi), raises serious concerns over the fundamental rights of citizens like privacy, right to free speech and not to be discriminated against being compromised due to mass video surveillance [13]. Other than hackers getting live video stream or accessing recorded feeds from the surveillance systems, the privacy of individuals is at risk even with the operators being able to misuse recorded footage. There have been cases of major privacy breaches from video surveillance systems [14,15,16,17]. Recently, a group of hackers breached live security camera feeds of 150,000 surveillance cameras inside hospitals, organizations, police departments, prisons and schools [18]. Thus, it is imperative to preserve the privacy of the individuals whenever video surveillance systems are deployed at scale. This applies to the various stakeholders (students and staff) in a campus setting too. Privacy-preservation is thus emerging as a major requirement for future campus-wide video surveillance applications. Therefore, the motivation for this work is:

  • resource optimization in a campus-wide video analytics system in terms of communication overheads, compute and storage requirements; and

  • privacy preservation of the stakeholders for specific applications.

The main contributions of this paper are:

  • A multi-layered framework, Campus-wide Video Analytics Framework (CVAF), is proposed, which is resource optimized and privacy preserving while meeting the functional requirements for campus surveillance.

  • Low cost-edge devices and efficient algorithms are deployed for:

    • frame filtering

    • inference deduction

    • face obfuscation for privacy preservation

  • Five custom video-analytics applications are built and deployed for experimental evaluation of the proposed framework.

  • Server-side storage and computing is significantly reduced and some applications are entirely off-loaded to the edge.

  • Experimental results establish the viability of a low resource, privacy preserving video surveillance framework.

The rest of the paper is organized as follows: Sect. 2 discusses the existing techniques that have been applied across different relevant implementations of video surveillance systems to solve at least one of the above-stated problems. Section 3 describes the Campus-wide Video Analytics Framework (CVAF) in detail. Section 4 discusses the implementation details with the experimental results, while Sect. 5 concludes the paper with an insight into the possible directions for future work in the domain.

Related work

The fundamental decision that affects both the network bandwidth consumption and computational load on the core is whether to allow a frame to be sent to the core or to drop it. The point where this decision is taken is crucial to its impact. Thus, resource optimized video-applications typically resort to some sort of filtering at the video camera or the edge to reduce the load on the core.

Authors in [19] propose RealEdgeStream (RES) framework which filters out low-value video streams at the edge based on configurable rules while performing inference on the local cloudlets. Their approach reduces the network bandwidth consumption by allowing minimal data to be sent to the cloud (core). The system is evaluated using a simulation and thus the exact hardware requirements are not available.

Authors in [20] propose an FPGA-based smart camera design to provide efficient stream processing of the captured video on the camera itself. The FPGA-based smart cameras perform initial filtering, while the edge servers perform basic deep learning-based inference with only advanced inference being done in the cloud. Filtering on the camera reduces the network bandwidth and storage consumption. However, specialized cameras need to replace all existing cameras.

Authors in [21] describe their pilot study in which they used NVIDIA Jetson TX2 embedded computing platform in the visual sensor and all computation is made onboard. Inference is done on the sensor itself and only meta-data travels through the network which offers a privacy compliant tracking solution. They implemented the project as a pilot study with 20 visual sensors in the city of Liverpool and were able to reduce the network bandwidth requirements.

Chameleon [22] is a video-surveillance system that dynamically adjusts the video stream configurations based on video segment profiles intending to optimize resource utilization and increase accuracy. It was able to achieve the same accuracy with only 30–50% of the resources used by traditional systems. They, however, worked only to reduce the compute load on the GPU in systems, where there is a choice between different neural networks, frame rate and image quality. The authors have mentioned the need for reducing the network bandwidth consumption and profiling at the edge.

Authors in [23] address the issue of network bandwidth consumption. They design FilterForward which reduces the network bandwidth consumption by transmitting only the relevant frames to the datacenter. Frames that are not relevant are discarded on the edge (early discard strategy). Thus, it allows for optimization of the feeds for multiple video analytics applications. The edge node, that the paper describes, is a heavy-duty computer with a quad-core i7-6700K CPU with 32 GB of RAM.

In [24], the authors describe a 3-layer architecture for video stream analytics in which the edge layer performs low-level feature extraction, the fog layer performs recognition, and the cloud computing layer performs behavior analysis tasks. The architecture intends to offload the computation to the edge and fog, addressing only one of stated objectives for a campus-wide video surveillance and analytics framework.

Table 1 Existing video-surveillance frameworks – optimization and privacy assurance evaluation

Recently, researchers have started focusing on privacy-preservation in video-surveillance applications. An extensive survey of visual privacy preservation methods is presented in [25]. One of the extensively discussed methods by them is redaction in which the region of interest, for example the face or the identity card number, are detected and blurred, replaced, removed or de-identified. Redaction can be performed at the camera, video processor, database or the user interface. A system for privacy preserving video surveillance that takes the computation for face detection, obfuscation and encryption to the camera itself is presented in [26]. Authors in [27] use a privacy mediator which is a module installed in the camera which denatures the live video feeds. However, such a solution would require complete replacement of existing cameras which do not offer this feature, entailing significant costs.

REVAMP2T [28] is a pipeline for detection, re-identification, and tracking across multiple cameras without the need for storing the streaming data. Although, it does not cater to multiple video analytics applications, it is an important breakthrough in the privacy preserving tracking of individuals. It uses key features of the pedestrians at runtime without the use of personally identifiable information to track their movement.

Authors in [29] introduce OpenFace and RTFace. OpenFace, trained for a pool of users, detects and classifies their faces. It uses deep neural network for feature extraction followed by support vector machines for classification. RTFace, built on top of OpenFace is a complete pipeline to track and blur regions of interest, i.e., recognized faces by OpenFace, in real-time at full frame-rate. This enables privacy-preservation for live video analysis. The system uses cloudlets on virtual machines on top of rack servers with Intel Xeon E5 v4 processors. The system, however, has decreased accuracy and speed when the pool of users increases.

Authors in [30] design a privacy-preserving and efficient mobile video analytic system PrivacyEye, which is composed of two major modules including the privacy-preserving module and the efficient video process module. The privacy-preserving module employs adversarial training process to extract useful information for model prediction while preserving privacy of sensitive information of the user. The efficient video process module combines the dynamic frame scheduler and optical-flow-based feature propagation to greatly reduce video processing time.

Authors in [31] propose DeepCache to accelerate the execution of CNN models via leveraging video temporal locality for continuous vision tasks.

Table 1 summarizes the relevant frameworks and how they cater to the need of network bandwidth reduction, storage consumption reduction and privacy preservation.

Campus-wide video analytics framework

Campus-wide video surveillance systems and applications are gaining traction. From basic recording of video feeds and manual review, such applications have evolved to use AI-based interventions and automated actions being initiated. Thus, video surveillance-based systems are becoming increasingly complex, requiring much higher storage and computing power. While the cloud has been able to cater to the storage and compute requirements, response times for critical surveillance applications required moving many of the analysis tasks to the edge. Thus, an effective campus surveillance systems is resource heavy, entailing significant costs. Privacy has also emerged as a major requirement for such surveillance systems with many countries adopting strict laws to prevent unauthorized video recording, storage and analysis to safeguard an individual’s right to privacy. This necessitates efficient privacy preservation mechanisms to be built in or over existing surveillance systems. Thus, it is desirable that surveillance systems for specific campuses be customized, resource efficient, privacy preserving while delivering the required functionality with acceptable accuracy and performance. We propose the Campus-wide Video Analytics Framework (CVAF) which is a resource optimized and privacy preserving framework that moves frame filtering, inferencing and privacy preservation (face obfuscation) tasks to the edge. This reduces the network bandwidth and video storage requirements significantly while preserving privacy of the students and the faculty. Figure 2 provides an overview of the framework.

Major design considerations for CVAF emerged from:

  • Analysis of the recorded video feeds in the existing NVR surveillance system on campus indicated that more than 70% of the recorded video streams did not contain information of interest for a given application.

  • Due to high network bandwidth requirements of IP cameras, the campus network can become constrained requiring network infrastructure management solutions to be deployed besides storage provisioning and expansion.

  • Majority of professionals in the field of video surveillance say that edge-based analytics is the ideal for video surveillance and it provides more efficient use of bandwidth and storage, since video data is large, complex and used in real time [32].

  • The European Data Protection Supervisor (EDPS), the European Union’s (EU) independent data protection authority describes 3 main data protection issues in case of video surveillance [33]:

    • Minimize gathering of irrelevant footage,

    • Notify the stakeholders, and

    • Timely and automatic deletion of footage.

Fig. 2
figure 2

Campus-wide video analytics framework (CVAF)

Fig. 3
figure 3

Frame dropping logic implemented at Layer 1 edge nodes

Fig. 4
figure 4

Operation of layer 2 edge nodes

Fig. 5
figure 5

High-level architecture of the server-side application

The CVAF framework has a 3-layer architecture in which layer-1 edge is responsible for frame filtering, layer-2 edge is responsible for deriving inference and layer 3 is the server core for maintaining application-specific campus-wide analytics. The video cameras installed at different locations of interest are connected to the layer 1 edge nodes which are light-weight compute devices (Raspberry Pi or its variants) capable of ingesting video streams, executing the frame-dropping logic and forwarding the stream to the layer 2 edge nodes. The layer 2 edge nodes are GPU-based edge devices optimized for deep-learning inference applications (NVIDIA Jetson Nano/Xavier or its variants) [34]. They are capable of processing multiple filtered streams from Layer 1, perform inferencing using deep learning models and initiate any application-specific configured actions. The video stream is then forwarded to the in-premises server(s). NVIDIA Jetson Nano is a family of low-cost GPU devices that consume only 5W of power and are able to process up to 5 video streams and perform deep learning inference [35]. These are used to perform inference, detection and removal of Personally Identifiable Information (PII) on the edge ensuring privacy preservation. In particular, layer 2 edge devices perform feature extraction and object detection using inference optimized (using TensorRT or ONNX) models [36, 37]. Furthermore, the layer 2 edge nodes perform inference and drop the frame if the frame is not required to be stored. It also obfuscates PII in the frame, encrypts and forwards the frame if it is required to be stored. This significantly reduces the network, storage and computational load on the core. The in-premises server(s) comprising the core hosts all the relevant web-applications including application-specific databases and file systems for storage. In our setup two Dell Precision T5500 and Dell Precision 5820 workstations serve as the core.

Table 2 Applications implemented on CVAF and layer-wise processing performed
Fig. 6
figure 6

Working screenshots of applications implemented on CVAF. a Parking Management, b traffic rule violation detection, c crowd detection, d teaching effectiveness analysis, and e video-based attendance management

Table 3 Layer-wise technical specification for CVAF implementation
Table 4 Number of devices used for each video application
Fig. 7
figure 7

Network bandwidth consumption for different video analytics applications with and without CVAF

Fig. 8
figure 8

Storage consumption for different video analytics applications with and without CVAF

Layer 1: Inexpensive edge filtering

Layer 1 edge nodes filter out frames that are not useful for further processing, for instance similar frames during time periods of low or no activity within the campus. Thus, the layer 1 edge nodes perform a singular task, i.e., whether to allow a frame to be forwarded to the core, based on the similarity of successive frames. Layer 1 drops the frames, generated from a static camera at 25 frames per second, if they are similar to the preceding frame beyond a certain threshold. We calculate the similarity between two successive frames as the Mean Squared Error (MSE) between the two frames. There are advanced methods available for finding the similarity between two images [38,39,40]. However, MSE is used as the similarity measure because of its efficiency in determining the similarity between successive frames 25 times per second [41]. Here speed more than accuracy was given precedence due to the non-critical nature of the video applications deployed. Furthermore, this allowed the inexpensive layer 1 edge nodes to process and filter 2–3 video streams simultaneously during peak loads. Figure 3 shows the schematic diagram of the Frame Dropping Logic implemented in Layer 1 Edge Nodes.

The frames from the cameras are received by the layer 1 edge node, temporarily stored in the buffer and are processed by the Frame Dropping Logic. The Frame Dropping Logic maintains the application specific MSE threshold. If the calculated MSE between the present and the previous frame is more than the MSE threshold, the frame is forwarded to layer 2 edge node otherwise dropped.

Layer 2: Inference

The layer 2 edge nodes implement the computer vision tasks including face obfuscation for privacy preservation and also implementing the video application processing logic for specific applications. They receive frames from Layer 1, perform real-time inference, trigger specific action as per the application specific action trigger and send the inference to the server for further processing or action and generates notifications for registered devices if required. Based on the application, if the frames have to be sent to the server then all the Personally Identifiable Information (PII) are obfuscated and the frame forwarded to the server. Figure 4 shows a schematic diagram of the processing at the Layer 2 Edge Node. CVAF performs real-time face detection using Haar Cascade and applies a Gaussian blur on detected faces for face obfuscation. Although more accurate methods of face detection are available and well documented, we choose Haar Cascade for its real-time performance and low memory consumption [42,43,44,45,46].

Layer 3: Server

The Server fulfills two requirements: secure storage of video streams for multiple video analytics applications and hosting of web applications to enable the use of the inference. Figure 5 shows a high-level architecture of the server-side video analytics applications. The Data Receiver Module receives data from the Layer 2 Edge Nodes and saves the video stream to the Application-Specific Secure Storage and the inference data to the Application-Specific Database, respectively. The Analytics Engine analyzes the data and generates patterns and insights. The Web Application provides a dashboard to the Administrator. Administrators with special privileges have application-specific private key which they can use to view the obfuscated video streams through the Web App when required.

Implementation and experimental results

We implemented the proposed framework at our institutional campus. Five major video-analytics applications were deployed:

  • Parking Management System

  • Traffic Rule Violation Detection System

  • Crowd Detection System

  • Teaching Effectiveness Analysis System

  • Video-based Attendance System

as a pilot program. These applications use real-time feeds from IP-cameras over the campus network. Cameras installed inside buildings are connected to layer 1 edge nodes through Wi-Fi, while cameras installed outside the buildings are connected to layer 1 edge nodes through Power over Ethernet (PoE)LAN. The action trigger is specific to the application and decides the action to be triggered in case a particular inference is made. The server receives inference data and the video stream from the layer 2 edge nodes, stores the video streams, pre-configured triggers initiate actions/notifications as required. The servers comprising the core also maintain the application-specific and campus-wide analytics for all the deployed video applications. Table 2 describes the applications implemented on CVAF and the layer-by-layer processing performed, while Fig. 6 shows the working screenshots of the said applications.

Table 5 Time taken per frame for different operations at Layer 1 and Layer 2 edge nodes
Table 6 Improvement in the average response time of different applications during peak and non-peak hours
Fig. 9
figure 9

Average response times for different applications deployed with and without CVAF over 24 h. a Parking management, system, b traffic rule violation detection system, c crowd detection system, d teaching effectiveness analysis system, and e face-recognition-based attendance management system

Fig. 10
figure 10

Overall effectiveness of CVAF over 24 h

Fig. 11
figure 11

User responses on installation of teaching effectiveness analysis system with and without CVAF

Experimental setup

For the evaluation of the effectiveness of CVAF, 54 IP cameras were deployed in campus transmitting frames at 25 FPS. We measure the network bandwidth and storage consumption for the applications implemented with and without CVAF. For Layer 1 edge nodes, we use Raspberry Pi 3 Model B+ [52], for Layer 2 edge nodes we use NVIDIA Jetson Nano and for the servers at the core, we use the Dell Precision workstations. Table 3 shows the specifications of the Layer 1, 2 and the Server. Table 4 shows the number of cameras, Layer 1 and 2 nodes used.

Results

We evaluated CVAF on the following parameters:

  • Overall network bandwidth usage by the video analytics applications in a 24-h window with and without CVAF optimizations (Fig. 7).

  • Overall storage consumption for video analytics applications with and without CVAF optimizations (Fig. 8).

  • Percentage delay caused due to CVAF and the percentage improvement in the response time of the applications (Table 6).

Figure 7 shows a radar chart of the combined network bandwidth consumption for the deployed video analytics applications with and without CVAF optimizations (filtering, inferencing, obfuscation turned on or off). CVAF reduces the network bandwidth usage by 27% even during peak usage hours and by 70% during non-peak usage hours. A majority of the optimization is attained by Layer 1 frame dropping during periods of low usage. During peak hours, the percentage of dropped frames reduces at layer 1, while layer 2 filtering post-processing and inferencing becomes more effective.

Figure 8 shows a radar chart of the storage consumption for the deployed video analytics applications with and without CVAF optimizations. CVAF reduces the storage consumption by 40% during peak usage hours and by 75% during non-peak usage hours, primarily due to duplicate frame dropping at Layer 1 and post-processing frame dropping at Layer 2. If video streaming is enabled at the server side, then the layer 2 frame dropping is disabled and frames are forwarded to the server, reducing the effectiveness of the filtering.

Table 5 shows the time taken by different operations at the Layer 1 and Layer 2 edge nodes. Layer 1 duplicate frame determination and dropping introduces a 6% delay in overall latency and Layer 2 introduces 11% delay causing the total overhead due to CVAF to be around 17%. However, lesser load on the network and quicker inference at the edge improve the response time of the applications by nearly 34%.

The average response time of a given application is the average time taken between the receipt of the frame at layer-1 edge and the final response generated at either layer-2 or server core depending upon the video application. The average response time decreases considerably when the application is deployed with CVAF. Table 6 shows the improvement in the average response time of different applications during the peak and non-peak hours.

The Parking Management System performs best when deployed on CVAF as its peak time of operation is highly concentrated, i.e., mostly during the morning and evening times. The improvement in average response time is attributed to the frame dropping logic on layer 1 edge nodes which filters frames not containing substantial information and the early inference on the layer 2 edge nodes. Figure 9 shows the variation in the average response time for different applications on an hourly basis. The overall effectiveness of CVAF is defined as following:

$$\begin{aligned} E_w= & {} ((N_{traditional,w}- N_{CVAF,w})/N_{traditional,w} \\&+(S_{traditional,w}- S_{CVAF,w})/S_{traditional,w})/2 \end{aligned}$$

where

  • w is the time window taken for calculating the effectiveness

  • \(E_w\) is the effectiveness of CVAF in a time widow w

  • \(N_{traditional,w}\) and \(N_{CVAF,w}\) are the network bandwidth consumption in the traditional system and the CVAF system, respectively, in a time window w

  • \(S_{traditional,w}\) and \(S_{CVAF,w}\) are the storage consumption in the traditional system and the CVAF system, respectively, in a time window w

Figure 10 shows the hourly variation of overall effectiveness of the CVAF. CVAF achieves 0.75\(-\)0.80 effectiveness range during late evening/night when there is no or little human movement in the campus. The effectiveness decreases sharply during peak hours when people and vehicles are coming in the campus, and in the evening when people are leaving the campus.

We also surveyed 200 students and faculty members in the institution regarding their acceptance of the Teaching Effectiveness Evaluations System. The pie-chart in Fig. 11a shows that a majority of the students and faculty members did not want cameras to be installed in the classrooms due to privacy concerns. However, when privacy preservation and a recording-less model is factored in, the respondents agreed to having the system installed in the classroom.

Conclusions

We present a low cost, resource optimized and privacy preserving Campus-wide Video Analytics Framework (CVAF) which is novel and caters to multiple video applications with varying requirements. Privacy preservation is expected to emerge as a major requirement for large scale video surveillance applications in the near future. Our framework attains acceptable performance using off-the-shelf hardware components. Two major optimizations include duplicate frame dropping to reduce network bandwidth and storage requirements and fast face detection and obfuscation for privacy preservation. Deploying light-weight pre-trained AI-models at the edge for faster processing using GPU-based devices greatly speeds up video processing and reduces latency significantly. Major implication of this research is that standard off-the-shelf hardware, open-source software and existing network infrastructure can be used to build a minimum viable and scalable framework that optimizes resource utilization and preserves privacy of the users. Future work shall focus on security aspects of the video surveillance framework to work seamlessly in open environments such as smart cities. Ensuring performance, security and privacy at scale shall be the next big challenge for video analytics frameworks.