What Does It Take?

Goal: Define and dissect the data pipeline with focus on key criteria and enable the reader to understand the mapping of this pipeline to Intel hardware and software blocks. With this knowledge, the reader will also be equipped with knowledge to optimize the partition and future-proof with security and manageability.

  • Key considerations and elements in architecting a data pipeline

  • Basic tasks of a E2E IMSS pipeline – Capture, Storage, and Display

  • Evolution of IMSS Systems – Analog to Digital to Connected to Intelligent

  • Sensing the World – Video and Beyond

  • Making Sense of the World – Algorithms, Neural Networks, and Metadata

  • Architecting IMSS Systems – IP Cameras, Network Video Recorders (NVRs), and Accelerators

The chapter will start by defining the purpose of a data pipeline, the key elements of the pipeline and the key criteria for specifying a data pipeline. The next section will describe the types of data comprising a data pipeline, key characteristics of the data types, and the relationship of the data types to each other. The final section will apply the data pipeline concept to the fundamental system architectures commonly used in the IMSS space and demonstrate how IMSS systems are evolving. This builds the foundation for the subsequent chapter describing the attack models on IMSS systems and how to mitigate the attacks.

IMSS Data Pipeline Terminology

Abbr

Term

Definition

 

Pipeline

A complete description of the data elements, processing steps, and resources required to implement an IMSS system

CCTV

Closed-Circuit Television

An analog security system built upon modifications of television standards

CIF, QCIF

(Quarter) Common Interchange Format

Early video standard for analog systems – CIF = 352x288 pixels, QCIF = 176x144 pixels

D1

Television Standard

NTSC Broadcast and DVD television standard of 720x480 pixels at 29.97 frames per second

IMSS

Digital Safety and Security

A system with the primary purpose to enhance the safety and security of information, assets, and/or personnel

 

Video Frame

A two-dimensional array of pixels. Frames are typically described in terms of rows (lines) x columns (width). Hence a 1080x1920 frame has 1080 rows and 1920 columns of pixels.

The frames may be either progressive (every line in the frame is captured in each frame) or interlaced (every other line is captured in each frame, requiring at least two frames to sample every line in the frame).

FPS

Frames Per Second

The temporal sampling rate of a frame capture oriented video stream is measured in frames per second, that is, number of frames captured per second

ML

Machine Learning

Any of several techniques whereby machines extract information from observing and hence analyzing/processing the real-world data

CNN

Convolutional Neural Networks

A specific branch of Machine Learning based on algorithms relying on convolution, that is, multiplication and accumulation as a key operation

 

Video

A time sequence of image data, typically image frames captured at rates sufficient for the humans to perceive motion. As used in this context, color images captured from visible light sensors

IP

Intellectual Property

Any knowledge regarding methods, data, processes, or other non-tangible items having value and owned by a legal entity. Often IP will have enforceable rights for use or prevention of use by others.

 

Metadata

Information inferred from sensed data or other data sets

 

Pixel

Picture Element, a single element of a picture describing the video at a single spatial location. Pixels are arranged in arrays of rows and columns to form a video frame. A pixel may have a single value if grey scale for monochrome frames or three values, for example, RGB, for color frames.

PII

Personally Identifiable Information

The most sensitive type of metadata, often subject to contractual, regulatory, and/or legal controls

Defining the Data Pipeline – Key Concepts

The fundamental purpose of an IMSS system is to sense information in the physical world and allow appropriate actions to be taken based on that information. Accomplishing this purpose is the role of the data pipeline. A typical IMSS system will span several physical locations and several processing steps. Consequently, the data pipeline will also span several physical locations and several logical processing steps or tasks operating on the sensed data.

Defining a data pipeline starts with understanding the key elements of the data pipeline. The elements can be classified into broad categories – the sensed data, the algorithms that transform data into information (metadata), the decisions, D, that are taken based on the metadata and the actions resulting from decisions. Definition of the data pipeline can then be approached as a set of interrelated questions about sensed data, algorithms, metadata, decisions, and actions. This fundamental data pipeline is shown schematically in Figure 3-1.

Figure 3-1
A data pipeline. It includes, sensed data to algorithms where planning and operations are done, information that is metadata, and decision with action 1 Conf A, Action 2 Conf B, and no action.

Fundamental data pipeline features

  • What are the desired actions to be taken and at what confidence level?

  • What information is needed to make the decisions that will lead to actions being taken?

  • What is the sensed data that needs to be gathered from a scene to create that information?

  • What are the algorithms required to transform sensed data to information and information into decisions?

A key aspect of the Data Pipeline, Figure 3-1, is that definition should proceed in the opposite direction of the final data flow. In the example shown, what are the desired actions and with what confidence level should each action be taken, with the null case of “No Action” being clearly understood. With this framework in mind, let us look at each of the key questions in more depth.

Desired Actions and Outcomes

The most critical step in the definition of a data pipeline is understanding what actions will be taken because of the system. The more specific and focused the desired actions can be defined, the more precisely the data pipeline can be defined. Ideally, a system has a single, well-defined action or set of actions as the desired outcome. These actions may range from issuing a ticket for traffic violations, to triggering alarms of intrusion in a restricted area, to identifying authorized personnel for a particular activity. By far the most common action is to take no action at all. It is critical to understand the array of events in the real world that will result in no action being taken, especially as these are usually far more numerous than events that will trigger an action. Failure to do so is the leading cause of false negatives, that is, failure to act when action should have been taken.

If you don’t know where you are going, then any road will do.

It is common that a single system will have multiple possible actions as outcomes. Based on the same database of real-world information, that database may drive multiple actions (or non-actions). For example, a camera viewing a street scene in an urban area may capture information on the number of vehicles, the presence of pedestrians and the state of the roadway. A traffic bureau may use the data in real time to advise commuters on traffic routes; a developer may use demographic data collected over several weeks to determine if a store of a particular type should be opened in the area and the city repair department may use the information to monitor infrastructure status and schedule maintenance. These all have vastly different goals, but each can be defined in terms of several key criteria to determine the overall system needs.

Accuracy. The single criteria most influencing data pipeline design is accuracy, or the ability to reliably take an action. In principle, to achieve perfect accuracy would require infinite information, driven by the substantial number of unusual or outlier events. While it may seem paradoxical that there are many unusual events, this phenomenon is driven by the observation that to observe, hence comprehend, exceedingly rare events will require enormous amounts of observational data. Again, in principle to observe an event that occurs once in 500 years will require about 500 years of data!

The accuracy specified should be related to the consequence of error or taking the incorrect action. The consequence of taking incorrect action can be further segregated into false positives and false negatives. The consequences of these types of errors are very application dependent, and it is often possible to tune the system to be more tolerant for one type of error than the other.

Accuracy requirements will strongly impact system specifications for the data resolution at which the scene is sampled, the computational complexity requirements of the algorithms to analyze the video, and storage requirements, among others.

Frequency of Actions – Throughput. Another critical factor is determining how often an action needs to be taken. The frequency of taking an action may be either event driven (i.e., act on detection of event A) or may be periodic (take an action every hour). If the action is event driven, it may be necessary to specify the minimum times between successive events in which the system is expected to act or the time interval in which the multiple events must occur (send ticket only if light is red and vehicle is in intersection).

The frequency at which actions must be taken, the throughput, will drive the sample rate of the scene and hence the total amount of data collected. The throughput requirements of the system drive many system parameters such as the total storage required as well as communication bandwidth for local and wide area links.

Latency. A third critical factor in specifying the system goals for acting is the latency from the time the scene is sampled to the time an action is potentially taken. In the urban street scene example, the time span required to act may range from a few seconds to several months.

A system may also have multiple latencies associated with the entire architecture. There may be a latency defined for sensed data capture, a second latency for analytics, and a third for action to be taken. The data capture latency may be determined by the characteristics of the object being sensed – how fast it is moving, how long a traffic light is on, etc. The latency for analytics may be determined by the algorithmic complexity and computational power available. The latency for action may be determined by mechanical constraints of an actuator on factory floor, or reaction time of a human operator.

At the most abstract level, a data pipeline can be thought of as a balance of the factors discussed earlier. A simple example is shown in Figure 3-2 comparing the specifications of two systems on a limited number of attributes. (Note that latency is graphed as 1/Latency such that shorter latencies have higher values). In this case, two actions are considered, each with an associated Accuracy and Confidence Level. A single value for throughput and latency is specified for the systems. In the example shown, System 1 has been optimized for greater accuracy at the sacrifice of throughput and latency relative to System 2. Neither system is “better” than the other; each will be appropriate for a given application.

Figure 3-2
A radar chart with 6 specifications of systems. System 1 has higher accuracy 1, Accuracy 2, and conf A. System 2 has higher 1 over latency, throughput, and conf B.

Example system specifications

At this point, the reader may well ask – why not just specify a “best” system which is the best of both worlds, that is the superset of System1 and System2? The answer is that such a system specification may well violate program constraints such as cost, power, schedule, resources, etc. The “best” system often ends up becoming a camel.

A camel is a horse designed by a committee.

Three Basic Tasks – Storage, Display, and Analytics

To this point, we have described the fundamental data pipeline elements – sensing data, algorithms for analytics, decisions and actions resulting from decisions. In addition to the fundamental elements, two additional elements are often present – storage and display, as in Figure 3-3.

Storage is often required to meet legal and insurance requirements that the data be archived as evidence and available for a specified period. The purpose is to allow retrospective access to the data for analysis of key events of interest and/or provide a legal source of record for evidence purposes. Two types of data are stored – the sensed data, for example, a video stream, and metadata. Metadata is information about the sensed data.

Display is required when the decision-making element is a human operator, DO. The analytics and decision elements are taken over by the operator, DO. The default mechanism for providing information to the operator is a visual display. In some installations, the display function may take the form of an array of two or more panels.

Figure 3-3
A diagram of data storage. Sensed data to algorithms and information that is metadata stored in storage, leads to display and decision with action 1 Conf A, Action 2 Conf B, and no action.

Fundamental elements plus storage and display

In applications where the decisions are taken by algorithms, the display function may be omitted. This situation will be examined in more detail in the section related to Machine Learning and neural networks.

Basic Datatypes and Relationships – Sensed Data, Algorithms, and Metadata

A Digital Surveillance system operates with diverse types of data, each of which has a specific purpose. When architecting a digital surveillance system, it is necessary to identify the basic data types in the data pipeline. These data types have quite distinctive characteristics and require different treatment in their processing, storage, and protection.

Figure 3-4
A diagram of data types. Sensed data directly to algorithms or via descriptive metadata, becomes inferred metadata, that leads to decision with action 1 Conf A, Action 2 Conf B, and no action.

Basic data types and relationships

The basic data type of sensed data has already been introduced. The sensed data is the information captured about the real world through a sensor and the transforms of that data into a form suitable for analysis. Fundamentally, the information contained in the sensed data is defined at the moment the sensed data is captured. The subsequent transforms of the sensed data do not change the information content, only the representation of the information. The sensed data is typically the largest in terms of quantity. Video data sets can easily reach up to hundreds of Gigabytes (GB) in size. An uncompressed video stream can reach tens of Megabytes (MB) per second and even compressed video can reach several to tens of Megabits per second (Mbps). The sensed data will often contain sensitive information such as Personally Identifiable Information (PII) subject to legal or regulatory control.

The second data type consists of algorithms. While algorithms are normally considered processing elements, when stored or transmitted, the algorithms can be considered a particularly sensitive type of data. The instructions for algorithms are typically represented in computer code. The code is the data type that represents the knowledge of how to transform the sensed data into information. The information output by the algorithm is referred to as Inferred Metadata and will be discussed in detail in the concluding section. The size of the algorithmic data can vary substantially and is stable during the operation of the pipeline. The algorithmic code does not change, though which portions are executed may be dependent on the sensed data. Corruption or tampering of the algorithm may lead to “undefined results” and hence errors in the decision and action elements of the pipeline. Algorithms will often use information in the Descriptive Metadata as an input to properly interpret the structure and format of the sensed data. Additionally, the algorithm itself is often the result of significant investment in its development. The algorithms are often highly valued Intellectual Property (IP) and may comprise a substantial portion of an enterprise’s net worth. Combined with the quality of the sensed data, the algorithm is critical in determining the accuracy and confidence level of the information used to take decisions and implement actions.

The final general data type consists of two subtypes: Descriptive Metadata – information about the sensed data and Inferred Metadata – information derived, or inferred, from the sensed data. Descriptive metadata describes how the sensed data was created, and key traits of the sensed data such as encoding methods. Examples of key traits of the sensed data include key elements such as location, time stamps, file sizes, codecs, and file names. Descriptive metadata enables use of the sensed data and provides context. Descriptive metadata is typically less than 1–2% the size of the sensed data. The descriptive metadata is often embedded into the same file or data structure as the sensed data for ease of reference. Descriptive metadata typically does not contain sensitive information per se, but, if corrupted or tampered with, can affect how the sensed data is interpreted, or if it is even usable.

Inferred metadata is information derived from the sensed data that summarizes or classifies the information content in a more concise form. Typical examples of inferred metadata are identifying objects in a scene such as pedestrian, a vehicle, or the state of a traffic light. Inferred metadata is the information that will be used for making decisions and taking actions. Consequently, the critical characteristics of inferred data are accuracy and confidence level. These are two of the fundamental criteria we defined for describing the data pipeline goals. Corruption or tampering of the inferred metadata, causing errors in accuracy or confidence level of the inferred metadata, will lead to errors in decisions and hence the actions taken. For this reason, the inferred metadata must be accorded the same level of protection as the sensed data. Additionally, the inferred metadata will often contain the same Personally Identifiable Information (PII) as the sensed data. Again, inferred metadata is subject to the same legal or regulatory control.

Fundamental Principle of Computation:

GIGO: Garbage In, Garbage Out

In summary, the three fundamental data types each have unique and critical characteristics and requirements. The interaction of these data types is critical to success of the data pipeline. Furthermore, protection of these elements is critical to the data pipeline operating securely and in a predictable manner.

Table 3-1 Summary of Data Types and Characteristics

Evolution of IMSS Systems, or a Brief History of Crime

The primary purpose of this text is to describe the architecture and requirements of a modern IMSS system based on video analytics. For many readers, it will be helpful to place the modern system in context with its predecessors to understand the evolution and the motivation for the modern systems.

IMSS 1.0 In the Beginning, There Was Analog…

The very earliest systems were based on analog technology, essentially television on a private system. Prior to IMSS 1.0, security systems were based on either human observers or no surveillance at all. Human observers were able to monitor only one area at a time, and continuous coverage demanded multiple shifts. In addition, humans were subject to errors in recall, attention, etc. The expense of human security systems restricted their use to only high value situations and often only at specific locations such as checkpoints, lobbies, etc.

The analog system architecture is described in Figure 3-5. The system consists of:

  • Capture: Video sensors, typically QCIF (176x144 pixels) or CIF (352x288 pixels)

  • Camera Connection: Coaxial cable

  • Storage – Magnetic tape, for example, VHS tape with one recorder per camera

  • Inferred metadata (analysis) – Performed manually by human watching recorded scene

  • Decision – Decisions made by human after watching video on a display

The IMSS 1.0 systems offered the advantage that multiple locations could be observed simultaneously for the cost of a camera and a recorder. Further, continuous coverage could be maintained at substantially lower cost than with human observers and a persistent record maintained by the videotape. Further, the recorded video could be used as evidence in legal and other proceedings, without relying on human memory or interpretation.

Figure 3-5
A data pipeline of the I M S S 1.0 analog system. Sensor data of sub T V directly to algorithms, or via descriptive metadata H W defined, to algorithms, connects display monitor, inferred metadata, and decision with action 1 Conf A, Action 2 Conf B, and no action.

IMSS 1.0 analog system

IMSS 1.0 systems still suffered from significant drawbacks imposed by the technology. The VHS magnetic tapes possessed limited storage, approximately two to six hours of broadcast quality television. Extending the recording life required reducing the camera resolution from the D1 standard of 720x480 pixels to the CIF and even QCIF referred to previously. In addition, the frame rate was often reduced from the D1 standard 29.97 FPS to a lower value, again to extend recording time on the magnetic tape. Additionally, due to limitations on cable length, the VHS recorder had to be located near the cameras, so retrieval of any data meant traveling to the location being monitored. Finally, there was no real time response – unless the system was monitored by a human observer. The system was primarily retrospective.

IMSS 1.0 systems suffered from additional vulnerabilities and security risks. The limited camera resolution imposed by the storage technology made identification and classification of subjects difficult because of the potential low fidelity of video. The analytics functions were strongly dependent on the human operator, thus leading to inconsistencies in analysis. Tapes were often in unsecured locations and subject to theft or erasure by unskilled persons. While IMSS 10 systems had significant advantages over the previous human-based methods, the disadvantages were the impetus for IMSS 2.0 systems.

IMSS 2.0 …And Then There Was Digital…

The evolution to IMSS 2.0 was driven primarily by the conversion from an analog representation of the video data to a digital representation. A digital representation leveraged the wider computer technology base, enabling greater customization. Storage capacity was no longer dictated by analog television standards, but by the variety of storage capacities provided by the emerging computer industry. A second consequence of conversion to a digital representation was the emergence of video compression technology. Much of the information in a video frame is redundant, varying little from one frame to another.

The key innovation was to change how the information was encoded, going from an analog representation to a digital representation. In an analog representation, the video information is represented by a signal with all possible values between a specified lower bound and an upper bound. Analog encoding required very strict adherence to timing conventions for both the information in a line and the information in a frame, that is, a sequential number of lines that were displayed together. Figure 3-6 compares typical analog and digital video encoding techniques. The intensity of the video signal is represented by relative units designated as IRE (Derived from the initials of the Institute of Radio Engineers), which corresponded to specific values of voltage. Higher values of the IRE corresponded to brighter (whiter) tones and lower values to darker (blacker) tones. The precise correspondence depended on the exact analog system used, requiring substantial calibration to achieve correct results. Additionally, the timing is shown for two representative analog systems, NTSC (primarily US) and PAL (primarily European). The PAL system uses 625 lines of which 576 are used for visible video information; the NTSC system uses 525 lines of which 480 are used for video information. Add in different refresh rates (PAL ~ 50Hz and NTSC ~ 60HZ), and conversion from one system to the other was not for the faint of heart. In summary, the analog encoding system was extremely rigid in practice and limited one to a few standardized choices. The choices were largely dictated by the ultimate display technology to be used.

Conversely, digital encoding was much more flexible and conversion from one format to another (an operation known as transcoding) was considerably more straightforward. A digital frame was represented as an array of M x N pixels, as illustrated on the right-hand side of Figure 3-6. Each square in the array represents one pixel. Each pixel can be described by three numbers, in this example, three values of Red, Green, and Blue. It should be noted that RGB is not the only representation, and indeed, there are other representations which are of practical application. The three values of the pixel are represented by a binary format consisting of 1’s and 0’s. In this example, eight (8) binary values (bits) are used to represent each value of R, G, or B. This example can then represent 28 different values for each color, or 256 hues of a color. For all three colors, with eight bits each 256 * 256 * 256 = ~16.7M different colors are possible with a relatively compact representation. Finally, the waveform for the blue value is shown as a waveform with allowed values of either “0” or “1.” The precise voltages corresponding to a “0” or “1” are entirely arbitrary, allowing great flexibility in system design.

Figure 3-6
2 illustrations. In analog video, the higher values of I R E consist of a whiter tone and the lower values consist of a blacker tone with different refresh rates. In digital video, the 8 binary values represent values of R, G, and B. Each box is an array with one pixel of M by N pixels.

Analog vs. digital video encoding

The image pixel array size is a flexible parameter for system design and now decoupled from the display choice. Using digital signal processing techniques, the MxN pixel array can be scaled and cropped to fit a wide array of display formats.

While decoupling the sensor capture architecture from the display architecture is a major benefit of digital signal processing, far more important was the ability to apply digital signal processing techniques to the video signal. The ability to detect and remove redundant information to achieve video compression is the most critical feature impacting system architecture. The next section will describe the principles of video compression and the impact on storage, transmission, and display of the video.

Video compression technology used two techniques to retain and store only the critical information. The first technique took advantage of the operation of the human visual system. A well-known example is shown in Figure 3-7, which shows how the human eye perceives frequency vs. contrast, with higher contrast (greater difference between black and white) at the bottom of the graph. As the contrast decreases, the human visual system becomes less able to perceive information. In a digital representation, signal processing can be used to filter out information that humans cannot perceive, and so save storage. The mechanisms will be discussed in more detail in a later section.

Figure 3-7
A graph contains an inverted parabola with unseen region around the inner seen region. There are vertical bands of alternately shaded lines which decrease in width progressively.

Human visual system (Source: Understandinglowvision.com)

The second technique relies on the observation that much of the data in two successive video frames is the same. In an analog system, this redundant data must be captured, stored, and recalled for each video frame, leading to a tremendous amount of redundant information.

Figure 3-8 illustrates a typical scene, here composed of a bicyclist traveling along a road with buildings and scenery in the background. Between video frames, the only object changing is the bicycle; all other objects are the same as in the previous frame. Digital signal processing allows the video system to identify which parts of the scene are constant, performing the mathematical equivalent of “Do not Resend this information.”

Between these two techniques, it is typical to achieve a data reduction of 50X, that is fifty times less data to be transmitted and stored compared to the uncompressed video stream. As an example, suppose one was capturing a scene at a standard HD resolution of 1280x720 pixels @ 8 bits/pixel (if gray scale image) at a rate of 30 frames per second (fps). The load on the system network bandwidth and storage would be 1280x720x30 fps x 8 bits = 27MB per second. For color imagery, it will be three times this size. Conversely, with compression, the data is reduced to approximately 0.5 MB/s. The actual reduction seen is scene-dependent – scenes with few moving objects and/or less noise will compress better than scenes with the opposite traits.

Figure 3-8
A figure illustrates 3 video frames, where a bicyclist traverses a road with bends. Mountains, sun, and buildings are in the background. The cyclist approaches the road bend.

Redundant video frame information (road, buildings, mountains) and changing information (bicycle)

As important as compression, the move to the digital domain also permitted encryption of the video data. Encryption is a reversible mathematical process for translating “plaintext,” data anyone can read, into “ciphertext,” data only those with the correct key and algorithm can read. Ciphers have been employed since antiquity to protect military, commercial, and even scientific secrets. Methods ranged from simple substitution codes to the process illustrated in Figure 3-9 using a common technique of public key/private key within the class of asymmetric cryptography. In the public key/private key example, the public key is published, and anyone can encrypt the data. However, only those with access to the private key can decrypt the data and read the ciphertext.

From a security perspective, the data is now much more secure from tampering and access by unauthorized parties. Only those with the correct key can read the video data. Of course, the data is only as secure as the key. If the key is compromised, then the data is now readable. In IMSS 2.0 systems, the encryption is applied primarily to data in storage, not during the transmission of the data or during the display of the data.

It is important to note that the encryption must be applied after the video compression; before encryption, all redundant data patterns are removed. This is necessary to prevent breaking the encryption by examining the cipher text for any patterns that correlate with the plaintext that was input. Encryption techniques depend on mathematical algorithms that map an input symbol to an output symbol based upon the key. In general, the more complex the algorithm, the more complex the key, the more difficult it is to break the code. For the majority of practical IMSS systems, the algorithm is known, documented, and implemented as a standard. This is a requirement for encryption systems such as the public key/private key flow illustrated in Figure 3-9, because the users must know the encryption algorithm. A second motivation is that the encryption throughput is often enhanced by a piece of dedicated hardware; hence the hardware must be designed to precisely execute the algorithm for both decryption and encryption. The result is that the security of the data is entirely dependent on the security of the key. For this reason, the more complex the key, the harder to guess the correct value. Initial deployments utilized 56-bit keys enabling 256 different keys, or about 7.2x1016 combinations. Initially thought sufficient, such keys have now been broken with the inexorable advance of computational capabilities, hence are no longer considered secure. Commonly used modern encryption systems have a key minimum of 128 bits, allowing 2128 = 3.4x1038 possible keys making brute force attempts at guessing a key impractical as of this writing. However, the advent of quantum computing foreshadows an evolution to 256-bit keys within the operational lifetime of many systems being designed and deployed today. Security breaches are now centered on securely creating, distributing, and storing the keys.

Figure 3-9
A block diagram of the encryption and decryption. Plain text of, Here is my private data, which is encrypted by the public key in cipher text, is decrypted by the private key to give plain text of, Here is my private data.

Digital data enables encryption and decryption (Source: ico.org.uk)

The introduction of digital technology enabled the critical capabilities of separating the video data format from the display technology, video compression, and encryption/decryption. The IMSS 2.0 resulting digital system architecture is described in the following figure. The system consists of:

  • Capture: Video sensors, resolution up to HD now possible

  • Camera Connection: EtherNet, usually Power over EtherNet (POE)

  • Video data compression

  • Encryption and Decryption of Video Data

  • Storage – Rotating magnetic media, HDD, with multiple cameras per unit

  • Inferred metadata (analysis) – Performed manually by human watching recorded scene

  • D: Decisions made by a human after watching video on display

Figure 3-10
A block diagram of the I M S S 2.0 digital system. Compressed video data with encryption interacting with digital storage, with inputs from descriptive metadata and captured sensor data, connect to decrypted, decompressed, and reformatted display, inferred metadata, and decision with action 1 Conf A, Action 2 Conf B, and no action.

IMSS 2.0 digital system architecture

IMSS 2.0 systems offered qualitative improvements in system flexibility over the analog IMSS 1.0 systems in several key respects. The higher resolution sensors meant either a wider area of coverage at the same number of pixels on a target or a higher number of pixels on a target for the same coverage. The former required fewer cameras for a given installation; the latter improved image quality and the ability to identify objects more confidently. The ability to tune the video compression algorithm meant longer recording time could be achieved either increasing the video compression ratio for a given amount of storage or increasing the amount of storage purchased. Finally, the number of cameras per recording units may be varied by trading off total recording time and the video algorithm compression. Figure 3-11 schematically represents the flexibility of IMSS 2.0 systems architecture over IMSS 1.0 systems. The arrow indicates increasing values of a quantity. Key to this flexibility is that an objective may often be achieved in more than one way, for example, increase recording time by increasing video algorithm compression or storage.

Figure 3-11
A graph consists of a diagonal storage line, video compression on y-axis, and sensor resolution on x-axis, and ranges from less to more recording time, less to better image quality, and less storage to more coverage.

IMSS 2.0 system architecture flexibility

IMSS 2.0 systems did still retain significant drawbacks with IMSS 1.0 systems. Again, the digital recorder had to be located relatively near the cameras, so retrieval of any data meant traveling to the location being monitored. There was no real time response – unless the system was monitored by a human observer. The analytics functions were still strongly dependent on the human operator, thus leading to inconsistencies in analysis. The system was still primarily retrospective.

In terms of security risks, IMSS 2.0 systems did offer some advantages, but retained significant drawbacks. Analog tape systems required operators to periodically either manually replace or manually rewind the tapes. Consequently, analog video tapes could be viewed and erased by anyone with access to the storage/playback system. Digital storage enabled digital encryption and access methods to be employed. Passwords could be used to restrict viewing and erasing to only authorized personnel. It was also now possible to use automated programs to set the recording time and when previous data would be erased and recorded over. The result was a substantial increase in system availability and robustness. The systems were often installed in unsecured, remote locations and vulnerable to physical tampering or destruction. Techniques as simple as a spray can of paint or fogger could disable a system and evade detection until a human was sent on-site to investigate.

IMSS 3.0 …Better Together – Network Effects…

Many of the deficiencies of IMSS 2.0 systems related to the data being local and inaccessible to remote operators. In smaller systems or high value assets such as critical infrastructure, it is possible to co-locate the IMSS system and the operators, but is not broadly economically feasible. The rise of the Internet enabled adoption by IMSS systems of technology intended for a much broader set of applications. Leveraging the broad technology base enabled cost-effective implementations and integration into the broader IT infrastructure called the Internet.

The key difference between the IMSS 2.0 digital system and the IMSS 3.0 digital systems is extending digital encoding to comprehend the transport of the digital data. In IMSS 2.0 digital systems, the encryption was primarily used to protect stored data; compression was primarily used to make more efficient use of storage media. IMSS 2.0 systems were primarily implemented on dedicated infrastructure over local distances (<<1 km) with known and predictable data patterns. The move to an Internet-centered system architecture meant comprehending the transmission of video data over substantial distances with low latency and on a shared infrastructure. Key to the Internet-based system is error resilience, how to respond to lost or corrupted data.

Breaking Up Is Hard to Do…Packets Everywhere…

The Internet was designed to support a wide variety of applications, hence it was built on flexible structure. A complete description is beyond the scope of this tome, a simplified version will be presented to bring out basic concepts. Figure 3-12 shows four basic components, or layers, that provide the Internet with the necessary flexibility to support a broad range of applications.

Because the Internet is a shared resource, all the information is broken up into packets, or discrete chunks. A packet comprises discrete elements, each of which is used by a different layer of software to accomplish a specific portion of the data transport task. At the heart of the packet is the application data. The application data is determined by the application and could be anything of interest, video or audio or voice or database records. The size of the application data is in theory quite flexible, though in practice, one of a few standard sizes are chosen. This reflects the practice that packet processing is often accelerated by hardware with predefined characteristics such as buffer sizes and register allocations. As an example, for Ethernet protocols, a common packet size is 1500 bytes, a common lower bound is 576 bytes. Even with compression, a consequence is that for practical purposes, all video streams will require multiple packets to be transmitted.

The next element in the packet is the Transport Control Protocol, commonly referred to as the TCP. Transmission may fail in one of two ways – either data corruption or losing a packet in transit. The element contains information regarding detecting and, in some cases, correcting errors in the application data. The element also can determine if a packet is missing and request retransmission. The right-hand side of Figure 3-12 illustrates a case where three data packets are sent, D1 through D3, however, one packet has been lost (D2, as indicated by the dotted line). In this case, retransmission may be requested and data packet 2 resent. This element may also contain information regarding ordering of the data as packets may not be received in the order transmitted.

Conceptually, the IP layer contains information related to routing such as the destination address. The network uses this element to transport the packet from the point of origin, across the network to the destination. While traversing the network, the packet may traverse through several nodes, and individual packets may take unique routes through the network. There is not necessarily a single unique path through the network.

Finally, the link layer is related to the physical characteristic of the connection the device is connected to. Note that it is not required that the origin and destination devices have the same physical connection type. As an example, an originating device may have a copper wire EtherNet cable, which an intermediate device may convert to an optical fiber, and finally, the destination device may be Wi-Fi connected.

Figure 3-12
An illustration of the simplified internet packet. It consists of 4 layers. Data link layer at the bottom consists of a header, I P, T C P, application data, and a footer. Internet layer consists of I P, T C P, and application data. Transport layer has T C P and application data. Application layer has application data.

Simplified Internet Packet description

What is the impact of introducing networking on architecting a secure IMSS 3.0 system? The most critical impact is that the system elements can now be physically disaggregated from one another. There is no longer a requirement to have a sensor unit, such as a video camera, in close physical proximity to the recording unit. Similarly, there is no longer a requirement to have the display and operator in close physical proximity to the storage unit. Equally critical is that the shared, open network infrastructure now means data can be shared across a wide range of actors and geographies. The open, shared network infrastructure is both an advantage – you can share with anyone, anywhere, and a disadvantage – anyone, anywhere can potentially access your data as it travels across the network. In security terms, your attack surface has just greatly expanded.

Learning to Share…

In practice, there are two fundamental strategies to employ in designing an architecture. The first is to restrict oneself to only private Internet infrastructure. In the private infrastructure approach, a single entity owns all the elements of the systems – the sensors, the storage, the displays, and the operator terminals. In practice, the private infrastructure approach is only feasible when the substantial costs of construction, operation, and maintenance justify the value of the data being protected. Even then, private infrastructure severely limits the geographic span and access of the network in all but a very few instances. Conversely, encryption during transport now becomes critical to ensure secure use of the shared infrastructure.

Encryption during transport becomes a question then of what elements of the packet to encrypt. Figure 3-13 schematically illustrates how the elements of the packet are used as the packet traverses the network from origin to destination via one or more intermediate nodes. Dotted arrows indicate interaction between the nodes at each level. The criteria become: what information is being protected? The Application data itself? The source and destination? The data integrity – that is, correction for data corruption and or retransmission of lost packets? Intertwined with the data integrity question is that of authentication. Is the source trusted? Is the destination a trusted destination? Has an intermediate party tampered with the data and how to detect such tampering? The specific attacks, vulnerabilities, and countermeasures will be dealt with in detail in a later chapter. At this point, the intent is to raise the readers’ awareness of the vulnerabilities and critical architectural decision points.

Several different protocols may be used at each layer of the system, and hence require modification of the packet structure. It is not unusual for the packet structure to be modified as the packet traverses the system. Some representative protocols are given in Table 3-2. As an example, consider an origin device connected with a Wi-Fi link to a router connected by Ethernet cable to a second router connected by Bluetooth to a destination device. Packets traversing this route would have the Physical/Datalink element modified from a Wi-Fi to Ethernet to Bluetooth structure; return packets from the destination to the origin device would follow the inverse transformation. Note that the protocols indicated by the dashed arrows in Figure 3-13 must match at each end of the arrow. Hence transport layer must match to transport layer, application layer to application layer, and so forth, even though the layers underneath are changing.

Table 3-2 Representative Networking Protocols

Hook Me Up…Let’s Get Together

IMSS 3.0 introduced critical networking capabilities, enabling the system to be physically distributed. The IMSS 3.0 resulting digital system architecture is described in Figure 3-14. The system consists of:

Figure 3-13
An illustration of the internet protocol layers with the origin node to the destination node via the intermediate node. The application and transport layers communicate directly, while internet and physical layer communicate via intermediate router nodes.

Internet Protocol layers

  • Capture: Video sensors, resolution up to HD or even higher resolution is now possible

  • Camera Connection: Ethernet, usually Power over Ethernet (POE)

  • Video data compression between nodes

  • Network Link at Specified Points

    • Encryption and Decryption of Sensor Data between nodes

    • Compression and Decompression of Sensor Data

  • Storage – Rotating magnetic media, HDD, with multiple cameras per unit

    • Compressed and Encrypted sensor data

  • Display Functionality – Decrypt, Decompress, and format to display one or more sensor data streams

  • Inferred metadata (analysis) – Performed manually by human watching recorded scene

  • Decisions made by a human after watching video on display

The key distinction of IMSS 3.0 IMSS systems, as shown in Figure 3-14, is the separation of the system elements made possible by the introduction of the network capability. Commonly, the network architecture is partitioned into a private network and a public network. The private network will aggregate data from multiple sensors, as shown on the left of Figure 3-14. Note, not all the sensors may natively support network capability, hence the public router may need to construct the appropriate packets and select the appropriate protocols. The network aggregation point may also optionally support storage of data or may simply pass the data through to a node further along in the system. If storage functionality is present then the node is referred to as an NVR, Network Video recorder. The second critical feature is a complementary router node connected to the public network, designated in the diagram as the “Public Router.” The public router will accept data from the private router and/or storage, compress and encrypt the data, packetize the data to the appropriate granularity, and generate the addresses and protocols in each packet to ensure arrival at the destination node. Between the origin node (NVR) and the destination node (Operations Center) there may be one or more intermediate nodes on the public network. To construct secure systems, it is quite critical that the origin node have separate router functions for the private and public networks to provide isolation.

As noted in Figure 3-14, multiple NVR units may connect to a given destination node, here an operations center. The disaggregation allowed by multiple NVRs connecting to a single operation center increases the overall system efficiency and response time. Particularly expensive resources such as displays, human operators, and response systems need not be duplicated multiple times across the entire system. These resources can be concentrated in a single location and shared. Additionally, because the information from multiple locations is available at a single location, the operators can obtain a much more complete situational awareness. As an example, a traffic monitoring system can gather the sensor data from an entire city, rather than look at the traffic patterns in only a single neighborhood or a single street.

Figure 3-14
A data pipeline of I M S S 3.0 network. Includes 3 sensors with network links, private and public router, descriptive metadata, and local data storage in the network video recorder. Operations center has a public router, inferred metadata, display, decision with action 2 and no action, and main data storage. Both architectures are connected by the public router which is connected to a N V R.

IMSS 3.0 networked system architecture

A key system architecture decision is the partitioning of storage between “local” storage near the sensor and “main” storage near the operators. The partitioning is driven by the balance of network bandwidth and latency available for the expected sensor data volume generated and the cost of providing local vs. main storage. The one extremum of the continuum is to have no local storage at all and send all the sensor data to the operations center for storage. The consequence is placing a heavy premium on both network bandwidth and latency, in the face of unpredictable loading of the public network. The other extremum is to have all sensor data stored at the NVR level in local storage and only forward sensor data to the destination node, operations center, as requested for analysis. The consequence is minimizing the network resources required for bandwidth; however, latency may become an issue depending on the public network loading. Additionally, with access to only a portion of the data, it may be difficult for the operations center to know which sensor data to request. The intricacies of network analysis are beyond the scope of this book, suffice to say on this point that application-dependent balance will need to be determined.

Data Rich, Information Sparse…

IMSS 3.0 systems, for all the advantages, retained one serious weakness from previous generations. While IMSS 3.0 systems enabled the aggregation of massive amounts of data, much of the data was redundant and difficult to correlate to extract actionable information. IMSS 3.0 systems still rely on human operators viewing displays to make inferences from the sensor data, make decisions, and take actions. In principle, there is no fundamental difference from previous generations, in practice, there is a considerable qualitative change. As the amount of sensor data aggregated increases, it becomes increasingly difficult for human operators to assimilate and synthesize the sensor data and form reliable inferences about a situation. Knowing which sensor data to access and in what combination becomes an increasing challenge for the operators requiring increasing amounts of expertise and training. In our city traffic example cited previously, imaging a moderate size city with a few hundred cameras on the traffic grid feeding a display system which can service tens of the sensor streams at once in any combination. How to select which sensor streams? How to select which combinations of the streams? The impact of variation in the human operators’ skill on overall system performance is greatly magnified in these cases.

From a system security perspective, IMSS 3.0 systems also introduced a vulnerability in traversing public networks. Without a clear understanding of the potential attacks on packetized data, IMSS 3.0 systems were open to exploitation. Using encryption and secure protocols, it is feasible to construct robust systems resistant to attack. Doing so requires substantial knowledge of how networks work and how the packetized data is transformed during the transmission process. Use of the public network meant that attackers no longer required physical access to the data, but could remotely access the data over the public network. Attacks could come either while traversing the network or when the data was stored, either locally or at the main storage locations such as data centers. A practical consequence is to multiply the number of potential attackers to anyone with access to the public network – in effect, the global population. The number and types of attacks grew rapidly and were difficult for human defenders to monitor, identify, and react to in a timely fashion.

Addressing both the massive data available for analysis and the necessity to secure that data from increasingly sophisticated attackers forced the next step in IMSS system evolution.

IMSS 4.0…If I Only Had a Brain…

A well-developed tool kit for analyzing images was developed under the general heading of Computer Vision. Recently, there has been a resurgence in a related but distinct branch of analytics under the general heading of Artificial Intelligence (AI) or Machine Learning (ML). The field of Artificial Intelligence (AI) can be traced back at least to the 1950s, the origins often attributed to a 1956 conference at Dartmouth College. Since the inception of AI, the field has endured several periods swinging between irrational exuberance followed by disappointment with actual performance and a fallow period. The latest revival started in 2012 using machine learning (ML) techniques requiring intense computation using a technique called Convolutional Neural Networks (CNN). Figure 3-15 demonstrates the decline in the cost of computation, storage, and network speeds that enabled the introduction of Machine learning techniques. Compute is based on the fundamental operation of a CNN being matrix multiplication. In 2012, Moore’s Law of computation finally intersected the CNN mathematical techniques to produce practical results. The CNN algorithms require Trillions of Operations per Second (TOPs) combined. The large databases, primarily video data, required memory enough to hold tens of millions of parameters and intermediate results during computation. The storage demands of the video data required low-cost storage; a single video camera could easily generate one Terabyte (1,000 Megabytes) of data per month. Transmitting the data to where it could be processed demanded connection speeds on the order of Mbits per seconds per camera. Architecting, developing, and installing the new systems at economically viable costs required coordinated scaling in compute, storage, and transmission.

Figure 3-15
A line graph exhibits a decline in internet speed, average transistor cost, and disk drive prices from 1956 to 2019.

Falling costs: compute, storage and networking (Source: The Economist, Sept 14th, 2019)

When all elements were in place, it was possible for the next stage to appear – intelligent IMSS systems.

Classical CV Techniques – Algorithms First, Then Test against Data

Traditional Computer Vision (CV) consists of a developer selecting and connecting computational filters based on linear algebra with the goal of extracting key features of a scene, then correlating the key features with an object(s) so the system can recognize the object(s).Footnote 1 The key feature of traditional CV methodology is that the developer selects which filters to use and, hence, which features will be used to identify an object. This method works well when the object is well defined and the scene is well understood or controlled. However, as the number of objects increases or the scene conditions vary widely, it becomes increasingly difficult for the developer to predict the critical features that must be detected to identify an object.

Deep Learning – Data First, Then Create Algorithms

The terminology used in the AI field can often be confusing; for the purposes of this discussion, we will use the taxonomy described in Figure 3-16. AI shall refer to tasks that are related to the physical world, focused on the subset related to perceptual understanding. In this sense, AI is a subset of the broader field of data analytics, which may or may not refer to the real word. The next level is machine learning, a class of algorithms which relies on exposing a mathematical model to numerous examples of objects, the algorithms then “learning” how to identify objects from the examples. In effect, the mathematical approach is in direct contrast to the traditional CV approach, where a human programmer decides explicitly what features will identify an object.

The mathematical models are inspired by the neural networks found in nature, which are based on a hierarchical structure. The neurons are arranged in layers, each layer extracting more complex and more abstract information based on the processing performed by previous layers. Using the human visual system as an example, the first layer recognizes simple structures in the visual field: color blobs, corners, and how edges are oriented. The next layer will take this information, construct simple geometrical forms, and perform tracking of these forms. Finally, these geometric forms are assembled into more complex objects such as hands, automobiles, and strawberries based on attributes such as combinations of shapes, colors, textures, and so forth.

Figure 3-16
An illustration of the A I taxonomy. It includes A I with perceptual understanding and data analytics encompassing, machine learning, deep learning, and convolutional neural networks.

AI taxonomy

Deep learning and CNN are based on the hierarchical neural network approach, building from the simple to the complex. Rather than try to “guess” the right filters to apply to identify an object, the neural network first applies many filters to the image in the convolution stage. Based on the filters in the CNN, the network extracts features based on the response to the filters. In the full connected stage, the network analyzes the features in the object(s) with the goal of associating each input image with an output node for each type of object, the value at output node representing the probability that the image is the object associated with the output node.

Convolutional Neural Networks and Filters…Go Forth and Multiply…and Multiply

The convolution in CNN refers to a set of filters, the filters constructed on matrix multiplication. An incoming set of data is multiplied by parameters, designated as weights. The results are summed, and a filter is deemed “activated” if the sum exceeds a designated threshold. The filters are arranged in a hierarchy, with the results of previous filters feeding into later filters. An example of this hierarchical approach of the filters is shown in Figure 3-17 for the case of an automobile sitting on a field of grass. The very low-level features are drawn from a very local region of pixels recognizing simple attributes – colors, lines, edges. The next layer of mid-level features combines the low-level features to start to form primitive structures – circles, extended edges across larger numbers of pixels, larger area of colors. The third level of High-level features creates more complex, more abstract features associated with the car, the grass, etc. Some filters will have a strong response to a feature, say a wheel, other filters will respond more strongly to the grass. By knowing which set of filters are responding strongly, one can infer what object(s) are present in the scene.

Figure 3-17
An illustration of the C N N feature visualization. A car on the grass field is drawn using low-level features, mid-level features, and high-level features, which passes through a trainable classifier.

CNN feature visualization

The term Neural Network is derived from how the mathematical filters are connected into a structure. There are two stages to the process of recognizing an object – feature extraction (convolutions) and classifiers (fully connected layers). The detailed mathematical descriptions are beyond the scope of this book; however, there are numerous treatments available in the literature for those wanting more detailed information.

A simplified example of a neural network and its operation is shown in Figure 3-18. The key difference between DL and traditional CV is that in deep learning, there is no attempt to preselect which filters or features are the key identifiers for a particular object. Instead, all the filters are applied to every object during the convolution phase to extract features. During the fully connected phase, the system “learns” which features characterize a particular object. The learning occurs at a series of nodes that analyze the strength of particular features in some combination. If the response of that combination of features is over some threshold, then the node is activated and considered a characteristic of that object. It is applying many filters and having to look at all the combinations of features that makes deep learning so computationally intense. In addition, the more combinations of features and combinations of combinations the network examines in the fully connected phase, the more likely key characteristics that describe an object are to be discovered. At the end, a score is given related to how probable it is that the input image matches that particular class or object at the output.

In practice, this means more and more layers of nodes have a higher probability of matching an input image with the correct output node, or class—hence the term “deep learning.” In return for the computational resources used, DL is often much more robust than conventional CV methods and, hence, much more accurate across a broader range of objects and scene conditions. The developer will need to assess whether the increased computation and robustness is required vs. traditional CV methods for the particular use case.

The image data is input at the left of the diagram in the form of an image in the example shown. The first section is applying the filters described in Figure 3-17. Note the filters are applied in layers, analogous to the neural network layers in a biological brain. The output of filters in the first layer are fed to one or more filters in the second layer, and so forth until a set of features have been generated. The next section is referred to as the fully connected layers which assess the features that were created and the relative strengths of the features in various combinations. The output nodes at the right of the diagram represent the possible outcomes, referred to as “classes.” The result is a score for each output node (class) estimating the strength of response of that particular output node to the input image. All output nodes will have some response to the input image, though it may be quite low. However, it is not unusual for more than one output node to have a significate response to a given image. Often the output score is misinterpreted to relate to the probability of a node being correct, it is only the score given by the network. Adding all the scores together will, in general, not add up to 1 or any other particular value. (An exception is when a normalization operation has been implemented, but that still does not give the probability.)

Figure 3-18
An illustration of the neural networks. A photo of a girl is processed under convolution and fully connected layers, who is identified as a Swedish girl.

Neural networks features and classifiers example

This raises the question of how to interpret the outputs of the neural networks and how to define the classes that are the output nodes. How does the neural network know how to match an image input to output node, or class? To understand that, we need to explore the entire deep learning flow.

Teaching a Network to Learn…

Earlier it was stated that the difference between traditional computer vision techniques and machine learning techniques is the deep learning aspect. In deep learning it was stated that the data comes first, then the algorithm or filters – how is this accomplished? The deep learning workflow for neural networks is illustrated in Figure 3-19. The workflow is partitioned into two phases: training and inferencing. The previous section described how inferencing occurs. Data is presented to the network and after processing through successive layers of convolutions filters, each of the output nodes, or classes, is given a score. The algorithm that determines the score for the classes is created during the “training” phase of the process.

The training phase starts with the selection of a neural network model, the basic structure of filters and layers and how these filters and layers are connected. There are a wide variety of models in existence which represent trade-offs between computational complexity, accuracy, latency, and system constraints such as power and cost. These aspects will be discussed in a later section. The selected mode is represented schematically by the circles (nodes) and arrows (connections between nodes) in the diagram.

During the training phase, the network is exposed to many images of a given object(s). For each image, it is told what the object in the image is—that is, which output node the input image should map to. During the training phase, the parameters of the CNN in the convolution and fully connected sections are modified to minimize the error between the input image and the correct output node or class. Because the network is exposed to large numbers of examples of the object, it learns which features are associated with the object over a large sample and, hence, which features tend to be persistent. Eventually, the parameters of the network are finalized when application requirements for number of classes (objects) to be recognized and accuracy are met. Once training is complete, the network can be deployed for use in the field, the step called inferencing or scoring. The green mound represents the decrease in error that occurs during training over multiple parameters. For purposes of illustration, the decrease over two parameters is shown.

Figure 3-19
An illustration of the Neural network. Training phase includes a human, a bicycle, and lots of labeled data of strawberries, passing through C N N, and the output of cycle from the training phase is forwarded to the inferencing phase, which gives a decreased error, indicated by a mound for the cycle.

Neural network: training and inferencing

In the training phase, there must be a source of “ground truth,” usually in the form of labeled data curated by humans. It is this labeled data that will determine what classes of objects the neural network model will respond to. In the example shown, images of Humans, Strawberries, and Bicycles are used, therefore, the only objects the neural network will be capable of recognizing are Humans, Strawberries, and Bicycles. A critical consequence is that this neural network will try to map ANY object into Humans, Strawberries, and Bicycles. Shown an image of a refrigerator, the model will give it a score in each of the Humans, Strawberries, and Bicycles classes, though probably a low score. For this reason, when evaluating the results of a neural network, it is critical to understand not only what the highest-ranking class is, but how strongly the object scored in that class.

During training, if an error is made in classification, then a correction is made to the parameters in the model. In Figure 3-19, the neural network has been presented with an image of a Bicycle and incorrectly mapped it the class of “Strawberry.” The neural network model is now adjusted by altering parameters in the convolutions and fully connected sections until the correct predictions are made for that image. For pragmatic models with acceptable accuracy over a wide range of real-world conditions, the training data set is typically in the range of tens of thousands to hundreds of thousands of examples. The size of the training data required is a function of the number of classes to be recognized, the robustness to different observing conditions (lighting, angles of observation, presence of other objects), etc. Training a neural network is subject to the classic computer program GIGO, Garbage In, Garbage Out. Poor data labeling, poor quantity of data and poor definition of output classes will yield poor results.

Types of Neural Networks: Detection and Classification

There are numerous types of neural network models, two are especially common in IMSS video systems. In the training phase, we tell the CNN what objects are in the image—that is, which output node an input image should map to. Conversely, in the inference stage, we don’t know what is in the scene. This means that the network must first determine if there is an object of interest in the scene (detection phase) and, if so, identify what the object is (classification phase). There are networks that are optimized for each task. In addition, it is possible to mix DL and conventional techniques (e.g., one could perform detection with traditional CV and classification with DL).

Detection is done at the video frame level. It consists of examining a video frame to detect how many objects are in it. A frame of video may contain 0, 1, 2, or many objects. The key metric for detection is frames per second (fps).

Classification is identifying the objects detected in a frame. Is it a car, a person, etc.? A classification may have multiple attributes (e.g., car—blue, sedan, Audi*), and a frame may give rise to 0, 1, 2, or many classification tasks. Classification is measured in inferences/second. Identifying one object equals one inference.

An example of a complete Deep Learning data flow is shown Figure 3-20 for a street scene. Starting in the upper left portion of the diagram, the model is trained on a large data set of street scenes. At this point, it is determined that the model will be trained to recognize automobiles and pedestrians. Each training sample requires tens of GOPs (One GOP = one Giga Operation = 1 billion operations), and training the model requires tens of thousands of examples, means the training is often performed on specialized systems in a data center or the equivalent. Once the model is trained, it is commonplace for mathematical optimizations to be performed to increase computational efficiency while preserving desirable traits. The common mathematical techniques involve removing nodes or filters that are not used in the final model or have only a minimal impact, combining filters based on mathematical transforms that are equivalent but less compute and reducing precision, for example, using 8-bit operations instead of 16-bit operations. Collectively, these constitute model compression.

As mentioned previously, it is common to train multiple models depending on the application, such as a detection and one or more classification models, as shown in Figure 3-20. The detection and classification models are then deployed to the location(s) where inferencing is to occur. There are two broad classes of system architecture in this respect. In the first class, all the data is sent to a central location for the inferencing to take place. The advantage is computing and storage resources and to be amortized; the disadvantage is that transporting the required amount of data to be analyzed may strain network BW. Conversely, in the second class of system architecture, the analysis is performed at or near the site at which the data is collected. The disadvantage is that compute resources may be more limited, but now only the results of inferencing need to be sent elsewhere, or in some cases, action can be taken locally. Hybrid systems are on a continuum between the two extremes cited.

Figure 3-20
An illustration of the Deep Learning flow for a street scene. It includes datasets for training, raw bayer video stream, sampling down, general network training machine, and more. Output is vehicle type classification and human detection.

Example of Deep Learning flow: street monitoring

In either system architecture, the task is inferencing – ingesting data from the real world and extracting actionable information. Starting in the lower left-hand corner, the image is ingested into the system for analysis. If connected directly to a sensor, such as an image sensor, then the raw image will be processed through an Image Signal Processor to convert the raw sensor data into a format useful for the neural network to operate on. Conversely, the image may be coming from another device which has already converted the image to a standard format, compressed it, and encoded the video into a standard video format such as High Efficiency Video Coding (HEVC). Once the video data is ingested as either sensor data or as a video stream, the video data is usually resampled from the video resolution to the resolution required by the neural network model using well known scaling, cropping, and color space conversion techniques. The video stream is then sent one frame at a time to the first neural network where detection occurs. Typical families of neural network models used for detection are SSD (Single Shot Detection) and YOLO (You Only Look Once). Recall detection networks operate on video frames, their performance is measured in frames per second (fps). The output of the detection network is a series of Regions of Interest (ROIs) which contain a bounding box identifying the portion of the image in where an object of interest is located. Some classes of detection networks may also perform an initial classification of the object into broad classes such as car, pedestrian, bicycle. Again, recall classification networks operate on ROIs, not video frames, and are measured in inferences per second. The two metrics are often conflated and will lead to serious errors in system specification and sizing if not properly applied.

Following the detection function, it is common that more specialized classification networks may then operate on the ROI’s identified by the detection network. In the example shown in Figure 3-20 subsequent networks specialized for pedestrians and automobiles are used to gather more fine-grained information about the objects. The automotive neural network may provide information about the make and model of the automobile, the color, the license plate, etc. The pedestrian classification network may provide information about the location of the pedestrian in the scene, demographic data, etc.

A Pragmatic Approach to Deep Learning …Key Performance Indicators (KPIs)

The previous sections described a high-level overview of how a deep learning system works at a conceptual level in terms of the neural network model workflow and basic concepts. We will now turn to pragmatic considerations in implementing a deep learning based for IMSS 4.0 systems. In architecting an IMSS 4.0 system, the critical concept is that the DL workflow has different KPIs and capabilities at different points in the system architecture.

The system architecture KPIs are driven by the relative number of units at each level in the system architecture, the power available, the location and size of the required data, and cost constraints. Table 3-3 gives typical system level KPIs for DL that are related to the system architecture structure. The actual values may vary for an application or use case. For purposes of taxonomy, the system architecture is partitioned into three general elements: Data center, Gateways, and Cameras/Edge devices. The boundaries between these elements are not rigid and are subject to adaptation depending on the specific industry.

Data centers refer to installations aggregating large compute resources in a controlled environment and may be either public (e.g., Cloud Service Providers) or private. Gateways are often located remotely, often in unsupervised locations with uncontrolled access. The primary purpose of a gateway is to access data from multiple sources, process some of the data locally and often store some of the data locally. If a data center is present, the gateway will often send processed data to the data center for further analysis. Multiple gateways will typically support a single processing element in a data center. Finally, a camera/edge device is responsible for sensing the data directly, performing any required signal processing to transform the data to a consumable format and forward the data to a gateway. It is not unusual that a single data center processor may be ingesting data originating from 100 to 5000 edge devices. Managing the data flow between the edge and the data center is a critical system architecture function.

The resources and capabilities of each of the elements varies substantially across the system architecture. The tasks allocated to the elements thus varies to reflect the differing capabilities and resources available. Using three elements as a starting point, it is possible to architect a variety of different Deep Learning based IMSS systems. The optimal system architecture for a given application will depend on the neural network(s) selected, the storage needed, the compute performance of the system components, cost and speed of networks, and program constraints (cost, schedule, pre-existing systems that must be supported). The relative values of these KPIs will determine the optimal partitioning of the workload among the three elements. The KPIs of each of these elements and the effect on system architecture will be investigated in some detail later in this chapter.

Table 3-3 Typical Deep Learning-based System KPIs

One Size Doesn’t Fit All…

The constraints and KPIs are different at each point in the system architecture, hence it is difficult for a single type of processor architecture to satisfy all the demands. This will be reinforced when we examine the performance of different CNNs on different processor architectures. Different processor architectures are differently advantaged for different CNNs. There is no one-size-fits-all solution or one best processor architecture for the entire DL workflow.

IMSS 4.0: From Data to Information

IMSS 4.0 introduced the critical concept of Machine Learning enabling the mass transformation of data to information. Previously the transformation of data to information was gated by the ability of a human to view the data, assess the data, make inferences and act upon the data. IMSS 4.0 systems are distinguished by the ability of machine learning to transform data into information. The two immediate consequences are first, the system itself can make decisions within specified bounds, relieving humans of many of the rote tasks. As important, for those decisions reserved to humans, much of the data has been preprocessed to present only the most critical and relevant information for human consideration and judgment. The IMSS 4.0 resulting digital system architecture is described in Figure 3-21. The system consists of:

  • Capture: Video sensors, resolution up to HD now possible

  • Camera Connection: Ethernet, usually Power over Ethernet (POE)

  • Video data compression between nodes

  • Router: Network Link at Specified Points

    • Encryption and Decryption of Sensor Data between nodes

    • Compression and Decompression of Sensor Data

  • Storage – Rotating magnetic media, HDD, with multiple cameras per unit

    • Compressed and Encrypted sensor data

  • Display Functionality – Decrypt, Decompress and format to display one or more sensor data streams

  • Inference: Inferred metadata (analysis) – Performed using inferencing capability at multiple locations in the system architecture

  • Routine decisions, DL, made by machine learning based on inference

  • Critical decisions, DH, made by a human based in inferenced data preprocessed by AI

The IMSS 4.0 system architecture allows for considerably more flexibility in where functions such as storage and video analytics are placed. With the addition of inferencing to video analytics, the element formerly designated NVR now becomes more powerful, able to take on decision making tasks in real time, superseding its formerly passive role as a recording device. In recognition of this extended functionality, this element is promoted to a Video Analytics Node (VAN). The decisions, DL, and the corresponding responses, ActionL, will greatly simplify the overall system architecture constraints. The ability to act locally will substantially reduce both the network bandwidth to the operations center and the compute required at the operations center. In the overall IMSS 4.0 system architecture it is quite common to have a mixture of NVR and VAN elements in the overall system. The information sent from the VAN to the operations center can be any mix of metadata derived from inferencing and sensor data.

Figure 3-21
An illustration. It includes a sensor, private router, descriptive metadata, local data storage, and public router in the video analytics node. Operations center has a public router, inferred metadata, display, decision, and main data storage, both are connected by a public router which is connected to N V R or VAN 1 through N.

IMSS 4.0 system architecture

At the operations center, the data and metadata from multiple Video Analytics Nodes and NVRs may be combined. Like the VAN operation, a substantial portion of the aggregate data may be analyzed by Machine Learning algorithms, and decisions, DL, and responses, ActionL, taken without human intervention. The scope of these Machine Learning decisions must be carefully considered and bounded as part of the overall system architecture design, development, and validation. The inferencing may operate either on real time information streaming into the operations center, on stored data or some combination of the two sources.

Those decisions and actions reserved for humans, DH, and ActionH, can still benefit from the inferencing operations. The output of the inferencing operations may be presented to humans as part of the display element. Using inferencing in this manner will greatly reduce the cognitive load on human operators by presenting only the most relevant data for decisions requiring human judgment. Determining what information to display to the human operators, in what format, and the options permitted are a key step in the IMSS 4.0 system architecture. The tighter the constraints on real time response and accuracy, the more critical this often-overlooked design element becomes.

Information Rich, Target Rich…

Previously, it was highlighted that a drawback of IMSS 3.0 systems is that they were data rich, but information sparse. For adversaries to access significant information often means wading through mounds of data to extract the critical information. Hours, days, weeks, or months of video data amounting to Terabytes or more data would need to be exfiltrated, sifted through, and analyzed before useful information was available. The advent of machine learning changes this paradigm completely. The essence of machine learning is to distill the mass of data into easily accessible information, the information being substantially smaller in size. Consequently, in some sense, machine learning has created a target-rich environment for adversaries by concentrating the vast amount of data into a compact information representation.

Task Graph – Describing the Use Case/Workload – Overview

In the first section of the chapter, we discussed the evolution of IMSS systems from their analog starts to modern day systems based on digital architecture and artificial intelligence. In this section, we will detail the workflows found in the distinct phases of the IMSS system. The key concept is distinguishing the workflows performed, the sequence of tasks, from the underlying hardware architecture. A given workflow can be mapped to many different hardware architectures.

The workflows are commonly described as a graph, a series of tasks connected by arrows representing the transition from one task to another. The graph describing the workflow will be denoted as a task graph. Task graphs tend to be associated with specific locations or nodes in the system architecture. We will describe four distinct types of nodes in the system architecture, and the associated task graphs:

  • Sensors, when based on video often referred to as Internet Protocol Cameras (IPC)

  • Network Video Recorders (NVRs)

  • Video Analytic nodes

  • Video Analytic Accelerators, specialized devices that rely on a host

For purposes of clarity, the task graphs will be discussed in the context of a IMSS 4.0 type system. Once understood in this context, the reader will be able to easily extrapolate equivalent task graphs for prior generations, by removing functions present in IMSS 4.0 systems.

Sensors and Cameras – Sampling the Scene in Space, Time, and Spectra

In modern IMSS systems, the most common sensor is the video camera. By convention, a camera is a device that converts electromagnetic radiation into a two-dimensional array of pixels in a focal plane. The electromagnetic spectrum spans a considerable range as characterized by either wavelength or frequency (The two are related by the speed of light, c, such that every wavelength corresponds to a unique frequency. The wavelength will be influenced by the medium the radiation is travelling through). The most common camera is the video camera, which senses visible light in the electromagnetic spectrum, either intensity only (Black and White) or color modes. Video also implies that the scene is captured at some constant sampling rate.

Figure 3-22
A work-flow of Internet Protocol Camera. It includes video sensor data, I S P, encryption, storage key, feature matching, encrypt the storage key, encrypt video data, O S D blend, video display, A I preprocesses, detection, classification, tracking, etc.

Task graph for Internet Protocol Camera (IPC)

Video sensors are characterized by a few key metrics described in Table 3-4.

Table 3-4 Selected Sensor Characteristics and Metrics

These basic sensor characteristics determine many of the fundamental system parameters as regards data bandwidth and processing. Selecting the sampling parameters determines what types of information are gathered about the scene, and hence what information is available for analysis. The data generated by a sensor can be approximated as

Equation 1 Video Data Generated in 1 Second

$$ {Data}_{1\ second}= Resolution\times Spectral\ Sampling\ Components\times Frame\ Rate $$

The implied data rates for common sensor formats and frame rates are shown in Table 3-5. Note the numerical values are approximate and reflect only the final image size. In practice, sensors will incorporate additional rows and columns for the purposes of removing noise and artifacts such as dark current. For a particular instantiation, refer to the sensor data sheet to understand the actual sensor readout specifications. The configurations given in Table 3-5 are not exhaustive and even within categories some variation exists. Note the two separate definitions for 4K resolution. The Data Rate from sensor is an estimate of the data traversing the path from the video sensor to the Image Signal Processor (ISP). At this point in the signal chain, the video data is raw and if intercepted, can easily be read and decoded by unauthorized actors.

Converting Sampled Data to Video

The spectral sampling metric describes how many spectral bands the sensor is trying to capture; for a typical video sensor, this is three bands – Red (R), Green(G), and Blue (B). Each pixel can capture only one band and hence information about the other two bands is lost. Part of the function of the ISP is to estimate the values of the missing bands, for example, if the Green bandpass is measured then the ISP will attempt to interpolate the values of Red and Blue at that pixel based on the values of Red and Blue for the surrounding pixels. The final column, Data Rate out of the ISP, estimates the data rate in Mbps after the color interpolation.

Table 3-5 Common Sensor Formats and Data Rates

A final observation in Table 3-5 concerns the precision, or number of bits used to represent a pixel value. A standard representation has been 8-bits, allowing for 256 distinct levels. The 8b representation has worked well for many sensor and display technologies over the history of video. However, with the advent of newer sensor and display technologies, it has become feasible to capture a wider dynamic range referred to as High Dynamic range (HDR) encoding. The trend is using 10b to 12b encoding for HDR systems.

Referring again to Figure 3-22, once the sensor data has been processed to form a complete image, several options are available for further action. The simplest is to perform any data composition onto the image such as date, time, location, and camera ID. The Composition for Display and OSD blocks perform this function.

The video data can then be encoded to reduce the data volume using video compression techniques. Common video compressors are h.264 and h.265. Video compressors operate by removing redundant spatial and temporal information. The compression rate achieved depends on the original resolution, complexity of the scene, and how fast things are changing in the scene. A low-resolution scene with no changes will compress much more than a high-resolution scene monitoring a busy highway with lots of change. Another often overlooked influence on the compression ratio is the amount of noise in the scene; an image captured at night will have much more noise than the same scene captured in daylight. Because compression relies on removing redundant information, it will not compress random or uncorrelated noise. A very rough heuristic for compression ratio based on initial resolution is given in Table 3-6. Depending on the codec used, settings, and content, the observed compression ratio can vary by up to a factor of three from the values shown. It is strongly recommended that architects should obtain and/or simulate video streams and codecs relevant to their applications to estimate pragmatic compression ratios. 

Table 3-6 Heuristics for Video Compression Ratio

Transporting Data – Getting Safely from Point A to Point B

Once encoded, the video data should be encrypted to preserve privacy and confidentiality of the data. The encryption data rate is equal to the bit rate after compression as shown in Table 3-6. Finally, the data is prepared for transmission from the IPC to the world via a Wide Area Network (WAN) such as Ethernet or cellular data. The video compression step is critical to minimize bandwidth load on the WAN. It is not uncommon for the WAN to support hundreds to thousands of cameras; hence the aggregate bandwidth can accumulate rapidly.

The preceding system functions describe a traditional IPC such as might be used in an IMSS 3.0 system. An IMSS 4.0 system would include one or more of the AI-inferencing steps in Figure 3-22 to analyze the data:

  • AI preprocessing/Color Space Conversion (CSC)/Crop

  • Detection on selected frames

  • Tracking of objects in selected Frames

  • One or more classification operations

The advantage of performing these operations at the IPC stage is twofold: 1) reduce the amount of data sent over the WAN to a few KB/s and 2) to provide greater security for the data. No data need ever leave the device and the attack surface is much smaller. The disadvantage is fitting the analytic operations into the power and computational budget of the IPC device. The analytic blocks will be described in more detail in the next section.

NVR/Video Analytic Nodes – Making Sense of The World

The second major component of an IMSS system is the aggregation point for multiple video streams. Depending on the functionality, the aggregation point can be either a Network Video recorder (NVR) or a Video Analytics Node. The task graph for an NVR/Video Analytics node is shown in Figure 3-23.

Figure 3-23
A work flow of N V R or Video Analytics node. It includes encrypted video stream, I S P, decryption and encryption, storage key, feature matching, encrypt video data, video display, A I preprocesses, detection, classification, tracking, etc.

Task graph for NVR/Video Analytics node

Storing Data – Data at Rest

The initial elements of the task graph are common to the NVR and Video Analytics Node. Multiple video streams are ingested from either a Local Area Network (LAN) or a WAN. The streams from the IPC are assumed to be both encrypted and encoded for video compression. The encryption key used for transmission over the WAN/LAN is typically session dependent and hence ephemeral. The first step is to decrypt the video streams using the session key and then re-encrypt the video streams using a permanent key for storage. Depending on if the system is assigned a single tenant or multiple tenants will determine if a single storage key is used or multiple keys. Typically, all video streams that are ingested are stored. The storage period may range from a few hours up to 30 days (even month(s) depending upon the user requirements, often guided by law enforcement authorities), or in some cases, permanent storage. The number of video streams, bit rate per stream, and retention period will provide an estimate of the total storage required.

Equation 2 Storage

$$ {Storage}_{GB}= Number\ of\ Video\ Streams\ X\ Bit\ Rate\ (Mbps)\ X\ Retention\ Period(s) $$

Recall the estimates made for bit rates per stream of selected camera resolutions earlier in Table 3-6. Based on these values, Table 3-7 shows a range of common storage requirements. Again, the actual storage requirements will depend on the details of the system configuration; however, these are representative of many configurations. It is apparent that the storage requirements can vary over several orders of magnitude. For systems with mixed cameras, a first approximation of total storage requirements can be had by summing the numbers and bit rates of the individual camera types.

Table 3-7 Video Storage Heuristics

Table 3-7 also provides an estimate of the decryption and encryption rates required; note that this is the required rate for each function. Again, there is a large variation in the required rates. Implementations may vary from an algorithm run on a general-purpose CPU at the lower end to requiring dedicated hardware accelerators at the higher end.

Converting Reconstructed Data to Information – Inferencing and Classification

The key distinction between NVRs and Video Analytics Nodes is at the stage of converting the video data into information. There is not a bright line between the two and an optimal system design will use blend and balance the two methods in a complementary fashion. Recall from the earlier discussion that IMSS 3.0 and earlier systems rely on a human being to interpret the data. The distinguishing feature of an IMSS 4.0 system is that machine learning is added to support and enhance the interpretation by humans.

Humans Consumption – Display

An NVR relies on human interpretation of video streams for converting data to information. This method relies on the training and skill of the operator. The video data is typically presented to the human operator in the form of one or more displays on a monitor. Referring to Figure 3-23, the video streams may be either “real-time” from the ingested video streams or accessing stored video streams or a combination of the two. In either case, the first step is to decrypt the data with the appropriate key, and then to decode the data from the compressed format to a raw video stream suitable for display. This is the inverse of the process in Tables 3-5 and 3-6, described by Equations 1 and 2. Depending on the number of video streams, the original video stream resolution(s), and frame rate, there is a potential for exceptionally large data flows to be created. The video streams will typically need to be scaled to fit multiple streams on one or more monitors. Once the video streams are composed for display, the operator can then observe and interpret the video streams.

Recalling our earlier discussion of IMSS 2.0 systems, the operator approach still suffered from the drawbacks pointed out then.

There was no real time response – unless the system was monitored by a human observer. The analytics functions were still strongly dependent on the human operator, thus leading to inconsistencies in analysis. The system was still primarily retrospective.

Despite these drawbacks, there are still compelling arguments for retaining human operators as the final arbiters and decision makers.

Machine Consumption – Algorithms, Databases, and Metadata

Video Analytics Nodes differ from NVRs in adding ML based on AI techniques to analyze and winnow the data, highlighting the important from the mundane. Again, referring to Figure 3-23, the analytics path can operate on either real time or stored video streams. The selection of video streams for analysis may range from a subset of the video streams analyzed periodically to the entire suite of video streams. Like the display function, it is often necessary to scale and or crop the video streams to match the input size requirements of a particular neural network. Input sizes may range from 224x224 pixels up to a full HD stream of 1920x1080p.

A selected set of Neural Networks is shown in Table 3-8 for common networks used for video analytics as of this writing. The network input size, compute requirements in GFLOPS (109 operations for each frame), and Millions of parameters (106 parameters per model) are shown. The reader should note that Neural Network models are rapidly evolving. It is not unusual that a network will go from discovery in an academic or industrial research setting to deployment in a matter of several months. As of this writing, hundreds of neural network models are in use ranging from public, general purpose models to models optimized for very specific tasks. There is a trade-off between model complexity (MParams), compute (GFLops), and accuracy, as illustrated in the following.

Table 3-8 Selected Neural Network Models and Metrics as of the Time of Publication

Referring to Figure 3-23, the first step is to scale and crop the incoming video frame to match the input size of the detection network. In selecting a detection network, the general heuristics are that accuracy improves with larger input sizes, more compute per video frame and more parameters; conversely throughput and latency decrease, power, and cost increase. Proper selection of the detection network will require balancing these factors for the application needs.

Recalling Figure 3-20, the output of the detection Neural Network model is a set of bounding boxes or Region of Interest (ROI) identifying the location of an object in the video frame and perhaps a first order identification such as car or person. There may be one, many, or no objects detected in a particular frame. In the example shown in Figure 3-20, the two classes of interest are cars and people. To gather more information about cars and people, specialized Neural Networks optimized for cars and people, respectively, are passed to the classifier networks. The input sizes of the classification networks are smaller than detection networks because the objects have been isolated from the entire video frame. The objects detected by the detection network will have an arbitrary size depending on where they are in the video frame, their distance, the effective focal length of the camera lens and the resolution of the image sensor. This necessitates a second scaling operation between the output of the detection network and the input of the classification network.

Similarly, to the detection network, selection of the classification network requires balancing input size, compute, and the number of parameters against throughput, latency, power, and cost. Estimating the impact of a given network on accuracy can be performed independently of the system architecture; however, the throughput, latency cost, and power are strong functions of the underlying hardware and software selections comprising the system architecture.

The output of the classification neural networks will be a feature vector, typically 128 to 512 bytes in length. Each feature vector will correspond to a set of descriptors or attributes of the object. The feature vector is referred to as metadata. The feature vector and the ROI are returned as the result, giving both the object identification and the location of the object in the frame. With this information, it is possible to combine this with the video data, perhaps to draw a bounding box around the object using the ROI information and providing color coding or text annotation on the video frame (OSD Blend Video + AI Results block).

From this point, the data flow is like that described earlier for the NVR, except now it is feasible to construct a database of objects’ identity, location, and times. The data can be used to flag events of interest to a human operator, relieving the operator of the tedium of monitoring routine events. In addition, the database can be queried to identify trends over time that would not otherwise be apparent. The guiding principle is that routine decisions, DR, can be made by machines and critical decisions, DC, can be retained for human operators.

Video Analytic Accelerators – Optimized Analytics

The final major component of an IMSS system is the introduction of specialized accelerators for the required processing. General purpose systems work well when the number of video streams is modest and or the compute per stream is modest. However, as Table 3-8 Selected Neural Network Models and Metrics indicates the compute load per Neural Network can be quite substantial.

The overall data flow is similar to that of the NVR/Video Analytics node with a few crucial differences. Figure 3-24 describes the Video Analytics Accelerators data flow for High Density Deep Learning (HDDL) segments. The primary difference is that the accelerator will be a specialized device optimized for the intense compute loads and memory bandwidths demanded by advanced Neural Networks applications. The dotted line demarcated the physical and logical boundary between the host system and the accelerator. In the example shown, the interface is a PCIe interface, common across a wide variety of computer systems.

Figure 3-24
A work flow of Video Analytics accelerator. It includes encrypted video streams, decryption and encryption, feature matching, encrypt the storage key, encrypt video data, A I preprocesses, detection, classification, tracking, etc.

Video analytics accelerator

The introduction of the PCIe interface potentially exposes data transfers across the interface to interception and modification by adversaries. For this reason, it is necessary to ensure the data is encrypted and decrypted as part of traversing the PCIe (or equivalent) interconnect. Depending on if there is a single tenant or multiple tenants accessing the accelerator will determine if a single key is sufficient or if a multi-key schema is required. The second modification relates to minimizing the bandwidth traversing the PCIe interface. Referring to Tables 3-5 and 3-6 regarding compressed vs. uncompressed video, it is clearly advantageous in all but the most modest applications to compress the video before sending across the PCIe link. This will not only conserve bandwidth for other system activities but will also notably impact power consumption.

Within the accelerator, it is critical to ensure that there are features and capabilities to ensure that both the data and the Neural Network model are protected. The trained Neural Network model embodies substantial Intellectual Property value, in some cases representing much of a company’s valuation.

Like the rapid evolution of Neural Networks themselves, the Neural Network Accelerators are rapidly evolving to service the computational, power, and cost requirements of applications. For inferencing tasks, the performance of an accelerator is up to 10x that of a general compute platform in terms of both absolute performance (FPS) and cost effectiveness (FPS/$).

Conclusions and Summary

At the beginning of this chapter, we set out to cover critical concepts aspects of the IMSS systems as a foundational framework for addressing security in IMSS systems. At that time, the goal was to address the following key topics:

  • Key considerations and elements in architecting a data pipeline

  • Basic tasks of an E2E IMSS pipeline – Capture, Storage, and Display

  • Evolution of IMSS Systems – Analog to Digital to Connected to Intelligent

  • Sensing the World – Video

  • Making Sense of the World – Algorithms, Neural Networks, and Metadata

  • Architecting IMSS Systems – IP Cameras, Network Video Recorders (NVRs), and Accelerators

In this chapter, we have discussed the basic IMSS systems and data paths. The key considerations regarding decision-making, accuracy, throughput, and latency were introduced. From these concepts, the fundamentals of capture, storage, and display were developed and related to how decisions are made, and action taken. A key concept was how data is transformed to information and information to action. The security risks at each of these stages were delineated.

The next stage was to apply these concepts to review the evolution of IMSS systems from the earliest analog systems to modern AI-based systems, showing how the basic concepts have evolved over that progression. At each stage, the security risks particular to the stage of evolution from IMSS 1.0 to IMSS 4.0 architectures were brought forward. Key architectural and feature changes were described and the impact on the overall system capabilities.

Key to recent advancements is the introduction of Machine Learning and AI in the form of Neural networks. These advancements both enable new and valuable features, but also introduce potential vulnerabilities, if not properly addressed.

Finally, basic elements of the IMSS system in terms of IP cameras, Network Video recorders/Video Analytics Nodes and Accelerators were described.

In the subsequent chapters, we will use this foundational understanding and framework to further explore the strengths, vulnerabilities, and strategies for robust implementation of security in the AI world. We will examine some representative systems