Putting It All Together – What Does It Take to Build a System?

Goal: Based on the foundations in system architecture and security practices, create pragmatic E2E IMSS systems. The key concepts are:

  • Define Cloud to Edge system requirements and expectations

  • Define key system metrics – accuracy, throughput, latency, power

  • Define key system blocks and architecture, concept of resource graph

  • First order system analysis and derived system attributes

  • Identify key security vulnerabilities and mitigations

This chapter will examine three different IMSS system applications, starting with the least sophisticated and progressing to more sophisticated systems.

IMSS WORKSTATION (NVR Light, IMSS 3.0) – 4 Dumb cameras and storage only, < 8 video streams of data, alerts based on light analytics in host, action taken by humans reviewing display: Solution: Intel® Core™ x86, TGL class device + 3 screens, forward videos to corporate office under operator control

IMSS Enterprise (Video Analytics, IMSS 4.0) – 32 smart camera – performs detection, forward ROI and metadata to Video analytics node; Video analytics node performs classification and feature matching. Solution: Xeon™ Edge server with AI enhanced instruction set class accelerator

IMSS Smart City (Video accelerator, IMSS 4.0) – 1000 smart cameras, object detection and attribute classification for multiple object classes; real time and historical correlations, Solution: Xeon™ rack server, 1000 streams in 1000 watts, PCIe card based AI class accelerator, edge processing for low latency response as needed

Resource Graph

Previously, we introduced the concept of the task graph, a sequence of tasks connected to accomplish an overall result. A task graph is an abstract object. To execute a task graph requires mapping the tasks to an infrastructure composed of components such as hardware, software, and firmware. A single task graph can be mapped to multiple resource graphs. Figure 6-1 is an example of the application we will use for this section.

Figure 6-1
A flow diagram. Ingested encrypted video streams in the N V R or A I box are decoded, classified, that results in encrypted video or metadata forward, monitor display of O S D blend video +A I results, and closest feature vector match.

Example task graph for IMSS 4.0 system

The components of a resource graph can be characterized in terms of key building blocks, connections, and metrics. Figure 6-2 shows an example of a resource graph for an NVR/VAA Node. A resource graph is historically shown in the format of a block diagram, which is the notation we will use here. The blocks correspond to the nodes of a graph and the interconnects correspond to the connections between graph nodes. In this example, the functions enclosed in the box are assumed to be a single physical device, connected to other physical devices. Common interconnects between physical devices are listed, each of which has an associated bandwidth, latency, and power per bit transmitted. Each interconnect will also have an associated Bit Error Rate (BER). Internal to the physical device there will be one or more interconnect techniques. Commonly a Network on Chip (NoC) will interconnect internal components. Additionally, dedicated interconnects are often used for low-speed devices, shown here as the Peripheral subsystem (PSS) and High-Speed IO Subsystems (HSIO SS).

Figure 6-2
A block diagram of V A or A I box use. A I, media, display, C P U, and memory, interact with components of peripherals, camera, monitor, X 6 L P D memo, and storage.

Resource graph WORKSTATION example

A key difference between external and internal interconnects is accessibility. External interconnects can be more easily accessed and intercepted than internal interconnects. This does not mean internal interconnects are immune to disruption and probing, but do require more sophisticated techniques and higher levels of skill.

Some properties of the resource graph are independent of the task graph or provide upper bounds. Table 6-1 shows a very simplified set of interconnects and restricted to bandwidth only. Even restricted to bandwidth, multiple options are available spanning a wide range of values. In addition to the raw bandwidth, the different interconnects often have protocols optimized for a specific task. A complete discussion of the detailed impact of interconnects on system performance is beyond the scope of this work. For the remainder of this discussion, we will assume that the performance is dominated by the execution of the tasks on the major resource blocks rather than the interconnects.

Table 6-1 Selected Interconnects and Properties

The major resource blocks are shown in Figure 6-2. It is feasible in many instances to assign an element from the task graph to one or more elements of the resource graph. As an example, the media decode task could be mapped either to a CPU or to a media decoder HW block. However, the performance and other metrics will not be the same between the two choices.

In this section, we will use the task graph in Figure 6-1 as a reference and map it to different resource graphs as the requirements become more stringent and the system scales in both the number of streams and the analytics requirements. The primary rationale for this discussion is to highlight the security aspects of the IMSS systems, hence we will look at a simplified set of parameters. Table 6-2 contains the default parameters used in the remainder of the analysis. It is based on a 1080p30 video stream and a representative estimate for the compute required for video analytics during the classification and detection phases discussed in Chapter 3. The compute for the analytics is given in GigaOperations per second (GOPs), where 1 GOP = 1 Billion (1x109) operations. The storage assumption here is that each storage device supports 8TB (8 x 1012 bytes). The general principles used in the analyses can be extended to other situations by modifying the following parameters.

Table 6-2 Default Parameters for Purposes of Analysis

Using these parameters, we will examine three different scenarios as described in Table 6-3. The applications shown are representative and are not meant to be exhaustive. These correspond roughly to IMSS 4.0 systems for Small businesses (SMB), an enterprise or critical infrastructure scale system (Enterprise) and a metropolitan deployment (Smart City). These scenarios correspond to systems based on workstations, edge servers, and data centers, respectively. The architecture applied to each summarizes the types of resources that will be used in constructing a solution. The scenarios are distinguished primarily by the number of video streams being processed and the assumption that all video streams are being analyzed. In practice, only a subset of the video streams may be analyzed, in which the subsequent values for video bandwidth, storage and analytics compute would be adjusted. A further assumption is that in Figure 6-1, the object tracking function is invoked. This block has a parameter where M of N frames use an object tracking algorithm rather than perform classification and detection on every frame. This is very critical in practical systems to balance the memory and compute requirements vs. the required accuracy. In the compute values for Table 6-3, we assumed an M of N frames value of 1 of 2, that is, object tracking is performed on every other frame. This results in a reduction in classification and detection compute requirements of a factor of 2 relative to performing these operations on every frame. Note the values for Compute are given in TOPs = Trillions of Operations per second (1x1012 operations per second).

Table 6-3 Scenarios for IMSS 4.0 Systems Examination

Also note that TOPs is a very crude estimate of the computational requirements and is also very architecture-dependent. An accelerator optimized for video analytics tasks can perform a given neural network model with many fewer operations than a compute unit optimized for another task, say graphics. The difference can be up to 3x fewer TOPs required by the accelerator compared to the non-specialized units. In practice, always use the performance on the actual workloads when sizing systems and comparing options.

Crawl – Starting Small – Workstation: An SMB IMSS System

We will begin our examination of the SMB IMSS system by explicitly reprising the calculations used to arrive at the values in Table 6-3. These can be used as a guide for the reader in modifying the results to adapt to your needs.

Table 6-4 Calculations for SMB IMSS 4.0 System

The estimate given for the Classification compute implicitly assumes there is one object per frame to be identified. In practice, the number of objects per frame may vary depending on the application and the estimate will need to be adjusted. An example may be a monitoring of an entrance to a business where, depending on the time of day, there may be zero, one, or multiple persons in the camera view. In this case, one can either size the system for the worst case (highest number of expected people at once) or allow some latency in the system and store the items to be classified and process the excess detections at a chosen rate such as one per frame. The architectural trade-off between provisioning for worst case compute and acceptable latency is one we will encounter in all the scenarios examined.

Figure 6-3 shows the critical external connections of an SMB system block diagram (resource graph) built using the device shown in Figure 6-2. Refer to Figure 6-1 for the task graph described here. The compressed video streams from the external camera(s) are ingested through a Gigabit Ethernet (GbE) port via an Ethernet switch. The first step is to decrypt the video streams with a session key using either a dedicated hardware engine or software decryption on the CPU. The choice will depend on the number for streams and encryption protocols; however, it is preferred to use the HW encryption engines when available to reduce host compute loads. It is then recommended to re-encrypt the video streams using a persistent key for storage. Once ingested into the x86® system, the video streams may be stored on one or more SSD or HDD devices via PCIe or SATA, respectively. For the example shown, 8TB storage are required for 30 days retention, which can be satisfied by a single HDD unit or multiple SSDs. Additional streams and/or longer retention periods will require additional storage units and usage of additional PCIe/SATA lanes from the host.

Many applications will require local display of the video data on one or more monitors as shown at the top of Figure 6-3. The GPU can be used to decode the video streams desired for display and composite the video streams on the desired monitors(s) alongside operator control and system status information. Core™ x86® systems supporting up to three displays or more are commonly available. Connection to the displays may be via either DisplayPort, HDMI, or Type C USB/ThunderBolt® connections.

An SMB system may run stand-alone without any external Ethernet connections (i.e., it may be “air gapped” from the public Internet). Optionally, an external connection, shown as Ethernet2 in Figure 6-3, to an enterprise or public cloud may be added to leverage cloud storage, processing, or to enable viewing from devices via the public Internet. We’ll discuss the security implications of this option later in this section.

The video analytics requirements in Table 6-4 can be approached in several ways depending on the specific x86® system chosen and the details of the detection and classification networks chosen. The simplest approach is to assign the video analytics to the CPU complex. The CPU will often have the broadest coverage in terms of models supported and developer familiarity. In these cases, the decoded video streams will be sent from the GPU to the CPU for processing. If the CPU is not able to perform all the video analytics, then the next candidate will be the GPU. The decoded video is already present in the GPU subsystem and memory; however, the GPU analytics tasks may also be competing with the display functions for compute and memory resources.

Figure 6-3
An illustration of the S M B system. 1 to N camera via ethernet through U S B connects X 86 core system, which connects 1 to N, S S D or H D D an analytics accelerator, and via cloud to 1 to N work stations and data centers.

SMB system block diagram

For a moderate number of video streams and moderate levels of video analytics, the CPU and/or GPU will often suffice. It should be noted that modern Core™ x86 systems can decode up to 32 video streams, depending on the video stream resolution and scene complexity. This will substantially exceed the video analytics capabilities of the base Core™ x86 systems if all or even a substantial fraction of the streams require analysis. There are two fundamental approaches in this situation. The first is to forward the compressed, unanalyzed streams to a remote processing capability in the cloud via the second EtherNet port connected to a WAN. While relieving the immediate local pressure, the system does not scale well when the total system involves large numbers of cameras as we will see in subsequent discussions. The second approach is to add an external analytics accelerator via either an M.2 or PCIe form factor card. These accelerators are much more efficient at performing the analytics task both in terms of performance per dollar and performance per watt. The accelerators are available in a range of capabilities from as low as 5 TOPs to over 100 TOPs. These accelerators have the added advantage of freeing up the CPU/GPU complex for other application tasks. When using an accelerator for video analytics, it is strongly recommended to ensure a video codec is incorporated in the accelerator. Uncompressed video has a memory and BW footprint anywhere from 25x to 150x larger than compressed video and can quickly strain interconnect bandwidth, memory capacity, and even system power as the number of streams/analytics complexities grows.

From a security viewpoint, all external connections must be addressed with multiple levels of protection. Data and models in transit and at rest must be encrypted and authenticated for integrity using the techniques described in previous chapters.

SMB System Assets and Threats

To provide the right level of security for this simple SMB system, let’s break it down to the basics: Assets and Threats. Then we can design the right level of security.

The primary assets in this system are the video streams, the detection and classification processing that is being performed on those streams, and the results of the processing. As the streams are generated, processed, and stored, they may be exposed to various threats. The first threat to streams is in the network connecting video cameras and the core system which is functioning as a smart digital video recorder (DVR). The second threat is when the streams are stored on the SSD/HDD volumes. The third is when the streams are displayed, and the fourth threat is when they are sent on via the Wide Area Network (WAN) to a video storage server.

The analytics applications performing the detection and classification are loaded into the system by a system administrator with the rights to add or update applications on the system. These may be provided by the OEM that manufactured the system, by a third party, or by the system operator themselves. The applications can be copied, cloned, reverse engineered, or tampered with when they are sent to the system operator, into storage in the device, or when they are running on the system.

And finally, the results of the analytics processing can either be stored in a separate file structure to the stream storage, inserted into the stream data channel, or may be used to generate graphics that is composited with the video stream and re-encoded for storage and transmission over the WAN. This data may be more privacy-sensitive and therefore may have a higher security requirement than the video stream itself.

Note the assumption that all the assets are protected inside the system. Let’s discuss those threats as well to understand what is being relied on inside the system to protect the assets. Anytime instructions or data are decrypted in storage, in DRAM, and in various caches or buffers in internal processing, they must be protected. In this environment, the threats may come from anything that is assumed to be trusted inside the system – other user mode applications, middleware, and drivers; the system software (Drivers, Operating System, hypervisor); and the infrastructure that controls access to the assets, including the users and administrators themselves. Also assumed is that the system administrators have a maintenance process that monitors for vulnerability citings, updates from the systems’ software and firmware suppliers, and installs those updates promptly.

Using Information Security Techniques to Address These Threats to an SMB IMSS System

Booting a system with a secure, cryptographically authenticated root of trust is the foundation of system security. Extending that chain of trust through the BIOS, FW, and OS ensures all of the SW bill of materials up to user-loaded applications has not been tampered with.

The video streams from cameras attached with USB-A, USB-B, or USB-C cannot be secured. There is no standard native protocol that provides encryption. The only solution to this would be a proprietary solution that encrypts the packet payloads in the camera and decrypts them in the system. For all the ethernet interfaces, the standard ONVIF protocol allows link encryption to be applied to keep the video confidential in transit (using Secure Real Time streaming Protocol). S-RTSP also allows for stream hashing to detect any errors in the stream (particularly due to tampering). This is initiated with a secure key exchange between the camera and the DVR. As mentioned previously, the video security is terminated (decrypted) in the network layer and must be re-encrypted to store it on the SSD/HDD volume. Disk encryption protects the confidentiality and privacy of the video stream in storage from removal of the storage volume. Depending on how the disk encryption key is stored and its use is controlled, it may also protect from some software attacks.

Protecting applications when they are sent to the system administrator, when stored on the platform, and when they are running involves several different security implementations. Encryption will provide confidentiality when the application is deployed to the end user. Some license managers will do that. To protect the application from being copied or cloned from storage, it also must be stored encrypted, and the decryption authorization must be performed by the license manager.

In the most basic context, faith that an application will do what it is supposed to do and not do what it is not supposed to do depends on whether the supplier of the application is trustworthy. The company that wrote the application needs to be reputable and have a secure design lifecycle process that mandates secure coding practices and tests for vulnerabilities. Using digitally signed applications (and verifying the signatures) ensures through a third party signature authority that the application came from the signing company, that it stands behind the application, and that it has not been tampered with.

The results of the detection and classification can also be protected using the same methods as video streams: link encryption and device (i.e., hard disk) encryption.

These methods provide basic security with the assumptions about the system being physically secured, the entire software stack trustworthy, and that the people with access to the system (especially the administrators) are trustworthy. You will read in the enterprise and smart city sections what can be done if these assumptions are not true.

Walk – Let’s Get Enterprising – Edge Server: Critical Infrastructure

The next application to be considered is an enterprise-level system, an increase of approximately one order of magnitude in video streams. These applications require the processing of a hundred to a couple of hundred video streams, almost always requiring that all video streams be analyzed, and the analytics occur in real time. Critical infrastructure, industrial processes, and factories all require real time responses and usually analytics across the entire web of video feeds. The interrelationships between the information in the different video streams is as critical as the analysis of the information in a single video stream.

Referring to Table 6-3 for the Enterprise system, the compute, bandwidth, and storage requirements exceed what the system described in the previous section can support requiring a fundamental change in architecture. Figure 6-4 introduces the resource graph for an enterprise-level system in the form of a server class product, using the Intel® Xeon™ product line as an example; similar considerations would hold for alternative products. In addition to the requirements noted earlier, at this level of workload consolidation, it is not unusual for the resources to be shared among multiple entities. Virtualization and security therefore become more critical considerations in addition to raw performance metrics.

In comparing the SMB resource graph in Figure 6-2 to the Enterprise resource graph in Figure 6-4, critical architectural choices quickly become apparent. In the SMB system, considerable functionality is devoted to interacting with the specific environmental sensors such as video (note the MIPI, for example, and the I2s for audio) as well as interacting with the operator (note the graphics and display functions). To support the graphics and display functions requires the media blocks as well. The overall result is to place limits on the compute, memory and storage capabilities to observe constraints on cost and power.

In the Enterprise Device, Figure 6-4, the greyed-out blocks indicate many of these functions have been eliminated. This frees up power, die area, and package interface pins to focus on compute, memory, and storage functions. These are exactly the features required to support the higher density video loads that the Enterprise systems will see. The number of CPU clusters is increased to support the 4–8 CPU cores in an SMB system to the 12–50+ cores of a server type system. In addition to increasing the number of cores, the Instruction Set Architecture (ISA) of the cores is extended to support the operation required for video analytics, including both training and inferencing. At a minimum, the CPU should support vector instruction set extensions and preferably tensor instruction set extensions. The Intel x86® Xeon™ cores have extensions to support both types of operators. In addition, from Table 6-3, we note that HDD capacity on the order of 160 TB of storage is required. At 8 TB per HDD, approximately 20 HDD drives are required. This requires a similar number of PCIe/SATA lanes potentially with SATA port multipliers. Overall, substantially more high-speed IO lanes are required than supported by a Core™ level device.

Figure 6-4
A block diagram of V A or Xeon. Modules of A I compute, C P U, and memory interconnect with components of peripherals, storage, monitor, and x 64 b L P D D R memory N 1 to 8.

Resource graph for enterprise device

Next, recall that the relationships between video streams is as important as the information in the individual streams in order to understand the entire context of the environment and the actions taking place. Referring to Figure 6-1, this is where the feature-matching function becomes critical. Seeking pattens and correlations among the inference results from multiple video streams across time and the different locales monitored by the cameras is how the context is understood. This requires accessing data across the entire stored data base and holding a substantial fraction in working memory so these correlations can be made in real time. The multiple memory channels in the Intel® Xeon™ devices make this feasible. Table 6-5 shows potential memory configurations with a device supporting 8 memory channels. Each channel is 64b at up to 6400 MTS, resulting in up to 50GB/s memory BW per channel. The memory configuration will determine the database size which can be addressed for the feature matching function. A common configuration would be in the 1 to 2 TB range for system with 100 cameras. This would allow access to several hours of video across all cameras per the description in Table 6-2 of 160 TB over 30 days of video storage (~4 ½ hours of video per TB).

Table 6-5 Server Class Memory Configurations

Figure 6-5 describes an Enterprise class system architecture based on the enterprise device of Figure 6-4. Video streams may be ingested from a cloud type of interface, either from the SMB systems described in the previous section or directly from cloud connected cameras. The enterprise solution requires great compute flexibility because the number and types of video streams ingested may not be known ahead of time or may change over time. The content of the video streams may also be quite diverse with some video streams having no processing, some such as those from the SMB type systems having partially analyzed data to the detection phase or through the classification phase (See Figure 6-1). Table 6-6 indicates the extreme range of video ingestion data rates and hence impact on both storage in HDD/SDD and number of video stream accessible for processing in working memory as indicated in Table 6-5. For the example shown in Table 6-6, if an input system has performed the detection phase, then only Region of Interest (ROI) is sent to the Enterprise system; if the input system has performed classification, then only a feature vector is sent.

Table 6-6 Video Data Ingestion Rate vs. Analytics Stage

A similar consideration holds when considering the relationship between the incoming streams and the compute requirements for the system. (See Table 6-2 for classification and detection parameters). The following table demonstrates the impact on the overall system architecture performing the detection and classification operations at different points in the enterprise system. The typical enterprise system will be fed by some combination of dumb cameras, smart cameras and SMB (or the equivalent) sources.

Table 6-7 Server Class Compute Requirements vs. Analytics Stage

Clearly the compute requirements are even more strongly influenced by the source processing than the memory bandwidth and storage requirements.

Figure 6-5
An illustration of the enterprise system. 1 to N video streams via cloud connects X 86 xenon system, which connects 1 to N, S S D or H D D an analytics accelerator, and via cloud to 1 to N work stations and data centers.

Enterprise system block diagram

Depending on the configuration and model of the server CPU selected, it is possible to set a few guidelines for selection based on a 100-stream scenario, in which all video streams are analyzed.

  • All server class devices will have sufficient compute to perform feature matching.

  • Some server class devices will have sufficient compute power to perform classification.

    • Even in these cases, an accelerator may prove more cost effective and power efficient from a TCO viewpoint and should be considered. Analysis should be performed.

  • Few, if any, server class devices will have sufficient compute power to perform detection plus classification and will require accelerator(s) to achieve the necessary performance.

    • Video analytic accelerators should integrate the media codec with inference acceleration and have a dedicated memory. This accelerator greatly reduces the resource burden on the host since all video decode and inference are self-contained, requiring only a compressed video stream from the host.

Another consideration for detection workloads is that few server class devices incorporate a media codec. Media decode functions can be performed in software on the CPU cores, but will be inefficient compared to a HW accelerated codec. Analysis should be performed to determine the performance with the expected video streams; historically, between 2 and 8 video streams can be decoded per core depending on the original video resolution, frame rate, codec, and video stream structure. It should be noted that the video decode function may also consume a significant fraction of the available memory bandwidth.

Using the methodology in this section, the reader can estimate compute requirements for a particular system by substituting the appropriate values for the key system factors:

  • Number of compressed video streams

    • Frames per second…

  • Number of compressed Regions of Interest

    • Frames per second, ROI size, Compression ratio…

  • GOPs per Detection inference

  • GOPs per Classification inference

The better the understanding of the types and structure of the incoming sources and the processing which has previously occurred, the lower the risk of either under-provisioning or over-provisioning the system. Again, best practice is to perform a proof of concept on the proposed system using the actual workloads; however, an analysis at this level of granularity may narrow the scope of system configurations to be tested. Many vendors will also report performance of common inference models through the MLCommons organization (mlcommons.org/en for the English language version). In addition to reporting results in different categories, the organization also supplies open datasets and tools for use. MLCommons is supported by a broad cross-section of leading players in the field of AI and machine learning.

Server class devices (Figure 6-4) used in Enterprise systems (Figure 6-5) support a substantially higher number of high speed I/Os. These IOs can often be configured to support multiple protocols, the most common being the PCIe and SATA standards. SATA is more often associated with HDD; PCIe with SSD and peripherals, including accelerators. The processing of the incoming video streams as described in Table 6-6 will also impact the use of the storage resources. Within a single 2U rack (3.5” height) it is possible to mount up to 20 3.5” HDD. Assuming each drive supports 8TB of storage, then we estimate a nominal total of 160 TB of storage located within the same rack. (Of course, HDD density is increasing over time, so this estimate should be verified at the time of system architecture analysis).

Table 6-8 Storage Requirements vs. Analytics Stage

Based on these estimates, a 2U server rack would easily support an ingestion of 100 streams if the source performed detection with about eight 8TB, the data sent were ROI and only classification was done at the Enterprise system; if the source had performed both detection and classification so only feature matching was required, then as little as 4 TB, or a single 8TB storage unit would suffice. Conversely, if raw video were sent, then the storage requirements may exceed the capabilities of a single rack, and Network Attached Storage (NAS) architectures are required. In the example shown, the storage requirements are at the edge of what a single 2U rack system can support.

Operator interaction is typically through remote workstations via a cloud interface rather than a local interface. The operator interaction will consist of some combination of compressed video streams, analytics results, and user interface functions. Because the remote operator workstation may support multiple displays, the forwarded information may comprise a few to several tens of video streams. There may also be multiple operators or client’s workstations supported, in some cases, including the workstation that initially forwarded the video data to the edge server. The methodology used to estimate video ingestion based on Table 6-6 can also be used to estimate video egress bandwidths. The final egress destination for data will often be to a datacenter for further processing of the video data. Because there are multiple points of consumption for the data generated at this stage, depending on the number of clients and the types of data sent, the data egress BW may be substantially larger than the data ingress. In addition, the data egress BW may be substantially more variable than the ingress BW. In this critical infrastructure example, the ingress is dominated by the cameras monitoring the facility throughout the day, representing a relatively constant load. Conversely, the operators and clients may only be present during a single shift for both workstation type clients and those accessing via a data center.

From a security perspective, the edge server application adds multi-tenancy as a feature. While the server may have a single owner, multiple tenants will reside on the edge server, as shown in Figure 6-5. The video streams entering the edge server may need to maintain isolation and authentication. If an endpoint is breached, then the malicious code must be confined at a minimum, detected, and eradicated, if possible. A similar consideration applies to the data and telemetry egressing the edge server to the operators and the data center. As part of the communications protocol, these entities will certainly be sending messages and data requests back to the edge server which could be corrupted. Virtualization and memory encryption techniques are critical to maintaining confidentiality on a multitenant environment.

A second class of challenges is that in contrast to the workstation environment, the edge server is often in a physically unsecured location. This makes the edge server vulnerable to physical attacks such as use of interposers, bus snooping, power supply and pin glitching, and clock manipulation attacks. Systems located in remote or off-site locations are particularly susceptible. Memory encryption and IO channel encryption can mitigate against these attacks that attempt to intercept and/or corrupt data.

Edge Server Critical Infrastructure System Assets and Threats

To provide the right level of security for an edge server critical infrastructure system, the primary assets in this system (like the SMB system) are the video streams, the detection and classification processing that is being performed on those streams, and the results of the processing. Please refer to the SMB System Assets and Threats section for the details.

The system environment is significantly different. On the enterprise premises side of the edge server, there are more types of devices networked into the system. Figure 6-5 shows local DVR workstations connect to the edge server. Proper network security practices would have those devices on a separate physical network, or on a VPN using managed switches. This may not always be the case; not only might there be many other types of devices that have a physical security role but also general office devices might be connected to the network without being realized. These all provide more ways that access can be gained to the assets from devices inside and outside the LAN. In Figure 6-5, there are also remote workstations (or on cellphones, tablets, and personal computers) for emergency access to time- sensitive data on the edge server over the WAN. Those external devices, including the local networks they are on, are yet another risk that can allow unauthorized access to the assets. Furthermore, the open ports for WAN access are an attack point.

In addition, because this class of system is expected to have multiple users at a time, the threat environment is different than for an SMB system. Not only do more users represent more threats simply due to the numbers of users, it also is more difficult to track behaviors to determine whether the assets are under attack and more difficult to attribute a data leak to a particular user. Unless there is high certainty that the multiple users can be trusted and that they will not make naïve mistakes like fall for phishing schemes, employing zero trust security principles is the best way to manage risk.

Using Information Security Techniques to Address the Threats to an Edge Critical Infrastructure System

Foundational security, as described in the Small Business threat mitigation section, is the basis for all system security and must be applied to Edge Critical Infrastructure systems.

For enterprises where the value of the assets is high enough to warrant risk mitigations, the critical applications should be run in trusted execution environments such as virtual machines. These will help protect against inevitable vulnerabilities in the OS and applications as well as protect against users of the system that are naive, or inattentive. Malicious attackers that appear to be authorized through phished or stolen credentials also will have more difficulty gaining access to system assets. Those agents may also escalate their privileges to administrator for which zero trust methods like barring administrators from viewing, copying, or modifying the assets help to protect these assets.

Run – Forecast: Partly Cloudy – Data Center: Building Blocks for a Smart City

The final application class we will examine is that of a data center, an increase of another order of magnitude in the number of video streams comprehended to a thousand to several thousands. Data center architectures are required when the processing required exceeds that contained in a single server or single rack of servers. Typically, this will involve both the analysis of the video streams and also incorporation of other enterprise-level workloads. These enterprise-level workloads may or may not be directly related to the video analytics task. Common enterprise level workloads are diverse comprising recommendation engines for e-commerce, logistics and inventory, remote gaming, accounting and applications specific to the clients of the data center. For these reasons, a data center will contain a much more diverse set of resources consisting of at least CPUs, GPUs, dedicated storage elements, IPU (infrastructure processing units), NIC (network interface card), and analytics accelerators.

The basic compute building block for the data center remains the server architecture previously described in Figure 6-4. However, because of the order of magnitude increase in the data volume, the interconnect, as reflected by the networking component, becomes much more critical. The data center architecture is dominated by how the compute server elements, storage elements, and the external world are networked together. The data center architectures continue to evolve and are vendor specific. A complete discussion of all the data center architectures is beyond the scope of this work. A representative data center architecture is given in Figure 6-6.

The data center connects to the external world through the core network shown at the top of the diagram. The network elements at this layer connect to the external world. In general, there will be more than one router at the core level which communicate data from the outside world to the layers below it. In principle, the core layers can connect to any of the networking layers below it, represented here as the Aggregation Network. In this example, the aggregation network is composed of clusters, in this example, two aggregation nodes in each cluster.

Aggregation Nodes within the same cluster can communicate with each other as well as the Core Network layer and the compute resources at the next level. The connections in the aggregation layer may take a variety of forms – mesh, star, ring, and bus are the most common topologies. The selection of a specific topology for connectivity will be based on a balance of scalability, fault tolerance, quality of service, and security.

Figure 6-6
An illustration of the data center architecture. Includes a core network, 2 aggregation nodes in 3 clusters, compute and storage units in top of the rack and in server rack.

Data center architecture

The next layer down comprises the compute and storage resources which are physically arranged in racks. Each rack will consist of several rack units which can be either compute, storage, or some combination. The rack units are similar in concept to the edge server described in Figure 6-4 in the previous section. The difference is primarily one of scale. In contrast to the edge server, the scale of the workload has grown beyond the ability of a single device or a single group of devices in a rack to address. In the example shown here, the connection between the aggregation network layer and the racks are through a router at the top of the rack (ToR) router. The ToR mediates all communication between the components in the rack and the aggregation layer.

The components in the rack may be populated according to the specific needs projected for the data center as a whole. In Figure 6-6, the server racks to the left and center are primarily composed of computational elements, as described in Figure 6-4. The server racks on the right-hand side are a mix of computational elements and storage elements.

From the preceding discussion, it is apparent that connectivity of the elements through the tiers in Figure 6-6 is the critical emergent property in data centers. An important emerging trend in data center architecture are Infrastructure Processing Units, devices which are specialized to perform infrastructure tasks such as routing, scheduling, and allocating tasks to resources. An example of an IPU-based data center architecture is shown in Figure 6-7. An intelligent network links the IPU-based devices together. The IPU devices provide the interface between the processing units and the network described in Figure 6-6 as well as comprising the tiers for Aggregation and Core Network functions.

Figure 6-7
An illustration with cloud connecting, G P compute with 2 C P Us, shared storage, acceleration services with 2 X P Us, and M L or A I services with C P U, V P U or X P U, along with an I P U in each.

IPU-based data center

The IPU is optimized for network functions only with no compromises for general compute or specialized compute functions such as analytics, graphics, or storage. The software controlling the data center infrastructure is isolated from the client application software, providing a high degree of isolation. A common IPU architecture for the data center network enables a common software stack for the infrastructure applications and services. The IPU instruction set architecture and accelerators for network functions can be accessed uniformly and efficiently. The data center operators can configure the core, aggregation, and ToR (or equivalent) topologies and operations independently of the impact of client application code resource contention and security considerations. The client code runs on the CPU/GPU/XPU/VPU, and the data center provider code runs on the IPU. In summary, the benefits of an IPU-based data center are to provide 1) a highly optimized network infrastructure; 2) system level security, control, and isolation; 3) common software frameworks – for infrastructure functions; and 4) flexibility via SW programmability to implement a wide variety of data center architectures.

Smart City Data Center System Assets and Threats

In addition to the video and applications assets described previously, Smart Cities often collect many more types of data than a typical physical security system does. Audio may be collected to provide a more complete situational awareness or to enable citizens to interact with city services by talking at a services kiosk. Audio data can locate gunshots, thunder, car crashes, and crowds at events. Other city services may require environmental sensors for air or water quality, motion or vibration sensors, sonar, radar, or lidar for distance measurements to locate objects. In addition, many city services are installed and supported by data carriers, so the communications are often included in these systems, not only for the city services but also for their direct customers. For some of these use cases, secure time as an input to the system is also critical data, especially when using data from different edge devices or when using heterogeneous types of data to get more accurate results.

The actions taken by these services also represent critical assets. Traffic controls, emergency services, law enforcement, and health services can improve our quality of life and even save lives. Particularly due to the latter, the proper function of the sensor devices, software, and outputs of these systems must be reliable and trustworthy.

These geographically distributed devices communicate over private or public networks. The edge devices and local aggregation servers are often in physical locations that are difficult to physically secure. This makes the devices and the networks more vulnerable than SMB or Edge infrastructure systems generally are.

The services for a smart city may be publicly funded, but often they are leased equipment and services that one entity has provided the capital for, in exchange for profits from the data and services. In both cases, reliability and trust are required; however, for the latter, there is a complex relationship between the city, the data owners, data consumers (the entities whose business depends on that data and provide services), and the citizens (who benefit from the services and may also be consuming that data). Privacy laws and regulations may be applicable to some of the data as well.

To maximize the value of the data, it must be broadly available via the public Internet through web browsers, applications, and through data APIs. The data value also makes it an attractive target for exploitation and theft. And public availability means not only are poorly secured devices accessing the data, but also enterprises eager to profit by manipulating citizens, manipulating the services, or manipulating the data itself for profit (legal or otherwise).

So, to summarize, a smart city represents the worst case scenario for security – a very complex geographically distributed system with little physical access control; a lot of valuable applications and data; complex ownership, use, and control of the applications and data; public access over the Internet; with lives depending on its proper behavior. It is a high threat environment, vulnerable to physically present and remote agents.

Using Information Security Techniques to Address the Threats to a Smart City System

As always, foundational security is essential to providing protection for the primary assets: applications and data. Devices must employ secure boot, have authenticated firmware, OS, and drivers and services, and applications.

When possible, applications should be run in SGX secure enclaves to provide the highest level of hardware enforced isolation and cryptographic protection. If the devices are not capable of running an enclave, then they should be run in a launch authenticated virtual machine that is running on a type 1 hypervisor. The applications must be encrypted when they are transmitted and stored and only decrypted in the virtual machine. Likewise, input and output data must be encrypted in transit and in storage, and only decrypted in an authenticated enclave or virtual machine.

The OpenVINO Security Add-onFootnote 1 will do all of this with a secure protocol that protects the encryption keys and uses a secure method of attesting the secure boot and enclave or virtual machine launch measurement.

Zero trust authorization protocols should also be employed to minimize the risk of gaps in security protocols and lapses by personnel. Access to any valuable asset should be limited to the actions required and only be granted at the time that action is needed. For example, an administrator needs to install or uninstall applications, but rarely needs to see the binary executable code. The authorization should be granted for only that one task and once completed, the authorization should be terminated automatically. The authorizations need to be multi-factor authenticated, fine grained, and time limited.

Employing these techniques will keep the risk of loss, penalties, and damage to the system operators reputation as low as possible.

In the next chapter, you will learn how to keep IMSS cybersecurity capabilities up-to-date as threats evolve and as standards and laws change to keep up with technology.