DeepSHARQ: hybrid error coding using deep learning

Cyber-physical systems operate under changing environments and on resource-constrained devices. Communication in these environments must use hybrid error coding, as pure pro- or reactive schemes cannot always fulfill application demands or have suboptimal performance. However, finding optimal coding configurations that fulfill application constraints—e.g., tolerate loss and delay—under changing channel conditions is a computationally challenging task. Recently, the systems community has started addressing these sorts of problems using hybrid decomposed solutions, i.e., algorithmic approaches for well-understood formalized parts of the problem and learning-based approaches for parts that must be estimated (either for reasons of uncertainty or computational intractability). For DeepSHARQ, we revisit our own recent work and limit the learning problem to block length prediction, the major contributor to inference time (and its variation) when searching for hybrid error coding configurations. The remaining parameters are found algorithmically, and hence we make individual contributions with respect to finding close-to-optimal coding configurations in both of these areas—combining them into a hybrid solution. DeepSHARQ applies block length regularization in order to reduce the neural networks in comparison to purely learning-based solutions. The hybrid solution is nearly optimal concerning the channel efficiency of coding configurations it generates, as it is trained so deviations from the optimum are upper bound by a configurable percentage. In addition, DeepSHARQ is capable of reacting to channel changes in real time, thereby enabling cyber-physical systems even on resource-constrained platforms. Tightly integrating algorithmic and learning-based approaches allows DeepSHARQ to react to channel changes faster and with a more predictable time than solutions that rely only on either of the two approaches.

the code rate, as part of the selected modulation and coding scheme (MCS), is adapted. The incremental redundancy follows a fixed schedule with a fixed number and sequence of redundancy versions (RVs). However, on higher layers, specifically the transport layer, this parameterization needs to consider and fulfill application requirements and hence it is a complex task. Finding this configuration has been well-understood mathematically for the last decades [17,18,21,22]-including finding optimal configurations that fulfill application requirements and minimizing redundancy overhead. However, this task does not allow for a closedform representation whose complexity is independent of the channel parameters. Instead, it is a search problem with a complexity dependent on its input parameters-e.g., a linear increase in round-trip time leads to a more than linear increase of configurations to evaluate. Executing the search for realistic channel parameters on realistic CPS computing devices proved intractable [23].
Based on an efficient, but still intractable, reimplementation of the full search [21], we set out to bring hybrid error coding to resource-constrained devices. In one branch, we approached the problem using machine learning [23], in particular using supervised learning with deep neural networks. In a second branch, we have been successful in decomposing the search problem in stages and improving individual stages algorithmically-achieving optimal redundancy efficiency but shorter inference time [24]. In this article, we look at the decomposed search and combine both algorithmic as well as learning approaches to build DeepSHARQ: a search with minimized run-time but high efficiency.
The contribution of this article is threefold: (a) We describe a decomposition of the HARQ coding configuration search, allowing for optimizations at different stages. (b) We implement the search algorithm DeepSHARQ that leverages both algorithmic and learning-based approaches to infer efficient coding configurations in real time. (c) We evaluate DeepSHARQ and compare it against existing solutions-showing its usability on resource-constr ained devices.
The remainder of this article is structured as follows: first, we describe related approaches to our work (Sect. 2) and give background on error control at the transport layer of packet networks (Sect. 3). How optimal HARQ configurations can be determined is explained in Sect. 4. Our approach, DeepSHARQ, is described in detail in Sect. 5. This is extended by a description of the model training process (Sect. 6) and an evaluation of the search (Sect. 7). Section 8 outlines directions for future research and Sect. 9 concludes the paper.

Related work
The end-to-end design paradigm [25] has led to many proposals to complement error coding in the lower layers with coding at the transport layer in order to improve reliability without prohibitively increasing the delay [11,12,21,22,[26][27][28][29][30]. Maximum Distance Separable (MDS) block codes ensure that the number of correctable losses equals the number of transmitted parity packets. MDS codes have been used to provide predictable reliability under time constraints [21], reduce delay in multimedia communication [28], and avoid feedback implosion in multicast [16,31]. Despite their high loss rate floor, and hence a redundancy transmission overhead to achieve the same performance as MDS codes, binary codes have also been a mechanism of choice due to their reduced coding complexity [11,29,30]. Finally, making the end-toend delay independent of the block length is possible with windowed Random Linear Codes (RLC), which evenly distribute the parity packets over the source packets. RLC codes have proved to reduce the in-order delay, and hence the tail delay in fully reliable protocols [15,32]. However, this delay reduction in RLC codes comes at the cost of lower code rates than block codes [33], and the run-time complexity of their matrix inversion function hinders their deployment in packetized layers [29,34,35]. Michel et al. [12] have extended QUIC with the three aforementioned code families, showing that RLC codes achieve the lowest delay. Although, in this paper, we have opted for block codes, which in principle have a larger delay, we have done so because i) we implement a delay-aware scheme, which ensures that the delay of no packet exceeds the application's target delay, and ii) we target code configurations that approach the theoretical minimum under timing constraints [36] and windowed RLC codes are limited from the code rate standpoint [33].
Like in almost any other field, the significant advances in Deep Learning (DL) have made their way into networked communications [37]-e.g., adaptive video streaming [38,39], channel state information prediction [27,40], congestion control [6,8,10], and protocol optimization [26,27,41]. In the context of error control, Chen et al. [26] use reinforcement learning to select the code rate of an FEC scheme in order to improve the quality-of-experience in the context of real-time video streaming. Cheng et al. [27], implement an LSTM network that predicts the future loss pattern in a block of data packets, and based on it, selects the amount of redundancy to transmit. Hu et al. [19] also use LSTM networks to predict loss patterns, but propose a model compression method to enable fast inference and compensate for the large complexity of LSTM networks.
Non-learning-based approaches have also been proposed to implement adaptive error control [13,17,18,22]. Tickoo et al. [22] implement loss-tolerant TCP that uses an adaptive FEC scheme based on MDS codes that, similar to our approach, adjusts the transmitted redundancy to the channel characteristics. Adaptive, RLC-based error control is proposed in [17], and the authors show that the proposed mechanism is on par with pure ARQ in throughput-and delay-bound scenarios. [13] proposes a new code construction for lowdelay stream codes and presents an adaptive algorithm that outperforms MDS codes. Michel at al. [18] implemented adaptive FEC in QUIC, and evaluated the algorithm's performance for applications with different requirements, showing the benefit of FEC over QUIC's purely reactive error control.

Background
Error control is a key function in the most common transport protocols, as it compensates losses in the lower layers in order to provide the desired reliability level. This section introduces the different building blocks in error control.

Transport layer error control
Networked systems experience packet losses for multiple reasons, e.g., buffer overflows in congested links, channel noise, and fading, and medium access collisions. PHY/MAC layers already implement error correction mechanisms that transmit some form of redundancy that allows for loss recovery. However, these mechanisms fail to provide predictable reliability and end-to-end guarantees [20]. Therefore, error control in the upper layers must complement them [25].
Automatic Repeat reQuest (ARQ) has traditionally been the scheme of choice in the most widely deployed transport protocols-i.e., TCP and QUIC. ARQ requires a feedback mechanism to signal either the reception of packets with acknowledgments (ACK) or packet losses with negative acknowledgments (NAK). TCP implements cumulative ACKs referring to the last, correctly received byte, whereas QUIC implements a selective packet-based mechanism in which every received and processed packet is ACKed. Although an ACK could be issued for every packet, both TCP and QUIC implement ACK aggregation mechanisms that reduce the receiver-side traffic-e.g., see delayed ACKs in TCP [42] and ACK aggregation in QUIC [43]. On the other hand, NAKs have been typically implemented for multicast [44,45] to avoid the feedback implosion problemi.e., the sender in a multicast group is overwhelmed by the ACKs from all receivers, both in terms of received traffic and processing time [16]. When packet retransmissions are triggered depends on the implemented loss detection algorithm [14,[46][47][48]. TCP was originally designed with a purely time-based retransmission mechanism. However, more recent algorithms use duplicate ACKs/NAKs as packet loss signals as well, which provides faster reactions than timers at the risk of wrongly deeming a packet as lost due to packet reordering in the network. Regardless of the implemented algorithm, retransmissions are never triggered before the round-trip time (RTT) that is required to collect feedback for a packet, and hence we say that ARQ's delay is RTTdependent.
Obtaining feedback is not always possible if i) the application's target delay is not large enough to wait for feedback, or ii) a feedback channel does not exist (e.g., television broadcasting). In such cases, Forward Error Coding (FEC) is more suitable for the task. Unlike ARQ, FEC proactively transmits redundancy information (RI). As no information about lost packets is available at the time of transmitting the redundancy, FEC must encode parity packets, which are a linear combination of data packets, so that losses can be recovered by solving a linear equation system at the receiver (see Sect. 3.2 for a detailed description of how these packets are encoded). As a result, the loss recovery delay is no longer RTT-dependent, but it is proportional to the source packet intervals that the sender must wait to collect packets before encoding.
As the ARQ and FEC delays differ in nature, it stands to reason that both approaches should be combined to provide optimal predictable reliability under delay constraints. When combined, the optimal balance between proactive (FEC) and reactive (ARQ) can be found such that the transmitted RI is minimized. Hybrid ARQ (HARQ) implements precisely that behavior: parity packets can be transmitted in the proactive or reactive cycles, and the sender stops transmitting redundancy when the receiver signals it has enough to recover the losses or until it is too late to recover them in time. Figure 1 provides a graphical comparison of the three aforementioned schemes.

Packet coding
When implemented in the transport layer, HARQ transmits parity packets-or, more generally, parity symbols-to recover the losses. A block code C(n, k) : F k q → F n q trans-forms a message vector m into a code word c ∈ C. The finite field F q has size q. Typically, the field is selected from the family of Galois Fields G F(2 m ) for binary representation, where m is the number of bits per symbol in the alphabet.
Here, k is the block length-number of symbols in m-, and n the codeword length-number of symbols in c. The symbols are encoded by performing a matrix-vector multiplication with the generator matrix G ( c = m · G). At the receiver, the original message vector is recovered by performing the inverse operation ( m = ĉ ·Ĝ −1 ).Ĝ is a k × k submatrix of G, whose columns have been selected based on the position of the received symbols ĉ. Figure 2 shows how the encoding operation is performed. We assume a systematic code is used-i.e., the k × k identity matrix is part of G, and thus the code word contains a verbatim copy of the message vector. Systematic codes reduce the coding complexity as only p = n − k symbols are encoded instead of n, achieve better error correction capabilities: if the linear system cannot be solved-e.g., it is undetermined because fewer than k packets were received-, they can still forward the received verbatim data without decoding, and they also allow for data transmission before all the k packets are collected for encoding, which reduces the end-to-end delay. While the physical layer performs the coding operation at the symbol level-i.e., directly in bits-, IP networks are packetized erasure channels, meaning that full packets are lost in the network because packets with uncorrectable bit flips are not forwarded to the upper layer, or full packets are dropped due to buffer overflows. As a result, HARQ at the transport layer must be capable of recovering full packets. Assume an IP packet is MT U 1 bytes long. With virtual interleaving the packets can be split into smaller symbols of m bits, k packets are grouped in the interleaver buffer, and the coding operations are iterated throughout the complete packet length. In [29], we showed that the packetization directly impacts the complexity of the system: while the matrix inversion has typically dominated the run-time complexity of coding in the physical layer, the matrix-vector multiplication dominates the packetized layers. As a result, a different code construction may be the best option depending on the channel conditions and platform the protocol runs on.

Code construction
Three different families of codes have been proposed for the transport layer: MDS [21,22,31], binary [11,29,30], and RLC codes [18,34,35]. They vary in error correction capabilities, underlying field size, and generator matrix construction, and they have different algorithmic tools at their disposal for efficient implementation [31,49]. 1 MT U is the network's maximum transmission unit of the underlying medium, e.g., 1,500 bytes in Ethernet.
Maximum Distance Separable (MDS) codes [31] guarantee that the minimum distance between codewords is d min = e + 1-i.e., they meet the Singleton Bound with equality-, where e = n − k is the number of correctable erasures [50]. For this property to hold true, any k × k submatrix of G must be invertible. The Cauchy and Vandermonde matrices fulfill this same property, and thus they are frequently used to construct this type of code, usually in G F (2 8 ) so that symbols are one-byte long.
The matrix inversion is, at the symbol level, the main contributor to the run-time complexity. Binary codes [51][52][53] overcome this limitation by decoding without an explicit matrix inversion. However, operating in G F (2) does not guarantee the invertibility of every square submatrix. As a result, the loss rate floor is lifted from MDS codesconversely, binary codes require excess parity packets to achieve the same loss rate as MDS codes. It can be shown that the excess portion of the transmitted redundancy reduces for very large block lengths [52]. Hence, binary codes have dominated physical layer deployments-e.g., LDPC [51] in 4 G and 5 G, or polar [53] codes in 5 G-, where such large block lengths are common. However, they can also perform well in the transport layer when running on resource-constrained devices. Since most CPUs do not directly support operations in high-order Galois Fields, binary codes, which can be implemented with simple XORs, can significantly reduce the run-time complexity [29].
Finally, random linear codes (RLC) follow a random code construction-similar to some binary codes [51,52], which actually are a sub-family of RLC codes-, in high-order Galois Fields to have a high probability of obtaining linearly independent rows and hence decrease the loss rate floor of random codes. However, these codes need many resources for matrix inversion [34,35], and it is still an open research question as to whether they can be efficiently used on embedded devices, the natural component of CPS.
In the following, this paper assumes systematic MDS codes are used. However, the presented algorithms are code-agnostic as long as the probability of losing a packet and triggering retransmission rounds (see Eqs. 8 and 2 in Sect. 4.1) are adapted to model other code's properties (e.g., random binary, polar or RLC codes, and non-systematic codes).

Fig. 2
Encoding process of a systematic code with a block length k, p parity packets and a generator matrix G. Symbols are packets of MT U bytes section, we introduce SHARQ, an algorithm that finds the optimal configuration in polynomial time.

Problem statement
The performance of every HARQ scheme is governed by two parameters: the block length k, or how many data packets are encoded, and the repair schedule N P , which dictates how the p parity packets 2 are distributed among the N C repair cycles (see Fig. 3). The objective is to find the HARQ configuration that minimizes the transmitted RI (see Eq. 1) while meeting the application and network constraints at the same time. Minimizing the RI is essential for any communication system, otherwise resources-i.e., energy and bandwidth-are wasted due to the throughput increase, which is unfair to the other systems the communication channel is shared with. Formally, which considers three constraints: (i) every data packet must be received within the application target delay, (ii) the average number of loss packets cannot be greater than the application target loss rate, and (iii) the transmission data rate should not increase beyond the bottleneck data rate of the communication channel. The redundancy information is a weighted sum over the entries of the repair schedule N P (see Eq. 1). The weight is the probability of that cycle being required: where p[c] = n[c] − k is the cumulative number of parity packets until round c and w R [c] the weight for N P [c]-i.e., 2 The number of parity packets is the 1-norm of the repair schedule vector ( p = N P 1 ).

Fig. 3
HARQ delay budget. We analyze the impact of the repair schedule N P on the achievable capacity of HARQ in the transport layer the probability of cycle c to be triggered in a multicast group with R receivers. 3 Formally, and consequently w R It can be shown that, for sufficiently large block lengths, the probability of triggering a new retransmission in a binary erasure channel decreases exponentially with the number of cycles. In such a case, the optimal repair schedule can be straightforwardly built: N P is an all-ones vector except for the last entry, which is p − N C + 1. However, in the short block length regime, such a naive repair schedule construction may be suboptimal [24]: if the probability of cycle N C − 1 to fail is sufficiently high, accumulating packets in later rounds approaches FEC behavior-i.e., all parity packets are transmitted with very high probability. In such cases, parity packets should be brought forward to reduce the probability of latter cycles in the schedule-see Sect. 4.3 for an algorithm that efficiently finds the optimal schedule. While the FEC delay Eq. 3 depends on the source packet interval T s to collect k data packets before encoding, the ARQ delay Eq. 4 is RTT-dominated due to the ACK-triggered retransmission process. 4 The HARQ delay Eq. 5 can be represented as the combination of its FEC and ARQ components, as depicted in Fig. 3. D RS is the response delay of the system and models operating system delays-e.g., packet management or scheduling. Although a more precise adaptation can be achieved by feeding dynamic response delays into the algorithm [54], we have opted for a rather conservative constant value (D RS = 1 ms) to reduce the dimensions of the input dataset-see Sec. 6.1. The model also considers the upper bound to the time required to detect that a packet is lost (D P L ), which is the maximum time the system needs to mark a packet as lost after its transmission and hence determines when a new retransmission round is triggered. D P L solely depends on the loss detection algorithm implemented in the transport protocol [14,[46][47][48]. For the remainder of the paper, D P L = 4.5 · T s , which assumes the mechanism in [46] is implemented (see Sect. 5.4 for more details on why this is the case): The error control presented in this paper assumes some periodicity in the application data arrival-i.e., video streaming with a constant frame rate or sensors in CPS with a constant sampling rate. Equation 5 accordingly considers that the inter-packet time is constant for the optimization time window D T . However, the proposed mechanisms can also be applied to bursty, time-aware traffic: the T s estimation function must detect a burst-e.g., when the application does not provide further data after one T s -, in which case a new constraint is added that caps k to the maximum achievable block length for such a burst. The model also considers symmetrical network delay for simplicity. However, in the future, we intend to integrate DeepSHARQ in the time-aware protocol introduced in [55] to also provide predictable error control over networks with asymmetrical delays.
The packet loss rate is given in (6), where P(I k = i) is the probability of being unable to decode exactly i data packets-i.e., the loss rate as seen by the application-when a systematic MDS code is used and b = max( p + 1, i). Although we have already applied the framework here presented to channels with memory, such as the Gilbert-Elliot channel [21], in this paper, we limit ourselves to the more tractable i.i.d. channels in order to support intuition and plausibility for the reader. This is motivated by the fact that, if the protocol reacts to channel changes fast enough, the underlying channel can be modeled as a binary erasure channel with packet loss probability p e : While the previous two constraints deal purely with application constraints, the data rate constraint avoids network congestion by ensuring that the transmitted data rate (9) is below the bottleneck data rate of the network R C : Once the formal model is defined, an algorithm must be implemented that finds the optimum fast enough to react to changes within the channel coherence time-i.e., the time the channel properties remain unchanged.

SHARQ
Scheduled HARQ (SHARQ) is a search algorithm that, given the application delay and loss rate constraints, and the channel state information, finds (k, N P ) that minimizes the RI. SHARQ's algorithm-see Alg. 1-takes as input the maximum block length (k max ) and the number of parity packets ( p max ). Given the maximum block lengths allowed by the delay and loss rate constraints, k max is the minimum of the two: Given p opt (k) = min{ p | P L R H ARQ (k, p) ≤ P L R T } the optimal number of parity packets for a block length k to fulfill the packet loss rate constraint, it can be shown that it is a monotonically increasing function. The loss rate constraint solely depends on k and p-see Eq. 6. Therefore, as the block length increases, the RI decreases if p is kept constant: R I (k, p) > R I (k + 1, p). Conversely, the PLR increases because the same number of parity packets carries information from more data packets: with the equality holding true if the PLR increase is not large enough to surpass P L R T . It directly follows that the maximum number of parity packets is p max = p opt (k max ). Due to the monotonically increasing nature of the PLR, k max and p max can be found with a binary search with a run-time complexity O(m · C P L R ), with m the number of bits per symbol in the Galois Field, and C P L R = O(k + log( p)) the complexity of obtaining the PLR (see Appendix A for more details on the PLR complexity).
If (k, p) is known, N C can be directly obtained: as long as there are enough p's to fill later cycles and the delay budget allows it, this cycle can only reduce the RI because every newly transmitted parity packet reduces the probability of later cycles. Therefore, N C can be directly obtained as the maximum number of cycles that fit in the remaining of the delay budget: SHARQ clearly decouples the delay and PLR constraints, resulting in a more structured and efficient exploration of the search space. For every block length, p solely depends on the PLR constraint, whereas N C solely depends on the delay constraints. Finally, the graph search in Sect. 4.3 is used to find the optimal N P . The graph search has run-time complexity C G S = O( p 2 N C ), and hence the run-time complexity of the SHARQ search algorithm is in O(N C,max · k max · p 2 max ).

Algorithm 2 Graph Search
Require: k, p, N C Ensure: N * P = arg min increase lower and upper by 1 The objective of the graph search algorithm is to find the schedule N P with minimum RI, given a (k, p) pair and N C . As seen in Eq. 1, the RI is a weighted sum over the entries of N P . Each weight is the probability that the corresponding retransmission round is required. This structure creates a trade-off: packets in the later rounds are less likely to be transmitted and hence have a lower cost in terms of RI. However, putting fewer packets into the early rounds increases the probability that the later rounds are needed.
The key observation to efficiently find the optimal schedule is that the weight for round c only depends on the number of packets in rounds before c, but not how they are scheduled. In other words, if we have already scheduled x packets into y rounds, the cost of assigning dx packets to the next round is the same regardless of how the x packets were scheduled before. This structure can be expressed as a graph Eq. 10: with edge weights reflecting the RI cost Eq. 11: The edges are chosen to enforce that every retransmission round (i.e., N P [c] for c > 0) is assigned at least one packet. Consequently, we also need to ensure that we do not assign too many packets to one round, as we need at least one packet for every following round. An example of the resulting graph is shown in Fig. 4. Each path through the graph from the start node to ( p, N C ) corresponds to a schedule. Since the edge weights are equal to the required RI, the schedule achieving the minimal RI corresponds to the shortest path. This graph can be computed using a dynamic programming approach, shown in Alg. 2. For each layer, we relax the nodes between the lower and upper bound for the number of packets admissible for the corresponding round as per the restriction above. We store both the minimum distance in D and a parent pointer, allowing us to reconstruct the shortest path in the end.
The edge weights can be obtained in O( p). Each layer has O( p) nodes with O( p) predecessors each. Since there are N C layers, the time complexity is O( p 2 N C ).

DeepSHARQ
Based on SHARQ's search structure, DeepSHARQ applies learning algorithms to estimate the block length and implements a simple schedule construction to reduce the run-time complexity compared to algorithms that use purely learning and algorithmic solutions.

Design principles
DeepSHARQ is designed with two main principles in mind: i) in contrast to purely learning-based approaches, DeepSHARQ exploits SHARQ's search structure to simplify the learning problem, thereby requiring smaller neural networks to achieve similar inference accuracy, and ii) DeepSHARQ relaxes the optimality constraint to achieve predictably low inference times.
SHARQ quickly finds p with a binary search and N C with a closed-form expression. Thanks to SHARQ's structured search, it becomes apparent that the iteration over all possible block lengths and the graph search are responsible for most of the inference time. DeepSHARQ tackles the problem by inferring the block length with a neural network and using a simple repair schedule construction that, despite being suboptimal, does not produce significant RI increases.

Output space regularization
The quantization of the output space makes small variations in the input produce significantly different block lengths in the output-e.g., how to use an increase in the delay budget? It is possible that the extra time is enough to use yet another retransmission cycle, which may significantly drop the maximum block length that fits in the remaining time. Figure 5 shows how significant the block variations are. For each of the figures, a different input parameter is linearly increased, Changes in D T , P L R T , and T s produce relatively smooth variations in k that could be easily learned. However, a linear increase in p e results in a quasi-random behavior in the optimal block length. Although the block length variations may differ for other application and channel models, Fig. 5 clearly illustrates how difficult learning the output space can be. We propose a different training mechanism that tackles this problem by simplifying the output space via regularization. Instead of predicting the optimal label for each configuration, we train networks that predict any block length out of a set of valid block lengths. The set of valid k's is selected so that the RI deviation from the optimal RI is within certain limits. Formally, given the set K v of all block lengths that fulfill the requirements-see Sect. 4.1-, the neural network is allowed to predict any block length in the set

Repair schedule construction
It can be proved that, when error control is provided without any timing constraints, the probability of decoding failure decreases exponentially with every newly transmitted parity packet. The repair schedule in such a case is an all-ones vector so that the contribution of every new packet to the RI decreases exponentially as well-see Eq. 1 in Sect. 4.
SHARQ's simple schedule is based on this theoretical optimum: for N C = 0, p packets are transmitted in the FEC cycle, whereas for N C ∈ [1, p] the FEC cycle is set to 0, followed by the all-ones and p − N C + 1 in the last cycle. Despite being suboptimal [24], such a naive repair schedule has the advantage that it can be constructed in O(1) and, as we prove in Sect. 7.3, the RI increase it produces is negligible.

System architecture
DeepSHARQ's pipeline is depicted in Fig. 6, where the neural network has 4 hidden layers with 150 neurons and leaky ReLU activation function, and a softmax output layer (see Fig. 7). DeepSHARQ inherits some of its algorithmic components from SHARQ [24], namely the binary search for p, the closed-form expression for N C , and the constraint fulfillment check once the configuration is found in order to notify the application that the channel supports its requirements. On the other hand, the graph search in Sect. 4.3 is substituted by the simple repair schedule construction in Sect. 5.3, so that the run-time complexity of finding the schedule is reduced from O( p 2 max N C,max ) to O(1), and the block length selection goes from a full search for k ∈ [1, k max ] in SHARQ to neural network prediction with run-time complexity O(1) in DeepSHARQ. As a result, the major contributors to DeepSHARQ's complexity are the binary search for p, with O(m · (k max + log( p max ))), and the RI calculation, with  Table 2 shows how the run-time complexity of the search has been reduced with every newly proposed algorithm.
Although DeepSHARQ has not been designed for a specific transport protocol, it assumes the implemented transport layer functions fulfill certain requirements. In the following, we describe such assumptions and how DeepSHARQ interacts with the other transport functions.

Loss detection
DeepSHARQ triggers the repair cycles in the schedule N P if losses are detected in a block. This paper assumes the algorithm presented in [46] is implemented, which maintains a packet loss count at the receiver that is increased if 5 Given a systematic code in G F(2 m ), it is theoretically possible to construct a (k max , p max ) with k max = p max = 2 m . However, the MDS coder implementation considered in this paper enforces the code (k, p) to fulfill that k + p ≤ 2 m . Therefore, both variables must be independently treated in the complexity analysis. (i) an out-of-order packet arrives, or (ii) a packet timeout expires. The timeout is configured between 1 and 2 times the inter-packet time (T s ). A new cycle is triggered when the loss count reaches a configurable threshold. The higher the threshold, the higher the algorithm's robustness against innetwork packet reordering. For time-bound scenarios with target delays in the same orders of magnitude as the RTT and T s (see Table 3), packet reordering is equivalent to packet losses if the packets arrive outside of the time budget-i.e., D T milliseconds after the transmission of the first packet in the block. Therefore, packet reordering has little impact in such scenarios. We consider a low threshold of three loss counts and a packet timeout of 1.5 × T s , which results in D P L ≤ 4.5 · T s . The delay model in Sect. 4.1 considers the worst-case detection delay to ensure the parity packets arrive at the receiver in time. Recently, new algorithms have been proposed that perform better in channels with significant packet reordering [47,48]. In future work, we plan to integrate more recent algorithms into our model for faster and more accurate loss detection.

Congestion control
DeepSHARQ ensures the transmitted data rate does not exceed the channel data rate (see Eq. 9 in Sect. 4.1). However, it does not implement any mechanisms to sample the bottleneck data rate and ensure it is not exceeded but relies on congestion control for that. Congestion control is available in most transport layer protocols because it is key for a fair share of the available network resources. Although DeepSHARQ is congestion-control-agnostic and it could in principle coexist with any of the many proposed algorithms, we recommend BBR-like algorithms [9,54] that try to operate at the Bandwidth-Delay Product (BDP). Operating at the BDP is crucial for CPS as it keeps network buffers empty, thereby minimizing the end-to-end delay while the data rate is close to the bottleneck data rate.

Channel estimation
DeepSHARQ's ability to fulfill the application requirements depends on the precision of the estimated channel model. Another benefit of implementing BBR-like congestion control is that it provides an estimate of two of DeepSHARQ's input parameters: RT T and R C [56]. An estimation of the remaining parameter, the channel loss rate, is proposed in [21], which uses gaps in the data stream to estimate p e . In addition, the tolerated delays in CPS are so small that they are typically in the same order of magnitude as the channel coherence time, or even smaller. In other words, the channel can be considered constant during the time budget and [21,56] provide an estimation precise enough for most IP deployments. Nevertheless, fast-changing, dynamic channels can have a coherence time in the single-millisecond range which pose a more challenging scenario. Machine-learning-based solutions seem promising for such a small granularity as well [26,27]. We believe this is an interesting parallel research path that could enable DeepSHARQ even in the most demanding channels.

Model training
Finding the right hyperparameters is essential to achieve good performance in data-intensive tasks. This section analyzes the different components used in the learning process to shed some light on the model selection process, as well as to ensure the results are reproducible.

Dataset generation
We have designed the dataset with two objectives in mind: i) it must represent current deployments faithfully, and ii) it must generalize for any of the included deployments. The model uses six input parameters: • Application parameters: target erasure rate P L R T , target delay D T , and source packet interval T s . • Network parameters: channel data rate R C , channel erasure rate p e , and round-trip time RT T .
We have considered traces obtained in the wild for the most common network deployments-i.e., broadband, 6 4 G [57], 5 G [58,59], and WiFi [60,61] deployments. For application-related parameters, we have used delay and reliability constraints of traditional [62] 7 as well as more demanding applications still under deployment [7,63]. For each of the parameters, an order of magnitude is selected from Table 3 with equal probability, and a randomly selected number between 1 and 9 is prepended to that order of magnitude. Finally, Alg. 1 is executed for the input with a slight modification: not only the optimal block length k opt is logged, but k min and k max are also obtained, which respectively are the minimum and maximum block lengths that ensure the RI only deviates δ from the optimal RI (see Sect. 5.2). The , and test (20%) sets. Such a dataset simplification reduces the time and resources spent in training without a negative impact on the system's performance, as DeepSHARQ nevertheless discards any predicted k that does not meet the constraints. The models in Sect. 4 consider other three parameters that are not included in the dataset: the packet length P L , the processing delay D RS , and the loss detection delay D P L . We assume the packet length is fixed to the MTU, and hence P L = 1, 500 bytes. We also considered a rather conservative constant value for the processing delay D RS = 1 ms to reduce the dimensions of the dataset. Finally, D P L is linearly dependent on T s , and hence it adds no new information as an input.

Loss definition
Unlike common classification problems, in which the neural network learns the mapping from the input to a single valid output, DeepSHARQ's neural network is trained to accept as correct any label within a range. Therefore, we propose a new loss that accounts for the fact that the true label is a set and not a single value. Given the true label k i , and the neural network predictionk i , the proposed loss is based on the binary cross-entropy H (k i ,k i ), which is defined as follows: where p[k i ∈ K δ v ] is the probability that the neural network predicts any block length that belongs to the accepted range, and p[k i ∈ K δ v ] = 1 because the true label always belongs to that range by definition: The loss L(k,k) in Eq. 12 is evaluated for every batch of size N . In Sect. 7, we show that this loss allows the model to correctly learn the mapping from input parameters to any label in the set of valid labels.

Ablation study
Tuning the learning rate hyperparameter is instrumental to successfully training neural networks. PyTorch implements various learning rate policies that can be configured to schedule the learning rate, such as plateu 8 or super-convergence 9 [64]. Both policies benefit from large maximum learning rates that allow for a longer exploration phase and low learning rates to fine-tune the model weights. The superconvergence scheduler begins with a rising phase that goes from start_lr to max_lr, after which it decays towards end_lr, which is substantially lower than start_lr. The plateau learning rate policy monitors the validation loss to estimate the effectiveness of the current learning rate (i.e., if after patience epochs the validation loss did not decrease by at least threshold amount, it decays the learning rate by a constant f actor until the min_lr has been reached). The super-convergence policy requires an optimizer with momentum, and hence we trained all the models with momentum-enabled stochastic gradient descent. Superconvergence varies the momentum between 0.85 and 0.95, while it is constant at 0.9 for plateau. Figure 8 shows both policies' accuracy and learning rate evolution with DeepSHARQ's neural network limited to 1,000 epochs. The plateau policy reaches lower learning rates faster than super-convergence resulting in higher initial accuracy, but convergence towards lower accuracy in the second half of the training. On the other hand, super-convergence surpasses plateau in accuracy for the last hundred epochs due to its extended high learning rate exploration phase. We selected super-convergence with max_lr = 0.04, as it achieves the best performance. 10 Training the models for 1,000 epochs takes approximately 24 h on a PC with an Intel Core i7-7700 CPU at 3.6 GHz and 8 cores, with an average core load of approximately 50%. The main bottleneck is the calculation of the loss function in Sect. 6.2, which, unlike the traditional cross-entropy loss, must be independently evaluated for every sample in Fig. 8 Validation accuracy and learning rate evolution for different learning rate policies. Three maximum learning rates have been used for super-convergence (i.e., 0.02, 0.03, and 0.04), whereas patience (10 and 15) and learning rate reduction factors (0.7 and 0.8) have been used for plateau the batch because every input may consider a different K δ v set. However, the significantly smaller models (see Sect. 7.2) counteract the impact of the longer training phase when DeepSHARQ is deployed at scale on a significant number of end devices. In addition, thanks to the broad set of channels and applications considered in Sect. 6.1, DeepSHARQ is readily deployable on the most common networks nowadays without a lengthy re-training for fine-tune adaptation. Table 4 presents the ablation study we performed to select the final hyperparameters. All the presented results are for models trained for a range of valid labels for δ = 0.3 (see Sect. 6.2). The regularization factor is a key parameter for super-convergence, as high learning rates already act as a form of regularization [64] and, combining it with L2 regularization with a high factor can be detrimental for performance (see Table 4 rows 2 and 6). We also tested multiple epochs and selected 1,000 as it strikes the right balance between good performance and training time.
For DeepSHARQ, we have selected the model with 4 hidden layers and 150 neurons because, for the selected δ, it achieves a good compromise between accuracy and model size. However, the method here proposed is flexible enough to allow for different model selections: while a large model with close-to-optimal accuracy may be a good option on powerful PCs, smaller models-e.g., using larger δ's, and hence at the cost of an RI increase-may be desirable for resourceconstrained platforms.

Evaluation
In the following, we evaluate the newly proposed neural network approach, as well as DeepSHARQ's real-time response to channel changes.

Methodology
All the models evaluated in this section have been trained following Sect. 6 and PyTorch 1.12.1. Only the test dataset has been used to generate the accuracy and Cumulative Distribution Functions (CDF) here presented, and the algorithms have been executed on a PC running Ubuntu 22.04 LTS on an Intel Core i7-7700 CPU at 3.6 GHz and 32 GB RAM. For inference time evaluation, the PyTorch-trained models have been ported to TensorFlow 2.11 using the TensorFlow Backend for ONNX 11 and executed with TensorFlow Lite with the tflite 12 Rust crate. Two different neural networks are considered: i) k opt , trained to predict the optimal block length, and ii) k range , trained to predict any block length in a range of valid block lengths-see Sect. 5. The models were trained for 1,000 epochs, using the superconvergence learning rate policy configured with max_lr = 4 · 10 −3 , start_lr = 2 · 10 −2 , and min_lr = 2 · 10 −9 . The smoother output space allows the models to faster converge towards high accuracy-conversely, low loss-, and experience a lower variance within the high learning rate in super-convergence-see Sect. 6. This result is somehow expected since more labels are accepted as "correct" for k range models, and hence the accuracy as defined in the learning process is increased. However, extending the range of  (4,150), (5,200), and (5,250) valid labels also improves the model performance from an information theoretical standpoint (see Fig. 10). If a valid configuration is defined as any configuration with an information rate below the time-bound channel capacity-i.e., any configuration meeting all the constraints-then k range models are able to find more valid configurations over the test dataset the larger the range of valid labels is. Such a trade-off between RI optimality and model size is particularly beneficial for resource-constrained devices, in which the bottleneck is the CPU rather than the network, both in terms of processing speed and energy consumption [29], especially when connected to 5 G networks [59]. In contrast, the Deep-HEC model presented in [23]

The cost of optimality
DeepSHARQ slightly deviates from the optimization problem defined in Sect. 4.1, as it introduces two sources of suboptimal configurations: i) the neural network, which can either be directly trained to allow for suboptimal block lengths, or produce misclassifications even when trained to predict the optimum, and ii) the simple schedule construction. Figure 11 compares k opt with 5 layers, 250 neurons, and k range with 4 layers, 150 neurons, each model implementing two different repair schedule constructions: the graph search DeepSHARQ's neural network has 4 hidden layers and 150 neurons per layer, and it has been trained with a dataset ensuring that R I = (1 + δ) · R I opt with δ = 0.1 in Sect. 4.3 (graph) and the simple repair schedule in Sect. 5 (simple). The results show that allowing for suboptimal configurations reduces the tail and average inference time, and increases its predictability. SHARQ's graph search for optimal repair schedules results in a long tail in the inference time due to its quadratic complexity, while increasing the unpredictability of the system at the same time. On the other hand, k range reduces the average delay by 40% in comparison to k opt due to the smaller NN. The faster inference time comes at the cost of a data rate increase. As expected, k opt produces no increase in significantly more cases than k range -69% vs. 32%. However, in both cases, the increase is below 100 kbps in 87% of the cases and below 1 Mbps for the 97th percentile, and thus they do not seem prohibitively large when looking at the data rates in current deployments [57,59,61]. When it comes to the tail data rate increase, k opt performance is worse than k range , which shows that learning a range instead of a single label acts as a regularization mechanism that improves generalizability. Finally, although the graph search makes a slight difference for k opt , it does not make any significant difference for k range , which further supports the design decision of opting for a suboptimal but faster scheduler. Figure 12 compares DeepSHARQ's inference time with three previously published algorithms: SHARQ [24], Fast Search [23], and DeepHEC [23]. Although Fast Search outperforms any other model in 43% of the cases, it also has a tail delayi.e., the largest inference time experienced by the algorithm-6 orders of magnitude higher than DeepSHARQ's. The three other models trade some inference time in the lower percentiles to provide a much more predictable inference time over the complete dataset. More precise statistics on the inference time are collected in Table 5, which shows that not only DeepSHARQ outperforms the other algorithms in terms of the mean and median inference time, but its standard deviation is at least two orders of magnitude smaller.

Inference time
SHARQ shows that a purely algorithmic solution is able to achieve high predictability. However, deep learning solutions are the only ones able to consistently achieve low delay-see DeepHEC and DeepSHARQ. DeepSHARQ outperforms all other models in terms of tail inference time and predictability thanks to i) its smaller neural network, which is the component consuming most of the delay budget, and ii) its simple schedule, which avoids spending precious time in finding a better schedule that nevertheless produces no significant RI reduction-see Sec. 7.3.
The results presented here show that modeling the problem purely with deep learning results in excessively large models. DeepHEC learns (k, p, N C ), and hence needs larger neural networks to achieve similar performance in terms of supported coding configurations (5 hidden layers and 250 neurons per layer, see [23]). DeepSHARQ halves the inference time compared to DeepHEC, and its tail delay is an order of magnitude smaller. The repair schedule construction is the only algorithmic component that remains in DeepHEC. Although it could be learned as well-e.g., applying reinforcement learning-, the neural network is expected to grow even larger due to the increased complexity of the problem to solve. Combining both approaches, which DeepSHARQ does, simplifies the problem enough for small neural networks to learn it while minimizing the time spent to derive the remaining parameters and the verification of the constraint fulfillment. DeepSHARQ's average inference time is an order of magnitude smaller than the end-to-end delay requirements of tactile applications [7], so that it can react to channel changes faster than they are detected even for the most demanding applications delay-wise. Two orders of magnitude until the target delay of the most demanding applications is reached leaves plenty of room for increased inference time when the algorithm is executed on more constrained devices.

Future work
While initial works have evaluated the coding configuration search in a concrete transport protocol [21], this and other recent articles looked at the search problem in isolation. Future work would be to integrate DeepSHARQ into existing transport layer protocols (e.g., QUIC [43] or CoAP [65]) and evaluate the performance under changing channel conditions. These practical evaluations would evaluate the ability of the protocol to meet the application constraints in a changing environment, and they would also involve the use of embedded devices-testing if they are capable to run the search in real time. In addition, we intend to further integrate DeepSHARQ with state-of-the-art loss detection algorithms [47,48].
In [29], we proved that a priori there is no single code that is optimal in terms of energy efficiency, but this depends on the hardware it is executed on, as well as channel conditions and application requirements. Given the recently increasing interest in energy-aware systems [66,67], a further line of research would involve making DeepSHARQ energy-aware as well. This involves changing the search problem to incorporate energy as an input metric, i.e., an application-defined energy limit per application packet, and an output metric, i.e., how much energy a coding configuration demands.
From a theoretical communication perspective, a further line of research involves results on finite block coding [36]. These results allow computing the optimal block length based on channel parameters. Central to this is, however, the computation of the dispersion metric-something that has not been done practically in network protocols.

Conclusion
In this article, we presented DeepSHARQ, an approach to finding optimal hybrid error coding configurations in real time. Coming from the (computationally complex) search problem, we presented a decomposed search algorithm that improves the complexity of the algorithm using algorithmic as well as learning-based methods. We propose a new training methodology that, exploiting the quantized nature of the HARQ configurations, improves the neural network performance in learning (i.e., achieved accuracy) as well as communication (i.e., yielded configurations supported by the channel capacity) terms. Our evaluations show that DeepSHARQ is delivering both on the efficiency of the inferred configurations as well as on being fast in executing the inference. To the best of our knowledge, this is the best approach so far to finding coding configurations and makes it possible to execute this demanding task on devices with limited resources-a common trait of cyber-physical system hardware.
Author Contributions PG, KV, and TH developed the algorithms. PG, KV, AS, and TH conceptualized the experiments. PG, KV, and MM carried out the experiments. PG and AS wrote the main manuscript. KV contributed to Sect. 4. MM contributed to Sect. 6. All the authors have reviewed the text and agree with its content.
Funding Open Access funding enabled and organized by Projekt DEAL. This work was funded by the German Research Foundation (DFG) grants 315036956 as part of SPP 1914-e.LARN (see http:// larn.systems) and 389792660 as part of TRR 248-CPEC (see https:// perspicuous-computing.science).

Data availability statement
The datasets, algorithm implementation, and scripts used to generate the results presented here are available through the GitLab webpage of some of the authors (https://git.nt.unisaarland.de/open-access/deepsharq) as well as Zenodo (https://doi.org/ 10.5281/zenodo.8026445).

Conflict of interest
The authors have no relevant financial or nonfinancial interests to disclose. The authors have no conflicts of interest to declare that are relevant to the content of this article. All the authors certify that they have no affiliations with or involvement in any organization or entity with any financial interest or non-financial interest in the subject matter or materials discussed in this manuscript. The authors have no financial or proprietary interests in any material discussed in this article.

Ethical approval Not applicable.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/.

Appendix A
The run-time complexity of the HARQ packet loss rate in Eq. 6 can be reduced to linear complexity if some optimizations are taken into account. Splitting the PLR into the two components in Eq. A1, it can be shown that each of the components can be calculated in O(k max + log( p max )) (see Eq. A2 and Eq. A4). Bear in mind that the derivations here presented do consider the alternative expression for the probability of i packet losses in systematic MDS codes presented in Eq. A5: