1 Introduction

In recent years, the proliferation of Internet of Things (IoT) devices and the exponential growth of data generated at the edge of networks have fueled the emergence of edge computing as a promising paradigm for efficient data processing and analysis. Edge computing leverages the computational resources available at the network periphery, closer to where data is generated, to provide low-latency and high-throughput services for various applications, ranging from real-time analytics to augmented reality [1, 2].

Central to the efficacy of edge computing is the efficient allocation of computational tasks among edge devices and centralized servers, aiming to strike a balance between resource utilization, latency reduction, and energy efficiency. Particularly, in inference tasks, where predictive models are deployed to analyze data and make real-time decisions, the allocation of tasks to appropriate computational resources becomes crucial [3].

The landscape of artificial intelligence (AI) has witnessed remarkable advancements, driven by breakthroughs in AI training algorithms and the availability of powerful hardware accelerators. As a result, the deployment of multiple compact AI inference models at the edge has become increasingly prevalent. These models, characterized by their small size and efficient computational footprint, offer significant advantages in terms of resource utilization, latency reduction, and energy efficiency. By leveraging techniques such as model distillation, quantization, and pruning, researchers have been able to create highly optimized inference models that are well-suited for deployment on resource-constrained edge devices [4]. Furthermore, the proliferation of specialized hardware accelerators, such as GPUs, TPUs, and FPGAs, has further accelerated this trend, enabling edge devices to perform complex AI inference tasks with unprecedented speed and efficiency. As such, the availability of multiple compact AI inference models at the edge presents a unique opportunity to enhance the performance and scalability of edge computing systems while catering to the diverse needs of emerging edge applications [5,6,7,8].

In this paper, we address the challenging problem of inference task scheduling and offloading in edge computing environments characterized by time and energy constraints. Specifically, we consider scenarios where edge devices are equipped with multiple small-sized inference models, each varying in accuracy and inference times (see Fig. 1). The objective is to assign inference tasks to either local edge models or centralized edge server models in a manner that maximizes accuracy while adhering to stringent time and energy constraints imposed by the application domain.

The problem of inference task offloading under time and energy constraints in edge computing is inherently complex and challenging, with its computational complexity exacerbated by the NP-hard nature of the problem. As a combinatorial optimization problem, determining the optimal allocation of inference models for inference tasks between local models and edge server models involves exploring a vast search space of possible task assignments while simultaneously considering multiple conflicting objectives, such as maximizing accuracy, while respecting time and energy constraints. The NP-hardness of the problem implies that finding an exact solution within a reasonable amount of time becomes increasingly difficult as the size of the problem instance grows, rendering traditional exact algorithms impractical for real-world deployment [9].

The significance of solving this problem lies in its direct impact on the performance, scalability, and cost-effectiveness of edge computing systems across various application domains. For instance, in healthcare applications, such as remote patient monitoring and real-time diagnosis, timely and accurate inference results are critical for ensuring patient safety and well-being. Similarly, in autonomous vehicles and smart transportation systems, efficient inference task offloading can enhance decision-making processes, leading to improved traffic management, accident prevention, and passenger safety. Furthermore, in industrial IoT (IIoT) applications, such as predictive maintenance and quality control, the ability to offload inference tasks optimally can result in substantial cost savings, productivity gains, and operational efficiency improvements [10, 11].

In literature, we find a scarcity of research concerning inference task scheduling and offloading under constraints mainly due to the novelty of the problem at hand. The work by [12] focuses on data selection based on the likelihood of low inference accuracy to offload to an edge server under an energy constraint. This work points in the direction of the more general problem this paper is tackling. However, they only take advantage of a single inference model along with a single edge server which is a smaller special case of the bigger problem. Additionally, the time dimension which is important for real-time applications, is not addressed. The work in [13] provides a broader coverage of the problem at hand by leveraging the dynamicity offered by deploying multiple inference models in the edge device while respecting a time constraint. Their work however, lacks the dimension of energy, which is important for battery-powered edge devices. Furthermore, their system only handles offloading to a single edge server. This leaves a research gap, in which the broader problem needs to be addressed where both time and energy dimensions are respected while maximizing the accuracy of inference by assigning inference models from a set of local and edge server models to inference tasks.

In this paper, we propose a novel metaheuristic algorithm for addressing the challenge of inference task scheduling and offloading in edge computing environments under time and energy constraints. Our primary objective is to devise an efficient algorithm which offers decision-making strategies that dynamically assign inference tasks between local edge models and centralized edge server models, aiming to maximize accuracy while respecting real-time requirements and energy limitations. Through a combination of theoretical analysis, algorithm design, and empirical evaluations, we seek to provide a practical solution that optimizes the allocation of computational resources in edge computing systems, thereby advancing the state-of-the-art in edge computing research and facilitating the development of intelligent and energy-efficient edge applications.

Fig. 1
figure 1

Inference model selection between local models and the edge server

The main contributions of this paper can be summarized as follows:

  • We formulate the problem of inference model scheduling and offloading in parallel under time and energy constraints by providing an accuracy, inference time and energy models and propose an optimization problem.

  • We propose HGSTO a hybrid genetic algorithm (GA) leveraging the accuracy of GA and the speed of Simulated Annealing (SA) which help accelerate the evolution process and converge to the best solution in fewer generations.

  • We perform experiments and compare the performance of HGSTO against GA, SA, and Particle Swarm Optimization (PSO) where we prove the efficiency and accuracy of HGSTO.

The rest of this paper is organized as follows. Section 2 presents the related works. In Sect. 3 we describe the system model. In Sect. 4 we propose HGSTO and explain the solution steps. In Sect. 5 we present the experiment setup and results in addition to analysis of the obtained results. Finally, we conclude this work in Sect. 6.

2 Related works

The problem of task offloading in edge computing has garnered significant attention in recent literature due to its crucial role in optimizing resource utilization and enhancing system performance. Various approaches have been proposed to address different aspects of task offloading [14]. For instance, works such as [15,16,17] focus on offloading decisions based on data characteristics and network conditions, aiming to minimize latency and maximize throughput. Similarly, [18,19,20] explore task offloading strategies considering energy consumption and device capabilities, with a focus on energy-aware scheduling algorithms. Other works, such as [21,22,23] targeted both energy and latency as an objective where the aim to reduce both latency and energy or strike a compromise between them. These works however, are not suitable for real-time applications where strict deadlines and energy budgets must be met.

Few works have explicitly considered time and energy constraints, either individually or in combination. For instance, [24,25,26] propose a time-aware task offloading framework that prioritizes tasks based on their deadlines and dynamically allocates resources to meet real-time requirements. Similarly, [27, 28] introduce an energy-aware task offloading approach that optimizes energy consumption by balancing workload distribution across edge devices and servers. Furthermore, [29,30,31] investigate the joint optimization of time and energy constraints in task offloading decisions, employing mathematical modeling techniques to formulate the problem as a multi-objective optimization task. These works underscore the importance of considering both time and energy constraints in task offloading decisions, highlighting their interdependence and the need for holistic optimization approaches in edge computing environments.

Various methods have been employed to address the task offloading problem in edge computing, reflecting the diverse nature of the challenges involved. Traditional optimization techniques, such as mixed linear programming (MLP) [32, 33] and branch and bound [33], have been utilized to formulate task allocation as a mathematical optimization problem, enabling the derivation of exact or near-optimal solutions under specific constraints. Additionally, machine learning approaches have gained prominence for their ability to adaptively learn and optimize task offloading decisions based on historical data and real-time observations [34, 35]. Furthermore, metaheuristic algorithms, including genetic algorithms, simulated annealing, and particle swarm optimization, have been applied to tackle the NP-hard nature of the task offloading problem by efficiently exploring the solution space and identifying high-quality solutions [36, 37]. Each of these methods offers distinct advantages and trade-offs, depending on the specific characteristics of the problem instance and the requirements of the edge computing environment.

The only two works in the literature that try to improve the inference accuracy of an edge computing system through offloading to edge servers, are the work by [12], where the authors propose a data selection scheme for IoT devices in which an edge device can decide to offload data that would likely lead to inaccurate inference if processed locally and thus improving the overall accuracy of the whole system. Their data scheme performs the selection under a given energy constraint. The proposed scheme shows promising results, however, it does not consider the time constraint and thus renders it unsuitable for real time applications. In addition, their scheme only considers a single inference model in the edge device which does not offer many options in terms of maximizing accuracy. On the other hand, the authors of [13] studied the inference task offloading under a time constraint in a system where edge devices are equipped with multiple inference models varying in size and accuracy in addition to an edge server. Their system leverages the fact that they have two parallel machines namely the edge device and the edge server. They proposed \(AMR^2\), a scheduling scheme based on LP-Relaxation and rounding which considers all possible cases of scheduling two tasks between the edge device and the edge server. They relax the problem’s constraint to take fractional values and then perform rounding to get the result. In the edge device they use dynamic programming to schedule the tasks. Their proposed scheme performed better than the greedy approach. However, their system does not take into consideration the energy constraint of the edge device. Although, using the edge device’s onboard inference models along with offloading tasks to the edge server can reduce the total inference time of the system, in the case of battery-powered edge devices, it costs a significant amount of valuable energy. These works leave a research opportunity for the more general and practical problem of considering both time and energy constraints while leveraging the dynamicity of having multiple local inference models in addition to multiple edge servers. This problem is significantly more challenging however it better represents real-world applications.

As far as we know, no other work has considered the problem of inference task offloading to maximize the accuracy under both time and energy constraints. Therefore, this work is designed to fill this research gap by providing a practical solution to this problem.

3 System model

Table 1 Symbols and notations
Fig. 2
figure 2

HGSTO system model (Color figure online)

Considering a system (see Fig. 2) where each edge device is equipped with a set of local inference models denoted as \({\mathcal {L}} = \{1,\ldots ,L\}\). Every edge node is connected to a set of edge servers \({\mathcal {E}}\) each equipped with a single inference model. The set of server inference models is denoted as \({\mathcal {S}}=\{1,\ldots ,S\}\). An edge node have access to a set of inference models \({\mathcal {M}}=\{1,\ldots ,M\} = {\mathcal {S}} \bigcup {\mathcal {L}}\) consisting of both \({\mathcal {S}}\) and \({\mathcal {L}}\). At each time slot the edge node receives a set of inference tasks denoted as \({\mathcal {J}} = \{1,\ldots ,J\}\).

3.1 Inference accuracy

The inference models deployed in edge nodes can be a single tunnable model where adjusting the input hyperparameters changes the accuracy and inference times. Another option is to deploy multiple instances of a similar type models with different internal structures such as layer sizes in the case of Deep Neural Networks (DNNs). Alternatively, different types of models varying in size and top-1 average accuracy also can be used. Since the real top-1 accuracy of each model for a given inference task is not known prior to performing the inference we use the average accuracy estimated over historical measured top-1 accuracy. The average top-1 accuracy of model i is denoted as \(A_i \quad \text {where}\quad i \in {\mathcal {M}}\). The average top-1 accuracy of edge server models is set to be significantly higher than edge device local inference models.

$$\begin{aligned} A_j > A_i \quad \forall i \in {\mathcal {L}}, j \in {\mathcal {S}} \end{aligned}$$

3.2 Time delay model

Similar to the estimated average accuracy we estimate average inference time for each model i denoted as \(T^{\text {inf}}_i\) where \(i \in {\mathcal {M}}\) obtained by averaging historical measured inference times. Data preporcessing time is considered to be part of \(T^{\text {inf}}_i\).

Let \(T^{\text {lat}}_i\) be the average communication latency of edge server i which can be estimated from historical measured data and is continuously updated after every transmission. We define \(T^{\text {off}}_{ij}\) to be the estimated time to offload task \(j \quad \text {where} \quad i \in {\mathcal {J}}\) to edge server i. \(T^{\text {off}}_{ij}\) can be estimated from the connection bandwidth and the size of task j denoted as \(size_j\). \(T^{\text {off}}_{ij}\) is given by:

$$\begin{aligned} T^{\text {off}}_{ij} = \frac{size_j}{b_i} + T^{\text {lat}}_i \end{aligned}$$

where \(b_i\) is the bandwidth of the communication channel between the edge device and edge server i. The value of \(b_i\) can be measured and updated regularly by probing the edge server.

We define \(T^{\text {task}}_{ij}\) as the total time it takes to process task j using model i including inference and offloading times.

$$\begin{aligned}{} & {} T^{\text {task}}_{ij} = T^{\text {inf}}_i \quad \forall i \in {\mathcal {L}} \quad j \in {\mathcal {J}} \\{} & {} T^{\text {task}}_{ij} = T^{\text {inf}}_i+T^{\text {off}}_{ij}+T^{\text {resp}}_i \quad \forall i \in {\mathcal {S}} \quad j \in {\mathcal {J}} \end{aligned}$$

where \(T^{\text {resp}}_i\) represents the average response time from edge server i given by:

$$\begin{aligned} T^{\text {resp}}_i = \frac{size_r}{b_i} + T^{\text {lat}}_i \end{aligned}$$

where \(size_r\) is a constant representing the response size.

Let \(x_{ij} = \{0, 1\}\) be a variable representing whether an inference model i is assigned to inference task j. We define \(T^{\text {slot}}_k\) as the total time to process a complete time slot \(k \in \Delta\). Since the local inference and offloading are performed in parallel, we define \(T^{\text {slot}}_k\) as the max between the total local inference time denoted as \(T^{\text {local}}_k\) and the total server time which includes offloading, inference and response times of all offloaded tasks denoted as \(T^{\text {server}}_k\).

$$\begin{aligned} T^{\text {slot}}_k = \max (T^{\text {local}}_k, T^{\text {server}}_k) \quad \forall k \in \Delta \end{aligned}$$

where

$$\begin{aligned} T^{\text {local}}_k = \sum _{i=1}^{L}\sum _{j=1}^{J} x_{ij}t_{ij} \end{aligned}$$

\(T^{\text {server}}_k\) is calculated using Algorithm 1.

Algorithm 1
figure f

Steps to calculate \(T^{\text {server}}_k\)

\(T^{\text {server}}_i\) is the total offload, inference and response times for all tasks offloaded to edge server with model i. \(T^{\text {off}}\_{\text {accu}}\) is a variable to accumulate and track offload times for all edge servers. \(T^{\text {inf}}\_{\text {accu}}_{i}\) is a variable that accumulates and tracks inference times for each edge server with model i.

Fig. 3
figure 3

Edge device model

The edge devices including normal edge nodes and edge servers, are assumed to have two queues as depicted in Fig. 3, one for performing computation and another for communication. This allows for parallel inference and offloading or receiving. In Fig. 4, we show an example of a schedule for 10 inference tasks where 4 tasks are processed using local inference models and 6 are offloaded to 3 different edge servers. At \(t_1\) we have:

$$\begin{aligned}{} & {} T^{\text {off}}\_{\text {accu}} = T^{\text {off}}_{1,1} \\{} & {} T^{\text {inf}}\_{\text {accu}}_{1} = max(T^{\text {inf}}\_{\text {accu}}_{1}, T^{\text {off}}\_{\text {accu}}) + T^{\text {inf}}_{1,1} \\{} & {} T^{\text {inf}}\_{\text {accu}}_{1} = T^{\text {off}}\_{\text {accu}} + T^{\text {inf}}_{1,1} = T^{\text {off}}_{1,1} + T^{\text {inf}}_{1,1} \\{} & {} T^{\text {server}}_1 = max(T^{\text {server}}_1, T^{\text {inf}}\_{\text {accu}}_{1})+T^{\text {resp}}_1 \\{} & {} T^{\text {server}}_1= T^{\text {inf}}\_{\text {accu}}_{1}+T^{\text {resp}}_{1,1} = T^{\text {off}}_{1,1} + T^{\text {inf}}_{1,1}+T^{\text {resp}}_{1,1} \end{aligned}$$

Following the steps in Algorithm 1 on the example shown in Fig. 4, we find \(T^{\text {server}}_i\) values shown in Table 2. The value of \(T^{\text {server}}\) is taken as the maximum value of all edge server total times. Finally, the total slot time \(T^{\text {slot}}\) is found by taking the larger value between total time of local inference \(T^{\text {local}}\) and \(T^{\text {server}}\).

$$\begin{aligned} T^{\text {server}} = max(T^{\text {server}}_1, T^{\text {server}}_2, T^{\text {server}}_3) = T^{\text {server}}_2 \end{aligned}$$
$$\begin{aligned} T^{\text {slot}} = \max (T^{\text {local}}, T^{\text {server}}_2) = T^{\text {server}}_2 \end{aligned}$$
Table 2 Example values of \(T^{\text {server}}_i\) for time steps 2–5
Fig. 4
figure 4

An example of a schedule for 3 edge servers and 10 inference tasks (Color figure online)

3.3 Inference energy

Let \(E^{\text {off}}_{ij}\) be the energy cost of offloading task j to edge server i. \(E^{\text {off}}_{ij}\) depends on the offload time \(T^{\text {off}}_{ij}\) and \(c_i\) the average energy cost of transmitting data to edge server i in one time unit. \(c_i\) depends on several factors, including the communication medium such as Wi-Fi, Cellular, Bluetooth, or Zigbee, affects energy consumption. Each medium has different power requirements, data rates, and transmission ranges, which influence the overall energy cost. The energy consumption of a wireless device is significantly influenced by its transmission power level. Typically, higher transmit power levels lead to increased energy usage, particularly when maintaining communication over longer distances or in challenging environments with obstacles or interference. Additionally, the power consumed by the device while idle or in standby mode adds to the overall energy cost. Furthermore, signal strength and quality play crucial roles in determining energy consumption, particularly in systems where transmission power adapts based on signal conditions. In scenarios where reliable communication is essential, such as environments with weak or noisy signals, higher power levels may be necessary, resulting in heightened energy consumption. Moreover, environmental factors like interference, obstacles, and electromagnetic noise can further impact energy consumption by influencing signal propagation and reception quality. Various common optimization techniques are must also be taken into consideration such as data compression, packet aggregation, adaptive modulation, and power control algorithms To mitigate energy consumption during wireless communication which significantly impact energy consumption.

We assume that \(c_i\) can be calculated internally by monitoring battery usage and the network adapter’s configurations such as the transmission power. By averaging these measured power usage metrics we can estimate \(c_i\). \(E^{\text {off}}_{ij}\) is given by:

$$\begin{aligned} E^{\text {off}}_{ij} = \frac{T^{\text {off}}_{ij}}{c_i} \end{aligned}$$

Similarly, the energy cost of the inference response denoted by \(E^{\text {resp}}_i\) is given by:

$$\begin{aligned} E^{\text {resp}}_i = \frac{T^{\text {resp}}_{i}}{c_i} \end{aligned}$$

Let \(E^{\text {inf}}_i\) be the average energy cost of performing the inference of an inference task using model i. The inference energy cost is negligible compared to the offloading energy cost. Therefore it is defined as a constant which can be estimated using the inference time and the maximum power consumption of the edge device’s CPU in the worst case under full load. Let \(E^{\text {task}}_{ij}\) be the total energy cost of processing task j using model i given by:

$$\begin{aligned}{} & {} E^{\text {task}}_{ij} = E^{\text {inf}}_i \quad \forall i \in {\mathcal {L}} \quad j \in {\mathcal {J}} \\{} & {} E^{\text {task}}_{ij} = E^{\text {inf}}_i+E^{\text {off}}_{ij}+E^{\text {resp}}_i \quad \forall i \in {\mathcal {S}} \quad j \in {\mathcal {J}} \end{aligned}$$

Finally, we define \(E^{\text {slot}}_k\) as the total energy consumption for slot k given by:

$$\begin{aligned} E^{\text {slot}}_k = \sum _{i=1}^{M}\sum _{j=1}^{J}x_{ij}E^{\text {task}}_{ij} \end{aligned}$$

3.4 Problem formulation

In this section, we model the inference models allocation for inference tasks under time and energy constraints while maximizing accuracy as a single objective optimization problem. The optimization problem is formulated as follows:

$$\begin{aligned} Maximize\quad A^{\text {slot}}_k = \sum _{i=1}^{M}\sum _{j=1}^{J} x_{ij}A_i \end{aligned}$$
(1)

Where \(A^{\text {slot}}_k\) is the total accuracy for a time slot k. Let \(E^{\text {slot}}_k\) be the total energy consumption of slot k. Given E and T as the energy and time constraints respectively, Eq. (1) is subject to:

$$\begin{aligned} T^{\text {slot}}_{k} \le T \quad \forall k \in \Delta \end{aligned}$$
(2)
$$\begin{aligned} E^{\text {slot}}_k \le E \quad \forall k \in \Delta \end{aligned}$$
(3)
$$\begin{aligned} \sum _{i=1}^{M}x_{ij} = 1 \quad \forall j \in J \end{aligned}$$
(4)

Using Eq. (2) we guarantee each parallel processing time of each slot is respecting the time constraint. Similarly, Eq. (3) ensures energy consumption for a time slot respects the given energy constraint. Finally, Eq. (4) guarantees each inference task is assigned an inference model which produces a complete solution.

This problem could be thought of as an instance of the well known classic knapsack problem (KP). In which, we are trying to fill our schedule (i.e. a knapsack) with inference models (i.e. pieces) to maximize the accuracy (i.e. profit) while respecting time and energy constraints (i.e. knapsack weight and volume capacities). In this case, the pieces and the knapsack have 2-dimensions and therefore this problem is an instance of the multi-dimensional KP. Additionally, inference models (i.e. pieces) can be reused to construct a schedule therefore the problem becomes an instance of the unbounded multi-dimensional KP (UMdKP). However, since we are considering parallel schedules where inference tasks can be processed locally and in edge servers in parallel, which in UMdKP terms means pieces can be overlapping in the weight dimension (i.e. time) but not in the volume dimension which renders this similarity useless. Alternatively, multiple knapsacks can be considered for each edge server which makes this problem an instance of the multi-knapsack problem (MKP). However, the problem becomes much more difficult to model specially when trying to uphold the time constraint over all knapsacks.

4 A hybrid genetic algorithm for selective inference task offloading (HGSTO)

In this section we propose a hybrid genetic algorithm for selective task offloading (HGSTO) using genetic algorithms (GA) and simulated annealing (SA). The rationale for incorporating GAs into our methodology stems from their inherent capability to efficiently explore complex, multidimensional solution spaces, particularly in optimization problems characterized by diverse and interrelated variables. Despite their effectiveness in traversing vast search spaces, GAs are susceptible to convergence towards local minima, potentially limiting the discovery of globally optimal solutions. To mitigate this limitation, we augment the GA framework with a simulated annealing local search step. Simulated annealing, renowned for its ability to escape local optima through probabilistic transitions, offers a complementary mechanism to the deterministic exploration of GAs. By incorporating SA as a local search mechanism within the GA framework, we enhance the algorithm’s ability to escape local minima while accelerating the evolution process. This synergistic combination harnesses the accuracy of GAs for multidimensional problems while leveraging the speed and robustness of SA to prevent premature convergence and promote the discovery of high-quality solutions [38, 39]. The main steps of HGSTO are illustrated in Fig. 5.

Fig. 5
figure 5

HGSTO steps flow diagram

4.1 Population initialization

Considering a population of solutions (i.e., schedules) \({\mathcal {P}}\) of size P where each solution s is a vector of model indices corresponding to each task given by \(s = \{i, \forall i \in {\mathcal {M}}\}\) such that \(|s| = |{\mathcal {J}}|\). The initial population can greatly impact the speed of convergence of the genetic algorithm where higher coverage of the solution space is critical to finding the best solution. Furthermore, increasing the population density in promising areas can also accelerate the convergence of solutions. However, in this work our main goal is to design a fast algorithm which renders complex initialization schemes unsuitable options. We experimented with multiple initialization schemes including the Latin hyper-cube sampling and uniform random initialization and found that these schemes only add runtime overhead while not providing any improvements to either the accuracy or the convergence speed of the genetic algorithm, therefore in our solution we favor using a simple pseudo-random numbers generator based on a modified version of Donald E. Knuth’s subtractive random number generator algorithm [40].

4.2 Fitness evaluation

The fitness function f of a solution \(s_k\) for a time slot k is calculated using \(T^{\text {slot}}_k\), \(A^{\text {slot}}_k\) and \(E^{\text {slot}}_k\) in addition to time and energy constraints denoted as T and E respectively. We define \(f(s_k)\) as follows:

$$\begin{aligned} \delta _T = \frac{T^{\text {slot}}_k - T}{T} \\ \delta _E = \frac{E^{\text {slot}}_k - E}{E} \end{aligned}$$
$$\begin{aligned} \omega _T = {\left\{ \begin{array}{ll} 1 &{} \text {if} \quad (\delta _T < 0) \quad \text {and} \quad (|\delta _E| \le \delta _{min}) \\ 1 - |\delta _T| &{} \text {otherwise} \end{array}\right. } \end{aligned}$$
(5)
$$\begin{aligned} \omega _E = {\left\{ \begin{array}{ll} 1 &{} \text {if} \quad (\delta _E < 0 ) \quad \text {and} \quad (|\delta _T| \le \delta _{min}) \\ 1 - |\delta _E| &{} \text {otherwise} \end{array}\right. } \end{aligned}$$
(6)
$$\begin{aligned} \omega = \frac{1}{2}\omega _T + \frac{1}{2}\omega _E \end{aligned}$$
(7)
$$\begin{aligned} f(s_k) = \omega A^{\text {slot}}_k \end{aligned}$$
(8)

\(\delta _T\) and \(\delta _E\) represent the distance ratio of \(T^{\text {slot}}_k\) and \(E^{\text {slot}}_k\) from the given constraints T and E respectively. \(\delta _{min}\) is the minimum distance ratio where we consider the constraint is met. \(\omega _T\) and \(\omega _E\) are the time and energy penalties respectively. The penalties are designed such that they scale with the distance (either larger or smaller) from the given constraints which forces the solution to minimize the distance and thus take advantage of the given constraints budget. However, in some cases we can face a limitation from a single constraint in which Eqs. (5) and (6) ensure that we do not force the solutions to maximize both constraints by dropping the penalty for the other constraint as long as it is respected.

4.3 Termination condition

The termination condition of a GA is a critical stopping criterion that determines when the optimization process should stop. Common termination conditions include reaching a specified number of max generations, achieving a desired level of fitness improvement, or surpassing a predefined computational budget. Choosing the appropriate termination condition is crucial to the performance of HGSTO. Therefore, we use a solution stagnation monitoring method to halt the evolution when the algorithm fails to produce significantly improved solutions over a certain number of generations, indicating that further iterations are unlikely to yield substantial improvements. By monitoring solution stagnation, the GA can dynamically adjust its exploration-exploitation trade-off, allocating computational resources more efficiently towards promising regions of the search space. This termination condition helps prevent the algorithm from wasting computational resources on unproductive iterations, thereby improving convergence speed and solution quality. The fitness of the top-1 ranking solution of each generation is compared against solutions from the last N generations to determine whether there is any significant improvement [41]. The termination condition steps are presented in Algorithm 2.

Algorithm 2
figure g

Termination condition

4.4 Neighborhood exploration

GAs are known to be susceptible to falling into local minima due to their reliance on deterministic mechanisms for solution exploration and exploitation. As GAs iteratively evolve a population of candidate solutions through processes such as selection, crossover, and mutation, they tend to converge towards regions of the search space that offer locally optimal solutions. However, this deterministic exploration can inadvertently lead to premature convergence, trapping the algorithm in suboptimal solutions known as local minima. To mitigate this limitation, incorporating simulated annealing as a local search (i.e., neighborhood exploration) method offers a complementary approach. SA introduces stochasticity into the optimization process, allowing the algorithm to explore regions beyond local optima with a certain probability, thereby escaping local minima. By leveraging SA’s probabilistic transitions, GAs can effectively navigate complex solution spaces, balancing exploration and exploitation to promote the discovery of globally optimal solutions. This synergistic combination of GA and SA not only enhances the algorithm’s robustness against local minima but also accelerates convergence towards high-quality solutions in challenging optimization problems [39].

Simulated annealing (SA) is an optimization technique inspired by the physical process of annealing in metallurgy, where a material is heated and then slowly cooled to reach a low-energy state. In the context of optimization, SA iteratively explores the solution space to find the optimal solution by mimicking this annealing process. The steps of simulated annealing are presented in Algorithm 4. It starts by initializing the algorithm with an initial solution and setting parameters such as the initial temperature \(t_0\) and cooling rate \(c_r\). The initial temperature controls the level of exploration in the early stages of the algorithm, while the cooling rate determines how the temperature decreases over iterations. At each iteration, we generate a neighboring solution \(s_{new}\) by applying a perturbation to the current solution as shown in Algorithm 3. This perturbation could involve randomly modifying one or more components of the solution, such as swapping two elements or perturbing values within a certain range. \(s_{new}\) is then evaluated and compared against the current solution \(s_{current}\) to determine whether to accept it as the current solution if its fitness value \(f(s_{new})\) is better than \(f(s_{current})\). Otherwise, if the neighboring solution is worse, accept it with a probability given by Eq. (9) which is based on the difference in objective function values and the current temperature. The temperature is then updated according to the cooling rate \(c_r\) which controls the rate at which the algorithm transitions from exploration to exploitation, gradually reducing the likelihood of accepting worse solutions as the algorithm progresses [42].

Algorithm 3
figure h

SA get neighbor algorithm

Algorithm 4
figure i

SA algorithm

$$\begin{aligned} p_{accept}(e_{old}, e_{new}, t) = {\left\{ \begin{array}{ll} 1.0 &{} \text {if} \quad (e_{new} > e_{old}) \\ e^{\frac{e_{new} - e_{old}}{t}} &{} \text {otherwise} \end{array}\right. } \end{aligned}$$
(9)

4.5 Reproduction process

A crossover operator plays a crucial role in generating new offspring solutions by combining information from two parent solutions. One common approach is the uniform crossover operator, where the offspring solution o is constructed by randomly selecting values from each parent solution with equal probability. In this process, for each inference model index in the offspring solution, a random choice is made between the corresponding values from the two parents \(p_1\) and \(p_2\). This crossover mechanism promotes diversity in the offspring population by recombining genetic information from different parent solutions, potentially leading to the discovery of novel and high-quality solutions in the search space. The parents are selected from the population using tournament selection, where a number \(\psi\) of individuals, referred to as tournament size, are randomly chosen from the population. These individuals then compete against each other, and the one with the highest fitness value is selected as a parent. This process is repeated until the desired number of parents for mating is reached. Tournament selection offers several advantages, including simplicity of implementation, computational efficiency, and robustness against premature convergence. By allowing weaker individuals to participate in the tournament, tournament selection maintains diversity in the population, enabling the algorithm to explore a wider range of solutions. Additionally, the tournament size parameter allows for control over the selection pressure, with larger tournaments favoring stronger individuals and smaller tournaments promoting diversity.

The newly generated offspring undergo probabilistic mutations to keep the population diverse and help nudge solutions out of local minima. A mutation probability parameter \(\delta\) is introduced to determine whether an offspring undergoes a mutation. A higher mutation probability increases the chances of introducing more random changes, which can be beneficial for exploration but may also disrupt promising solutions. Conversely, a lower mutation probability might lead to slower exploration but can help preserve promising solutions. Using a fading parameter \(\gamma\) to reduce the mutation probability \(\delta\) over iterations, helps to initially promote exploration with a higher mutation rate to discover diverse regions of the solution space and gradually decrease it to allow for exploitation and refinement of promising solutions.

Instead of applying random mutations to the offspring, we use a hill climbing algorithm (see Algorithm 6) which guarantees that the mutation is guiding the offspring solutions out of any local minima towards the global optimal solution. This helps further accelerate the evolution of solutions to converge in fewer iterations.

We employ an elitism mechanism to preserve the best solutions from one generation to the next. In elitism, a number \(\varphi\) of the fittest individuals from the current generation are directly carried over to the next generation without undergoing any genetic operations, such as crossover or mutation. This ensures that the best solutions found so far are not lost during the evolution process which in turn accelerates the evolution process.

Putting all steps together we provide HGSTO’s pseudo code in Algorithm 5.

Algorithm 5
figure j

HGSTO algorithm

Algorithm 6
figure k

Hill climbing algorithm

4.6 Complexity analysis of HGSTO

Assuming the fitness function has a time complexity of O(F) for a single solution, we can analyse the impact of evaluating the fitness function on the overall time complexity of the GA with SA as a local search by finding the complexity of each algorithm separately. Let G be the number of generations, P be the population size, J be number of inference model indexes per solutions, which is also the number of inference tasks per time slot, Q be the number of iterations in SA, F be the time complexity of the fitness function for a single solution.

Genetic algorithm operations: Selection, crossover, and mutation: \(O(P \times J)\) per generation. Fitness evaluation: \(O(P \times F)\) per generation. SA is performed once per generation with A iterations. Each iteration involves generating a neighboring solution, evaluating its fitness, and accepting or rejecting it based on a probabilistic criterion. Considering A iterations, the time complexity of SA per generation is approximately \(O(A \times (J + F))\).

Combining the time complexities of genetic algorithm operations and simulated annealing, the overall time complexity of the GA with SA as a local search is approximately \(O(G \times (P \times F + A \times (J + F)))\).

It is important to note that this is the time complexity of the worst case scenario in which the termination criterion is never used. In practice and in most cases, HGSTO converges before reaching G generations.

5 Experimental results

In this section, we assess the performance of the proposed HGSTO algorithm through testbed experiments. The following algorithms are examined as reference points.

First, Particle Swarm Optimization (PSO) is one of the classic metaheuristics used in the literature for the MdKP [43, 44], particularly renowned for its efficacy in exploring high-dimensional and complex search spaces which makes a it a suitable baseline method. PSO mimics the social behavior of bird flocks or fish schools, where individuals, known as particles, navigate the search space by iteratively adjusting their positions based on their own experiences and the collective knowledge of the swarm. This cooperative search mechanism enables PSO to efficiently explore diverse regions of the search space, adapt to dynamic environments, and converge towards optimal solutions. By comparing the performance of our proposed algorithm against PSO, we can assess its effectiveness and competitiveness in addressing the optimization problem at hand. PSO in its original form is primarily designed for searching continuous search spaces, where solutions are represented by real-valued vectors. However, in our experiments, we adapt PSO to effectively handle discrete search spaces by using integer position vectors representing inference model indexes. In addition, to integer velocity vectors where after every update to the position vector we perform clipping on the new values to make sure they stay in the permitted range.

Second, simulated annealing (SA) has been used in the literature for both task scheduling [45, 46] and the MdKP [47, 48] and proven to be effective. Additionally, its stochastic nature in which it uses a probabilistic acceptance criterion that allows it to escape local optima and explore diverse regions of the search space, making it particularly well-suited for handling complex and multimodal optimization problems and a compelling baseline choice. Furthermore, since HGSTO uses SA as step to accelerate convergence, it makes it important to see a direct comparison of HGSTO against the standalone SA. SA is implemented similar to HGSTO following Algorithms 4 and 3.

Finally, we compare against the unmodified version of GA to observe the improvements HGSTO provides on top of GA. GA is implemented similar to HGSTO with no additional modifications. All compared algorithms including HGSTO use the same fitness function given in Eq. (8). Parameters for all implemented algorithms are presented in Table 4. The language used for implementing all algorithms is Python version 3.12, in addition to PyTorch for inference models. Parameters such as population size and swarm size for HGSTO, GA and PSO have been set to the minimum values that offer the highest performance, higher values than these only increase execution times while providing no improvement to performance.

We conduct a real-world case study where we use a Raspberry Pi 5 as the edge device, which is equipped with a quad-core Arm Cortex A76 processor @ 2.4GHz and 8GB of RAM. The edge device is connected to an access point through WiFi (802.11ac) which has a set of edge servers consisting of desktop computers connected by Ethernet. In showcasing the efficacy of the proposed algorithm for real-world scenarios, it is important to highlight that, to mitigate potential traffic contention with other devices, the experiments were carried out in an environment devoid of concurrent devices competing for network resources. In our experiments, we use power as a constraint measured in watts instead of energy measured in kWh for convenience.

The case study consists of an object detection application using the ImageNet-Mini dataset [49] as a source of data with 3923 images ranging in size from 10KB to 10MB. The edge device is assumed to receive 10 inference tasks (i.e., images) per time slot. We deploy three lightweight object classification models namely ResNet18 and ResNet34 [50], in addition to ShuffleNetV2 [51] on the edge device used for local inference. More inference models have been considered such as AlexNet, GoogleNet, MobileNetV2 and SqueezeNet1.1. However, these models did not offer high enough accuracy to inference time ratios (i.e., inefficient) compared to the chosen models which resulted in algorithms never using them in any solutions and thus, they were eliminated from the experiment. In the edge servers we deploy ResNeXt101 [52] which is a larger and more accurate model. During the deployment phase we run tests on these models to estimate the average inference time and energy consumption of each inference model on each machine (see Table 3).

Table 3 Models average accuracies and inference times on the ImageNet-Mini dataset using our HW
Table 4 Algorithms parameters

5.1 Evaluation metrics

We employ four metrics to assess the performance of HGSTO: scheduling time, accuracy, execution time, and power consumption. Scheduling time, quantified in milliseconds, denotes the duration required by the algorithm to generate a schedule for a given time slot. This metric is measured individually for each time slot and subsequently averaged across all slots. Similarly, accuracy, execution time, and power consumption determine the quality and adherence of the schedules generated by the algorithm to the prescribed constraints. Each schedule for every time slot undergoes evaluation through inference utilizing the designated inference models. Subsequently, these metrics are aggregated across all time slots to yield a comprehensive assessment across the dataset.

5.2 Performance under different number of iterations

We initially assess the algorithms’ performance across varying numbers of iterations from 10 to 500 iterations in order to determine optimal values for subsequent experiments and ensure fair comparisons. Figure 6 subplot 1, illustrates the scheduling time trends for GA, PSO, and SA, demonstrating a linear increase in scheduling time with distinct slopes relative to the number of iterations. Notably, while SA exhibits a slower increase, it is bound to surpass HGSTO given enough iterations. In contrast, HGSTO maintains a consistent scheduling time as a result of its termination condition, which effectively halts the algorithm when no improved solutions are discerned. In Subplot 2, it is evident that HGSTO yields schedules with comparable or superior average accuracy compared to GA, despite the latter’s advantage of running for more iterations. Conversely, SA demonstrates an enhancement in accuracy as the number of iterations increases, resulting in schedules with noticeable improvement. In Subplot 3, it becomes apparent that all algorithms are well adhering to the prescribed time constraint of 500 ms, except for SA in which it is observed that with fewer than 200 iterations, schedules consistently exceed the specified constraint, surpassing the 500 ms threshold. However, as the number of iterations surpasses 200, SA begins to generate schedules that conform more closely to the prescribed constraint. Finally, in subplot 4, both HGSTO and GA exhibit optimal utilization of the given power constraint of 30w, leading to superior average accuracy and reduced scheduling times compared to SA and PSO. This superiority stems from HGSTO and GA’s ability to discover high-quality solutions that leverage extensive offloading, as opposed to solely relying on local inference. Notably, HGSTO demonstrates slightly higher power consumption than GA, yet it adheres to the power constraint due to the aforementioned reason of optimizing for superior solution quality.

Fig. 6
figure 6

Performance comparison under varying iteration numbers with time constraint set to 500 ms and power constraint set to 30 W (Color figure online)

5.3 Performance under different time constraints

In Fig. 7, we analyse the performance of HGSTO under varying time constraints from 200 to 500 ms while the power constraint is set to 30 W. We note that HGSTO exhibits a marginally higher scheduling time in comparison to SA and GA, yet it delivers superior average accuracy relative to PSO, GA, and SA. Subplot 3 indicates that all methods under the specified parameters conform impeccably to the time constraint provided. Moreover, Subplot 4 highlights that HGSTO excels in identifying solutions that maximize utilization of the designated power constraint (i.e., 30 W), consequently yielding enhanced average accuracy.

Fig. 7
figure 7

Performance comparison under varying time constraints with power constraint set to 30 W (Color figure online)

5.4 Performance under different power constraints

In Fig. 8, we observe the performance of HGSTO while varying the power constraint from 10 to 50 W with the time constraint set to 500 ms. In Subplot 2, HGSTO demonstrates superior accuracy compared to other algorithms. Subplot 3 unveils a consistent behavior across all compared algorithms: at 10 W, schedules are characterized by a 450 ms duration due to the power limitation, leading to reliance on local inference over offloading. However, at 15 W, there is a notable reduction in average time to 350 ms, driven by the optimization of solutions transitioning certain tasks from inferior local inference to offloading onto the most optimal edge server with minimal latency and maximal accuracy. This shift exhausts the power budget of the edge device, rendering it idle and unable to utilize the remaining time. Beyond 15 W, a gradual rise in both average time and accuracy ensues as the power budget expands, until it reaches 30 W, where the edge device becomes time-constrained, rendering additional power futile in further enhancing accuracy.

Fig. 8
figure 8

Performance comparison under varying power constraints with time constraint set to 500 ms (Color figure online)

5.5 Performance under varying number of edge servers

In Fig. 9, we present the results of experiments conducted with varying numbers of available edge servers, ranging from 1 to 10, while maintaining constant time and power constraints at 500 ms and 30 W, respectively. In Subplot 2, it is evident that HGSTO capitalizes on the increasing number of edge servers by generating solutions that leverage offloading, resulting in higher accuracy compared to GA, SA, and PSO. Subplot 3 illustrates that GA, and HGSTO provide solutions with the closest average times to the given time constraints. However, HGSTO is noticeably better with smaller distances between the average time and the given time constraint. Subplot 4 shows that HGSTO is finding solutions with similar power consumption to GA until 5 edge servers, where HGSTO starts finding superior solutions with higher average power as a result of its ability to better search the solution space.

Fig. 9
figure 9

Performance comparison under different number of available edge servers with time constraint set to 500 ms and power constraint set to 30 W (Color figure online)

In our research, we employed the t-test as a pivotal statistical method to ascertain the significance of our findings regarding the performance of our proposed algorithm, HGSTO, in comparison to GA, SA, and PSO. Through extensive experimentation conducted over the ImageNet-Mini dataset, iterated 100 times, we meticulously evaluated the fitness and accuracy of solutions produced by each algorithm. The resulting statistical analysis, as depicted in Table 5, revealed that all calculated p-values were below the conventional significance threshold of 0.05. This compelling evidence underscores the statistical significance of our results, affirming that the superiority of HGSTO over its counterparts is not merely coincidental but substantively rooted in the efficacy of our proposed approach.

Table 5 Statistical T-test p-value results of HGSTO compared to GA, PSO and SA in terms of fitness and accuracy over 100 test runs

6 Conclusion

In this work, we investigated the intricate problem of selective inference task offloading in edge computing environments constrained by both time and energy considerations. Specifically, we delved into the decision-making process at the edge devices, where they determine whether to execute inference tasks locally using multiple inference models or offload them to edge servers, aiming to maximize overall accuracy. Demonstrated to be NP-Hard, this problem necessitated the development of an efficient solution approach. To address this challenge, we introduced HGSTO, a hybrid metaheuristic algorithm that amalgamates the accuracy and versatility of genetic algorithms (GAs) with the stochasticity inherent in simulated annealing (SA). By leveraging the strengths of both techniques, HGSTO harnesses the evolutionary optimization capabilities of GAs while integrating the adaptive search mechanisms of SA, thereby providing a powerful tool for navigating the complex solution space of selective inference task offloading. Through extensive experimentation and evaluation on the ImageNet-Mini dataset, we demonstrated the effectiveness of HGSTO in producing high-quality solutions that optimize both accuracy and resource utilization in edge computing environments.

As a future work, we are considering enhancing HGSTO by integrating deep reinforcement learning (DRL) agents into its framework. By leveraging the capabilities of DRL, HGSTO could potentially benefit from improved decision-making processes and enhanced adaptability to dynamic network environments.