1 Introduction

Motivation: With the increasing demand for mobile devices, image retrieval approaches can simplify different promising applications. Examples are facial recognition for authentication and object classification for visually impaired persons [1]. The most common solution to handle such applications is based on mobile cloud servers where a user device uploads new image data to cloud computing servers to perform data processing and machine learning tasks. Unfortunately, some problems arise when uploading image data directly to cloud computing servers, making the transmission slower. Mobile Edge Computing (MEC) (also known as Fog Computing) has emerged and developed in recent years to tackle such limitations. MEC eliminates delays by providing enormous amounts of computational resources for computational tasks with high computational demands in applications that require low latency such as real-time image processing, interactive virtual Reality (VR), Augmented Reality (AR), and remote control of scientific facilities. MEC nodes can be used on any user device to access universal data stored in cloud computing servers [2].

Moreover, mobile user devices can initiate image recapture requests and get retrieval results from mobile edge servers closer to users than cloud computing servers. Additionally, MEC and image retrieval can alleviate  performance bottlenecks in various applications, such as electronic healthcare. When applied, one can reduce the response time to enable patients, for example, to access various healthcare services quickly. Another example where MEC can be utilized is in autonomous car applications, in which it can be used to detect various objects along the roads [2].

Deep learning has been extraordinarily successful across various application domains, including computer vision and natural language processing. Nevertheless, high computational and memory for both the training and testing phases of deep learning are required to get high accuracy in deep learning. The training phase is computationally expensive due to optimization of millions of parameters that must be repeated many times [3].

MEC servers mainly aim to minimize the data traffic on the core network. Convolution Neural Network (CNN) is one of the best solutions for computer vision applications that require accurate detection, especially in video images. Also, it is becoming more prevalent due to its superior performance when trained with massive amounts of data. Therefore, deep learning is trying to learn high-level features from huge data incrementally, which will reduce the need for hard-core feature extraction [2]. While edge computing can supply the scalability and privacy advantages discussed earlier in this section, many major challenges remain unaddressed for using deep learning at the edge. One major challenge is accommodating the high resource requirements of deep learning on less powerful edge computing resources. Deep learning needs to be executed on a variety of edge devices, ranging from reasonably provisioned edge computing servers equipped with a GPU to smartphones with mobile processors to bare-bones Raspberry Pi devices and microcontrollers [4] (Fig. 1).

Fig. 1
figure 1

System architecture of image retrieval in MEC

The proposed approach can find applications in machine learning techniques for medicine, which are currently obstructed by restricted dataset availability for algorithm training and testing, due to the loss of electronic medical records and ethical requirements to protect patient privacy [5]. Digital imaging and electronic data exchange and storage are the standards, but the requirements for privacy preservation are equally strict. To protect personal privacy adjustment while encouraging scientific research on huge datasets that aim to enhance patient care, the execution of technical solutions to address data security and data sharing requests is mandatory. Digital image datasets, easily shareable, always storable, and remotely accessible in cloud computing, high-performance computing servers, and edge computing servers have driven the aforementioned success of medical imaging machine learning.

Related work: In MEC-based wireless networks, the devices can offload all or part of the computation tasks to the MEC server, reducing the processing time and saving energy for the devices. Considerable attention has been paid to cooperative communications techniques for MEC under various objectives. For instance, the works in [6, 7] propose an energy-latency-aware task offloading approach. The approach aims at optimizing the trade-off be- between energy consumption and application completion time. The authors in [8] deploy the MEC server to improve the Quality of Experience (QoE) for dynamic adaptive video streaming by considering the distortion rate characteristics of video processing.

Meanwhile, image retrieval has been an important research subject in machine vision and multimedia for a long time. The purpose is to retrieve relevant images to a given query with a very efficient time response from a large image corpus. Content-based image retrieval by visible features is considered a fundamental step of that system. To address the different transformations of visible objects or scenes, visual features are extracted and subsequently transformed into a fixed-size vector for comparison. For instance, Zhou et al. [9] propose a new collaborative framework index embedding algorithm to search the shared image-level neighborhood structures and unify their index matrices by generating the sparse index matrices with scale-invariant feature transform (SIFT) and CNN features. Hörster and Lienhart [10] show two different chances of determining interest points and scales for feature extraction by sparse features in the various Gaussian pyramid and intensive features, which are extracted within efficient grid points at different scales. All feature vectors that are extracted from all images are exchanged by the most similar visible word determined as the closest word in the high-dimensional feature space.

Principle component analysis (PCA) preserves the universal intrinsic structures of the image by extracting the features.However, it does not use the label information of the image. Hence, it is difficult to extract special features that are helpful for image retrieval for linear discriminant analysis (LDA) which keeps the universal relationship of the image data [11]. Feature matching takes quite some time to retrieve the actual image by comparing it to the dataset because the image to be retrieved needs to match all the images stored in the dataset having high dimensional image features compared to low dimensional image features [12, 13].

Alasadi et al. [14] implemented a novel convolution neural network adversary that finds the best accuracy for face matching while reducing discrepancy in TPR and FPR across various demographic descriptors. Wang et al. [15] use a single neural network to learn more features that transfer from high-resolution MEC-based low-resolution images. Wang et al. [16] show a CNN to remove speckles from the input noisy images achieving a significant improvement over the state-of-the-art speckle reduction methods. Kaissis et al. [17] presented a supervised machine-learning model capable of predicting clinically relevant molecular sub-proposes pancreatic ductal adenocarcinoma (PDAC) from diffusion-imaging-derived radiomic features. Despite these studies, the following question remains: “Can adversarial attacks on medical images be crafted as easily as attacks on natural images? If not, why?”. “Furthermore, to the best of our knowledge, no previous work has investigated the detection of medical image adversarial examples” [18]. The authors in [18] show different degrees of adversarial vulnerabilities of deep neural networks on medical versus natural images can assist expand a more comprehensive understanding of the accuracy and robustness of deep learning models in different domains. Contributions: In this paper, we study the image retrieval problem in the MEC context, which is described as follows. The system architecture of MEC consists of three main components: mobile users, edge servers, and high-performance computing servers. Mobile devices communicate with edge servers through LTE or Wi-Fi and the edge servers connect to high-performance computing servers via the Internet backbone. A large amount of labeled image data is stored on high-performance computing servers. From mobile users’ perspective, edge servers and high-performance computing servers could be more explicit. A mobile user uploads an image to the service provider to launch an image retrieval request. Then, the service provider processes and returns the label information of the most similar image to the mobile user.

In this paper, our approach aims to protect the dataset from unintended leakage and purposed disclosure attempts. Sharing datasets enabled by digital imaging data storage threatens to reduce the importance of personal privacy and relax the grip on individual data in the name of scientific development research. The field of privacy-preserving machine learning offers techniques to bridge the gap between individual data preservation and dataset sharing for scientific research and medical routine. Overall, the contributions of this paper can be summarized as follows:

  • We present a CNN model for privacy-preserving deep learning applications. To protect personal privacy adjustment while encouraging scientific research on huge datasets that aim to enhance patient care, the execution of technical solutions to address data security and sharing requests is mandatory.

  • We propose a model that reduces network traffic by utilizing automated system capabilities.

  • We significantly minimize feature matching time by introducing a novel CNN-based approach.

Paper organization: The rest of the paper is organized as follows. In Sect. 2, some challenges that can be addressed using our framework in multiple applications are discussed. Section 3 elaborates on the proposed framework in more detail and describes the case scenarios that we consider, the experimental setup, and the evaluation results. Section 4 concludes the paper and provides future directions.

2 MEC-based image retrieval framework

The overview and components of the MEC-based Image reiterative network is provided; then, we discuss the CNN concepts.

2.1 Network architecture

As shown in Fig. 2, we propose a novel CNN, and we train it on the image training dataset. We then use the CNN network to extract discriminative features from the image data set. During system employment, the mobile user captures images and starts image retrieval requests by uploading them to the edge server via LTE or Wi-Fi. The edge server receives the image and automatically evaluates the received image to extract the features. Then, the edge high-performance computing or the edge computer sends back the labels of similar images as retrieval results to the mobile devices. The images with similar labels will have similar features, while images with different labels will have different features.

Fig. 2
figure 2

The guided feature extraction approach for image retrieval

Our novel CNN architecture has many advantages. First, it demands less network bandwidth because it uses CNN to extract low-dimensional features from the image, reducing the core network bandwidth consumption. Also, our approach can provide high retrieval accuracy. Third, our approach protects the dataset from unintended leakage and purposed disclosure attempts because the model is saved on the high-performance or edge computing server, and only a small vector feature is transmitted over the network. Thus, getting the original image from the transmitted features is difficult.

2.2 Radical pilot

RADICAL Pilot (RP) role aims to ease scientific issues by providing the conceptual and functional tools required to enable a salable execution of many task workloads. RP decouples the initial resource acquisition from the task-to-resource assignment. Once the pilot is active and scheduled via the resource management system, it can pull the tasks for execution. This capability allows all the tasks to be executed on remote resources [19]. RADICAL-Pilot is an implementation of the pilot abstraction, designed and implemented to provide scale-able and efficient launching and execution of heterogeneous tasks across multiple and different remote resources, as presented in Fig. 3.

Fig. 3
figure 3

RADICAL-Pilot programming and execution model design architecture [19]

2.3 CNN in machine learning

Our CNN-based design contains neurons that have learnable weights and biases. In the following, each layer of the CNN is explained in more detail.

Convolution layer: In image classification tasks, one or more channels are used as the convolution layer input, one of the most important key components of a CNN. The method to calculate a single output matrix is represented as:

$$A_j = f\left( {\sum_{i = 1}^N {X_i W_{i,j} + b_j } } \right).$$
(1)

For each image input matrix X, a convolution operator with an identical kernel matrix is performed. The output is then the summation of all matrices that are calculated, and the bias values are added to the final sum. The resultant output matrix A results from applying the activation function f to each factor of the previous matrix.

The learning process aims to find the matrices w and the biases b of the kernel set to extract useful special features for image classification. The back-propagation method is usually used to optimize the neural network weights. Often the convolution layer is followed up by the pooling layer which reduces the number of output neurons from the convolution layer. The purpose of this layer is to reduce the dimension of the feature to avoid over-fitting. For instance, if we have a max-pooling layer with 2 × 2 kernel size we will choose the highest value from these four neighboring elements of the input matrix. Therefore, we will produce one element in the output matrix. When computing the error function by the backpropagation procedure, the gradient signal should be routed to the neurons participating in the pooling output.

ReLU activation: The output of any layer is multiplied by our weight in the fully connected or convolution layer. Then, we pass it through an activation function or non-linearity. The most commonly used activation function includes a sigmoid function or tanh function. The sigmoid takes each number as input into the sigmoid non-linearity; then, each element squashes into this range [0, 1]. So, if it gets an extremely high value as input, the output will be close to 1. In contrast, getting negative values is going to be zero. In our case, we use the ReLU function f (x) = max(0, x) to increase learning speed and classification performance in the convolution neural network. Also, it does not saturate in the positive region as in sigmoid and tanh activation functions. This is computationally efficient and results in faster convergence than sigmoid and tanh. Actually, it turned out to be more biologically plausible.

For example, if a sigmoid gate is used and the data input is x as in Fig. 4, the output of the sigmoid that comes out of it. The upstream gradient is then multiplied by (∂σ) over (∂x). This is the gradient of a local sigmoid function with chains together for the downstream gradient that passes back. Back Propagation: The mechanism for speeding up and proving the stability of neural network training is utilized by multiple choices of hyperparameters like batch learning, momentum, and weight decay.

Fig. 4
figure 4

Overview of neuron downstream gradient over a sigmoid gate

The weight is updated by the equation below:

$$\Delta W_i \left( {New} \right) = W_i \left( {old} \right) - \eta \left( {\frac{\partial L}{{\partial W _i }}} \right) + \alpha \Delta W_i \left( {old} \right) - \lambda \eta W_i$$
(2)

where (Wi(old)) is an old weight vector, \(\frac{{\partial {{L}}}}{ \partial W_i }\) is the error gradient with respect to the old weight vector and (η) is the learning rate. The momentum part is illustrated by \((\alpha \Delta W_i (old))\), where (α) is the momentum rate. It helps in making learning faster. (ληWi) is the weight decay term, where (λ) is the delay rate. It reduces the error to zero in each learning iteration, stabilizing the learning process.

Loss function: We used the cross-entropy loss function to measure the difference between two vectors. The first one is the output of our classifier, which contains the output probabilities for all classes, and the second vector identity of the true labels. Depending on the distance between those two vectors, the cross-entropy is calculated as follows:

$$D(Y,L) = - \sum_i {L_i \log Y_i } .$$
(3)

2.4 Proposed solutions

We propose a novel CNN-based architecture for image retrieval using a Mobile Edge Server and RADICAL Pilot for feature extraction as shown in Fig. 5. Our system consists of three layers: (1) smartphones (mobile users), (2) edge servers, and (3) RADICAL Pilot. The dataset of the images with many redundant features is stored on the edge computing server.

Fig. 5
figure 5

a Network framework; b edge computing server setup; and c radical pilot setup

3 Performance evaluation

We detail the setups of the proposed framework and present the experimental results to evaluate the performance of our proposed approach for image retrieval extraction.

3.1 Implementation details

Our model is implemented using the PyTorch open-source framework [20, 21]. Our method was validated using the German Traffic Sign Recognition Benchmark (GTSRB) dataset [22]. Traffic signs images consist of 39,295 images for the training and 12,631 images for testing with 43 classes that represent different traffic sign images. It represents reliable ground-truth data based on semi-automatic annotation, meaning all real-world traffic signs only occur once. It has different strengths and challenges regarding this work’s goal of validating matching, traffic network time, and accuracy across different classes. In this paper, we focus on the time of matching and feature extraction as sensitive characteristics.

The Convolution Neural Network (CNN) obtains the feature vector, which consists of three convolution layers with ReLU activation functions after each layer except the last convolution layer. The input for our network is a traffic sign image of size 48 × 48. The first convolution layer has 64 feature maps with a filter size of 3 × 3, the second convolution layer has 96 feature maps with a filter size of 3 × 3 and the third layer has 128 feature maps with a filter size of 4 × 4. The features layer is used as input to the fully connected classification networks, the first fully connected has 128 feature maps and the second fully connected has 128 and 43 neurons. In all the previously mentioned layers, we used a max pooling layer that downsampled the size of our activation maps.

We implement our models using the CUDA package. The batch size of the training is 50, and the validation set is 30. Adam optimization with a learning rate of 0.001 and weight decay of 0.0005 was employed to update the trainable weights and biases of the proposed CNN. The network was trained for 10 epochs.

In the validation part, we use accuracy to evaluate the overall performance. Using other metrics like ROC can introduce additional difficulties. For instance, using ROC for multiclass evaluation techniques like “one-vs-all” or “one -vs-rest” strategy requires many figures to show the overall picture. Also, the dataset used in our paper has balanced samples among different classes. Moreover, the proposed application promotes uniform importance and does not encourage nonuniform importance among classes. For those reasons, the accuracy can provide a clear and precise metric to quantify the performance of eh proposed approach.

In the testing phase, the pre-processed images are provided along with their labels to the network. The classification accuracy is calculated after providing all the images in the testing set. The uncomplicated network system also greatly reduced the number of neurons to be learned which serves to avoid the overfitting problem, and in our case, it helps to reduce the stress on the network. The learning parameters correlated to these techniques are chosen based on common values or selected according to practical studies. For instance, the initial learning rate is selected in accordance with the value reported in [22]. On the other hand, we initialize the weights in our network by using the random initialization procedure as proposed [23].

Experiment setup: The experimental environment comprises a mobile device, an edge server, and a RADICAL Pilot.

  • Mobile Device: we used an LG Fortune equipped with an 8 MP camera and a 1.4 GHz processor running Android 7.1.2 (Nougat) OS with 4G compatibility.

  • Edge Server: the edge server is a laptop equipped with Intel i7-6700 @ 3.4 GHz CPU, 16 GB of memory running Ubuntu 16.04. We have used (tregardk) server to set up the connection between both the Android phone and the computer. Flask server [24] is a web framework communications within the same server, flask makes it possible to set up a client–server between any computer and any mobile device with few lines of code.

  • RADICAL Pilot: RP is a Pilot system [19] based on the building approach systems that are written in Python and mainly dedicated to executing applications composed of many computational tasks on high-performance computing (HPC) resources.

Two sets of experiments are designed to test our CNN accuracy level and speed using a local client-server setup and RADICAL Pilot (RP) setup. In both experiments, we focused on two important metrics, the CNN prediction time and the CNN prediction accuracy using four images that were chosen randomly from the testing dataset.

Local Client-Server Setup In this experiment, we implemented a (((flask))) server on a laptop and a (((flask))) client on an Android phone device. The server is responsible for receiving the desired image that needs to be classified from the client end (Android device), processing it using the preinstalled CNN model, and returning the classification label. The connection between the client and server is direct; thus, we expect a noticeably short latency between both ends. We have embedded the server with our CNN implementation, when the server is up and running, it waits for an image; whenever the image is received from the client side, it triggers the CNN model. The utility function that we have implemented takes a considerable amount of time to translate the received image from the Android device with any format into the PPM format to run with our CNN model. When the image is translated into the desired type, the CNN processes it and returns the image category as a message and enqueue. Finally, the server returns the data to the client side and dequeues the data from the communication channels.

RADICAL Pilot Setup We use RP to run our CNN model on the HPC system using Secure Shell (SSH) communications channels via Virtual Machine (VM). The RP is responsible for scheduling and submitting the CNN scripts as a batch of tasks to one of the compute nodes on one of the remote resources. This can be done by specifying previously the required resources by the user in the RP script. Our CNN runs on a CPU and uses a single core per image. The RP scripts submit 1 image per task (unit) to the remote resource, wait for that image to be processed, and return the results to the VM.

4 Numerical results

Figure 5 presents the prediction time of the CNN with the extra server-client communication time. The CNN time can be computed using the following equation:

$$CNN_{Time} = CNN_{tp} + Flex_{tr}$$
(4)

where CNNtp denotes the amount of time taken to process the image by the CNN; Flextr  represents the amount of time taken to send an image, receive back the data from the CNN, and send it to the client (e.g., Mobile device).

Our experiments show that our CNN can achieve up to 85.07% accuracy with 30 seconds of CNN time. The CNN approach achieves the highest retrieval accuracy in all images of the dataset that incorporate all types of dissimilarities compared to classic algorithms with only limited types of dissimilarities. For example, in weight-adaptive projection matrix learning (WAPL), partial types of dissimilarities reduces the capability of extracting discriminative features. Therefore, they result in lower retrieval accuracy. In practice, different dissimilarities contribute differently to learning. The equal treatment of each dissimilarity weakens the ability of the convolution neural network to extract discriminating features.

The optimal dimension of the feature dataset can be estimated in terms of the number of positive intrinsic structure features, which can help us design different interaction strategies between edge computing servers and high-performance computing servers to meet different requirements of users in terms of retrieval accuracy and response time. In addition, feature matching time is reduced since the feature matching in CNN is performed in the low-dimensional space. Therefore, the average response time can be reduced. Thus, our CNN approach can get a minimum response time. The main reason for the CNN is to extract a small number of effective discriminative features and simultaneously match these features to obtain the results with the guided feature extraction.

We examine the response time of our approach when the mobile device connects to the edge computing server. Figure 6 shows that our CNN approach can reduce the average response time to recognize the five images, which is 1000 ms and 3000 ms depending on the size of the images. The CNN model is normally prepared to run on the remote resource by preparing the needed Python environment previously when RP submits the script to the remote resource the CNN model gets triggered and starts processing the received image. Depending on the image size (KB) and the number of images, this process can take from a few seconds up to a minute as shown in Fig. 7. This indicates that with the development of 5G technology, our approach can significantly reduce response time and can deal with huge datasets depending on the transmission range between mobile devices and edge computing servers that become faster. The model proposed here is more secure because it has a small vector size of features thereby becoming difficult or impossible to get the original image from.

Fig. 6
figure 6

CNN prediction time (green) and accuracy level (red)

Fig. 7
figure 7

CNN prediction time (green) and accuracy level (red) for RADICAL Pilot

In our CNN architecture, approximately 75% of images from the GTSRB dataset are used for training and 25% of images are used for testing. Figure 6 shows that our CNN approach achieves an accuracy of 85.07%. This can be attributed to the fact that our CNN can remove redundant features. We can improve our accuracy by increasing the number of layers in the CNN but that can affect the training and testing times and will increase the traffic on the network. The CNN can extract effective discriminative and matching features simultaneously from the image dataset on the high-performance computing server and the images on the edge computing server. Also, our CNN aims to extract features to preserve the intrinsic structure of the image data.

5 Conclusion

In this paper, we proposed a novel CNN architecture on an edge computing server for feature extraction of image retrieval. The proposed approach has shown an improved retrieval accuracy of 80%. Furthermore, the proposed approach provided a decrease in response time and network traffic on the core network. The proposed approach is employed by leveraging CNN for feature extraction and information retrieval. Hence, the network transmission time can be reduced because of the CNN’s capability to extract discriminative features from images on the dataset and compare them with the images uploaded by the mobile device simultaneously. The model offers privacy-preserving machine learning techniques to bridge the gap between individual data preservation and sharing dataset.