1 Introduction

Machine learning (ML) and artificial intelligence (AI) have become an integral part of our daily lives and have transformed various domains, such as image processing and speech recognition [1,2,3]. A subset of ML called deep learning (DL) has been instrumental in this revolution by enabling automatic feature extraction from large datasets, particularly in domains like natural language processing, speech recognition, and computer vision [4].

The increasing global demand for healthcare diagnostic technologies and a shortage of medical personnel have led to a growing interest in using Deep Learning (DL) and telemedicine systems. However, DL's dependence on large annotated datasets presents a significant challenge in medical image analysis. Transfer Learning (TL) has emerged as a potential solution to tackle this challenge, allowing models to leverage pre-training on reference datasets before fine-tuning specific tasks [3].

Despite TL's success, researchers and application specialists are constrained by the need for accelerated hardware to scale DL algorithms beyond current capacities. Graphics Processing Units (GPUs) have become a dominant hardware accelerator for DL due to their superior parallel computation capabilities [5, 6]. However, the paper contends that there is a need to explore alternative hardware platforms, specifically Field-Programmable Gate Arrays (FPGAs), which exhibit advantages such as adaptable hardware configurations and power-saving performance for subprograms crucial to DL [7].

Although FPGAs offer attractive features, their adoption can be hindered by the requirement of specialized hardware knowledge. However, recent advancements in FPGA tools, especially those that involve OpenCL, have made them more accessible to a broader audience, including application scientists and hardware researchers. For deep learning researchers, FPGAs provide a compelling option due to their high parallelism, reconfigurability, and user-friendly development tools [8].

This paper aims to argue that FPGAs are the best hardware acceleration platforms for deep learning (DL) among the current options available. It aims to provide an overview of recent FPGA support for DL, highlight their associated limitations, and suggest future trends in parallel hardware computational tools. The paper focuses explicitly on meeting real-time requirements in learning strategies, particularly emphasizing the Diabetic Foot Ulcer (DFU) classification.

The paper has three primary objectives. First, it aims to demonstrate that FPGAs are superior to other contemporary hardware acceleration platforms. Second, it highlights the recent support for DL on FPGAs while also identifying their limitations. Lastly, it aims to provide insights and recommendations for future trends in parallel hardware computational tools, specifically Convolutional Neural Networks (CNNs). The paper presents case studies of DFU classification on two high-performance computing platforms, GPUs and FPGAs, to accelerate the classification process and potentially prevent amputations in individuals with DFU disease.

The unique contributions of this work include the introduction of two DFU classification models (DFU_TFNet and DFU_FNet) employing a novel TL strategy, evaluation on two parallel hardware platforms (FPGA and GPU), comparison with traditional classifiers (SVM and KNN), training and testing of various CNN models (AlexNet, VGG16, and GoogleNet) on the same dataset, calculation of power consumption and processing time values, and the demonstration that FPGA can be a viable choice for portable embedded devices, with the DFU_TFNet model achieving a remarkable accuracy 99.81%, precision 99.38% and F1-Score 99.25%.

The rest of this paper is organized as follows: preliminaries and definitions are described in Section 2. A brief discussion of related work is presented in Section 3. The methodology for the proposed models and the hardware setup by FPGA for real-time requirements are illustrated in Section 4. An evaluation of proposed models in terms of accuracy, precision, F1-Score, processing time, and power consumption is listed in Section 5. Lastly, the paper ends with a conclusion, which is Section 6.

2 Brief overview of CNNs with FPGA

In the implementation of DL methods using FPGAs, refer to Appendix 1. Notably, the primary impediment in achieving the requisite hardware lies in the design size, posing a significant challenge in this context. The exchange between density and the ability of design reconfiguration means that the FPGA circuits are generally considered less dense than hardware replacements. Thus, it is only sometimes possible to implement large neural networks. Conversely, deep networks become applied on single FPGA schemes because the current FPGA incorporates strengthened computational units and the common FPGA fabric and continues developing reduced feature sizes for enhancing density. Figure 1 illustrates a summarized year-sequence of significant events in deep-learning research-based FPGAs.

Fig. 1
figure 1

Key events in the history of FPGA DL research

In the early 1990s, Cox et al. were the first group of researchers to implement neural networks using FPGAs [9]. A few years later, Cloutier et al. recorded the first implementation of CNN using FPGAs [10]. These studies were restricted to utilizing low-precision arithmetic due to FPGA size limitations. Moreover, density-strengthened multiply-accumulate units were yet to be available in FPGAs. Thus, arithmetic was an extremely slow and expensive resource. The FPGA technology was then significantly modified, increasing the strengthened computation units available in FPGAs, more inspired by reducing transistor (feature) size and increasing the density of FPGAs fabric. The current CNN implementations using FPGAs have benefited from these design developments.

To the best of our knowledge, a team at Microsoft recently achieved forward propagation of CNN using FPGAs. Using the 1 K dataset on the ImageNet network, Ovtcharov et al. [11] reported that the processing amount of 134 images/second was achieved when operated on a Stratix V D5 at 25 W. This processing amount is approximately three times the processing amount of their closest competitor. However, it is predicted that an improved performance of up to approximately 233 images/second on an Arria 10 GX1150 with the same power consumption can be achieved by utilizing the latest FPGAs. In contrast, the high-performing GPU systems (Caffe + cuDNN) achieved 500–824 images/second with 235 W power consumption. This performance was attained utilizing FPGA servers and boards designed by Microsoft as part of an investigational project where FPGAs integrate inside the data centre applications. This project also increased the performance of large-scale search engines (twice) to show the capacity of such FPGA applications.

Zhang et al. are the closest competitor to realize another significant achievement: processing 46 images/second on Virtex 7485 T without reducing the power consumption [12]. These results are better than several significant works presented by their competitors [13,14,15]. These examples have a similar architectural design, including several parallel processing units applied on FPGA fabric (generally employed for convolution), buffered output and input, the ability to configure software layers, and usually utilizing off-chip memory access. However, there are also significant differences in using FPGAs, such as utilizing various operation frequencies, look-up table types, soft-cores, data-transfer mechanisms, memory subsystems, and completely diverse FPGAs. Therefore, more research is required to identify optimal architecture decisions [16].

Transfer learning is often utilized to build medical imaging models with little training data. One of the initial ideas for employing transfer learning [17] was to use pre-trained ImageNet models instead of training from scratch. As the pre-trained CNN is effective in computation and ease of algorithms, the main benefit of including FPGAs is accelerating the forward propagation of such systems and informing the attained processing amount. This issue is very significant for application engineers as they want to utilize viable pre-trained networks for processing sizeable volumes of data effectively and rapidly. Conversely, accelerating rearward propagation is another aspect to consider in CNN design using FPGA. The first to use parallelism in the learning phase on Virtex E FPGA was Paul et al. in 2006 [18], who focused on accelerating the classification process inside CNN and boosted this by using different software or hardware platforms to take advantage of parallelism techniques.

In the realm of early detection and prognosis for diabetic foot ulcers, Thotad et al. paved the way by introducing the use of the EfficientNet—a robust deep neural network model [19]. Building upon this foundation, various end-to-end CNN-based deep learning architectures, including AlexNet, VGG16/19, GoogLeNet, ResNet50.101, MobileNet, SqueezeNet, and DenseNet, have been explored for infection and ischemia categorization. This exploration was carried out using the DFU2020 benchmark dataset [20]. Applying machine learning to infrared images offers promising avenues for the early diagnosis of diabetic foot complications. Researchers delved into classical machine learning algorithms incorporating feature engineering, convolutional neural networks (CNN), and image enhancement techniques. These investigations aimed to pinpoint the most effective network for classifying thermograms [21]. In a different approach, [22] tackled the initial dataset's disparity by leveraging the synthetic minority oversampling strategy. Through a univariable analysis, nine key variables—random blood glucose, years with diabetes, cardiovascular diseases, peripheral arterial diseases, DFU history, smoking history, albumin, creatinine, and C-reactive protein—were identified. Subsequently, risk prediction models were independently developed using five machine learning algorithms: decision tree, random forest, logistic regression, support vector machine, and extreme gradient boosting (XGBoost). This multifaceted exploration underscores the diverse strategies employed to enhance the accuracy and effectiveness of diabetic foot ulcer prediction models. A comprehensive examination yielded Table 1, which provides an insightful overview of advancements, methodologies, and identified research gaps in the relevant studies. This table serves as a valuable reference, encapsulating the current knowledge landscape and highlighting areas where further research is warranted.

Table 1 Overview of Advancements, Methodologies, and Research Gaps in Relevant Studies

3 Methodology

This section is organized into two distinct parts. The first part centres around Diabetic Foot Ulcer Classification Models, delving into the software-driven aspects of these models. The discussion in this segment revolves around the intricacies of developing and refining classification models for diabetic foot ulcers.

In the second part, the focus shifts to Hardware Implementation on GPUs and FPGAs. This section explores the practical implementation of the aforementioned software models on Graphics Processing Units (GPUs) and Field-Programmable Gate Arrays (FPGAs). It elucidates the hardware-related considerations and optimizations essential for effectively deploying and executing these models on specialized computing architectures. Together, these two parts provide a comprehensive view of the software and hardware dimensions of diabetic foot ulcer classification systems.

3.1 Diabetic foot ulcer classification models

The Diabetic Foot Ulcer Classification Models section encompasses the analysis of a specific dataset, details on image pre-processing techniques, and a focused exploration of proposed models tailored for diabetic foot ulcer identification. It provides a concise yet comprehensive view of the key elements contributing to effective classification in this context.

3.1.1 Data set

Our team collected the dataset from patients in the Diabetic Center Department at Nasiriyah Hospital in Thi-Qar, Iraq, and some samples are shown in Fig. 2. The dataset is public now and available at the following link (https://www.kaggle.com/laithjj/diabetic-foot-ulcer-dfu). The dataset comprises 754 images of the feet of healthy and DFU patients. The images were taken using a Samsung Galaxy Note 8 and iPad, and different angles and lighting conditions were used to capture the images adequately. The images are color, standardized for training the DFU_TFNet, DFU_FNet model, and pre-training well-known models, i.e., VGG-16, GoogleNet, and AlexNet.

Fig. 2
figure 2

Samples of our datasets. The blue box samples are abnormal, and the green box samples are normal

3.1.2 Image pre-processing

Some pre-processing tasks were needed before using the dataset for the proposed and pre-trained models. First, the images were cropped to a size of 224 × 224 pixels. The resulting images show patches, so-called Regions of Interest (ROI), since each contains either the ulcer and its surrounding tissues or healthy skin. The dataset's total number of skin patches was 1,609, including 542 healthy skin and 1,067 DFU patches. Next, these patches were categorized by the medical expert into two types: healthy (normal) and DFU (abnormal). The data augmentation techniques were used to increase the dataset and avoid the unbalanced issue. Finally, the labelled patches are all used for training. There are 200 samples that were collected for testing. These samples will be public once the ethical approval is finished. The data augmentation techniques were applied only to the training set. Samples of the initial dataset (before cropping) are illustrated in Fig. 3.

Fig. 3
figure 3

Normal vs. Abnormal skin patches on a patient’s foot

3.1.3 DFU proposed models

Two CNN models, namely DFU_FNet and DFU_TFNet, have been introduced. DFU_FNet, characterized by its simplicity, extracts features utilized for training classifiers such as SVM and KNN. On the other hand, DFU_TFNet, a deeper model leveraging transfer learning, assesses hardware efficiency across both shallow and deep models. The evaluation parameters employed in setting up these proposed models include a learning rate of 0.001, a batch size of 32, 100 epochs, and using the SGD optimizer.

DFU_FNet model

This CNN model is introduced in the proposed system to improve the extraction of the critical features required for DFU classification. The concept of Directed Acyclic Graph (DAG) is the basis of the DFU_FNet model [23]. Two key challenges should be considered when using these types of networks. The first challenge involves enhancing the model accuracy using additional convolution layers compared to the traditional network. Unfortunately, adding layers can decrease the performance of the model. The second challenge is that discriminating between normal and abnormal DFU types requires extracting additional vital features. Thus, a more complicated structure is required. In this research, the width of the DFU_FNet model was increased, which can increase the comparative computing cost.

The structure of the DFU_FNet model, see Fig. 4, was instrumental in accelerating the possible learning details and enhancing its accuracy. The structure included eight layers.

Fig. 4
figure 4

DFU_FNet architecture on FPGA

  1. i.

    Input layer: Three channels with 224 × 224 pixels each. The final patches were entered through these channels to train the model.

  2. ii.

    Convolution layer: The output of the input layer is convolved via a set of learnable filters [24]. As the weights identify these filters, two-dimensional filter activation maps are generated, and all filters are slipped across the input volume, height, and width. It should be noted that all filters had the exact depth of the input. Three hyper-parameters manipulated the output size: zero-padding (to preserve its size, zeros are padded around the input borders); stride (number of skipped pixels when the filter slides through the image); and depth (number of operated filters identifying structures like a blob, corner, or edge over the image). This work had 17 convolution layers, and all filters had a size of 3 × 3 pixels. Two types of layers, batch normalization and the rectified linear unit, followed each convolution layer.

  3. iii.

    Batch normalization layer: A mini-batch was used to normalize each input channel, diminish the sensitivity of the network initialization, and speed up the training process of the CNNs [4]. This was located after the convolutional layer and consisted of 17 layers. Subtracting the mini-batch average from each channel's activations and dividing by the mini-batch standard deviation was the first step of the layer mechanism (i.e., normalizing these activations). Next, a learnable offset β and a learnable scale factor γ were added and scaled, respectively.

  4. iv.

    Rectified linear unit (ReLU): Data filtering was the main objective of this layer, and the function max (0, x) [4] was used to achieve this goal (note that x is the neuron input).

  5. v.

    Addition layer: The inputs of two or more neural network layers were added to this layer. To use this layer correctly, these inputs should have similar dimensions.

  6. vi.

    Average pooling layer: The input of this layer was partitioned into smaller pooling regions of various dimensions like 3 × 3, 2 × 2, etc., to reduce the input size. The average of each small spatial block, which may have vital and a reduced amount of vital pixel information to signify improving features, was then calculated to get normalized feature information [25]. It is important to note that in traditional CNNs, the max-pooling layer is next to the convolution layer and could lose valuable features. In this study, the average pooling was applied at the last partition of the network.

  7. vii.

    Dropout layer: This is utilized to enhance model performance by preventing the occurrence of overfitting [26]. More specifically, neurons were arbitrarily turned off and on in all layers to avoid overfitting. This model used one dropout layer with a probability of p = 0.5 between the fully connected layers.

  8. viii.

    Fully connected layer: All previous layer neurons were connected to this layer, thereby mixing the DFU patch classification features. This network used only two fully connected layers.

The overall number of layers in the model was 58. The output layer was placed over the second fully connected layer. The softmax function, located in the output layer, was utilized for classification. The output extracted features of the model were employed to train the classifiers of Support vector machines (SVM) (DFU_FNet + SVM) and k-nearest neighbors (KNN) (DFU_FNet + KNN). The SVM classifier was margin-based. The concept for the SVM algorithm was to determine the optimum partition line between two classes, aiming that objects would have the largest distance from that line. The SVM utilized kernels such as polynomial, linear, Sigmund, and radial basis functions.

In comparison, the KNN classifier of the object depended on the nearest training samples inside the feature space. Several discriminative features were available in each convolution layer. In contrast, skin abnormalities (ulcers) produced higher activations. A public dataset was used to train the proposed and pre-trained models for 100 epochs pending the learning termination.

As a final consideration, the pre-trained models (i.e., VGG-16 [27], AlexNet [28], and GoogleNet [29]) were trained with our dataset using the same training parameters that were utilized to train both the DFU_FNet. The pre-trained models have been fine-tuned for the DUF task by transferring the knowledge of these models learned from the ImageNet dataset.

DFU_TFNet model

With this model, we provide a new TL technique for addressing the issue of the small dataset of DFU. This TL type helps to overcome the issue of transfer learning from the pre-trained models of ImageNet to medical imaging applications where ImageNet images are different from medical images, which could not help with small datasets. At the same time, the proposed TL helps to learn the relevant features. It also helps to reduce the time of the annotation process of medical images. Because there has been significant growth in the amount of unannotated medical images, the recommended technique was based on training the DFU_TFNet model on a large number of unannotated images that look similar to DFU images. The collected images for TL include different datasets of skin cancer and wounds. The total number of images is 100 thousand images. The DFU_TFNet model is then fine-tuned and trained on the DFU dataset.

In order to improve the feature extraction as well as address the gradient-vanishing and overfitting concerns, we designed the DFU_TFNet model with the following components that make it robust against the aforementioned concern:

  1. 1.

    Typical convolutional layers at the model's beginning minimize the size of input images.

  2. 2.

    Parallel convolutional layers with varied filter sizes extract diverse data, ensuring the model learns small and large features.

  3. 3.

    For enhanced extracting features, residue and deep interconnectivity are used. Additionally, these connections alleviate the gradient vanishing issue.

  4. 4.

    In order to speed up the training process, batch normalization was used.

  5. 5.

    The vanishing gradient problem is less of an issue because a rectified linear unit (ReLU) does not compress the input value.

  6. 6.

    Dropout to prevent the problem of overfitting.

  7. 7.

    Global average pooling reduces complexity to one dimension. This layer helps reduce overfitting.

Figure 5 provides detailed explanations of the DFU_TFNet model. The model begins with two standard convolutional layers that are applied in succession. The first convolution has a filter size of 3 × 3, whereas the second has a filter size of 5 × 5. Following both convolutional layers are the BN and ReLU layers. We avoided using tiny filters, such as 1 × 1, at the start of the model to avoid losing little features, which would restrict the results. Following the typical convolutional layers are six blocks of parallel convolutional layers. Each block is made up of four parallel convolutional layers with four different filter sizes (1 × 1, 3 × 3, 5 × 5, and 7 × 7). The output of these four levels is combined at the concatenation layer before proceeding to the next block. Following all convolutional layers in all six blocks are the BN and ReLU layers. The blocks are linked by 10 links. Some are short, others are long, and all have a single convolutional layer. These links maintain the model's capacity to have multiple degrees of features for improved feature representation. Because gradient propagation may occur from several channels, parallel convolutions and connections are necessary. Two connected layers with one dropout layer between them are employed. Our structure combines 34 convolutional layers.

Fig. 5
figure 5

The DFU_TFNet model structure

The training process was achieved by three different cycles:

  1. 1.

    Cycle#1: Training the DFU_TFNet only with the DFU dataset.

  2. 2.

    Cycle#2: Training the DFU_TFNet with the DFU dataset plus augmented data.

  3. 3.

    Cycle#3: In the first step, training the DFU_TFNet on a large number of look-like images to DFU, such as the DermNet collection [30]. Then, the DFU_TFNet only with the DFU dataset. Figure 6 depicts the general concept of the transfer learning technique.

Fig. 6
figure 6

The transfer learning approach

Obtaining a large number of labeled images for some medical imaging applications, such as DFU, is challenging. To the authors' knowledge, there are only two public DFU datasets [23] and [30]. Therefore, the proposed TL can solve the issue of the small dataset and help the model to generalize very well. We have used the proposed TL with the DFU_TFNet due to its deep architecture, which requires a large amount of data to perform well. Moreover, the proposed TL may be easily adapted to any medical imaging application utilizing the same domain transfer learning. The scarcity of annotated medical data drove the decision to use same-domain transfer learning. This enabled the model to harness features acquired from ImageNet, expediting training, tailoring generic features to medical imaging tasks, and improving performance through effective generalization. Figure 7 shows some learned features from the first convolutional layer of the DFU_TFNet model.

Fig. 7
figure 7

Some learned features from the first convolutional layer of the DFU_TFNet model

Another tool used to visualize, Grad-CAM, or Gradient-weighted Class Activation Mapping, stands out as a potent method in interpretability for deep learning models. Its significance lies in being a valuable resource for comprehending and illustrating how neural networks arrive at decisions, especially in tasks related to image classification. The fundamental concept behind Grad-CAM involves emphasizing the sections of an input image that contribute significantly to the model's predictions. This is achieved by utilizing gradient information from the final convolutional layer. The outcome is a heatmap that visually depicts noteworthy areas, shedding light on the features and patterns the network considers during decision-making. Grad-CAM plays a crucial role in improving model transparency as an indispensable tool for researchers and practitioners aiming to demystify the opaque nature of deep learning models [31].

3.2 Hardware Implementation on GPUs and FPGAs

This section is divided into two parts. The first part explores model implementation on Graphics Processing Units (GPUs), scrutinizing optimizations for this hardware. The second part focuses on Field-Programmable Gate Arrays (FPGAs), elucidating the intricacies of adapting models for efficient execution on these platforms.

3.2.1 GPU

The experimental work used the combination of the Intel i7-9750H processor, RTX 3070 Ti GPU with 8 GB of VRAM, and 16 GB of RAM, providing a robust setup for experimental work. The high clock speed of 3.3 GHz on the i7-9750H is beneficial for CPU-intensive tasks, while the RTX 3070 Ti GPU brings substantial parallel processing power, especially with its 8 GB of VRAM, making it well-suited for deep learning tasks.

Having a powerful GPU like the RTX 3070 Ti significantly accelerates computations, especially in scenarios involving machine learning and deep learning where parallel processing is crucial [32]. The ample 16 GB of RAM ensures the system has enough memory to handle large datasets and complex computations without bottlenecks. This hardware configuration seems well-matched for the experimental work described in the paper, particularly in training and testing various models for diabetic foot ulcer classification. Combining a high-performance CPU and GPU is essential for achieving optimal results in tasks that demand significant computational power.

3.2.2 FPGA

The potential of ML in serving people is growing rapidly, and there is an increasing requirement for ML to operate in real-time. The hardware accelerator-based FPGA is similar to the motherboard CPU in a general-purpose computer. Specifically, the FPGA system (board) can be divided into three primary partitions: FPGA (parallel processing array), HPS (control unit), and the memory partition (software storage), as displayed in Fig. 8.

Fig. 8
figure 8

Block diagram of SoC FPGA

Usually, HPS is mainly composed of a microprocessor unit (MPU) subsystem with single or dual processors, synchronous DRAM (SDRAM), flash memory controllers, support and interface peripherals, on-chip memories, debug capabilities, and phase-locked loops (PLLs). However, the fabric of FPGA includes a CB (control block), PLLs, and high-speed serial interface (HSSI), depending on the device version. Additionally, it can incorporate HSSI transceivers, hard PCI Express (PCIe) controllers, RAM, and multipliers [33, 34].

The HPS and FPGA elements are separated, as shown in Fig. 8. From one of many sources, it booted for HPS and deployed the FPGAs via the HPS or any external device to switch them between.

Please note: FPGA refers to the whole system (the board), and FPGA Part refers to the computational partition of the board.

More specifically, the FPGA performs all calculations and computations similar to the CPU but in parallel with the HPS, interpreting the system and user commands and the memory partition for storing data, system, and user programs. The HPS handles the commands and the rest of the layer computations, including ReLU and max-pooling. Due to memory limitations on the board, loading input and filters to registers is performed line by line in a split manner. In this research, the Altera DE1-SoC board is the type of FPGA system selected [35], as shown in Fig. 9.

Fig. 9
figure 9

Block diagram of DFU classification-based FPGA

Working with FPGA first requires coding the user program using the unique programming language Verilog. The user program and its data are stored in the memory portion. The HPS interprets each line in the user program and generates suitable commands for execution. The user program has several functions and commands (system programs); one of the essential functions is Send_command, which generates a command for the FPGA to set the next state and the number of packets it is supposed to receive.

Initially, the HPS imports an image and decodes it. The model weights are also loaded and ready for processing at the FPGA input. The HPS sends a compute command to the FPGA to perform the required computations and returns the result to the HPS. The result is a feature of the input image sent to the monitor for display through the VGA port on the FPGA board. Note that the input images are pre-processed using MATLAB 2021a. The pre-processing functions include extracting the RGB values, calculating the mean values, and subtracting from the original data. Subtraction of the dataset is very helpful in centering the data, thus boosting the learning speed. Experimentally, when a 16 fixed-point format of 1:7:8 is adopted, the input data range is -127 to 127, and all the weights have relatively small values of between 0.03–0.3.

Due to the presence of three input lines, three registers are loaded to implement the pre-trained CNN models using the function load_mem. This function saves the input file pointer, numbering the required locations for reading and setting the padding options. It sends a line from the input file to the FPGA with two data in a packet each time. When padding is enabled, 0 data are added at the line's front and end. The whole line is sent if that line is the first or the last. All the data are sent out at the function end, waiting for the ACK signal and returning with the pointer to read the following line. Next, the function load_fil is applied to load the first 16 filters into the filter register inside the FPGA. This function saves the input file pointer and sends the content of the 16 filters to the FPGA, with each filter containing nine weights. At the end of the function, all the weights are sent out, waiting for the ACK signal and returned with the pointer for reading the next filter. When the input becomes ready, the function compute (which sends a command to the FPGA to compute and wait for the ACK signal when it finishes the computations) is applied to perform the convolutional computation for the first 16 filters as well as the first three lines of the input file. The next step is to read the result of the 16 filters and save it in the local files using the function to get the result. This step is repeated pending the whole filters are handled. The last step is loading a new line, and then the process is repeated. This step is also iterated, assuming all filters are multiplied by the whole lines in the input files.

4 Results and Discussion

DFU proposed a 64-computations array of 16-bit DSP on FPGA DE1-SoC accelerated other pre-training models. This acceleration process contained two elements: the software used for control, known as HPS, and the hardware responsible for convolutional calculations. Only 13 convolutional layers were used to increase efficiency while adjusting to the limitations of FPGA fabrics on DE1-Soc. Software completed The remaining calculations as they could be performed faster than hardware.

Each convolution layer (CONV) mainly comprises separate control logic and parallel adder. At the same time, the multipliers, which serve as the primary computational engines are linked throughout all layers, as seen in Fig. 10. The data input for the convolution is saved in the on-chip buffers, and the multiplier outputs are transferred to CONV for summing and accumulation. The results of CONV are routed to several different on-chip memory, which will be utilized for the next stage.

Fig. 10
figure 10

Convolutional Block Diagram inside FPGA

Accuracy assesses overall correctness, precision evaluates the accuracy of positive predictions, and recall measures the model's ability to capture all positive instances. Together, these metrics provide a nuanced understanding of a classification model's performance. Recall (R) and precision (P) are fundamental metrics for evaluating a suggested method. (see Eqs. 1234).

$$Accuracy= {~}^{(TP+TN)}\!\left/ \!{~}_{\left(TP+TN+FP+FN\right)}\right.$$
(1)
$$Precision\;(P)= {~}^{TP}\!\left/ \!{~}_{TP+FP}\right.$$
(2)
$$Recall\;(R)= {~}^{TP}\!\left/ \!{~}_{TP+FN}\right.$$
(3)
$$F1-Score=2\times ({~}^{P\times R}\!\left/ \!{~}_{P+R}\right.)$$
(4)

Here, TP (True Positive) denotes the number of relevant images properly identified by the network. A true negative is the number of images properly identified as irrelevant by the network as TN (True Negative). The number of images the network incorrectly classifies as relevant is denoted by the letter FP (False Positive). The number of relevant images the network fails to recognize is denoted by the FN (False Negative).

The evaluation of our models involved calculating key metrics such as accuracy, precision, and F1-Score, and comparing their performance across different scenarios. Table 2 presented a comprehensive comparison of various classifiers, each configured with different approaches, based on their time of processing, power consumption, and performance metrics. Notably, the DFU_TFNet series undergoes cycles, with processing times ranging from 102 to 310 ms. While DFU_TFNet (Cycle#1) achieves the lowest processing time and power consumption at 8.00 W, DFU_TFNet (Cycle#3) attains the highest accuracy, precision, and F1-Score at 99.81%, 99.38%, and 99.25%, respectively. On the other hand, DFU_FNet + SoftMax demonstrates lower processing time and power consumption, making it an efficient alternative. Figure 11 shows the heatmap through virtualization using the DFU_TFNet model with Grad-CAM.

Table 2 Comparison between DFU_FNet and DFU_TFNet
Fig. 11
figure 11

Grad-CAM heatmap visualization for DFU_TFNet (Cycle#3) model

Moving to Table 3, shown a comparative analysis of various deep learning models, including AlexNet, VGG16, GoogleNet, DFU_FNet + SVM, and DFU_TFNet (Cycle#3), across key performance metrics. Notably, the processing times for these models vary, with FPGA consistently demonstrating lower processing times than GPU. Furthermore, the power consumption of FPGA is considerably lower than that of GPU across all models, underscoring its energy efficiency. In terms of accuracy, precision, and F1-Score, DFU_TFNet (Cycle#3) emerges as a standout performer, boasting an impressive accuracy of 99.81%, precision of 99.38%, and an F1-Score of 99.25%. These metrics reflect the model's robust performance. The trade-offs between FPGA and GPU are evident, with FPGA offering energy efficiency at the cost of slightly longer processing times. The FPGA evaluations in Table 4 considered essential resources like total logic elements, block memory, and logic registers within the DE1-Soc.

Table 3 Our proposal vs. pre-trained CNN models
Table 4 Summary of resources for DE1- Soc

The processing time comparison between GPU and FPGA revealed that while GPU is faster, it demands significant power. Conversely, FPGA exhibits substantially lower power consumption, making it an attractive choice for smart devices with limited battery resources. With advancements in FPGA properties, processing times could become comparable to GPUs. As summarized in Table 5, the overall results guide us to conclude the preferred platforms based on the achieved metrics and performance benchmarks.

Table 5 Platforms are recommended based on hardware analysis

Table 6 presented a comparative analysis of various deep learning models, predominantly focused on DFU detection. The EfficientNet [19] achieved an impressive 98.97% accuracy, accompanied by high F1-score, recall, and precision on a GPU, with correspondingly high-power consumption. ResNet50 [20] demonstrated a notable 99.49% accuracy for Ischaemia and 84.76% for infection, also on a GPU with high power consumption. DFU_QUTNet [23] and DFUNet [26], both utilizing GPUs, exhibited a F1-score of 94.5% and an accuracy of 96.1%, respectively. The proposed model, DFU_TFNet (Cycle#3), stands out with remarkable accuracy, precision, recall, and F1-score of 99.81%, 99.38%, 99.76%, and 99.25%, respectively. Notably, DFU_TFNet utilizes both FPGA and GPU, potentially mitigating power consumption with a low setting on FPGA. This combination of high performance and potentially lower power usage makes DFU_TFNet an intriguing prospect for real-world applications in medical imaging and diagnostics.

Table 6 Comparison between out proposed model and other studies

5 Conclusions

This research proposes new diagnostic tools with real-time processing capabilities for DFU classification, which addresses a significant healthcare challenge. The key findings of this research include the effectiveness of pre-trained CNN models, namely, DFU_FNet and DFU_TFNet, in automatically categorizing DFU cases into normal and abnormal foot skin. These models were designed to overcome deep learning pitfalls, utilizing techniques such as domain-transfer learning. The results of this research indicate that when compared with various classifiers like SVM, KNN, and pre-trained CNN models like AlexNet, VGG16, and GoogleNet, DFU_FNet and DFU_TFNet exhibit superior performance. The models were trained and tested on different HPC parallel platforms, including GPUs and FPGAs, significantly reducing power consumption and execution time. Additionally, features extracted by DFU_FNet were leveraged to train SVM and KNN classifiers, further enhancing the overall classification process. The proposed framework, particularly DFU_TFNet (Cycle#3), achieved an impressive accuracy, precision, recall, and F1-score of 99.81%, 99.38%, 99.76%, and 99.25%, respectively, surpassing current methodologies. The FPGA implementation, utilizing DE1-SoC resources, exhibited a reasonable power consumption of 9.16W. Future directions involve transforming the model into a wearable smart application, enabling patients to monitor their condition anytime, anywhere, with prolonged battery life and swift processing. Training images will be securely stored on an online server to prioritize data privacy. During testing, the wearable device designated for testing purposes will reduce data exposure and align with security best practices. Furthermore, strong encryption will safeguard data transmission between the device and the server.