1 Introduction

There are millions of embedded devices connected to the internet today and it is increasing day by day and total devices connected will reach nearly 5.3 billion by 2023 [1]. The number of devices connected to internet are huge so that the network handles enormous amount of data every day. Due to rising number of digitally connected devices, internet traffic will reach 20.6 zeta-bytes (ZB) [1]. Out of the massive data generated in chunks, only fraction of them are saved in database permanently as useful information after data mining and analysis. Traditionally, computing depends on remote cloud architecture to meet the intensive need for data processing. Major application fields are surveillance, traffic analysis, industrial monitoring, mobile data and other embedded devices referred as internet of things (IoT) and single-board computers. IoT devices merely provide raw data for analysis which is further processed in a cloud computing environment for extracting meaningful information and draw inferences between them [2].

Machine learning (ML) and deep learning (DL) are subsets of the broader field artificial intelligence currently where extensive research and developments takes place during the last decade. The model training with dataset for specific application can be done with the help of cloud or any computing device with GPU support to reduce the training time. The model inferences can be done inside embedded platform as the process requires less computing resource. Deploying pre-trained model directly on mobile devices is inexpedient due to considerable storage requirement and memory bandwidth [3]. Machine learning modules can be incorporated in low resource end-devices but deep learning requires computational power and memory. Such applications are computation resource hungry, so they are executed mostly on cloud servers or personal computers with reasonable hardware support. Transferring resources to edge of the network from the core is the feasible solution as a trade-off between limited computation and reduced latency in communication [4]. In Fog layer [5], most IoT data is analyzed near the devices itself, alleviating traffic from the core network by outsourcing a major share of computational task by creating an array of edge computing devices. The potential for edge nodes are overwhelming so that each node can think and act independently with the help of machine learning tool kit, enables machines to learn and act like humans [6]. Edge Machine Learning (Edge ML) [7], a derivative of edge computing, offers machine learning and deep learning algorithms with the capability of processing data locally (at the device-level or connected local servers) thereby replacing solely cloud based applications. Notwithstanding as alternate extension for the cloud, edge-nodes complements computing near the edge of the network.

In situations where processed data need to be available without delay, which may be hindered by internet traffic or server related issues especially in healthcare applications where edge and fog computing plays an important role. Machine learning based applications in healthcare system devising clinical decision support(CDS) [8] demands reduced latency with growing need for smart as well as intelligent devices. Machine learning enables electronic devices to learn from existing data and to use this acquired knowledge to make assessments and take decisions autonomously [9].

It is possible now to perform machine learning directly on embedded devices so that they can be used in the field for end-to-end workflow. This paved the way for processing the data to make intelligent decisions considering current and previous data which can be implemented on embedded system. The Internet of Medical Things (IoMT) uses IoT for medical and healthcare where patient-centered data is closely monitored, collected and analyzed for research [9]. Digitized healthcare system connects available medical resources in healthcare services and supports follow-up procedures including health statistics. IoMT also enables remote health monitoring and emergency notification via mobile devices [10]. Sensor network can be connected to monitor well being of patients ensuring treatment by collecting and analysing information of subject without delay, even though in adverse conditions. IoMT in healthcare plays a crucial role in disease prevention and control by tracking the patients from a remote location, capturing real time data and analysing patient data using statistical tools. This paper proposes a deep learning approach for malignancy detection in histopathology images using limited resources of embedded platform. A task specific custom model is designed to ensure best performance with minimum execution time. Rest of the paper is organized as follows. Section 2 portrays a detailed study on related works. The proposed methodology is described in Sect. 3. The implementation results and the findings are narrated in Sect. 4. Section 5 draws the conclusion arrived.

2 Related Works

This section outlines an elaborate study and comparison of various works related to the application. Most of the algorithms available for machine learning and deep learning with the help of cloud computing or any other data mining tools available in repository. Virtual assistants with intelligence are software agents which have become part of our life [11]. Convolutional Neural networks (CNNs) for embedded systems are possible only if the hardware is capable of handling huge data with sufficient memory. For the same reason, available literature mostly refers to either general-purpose or custom embedded device with accelerating hardware support [12, 13]. Edge computing has the potential to overcome these limitations by allocating the computation closer to the network edge and data sources especially in real-time applications [14]. In [15] the authors suggest a novel tree-based algorithm, Bonsai, for efficient prediction in IoT device Arduino Uno board with ATmega328P micro-controller operating at 16 MHz with, 2 KB RAM and 32 KB read-only flash. They have shown that, with this algorithm IoT device can maintain prediction accuracy with minimum model size and make predictions without connecting to the cloud. They obtained prediction time in milliseconds on slow micro-controllers. In [16] authors present a software accelerator that enhances deep learning execution on heterogeneous hardware, including mobile devices. In their work using edge mobile devices, an acquired image is preprocessed and segmentation is performed. They have also classified the image on a remote server running pre-trained convolutional neural network (CNN). Tomasz Szydlo [17] presents the concept of machine learning models executing on the embedded devices with constrained resources by generating the source code of machine learning models. Suresh [18] suggest a solution in which machine learning on the embedded system and performed low power transmission through LoRa using k-nearest neighbors algorithm. They also show that edge processing using energy-efficient ARM Cortex-M0+ processor enables extended battery life while conducting field trial on animals. Lane [19] have implemented DeepX, a software accelerator for better deep learning execution with less demand in device resources. They suggest that efficient execution is possible with resource control algorithm in heterogeneous processors. The aforementioned algorithm is implemented in in wearable devices [20] as well.

Dundar [21] proposes a method based on scheduled routing topology for deep CNN acceleration hardware to utilize maximum computational resources available. They have optimized it for embedded platform in such a way to transform high-level representation of deep CNN(DCNN) into operation codes to execute applications in a hardware accelerator tested real-time on field programmable gate array(FPGA). Song Han [3] achieved reduction the storage requirement by compressing the network using three level pipeline: pruning combined with trained quantization and Huffman coding without reduction in accuracy. Their method reduced the storage of popular pre-trained models AlexNet and VGG-16 by a factor of 35x and 49x respectively. This method also enables the model to be deployed on embedded platforms. Gupta et al. [22] proposes k-Nearest Neighbor based algorithm for real time and accurate prediction deployed on resource-scarce devices like Arduino UNO with 2kB RAM. The concept of bringing content closer to end-users can be through content delivery/distribution networks (CDNs) which are information centric. Edge computing technology reduces transmission time and bandwidth consumption for applications by placing computing, storage and other resources on the edge of a network [23]. The concept of extending fog-computing to the edge of the network introduced in [24] incorporates intelligence in IoT devices and utilizes fog networking creating FC-IoT platform. Edge computing offers various IoT based applications in patient monitoring when portability and system robustness are a major concerns. For example, the design of wearable devices offers help to elderly or disabled people round the clock and alerts the response team in case of medical emergency [25]. IoT based healthcare system and challenges involved are portrayed in [26]. The authors also highlight the need for interconnecting all the resources for effective and efficient healthcare service in addition to medical assistance offered. They corroborate that smart devices in healthcare can be used to monitor over all health activity of patients in case of emergency situations.

Verma [27] made initial attempts for utilizing ultra-low-energy computational platforms for seizure-detection. In their system-on-chip (SoC) design, the instrumentation and computation blocks detect anomalies by sampling the EEG channel periodically and processing to derive features at every two second duration. Lee [28] incorporated a flexible accelerator together with a general-purpose CPU which enables feature extraction for a range of physiological signals and clinical applications. The accelerator enables energy and memory-usage space and the model adaptations implemented through the hardware supported embedded active learning. Wang [29] proposed embedded ML which learns statistics of the computational outputs creating an error-aware model for analysis through an embedded feature learning framework. Parker [30] suggests Artificial Neural Network capable of learning using Arduino-Pro Mini microprocessors and significant work has been done in hardware ANN implementations, with notable parallelism in neural networks. Liu [31] developed recognition system for diet related assessment with pervasive mobile devices. They followed a food recognition system with edge computing than traditional mobile cloud computing which causes system latency.

Alippi [32] used Hidden Markov Model to model relationship between two generic data streams through a sequence of linear dynamic model where the coefficients are used as features. Their model automatically detects changes that occurs in data streams obtained from multi-sensor units due to several reasons by comparing it with probabilistic patterns of normal data. Badan [33] implemented CNN in an embedded system with limited resources by neuro-evolution method thereby optimizing CNNs with respect to the classification error and reduced power consumption. Venieris [34] suggest optimized mapping of Deep neural networks on embedded platform with CNN accelerators to achieve high throughput and low latency. In their work, cascaded classifiers with domain specific knowledge enable the deployment of sophisticated deep learning models on edge mobile and embedded systems. Umuroglu [35] developed a framework for building fast and flexible FPGA accelerators (Finn) using a flexible architecture. They implemented on a ZC706 embedded FPGA platform drawing less than 25 W total power for 12.3 million image classifications per second on the MNIST dataset with 95.8% accuracy. Tsimpourlas [36] propose a method for efficient deployment of CNNs in low power processor based architectures used as edge devices in IoT networks to analyze utilization of hardware resources of the edge devices. It is then evaluated using different CNNs in the Intel Movidius Myriad2 low power embedded platform. Lee [37] adopted compression methods for downsizing the CNNs to deploy on resource-constrained embedded systems by pruning technique and knowledge distillation. David [38] used TensorFlow Lite Micro, an open-source ML inference framework, for running deep learning models in embedded systems. FPGA based custom device implementations offers high performance than general purpose computing devices [12, 13, 39]. They can be programmed using hardware description language where several programmable blocks are connected to accomplish the logic [40, 41]. Although many FPGA accelerators have demonstrated better performance over generic processors, the accelerator design area is gradually evolving.

Table 1 Comparison of related works

Table 1 compares various existing works using Raspberry-PI and machine learning for different applications including the proposed work. Various edge and embedded system based works using machine learning techniques, in different fields are considered for comparison. From the literature review it is clear that malignancy prediction in histopathology images using Raspberry Pi along with deep learning is less less explored.

3 Methodology

This section describes the dataset used, configuration of the proposed model and methodology.

3.1 Dataset

The dataset PatchCamelyon (PCam) [49] is used in our work which contains 96 \(\times\) 96 pixel color images (patches) annotated by experts with label indicating presence or absence of metastatic tissue. These patches were extracted from histopathology images of lymph node sections encompassing the benchmark classification dataset-PCam.

Out of 40,000 images in the database, 80% are allotted for training and 20% for the validation corroborating an equal number of images from both classes which is paramount in semi-supervised learning also rule of thumb in literature.

3.2 Configuration of Model

The summary of model configuration with different parameters is given in Table 2, used in tuned CNN model for optimized results.

Table 2 Overview of base model configuration

The base CNN model shown in Fig. 1, has three convolutional layers with pooling layers in between and two fully connected dense layers. Each convolutional layer uses a valid padding to reduce the output feature map dimension proportionate to the filter size used. We used non-linear activation function called ReLU [50] (Rectified Linear Units) in all hidden layers.

Fig. 1
figure 1

Block diagram of the base model

The first convolution layer uses 30 filters of kernel size 8 \(\times\) 8 to learn larger features, next convolution layer has 100 filters with filter size 5 \(\times\) 5 and subsequent convolution layer has 100 filters with size of the filter 2 \(\times\) 2. The stride value is set to unity for convolution operations in all the layers by default. The input to the model is 96 \(\times\) 96 images from the standard database [49]. The two fully connected layers in the model with sigmoid activation performs classification based on the features extracted by the convolution layers. Each convolution layer acts together with max pooling, batch normalization, and dropout layers so that features extracted are sub-sampled by pooling layers for dimensionality reduction. Max pooling layers have a 3 \(\times\) 3 sliding window and stride set to 2, added after convolution layers to down-sample the feature size. Thus the pooling operation scales down the size of the feature map to reduce the overhead of computational complexity. Batch normalization is applied at the intermediate phase of the convolution stage, except fully connected layers without dropout. Normalizing input layers reduces internal covariate shift [51] while training using mini-batch [52] and reduces over-fitting. The normalization of each scalar feature is performed independently to make zero mean and the unity variance. This step is carried out to replace the random weights initialization which results in vanishing or exploding the gradient otherwise. Algorithm 1 explains pseudo-code for the proposed CNN architecture.

figure a

Table 3 shows the fine details of CNN models used for our experiments. The model number indicates various models considered and memory consumed by each model is given under model size. All the customized models in the table are trained with various cyclic learning rate [53] strategies. Weights of pre-trained models such as MobileNet [54] and MobileNetV2 [55] are also used in the experiment to compare performance of the method.

Table 3 Model versions with size

Each model is deployed into embedded platform which extracts the features and produces predictions and binary class of test images.

Table 4 Configuration of devices

The details of Raspberry-Pi(RPi) used for the experiment is narrated in Table 4 along with stand-alone computer (SAC), for performance comparison.

3.3 Embedded Webserver

Communication is established from mobile to the embedded web-server via TCP/IP protocol over ethernet connection. The query image is loaded from the user end which is send for processing at the other end. The computation algorithm calculates the probability based on the prediction algorithm by the saved model and sends the report as response. Using this method, computing at the edge device is utilized, even though resources are limited. Deploying the model directly into the RPi suffers from following drawbacks such as some display device is required to view the output. Also display device is necessarily required to view the contents of the folder and do the selection of image. Raspberry-Pi acts as the web server and the commands are given from mobile. Communication to RPi is possible through lighttpd an open-source web server optimized for speed-critical environments. The low memory footprint (compared to other web servers), small CPU load and speed optimizations make lighttpd suitable for servers that are suffering load problems, or for serving static media separately from dynamic content. The python micro-framework (Flask) [56] is used to script the server side since it is fast and reliable than other general purpose scripting language for web development like PHP. Flask is considered as a micro framework because it does not require particular tools or libraries.

3.4 Deploying the Model into Embedded Platform

Embedded platforms hardly support heavy computations as they are designed for special-purpose computing applications but they can be customized for the application at hand. File size of the pre-trained models restricts feasibility in embedded devices which are designed for dedicated applications. Such portable devices can be custom designed for medical applications by loading only key library functions to take the maximum advantage of resources at the same time minimizing the memory footprint. We used the save_model and load_model option in keras  [57] to save trained model as well as its weights in .h5 format. Once the converted (saved_model.h5) file is ported to embedded device, histopathology images from the database are also loaded into the memory for prediction using mobile device. Next, the essential software are loaded into the RPi to support built-in functions. We also have selected only the best performing model to be ported into the embedded platform to facilitate simple and reliable histo-pathological image classification.

Figure 3 shows the steps involved in deployment of the model. The sequence is initiated by executing the python file which launches the window prompting for loading the image, which then the user can redirect to the folder where images are saved. User is then prompted to load the model required to predict the image and the results are displayed in another window along with ground truth label for comparison. The probability of association with the actual class is also indicated for confirming the predicted class. The code executes the saved model from the RPi and displays the result. CNN-based deep learning with custom model is feasible and easy to port and computationally efficient in contrast to state-of-the art pre-trained model approaches. The model with best performance metrics is deployed into the RPi thereby reducing false positive(FP) instances. When the user loads new image, the model provides the prediction label along with probability of malignancy in the image. Our proposed CNN model has relatively smaller size (less than 5 MB), which makes it easier to be loaded into portable devices. The image prediction and classification tasks were performed by general purpose embedded device – RPi-Model-3B+ and RPi-Model-4B with 1GB and 2GB RAM respectively.

3.5 Embedded Web-Server Framework

The detailed layout of the proposed method is shown in Fig. 2. The user can type the web address in the URL format which is the reference to web resource to specify location of devices on a computer network. The proposed model comprises of mobile device, RPi as embedded webserver connected via WLAN with optional intermediate FoG nodes.

Fig. 2
figure 2

Representation of the proposed method

Medical practitioner can enter the URL in the mobile device and use the webpage that appears on the screen to select the test image and upload into Raspberry-Pi based web-server as request. Through the same webpage on mobile device, a suitable model can also be selected for prediction and classification of the uploaded image. The web-server performs computation using its own resources and provides the result as response. A single router may limit the range of communication between web-server and mobile device, hence we have used a WiFi router to extend the range so that proximity of the nearby devices can be increased. The complete sequence of communication using local network is depicted in Fig. 3. The framework also can be connected to fog node with limited computation as optional facility if the patient data is stored in local server for later analysis.

Fig. 3
figure 3

Communication between client and web-server

Client forwards the request to the server. The request is processed at the server and it sends the response to the client. The RPi which is the host acts as the web-server which communicates with the client through wireless LAN (WLAN) via TCP/IP. The input image is fetched from the mobile with any inbuilt browser and forwarded to embedded server for processing. RPi processes the image using saved model(stored in memory), classifies the image and computes the probability of association with the respective class. Operation is completed when the web-server returns the response to the client side and displays prediction probability and image class in the mobile screen. The system can be enhanced such that if the required response from the server can be saved in the cloud server or fog network for later usage. The medical expert can take the picture of histopathological images in mobile device and send to embedded web-server. The image is then resized, processed and the model classifies it, predicts the class and probability of the image being malignant. At the processing end, images are resized to 96 \(\times\) 96 so that saved model can analyze/extract the features of image and predict its class. The system-on-chip single board computer and the user end device under the same network is protected by the firewall so that any attack from other network can be prevented. Moreover the latency in decision making can be reduced especially to ascertain the exact cause of disease. The compact size of the trained model easily fits into limited memory of end-device with supporting libraries. The memory footprint is also less and plug-in features help easy replacement and maintenance of the device. The host communication require no additional display as the client side is a mobile device. Memory usage of proposed configuration of device is minimum since optimization is performed at all implementation levels.

The application is designed to communicate with multiple embedded devices at different instances by changing the URL. It can be configured in such a way that, multi-domain expertise facility can also be implemented by selecting corresponding data-set for model training. In this application, we have spared RPi for dedicated usage of each medical expert. This modality increases the throughput as well as reduces latency. The embedded device activates the application at start up without initial settings so that medical practitioner who is not familiar with device settings or technical know-how can also use it hassle-free with this facility. Another advantage is that, once the device is configured for a particular network, there is no need to reconfigure it while using under the same network for any period of time. Devices within the range defined by sub-net can be connected to web-server with IP address. The application requires less internet support as it completely runs across any device within the intra-net with access permissions. Multi-user configuration affects the performance of the device as it requires allocation and sharing of resources with limited memory. In future, multi-user environment with separate log-in environment can be implemented for simultaneous usage with multi-device support by configuring the existing web-server to open-source HTTP web-server.

4 Experimental Results

This section portrays the implementation results and main observations of our work. Embedded devices with limited computing resources can be used for prediction and classification of histopathology images by tweaking trained deep CNN model into the device. The work is implemented on various CNN models with different learning strategies and number of neurons in each layer as shown in Table 3. CNN models are then deployed into embedded RPi. The test images are uploaded from mobile device to the local storage of embedded device, for classification and prediction. Thus the memory usage of RPi is minimally utilized. The performance of two variants of Raspberry-Pi with different configurations are compared and values are tabulated. It is clearly evident that the overall execution time using proposed custom models is considerably lesser than that of pre-trained models. This is an added advantage when embedded devices are used for field applications.

Table 5 Execution time in different devices for prediction and classification of histopathology images

Table 5 shows the total execution time taken by different system-on-chip embedded devices to process the given set of test images. Various sets of test images from the same data-set are used to ensure consistency of results. The total time for execution in the RPi web-servers are compared with that of stand-alone computers. From the results in Table 5 it is notable that RPi-4B takes minimal execution time for substantial number of test images.

Fig. 4
figure 4

Average prediction time per image in different computing platforms

Figure 4 shows average prediction time per image for prediction across different devices on test images in the data set using each model considered in the work. The results in the figure also agrees with the tabulated values in Table 5 showing that Rpi-4B with limited resource takes lesser mean prediction time per test image. Comparison of performance results on different devices portrayed here shows that the mean prediction time taken by RPi-4B with limited resource is comparable to the performance of a general purpose stand-alone computer while using same model.

Fig. 5
figure 5

Snapshots displaying the class and prediction results of histopathology images in mobile screen with the URL

Snapshots of interface are shown in Fig. 5 which allows the user to load test image, CNN model and ground truth label from datasets in the front-end. The test image is then analyzed by the trained model which is selected by the user. The interface then displays predicted class of the image and probability of malignancy in the test histopathology images. Provision for remarks from experts is also incorporated in the drop-down menu where actual label is displayed. The application works devoid of internet connection that makes seamless access of the device for medical experts. In case of any glitches, the device can be replaced in minimum time as the components are completely modular and customizable.

Table 6 Prediction results of Raspberry-Pi and stand-alone computer

A set of histopathology images are selected randomly from the data-set and tested. The obtained results are given in Table 6. The ground truth labels, provided along with dataset, indicates benign or malignant class of the image, annotated by experts. Prediction results with RPi-3B+ and RPi-4B along with that of stand-alone computer are included in the table. In the obtained prediction results, ’M’ represents malignancy where ’B’ stands for benign nature of test image. All correct predictions in a row indicates perfect prediction score which complements quality in disease diagnosis thereby simulating medical expertise. The prediction results of the embedded devices are compared with prediction results of stand alone computer and ground truth label. This process is equivalent to consulting two or more experts for second opinion in diagnosing the disease. This step can also be performed as standard procedure in practical scenario, to maximize the likelihood of true positive and true negative results thus reducing false negative cases. The mean prediction time for both the embedded devices shown in Fig. 4 shows that the embedded device can perform well with custom trained model for the classification of histopathology images.

5 Conclusion

Bringing computing capabilities to the edge of the network provides advantage in disease diagnosis and assists healthcare systems without dependence on cloud services. In this work an efficient deep CNN based classifier using embedded platform is suggested for histopathology images. The baseline model proposed in our work consist of three layer deep CNN trained with different fully connected layers to produce variants of the model. Cyclic learning rate strategies are adopted to improve performance of models as it reduces over-fitting and converges faster with consistent separability. Since Raspberry Pi is one of the modular and portable embedded devices, it is selected for the work. We have configured the device to interface any mobile device so that user can select and load the test pathology image as well as the model for prediction. The prediction results are then displayed on the mobile screen along with actual label and probability of malignancy.

Performance of the work is tested by considering class prediction probability, mean prediction time and over all execution time. We used two different RPi versions(RPi-3B+ & RPi-4B) and compared mean prediction time with stand-alone computer and it is noticed that mean prediction time and overall execution time is lesser in embedded devices. It is observed that the obtained predictions are consistent with ground truth labels. The application we developed is web-based which eliminates the overhead of software installation in the host device and additional infrastructure at the at client side. The method is also found to be secure and platform independent revitalizing machine learning assisted clinical decision support. Clinicians with less technical know-how can effortlessly operate the device without snag. Computing with ubiquitous end devices gives additional mileage to healthcare systems in hospitals as the data is collected on-site by deploying anywhere and interpretation can be done locally. Based on the performance results obtained, we clearly state that embedded devices with custom trained CNN model is well suited for applications to classify benign and malignant histopathology images. The future research direction is presented by processing large-scale whole slide image data in end-devices with multi-user environment.