1 Introduction

The Internet of Things (IoT) is one of the major enablers driving the fourth Industrial Revolution by providing an ecosystem where processes can be remotely monitored and controlled. A typical IoT ecosystem would consist of a number of edge nodes sending sensor information to a gateway device, or directly to a cloud server for analysis [1]. In low bandwidth applications resource constrained edge devices with limited computing capabilities can be used. In these cases, all the computing is done on the cloud server or on a more powerful gateway node (fog computing [2]).

There are a vast array of applications for image analysis from fields as broad as medical diagnostics, ecology research, fire detection systems, vehicle tracking and traffic management [3,4,5,6,7,8,9].

In situations where increasing the bandwidth capability of the IoT system is not feasible (e.g in LoRa [10] communications infrastructures), the images cannot be sent to the cloud or gateway device for analysis. Image processing would therefore need to be performed on the edge device, requiring a device with more computing capabilities – leading to higher cost and power consumption, or by using optimizing algorithms to use computational power more effectively. As the trend towards edge computing over cloud computing grows [11], it is fitting to find ways of increasing the amount of data analysis that can be done locally on existing IoT infrastructures. The edge devices targeted in this paper are the low cost ones which come with limited processing power and on-board memory (i.e resource constrained devices).

In microcontrollers with no dedicated hardware floating point unit (FPU), the execution time of an algorithm using floating point arithmetic is much longer than in a device with an FPU [12]. It has also been shown that in older architectures, integer arithmetic is significantly faster than floating point arithmetic [13]. In off-grid IoT edge devices where battery life needs to be conserved, increasing the speed of algorithms reduces the amount of time that the device needs to be on. In applications where image processing is required, performing local image processing on the edge device means that only the processed information needs to be sent to the cloud server or gateway device, instead of the entire image – further conserving battery power. It also reduces the need for high bandwidth (and high power) wireless communication infrastructures such as WiFi to be used in IoT ecosystems which require image analysis.

The advantage of floating point arithmetic lies in its precision. However using integer scaling floating point arithmetic may be used whilst only using integer data types [14]. The accuracy of the result is dependant on the size of the integer used, with 8-bit integers being the least accurate. This means that integer scaling is not as effective for floating point arithmetic when used in 8-bit processors. In terms of storing the floating point numbers, most processors use the IEEE 754 standard [15]. The memory requirements coupled with the slower computational speeds associated with floating point arithmetic therefore make it difficult to perform this arithmetic on resource constrained devices (such as low cost IoT nodes).

One of the fundamental steps in a typical image processing pipeline is detecting edges in the image. This step allows the image to be reduced in complexity, with only the key features (edges) of the image being used further down the pipeline. There are a number of widely used edge detection algorithms available [16]. Two popular ones are the Sobel and Laplacian edge detectors [17]. Both of these are gradient-based edge detection methods, which require the use of floating point arithmetic for computation.

The primary motivation in this paper is to improve and enable image processing on the edge for low-power resource limited processors. This paper proposes a method of detecting edges in an image using unsigned integer arithmetic (OptInt framework), and storing the edge image using 8-bit unsigned integer memory. This OptInt implementation eliminates the need for floating point computation, while producing similar results to the popular Sobel and Laplacian gradient based edge detectors. This method of edge detection is capable of detecting strong edges in a noisy environment. This is significant when low resolution images are used (as is the case for resource constrained microcontrollers).

The proposed method is an adaptation of the Laplacian edge detection method where pixels in a rectangular window patch are subtracted from the centre pixel in the window to compute the edge gradients, in a sliding window approach [17]. However, unlike the Laplacian method which uses floating point arithmetic, the OptInt method uses unsigned integer arithmetic for computation of the edges. A Rectified Linear Unit operation (ReLu) [18] is applied at each subtraction stage, thereby eliminating the need for signed memory storage. A scaling factor (using bit shifting) was also applied after each ReLu operation, to prevent the output from saturating. This has the effect of reducing noise in the edge image.

Hence the contributions of this paper are as follows:

  • To demonstrate how edge detection algorithms can be optimised using integer only arithmetic.

  • To benchmark implementations of optimised algorithms on resource limited CPU’s.

2 Related work

In [19], the authors implemented a 2D Gaussian filter using fixed point arithmetic in an FPGA. Gaussian filtering is another common image processing technique, and inherently involves convolutions using floating point arithmetic. The authors found that by using fixed point arithmetic in the FPGA, their algorithm was much faster than the conventional approach to Gaussian filtering. More recently an integer based approach was used in FPGAs for neural networks giving an efficiency improvement between 1.7x to 7.3x [20]. Integer-only Gaussian filtering was also implemented in [21], as an approximation to floating point based Gaussian filtering. Fixed point arithmetic has also been used to simplify deep learning architectures. In [22], the weights of a deep learning network were constrained to the integer values +1, 0, and -1 – resulting in a model which required less memory to store the weights. A similar approach was implemented in [23], where the quantization of the weights allowed for integer only arithmetic for inference.

In [13], benchmarking tests were performed to compare the computational speeds of floating point and integer arithmetic on different CPU architectures. The tests indicated that floating point arithmetic is slower than integer arithmetic for similar data sizes. Similarly, for image segmentation using deep learning an integer based approach provides near full precision with a significant improvement to processing requirements [24]. Most modern architectures come with one or more dedicated hardware floating point units (FPU) to speed up the floating point arithmetic to a point where there is a negligible difference in computational speed between floating point and integer arithmetic [25, 26]. However, despite the inclusion of these FPU’s in some embedded devices, there is still a significant gain in computational speed associated with using integer arithmetic over floating point [13].

Two common metrics used to benchmark microprocessors are the MIPS (Million Instructions per Second), and the MFLOPS (Million Floating Point Instructions per Second). The MIPS metric represents the number of integer instructions which can be executed by the microprocessor per second, while the MFLOPS metric represents the number of integer instructions which can be executed by the microprocessor per second [25]. The MIPS metric will be orders of magnitude higher than the MFLOPS metric for resource constrained devices which lack a FPU [13]. For benchmarking integer operations, the Dhrystone test [27] can be used, while floating point capability can be benchmarked using the Linpack test [28].

3 Unsigned Integer arithmetic

Using the unsigned 8-bit integer (uint8) data type requires only 1 byte of memory per image pixel for storage with a range of 0-255. In contrast the floating point data type can represent fractions and negative numbers, at the expense of 4 bytes per data sample. For greater precision, the 8-byte double data type can be used. This section defines the different arithmetic operations using the uint8 data type. Let a, b, and c represent three uint8 numbers:

The addition operation is denoted as and is defined as:

(1)

As shown in (1), the result of uint8 addition between two numbers is the normal addition of those two numbers, with clipping at 255.Footnote 1

Algorithm 1
figure b

Unsigned integer addition (8-bit).

The subtraction operation is denoted as and is defined as:

(2)

The operation effectively performs a ReLu operation, where negative values are clamped to 0. Algorithms 1 and 2 were implemented in C to achieve the operation with 8-bit memory.

Algorithm 2
figure f

Unsigned integer subtraction (8-bit).

The multiplication operation is denoted as and is defined as:

(3)

Algorithm 3 was implemented in C to achieve the operation with 8-bit memory.

Algorithm 3
figure i

Unsigned integer multiplication (8-bit).

4 Applications in edge detection

As an example of efficient image processing using unsigned integer arithmetic, OptInt is applied to find edges in an image. The OptInt implementation is similar to the Laplacian gradient based edge detection method, with the major difference being that uint8 variables are used, whereas the Laplacian method uses floating point variables. For a Grayscale image I stored in uint8 memory, let \(x_0\) be the center pixel in a square patch of size \(N \times N\). The pixels surrounding \(x_0\) within the patch can be denoted by \(x_k\), where \(k=1:N^2-1\). An example of a \(3 \times 3\) patch is shown below.

Within each patch, difference \(d_k\) between each pixel and the centre pixel can be computed as:

(4)

where M is a scaling factor defined as:

$$\begin{aligned} M=2^n\qquad \qquad \qquad n=0,1,2..7 \end{aligned}$$
(5)

By performing uint8 subtraction, negative gradients (where \(x_k\) is greater than \(x_0\)) are neglected. However, as will be shown in the Results section, OptInt produces similar edge images to the conventional methods without the need for floating point arithmetic. The constraint of making M a power of 2 is so that the uint8 division can be performed by bit shifting right n times. All \(d_k\) within the patch can be summed (using a series of uint8 additions) to produce an edge intensity pixel \(E_p\) related to that patch, as shown below:

$$\begin{aligned} E_p = \max (255,\sum _{k=1}^{N^2-1} d_k) \end{aligned}$$
(6)

The purpose of applying a scaling factor is to prevent \(E_p\) from saturating which reduces noise in the image. After obtaining \(E_p\) the patch is moved one pixel to the right and apply (4) and (6) to the pixels under the new patch. The final edge image is then made up of all edge intensity pixels \(E_p\).

4.1 The 8-bit unsigned integer edge detection algorithm

The algorithm for creating the edge image using 8-bit unsigned integer arithmetic is shown in Algorithm 4. The inputs to the algorithm are the Grayscale image D (for which the edges are to be found), and the patch size p. For a \(3 \times 3\) patch, \(p = 1\). Similarly, \(p = 2\) for a \(5 \times 5\) patch.

Algorithm 4
figure k

8-bit unsigned integer edge detection.

4.2 Implementing laplacian and sobel edge detection

As a way of benchmarking the unsigned integer arithmetic algorithm, Laplacian and Sobel edge detectors were implemented using floating point arithmetic to find the edges of a test image [17, 29]. The Laplace kernel, L used is shown in (7) and the two Sobel kernels, \(S_x\) and \(S_y\) used are shown in (8) [17].

$$\begin{aligned} L = \left[ \begin{array}{ccc} -1&{}-1&{}-1 \\ -1&{}8&{}-1 \\ -1&{}-1&{}-1 \\ \end{array}\right] \end{aligned}$$
(7)
$$\begin{aligned} S_x = \left[ \begin{array}{ccc} -1&{}0&{}1 \\ -2&{}0&{}2 \\ -1&{}0&{}1 \\ \end{array}\right] \quad S_y = \left[ \begin{array}{ccc} 1&{}2&{}1 \\ 0&{}0&{}0 \\ -1&{}-2&{}-1 \\ \end{array}\right] \end{aligned}$$
(8)

The algorithm for the Laplacian Edge detection is shown in Algorithm 5. The inputs to the algorithm are the Grayscale image D (for which the edges are to be found), and output of the algorithm is the Edge Intensity image E. To save the Edge Intensity image as a uint8 array, the maximum absolute intensity value must first be calculated and subsequently used to scale the intensity values to within the uint8 range (0 to 255). This means that the edge intensity calculation must be computed twice. To reduce the computational time, the intensity pixels can be stored in a floating point array, however this limits the image size in devices with low RAM.

Algorithm 5
figure l

Laplace edge detection algorithm.

Algorithm 6
figure m

Sobel edge detection algorithm.

The algorithm for the Sobel edge detection is shown in Algorithm 6. The algorithm functions similarly to the Laplacian edge detection, with the key difference being that two kernels are used with the Sobel edge detector. This means the Sobel operation will be computationally more expensive than the Laplace or uint8 operations.

5 Optimisation considerations

To improve the speed of the unsigned integer edge detection algorithm, three areas were explored:

  1. 1.

    Unrolling the for loops.

  2. 2.

    Optimizing the uint8 division process.

  3. 3.

    Compiler Optimisations.

From Algorithm 4, a number of nested for loops can be observed. When \(p=1\) the algorithm works on a \(3 \times 3\) window patch. Similarly when \(p=2\) the algorithm works on a \(5 \times 5\) patch. In the case of edge detection, any larger window size will produce ringing artifacts in the Edge Intensity Image E [17]. The speed of the algorithm can be optimized by limiting p to specific values (either \(p=1\) or \(p=2\)), and unroll the two innermost for loops. For example, when \(p=1\), Algorithm 4 can be rewritten as shown in 7. This optimisation results in a speed increase of approximately 2.4 times.Footnote 2

Algorithm 7
figure n

Unsigned integer edge detection algorithm with for loops optimized.

The uint8 division process performs an integer division on the operands, yielding only the quotient of the division. Division is computationally expensive when divisors are a power of 2 can be replaced by right bit-shifting [30]. This results in a significant increase in speed.

The edge detection algorithm was written in C and subsequently compiled into machine code using the GCC compiler. There are optimisations which can be specified for this compiler, with each optimisation giving various improvements in memory and speed. As will be shown in the Results and Discussion section, changing the compiler optimisation flags has a significant effect on the overall algorithm execution time. For resource constrained devices, the emphasis is on speeding up the execution time while staying within the memory constraints of the device.

6 Results and discussion

This section presents the results of tests performed with the unsigned integer arithmetic edge detection algorithm. The results obtained were first implemented MATLAB, so as to verify its ability to accurately detect edges. This is followed by performance statistics achieved from implementing the algorithms in IoT-ready devices, namely the ESP32 [31], ESP8266 [32], and the Raspberry Pi [33]. Following this the results of implementing modified Laplace and Sobel edge detection algorithms are presented for each of the above mentioned IoT-ready devices for comparison purposes.

6.1 Implementing the edge detection algorithms in MATLAB

To verify that unsigned integer arithmetic can be used to detect edges in a grayscale image, the OptInt Algorithms 4 and 7 were implemented in MATLAB, and compared their output with the Laplace and Sobel edge detection algorithms.

Figure 1 shows the output of the Uint8 edge detector when applied to a 256x256 Grayscale image. The different figures demonstrate that a larger patch size, (p), results in more pixels being classified as edge pixels. Also, comparing Fig. 1e to d we can see that the normalization factor can be used to reduce noise in the edge image.

Fig. 1
figure 1

Uint8 Edge Detection using various patch sizes p and normalization factors M

Figure 2 shows a comparison of the uint8, Sobel, and Laplace edge detector. The uint8 edge detector produces an output comparable to the Sobel edge detector. The advantage of using the unsigned integer edge detection method is more apparent in resource constrained devices, as will be shown in the sub-sections to follow.

6.2 Implementing the edge detection algorithms in IoT-ready devices

To quantify the performance of using unsigned integer arithmetic for image processing in IoT devices, the OptInt implementation was tested on three commonly used IoT devices. These devices are shown with relevant specifications in Table 1. Due to the processor speed and memory constraints of the NodeMCU devices, limits were imposed due to the image size (to extract edges) as well as the time taken to perform the edge detection.

6.2.1 ESP8266 NodeMCU implementation

The ESP8266 uses a single 32-bit microprocessor for both networking and application specific computations. It is assumed that the IoT device will read raw pixel data from a camera which has an on-board buffer. To emulate this, MATLAB was used to extract the raw grayscale pixel values of an image at various sizes which was then saved an SD card. The ESP8266 NodeMCU reads these files from the SD card and passes the data to the edge detection algorithms. The output arrays from the algorithms are then saved to the SD card. MATLAB was again used to display the edge intensity image for the purposes of this paper. The following edge detection algorithms were applied to the input images:

  • unsigned integer (8-bit), using algorithm 4; \(p = 1, M = 2\)

  • unsigned integer (8-bit), using algorithm 7; \(p = 1, M = 2\)

  • unsigned integer (8-bit), using algorithm 7; \(p = 2, M = 2\)

  • Laplace, using algorithm 5

  • Sobel, using algorithm 6

  • Sobel, using algorithm 6 and an approximation to the square root operation ref.

  • unsigned integer (32-bit), using algorithm 4 adapted for 32-bit architecture; \(p \!=\! 1, M \!=\! 2\)

  • unsigned integer (32-bit), using algorithm 7 adapted for 32-bit architecture; \(p \!=\! 1, M \!=\! 2\)

  • unsigned integer (32-bit), using algorithm 7 adapted for 32-bit architecture; \(p \!=\! 2, M \!=\! 2\)

Note that adapting the uint8 edge detection algorithms to uint32 only requires the use of a two 32-bit variables in which the additions, subtractions, and bit-shifting will be performed. After the arithmetic has taken place, the variable is cast back to uint8. This means that a negligible amount of additional RAM is needed for uint32 arithmetic, with the benefit of increased execution time outweighing the extra RAM requirement.

Fig. 2
figure 2

Comparison of Sobel, Laplace, and uint8 edge detectors

Table 1 Specifications of the IoT devices used in testing

To ensure consistent results, each edge detection algorithm was run 100 times on the input image, and the average execution time was recorded. The simulations indicate that the fastest algorithm was the unsigned integer 32-bit, with the unrolled inner for loops. Figure 3 shows the execution times for each of the above-mentioned algorithms applied to images with dimensions ranging from 32 x 32 pixels (\(\approx \)1 Kilopixel) up to 150 x 150 pixels (22.5 Kilopixels).

The maximum image size capable of being processed in the ESP8266 NodeMCU was 150 x 150 pixels. This means that QQVGA images (i.e 160 x 120 pixel images) would be able to be processed in this device which is highlighted as a vertical line. Since this IoT-ready device does not have a built-in FPU, floating point arithmetic will take much longer than integer arithmetic. This is evident when comparing the execution times for the modified Laplace algorithm with that of the unrolled uint32 algorithm (Fig. 3b). The QQVGA image edges are computed almost 6 times faster using the unrolled uint32 algorithm.

Fig. 3
figure 3

Speed comparisons on the ESP8266 NodeMCU at 160MHz

Compiler optimisations also have a significant influence on the execution times of each of the algorithms. Table 2 shows the execution times for each algorithm at various compiler optimisation levels, using a QQVGA resolution input image. An interesting observation made from this experiment was that the advantage of using uint32 over uint8 diminishes as the optimisation becomes more execution-speed oriented (02 and 03). At all optimisation levels, unsigned integer edge detection algorithms executed faster than the modified Laplace or Sobel algorithms. The results presented in Fig. 3 were obtained using the Os optimisation level.

Table 2 ESP8266 NodeMCU edge detection execution speeds at different compiler optimisation levels for a QQVGA image

6.2.2 ESP32 NodeMCU implementation

The ESP32 has two 32-bit microprocessors at its core, with one dedicated to networking and the other used for application specific computations. The same edge detection algorithms tested in the ESP8266 NodeMCU implementation were tested in this device. Figure 4 shows the execution times for each of the above-mentioned algorithms applied to images with dimensions ranging from 32 x 32 pixels (\(\approx \)1 Kilopixel) up to 200 x 200 pixels (40 Kilopixels). Results for the unsigned integer 32 bit algorithm with the unrolled for loops is also shown in each figure so as to highlight the execution speed gain when using this algorithm compared to the others. Each algorithm was run 100 times and the average execution times were taken to ensure consistency in the results. The maximum image size capable of being processed in the ESP32 NodeMCU was 200 x 200 pixels, meaning that it is able to process images in QQVGA and QCIF (176 x 144 pixels) resolution formats.

Fig. 4
figure 4

Speed comparisons on the ESP32 NodeMCU at 240MHz

Table 3 ESP32 NodeMCU edge detection execution speeds at different compiler optimisation levels for a QQVGA image

Figure 4 shows the uint32 algorithm with the unrolled for loops took the least amount of time to generate the edge intensity images. The addition of the FPU was observed to significantly reduce the time taken for the modified Sobel and Laplace edge detection algorithms (approximately 4 times faster in the case of the Laplace edge detection). Some of this execution time gain over the ESP8266 is due to the faster 240MHz CPU in the ESP32 device. However, even with the FPU present, the unrolled uint32 algorithm was almost 4 times faster for the QQVGA image. This shows that there is still an advantage gained when using unsigned integer arithmetic for image processing in microprocessors with an on-board FPU.

As with the ESP8266, varying the compiler optimisation level resulted in reduction in the execution times for each of the algorithms (Table 3). The effect of unrolling the for loops is reduced since some compiler optimisation levels will automatically perform this step. From the execution times shown in Table 3 the uint32 algorithm is the quickest, with an execution time of 4.6ms for QQVGA.

Fig. 5
figure 5

Speed comparisons on the Raspberry Pi at 1.2GHz

6.2.3 Raspberry Pi implementation

This is the most powerful IoT ready device of the three presented in this paper with processing capabilities significantly more powerful than the ESP32 and ESP8266. An operating system can be installed on this IoT device, and image processing packages such as OpenCV can run on it. Experimentation was also performed on a Raspberry Pi to show the results of OptInt algorithms in a less constrained IoT device. The nine edge detection algorithms implemented in the ESP8266 NodeMCU and ESP32 NodeMCU were also implemented in the Raspberry Pi. However, since the operating system on the Raspberry Pi was 64-bit, the uint32 algorithms were adapted to uint64 instead (by changing the variable data types from 32-bit unsigned integer to 64-bit unsigned integer for the arithmetic).

Figure 5 shows the execution times for each of the above-mentioned algorithms applied to images with dimensions ranging from 32 x 32 pixels (\(\approx \)1 Kilopixel) up to 800 x 600 pixels (480 Kilopixels). Results for the uint64 algorithm are included for comparison. Each algorithm was run 100 times and the average execution times were taken to ensure consistency in the results. The results of these tests show that the speed gain in using unsigned integer arithmetic for image processing is much less in the Raspberry Pi than in the ESP32 and ESP8266. This is because of the four cores in the Raspberry Pi processor, as well as the on-board FPU. Also, there is no advantage in manually unrolling the for loops with this device, as evident in the fact that the fastest unsigned integer algorithm was the uint64 with the nested for loops. Figure 5 shows the uint64 edge detection algorithm executes faster than the modified Sobel and Sobel approximation algorithms, but is slightly slower than the modified Laplace edge detection algorithm.

7 Conclusion

In this paper, the use of unsigned integer arithmetic for image processing computations in resource constrained devices was demonstrated. A framework of governing equations were introduced for using 8-bit unsigned integer arithmetic for addition, subtraction and multiplication; with the intention of keeping the result within the ranges of an 8-bit unsigned integer. A gradient-based edge detection algorithm using this arithmetic, as well as modified versions of two common edge detection algorithms which use floating point arithmetic, namely the Sobel and Laplace edge detectors were implemented. The purpose of the modification was so that the edge images could be saved using 8-bit unsigned integer memory. These algorithms were run on three IoT-ready (resource constrained) devices, namely the ESP8266 NodeMCU, the ESP32 NodeMCU and the Raspberry Pi 3. Various optimisation methods for OptInt algorithms were investigated, with the focus on reducing the overall execution time.

The results show that by implementing the unsigned integer algorithms for edge detection in resource constrained devices significantly reduces the computation time, while producing an edge image of similar quality to the popular Sobel and Laplace edge detectors. This is even more apparent in devices which do not contain an FPU. The algorithm execution speed is reduced further when the unsigned integer arithmetic is adjusted to the base architecture of the device (for example, using 32-bit unsigned integers for computation in a 32-bit device). Another benefit of using the OptInt implementation was its demonstrated ability to filter out noisy pixels in the edge image. This is achieved by the ReLu operation inherent in the unsigned integer addition and subtraction equations, coupled with the scaling factor used in the algorithm. Images therefore do not need to be smoothed prior to computing the edges.

Future work may include implementation and bench-marking of a range of other image processing and edge detection algorithms. Furthermore, the system could be integrated into some resource limited applications over a low-bandwidth communications link to demonstrate practical applications of the system.