1 Introduction

Artificial intelligence (AI) is one of the main areas of research and development in the world today. Many types of AI systems have been developed recently and have provided impressive solution in applications such as voice-powered virtual assistants (e.g., Siri and Alexa) [1], autonomous vehicles (e.g., Tesla) [2], robotics (e.g., in manufacturing cars) [3], and automatic translation (e.g., Google translate) [4]. Many AI-based solutions are also developed in the area of assistive technology, in particular for assisting the visually impaired (VI) persons. Most of these systems deal with the autonomous navigation problem using wearable assistive devices such as infrared sensors, ultrasound sensors, RFID, BLE beacon, and cameras [5, 6]. Besides autonomous navigation, VI persons need also other types of assistive technology such as people and object detection/recognition. Computer vision techniques combined with machine learning provide the most suitable solutions for this problem. For example, Hasanuzzaman et al. [7] present a computer vision system to recognize currency using speeded-up robust features (SURF). The system can recognize US notes with 100% true recognition rate and 0% false recognition rate. Other systems proposed in Refs. [8, 9], for example, assist the VI persons to do their shopping in the supermarket by detecting and reading barcodes and giving the VI person information (extracted from the shop database) about the product via voice communication. The authors in Ref. [10] proposed a system to assist the VI persons detect and read any text in their view. The system does this by first detecting candidate regions that may contain text using special statistical features. Then, commercial optical character recognition (OCR) software is used to recognize text inside the candidate regions (or decide that the content of the regions is non-text). Another interesting application is a travel assistant presented by the authors in Ref. [11] that detects and recognizes text in public transportation domains. The system detects text writings on buses and stations and informs the VI person about station names and numbers, bus numbers and destinations, and so on. Another work by Jia et al. [12] addresses the problem of finding staircases inside buildings and informing the user when they are within five meters of any staircases. The method is based on an iterative preemptive RANSAC algorithm to detect steps of a staircase. But, the work by Yang and Tian [13] focuses on detecting doors inside buildings by detecting the most general and stable features of doors, namely edges and corners. Finally, a system for detecting restroom signage is presented in Ref. [14] based on scale-invariant feature transform (SIFT) features.

Object detection and recognition is a heavily studied problem in the computer vision field. The early object detection algorithms, such as the ones by Viola and Jones [15] and Dalal and Triggs [16], were built based on the extraction of handcrafted features before applying a classification algorithm. After the rebirth of neural networks in 2012 and the appearance of deep learning network and convolutional neural network (CNN) [17], more advanced object detection algorithms based on these methods have appeared, including region CNN (RCNN) [18, 19], you only look once (YOLO) [20, 21], single shot MultiBox detector (SDD) [22], pyramid networks [23], and Retina-Net networks [24]. These algorithms, while quite successful in detecting objects, have high computational costs and hence are difficult to execute on portable devices, unless one focuses on one object such as faces. This makes them less useful for the VI person because it is more convenient for him/her to use portable devices and also because she/he wants to detect a wider range of objects that can be seen in daily life. Therefore, some researchers proposed a compromise solution that can detect multiple objects in a short amount of time by solving the problem from a multi-label classification perspective [25,26,27,28,29,30]. In this approach, the presence of multiple objects can be detected, but not their exact location in the image. We believe that it is reasonable to assume that the VI person is more interested in detecting multiple objects in a fast way and then in knowing their exact locations.

Thus, in this work, we propose a scene description module for listing which objects are present in scene and inform the VI user about them using voice communications. The module is part of a larger assistive technology system for the visually impaired that is designed to assist them in (1) navigating indoor environments, (2) detecting and reading text information, (3) detecting and recognizing faces, and (4) detecting and listing objects present in the scene. The system uses a portable device connected to a camera placed on the VI person chest.

Our solution for the scene description module is based on a multi-label classification approach using deep CNN models. Usually, CNNs do not perform well for small datasets as they are prone to overfitting. In this case, it has been shown in many studies [31,32,33,34] that it is more suitable to employ a knowledge transfer approach by using pre-trained CNNs such as VGG CNN family [35] and GoogLeNet (inception CNN family) [36]. The pre-trained CNNs have been already trained on very large image datasets and all they need is small modifications to adapt them to our special dataset. One way to transfer knowledge from these pre-trained models is by using their learned feature representations as input to train another external classifier. The survey in [31] describes this option and discusses factors to consider in applying such approach. In some cases, it is possible and more worthwhile to retrain the whole model again but not starting from scratch, i.e., with random model weights. In other words, we start with the pre-trained weights and retrain the whole model on the new dataset, which is known as the fine-tuning approach.

Many CNN models have been proposed in the literature for image classification including VGG-VD [35], GoogLeNet [36], SqueezeNet [37], MobileNet [38], ResNet [39], and so on. These CNN models have different classification capabilities because of the architectural differences and because of the different datasets they are pre-trained on. Thus, it is wise to try to fuse them in a way that takes advantage of their respective strengths.

Fusion of ensemble of classifiers and/or multiple features are efficient techniques to achieve better results for applications like classification and recognition [40,41,42,43,44,45,46,47]. Usually, fusion is employed in three levels: data input level, feature level, and decision level [42]. In data input-level fusion, images from multiple sources are combined together to create a new image with a better signal-to-noise ratio than the input signals. Feature-level fusion consists of combining the features extracted from feature extraction algorithms to create a richer and more descriptive feature vector. In decision-level fusion, first the data are separately classified using different methods and then fusion consists of merging the output from the classification. For example, the work in [48] explores the fusion of images from different datasets to enhance the performance of a multi-task deep classifier. The authors in [44] propose a novel deep CNN model based on multiscale side output fusion in a deeply supervised network. The model fuses different feature vectors from the deep CNN at different scales in order to improve the salient object detection.

For decision-level fusion, the work in [45] investigated the use of multiple widely used ensemble methods in the context of deep neural network classifiers. The investigated methods include naive unweighted averaging, majority voting, the Bayes optimal classifier, etc. However, these methods are vulnerable to weak learners, are sensitive to over-confident learners and may lead to information loss [45].

Another recent work by Koh and Woo [46] presented a novel fusion approach of different multi-view classifications. The different multi-views are obtained by applying a classification model over different batches of the data. Then, the fusion of these different views involves computing co-occurrence matrices, weighted adjacency matrices, and Laplacian matrices, which are time-consuming.

In this work, we propose to improve the detection accuracy of CNN models by fusing their predictions using induced ordered weighted averaging (OWA) techniques. In particular, we use two base CNN models, namely the VGG16 model [35] and another light model called SqueezeNet [49]. We have selected these two models because they are not that deep and they are quite diverse. Our datasets are small (less than 160 images in the training set), and it is known from the literature that deeper models that have large number of weights need a huge dataset for training [35]. Furthermore, it is well known that fusion methods work best when we have diverse classifiers [45, 50]. Our chosen models are quite diverse because VGG16 uses a basic convolutional architecture with a large number of weights, whereas SqueezeNet is a much lighter network that uses an advanced squeeze/expand architecture. We also increase diversity through using a different training approach. For the SqueezeNet, we use a fine-tuning approach, where we update all its weights during training, whereas for the VGG16 model we only update the weights of the upper added layers only.

The induced OWA technique fuses the output predictions of the CNN models by computing their weighted average. However, the weights used are computed after ordering the predictions based on their importance or level of confidence. As a measure of confidence for each prediction, we propose to use the residual error between the predicted output and the true output. The residual error can be computed at training time, because we have the true outputs. But during test time, obviously, we do not have the true outputs. As a solution, we propose to estimate the residual errors from the input image directly by training another dedicated CNN for this purpose. In other words, for each dataset we train two CNN models: One is used to learn the actual output, while the other is used to learn the residual error. Using this approach for each predicted output, we have also an estimate of the residual error, which we can use as a measure of confidence in the prediction. It is important to note that unlike the regular weighted average scheme, where the weights are the same for all input images, the OWA technique computes different weights for each input image. Thus, the “optimal” weights are used for each input image, and this explains why OWA is able to improve on the accuracy of the two base models. The contributions of this paper can be summarized in the following points:

  • Proposing a deep learning solution for image multi-label classification based on the fusion of two CNN models using OWA theory.

  • Proposing the residual errors between the model predictions and the true labels as measures of confidence in the model predictions and proposing using dedicated CNN models as a solution for their estimation from the input images.

  • Developing the mathematical model that formulates the usage of the estimated residual errors to fuse the predictions of the CNN models using the induced OWA approach.

The rest of this paper is organized as follows. In Sect. 2, we provide a description of the proposed methods based on the fusion of CNN models using OWA. The experimental results and conclusions are presented in Sects. 3 and 4, respectively.

2 Materials and Methods

In this section, first, we describe the two pre-trained CNN models, VGG-16 and SqueezeNet, and the modification made on the architectures to adapt them to our multi-label classification problem. Then, Sect. 2.2 introduces the OWA theory. Finally, Sect. 2.3 describes the proposed method including the mathematical formulation of the proposed fusion approach.

2.1 Pre-trained CNN Models Description

As mentioned earlier, we propose to fuse the outputs of two base CNN models, namely the VGG16 model [35] and the SqueezeNet model [49], using the induced OWA approach. Figure 1 shows the architecture of the pre-trained CNN models used. For the VGG16 model, we remove the last two layers in the original model and then add an extra dense layer with LeakyReLU activation function [51] followed by a BatchNormalization layer [52].

Fig. 1
figure 1

The two pre-trained CNN models used in this work. a Base model 1 based on the pre-trained VGG16 CNN, b Base model 2 based on the pre-trained SqueezeNet CNN, c Legend of layers

As for the SqueezeNet CNN, we remove the layers after fire 9 block and replace them with an extra convolutional layer with LeakyReLU activation functions and a BatchNormalization layer. We have to also follow this by a GlobalAvgPooling2D layer before we end the network with the output layer. The output layer for both models has \(N_{o}\) neurons with linear activation functions, which is then converted to binary output (representing the presence and non-presence of the particular object) using a threshold \(T_{p}\). For example, Fig. 1 shows a sample output for some input images, where the output is converted to binary values using a threshold \(T_{p}\) = 0.5. A binary output of one indicates the corresponding object is present in the image.

Another difference between the two base models is in the training. For the VGG16-based model, we fix the pre-trained layers because of the huge number of parameters in these layers (> 14 million), whereas for the SqueezeNet CNN, we employ a fine-tuning approach to train the network because it is a small network with less than one million parameters, which makes it easier to fine-tune using reasonable computational resources. Furthermore, even without fine-tuning the VGG16-based model can achieve good results, due to its rich architecture.

2.2 Induced OWA

Suppose we have a set of arguments \(a_{j}\), representing outputs of multiple estimators which we want to fuse into one argument. One simple way to do this is using simple weighted averaging, where the weights represent our confidence in the corresponding estimator, i.e., if we have high confidence in a particular estimator, we can assign it a bigger weight and vice versa. However, sometimes the confidence is related to the output rather than the estimator itself. For example, we might favor outputs with large values over small values or vice versa. For these cases, the ordered weighted averaging (OWA) is defined.

An OWA operator of dimension P is a mapping \(F_{W} :{\mathcal{R}}^{P} \to {\mathcal{R}}\) that has an associated weighting vector \(\varvec{w} = \left[ {w_{1} ,w_{2} , \ldots ,w_{P} } \right]^{\text{T}}\) such that \(w_{j} \in [0,1]\) and \(\mathop \sum \nolimits_{j = 1}^{P} w_{j} = 1.\) The function \(F_{W} \left( {a_{1} ,a_{2} , \ldots ,a_{P} } \right)\) determines the aggregated value of the arguments \(a_{1} ,a_{2} , \ldots ,a_{P}\) such that:

$$F_{W} \left( {a_{1} ,a_{2} , \ldots ,a_{P} } \right) = \mathop \sum \limits_{j = 1}^{P} w_{j} b_{j} ,$$
(1)

where \(b_{j}\) is the jth largest of the \(a_{j}\) and P is the number of predictions.

It is important to observe the main difference between the definition of the OWA and a simple weighted average. The OWA involves the step of ordering the arguments \(a_{j}\) from largest or most confident to smallest or least confident. This makes the weights dependent on the position in the ordering rather than on the arguments themselves. This also makes the OWA operator a nonlinear operator that provides a very rich family of aggregation operators parameterized by the weighting vector. For example, if the weights are equal to 1/P, then the OWA is simply the average operator. If the weight vector is [1, 0, …, 0], then the OWA becomes the maximum operator. Conversely, if the weight vector is [0, …, 0, 1], then OWA becomes the minimum operator.

By definition, the OWA operator performs an ordering of the arguments to be aggregated. The ordering is done based on the values of the arguments. In other application, the argument’s value is not the one that makes the argument important. Instead, we have an auxiliary value which can give us a confidence level in the argument. In this scenario, we can use the induced OWA operator, which relies on an auxiliary value, called order-inducing variable, to order the arguments.

This scenario applies to us in this work, because the actual value of an estimator output does not give us any indication of its importance or confidence. (Recall that in our case, the arguments represent outputs of multiple estimators.) Thus, we need to use an order-inducing variable that measures the confidence in the estimator’s output. We discuss this issue in Sect. 2.4.

2.3 Prioritized Aggregation Operator (PAO)

An issue of considerable interest, in applications of the induced OWA operator, is the determination of the weights to be used. To this end, various approaches have been suggested for obtaining these weights [53,54,55,56]. One elegant way is the prioritized aggregation operator (PAO), presented in [55]. Let the order-inducing variable be \(S_{j}\) (in our case this will be the predicted residual errors) and let \(S_{0} = 1\). The weights in the PAO approach are defined as follows. First, we define:

$$T_{1} = 1\quad T_{2} = S_{1} \quad T_{3} = S_{1} S_{2} .$$
(2)

Thus, in general we can write:

$$T_{j} = \mathop \prod \limits_{k = 1}^{j} S_{k - 1} .$$
(3)

Finally, the weights are defined as follows:

$$w_{j} = \frac{{T_{j} }}{{\mathop \sum \nolimits_{j = 1}^{p} T_{j} }} .$$
(4)

The definition in (4) guarantees that \(\mathop \sum \nolimits_{j = 1}^{P} w_{j} = 1.\) Recall here that \(P\) is the number of arguments and hence the number of estimators.

2.4 Proposed Fusion Approach Using Induced OWA

Recall that in applications of the induced OWA operator, we need to define an order-inducing variable, which is a variable that can help us order the predicted outputs of the estimators from most confident to least confident. In other words, the order-inducing variable measures the confidence in the predicted output of an estimator.

Our idea is to use the residual error, between predicted and true outputs, as a measure of confidence in the predicted output. We can do that by analyzing and modeling the residual errors of the estimator in the input space. Obviously, to compute the residual error in the model output, we need to know the true output. However, recall that the true output is known during training, but not during testing of the model. Thus, the idea is to use another dedicated CNN model to learn how to predict the residual error from the input (image). In other words, we have a pair of CNN models, one to predict the actual output and another one to predict the residual error in the output. This pair of models is illustrated in Fig. 2, where Fig. 2a shows the main CNN models used for predicting the object presence, while Fig. 2b shows the two models used to predict the residual error in the outputs of the main model. In Fig. 2a, we also illustrate the fusion operation performed at the decision level using the OWA approach.

Fig. 2
figure 2

Overview of proposed deep model with fusion of two pre-trained CNN models. a The two pre-trained models are used to predict the object presence  and then fused using OWA, b Two models to predict the residual error in the outputs of the models in part (a)

The residual error models shown in Fig. 2b can perform the learning of the residual error, because we can compute the actual error during training from the predicted and true outputs. Accordingly, the proposed method involves the following steps:

  1. 1.

    Train the two main models shown in Fig. 2a.

  2. 2.

    Compute the actual residual errors for all training image samples.

  3. 3.

    Train two residual error models, shown in Fig. 2b, using the input images and the corresponding residual errors computed in step (2).

  4. 4.

    During test time, we use main CNN models in Fig. 2a to predict the object presence from the input image.

  5. 5.

    Then, we use the models in Fig. 2b to estimate the residual errors of both main CNN models.

  6. 6.

    Finally, we use the estimated residual errors to fuse the outputs of the two main CNN models based on the induced OWA approach.

Thus, given a sample input image, let \(y_{j}\) be the true object label (equal to one if object j is present). Furthermore, let \(\hat{y}_{j1}\) and \(\hat{y}_{j2}\) be the two predictions produced by the two main models for object j (These are real-valued numbers). Next, let \(e_{j1} = {\text{abs}}\left( {\hat{y}_{j1} - y_{j} } \right)\) and \(e_{j2} = {\text{abs}}\left( { \hat{y}_{j2} - y_{j} } \right)\) be the absolute values of the residual errors corresponding to the two main models. Then obviously, the lower residual error indicates higher confidence in the prediction. But recall that we need the order-inducing variable to order the predictions from most confident to least confident. Thus, we propose the following definition for the order-inducing variable:

$$S_{jk} = \frac{1}{{1 + {\text{abs}}\left( { \hat{y}_{jk} - y_{jk} } \right)}}\quad {\text{for}}\;\;k = 1,2.$$
(5)

Next, as explained earlier, the weights are defined by first ordering the predictions \(\hat{y}_{j1}\) and \(\hat{y}_{j2}\) based on their order-inducing variable \(S_{jk}\) and then computing the weights using the PAO approach. Therefore, we have two cases:

Case 1 \(S_{j1} \ge S_{j2}\)

$$w_{j1} = \frac{1}{{1 + S_{j1} }}\;\;\;{\text{and}}\;\;\;w_{j2} = \frac{{S_{j1} }}{{1 + S_{j1} }}.$$
(6)

Case 2 \(S_{j2} \ge S_{j1}\)

$$w_{j1} = \frac{{ S_{i2} }}{{1 + S_{j2} }}\;\;\;{\text{and}}\;\;\;w_{j2} = \frac{ 1}{{1 + S_{j2} }}.$$
(7)

Finally, based on these weights, the final fused prediction for object j is computed as follows:

$$\hat{y}_{j} = \mathop \sum \limits_{k = 1}^{2} w_{jk} \hat{y}_{jk} .$$
(8)

It is worth observing here that the weights are not fixed for all objects; instead, they vary depending on residual errors predicted (from the test image) using the dedicated CNN models. This is why, this efficient fusion approach is able to select the better prediction for each object and hence produce improvement in the accuracy over the whole testing set.

Lastly, recall that the final predicted output \(\hat{y}_{j}\) is compared to the presence threshold \(T_{P}\) to decide whether the object j is present or not.

3 Experimental Results

In the first part of this section, we describe the datasets used in this study and how they are collected. Then, we optimize the base CNN models for these datasets. In particular, we find experimentally the best presence threshold \(T_{p}\). Finally, we present preliminary results of the proposed solution.

3.1 Dataset Description

The experiments in this paper use four datasets of multi-labeled images for evaluating the efficiency of the proposed deep learning solution. The first two datasets pertain to the college of computer and information sciences building at the King Saud University (KSU), Saudi Arabia. They have been collected by our team using a CMOS camera with the following features 87.2 fps, 752 × 480, 0.36 MPix, 1/3″, ON Semiconductor, Global Shutter, and connection via USB 2.0.

The second two datasets are collected by the authors of [29] in two different buildings of the University of Trento, Italy. The cameras used to capture these images are from a company called IDS-imaging [57]. The authors used the camera model UI-1240LE -C-HQ, which is CMOS-based camera with 25.8 fps, 1280 × 1024, 1.31 MPix, 1/1.8″, e2v, Global Shutter, Global Start Shutter, Rolling Shutter, and USB 2.0 support. The camera is equipped with KOWA LM4NCL 1/2″ 3.5-mm F1.4 manual IRIS C-Mount lens from RMA Electronics Inc. Company [58]. The details of these four datasets are given in Table 1, which also presents the list of objects considered in every dataset.

Table 1 Details of the four datasets of indoor scenes used in the work

It is noteworthy that we have selected the objects deemed to be the most important ones in the considered indoor environments. Also note that the datasets are not randomly split into training and testing images, because we cannot guarantee that all objects will be represented in the training set. The split into training and test images is performed manually beforehand and is fixed for all experiments. Table 2 presents the number of occurrences of objects in the training set and the test set for each dataset, while Fig. 3 shows four sample images with the list of objects contained within.

Table 2 Number of occurrences of objects in the training set of each dataset
Fig. 3
figure 3

Sample images from the datasets and the set of objects present (image multi-labels). a KSU1 dataset, b KSU2 dataset, c UTrento1 dataset, and d UTrento2 dataset

3.2 Assessment Metrics

In order to assess the proposed solution, we need to define quantitative performance metrics. In single-label classification, we use metrics such as precision, sensitivity (recall), specificity, and accuracy. These metrics can also be used in the multi-label case, and they can be computed per label/object or as an overall metric.

Let \(x_{i}\) represent a sample image in the test set where \(1 \le i \le N_{\text{test}}\), and let \(Y_{i}\) represent the set of true labels or objects associated with it. In addition, let \(P\) be a multi-label classifier that returns the set of predicted labels for \(x_{i}\).

For the label \(y_{k}\), four basic quantities characterizing the binary classification performance on this label can be defined:

$$\begin{aligned} {\text{TP}}_{k} & = \left| {\left\{ {x_{i} | y_{k} \in Y_{i} \;{\text{and}}\;y_{k} \in P\left( {x_{i} } \right), 1 \le i \le N_{\text{test}} } \right\}} \right|; \\ {\text{FP}}_{k} & = \left| {\left\{ {x_{i} | y_{k} \notin Y_{i} \;{\text{and}}\; y_{k} \in P\left( {x_{i} } \right), 1 \le i \le N_{\text{test}} } \right\}} \right|; \\ {\text{FN}}_{k} & = \left| {\left\{ {x_{i} | y_{k} \in Y_{i} \; {\text{and}}\; y_{k} \notin P\left( {x_{i} } \right), 1 \le i \le N_{\text{test}} } \right\}} \right|; \\ {\text{TN}}_{k} & = \left| {\left\{ {x_{i} | y_{k} \notin Y_{i} \;{\text{and}}\;y_{k} \notin P\left( {x_{i} } \right), 1 \le i \le N_{\text{test}} } \right\}} \right|. \\ \end{aligned}$$

The quantities represent the number of true positives, false positives, false negatives, and true negatives with respect to label \(y_{k}\), respectively. It can be easily checked that \({\text{TP}}_{k} + {\text{FP}}_{k} + {\text{FN}}_{k} + {\text{TN}}_{k} = N_{\text{test}}\). Based on these qualities, we define the metrics precision (PRE), sensitivity or recall (SEN), specificity (SPE), and accuracy (ACC) in Eqs. (9)–(12).

$${\text{PRE}}_{k} = \frac{{{\text{TP}}_{k} }}{{{\text{TP}}_{k} + {\text{FP}}_{k} }},$$
(9)
$${\text{SEN}}_{k} = \frac{{{\text{TP}}_{k} }}{{{\text{TP}}_{k} + {\text{FN}}_{k} }},$$
(10)
$${\text{SPE}}_{k} = \frac{{{\text{TN}}_{k} }}{{{\text{TN}}_{k} + {\text{FP}}_{k} }},$$
(11)
$${\text{ACC}}_{k} = \frac{{ {\text{TP}}_{k} + {\text{TN}}_{k} }}{{{\text{TP}}_{k} + {\text{FP}}_{k} + {\text{TN}}_{k} + {\text{FN}}_{k} }}.$$
(12)

To get the overall metrics, we can just take the average of the individual metrics per label. In the multi-label case, ACC is ambiguous [59, 60]. However, balanced or average accuracy (AVG) can be used instead:

$${\text{AVG}}_{k} = \frac{{{\text{SEN}}_{k} + {\text{SPE}}_{k} }}{2 } .$$
(13)

In regular classification, with a single label per image, the classification per image is either correct or wrong. However, multi-label classification problems have a chance of being partially correct, because some labels may be detected, while others not. Thus, there are other evaluation metrics specific to multi-label classification that takes this fact into account. The Hamming loss (HL) is probably the most widely used loss function in multi-label classification. The HL is used for measuring the fraction of incorrectly predicted labels. First, we define it per image sample \({\text{HL}}_{i}\):

$${\text{HL}}_{i} = \frac{1}{{N_{\text{labels}} }} \left| {P\left( {x_{i} } \right) \Delta Y_{i} } \right|.$$
(14)

Here, \(\Delta\) stands for the symmetric difference between two sets and \(N_{\text{labels}}\) is the number of objects/labels. The overall HL can then be computed by taking the average over all sample images in the test set:

$${\text{HL}} = \frac{1}{{N_{\text{test}} }}\mathop \sum \limits_{i = 1}^{{N_{\text{test}} }} {\text{HL}}_{i} .$$
(15)

The mean average precision (mAP) is a ranking metric which refers to the average fraction of relevant labels ranked higher than the irrelevant ones. First, we compute the precision/recall curve for a particular label/object over the whole test set. Then, the average precision per object is computed as the area under the precision–recall curve:

$${\text{AP}}_{k} = \mathop \smallint \limits_{0}^{1} p(r) {\text{d}}r.$$
(16)

Obviously, the precision/recall is discrete and thus AP needs to be approximated. Recently, it is approximated by finding the area under the precision–recall curve which is known as area under curve (AUC). Finally, mAP is computed the average of the individual \({\text{AP}}_{k}\).

$${\text{mAP}} = \frac{1}{{N_{\text{labels}} }}\mathop \sum \limits_{k = 1}^{{N_{\text{labels}} }} {\text{AP}}_{k} .$$
(17)

The label ranking loss (RL) is another metric that computes the average number of label pairs that are incorrectly ordered. In other words, it computes the fraction of reversely ordered label pairs, i.e., an irrelevant label is ranked higher than a relevant label. For more details, we refer the reader to [61].

3.3 Training the CNN Models

We implement the proposed deep CNN models for image multi-label classification in the Keras environment with Tensorflow as a back end. Tensorflow is end-to-end open-source machine learning platform developed by Google and that can be programmed in Python. However, due to its peculiar programming style it is often used through a higher-level programming interface such as Keras (also written in Python).

All experiments are conducted on HP-laptop with an Intel Core i7-7700HQ CPU, the NVIDIA graphics card GeForce GTX 1060 Ti with 4 GB dedicated memory, and 8 GB of RAM. The size of the images contained in the datasets is 640 × 480. We set up the CNN base models so that they accept the images with their original size.

Figure 4 shows the plot of the curves of the loss versus epoch number for training the different CNN models on the first dataset (KSU1) as an example. We set the batch size to 16 and the learning rate to 0.001. From these plots, we can see that the loss function converges with 100 epochs for both models when training them to classify the actual images. In order to improve the stability of the models’ convergence, we reduce the learning rate after epoch 100 from 0.001 to 0.0001 and train them for 20 epochs more. As for the CNN models dedicated to learn the residual error, they actually converge much quicker. Thus, we only trained them for 30 epochs each with a learning rate equal to 0.001 and a batch size of 16.

Fig. 4
figure 4

Loss curves for training the CNN models. a Loss for SqueezeNet main model, b loss for SqueezeNet residual-error model, c loss for VGG-16 main model, and d loss for VGG-16 residual-error model

3.4 Determining the Optimal Presence Threshold \(T_{P}\)

Recall that an important parameter used in our CNN models is the presence threshold \(T_{P}\) which is used to convert the outputs into binary values. It is reasonable to assume that the best value for this parameter is 0.5, but an ablation study is needed to obtain the optimal value according to our defined metrics. Thus, in this set of experiments we examine the AVG accuracy of the base models with respect to the threshold \(T_{P}\). The results are shown in Fig. 5 and clearly show that the best threshold value is 0.3 and not 0.5, and this is true for all datasets. We note here that we execute each training experiment ten times and plot the average of the AVG metric. The small bars at each point indicate the standard deviation of the ten runs.

Fig. 5
figure 5

A sensitivity of the base models with respect to the threshold parameter \(T_{P}\), results for a KSU1, b KSU2, c UTrento1, and d UTrento2

3.5 Results for a Sample Run

The proposed fusion algorithm relies clearly on the correct estimation of the residual values between true and predicted outputs. To estimate the residual values, we can use any one of the two CNN models. Obviously, it is preferable to use the model that gives a better prediction of the residual values. To that end, we train both model types, SqueezeNet and VGG16, and evaluate them based on the mean squared error (MSE) between the true residuals and the one predicted. Table 3 shows the obtained MSE results. Based on this comparison, it is clear that SqueezeNet is more accurate in predicting the residual values. Thus, we use this model to estimate the residual values after each main model.

Table 3 MSE between true residual and estimated residual on test sets

Next, we tested the proposed method on the four datasets described in Section. Tables 4 and 5 present the results showing the AVG accuracies for the two based models and the fused model. Obviously, here we show the results of one sample run only. These results use the presence threshold \(T_{P} = 0.3\).

Table 4 A sample of detailed results for each class in the datasets KSU1 and KSU2 using \(T_{P} = 0.3\)
Table 5 A sample of detailed results for each class in the datasets UTrento1 and UTrento2 using \(T_{P} = 0.3\)

From Tables 4 and 5, we can see that for all datasets, the proposed fused model produced improvement in the AVG accuracy, compared to both of the base models. We can also observe that in general SqueezeNet outperforms VGG16 in terms of AVG results. However, the fusion of the two models produces a good improvement on average for all datasets. That is, because even though SqueezeNet usually performs better, for some images VGG16 does a better job. For example, by looking at the “people” object in Table 5 (found in UTrento datasets), we clearly see that VGG16 outperforms SqueezeNet for this object type. Thus, we can say that the two models complement each other, as the VGG16 model in a sense corrects the SqueezeNet modes for those specific object classes.

When we look at the overall metrics, we can observe that the fused model outperforms the separate models for almost all metrics. There are few exceptions (highlighted in bold font) such as for the mAP metric in the KSU2 dataset, where the fused model achieved 80.69, whereas the SqueezeNet model achieved 81.43. The other exception is the UTrento1 dataset with respect to the RL metric. Here, the RL metric for the two separate models achieved quite different values, namely 0.185 and 0.125. The RL of the fused model should be lower than both RL of the two models, which is not the case. However, the RL value of 0.132 achieved by the fused model is significantly lower than the VGG16 model and closer to the optimal result of the SqueezeNet model.

3.6 Comparison to State-of-the-Art

We also present comparison with the state-of-the-art methods in Table 6. These state-of-the-art methods include: (1) SURF matching and Gaussian process regression (SURF + GPR) [28], (2) compressive sensing and Gaussian process regression (CS + GP) [28], (3) multi-resolution random projection (MR-random-proj) [30], (4) pre-trained GoogLeNet CNN [36], (5) pre-trained ResNet CNN [39], (6) fine-tuning of SqueezeNet CNN [25], and (7) a recent method based on convolutional SVM networks (convolutional SVM Net) [26].

Table 6 Comparison to state-of-the-art methods

Again, these results are obtained using \(T_{P} = 0.3\) as the presence of threshold. The results in Table 6 clearly show that the proposed method produced significant improvements over the state-of-the-art methods for all datasets, except KSU1, where the improvement is insignificant.

From Tables 4, 5, and 6, we notice that SqueezeNet produces better results than VGG16 on average. It is very important to observe here that the fusion technique computes different weights for each image. In other words, for each image, a different weighting scheme is computed that favors the best model for the current image. Therefore, even though SqueezeNet does better than VGG16 on average, the latter does better for certain images. The proposed OWA fusion technique is able to discover and exploit those cases by adjusting its weights accordingly.

We also notice from Table 6 that the improvements are more significant for UTrento1 and UTrento2 datasets. These two datasets are more challenging as can be illustrated by their lower AVG accuracies achieved. One possible reason for that is that they contain barrel-like distortions as can be observed in Fig. 3. The challenging nature of these two datasets partly explains why the proposed method provides a more significant improvement for them. The two CNN models disagree more for these two datasets, and the fusion technique is able to exploit this disagreement in a complementary way to enhance the detection result.

3.7 Hardware Implementation Using FPGA

This work describes a module that is part of a bigger project, called BlindSys, to help the VI persons with vision tasks such as navigation, text recognition, object detection, and recognition of faces. The project involves a hardware part, which is illustrated in Fig. 6. The system hardware includes a wide-angle camera, earphone, laser and inertial measurement unit (IMU) sensors, and of a high-end 10-inch mobile device (tablet).

Fig. 6
figure 6

BlindSys overview. a Frontal view, b Side view showing the camera on the chest, tablet in the back packet, and the headphones

However, the tablet does not contain any graphical processing units, and thus, it is not able to handle real-time execution of some tasks. Thus, using a dedicated hardware circuitry to execute network models, such as field programmable gate arrays (FPGA), is an attractive solution.

Modern neural networks are computationally expensive and require specialized hardware, such as graphics processing units. The use of mobile devices without further optimization may not provide sufficient performance when high processing speed is required such as in our computer vision system to support the VI persons. We can speed up neural networks by moving CNN computation from software to hardware, namely an FPGA implementation, and by using fixed-point calculations instead of floating point.

The unique flexibility of the FPGA fabric allows the logic precision to be adjusted to the minimum that a particular network design requires. By limiting the bit precision of the CNN calculation, the number of images that can be processed per second can be significantly increased, improving performance and reducing power, while achieving the exact performance of the corresponding software implementation.

For example, the authors in [62] proposed an FPGA implementation of pre-trained deep neural networks from VGG16. They used dynamic precision quantization with 48-bit data representation and singular vector decomposition to reduce the size of fully connected layers, which led to smaller number of weights that had to be passed from the device the external memory. Another work by Zhang et al. [63] analyzed the throughput and required memory bandwidth for various CNNs using optimization techniques, such as loop tiling and transformation. They achieved 17.42× speedup for the AlexNet CNN [17]. Another work by Suda et al. [64] considers a higher-level solution, which uses the OpenGL compiler for deep networks. Using their method, they were able to implement two large-scale CNNs, namely AlexNet and VGG16, on two Altera Stratix-V FPGA platforms, DE5-Net and P395-D8 boards, which have different hardware resources.

Recently, the authors in [65] implemented a fully connected neural network with six layers and 64 neurons using the FPGA Cyclone IV GX FPGA DE2i-150 from Altera. The descriptions of the digital blocks are performed using fixed-point notation, which provides a high speed and a low cost of hardware resources. In this manner, N = 32 bits are used to represent a fixed-point format, where the most significant bit represents the sign, three bits are used to represent the integer part, and 28 bits are used to represent the fractional part. Recently, Duarte et al. [66] have proposed a protocol for automatic conversion of fully connected neural network implementations in high-level programming language to intermediate format (HLS) and then into FPGA implementation.

However, implementing even deeper networks with multiple dozens of layers is problematic, since all layer weights would not fit into the FPGA memory and will require the use of the external RAM, which can lead to the decrease in performance. Moreover, due to the large number of layers, error accumulation will increase and will require wider bit range to store fixed-point weight values. All of the above findings make our solution more advantageous because it is based on SqueezeNet and Vgg16 CNN which are shallow models with low number of layers compared to other more recent models in the literature. In fact, as we mentioned previously, FPGA implementations for the larger of the two models (VGG16) have already been successfully proposed in the literature.

4 Conclusions

In this work, we present a novel computer vision method for the detection of the presence of multiple objects in a scene. The method represents a module in a larger assistive technology system for the visually impaired. We propose an innovative idea to fuse two CNN models, namely VGG16 and SqueezeNet, based on dedicated CNN models to estimate the residual error in predicted outputs in combination with an OWA approach.

The experimental results on four image datasets of indoor environments from two separate locations show significant improvements compared to state-of-the-art methods. The proposed OWA approach, based on estimating the residual of the CNN outputs and using it as confidence values, is able to select the better of the two CNN outputs. One way to improve the results is to add more CNN models to the ensemble. However, this will also increase the computational time. Another more promising direction is to employ augmentations techniques to increase the dataset size, because usually CNN models require large datasets for training. Finally, an interesting idea is to divide the objects of interest among multiple CNN models. For example, we can use three CNN models of the same type or different types; each one is responsible for detecting five objects only out of 15. It is expected that reducing the number of objects per CNN should increase its performance.