Introduction

The availability of ever-increasing computational power and the diffusion of distributed computing allow to apply deep-learning paradigms to complex problems. Deep learning also lies at the core of a variety of applications supported by smart devices, but power consumption and hardware constraints tend to limit the deployment of those learning models. Convolutional neural networks (CNNs), for example, represent a key tool to deal with image/video processing domains, but convey a notable effort in the architecture design and bring about a considerable computational cost. This makes the real-time implementation of CNNs on embedded systems a very challenging task. In practice, when addressing resource-constrained devices, a trade-off between generalization accuracy and latency is of paramount importance.

Sentiment analysis is a most interesting, yet challenging application relying on deep learning  [1, 2], since it aims to extract the emotional information conveyed by media contents. That problem calls for a multidisciplinary approach, involving cognitive models [3], computational resources [4], Natural Language Processing [5, 6], and multimodal analysis [7]. Image polarity detection, in particular, deals with the emotional information conveyed by an image, which even worsens the complexity of sentiment analysis. To tackle the so-called subjective perception problem [8], i.e., different users perceive the same image in different ways, designers often envision custom solutions, involving both algorithms and hardware implementations.

Modern polarity detectors rely on CNNs, mostly because that paradigm proved very effective in extracting suitable features from images [9]. Moreover, CNNs can support saliency detection (to identify the parts of an image that draw an observer’s attention). This is crucial for polarity detection  [10,11,12], since the salient parts of an image carry relevant information for sentiment analysis, too. In spite of these promising features, CNNs cannot usually fit the constraints imposed by smart devices [13].

To tackle the latter issue one might consider cloud computing, and move the computational load from a user’s device to remote data centers. That approach, however, brings about significant side effects [14]: firstly, fast connections and high-power servers should always be available; more importantly, privacy concerns might arise, when considering that a user’s emotional perception can relate to religion, sexual, and political orientations. The implementation of the inference, forward phase in a user’s device therefore seems a viable solution. Since battery-operated systems cover the majority of smart devices, energy consumption is a crucial design constraint when targeting real-world applications.

This paper extends previous research [15], and presents a design strategy to deploy image polarity detection tools on resource-constrained embedded systems; it introduces implementation details, an extensive analysis about hardware options, and new experiments. Mobile devices are the target platforms, hosting on-line polarity detectors supported by hardware-oriented deep neural networks. Experimental results confirm that commercially available smartphones can support image polarity detection endowed with visual attention capabilities, thus opening new vistas on the development of low-cost applications in the man–machine interaction domain.

Contribution

The continuous introduction of new, more powerful mobile devices emphasizes the need to trade-off generalization performances and hardware consumption. The widespread application of deep-learning solutions has further widened the gap between theoretical algorithms and real-time, hardware-constrained implementations. A most frequent option for deployment is to remove the critical functional blocks of the inference systems altogether. Saliency detection, which is commonly embedded in sentiment classifiers, makes no exception and, to the authors’ best knowledge, has never been implemented in resource-constrained devices. The original contribution of the paper can be outlined as follows:

  • The definition of a hardware-friendly Saliency Detector, relying on a hardware-friendly deep neural network and suitably designed to fit embedded and constrained systems.

  • As a result, any embedded implementation of image sentiment analysis can now benefit from saliency detection. When tested on standard benchmarks, this solution proved outperformed existing implementation of polarity detectors for embedded systems.

  • A complete design strategy aimed to the deployment on commercial smartphones. The process takes into account the definition of the algorithm, the quantization mechanisms with the associate effects, and the required tools for the real-time deployment.

The paper is organized as follows. Section 2 reviews the state-of-the-art in the literature. Section 3 presents the hardware-aware deep-learning models that support the proposed solution, whereas Section 4 introduces and analyzes the novel algorithm. Section 5 describes the design aspects to ensure an effective deployment on mainstream embedded devices. Section 6 presents the experimental results obtained, and Section 7 makes some concluding remarks.

Related Works

Today’s state-of-the-art implementations of image polarity detection all rely on deep learning. An exhaustive review of the several proposed approaches can be found in [7,8,9].

The approaches to polarity detection typically rely on low-level features, which are extracted by a CNN (pre-)trained for object recognition; a fine-tuning process adjusts the network parameters for the specific application task. The various methods mostly differ in: a) the adopted CNN, b) the transfer-learning technique, and c) the training data domain. Campos et al. [16] adopted the AlexNet architecture for feature extraction, and analyzed the effects of layer ablation/addition on the eventual generalization performances of the fine-tuned network. Ontology-based representations have been widely used on top of object recognition models [17]; You et al. [18] introduced an architecture in which a pair of convolutional layers and four fully connected layers were stacked.

Recent research aimed to augment the basic, fully convolutional architectures. In some variants for polarity detection, both the image and a set of additional features formed the network inputs. The Vgg_19 architecture proposed by Fan et al. [10] processed combined image and focal-channel information to model human attention. Likewise, other works studied the role played by art features [19], context, and information [20]. Qian et al. [21] proposed a combination of CNNs and recurrent neural networks to model sequence of different sentiments in one image.

Several papers recently tackled the saliency issue: in [11], a specific detector [22] pinpointed the relevant parts of an image. Likewise, in [12] local information supplied by a Saliency Detector and global features extracted from the entire image merged in a fusion step. Recently, Wu et al [23] proposed a multi-task learning algorithm that integrated saliency detection and attention mechanisms into a deep recurrent network architecture. Attention mechanisms, as defined in [24], were also explored in [25], where a dedicated convolutional neural network extracted context information from the original input image. Similarly, [26,27,28] applied attention mechanism, whereas Rao et al. [29] adopted a faster-RCNN to locate the significant parts of an image.

The literature proves that multimodal approaches can enhance the performances of sentiment classifiers [30]. Those models, however, depend critically on the performances of the feature extractors when addressing different information sources, and require an additional computation effort to combine the extracted features.

The deployment of image polarity classifiers on resource-constrained devices seems to have drawn a limited attention. In [13], the authors analyzed solutions including hardware-friendly neural networks and weight truncation, but considered quite expensive hardware accelerators for deep learning. By contrast, this paper targets standard microprocessors and extends that research by including automatic saliency detection, to the purpose of further enhancing visual-polarity assessment.

Hardware-Friendly Deep Networks

Deep networks allow for different architectures, not all of them prove suitable when targeting hardware implementations. In practice, one needs to properly balance accuracy, memory consumption, and computational cost.

The development of an image polarity detector endowed with saliency-detection capability involves two main tasks of computer vision, i.e., object classification and detection. The following sections briefly review the existing solutions to trade-off the generalization performances and the computational costs of the models.

Object Classification

From among the proposed hardware-oriented solutions for object classification, the family of MobileNets represents a most popular approach. The depth-wise separable convolution (DSC) is a key feature of MobilenetV1  [31,32,33]. In DSC, the standard convolutional operator is replaced by a pair of separate layers, which implement a factorized version of the original operation. The first layer supports a depth-wise convolution and involves one convolutional filter per input channel. The second layer is a \(1 \times 1\), point-wise convolution, which extracts a new set of features by working out linear combinations of the input channels. Such a decomposition remarkably reduces computational costs. For an input set of size \(H \times W \times M\), and a convolutional layer characterized by N kernels of size \(D_k \times D_k\), the computational cost \(C_{sc}\) of the conventional convolution is

$$\begin{aligned} C_{sc} = H \times W \times M \times N \times D_k \times D_k. \end{aligned}$$
(1)

When using the factorized version, the associate cost \(C_{DSC}\) is

$$\begin{aligned} C_{DSC} = H \times W \times M \times (D_k^2+N). \end{aligned}$$
(2)

which is significantly smaller than (1).

Recently, Ragusa et al. [13] showed that, in the presence of a large dataset for fine-tuning, one can implement polarity detection by means of hardware-friendly CNNs without compromising the eventual generalization performances. Those results were achieved by means of MobileNetV1, an architecture specifically designed for embedded systems.

Object Detection

Not all parts of an image are equally important, and a detection process is often useful to identify the relevant sub-sections. Box-based solutions for saliency detection can be divided into region-based and single-shot detectors.

The former approaches rely on a region-cropping network that first detects the informative parts of the image [34, 35], then evaluates the visual contents by treating each sub-region as an independent image. This model attains state-of-the-art performances in terms of accuracy, but typically cannot satisfy hardware constraints. Furthermore, the inference time grows linearly with the number of detected boxes.

Single-shot detectors [36, 37] (SSDs) merge the region cropping and classification steps into one operation. The architecture uses a set of pre-set boxes called anchors. By using anchors, the network need not select the regions of interest a priori, but computes class probabilities and the positions of all anchors in one step. The resulting speed-up in the inference phase might be counterbalanced by a loss in accuracy, especially in the identification of small objects.

Several CNN architectures involving DSC can attain state-of-the-art performances in object detection  [31,32,33, 38, 39]. Most Saliency Detectors derive meta-architectures [22, 23, 40] from object detection. According to [41], the framework combining MobilenetV1 and Single-Shot Detector seems the best option for implementing object detection on resource-constrained devices. An empirical analysis considering latency vs accuracy involved more than 100 different configurations and proved that the combined setup lied on the Pareto-optimal surface [41].

The Proposed System

The paper adopts a bottom-up approach to the enhancement of image polarity detection, which splits the problem into sub-tasks [12]. Each module of the described architecture is designed by taking into account possibly hard constraints in terms of memory usage, power consumption, and the number of floating-point operations.

Figure 1 outlines the overall processing flow. The system includes four logical blocks and is organized into two branches. The upper branch implements a conventional classification scheme: the entire input image feeds a classifier, that works out the associate polarity (\(P_{tot}\)). In the lower branch, a saliency-detector block first processes the input image, and yields a set of sub-regions holding the salient parts in the image itself. The sub-regions feed a Patch Classifier, which assigns a polarity estimate to each of them. A Fusion Block eventually combines the outcomes of both classifiers and works out the overall polarity label associated with the input image.

Fig. 1
figure 1

Complete schema showing the classification flow for the whole system

In the following, it is assumed that the output of each polarity classifier is a scalar quantity \(\in [-1,1]\), where \(-1\) means totally negative, and \(+1\) means totally positive.

Image Classifier

In the Image Classifier, one CNN is trained to estimate image polarity; this module processes the input image as a whole without any additional information. A pre-trained CNN architecture provides the starting point, while a fine-tuning procedure adjusts the image polarity detection process.

The design strategy adopts MobileNetV1 as the core CNN. The architecture is characterized by 4.2M parameters, and each inference (forward) phase requires 569M operations. It is worth noting that the most popular architecture for image polarity, namely VGG_16, involves 138M parameters and requires 15300M operations to carry out an inference step [32]. MobilenetV1 is about 32 times lighter in terms of memory usage, and performs 27 times more efficiently in terms of floating-point operations. Nonetheless, MobileNetV1 attains a comparable accuracy in polarity detection [13].

Saliency Detector

Saliency detection locates the significant parts of an image that are more interesting for a human viewer, and typically relies on object-detection approaches [12] [41]. Deep networks typically split the inference strategy for object detection into two phases: 1) a meta-architecture identifies a set of “boxes” delimiting the tentative sub-regions, and 2) a backbone CNN extracts features from each box. A pre-trained network for object detection again provides the starting point for training, followed by a fine-tuning procedure specifically targeted to classify the saliency of each sub-region.

In the design schema presented here, the Saliency Detector relies on an SSD-MobileNetV1 meta-architecture, which combines both box identification and box cropping and classification, thereby reducing the overall computational cost. The process requires 1.2 billion operations and storage for 6.8 M parameters. The equivalent process using a Fast R-CNN as an alternate meta-architecture and VGG_16 as a feature extractor, would involve 64.3 billion operations and 138.5M parameters.

Patch Classifier

The Patch Classifier assigns a polarity to each sub-region prompted by the Saliency Detector. This module adopts the same structure of the Image Classifier, but the fine-tuning procedure covers the sub-regions extracted from image polarity datasets. Section 6.2 will provide further details of this procedure.

Fusion Block

The Fusion Block handles the outcomes of the Image Classifier and the Patch Classifier to associate a polarity estimate to the input image. This module implements a hard-coded algorithm and is not subject to training. The rationale underlying the algorithm is that bigger objects in an image often tend to draw the viewer’s attention and therefore prove to be more important. In fact, small objects such as guns, injuries, or particular symbols can sometimes be crucial. The Fusion Block balances these two aspects by clustering the sub-regions into groups based on their size. Then, the overall polarity is assessed. The procedure can be outlined as follows:

  • Salient sub-regions are grouped into 5 clusters \(C_j, j=1,...,5\) on the basis of the ratio of the area of the i\(-th\) sub-regions \(A_i\) to the area of the entire image \(A_{tot}\). The five ranges are: (0.0, 0.2), [0.2, 0.4), [0.4, 0.6), [0.6, 0.8), [0.8, 1). For instance, the set \(C_1\) includes the patches such that \(0.0<A_i/A_{tot}<0.2\).

  • The cumulative polarity of each cluster is worked out by adding the polarities of all cluster elements: \(P_{C_i}= \sum _{j \in C_i} P_j\).

  • The eventual polarity is computed by picking the element featuring the maximum absolute value in the vector \(h = [P_{C_1}, \ P_{C_2},\ P_{C_3},\ P_{C_4},\ P_{C_5},\ P_{tot}]\). Let \(\hat{P} =\arg \max _{j}(|h_j|)\); then, one assigns positive or negative polarity according to the sign of \(\hat{P}\). For instance, if an image is characterized by \(h = [-0.1, -0.7, -0.3, +0.5, -0.2, +0.3]\), then \(\hat{P} = -0.7\) and the eventual image polarity is negative.

Deployment on Embedded Devices

The deployment of any deep-learning algorithm on embedded devices should take into account some constraints. First, the target device should properly balance memory size, power consumption, computational capacity and cost. Secondly, data types (in terms of bit representation) should be carefully selected, as they impact on the trade-off between accuracy and memory usage.

Devices

Conventional deep-learning algorithms require GBs of memory to be deployed. For this reason, the market offers embedded hardware accelerators with dedicated memories. Intel released the “Movidius” neural-computing stick that includes 8GBs of memory for parameters storage and inference operations. Nvidia offers the “Jetson” family, in which microcontrollers are embedded side by side with a dedicated GPU according to the “Systems-On-Modules” paradigm. Google’s “Coral” line of edge embedded devices relies on Tensor Processing Units (TPUs) for the inference phase. All those solutions leverage dedicated hardware to run deep-learning algorithms for limiting latency. Hardware accelerators also involve advanced software tools, such as sophisticated compression strategies to reduce storage requirements. A typical trick is to apply a fixed-point representation whenever required. Ragusa et al. [13] recently deployed image sentiment classifiers on Movidius and Jetson devices; that paper proved empirically that real-time performances could be obtained without affecting the accuracy of the eventual predictors.

In many cases, however, application-specific constraints force designers to look for different solutions. Most commercial products, including smartphones and smart devices, set tight constraints on power consumption. Although recent high-end smartphones embed Graphical Processing Units (GPUs) and Neural Processing Units (NPUs) [42], those devices are expensive and quite demanding in terms of power consumption. Moreover, even in the presence of a memory ranging in GBs, one application can only use a limited portion of it.

Smart devices often rely on microcontrollers; hence, the quantity of available memory is limited to few MBs. For instance, the ARM Cortex M7—largely embedded in IoT applications—just reserves 2 MBs for parameter storage. External RAMs and flash memories can extend memory space, but these add-ons must keep relatively small to limit power consumption and make any design strategy based on embedded accelerators impractical.

Data Representation

Low-level programming languages can improve the efficiency in the implementation of deep neural networks, but at the same time their adoption can increase a product’s Time-To-Market significantly. Thus, designers often rely on high-level programming languages that simplify optimization by means of dedicated libraries and tools, and make large use of high-precision data representations.

Post-training quantization is a common approach to deploy the resulting deep-learning models on resource-constrained devices after training. This solution consists in converting the representation of the network parameters from single-precision floating-point (FP32) to half-precision floating point (FP16), or even 8-bit integer (INT8), coding. FP32 complies with the IEEE floating-point standard and does not prove critical from an accuracy point of view. Many devices, however, inherently support FP16 in their instruction set. In addition to halving memory occupation, this also yields a considerable speedup in inference operations. For example, high-end microcontrollers typically support the Single Instruction Multiple Data (SIMD) paradigm with the FP16 format.

The INT8 format further reduces memory occupation, and makes fixed-point operations faster, to the detriment of dynamic range and resolution. To minimize the consequent loss in accuracy, a pair of factors modulate integer post-quantization as follows:

$$\begin{aligned} real = (approx8_{bit} - bias)*scale \end{aligned}$$
(3)

where bias is an integer quantity that sets the center value of represented numbers, and scale is a floating-point 32-bit value that re-calibrates the dynamic range.

An INT8 representation calls for some steps for floating-point re-calibration. These steps avoid that rounding issues cause data losses when data propagate through the architecture. Even in the presence of scaling and biasing, 8-bit quantization can affect a predictor accuracy significantly. At the same time, most commercial tools do not provide software libraries supporting optimization for 8/16 bit data; hence, the model parameters are often converted to a floating-point representation. As a result, the adoption of INT8 ultimately does not reduce the run-time memory occupation.

Design Choices

The deployment of the enhanced architecture for image sentiment classification with saliency detection includes a variety of techniques. DSC replaces basic convolutions, and the MobileNetV1 architecture is applied, as a result of the analysis described previously [9]. The detection architectures embed Single-Shot Detectors, with post-training weight quantization. To comply with hardware constraints, the quantization mechanism only affects parameter storage, whereas 32-bit floating-point precision is restored and kept during inference operations.

The set of embedded devices used to test the design criterion included five (Android) smartphones, namely, Honor 10, Huawei P10, Samsung S8, Huawei P8 Lite, and Huawei P20 Lite. This choice was justified by the fact that smartphones are the mainstream from among battery-operated devices. Although most of these devices embed GPUs and NPUs, only the microcontroller components therein were used to work out computations, to ensure fair comparisons. The analysis also excluded multi-threading for the same reason. Measuring performances on a single core emulated a maximally constrained scenario. The experiments were carried out by implementing a custom Android application.

The TFLite tool supported the deployment of the design strategy on the target smartphones; the network weights were only quantized on the host disk. During the inference phases, instead, the weights were cast to FP32 precision after being uploaded into the RAM of each device. Thus, a model quantized in the INT8 format could require 1/4 of disk space as compared with RAM occupation. In the rest of the paper, quantization will only apply to disk-based storage and therefore will not affect the inference phase.

Experimental Results

The experimental session aimed both to evaluate the accuracy of the proposed method and to assess memory usage and inference timings.

The experimental results are organized as follows. Sections 6.1 and 6.2 present the accuracy and memory occupation of the Saliency Detector and the pair {Image Classifier, Patches Classifier}, respectively. Section 6.3 addresses the performances of the whole framework, whereas Section 6.4 illustrates the obtained results in a graphic form. The subsequent experimental campaign analyzed the impact of post-training quantization on the generalization performances. A final experimental session measured the hardware performances of the classifier when deployed on the target smartphones.

Saliency Detector

The training of the Saliency Detector adopted the procedure proposed in [12]. The training involved an SSD_MobileNet architecture trained for object detection by using the COCO dataset [43]Footnote 1. The configuration file of the original model, as downloaded from the tensorflow website, sets all the training parameters, with the exception of the number of iterations, which was set to 50,000.

The experiments involved a pair of (publicly available) datasets, holding images with the associate information about salient contents. The starting model was first adjusted by using the ILSVRC-2014 dataset, which contained 127030 images. Then, a fine-tuning process involved the SOS dataset [8], which included 3951 images. The generalization performance was assessed by applying the conventional hold-out procedure: in each experiment, 90\(\%\) of the available data formed the training set, whereas the remaining 10\(\%\) of the data provided a test set, which was never used to tune any parameter (during training) or hyper-parameter (during model selection).

Table 1 gives the results of the experimental session. The first column lists the different Saliency Detectors that were tested: “ILSVRC” marks the model trained with that dataset, while “SOS” refers to the model first trained with ILSVRC, then refined with SOS. The second column gives the number of boxes prompted by the model. The third column gives the IoU75 mAP indicator [43], used for assessing the object-detection performance. The general Intersection over Union (IoU) measure takes into account the set A of pixels associated with a box proposed by the detector and the ground-truth set of pixels B. It can be formalized as:

$$\begin{aligned} IoU(A,B)=\frac{A\cap B}{A\cup B}. \end{aligned}$$
(4)

Thus, IoU75 means that a prediction is classified as correct if \(IoU(A,B)>0.75\). Informally speaking, this quantity only counts as correct predictions those outcomes that are sufficiently overlapped with respect to the real position of the salient region.

The rightmost column reports on the precision, i.e., the ratio of the number of correct predictions to the total number of predictions.

Table 1 Saliency-detector performances tested on 13982 true boxes

The outcomes of this experimental session confirmed that the number of predicted boxes for the test set was largely consistent with the ground-truth (13982). Moreover, the IoU measurements proved that the predicted boxes overlapped the actual positions satisfactorily. IoU exceeded the \(75\%\) threshold in the \(77.8\%\) of the cases for SSD_MobileNet_ILSVRC, and in the \(63.0\%\) of cases for SSD_MobileNet_SOS. These results seem quite interesting when considering the severe hardware constraints on the architecture.

Image Classifier and Patch Classifier

The subsequent experimental session measured the performances of the pair of classification modules, both relying on the MobileNetV1 architecture.

Training the CNN for image polarity detection required a large dataset. Toward that end, the training sessions used a version of the adjective–noun pairs (ANP) dataset [44] (images crawled from Flickr). The experiments involved the ANP40 dataset [9], which covered the 20 most positive and the 20 most negative adjective–noun pairs. The associate set of images was split into the corresponding subsets (positive and negative samples). The eventual ANP40 set held 10,516 image samples.

The Image Classifier and the Patch Classifier shared the same architecture, i.e., the MobileNetV1 model trained on the ILSVRC-2012-CLS datasetFootnote 2, but were fine-tuned separately. The Image Classifier was adjusted on the whole ANP40 dataset; instead, the Patch Classifier was only fine-tuned on the sub-regions extracted from the ANP40 dataset. In the following,

  • PC_ILSVRC will refer to the Patch Classifier fine-tuned on the sub-regions extracted by SSD_MobileNet_ILSVRC

  • PC_SOS will refer to the Patches Classifier fine-tuned on the patches extracted via SSD_MobileNet_SOS (as per Sec. 6.1).

The well-known datasets Twitter (Tw) [18] and Multi-view (MVSA_eq and MVSA_maj) [45] formed the test sets to assess the classifiers performances. In both cases, a pool of human users had manually annotated the two datasets. Neither parameter tuning nor model selection ever involved these data.

In the case of Tw, the “all-agreed” version was considered. The experimental sessions involved two versions of MVSA: MVSA_eq (i.e., the all-agreed version of MVSA), and MVSA_maj, in which human annotators assigned the eventual label on a majority basis.

To assess the classifiers performances, the procedure first extracted the sub-regions from each test image. Then, the sub-regions were grouped into five clusters according to their relative size (as per Sec. 4.4). Finally, each cluster included both the sub-regions and the entire, original images.

The obtained results are reported in the form of tables. Each table gives the outcomes of the tests on a specific dataset and a configuration of the Patch Classifier. Each table compares the Image Classifier and the Patch Classifier (first column). The second column indicates the possible classifier inputs, i.e., either the set of sub-regions or the set of images. Columns [3–7] refer to the five clusters, grouped according to the coverage spanned by the sub-regions. For a specific classifier and a specific input (i.e., for a specific row of the table), the Column associated with a Cluster gives the average classification accuracy—in the range [0, 1]—for the patches/images belonging to that Cluster.

For instance, in Table 2, \(C_1\) takes into account the sub-regions covering a percentage of the whole image in the range {\(0\%\), \(20\%\)}. The cell value 0.628 shows that a percentage of \(62.8\%\) of the covered sub-regions was correctly classified.

Table 2 gives the outcomes of the tests involving the Tw dataset; in those experiments, the Patch Classifier was supported by PC_ILSVRC.

Table 2 Comparison of classification accuracy of Image Classifier and Patches Classifier for dataset Twitter with patch extractor trained on ILSVRC

PC_ILSVRC always outperformed the Image Classifier when small- and medium-sized patches were classified (Clusters \(C_1\), \(C_2\) and \(C_3\)), independently of the classifiers’ inputs (either the sub-regions or the whole image). Conversely, the Image Classifier performed better than PC_ILSVRC, for both sub-regions and full images inputs, for \(C_4\) and \(C_5\). That behavior was expected because larger sub-regions better approximated full images in size and contents.

Table 3 gives the results for the Tw dataset when the sub-regions were extracted from Saliency Detector trained on SOS.

Table 3 Comparison of classification accuracy of Image Classifier and Patches Classifier for dataset Twitter with patch extractor trained on SOS

When image sub-regions were the network inputs, PC_SOS scored better accuracy in \(C_1\) and \(C_4\). In agreement with the results shown in Table 2, the Patch Classifier was the best option in the case of the smallest sub-regions (\(C_1\)). In almost other cases, the basic Image Classifier seemed to perform better.

Table 4 reports on the result for the MVSA_eq dataset when adopting PC_ILSVRC, which differed significantly from those obtained on the Twitter dataset. The image-based classifiers always performed better, independently of any inputs provided. Such a behavior was probably due to the fact that the dataset included images with heterogeneous contents, including text, schemes, draws. As a consequence, a Patch Classifier trained on saliency regions would prove suitable to handle that kind of data.

Table 4 Comparison of classification accuracy of Image Classifier and Patch Classifier for dataset MVSA_eq with patch detector trained on ILSVRC

Table 5 relates to the experiments on MVSA_eq using PC_SOS. The experimental outcomes confirmed the trend shown with PC_ILSVRC. Tables 6 and 7 complete the experimental analysis.

Table 5 Comparison of classification accuracy of Image Classifier and Patch Classifier for dataset MVSA_eq with patch extractor trained on SOS
Table 6 Comparison of classification accuracy of Image Classifier and Patch Classifier for dataset MVSA_maj with patch detector trained on ILSVRC
Table 7 Comparison of classification accuracy of Image Classifier and Patch Classifier for dataset MVSA_maj with patch detector trained on SOS

The experimental verification pointed out that the Patch Classifier and the Image Classifier scored comparable accuracy values; differences depended on the processed dataset; the combination of these two algorithms could enhance performances in most cases, and the results from this approach will be reported in the next section.

The Overall Framework

The third experimental session evaluated the performances of the combined framework (Fig. 1). The Saliency Detector, the Image Classifier, and the Patch Classifier were trained as described above. The Tw, MVSA_eq, and MVSA_maj datasets formed the test sets, and they were never used for either model selection or parameters tuning. A pair of alternative configurations provided the comparisons for the experimental analysis.

MobileNetV1 was the first comparison, i.e., an image polarity detector that did not include any saliency-based technique. This model was selected because 1) it was most efficient in terms of computational cost, and 2) it achieved the best classification accuracy from among a set of candidate architectures on a very similar experimental setup [13].

The second comparison will be denoted in the following as FullyDD, and relied on a fully data-driven strategy based on the proposed framework (Fig. 1); a linear regression module supported the Fusion Block. The value of the regularization parameter was set by using a fivefold cross-validation procedure.

Table 8 refers to the Twitter dataset. The first column lists the various predictors, namely, MobileNetV1, FullyDD including the Patch Classifier trained on SOS, FullyDD with the Patch Classifier trained on ILSVRC, the proposed framework with the Patch Classifier trained on SOS, and the proposed framework with the Patch Classifier trained on ILSVRC. Columns [2–5] give the performances on the test set measured in terms of accuracy, precision, recall, and F1. The experimental results confirmed that both implementations of the proposed framework outperformed the baseline solutions in terms of accuracy, recall, and F1, with the exception of the FullyData\(\_{SOS}\) implementation.

Tables 9 and 10 show the results for the MVSA_maj and MVSA_eq datasets, respectively, following the same format adopted in Table 8. In both experiments, the proposed implementation always outperformed the baseline solutions. FullyDD\(_{SOS}\) only scored a better precision on the MVSA_maj dataset.

Table 8 Performances of the proposed model for dataset Twitter
Table 9 Performances of the proposed model for dataset MVSA_maj
Table 10 Performances of the proposed model for dataset MVSA_eq

Visual Examples

This section gives some visual examples that illustrate the behavior of the proposed framework. Figure 2 includes nine test images. Each sample is outlined in red/green when it was associated with a negative/positive polarity. In each image, a box highlights the salient part as pinpointed by the Saliency Detector, with the same color scheme.

The tests involved different sizes of the patches. In all cases, the overall polarity of the image matched the polarity assigned to the salient sub-region, with the only exception of the leftmost example in the central row. The latter example clarifies that the eventual polarity assigned to an image depended on both the relative size of the sub-region and the strength of the polarity value associated with that patch. In that specific case, the absolute value of polarity associated with the whole image was larger than the absolute value of the polarity associated with the sub-region.

Fig. 2
figure 2

Examples of images labeled using the proposed algorithm with details about the salient parts

Robustness to Compression and Quantization

The experiments reported in this section analyzed the impact of quantization. In addition to the baseline floating-point representation (i.e., FP32), the analysis took into account the quantization levels FP16 and INT8. The setup followed the procedure as per Section 5.2: the models were at first trained by using the FP32 representation, then all weights were truncated to the target integer format. In the inference phase, the truncated weights were back converted to the FP32 representation. TFLite supported all the inference models.

Figure 3 reports on the results for the Proposal\(_{SOS}\) model on the Twitter dataset, following the training/test set strategy already described. The bar graph gives 4 measurements, i.e., accuracy, precision, recall, and F1. The blue, red, and green bars refer to the FP32, FP16, and INT8 representations, respectively. The results confirmed that quantization did not affect the generalization performances significantly. In fact, when considering recall and F1 descriptors, the FP16 representation even outperformed the baseline model involving FP32.

Fig. 3
figure 3

Experimental results using Proposal\(_{SOS}\) model for inference on Twitter dataset

Figures 4 and 5 give the results for the MVSA_maj and MVSA_eq datasets. When dealing with the latter datasets, too, quantization proved beneficial: FP16 quantization improved accuracy, recall, and F1. Notably, also INT8 representation turned out to be effective. These improvements were possibly due to the fact that truncation acted as a low pass filter on images; hence, the eventual prediction was less influenced by details and more influenced by global features.

Fig. 4
figure 4

Experimental results using Proposal\(_{SOS}\) model for inference on MVSA_maj dataset

Fig. 5
figure 5

Experimental results using Proposal\(_{SOS}\) model for inference on mvsa_eq dataset

Table 11 reports on a worst/best case analysis based on the experiments shown in Figs. 3, 4, and 5. The analysis covered the MobileNetV1 and Proposal\(\_{ILSVRC}\) models, and included all the combinations of 3 deep-learning architectures, 4 measures, and 3 test sets. This led to a total of 36 configurations for each model. In the table, the row named “best” counts how many times the associate quantization level attained the best score. For example, models quantized with FP16 scored the best result in 24 out of 36 configurations. The row marked as “worst” counts the configurations scoring the worst results.

The major outcome of that analysis was that the FP16 quantization quite often proved more robust than FP32; this allowed to halve the memory to store the model parameters. These results did not contradict those reported in [13]: in that research, the entire inference process adopted a truncated algebra, bringing about severe propagation issues.

Table 11 Worst and best cases analysis

Deployment on Smartphones

This section presents the results of the deployment of the proposed design strategy on 5 smartphones. To ensure fair comparisons, the microcontroller module within each smartphone only carried out the computations, and any dedicated HW (such as GPUs and NPUs) as well as multi-threading was excluded. Measuring all performances on a single core allowed to emulate a maximally constrained scenario.

A custom Android application supported all experiments, using the TFLite tool for deployment, and the tested quantization levels were all considered, i.e., INT8, FP16, and FP32. In Table 12, each row marks a target device, whereas the columns relate to the quantization levels. The table gives the average inference timings (in milliseconds) over 30 images, with the corresponding standard deviation in brackets.

The considerably high inference timings were a consequence of using one core in the computations. Secondly, the smartphones behaved differently when subject to quantization effects; with the noted exception of Samsung S8, all devices proved faster when using the INT8 representation. That discrepancy should be ascribed possibly to the different chipsets and the memory management by the Android operating systems. The results anyway confirmed that the design strategy for the inference phase could deploy successfully on commercial smartphones even when using a single core, and always scored smaller latency values than 1 second. This allowed the model to run in parallel with other applications in real-life use cases. As compared with these achievements, existing solutions featuring saliency detection for image sentiment analysis all relied on VGG and RESNET backbones, but the memory occupation required by those approaches could not satisfy the hardware constraints of smartphones[13].

Table 12 Inference time measures for 30 images

Conclusions

The paper explored the use of deep neural networks, based on depth-wise separable convolutions, as the key elements for image polarity detection endowed with saliency information. The overall problem was addressed using a bottom-up approach, and the eventual system included three deep-learning stages based on MobileNetV1. The overall setup proved computationally lighter than the conventional solutions presented in the literature. Experimental results confirmed that the proposed method managed to balance generalization performances and computational costs effectively.