1 Introduction

1.1 Motivation and Purpose

Massive web documents such as micro-blogs and customer reviews are useful for public opinion sensing and trend analysis. The sentiment analysis approach (i.e., to automatically predict whether a review is overall positive or negative) has been commonly used in this area. Deep neural networks (DNNs) are some of the best-performing machine learning methods [1]. However, DNNs are often avoided in cases where explanations are required because these networks are generally considered as black boxes. Thus, developing a high predictable neural network (NN) model that can explain the process of its prediction process in a human-like way is a critical problem. In the development of such NN model, we should consider how humans usually judge the positive or negative polarity of each review. As described in some previous linguistic researches [2,3,4], it is well known that humans judge the positive or negative document-level polarity of each review with extracting four types of word-level scores in the following order.

1.:

Word-level original sentiment score: this score means the sentiment that each word in a review originally has (e.g., scores in a word sentiment dictionary [5]).

2.:

Word-level sentiment shift score: this score means the sentiment of each term in a review is shifted or not (e.g., “good” in “not good” and “goodness” in “decrease the goodness.”)

3.:

Word-level global important point score: This score means the important part of the entire review.

4.:

Word-level contextual sentiment score: this score means the positive or negative sentiment score of each term after considering the sentiment shift and global important point.

In addition, as described in previous text visualization research [4], the following concept-level contextual sentiment score is important for readers to catch up the summary of the review content.

5.:

Concept-level contextual sentiment score: this score means the concept-level positive or negative sentiment of each review where a concept means a set of similar terms.

Therefore, neural network models that can (1) analyze document-level sentiment with high predictability and (2) explain the prediction results using the above five types of sentiments as shown in Fig. 1 should have a great demand in the industry:

Fig. 1
figure 1

Goal: development of neural network (NN) that can explain its prediction results using four types of sentiments

However, a method for developing such NNs is yet to be established. Many studies have been done to address the black-box property of the NNs [4, 6,7,8,9,10,11,12,13,14]; however, it is hard to say that these previous works can realize the interpretability in the form that humans can find natural and agreeable because these previous studies alone cannot describe the above five types of scores. For example, interpretable NNs with attention mechanism [6, 7] can describe the global important point of each term in a review; however, they cannot describe the other three types of word-level sentiment scores. Interpretable NNs that include word-level original sentiment scores (i.e., original sentiment interpretable NN) [4, 8, 9] can describe the word-level original sentiment scores; however, they cannot describe the word-level global and local contextual sentiment scores. As for other approaches, methods for interpreting NNs can describe the word-level global sentiment scores [10,11,12,13,14]; however, they cannot describe the other scores.

1.2 Approach

To solve this problem, we propose a novel NN model called contextual sentiment neural network (CSNN) and a novel learning strategy called initialization and propagation (IP) learning.

1.2.1 CSNN

CSNN has the following four interpretable layers: word-level original sentiment layer (WOSL), sentiment shift layer (SSL), global important point layer (GIL), and word-level contextual sentiment layer (WCSL), and concept-level contextual sentiment layer (CCSL) as shown in Fig. 2. The WOSL and WCSL represent the word-level original and contextual sentiment of each term in a review, respectively. The SSL indicates whether a sentiment of each term in a review is shifted or not, and GIL indicates the global important points in a review. The WOSL is represented in a word sentiment dictionary manner. The SSL and GIL are represented using long short-term memories (LSTM) cells [15] and attention mechanism [16, 17], respectively. The values of WCSL are represented by multiplying the values of WOSL, SSL, and GIL. The values of CCSL are represented by the WCSL and the K-means clustering results with the word embeddings following the strategy in [4].

Fig. 2
figure 2

Structure of CSNN

Therefore, using the WOSL, SSL, GIL, and WCSL, the CSNN can explain the process of the sentiment analysis prediction in a form that humans find natural.

1.2.2 IP Learning

In developing this CSNN, realizing the interpretability for WOSL, SSL, GIL, and WCSL is a crucial problem. Generally, sentiment analysis models are developed using the back-propagation method with the gradient values for the loss value between the predicted document-level sentiment and the positive or negative tag of each review; however, when such general back-propagation method is used, each layer does not represent the corresponding sentiment. Thus, to realize the interpretability of layers in CSNN, we propose a novel learning strategy called the initialization and propagation (IP) learning.

IP learning includes two specific strategies called Init and Update. Update is a strategy of regularization for the final weight matrix, which is expected to improve the interpretability in WCSL. Init is a strategy for initialization of the WOSL using a small word sentiment dictionary that is composed of a few hundreds of word-level original sentiment scores, which is expected to improving the interpretability in WOSL and GIL. Using both the Update and Init, the interpretability in SSL is also expected to be improved. IP learning requires only reviews, their sentiment tags, and a small word sentiment dictionary. It does not require any sentiment shift information or syntactic text analysis. This is a valuable point in our approach because we can develop CSNN even for minor language or non-grammatical documents.

We experimentally evaluated the performance of the proposed approach using real textual datasets. We first demonstrated that IP learning is useful for realizing the interpretability of each layer in the CSNN. We then demonstrated that the CSNN developed with IP learning has both the high predictability and high explanation ability.

1.3 Contribution

The contributions of this paper are as follows:

  • We proposed a novel NN architecture called CSNN that can explain its sentiment analysis process in a form that humans find natural and agreeable.

  • To realize the interpretability of CSNN, we proposed a novel learning strategy called IP learning.

  • We experimentally demonstrated the high interpretability and high predictability of the proposed CSNN.

The remainder of this paper is structured as follows. In Sect. 2, the CSNN architecture and IP learning are explained in detail. Section 3 pre-experimentally evaluates the effect of the proposed IP learning. Section 4 presents the experiments and results. Section 5 presents the related works. In Sect. 6, the conclusion and directions for future work are discussed.

2 CSNN

This section introduce the proposed CSNN. A CSNN as described in Sect. 2.1 can be developed through IP learning (Sect. 2.1) using a training dataset \(\{ (\mathbf{Q}_i , d^{\mathbf{Q}_i})\}_{i = 1}^{N}\), and a small word sentiment dictionary. Note that N is the training data size, \(\mathbf{Q}_i\) is a comment, and \(d^{\mathbf{Q}_i}\) is its sentiment tag (1 is positive and 0 is negative).

2.1 Structure of CSNN

This section introduces the CSNN structure. The CSNN includes the following layers: WOSL, SSL, GIL, WCSL, CCSL, and outputs the document-level sentiment.

Notation. Before explaining the construction of the CSNN model, we define several symbols. Let \(\{w_i\}_{i = 1}^{v}\) represent the terms that appear in a text corpus of a dataset, and v be the vocabulary size. We define the vocabulary index of word \(w_i\) as \(I(w_i)\). Therefore, \(I(w_i) = i\). Let \({\varvec{w}}^{em}_i \in \mathbb {R}^{e}\) be an embedding representation of word \(w_i\), and the embedding matrix \({\varvec{W}}^{em} \in \mathbb {R}^{v \times e}\) be \([{{\varvec{w}}^{em}_1}^{T}, \ldots , {{\varvec{w}}^{em}_v}^T]^T\). Here, e is the dimension size of word-level embedding. Then, for each i, \(\Vert {{\varvec{w}}^{em}_i}\Vert _2 = 1\) is satisfied. \({\varvec{W}}^{em}\) is the constant value obtained using the skip-gram method [18] and the text corpus in a training dataset.

2.1.1 WOSL

Given a comment \(\mathbf{Q} = \{w_t^{\mathbf{Q}} \}_{t = 1}^{n}\), this layer converts the words \(\{w_t^{\mathbf{Q}}\}_{t = 1}^{n}\) to original word-level sentiment representations \(\{ {p}_t^{\mathbf{Q}} \}_{t = 1}^{n}\):

$$\begin{aligned} {p}_t^{\mathbf{Q}} = w^{p}_{I(w_t^{\mathbf{Q}})} \end{aligned}$$
(1)

where \({\varvec{W}}^p \in \mathbb {R}^{v}\) represents the original sentiment scores of words, and \({w}^{p}_i\) is the \(i-\)th element of \({\varvec{W}}^{p}\). The \({w}^{p}_{i}\) value corresponds to the original sentiment score of the word \(w_i\).

2.1.2 SSL

First, this layer converts terms \(\{w_t^{\mathbf{Q}}\}_{t = 1}^{n}\) in comment Q into their word-level embeddings \(\{ {\varvec{e}}_t^{\mathbf{Q}} \}_{t = 1}^{n}\) using \({\varvec{W}}^{em}\), and converts them to context representations \(\{\overrightarrow{{\varvec{h}}}_t^{\mathbf{Q}} \}_{t = 1}^{n}\) and \(\{{\overleftarrow{{\varvec{h}}}_t^{\mathbf{Q}}} \}_{t = 1}^{n}\) using forward and backward long short-term memories, \(\overrightarrow{\mathrm{LSTM}}\) and \(\overleftarrow{\mathrm{LSTM}}\) [15]:

$$\begin{aligned} \overrightarrow{{\varvec{h}}}_t^{\mathbf{Q}} = \overrightarrow{\mathrm{LSTM}}({\varvec{e}}_t^{\mathbf{Q}}), \overleftarrow{{\varvec{h}}_t^{\mathbf{Q}}} = \overleftarrow{\mathrm{LSTM}}({\varvec{e}}_t^{\mathbf{Q}}). \end{aligned}$$
(2)

Second, it converts \(\{\overrightarrow{{\varvec{h}}}_t^{\mathbf{Q}} \}_{t = 1}^{n}\) and \(\{{\overleftarrow{{\varvec{h}}}_t^{\mathbf{Q}}} \}_{t = 1}^{n}\) to right- and left-oriented sentiment shift representations, \(\overrightarrow{s}_t^{\mathbf{}}\) and \(\overleftarrow{s}_t^{\mathbf{Q}}\):

$$\begin{aligned} \overleftarrow{s}_t^{\mathbf{Q}} = \tanh ({\varvec{v}}^{left} \cdot \overleftarrow{{\varvec{h}}}_t^{\mathbf{Q}}), \overrightarrow{s}_t^{\mathbf{Q}} = \tanh ({\varvec{v}}^{right} \cdot \overrightarrow{{\varvec{h}}}_t^{\mathbf{Q}}). \end{aligned}$$
(3)

Here, \({\varvec{v}}^{right}, {\varvec{v}}^{left} \in \mathbb {R}^{e}\) are parameter values. \(\overrightarrow{s}_t^{\mathbf{Q}}\) and \(\overleftarrow{s}_t^{\mathbf{Q}}\) denote whether or not the sentiment of \(w_t^{\mathbf{Q}}\) is shifted by the left-side and right-side terms of \(w_t^{\mathbf{Q}}\): \(\{w_{t'}^{\mathbf{Q}} \}_{t' = 1}^{t-1}\) and \(\{w_{t'}^{\mathbf{Q}} \}_{t' = t+1}^{n}\), respectively.

Finally, this layer converts \(\{\overrightarrow{s}_t^{\mathbf{Q}} \}_{t = 1}^{n}\) and \(\{{\overleftarrow{s}_t^{\mathbf{Q}}} \}_{t = 1}^{n}\) into word-level sentiment shift scores \(\{s_t^{\mathbf{Q}}\}_{t = 1}^{n}\):

$$\begin{aligned} s_t^{\mathbf{Q}} := \overrightarrow{s}_t^{\mathbf{Q}} \cdot \overleftarrow{s}_t^{\mathbf{Q}}. \end{aligned}$$
(4)

\(s_t^{\mathbf{Q}}\) denotes whether the sentiment of \(w_t^{\mathbf{Q}}\) is shifted (\(s_t^{\mathbf{Q}} < 0\)) or not (\(s_t^{\mathbf{Q}} \ge 0\)).

The overall structure of this SSL is shown in Fig. 3.

Fig. 3
figure 3

SSL architecture

2.1.3 GIL

This layer represents the word-level global important point representations \(\{ {\alpha }_t^{\mathbf{Q}} \}_{t = 1}^{n}\) using a revised self-attention mechanism [16, 17] as

$$\begin{aligned} {\alpha }_{t}^{\mathbf{Q}} := \sum _{t'=1}^{T} \frac{e^{\tanh ({\overrightarrow{{\varvec{h}}}_{t}^{\mathbf{Q}}}^{T} \overrightarrow{{\varvec{h}}}_{t'}^{\mathbf{Q}} + {\overleftarrow{{\varvec{h}}}_{t}^{\mathbf{Q}}}^{T} \overleftarrow{{\varvec{h}}}_{t'}^{\mathbf{Q}})}}{\sum _{t=1}^{T} e^{\tanh ({\overrightarrow{{\varvec{h}}}_{t}^{\mathbf{Q}}}^{T} \overrightarrow{{\varvec{h}}}_{t'}^{\mathbf{Q}} + {\overleftarrow{{\varvec{h}}}_{t}^{\mathbf{Q}}}^{T} \overleftarrow{{\varvec{h}}}_{t'}^{\mathbf{Q}})}}. \end{aligned}$$
(5)

2.1.4 WCSL

Using the WOSL, SSL, and GIL, this layer represents word-level contextual sentiment representations \(\{ {g}_t^{\mathbf{Q}} \}_{t = 1}^{n}\):

$$\begin{aligned} {g}_t^{\mathbf{Q}} := {p}_t^{\mathbf{Q}} \cdot s_t^{\mathbf{Q}} \cdot \alpha ^\mathbf{Q}_{t} \end{aligned}$$
(6)

2.1.5 CCSL

This layer converts \(\{ {g}_t^{\mathbf{Q}} \}_{t = 1}^{n}\) into the concept-level contextual sentiment representations \(\{ {\varvec{v}}_t^{\mathbf{Q}} \}_{t = 1}^{n}\):

$$\begin{aligned} {\varvec{v}}_t^{\mathbf{Q}} = {g}_t^{\mathbf{Q}} {\varvec{b}}_t^{\mathbf{Q}} \end{aligned}$$
(7)

where \({\varvec{b}}^{\mathbf{Q}}_t := \max (\mathrm{Softmax}({\varvec{W}}_c {\varvec{e}}_t^{\mathbf{Q}} - t_c), 0),\)\({\varvec{v}}_t^{\mathbf{Q}} \in \mathbb {R}^{K}\), \({\varvec{b}}^{\mathbf{Q}}_t \in \mathbb {R}^{K}\), \(t_{c} > 0\) is a hyper-parameter value, \({\varvec{W}}_c \in \mathbb {R}^{K \times e}\) is centroid vectors of \(\{ {\varvec{w}}_i^{em}\}_{i = 1}^{v}\) calculated using a spherical k-means method [19] where the cluster number is K. Here, the (ik) element of \({\varvec{b}}_t^{\mathbf{Q}}\) represents the cluster weight of word \(w_t^{\mathbf{Q}}\) to cluster k. Therefore, from the values in the CCSL, we can catch up the concept-level contextual sentiment scores.

2.1.6 Output

Finally, this NN converts \(\{ {\varvec{v}}_t^{\mathbf{Q}} \}_{t = 1}^{n}\) into a predicted sentiment tag \(y^{\mathbf{Q}} \in \{0 (\mathrm{negative}), 1 (\mathrm{positive}) \}\):

$$\begin{aligned} {\varvec{a}}^{\mathbf{Q}}= \, & {} \mathrm{Softmax}\left({\varvec{W}}^{O} \tanh \left( \sum _{t = 1}^{n} {\varvec{v}}_t^{\mathbf{Q}} \right) \right),\\ y^{\mathbf{Q}}= \, & {} \mathrm{argmax} {\varvec{a}}^{\mathbf{Q}} \end{aligned}$$

where \({\varvec{W}}^{O} \in \mathbb {R}^{2 \times K}\) is the parameter value.

2.2 Key Idea in IP learning

In developing CSNN, the realization of the interpretability in WOSL and SSL is especially difficult. Through the learning with \(L^\mathbf{Q}\) and Update (will be defined later), WCSL learns to represent corresponding sentiments. However, this learning strategy alone cannot realize the interpretability in WOSL and SSL because in the case where the polarity of \(c_{t}^\mathbf{Q}\) is accurately negative, the following two cases are possible: (1) \(p_{t}^\mathbf{Q} > 0\) and \(s_{t}^\mathbf{Q} < 0\), or (2) \(p_{t}^\mathbf{Q} < 0\) and \(s_{t}^\mathbf{Q} > 0\), and the accurate case cannot be chosen automatically in general learning. We assume that this problem can be solved by initially limiting the polarity of \(p_{t}^\mathbf{Q}\) to the accurate case for a few words because this limitation leads to the accurate choice from the above two cases. Therefore, this limitation can lead to the learning of \(s_{t}^\mathbf{Q}\) within the appropriate case. The effect of this limitation works for only the limited words, first; however, this effect is assumed to be propagated to the other non-limited terms whose meanings are similar to any of the limited words thorough learning, afterward. To realize this idea, we utilize the Init (will be defined later) in IP learning.

2.3 Initialization and Propagation (IP) Learning

This section describes the learning strategy of the CSNN. Overall process is described in Algorithm 1 where \({w}^{o}_{i, j}\) is the (ij) element of \({\varvec{W}}^{O}\), and \(L^\mathbf{Q}\) is the cross entropy between \({\varvec{a}}^{\mathbf{Q}}\) and \(d^\mathbf{Q}\). IP learning utilizes the two specific techniques called Update and Init. Update is a strategy for improving the interpretability in WCSL. Init is a strategy for improving the interpretability in WOSL and GIL. Using both the Update and Init, the interpretability in SSL is also expected to be improved (as theoretically analyzed in Appendix A in the supplementary material).

2.3.1 Update

First, \({\varvec{W}}^{O}\) is updated according to processes 6–7 in Algorithm 1. This makes WCSL to represent the corresponding sentiment scores (Proposition A.3 in Appendix) without violating the learning process after sufficient iterations (Proposition A.7 in Appendix A).

2.3.2 Init

Then, \({\varvec{W}}^{p}\) is initialized as process 2 in Algorithm 1, where \(PS (w_i)\) is the sentiment score for word \(w_i\) given by the word sentiment dictionary, and \(S^{d}\) is a set of words from the dictionary. Init makes WOSL and SSL represent the corresponding scores in the condition that Update is utilized.

Through this IP learning, for every word sufficiently similar to any of the words in \(S^d\), the WOSL, SSL, GIL, and WCSL learn to represent the corresponding scores, as theoretically analyzed in Appendix A. After the learning, the CSNN can explain its prediction result using these layers.

figure a

3 Pre-experimental Evaluation for IP Learning

This section experimentally tests the explanation ability and predictability of the CSNN and investigate the effect of IP learning for the interpretability of the layers in the CSNN.

3.1 Dataset

3.1.1 Text Corpus

We used the following four textual corpora, including reviews and their sentiment tags, for this evaluation. They were used for developing CSNN.

  1. (a)

    EcoRevs I and II. These datasets are composed of comments on current (I) and future (II) economic trends and their positive or negative sentiment tagsFootnote 1

  2. (b)

    Yahoo review. This dataset is composed of comments on stocks and their long (positive) or short (negative) attitude tags, extracted from financial micro-blogs.Footnote 2

  3. (c)

    Sentiment 140. This dataset contains tweets and their positive or negative sentiment tags.Footnote 3

EcoReviews and Yahoo review were Japanese datasets, and Sentiment 140 was an English dataset. We used them to verify whether the CSNN can be used irrespective of the language or domain. We divided each dataset into the training, validation, and test datasets, as presented in Table 1.

3.1.2 Annotated Dataset

For this evaluation, we prepared the Economy, Yahoo, and message annotated datasets. The Economy annotated dataset has 2200 reviews (1100 positive and 1100 negative) in the test dataset of EcoReviews I. The Yahoo annotated dataset has 1520 reviews (760 positive and 760 negative) in the test dataset of Yahoo reviews. The message annotated dataset has 10258 reviews obtained from the test datasets in SemEval tasks [20, 21]. In these datasets, part of the terms in reviews had word-level contextual sentiment tags and word-level sentiment shift tags.

Word-level contextual sentiment tags indicate whether the word-level contextual sentiments of terms are positive or negative as shown in the following examples.

  1. (1)

    In total, we are in a \(bull^{+}\) market.

  2. (2)

    This room is not \(clean^{-}\).

  3. (3)

    Products in this shop are too \(expensive^{-}\).

Word-level sentiment shift tags indicate whether the sentiments of terms were shifted (1: shifted tags) or not (0: non-shifted tags) as shown in the following examples.

  1. (1)

    In total, we are in a \(bull^{(0)}\) market.

  2. (2)

    This room is not \(clean^{(1)}\).

  3. (3)

    Products in this shop are too \(expensive^{(1)}\).

Moreover, in the message annotated dataset, part of phrases in reviews have positive or negative tags for contextual sentiments (phrase-level sentiment tags) as the following examples.

  1. (1)

    In total, we are in a \(\{bull\)\(market\}^{+}\).

  2. (2)

    This room is \(\{not\)\(clean \}^{-}\).

  3. (3)

    Products in this shop are \(\{too\)\(expensive\}^{-}\).

In addition, a gold global important point (0: not important or 1: important) is assigned to each term of the reviews included in the Economy and Yahoo annotated datasets. This gold global important point indicates that each term in a review is important (1) or not (0) for deciding the overall positive or negative polarity of the review as the following examples.

  1. (1)

    \(We^{(0)}\,are^{(0)}\,in^{(0)}\,a^{(0)}\,bull^{(1)}\,market^{(1)}\).

  2. (2)

    \(This^{(0)}\,room^{(0)}\,is^{(0)}\,not^{(1)}\,1clean^{(1)}\).

These tags were used in evaluating the explanation ability of the CSNN. We used the Economy, Yahoo, and message annotated datasets when developing CSNNs with the EcoReviews, Yahoo reviews, and Sentiment 140, respectively. We only employed tags of terms that were not used in Init and appeared in the training dataset, and only used tags of the phrases that include at least one term involved in the training dataset. Table 2 summarizes the numbers of tags used. See the supplementary material for details.

3.2 CSNN Development Setting

We developed the CSNN using each training and validation datasets in the following settings.

Setting in Init.Init used a part of a Japanese financial word sentiment dictionary (JFWS dict) developed by six financial professionals and the Vader word sentiment dictionary (Vader dict) [5]. These dictionaries contain words and their sentiment scores. After we excluded the words with zero sentiment scores and those with absolute sentiment scores of less than 1.0 from JFWS dict and the Vader dict, respectively, we extracted most frequent 200 words in each training dataset from these dictionaries and used their sentiment scores in Init. To analyze the results in the cases where Init used fewer words, we evaluated the results with CSNNs developed with only 50 or 100, or 200 words: CSNN (50), CSNN (100) and CSNN (200).

Other settings. We calculated the word embedding matrix \({\varvec{W}}^{em}\) by the skip-gram method (window size = 5) [18] based on each textual dataset. We set the dimensions of the hidden and embedding vectors to 200, epoch to 50 with early stopping, K to [100, 500, 1000], \(t_c\) to 1/K, and mini-batch size to 64. We used stratified sampling [22] to analyze imbalanced data, and the Adam optimizer [23], and the dropout [24] method (rate = 0.5) for the BiRNNs and CSNNs. We calculated \({\varvec{W}}^{em}\) using the skip-gram method (window size = 5) with each text corpus. We determined the hyper-parameters using the validation data. We used the mean score of the five trials for the evaluations in this paper.

Table 1 Dataset organization for text corpus
Table 2 Dataset details for text corpus and annotated data

3.3 Evaluation Metrics in Explanation ability

Evaluation Metric. We evaluated the explanation ability of the CSNN based on the validity in WOSL, SSL, GIL, and WCSL in the following way.

3.3.1 Validity of WOSL

We evaluated the validity of WOSL based on how accurately the polarities of word \(w_{i}\) and \(w^{p}_{i}\) agree using the economic, Yahoo, and LEX word polarity listFootnote 4). These lists include words and their positive or negative polarities. The economic and Yahoo word-polarity lists include Japanese economic terms, and LEX word-polarity list includes English terms. If we used the EcoReview I or II, Yahoo reviews, and Sentiment 140 in training, we utilized the economic, Yahoo, and LEX word polarity lists, respectively. Moreover, we used only those terms that appeared in the training dataset but were not used in Init. Table 1 summarizes the number of words used in evaluating the CSNN developed with each dataset.

3.3.2 Validity of SSL

Using the sentiment shift tags in the annotated datasets, we evaluated the validity of the SSL based on whether the sentiment shift tags of \(w_t^\mathbf{Q}\) and the polarity of \(s_{t}^{\mathbf{Q}} > 0\) (shifted: \(w^{p}_{i} < 0\) and non-shifted: \(w^{p}_{i} > 0\)) is accurately agreed well.

3.3.3 Validity of GIL

Using the gold word-level global important points in the annotated datasets, we evaluated the validity of the GIL based on whether the values of GIL \(\{\alpha _t^\mathbf{Q}\}_{t=1}^{n}\) and gold word-level global important points were correlated. We used the Pearson correlation coefficient for this evaluation.

3.3.4 Validity of WCSL

Using the word-level or phrase-level contextual sentiment tags in the annotated datasets, we evaluated the validity of the WCSL with regard to whether the values of WCSL in CSNN could accurately assign the word or phrase-level contextual sentiments, that is, whether \({g}_{t}^{\mathbf{Q}}\) was accurately positive (negative) when the contextual word-level sentiment of \({w}_{t}^{\mathbf{Q}}\) was positive (negative) or whether the polarity of the summed scores for terms involved in each phrase accurately presented its sentiment. We used the macro average score between the macro \(F_1\) score for shifted terms and that for non-shifted terms for the evaluation basis. We used this score to test whether each method could accurately correspond to both shifted and non-shifted terms.

In the above, the values for the WOSL, SSL and WCSL are evaluated using the F1 Score because the of range of values for the WOSL and WCSL is \([- \infty , \infty ]\) and the range of values for the SSL is \((-1,1)\). In contrast, the range of values for the GIL is \([0, \infty ]\). Thus, we evaluated the validity of GIL by the Pearson Correlation.

Baselines. To evaluate the effect of IP learning, we compared the results of the CSNNs developed with IP learning and those of the following baseline models, namely, \(CSNN^{Base}\), \(CSNN^{NoInit}\), and \(CSNN^{NoUp}\). The structures of these baseline models are the same as the structure of CSNN; however, they are different in the following points:

  • \(CSNN^{Base}\) is developed using the general backpropagation and without Update or Init strategy.

  • \(CSNN^{Random}\) is developed with only Update strategy.

  • \(CSNN^{NoUp}\) is developed with only Init strategy.

Comparison Method. To evaluate the explanation ability of CSNN, we compared the evaluation result of CSNN with other comparative methods in each layer validity.

  1. (1)

    WOSL: This evaluation compared the CSNN with the other word-level original sentiment assignment methods, namely, PMI [25], logistic fixed weight model (LFW) [8], sentiment-oriented NN (SONN) [9], and gradient interpretable neural network (GINN) [4].

  2. (2)

    SSL: This evaluation compared the CSNN with the baseline and NegRNN methods. In the baseline, we predicted \(w_t^\mathbf{Q}\) as “shifted” if the sentiment of \(d^\mathbf{Q}\) predicted by the RNN and sentiment tag of \(w_t^\mathbf{Q}\) assigned by the PMI were different and as “not shifted” in other cases. In NegRNN, we used the RNN that predicts polarity shifts [26] developed with the polarity shifting training data created by the weighed frequency odds method [27].

  3. (3)

    GIL: This evaluation compared the CSNN with the other word-level important point assignment methods using the RNNs using attention mechanism: word attention network (ATT) [28], hierarchical attention network (HN-ATT) [28], sentiment and negation neural network (SNNN) [29], and lexicon-based supervised attention (LBSA) [6]. SNNN and LBSA are set up in a form that the attention weights of terms with the strong word-level original sentiment are strengthened. We used the attention score of each model as the score.

  4. (4)

    WCSL: This evaluation compared the CSNN with the other word-level sentiment assignment methods: PMI, LFW, SONN, GINN, Grad + a bidirectional LSTM model (RNN) [12], LRP + RNN [30], and IntGrad + RNN [11].

4 Experimental Evaluation for CSNN

4.1 Evaluation Metrics in Predictability

Evaluation Metric. We evaluates the predictability of the CSNN based on whether it can predict the sentiment tags of reviews in each test dataset. Comparison Method. We compared the CSNN and the following methods: logistic regression (LR), LFW [8], SONN [9], GINN [4], a bi-LSTM based RNN (RNN), convolutional NN (CNN)[1], ATT[28], HN-ATT [28], SNNN [29], LBSA [6]. We used the macro \(F_1\) score as the evaluation basis.

Among the above methods, LR is a linear representation model. LFW, SONN, and GINN are original sentiment interpretable NNs. ATT, HN-ATT, SNNN, and LBSA are NNs with attention mechanism, and especially, SNNN and LBSA are set up in a form that the attention weights of terms with the strong word-level original sentiment are strengthened.

4.2 Result

4.2.1 Explanation ability and Predictability

Tables 3, 4, 5 and 6 summarize the results for explanation ability, indicating that the proposed CSNN outperformed the other methods in most cases. Table 7 summarizes the results, indicating that HN-ATT had greater predictability than the proposed CSNNs in most cases; however, CSNN (200) had greater predictability than LR and some deep NNs such as CNN and SNNN, and had predictability equivalent to that of ATT or LBSA. These results demonstrate that the proposed CSNN has both the high explanation ability and high predictability.

4.2.2 Effect of IP Learning

The results of CSNNs, \(CSNN^{Base}\), \(CSNN^{NoUp}\), and \(CSNN^{Rand}\) for explainability demonstrate the effect of IP learning as follows. The \(CSNN^{Rand}\) outperformed the \(CSNN^{Base}\) in WCSL, indicating that Update promoted the validity in WCSL; whereas, the \(CSNN^{NoUp}\) outperformed the \(CSNN^{Base}\) in WOSL and GIL, indicating that Init promoted the validity in WOSL and GIL. Consequently, the validity in all the five layers were improved by using both Update and Init, and the CSNNs outperformed the \(CSNN^{Base}\) in all the cases. This is the expected result as described in Sect. C (and Appendix A in the supplementary).

4.3 Discussion

We then discuss the performance of the CSNN in detail.

4.3.1 Predictability

The reason behind the good performance of HNATT in the predictability evaluation may lie in whether the sentence-level importance is considered or not. The HNATT considers the sentence-level importance, whereas the CSNN does not consider it. Therefore, it is possible that the performance for the CSNN can become better by adding the sentence-level importance attention mechanism to the CSNN. Additionally, it should be noted that the performance for the CSNN was better than the others in Yahoo dataset. It is possible that this is because sentiment shift representations in Yahoo dataset are more general and complex than those in EcoReviews. The CSNN directly strengthens the word-level sentiment score and its sentiment shift. Thus, the CSNN can address the sentiment shift representations in Yahoo dataset.

4.3.2 Effect of IP Learning

It should be noted that the interpretability for the CSNN has succeeded even when we used only fifty terms for the Init and there has been significant difference for the setting of Init. These results indicate that the number of the required minimum words for the learning was less than fifty and our algorithm was sufficiently practical.

4.3.3 Sentiment Shift Detection Performance in Yahoo Dataset

Sentiment shift representations in Yahoo dataset are more general and complex than those in EcoReviews. We consider that this is the reason for the better performance of the CSNN. The CSNN directly strengthen the word-level sentiment score and its sentiment shift. Thus, the CSNN can address the sentiment shift representations in Yahoo dataset.

Table 3 Evaluation for explanation ability in WOSL (Macro \(F_1\) score)
Table 4 Evaluation for explanation ability in SSL (Macro \(F_1\) score)
Table 5 Evaluation for explanation ability in GIL (Pearson correlation)
Table 6 Evaluation for explanation ability in WCSL (Macro \(F_1\) score)
Table 7 \(F_1\) score results for the predictability evaluation

4.4 Text-Visualization Example

This section introduces some examples of text-visualization produced by the CSNN. Figures 4 and 5 show the text-visualization examples for visualizing a review in Yahoo review and a review in the Sentiment 140 using the CSNN. Users can explain the CSNN’s prediction process based on this type of text-visualizations.

In addition, based on the values of the right- and left-oriented sentiment shift representations, we can interpret the sentiment shift processes in the CSNN. Figure 5 shows examples. Based on Fig. 5, we can interpret that “uru (bearish)” is shifted by its right-side terms, and term “aoru (manipulate)” caused a sentiment shift because in the right-oriented sentiment shift representations, the terms to the left side of “aoru (manipulate)” become blue. In the same manner, we can interpret that “great” is shifted by “not” (right-oriented shift layer) in Fig. 4.

Fig. 4
figure 4

Text-visualization Example for an English review in Sentiment 140. The color and depth of terms mean polarity (red: \(>0\) and blue: \(<0\)) and scale of word-level sentiments in each layer

Fig. 5
figure 5

Text-visualization Example for an Japanese review in Yahoo Review. The color and depth of terms mean polarity (red: \(>0\) and blue: \(<0\)) and scale of word-level sentiments in each layer

5 Related Work

There are many studies for addressing the black-box property of the deep NNs. As a useful technique for explaining the prediction results of NNs, we can present methods for interpreting prediction models [10,11,12,13, 31, 32]. These methods calculated the gradient score of each input feature in the prediction and visualized an important feature in their predictions. The LRP method is one of the state-of-the-art methods. Interpretable NNs [4, 6,7,8,9, 28, 29] are also useful in these aspects. In this context, several methods developed a neural network including the layer that represents word-level original score [4, 8, 9]. Other methods developed a neural network including the layer that represents word-level global context using the attention mechanism [6, 7, 28, 29]. However, these previous methods do not satisfy our purpose because they alone cannot represent all the five types of scores, namely, word-level original sentiment score, word-level sentiment shift score, word-level global important point score, word-level contextual sentiment score, and concept-level contextual sentiment score in the explanation. In contrast, the proposed CSNN can explain the prediction results using the above five types of scores.

Many existing studies explored sentiment shift detection [2, 3, 26, 33, 34]. However, because most of these methods require specific knowledge of sentiment shifts, we cannot always use them in the real world. Unlike these methods, the CSNN can detect sentiment shifts without any specific knowledge on sentiment shifts. Although a method for detecting sentiment shifts without specific knowledge was developed in a previous study [27], the CSNN was better than this method in detecting sentiment shifts. Other studies dealt with assigning original sentiment scores to words using the sentiment tags of documents [8, 9, 25, 35]. The proposed CSNN outperformed them.

6 Conclusion

A novel NN architecture called CSNN that can explain its prediction process is proposed. To realize the explainability of CSNN, we proposed a novel learning strategy called IP learning. We experimentally demonstrated the effectiveness of IP learning for improving the explainability of CSNN. Using real textual datasets, we then experimentally demonstrated that the CSNN had higher predictability compared to that of some DNNs and that the explanation provided by the CSNN was sufficiently valid. In the future, we will apply this CSNN to documents pertaining to other domains or languages. Dataset, code, and the supplementary material are availableFootnote 5.