Monitoring rare categories in sentiment and opinion analysis: a Milan mega event on Twitter platform


This paper proposes a new aggregated classification scheme aimed to support the implementation of semantic text analysis methods in contexts characterized by the presence of rare text categories. The proposed approach starts from the aggregate supervised text classifier developed by Hopkins and King and moves forward, relying on rare event sampling methods. In detail, it enables the analyst to enlarge the number of estimated sentiment categories, both preserving the estimation accuracy and reducing the working time to unconditionally increase the size of the training set. The approach is applied to study the daily evolution of the web reputation of one of the last mega-event taking place in Europe: Expo Milano. The corpus consists of more than one million tweets in both Italian and English, discussing about the event. The analysis provides an interesting portrayal of the evolution of the Expo stakeholders’ opinions over time and allows the identification of the main drivers of the Expo reputation. The algorithm will be implemented as a running option in the next release of the R package ReadMe.

Appendix 1

Proof of Theorem 1.


Consider the result of Theorem 1, the demonstration is shown for a general number of stems K and number of categories J. Consider S a multinomial variable assuming \(S_1,\ldots ,S_{2^K}\) possible values and D multinomial variable assuming \(D_1,\ldots ,D_J\) possible values. By definition, A is a \(2^K\times 2^K\) diagonal matrix and B is a \(J\times J\) diagonal matrix. In matrix terms, the Eq. (8) can be re-written as following:

$$\begin{aligned}&\left[ \begin{array}{ccc} A_{1 1} &{} &{} \\ &{} \ddots &{} \\ &{} &{} A_{2^K 2^K} \end{array}\right] \left[ \begin{array}{ccc} P^{RTr}(S=S_1|D=D_1) &{} \dots &{} P^{RTr}(S=S_1|D=D_J) \\ P^{RTr}(S=S_2|D=D_1) &{} \dots &{} P^{RTr}(S=S_2|D=D_J) \\ \vdots &{} \ddots &{} \vdots \\ P^{RTr}(S=S_{2^K}|D=D_1) &{} \dots &{} P^{RTr}(S=S_{2^K}|D=D_J) \end{array}\right] \left[ \begin{array}{ccc} B_{1 1} &{} &{} \\ &{} \ddots &{} \\ &{} &{} B_{J J} \end{array}\right] \\&\quad = \left[ \begin{array}{ccc} P(S=S_1|D=D_1) &{} \dots &{} P(S=S_1|D=D_J) \\ P(S=S_2|D=D_1) &{} \dots &{} P(S=S_2|D=D_J) \\ \vdots &{} \ddots &{} \vdots \\ P(S=S_{2^K}|D=D_1) &{} \dots &{} P(S=S_{2^K}|D=D_J) \end{array}\right] \\ \end{aligned}$$

Due to the matrixes’ structure, we can prove the equality component-wise, by considering the general component (ij):

$$\begin{aligned}{}[A_{ii}P^{RTr}(S=S_i|D=D_j)B_{jj}]_{ij}=[P(S=S_i|D=D_j)]_{ij} \end{aligned}$$

For seek of simplicity, we miss the (ij) subscripts along the demonstration.

Consider the left side of the equality, substitute \(A_{ii}\), and apply Bayes Formula:

$$\begin{aligned}&\dfrac{P^{Tr}(S=S_i)P^{RTr}(S=S_i|D=D_j)B_{jj}}{P^{RTr}(S=S_i)}\\&\quad =\dfrac{P^{Tr}(S=S_i)P^{RTr}(D=D_j|S=S_i)P^{Rtr}(S=S_i)B_{jj}}{P^{RTr}(S=S_i)P^{RTr}(D=D_j)} \end{aligned}$$

Substituting \(B_{j j}\) :

$$\begin{aligned}&\dfrac{P^{Tr}(S=S_i)P^{RTr}(D=D_j|S=S_i)}{P^{RTr}(D=D_j)\sum \limits _{n=1}^{2^K}{P^{RTr}(S=S_n|D=D_j)A_n}}\\&\quad =\dfrac{P^{Tr}(S=S_i)P^{RTr}(D=D_j|S=S_i)}{P^{RTr}(D=D_j)\sum \limits _{n=1}^{2^K}{\dfrac{P^{RTr}(D=D_j|S=S_n)P^{Tr}(S=S_n)}{P^{RTr}(D=D_j)}}} \end{aligned}$$

Using the hypothesis \(P^{RTr}(D|S)=P^{Tr}(D|S)\) and the law of total probability:

$$\begin{aligned}&\dfrac{P^{Tr}(S=S_i)P^{Tr}(D=D_j|S=S_i)}{\sum \nolimits _{n=1}^{2^K}P^{Tr}(D=D_j|S=S_n)P^{Tr}(S=S_n)}\\&\quad =\dfrac{P^{Tr}(S=S_i)P^{Tr}(D=D_j|S=S_i)}{P^{Tr}(D=D_j)}\\&\quad =P^{Tr}(S=S_i|D=D_j) \end{aligned}$$

For hypothesis:

$$\begin{aligned} P^{Tr}(S=S_i|D=D_j)=P(S=S_i|D=D_j) \end{aligned}$$

So we can write the following:

$$\begin{aligned}{}[A_{ii}P^{RTr}(S=S_i|D=D_j)B_{jj}]_{ij}=[P^{Tr}(S=S_i|D=D_j)]_{ij}=[P(S=S_i|D=D_j)]_{ij} \end{aligned}$$

Appendix 2

List of opinion and sentiment categories defined for the analysis of web-reputation of Expo Milan (Tables 1, 2).

Table 1 Sentiment analysis: description of the sentiment categories
Table 2 Opinion analysis: description of the positive (negative) opinion categories. Every positive category has a corresponding negative one. Neutral, Off-Topics, and Advertise categories are also estimated in opinion analysis

