Monitoring rare categories in sentiment and opinion analysis: a Milan mega event on Twitter platform


This paper proposes a new aggregated classification scheme aimed to support the implementation of semantic text analysis methods in contexts characterized by the presence of rare text categories. The proposed approach starts from the aggregate supervised text classifier developed by Hopkins and King and moves forward, relying on rare event sampling methods. In detail, it enables the analyst to enlarge the number of estimated sentiment categories, both preserving the estimation accuracy and reducing the working time to unconditionally increase the size of the training set. The approach is applied to study the daily evolution of the web reputation of one of the last mega-event taking place in Europe: Expo Milano. The corpus consists of more than one million tweets in both Italian and English, discussing about the event. The analysis provides an interesting portrayal of the evolution of the Expo stakeholders’ opinions over time and allows the identification of the main drivers of the Expo reputation. The algorithm will be implemented as a running option in the next release of the R package ReadMe.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8


  1. Agosti M, Bacchin M, Ferro N, Melucci M (2002) Improving the automatic retrieval of text documents. In: Workshop of the cross-language evaluation forum for European Languages. Springer, pp 279–290

  2. Aprosio AP, Moretti G (2016) Italy goes to stanford: a collection of corenlp modules for italian. arXiv preprint arXiv:1609.06204

  3. Berger AL, Pietra VJD, Pietra SAD (1996) A maximum entropy approach to natural language processing. Comput Ling 22(1):39–71

    Google Scholar 

  4. Blei DM (2012) Probabilistic topic models. Commun ACM 55(4):77–84

    Google Scholar 

  5. Blei DM, Lafferty JD (2007) A correlated topic model of science. Ann Appl Stat 1(1):17–35

    MathSciNet  MATH  Google Scholar 

  6. Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022

    MATH  Google Scholar 

  7. Bouchet-Valat M (2014) SnowballC: snowball stemmers based on the C libstemmer UTF-8 library. R package version 0.5.1

  8. Breiman L, Friedman J, Stone C, Olshen R (1984) Classification and regression trees. The Wadsworth and Brooks-Cole statistics-probability series. Chapman & Hall, New York

    Google Scholar 

  9. Breslow NE (1996) Statistics in epidemiology: the case–control study. J Am Stat Assoc 91(433):14–28

    MathSciNet  MATH  Google Scholar 

  10. Ceron A, Curini L, Iacus SM (2015) Using social media to forecast electoral results: a review of state-of-the-art. Stat Appl Ital J Appl Stat 25(3):239–261

    Google Scholar 

  11. Ceron A, Curini L, Iacus SM (2016) isa: a fast, scalable and accurate algorithm for sentiment analysis of social media content. Inf Sci 367:105–124

    Google Scholar 

  12. Chen H, Chiang RH, Storey VC (2012) Business intelligence and analytics: from big data to big impact. MIS Q 36(4):1165–1188

    Google Scholar 

  13. Choi D, Kim P (2013) Sentiment analysis for tracking breaking events: a case study on twitter. Asian conference on intelligent information and database systems. Springer, Berlin, pp 285–294

    Google Scholar 

  14. Corallo A, Fortunato L, Matera M, Alessi M, Camillò A, Chetta V, Giangreco E, Storelli D (2015) Sentiment analysis for government: an optimized approach. In: Perner P (ed) Machine learning and data mining in pattern recognition. Springer, Cham, pp 98–112

    Google Scholar 

  15. da Silva NF, Hruschka ER, Hruschka ER (2014) Tweet sentiment analysis with classifier ensembles. Decis Support Syst 66:170–179

    Google Scholar 

  16. Das SR, Chen MY (2007) Yahoo! for Amazon: sentiment extraction from small talk on the web. Manag Sci 53(9):1375–1388

    Google Scholar 

  17. Dave K, Lawrence S, Pennock DM (2003) Mining the peanut gallery: opinion extraction and semantic classification of product reviews. In: Proceedings of the 12th international conference on World Wide Web. ACM, New York, WWW ’03, pp 519–528

  18. Deerwester S, Dumais ST, Furnas GW, Landauer TK, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inf Science 41(6):391–407

    Google Scholar 

  19. Duda RO, Hart PE (1973) Pattern classification and scene analysis. Wiley, New York

    Google Scholar 

  20. Erosheva E, Fienberg S, Lafferty J (2004) Mixed-membership models of scientific publications. Proc Natl Acad Sci 101(suppl 1):5220–5227

    Google Scholar 

  21. ExpoMilano (2015) Expo Milano 2015: La sfida dell’italia per un’esplosione universale innovativa.

  22. Feinerer I, Hornik K (2017) tm: Text Mining Package. R package version 0.7-3

  23. Gentry J (2015) twitteR: R based Twitter Client. R package version 1.1.9

  24. Go A, Bhayani R, Huang L (2009) Twitter sentiment classification using distant supervision. Nature 1(12):1–6

    Google Scholar 

  25. Grimmer J, Stewart BM (2013) Text as data: the promise and pitfalls of automatic content analysis methods for political texts. Polit Anal 21(3):267–297

    Google Scholar 

  26. Hand DJ (2006) Classifier technology and the illusion of progress. Stat Sci 21(1):1–14

    MathSciNet  MATH  Google Scholar 

  27. Hopkins DJ, King G (2010) A method of automated nonparametric content analysis for social science. Am J Polit Sci 54(1):229–247

    Google Scholar 

  28. Hopkins D, King G (2017) ReadMe: software for automated content analysis. R package version 0.99837

  29. Inversini A, Marchiori E, Dedekind C, Cantoni L (2010) Applying a conceptual framework to analyze online reputation of tourism destinations. In: Gretzel U, Law R, Fuchs M (eds) Information and communication technologies in tourism 2010. Springer Vienna, Vienna, pp 321–332

    Google Scholar 

  30. Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. In: Nédellec C, Rouveirol C (eds) Machine learning: ECML-98. Springer, Berlin, pp 137–142

    Google Scholar 

  31. King G, Zeng L (2001) Logistic regression in rare events data. Polit Anal 9(2):137–163

    Google Scholar 

  32. Laver M, Benoit K, Garry J (2003) Extracting policy positions from political texts using words as data. Am Polit Sci Rev 97(2):311–331

    Google Scholar 

  33. Liaw A, Wiener M (2015) Classification and regression by randomforest. R Cran Repository R package version 4.6-12

  34. Lowe W (2008) Understanding wordscores. Polit Anal 16(4):356–371

    Google Scholar 

  35. Mahalakshmi S, Sivasankar E (2015) Cross domain sentiment analysis using different machine learning techniques. In: Ravi V, Panigrahi BK, Das S, Suganthan PN (eds) Proceedings of the fifth international conference on fuzzy and neuro computing. Springer, Cham, FANCCO-2015, pp 77–87

  36. Manning CD, Raghavan P, tze Hinrich S (2008) Introduction to information retrieval. Cambridge University Press, Cambridge

    Google Scholar 

  37. Martin LW, Vanberg G (2008) A robust transformation procedure for interpreting political text. Polit Anal 16(1):93–100

    Google Scholar 

  38. Monroe BL, Maeda K (2004) Talk’s cheap: text-based estimation of rhetorical ideal-points. In: 21st annual meeting of the Society for Political Methodology, pp 29–31

  39. Mudinas A, Zhang D, Levene M (2012) Combining lexicon and learning based approaches for concept-level sentiment analysis. In: Proceedings of the first international workshop on issues of sentiment discovery and opinion mining. ACM, New York, WISDOM ’12, pp 1–8

  40. Mukherjee S, Bhattacharyya P (2013) Sentiment analysis : a literature survey. arXiv preprint arXiv:1304.4520

  41. Müller M (2015) What makes an event a mega-event? Definitions and sizes. Leis Stud 34(6):627–642

    Google Scholar 

  42. Nirmala CR, Roopa GM, Kumar KRN (2015) Twitter data analysis for unemployment crisis. In: 2015 international conference on applied and theoretical computing and communication technology. Davanagere, Karnataka, India. iCATccT, pp 420–423

  43. Pang B, Lee L (2008) Opinion mining and sentiment analysis. Found Trends Inf Retrivial 2(1–2):1–135

    Google Scholar 

  44. Pang B, Lee L, Vaithyanathan S (2002) Thumbs up?: Sentiment classification using machine learning techniques. In: Proceedings of the ACL-02 conference on empirical methods in natural language processing, vol 10. Association for Computational Linguistics, Stroudsburg, EMNLP ’02, pp 79–86

  45. Ponzi LJ, Fombrun CJ, Gardberg NA (2011) Reptrak™ pulse: conceptualizing and validating a short-form measure of corporate reputation. Corp Reput Rev 14(1):15–35

    Google Scholar 

  46. Rao Y, Lei J, Wenyin L, Li Q, Chen M (2014a) Building emotional dictionary for sentiment analysis of online news. World Wide Web 17(4):723–742

    Google Scholar 

  47. Rao Y, Li Q, Mao X, Wenyin L (2014b) Sentiment topic models for social emotion mining. Inf Sci 266:90–100

    Google Scholar 

  48. Rayner J (2004) Managing reputational risk: curbing threats, leveraging opportunities. Wiley, New York

    Google Scholar 

  49. Ribeiro FN, Araújo M, Gonçalves P, André Gonçalves M, Benevenuto F (2016) Sentibench—a benchmark comparison of state-of-the-practice sentiment analysis methods. EPJ Data Sci 5(1):23

    Google Scholar 

  50. Roberts ME, Stewart BM, Airoldi EM (2016) A model of text for experimentation in the social sciences. J Am Stat Assoc 111(515):988–1003

    MathSciNet  Google Scholar 

  51. Salter-Townshend M, Murphy TB (2014) Mixtures of biased sentiment analysers. Adv Data Anal Classif 8(1):85–103

    MathSciNet  MATH  Google Scholar 

  52. Slapin JB, Proksch SO (2008) A scaling model for estimating time-series party positions from texts. Am J Polit Sci 52(3):705–722

    Google Scholar 

  53. Solari D, Sciandra A, Rinaldo M, Redaelli M, Finos L (2016) Textwiller: collection of functions for text mining, specially devoted to the Italian language. https://github com/livioivil/TextWiller

  54. Sparck Jones K (1972) A statistical interpretation of term specificity and its application in retrieval. J Doc 28(1):11–21

    Google Scholar 

  55. Stone PJ, Dexter CD, Smith MS, Ogilvie DM (1968) The general inquirer: a computer approach to content analysis. Am J Sociol 73(5):634–635

    Google Scholar 

  56. Taboada M, Brooke J, Tofiloski M, Voll K, Stede M (2011) Lexicon-based methods for sentiment analysis. Comput Ling 37(2):267–307

    Google Scholar 

  57. Tian F, Wu F, Chao KM, Zheng Q, Shah N, Lan T, Yue J (2016) A topic sentence-based instance transfer method for imbalanced sentiment classification of chinese product reviews. Electron Commerce Res Appl 16:66–76

    Google Scholar 

  58. Tripathy A, Agrawal A, Rath SK (2016) Classification of sentiment reviews using n-gram machine learning approach. Expert Syst Appl 57:117–126

    Google Scholar 

  59. Zhao H, Ji X, Zeng Q, Jiang S (2016) A teaching evaluation method based on sentiment classification. Int J Comput Sci Math 7(1):54–62

    Google Scholar 

  60. Zhou Z, Zhang X, Sanderson M (2014) Sentiment analysis on twitter through topic-based lexicon expansion. In: Wang H, Sharaf MA (eds) Databases theory and applications. Springer, Cham, pp 98–109

    Google Scholar 

Download references

Author information



Corresponding author

Correspondence to Anna Calissano.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.


Appendix 1

Proof of Theorem 1.


Consider the result of Theorem 1, the demonstration is shown for a general number of stems K and number of categories J. Consider S a multinomial variable assuming \(S_1,\ldots ,S_{2^K}\) possible values and D multinomial variable assuming \(D_1,\ldots ,D_J\) possible values. By definition, A is a \(2^K\times 2^K\) diagonal matrix and B is a \(J\times J\) diagonal matrix. In matrix terms, the Eq. (8) can be re-written as following:

$$\begin{aligned}&\left[ \begin{array}{ccc} A_{1 1} &{} &{} \\ &{} \ddots &{} \\ &{} &{} A_{2^K 2^K} \end{array}\right] \left[ \begin{array}{ccc} P^{RTr}(S=S_1|D=D_1) &{} \dots &{} P^{RTr}(S=S_1|D=D_J) \\ P^{RTr}(S=S_2|D=D_1) &{} \dots &{} P^{RTr}(S=S_2|D=D_J) \\ \vdots &{} \ddots &{} \vdots \\ P^{RTr}(S=S_{2^K}|D=D_1) &{} \dots &{} P^{RTr}(S=S_{2^K}|D=D_J) \end{array}\right] \left[ \begin{array}{ccc} B_{1 1} &{} &{} \\ &{} \ddots &{} \\ &{} &{} B_{J J} \end{array}\right] \\&\quad = \left[ \begin{array}{ccc} P(S=S_1|D=D_1) &{} \dots &{} P(S=S_1|D=D_J) \\ P(S=S_2|D=D_1) &{} \dots &{} P(S=S_2|D=D_J) \\ \vdots &{} \ddots &{} \vdots \\ P(S=S_{2^K}|D=D_1) &{} \dots &{} P(S=S_{2^K}|D=D_J) \end{array}\right] \\ \end{aligned}$$

Due to the matrixes’ structure, we can prove the equality component-wise, by considering the general component (ij):

$$\begin{aligned}{}[A_{ii}P^{RTr}(S=S_i|D=D_j)B_{jj}]_{ij}=[P(S=S_i|D=D_j)]_{ij} \end{aligned}$$

For seek of simplicity, we miss the (ij) subscripts along the demonstration.

Consider the left side of the equality, substitute \(A_{ii}\), and apply Bayes Formula:

$$\begin{aligned}&\dfrac{P^{Tr}(S=S_i)P^{RTr}(S=S_i|D=D_j)B_{jj}}{P^{RTr}(S=S_i)}\\&\quad =\dfrac{P^{Tr}(S=S_i)P^{RTr}(D=D_j|S=S_i)P^{Rtr}(S=S_i)B_{jj}}{P^{RTr}(S=S_i)P^{RTr}(D=D_j)} \end{aligned}$$

Substituting \(B_{j j}\) :

$$\begin{aligned}&\dfrac{P^{Tr}(S=S_i)P^{RTr}(D=D_j|S=S_i)}{P^{RTr}(D=D_j)\sum \limits _{n=1}^{2^K}{P^{RTr}(S=S_n|D=D_j)A_n}}\\&\quad =\dfrac{P^{Tr}(S=S_i)P^{RTr}(D=D_j|S=S_i)}{P^{RTr}(D=D_j)\sum \limits _{n=1}^{2^K}{\dfrac{P^{RTr}(D=D_j|S=S_n)P^{Tr}(S=S_n)}{P^{RTr}(D=D_j)}}} \end{aligned}$$

Using the hypothesis \(P^{RTr}(D|S)=P^{Tr}(D|S)\) and the law of total probability:

$$\begin{aligned}&\dfrac{P^{Tr}(S=S_i)P^{Tr}(D=D_j|S=S_i)}{\sum \nolimits _{n=1}^{2^K}P^{Tr}(D=D_j|S=S_n)P^{Tr}(S=S_n)}\\&\quad =\dfrac{P^{Tr}(S=S_i)P^{Tr}(D=D_j|S=S_i)}{P^{Tr}(D=D_j)}\\&\quad =P^{Tr}(S=S_i|D=D_j) \end{aligned}$$

For hypothesis:

$$\begin{aligned} P^{Tr}(S=S_i|D=D_j)=P(S=S_i|D=D_j) \end{aligned}$$

So we can write the following:

$$\begin{aligned}{}[A_{ii}P^{RTr}(S=S_i|D=D_j)B_{jj}]_{ij}=[P^{Tr}(S=S_i|D=D_j)]_{ij}=[P(S=S_i|D=D_j)]_{ij} \end{aligned}$$

Appendix 2

List of opinion and sentiment categories defined for the analysis of web-reputation of Expo Milan (Tables 1, 2).

Table 1 Sentiment analysis: description of the sentiment categories
Table 2 Opinion analysis: description of the positive (negative) opinion categories. Every positive category has a corresponding negative one. Neutral, Off-Topics, and Advertise categories are also estimated in opinion analysis

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Calissano, A., Vantini, S. & Arena, M. Monitoring rare categories in sentiment and opinion analysis: a Milan mega event on Twitter platform. Stat Methods Appl 29, 787–812 (2020).

Download citation


  • Classification
  • Sentiment analysis
  • Twitter
  • Expo
  • Web reputation
  • Mega event

Mathematics Subject Classification

  • 62H30
  • 62D99