Mixtures of biased sentiment analysers

Salter-Townshend, Michael; Murphy, Thomas Brendan

doi:10.1007/s11634-013-0150-6

Mixtures of biased sentiment analysers

Regular Article
Published: 31 August 2013

Volume 8, pages 85–103, (2014)
Cite this article

Advances in Data Analysis and Classification Aims and scope Submit manuscript

Michael Salter-Townshend¹ &
Thomas Brendan Murphy¹

370 Accesses
3 Citations
1 Altmetric
Explore all metrics

Abstract

Modelling bias is an important consideration when dealing with inexpert annotations. We are concerned with training a classifier to perform sentiment analysis on news media articles, some of which have been manually annotated by volunteers. The classifier is trained on the words in the articles and then applied to non-annotated articles. In previous work we found that a joint estimation of the annotator biases and the classifier parameters performed better than estimation of the biases followed by training of the classifier. An important question follows from this result: can the annotators be usefully clustered into either predetermined or data-driven clusters, based on their biases? If so, such a clustering could be used to select, drop or otherwise categorise the annotators in a crowdsourcing task. This paper presents work on fitting a finite mixture model to the annotators’ bias. We develop a model and an algorithm and demonstrate its properties on simulated data. We then demonstrate the clustering that exists in our motivating dataset, namely the analysis of potentially economically relevant news articles from Irish online news sources.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Notes

These words are: “said”, “its”, “year”, “a”, “s”, “to”, “by”, “irish”, “has”, “and”, “that”, “at”, “as”, “an”, “they”, “for”, “of”, “are”, “on”, “not”, “but”, “last”, “this”, “have”, “from”, “was”, “with”, “it”, “the”, “in”, “he”, “would”, “be”, “will”, “is”, “their”, “mr”, “were”, “had”, “which”, “we”, “ireland”, “been”, “his”, “2009”, “per”, “cent”.

Abbreviations

\(K\) :: Number of annotators
\(N\) :: Number of unique word terms in the articles
\(A\) :: Number of articles
\(B\) :: Number of annotator clusters
\(J\) :: Number of article types; 3 in our examples
\({y}\) :: \(A\times J \times K\) binary matrix of annotations
\({w}\) :: \(A\times N\) binary matrix of word term presences
\({\varvec{T}}\) :: \(A\times J\) length binary indicators for the types of the articles
\({\varvec{Z}}\) :: \(K\times B\) length binary indicators for the cluster memberships of the annotators
\(\pi \) :: \(B\times J\times J\) cluster specific matrix of annotation error rates
\(\theta \) :: \(N\times J\) matrix of word term classifier probabilities

References

Aitchison J (1986) The statistical analysis of compositional data. Monographs on statistics and applied probability. Chapman and Hall, London
Book Google Scholar
Biernacki C, Celeux G, Govaert G (1999) An improvement of the NEC criterion for assessing the number of clusters in a mixture model. Pattern Recognit Lett 20(3):267–272
Article MATH Google Scholar
Blitzer J, Dredze M, Pereira F (2007) Biographies, bollywood, boomboxes and blenders: domain adaptation for sentiment classification. In: Carroll JA, van den Bosch A, Zaenen A (eds) Proceedings of the 45th annual meeting of the association for computational linguistics (ACL 2007), pp 187–205
Bollen J, Mao H (2011) Twitter mood as a stock market predictor. Computer 44(10):91–94
Article Google Scholar
Brew A, Greene D, Cunningham P (2010a) The interaction between supervised learning and crowdsourcing. In: NIPS workshop on computational social science and the wisdom of crowds
Brew A, Greene D, Cunningham P (2010b) Using crowdsourcing and active learning to track sentiment in online media. In: Coelho H, Studer R, Wooldridge M (eds) Proceedings of the 19th European conference on artificial intelligence (ECAI 2010), IOS Press, pp 1–11
Celeux G, Soromenho G (1996) An entropy criterion for assessing the number of clusters in a mixture model. J Classif 13(2):195–212
Article MATH MathSciNet Google Scholar
Dawid A, Skene A (1979) Maximum likelihood estimation of observer error-rates using the EM algorithm. J R Stat Soc Ser C (Appl Stat) 28(1):20–28
Google Scholar
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B 39(1):1–38
MATH MathSciNet Google Scholar
Domingos P, Pazzani M (1997) On the optimality of the simple Bayesian classifier under zero-one loss. Mach Learn 29:103–130
Article MATH Google Scholar
Fraley C, Raftery AE (2002) Model-based clustering, discriminant analysis, and density estimation. J Am Stat Assoc 97(458):611–631
Article MATH MathSciNet Google Scholar
Govaert G, Nadif M (2003) Clustering with block mixture models. Pattern Recognit 36(2):463–473
Article Google Scholar
Govaert G, Nadif M (2005) An EM algorithm for the block mixture model. IEEE Trans Pattern Anal Mach Intell 27(4):643–647
Article Google Scholar
Govaert G, Nadif M (2008) Block clustering with Bernoulli mixture models: comparison of different approaches. Comput Stat Data Anal 52(6):3233–3245
Article MATH MathSciNet Google Scholar
Hand DJ, Yu K (2001) Idiot’s Bayes—not so stupid after all? Int Stat Rev 69(3):385–398
MATH Google Scholar
Hathaway R (1986) Another interpretation of the EM algorithm for mixture distributions. Stat Probab Lett 4(2):53–56
Article MATH MathSciNet Google Scholar
Kass R, Raftery AE (1995) Bayes factors and model uncertainty. J Am Stat Assoc 90:773–795
Article MATH Google Scholar
Keribin C (2000) Consistent estimation of the order of mixture models. Sankhyā Ser A 62(1):49–66
MATH MathSciNet Google Scholar
Leroux BG (1992) Consistent estimation of a mixing distribution. Ann Stat 20:1350–1360
Article MATH MathSciNet Google Scholar
Long B, Zhang Z, Yu P (2007) A probabilistic framework for relational clustering. In: Berkhin P, Caruana R, Wu X (eds) Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining (KDD 2007), ACM, pp 470–479
McLachlan G, Peel D (2000) Finite mixture models. Wiley-Interscience, Hoboken
McNicholas PD, Murphy TB (2008) Parsimonious Gaussian mixture models. Stat Comput 18(3):285–296
Article MathSciNet Google Scholar
Miller RG (1974) The jackknife—a review. Biometrika 61(1):1–15
MATH MathSciNet Google Scholar
Pang B, Lee L (2004) A sentimental education: sentiment analysis using subjectivity summarization based on minimum cuts. In: Scott D, Daelemans W, Walker MA (eds) Proceedings of the 42nd annual meeting on association for computational linguistics (ACL 2004), pp 271–278
Quenouille M (1949) Approximate tests of correlation in time-series. J R Stat Soc Ser B (Methodol) 11(1):68–84
MATH MathSciNet Google Scholar
Raykar V, Yu S, Zhao L, Valadez G, Florin C, Bogoni L, Moy L (2010) Learning from crowds. J Mach Learn Res 11:1297–1322
MathSciNet Google Scholar
Raykar VC, Yu S (2012) Eliminating spammers and ranking annotators for crowdsourced labeling tasks. J Mach Learn Res 13:491–518
MathSciNet Google Scholar
Salter-Townshend M, Murphy T (2012) Sentiment analysis of online media. In: Lausen B, van del Poel D, Ultsch A (eds) Algorithms from and for nature and life. Studies in classification, data analysis, and knowledge organization. Springer, Berlin
Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6(2):461–464
Article MATH Google Scholar
Shan H, Banerjee A (2008) Bayesian co-clustering. In: Proceedings of the 8th IEEE international conference on data mining (ICDM 2008), IEEE, pp 530–539
Thomas M, Pang B, Lee L (2006) Get out the vote: determining support or opposition from congressional floor-debate transcripts. In: Proceedings of the 2006 conference on empirical methods in natural language processing (EMNLP 2006), pp 327–335
Tukey JW (1958) Bias and confidence in not quite large samples. Ann Math Stat 29(2):614
Article Google Scholar

Download references

Author information

Authors and Affiliations

School of Mathematical Sciences and Complex and Adaptive Systems Laboratory, University College Dublin, Belfield, Dublin 4, Ireland
Michael Salter-Townshend & Thomas Brendan Murphy

Authors

Michael Salter-Townshend
View author publications
You can also search for this author in PubMed Google Scholar
Thomas Brendan Murphy
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Michael Salter-Townshend.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Salter-Townshend, M., Murphy, T.B. Mixtures of biased sentiment analysers. Adv Data Anal Classif 8, 85–103 (2014). https://doi.org/10.1007/s11634-013-0150-6

Download citation

Received: 07 December 2012
Revised: 07 August 2013
Accepted: 22 August 2013
Published: 31 August 2013
Issue Date: March 2014
DOI: https://doi.org/10.1007/s11634-013-0150-6

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Mixtures of biased sentiment analysers

Abstract

Access this article

Similar content being viewed by others

Ballpark Learning: Estimating Labels from Rough Group Comparisons

Crowd labeling latent Dirichlet allocation

Generative Models for Sentiment Analysis and Opinion Mining

Notes

Abbreviations

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

Mixtures of biased sentiment analysers

Abstract

Access this article

Similar content being viewed by others

Ballpark Learning: Estimating Labels from Rough Group Comparisons

Crowd labeling latent Dirichlet allocation

Generative Models for Sentiment Analysis and Opinion Mining

Notes

Abbreviations

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation