Learning from noisy label proportions for classifying online social data

Original Article


Inferring latent attributes (e.g., demographics) of social media users is important to improve the accuracy and validity of social media analysis methods. While most existing approaches use either heuristics or supervised classification, recent work has shown that accurate classification models can be trained using supervision from population statistics. These learning with label proportion (LLP) models are fit on bags of instances and then applied to individual accounts. However, it is well known that many social media sites such as Twitter are not a representative sample of the population; thus, there are many sources of noise in these label proportions (e.g., sampling bias). This can in turn degrade the quality of the resulting model. In this paper, we investigate classification algorithms that use population statistical constraints such as demographics, names, and social network followers to fit classifiers to predict individual user attributes. We propose LLP methods that explicitly model the noise inherent in these label proportions. On several real and synthetic datasets, we find that combining these enhancements together can significantly reduce averaged classification error by 7%, resulting in methods that are robust to noise in the provided label proportions.


Social networks Text classification Machine learning 



We thank the anonymous reviewers for helpful feedback. This research was funded in part by National Science Foundation under Grants #IIS-1526674 and #IIS-1618244. Any opinions, findings, and conclusions or recommendations expressed in this material are the authors and do not necessarily reflect those of the sponsor.


  1. Al Zamal F, Liu W, Ruths D (2012) Homophily and latent attribute inference: Inferring latent attributes of twitter users from neighbors. In: ICWSMGoogle Scholar
  2. Amigó E, Carrillo de Albornoz J, Chugur I, Corujo A, Gonzalo J, Martín T, Meij E, de Rijke M, Spina D (2013) Overview of RepLab 2013: evaluating online reputation monitoring systems. In: Proceedings of the fourth international conference of the CLEF initiative, pp 333–352Google Scholar
  3. Ardehaly E Mohammady, Culotta A (2015) Inferring latent attributes of twitter users with label regularization. In: Proceedings of the 2015 conference of the North American chapter of the association for computational linguistics: human language technologies, Association for Computational Linguistics, Denver, Colorado, pp 185–195. http://www.aclweb.org/anthology/N15-1019
  4. Ardehaly EM, Culotta A (2016) Domain adaptation for learning from label proportions using self-training. In: Proceedings of the twenty-fifth international joint conference on artificial intelligence, IJCAI 2016, New York, NY, USA, pp 3670–3676, 9-15 July 2016. http://www.ijcai.org/Abstract/16/516
  5. Argamon S, Dhawle S, Koppel M, Pennebaker JW (2005) Lexical predictors of personality type. In: Proceedings of the joint annual meeting of the interface and the classification society of North AmericaGoogle Scholar
  6. Barberá P (2013) Birds of the same feather tweet together. Bayesian ideal point estimation using twitter data. In: Proceedings of the social media and political participation, Florence, Italy, pp 10–11Google Scholar
  7. Brodley CE, Friedl MA (1999) Identifying mislabeled training data. J Artif Intell Res 11:131–167MATHGoogle Scholar
  8. Burger JD, Henderson J, Kim G, Zarrella G (2011) Discriminating gender on twitter. In: Proceedings of the conference on empirical methods in natural language processing, association for computational linguistics, Stroudsburg, PA, USA, EMNLP ’11, p 13011309. http://dl.acm.org/citation.cfm?id=2145432.2145568
  9. Byrd RH, Lu P, Nocedal J, Zhu C (1995) A limited memory algorithm for bound constrained optimization. SIAM J Sci Comput 16(5):1190–1208MathSciNetCrossRefMATHGoogle Scholar
  10. Chang MW, Ratinov L, Roth D (2012) Structured learning with constrained conditional models. Mach Learn 88(3):399–431MathSciNetCrossRefMATHGoogle Scholar
  11. Chang M, Ratinov L, Roth D (2007) Guiding semi-supervision with constraint-driven learning. In: ACL, association for computational linguistics, Prague, Czech Republic, pp 280–287. http://cogcomp.cs.illinois.edu/papers/ChangRaRo07.pdf
  12. Chang J, Rosenn I, Backstrom L, Marlow C (2010) Epluribus: ethnicity on social networks. In: ICWSMGoogle Scholar
  13. Cohen R, Ruths D (2013) Classifying political orientation on twitter: it’s not easy! In: ICWSMGoogle Scholar
  14. Conover MD, Gonçalves B, Ratkiewicz J, Flammini A, Menczer F (2011) Predicting the political alignment of twitter users. In: 2011 IEEE third international conference on Privacy, security, risk and trust (passat) and 2011 IEEE third international conference on social computing (socialcom). IEEE, pp 192–199Google Scholar
  15. Culotta A, Kumar NR, Cutler J (2016) Predicting twitter user demographics using distant supervision from website traffic data. J Artif Intell Res (JAIR) 55:389–408Google Scholar
  16. Diaz F, Gamon M, Hofman JM, Kıcıman E, Rothschild D (2016) Online and social media data as an imperfect continuous panel survey. PloS ONE 11(1):e0145406CrossRefGoogle Scholar
  17. Dredze M (2012) How social media will change public health. IEEE Intell Syst 27(4):81–84.  https://doi.org/10.1109/MIS.2012.76 CrossRefGoogle Scholar
  18. Eisenstein J, Smith NA, Xing EP (2011) Discovering sociolinguistic associations with structured sparsity. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies: Volume 1, Association for Computational Linguistics, Stroudsburg, PA, USA, HLT ’11, p 13651374. http://dl.acm.org/citation.cfm?id=2002472.2002641
  19. Fischler MA, Bolles RC (1981) Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun ACM 24(6):381–395.  https://doi.org/10.1145/358669.358692 MathSciNetCrossRefGoogle Scholar
  20. Ganchev K, Graca J, Gillenwater J, Taskar B (2010) Posterior regularization for structured latent variable models. J Mach Learn Res 11:20012049. http://dl.acm.org/citation.cfm?id=1756006.1859918
  21. Gopinath S, Thomas JS, Krishnamurthi L (2014) Investigating the relationship between the content of online word of mouth, advertising, and brand performance. Market Sci 33(2):241–258CrossRefGoogle Scholar
  22. Graca J, Ganchev K, Taskar B (2007) Expectation maximization and posterior constraints. NIPS 20:569–576Google Scholar
  23. Jin R, Liu Y (2005) A framework for incorporating class priors into discriminative classification. In: Ho TB, Cheung D, Liu H (eds) Advances in knowledge discovery and data Mining. PAKDD 2005. Lecture Notes in Computer Science, vol 3518. Springer, BerlinGoogle Scholar
  24. Kamerer D (2013) Estimating online audiences: understanding the limitations of competitive intelligence services. First Monday 18(5).  https://dx.doi.org/10.5210/fm.v18i5.3986
  25. Knowles R, Carroll J, Dredze M (2016) Demographer: extremely simple name demographics. In: NLP+ CSS 2016, p 108Google Scholar
  26. Lenhart A, Fox S (2009) Twitter and status updating. PEW Internet & American Life Project, Washington DCGoogle Scholar
  27. Lin CJ, Kuo TT, Lin SD (2014) A content-based matrix factorization model for recipe recommendation. In: Tseng V, Ho T, Zhou ZH, Chen A, Kao HY (eds) Advances in knowledge discovery and data mining, lecture notes in computer science, vol 8444. Springer International Publishing, pp 560–571. https://dx.doi.org/10.1007/978-3-319-06605-9_46
  28. Liu W, Ruths D (2013) What’s in a name? Using first names as features for gender inference in twitter. In: AAAI spring symposium on analyzing microtext. http://dblp.uni-trier.de/rec/bibtex/conf/aaaiss/LiuR13
  29. Maas AL, Daly RE, Pham PT, Huang D, Ng AY, Potts C (2011) Learning word vectors for sentiment analysis. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, Association for Computational Linguistics, Portland, Oregon, USA, pp 142–150. http://www.aclweb.org/anthology/P11-1015
  30. Maneewongvatana S, Mount DM (2002) Analysis of approximate nearest neighbor searching with clustered point sets. Data Struct Near Neighb Search Methodol 59:105–123MathSciNetMATHGoogle Scholar
  31. Mann GS, McCallum A (2007) Simple, robust, scalable semi-supervised learning via expectation regularization. In: Proceedings of the 24th international conference on machine learning, ACM, New York, NY, USA, ICML ’07, p 593600. https://doi.org/10.1145/1273496.1273571
  32. Mann GS, McCallum A (2010) Generalized expectation criteria for semi-supervised learning with weakly labeled data. J Mach Learn Res 11:955984. http://dl.acm.org/citation.cfm?id=1756006.1756038
  33. Mislove A, Lehmann S, Ahn YY, Onnela JP, Rosenquist JN (2011) Understanding the demographics of twitter users. In: Proceedings of the fifth international AAAI conference on weblogs and social media (ICWSM’11), Barcelona, SpainGoogle Scholar
  34. Musicant D, Christensen J, Olson J (2007) Supervised learning by training on aggregate outputs. In: Seventh IEEE international conference on data mining, 2007. ICDM 2007, pp 252–261.  https://doi.org/10.1109/ICDM.2007.50
  35. Nguyen D, Smith NA, Ros CP (2011) Author age prediction from text using linear regression. In: Proceedings of the 5th ACL-HLT workshop on language technology for cultural heritage, social sciences, and humanities, Association for Computational Linguistics, Stroudsburg, PA, USA, LaTeCH ’11, p 115123. http://dl.acm.org/citation.cfm?id=2107636.2107651
  36. O’Connor B, Balasubramanyan R, Routledge BR, Smith NA (2010) From tweets to polls: linking text sentiment to public opinion time series. ICWSM 11:122–129Google Scholar
  37. Oktay H, Firat A, Ertem Z (2014) Demographic breakdown of twitter users: an analysis based on names. In: ASE Bigdata/Socialcom/Cyber Security Conference, Academy of Science and Engineering (ASE), Los Angeles. http://www.merl.com/publications/TR2014-042
  38. Pennacchiotti M, Popescu AM (2011) A machine learning approach to twitter user classification. In: Adamic LA, Baeza-Yates RA, Counts S (eds) ICWSM. The AAAI Press. http://dblp.uni-trier.de/db/conf/icwsm/icwsm2011.html
  39. Prechelt L (2012) Early stopping — But When?. In: Montavon G, Orr GB, Müller KR (eds) Neural networks: tricks of the trade. Lecture Notes in Computer Science, vol 7700. Springer, Berlin. https://doi.org/10.1007/978-3-642-35289-8_5
  40. Preotiuc-Pietro D, Lampos V, Aletras N (2015) An analysis of the user occupational class through twitter content. In: ACLGoogle Scholar
  41. Quadrianto N, Smola AJ, Caetano TS, Le QV (2009) Estimating labels from label proportions. J Mach Learn Res 10:23492374. http://dl.acm.org/citation.cfm?id=1577069.1755865
  42. Rao D, Paul MJ, Fink C, Yarowsky D, Oates T, Coppersmith G (2011) Hierarchical Bayesian models for latent attribute detection in social media. In: Adamic LA, Baeza-Yates RA, Counts S (eds) ICWSM. The AAAI PressGoogle Scholar
  43. Rao D, Yarowsky D, Shreevats A, Gupta M (2010) Classifying latent user attributes in twitter. In: Proceedings of the 2nd international workshop on search and mining user-generated contents, ACM, New York, NY, USA, SMUC ’10, p 3744. https://doi.org/10.1145/1871985.1871993
  44. Rendle S, Schmidt-Thieme L (2008) Online-updating regularized kernel matrix factorization models for large-scale recommender systems. In: Proceedings of the 2008 ACM conference on recommender systems, ACM, New York, NY, USA, RecSys ’08, pp 251–258. https://doi.org/10.1145/1454008.1454047
  45. Rogati M, Yang Y (2002) High-performing feature selection for text classification. In: Proceedings of the eleventh international conference on information and knowledge management, ACM, New York, NY, USA, CIKM ’02, pp 659–661. https://doi.org/10.1145/584792.584911
  46. Rosenthal S, McKeown K (2011) Age prediction in blogs: A study of style, content, and online behavior in pre- and post-social media generations. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies: Volume 1, Association for Computational Linguistics, Stroudsburg, PA, USA, HLT ’11, p 763772. http://dl.acm.org/citation.cfm?id=2002472.2002569
  47. Salakhutdinov R, Mnih A (2008) Probabilistic matrix factorization. In: Platt JC, Koller D, Singer Y, Roweis ST (eds) Advances in neural information processing systems, Curran Associates, Inc., Red Hook, vol 20, pp 1257–1264. http://papers.nips.cc/paper/3208-probabilistic-matrix-factorization.pdf
  48. Saveski M, Mantrach A (2014) Item cold-start recommendations: learning local collective embeddings. In: Proceedings of the 8th ACM conference on recommender systems, ACM, New York, NY, USA, RecSys ’14, pp 89–96. https://doi.org/10.1145/2645710.2645751
  49. Schapire RE, Rochery M, Rahim MG, Gupta NK (2002) Incorporating prior knowledge into boosting. In: Proceedings of the nineteenth international conference on machine learning, pp 538–545Google Scholar
  50. Schler J, Koppel M, Argamon S, Pennebaker J (2006) Effects of age and gender on blogging. In: AAAI 2006 spring symposium on computational approaches to analysing weblogs (AAAI-CAAW), pp 06–03Google Scholar
  51. Schwartz HA, Eichstaedt JC, Kern ML, Dziurzynski L, Lucas RE, Agrawal M, Park GJ, Lakshmikanth SK, Jha S, Seligman MEP, Ungar LH (2013a) Characterizing geographic variation in well-being using tweets. In: Seventh international AAAI conference on weblogs and social media (ICWSM)Google Scholar
  52. Schwartz HA, Eichstaedt JC, Kern ML, Dziurzynski L, Ramones SM, Agrawal M, Shah A, Kosinski M, Stillwell D, Seligman MEP, Ungar LH (2013) Personality, gender, and age in the language of social media: the open-vocabulary approach. PloS ONE 8(9):e73791.  https://doi.org/10.1371/journal.pone.0073791 CrossRefGoogle Scholar
  53. She Y, Owen AB (2011) Outlier detection using nonconvex penalized regression. J Am Stat Assoc 106(494):626–639MathSciNetCrossRefMATHGoogle Scholar
  54. Silver N, McCanc A (2014) How to tell someone’s age when all you know is her name. Retrieved from http://fivethirtyeight.com/features/how-to-tell-someones-age-when-all-you-know-is-her-name/
  55. Takacs G, Pilaszy I, Nemeth B, Tikk D (2008) Investigation of various matrix factorization methods for large recommender systems. In: IEEE international conference on data mining workshops, 2008. ICDMW ’08, pp 553–562.  https://doi.org/10.1109/ICDMW.2008.86
  56. Tibshirani J, Manning CD (2014) Robust logistic regression using shift parameters. In: ACL, pp 124–129Google Scholar
  57. Vapnik VN (1995) The nature of statistical learning theory. Springer, New YorkCrossRefMATHGoogle Scholar
  58. Volkova S, Van Durme B (2015) Online bayesian models for personal analytics in social media. In: Proceedings of the twenty-ninth conference on artificial intelligence (AAAI), Austin, TXGoogle Scholar
  59. Wang Z, Lyu S, Schalk G, Ji Q (2012) Learning with target prior. In: Pereira F, Burges C, Bottou L, Weinberger K (eds) Advances in neural information processing systems, vol 25. Curran Associates, Inc., New York, pp 2231–2239. http://papers.nips.cc/paper/4849-learning-with-target-prior.pdf
  60. Watkins SC (2009) The young and the digital: what the migration to social-network sites, games, and anytime, anywhere media means for our future. Beacon Press, BostonGoogle Scholar
  61. Yang Y, Pedersen JO (1997) A comparative study on feature selection in text categorization. In: Proceedings of the fourteenth international conference on machine learning, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, ICML ’97, pp 412–420. http://dl.acm.org/citation.cfm?id=645526.657137
  62. Yao Y, Rosasco L, Caponnetto A (2007) On early stopping in gradient descent learning. Constr Approx 26(2):289–315.  https://doi.org/10.1007/s00365-006-0663-2 MathSciNetCrossRefMATHGoogle Scholar
  63. Zhang S, Wang W, Ford J, Makedon F (2006) Learning from incomplete ratings using non-negative matrix factorization. In: Proceedings of the 6th SIAM conference on data mining, SDM, pp 549–553Google Scholar
  64. Zhang T, Yu B (2005) Boosting with early stopping: Convergence and consistency. Ann Stat 33(4):1538–1579. http://projecteuclid.org/euclid.aos/1123250222
  65. Zhu J, Chen N, Xing EP (2014) Bayesian inference with posterior regularization and applications to infinite latent svms. J Mach Learn Res 15:1799–1847MathSciNetMATHGoogle Scholar

Copyright information

© Springer-Verlag GmbH Austria, part of Springer Nature 2017

Authors and Affiliations

  1. 1.Department of Computer ScienceIllinois Institute of TechnologyChicagoUSA

Personalised recommendations