Estimating Group Properties in Online Social Networks with a Classifier

  • George BerryEmail author
  • Antonio Sirianni
  • Nathan High
  • Agrippa Kellum
  • Ingmar Weber
  • Michael Macy
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11185)


We consider the problem of obtaining unbiased estimates of group properties in social networks when using a classifier for node labels. Inference for this problem is complicated by two factors: the network is not known and must be crawled, and even high-performance classifiers provide biased estimates of group proportions. We propose and evaluate AdjustedWalk for addressing this problem. This is a three step procedure which entails: (1) walking the graph starting from an arbitrary node; (2) learning a classifier on the nodes in the walk; and (3) applying a post-hoc adjustment to classification labels. The walk step provides the information necessary to make inferences over the nodes and edges, while the adjustment step corrects for classifier bias in estimating group proportions. This process provides de-biased estimates at the cost of additional variance. We evaluate AdjustedWalk on four tasks: the proportion of nodes belonging to a minority group, the proportion of the minority group among high degree nodes, the proportion of within-group edges, and Coleman’s homophily index. Simulated and empirical graphs show that this procedure performs well compared to optimal baselines in a variety of circumstances, while indicating that variance increases can be large for low-recall classifiers.


Classification error Quantification learning Network sampling Digital demography 


  1. 1.
    Al Zamal, F., Liu, W., Ruths, D.: Homophily and latent attribute inference: inferring latent attributes of Twitter users from neighbors. In: ICWSM, vol. 270 (2012)Google Scholar
  2. 2.
    Barberá, P.: Less is more? How demographic sample weights can improve public opinion estimates based on Twitter data. Working Paper for NYU (2016)Google Scholar
  3. 3.
    Ciot, M., Sonderegger, M., Ruths, D.: Gender inference of Twitter users in non-English contexts. In: EMNLP, pp. 1136–1145 (2013)Google Scholar
  4. 4.
    Coleman, J.S.: Relational analysis: the study of social organizations with survey methods. Hum. Organ. 17(4), 28–36 (1958). Scholar
  5. 5.
    Culotta, A., Cutler, J.: Predicting Twitter user demographics using distant supervision from website traffic data. J. Artif. Intell. Res. 55, 389–408 (2016)CrossRefGoogle Scholar
  6. 6.
    Culotta, A., Kumar, N.R., Cutler, J.: Predicting the demographics of Twitter users from website traffic data. In: AAAI, pp. 72–78 (2015)Google Scholar
  7. 7.
    Ding, Y., Yan, S., Zhang, Y., Dai, W., Dong, L.: Predicting the attributes of social network users using a graph-based machine learning method. Comput. Commun. 73, 3–11 (2016). Scholar
  8. 8.
    Fang, Q., Sang, J., Xu, C., Hossain, M.: Relational user attribute inference in social media. 17 (2015). Scholar
  9. 9.
    Forman, G.: Counting positives accurately despite inaccurate classification. In: Gama, J., Camacho, R., Brazdil, P.B., Jorge, A.M., Torgo, L. (eds.) ECML 2005. LNCS, vol. 3720, pp. 564–575. Springer, Heidelberg (2005). Scholar
  10. 10.
    Forman, G.: Quantifying counts and costs via classification. Data Min. Knowl. Discov. 17(2), 164–206 (2008). Scholar
  11. 11.
    Gao, W., Sebastiani, F.: From classification to quantification in tweet sentiment analysis. Soc. Netw. Anal. Min. 6(1) (2016).
  12. 12.
    Gile, K.J., Handcock, M.S.: Respondent-driven sampling: an assessment of current methodology. Sociol. Methodol. 40(1), 285–327 (2010). Scholar
  13. 13.
    Gjoka, M., Kurant, M., Butts, C.T., Markopoulou, A.: A walk in Facebook: uniform sampling of users in online social networks. arXiv:0906.0060 [physics, stat], May 2009
  14. 14.
    Gjoka, M., Kurant, M., Butts, C.T., Markopoulou, A.: Walking in Facebook: a case study of unbiased sampling of OSNs. In: Proceedings of - IEEE INFOCOM (2010).
  15. 15.
    Gjoka, M., Kurant, M., Butts, C.T., Markopoulou, A.: Practical recommendations on crawling online social networks. IEEE J. Sel. Areas Commun. 29(9), 1872–1892 (2011). Scholar
  16. 16.
    Goel, S., Salganik, M.J.: Respondent-driven sampling as Markov chain Monte Carlo. Stat. Med. 28(17), 2202–2229 (2009). Scholar
  17. 17.
    Gong, N.Z., et al.: Joint link prediction and attribute inference using a social-attribute network. ACM Trans. Intell. Syst. Technol. 5(2), 1–20 (2014). Scholar
  18. 18.
    Heckathorn, D., Jeffri, J.: Finding the beat: using respondent-driven sampling to study jazz musicians. Poetics 28, 307–329 (2001). Scholar
  19. 19.
    Heckathorn, D.D.: Respondent-driven sampling II: deriving valid population estimates from chain-referral samples of hidden populations. Soc. Probl. 49(1), 11–34 (2002). Scholar
  20. 20.
    Karimi, F., Gnois, M., Wagner, C., Singer, P., Strohmaier, M.: Visibility of minorities in social networks. arXiv preprint arXiv:1702.00150 (2017)
  21. 21.
    Kurant, M., Gjoka, M., Butts, C.T., Markopoulou, A.: Walking on a graph with a magnifying glass: stratified sampling via weighted random walks. In: Proceedings of the ACM SIGMETRICS Joint International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS 2011, pp. 281–292. ACM, New York (2011).
  22. 22.
    Leskovec, J., Krevl, A.: SNAP Datasets: Stanford Large Network Dataset Collection, June 2014.
  23. 23.
    Liu, A., Ziebart, B.: Robust classification under sample selection bias. In: Advances in Neural Information Processing Systems. pp. 37–45 (2014)Google Scholar
  24. 24.
    Liu, W., Ruths, D.: What’s in a name? Using first names as features for gender inference in Twitter. In: AAAI Spring Symposium: Analyzing Microtext, vol. 13, p. 01 (2013)Google Scholar
  25. 25.
    Malmi, E., Weber, I.: You are what apps you use: demographic prediction based on user’s apps. In: ICWSM, pp. 635–638 (2016)Google Scholar
  26. 26.
    McAllister, M.K., Ianelli, J.N.: Bayesian stock assessment using catch-age data and the sampling-importance resampling algorithm. Candian J. Fish. Aquat. Sci. 54(2), 284–300 (1997)Google Scholar
  27. 27.
    Messias, J., Vikatos, P., Benevenuto, F.: White, man, and highly followed: gender and race inequalities in Twitter. arXiv preprint arXiv:1706.08619 (2017)
  28. 28.
    Mohammady, E., Culotta, A.: Using county demographics to infer attributes of Twitter users. In: ACL 2014, p. 7 (2014)Google Scholar
  29. 29.
    Nguyen, D.P., Gravel, R., Trieschnigg, R.B., Meder, T.: How old do you think I am? A study of language and age in Twitter (2013)Google Scholar
  30. 30.
    Ramirez-Valles, J., Heckathorn, D.D., Vzquez, R., Diaz, R.M., Campbell, R.T.: From networks to populations: the development and application of respondent-driven sampling among IDUs and Latino gay men. AIDS Behav. 9(4), 387–402 (2005). Scholar
  31. 31.
    Rao, D., Yarowsky, D., Shreevats, A., Gupta, M.: Classifying latent user attributes in Twitter, pp. 37–44 (2009)Google Scholar
  32. 32.
    Ribeiro, B., Towsley, D.: Estimating and sampling graphs with multidimensional random walks. In: Proceedings of the 10th ACM SIGCOMM Conference on Internet Measurement, IMC 2010, pp. 390–403. ACM, New York (2010).
  33. 33.
    Rocha, L.E.C., Liljeros, F., Holme, P.: Simulated epidemics in an empirical spatiotemporal network of 50,185 sexual contacts. PLOS Comput. Biol. 7(3), e1001109 (2011). Scholar
  34. 34.
    Rubin, D.B.: The calculation of posterior distributions by data augmentation: comment: a noniterative sampling/importance resampling alternative to the data augmentation algorithm for creating a few imputations when fractions of missing information are modest: the SIR algorithm. J. Am. Stat. Assoc. 82(398), 543–546 (1987). Scholar
  35. 35.
    Salganik, M.J., Heckathorn, D.D.: Sampling and estimation in hidden populations using respondent-driven sampling. Sociol. Methodol. 34(1), 193–240 (2004). Scholar
  36. 36.
    Takac, L.: Zabovsky: data analysis in public social networks, Lomza, Poland (2012)Google Scholar
  37. 37.
    Volkova, S., Bachrach, Y., Armstrong, M., Sharma, V.: Inferring latent user properties from texts published in social media. In: AAAI, pp. 4296–4297 (2015)Google Scholar
  38. 38.
    Volz, E., Heckathorn, D.D.: Probability based estimation theory for respondent driven sampling. J. Off. Stat. 24(1), 79 (2008)Google Scholar
  39. 39.
    Wagner, C., Singer, P., Karimi, F., Pfeffer, J., Strohmaier, M.: Sampling from social networks with attributes. In: WWW, pp. 1181–1190 (2017).
  40. 40.
    Wang, P., Guo, J., Lan, Y., Xu, J., Cheng, X.: Your cart tells you: inferring demographic attributes from purchase data, pp. 173–182. ACM Press (2016).
  41. 41.
    Zadrozny, B.: Learning and evaluating classifiers under sample selection bias. In: Proceedings of the Twenty-first International Conference on Machine Learning, p. 114. ACM (2004)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.Cornell UniversityIthacaUSA
  2. 2.Qatar Computing Research InstituteDohaQatar

Personalised recommendations