Advertisement

Estimating Group Properties in Online Social Networks with a Classifier

  • George BerryEmail author
  • Antonio Sirianni
  • Nathan High
  • Agrippa Kellum
  • Ingmar Weber
  • Michael Macy
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11185)

Abstract

We consider the problem of obtaining unbiased estimates of group properties in social networks when using a classifier for node labels. Inference for this problem is complicated by two factors: the network is not known and must be crawled, and even high-performance classifiers provide biased estimates of group proportions. We propose and evaluate AdjustedWalk for addressing this problem. This is a three step procedure which entails: (1) walking the graph starting from an arbitrary node; (2) learning a classifier on the nodes in the walk; and (3) applying a post-hoc adjustment to classification labels. The walk step provides the information necessary to make inferences over the nodes and edges, while the adjustment step corrects for classifier bias in estimating group proportions. This process provides de-biased estimates at the cost of additional variance. We evaluate AdjustedWalk on four tasks: the proportion of nodes belonging to a minority group, the proportion of the minority group among high degree nodes, the proportion of within-group edges, and Coleman’s homophily index. Simulated and empirical graphs show that this procedure performs well compared to optimal baselines in a variety of circumstances, while indicating that variance increases can be large for low-recall classifiers.

Keywords

Classification error Quantification learning Network sampling Digital demography 

References

  1. 1.
    Al Zamal, F., Liu, W., Ruths, D.: Homophily and latent attribute inference: inferring latent attributes of Twitter users from neighbors. In: ICWSM, vol. 270 (2012)Google Scholar
  2. 2.
    Barberá, P.: Less is more? How demographic sample weights can improve public opinion estimates based on Twitter data. Working Paper for NYU (2016)Google Scholar
  3. 3.
    Ciot, M., Sonderegger, M., Ruths, D.: Gender inference of Twitter users in non-English contexts. In: EMNLP, pp. 1136–1145 (2013)Google Scholar
  4. 4.
    Coleman, J.S.: Relational analysis: the study of social organizations with survey methods. Hum. Organ. 17(4), 28–36 (1958).  https://doi.org/10.17730/humo.17.4.q5604m676260q8n7CrossRefGoogle Scholar
  5. 5.
    Culotta, A., Cutler, J.: Predicting Twitter user demographics using distant supervision from website traffic data. J. Artif. Intell. Res. 55, 389–408 (2016)CrossRefGoogle Scholar
  6. 6.
    Culotta, A., Kumar, N.R., Cutler, J.: Predicting the demographics of Twitter users from website traffic data. In: AAAI, pp. 72–78 (2015)Google Scholar
  7. 7.
    Ding, Y., Yan, S., Zhang, Y., Dai, W., Dong, L.: Predicting the attributes of social network users using a graph-based machine learning method. Comput. Commun. 73, 3–11 (2016). https://doi.org/10.1016/j.comcom.2015.07.007. http://linkinghub.elsevier.com/retrieve/pii/S0140366415002455CrossRefGoogle Scholar
  8. 8.
    Fang, Q., Sang, J., Xu, C., Hossain, M.: Relational user attribute inference in social media. 17 (2015).  https://doi.org/10.1109/TMM.2015.2430819CrossRefGoogle Scholar
  9. 9.
    Forman, G.: Counting positives accurately despite inaccurate classification. In: Gama, J., Camacho, R., Brazdil, P.B., Jorge, A.M., Torgo, L. (eds.) ECML 2005. LNCS, vol. 3720, pp. 564–575. Springer, Heidelberg (2005).  https://doi.org/10.1007/11564096_55CrossRefGoogle Scholar
  10. 10.
    Forman, G.: Quantifying counts and costs via classification. Data Min. Knowl. Discov. 17(2), 164–206 (2008).  https://doi.org/10.1007/s10618-008-0097-yMathSciNetCrossRefGoogle Scholar
  11. 11.
    Gao, W., Sebastiani, F.: From classification to quantification in tweet sentiment analysis. Soc. Netw. Anal. Min. 6(1) (2016).  https://doi.org/10.1007/s13278-016-0327-z
  12. 12.
    Gile, K.J., Handcock, M.S.: Respondent-driven sampling: an assessment of current methodology. Sociol. Methodol. 40(1), 285–327 (2010).  https://doi.org/10.1111/j.1467-9531.2010.01223.xCrossRefGoogle Scholar
  13. 13.
    Gjoka, M., Kurant, M., Butts, C.T., Markopoulou, A.: A walk in Facebook: uniform sampling of users in online social networks. arXiv:0906.0060 [physics, stat], May 2009
  14. 14.
    Gjoka, M., Kurant, M., Butts, C.T., Markopoulou, A.: Walking in Facebook: a case study of unbiased sampling of OSNs. In: Proceedings of - IEEE INFOCOM (2010).  https://doi.org/10.1109/INFCOM.2010.5462078
  15. 15.
    Gjoka, M., Kurant, M., Butts, C.T., Markopoulou, A.: Practical recommendations on crawling online social networks. IEEE J. Sel. Areas Commun. 29(9), 1872–1892 (2011).  https://doi.org/10.1109/JSAC.2011.111011. http://ieeexplore.ieee.org/document/6027868/CrossRefGoogle Scholar
  16. 16.
    Goel, S., Salganik, M.J.: Respondent-driven sampling as Markov chain Monte Carlo. Stat. Med. 28(17), 2202–2229 (2009). https://doi.org/10.1002/sim.3613. http://www.ncbi.nlm.nih.gov/pubmed/19572381MathSciNetCrossRefGoogle Scholar
  17. 17.
    Gong, N.Z., et al.: Joint link prediction and attribute inference using a social-attribute network. ACM Trans. Intell. Syst. Technol. 5(2), 1–20 (2014).  https://doi.org/10.1145/2594455CrossRefGoogle Scholar
  18. 18.
    Heckathorn, D., Jeffri, J.: Finding the beat: using respondent-driven sampling to study jazz musicians. Poetics 28, 307–329 (2001). http://www.respondentdrivensampling.org/reports/Heckathorn.pdfCrossRefGoogle Scholar
  19. 19.
    Heckathorn, D.D.: Respondent-driven sampling II: deriving valid population estimates from chain-referral samples of hidden populations. Soc. Probl. 49(1), 11–34 (2002).  https://doi.org/10.1525/sp.2002.49.1.11CrossRefGoogle Scholar
  20. 20.
    Karimi, F., Gnois, M., Wagner, C., Singer, P., Strohmaier, M.: Visibility of minorities in social networks. arXiv preprint arXiv:1702.00150 (2017)
  21. 21.
    Kurant, M., Gjoka, M., Butts, C.T., Markopoulou, A.: Walking on a graph with a magnifying glass: stratified sampling via weighted random walks. In: Proceedings of the ACM SIGMETRICS Joint International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS 2011, pp. 281–292. ACM, New York (2011).  https://doi.org/10.1145/1993744.1993773
  22. 22.
    Leskovec, J., Krevl, A.: SNAP Datasets: Stanford Large Network Dataset Collection, June 2014. http://snap.stanford.edu/data
  23. 23.
    Liu, A., Ziebart, B.: Robust classification under sample selection bias. In: Advances in Neural Information Processing Systems. pp. 37–45 (2014)Google Scholar
  24. 24.
    Liu, W., Ruths, D.: What’s in a name? Using first names as features for gender inference in Twitter. In: AAAI Spring Symposium: Analyzing Microtext, vol. 13, p. 01 (2013)Google Scholar
  25. 25.
    Malmi, E., Weber, I.: You are what apps you use: demographic prediction based on user’s apps. In: ICWSM, pp. 635–638 (2016)Google Scholar
  26. 26.
    McAllister, M.K., Ianelli, J.N.: Bayesian stock assessment using catch-age data and the sampling-importance resampling algorithm. Candian J. Fish. Aquat. Sci. 54(2), 284–300 (1997)Google Scholar
  27. 27.
    Messias, J., Vikatos, P., Benevenuto, F.: White, man, and highly followed: gender and race inequalities in Twitter. arXiv preprint arXiv:1706.08619 (2017)
  28. 28.
    Mohammady, E., Culotta, A.: Using county demographics to infer attributes of Twitter users. In: ACL 2014, p. 7 (2014)Google Scholar
  29. 29.
    Nguyen, D.P., Gravel, R., Trieschnigg, R.B., Meder, T.: How old do you think I am? A study of language and age in Twitter (2013)Google Scholar
  30. 30.
    Ramirez-Valles, J., Heckathorn, D.D., Vzquez, R., Diaz, R.M., Campbell, R.T.: From networks to populations: the development and application of respondent-driven sampling among IDUs and Latino gay men. AIDS Behav. 9(4), 387–402 (2005).  https://doi.org/10.1007/s10461-005-9012-3CrossRefGoogle Scholar
  31. 31.
    Rao, D., Yarowsky, D., Shreevats, A., Gupta, M.: Classifying latent user attributes in Twitter, pp. 37–44 (2009)Google Scholar
  32. 32.
    Ribeiro, B., Towsley, D.: Estimating and sampling graphs with multidimensional random walks. In: Proceedings of the 10th ACM SIGCOMM Conference on Internet Measurement, IMC 2010, pp. 390–403. ACM, New York (2010).  https://doi.org/10.1145/1879141.1879192
  33. 33.
    Rocha, L.E.C., Liljeros, F., Holme, P.: Simulated epidemics in an empirical spatiotemporal network of 50,185 sexual contacts. PLOS Comput. Biol. 7(3), e1001109 (2011). http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1001109.CrossRefGoogle Scholar
  34. 34.
    Rubin, D.B.: The calculation of posterior distributions by data augmentation: comment: a noniterative sampling/importance resampling alternative to the data augmentation algorithm for creating a few imputations when fractions of missing information are modest: the SIR algorithm. J. Am. Stat. Assoc. 82(398), 543–546 (1987).  https://doi.org/10.2307/2289460Google Scholar
  35. 35.
    Salganik, M.J., Heckathorn, D.D.: Sampling and estimation in hidden populations using respondent-driven sampling. Sociol. Methodol. 34(1), 193–240 (2004).  https://doi.org/10.1017/CBO9781107415324.004CrossRefGoogle Scholar
  36. 36.
    Takac, L.: Zabovsky: data analysis in public social networks, Lomza, Poland (2012)Google Scholar
  37. 37.
    Volkova, S., Bachrach, Y., Armstrong, M., Sharma, V.: Inferring latent user properties from texts published in social media. In: AAAI, pp. 4296–4297 (2015)Google Scholar
  38. 38.
    Volz, E., Heckathorn, D.D.: Probability based estimation theory for respondent driven sampling. J. Off. Stat. 24(1), 79 (2008)Google Scholar
  39. 39.
    Wagner, C., Singer, P., Karimi, F., Pfeffer, J., Strohmaier, M.: Sampling from social networks with attributes. In: WWW, pp. 1181–1190 (2017).  https://doi.org/10.1145/3038912.3052665
  40. 40.
    Wang, P., Guo, J., Lan, Y., Xu, J., Cheng, X.: Your cart tells you: inferring demographic attributes from purchase data, pp. 173–182. ACM Press (2016).  https://doi.org/10.1145/2835776.2835783
  41. 41.
    Zadrozny, B.: Learning and evaluating classifiers under sample selection bias. In: Proceedings of the Twenty-first International Conference on Machine Learning, p. 114. ACM (2004)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.Cornell UniversityIthacaUSA
  2. 2.Qatar Computing Research InstituteDohaQatar

Personalised recommendations