Abstract
We consider the problem of obtaining unbiased estimates of group properties in social networks when using a classifier for node labels. Inference for this problem is complicated by two factors: the network is not known and must be crawled, and even high-performance classifiers provide biased estimates of group proportions. We propose and evaluate AdjustedWalk for addressing this problem. This is a three step procedure which entails: (1) walking the graph starting from an arbitrary node; (2) learning a classifier on the nodes in the walk; and (3) applying a post-hoc adjustment to classification labels. The walk step provides the information necessary to make inferences over the nodes and edges, while the adjustment step corrects for classifier bias in estimating group proportions. This process provides de-biased estimates at the cost of additional variance. We evaluate AdjustedWalk on four tasks: the proportion of nodes belonging to a minority group, the proportion of the minority group among high degree nodes, the proportion of within-group edges, and Coleman’s homophily index. Simulated and empirical graphs show that this procedure performs well compared to optimal baselines in a variety of circumstances, while indicating that variance increases can be large for low-recall classifiers.
The authors thank members of the Social Dynamics Laboratory and anonymous reviewers for their helpful suggestions. The authors were supported while this research was conducted by grants from the U.S. National Science Foundation (SES 1357488), the National Research Foundation of Korea (NRF-2016S1A3A2925033), the Minerva Initiative (FA9550-15-1-0162), and DARPA (NGS2). The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
We also conducted simulations with minority group sizes of 0.35 and 0.5, and ingroup preferences of 0.2 (heterophily) and 0.5. The case we present is on balance the most challenging, although heterophilous graphs can present difficulties as well. We omit these additional cases for brevity, and because homophilous graphs are the case we are most often faced with empirically.
- 2.
The process for estimating the degree distribution for visibility is described in the Appendix.
References
Al Zamal, F., Liu, W., Ruths, D.: Homophily and latent attribute inference: inferring latent attributes of Twitter users from neighbors. In: ICWSM, vol. 270 (2012)
Barberá, P.: Less is more? How demographic sample weights can improve public opinion estimates based on Twitter data. Working Paper for NYU (2016)
Ciot, M., Sonderegger, M., Ruths, D.: Gender inference of Twitter users in non-English contexts. In: EMNLP, pp. 1136–1145 (2013)
Coleman, J.S.: Relational analysis: the study of social organizations with survey methods. Hum. Organ. 17(4), 28–36 (1958). https://doi.org/10.17730/humo.17.4.q5604m676260q8n7
Culotta, A., Cutler, J.: Predicting Twitter user demographics using distant supervision from website traffic data. J. Artif. Intell. Res. 55, 389–408 (2016)
Culotta, A., Kumar, N.R., Cutler, J.: Predicting the demographics of Twitter users from website traffic data. In: AAAI, pp. 72–78 (2015)
Ding, Y., Yan, S., Zhang, Y., Dai, W., Dong, L.: Predicting the attributes of social network users using a graph-based machine learning method. Comput. Commun. 73, 3–11 (2016). https://doi.org/10.1016/j.comcom.2015.07.007. http://linkinghub.elsevier.com/retrieve/pii/S0140366415002455
Fang, Q., Sang, J., Xu, C., Hossain, M.: Relational user attribute inference in social media. 17 (2015). https://doi.org/10.1109/TMM.2015.2430819
Forman, G.: Counting positives accurately despite inaccurate classification. In: Gama, J., Camacho, R., Brazdil, P.B., Jorge, A.M., Torgo, L. (eds.) ECML 2005. LNCS, vol. 3720, pp. 564–575. Springer, Heidelberg (2005). https://doi.org/10.1007/11564096_55
Forman, G.: Quantifying counts and costs via classification. Data Min. Knowl. Discov. 17(2), 164–206 (2008). https://doi.org/10.1007/s10618-008-0097-y
Gao, W., Sebastiani, F.: From classification to quantification in tweet sentiment analysis. Soc. Netw. Anal. Min. 6(1) (2016). https://doi.org/10.1007/s13278-016-0327-z
Gile, K.J., Handcock, M.S.: Respondent-driven sampling: an assessment of current methodology. Sociol. Methodol. 40(1), 285–327 (2010). https://doi.org/10.1111/j.1467-9531.2010.01223.x
Gjoka, M., Kurant, M., Butts, C.T., Markopoulou, A.: A walk in Facebook: uniform sampling of users in online social networks. arXiv:0906.0060 [physics, stat], May 2009
Gjoka, M., Kurant, M., Butts, C.T., Markopoulou, A.: Walking in Facebook: a case study of unbiased sampling of OSNs. In: Proceedings of - IEEE INFOCOM (2010). https://doi.org/10.1109/INFCOM.2010.5462078
Gjoka, M., Kurant, M., Butts, C.T., Markopoulou, A.: Practical recommendations on crawling online social networks. IEEE J. Sel. Areas Commun. 29(9), 1872–1892 (2011). https://doi.org/10.1109/JSAC.2011.111011. http://ieeexplore.ieee.org/document/6027868/
Goel, S., Salganik, M.J.: Respondent-driven sampling as Markov chain Monte Carlo. Stat. Med. 28(17), 2202–2229 (2009). https://doi.org/10.1002/sim.3613. http://www.ncbi.nlm.nih.gov/pubmed/19572381
Gong, N.Z., et al.: Joint link prediction and attribute inference using a social-attribute network. ACM Trans. Intell. Syst. Technol. 5(2), 1–20 (2014). https://doi.org/10.1145/2594455
Heckathorn, D., Jeffri, J.: Finding the beat: using respondent-driven sampling to study jazz musicians. Poetics 28, 307–329 (2001). http://www.respondentdrivensampling.org/reports/Heckathorn.pdf
Heckathorn, D.D.: Respondent-driven sampling II: deriving valid population estimates from chain-referral samples of hidden populations. Soc. Probl. 49(1), 11–34 (2002). https://doi.org/10.1525/sp.2002.49.1.11
Karimi, F., Gnois, M., Wagner, C., Singer, P., Strohmaier, M.: Visibility of minorities in social networks. arXiv preprint arXiv:1702.00150 (2017)
Kurant, M., Gjoka, M., Butts, C.T., Markopoulou, A.: Walking on a graph with a magnifying glass: stratified sampling via weighted random walks. In: Proceedings of the ACM SIGMETRICS Joint International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS 2011, pp. 281–292. ACM, New York (2011). https://doi.org/10.1145/1993744.1993773
Leskovec, J., Krevl, A.: SNAP Datasets: Stanford Large Network Dataset Collection, June 2014. http://snap.stanford.edu/data
Liu, A., Ziebart, B.: Robust classification under sample selection bias. In: Advances in Neural Information Processing Systems. pp. 37–45 (2014)
Liu, W., Ruths, D.: What’s in a name? Using first names as features for gender inference in Twitter. In: AAAI Spring Symposium: Analyzing Microtext, vol. 13, p. 01 (2013)
Malmi, E., Weber, I.: You are what apps you use: demographic prediction based on user’s apps. In: ICWSM, pp. 635–638 (2016)
McAllister, M.K., Ianelli, J.N.: Bayesian stock assessment using catch-age data and the sampling-importance resampling algorithm. Candian J. Fish. Aquat. Sci. 54(2), 284–300 (1997)
Messias, J., Vikatos, P., Benevenuto, F.: White, man, and highly followed: gender and race inequalities in Twitter. arXiv preprint arXiv:1706.08619 (2017)
Mohammady, E., Culotta, A.: Using county demographics to infer attributes of Twitter users. In: ACL 2014, p. 7 (2014)
Nguyen, D.P., Gravel, R., Trieschnigg, R.B., Meder, T.: How old do you think I am? A study of language and age in Twitter (2013)
Ramirez-Valles, J., Heckathorn, D.D., Vzquez, R., Diaz, R.M., Campbell, R.T.: From networks to populations: the development and application of respondent-driven sampling among IDUs and Latino gay men. AIDS Behav. 9(4), 387–402 (2005). https://doi.org/10.1007/s10461-005-9012-3
Rao, D., Yarowsky, D., Shreevats, A., Gupta, M.: Classifying latent user attributes in Twitter, pp. 37–44 (2009)
Ribeiro, B., Towsley, D.: Estimating and sampling graphs with multidimensional random walks. In: Proceedings of the 10th ACM SIGCOMM Conference on Internet Measurement, IMC 2010, pp. 390–403. ACM, New York (2010). https://doi.org/10.1145/1879141.1879192
Rocha, L.E.C., Liljeros, F., Holme, P.: Simulated epidemics in an empirical spatiotemporal network of 50,185 sexual contacts. PLOS Comput. Biol. 7(3), e1001109 (2011). http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1001109.
Rubin, D.B.: The calculation of posterior distributions by data augmentation: comment: a noniterative sampling/importance resampling alternative to the data augmentation algorithm for creating a few imputations when fractions of missing information are modest: the SIR algorithm. J. Am. Stat. Assoc. 82(398), 543–546 (1987). https://doi.org/10.2307/2289460
Salganik, M.J., Heckathorn, D.D.: Sampling and estimation in hidden populations using respondent-driven sampling. Sociol. Methodol. 34(1), 193–240 (2004). https://doi.org/10.1017/CBO9781107415324.004
Takac, L.: Zabovsky: data analysis in public social networks, Lomza, Poland (2012)
Volkova, S., Bachrach, Y., Armstrong, M., Sharma, V.: Inferring latent user properties from texts published in social media. In: AAAI, pp. 4296–4297 (2015)
Volz, E., Heckathorn, D.D.: Probability based estimation theory for respondent driven sampling. J. Off. Stat. 24(1), 79 (2008)
Wagner, C., Singer, P., Karimi, F., Pfeffer, J., Strohmaier, M.: Sampling from social networks with attributes. In: WWW, pp. 1181–1190 (2017). https://doi.org/10.1145/3038912.3052665
Wang, P., Guo, J., Lan, Y., Xu, J., Cheng, X.: Your cart tells you: inferring demographic attributes from purchase data, pp. 173–182. ACM Press (2016). https://doi.org/10.1145/2835776.2835783
Zadrozny, B.: Learning and evaluating classifiers under sample selection bias. In: Proceedings of the Twenty-first International Conference on Machine Learning, p. 114. ACM (2004)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
7. Appendix
7. Appendix
1.1 7.1 Variance of Corrected Estimates
Consider the simple case of classifying group proportions with two groups. We obtain a sample from the population with true proportions \(\hat{p}\) and estimated proportions \(\hat{m}\). Multiplying out Eq. 2 gives expressions for for the estimated \(\hat{p}_a\) and \(\hat{p}_b\),
The variance of the mean is given by,
where we use the assumption that C is constant to pull it out of the variance expression.
When there is no classification error, \(\det (C) = 1\), and when the classifier guesses randomly (.5 in every cell), \(\det (C) = 0\) and the variance is undefined. \(\det ^2(C)\) provides a clear quantification of the variance increase we expect for group proportions. For instance, if \(C = [0.8, 0.2; 0.2, 0.8]\) with \(\det ^2(C) = 0.6\), we expect a variance increase of \(1/0.6^2 = 2.78\). If classifier performance improves to \(C = [0.9, 0.1; 0.1, 0.9]\), the variance increase is \(1/0.8^2 = 1.56\).
\({{\mathrm{Var}}}(E[\hat{m}_a])\) comes from the random walking procedure itself and is generally not known in closed form. Two methods for closed-form variance have been proposed [16, 38]. The Volz-Heckathorn [38] estimator is biased but provides reasonable estimates in practice. The Goel and Salganik [16] variance estimator relies on knowing the homophily of the network. Bootstrap resampling methods based on creating “synthetic chains” from the estimated transition matrix between groups have also often been used [19].
Simulations of various RWRW estimators [12] show that factors such as a non-equilibrium seed selection, group homophily, and number of waves from each seed affect both the bias and variance of RWRW estimates. Generally, one long chain provides the best results, rather than many shorter chains. It is easier to sample from lower homophily networks, and equilibrium seed selection (proportional to degree) is useful if one must use relatively short chains. Otherwise, if chains may be long, a burn-in period can be used to simulate equilibrium seed selection.
1.2 7.2 Correcting Visibility
While RWRW gives the mean of g over the population of nodes, the distribution of g is often an object of interest. For instance, if we wish to estimate the proportion of minority group members in the top 20% of the degree distribution, we need to estimate the joint distribution of \((g(i), d_i)\) and take nodes in the top 20% of the distribution of \(d_i\).
Fortunately, importance resampling [26, 34] based on the data obtained during an RWRW walk provides a method to do this. If we know node i with degree \(d_i\) is sampled with probability \(\pi (i)\) and we want to sample it with probability \(1{\slash }N\) (a uniform distribution over the nodes), then we construct an importance weight using the ratio of desired over actual distributions
\(w_i\) provides a resampling weight for node i. We then normalize \(w_i / \sum _j w_j\) and resample data \((f(i), d_i)\) according to this probability to approximate draws from the desired distribution \(1{\slash }N\).
An importance resample produces a distribution of \((d_i, g(i))\) which mirrors the distribution in the population. We then sort the resampled nodes by degree \(d_i\) and take the proportion in the top 20% of degree where \(g(i) = b\), or where i is a member of the minority group. In the case with no classification error, this procedure produces an unbiased estimate of the fraction of minority group members in the top 20% of the degree distribution.
With classification error, we need to add an additional step to correct the importance resample. Call \(\hat{m}_b^{\mathcal {I}}(20)\) the measured proportion of group b in the top \(20\%\) of the degree distribution in importance resample \(\mathcal {I}\). Likewise, there is a vector that contains measures for all groups \(\mathbf {\hat{m}}^{\mathcal {I}}(20)\). Then we can use a procedure similar to Eq. 2 to correct the importance resample proportions:
To see when \(\mathbf {\hat{p}}^{\mathcal {I}}(20)\) is unbiased, repeat the reasoning for estimating the population proportion \(\mathbf {\hat{p}}\) above. This shows that \(\mathbf {\hat{p}}^{\mathcal {I}}(20)\) is unbiased when the importance resample provides an unbiased estimate of \(\mathbf {m}^{\mathcal {I}}(20)\). A similar argument applies for the variance, and the determinant of C may be used to estimate the increase in variance.
1.3 7.3 Correcting Edge Proportions
Correcting estimates of ties between groups presents a more substantial challenge than correcting group proportions. Akin to C, there is a dyadic misclassification matrix M which maps \(\mathbf {s}\) to \(\mathbf {t}\),
where
which implies that we can use a technique similar to Eq. 2 at the dyad level
In practice, we obtain a sample \(\mathbf {\hat{t}}\) rather than \(\mathbf {t}\) for the entire graph, which is then used to estimate true edge proportions \(\mathbf {\hat{s}}\). \(\mathbf {\hat{s}}\) is unbiased when the sampling method employed produces unbiased estimates of \(\mathbf {t}\). If \(B = M^{-1}\), the expectation for \(\hat{s}_a\) is given by
As in the node case, we can expect the variance of \(E[\hat{s}_{aa}]\) and \(E[\hat{s}_a]\) to increase when applying classification bias correction. Simulations below indicate that variance inflation for \(E[\hat{s}_a]\) is larger than for \(E[\hat{p}_a]\). Note that \(\hat{s}_a = 2\hat{s}_{aa} / (2\hat{s}_{aa} + \hat{s}_{ab})\) is unbiased under the same conditions as \(\hat{s}_{aa}\).
If \(B = M^{-1}\), then the variance for \(\hat{s}_{aa}\) is
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Berry, G., Sirianni, A., High, N., Kellum, A., Weber, I., Macy, M. (2018). Estimating Group Properties in Online Social Networks with a Classifier. In: Staab, S., Koltsova, O., Ignatov, D. (eds) Social Informatics. SocInfo 2018. Lecture Notes in Computer Science(), vol 11185. Springer, Cham. https://doi.org/10.1007/978-3-030-01129-1_5
Download citation
DOI: https://doi.org/10.1007/978-3-030-01129-1_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-01128-4
Online ISBN: 978-3-030-01129-1
eBook Packages: Computer ScienceComputer Science (R0)