Skip to main content

Estimating Group Properties in Online Social Networks with a Classifier

  • Conference paper
  • First Online:
Book cover Social Informatics (SocInfo 2018)

Abstract

We consider the problem of obtaining unbiased estimates of group properties in social networks when using a classifier for node labels. Inference for this problem is complicated by two factors: the network is not known and must be crawled, and even high-performance classifiers provide biased estimates of group proportions. We propose and evaluate AdjustedWalk for addressing this problem. This is a three step procedure which entails: (1) walking the graph starting from an arbitrary node; (2) learning a classifier on the nodes in the walk; and (3) applying a post-hoc adjustment to classification labels. The walk step provides the information necessary to make inferences over the nodes and edges, while the adjustment step corrects for classifier bias in estimating group proportions. This process provides de-biased estimates at the cost of additional variance. We evaluate AdjustedWalk on four tasks: the proportion of nodes belonging to a minority group, the proportion of the minority group among high degree nodes, the proportion of within-group edges, and Coleman’s homophily index. Simulated and empirical graphs show that this procedure performs well compared to optimal baselines in a variety of circumstances, while indicating that variance increases can be large for low-recall classifiers.

The authors thank members of the Social Dynamics Laboratory and anonymous reviewers for their helpful suggestions. The authors were supported while this research was conducted by grants from the U.S. National Science Foundation (SES 1357488), the National Research Foundation of Korea (NRF-2016S1A3A2925033), the Minerva Initiative (FA9550-15-1-0162), and DARPA (NGS2). The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    We also conducted simulations with minority group sizes of 0.35 and 0.5, and ingroup preferences of 0.2 (heterophily) and 0.5. The case we present is on balance the most challenging, although heterophilous graphs can present difficulties as well. We omit these additional cases for brevity, and because homophilous graphs are the case we are most often faced with empirically.

  2. 2.

    The process for estimating the degree distribution for visibility is described in the Appendix.

References

  1. Al Zamal, F., Liu, W., Ruths, D.: Homophily and latent attribute inference: inferring latent attributes of Twitter users from neighbors. In: ICWSM, vol. 270 (2012)

    Google Scholar 

  2. Barberá, P.: Less is more? How demographic sample weights can improve public opinion estimates based on Twitter data. Working Paper for NYU (2016)

    Google Scholar 

  3. Ciot, M., Sonderegger, M., Ruths, D.: Gender inference of Twitter users in non-English contexts. In: EMNLP, pp. 1136–1145 (2013)

    Google Scholar 

  4. Coleman, J.S.: Relational analysis: the study of social organizations with survey methods. Hum. Organ. 17(4), 28–36 (1958). https://doi.org/10.17730/humo.17.4.q5604m676260q8n7

    Article  Google Scholar 

  5. Culotta, A., Cutler, J.: Predicting Twitter user demographics using distant supervision from website traffic data. J. Artif. Intell. Res. 55, 389–408 (2016)

    Article  Google Scholar 

  6. Culotta, A., Kumar, N.R., Cutler, J.: Predicting the demographics of Twitter users from website traffic data. In: AAAI, pp. 72–78 (2015)

    Google Scholar 

  7. Ding, Y., Yan, S., Zhang, Y., Dai, W., Dong, L.: Predicting the attributes of social network users using a graph-based machine learning method. Comput. Commun. 73, 3–11 (2016). https://doi.org/10.1016/j.comcom.2015.07.007. http://linkinghub.elsevier.com/retrieve/pii/S0140366415002455

    Article  Google Scholar 

  8. Fang, Q., Sang, J., Xu, C., Hossain, M.: Relational user attribute inference in social media. 17 (2015). https://doi.org/10.1109/TMM.2015.2430819

    Article  Google Scholar 

  9. Forman, G.: Counting positives accurately despite inaccurate classification. In: Gama, J., Camacho, R., Brazdil, P.B., Jorge, A.M., Torgo, L. (eds.) ECML 2005. LNCS, vol. 3720, pp. 564–575. Springer, Heidelberg (2005). https://doi.org/10.1007/11564096_55

    Chapter  Google Scholar 

  10. Forman, G.: Quantifying counts and costs via classification. Data Min. Knowl. Discov. 17(2), 164–206 (2008). https://doi.org/10.1007/s10618-008-0097-y

    Article  MathSciNet  Google Scholar 

  11. Gao, W., Sebastiani, F.: From classification to quantification in tweet sentiment analysis. Soc. Netw. Anal. Min. 6(1) (2016). https://doi.org/10.1007/s13278-016-0327-z

  12. Gile, K.J., Handcock, M.S.: Respondent-driven sampling: an assessment of current methodology. Sociol. Methodol. 40(1), 285–327 (2010). https://doi.org/10.1111/j.1467-9531.2010.01223.x

    Article  Google Scholar 

  13. Gjoka, M., Kurant, M., Butts, C.T., Markopoulou, A.: A walk in Facebook: uniform sampling of users in online social networks. arXiv:0906.0060 [physics, stat], May 2009

  14. Gjoka, M., Kurant, M., Butts, C.T., Markopoulou, A.: Walking in Facebook: a case study of unbiased sampling of OSNs. In: Proceedings of - IEEE INFOCOM (2010). https://doi.org/10.1109/INFCOM.2010.5462078

  15. Gjoka, M., Kurant, M., Butts, C.T., Markopoulou, A.: Practical recommendations on crawling online social networks. IEEE J. Sel. Areas Commun. 29(9), 1872–1892 (2011). https://doi.org/10.1109/JSAC.2011.111011. http://ieeexplore.ieee.org/document/6027868/

    Article  Google Scholar 

  16. Goel, S., Salganik, M.J.: Respondent-driven sampling as Markov chain Monte Carlo. Stat. Med. 28(17), 2202–2229 (2009). https://doi.org/10.1002/sim.3613. http://www.ncbi.nlm.nih.gov/pubmed/19572381

    Article  MathSciNet  Google Scholar 

  17. Gong, N.Z., et al.: Joint link prediction and attribute inference using a social-attribute network. ACM Trans. Intell. Syst. Technol. 5(2), 1–20 (2014). https://doi.org/10.1145/2594455

    Article  Google Scholar 

  18. Heckathorn, D., Jeffri, J.: Finding the beat: using respondent-driven sampling to study jazz musicians. Poetics 28, 307–329 (2001). http://www.respondentdrivensampling.org/reports/Heckathorn.pdf

    Article  Google Scholar 

  19. Heckathorn, D.D.: Respondent-driven sampling II: deriving valid population estimates from chain-referral samples of hidden populations. Soc. Probl. 49(1), 11–34 (2002). https://doi.org/10.1525/sp.2002.49.1.11

    Article  Google Scholar 

  20. Karimi, F., Gnois, M., Wagner, C., Singer, P., Strohmaier, M.: Visibility of minorities in social networks. arXiv preprint arXiv:1702.00150 (2017)

  21. Kurant, M., Gjoka, M., Butts, C.T., Markopoulou, A.: Walking on a graph with a magnifying glass: stratified sampling via weighted random walks. In: Proceedings of the ACM SIGMETRICS Joint International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS 2011, pp. 281–292. ACM, New York (2011). https://doi.org/10.1145/1993744.1993773

  22. Leskovec, J., Krevl, A.: SNAP Datasets: Stanford Large Network Dataset Collection, June 2014. http://snap.stanford.edu/data

  23. Liu, A., Ziebart, B.: Robust classification under sample selection bias. In: Advances in Neural Information Processing Systems. pp. 37–45 (2014)

    Google Scholar 

  24. Liu, W., Ruths, D.: What’s in a name? Using first names as features for gender inference in Twitter. In: AAAI Spring Symposium: Analyzing Microtext, vol. 13, p. 01 (2013)

    Google Scholar 

  25. Malmi, E., Weber, I.: You are what apps you use: demographic prediction based on user’s apps. In: ICWSM, pp. 635–638 (2016)

    Google Scholar 

  26. McAllister, M.K., Ianelli, J.N.: Bayesian stock assessment using catch-age data and the sampling-importance resampling algorithm. Candian J. Fish. Aquat. Sci. 54(2), 284–300 (1997)

    Google Scholar 

  27. Messias, J., Vikatos, P., Benevenuto, F.: White, man, and highly followed: gender and race inequalities in Twitter. arXiv preprint arXiv:1706.08619 (2017)

  28. Mohammady, E., Culotta, A.: Using county demographics to infer attributes of Twitter users. In: ACL 2014, p. 7 (2014)

    Google Scholar 

  29. Nguyen, D.P., Gravel, R., Trieschnigg, R.B., Meder, T.: How old do you think I am? A study of language and age in Twitter (2013)

    Google Scholar 

  30. Ramirez-Valles, J., Heckathorn, D.D., Vzquez, R., Diaz, R.M., Campbell, R.T.: From networks to populations: the development and application of respondent-driven sampling among IDUs and Latino gay men. AIDS Behav. 9(4), 387–402 (2005). https://doi.org/10.1007/s10461-005-9012-3

    Article  Google Scholar 

  31. Rao, D., Yarowsky, D., Shreevats, A., Gupta, M.: Classifying latent user attributes in Twitter, pp. 37–44 (2009)

    Google Scholar 

  32. Ribeiro, B., Towsley, D.: Estimating and sampling graphs with multidimensional random walks. In: Proceedings of the 10th ACM SIGCOMM Conference on Internet Measurement, IMC 2010, pp. 390–403. ACM, New York (2010). https://doi.org/10.1145/1879141.1879192

  33. Rocha, L.E.C., Liljeros, F., Holme, P.: Simulated epidemics in an empirical spatiotemporal network of 50,185 sexual contacts. PLOS Comput. Biol. 7(3), e1001109 (2011). http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1001109.

    Article  Google Scholar 

  34. Rubin, D.B.: The calculation of posterior distributions by data augmentation: comment: a noniterative sampling/importance resampling alternative to the data augmentation algorithm for creating a few imputations when fractions of missing information are modest: the SIR algorithm. J. Am. Stat. Assoc. 82(398), 543–546 (1987). https://doi.org/10.2307/2289460

    Google Scholar 

  35. Salganik, M.J., Heckathorn, D.D.: Sampling and estimation in hidden populations using respondent-driven sampling. Sociol. Methodol. 34(1), 193–240 (2004). https://doi.org/10.1017/CBO9781107415324.004

    Article  Google Scholar 

  36. Takac, L.: Zabovsky: data analysis in public social networks, Lomza, Poland (2012)

    Google Scholar 

  37. Volkova, S., Bachrach, Y., Armstrong, M., Sharma, V.: Inferring latent user properties from texts published in social media. In: AAAI, pp. 4296–4297 (2015)

    Google Scholar 

  38. Volz, E., Heckathorn, D.D.: Probability based estimation theory for respondent driven sampling. J. Off. Stat. 24(1), 79 (2008)

    Google Scholar 

  39. Wagner, C., Singer, P., Karimi, F., Pfeffer, J., Strohmaier, M.: Sampling from social networks with attributes. In: WWW, pp. 1181–1190 (2017). https://doi.org/10.1145/3038912.3052665

  40. Wang, P., Guo, J., Lan, Y., Xu, J., Cheng, X.: Your cart tells you: inferring demographic attributes from purchase data, pp. 173–182. ACM Press (2016). https://doi.org/10.1145/2835776.2835783

  41. Zadrozny, B.: Learning and evaluating classifiers under sample selection bias. In: Proceedings of the Twenty-first International Conference on Machine Learning, p. 114. ACM (2004)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to George Berry .

Editor information

Editors and Affiliations

7. Appendix

7. Appendix

1.1 7.1 Variance of Corrected Estimates

Consider the simple case of classifying group proportions with two groups. We obtain a sample from the population with true proportions \(\hat{p}\) and estimated proportions \(\hat{m}\). Multiplying out Eq. 2 gives expressions for for the estimated \(\hat{p}_a\) and \(\hat{p}_b\),

$$\begin{aligned} \hat{p}_a = \frac{\hat{m}_a c_{\hat{b} \mid b} - \hat{m}_b c_{\hat{a} \mid b}}{\det (C)}, \quad \hat{p}_b = \frac{\hat{m}_b c_{\hat{a} \mid a} - \hat{m}_a c_{\hat{b} \mid a}}{\det (C)}. \end{aligned}$$
(6)

The variance of the mean is given by,

$$\begin{aligned} {{\mathrm{Var}}}(E[\hat{p}_a])&= {{\mathrm{Var}}}(\frac{E[\hat{m}_a] - c_{\hat{a} \mid b}}{\det (C)}) \end{aligned}$$
(7)
$$\begin{aligned}&= \frac{1}{\det (C)^2}{{\mathrm{Var}}}(E[\hat{m}_a]), \end{aligned}$$
(8)

where we use the assumption that C is constant to pull it out of the variance expression.

When there is no classification error, \(\det (C) = 1\), and when the classifier guesses randomly (.5 in every cell), \(\det (C) = 0\) and the variance is undefined. \(\det ^2(C)\) provides a clear quantification of the variance increase we expect for group proportions. For instance, if \(C = [0.8, 0.2; 0.2, 0.8]\) with \(\det ^2(C) = 0.6\), we expect a variance increase of \(1/0.6^2 = 2.78\). If classifier performance improves to \(C = [0.9, 0.1; 0.1, 0.9]\), the variance increase is \(1/0.8^2 = 1.56\).

\({{\mathrm{Var}}}(E[\hat{m}_a])\) comes from the random walking procedure itself and is generally not known in closed form. Two methods for closed-form variance have been proposed [16, 38]. The Volz-Heckathorn [38] estimator is biased but provides reasonable estimates in practice. The Goel and Salganik [16] variance estimator relies on knowing the homophily of the network. Bootstrap resampling methods based on creating “synthetic chains” from the estimated transition matrix between groups have also often been used [19].

Simulations of various RWRW estimators [12] show that factors such as a non-equilibrium seed selection, group homophily, and number of waves from each seed affect both the bias and variance of RWRW estimates. Generally, one long chain provides the best results, rather than many shorter chains. It is easier to sample from lower homophily networks, and equilibrium seed selection (proportional to degree) is useful if one must use relatively short chains. Otherwise, if chains may be long, a burn-in period can be used to simulate equilibrium seed selection.

1.2 7.2 Correcting Visibility

While RWRW gives the mean of g over the population of nodes, the distribution of g is often an object of interest. For instance, if we wish to estimate the proportion of minority group members in the top 20% of the degree distribution, we need to estimate the joint distribution of \((g(i), d_i)\) and take nodes in the top 20% of the distribution of \(d_i\).

Fortunately, importance resampling [26, 34] based on the data obtained during an RWRW walk provides a method to do this. If we know node i with degree \(d_i\) is sampled with probability \(\pi (i)\) and we want to sample it with probability \(1{\slash }N\) (a uniform distribution over the nodes), then we construct an importance weight using the ratio of desired over actual distributions

$$\begin{aligned} \frac{1/N}{\pi (i)} = \frac{D}{N d_i} = \frac{\bar{d}}{d_i} = w_i. \end{aligned}$$
(9)

\(w_i\) provides a resampling weight for node i. We then normalize \(w_i / \sum _j w_j\) and resample data \((f(i), d_i)\) according to this probability to approximate draws from the desired distribution \(1{\slash }N\).

An importance resample produces a distribution of \((d_i, g(i))\) which mirrors the distribution in the population. We then sort the resampled nodes by degree \(d_i\) and take the proportion in the top 20% of degree where \(g(i) = b\), or where i is a member of the minority group. In the case with no classification error, this procedure produces an unbiased estimate of the fraction of minority group members in the top 20% of the degree distribution.

With classification error, we need to add an additional step to correct the importance resample. Call \(\hat{m}_b^{\mathcal {I}}(20)\) the measured proportion of group b in the top \(20\%\) of the degree distribution in importance resample \(\mathcal {I}\). Likewise, there is a vector that contains measures for all groups \(\mathbf {\hat{m}}^{\mathcal {I}}(20)\). Then we can use a procedure similar to Eq. 2 to correct the importance resample proportions:

$$\begin{aligned} \mathbf {\hat{p}}^{\mathcal {I}}(20) = C^{-1}\mathbf {\hat{m}}^{\mathcal {I}}(20). \end{aligned}$$
(10)

To see when \(\mathbf {\hat{p}}^{\mathcal {I}}(20)\) is unbiased, repeat the reasoning for estimating the population proportion \(\mathbf {\hat{p}}\) above. This shows that \(\mathbf {\hat{p}}^{\mathcal {I}}(20)\) is unbiased when the importance resample provides an unbiased estimate of \(\mathbf {m}^{\mathcal {I}}(20)\). A similar argument applies for the variance, and the determinant of C may be used to estimate the increase in variance.

1.3 7.3 Correcting Edge Proportions

Correcting estimates of ties between groups presents a more substantial challenge than correcting group proportions. Akin to C, there is a dyadic misclassification matrix M which maps \(\mathbf {s}\) to \(\mathbf {t}\),

$$\begin{aligned} M \mathbf {s} = \mathbf {t}, \end{aligned}$$
(11)

where

$$\begin{aligned} M = \begin{bmatrix} c_{\hat{a} \mid a}^2&c_{\hat{a} \mid a} c_{\hat{a} \mid b}&c_{\hat{a} \mid b}^2\\ 2 * c_{\hat{a} \mid a} c_{\hat{b} \mid a}&c_{\hat{a} \mid a} c_{\hat{b} \mid b} + c_{\hat{a} \mid b} c_{\hat{b} \mid a}&2 * c_{\hat{a} \mid b} c_{\hat{b} \mid b} \\ c_{\hat{b} \mid a}^2&c_{\hat{b} \mid a} c_{\hat{b} \mid b}&c_{\hat{b} \mid b}^2 \end{bmatrix}, \end{aligned}$$

which implies that we can use a technique similar to Eq. 2 at the dyad level

$$\begin{aligned} \mathbf {s} = M^{-1} \mathbf {t}. \end{aligned}$$
(12)

In practice, we obtain a sample \(\mathbf {\hat{t}}\) rather than \(\mathbf {t}\) for the entire graph, which is then used to estimate true edge proportions \(\mathbf {\hat{s}}\). \(\mathbf {\hat{s}}\) is unbiased when the sampling method employed produces unbiased estimates of \(\mathbf {t}\). If \(B = M^{-1}\), the expectation for \(\hat{s}_a\) is given by

$$\begin{aligned} E[\hat{s}_{aa}] = b_{00}E[\hat{t}_{aa}] + b_{01}E[\hat{t}_{ab}] + b_{02} E[\hat{t}_{bb}]. \end{aligned}$$
(13)

As in the node case, we can expect the variance of \(E[\hat{s}_{aa}]\) and \(E[\hat{s}_a]\) to increase when applying classification bias correction. Simulations below indicate that variance inflation for \(E[\hat{s}_a]\) is larger than for \(E[\hat{p}_a]\). Note that \(\hat{s}_a = 2\hat{s}_{aa} / (2\hat{s}_{aa} + \hat{s}_{ab})\) is unbiased under the same conditions as \(\hat{s}_{aa}\).

If \(B = M^{-1}\), then the variance for \(\hat{s}_{aa}\) is

$$\begin{aligned} {{\mathrm{Var}}}(E[\hat{s}_{aa}]) = b^2_{00}{{\mathrm{Var}}}(E[\hat{t}_{aa}]) + b^2_{01}{{\mathrm{Var}}}(E[\hat{t}_{ab}]) + b^2_{02} {{\mathrm{Var}}}(E[\hat{t}_{bb}]). \end{aligned}$$
(14)

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Berry, G., Sirianni, A., High, N., Kellum, A., Weber, I., Macy, M. (2018). Estimating Group Properties in Online Social Networks with a Classifier. In: Staab, S., Koltsova, O., Ignatov, D. (eds) Social Informatics. SocInfo 2018. Lecture Notes in Computer Science(), vol 11185. Springer, Cham. https://doi.org/10.1007/978-3-030-01129-1_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-01129-1_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-01128-4

  • Online ISBN: 978-3-030-01129-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics