Estimating Group Properties in Online Social Networks with a Classifier

Berry, George; Sirianni, Antonio; High, Nathan; Kellum, Agrippa; Weber, Ingmar; Macy, Michael

doi:10.1007/978-3-030-01129-1_5

George Berry ORCID: orcid.org/0000-0003-3898-2380¹⁶,
Antonio Sirianni ORCID: orcid.org/0000-0002-7710-3513¹⁶,
Nathan High¹⁶,
Agrippa Kellum¹⁶,
Ingmar Weber ORCID: orcid.org/0000-0003-4169-2579¹⁷ &
…
Michael Macy¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11185))

Included in the following conference series:

International Conference on Social Informatics

2427 Accesses
3 Citations

Abstract

We consider the problem of obtaining unbiased estimates of group properties in social networks when using a classifier for node labels. Inference for this problem is complicated by two factors: the network is not known and must be crawled, and even high-performance classifiers provide biased estimates of group proportions. We propose and evaluate AdjustedWalk for addressing this problem. This is a three step procedure which entails: (1) walking the graph starting from an arbitrary node; (2) learning a classifier on the nodes in the walk; and (3) applying a post-hoc adjustment to classification labels. The walk step provides the information necessary to make inferences over the nodes and edges, while the adjustment step corrects for classifier bias in estimating group proportions. This process provides de-biased estimates at the cost of additional variance. We evaluate AdjustedWalk on four tasks: the proportion of nodes belonging to a minority group, the proportion of the minority group among high degree nodes, the proportion of within-group edges, and Coleman’s homophily index. Simulated and empirical graphs show that this procedure performs well compared to optimal baselines in a variety of circumstances, while indicating that variance increases can be large for low-recall classifiers.

The authors thank members of the Social Dynamics Laboratory and anonymous reviewers for their helpful suggestions. The authors were supported while this research was conducted by grants from the U.S. National Science Foundation (SES 1357488), the National Research Foundation of Korea (NRF-2016S1A3A2925033), the Minerva Initiative (FA9550-15-1-0162), and DARPA (NGS2). The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
We also conducted simulations with minority group sizes of 0.35 and 0.5, and ingroup preferences of 0.2 (heterophily) and 0.5. The case we present is on balance the most challenging, although heterophilous graphs can present difficulties as well. We omit these additional cases for brevity, and because homophilous graphs are the case we are most often faced with empirically.
2.
The process for estimating the degree distribution for visibility is described in the Appendix.

References

Al Zamal, F., Liu, W., Ruths, D.: Homophily and latent attribute inference: inferring latent attributes of Twitter users from neighbors. In: ICWSM, vol. 270 (2012)
Google Scholar
Barberá, P.: Less is more? How demographic sample weights can improve public opinion estimates based on Twitter data. Working Paper for NYU (2016)
Google Scholar
Ciot, M., Sonderegger, M., Ruths, D.: Gender inference of Twitter users in non-English contexts. In: EMNLP, pp. 1136–1145 (2013)
Google Scholar
Coleman, J.S.: Relational analysis: the study of social organizations with survey methods. Hum. Organ. 17(4), 28–36 (1958). https://doi.org/10.17730/humo.17.4.q5604m676260q8n7
Article Google Scholar
Culotta, A., Cutler, J.: Predicting Twitter user demographics using distant supervision from website traffic data. J. Artif. Intell. Res. 55, 389–408 (2016)
Article Google Scholar
Culotta, A., Kumar, N.R., Cutler, J.: Predicting the demographics of Twitter users from website traffic data. In: AAAI, pp. 72–78 (2015)
Google Scholar
Ding, Y., Yan, S., Zhang, Y., Dai, W., Dong, L.: Predicting the attributes of social network users using a graph-based machine learning method. Comput. Commun. 73, 3–11 (2016). https://doi.org/10.1016/j.comcom.2015.07.007. http://linkinghub.elsevier.com/retrieve/pii/S0140366415002455
Article Google Scholar
Fang, Q., Sang, J., Xu, C., Hossain, M.: Relational user attribute inference in social media. 17 (2015). https://doi.org/10.1109/TMM.2015.2430819
Article Google Scholar
Forman, G.: Counting positives accurately despite inaccurate classification. In: Gama, J., Camacho, R., Brazdil, P.B., Jorge, A.M., Torgo, L. (eds.) ECML 2005. LNCS, vol. 3720, pp. 564–575. Springer, Heidelberg (2005). https://doi.org/10.1007/11564096_55
Chapter Google Scholar
Forman, G.: Quantifying counts and costs via classification. Data Min. Knowl. Discov. 17(2), 164–206 (2008). https://doi.org/10.1007/s10618-008-0097-y
Article MathSciNet Google Scholar
Gao, W., Sebastiani, F.: From classification to quantification in tweet sentiment analysis. Soc. Netw. Anal. Min. 6(1) (2016). https://doi.org/10.1007/s13278-016-0327-z
Gile, K.J., Handcock, M.S.: Respondent-driven sampling: an assessment of current methodology. Sociol. Methodol. 40(1), 285–327 (2010). https://doi.org/10.1111/j.1467-9531.2010.01223.x
Article Google Scholar
Gjoka, M., Kurant, M., Butts, C.T., Markopoulou, A.: A walk in Facebook: uniform sampling of users in online social networks. arXiv:0906.0060 [physics, stat], May 2009
Gjoka, M., Kurant, M., Butts, C.T., Markopoulou, A.: Walking in Facebook: a case study of unbiased sampling of OSNs. In: Proceedings of - IEEE INFOCOM (2010). https://doi.org/10.1109/INFCOM.2010.5462078
Gjoka, M., Kurant, M., Butts, C.T., Markopoulou, A.: Practical recommendations on crawling online social networks. IEEE J. Sel. Areas Commun. 29(9), 1872–1892 (2011). https://doi.org/10.1109/JSAC.2011.111011. http://ieeexplore.ieee.org/document/6027868/
Article Google Scholar
Goel, S., Salganik, M.J.: Respondent-driven sampling as Markov chain Monte Carlo. Stat. Med. 28(17), 2202–2229 (2009). https://doi.org/10.1002/sim.3613. http://www.ncbi.nlm.nih.gov/pubmed/19572381
Article MathSciNet Google Scholar
Gong, N.Z., et al.: Joint link prediction and attribute inference using a social-attribute network. ACM Trans. Intell. Syst. Technol. 5(2), 1–20 (2014). https://doi.org/10.1145/2594455
Article Google Scholar
Heckathorn, D., Jeffri, J.: Finding the beat: using respondent-driven sampling to study jazz musicians. Poetics 28, 307–329 (2001). http://www.respondentdrivensampling.org/reports/Heckathorn.pdf
Article Google Scholar
Heckathorn, D.D.: Respondent-driven sampling II: deriving valid population estimates from chain-referral samples of hidden populations. Soc. Probl. 49(1), 11–34 (2002). https://doi.org/10.1525/sp.2002.49.1.11
Article Google Scholar
Karimi, F., Gnois, M., Wagner, C., Singer, P., Strohmaier, M.: Visibility of minorities in social networks. arXiv preprint arXiv:1702.00150 (2017)
Kurant, M., Gjoka, M., Butts, C.T., Markopoulou, A.: Walking on a graph with a magnifying glass: stratified sampling via weighted random walks. In: Proceedings of the ACM SIGMETRICS Joint International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS 2011, pp. 281–292. ACM, New York (2011). https://doi.org/10.1145/1993744.1993773
Leskovec, J., Krevl, A.: SNAP Datasets: Stanford Large Network Dataset Collection, June 2014. http://snap.stanford.edu/data
Liu, A., Ziebart, B.: Robust classification under sample selection bias. In: Advances in Neural Information Processing Systems. pp. 37–45 (2014)
Google Scholar
Liu, W., Ruths, D.: What’s in a name? Using first names as features for gender inference in Twitter. In: AAAI Spring Symposium: Analyzing Microtext, vol. 13, p. 01 (2013)
Google Scholar
Malmi, E., Weber, I.: You are what apps you use: demographic prediction based on user’s apps. In: ICWSM, pp. 635–638 (2016)
Google Scholar
McAllister, M.K., Ianelli, J.N.: Bayesian stock assessment using catch-age data and the sampling-importance resampling algorithm. Candian J. Fish. Aquat. Sci. 54(2), 284–300 (1997)
Google Scholar
Messias, J., Vikatos, P., Benevenuto, F.: White, man, and highly followed: gender and race inequalities in Twitter. arXiv preprint arXiv:1706.08619 (2017)
Mohammady, E., Culotta, A.: Using county demographics to infer attributes of Twitter users. In: ACL 2014, p. 7 (2014)
Google Scholar
Nguyen, D.P., Gravel, R., Trieschnigg, R.B., Meder, T.: How old do you think I am? A study of language and age in Twitter (2013)
Google Scholar
Ramirez-Valles, J., Heckathorn, D.D., Vzquez, R., Diaz, R.M., Campbell, R.T.: From networks to populations: the development and application of respondent-driven sampling among IDUs and Latino gay men. AIDS Behav. 9(4), 387–402 (2005). https://doi.org/10.1007/s10461-005-9012-3
Article Google Scholar
Rao, D., Yarowsky, D., Shreevats, A., Gupta, M.: Classifying latent user attributes in Twitter, pp. 37–44 (2009)
Google Scholar
Ribeiro, B., Towsley, D.: Estimating and sampling graphs with multidimensional random walks. In: Proceedings of the 10th ACM SIGCOMM Conference on Internet Measurement, IMC 2010, pp. 390–403. ACM, New York (2010). https://doi.org/10.1145/1879141.1879192
Rocha, L.E.C., Liljeros, F., Holme, P.: Simulated epidemics in an empirical spatiotemporal network of 50,185 sexual contacts. PLOS Comput. Biol. 7(3), e1001109 (2011). http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1001109.
Article Google Scholar
Rubin, D.B.: The calculation of posterior distributions by data augmentation: comment: a noniterative sampling/importance resampling alternative to the data augmentation algorithm for creating a few imputations when fractions of missing information are modest: the SIR algorithm. J. Am. Stat. Assoc. 82(398), 543–546 (1987). https://doi.org/10.2307/2289460
Google Scholar
Salganik, M.J., Heckathorn, D.D.: Sampling and estimation in hidden populations using respondent-driven sampling. Sociol. Methodol. 34(1), 193–240 (2004). https://doi.org/10.1017/CBO9781107415324.004
Article Google Scholar
Takac, L.: Zabovsky: data analysis in public social networks, Lomza, Poland (2012)
Google Scholar
Volkova, S., Bachrach, Y., Armstrong, M., Sharma, V.: Inferring latent user properties from texts published in social media. In: AAAI, pp. 4296–4297 (2015)
Google Scholar
Volz, E., Heckathorn, D.D.: Probability based estimation theory for respondent driven sampling. J. Off. Stat. 24(1), 79 (2008)
Google Scholar
Wagner, C., Singer, P., Karimi, F., Pfeffer, J., Strohmaier, M.: Sampling from social networks with attributes. In: WWW, pp. 1181–1190 (2017). https://doi.org/10.1145/3038912.3052665
Wang, P., Guo, J., Lan, Y., Xu, J., Cheng, X.: Your cart tells you: inferring demographic attributes from purchase data, pp. 173–182. ACM Press (2016). https://doi.org/10.1145/2835776.2835783
Zadrozny, B.: Learning and evaluating classifiers under sample selection bias. In: Proceedings of the Twenty-first International Conference on Machine Learning, p. 114. ACM (2004)
Google Scholar

Download references

Author information

Authors and Affiliations

Cornell University, Ithaca, NY, USA
George Berry, Antonio Sirianni, Nathan High, Agrippa Kellum & Michael Macy
Qatar Computing Research Institute, Doha, Qatar
Ingmar Weber

Authors

George Berry
View author publications
You can also search for this author in PubMed Google Scholar
Antonio Sirianni
View author publications
You can also search for this author in PubMed Google Scholar
Nathan High
View author publications
You can also search for this author in PubMed Google Scholar
Agrippa Kellum
View author publications
You can also search for this author in PubMed Google Scholar
Ingmar Weber
View author publications
You can also search for this author in PubMed Google Scholar
Michael Macy
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to George Berry .

Editor information

Editors and Affiliations

University of Koblenz, Koblenz, Germany
Steffen Staab
National Research University Higher School of Economics, St. Petersburg, Russia
Olessia Koltsova
National Research University Higher School of Economics, Moscow, Russia
Dmitry I. Ignatov

7. Appendix

1.1 7.1 Variance of Corrected Estimates

Consider the simple case of classifying group proportions with two groups. We obtain a sample from the population with true proportions $\hat{p}$ and estimated proportions $\hat{m}$. Multiplying out Eq. 2 gives expressions for for the estimated $\hat{p}_a$ and $\hat{p}_b$,

$$\begin{aligned} \hat{p}_a = \frac{\hat{m}_a c_{\hat{b} \mid b} - \hat{m}_b c_{\hat{a} \mid b}}{\det (C)}, \quad \hat{p}_b = \frac{\hat{m}_b c_{\hat{a} \mid a} - \hat{m}_a c_{\hat{b} \mid a}}{\det (C)}. \end{aligned}$$

(6)

The variance of the mean is given by,

$$\begin{aligned} {{\mathrm{Var}}}(E[\hat{p}_a])&= {{\mathrm{Var}}}(\frac{E[\hat{m}_a] - c_{\hat{a} \mid b}}{\det (C)}) \end{aligned}$$

(7)

$$\begin{aligned}&= \frac{1}{\det (C)^2}{{\mathrm{Var}}}(E[\hat{m}_a]), \end{aligned}$$

(8)

where we use the assumption that C is constant to pull it out of the variance expression.

When there is no classification error, $\det (C) = 1$, and when the classifier guesses randomly (.5 in every cell), $\det (C) = 0$ and the variance is undefined. $\det ^2(C)$ provides a clear quantification of the variance increase we expect for group proportions. For instance, if $C = [0.8, 0.2; 0.2, 0.8]$ with $\det ^2(C) = 0.6$, we expect a variance increase of $1/0.6^2 = 2.78$. If classifier performance improves to $C = [0.9, 0.1; 0.1, 0.9]$, the variance increase is $1/0.8^2 = 1.56$.

${{\mathrm{Var}}}(E[\hat{m}_a])$ comes from the random walking procedure itself and is generally not known in closed form. Two methods for closed-form variance have been proposed [16, 38]. The Volz-Heckathorn [38] estimator is biased but provides reasonable estimates in practice. The Goel and Salganik [16] variance estimator relies on knowing the homophily of the network. Bootstrap resampling methods based on creating “synthetic chains” from the estimated transition matrix between groups have also often been used [19].

Simulations of various RWRW estimators [12] show that factors such as a non-equilibrium seed selection, group homophily, and number of waves from each seed affect both the bias and variance of RWRW estimates. Generally, one long chain provides the best results, rather than many shorter chains. It is easier to sample from lower homophily networks, and equilibrium seed selection (proportional to degree) is useful if one must use relatively short chains. Otherwise, if chains may be long, a burn-in period can be used to simulate equilibrium seed selection.

1.2 7.2 Correcting Visibility

While RWRW gives the mean of g over the population of nodes, the distribution of g is often an object of interest. For instance, if we wish to estimate the proportion of minority group members in the top 20% of the degree distribution, we need to estimate the joint distribution of $(g(i), d_i)$ and take nodes in the top 20% of the distribution of $d_i$.

Fortunately, importance resampling [26, 34] based on the data obtained during an RWRW walk provides a method to do this. If we know node i with degree $d_i$ is sampled with probability $\pi (i)$ and we want to sample it with probability $1{\slash }N$ (a uniform distribution over the nodes), then we construct an importance weight using the ratio of desired over actual distributions

$$\begin{aligned} \frac{1/N}{\pi (i)} = \frac{D}{N d_i} = \frac{\bar{d}}{d_i} = w_i. \end{aligned}$$

(9)

$w_i$ provides a resampling weight for node i. We then normalize $w_i / \sum _j w_j$ and resample data $(f(i), d_i)$ according to this probability to approximate draws from the desired distribution $1{\slash }N$.

An importance resample produces a distribution of $(d_i, g(i))$ which mirrors the distribution in the population. We then sort the resampled nodes by degree $d_i$ and take the proportion in the top 20% of degree where $g(i) = b$, or where i is a member of the minority group. In the case with no classification error, this procedure produces an unbiased estimate of the fraction of minority group members in the top 20% of the degree distribution.

With classification error, we need to add an additional step to correct the importance resample. Call $\hat{m}_b^{\mathcal {I}}(20)$ the measured proportion of group b in the top $20\%$ of the degree distribution in importance resample $\mathcal {I}$. Likewise, there is a vector that contains measures for all groups $\mathbf {\hat{m}}^{\mathcal {I}}(20)$. Then we can use a procedure similar to Eq. 2 to correct the importance resample proportions:

$$\begin{aligned} \mathbf {\hat{p}}^{\mathcal {I}}(20) = C^{-1}\mathbf {\hat{m}}^{\mathcal {I}}(20). \end{aligned}$$

(10)

To see when $\mathbf {\hat{p}}^{\mathcal {I}}(20)$ is unbiased, repeat the reasoning for estimating the population proportion $\mathbf {\hat{p}}$ above. This shows that $\mathbf {\hat{p}}^{\mathcal {I}}(20)$ is unbiased when the importance resample provides an unbiased estimate of $\mathbf {m}^{\mathcal {I}}(20)$. A similar argument applies for the variance, and the determinant of C may be used to estimate the increase in variance.

1.3 7.3 Correcting Edge Proportions

Correcting estimates of ties between groups presents a more substantial challenge than correcting group proportions. Akin to C, there is a dyadic misclassification matrix M which maps $\mathbf {s}$ to $\mathbf {t}$,

$$\begin{aligned} M \mathbf {s} = \mathbf {t}, \end{aligned}$$

(11)

where

$$\begin{aligned} M = \begin{bmatrix} c_{\hat{a} \mid a}^2&c_{\hat{a} \mid a} c_{\hat{a} \mid b}&c_{\hat{a} \mid b}^2\\ 2 * c_{\hat{a} \mid a} c_{\hat{b} \mid a}&c_{\hat{a} \mid a} c_{\hat{b} \mid b} + c_{\hat{a} \mid b} c_{\hat{b} \mid a}&2 * c_{\hat{a} \mid b} c_{\hat{b} \mid b} \\ c_{\hat{b} \mid a}^2&c_{\hat{b} \mid a} c_{\hat{b} \mid b}&c_{\hat{b} \mid b}^2 \end{bmatrix}, \end{aligned}$$

which implies that we can use a technique similar to Eq. 2 at the dyad level

$$\begin{aligned} \mathbf {s} = M^{-1} \mathbf {t}. \end{aligned}$$

(12)

In practice, we obtain a sample $\mathbf {\hat{t}}$ rather than $\mathbf {t}$ for the entire graph, which is then used to estimate true edge proportions $\mathbf {\hat{s}}$. $\mathbf {\hat{s}}$ is unbiased when the sampling method employed produces unbiased estimates of $\mathbf {t}$. If $B = M^{-1}$, the expectation for $\hat{s}_a$ is given by

$$\begin{aligned} E[\hat{s}_{aa}] = b_{00}E[\hat{t}_{aa}] + b_{01}E[\hat{t}_{ab}] + b_{02} E[\hat{t}_{bb}]. \end{aligned}$$

(13)

As in the node case, we can expect the variance of $E[\hat{s}_{aa}]$ and $E[\hat{s}_a]$ to increase when applying classification bias correction. Simulations below indicate that variance inflation for $E[\hat{s}_a]$ is larger than for $E[\hat{p}_a]$. Note that $\hat{s}_a = 2\hat{s}_{aa} / (2\hat{s}_{aa} + \hat{s}_{ab})$ is unbiased under the same conditions as $\hat{s}_{aa}$.

If $B = M^{-1}$, then the variance for $\hat{s}_{aa}$ is

$$\begin{aligned} {{\mathrm{Var}}}(E[\hat{s}_{aa}]) = b^2_{00}{{\mathrm{Var}}}(E[\hat{t}_{aa}]) + b^2_{01}{{\mathrm{Var}}}(E[\hat{t}_{ab}]) + b^2_{02} {{\mathrm{Var}}}(E[\hat{t}_{bb}]). \end{aligned}$$

(14)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Berry, G., Sirianni, A., High, N., Kellum, A., Weber, I., Macy, M. (2018). Estimating Group Properties in Online Social Networks with a Classifier. In: Staab, S., Koltsova, O., Ignatov, D. (eds) Social Informatics. SocInfo 2018. Lecture Notes in Computer Science(), vol 11185. Springer, Cham. https://doi.org/10.1007/978-3-030-01129-1_5

Download citation

DOI: https://doi.org/10.1007/978-3-030-01129-1_5
Published: 20 September 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-01128-4
Online ISBN: 978-3-030-01129-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Estimating Group Properties in Online Social Networks with a Classifier

Abstract

Access this chapter

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

7. Appendix

7. Appendix

1.1 7.1 Variance of Corrected Estimates

1.2 7.2 Correcting Visibility

1.3 7.3 Correcting Edge Proportions

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation