Abstract
With the rise of big data, more and more attention is paid to statistical network analysis. However, exact computation of many statistics of interest is of prohibitive cost for big graphs. Statistical estimators can thus be preferable. Model-based estimators for networks have some drawbacks. We study design-based estimates relying on sampling methods that were developed specifically for use on graph populations. In this contribution, we test some sampling designs that can be described as “extension” sampling designs. Unit selection happens in two phases: in the first phase, simple designs such as Bernoulli sampling are used, and in the second phase, some units are selected among those that are somehow linked to the units in the first-phase sample. We test these methods on Twitter data, because the size and structure of the Twitter graph is typical of big social networks for which such methods would be very useful.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Barabási, A. L., & Albert, R. (1999). Emergence of scaling in random networks. Science, 286(5439), 509–512.
Bollen, J., Mao, H., & Zeng, X. (2011). Twitter mood predicts the stock market. Journal of Computational Science, 2(1), 1–8.
Burnap, P., Gibson, R., Sloan, L., Southern, R., & Williams, M. (2015). 140 characters to victory? Using twitter to predict the UK 2015 general election. arXiv:1505.01511.
Conover, M., Ratkiewicz, J., Francisco, M., Gonçalves, B., Menczer, F., & Flammini, A. (2011). Political polarization on twitter. In ICWSM.
Deville, J. C. (2012). échantillonnage de réseaux, une relecture de s.k thompson avec une nouvelle présentation et quelques nouveautés. Accessed April 7, 2017 from http://jms.insee.fr/files/documents/2012/930_2-JMS2012_S21-4_DEVILLE-ACTE.PDF
Deville, J. C., & Särndal, C. E. (1992). Calibration estimators in survey sampling. Journal of the American statistical Association, 87(418), 376–382.
Ferrara, E. (2015). Manipulation and abuse on social media. SIGWEB Newsletter, (Spring), 4:1–4:9. https://doi.org/10.1145/2749279.2749283
Ferrara, E., Varol, O., Davis, C., Menczer, F., & Flammini, A. (2014). The rise of social bots. arXiv:1407.5225.
Frank, O. (1977). Survey sampling in graphs. Journal of Statistical Planning and Inference, 1(3), 235–264.
Hansen, M. H., & Hurwitz, W. N. (1943). On the theory of sampling from finite populations. The Annals of Mathematical Statistics, 14(4), 333–362.
Horvitz, D. G., & Thompson, D. J. (1952). A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association, 47(260), 663–685.
Isaki, C. T., & Fuller, W. A. (1982). Survey design under the regression superpopulation model. Journal of the American Statistical Association, 77(377), 89–96.
Kolaczyk, E. D. (2009). Statistical analysis of network data. Berlin: Springer.
Lavallée, P., & Caron, P. (2001). Estimation par la méthode généralisée du partage des poids: Le cas du couplage d’enregistrements. Survey Methodology, 27(2), 171–188.
Lesage, E. (2009). Calage non linéaire. Accessed April 7, 2017, from http://jms.insee.fr/files/documents/2009/85_2-JMS2009_S11-3_LESAGE-ACTE.PDF
Leskovec, J., & Faloutsos, C. (2006). Sampling from large graphs. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 631–636). New York: ACM.
Leskovec, J., & Sosič, R. (2014). Snap.py: SNAP for Python, a general purpose network analysis and graph mining tool in Python. http://snap.stanford.edu/snappy
Merly-Alpa, T., & Rebecq, A. (2017). L’algorithme CURIOS pour l’optimisation du plan de sondage en fonction de la non-réponse. Accessed April 7, 2017, from http://papersjds15.sfds.asso.fr/submission_29.pdf
Mustafaraj, E., Finn, S., Whitlock, C., & Metaxas, P. T. (2011). Vocal minority versus silent majority: Discovering the opinions of the long tail. In Privacy, Security, Risk and Trust (PASSAT) and 2011 IEEE Third International Conference on Social Computing (SocialCom) (pp. 103–110).
Myers, S. A., Sharma, A., Gupta, P., & Lin, J. (2014). Information network or social network? The structure of the twitter follow graph. In Proceedings of the Companion Publication of the 23rd International Conference on World Wide Web Companion (pp. 493–498). International World Wide Web Conferences Steering Committee.
Neyman, J. (1934). On the two different aspects of the representative method: The method of stratified sampling and the method of purposive selection. Journal of the Royal Statistical Society, 97, 558–625.
Nowicki, K., & Snijders, T. (2001). Estimation and prediction for stochastic blockstructures. Journal of the American Statistical Association, 96(455), 1077–1087.
Rivers, D., & Bailey, D. (2009). Inference from matched samples in the 2008 US National elections. In Proceedings of the Joint Statistical Meetings (pp. 627–639)
Särndal, C. E., Swensson, B., & Wretman, J. (2003). Model assisted survey sampling. New York: Springer Science & Business Media.
Sautory, O. (2012). Les enjeux méthodologiques liés à l’usage de bases de sondage imparfaites. Conference Report.
Sloan, L., Morgan, J., Housley, W., Williams, M., Edwards, A., Burnap, et al. (2013). Knowing the tweeters: Deriving sociologically relevant demographics from twitter. Sociological Research Online, 18(3), 7.
Thompson, S. K. (1990). Adaptive cluster sampling. Journal of the American Statistical Association, 85(412), 1050–1059
Thompson, S. K. (1991). Stratified adaptive cluster sampling. Biometrika, 78, 389–397.
Thompson, S. K. (1998). Adaptive sampling in graphs. In Proceedings of the Section on Survey Methods Research, American Statistical Association (pp. 13–22).
Thompson, S. K. (2006). Adaptive web sampling. Biometrics, 62(4), 1224–1234.
Tillé, Y. (2001). Théorie des sondages. Paris: Dunod.
Tumasjan, A., Sprenger, T. O., Sandner, P. G., & Welpe, I. M. (2010). Predicting elections with twitter: What 140 characters reveal about political sentiment. International AAAI Conference on Web and Social Media, 10, 178–185.
Watts, D. J., & Strogatz, S. H. (1998). Collective dynamics of ‘small-world’ networks. Nature, 393(6684), 440–442.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Rebecq, A. (2018). Extension Sampling Designs for Big Networks: Application to Twitter. In: Bertail, P., Blanke, D., Cornillon, PA., Matzner-Løber, E. (eds) Nonparametric Statistics. ISNPS 2016. Springer Proceedings in Mathematics & Statistics, vol 250. Springer, Cham. https://doi.org/10.1007/978-3-319-96941-1_17
Download citation
DOI: https://doi.org/10.1007/978-3-319-96941-1_17
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-96940-4
Online ISBN: 978-3-319-96941-1
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)