Skip to main content

Extension Sampling Designs for Big Networks: Application to Twitter

  • Conference paper
  • First Online:
Nonparametric Statistics (ISNPS 2016)

Part of the book series: Springer Proceedings in Mathematics & Statistics ((PROMS,volume 250))

Included in the following conference series:

Abstract

With the rise of big data, more and more attention is paid to statistical network analysis. However, exact computation of many statistics of interest is of prohibitive cost for big graphs. Statistical estimators can thus be preferable. Model-based estimators for networks have some drawbacks. We study design-based estimates relying on sampling methods that were developed specifically for use on graph populations. In this contribution, we test some sampling designs that can be described as “extension” sampling designs. Unit selection happens in two phases: in the first phase, simple designs such as Bernoulli sampling are used, and in the second phase, some units are selected among those that are somehow linked to the units in the first-phase sample. We test these methods on Twitter data, because the size and structure of the Twitter graph is typical of big social networks for which such methods would be very useful.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://dev.twitter.com/streaming/overview.

  2. 2.

    https://dev.twitter.com/rest/public.

  3. 3.

    http://reverb.guru/view/597295533668271595.

  4. 4.

    https://support.twitter.com/articles/119135.

References

  1. Barabási, A. L., & Albert, R. (1999). Emergence of scaling in random networks. Science, 286(5439), 509–512.

    Google Scholar 

  2. Bollen, J., Mao, H., & Zeng, X. (2011). Twitter mood predicts the stock market. Journal of Computational Science, 2(1), 1–8.

    Google Scholar 

  3. Burnap, P., Gibson, R., Sloan, L., Southern, R., & Williams, M. (2015). 140 characters to victory? Using twitter to predict the UK 2015 general election. arXiv:1505.01511.

    Google Scholar 

  4. Conover, M., Ratkiewicz, J., Francisco, M., Gonçalves, B., Menczer, F., & Flammini, A. (2011). Political polarization on twitter. In ICWSM.

    Google Scholar 

  5. Deville, J. C. (2012). échantillonnage de réseaux, une relecture de s.k thompson avec une nouvelle présentation et quelques nouveautés. Accessed April 7, 2017 from http://jms.insee.fr/files/documents/2012/930_2-JMS2012_S21-4_DEVILLE-ACTE.PDF

  6. Deville, J. C., & Särndal, C. E. (1992). Calibration estimators in survey sampling. Journal of the American statistical Association, 87(418), 376–382.

    Google Scholar 

  7. Ferrara, E. (2015). Manipulation and abuse on social media. SIGWEB Newsletter, (Spring), 4:1–4:9. https://doi.org/10.1145/2749279.2749283

  8. Ferrara, E., Varol, O., Davis, C., Menczer, F., & Flammini, A. (2014). The rise of social bots. arXiv:1407.5225.

    Google Scholar 

  9. Frank, O. (1977). Survey sampling in graphs. Journal of Statistical Planning and Inference, 1(3), 235–264.

    Google Scholar 

  10. Hansen, M. H., & Hurwitz, W. N. (1943). On the theory of sampling from finite populations. The Annals of Mathematical Statistics, 14(4), 333–362.

    Google Scholar 

  11. Horvitz, D. G., & Thompson, D. J. (1952). A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association, 47(260), 663–685.

    Google Scholar 

  12. Isaki, C. T., & Fuller, W. A. (1982). Survey design under the regression superpopulation model. Journal of the American Statistical Association, 77(377), 89–96.

    Google Scholar 

  13. Kolaczyk, E. D. (2009). Statistical analysis of network data. Berlin: Springer.

    Google Scholar 

  14. Lavallée, P., & Caron, P. (2001). Estimation par la méthode généralisée du partage des poids: Le cas du couplage d’enregistrements. Survey Methodology, 27(2), 171–188.

    Google Scholar 

  15. Lesage, E. (2009). Calage non linéaire. Accessed April 7, 2017, from http://jms.insee.fr/files/documents/2009/85_2-JMS2009_S11-3_LESAGE-ACTE.PDF

    Google Scholar 

  16. Leskovec, J., & Faloutsos, C. (2006). Sampling from large graphs. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 631–636). New York: ACM.

    Google Scholar 

  17. Leskovec, J., & Sosič, R. (2014). Snap.py: SNAP for Python, a general purpose network analysis and graph mining tool in Python. http://snap.stanford.edu/snappy

  18. Merly-Alpa, T., & Rebecq, A. (2017). L’algorithme CURIOS pour l’optimisation du plan de sondage en fonction de la non-réponse. Accessed April 7, 2017, from http://papersjds15.sfds.asso.fr/submission_29.pdf

    Google Scholar 

  19. Mustafaraj, E., Finn, S., Whitlock, C., & Metaxas, P. T. (2011). Vocal minority versus silent majority: Discovering the opinions of the long tail. In Privacy, Security, Risk and Trust (PASSAT) and 2011 IEEE Third International Conference on Social Computing (SocialCom) (pp. 103–110).

    Google Scholar 

  20. Myers, S. A., Sharma, A., Gupta, P., & Lin, J. (2014). Information network or social network? The structure of the twitter follow graph. In Proceedings of the Companion Publication of the 23rd International Conference on World Wide Web Companion (pp. 493–498). International World Wide Web Conferences Steering Committee.

    Google Scholar 

  21. Neyman, J. (1934). On the two different aspects of the representative method: The method of stratified sampling and the method of purposive selection. Journal of the Royal Statistical Society, 97, 558–625.

    Google Scholar 

  22. Nowicki, K., & Snijders, T. (2001). Estimation and prediction for stochastic blockstructures. Journal of the American Statistical Association, 96(455), 1077–1087.

    Google Scholar 

  23. Rivers, D., & Bailey, D. (2009). Inference from matched samples in the 2008 US National elections. In Proceedings of the Joint Statistical Meetings (pp. 627–639)

    Google Scholar 

  24. Särndal, C. E., Swensson, B., & Wretman, J. (2003). Model assisted survey sampling. New York: Springer Science & Business Media.

    Google Scholar 

  25. Sautory, O. (2012). Les enjeux méthodologiques liés à l’usage de bases de sondage imparfaites. Conference Report.

    Google Scholar 

  26. Sloan, L., Morgan, J., Housley, W., Williams, M., Edwards, A., Burnap, et al. (2013). Knowing the tweeters: Deriving sociologically relevant demographics from twitter. Sociological Research Online, 18(3), 7.

    Google Scholar 

  27. Thompson, S. K. (1990). Adaptive cluster sampling. Journal of the American Statistical Association, 85(412), 1050–1059

    Google Scholar 

  28. Thompson, S. K. (1991). Stratified adaptive cluster sampling. Biometrika, 78, 389–397.

    Google Scholar 

  29. Thompson, S. K. (1998). Adaptive sampling in graphs. In Proceedings of the Section on Survey Methods Research, American Statistical Association (pp. 13–22).

    Google Scholar 

  30. Thompson, S. K. (2006). Adaptive web sampling. Biometrics, 62(4), 1224–1234.

    Google Scholar 

  31. Tillé, Y. (2001). Théorie des sondages. Paris: Dunod.

    Google Scholar 

  32. Tumasjan, A., Sprenger, T. O., Sandner, P. G., & Welpe, I. M. (2010). Predicting elections with twitter: What 140 characters reveal about political sentiment. International AAAI Conference on Web and Social Media, 10, 178–185.

    Google Scholar 

  33. Watts, D. J., & Strogatz, S. H. (1998). Collective dynamics of ‘small-world’ networks. Nature, 393(6684), 440–442.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to A. Rebecq .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Rebecq, A. (2018). Extension Sampling Designs for Big Networks: Application to Twitter. In: Bertail, P., Blanke, D., Cornillon, PA., Matzner-Løber, E. (eds) Nonparametric Statistics. ISNPS 2016. Springer Proceedings in Mathematics & Statistics, vol 250. Springer, Cham. https://doi.org/10.1007/978-3-319-96941-1_17

Download citation

Publish with us

Policies and ethics