API-based social media collecting as a form of web archiving

  • Justin Littman
  • Daniel Chudnov
  • Daniel Kerchner
  • Christie Peterson
  • Yecheng Tan
  • Rachel Trent
  • Rajat Vij
  • Laura Wrubel
Article

Abstract

Social media is increasingly a topic of study across a range of disciplines. Despite this popularity, current practices and open source tools for social media collecting do not adequately support today’s scholars or support building robust collections for future researchers. We are continuing to develop and improve Social Feed Manager (SFM), an open source application assisting scholars collecting data from Twitter’s API for their research. Based on our experience with SFM to date and the viewpoints of archivists and researchers, we are reconsidering assumptions about API-based social media collecting and identifying requirements to guide the application’s further development. We suggest that aligning social media collecting with web archiving practices and tools addresses many of the most pressing needs of current and future scholars conducting quality social media research. In this paper, we consider the basis for these new requirements, describe in depth an alignment between social media collecting and web archiving, outline a technical approach for effecting this alignment, and show how the technical approach has been implemented in SFM.

Keywords

Social media Web archiving Archives Data collection - Twitter 

References

  1. 1.
    GW Libraries: gwu-libraries/social-feed-manager (2012). https://github.com/gwu-libraries/social-feed-manager. Accessed 10 Feb 2016
  2. 2.
    GW Libraries: Welcome to Social Feed Manager! (2015). http://social-feed-manager.readthedocs.org/en/latest/. Accessed 12 Feb 2016
  3. 3.
    Chudnov, D., Kerchner, D., Sharma, A., Wrubel, L.: Technical challenges in developing software to collect twitter data. Code4lib J. (2014) http://journal.code4lib.org/articles/10097. Accessed 10 Feb 2016
  4. 4.
    Hayes, D., Lawless, J.L.: Women on the run: gender, media, and political campaigns in a polarized Era. Cambridge University Press, Cambridge (2016). http://books.google.com/books/about/Women_on_the_Run.html?hl=&id=fXNNDAAAQBAJ. Accessed 10 Feb 2016
  5. 5.
    GW Libraries: gwu-libraries/sfm-ui (2015). https://github.com/gwu-libraries/sfm-ui. Accessed 10 Feb 2016
  6. 6.
    GW Libraries: Social Feed Manager (SFM) documentation (2015). http://sfm.readthedocs.org/en/latest/. Accessed 12 Feb 2016
  7. 7.
    International Internet Preservation Consortium: About IIPC (2012). http://netpreserve.org/about-us. Accessed 10 Feb 2016
  8. 8.
    International Internet Preservation Consortium: About archiving (2012). http://netpreserve.org/web-archiving/about-archiving. Accessed 10 Feb 2016
  9. 9.
    Jack, P., Levitt, N.: Heritrix (2014). https://webarchive.jira.com/wiki/display/Heritrix. Accessed 10 Feb 2016
  10. 10.
    Kreymer, I.: Webrecorder/webrecorder (2015). https://github.com/webrecorder/webrecorder. Accessed 11 Feb 2016
  11. 11.
    Internet Archive: Internetarchive/warcprox (2012). https://github.com/internetarchive/warcprox. Accessed 11 Feb 2016
  12. 12.
    International Internet Preservation Consortium: The WARC format (2015). http://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.0/. Accessed 10 Feb 2016
  13. 13.
    International Internet Preservation Consortium: iipc/openwayback (2013). https://github.com/iipc/openwayback. Accessed 11 Feb 2016
  14. 14.
    Kreymer, I.: ikreymer/pywb (2013). https://github.com/ikreymer/pywb. Accessed 11 Feb 2016
  15. 15.
    Thomson, S.D.: Preserving social media (2016). doi:10.7207/twr16-01. http://www.dpconline.org/component/docman/doc_download/1486-twr16-01. Accessed 10 Feb 2016
  16. 16.
    Bercovici, J.: Who coined “Social Media”? Web pioneers compete for credit. Forbes. (2010). http://www.forbes.com/sites/jeffbercovici/2010/12/09/who-coined-social-media-web-pioneers-compete-for-credit/. Accessed 10 Feb 2016
  17. 17.
    Espley, S., Carpentier, F., Pop, R., Medjkoune, L.: Collect, preserve, access: applying the governing principles of the national archives UK government web archive to social media content. Alexandria 25, 31–50 (2014). doi:10.7227/ALX.0019. http://openurl.ingenta.com/content/xref?genre=article&issn=0955-7490&volume=25&issue=1&spage=31. Accessed 10 Feb 2016
  18. 18.
    Bragg, M., Eubank, K., Ricker, J.: Preserving Web 2.0. Presented at: Best practices exchange (2009) https://webarchive.jira.com/wiki/download/attachments/5734676/BPE_web2_partner+meeting.ppt?version=1&modificationDate=1257454424180. Accessed 10 Feb 2016
  19. 19.
    Ricker, J.: A flickr of Hope: harvesting social networking sites with archive-it. Presented at: NDIIPP partners meeting (2010). http://digitalpreservation.ncdcr.gov/asgii/presentations/ndiipp2010.pdf. Accessed 10 Feb 2016
  20. 20.
    Ricker, J.: Archiving social media sites in North Carolina. Presented at: Best practices exchange (2010). http://digitalpreservation.ncdcr.gov/asgii/presentations/bpe2010.pdf. Accessed 10 Feb 2016
  21. 21.
    Trent, R., Kenney, K.: Social Media Archiving in State Government. Presented at: Tri-State archivists meeting (2013). http://digitalpreservation.ncdcr.gov/asgii/presentations/snca_2013_socialmedia.pdf. Accessed 10 Feb 2016
  22. 22.
    McNealy, J.E.: The privacy implications of digital preservation: Social media archives and the social networks theory of privacy. Elon Univ. Law Rev. 3, 133–160 (2010). http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2027036. Accessed 10 Feb 2016
  23. 23.
    Miao, T.A.: Access denied: how social media accounts fall outside the scope of intellectual property law and into the realm of the computer fraud and abuse act. Fordham Intell. Prop. Med. Ent. LJ 23, 1017 (2012). http://heinonlinebackup.com/hol-cgi-bin/get_pdf.cgi?handle=hein.journals/frdipm23&section=32. Accessed 10 Feb 2016
  24. 24.
    Moyer, M.W.: Twitter opens its cage. Sci. Am. 310, 16 (2014). http://www.ncbi.nlm.nih.gov/pubmed/25004563. Accessed 10 Feb 2016
  25. 25.
    NDSA Content Working Group: Web Archiving Survey Report. National Digital Stewardship Alliance (2012). http://www.digitalpreservation.gov/ndsa/working_groups/documents/ndsa_web_archiving_survey_report_2012.pdf. Accessed 10 Feb 2016
  26. 26.
    Bowers, K., Dolan-Mescal, A., Donovan, L., et al.: Occupy archives panel. Presented at: Annual Society of American Archivists Meeting (2013). http://archives2013.sched.org/event/14m52JH/session-303-occupy-archives. Accessed 10 Feb 2016
  27. 27.
    King, L.: Emory digital scholars archive occupy wall street Tweets. Emory Rep. (2012). http://news.emory.edu/stories/2012/09/er_occupy_wall_street_tweets_archive/campus.html. Accessed 10 Feb 2016
  28. 28.
    Del Signore, J.: Museums Archiving Occupy Wall Street: Historical Preservation Or “Taxpayer-Funded Hoarding”? Gothamist (2011). http://gothamist.com/2011/12/26/occupy_wall_street_the_museum_exhib.php. Accessed 10 Feb 2016
  29. 29.
    Chitturi, K., Yang, S.: Real-time archiving of spontaneous events (Use-Case; Hurricane Sandy) and visualizing disaster phases appearing in Tweets. Presented at: Archive-it partner meeting at Best practices exchange. (2012). https://webarchive.jira.com/wiki/download/attachments/40075274/Real-%C2%AD%E2%80%90%26me%20Archiving%20of%20Spontaneous%20Events%20%28Use-%C2%AD%E2%80%90Case%20-%20Hurricane%20Sandy%29.pdf. Accessed 10 Feb 2016
  30. 30.
    Gueguen, G.: Capturing the Zeitgeist. (2012). http://www.slideshare.net/guegueng/capturing-the-zeitgeist. Accessed 10 Feb 2016
  31. 31.
    National Archives and Record Administration: Best practices for social media capture. National Archives and Record Administration (2013). http://www.archives.gov/records-mgmt/resources/socialmediacapture.pdf. Accessed 10 Feb 2016
  32. 32.
    Trent, R.: Social media archive BETA is live! The G.S. 132 Files (2012). https://ncrecords.wordpress.com/2012/12/04/social-media-archive-beta-is-live/. Accessed 10 Feb 2016
  33. 33.
    Emory Libraries: emory-libraries/Twap (2011). https://github.com/emory-libraries/Twap. Accessed 11 Feb 2016
  34. 34.
    North Carolina State University Libraries: NCSU-Libraries/lentil (2013). https://github.com/NCSU-Libraries/lentil. Accessed 11 Feb 2016
  35. 35.
    Thomson, S.D., Kilbride, W.: Preserving social media: the problem of access. New Rev. Inf. Netw. 20, 261–275 (2015). doi:10.1080/13614576.2015.1114842 CrossRefGoogle Scholar
  36. 36.
    Pennock, M.: Web-archiving (2013). doi:10.7207/twr13-01
  37. 37.
    Bailey, J., Grotke, A., Hanna, K., et al.: Web archiving in the United States: a 2013 survey. National Digital Stewardship Alliance (2014). http://www.digitalpreservation.gov/ndsa/working_groups/documents/NDSA_USWebArchivingSurvey_2013.pdf. Accessed 10 Feb 2016
  38. 38.
    Boyd, D., Crawford, K.: Critical questions for big data. Inf. Commun. Soc. 15, 662–679 (2012). doi:10.1080/1369118X.2012.678878 CrossRefGoogle Scholar
  39. 39.
    Bruns, A.: Faster than the speed of print: reconciling “big data” social media analysis and academic scholarship. First Monday (2013). doi:10.5210/fm.v18i10.4879. http://journals.uic.edu/ojs/index.php/fm/article/view/4879. Accessed 20 July 2016
  40. 40.
    Tufekci, Z.: Big questions for social media big data: representativeness, validity and other methodological pitfalls. arXiv:1403.7400
  41. 41.
    Hajtnik, T., Uglešić, K., Živkovič, A.: Acquisition and preservation of authentic information in a digital age. Publ. Relat. Rev. 41, 264–271 (2015). doi:10.1016/j.pubrev.2014.12.001. http://www.sciencedirect.com/science/article/pii/S0363811114001945. Accessed 10 Feb 2016
  42. 42.
    Eltgrowth, D.R.: Best evidence and the Wayback machine: toward a workable authentication standard for archived Internet evidence. Fordham Law Rev. 78, 181 (2009). http://heinonline.org/hol-cgi-bin/get_pdf.cgi?handle=hein.journals/flr78&section=8. Accessed 10 Feb 2016
  43. 43.
    AIIM: AIIM TR31-2004, Legal acceptance of records produced by information technology systems (2004). http://www.aiim.org/Resources/Standards/AIIM_TR_31. Accessed 20 July 2016
  44. 44.
    State Archives of North Carolina: Guidelines for managing trustworthy digital public records (2000). http://archives.ncdcr.gov/Portals/3/PDF/guidelines/guidelines_for_digital_public_records.pdf. Accessed 10 Feb 2016
  45. 45.
    Markham, A., Buchanan, E., Committee, A.E.W. Others: Ethical decision-making and Internet research: Version 2.0. Association of Internet Researchers (2012). http://www.uwstout.edu/ethicscenter/upload/aoirethicsprintablecopy.pdf. Accessed 10 Feb 2016
  46. 46.
    Leetaru, K.: Are research ethics obsolete in the Era of big data? Forbes (2016). http://www.forbes.com/sites/kalevleetaru/2016/06/17/are-research-ethics-obsolete-in-the-era-of-big-data/. Accessed 20 July 2016
  47. 47.
    Council for Big Data, Ethics, and Society (2016). http://bdes.datasociety.net/. Accessed 20 July 2016
  48. 48.
    Summers, E.: Introducing documenting the now—documenting DocNow. Medium (2016). https://news.docnow.io/introducing-documenting-the-now-416874c07e0. Accessed 25 July 2016
  49. 49.
    Townsend, L., Wallace, C.: Social media research: a guide to ethics. The University of Aberdeen. http://www.dotrural.ac.uk/socialmediaresearchethics.pdf. Accessed 10 Feb 2016
  50. 50.
    Milligan, I., Webster, P.: The Web archive bibliography. Web archives for historians (2014). https://webarchivehistorians.org/the-web-archive-bibliography/. Accessed 22 July 2016
  51. 51.
    Milligan, I.: Finding community in the Ruins of GeoCities: distantly reading a web archive. Bull. IEEE Tech. Commit. Dig. Lib. (2015). http://www.ieee-tcdl.org/Bulletin/v11n2/papers/milligan.pdf. Accessed 10 Feb 2016
  52. 52.
    Milligan, I.: Lost in the infinite archive: the promise and pitfalls of web archives. Int. J. Hum. Arts Comput. 10, 78–94 (2016). doi:10.3366/ijhac.2016.0161 CrossRefGoogle Scholar
  53. 53.
    Webster, P.: Why historians should care about web archiving. Webstory: Peter Webster’s blog. (2012). https://peterwebster.me/2012/10/08/why-historians-should-care-about-web-archiving/. Accessed 14 July 2016
  54. 54.
    Statista (2016) Twitter: number of monthly active users 2015. Statista. http://www.statista.com/statistics/282087/number-of-monthly-active-twitter-users/. Accessed 11 Feb 2016
  55. 55.
    Summers, E.: URLs in Tweets Mentioning Ferguson: August 10–27, 2014 (2014). https://edsu.github.io/ferguson-urls/index.html. Accessed 10 Feb 2016
  56. 56.
    Baumann, R.: Archiving Video from #Ferguson: on Archivy. Medium. (2015) https://medium.com/on-archivy/archiving-video-from-ferguson-504e95859756. Accessed 10 Feb 2016
  57. 57.
    Milligan, I., Ruest, N., Lin, J.: The Gatekeepers vs. the Masses. Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries. (2016). doi:10.1145/2910896.2910913
  58. 58.
    Consultative Committee for Space Data Systems: Reference model for an Open Archival Information System (OAIS). CCSDS Secretariat, Washington, DC (2012). http://public.ccsds.org/publications/archive/650x0m2.pdf. Accessed 10 Feb 2016
  59. 59.
    Commission on Preservation and Access, Research Libraries Group, Task Force on Digital Archiving: Preserving digital information: report of the task force on archiving of digital information (1996). https://books.google.com/books?id=T9YmrgEACAAJ. Accessed 10 Feb 2016
  60. 60.
    Provenance Working Group: PROV-overview (2013). https://www.w3.org/TR/prov-overview/. Accessed 6 June 2016
  61. 61.
    Kerchner, D., Littman, J., Peterson, C. et al.: The Provenance of a Tweet (2016). https://scholarspace.library.gwu.edu/downloads/h128nd689. Accessed 20 July 2016
  62. 62.
    Internet Archive: internetarchive/brozzler. GitHub. https://github.com/internetarchive/brozzler. Accessed 14 July 2016
  63. 63.
    Littman, J.: Social media harvesting techniques. GW Libraries (2015). https://library.gwu.edu/scholarly-technology-group/posts/social-media-harvesting-techniques. Accessed 10 Feb 2016
  64. 64.
    Foo, C.: chfoo/wpull (2013). https://github.com/chfoo/wpull. Accessed 10 Feb 2016
  65. 65.
    Van de Sompel, H., Nelson, M., Sanderson, R.: HTTP framework for time-based access to resource states–Memento (2013). https://tools.ietf.org/rfc/rfc7089.txt. Accessed 10 Feb 2016
  66. 66.
    Wrubel, L.: Announcing SFM Version 1.0. Social Feed Manager (2016). http://gwu-libraries.github.io/sfm-ui/posts/2016-06-20-releasing-1-0. Accessed 15 July 2016
  67. 67.
    Twitter, Inc. The Streaming APIs. https://dev.twitter.com/streaming/overview. Accessed 20 July 2016
  68. 68.
    REST APIs. Twitter Developers. https://dev.twitter.com/rest/public. Accessed 20 July 2016
  69. 69.
    Flickr.: Flickr services (2005). https://www.flickr.com/services/api/. Accessed 20 July 2016
  70. 70.
    Weibo Corporation: Weibo API (2012). http://open.weibo.com/wiki/API%E6%96%87%E6%A1%A3/en. Accessed 20 July 2016
  71. 71.
    Summers, E.: edsu/twarc (2013). doi:10.5281/zenodo.17385. https://github.com/edsu/twarc. Accessed 10 Feb 2016
  72. 72.
    Stüvel, S.A.: sybrenstuvel/flickrapi (2013). https://github.com/sybrenstuvel/flickrapi. Accessed 18 Oct 2016
  73. 73.
    Internet Archive: internetarchive/warc. GitHub. https://github.com/internetarchive/warc. Accessed 15 July 2016
  74. 74.
    Dolan, S.: stedolan/jq (2012). https://github.com/stedolan/jq. Accessed 18 Oct 2016
  75. 75.
    Clarke, N.: JWAT-tools (2012). https://sbforge.org/display/JWAT/JWAT-Tools. Accessed 18 Oct 2016
  76. 76.
    Internet Archive: internetarchive/warctools (2010). https://github.com/internetarchive/warctools. Accessed 18 Oct 2016

Copyright information

© Springer-Verlag Berlin Heidelberg 2016

Authors and Affiliations

  1. 1.GW LibrariesThe George Washington UniversityWashingtonUSA
  2. 2.District Data LabsWashingtonUSA

Personalised recommendations