Abstract
Social media is increasingly a topic of study across a range of disciplines. Despite this popularity, current practices and open source tools for social media collecting do not adequately support today’s scholars or support building robust collections for future researchers. We are continuing to develop and improve Social Feed Manager (SFM), an open source application assisting scholars collecting data from Twitter’s API for their research. Based on our experience with SFM to date and the viewpoints of archivists and researchers, we are reconsidering assumptions about API-based social media collecting and identifying requirements to guide the application’s further development. We suggest that aligning social media collecting with web archiving practices and tools addresses many of the most pressing needs of current and future scholars conducting quality social media research. In this paper, we consider the basis for these new requirements, describe in depth an alignment between social media collecting and web archiving, outline a technical approach for effecting this alignment, and show how the technical approach has been implemented in SFM.
This is a preview of subscription content, access via your institution.











Notes
Search of NSF awards on the term “social media” on February 4, 2016 returns 455 results. https://www.nsf.gov/awardsearch/simpleSearchResult?queryText=%22social+media%22&ActiveAwards=true.
This development was supported by a grant (#LG-46-13-0257-13) from the Institute of Museum and Library Services to GWU Libraries from 2013 to 2014.
We refer here to “wayback software”, a generic term for software that plays back WARC files, as distinguished from “The Wayback Machine”, an instance and implementation of wayback software hosted by the Internet Archive. Two examples of wayback software are the International Internet Preservation Consortium’s OpenWayback [13] and Ilya Kreymer’s pywb [14].
Although the URL “http://myspace.com” was captured from 1996 forward, MySpace was founded and launched at that URL in 2003. https://web.archive.org/web/20031004101518/http://myspace.com/.
ArchiveSocial requires social media account owners to login and give ArchiveSocial permission to their social media data. One of the authors of the paper worked with the adoption of ArchiveSocial at the State Archives of North Carolina.
Noting also that, “The research, development, and technical experimentation necessary to advance the archiving tools on these fronts will not come from the majority of web archiving organizations with their fractional staff time commitments” [37].
Many of us remember Friendster, MySpace and other extinct social platforms. Though certainly more popular, even Twitter itself seems to be experiencing a stall in the growth of its user base [54].
The GW Libraries are collaborating with Johns Hopkins University and Georgetown University in this grant work, entitled “Blogging and Microblogging: Preserving Non-Official Voices in China’s Anti-Corruption Campaign”.
Another aspect of tweets is the metadata that accompanies it when harvested from the API. This metadata contains social network information, in that they contain references to (and/or retweets of) other accounts. In addition, tweets contain complete user profile information, which often changes over time. This metadata has research potential, which is why we have also saved it.
For Twitter, this is commonly referred to as “dehydration” and is useful because it allows exchanging datasets within the constraints of Twitter’s terms of service.
References
GW Libraries: gwu-libraries/social-feed-manager (2012). https://github.com/gwu-libraries/social-feed-manager. Accessed 10 Feb 2016
GW Libraries: Welcome to Social Feed Manager! (2015). http://social-feed-manager.readthedocs.org/en/latest/. Accessed 12 Feb 2016
Chudnov, D., Kerchner, D., Sharma, A., Wrubel, L.: Technical challenges in developing software to collect twitter data. Code4lib J. (2014) http://journal.code4lib.org/articles/10097. Accessed 10 Feb 2016
Hayes, D., Lawless, J.L.: Women on the run: gender, media, and political campaigns in a polarized Era. Cambridge University Press, Cambridge (2016). http://books.google.com/books/about/Women_on_the_Run.html?hl=&id=fXNNDAAAQBAJ. Accessed 10 Feb 2016
GW Libraries: gwu-libraries/sfm-ui (2015). https://github.com/gwu-libraries/sfm-ui. Accessed 10 Feb 2016
GW Libraries: Social Feed Manager (SFM) documentation (2015). http://sfm.readthedocs.org/en/latest/. Accessed 12 Feb 2016
International Internet Preservation Consortium: About IIPC (2012). http://netpreserve.org/about-us. Accessed 10 Feb 2016
International Internet Preservation Consortium: About archiving (2012). http://netpreserve.org/web-archiving/about-archiving. Accessed 10 Feb 2016
Jack, P., Levitt, N.: Heritrix (2014). https://webarchive.jira.com/wiki/display/Heritrix. Accessed 10 Feb 2016
Kreymer, I.: Webrecorder/webrecorder (2015). https://github.com/webrecorder/webrecorder. Accessed 11 Feb 2016
Internet Archive: Internetarchive/warcprox (2012). https://github.com/internetarchive/warcprox. Accessed 11 Feb 2016
International Internet Preservation Consortium: The WARC format (2015). http://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.0/. Accessed 10 Feb 2016
International Internet Preservation Consortium: iipc/openwayback (2013). https://github.com/iipc/openwayback. Accessed 11 Feb 2016
Kreymer, I.: ikreymer/pywb (2013). https://github.com/ikreymer/pywb. Accessed 11 Feb 2016
Thomson, S.D.: Preserving social media (2016). doi:10.7207/twr16-01. http://www.dpconline.org/component/docman/doc_download/1486-twr16-01. Accessed 10 Feb 2016
Bercovici, J.: Who coined “Social Media”? Web pioneers compete for credit. Forbes. (2010). http://www.forbes.com/sites/jeffbercovici/2010/12/09/who-coined-social-media-web-pioneers-compete-for-credit/. Accessed 10 Feb 2016
Espley, S., Carpentier, F., Pop, R., Medjkoune, L.: Collect, preserve, access: applying the governing principles of the national archives UK government web archive to social media content. Alexandria 25, 31–50 (2014). doi:10.7227/ALX.0019. http://openurl.ingenta.com/content/xref?genre=article&issn=0955-7490&volume=25&issue=1&spage=31. Accessed 10 Feb 2016
Bragg, M., Eubank, K., Ricker, J.: Preserving Web 2.0. Presented at: Best practices exchange (2009) https://webarchive.jira.com/wiki/download/attachments/5734676/BPE_web2_partner+meeting.ppt?version=1&modificationDate=1257454424180. Accessed 10 Feb 2016
Ricker, J.: A flickr of Hope: harvesting social networking sites with archive-it. Presented at: NDIIPP partners meeting (2010). http://digitalpreservation.ncdcr.gov/asgii/presentations/ndiipp2010.pdf. Accessed 10 Feb 2016
Ricker, J.: Archiving social media sites in North Carolina. Presented at: Best practices exchange (2010). http://digitalpreservation.ncdcr.gov/asgii/presentations/bpe2010.pdf. Accessed 10 Feb 2016
Trent, R., Kenney, K.: Social Media Archiving in State Government. Presented at: Tri-State archivists meeting (2013). http://digitalpreservation.ncdcr.gov/asgii/presentations/snca_2013_socialmedia.pdf. Accessed 10 Feb 2016
McNealy, J.E.: The privacy implications of digital preservation: Social media archives and the social networks theory of privacy. Elon Univ. Law Rev. 3, 133–160 (2010). http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2027036. Accessed 10 Feb 2016
Miao, T.A.: Access denied: how social media accounts fall outside the scope of intellectual property law and into the realm of the computer fraud and abuse act. Fordham Intell. Prop. Med. Ent. LJ 23, 1017 (2012). http://heinonlinebackup.com/hol-cgi-bin/get_pdf.cgi?handle=hein.journals/frdipm23§ion=32. Accessed 10 Feb 2016
Moyer, M.W.: Twitter opens its cage. Sci. Am. 310, 16 (2014). http://www.ncbi.nlm.nih.gov/pubmed/25004563. Accessed 10 Feb 2016
NDSA Content Working Group: Web Archiving Survey Report. National Digital Stewardship Alliance (2012). http://www.digitalpreservation.gov/ndsa/working_groups/documents/ndsa_web_archiving_survey_report_2012.pdf. Accessed 10 Feb 2016
Bowers, K., Dolan-Mescal, A., Donovan, L., et al.: Occupy archives panel. Presented at: Annual Society of American Archivists Meeting (2013). http://archives2013.sched.org/event/14m52JH/session-303-occupy-archives. Accessed 10 Feb 2016
King, L.: Emory digital scholars archive occupy wall street Tweets. Emory Rep. (2012). http://news.emory.edu/stories/2012/09/er_occupy_wall_street_tweets_archive/campus.html. Accessed 10 Feb 2016
Del Signore, J.: Museums Archiving Occupy Wall Street: Historical Preservation Or “Taxpayer-Funded Hoarding”? Gothamist (2011). http://gothamist.com/2011/12/26/occupy_wall_street_the_museum_exhib.php. Accessed 10 Feb 2016
Chitturi, K., Yang, S.: Real-time archiving of spontaneous events (Use-Case; Hurricane Sandy) and visualizing disaster phases appearing in Tweets. Presented at: Archive-it partner meeting at Best practices exchange. (2012). https://webarchive.jira.com/wiki/download/attachments/40075274/Real-%C2%AD%E2%80%90%26me%20Archiving%20of%20Spontaneous%20Events%20%28Use-%C2%AD%E2%80%90Case%20-%20Hurricane%20Sandy%29.pdf. Accessed 10 Feb 2016
Gueguen, G.: Capturing the Zeitgeist. (2012). http://www.slideshare.net/guegueng/capturing-the-zeitgeist. Accessed 10 Feb 2016
National Archives and Record Administration: Best practices for social media capture. National Archives and Record Administration (2013). http://www.archives.gov/records-mgmt/resources/socialmediacapture.pdf. Accessed 10 Feb 2016
Trent, R.: Social media archive BETA is live! The G.S. 132 Files (2012). https://ncrecords.wordpress.com/2012/12/04/social-media-archive-beta-is-live/. Accessed 10 Feb 2016
Emory Libraries: emory-libraries/Twap (2011). https://github.com/emory-libraries/Twap. Accessed 11 Feb 2016
North Carolina State University Libraries: NCSU-Libraries/lentil (2013). https://github.com/NCSU-Libraries/lentil. Accessed 11 Feb 2016
Thomson, S.D., Kilbride, W.: Preserving social media: the problem of access. New Rev. Inf. Netw. 20, 261–275 (2015). doi:10.1080/13614576.2015.1114842
Pennock, M.: Web-archiving (2013). doi:10.7207/twr13-01
Bailey, J., Grotke, A., Hanna, K., et al.: Web archiving in the United States: a 2013 survey. National Digital Stewardship Alliance (2014). http://www.digitalpreservation.gov/ndsa/working_groups/documents/NDSA_USWebArchivingSurvey_2013.pdf. Accessed 10 Feb 2016
Boyd, D., Crawford, K.: Critical questions for big data. Inf. Commun. Soc. 15, 662–679 (2012). doi:10.1080/1369118X.2012.678878
Bruns, A.: Faster than the speed of print: reconciling “big data” social media analysis and academic scholarship. First Monday (2013). doi:10.5210/fm.v18i10.4879. http://journals.uic.edu/ojs/index.php/fm/article/view/4879. Accessed 20 July 2016
Tufekci, Z.: Big questions for social media big data: representativeness, validity and other methodological pitfalls. arXiv:1403.7400
Hajtnik, T., Uglešić, K., Živkovič, A.: Acquisition and preservation of authentic information in a digital age. Publ. Relat. Rev. 41, 264–271 (2015). doi:10.1016/j.pubrev.2014.12.001. http://www.sciencedirect.com/science/article/pii/S0363811114001945. Accessed 10 Feb 2016
Eltgrowth, D.R.: Best evidence and the Wayback machine: toward a workable authentication standard for archived Internet evidence. Fordham Law Rev. 78, 181 (2009). http://heinonline.org/hol-cgi-bin/get_pdf.cgi?handle=hein.journals/flr78§ion=8. Accessed 10 Feb 2016
AIIM: AIIM TR31-2004, Legal acceptance of records produced by information technology systems (2004). http://www.aiim.org/Resources/Standards/AIIM_TR_31. Accessed 20 July 2016
State Archives of North Carolina: Guidelines for managing trustworthy digital public records (2000). http://archives.ncdcr.gov/Portals/3/PDF/guidelines/guidelines_for_digital_public_records.pdf. Accessed 10 Feb 2016
Markham, A., Buchanan, E., Committee, A.E.W. Others: Ethical decision-making and Internet research: Version 2.0. Association of Internet Researchers (2012). http://www.uwstout.edu/ethicscenter/upload/aoirethicsprintablecopy.pdf. Accessed 10 Feb 2016
Leetaru, K.: Are research ethics obsolete in the Era of big data? Forbes (2016). http://www.forbes.com/sites/kalevleetaru/2016/06/17/are-research-ethics-obsolete-in-the-era-of-big-data/. Accessed 20 July 2016
Council for Big Data, Ethics, and Society (2016). http://bdes.datasociety.net/. Accessed 20 July 2016
Summers, E.: Introducing documenting the now—documenting DocNow. Medium (2016). https://news.docnow.io/introducing-documenting-the-now-416874c07e0. Accessed 25 July 2016
Townsend, L., Wallace, C.: Social media research: a guide to ethics. The University of Aberdeen. http://www.dotrural.ac.uk/socialmediaresearchethics.pdf. Accessed 10 Feb 2016
Milligan, I., Webster, P.: The Web archive bibliography. Web archives for historians (2014). https://webarchivehistorians.org/the-web-archive-bibliography/. Accessed 22 July 2016
Milligan, I.: Finding community in the Ruins of GeoCities: distantly reading a web archive. Bull. IEEE Tech. Commit. Dig. Lib. (2015). http://www.ieee-tcdl.org/Bulletin/v11n2/papers/milligan.pdf. Accessed 10 Feb 2016
Milligan, I.: Lost in the infinite archive: the promise and pitfalls of web archives. Int. J. Hum. Arts Comput. 10, 78–94 (2016). doi:10.3366/ijhac.2016.0161
Webster, P.: Why historians should care about web archiving. Webstory: Peter Webster’s blog. (2012). https://peterwebster.me/2012/10/08/why-historians-should-care-about-web-archiving/. Accessed 14 July 2016
Statista (2016) Twitter: number of monthly active users 2015. Statista. http://www.statista.com/statistics/282087/number-of-monthly-active-twitter-users/. Accessed 11 Feb 2016
Summers, E.: URLs in Tweets Mentioning Ferguson: August 10–27, 2014 (2014). https://edsu.github.io/ferguson-urls/index.html. Accessed 10 Feb 2016
Baumann, R.: Archiving Video from #Ferguson: on Archivy. Medium. (2015) https://medium.com/on-archivy/archiving-video-from-ferguson-504e95859756. Accessed 10 Feb 2016
Milligan, I., Ruest, N., Lin, J.: The Gatekeepers vs. the Masses. Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries. (2016). doi:10.1145/2910896.2910913
Consultative Committee for Space Data Systems: Reference model for an Open Archival Information System (OAIS). CCSDS Secretariat, Washington, DC (2012). http://public.ccsds.org/publications/archive/650x0m2.pdf. Accessed 10 Feb 2016
Commission on Preservation and Access, Research Libraries Group, Task Force on Digital Archiving: Preserving digital information: report of the task force on archiving of digital information (1996). https://books.google.com/books?id=T9YmrgEACAAJ. Accessed 10 Feb 2016
Provenance Working Group: PROV-overview (2013). https://www.w3.org/TR/prov-overview/. Accessed 6 June 2016
Kerchner, D., Littman, J., Peterson, C. et al.: The Provenance of a Tweet (2016). https://scholarspace.library.gwu.edu/downloads/h128nd689. Accessed 20 July 2016
Internet Archive: internetarchive/brozzler. GitHub. https://github.com/internetarchive/brozzler. Accessed 14 July 2016
Littman, J.: Social media harvesting techniques. GW Libraries (2015). https://library.gwu.edu/scholarly-technology-group/posts/social-media-harvesting-techniques. Accessed 10 Feb 2016
Foo, C.: chfoo/wpull (2013). https://github.com/chfoo/wpull. Accessed 10 Feb 2016
Van de Sompel, H., Nelson, M., Sanderson, R.: HTTP framework for time-based access to resource states–Memento (2013). https://tools.ietf.org/rfc/rfc7089.txt. Accessed 10 Feb 2016
Wrubel, L.: Announcing SFM Version 1.0. Social Feed Manager (2016). http://gwu-libraries.github.io/sfm-ui/posts/2016-06-20-releasing-1-0. Accessed 15 July 2016
Twitter, Inc. The Streaming APIs. https://dev.twitter.com/streaming/overview. Accessed 20 July 2016
REST APIs. Twitter Developers. https://dev.twitter.com/rest/public. Accessed 20 July 2016
Flickr.: Flickr services (2005). https://www.flickr.com/services/api/. Accessed 20 July 2016
Weibo Corporation: Weibo API (2012). http://open.weibo.com/wiki/API%E6%96%87%E6%A1%A3/en. Accessed 20 July 2016
Summers, E.: edsu/twarc (2013). doi:10.5281/zenodo.17385. https://github.com/edsu/twarc. Accessed 10 Feb 2016
Stüvel, S.A.: sybrenstuvel/flickrapi (2013). https://github.com/sybrenstuvel/flickrapi. Accessed 18 Oct 2016
Internet Archive: internetarchive/warc. GitHub. https://github.com/internetarchive/warc. Accessed 15 July 2016
Dolan, S.: stedolan/jq (2012). https://github.com/stedolan/jq. Accessed 18 Oct 2016
Clarke, N.: JWAT-tools (2012). https://sbforge.org/display/JWAT/JWAT-Tools. Accessed 18 Oct 2016
Internet Archive: internetarchive/warctools (2010). https://github.com/internetarchive/warctools. Accessed 18 Oct 2016
Acknowledgements
This work is supported by Grant #NARDI-14-50017-14 from the National Historical Publications and Records Commission.
Author information
Authors and Affiliations
Corresponding author
Additional information
Justin Littman is the lead author. All other authors contributed significantly to this work and participated in the writing of the paper. The authors are listed alphabetically. Laura Wrubel is the current principal investigator on the grant supporting this work; Daniel Chudnov held this role previously.
Rights and permissions
About this article
Cite this article
Littman, J., Chudnov, D., Kerchner, D. et al. API-based social media collecting as a form of web archiving. Int J Digit Libr 19, 21–38 (2018). https://doi.org/10.1007/s00799-016-0201-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00799-016-0201-7