Abstract
Social media is increasingly a topic of study across a range of disciplines. Despite this popularity, current practices and open source tools for social media collecting do not adequately support today’s scholars or support building robust collections for future researchers. We are continuing to develop and improve Social Feed Manager (SFM), an open source application assisting scholars collecting data from Twitter’s API for their research. Based on our experience with SFM to date and the viewpoints of archivists and researchers, we are reconsidering assumptions about API-based social media collecting and identifying requirements to guide the application’s further development. We suggest that aligning social media collecting with web archiving practices and tools addresses many of the most pressing needs of current and future scholars conducting quality social media research. In this paper, we consider the basis for these new requirements, describe in depth an alignment between social media collecting and web archiving, outline a technical approach for effecting this alignment, and show how the technical approach has been implemented in SFM.
This is a preview of subscription content, access via your institution.











Notes
- 1.
Search of NSF awards on the term “social media” on February 4, 2016 returns 455 results. https://www.nsf.gov/awardsearch/simpleSearchResult?queryText=%22social+media%22&ActiveAwards=true.
- 2.
This development was supported by a grant (#LG-46-13-0257-13) from the Institute of Museum and Library Services to GWU Libraries from 2013 to 2014.
- 3.
We refer here to “wayback software”, a generic term for software that plays back WARC files, as distinguished from “The Wayback Machine”, an instance and implementation of wayback software hosted by the Internet Archive. Two examples of wayback software are the International Internet Preservation Consortium’s OpenWayback [13] and Ilya Kreymer’s pywb [14].
- 4.
- 5.
- 6.
- 7.
Although the URL “http://myspace.com” was captured from 1996 forward, MySpace was founded and launched at that URL in 2003. https://web.archive.org/web/20031004101518/http://myspace.com/.
- 8.
- 9.
ArchiveSocial requires social media account owners to login and give ArchiveSocial permission to their social media data. One of the authors of the paper worked with the adoption of ArchiveSocial at the State Archives of North Carolina.
- 10.
Noting also that, “The research, development, and technical experimentation necessary to advance the archiving tools on these fronts will not come from the majority of web archiving organizations with their fractional staff time commitments” [37].
- 11.
- 12.
- 13.
Many of us remember Friendster, MySpace and other extinct social platforms. Though certainly more popular, even Twitter itself seems to be experiencing a stall in the growth of its user base [54].
- 14.
The GW Libraries are collaborating with Johns Hopkins University and Georgetown University in this grant work, entitled “Blogging and Microblogging: Preserving Non-Official Voices in China’s Anti-Corruption Campaign”.
- 15.
Another aspect of tweets is the metadata that accompanies it when harvested from the API. This metadata contains social network information, in that they contain references to (and/or retweets of) other accounts. In addition, tweets contain complete user profile information, which often changes over time. This metadata has research potential, which is why we have also saved it.
- 16.
- 17.
For Twitter, this is commonly referred to as “dehydration” and is useful because it allows exchanging datasets within the constraints of Twitter’s terms of service.
- 18.
- 19.
References
- 1.
GW Libraries: gwu-libraries/social-feed-manager (2012). https://github.com/gwu-libraries/social-feed-manager. Accessed 10 Feb 2016
- 2.
GW Libraries: Welcome to Social Feed Manager! (2015). http://social-feed-manager.readthedocs.org/en/latest/. Accessed 12 Feb 2016
- 3.
Chudnov, D., Kerchner, D., Sharma, A., Wrubel, L.: Technical challenges in developing software to collect twitter data. Code4lib J. (2014) http://journal.code4lib.org/articles/10097. Accessed 10 Feb 2016
- 4.
Hayes, D., Lawless, J.L.: Women on the run: gender, media, and political campaigns in a polarized Era. Cambridge University Press, Cambridge (2016). http://books.google.com/books/about/Women_on_the_Run.html?hl=&id=fXNNDAAAQBAJ. Accessed 10 Feb 2016
- 5.
GW Libraries: gwu-libraries/sfm-ui (2015). https://github.com/gwu-libraries/sfm-ui. Accessed 10 Feb 2016
- 6.
GW Libraries: Social Feed Manager (SFM) documentation (2015). http://sfm.readthedocs.org/en/latest/. Accessed 12 Feb 2016
- 7.
International Internet Preservation Consortium: About IIPC (2012). http://netpreserve.org/about-us. Accessed 10 Feb 2016
- 8.
International Internet Preservation Consortium: About archiving (2012). http://netpreserve.org/web-archiving/about-archiving. Accessed 10 Feb 2016
- 9.
Jack, P., Levitt, N.: Heritrix (2014). https://webarchive.jira.com/wiki/display/Heritrix. Accessed 10 Feb 2016
- 10.
Kreymer, I.: Webrecorder/webrecorder (2015). https://github.com/webrecorder/webrecorder. Accessed 11 Feb 2016
- 11.
Internet Archive: Internetarchive/warcprox (2012). https://github.com/internetarchive/warcprox. Accessed 11 Feb 2016
- 12.
International Internet Preservation Consortium: The WARC format (2015). http://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.0/. Accessed 10 Feb 2016
- 13.
International Internet Preservation Consortium: iipc/openwayback (2013). https://github.com/iipc/openwayback. Accessed 11 Feb 2016
- 14.
Kreymer, I.: ikreymer/pywb (2013). https://github.com/ikreymer/pywb. Accessed 11 Feb 2016
- 15.
Thomson, S.D.: Preserving social media (2016). doi:10.7207/twr16-01. http://www.dpconline.org/component/docman/doc_download/1486-twr16-01. Accessed 10 Feb 2016
- 16.
Bercovici, J.: Who coined “Social Media”? Web pioneers compete for credit. Forbes. (2010). http://www.forbes.com/sites/jeffbercovici/2010/12/09/who-coined-social-media-web-pioneers-compete-for-credit/. Accessed 10 Feb 2016
- 17.
Espley, S., Carpentier, F., Pop, R., Medjkoune, L.: Collect, preserve, access: applying the governing principles of the national archives UK government web archive to social media content. Alexandria 25, 31–50 (2014). doi:10.7227/ALX.0019. http://openurl.ingenta.com/content/xref?genre=article&issn=0955-7490&volume=25&issue=1&spage=31. Accessed 10 Feb 2016
- 18.
Bragg, M., Eubank, K., Ricker, J.: Preserving Web 2.0. Presented at: Best practices exchange (2009) https://webarchive.jira.com/wiki/download/attachments/5734676/BPE_web2_partner+meeting.ppt?version=1&modificationDate=1257454424180. Accessed 10 Feb 2016
- 19.
Ricker, J.: A flickr of Hope: harvesting social networking sites with archive-it. Presented at: NDIIPP partners meeting (2010). http://digitalpreservation.ncdcr.gov/asgii/presentations/ndiipp2010.pdf. Accessed 10 Feb 2016
- 20.
Ricker, J.: Archiving social media sites in North Carolina. Presented at: Best practices exchange (2010). http://digitalpreservation.ncdcr.gov/asgii/presentations/bpe2010.pdf. Accessed 10 Feb 2016
- 21.
Trent, R., Kenney, K.: Social Media Archiving in State Government. Presented at: Tri-State archivists meeting (2013). http://digitalpreservation.ncdcr.gov/asgii/presentations/snca_2013_socialmedia.pdf. Accessed 10 Feb 2016
- 22.
McNealy, J.E.: The privacy implications of digital preservation: Social media archives and the social networks theory of privacy. Elon Univ. Law Rev. 3, 133–160 (2010). http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2027036. Accessed 10 Feb 2016
- 23.
Miao, T.A.: Access denied: how social media accounts fall outside the scope of intellectual property law and into the realm of the computer fraud and abuse act. Fordham Intell. Prop. Med. Ent. LJ 23, 1017 (2012). http://heinonlinebackup.com/hol-cgi-bin/get_pdf.cgi?handle=hein.journals/frdipm23§ion=32. Accessed 10 Feb 2016
- 24.
Moyer, M.W.: Twitter opens its cage. Sci. Am. 310, 16 (2014). http://www.ncbi.nlm.nih.gov/pubmed/25004563. Accessed 10 Feb 2016
- 25.
NDSA Content Working Group: Web Archiving Survey Report. National Digital Stewardship Alliance (2012). http://www.digitalpreservation.gov/ndsa/working_groups/documents/ndsa_web_archiving_survey_report_2012.pdf. Accessed 10 Feb 2016
- 26.
Bowers, K., Dolan-Mescal, A., Donovan, L., et al.: Occupy archives panel. Presented at: Annual Society of American Archivists Meeting (2013). http://archives2013.sched.org/event/14m52JH/session-303-occupy-archives. Accessed 10 Feb 2016
- 27.
King, L.: Emory digital scholars archive occupy wall street Tweets. Emory Rep. (2012). http://news.emory.edu/stories/2012/09/er_occupy_wall_street_tweets_archive/campus.html. Accessed 10 Feb 2016
- 28.
Del Signore, J.: Museums Archiving Occupy Wall Street: Historical Preservation Or “Taxpayer-Funded Hoarding”? Gothamist (2011). http://gothamist.com/2011/12/26/occupy_wall_street_the_museum_exhib.php. Accessed 10 Feb 2016
- 29.
Chitturi, K., Yang, S.: Real-time archiving of spontaneous events (Use-Case; Hurricane Sandy) and visualizing disaster phases appearing in Tweets. Presented at: Archive-it partner meeting at Best practices exchange. (2012). https://webarchive.jira.com/wiki/download/attachments/40075274/Real-%C2%AD%E2%80%90%26me%20Archiving%20of%20Spontaneous%20Events%20%28Use-%C2%AD%E2%80%90Case%20-%20Hurricane%20Sandy%29.pdf. Accessed 10 Feb 2016
- 30.
Gueguen, G.: Capturing the Zeitgeist. (2012). http://www.slideshare.net/guegueng/capturing-the-zeitgeist. Accessed 10 Feb 2016
- 31.
National Archives and Record Administration: Best practices for social media capture. National Archives and Record Administration (2013). http://www.archives.gov/records-mgmt/resources/socialmediacapture.pdf. Accessed 10 Feb 2016
- 32.
Trent, R.: Social media archive BETA is live! The G.S. 132 Files (2012). https://ncrecords.wordpress.com/2012/12/04/social-media-archive-beta-is-live/. Accessed 10 Feb 2016
- 33.
Emory Libraries: emory-libraries/Twap (2011). https://github.com/emory-libraries/Twap. Accessed 11 Feb 2016
- 34.
North Carolina State University Libraries: NCSU-Libraries/lentil (2013). https://github.com/NCSU-Libraries/lentil. Accessed 11 Feb 2016
- 35.
Thomson, S.D., Kilbride, W.: Preserving social media: the problem of access. New Rev. Inf. Netw. 20, 261–275 (2015). doi:10.1080/13614576.2015.1114842
- 36.
Pennock, M.: Web-archiving (2013). doi:10.7207/twr13-01
- 37.
Bailey, J., Grotke, A., Hanna, K., et al.: Web archiving in the United States: a 2013 survey. National Digital Stewardship Alliance (2014). http://www.digitalpreservation.gov/ndsa/working_groups/documents/NDSA_USWebArchivingSurvey_2013.pdf. Accessed 10 Feb 2016
- 38.
Boyd, D., Crawford, K.: Critical questions for big data. Inf. Commun. Soc. 15, 662–679 (2012). doi:10.1080/1369118X.2012.678878
- 39.
Bruns, A.: Faster than the speed of print: reconciling “big data” social media analysis and academic scholarship. First Monday (2013). doi:10.5210/fm.v18i10.4879. http://journals.uic.edu/ojs/index.php/fm/article/view/4879. Accessed 20 July 2016
- 40.
Tufekci, Z.: Big questions for social media big data: representativeness, validity and other methodological pitfalls. arXiv:1403.7400
- 41.
Hajtnik, T., Uglešić, K., Živkovič, A.: Acquisition and preservation of authentic information in a digital age. Publ. Relat. Rev. 41, 264–271 (2015). doi:10.1016/j.pubrev.2014.12.001. http://www.sciencedirect.com/science/article/pii/S0363811114001945. Accessed 10 Feb 2016
- 42.
Eltgrowth, D.R.: Best evidence and the Wayback machine: toward a workable authentication standard for archived Internet evidence. Fordham Law Rev. 78, 181 (2009). http://heinonline.org/hol-cgi-bin/get_pdf.cgi?handle=hein.journals/flr78§ion=8. Accessed 10 Feb 2016
- 43.
AIIM: AIIM TR31-2004, Legal acceptance of records produced by information technology systems (2004). http://www.aiim.org/Resources/Standards/AIIM_TR_31. Accessed 20 July 2016
- 44.
State Archives of North Carolina: Guidelines for managing trustworthy digital public records (2000). http://archives.ncdcr.gov/Portals/3/PDF/guidelines/guidelines_for_digital_public_records.pdf. Accessed 10 Feb 2016
- 45.
Markham, A., Buchanan, E., Committee, A.E.W. Others: Ethical decision-making and Internet research: Version 2.0. Association of Internet Researchers (2012). http://www.uwstout.edu/ethicscenter/upload/aoirethicsprintablecopy.pdf. Accessed 10 Feb 2016
- 46.
Leetaru, K.: Are research ethics obsolete in the Era of big data? Forbes (2016). http://www.forbes.com/sites/kalevleetaru/2016/06/17/are-research-ethics-obsolete-in-the-era-of-big-data/. Accessed 20 July 2016
- 47.
Council for Big Data, Ethics, and Society (2016). http://bdes.datasociety.net/. Accessed 20 July 2016
- 48.
Summers, E.: Introducing documenting the now—documenting DocNow. Medium (2016). https://news.docnow.io/introducing-documenting-the-now-416874c07e0. Accessed 25 July 2016
- 49.
Townsend, L., Wallace, C.: Social media research: a guide to ethics. The University of Aberdeen. http://www.dotrural.ac.uk/socialmediaresearchethics.pdf. Accessed 10 Feb 2016
- 50.
Milligan, I., Webster, P.: The Web archive bibliography. Web archives for historians (2014). https://webarchivehistorians.org/the-web-archive-bibliography/. Accessed 22 July 2016
- 51.
Milligan, I.: Finding community in the Ruins of GeoCities: distantly reading a web archive. Bull. IEEE Tech. Commit. Dig. Lib. (2015). http://www.ieee-tcdl.org/Bulletin/v11n2/papers/milligan.pdf. Accessed 10 Feb 2016
- 52.
Milligan, I.: Lost in the infinite archive: the promise and pitfalls of web archives. Int. J. Hum. Arts Comput. 10, 78–94 (2016). doi:10.3366/ijhac.2016.0161
- 53.
Webster, P.: Why historians should care about web archiving. Webstory: Peter Webster’s blog. (2012). https://peterwebster.me/2012/10/08/why-historians-should-care-about-web-archiving/. Accessed 14 July 2016
- 54.
Statista (2016) Twitter: number of monthly active users 2015. Statista. http://www.statista.com/statistics/282087/number-of-monthly-active-twitter-users/. Accessed 11 Feb 2016
- 55.
Summers, E.: URLs in Tweets Mentioning Ferguson: August 10–27, 2014 (2014). https://edsu.github.io/ferguson-urls/index.html. Accessed 10 Feb 2016
- 56.
Baumann, R.: Archiving Video from #Ferguson: on Archivy. Medium. (2015) https://medium.com/on-archivy/archiving-video-from-ferguson-504e95859756. Accessed 10 Feb 2016
- 57.
Milligan, I., Ruest, N., Lin, J.: The Gatekeepers vs. the Masses. Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries. (2016). doi:10.1145/2910896.2910913
- 58.
Consultative Committee for Space Data Systems: Reference model for an Open Archival Information System (OAIS). CCSDS Secretariat, Washington, DC (2012). http://public.ccsds.org/publications/archive/650x0m2.pdf. Accessed 10 Feb 2016
- 59.
Commission on Preservation and Access, Research Libraries Group, Task Force on Digital Archiving: Preserving digital information: report of the task force on archiving of digital information (1996). https://books.google.com/books?id=T9YmrgEACAAJ. Accessed 10 Feb 2016
- 60.
Provenance Working Group: PROV-overview (2013). https://www.w3.org/TR/prov-overview/. Accessed 6 June 2016
- 61.
Kerchner, D., Littman, J., Peterson, C. et al.: The Provenance of a Tweet (2016). https://scholarspace.library.gwu.edu/downloads/h128nd689. Accessed 20 July 2016
- 62.
Internet Archive: internetarchive/brozzler. GitHub. https://github.com/internetarchive/brozzler. Accessed 14 July 2016
- 63.
Littman, J.: Social media harvesting techniques. GW Libraries (2015). https://library.gwu.edu/scholarly-technology-group/posts/social-media-harvesting-techniques. Accessed 10 Feb 2016
- 64.
Foo, C.: chfoo/wpull (2013). https://github.com/chfoo/wpull. Accessed 10 Feb 2016
- 65.
Van de Sompel, H., Nelson, M., Sanderson, R.: HTTP framework for time-based access to resource states–Memento (2013). https://tools.ietf.org/rfc/rfc7089.txt. Accessed 10 Feb 2016
- 66.
Wrubel, L.: Announcing SFM Version 1.0. Social Feed Manager (2016). http://gwu-libraries.github.io/sfm-ui/posts/2016-06-20-releasing-1-0. Accessed 15 July 2016
- 67.
Twitter, Inc. The Streaming APIs. https://dev.twitter.com/streaming/overview. Accessed 20 July 2016
- 68.
REST APIs. Twitter Developers. https://dev.twitter.com/rest/public. Accessed 20 July 2016
- 69.
Flickr.: Flickr services (2005). https://www.flickr.com/services/api/. Accessed 20 July 2016
- 70.
Weibo Corporation: Weibo API (2012). http://open.weibo.com/wiki/API%E6%96%87%E6%A1%A3/en. Accessed 20 July 2016
- 71.
Summers, E.: edsu/twarc (2013). doi:10.5281/zenodo.17385. https://github.com/edsu/twarc. Accessed 10 Feb 2016
- 72.
Stüvel, S.A.: sybrenstuvel/flickrapi (2013). https://github.com/sybrenstuvel/flickrapi. Accessed 18 Oct 2016
- 73.
Internet Archive: internetarchive/warc. GitHub. https://github.com/internetarchive/warc. Accessed 15 July 2016
- 74.
Dolan, S.: stedolan/jq (2012). https://github.com/stedolan/jq. Accessed 18 Oct 2016
- 75.
Clarke, N.: JWAT-tools (2012). https://sbforge.org/display/JWAT/JWAT-Tools. Accessed 18 Oct 2016
- 76.
Internet Archive: internetarchive/warctools (2010). https://github.com/internetarchive/warctools. Accessed 18 Oct 2016
Acknowledgements
This work is supported by Grant #NARDI-14-50017-14 from the National Historical Publications and Records Commission.
Author information
Affiliations
Corresponding author
Additional information
Justin Littman is the lead author. All other authors contributed significantly to this work and participated in the writing of the paper. The authors are listed alphabetically. Laura Wrubel is the current principal investigator on the grant supporting this work; Daniel Chudnov held this role previously.
Rights and permissions
About this article
Cite this article
Littman, J., Chudnov, D., Kerchner, D. et al. API-based social media collecting as a form of web archiving. Int J Digit Libr 19, 21–38 (2018). https://doi.org/10.1007/s00799-016-0201-7
Received:
Revised:
Accepted:
Published:
Issue Date:
Keywords
- Social media
- Web archiving
- Archives
- Data collection - Twitter