Abstract
Historians and researchers rely on web archives to preserve social media content that no longer exists on the live web. However, what we see on the live web and how it is replayed in the archive are not always the same. In this study, we document and analyze the problems in archiving Twitter after Twitter switched to a new user interface (UI) in June 2020. Most web archives could not archive the new UI, resulting in archived Twitter pages displaying Twitter’s “Something went wrong” error. The challenges in archiving the new UI forced web archives to continue using the old UI. But, features such as Twitter labels were a part of the new UI; hence, web archives archiving Twitter’s old UI would be missing these labels. To analyze the potential loss of information in web archival data due to this change, we used the personal Twitter account of the 45th President of the USA, @realDonaldTrump, which was suspended by Twitter on January 8, 2021. Trump’s account was heavily labeled by Twitter for spreading misinformation; however, we discovered that there is no evidence in web archives to prove that some of his tweets ever had a label assigned to them. We also studied the possibility of temporal violations in archived versions of the new UI, which may result in the replay of pages that never existed on the live web. We also discovered that when some tweets with embedded media are replayed, portions of the rewritten t.co URL, meant to be hidden from the end-user, are partially exposed in the replayed page. Our goal is to educate researchers who may use web archives and caution them when drawing conclusions based on archived Twitter pages.
Similar content being viewed by others
Notes
This paper is an extended version of a paper that originally appeared in ACM/IEEE JCDL 2021 [22].
References
Acker, A., Chaiet, M.: The weaponization of web archives: data craft and COVID-19 publics. Harvard Kennedy School (HKS) Misinformation Review (2020). https://doi.org/10.37016/mr-2020-41
Ainsworth, S.G., Nelson, M.L., Van de Sompel, H.: A framework for evaluation of composite memento temporal coherence. Tech. Rep. (2014) arXiv:1402.0928
Ainsworth, S.G., Nelson, M.L., Van de Sompel, H.: Only one out of five archived web pages existed as presented. In: Proceedings of the 26th ACM Conference on Hypertext & Social Media, pp. 257–266 (2015)
Alam, S.: Cookie Violations Cause Archived Twitter Pages to Simultaneously Replay in Multiple Languages (2019). https://ws-dl.blogspot.com/2019/03/2019-03-18-cookie-violations-cause.html
Alam, S., Berlin, J.A.: Reconstructive: A ServiceWorker for Client-Side Reconstruction of Composite Mementos (2017). https://oduwsdl.github.io/Reconstructive/
Alam, S., Kelly, M.: InterPlanetary Wayback: Peer-to-Peer Permanence of Web Archives (2016). https://github.com/oduwsdl/ipwb
Alam, S., Nelson, M.L.: MemGator—a portable concurrent memento aggregator: cross-platform CLI and server binaries in Go. In: JCDL ’16: Proceedings of the 16th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 243–244
Alam, S., Vargas, P.: Cookies Are Why Your Archived Twitter Page Is Not in English (2018). https://ws-dl.blogspot.com/2018/03/2018-03-21-cookies-are-why-your.html
Alam, S., Kelly, M., Nelson, M.L.: InterPlanetary Wayback: the permanent web archive. In: Proceedings of the 16th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL ’16, pp. 273–274 (2016). https://doi.org/10.1145/2910896.2925467
Alam, S., Kelly, M., Weigle, M.C., et al.: Client-side reconstruction of composite mementos using service worker. In: Proceedings of the 17th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL ’17, pp. 237–240 (2017). https://doi.org/10.1109/JCDL.2017.7991579
Alam, S., Vargas, P., Weigle, M.C., et al.: Impact of HTTP cookie violations in web archives. Tech. Rep. (2019) arXiv:1906.07141
Alam, S., Weigle, M.C., Nelson, M.L., et al.: Supporting web archiving via web packaging. Tech. Rep. (2019) arXiv:1906.07104
Berlin, J.: 2017-01-20: CNN.com has been unarchivable since November 1st, 2016 (2017). https://ws-dl.blogspot.com/2017/01/2017-01-20-cnncom-has-been-unarchivable
Blumenthal, K.R.: The stack: high fidelity web collecting at scale with Brozzler(2020). https://archive-it.org/blog/post/the-stack-brozzler/
Bray, T.: An HTTP Status Code to Report Legal Obstacles, RFC 7725 (2016). https://datatracker.ietf.org/doc/html/rfc7725
Brunelle, J.F.: Zombies in the archives (2012). https://ws-dl.blogspot.com/2012/10/2012-10-10-zombies-in-archives.html
Fielding, R., Reschke, J.: Hypertext Transfer Protocol (HTTP/1.1): Semantics and Content, RFC 7231 (2014). https://datatracker.ietf.org/doc/html/rfc7231
Gadde, V., Beykpour, K.: Additional steps we’re taking ahead of the 2020 US Election (2020). https://blog.twitter.com/en_us/topics/company/2020/2020-election-changes.html
Garg, K., Jayanetti, H.: Twitter Added Labels on Its Old User Interface (2020). https://ws-dl.blogspot.com/2020/12/2020-12-08-twitter-added-labels-on-its.html
Garg, K., Jayanetti, H.: Twitter was Already Difficult to Archive, Now It’s Worse! (2020). https://ws-dl.blogspot.com/2020/07/2020-07-15-twitter-was-already.html
Garg, K., Jayanetti, H.: TwitterLabels (2021). https://github.com/oduwsdl/TwitterLabels
Garg, K., Jayanetti, H.R., Alam, S., et al.: Replaying archived Twitter: when your bird is broken, will it bring you down? In: Proceedings of the 2021 ACM/IEEE Joint Conference on Digital Libraries (JCDL), pp. 160–169 (2021). https://doi.org/10.1109/JCDL52503.2021.00028
Graham, M.: The Wayback Machine’s Save Page Now is New and Improved (2019). http://blog.archive.org/2019/10/23/the-wayback-machines-save-page-now-is-new-and-improved/
Jayanetti, H.: How well is Instagram archived? (2020). https://ws-dl.blogspot.com/2020/11/2020-11-04-how-well-is-instagram.html
Jayanetti, H., Garg, K.: New Twitter UI: Replaying Archived Twitter Pages That Never Existed. https://ws-dl.blogspot.com/2020/11/2020-11-04-new-twitter-ui-replaying.html
Jayanetti, H., Garg, K.: Twitter rewrites your URLs, but assumes you’ll never rewrite theirs: more problems replaying archived Twitter (2021). https://ws-dl.blogspot.com/2021/01/2020-01-22-twitter-rewrites-your-urls.html
Jones, S.M., Klein, M., Van de Sompel, H., et al.: Interoperability for accessing versions of web resources with the Memento protocol. In: The Past Web: Exploring Web Archives. Springer, pp. 101–126 (2021)
Kelly, M., Alam, S., Nelson, M.L., et al.: InterPlanetary wayback: peer-to-peer permanence of web archives. In: Proceedings of the 20th International Conference on Theory and Practice of Digital Libraries, pp. 411–416 (2016). https://doi.org/10.1007/978-3-319-43997-6_35
Kreymer, I.: Pywb 2.0: technical overview and Q &A (2018). https://netpreserve.org/ga2018/workshops/pywb-2-0-technical-overview-and-qa/
Kreymer, I.: Webrecorder: Developing an Open-Source High-Fidelity Web Archiving Toolset (2019). https://2019.code4lib.org/talks/Webrecorder-Developing-an-OpenSource-HighFidelity-Web-Archiving-Toolset
Kreymer, I.: Introducing ArchiveWeb.page—local high-fidelity web archiving directly in your browser (2021). https://webrecorder.net/2021/01/18/archiveweb-page-extension.html
Kreymer, I., Berlin, J.: Wombat.js Client-Side Rewriting Library (2018). https://github.com/webrecorder/wombat
Kreymer, I., Rosenthal, D.S.H.: Guest Post: Ilya Kreymer on oldweb.today (2016). https://blog.dshr.org/2016/01/guest-post-ilya-kreymer-on-oldwebtoday.html
Kreymer, I., Rosenthal, D.S.H.: Announcing the New OldWeb.today (2020). https://webrecorder.net/2020/12/23/new-oldweb-today.html
Lerner, A., Kohno, T., Roesner, F.: Rewriting history: changing the archived web from the present. In: Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, CCS ’17, pp. 1741–1755 (2017). https://doi.org/10.1145/3133956.3134042
MDN (n.d.) XMLHttpRequest. https://developer.mozilla.org/en-US/docs/Web/API/XMLHttpRequest
Nelson, M.L., Van de Sompel, H.: Adding the dimension of time to HTTP. In: SAGE Handbook of Web History. SAGE Publishing, pp. 191–214 (2019)
Nottingham, M., Fielding, R.: Additional HTTP Status Codes, RFC 6585 (2012). https://datatracker.ietf.org/doc/html/rfc6585
Ott, B.L., Dickinson, G.: The Twitter presidency: how Donald Trump’s tweets undermine democracy and threaten us all. Polit. Sci. Q. 135(4), 607–636 (2020). https://doi.org/10.1002/polq.13129
Pain, P., Masullo Chen, G.: The president is in: public opinion and the presidential use of Twitter. Soc. Media Soc. (2019). https://doi.org/10.1177/2056305119855143
Rosenthal, D.S.H.: The 47 links mystery (2019). https://blog.dshr.org/2019/03/the-47-links-mystery.html
Roth, Y., Pickles, N.: Updating our approach to misleading information (2020). https://blog.twitter.com/en_us/topics/product/2020/updating-our-approach-to-misleading-information.html
Selenium.: Selenium Client Driver (2018). selenium.dev/selenium/docs/api/py/
Siddique, M.N.: Searching Web Archives for Unattributed Deleted Tweets From Politwoops (2019). https://ws-dl.blogspot.com/2019/08/2019-08-03-searching-web-archives-for.html
Siddique, M.N., Alam, S.: TweetedAt: Finding Tweet Timestamps for Pre and Post Snowflake Tweet IDs (2019). https://ws-dl.blogspot.com/2019/08/2019-08-03-tweetedat-finding-tweet.html
Starbird, K., Miller, C.: Examining Twitter’s policy against election-related misinformation in action (2020). https://www.eipartnership.net/policy-analysis/twitters-policy-election-misinfo-in-action
Summers, E.: Trump’s Tweets (2021). https://inkdroid.org/2021/01/21/trumps-tweets/
Twitter Introducing a new Twitter.com (2019). https://blog.twitter.com/en_us/topics/product/2019/introducing-a-new-Twitter-dot-com.html
Twitter Twitter API HTTP status codes (2020). https://developer.twitter.com/en/support/twitter-api/error-troubleshooting
Twitter Updating our approach to misleading information (2020). https://blog.twitter.com/en_us/topics/product/2020/updating-our-approach-to-misleading-information
Twitter permanent suspension of @realDonaldTrump (2021). https://blog.twitter.com/en_us/topics/company/2020/suspension.html
Twitter (n.d.) t.co links. https://developer.twitter.com/en/docs/tco
Twitter Safety An update following the riots in Washington, DC (2021). https://blog.twitter.com/en_us/topics/company/2021/protecting--the-conversation-following-the-riots-in-washington--
Van de Sompel, H., Nelson, M.L., Sanderson, R., et al.: Memento: Time Travel for the Web. Tech. Rep. (2009). arXiv:0911.1112, arXiv
Van de Sompel, H., Nelson, M.L., Sanderson, R.: HTTP framework for time-based access to resource states—Memento, Internet RFC 7089 (2013). http://tools.ietf.org/html/rfc7089
Watanabe, T., Shioji, E., Akiyama, M., et al.: Melting pot of origins: compromising the intermediary web services that Rehost websites. In: Proceedings of the Network and Distributed System Security (NDSS) Symposium (2020). https://doi.org/10.14722/ndss.2020.24140
Wells, C., Shah, D., Lukito, J., et al.: Trump, Twitter, and news media responsiveness: a media systems approach. New Media Soc. 22(4), 659–682 (2020)
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Garg, K., Jayanetti, H.R., Alam, S. et al. Challenges in replaying archived Twitter pages. Int J Digit Libr (2023). https://doi.org/10.1007/s00799-023-00379-w
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s00799-023-00379-w