Skip to main content
Log in

Abstract

Historians and researchers rely on web archives to preserve social media content that no longer exists on the live web. However, what we see on the live web and how it is replayed in the archive are not always the same. In this study, we document and analyze the problems in archiving Twitter after Twitter switched to a new user interface (UI) in June 2020. Most web archives could not archive the new UI, resulting in archived Twitter pages displaying Twitter’s “Something went wrong” error. The challenges in archiving the new UI forced web archives to continue using the old UI. But, features such as Twitter labels were a part of the new UI; hence, web archives archiving Twitter’s old UI would be missing these labels. To analyze the potential loss of information in web archival data due to this change, we used the personal Twitter account of the 45th President of the USA, @realDonaldTrump, which was suspended by Twitter on January 8, 2021. Trump’s account was heavily labeled by Twitter for spreading misinformation; however, we discovered that there is no evidence in web archives to prove that some of his tweets ever had a label assigned to them. We also studied the possibility of temporal violations in archived versions of the new UI, which may result in the replay of pages that never existed on the live web. We also discovered that when some tweets with embedded media are replayed, portions of the rewritten t.co URL, meant to be hidden from the end-user, are partially exposed in the replayed page. Our goal is to educate researchers who may use web archives and caution them when drawing conclusions based on archived Twitter pages.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17

Similar content being viewed by others

Notes

  1. This paper is an extended version of a paper that originally appeared in ACM/IEEE JCDL 2021 [22].

  2. https://www.thetrumparchive.com/.

  3. https://factba.se/topic/flagged-tweets.

  4. https://gist.github.com/ibnesayeed/c7e5773318d6ea041984fb2433bf1d1e.

  5. https://github.com/internetarchive/wayback/tree/master/wayback-cdx-server/.

  6. http://bit.ly/tweetJSONtimemap.

  7. https://www.bl.uk/collection-guides/uk-web-archive.

  8. https://web.archive.org/save.

References

  1. Acker, A., Chaiet, M.: The weaponization of web archives: data craft and COVID-19 publics. Harvard Kennedy School (HKS) Misinformation Review (2020). https://doi.org/10.37016/mr-2020-41

  2. Ainsworth, S.G., Nelson, M.L., Van de Sompel, H.: A framework for evaluation of composite memento temporal coherence. Tech. Rep. (2014) arXiv:1402.0928

  3. Ainsworth, S.G., Nelson, M.L., Van de Sompel, H.: Only one out of five archived web pages existed as presented. In: Proceedings of the 26th ACM Conference on Hypertext & Social Media, pp. 257–266 (2015)

  4. Alam, S.: Cookie Violations Cause Archived Twitter Pages to Simultaneously Replay in Multiple Languages (2019). https://ws-dl.blogspot.com/2019/03/2019-03-18-cookie-violations-cause.html

  5. Alam, S., Berlin, J.A.: Reconstructive: A ServiceWorker for Client-Side Reconstruction of Composite Mementos (2017). https://oduwsdl.github.io/Reconstructive/

  6. Alam, S., Kelly, M.: InterPlanetary Wayback: Peer-to-Peer Permanence of Web Archives (2016). https://github.com/oduwsdl/ipwb

  7. Alam, S., Nelson, M.L.: MemGator—a portable concurrent memento aggregator: cross-platform CLI and server binaries in Go. In: JCDL ’16: Proceedings of the 16th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 243–244

  8. Alam, S., Vargas, P.: Cookies Are Why Your Archived Twitter Page Is Not in English (2018). https://ws-dl.blogspot.com/2018/03/2018-03-21-cookies-are-why-your.html

  9. Alam, S., Kelly, M., Nelson, M.L.: InterPlanetary Wayback: the permanent web archive. In: Proceedings of the 16th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL ’16, pp. 273–274 (2016). https://doi.org/10.1145/2910896.2925467

  10. Alam, S., Kelly, M., Weigle, M.C., et al.: Client-side reconstruction of composite mementos using service worker. In: Proceedings of the 17th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL ’17, pp. 237–240 (2017). https://doi.org/10.1109/JCDL.2017.7991579

  11. Alam, S., Vargas, P., Weigle, M.C., et al.: Impact of HTTP cookie violations in web archives. Tech. Rep. (2019) arXiv:1906.07141

  12. Alam, S., Weigle, M.C., Nelson, M.L., et al.: Supporting web archiving via web packaging. Tech. Rep. (2019) arXiv:1906.07104

  13. Berlin, J.: 2017-01-20: CNN.com has been unarchivable since November 1st, 2016 (2017). https://ws-dl.blogspot.com/2017/01/2017-01-20-cnncom-has-been-unarchivable

  14. Blumenthal, K.R.: The stack: high fidelity web collecting at scale with Brozzler(2020). https://archive-it.org/blog/post/the-stack-brozzler/

  15. Bray, T.: An HTTP Status Code to Report Legal Obstacles, RFC 7725 (2016). https://datatracker.ietf.org/doc/html/rfc7725

  16. Brunelle, J.F.: Zombies in the archives (2012). https://ws-dl.blogspot.com/2012/10/2012-10-10-zombies-in-archives.html

  17. Fielding, R., Reschke, J.: Hypertext Transfer Protocol (HTTP/1.1): Semantics and Content, RFC 7231 (2014). https://datatracker.ietf.org/doc/html/rfc7231

  18. Gadde, V., Beykpour, K.: Additional steps we’re taking ahead of the 2020 US Election (2020). https://blog.twitter.com/en_us/topics/company/2020/2020-election-changes.html

  19. Garg, K., Jayanetti, H.: Twitter Added Labels on Its Old User Interface (2020). https://ws-dl.blogspot.com/2020/12/2020-12-08-twitter-added-labels-on-its.html

  20. Garg, K., Jayanetti, H.: Twitter was Already Difficult to Archive, Now It’s Worse! (2020). https://ws-dl.blogspot.com/2020/07/2020-07-15-twitter-was-already.html

  21. Garg, K., Jayanetti, H.: TwitterLabels (2021). https://github.com/oduwsdl/TwitterLabels

  22. Garg, K., Jayanetti, H.R., Alam, S., et al.: Replaying archived Twitter: when your bird is broken, will it bring you down? In: Proceedings of the 2021 ACM/IEEE Joint Conference on Digital Libraries (JCDL), pp. 160–169 (2021). https://doi.org/10.1109/JCDL52503.2021.00028

  23. Graham, M.: The Wayback Machine’s Save Page Now is New and Improved (2019). http://blog.archive.org/2019/10/23/the-wayback-machines-save-page-now-is-new-and-improved/

  24. Jayanetti, H.: How well is Instagram archived? (2020). https://ws-dl.blogspot.com/2020/11/2020-11-04-how-well-is-instagram.html

  25. Jayanetti, H., Garg, K.: New Twitter UI: Replaying Archived Twitter Pages That Never Existed. https://ws-dl.blogspot.com/2020/11/2020-11-04-new-twitter-ui-replaying.html

  26. Jayanetti, H., Garg, K.: Twitter rewrites your URLs, but assumes you’ll never rewrite theirs: more problems replaying archived Twitter (2021). https://ws-dl.blogspot.com/2021/01/2020-01-22-twitter-rewrites-your-urls.html

  27. Jones, S.M., Klein, M., Van de Sompel, H., et al.: Interoperability for accessing versions of web resources with the Memento protocol. In: The Past Web: Exploring Web Archives. Springer, pp. 101–126 (2021)

  28. Kelly, M., Alam, S., Nelson, M.L., et al.: InterPlanetary wayback: peer-to-peer permanence of web archives. In: Proceedings of the 20th International Conference on Theory and Practice of Digital Libraries, pp. 411–416 (2016). https://doi.org/10.1007/978-3-319-43997-6_35

  29. Kreymer, I.: Pywb 2.0: technical overview and Q &A (2018). https://netpreserve.org/ga2018/workshops/pywb-2-0-technical-overview-and-qa/

  30. Kreymer, I.: Webrecorder: Developing an Open-Source High-Fidelity Web Archiving Toolset (2019). https://2019.code4lib.org/talks/Webrecorder-Developing-an-OpenSource-HighFidelity-Web-Archiving-Toolset

  31. Kreymer, I.: Introducing ArchiveWeb.page—local high-fidelity web archiving directly in your browser (2021). https://webrecorder.net/2021/01/18/archiveweb-page-extension.html

  32. Kreymer, I., Berlin, J.: Wombat.js Client-Side Rewriting Library (2018). https://github.com/webrecorder/wombat

  33. Kreymer, I., Rosenthal, D.S.H.: Guest Post: Ilya Kreymer on oldweb.today (2016). https://blog.dshr.org/2016/01/guest-post-ilya-kreymer-on-oldwebtoday.html

  34. Kreymer, I., Rosenthal, D.S.H.: Announcing the New OldWeb.today (2020). https://webrecorder.net/2020/12/23/new-oldweb-today.html

  35. Lerner, A., Kohno, T., Roesner, F.: Rewriting history: changing the archived web from the present. In: Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, CCS ’17, pp. 1741–1755 (2017). https://doi.org/10.1145/3133956.3134042

  36. MDN (n.d.) XMLHttpRequest. https://developer.mozilla.org/en-US/docs/Web/API/XMLHttpRequest

  37. Nelson, M.L., Van de Sompel, H.: Adding the dimension of time to HTTP. In: SAGE Handbook of Web History. SAGE Publishing, pp. 191–214 (2019)

  38. Nottingham, M., Fielding, R.: Additional HTTP Status Codes, RFC 6585 (2012). https://datatracker.ietf.org/doc/html/rfc6585

  39. Ott, B.L., Dickinson, G.: The Twitter presidency: how Donald Trump’s tweets undermine democracy and threaten us all. Polit. Sci. Q. 135(4), 607–636 (2020). https://doi.org/10.1002/polq.13129

    Article  Google Scholar 

  40. Pain, P., Masullo Chen, G.: The president is in: public opinion and the presidential use of Twitter. Soc. Media Soc. (2019). https://doi.org/10.1177/2056305119855143

    Article  Google Scholar 

  41. Rosenthal, D.S.H.: The 47 links mystery (2019). https://blog.dshr.org/2019/03/the-47-links-mystery.html

  42. Roth, Y., Pickles, N.: Updating our approach to misleading information (2020). https://blog.twitter.com/en_us/topics/product/2020/updating-our-approach-to-misleading-information.html

  43. Selenium.: Selenium Client Driver (2018). selenium.dev/selenium/docs/api/py/

  44. Siddique, M.N.: Searching Web Archives for Unattributed Deleted Tweets From Politwoops (2019). https://ws-dl.blogspot.com/2019/08/2019-08-03-searching-web-archives-for.html

  45. Siddique, M.N., Alam, S.: TweetedAt: Finding Tweet Timestamps for Pre and Post Snowflake Tweet IDs (2019). https://ws-dl.blogspot.com/2019/08/2019-08-03-tweetedat-finding-tweet.html

  46. Starbird, K., Miller, C.: Examining Twitter’s policy against election-related misinformation in action (2020). https://www.eipartnership.net/policy-analysis/twitters-policy-election-misinfo-in-action

  47. Summers, E.: Trump’s Tweets (2021). https://inkdroid.org/2021/01/21/trumps-tweets/

  48. Twitter Introducing a new Twitter.com (2019). https://blog.twitter.com/en_us/topics/product/2019/introducing-a-new-Twitter-dot-com.html

  49. Twitter Twitter API HTTP status codes (2020). https://developer.twitter.com/en/support/twitter-api/error-troubleshooting

  50. Twitter Updating our approach to misleading information (2020). https://blog.twitter.com/en_us/topics/product/2020/updating-our-approach-to-misleading-information

  51. Twitter permanent suspension of @realDonaldTrump (2021). https://blog.twitter.com/en_us/topics/company/2020/suspension.html

  52. Twitter (n.d.) t.co links. https://developer.twitter.com/en/docs/tco

  53. Twitter Safety An update following the riots in Washington, DC (2021). https://blog.twitter.com/en_us/topics/company/2021/protecting--the-conversation-following-the-riots-in-washington--

  54. Van de Sompel, H., Nelson, M.L., Sanderson, R., et al.: Memento: Time Travel for the Web. Tech. Rep. (2009). arXiv:0911.1112, arXiv

  55. Van de Sompel, H., Nelson, M.L., Sanderson, R.: HTTP framework for time-based access to resource states—Memento, Internet RFC 7089 (2013). http://tools.ietf.org/html/rfc7089

  56. Watanabe, T., Shioji, E., Akiyama, M., et al.: Melting pot of origins: compromising the intermediary web services that Rehost websites. In: Proceedings of the Network and Distributed System Security (NDSS) Symposium (2020). https://doi.org/10.14722/ndss.2020.24140

  57. Wells, C., Shah, D., Lukito, J., et al.: Trump, Twitter, and news media responsiveness: a media systems approach. New Media Soc. 22(4), 659–682 (2020)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kritika Garg.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Garg, K., Jayanetti, H.R., Alam, S. et al. Challenges in replaying archived Twitter pages. Int J Digit Libr (2023). https://doi.org/10.1007/s00799-023-00379-w

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s00799-023-00379-w

Keywords

Navigation