Advertisement

What you tweet is what we get?

Zum wissenschaftlichen Nutzen von Twitter-Daten

What you tweet is what we get?

About the scientific use of Twitter data

Zusammenfassung

Elf Jahre nach Veröffentlichung des ersten Tweets hat der Kurznachrichtendienst Twitter mittlerweile eine hohe Präsenz in Gesellschaft, Medien und Wissenschaft. Die Vielzahl an Studien mit Twitter-Daten zeigt, dass Tweets eine beliebte Datenquelle wissenschaftlicher Arbeiten sind. Dies lässt sich vor allem durch die weitestgehend kostenlos und technisch gut verfügbaren Daten sowie die klare, offene Kommunikationsstruktur erklären. Dennoch ist der Kurznachrichtendienst nur bedingt für die Forschung geeignet: Eine eingeschränkte Repräsentativität und Aussagekraft, eine hinsichtlich Zeitspanne und Volumen begrenzte Datenverfügbarkeit und die geringe Datenqualität mindern den wissenschaftlichen Nutzen. Die Eigenheiten der Internetsprache sowie fehlende Metriken erschweren die inhaltliche Analyse der verbreiteten Nachrichten zusätzlich. Zudem stellt die zunehmende Verbreitung von Bots, die mittlerweile einen großen Teil der Kommunikation auf Twitter erzeugen, eine große Herausforderung dar. Anhand eines Fallbeispiels bewertet die Arbeit den wissenschaftlichen Nutzen von Twitter-Daten, indem Probleme bei der Datenerhebung, Auswertung und Interpretation herausgearbeitet werden. Dies soll nicht nur zu einem vorsichtigeren und kritischeren wissenschaftlichen Umgang mit Twitter-Daten beitragen, sondern auch die Frage aufwerfen, inwieweit Twitter-Daten überhaupt in Zukunft für die Wissenschaft bedeutsam sein können.

Abstract

Twitter has a high presence in our modern society, media and science. Numbers of studies with Twitter data – not only in communication research – show that tweets are a popular data source for science. This popularity can be explained by the mostly free data and its technically high availability, as well as the distinct and open communication structure. Even though much research is based on Twitter data, it is only suitable for research to a limited extent. For example, some studies have already revealed that Twitter data has a low explanatory power when predicting election outcomes. Furthermore, the rise of automated communication by bots is an urgent problem of Twitter data analysis. Although critical aspects of Twitter data have already been discussed to some extent (mostly in final remarks of studies), comprehensive evaluations of data quality are relatively rare.

To contribute to a deeper understanding of problems regarding the scientific use of Twitter data leading to a more deliberate und critical handling of this data, the study examines different aspects of data quality, usability and explanatory power. Based on previous research on data quality, it takes a critical look with the following four dimensions: availability and completeness, quality (regarding authenticity, reliability and interpretability), language as well as representativeness. Based on a small case study, this paper evaluates the scientific use of Twitter data by elaborating problems in data collection, analysis and interpretation. For this illustrative purpose, the author typically gathered data via Twitter’s Streaming APIs: 73,194 tweets collected between 20–24/02/2017 (each 8pm) with the Streaming APIs (POST statuses/filter) containing the search term “#merkel”.

Concerning data availability and completeness, several aspects diminish data usability. Twitter provides two types of data gateways: Streaming APIs (for real-time data) and REST APIs (for historical data). Streaming APIs only have a free available Spritzer bandwidth, that is limited to only one percent of the overall (global) tweet volume at any given time. This limit is a prevalent problem when collecting Twitter data to major events like elections and sports. The REST APIs do not usually provide data older than seven days. Furthermore, Twitter gives no information about the total or search term-related tweet volume at any time.

In addition to incomplete data, several quality related aspects complicate data gathering and analysis, like the lack of user specific and verified information (age, gender, location), inconsistent hashtag usage, missing conversational context or poor data/user authenticity. Geo data on Twitter is – if available at all – rarely correct and not useful for filtering relevant tweets. Searching and filtering relevant tweets by search terms can be deceptive, because not every tweet concerning a topic contains corresponding hashtags. Furthermore, it is difficult to find a perfect search term for broader and dynamically changing topics. Besides, the missing conversational context of tweets impedes interpretation of statements (especially with regard to irony or sarcasm). In addition, the rise of social bots diminishes dataset quality enormously. In the dataset generated for this work, only three of the top 30 accounts (by tweet count) could be directly identified as genuine. One fourth of all accounts in this dataset generated about 60% of all tweets. If the high-performing accounts predominantly consist of bots, the negative impact on data quality is immense.

Another problem of Twitter analysis is Internet language. While Emojis can be misinterpreted, abbreviations, neologisms, mixed languages and a lack of grammar impede text analysis. In addition to low data quality in general, the quality of tweet content and its representativeness is crucial. This work compares user statistics with research articles on SCOPUS as well as media coverage of two selected, German quality newspapers. Twitter is – compared to its user count – enormously overrepresented in media and science. Only 16% of German adults (over 18 years) are monthly active (MAUs) and merely four percent are daily active users.

Considering all presented problems, Twitter can be a good data source for research, but only to a limited extent. Researchers must consider that Twitter does not guarantee complete, reliable and representative data. Ignoring those critical points can mislead data analysis. While Twitter data can be suitable for specific case studies, like the usage and spread of selected hashtags or the twitter usage of specific politicians, you cannot use it for broader, nation-based surveys like the prediction of elections or the public opinion on a specific topic. Twitter has a low representativeness and is mostly an “elite medium” with an uncertain future (concerning the stagnating number of users and financial problems).

This is a preview of subscription content, log in to check access.

Access options

Buy single article

Instant unlimited access to the full article PDF.

US$ 39.95

Price includes VAT for USA

Abb. 1
Abb. 2
Abb. 3
Abb. 4

Notes

  1. 1.

    SCOPUS-Abfrage: TITLE-ABS-KEY (twitter). Stand: 15. Oktober 2017.

  2. 2.

    Einen systematischen Forschungsüberblick zu Qualitätskriterien von Daten geben Knight und Burn (2005).

  3. 3.

    Engl. Application Programming Interface (kurz: API).

  4. 4.

    Die kostenfrei verfügbare Bandbreite (Spritzer) beträgt maximal ein Prozent des gesamten Twitter-Volumens. Daneben stehen weitere, kostenpflichtige und nicht allgemein verfügbare Bandbreiten zur Verfügung: Decahose (10 %) und Firehose (100 %).

  5. 5.

    Für die REST API GET search/tweets stehen über den Suchparameter result_type drei Abfragemöglichkeiten zur Verfügung: recent (übermittelt nur die neuesten/letzten Tweets zur einer Suchabfrage), popular (nur die populärsten Tweets) und mixed (populäre und neueste Tweets gemischt).

  6. 6.

    Das Programm basiert auf dem Python-Paket Tweepy (http://tweepy.org) und sammelt alle Daten mit Suchterm-Filter über die Streaming APIs, bzw. deren Endpoint POST statuses/filter. Im Untersuchungszeitraum gab es keine Rate Limits.

  7. 7.

    Beispielsweise nutzen manche Bots aktuell beliebte Hashtags, um die Sichtbarkeit eigener Botschaften (meistens Spam) zu erhöhen (vgl. Marechal 2016).

  8. 8.

    Als MAU gilt jeder Account, der sich mindestens einmal pro Monat einloggt oder mit einem Twitter-Dienst verbindet – unabhängig davon, ob dies bewusst oder automatisch geschieht.

  9. 9.

    Für die manuelle Account-Analyse verglich der Autor allgemeine Nutzermetriken (wie die Tweet- und Retweet-Häufigkeit) der 30 Accounts mit der höchsten Tweet-Anzahl im Datensatz. Beispielsweise war ein Account mit hoher Wahrscheinlichkeit ein Bot, wenn dieser ein sehr hohes Tweet-Volumen hatte (> 1000/Tag) oder nur andere Tweets teilte.

  10. 10.

    Benutzernamen vom Autor unkenntlich gemacht.

  11. 11.

    Twitter verdoppelte im November 2017 testweise das Limit auf 280 Zeichen.

Literatur

  1. Alexander, J. E., & Tate, M. A. (1999). Web wisdom; how to evaluate and create information quality on the web. Hillsdale: Lawrence Erlbaum.

  2. ARD (2017). Was ist eigentlich Teletwitter? http://www.daserste.de/community/diskutieren/foren/social-media/teletwitter-102.html. Zugegriffen: 1. Okt. 2017.

  3. Bollen, J., Mao, H., & Zeng, X. (2011). Twitter mood predicts the stock market. Journal of Computational Science, 2, 1–8.

  4. Burnap, P., Gibson, R., Sloan, L., Southern, R., & Williams, M. (2016). 140 characters to victory? Using Twitter to predict the UK 2015 general election. Electoral Studies, 41, 230–233.

  5. Cai, L., & Zhu, Y. (2015). The challenges of data quality and data quality assessment in the big data era. Data Science Journal, 14(2), 1–10.

  6. Carter, S., Weerkamp, W., & Tsagkias, M. (2013). Microblog language identification. Overcoming the limitations of short, unedited and idiomatic text. Language Resources and Evaluation, 47, 195–215.

  7. Cheng, Z., Caverlee, J., & Lee, K. (2010). You are where you tweet. Proceedings of the 19th ACM International Conference on Information and Knowledge Management. (S. 759–768). New York: ACM.

  8. Chew, C., & Eysenbach, G. (2010). Pandemics in the age of Twitter: content analysis of tweets during the 2009 H1N1 outbreak. PloS one, 5, e14118. https://doi.org/10.1371/journal.pone.0014118.

  9. Diakopoulos, N. A., & Shamma, D. A. (2010). Characterizing debate performance via aggregated twitter sentiment. Proceedings of ACM CHI 2010 Conference on Human Factors in Computing Systems. https://doi.org/10.1145/1753326.1753504.

  10. Dusch, A., Gerbig, S., Lake, M., Lorenz, S., Pfaffenberger, F., & Schulze, U. (2015). Post, reply, retweet – Einsatz und Resonanz von Twitter im Bundestagswahlkampf 2013. In C. Holtz-Bacha (Hrsg.), Die Massenmedien im Wahlkampf (S. 275–294). Wiesbaden: Springer VS.

  11. Facebook Inc. (2016). Form 10-Q, Facebook, Inc. https://s21.q4cdn.com/399680738/files/doc_financials/2016/Q3/Facebook-Q3FY16-10-Q.pdf. Zugegriffen: 1. Dez. 2016.

  12. Ferrara, E., Varol, O., Davis, C., Menczer, F., & Flammini, A. (2016). The rise of social bots. Communications of the ACM, 59(7), 96–104.

  13. Gainous, J., & Wagner, K. M. (2014). Tweeting to power: the social media revolution in American politics. Oxford: Oxford University Press.

  14. Gaurav, M., Srivastava, A., Kumar, A., & Miller, S. (2013). Leveraging candidate popularity on Twitter to predict election outcome. Proceedings of the 7th Workshop on Social Network Mining and Analysis – SNAKDD’’13. (S. 1–8).

  15. Gayo-Avello, D. (2012). No, you cannot predict elections with Twitter. IEEE Internet Computing, 16(6), 91–94.

  16. Gesellschaft für integrierte Kommunikationsforschung (2017). Best for planning (b4p) 2017 II. http://www.b4p.media/online-auswertung. Zugegriffen: 1. Okt. 2017.

  17. Ginnis, S., & Miller, C. (2017). #GE2015. The general election on Twitter. In D. Wring, R. Mortimore & S. Atkinson (Hrsg.), Political communication in britain: polling, campaigning and media in the 2015 general election (S. 315–328). Cham: Springer.

  18. Graham, M., Hale, S. A., & Gaffney, D. (2014). Where in the world are you? Geolocation and language identification in Twitter. The Professional Geographer, 66, 568–578.

  19. Hasan, S., Zhan, X., & Ukkusuri, S. V. (2013). Understanding urban human activity and mobility patterns using large-scale location-based data from online social media. Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. https://doi.org/10.1145/2505821.2505823b.

  20. Hawelka, B., Sitko, I., Beinat, E., Sobolevsky, S., Kazakopoulos, P., & Ratti, C. (2014). Geo-located Twitter as proxy for global mobility patterns. Cartography and Geographic Information Science, 41, 260–271.

  21. Holtz-Bacha, C., & Zeh, R. (2017). Tweeting to the press? Effects of political Twitter activity on offline media in the 2013 German election campaign. In C. Holtz-Bacha, M. R. Just & R. Davis (Hrsg.), Twitter and elections around the world. Campaigning in 140 characters or less. Routledge studies in global information, politics and society: Vol. 11. (S. 27–42). London: Routledge.

  22. Howard, P. N., & Kollanyi, B. (2016). Bots, #strongerin, and #brexit: computational propaganda during the UK-EU referendum. https://ssrn.com/abstract=2798311. Zugegriffen: 2. Jan. 2017.

  23. Huberman, B. A., Romero, D. M., & Wu, F. (2008). Social networks that matter: Twitter under the microscope. First Monday. https://doi.org/10.5210/fm.v14i1.2317.

  24. Huberty, M. (2015). Can we vote with our tweet? On the perennial difficulty of election forecasting with social media. International Journal of Forecasting, 31, 992–1007.

  25. Immonen, A., Pääkkönen, P., & Ovaska, E. (2015). Evaluating the quality of social media data in big data architecture. IEEE Access, 3, 2028–2043.

  26. Instagram (2016). Stats, Instagram. https://www.instagram.com/press. Zugegriffen: 1. Jan. 2016.

  27. Ji, X., Chun, S. A., & Geller, J. (2012). Epidemic outbreak and spread detection system based on Twitter data. In J. He, X. Liu, E. A. Krupinski & G. Xu (Hrsg.), Health Information Science: First International Conference, HIS 2012 (S. 152–163). Berlin: Springer.

  28. Jungherr, A., Jurgens, P., & Schoen, H. (2012). Why the Pirate Party won the German Election of 2009 or the trouble with predictions: a response to Tumasjan, A., Sprenger, T. O., Sander, P. G., & Welpe, I. M. „Predicting elections with Twitter: What 140 characters reveal about political sentiment“. Social Science Computer Review, 30, 229–234.

  29. Knight, S.-A., & Burn, J. (2005). Developing a framework for assessing information quality on the World Wide Web. Informing Science, 8, 159–172. https://doi.org/10.28945/493

  30. Kruikemeier, S. (2014). How political candidates use Twitter and the impact on votes. Computers in Human Behavior, 34, 131–139.

  31. Larsson, A. O., & Moe, H. (2012). Studying political microblogging: Twitter users in the 2010 Swedish election campaign. New Media & Society, 14, 729–747.

  32. Larsson, A. O., & Moe, H. (2015). Bots or journalists? News sharing on Twitter. Communications, 40, 361–370.

  33. Lui, M., & Baldwin, T. (2014). Accurate language identification of twitter messages. Proceedings of the 5th Workshop on Language Analysis for Social Media (LASM). (S. 17–25). Göteborg: EACL.

  34. Marechal, N. (2016). When bots tweet: toward a normative framework for bots on social networking sites. International Journal of Communication, 10, 5022–5031.

  35. McGee, J., Caverlee, J., & Cheng, Z. (2013). Location prediction in social media based on tie strength. Proceedings of the 22nd ACM international conference on Information & Knowledge Management. (S. 459–468). New York: ACM Press.

  36. Metaxas, P. T., Mustafaraj, E., & Gayo-Avello, D. (2011). How (not) to predict elections. Privacy, Security, Risk and Trust (PASSAT) and 2011 IEEE Third International Conference on Social Computing (SocialCom). (S. 165–171). Boston, Massachusetts: IEEE Press.

  37. Miller, C., Ginnis, S., Stobart, R., Krasodomski-Jones, A., & Clemence, M. (2015). The road to representivity. A Demos and Ipsos MORI report on sociological research using Twitter. https://www.demos.co.uk/iles/Road_to_representivity_final.pdf?1441811336. Zugegriffen: 12. Okt. 2017.

  38. Miller, H., Thebault-Spieker, J., Chang, S., Johnson, I., Terveen, L., & Hecht, B. (2016). „Blissfully happy“ or „ready to fight“: varying interpretations of Emoji. Proceedings of ICWSM 2016. (S. 259–268). Köln: AAAI Press.

  39. Morstatter, F., Pfeffer, J., Liu, H., & Carley, K. M. (2013). Is the sample good enough? Comparing data from twitter’s streaming API with twitter’s firehose. Proceedings of the Seventh International AAAI Conference on Weblogs and Social Media. (S. 400–408). Palo Alto: AAAI Press.

  40. Murthy, D., & Longwell, S. A. (2013). Twitter and disasters. Information, Communication & Society, 16, 837–855.

  41. Naumann, F., & Rolker, C. (2000). Assessment methods for information quality criteria. Proceedings of 5th International Conference on Information Quality. (S. 148–162).

  42. Oelsner, K., & Heimrich, L. (2015). Social media use of German politicians. Towards dialogic voter relations? German Politics, 24, 451–468.

  43. Park, J., Barash, V., Fink, C., & Cha, M. (2013). Emoticon style: interpreting differences in emoticons across cultures. Proceedings of the Seventh International AAAI Conference on Weblogs and Social Media. (S. 466–475). Palo Alto: AAAI Press.

  44. Pew Research Center (2015). Social media update 2014. http://www.pewinternet.org/2015/01/09/social-media-update-2014/. Zugegriffen: 30. Nov. 2016.

  45. Pew Research Center (2016). Social media update 2016. Facebook usage and engagement is on the rise, while adoption of other platforms holds steady. http://assets.pewresearch.org/wp-content/uploads/sites/14/2016/11/10132827/PI_2016.11.11_Social-Media-Update_FINAL.pdf. Zugegriffen: 2. Jan. 2016.

  46. Stieglitz, S., & Dang-Xuan, L. (2012). Political communication and influence through microblogging. An empirical analysis of sentiment in Twitter messages and retweet behaviour. In. Proceedings of 45th Hawaii International Conference on System Sciences. (S. 3500–3509). Hawaii: IEEE.

  47. Tumasjan, A., Sprenger, T. O., Sandner, P. G., & Welpe, I. M. (2010). Election forecasts with twitter. Social Science Computer Review, 29, 402–418.

  48. Tumblr, I. (2016). About Tumblr, Inc. https://www.tumblr.com/about. Zugegriffen: 1. Dez. 2016.

  49. Twitter, I. (2014a). Insights into the #WorldCup conversation on Twitter. https://blog.twitter.com/2014/insights-into-the-worldcup-conversation-on-twitter (Erstellt: 14. Juli 2014). Zugegriffen: 1. Febr. 2017.

  50. Twitter, I. (2014b). Open sourcing Twitter emoji for everyone. https://blog.twitter.com/2014/open-sourcing-twitter-emoji-for-everyone (Erstellt: 6. Nov. 2014). Zugegriffen: 1. Dez. 2016.

  51. Twitter, I. (2016a). Never miss important Tweets from people you follow. https://blog.twitter.com/2016/never-miss-important-tweets-from-people-you-follow (Erstellt: 10. Febr. 2016). Zugegriffen: 1. Dez. 2016.

  52. Twitter, I. (2016b). Coming soon: express even more in 140 characters. https://blog.twitter.com/express-even-more-in-140-characters (Erstellt: 24. Mai 2016). Zugegriffen: 1. Dez. 2016.

  53. Twitter, I. (2016c). Form 10-Q. https://investor.twitterinc.com/secfiling.cfm?filingID=1564590-16-26749&CIK=1418091 (Erstellt: 30. Sept. 2016). Zugegriffen: 1. Dez. 2016.

  54. Twitter, I. (2016d). Selected company metrics and financials. http://files.shareholder.com/downloads/AMDA-2F526X/3109864881x0x913987/910DA927-7E1D-4A16-9E40-9E89D4F4553E/Q316_Selected_Company_Metrics_and_Financials.pdf (Erstellt: 19.10.). Zugegriffen: 1. Dez. 2016.

  55. Twitter, I. (2017a). An update on safety. https://blog.twitter.com/2017/an-update-on-safety (Erstellt: 7. Febr. 2017). Zugegriffen: 24. Febr. 2017.

  56. Twitter, I. (2017b). Twitter developer documentation. Search tweets. Parameters. https://developer.twitter.com/en/docs/tweets/search/api-reference/get-search-tweets. Zugegriffen: 10. Okt. 2017.

  57. Twitter, I. (2017c). Q317 – Selected company metrics and financials. http://files.shareholder.com/downloads/AMDA-2F526X/5439610324x0x961126/1C3B5760-08BC-4637-ABA1-A9423C80F1F4/Q317_Selected_Company_Metrics_and_Financials.pdf. Zugegriffen: 24. Okt. 2017.

  58. Unicode, I. (2016). Full Emoji list, v5.0. http://unicode.org/emoji/charts/full-emoji-list.html. Zugegriffen: 19. März 2017.

  59. Vaccari, C., Valeriani, A., Barberá, P., Bonneau, R., Jost, J. T., Nagler, J., & Tucker, J. (2013). Social media and political communication: a survey of twitter users during the 2013 Italian general election. Rivista italiana di scienza politica, 43, 381–410.

  60. Vieweg, S., Hughes, A. L., Starbird, K., & Palen, L. (2010). Microblogging during two natural hazards events. In CHI2010 (S. 1079–1088).

  61. Wang, R. Y., & Strong, D. M. (1996). Beyond accuracy. What data quality means to data consumers. Journal of management information systems, 12(4), 5–33.

  62. Woolley, S. (2016). Automating power: social bot interference in global politics. First Monday, 21(4) https://doi.org/10.5210/fm.v21i4.6161.

Download references

Author information

Correspondence to Fabian Pfaffenberger M.Sc..

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Pfaffenberger, F. What you tweet is what we get?. Publizistik 63, 53–72 (2018). https://doi.org/10.1007/s11616-017-0400-2

Download citation

Schlüsselwörter

  • Twitter
  • Datenqualität
  • Social Media
  • Methodik
  • Analyse

Keywords

  • Twitter
  • Data quality
  • Social media
  • Methods
  • Analysis