Skip to main content

Twigraph: Discovering and Visualizing Influential Words Between Twitter Profiles

  • Conference paper
  • First Online:
Book cover Social Informatics (SocInfo 2017)

Abstract

The social media craze is on an ever increasing spree, and people are connected with each other like never before, but these vast connections are visually unexplored. We propose a methodology Twigraph to explore the connections between persons using their Twitter profiles. First, we propose a hybrid approach of recommending social media profiles, articles, and advertisements to a user. The profiles are recommended based on the similarity score between the user profile, and profile under evaluation. The similarity between a set of profiles is investigated by finding the top influential words thus causing a high similarity through an Influence Term Metric for each word. Then, we group profiles of various domains such as politics, sports, and entertainment based on the similarity score through a novel clustering algorithm. The connectivity between profiles is envisaged using word graphs that help in finding the words that connect a set of profiles and the profiles that are connected to a word. Finally, we analyze the top influential words over a set of profiles through clustering by finding the similarity of that profiles enabling to break down a Twitter profile with a lot of followers to fine level word connections using word graphs. The proposed method was implemented on datasets comprising 1.1 M Tweets obtained from Twitter. Experimental results show that the resultant influential words were highly representative of the relationship between two profiles or a set of profiles.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

References

  1. Java, A., et al.: Why we twitter: understanding microblogging usage and communities. In: Proceedings of the 9thWebKDD and 1st SNA-KDD 2007 Workshop on Web Mining and Social Network Analysis. ACM (2007)

    Google Scholar 

  2. Pak, A., Paroubek, P.: Twitter as a corpus for sentiment analysis and opinion mining. LREc 10, 2010 (2010)

    Google Scholar 

  3. Gupta, P., et al.: WTF: The who to follow service at twitter. In: Proceedings of the 22nd International Conference on World Wide Web. ACM (2013)

    Google Scholar 

  4. Hannon, J., McCarthy, K., Smyth, B.: Finding useful users on twitter: twittomender the followee recommender. In: Clough, P., Foley, C., Gurrin, C., Jones, Gareth J.F., Kraaij, W., Lee, H., Mudoch, V. (eds.) ECIR 2011. LNCS, vol. 6611, pp. 784–787. Springer, Heidelberg (2011). doi:10.1007/978-3-642-20161-5_94

    Chapter  Google Scholar 

  5. Kagan, V., Stevens, A., Subrahmanian, V.S.: Using twitter sentiment to forecast the 2013 pakistani election and the 2014 indian election. IEEE Intell. Syst. 30(1), 2–5 (2015)

    Article  Google Scholar 

  6. Tunggawan, E., Soelistio, Y.E.: And the Winner is…: Bayesian Twitter-based Prediction on 2016 US Presidential Election. arXiv preprint arXiv:1611.00440 (2016)

  7. Ramos, J.: Using TF-IDF to determine word relevance in document queries. In: Proceedings of the First Instructional Conference on Machine Learning (2003)

    Google Scholar 

  8. Jing, L.-P., Huang, H.-K., Shi, H.-B.: Improved feature selection approach TFIDF in text mining. In: Proceedings of 2002 International Conference on Machine Learning and Cybernetics, vol. 2. IEEE (2002)

    Google Scholar 

  9. Huang, A.: Similarity measures for text document clustering. In: Proceedings of the Sixth New Zealand Computer Science Research Student Conference (NZCSRSC 2008), Christchurch, New Zealand (2008)

    Google Scholar 

  10. Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering techniques. In: KDD Workshop on Text Mining, vol. 400(1) (2000)

    Google Scholar 

  11. Shah, N., Mahajan, S.: Document clustering: a detailed review. Int. J. Appl. Inf. Syst. 4(5), 30–38 (2012)

    Google Scholar 

  12. Cutting, D.R., et al.: Scatter/gather: a clusterbased approach to browsing large document collections. In: Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM (1992)

    Google Scholar 

  13. Bhaumik, H., et al.: Towards reliable clustering of english text documents using correlation coefficient. In: 2014 International Conference on Computational Intelligence and Communication Networks (CICN). IEEE (2014)

    Google Scholar 

  14. Li, G., Liu, F.: A clustering-based approach on sentiment analysis. In: 2010 International Conference on Intelligent Systems and Knowledge Engineering (ISKE). IEEE (2010)

    Google Scholar 

  15. Kavyasrujana, D., Rao, B.C.: Hierarchical clustering for sentence extraction using cosine similarity measure. In: Satapathy, S., Govardhan, A., Raju, K., Mandal, J. (eds.) Emerging ICT for Bridging the Future - Proceedings of the 49th Annual Convention of the Computer Society of India (CSI) Volume 1. AISC, vol. 337, pp. 185–191. Springer, Cham (2015). doi:10.1007/978-3-319-13728-5_21

  16. Radev, D.R., et al.: Centroid-based summarization of multiple documents. Inf. Process. Manage. 40(6), 919–938 (2004)

    Article  MATH  Google Scholar 

  17. Mihalcea, R., Corley, C., Strapparava, C.: Corpus-based and knowledge-based measures of text semantic similarity.In: AAAI, vol. 6 (2006)

    Google Scholar 

  18. Tweepy, https://github.com/tweepy/tweepy

  19. Reuters Institute for the Study of Journalism. Digital news report 2015: Tracking the future of news (2015). http://www.digitalnewsreport.org/survey/2015/socialnetworks-and-their-role-in-news-2015/

  20. Pew Research Center. The evolving role of news on twitter and facebook (2015). http://www.journalism.org/2015/07/14/the-evolving-role-of-news-ontwitter-and-facebook

  21. Twigraph Source code. https://github.com/Dhanasekar-S/Twigraph_Source_Code

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dhanasekar Sundararaman .

Editor information

Editors and Affiliations

Appendices

Appendix 1

1.1 Document Modeling

Example of document modeling.

TF: Term Frequency, which measures how frequently a term occurs in a document. Since every document is different in length, it is possible that a term would appear much more times in long documents than shorter ones. Thus, the term frequency is often divided by the document length (the total number of terms in the document) as a way of normalization. IDF: Inverse Document Frequency, which measures how important a term is. While computing TF, all terms are considered equally important. However, it is known that certain terms such as “is,” “of” and “that,” may appear a lot of times but have little importance.

Document 1: data mining and social media mining

Document 2: social network analysis

Document 3: data mining

Tables 4, 5 and 6 shows us the nomalized term frequency of each term in documents 1, 2, and 3 respectively. Table 7 shows us the IDF of all terms in document 1, 2 and 3. Table 8 shows us the TF-IDF of top words from document 1, 2 and 3. From Table 8, it is evident that TF-IDF is high for the important words in the document and how stopwords are ignored.

Table 4. Normalized TF of terms in document 1
Table 5. Normalized TF of terms in document 2
Table 6. Normalized TF of terms in document 3
Table 7. IDF of terms in corpus
Table 8. Term with top IDF scores in each document

Appendix 2

2.1 Influential Words Based on Chronological Order

Instead of using the distance as metric for the entry of the profile into the cluster, the chronological order of entry of profiles into the cluster is taken. The chronological entry is adopted to trace the influential words a profile possess that attracted a potential follower assuming the social media adopts the recommendation of users to follow based on the influential words between profiles we proposed. The technique can be used to blow down a Twitter profile with lot of followers to list of words and understand the relationship between that profile and followers (Table 9).

Table 9. Top 3 influential words between cluster and incoming profile(Chronological Based)

This metric if used properly, would enable us to decompose a complex profile with a large follower-base like that of celebrities and detect the top influential words. By doing so, we can perform a detailed analysis on why people follow celebrities and which are the keywords that make a difference. Public relation officers and campaign managers for political candidates can use this analysis to target voting blocks.

Appendix 3

3.1 Hybrid Suggestion System (for Companies and Articles)

For companies or brands, personalized ad targeting based on the interest shown by the users would prove efficient as there is more possibility of a relevant user turning into a potential customer than a common user. Companies can use Twitter to display the most relevant ads to the respective users based on the nature of their tweets, performing personalized ad targeting. Table 10 gives the top 2 users for each of the brands namely Nike and BBC. The top 2 users for Nike happens to be footballers namely Alex Morgan and Wayne Ronney. While for BBC, a British news channel, the top user is number10gov which is the handle of UK prime minister; the second top user is a British referee Graham Scott (Table 11).

Table 10. Similar users for each company/brand
Table 11. Similar articles for each user profile

Appendix 4

4.1 Profile Similarity

See (Tables 12, 13 and 14).

Table 12. TF-IDF of user profile terms (HillaryClinton)
Table 13. TF-IDF of profile 1 terms(realDonaldTrump)
Table 14. TF-IDF of profile 2 terms(katyperry)

Appendix 5

5.1 User Profile Based Single Source Clustering

Figures 4, 5, and 6 illustrate the word clouds formed based on the top impactful words between the closest incoming profile and the current formed cluster using the algorithm mentioned. The word clouds are made of words based on their importance between the current cluster and the incoming closest profile, calculated using the ITM (described in the next section). Figure 4 gives the word cloud between user profile (HillaryClinton) and the closest profile to the user profile (THEHermaineCain). Almost all of the words denote about politics, as these two profiles are politicians. Figure 5 illustrates the top words between the existing cluster (around 302 profiles including HillaryClinton and THEHermaineCain) and SpeakerRyan, while Fig. 6 illustrates the influential words between JimmyFallon and the existing cluster. It can be noted that the first two Figures are almost about politics while the last Figure is about entertainment.

Fig. 4.
figure 4

Top Influential words between HilaryClinton and THEHermaineCain

Fig. 5.
figure 5

Top Influential words between the current Cluster and SpeakerRyan

Fig. 6.
figure 6

Top Influential words between the Cluster and JimmyFallon

Appendix 6

6.1 Visualizing Word Graphs

Decreasing significance. Taking the word “parenthood” as an example, we observe that it belongs in the top 20 influential words shared between the user profile (HillaryClinton) and her closest profile (THEHermanCain) as there is a high TF-IDF of that word in both their tweets. But when other profiles start coming into the equation as we progress with the clustering process, the word loses it’s significance and moves out of the top 20 list because of its low usage amongst the newly clustered profiles. The same could be said for the word “democrats”. HillaryClinton and THEHermanCain use that word with a very high frequency, and hence it makes the top 20 list of words shared between them. But due to its relatively low usage amongst the next incoming profiles, the word loses its significance.

Increasing significance. What happens if the influence for a particular word is very low for the first few incoming profiles but increases over several iterations? This leads to our second scenario where there is an increase in significance for a particular word with the progression of the clustering process. The word “marcorubio” can be used to describe this scenario perfectly. There is no usage of that word from THEHermanCain and hence it doesn’t make the list of influential words. But there is some level of usage from realdonaldtrump and GovMikeHuckabee who are the third and fourth profile respectively. This ensures that the word enters the list of common words between existing cluster and incoming profile but not enough to push it to the top 20 influential words list. The word finally makes the top 20 list with the entry of newtgingrich as he had heavily used it in his tweets. The word “medicare” is another similar example.

Maintaining significance. This scenario is commonly observed when the usage of a word remains reasonably constant across several incoming profiles. One such example is the word “president”. Since the first four closest profiles to HillaryClinton are all politicians, the word “president” is a common occurrence among their tweets. Hence it consistently features in the top 20 list across the first four iterations. Similarly, the word “america” follows the same scenario of maintaining its significance across iterations.

Oscillating Significance. This is the final scenario which can be observed from the progression of the clustering process. As the name suggests, it occurs when a particular word oscillates between high and low influential across several iterations. Every word will eventually follow this pattern if we are to increase the range of our observations across many iterations, but we are more interested with oscillations within a short range of iterations. For instance, the word “timkaine” helps us better understand this scenario. THEHermanCain didn’t use this words in his tweets, resulting in it’s absence from the list of common words for iteration one. But realdonaldtrump has such a high TF-IDF that it manages to make it to the top 20-word list. With the entry of GovMikeHuckabee, word falls out of the top 20 list again as he hasn’t used that word in his tweets thereby decreasing it’s score. The word “abolish” is another example which follows a similar pattern resulting in oscillating significance.

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Sundararaman, D., Srinivasan, S. (2017). Twigraph: Discovering and Visualizing Influential Words Between Twitter Profiles. In: Ciampaglia, G., Mashhadi, A., Yasseri, T. (eds) Social Informatics. SocInfo 2017. Lecture Notes in Computer Science(), vol 10540. Springer, Cham. https://doi.org/10.1007/978-3-319-67256-4_26

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-67256-4_26

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-67255-7

  • Online ISBN: 978-3-319-67256-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics