Twigraph: Discovering and Visualizing Influential Words Between Twitter Profiles

Sundararaman, Dhanasekar; Srinivasan, Sudharshan

doi:10.1007/978-3-319-67256-4_26

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10540))

Included in the following conference series:

International Conference on Social Informatics

3917 Accesses
2 Citations

Abstract

The social media craze is on an ever increasing spree, and people are connected with each other like never before, but these vast connections are visually unexplored. We propose a methodology Twigraph to explore the connections between persons using their Twitter profiles. First, we propose a hybrid approach of recommending social media profiles, articles, and advertisements to a user. The profiles are recommended based on the similarity score between the user profile, and profile under evaluation. The similarity between a set of profiles is investigated by finding the top influential words thus causing a high similarity through an Influence Term Metric for each word. Then, we group profiles of various domains such as politics, sports, and entertainment based on the similarity score through a novel clustering algorithm. The connectivity between profiles is envisaged using word graphs that help in finding the words that connect a set of profiles and the profiles that are connected to a word. Finally, we analyze the top influential words over a set of profiles through clustering by finding the similarity of that profiles enabling to break down a Twitter profile with a lot of followers to fine level word connections using word graphs. The proposed method was implemented on datasets comprising 1.1 M Tweets obtained from Twitter. Experimental results show that the resultant influential words were highly representative of the relationship between two profiles or a set of profiles.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

References

Java, A., et al.: Why we twitter: understanding microblogging usage and communities. In: Proceedings of the 9thWebKDD and 1st SNA-KDD 2007 Workshop on Web Mining and Social Network Analysis. ACM (2007)
Google Scholar
Pak, A., Paroubek, P.: Twitter as a corpus for sentiment analysis and opinion mining. LREc 10, 2010 (2010)
Google Scholar
Gupta, P., et al.: WTF: The who to follow service at twitter. In: Proceedings of the 22nd International Conference on World Wide Web. ACM (2013)
Google Scholar
Hannon, J., McCarthy, K., Smyth, B.: Finding useful users on twitter: twittomender the followee recommender. In: Clough, P., Foley, C., Gurrin, C., Jones, Gareth J.F., Kraaij, W., Lee, H., Mudoch, V. (eds.) ECIR 2011. LNCS, vol. 6611, pp. 784–787. Springer, Heidelberg (2011). doi:10.1007/978-3-642-20161-5_94
Chapter Google Scholar
Kagan, V., Stevens, A., Subrahmanian, V.S.: Using twitter sentiment to forecast the 2013 pakistani election and the 2014 indian election. IEEE Intell. Syst. 30(1), 2–5 (2015)
Article Google Scholar
Tunggawan, E., Soelistio, Y.E.: And the Winner is…: Bayesian Twitter-based Prediction on 2016 US Presidential Election. arXiv preprint arXiv:1611.00440 (2016)
Ramos, J.: Using TF-IDF to determine word relevance in document queries. In: Proceedings of the First Instructional Conference on Machine Learning (2003)
Google Scholar
Jing, L.-P., Huang, H.-K., Shi, H.-B.: Improved feature selection approach TFIDF in text mining. In: Proceedings of 2002 International Conference on Machine Learning and Cybernetics, vol. 2. IEEE (2002)
Google Scholar
Huang, A.: Similarity measures for text document clustering. In: Proceedings of the Sixth New Zealand Computer Science Research Student Conference (NZCSRSC 2008), Christchurch, New Zealand (2008)
Google Scholar
Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering techniques. In: KDD Workshop on Text Mining, vol. 400(1) (2000)
Google Scholar
Shah, N., Mahajan, S.: Document clustering: a detailed review. Int. J. Appl. Inf. Syst. 4(5), 30–38 (2012)
Google Scholar
Cutting, D.R., et al.: Scatter/gather: a clusterbased approach to browsing large document collections. In: Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM (1992)
Google Scholar
Bhaumik, H., et al.: Towards reliable clustering of english text documents using correlation coefficient. In: 2014 International Conference on Computational Intelligence and Communication Networks (CICN). IEEE (2014)
Google Scholar
Li, G., Liu, F.: A clustering-based approach on sentiment analysis. In: 2010 International Conference on Intelligent Systems and Knowledge Engineering (ISKE). IEEE (2010)
Google Scholar
Kavyasrujana, D., Rao, B.C.: Hierarchical clustering for sentence extraction using cosine similarity measure. In: Satapathy, S., Govardhan, A., Raju, K., Mandal, J. (eds.) Emerging ICT for Bridging the Future - Proceedings of the 49th Annual Convention of the Computer Society of India (CSI) Volume 1. AISC, vol. 337, pp. 185–191. Springer, Cham (2015). doi:10.1007/978-3-319-13728-5_21
Radev, D.R., et al.: Centroid-based summarization of multiple documents. Inf. Process. Manage. 40(6), 919–938 (2004)
Article MATH Google Scholar
Mihalcea, R., Corley, C., Strapparava, C.: Corpus-based and knowledge-based measures of text semantic similarity.In: AAAI, vol. 6 (2006)
Google Scholar
Tweepy, https://github.com/tweepy/tweepy
Reuters Institute for the Study of Journalism. Digital news report 2015: Tracking the future of news (2015). http://www.digitalnewsreport.org/survey/2015/socialnetworks-and-their-role-in-news-2015/
Pew Research Center. The evolving role of news on twitter and facebook (2015). http://www.journalism.org/2015/07/14/the-evolving-role-of-news-ontwitter-and-facebook
Twigraph Source code. https://github.com/Dhanasekar-S/Twigraph_Source_Code

Download references

Author information

Authors and Affiliations

SSN College of Engineering, Chennai, India
Dhanasekar Sundararaman
SRM University, Chennai, India
Sudharshan Srinivasan

Authors

Dhanasekar Sundararaman
View author publications
You can also search for this author in PubMed Google Scholar
Sudharshan Srinivasan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dhanasekar Sundararaman .

Editor information

Editors and Affiliations

Indiana University, Bloomington, Indiana, USA
Giovanni Luca Ciampaglia
University of Washington, Seattle, Washington, USA
Afra Mashhadi
University of Oxford, Oxford, United Kingdom
Taha Yasseri

Appendices

Appendix 1 1.1 Document Modeling

Example of document modeling.

TF: Term Frequency, which measures how frequently a term occurs in a document. Since every document is different in length, it is possible that a term would appear much more times in long documents than shorter ones. Thus, the term frequency is often divided by the document length (the total number of terms in the document) as a way of normalization. IDF: Inverse Document Frequency, which measures how important a term is. While computing TF, all terms are considered equally important. However, it is known that certain terms such as “is,” “of” and “that,” may appear a lot of times but have little importance.

Document 1: data mining and social media mining

Document 2: social network analysis

Document 3: data mining

Tables 4, 5 and 6 shows us the nomalized term frequency of each term in documents 1, 2, and 3 respectively. Table 7 shows us the IDF of all terms in document 1, 2 and 3. Table 8 shows us the TF-IDF of top words from document 1, 2 and 3. From Table 8, it is evident that TF-IDF is high for the important words in the document and how stopwords are ignored.

Table 4. Normalized TF of terms in document 1

Full size table

Table 5. Normalized TF of terms in document 2

Full size table

Table 6. Normalized TF of terms in document 3

Full size table

Table 7. IDF of terms in corpus

Full size table

Table 8. Term with top IDF scores in each document

Full size table

Appendix 2 2.1 Influential Words Based on Chronological Order

Instead of using the distance as metric for the entry of the profile into the cluster, the chronological order of entry of profiles into the cluster is taken. The chronological entry is adopted to trace the influential words a profile possess that attracted a potential follower assuming the social media adopts the recommendation of users to follow based on the influential words between profiles we proposed. The technique can be used to blow down a Twitter profile with lot of followers to list of words and understand the relationship between that profile and followers (Table 9).

Table 9. Top 3 influential words between cluster and incoming profile(Chronological Based)

Full size table

This metric if used properly, would enable us to decompose a complex profile with a large follower-base like that of celebrities and detect the top influential words. By doing so, we can perform a detailed analysis on why people follow celebrities and which are the keywords that make a difference. Public relation officers and campaign managers for political candidates can use this analysis to target voting blocks.

Appendix 3 3.1 Hybrid Suggestion System (for Companies and Articles)

For companies or brands, personalized ad targeting based on the interest shown by the users would prove efficient as there is more possibility of a relevant user turning into a potential customer than a common user. Companies can use Twitter to display the most relevant ads to the respective users based on the nature of their tweets, performing personalized ad targeting. Table 10 gives the top 2 users for each of the brands namely Nike and BBC. The top 2 users for Nike happens to be footballers namely Alex Morgan and Wayne Ronney. While for BBC, a British news channel, the top user is number10gov which is the handle of UK prime minister; the second top user is a British referee Graham Scott (Table 11).

Table 10. Similar users for each company/brand

Full size table

Table 11. Similar articles for each user profile

Full size table

Appendix 4 4.1 Profile Similarity

See (Tables 12, 13 and 14).

Table 12. TF-IDF of user profile terms (HillaryClinton)

Full size table

Table 13. TF-IDF of profile 1 terms(realDonaldTrump)

Full size table

Table 14. TF-IDF of profile 2 terms(katyperry)

Full size table

Appendix 5 5.1 User Profile Based Single Source Clustering

Figures 4, 5, and 6 illustrate the word clouds formed based on the top impactful words between the closest incoming profile and the current formed cluster using the algorithm mentioned. The word clouds are made of words based on their importance between the current cluster and the incoming closest profile, calculated using the ITM (described in the next section). Figure 4 gives the word cloud between user profile (HillaryClinton) and the closest profile to the user profile (THEHermaineCain). Almost all of the words denote about politics, as these two profiles are politicians. Figure 5 illustrates the top words between the existing cluster (around 302 profiles including HillaryClinton and THEHermaineCain) and SpeakerRyan, while Fig. 6 illustrates the influential words between JimmyFallon and the existing cluster. It can be noted that the first two Figures are almost about politics while the last Figure is about entertainment.

Appendix 6 6.1 Visualizing Word Graphs

Decreasing significance. Taking the word “parenthood” as an example, we observe that it belongs in the top 20 influential words shared between the user profile (HillaryClinton) and her closest profile (THEHermanCain) as there is a high TF-IDF of that word in both their tweets. But when other profiles start coming into the equation as we progress with the clustering process, the word loses it’s significance and moves out of the top 20 list because of its low usage amongst the newly clustered profiles. The same could be said for the word “democrats”. HillaryClinton and THEHermanCain use that word with a very high frequency, and hence it makes the top 20 list of words shared between them. But due to its relatively low usage amongst the next incoming profiles, the word loses its significance.

Increasing significance. What happens if the influence for a particular word is very low for the first few incoming profiles but increases over several iterations? This leads to our second scenario where there is an increase in significance for a particular word with the progression of the clustering process. The word “marcorubio” can be used to describe this scenario perfectly. There is no usage of that word from THEHermanCain and hence it doesn’t make the list of influential words. But there is some level of usage from realdonaldtrump and GovMikeHuckabee who are the third and fourth profile respectively. This ensures that the word enters the list of common words between existing cluster and incoming profile but not enough to push it to the top 20 influential words list. The word finally makes the top 20 list with the entry of newtgingrich as he had heavily used it in his tweets. The word “medicare” is another similar example.

Maintaining significance. This scenario is commonly observed when the usage of a word remains reasonably constant across several incoming profiles. One such example is the word “president”. Since the first four closest profiles to HillaryClinton are all politicians, the word “president” is a common occurrence among their tweets. Hence it consistently features in the top 20 list across the first four iterations. Similarly, the word “america” follows the same scenario of maintaining its significance across iterations.

Oscillating Significance. This is the final scenario which can be observed from the progression of the clustering process. As the name suggests, it occurs when a particular word oscillates between high and low influential across several iterations. Every word will eventually follow this pattern if we are to increase the range of our observations across many iterations, but we are more interested with oscillations within a short range of iterations. For instance, the word “timkaine” helps us better understand this scenario. THEHermanCain didn’t use this words in his tweets, resulting in it’s absence from the list of common words for iteration one. But realdonaldtrump has such a high TF-IDF that it manages to make it to the top 20-word list. With the entry of GovMikeHuckabee, word falls out of the top 20 list again as he hasn’t used that word in his tweets thereby decreasing it’s score. The word “abolish” is another example which follows a similar pattern resulting in oscillating significance.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sundararaman, D., Srinivasan, S. (2017). Twigraph: Discovering and Visualizing Influential Words Between Twitter Profiles. In: Ciampaglia, G., Mashhadi, A., Yasseri, T. (eds) Social Informatics. SocInfo 2017. Lecture Notes in Computer Science(), vol 10540. Springer, Cham. https://doi.org/10.1007/978-3-319-67256-4_26

Download citation

DOI: https://doi.org/10.1007/978-3-319-67256-4_26
Published: 02 September 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-67255-7
Online ISBN: 978-3-319-67256-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Abstract

Access this chapter

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendices

Appendix 1

1.1 Document Modeling

Appendix 2

2.1 Influential Words Based on Chronological Order

Appendix 3

3.1 Hybrid Suggestion System (for Companies and Articles)

Appendix 4

4.1 Profile Similarity

Appendix 5

5.1 User Profile Based Single Source Clustering

Appendix 6

6.1 Visualizing Word Graphs

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation