Language, demographics, emotions, and the structure of online social networks
Social networks affect individuals’ economic opportunities and well-being. However, few of the factors thought to shape networks—culture, language, education, and income—were empirically validated at scale. To fill this gap, we collected a large number of social media posts from a major US metropolitan area. By associating these posts with US Census tracts through their locations, we linked socioeconomic indicators to group-level signals extracted from social media, including emotions, language, and online social ties. Our analysis shows that tracts with higher education levels have weaker social ties, but this effect is attenuated for tracts with high ratio of Hispanic residents. Negative emotions are associated with more frequent online interactions, or stronger social ties, while positive emotions are associated with weaker ties. These results hold for both Spanish and English tweets, evidencing that language does not affect this relationship between emotion and social ties. Our findings highlight the role of cognitive and demographic factors in online interactions and demonstrate the value of traditional social science sources, like US Census data, within social media studies.
KeywordsSocial media Social ties Emotions Demographics
Humans have evolved large brains, in part to handle the complex cognitive demands of social interactions . The social structures resulting from these interaction confer numerous fitness advantages. Scholars distinguish between two types of social relationships: those representing strong and weak ties. Strong ties are characterized by high frequency of interaction and emotional intimacy that can be found in relationships between family members or close friends. People connected by strong ties share mutual friends , forming cohesive social bonds that are essential for providing emotional and material support [3, 4] and creating resilient communities . In contrast, weak ties represent more casual social relationships, characterized by less frequent, less intense interactions, such as those occurring between acquaintances. By bridging otherwise unconnected communities, weak ties expose individuals to novel and diverse information that leads to new job prospects  and career opportunities [7, 8]. Online social relationships provide similar benefits to those of the offline relationships, including emotional support and exposure to novel and diverse information [9, 10, 11, 12].
How and why do people form different social ties, whether online or offline? Of the few studies that addressed this question, Shea and collaborators examined the relationship between emotions and cognitive social structures , i.e., the mental representations individuals form of their social contacts . In a laboratory study, they demonstrated that subjects experiencing positive affect—emotions such as happiness and joy—were able to recall larger and more diverse social contacts than those experiencing negative affect, e.g., sadness. In other words, positive affect was more closely associated with weak ties and negative affect with strong ties in cognitive social structures. The cost of maintaining strong ties is higher than for maintaining weak ties, as they involve higher frequency of interaction, but also their associated benefits are higher, such as emotional support that manifests in the social sharing of negative emotional experiences [4, 15]. As a consequence, frequency of interaction along social ties is positively associated with stronger emotional intensity and negative emotional expression .
In addition to psychological factors, social structures also depend on the participants’ demographic characteristics, including socioeconomic status , culture. A study, which reconstructed a national-scale social network from the phone records of people living in the UK, found that people living in more prosperous regions formed more diverse social networks, linking them to others living in distinct communities . On the other hand, people living in less prosperous communities formed less diverse, more cohesive social structures. In addition, culture plays a prominent role in shaping the structure of social interaction, but only recently few studies focused on how culture, as well as spoken language, affect online interactions [19, 20]. The link between social structures and place has led researchers examine the role of neighborhoods in shaping communities [21, 22, 23].
The present paper examines how group-level psychological, socioeconomic, demographic factors, and language, affect the structure of online social interactions. We restrict our attention to interactions on the Twitter microblogging platform as a first approximation to measuring macroscopic signals through digital traces. We collected a large body of geo-referenced text messages, known as tweets, from a large US metropolitan area. We linked these tweets to US Census tracts through their locations. Census tracts are small regions, on a scale of city blocks, that are relatively homogeneous with respect to population characteristics, economic status, and living conditions. Some of the tweets also contained explicit references to other users through the ‘@’ mention convention, which has been widely adopted on Twitter for conversations. We used mentions to measure the strength of social ties of people tweeting from each tract. Using these data, we study group-level relationships between social ties, the demographic characteristics of the tract, and the emotions expressed by people tweeting from there. We separately measure emotions expressed in English- and Spanish language tweets, enabling us to additionally explore the impact of language on emotions and social tie formation. In addition, people tweeting from one tract often tweeted from other tracts. Since geography is a strong organizing principle, for both offline [24, 25] and online [26, 27, 28] social relationships, we measured the spatial diversity to correct for its effect on social network structures in our statistical analyses.
This article illustrates how digital trace data can be a complement to previous studies in online social networks and characterize group-level relationships between the structure of online interactions in urban places and their demographic and socioeconomic characteristics. While unfit to analyze emotions, demographics, and social structures at the individual level, our methods link general properties of the structure of online interactions in groups with their aggregated levels of positive affect. Groups of people who express happier emotions, regardless of language, interact with a more diverse set social contacts, which puts them in a position to access, and potentially take advantage of, novel information. As our social interactions increasingly move online, understanding—and unobtrusively monitoring—online social structures at a macroscopic level is necessary for ensuring equal access to the benefits of social relationships. Although many important caveats exist about generalizing these results to offline social interactions, our work highlights the value of linking social media data to traditional data sources, such as US Census, to drive novel analysis of online behavior and online social structures.
Of the roughly 2000 tracts in Los Angeles (LA), we collected tweets from 1700 tracts. The “Methods” section describes our approach to measuring emotional expression of these tweets. We observe systematic differences between emotions expressed in the tweets posted from different places, the language of the tweets, and the structure of online social interactions, as well as spatial mobility. By combining these data, we created one set with 688 tracts that had at least 15 tweets in both Spanish and English from which we could measure emotional expressions. In spite of that, some of these tracts did not have all demographic or socioeconomic variables available from the US Census and were ignored in further analysis. After cleaning the sample of tracts, as explained more in detail in the “Methods” section, the number of analyzed tracts is 539, comprising a total of more than 28 thousand tweets.
Demographics and social ties
Fit results for the mixed effects models of log(tie strength) as a function of demographic variables, controlling for spatial diversity and including random intercepts per tract group
Var: tractGroup (Intercept)
Table 1 reports the regression results for the logarithm of tie strength as a linear combination of education and the ratio of Hispanic residents, including control terms for age and spatial diversity. The three models include incremental terms that explore the role of education and Hispanic ratio including interaction terms, as both variables are strongly colinear (Pearson’s \(\rho =-0.865\), CI \([-0.885,0.842]\)). The analysis shows a negative association between education and tie strength. On average, tracts with a higher ratio of residents with a bachelor’s degree have weaker social ties in Twitter.
The positive relationship between the ratio of Hispanic residents appears in the model without interactions (Model 1) but adding interaction terms shows its dependence on education (Model 2). The positive marginal effect of the ratio of Hispanic residents is not significant, but the interaction term with education is significant and positive. This has two implications: first, that the positive association between the ratio of Hispanic residents and tie strength is only present for higher levels of education, and second that the negative effect of education in tie strength is counterbalanced by the positive interaction with the ratio of Hispanic residents. This way, for ratios of Hispanic residents above 0.6, we can expect a positive relationship between education and tie strength. This is further evidenced by the fact that the best model in terms of Bayesian Information Criterion is the model with an interaction term but no direct effect of the ratio of Hispanic residents on tie strength (Model 3).
We repeated the fit of the full model replacing the education variable with two of its strongest correlates: the employment rate and the logarithm of the median household income. These two models, presented in detail in the Supplementary Material, do not show significant effects on the logarithm of tie strength, neither directly nor through interactions with the ratio of Hispanic residents. This highlights the role of education in the result, which is clearly not a confound with the economic variables of income and unemployment.
Language and emotion
The average value of valence across all tracts is 5.78 for English language tweets and 5.71 for Spanish language tweets, which corresponds to slightly positive affect with respect to the neutral point of 5.0, in line with emotional expression in other media . As expected, the measurements of valence are positively correlated with the measurements of positive affect (\(\rho =0.50\) in English and \(\rho =0.30\) in Spanish) and negatively correlated with the measurements of negative affect (\(\rho =-0.55\) in English and \(\rho =-0.21\) in Spanish). This illustrates the relative dependence between the dimensional representations of valence-arousal (VA Model) and positive–negative (PN Model), which capture emotional life in different ways. In the further analysis, we apply these representations in two parallel models for each language, to have a double test of the hypothetical relation between tie strength and emotions.
Emotion and social ties
Fit results for the mixed effects models of log(tie strength) as a function of sentiment values (valence and arousal in VA Model, positive and negative in PN Model), including random intercepts per tract group
Var: tractGroup (Intercept)
The availability of large scale, near real-time data from social media sites such as Twitter brings novel opportunities for studying online behavior and social interactions at an unprecedented spatial and temporal resolution. By combining Twitter data with US Census, we were able to study how the socioeconomic and demographic characteristics of residents of different census tracts are related to the structure of online interactions. Sentiment analysis of tweets in English and Spanish languages originating from a tract revealed a link between emotional expression and tie strength of Twitter users at the group level.
Our findings are broadly consistent with results of previous studies carried out in offline settings, and also give new insights into the structure of online social interactions. We found that, in line with previous research and theoretical arguments, Twitter users express more positive emotions in areas with weaker ties, while negative emotions are more salient where ties are stronger. We find a lack of correlation with arousal that is consistent with a general pattern in which sentiment analysis techniques do not seem to capture the subjective experience of arousal . We find that at an aggregate level, areas where Twitter users form stronger social ties have lower levels of education, but that this effect interacts with the ratio of Hispanic residents in the opposite direction. Since weak ties are believed to play an important role in delivering strategic, novel information, our work identifies education as a main correlate with the presence of weak ties and their associated novel information.
Our results highlight the social component of culture: Hispanic cultures share collectivist values and are less individualist than anglo-saxon cultures (for example, Mexico scores 30 and the US 91 in the individualism scale of Hofstede ). This provides an explanation for the stronger links of tracts with higher number of Hispanic residents (interacting with education), as their online network structures reflect their shared values [33, 34]. However, this manifestation of shared values in digital traces is subject to appear only in areas where education levels are higher, as they also have higher levels of penetration of online social media.
Some important considerations limit the interpretation of our findings. First, our methodology for identifying social interactions may not give a complete view of the social network of Twitter users. Our observations were limited to social interactions initiated by users who geo-reference their tweets. This may not be representative of all Twitter users posting messages from a given tract, if systematic biases exist in what type of people elect to geo-reference their tweets. For demographic analysis, we did not resolve the home location of Twitter users. Instead, we assumed that characteristics of an area, i.e., of residents of a tract, influence the tweets posted from that tract. Other subtle selection biases could have affected our data and the conclusions we drew . It is conceivable that Twitter users residing in more affluent areas are less likely to use the geo-referencing feature, making our sample of Twitter users different from the population of LA county residents.
Recognizing this limitation, our conclusions only apply at the group level and not at the level of individual behavior of LA residents. In the same vein as how Twitter data can be used to identify group effects on heart disease mortality , our analysis identifies relations between properties of groups of people. While recent research opens the possibility to how to reweight Twitter metrics across demographic sections , evaluating the external validity of social media metrics requires an interdisciplinary effort beyond the scope of this contribution from data science. Further research can build on our work, combining social media data with standard social science resources, such as surveys, questionnaires, and the census, achieving a more complete picture of social interaction in our current Digital Society.
Los Angeles (LA) County is the most populous county in the USA, with almost 10 million residents. It is extremely diverse both demographically and economically, making it an attractive subject for research. We collected a large body of tweets from LA County over the course of 4 months, starting in July 2014. Our data collection strategy was as follows. First, we used Twitter’s location search API to collect tweets from an area that included Los Angeles County. We then used Twitter4J API to collect all (timeline) tweets from users who tweeted from within this area during this time period. A portion of these tweets were geo-referenced, i.e., they had geographic coordinates attached to them. In all, we collected 6M geo-tagged tweets made by 340K distinct users.
We localized geo-tagged tweets to tracts from the 2012 US Census.1 A tract is a geographic region that is defined for the purpose of taking a census of a population, containing about 4000 residents on average, and is designed to be relatively homogeneous with respect to demographic characteristics of that population. We included only Los Angeles County tracts in the analysis. We used data from the US Census to obtain demographic and socioeconomic characteristics of a tract, including the mean household income, median age of residents, percentage of residents with a bachelor’s degree or above, as well as racial and ethnic composition of the tract.
We apply sentiment analysis , i.e., methods that process text to quantify subjective states of the author of the text, to measure happiness or subjective well-being of Twitter users. Two recent independent benchmark studies evaluate a wide variety of sentiment analysis tools in various social media  and Twitter datasets . Across social media, one of the best performing tools is SentiStrength , which also was shown to be the best unsupervised tool for tweets in various contexts .
English language analysis SentiStrength quantifies emotions expressed in short informal text by matching terms from a lexicon and applying intensifiers, negations, misspellings, idioms, and emoticons. We use the standard English version of SentiStrength.2 To each tweet in our dataset, quantifying positive sentiment \(P\in [+1,+5\)] and negative sentiment \(N\in [-1,-5\)], consistently with the Positive and Negative Affect Schedule (PANAS) . SentiStrength has been shown to perform very closely to human raters in validity tests  and has been applied to measure emotions in product reviews , online chatrooms , Yahoo answers , Youtube comments , and social media discussions . In addition, SentiStrength allows our approach to be applied in the future to other languages, like Spanish [30, 48], and to include contextual factors , like sarcasm .
Beyond positivity and negativity, meanings expressed through text can be captured through the application of the semantic differential , a dimensional approach that quantifies emotional meaning in terms of valence and arousal . The dimension of valence quantifies the level of pleasure or evaluation expressed by a word, while arousal measures the level of activity induced by the emotions associated with a word. Research in psychology suggests that a multidimensional approach is necessary to capture the variance of emotional experience , motivating our application of this approximation of two dimensions that goes beyond simple polarity approximations. The state of the art in the quantification of these dimensions is the lexicon of Warriner, Kuperman, and Brysbaert (WKB) . The WKB lexicon includes scores in the three dimensions for more than 13,000 English lemmas. We quantify valence and arousal in a tweet by first lemmatizing the words in the tweet, to then match the lexicon and compute mean values of the valence and arousal as in . While this method is not the most accurate, it provides high coverage for Twitter data , and allows a multidimensional representation of emotions that is not frequent in mainstream sentiment analysis. In our dataset, for example, the lexicon matched terms in 82.39% of the tweets.
Spanish language analysis We analyzed emotions expressed in tweets written in Spanish, as determined by Google language-detection library.3 After tokenizing tweets (using Stanford CoreNLP4 tool) and stemming Spanish words (using NLTK module5), we used GISB, a lexicon developed by Gonzalez, Imbault, Sanchez, and Brysbaert , to measure the emotional content of Spanish language tweets. Similar to WKB lexicon, GISB lexicon contains a large set (14,031) of Spanish words annotated with valence and arousal values ranging from 1 to 9, with 5 as the neutral point. About 65% of the tweets recognized as Spanish language tweets contained at least one word that matches the GISB lexicon.
We used the Spanish version of SentiStrength to quantify the positive and negative sentiment expressed in tweets. Similar to its English version, the latest adaptation of SentiStrength to Spanish  returns values in the range of [1, 5] for positive and [−5, −1] for negative sentiment. We ignored neutral tweets that have the combined score of zero (i.e., the same positive and negative scores). To keep our analysis consistent across languages, we focus on the dimensions of positive and negative affect from SentiStrength and on valence and arousal from the WKB and GISB affective norms lexica.
Social tie analysis
Twitter users address others using the ‘@’ mention convention. We use the mentions as evidence of social ties, although sometimes users address public figures and celebrities also using this convention. We use mention frequency along a tie as a proxy of tie strength, drawing upon multiple studies that used frequency of interactions as a measure of tie strength [6, 57, 58]. In contrast to other measures, such as clustering coefficient, it does not require knowledge of full network structure (which we do not observe).
Thus, spatial diversity is a ratio that compares the empirical entropy of data with its expected value in the uniformly distributed case. As a consequence, a high spatial diversity value for a tract suggests that people tweeting from that tract split their tweets evenly among all the tracts they are tweeting from. In contrast, a low value implies that people tweeting from that tract concentrate their tweets in few tracts.
Regression models Our statistical analysis applies mixed-effects regression models of the logarithm of the average tie strength in each tract. To control for spatial correlations, we model a random intercept for each tract group, which are captured by the tract prefixes of the census.6 These tract groups are formed by the division of earlier tracts into subtracts, capturing spatial autocorrelations more sensitive to demographic features than pure geographic methods that ignore urban and administrative barriers. We fit models using the lmer function of the lmer package, specifying a model with tract groups as random intercepts.
The left panel of Fig. 1 shows the existence of a sizeable negative correlation (Pearson’s \(\rho =-0.47\)) between tie strength and spatial diversity. Tracts that bring together people who also tweet from different places have weaker ties, while tracts with more concentrated user groups have stronger ties. To exclude this pattern from our demographic and emotion analysis, we include a linear term of spatial diversity in each regression model. We further include an additional control term with the average age of residents in the census, to control for the possible age effect in the intensity of social links as manifested in Twitter.
For the demographics model, we perform three fits to survey the interaction between education levels and the ratio of Hispanic residents. For the analysis of emotions, we fit each model twice: first a mixed-effects model using tract groups as random effects and sentiment variables as fixed effects, and second a model that takes as dependent variable the residuals of the demographics model and regresses them against the emotion variables. This way, we verify that the results of the emotion model fits are robust to the role of demographic factors in tie strength.
To verify the validity of our fit results, we perform regression diagnostics that are reported in the Supplementary Material. These verify that multicollinearity is weak (moderate Variance Inflation Factors), that residuals are roughly normally distributed, that no pathological correlations exist between residuals and independent variables, and that no pattern of heteroscedasticity is present. These verifications support the validity of the inferences that evidence the conclusions of this analysis.
- 7.Burt, R. (1995). Structural holes: The social structure of competition. Cambridge: Harvard University Press.Google Scholar
- 10.Bakshy, E., Rosenn, I., Marlow, C., & Adamic, L. (2012). The role of social networks in information diffusion. In Proceedings of the 21st International Conference on World Wide Web (pp. 519–528). ACMGoogle Scholar
- 12.Kang, J. H., & Lerman, K. (2017). Effort mediates access to information in online social networks. ACM Transactions on the Web (TWEB), 11(1), 3:1–3:19Google Scholar
- 26.Quercia, D., Capra, L., & Crowcroft, J. (2012). The social world of twitter: Topics, geography, and emotions. In Proceedings of the 6th International AAAI Conference on Weblogs and Social Media (ICWSM), Vol. 12, pp. 298–305Google Scholar
- 28.Backstrom, L., Sun, E., & Marlow, C. (2010). Find me if you can: Improving geographical prediction with social and spatial proximity. In Proceedings of the 19th International Conference on World Wide Web, pp. 61–70. ACMGoogle Scholar
- 32.Hofstede, G. (1984). Culture’s consequences: International differences in work-related values (Vol. 5). Thousand Oaks: SAGE Publications.Google Scholar
- 33.Garcia-Gavilanes, R., Quercia, D., & Jaimes, A. (2013). Cultural dimensions in twitter: Time, individualism and power. In International AAAI Conference on Weblogs and Social MediaGoogle Scholar
- 34.Kayes, I., Kourtellis, N., Quercia, D., Iamnitchi, A., & Bonchi, F. (2015). Cultures in community question answering. In Proceedings of the 26th ACM Conference on Hypertext and Social Media, pp. 175–184. ACMGoogle Scholar
- 35.Tufekci, Z. (2014). Big questions for social media big data: Representativeness, validity and other methodological pitfalls. In International AAAI Conference on Web and Social Media. http://www.aaai.org/ocs/index.php/ICWSM/ICWSM14/paper/view/8062
- 37.Barbera, P. (2016). Less is more? How demographic sample weights can improve public opinion estimates based on twitter data. NYU Working Paper.Google Scholar
- 40.Abbasi, A., Hassan, A., & Dhar, M. (2014). Benchmarking twitter sentiment analysis tools. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14)Google Scholar
- 42.Watson, D., Clark, L. A., & Tellegen, A. (2013). Development and validation of brief measures of positive and negative affect: The PANAS scales. Journal of Personality and Social Psychology, 3(6), 1063.Google Scholar
- 43.Garcia, D., & Schweitzer, F. (2011). Emotions in product reviews—empirics and models. In Privacy, Security, Risk and Trust (PASSAT) and 2011 IEEE Third International Conference on Social Computing (SocialCom)Google Scholar
- 44.Garas, A., Garcia, D., Skowron, M., & Schweitzer, F. (2012). Emotional persistence in online chatting communities. Scientific Reports, 2 Google Scholar
- 45.Kucuktunc, O., Cambazoglu, B. B., Weber, I., & Ferhatosmanoglu, H. (2012). A large-scale sentiment analysis for Yahoo! answers. In Proceedings of the Fifth ACM International Conference on Web Search and Data MiningGoogle Scholar
- 46.Garcia, D., Mendez, F., Serdült, U., & Schweitzer, F. (2012). Political polarization and popularity in online participatory media: An integrated approach. In: Proceedings of the First Edition Workshop on Politics, Elections and DataGoogle Scholar
- 49.Thelwall, M., Buckley, K., Paltoglou, G., Skowron, M., Garcia, D., Gobron, S., Ahn, J., Kappas, A., Küster, D., & Holyst, J. A. (2013). Damping sentiment analysis in online communication: Discussions, monologs and dialogs. In International Conference on Intelligent Text Processing and Computational Linguistics (pp. 1–12). Springer, Berlin, Heidelberg.Google Scholar
- 50.Rajadesingan, A., Zafarani, R., & Liu, H. (2015). Sarcasm detection on twitter: A behavioral modeling approach. In Proceedings of the Eighth ACM International Conference on Web Search and Data MiningGoogle Scholar
- 51.Osgood, C. E., Suci, G. J., & Tannenbaum, P. H. (1964). The measurement of meaning. Champaign: University of Illinois Press.Google Scholar
- 58.Quercia, D., Ellis, J., Capra, L., & Crowcroft, J. (2012). Tracking gross community happiness from tweets. In Proceedings of the ACM 2012 Conference on Computer Supported Cooperative Work, pp. 965–968. ACMGoogle Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.