Language, demographics, emotions, and the structure of online social networks

Social networks affect individuals’ economic opportunities and well-being. However, few of the factors thought to shape networks—culture, language, education, and income—were empirically validated at scale. To ﬁll this gap, we collected a large number of social media posts from a major US metropolitan area. By associating these posts with US Census tracts through their locations, we linked socioeconomic indicators to group-level signals extracted from social media, including emotions, language, and online social ties. Our analysis shows that tracts with higher education levels have weaker social ties, but this effect is attenuated for tracts with high ratio of Hispanic residents. Negative emotions are associated with

living in distinct communities [18]. On the other hand, people living in less prosperous communities formed less diverse, more cohesive social structures. In addition, culture plays a prominent role in shaping the structure of social interaction, but only recently few studies focused on how culture, as well as spoken language, affect online interactions [19,20]. The link between social structures and place has led researchers examine the role of neighborhoods in shaping communities [21][22][23].
The present paper examines how group-level psychological, socioeconomic, demographic factors, and language, affect the structure of online social interactions. We restrict our attention to interactions on the Twitter microblogging platform as a first approximation to measuring macroscopic signals through digital traces. We collected a large body of geo-referenced text messages, known as tweets, from a large US metropolitan area. We linked these tweets to US Census tracts through their locations. Census tracts are small regions, on a scale of city blocks, that are relatively homogeneous with respect to population characteristics, economic status, and living conditions. Some of the tweets also contained explicit references to other users through the '@' mention convention, which has been widely adopted on Twitter for conversations. We used mentions to measure the strength of social ties of people tweeting from each tract. Using these data, we study group-level relationships between social ties, the demographic characteristics of the tract, and the emotions expressed by people tweeting from there. We separately measure emotions expressed in English-and Spanish language tweets, enabling us to additionally explore the impact of language on emotions and social tie formation. In addition, people tweeting from one tract often tweeted from other tracts. Since geography is a strong organizing principle, for both offline [24,25] and online [26][27][28] social relationships, we measured the spatial diversity to correct for its effect on social network structures in our statistical analyses.
This article illustrates how digital trace data can be a complement to previous studies in online social networks and characterize group-level relationships between the structure of online interactions in urban places and their demographic and socioeconomic characteristics. While unfit to analyze emotions, demographics, and social structures at the individual level, our methods link general properties of the structure of online interactions in groups with their aggregated levels of positive affect. Groups of people who express happier emotions, regardless of language, interact with a more diverse set social contacts, which puts them in a position to access, and potentially take advantage of, novel information. As our social interactions increasingly move online, understanding-and unobtrusively monitoring-online social structures at a macroscopic level is necessary for ensuring equal access to the benefits of social relationships. Although many important caveats exist about generalizing these results to offline social interactions, our work highlights the value of linking social media data to traditional data sources, such as US Census, to drive novel analysis of online behavior and online social structures.

Results
Of the roughly 2000 tracts in Los Angeles (LA), we collected tweets from 1700 tracts. The ''Methods'' section describes our approach to measuring emotional expression of these tweets. We observe systematic differences between emotions expressed in the tweets posted from different places, the language of the tweets, and the structure of online social interactions, as well as spatial mobility. By combining these data, we created one set with 688 tracts that had at least 15 tweets in both Spanish and English from which we could measure emotional expressions. In spite of that, some of these tracts did not have all demographic or socioeconomic variables available from the US Census and were ignored in further analysis. After cleaning the sample of tracts, as explained more in detail in the ''Methods'' section, the number of analyzed tracts is 539, comprising a total of more than 28 thousand tweets.
Our regression analysis approach, explained in the ''Methods'' section, applies a mixed effects model with random intercepts per tract group, to correct for spatial autocorrelations. Regression models include a control term of spatial diversity to correct for the negative correlation with tie strength shown on the left panel of Fig. 1. In this section, we explore these differences and their association with the demographic and cultural characteristics of places from which people tweet.

Demographics and social ties
We initially explored the bivariate correlations between tie strength per tract (see Methods) and the demographic variables measured in the census. The center and right panels of Fig. 1 show, respectively, a negative correlation between tie strength and education (Pearson's q ¼ À0:338, CI ½À0:26; À0:41) as well as a positive correlation between tie strength and the ratio of Hispanic residents (Pearson's Fig. 1 Tract tie strength as a function of spatial diversity and demographic factors. Black dots show the empirical points, red lines show linear fits, blue lines non-parametric local fits, and shaded regions show prediction errors of model fits. Average tie strength is negatively correlated with spatial diversity, which we include as a control variable in our analysis. Tie strength shows a negative correlation with education, measured as the ratio of residents with a bachelor's degree, and a positive correlation with the fraction of Hispanic residents q ¼ 0:33, CI [0.25, 0.403]). While this initially indicates that tracts with more educated but fewer Hispanic inhabitants have weaker social ties, we contrast these observations against incremental regression models to verify that our observations are not due to confounds across demographic variables. Table 1 reports the regression results for the logarithm of tie strength as a linear combination of education and the ratio of Hispanic residents, including control terms for age and spatial diversity. The three models include incremental terms that explore the role of education and Hispanic ratio including interaction terms, as both variables are strongly colinear (Pearson's q ¼ À0:865, CI ½À0:885; 0:842). The analysis shows a negative association between education and tie strength. On average, tracts with a higher ratio of residents with a bachelor's degree have weaker social ties in Twitter.
The positive relationship between the ratio of Hispanic residents appears in the model without interactions (Model 1) but adding interaction terms shows its dependence on education (Model 2). The positive marginal effect of the ratio of Hispanic residents is not significant, but the interaction term with education is significant and positive. This has two implications: first, that the positive association between the ratio of Hispanic residents and tie strength is only present for higher levels of education, and second that the negative effect of education in tie strength is counterbalanced by the positive interaction with the ratio of Hispanic residents. This way, for ratios of Hispanic residents above 0.6, we can expect a positive relationship between education and tie strength. This is further evidenced by the fact that the best model in terms of Bayesian Information Criterion is the model with an interaction term but no direct effect of the ratio of Hispanic residents on tie strength (Model 3).
We repeated the fit of the full model replacing the education variable with two of its strongest correlates: the employment rate and the logarithm of the median household income. These two models, presented in detail in the Supplementary Material, do not show significant effects on the logarithm of tie strength, neither directly nor through interactions with the ratio of Hispanic residents. This highlights the role of education in the result, which is clearly not a confound with the economic variables of income and unemployment.

Language and emotion
As a preliminary step before the study of emotions and tie strength, we surveyed the correlations between emotion measurements across languages. Figure 2 shows the relationships between average normalized values of affect for English and Spanish language tweets. There is a significant positive correlation (Pearson's q ¼ 0:27, CI [0.20, 0.35]) between the mean valence of English language tweets and the mean valence of Spanish language tweets from the same tract. Arousal values are only weakly correlated (Pearson's q ¼ 0:12, CI [0.03, 0.2]). In contrast, there is no significant correlation between positive values of sentiment in English and Spanish or negative values of sentiment in the two languages. This absence of strong correlations motivates the analysis of affect in more than one language, as emotional experiences and social network structures might vary across ethnicities in Los Angeles.
The average value of valence across all tracts is 5.78 for English language tweets and 5.71 for Spanish language tweets, which corresponds to slightly positive affect  [29]. As expected, the measurements of valence are positively correlated with the measurements of positive affect (q ¼ 0:50 in English and q ¼ 0:30 in Spanish) and negatively correlated with the measurements of negative affect (q ¼ À0:55 in English and q ¼ À0:21 in Spanish). This illustrates the relative dependence between the dimensional representations of valence-arousal (VA Model) and positive-negative (PN Model), which capture emotional life in different ways. In the further analysis, we apply these representations in two parallel models for each language, to have a double test of the hypothetical relation between tie strength and emotions.

Emotion and social ties
Finally, we study the relationship between emotions expressed in tweets, the language in which they are written, and strength of social interactions in the analyzed tracts. Despite the overall positive affect of tweets, there is a negative correlation between the strength of social ties and the variables of valence and positive affect. Figure 3 shows the scatter plots of tie strength versus affect variables in both languages. Three patterns can be observed: (1) valence and positive affect are negatively correlated with average tie strength in both languages, (2) arousal appears to have little to no correlation with average tie strength, and (3) negative affect appears to be positively correlated with average tie strength in English but not in Spanish.
We evaluate the above observations in two regression models for each language, one using valence and arousal (VA Model) and another using positive and negative affect (PN Model). The main results of these fits are reported on Table 2, with an additional model controlling for demographic effects reported in the Supplementary Material accompanied by model diagnostics and robustness tests. The regression results verify that, for both languages, valence and positive affect are negatively correlated with average tie strength, while no effect can be observed for arousal in either of the languages. This is consistent with the hypothesis that positive emotions are more likely to be shared with weak contacts, while negative experiences are chosen to be shared through stronger ties. This result is in line with theories of social regulation of emotions [4] and with previous results in protest movements that showed that online negative emotions were associated with stronger collective action [30]. This appears also as a positive relationship between negative affect and tie strength in English, but is inconsistent with the absent pattern for Spanish, which points to the opposite direction but without consistent significance.

Discussion
The availability of large scale, near real-time data from social media sites such as Twitter brings novel opportunities for studying online behavior and social interactions at an unprecedented spatial and temporal resolution. By combining Twitter data with US Census, we were able to study how the socioeconomic and Our findings are broadly consistent with results of previous studies carried out in offline settings, and also give new insights into the structure of online social interactions. We found that, in line with previous research and theoretical arguments, Twitter users express more positive emotions in areas with weaker ties, while negative emotions are more salient where ties are stronger. We find a lack of correlation with arousal that is consistent with a general pattern in which sentiment analysis techniques do not seem to capture the subjective experience of arousal [31]. We find that at an aggregate level, areas where Twitter users form stronger social ties have lower levels of education, but that this effect interacts with the ratio of Hispanic residents in the opposite direction. Since weak ties are believed to play an important role in delivering strategic, novel information, our work identifies education as a main correlate with the presence of weak ties and their associated novel information.
Our results highlight the social component of culture: Hispanic cultures share collectivist values and are less individualist than anglo-saxon cultures (for example, Mexico scores 30 and the US 91 in the individualism scale of Hofstede [32]). This provides an explanation for the stronger links of tracts with higher number of Hispanic residents (interacting with education), as their online network structures reflect their shared values [33,34]. However, this manifestation of shared values in digital traces is subject to appear only in areas where education levels are higher, as they also have higher levels of penetration of online social media.
Some important considerations limit the interpretation of our findings. First, our methodology for identifying social interactions may not give a complete view of the social network of Twitter users. Our observations were limited to social interactions initiated by users who geo-reference their tweets. This may not be representative of all Twitter users posting messages from a given tract, if systematic biases exist in what type of people elect to geo-reference their tweets. For demographic analysis, we did not resolve the home location of Twitter users. Instead, we assumed that characteristics of an area, i.e., of residents of a tract, influence the tweets posted from that tract. Other subtle selection biases could have affected our data and the conclusions we drew [35]. It is conceivable that Twitter users residing in more affluent areas are less likely to use the geo-referencing feature, making our sample of Twitter users different from the population of LA county residents. Recognizing this limitation, our conclusions only apply at the group level and not at the level of individual behavior of LA residents. In the same vein as how Twitter data can be used to identify group effects on heart disease mortality [36], our analysis identifies relations between properties of groups of people. While recent research opens the possibility to how to reweight Twitter metrics across demographic sections [37], evaluating the external validity of social media metrics requires an interdisciplinary effort beyond the scope of this contribution from data science. Further research can build on our work, combining social media data with standard social science resources, such as surveys, questionnaires, and the census, achieving a more complete picture of social interaction in our current Digital Society.

Data
Los Angeles (LA) County is the most populous county in the USA, with almost 10 million residents. It is extremely diverse both demographically and economically, making it an attractive subject for research. We collected a large body of tweets from LA County over the course of 4 months, starting in July 2014. Our data collection strategy was as follows. First, we used Twitter's location search API to collect tweets from an area that included Los Angeles County. We then used Twitter4J API to collect all (timeline) tweets from users who tweeted from within this area during this time period. A portion of these tweets were geo-referenced, i.e., they had geographic coordinates attached to them. In all, we collected 6M geotagged tweets made by 340K distinct users.
We localized geo-tagged tweets to tracts from the 2012 US Census. 1 A tract is a geographic region that is defined for the purpose of taking a census of a population, containing about 4000 residents on average, and is designed to be relatively homogeneous with respect to demographic characteristics of that population. We included only Los Angeles County tracts in the analysis. We used data from the US Census to obtain demographic and socioeconomic characteristics of a tract, including the mean household income, median age of residents, percentage of residents with a bachelor's degree or above, as well as racial and ethnic composition of the tract.

Emotion analysis
We apply sentiment analysis [38], i.e., methods that process text to quantify subjective states of the author of the text, to measure happiness or subjective wellbeing of Twitter users. Two recent independent benchmark studies evaluate a wide variety of sentiment analysis tools in various social media [39] and Twitter datasets [40]. Across social media, one of the best performing tools is Sen-tiStrength [41], which also was shown to be the best unsupervised tool for tweets in various contexts [40].
English language analysis SentiStrength quantifies emotions expressed in short informal text by matching terms from a lexicon and applying intensifiers, negations, misspellings, idioms, and emoticons. We use the standard English version of SentiStrength. 2 To each tweet in our dataset, quantifying positive sentiment P 2 ½þ1; þ5] and negative sentiment N 2 ½À1; À5], consistently with the Positive and Negative Affect Schedule (PANAS) [42]. SentiStrength has been shown to perform very closely to human raters in validity tests [41] and has been applied to measure emotions in product reviews [43], online chatrooms [44], Yahoo answers [45], Youtube comments [46], and social media discussions [47]. In addition, SentiStrength allows our approach to be applied in the future to other languages, like Spanish [30,48], and to include contextual factors [49], like sarcasm [50].
Beyond positivity and negativity, meanings expressed through text can be captured through the application of the semantic differential [51], a dimensional approach that quantifies emotional meaning in terms of valence and arousal [52]. The dimension of valence quantifies the level of pleasure or evaluation expressed by a word, while arousal measures the level of activity induced by the emotions associated with a word. Research in psychology suggests that a multidimensional approach is necessary to capture the variance of emotional experience [53], motivating our application of this approximation of two dimensions that goes beyond simple polarity approximations. The state of the art in the quantification of these dimensions is the lexicon of Warriner, Kuperman, and Brysbaert (WKB) [54]. The WKB lexicon includes scores in the three dimensions for more than 13,000 English lemmas. We quantify valence and arousal in a tweet by first lemmatizing the words in the tweet, to then match the lexicon and compute mean values of the valence and arousal as in [55]. While this method is not the most accurate, it provides high coverage for Twitter data [39], and allows a multidimensional representation of emotions that is not frequent in mainstream sentiment analysis. In our dataset, for example, the lexicon matched terms in 82.39% of the tweets.
The Fig. 4 shows word clouds of tweets from a tract with one of the highest average valence and one from a tract with a lower average valence. The words themselves are colored by their valence, with red corresponding to high and blue to low valence words. Despite seemingly small differences in average tract valence, the words depicted in the word clouds are remarkably different in the emotions they convey. The ''happy'' tract has words such as ''beach'', ''love'', ''family'', ''beautiful'', while the ''sad'' tract contains many profanities (though it also contains some happy words).
Spanish language analysis We analyzed emotions expressed in tweets written in Spanish, as determined by Google language-detection library. 3 After tokenizing tweets (using Stanford CoreNLP 4 tool) and stemming Spanish words (using NLTK module 5 ), we used GISB, a lexicon developed by Gonzalez, Imbault, Sanchez, and Brysbaert [56], to measure the emotional content of Spanish language tweets. Similar to WKB lexicon, GISB lexicon contains a large set (14,031) of Spanish words annotated with valence and arousal values ranging from 1 to 9, with 5 as the neutral point. About 65% of the tweets recognized as Spanish language tweets contained at least one word that matches the GISB lexicon.
We used the Spanish version of SentiStrength to quantify the positive and negative sentiment expressed in tweets. Similar to its English version, the latest adaptation of SentiStrength to Spanish [48] returns values in the range of [1,5] for positive and [-5, -1] for negative sentiment. We ignored neutral tweets that have the combined score of zero (i.e., the same positive and negative scores). To keep our analysis consistent across languages, we focus on the dimensions of positive and negative affect from SentiStrength and on valence and arousal from the WKB and GISB affective norms lexica.

Social tie analysis
Twitter users address others using the '@' mention convention. We use the mentions as evidence of social ties, although sometimes users address public figures and celebrities also using this convention. We use mention frequency along a tie as a proxy of tie strength, drawing upon multiple studies that used frequency of interactions as a measure of tie strength [6,57,58]. In contrast to other measures, such as clustering coefficient, it does not require knowledge of full network structure (which we do not observe).
Tie strength For each tract, we create a mention graph with users as nodes and an edge from user A to user B if A mentions B in her tweets. Using this graph, the average social tie strength per tract is defined as where w j is the weight of the jth edge (i.e., the number of times user A mentioned user B), and k i is the total number of distinct users mentioned in tract i. We do not have complete knowledge of network structure, since we only observe the tweets of users who geo-referenced their tweets, and not necessarily the tweets of mentioned users. However, even in the absence of complete information about interactions, average tie strength captures the amount of social cohesion and diversity [59]. Figure 5 illustrates mention graphs from two tracts with very different tie strength values. High tie strength (Fig. 5a) is associated with a high degree of interaction and more clustering [2]. In contrast, low tie strength is associated with a sparse, more diverse network with few interconnections (Fig. 5b). In the analysis presented in this article, we apply a logarithmic transformation to S i , reducing its skewness. After this transformation we identified three outliers that could be mapped to tracts with very few tweets, which we removed for the statistical analysis. For completeness, we repeated the analysis including these three outliers, leading to very similar results and model diagnostics that we report in the Supplementary Material. Spatial diversity Geography and distance are important organizing principles of social interactions, both offline [24,25] and online [27,28,58]. While most social interactions are short-range, long-distance interactions serve as evidence of social diversity [18]. In this paper, we use the movement of people across tracts as evidence of the spatial diversity of their social structures. Following Eagle et al. [18], we measure spatial diversity of places from which people tweeting from a given tract also tweet from, using Shannon's Entropy ratio, as where n i is the number of tracts from which users who tweeted from tract i also tweeted from, and p ij is the proportion of tweets posted by these users from tract j such that where T ij is the number of tweets that have been posted in tract j by the users who have tweeted from both tract i and j. Thus, spatial diversity is a ratio that compares the empirical entropy of data with its expected value in the uniformly distributed case. As a consequence, a high spatial diversity value for a tract suggests that people tweeting from that tract split their tweets evenly among all the tracts they are tweeting from. In contrast, a low value implies that people tweeting from that tract concentrate their tweets in few tracts.
Regression models Our statistical analysis applies mixed-effects regression models of the logarithm of the average tie strength in each tract. To control for spatial correlations, we model a random intercept for each tract group, which are captured by the tract prefixes of the census. 6 These tract groups are formed by the division of earlier tracts into subtracts, capturing spatial autocorrelations more sensitive to demographic features than pure geographic methods that ignore urban and administrative barriers. We fit models using the lmer function of the lmer package, specifying a model with tract groups as random intercepts.
The left panel of Fig. 1 shows the existence of a sizeable negative correlation (Pearson's q ¼ À0:47) between tie strength and spatial diversity. Tracts that bring together people who also tweet from different places have weaker ties, while tracts with more concentrated user groups have stronger ties. To exclude this pattern from our demographic and emotion analysis, we include a linear term of spatial diversity in each regression model. We further include an additional control term with the average age of residents in the census, to control for the possible age effect in the intensity of social links as manifested in Twitter. For the demographics model, we perform three fits to survey the interaction between education levels and the ratio of Hispanic residents. For the analysis of emotions, we fit each model twice: first a mixed-effects model using tract groups as random effects and sentiment variables as fixed effects, and second a model that takes as dependent variable the residuals of the demographics model and regresses them against the emotion variables. This way, we verify that the results of the emotion model fits are robust to the role of demographic factors in tie strength.
To verify the validity of our fit results, we perform regression diagnostics that are reported in the Supplementary Material. These verify that multicollinearity is weak (moderate Variance Inflation Factors), that residuals are roughly normally distributed, that no pathological correlations exist between residuals and independent variables, and that no pattern of heteroscedasticity is present. These verifications support the validity of the inferences that evidence the conclusions of this analysis.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.