1 Introduction

Residential segregation has historically been associated with societal issues such as economic, educational, and health inequalities [1, 2]; as a consequence, it has been a central focus in social, economic and political sciences [35]. Recent studies show that while racial segregation seems to be decreasing in the United States [6, 7], income inequality has been simultaneously rising [8, 9]. According to the Stanford Center on Poverty and Inequality, 1% of the American population held 21% of all the income in 2012, which is more than double of what they held in 1970 (8.4%). This change is coupled with a sharp increase in residential segregation by income [10]. In forty years, the number of American families living in middle-income neighborhoods went from 65% down to 43% in large metropolitan areas. Families are thus increasingly living in either extremely poor or rich neighborhoods, endangering the existence and stability of the middle classes [11].

In order to quantify residential segregation, American census reports [12] calculate twenty different indexes across five dimensions, namely: evenness, exposure, concentration, centralization and clustering [13]. These metrics are mostly based on static census data and do not reflect patterns in an activity or behavioral space. At the same time, regardless of whether they involve physical space or not, restrictions on any type of social interaction may be considered as forms of segregation [14]. These, together with the increasing availability of data sources resulting from human activities [15, 16], have led to an increasing number of studies on modern forms of segregation in spaces beyond residential neighborhoods. Most notably, recent works have shown that there exists clear separation between different ethnic or income groups in everyday activities such as visitation of urban areas [1723] or consumption of online information [2429], leading to the so-called “echo chambers” or “filter bubbles” [30].

While the literature mainly focuses on the limited exposure of certain socio-demographic and wealth groups to the others, the restriction on interactions between these groups [31] remains rather unexplored, possibly due to the lack of large-scale interaction data. The recent work of Morales et al. [23] has shown that groups of different income levels have differentiated topics of conversation, and that exposure limited by segregated interactions both offline and online is a key variable for homogeneization. Along a similar line of investigation, and following recent studies using similar data sources in analysing urban mobility and behavior [3235], we combine in the present paper large-scale credit card transaction and Twitter data sets to study income segregation in daily purchase activities and online communication, thus capturing two explicit interactions in economic and social behavior. We analyze how the patterns of segregation in both offline and online activities are intertwined, and vary with respect to both difference in socio-economic status and geographical distance. We demonstrate the consistency in these patterns by examining different cultural and political contexts, in three large metropolitan areas from Europe, Latin America, and North America.Footnote 1 Although we do not have a direct matching of individuals between the transaction and Twitter data sets, we study behaviors at the collective scale by aggregating the data by urban administrative neighborhoods, for which socio-economic status can be obtained from national census data.

The main contributions of the present paper are three-fold. First, we show that segregation in behavioral interactions is amplified with respect to the expected effect of geographic segregation, in terms of both purchase activity and online communication. Second, we analyse segregation with respect to socio-economic status and geographical distance, where we found that segregation is most pronounced between extreme income groups. Finally, we demonstrate that segregation is asymmetric for purchase activity, where the amount of interactions from poorer to wealthier neighborhoods is larger than the other way around. These findings provide a new angle to study modern forms of segregated behavior, with implications on urban planning, policy-making, and inequality reduction.

2 Results

Human patterns of exploration in urban and social spaces are linked to both individual financial wellbeing [36] and regional economic development [37, 38]. We measure exploration by means of the diversity of purchases and Twitter communication via mentions. In order to measure diversity, we first characterize each individual with a pair of vectors whose elements represent either shops (in the case of purchases) or other individuals (in the case of Twitter mentions), and count the number of times individuals purchase at each shop or communicate with other individuals, respectively. We then measure, as explained in Materials and Methods, the individual diversity as the Shannon entropy of each vector. Figure 1 shows a scatter plot with the aggregate neighborhood diversity of purchases and Twitter mentions after averaging over the individuals who live in each neighborhood of the European and Latin American cities. The average neighborhood diversity of both types of behaviors is positively correlated with each other (\(r=0.45\) in Europe and \(r=0.38\) in Latin America), as well as with the neighborhood socio-economic status.Footnote 2 This indicates that people living in poorer neighborhoods are less exploratory in their purchase and online activities, suggesting that they live in physically and virtually confined spaces.

Figure 1
figure 1

Purchase interactions among some of the neighborhoods in the European metropolitan area. Each neighborhood is shown as an area on the map (panel (A)) and as a node in a graph (panel (B)), and it is in both cases color-coded by its socio-economic status. Each curve in both panels represents an interaction between a pair of neighborhoods, and it is color-coded by the number of purchases made by residents of one neighborhood in another. The directions of the curves are represented by their convexity: curves of convex shape represent interactions from the left end point to the right end point, while those of concave shape represent interactions from right to left

The confinement of physical and virtual spaces is associated with segregation by income. We analyze this relationship by creating networks of interactions among neighborhoods based on purchases and Twitter mentions. In both networks, nodes represent neighborhoods whose socio-economic status is obtained from census data. Edges represent either the number of purchases made by customers living in neighborhood i at stores in neighborhood j, or the number of tweets directed from users living in i to users living in j. To account for potential bias in the sampling of the users in the data sets, we use population-weighted versions of the interaction networks (see Materials and Methods for data statistics and construction of population-weighted interaction networks). Figure 2 displays an illustration of the purchase network for some of the neighborhoods in the European metropolitan area. By analyzing the structure of these networks and the distribution of edges among neighborhoods, we are able to observe patterns of urban mixing or segregation.

Figure 2
figure 2

Neighborhood-level average diversity of purchases and Twitter mentions in the European and Latin American metropolitan areas. The size of the dots is proportional to the census population of the neighborhood and the color code indicates its wealth level. The correlation between both types of diversity is 0.45 and 0.38 for the European and Latin American case, respectively

In order to quantify segregation, we put the neighborhoods into ten groups according to their socio-economic status, where all groups have an equal number of neighborhoods. We then create mixing matrices whose elements show the aggregate number of interactions between the ten groups [39]. We further normalize the mixing matrix for both behaviors into a stochastic matrix whose elements show the probability of directed interaction among pairs of socio-economic groups (see Materials and Methods and Appendix for the construction and visualization of the mixing matrices). We found that most of the interactions occur within groups of the same socio-economic status. We quantify such preference by calculating the assortativity coefficient of the mixing matrices [39]. A coefficient of 1 indicates a perfectly assortative network while 0 indicates random mixing patterns.Footnote 3 The intuition is that assortative matrices are dominated by entries along and close to the matrix diagonal, indicating a stronger preference for neighborhoods to interact with similar ones (hence segregation). For example, in the European metropolitan area the assortativity coefficient of the mixing matrices for purchase and Twitter mentions is 0.42 and 0.41, respectively, indicating a certain degree of segregation (see Fig. 11, Fig. 12, and Fig. 13 for the mixing matrices in the European, Latin American, and Northern American cases, respectively).

While the assortativity coefficient shows a global description of the network, it misses heterogeneity in its structure, such as differences in the amount of segregation among certain socio-economic groups. In order to capture such heterogeneity, we analyze the segregation between the highest and lowest socio-economic strata and progressively include the remaining socio-economic groups in both directions until reaching the whole network. At each step, we consider a certain number of groups both at the top and bottom of the socio-economic distribution (hence focusing on a percentage of the neighborhoods) and measure their segregation with the assortativity coefficient (see Fig. 14, Fig. 15, and Fig. 16 in Appendix). Figure 3 (top left) shows the assortativity coefficient as a function of the percentage of neighborhoods considered at each extreme of the distribution, for both purchases and mentions’ networks, in the European city. For comparison, we also include in the figure the expected results from two artificial networks generated by (i) simulating neighborhood-to-neighborhood interactions with a gravity-based model [40] and (ii) randomly reshuffling neighborhoods’ socio-economic status in a null model (see Appendix for details on the construction of artificial interaction networks using gravity-based model and null model). Similar to the empirical interaction networks, both artificial networks have been re-weighted according to neighborhood population.

Figure 3
figure 3

(Top) The assortativity for different networks, as a function of the percentage of neighborhoods with extreme socio-economic status included in computation. (Bottom) The assortativity for different networks, as a function of the distance thresholds used for pruning edges in the interaction networks. The error bars in the green and cyan curves correspond to the standard deviations, and all other error bars correspond to the 95% confidence interval using a jackknife resampling technique as in [46]

The following observations can be made from Fig. 3 (top left). First, segregation is most pronounced between the highest and lowest socio-economic groups, which barely interact with each other, and decreases by including middle-class neighborhoods, which serve as “social bridges” between the richest and poorest parts of the society. Second, segregation in interactions (blue and orange) is stronger than the one due to geographical distance (yellow and purple), implying that it cannot be simply attributed to the segregated distribution of residential households in the city. Third, this segregation is also stronger than the one produced by the null model (green and cyan) showing that the patterns we observe are significant and not an artifact of the data. While the segregation in purchase patterns could be expected partially given the limitations that prices impose on people, the fact that they also tend to self-segregate on the Internet is interesting. Furthermore, segregation online appears to be even stronger than offline especially between the highest and lowest socio-economic groups. Similar patterns are also observed for the Latin and North American cities in Fig. 3 (top middle) and in Fig. 3 (top right).

Geography and the organization of the physical urban space has been linked with segregation and inequality [41]. In order to further analyze the role of geographical distance in the segregation in interactions, we measure the assortativity coefficient at multiple distances, by only considering subsets of neighborhood pairs that are either within a certain distance of d km or beyond this distance, representing short- and long-distance interactions, respectively (see Appendix for details on the analysis procedure). The bottom row of Fig. 3 depicts the assortativity coefficients as a function of d for the empirical networks of the three cities. It can be seen that both networks are predominately segregated due to short-distance interactions (blue and orange), with the network resulting from short-distance interactions having a consistently higher assortativity coefficient than that resulting from long-distance interactions (yellow and purple). In the case of purchases, short-distance interactions are less costly in terms of time and money, and are dominated by daily activities such as groceries or banking. Interestingly, the same pattern holds for online behaviors, which could be dominated by the interaction of local social groups and is consistent with the finding in [28]. Moreover, the positive assortativity score for long-distance interactions in the European and North American cases suggests that self-segregation might even exist between neighborhoods that reside further away.

Apart from being segregated, it is interesting to investigate whether interactions are symmetric among different socio-economic groups. To this end, we measure the excess of interactions directed from poorer to richer areas, relative to the number of interactions in the opposite direction. This is quantified as the difference between the sums of the lower and upper triangles of the mixing matrices. Figure 4 shows the poor-to-rich interaction bias as a function of the percentage of neighborhoods considered, following the same methodology presented in Fig. 3, where we initially consider only the highest and lowest extremes of the wealth distribution and progressively include neighborhoods towards the middle. The bias is positive in the case of offline purchases (blue curve), and generally increases as we include the middle-low and middle-high wealth groups. Moreover, although still presenting asymmetric interactions, the less bias from the lowest to the highest wealth group, together with the most pronounced segregation between these two groups as shown in Fig. 3, indicates a polarization of behavior in this case. On the other hand, online communication does not seem to exhibit such bias (orange curve).

Figure 4
figure 4

Asymmetry in the interaction patterns between the relatively poor and the relatively rich segments of the population. The error bars correspond to the 95% confidence interval using a jackknife resampling technique as in [46]

We further investigate the robustness of the asymmetric relationship, by repeating the analysis on the two artificial networks introduced before: (i) simulated interaction networks based on a gravity model, and (ii) the ones produced by randomly reshuffling the neighborhoods’ socio-economic status in a null model. While in the former case the asymmetry persists for offline purchases (yellow curve in Fig. 4), it disappears in the latter where the bias drops to zero (green curve). For the European metropolitan area, when all neighborhoods are considered, the observed asymmetric pattern in offline behavior is more pronounced that the one produced by the gravity-based model. This implies that the stronger tendency of interactions from the relatively poor to the rich cannot be simply attributed to a geographical factor. The same argument, however, does not hold for the case of the Latin American city, which is likely due to the observation that richer neighborhoods account for more stores in that case. Although the observed asymmetric relationship for offline purchases might be influenced by a larger number of stores (and popular ones) in richer neighborhoods, it nevertheless suggests that at macro scale there seems to exist a hierarchy that is embedded in the behavioral interactions between different segments of the society.

3 Discussion

In summary, our results suggest that segregated patterns exist in both urban and online interactions between different socio-economic groups, and they seem stronger than that expected from merely the geographic distribution of residential households. While residential neighborhoods in a region might consist of different socio-economic groups, interactions, both physically and socially, seem to take place more often between neighborhoods whose economic conditions are similar. Such emerging behavior might be expected from purchase activities partially due to the constraints imposed by prices, but less so from online communication where boundaries are more likely self-imposed.

Indeed, while purchase behaviors are constrained by mobility, time and monetary resources associated with the spatial segregation patterns observed in our data, it was expected that online behavior would mitigate those constrains by creating a virtual third-place [42] where more diverse interaction would be possible. However, our results reinforce the recent findings that “echo chambers” in virtual space recreate and amplify the observed residential segregation in physical space. Nevertheless, it is still possible that segregation can be mitigated by encouraging virtual conversations and physical interactions between different groups. The promotion of such interactions might be critical in reducing segregation and prove more effective than simply an increase in the exposure to opposing views [29].

More interestingly, we observe that the restrictions on interactions, in both urban and online space, are most pronounced among the extremes of the wealth distribution, but fuzzy for the middle classes, which might act as social bridges [34] distributing information across the social system. Interactions across different segments of the society might therefore be promoted especially through the agency of groups with middle socio-economic status given their bridging roles. Furthermore, an asymmetric pattern of interaction for offline purchases seems to suggest the existence of a hierarchy at macro scale, where richer areas attract a disproportionately large amount of capical (see “Segregation and economic inequality” in Appendix). This, in turn, is crucial for the creation of new economic opportunities. As observed in Fig. 20, for the European metropolitan area, a stronger segregation pattern in purchase interactions is linked with a higher level of inequality between neighborhoods in terms of their sales’ revenues. This observation is worth of further investigation, with the possible implication that urban planners may consider a better strategy in allocating store locations for a more even distribution of capital, which can be achieved by promoting tax segmentation.

Our analysis has limitations. Even though the number of users in the two data sets are correlated (see Fig. 5 and Fig. 6), penetration rates of credit card and Twitter usage differ in neighborhoods of different socio-economic status. Richer neighborhoods tend to account for more samples in our data sets, an observation that is most pronounced for the Latin American city (see Fig. 6). This leads to under-representation of population in neighborhoods with lower socio-economic status. In this work, we have used population-weighted interaction networks to account for such sampling bias, and further investigation would be needed to fully assess its impact. Furthermore, due to the culture and constraints of the different countries, the credit card transaction data may only represent a fraction of the daily spending as people may choose to pay by cash in certain situations. Finally, our socio-economic status data are obtained at the neighborhood level, and may not necessarily reflect the economic situation of individuals in the data sets. Nevertheless, the general consistency between the results in three cities from three different continents across a period of several months suggests the validity of our findings in the contexts examined in the present study.

Figure 5
figure 5

The relationship between the number of customers in the transaction data set, the number of users in the Twitter data set, and the neighborhood-level wealth, for the European metropolitan area

Figure 6
figure 6

The relationship between the number of customers in the transaction data set, the number of users in the Twitter data set, and the neighborhood-level wealth, for the Latin American metropolitan area

4 Materials and methods

4.1 Data sets and pro-processing

The credit card transaction data sets are provided by two major financial institutions, one in an European country and one in a Latin American country. Each record in the data set corresponds to one credit card transaction along with customer and store IDs, as well as the time (day, hour and minute) of the transaction and the spending amount in local currency. Additional information about the customers and stores are also made available, including customers’ home location as well as store location and category. The customer-level data are pseudonymized such that each customer is represented by a pseudo-unique number, in a way similar to the pseudonymization of mobile phone call detail records (CDRs). In addition, all personally identifiable data attributes were removed before the data sets were provided to us.

We focus our analysis on two large metropolitan areas of the two countries. As pre-processing steps, we first filter out foreign and online transactions to focus on local and physical activities. We then consider customers who made at least ten transactions in the data set. In the European case, this leads to a set of 2.4 million records of individual credit card purchases from April to June 2013, made by 85 thousand individuals at 54 thousand stores. In the Latin American case, this consists of a set of 3.5 million records of individual credit card purchases from April to July 2013, made by 200 thousand individuals at 55 thousand stores.

We collect geo-localized Twitter data sets using Twitter’s Streaming API [43] from August 2013 to August 2014. In the European case, it contains 76 million geo-localized tweets within the metropolitan area of interest, from 1.4 million Twitter users. In the Latin American case, it contains 10.3 million geo-localized tweets within the metropolitan area under study, from 422 thousand Twitter users. Finally, in the Northern American metropolitan area, it consists of 22.4 million geo-localized tweets, from 862 thousand Twitter users.

On Twitter, a user A can mention or reply to another user B in his post in which case the post contains B’s username. This allows us to build neighborhood-level Twitter mention networks. For this purpose, for the European case, we select a subset of 20.1 million tweets containing user mentions or replies that are posted by 1 million users, for whom we are also able to infer their home locations. For the Latin American and Northern American metropolitan areas, we collect 3.8 million tweets and 8.1 million tweets containing user mentions or replies that are posted by 260 thousand and 440 thousand users, respectively.

It is worth noting that in this paper we study interactions among urban neighborhoods of different socio-economic status. In all the three metropolitan areas we consider, neighborhoods are administrative districts of similar size used for census purposes. We have around 660 such neighborhoods in the European case, around 160 in the Latin American case, and around 190 in the Northern American case. For the European city, the neighborhood-level socio-economic status is provided by a national institute in 2011, which is a composite measure between 0 and 100 that quantifies the relative prosperity of the neighborhood based on a number of indicators such as income and education level. The higher the index, the more prosperous the neighborhood is. In the Latin American case, we use the neighborhood-level marginalization index provided by a national institute in 2012 as an approximation of the (negative) socio-economic status, namely, the higher the marginalization index, the lower the socio-economic status. Finally, for the Northern American city, we use median household income provided by a national survey for the period of 2010–2014 to approximate the socio-economic status of the neighborhoods.

Even though we do not have a matching between the individuals in the two different data sets, we are able to study both offline (purchases) and online (Twitter mentions) behavior at the level of administrative neighborhoods within the city. Specifically, for the credit card data set, we associate the customers and stores with the neighborhoods in which they reside and are located, respectively. For the Twitter data set, the procedure for assigning a home neighborhood to a Twitter user is as follows. First, we map tweets to the neighborhoods. We do this by observing in which polygons (neighborhoods) the coordinates of the tweets of the user fall into. Second, we observe the times of the day in which the user tweets from each neighborhood. Third, we select the neighborhood that is used the most during night hours, i.e., from 8pm to 6am, as the home neighborhood of the user. We then compute diversity scores and construct neighborhood-level interaction networks as described below.

4.2 Data statistics

Figure 5 (Top Left) and Fig. 6 (Top Left) illustrate the comparison between the number of credit card customers and Twitter users in the European and in the Latin American cases, respectively. We see that, for both metropolitan areas, the number of people in the two data sets are correlated (\(r=0.76\) in the European case and \(r=0.69\) in the Latin America case). The other plots in Fig. 5, Fig. 6 and Fig. 7 show the relationship between the number of credit card users, number of Twitter users, the neighborhood population, and the neighborhood-level socio-economic status, for the three metropolitan areas, respectively. It can be seen that in the European case the number of users in the credit card and Twitter data is strongly correlated with population (\(r=0.74\) and \(r=0.62\)), and not biased towards population with certain socio-economic status (\(r=0.25\) and \(r=0.24\)). In comparison, the correlation between the number of credit card and Twitter users and the population is much weaker for the Latin American (\(r=-0.01\) and \(r=-0.20\)) and Northern American (\(r=-0.01\)) cases, where the sampling of users is biased towards richer neighborhoods (\(r=-0.64\) and \(r=-0.53\) for the Latin American case and \(r=0.42\) for the Northern American case). To address the issue of bias in sampling, in particular for the Latin American and Northern American cases, we construct population-weighted interaction networks as described below.

Figure 7
figure 7

The relationship between the number of users in the Twitter data set and the neighborhood-level wealth, for the Northern American metropolitan area

Figure 8 shows the distribution of the number of credit card customers, stores, Twitter users, and that of the socio-economic status of the neighborhoods in the European metropolitan area, where Fig. 9 and Fig. 10 show the same distribution in the Latin American and Northern American metropolitan areas, respectively.

Figure 8
figure 8

Histograms of (Top Left) the number of credit card customers, (Top Right) the number of stores, (Bottom Left) the number of Twitter users, and (Bottom Right) the socio-economic status, for neighborhoods in the European metropolitan area

Figure 9
figure 9

Histograms of (Top Left) the number of credit card customers, (Top Right) the number of stores, (Bottom Left) the number of Twitter users, and (Bottom Right) the socio-economic status, for neighborhoods in the Latin American metropolitan area

Figure 10
figure 10

Histograms of (Left) the number of Twitter users, and (Right) the median household income, for neighborhoods in the Northern American metropolitan area

4.3 Computation of the diversity score

For each individual s, we define the diversity as the Shannon entropy of his/her purchase (or Twitter) activities:

$$ D(s) = -\sum_{t=1}^{N} p_{st} \operatorname{log}(p_{st}), $$
(1)

where \(p_{st}\) is the probability that an individual s (or Twitter user) visits a store t (or mentions another user t) and N is the total number of stores (or Twitter users). The average diversity score of a neighborhood is then defined as the average diversity of individuals living in that neighborhood. This approach is similar to the network-based approach of Eagle et al. [37].

4.4 Construction of the interaction networks

We construct two networks to capture interactions between different neighborhoods. In these networks, nodes represent neighborhoods and edges represent interactions whose intensities are captured by the weights of the edges. For purchase network, we define a directional edge from neighborhood i to j with weight \(w_{ij}^{(p)}\), which is the number of purchases made by customers living in i at stores in j. Similarly, for Twitter mention network, we define a directional edge from neighborhood i to j with weight \(w_{ij}^{(t)}\), which is the number of mentions made by Twitter users living in i to users in j.

4.5 Construction of the population-weighted interaction networks

To address the issue of bias in the sampling of the users in the credit card and Twitter data sets, we scale the interaction networks using a population based weighting scheme. More specifically, for the credit card data set, if a neighbourhood i has \(m_{i}\) credit card users and \(p_{i}\) population, and the number of purchases from i to j is \(w_{ij}^{(p)}\), then we define the population-weighted interaction in the credit card case as:

$$ \bar{w}_{ij}^{(p)} = \frac{w_{ij}^{(p)}}{m_{i} / p_{i}}. $$
(2)

Similarly, for the Twitter data set, if neighbourhood i has \(x_{i}\) twitter users and \(p_{i}\) population, neighbourhood j has \(m_{j}\) twitter users and \(p_{j}\) population, and the number of mentions from i to j is \(w_{ij}^{(t)}\), then we define the population-weighted interaction in the Twitter case as:

$$ \bar{w}_{ij}^{(t)} = \frac{w_{ij}^{(t)}}{(m_{i} \times m_{j}) / (p_{i} \times p_{j})}. $$
(3)

4.6 Construction of the mixing matrices

To quantify segregation, we first construct the mixing matrices. Specifically, we put the neighborhoods into ten groups according to their socio-economic status, where all groups have an equal number of neighborhoods. The groups have increasing socio-economic status from 1 to 10, i.e., the socio-economic status group for neighborhood i is \(s(i) = [1,2,\dots,10]\). We then convert the population-weighted interaction networks into the 10 by 10 mixing matrices, whose mn-th entry is defined as:

$$ \begin{aligned} &M_{mn}^{(p)} = \sum _{s(i)=m, s(j)=n} \bar{w}_{ij}^{(p)}, \\ &M_{mn}^{(t)} = \sum_{s(i)=m, s(j)=n} \bar{w}_{ij}^{(t)}. \end{aligned} $$
(4)

Finally, we normalize the mixing matrix of both networks into a stochastic matrix:

$$ \begin{aligned} S_{mn}^{(p)} = \frac{ M_{mn}^{(p)} }{ \sum_{n} M_{mn}^{(p)} }, \\ S_{mn}^{(t)} = \frac{ M_{mn}^{(t)} }{ \sum_{n} M_{mn}^{(t)} }, \end{aligned} $$
(5)

This way, \(S_{mn}^{(p)}\) and \(S_{mn}^{(t)}\) represent the probability of interaction from one socio-economic status group m to another n in terms of credit card purchases and Twitter mentions. The resulting mixing matrices for the interaction networks in the European, Latin American, and Northern American case are shown in Fig. 11, Fig. 12, and Fig. 13, respectively.

Figure 11
figure 11

Mixing matrices for (Left) the purchase network and (Right) the Twitter mention network, for ten socio-economic status groups in the European metropolitan area. Socio-economic status groups are ordered from the lowest wealth (1) to the highest wealth (10)

Figure 12
figure 12

Mixing matrices for the (Left) purchase network and (Right) mention network, for ten socio-economic status groups, in the Latin American metropolitan area. Socio-economic status groups are ordered from the lowest wealth (1) to the highest wealth (10)

Figure 13
figure 13

Mixing matrix for the mention network, for ten income groups, in the Northern American metropolitan area. Socio-economic status groups are ordered from the lowest wealth (1) to the highest wealth (10)

4.7 Computation of the assortative mixing coefficient

Assortative mixing coefficient (or assortativity) is a measure proposed by Newman et al. [39] to quantify the phenomenon of homophily in social networks, which can also be used for measuring segregation in networks [44]. In the context of the mixing matrices, it is equivalent to Cohen’s Kappa, a classical psychometric measure of agreement on nominal variables [45]. In our case, the more assortative the mixing matrices, the more segregated the behavioral interaction patterns.

The assortativity proposed in [39] is computed as follows. Given a weighted network, let \(e_{xy}\) be the fraction of weights of edges in the network that join nodes having attribute values x and y. The assortativity is then defined as:

$$ r = \frac{\sum_{xy} xy (e_{xy}-a_{x} b_{y})}{\sigma _{x} \sigma _{y}}, $$
(6)

where \(\sum_{xy}e_{xy}=1\), \(a_{x}=\sum_{y}e_{xy}\) is the fraction of weights of edges starting from nodes with attribute value x, \(b_{y}=\sum_{x}e_{xy}\) is the fraction of that connecting to nodes with attribute value y, and

$$ \sigma _{x} = \sqrt{\sum_{x} x^{2} a_{x} - \biggl(\sum_{x} x a_{x} \biggr)^{2}} $$
(7)

and

$$ \sigma _{y} = \sqrt{\sum_{y} y^{2} b_{y} - \biggl(\sum_{y} y b_{y} \biggr)^{2}} $$
(8)

are the standard deviations of distributions of \(a_{x}\) and \(b_{y}\), respectively. As we can see, the assortativity r is the Pearson correlation coefficient between the attributes of the two end nodes for all the edges.

In Appendix, we describe in detail how we compute the assortativity based on a percentage of socio-economic status groups as well as geographical distance between neighborhoods.