Introduction

As summarized by Ritchie and Roser in 2018 [66] and revised in 2019, more than half of the global population currently live in urban areas, and increasingly so in cities that are highly dense. Such rapid and pervasive urbanization is a unique feature of modern life, never having been witnessed before to this extent in human history. Urbanization has profoundly influenced human activity and behavior, including the ways in which we live, work, commute and socialize [6, 7].

At the same time, the relative rise in technological fluency and adoption of novel technologies (especially among the younger populace), driven in no small part by the ubiquity and low cost of devices such as smartphones [70], as well as network externalities from growing social media platforms and services such as Twitter, Instagram, WhatsApp and WeChat [22, 59], has led to a new era for social science. The field of computational social science [3], which has emerged slowly but surely over the previous decade due to largely uncoordinated efforts in different fields, including urban studies, economics, sociology and network science, has generally been described as the ‘computational study of social phenomena’ and is primarily focused on emergent collective phenomena through analysis of interacting individual units (including both people and organizations) [12, 45]. With the availability of both public and proprietary datasets from governments, polling organizations (such as Gallup [21]) and enterprise (and non-profit) alike [41], computational social science has made great strides, especially in urban settings [4, 25, 71] (see also “Related work)”. Foth et al. [26] describe the field of urban informatics as being ‘situated at the intersection of notions, trends and considerations for place, technology and people in urban environments’. The growth of social media, especially short, public messages from platforms such as Twitter, allows us to study all three of these features in painstaking detail using a range of visualization and machine learning technologies. While many advances have already been achieved in urban informatics, many questions also remain.

In this article, we study one such question, namely the tendency of people who are based in similar locations to discuss topically similar themes. We introduce computational spatio-textual affinity as an paradigm both for understanding urban differences in the way that people in specific urban areas (such as New York City and Los Angeles) communicate, as well as what they communicate. It is important to note that some differences can be explained due to inherent differences in the city’s geography, politics, national culture, language and even time zones. For example, it would not surprise us to know that tweets from Madrid are more likely to be in the Spanish language than tweets from the US Midwest, or that tweets from an electoral region that was known to favor a particular political candidate would express more favorable sentiment about that candidate than a region that was heavily opposedFootnote 1. However, when studying two cities over an identical period of time, and that have similar traits with respect to language, politics and national affiliation, we ask the question: is it feasible to distinguish tweets in one city from another city based on their textual characteristics? Put another way, when we cluster tweets using (subsequently described) textual similarity measures, do clear spatial divisions emerge as a result?

For some tweets and topics, the answer would (presumably) be in the affirmative. Local events, including crimes, road closures and festivals, are all good examples: a local book festival in Los Angeles would not be expected to provoke conversation in New York. Even here, however, caution is called for. A shooting in Los Angeles would likely receive national press coverage and may become a topic of discussion everywhere. In general, due to homogenization of mass media and entertainment (such as Netflix) [57], and due to the special status of urban dwellers as ‘cosmopolitan’ consumers of such services [13], we would expect a lower degree of spatio-textual affinity between cities like New York and Los Angeles. While some of our experiments do bear out this conclusion, we provide key results that illustrate differences between the two cities, providing validation for the use of the framework when studying similar cities through the lens of similar data. Additionally, by studying distributions of textual and spatial affinity in aggregate, regional evidence of spatio-textual evidence is seen to emerge both at the levels of zipcodes and counties.

This article’s contribution to spatio-textual affinity is twofold. First, we present rigorous methodologies, designed for studies of multiple granularities, for measuring and studying both spatial and textual affinities. We do not rely on a single measure, but provide at least two ways to define either affinity. Second, we conduct a detailed empirical study in the context of two international metropolises (Los Angeles County and the Boroughs of New York City). Our study uses unbiased and geotagged data collected from the Twitter decahose (an approximately random 10% of tweets streaming at any given moment that has to be proprietarily obtained from Twitter) during a full month. By not relying on the public Twitter API for selective data acquisition or using automatic geolocation detection techniques, we mitigate a potential source of noise and bias that always raise questions of validity for such heavily empirical studies. We further maintain rigor by focusing on affinity as a correlational and non-causal phenomenon. We also use a judicious mix of qualitative methods, including visualizations and specific examples, to express some of the consequences of our findings and hypotheses.

The rest of this article is structured as follows. We start with a review of some related work in “Related work”, followed by a description of data and methodology in “Data and methodology” and the experimental study itself in “Experiments”. “Discussion” provides a discussion of some core findings, as well as extended findings that expose the scope for future research. “Conclusion and future work” concludes the article.

Related work

Urban informatics and social media

Quantitative studies in an urban context have been on the rise with the release of data, with commonly studied domains including demographics, transportation, and economics [5, 8, 65]. Use of GPS and cellular data has also been the subject of various studies, especially in the last decade [56, 64, 65]. In many works, a city is compared in the context of world city networks, where pairs of cities are linked based on indicators such as whether they share headquarters of some companies [16, 18, 73], whether both are part of the same production chains [8], Internet [53], financial services [5] and existence of direct flight or ship routes [19, 20, 77]. Regional differences within cities have also been studied e.g., Grauwin [33] studied differences between several major cities in the world, along with regional studies based on patterns of mobile phone call and SMS. Yuan et al. [76] study land use and human mobility within the city using GPS trajectory of taxis. Our work complements, and adds to, this emerging body of work by studying a specific concept (spatio-textual affinity) for a linked pair of coastal metropolises (Los Angeles and New York City).

In fact, increasing use of social media (see, for example, [50]) has led to a growing body of work in urban informatics [71], and unprecedented opportunities to study urban differences based on big data with a global footprint and human-generated content. Based on social media data, previous research is often revisited and some authors have illustrated divergence from previous studies on topics such as human mobility [24, 46, 55], social media activities [2, 25], lifestyle [37], emotional ‘maps’ [4], language usage [49] and word interpretation [38] with variables including region, occupation, age and gender. Hu et al. [38] combine modern Natural Language Processing (NLP) and social media methodologies for such studies by leveraging word embeddings that are arguably better suited for capturing underlying patterns such as grammar, synonyms and morphological relations in text. Using such embeddings, differences in word interpretations on topics like fashion and sports have been found between genders and regions (NYC and the Bay Area on the US West Coast). Studies that measure some kind of ‘spatial affinity, such as the work in [75] that groups similar cities across the world based on a ‘Local-Based Social Network’ collected from Foursquare (showing spatial affinity of cities in same clusters) have also been on the rise. In a similar vein in [61], cities are compared based on spatial distribution of amenities. [40] further utilize textual data and graphical models to surface content that is characteristic for a region. However, all of these studies have independently studied spatial or textual affinity, without attempting to bring them together in a common empirical framework. In this article, we address this issue using both spatial affinity and NLP methods, including word embeddings and clustering.

Sociology

As urbanization has expanded, urban studies have also taken on a more prominent role in sociology, including the field of urban sociology [32]. Over the last few decades, there has also been an increased focus on data-driven, empirical studies. A work that is, in principle, similar to the studies in this paper is the study of demographic dynamism and metropolitan change for four different cities in the US, including New York City and Los Angeles [52]. However, since social media (as we know it today) did not yet exist, any such study could not take into account ‘grassroots’ conversations coming from people in those cities through millions of digital conversations. Therefore, to our knowledge, studies of spatio-textual affinity at scale have been non-existent. In other relevant work, urban differences have been viewed and studied through the lens of globalization and world-system theory by Knox et al. [42]. Cities in the context of globalization have been an object of increasing academic and policy-oriented scrutiny for almost four decades now [17, 28, 35, 72].

Domain-specific applications

Attempts have also been made to apply findings from urban informatics and social media analytics to domain-specific cases like policymaking, city-planning [11, 29, 54], computational epidemiology [68] and automatic geolocation detection [10, 23, 62, 74]. For example, Chenge et al. [10] and Wing et al. [74] utilize word frequency to model spatial variations and predict geolocations of tweets. Eisenstein et al. [23], Hong et al. [36] and Emre et al. [9] use generative models to identify features of geographical areas. Frias et al. [27] determines land use in urban areas by clustering regions with similar tweeting activity patterns. Based on detected land use, Preoţiuc-Pietro and Cohn [60] cluster users and reveals communities with different lifestyle and preference signatures. Using similar methodologies, Sadilek et al. [68] models the associations between lifestyle and health conditions. While we do not focus on applications of spatio-textual affinity in this article, it could serve as a valuable feature in many of the applications above. For example, the positive evidence that we find in favor of spatio-textual affinity could help improve accuracy of geolocation detection systems using some of our findings in a weak supervision setting [48, 63].

Data and methodology

To explore our hypotheses on spatio-textual affinity empirically, we used social media data from the New York and Los Angeles metropolitan areas collected from the Twitter decahose (understood to be an approximately random 10% sample from globally streaming Twitter data that must be purchased directly from Twitter) over all 31 days of the month of October, 2016. To remove any ambiguities concerning location, we filtered and used only the subset of the English tweets that had geotags embedded in the metadata (usually because the GPS is on and the tweeter has allowed the information to be embedded in the tweet metadata) and where the geotags could be triangulated to either lie in Los Angeles County or in the Boroughs of New York City (the Bronx, Brooklyn, Manhattan, Queens, and Staten Island). Footnote 2 Henceforth, we succinctly refer to these metropolitan areas as Los Angeles (LA) and New York City (NYC), respectively.

Besides their cultural significance, another good reason for choosing these two areas for exploring our hypotheses on spatio-textual affinity is that they are generally taken to be more similar than different. Politically, both are known to be liberal and lean heavily (much higher than the national average) towards the Democratic Party [51]. Both also have a healthy population of Twitter users; our data contains 247,068 tweets with geotags from LA and 197,854 tweets from NYC. Based on Fig. 1, we observe major ‘hotspots’ of twitter prevalence and density in areas such as Hollywood and downtown LA (Fig. 1a), and Manhattan and Long Island (Fig. 1b). However, there are also many smaller clusters in both plots.

Despite the similarities (primarily the fact that both cities are big and international metropolises), there are also many differences between them, not the least of which is the spatial dispersion and density of the city’s residents. In the recent era, the COVID-19 crisis showed how dense and interconnected NYC was, since the disease spread quickly through the city in the earliest days of the crisis [44]. In popular culture, the two cities also exhibit differences [14]. A range of other studies have also paired New York City and Los Angeles [1, 15, 30, 34, 47]; hence, there is precedent for selecting these two cities for exploring the concept of spatio-textual affinity.

Fig. 1
figure 1

Spatial distribution of geotagged tweets from New York City (b) and Los Angeles (a) between 10/01/2016 and 10/31/2016

To study spatio-textual affinity, we have to precisely define both how to measure textual similarity and spatial similarity. For the latter, an established measure is the Haversine distance between the geotags (a latitude–longitude coordinate) of two tweets [69]. Specifically, given two latitude–longitude coordinates \((\mathrm{lat}_1, \mathrm{long}_1)\) and \((\mathrm{lat}_2,\mathrm{long}_2)\) in radians, we denote two variables \(d_1\) and \(d_2\) as \((\mathrm{lat}_1 - \mathrm{long}_1)\) and \((\mathrm{lat}_2 - \mathrm{long}_2)\), respectively. The Earth’s radius is given by the symbol r. With these notations in place, the haversine distance HD between two latitude–longitude coordinates is defined as

$$\begin{aligned} \begin{aligned} \mathrm{HD} =&2 \times r \times \arcsin [\\&\sqrt{\sin ^2(d_1 / 2) + \cos (\mathrm{lat}_1)\cos (\mathrm{long}_1)\sin ^2(d_2 / 2)}] \end{aligned} \end{aligned}$$
(1)

In some studies, a more city-specific version of spatial affinity is also considered, since the Haversine distance can sometimes be less meaningful due to the different densities of the two cities. Specifically, we consider zipcodes and counties: two tweets are deemed to have spatial affinity of 1 if they have the same zipcode or county (depending on the measure used) and 0 otherwise. In a similar vein, the coarsest-grained measure is city membership: two tweets are assumed to have spatial affinity of 1 if both are geolocated in the same city (and 0 otherwise). Despite their coarse granularity, we show later that these measures can be useful for macroscopic analysis utilizing many tweets.

Textual similarity is a more complicated and subjective matter, especially if automated methods from natural language processing (such as word embeddings [39]) are considered. We consider two separate methodologies for measuring textual similarity. The ‘conservative’ methodology is to only consider topics that are explicitly and unambiguously declared by the tweeter through hashtags. To avoid topical overlap and ambiguity, we limit our corpus in hashtag-based experiments to tweets that have exactly one hashtag. Tweets that share a common hashtag are judged to be similar, otherwise dissimilar. Applying this single-hashtag constraint yields a filtered set of 34,047 tweets and a total of 17,243 unique hashtags. Note that this approach is not only conservative, but also binary (similar to the coarse-grained spatial affinity measures): either a pair of tweets is judged to be similar or not similar. In “Experiments”, we refer to this methodology as the hashtag-based methodology for computing textual affinity.

This conservative methodology, while likely to lead to results that are more precise and unambiguous, also discards much of the data, since many tweets may have more than one hashtag or even no hashtags. Its binary nature also precludes more interesting and nuanced studies, especially since it does not give any weight to the actual content of the tweets (the words) besides the lone hashtag. Therefore, we consider a second methodology, namely subword-based embeddings embodied by the fastText package released by Facebook Research [39]. These embeddings, which can be used to convert words and entire sentences (and hence, tweets) into dense, continuous, real-valued vectors are particularly apt for handling the kinds of spelling variations common on social media. We used the publicly available embedding described in [31] that had been previously trained on 400 million Twitter English tweets and found to yield excellent performance on multiple NLP tasks.

Using the embedding methodology, the textual affinity can be computed by computing the cosine similarity between the vector representations of the two tweets. Given two vectors \(v_1\) and \(v_2\), the cosine similarity between them is \(\frac{v_1 \cdot v_2}{\Vert v_1 \Vert \times \Vert v_2 \Vert }\). It is an established way in the NLP community for measuring textual affinity in a more nuanced way. In “Experiments”, we refer to this methodology as the embedding-based methodology for computing textual affinity.

Experiments

Measuring spatio-textual affinity: hashtag-based methodology

As a first step, we study spatio-textual affinity using the hashtag-based methodology described in the previous section. Specifically, we group our tweet collection by hashtag, Footnote 3 and then calculate, for each such cluster, the ratio \(\mathrm{ratio}_\mathrm{LA}\) of tweets that fall within the LA metropolitan area.

Fig. 2
figure 2

Cluster frequency distribution of \(\mathrm{ratio}_\mathrm{LA}\), with clusters created using shared hashtags

In total, there are 17,239 clusters, since there are 17,239 unique hashtags. In Fig. 2, we plot the cluster–frequency ratio \(\mathrm{ratio}_\mathrm{LA}\). We find a significant topical separation between LA and NYC, with about 16,089 clusters being relatively ‘local’ ( \(0.8 \le \mathrm{ratio}_\mathrm{LA}\) or \(\mathrm{ratio}_\mathrm{LA} \le 0.2\)), and with only about 1150 clusters having a ‘moderate’ mix of participants from both cities ( \(0.8 \ge \mathrm{ratio}_\mathrm{LA} \ge 0.2\)). This is perhaps the most direct evidence of clear spatio-textual affinity.

Further analysis showed that most of these hashtags have a long-tailed frequency distribution themselves, with most having been used only once in our considered time period. To better understand the spatial relationships of these hashtags, and the common trends across the two metropolitan areas, we plot the top 20 most frequently occurring hashtags in LA and NYC in Figs.  3 and 4, respectively. From these figures, we find that hashtags such as ‘job’, ‘hiring’ and ‘CareerArc’ have similarly high ranks in both cities, which is understandable, since the economy and the workplace are predominant areas of conversation in any large-scale social setting. However, some evidence of location-specificity starts to emerge even in these top 20 hashtags, since some are related to locations (e.g., ‘NYC’, ‘Torrance’ and ‘Longbeach’). Furthermore, other hashtags may not directly pertain to location, but seem to have high ranking in one metropolis but not the other. For example, ‘diabetes’ is present in NYC but not Los Angeles (although ‘Healthcare’ is) and ‘VPDebate’ is present in both top 20 lists but at relatively different ranks. To obtain an independent trend for these topics, we also plot hashtag usage (for some hashtags) using Google Trends (Fig. 5). Similarities and differences between LA and NYC are clearly manifested, e.g., ’clinton’ has wide disparity between both cities, while ‘healthcare’ is nearly coincidental and ‘job’ is highly synchronous. Therefore, the google trends support the Twitter results, permitting us to verify (at least in part) the empirical validity of our experiments.

Fig. 3
figure 3

Top 20 most frequently occurring hashtags in LA

Fig. 4
figure 4

Top 20 most frequently occurring hashtags in NYC

Fig. 5
figure 5

Daily Google Trends index for selected keywords over October, 2016

Even though the hashtag clustering is conservative, it can shed light on conversations on social media that are national in scope, spanning cities and regions. To obtain these insights, we conduct the following experiment that utilizes both hashtags and sentence embedding. Recall that there were 1150 hashtag-based clusters with a ‘moderate’ mix of participants. For each such cluster within the set of clusters \(C=\{t_1,\ldots ,t_{1150}\}\), respectively, with \(|C_k|\) tweets within the \(k\mathrm{th}\) cluster, we partition the pairwise setFootnote 4 derived from that cluster into three sub-clusters: \(C_\mathrm{LA}\), \(C_\mathrm{NY}\) and \(C_\mathrm{hetero}\). \(C_\mathrm{LA}\) and \(C_\mathrm{NY}\) each contain those pairs, where both tweets in the pair are (respectively) from LA and New York. \(C_\mathrm{hetero}\) contains the remainder, i.e., (by definition) all pairs, where one tweet each is from LA and New York region. To control for the fact that tweets may be coming from the same user, we remove all pair of tweets from the same users in these sub-clusters for this experiment.

Next, for each sub-cluster, we compute a textual affinity distribution in embedding space by computing the cosine distance between the embeddings of the tweets in the pair. Footnote 5 Since there are always three non-emptyFootnote 6 sub-clusters for each cluster (i.e., hashtag), we get three distributions that we can plot. Figure  6 provides illustrative results for four such hashtags. Intriguingly, we find inter-city differences in these distributions even though they are well-mixed. #job and #TheWalkingDead, for example, have distributions with non-trivial non-overlap, while #debate and #Trump (normally considered polarizing topics) have almost identical overlap, which may (hypothetically) be due to the similar political affiliations in both cities. Furthermore, from the trend of daily hashtag usage shown in Fig. 7, we obtain a similar conclusion as in Fig. 6 and (earlier) in Fig. 5: some hashtags have more distinctive spatial signatures (i.e., are more predictive of, and tied to, their locations) than others, despite seeming like national or pop-culture topics on the surface.

Fig. 6
figure 6

Sub-cluster distributions for four example hashtags in ‘mixed’ clusters based on sentence embeddings. The specific methodology is described in the text

Fig. 7
figure 7

Daily occurrence counts for selected hashtags over October normalized by total tweets over October, 2016 for both LA and NYC

Measuring spatio-textual affinity: embedding-based methodology

To better address the brittleness that can be caused by hashtag-only clustering, we conduct a deeper study of spatio-textual affinity of tweets using k-means clustering on the tweet embeddings. Since k-means assumes k (the number of clusters) as given, we used the established elbow method to obtain an optimal k value of 1000 [43]. Once we obtain the clusters, we compute the distribution of clusters based on \(\mathrm{ratio}_\mathrm{LA}\), analogous to the distribution in Fig. 2 for hashtag-based clustering. As shown in Fig. 8, the resulting distribution, while still having localized peaks at the extremes, now has a more robust mixture distribution compared with Fig. 2, with many more clusters (even as a ratio of the total number of clusters having moderate \(\mathrm{ratio}_\mathrm{LA}\) values).

Considering the significant change that manifested only when taking a finer-grained view of textual similarity (continuous-space embeddings versus discrete hashtags), the question arises as to whether a similar change could be observed by taking a finer-grained view of spatial similarity inside LA and NYC regions rather than just classifying tweets as being from LA or NYC. To this end, we use the Haversine distance presented in Eq. 1 for the next set of experiments.

Fig. 8
figure 8

Cluster frequency distribution of \(\mathrm{ratio}_\mathrm{LA}\), with clusters created using cosine distance-based k-means on sentence embeddings

First, for the 1000 embedding-based clusters of which the \(\mathrm{ratio}_\mathrm{LA}\) distribution is plot in Fig. 8, we took the 70 clusters with \(\mathrm{ratio}_\mathrm{LA} \ge 0.8\). The union \({\mathcal {C}}=\bigcup _{i=1}^{70} C_i\) of these clusters contains 15,320 (\(=|{\mathcal {C}}|\)) tweets in total. We start by computing the Haversine distance dist({\(t_i, t_j\)}) (using the lat-long coordinates in \(t_i\) and \(t_j\)) between the two tweets in every possible pair of tweets from LA. Next, we partition the distances into two distributions in and between, where dist({\(t_i, t_j\)}) is placed into in if the tweets \(t_i\) and \(t_j\) fall within the same cluster, and placed into between otherwise. We plot these two sets of distances as probability distributions in Fig. 9 using a Kernel Density Estimation (KDE) algorithm [58, 67]. Although visually apparent in the figure, we also used the Chi-squared test (specifically, by computing the D-statistic) to reject the null hypothesis at the 99.99% confidence level that the two distributions are the same (statistically). This difference between the plots indicates a high degree of spatio-textual affinity: simply controlling for the city (i.e., high \(\mathrm{ratio}_\mathrm{LA}\)) is not enough to yield strong affinity, since whether two tweets have the same cluster membership (in the fine-grained text embedding space) clearly matters.

Fig. 9
figure 9

Haversine distributions of tweet pairs from LA in ‘LA-dominant’ embedding-based clusters (where \(\mathrm{ratio}_\mathrm{LA} \ge 0.8\))

More evidence of spatio-textual affinity is indicated when we repeat the experiment for the ‘mixed’ clusters, i.e., those clusters with \(\mathrm{ratio}_\mathrm{LA}\) between 0.2 and 0.8, by again plotting the in and between distributions in Fig. 10. Despite the visual suggestion in the figure, we find that even in this case we can reject the null hypothesis that the two distributions are the sameFootnote 7, but the D-statistic (D = 0.0119) is very different (almost 15\(\times \) lower) from that of the experiment above (D = 0.1754). Hence, spatio-textual affinity declines considerably as we start ‘loosening’ spatial constraints (as we do here by only considering mixed-location clusters). In other words, both spatial and textual dimensions make strong positive contributions in measurements of spatio-textual affinity.

Fig. 10
figure 10

Haversine distributions of tweet pairs from LA in ‘mixed’ embedding-based clusters (where \(0.2 \le \mathrm{ratio}_\mathrm{LA} \le 0.8\))

While not shown here, the conclusion for NYC was found to be the same: for the set of clusters with low \(\mathrm{ratio}_\mathrm{LA}\) (and hence, high concentration of tweets from NY), a similar finding as in Fig.  9 was observed.

We further explore affinity observed above on a more intuitive level of geographical region rather than at the fine-grained level of latitude–longitude coordinates or at the highly coarse-grained level of the entire city. For the next set of experiments, therefore, we choose the zipcode, city and county as our spatial affinity variables, not dissimilar to using the hashtag earlier as the textual affinity variable. Namely, if two tweets have the same value for either the city, zipcode or county (depending on the experiment) they have a spatial affinity of 1, and 0 otherwise. Formally, we can use the Kronecker Delta Function (KDF) to express this notion. Given a function \({\mathcal {F}}\):

$$\begin{aligned} \delta ^{{\mathcal {F}}}_{ij} = {{\left\{ \begin{array}{ll}0&{}{\text {if }}{\mathcal {F}}(i)\ne {\mathcal {F}}(j),\\ 1&{}{\text {if }}{\mathcal {F}}(i)={\mathcal {F}}(j).\end{array}\right. }} \end{aligned}$$
(2)

Here, \(\delta ^{{\mathcal {F}}}_{ij}\) is the KDF. We compute three types of KDFs: \(\delta ^\mathrm{zipcode}_{ij}\), \(\delta ^\mathrm{city}_{ij}\) and \(\delta ^\mathrm{county}_{ij}\). For example, \(\delta ^\mathrm{zipcode}_{ij} = 1\) for precisely those (unordered) pairs of tweets \(\{t_i,t_j\}\) that have the same zipcode.

Furthermore, since these are nominal variables, χ2 test and Cramer’s V value are used to determine the significance of the difference in distributions of tweet pairs falling within ‘in’ or ‘between’ clusters in the same vein as in Figs.  9 and 10. The Cramer’s V value is defined by

$$\begin{aligned} V = \sqrt{\frac{\chi ^{2} / n}{\min (c - 1, r - 1)}} \end{aligned}$$
(3)

Here, \(\chi ^{2}\) is derived from Chi-squared test, n is the total of samples, and c and r are (respectively) the number of columns and rows (the respective value set observed for the two nominal variables). Cramer’s V ranges from 0 to 1, with 0 corresponding to no association between the variables and 1 corresponding to complete association (Cramer’s V can reach 1 only when each variable is completely determined by the other).

We compute the three KDFs (\(\delta ^\mathrm{zipcode}_{ij}\), \(\delta ^\mathrm{city}_{ij}\) and \(\delta ^\mathrm{county}_{ij}\)) for all possible pairs of tweets from LA and NYC in both local and mixed clusters, where \(\mathrm{zipcode}(i)\), \(\mathrm{city}(i)\) and \(\mathrm{county}(i)\) correspond, respectively, to the zipcode, city and county that tweet \(t_i\) belongs to. For each type of function and cluster, we partition the KDFs into two distributions in and between, depending again on the condition imposed on \(\mathrm{ratio}_\mathrm{LA}\) (analogous to the in and between distributions in Figs.  9 and 10). The counts, specifically for all LA tweets, are tabulated in Table 1. Note that computing all pairs of tweets from mixed clusters is not computationally viable due to the large number of tweets that fall in mixed clusters compared to local clusters; hence, for each mixed cluster, we randomly subsample 1000 tweets (if there are more than 1000 tweets in the cluster), to obtain approximately \(C^{1000}_2\) pairs (that could be further sub-divided into ‘in’ and ‘between’ pairs).

Table 1 Counts of pairs of tweets on each Kronecker Delta Function (KDF) setting. ‘In’ (‘between’) refers to the case when both tweets belong to the same (different) embedding-based cluster. ‘Local’ (‘mixed’) refers to clusters that have \(\mathrm{ratio}_\mathrm{LA} \ge 0.8\) (\(0.2 \le \mathrm{ratio}_\mathrm{LA} \le 0.8\)). Tweets are never paired across ‘local’ and ‘mixed’ clusters

The counts show that, in local clusters, there is higher count of tweet pairs in the condition \(\delta ^\mathrm{zipcode}_{{ij}} = 0\) (compared to condition \(\delta ^\mathrm{zipcode}_{{ij}} = 1\)), but the situation reverses in mixed clusters, showing that spatio-textual affinity strongly asserts itself at the more macroscopic level rather than at the microscopic level. However, at the county level, spatio-textual affinity is completely dominant, i.e., both for local and mixed clusters, there is much higher count of tweet pairs for the condition \(\delta ^\mathrm{zipcode}_{{ij}} = 1\).

Concerning significance (as well as analysis of the results in the context of both NY and LA clusters), we tabulate the Cramer’s V values in Table 2. One nominal variable is the KDF, while the other variable simply records whether two tweets (an observation for the purposes of this experiment is always a pair of tweets) are in an ‘in’ condition (belong to the same cluster) or the ‘between’ condition (belong to two different clusters). The meanings of ‘local’ and ‘mixed’ clusters are the same as earlier (depending on the \(\mathrm{ratio}_\mathrm{LA}\)), but we break out local clusters further based on whether the clusters are NYC or LA clusters. While all values are low in the table, Cramer’s V of local clusters are about 10x larger than that of mixed clusters, indicating stronger spatio-textual affinity in local clusters compared to mixed clusters. As further evidence, scatter-map plots in Fig. 11 and Fig. 12 also show noticeable affinity difference between local and mixed clusters.

Table 2 Two variables involved in the Cramer’s V calculations are ‘in/between’ and the KDF in each row. For an explanation of ‘In/Between’, see caption of Table 1
Fig. 11
figure 11

Spatial distribution of geotagged tweets from LA in local and mixed clusters (different color represents different clusters and each point represents a single tweet)

Fig. 12
figure 12

Spatial distribution of geotagged tweets from NYC in local and mixed clusters (different color represents different clusters and each point represents a single tweet)

Discussion

The statistical results and visualizations in the previous section show that while different levels of affinity exists in local and mixed cluster, far greater affinity exists in local clusters. Interestingly we can see that affinity is the strongest for LA clusters when measured on the city level, while affinity is strongest for NY clusters when measured on the zipcode level.

To further understand the reason why statistically significant differences in affinity exist between NY/LA-dominant clusters and mixed clusters, we looked into the content of selected tweets from these clusters. In mixed clusters (with \(\mathrm{ratio}_\mathrm{LA}\) between 0.2 and 0.8), many clusters tend to have general topics shared by both LA and NYC e.g. political tweets such as ”@realDonaldTrump @HillaryClinton HRC talk about important issues like jobs I security health care Got nothing on Trump to move the needle”, but also national sports e.g.,”@chedsy22 Cool. I will root for the Giants big time if they play the Dodgers or Nationals.”. In contrast, in local clusters (with \(\mathrm{ratio}_\mathrm{LA} \le 0.2\) or \(\mathrm{ratio}_\mathrm{LA} \ge 0.8\)) many topics are (understandably) discussing local events, e.g., ”Awesome! #RickAndMorty #NYCC @Javits Center” and ”Halloween Horror Nights 2016 #iGetScaredEasily #AlmostShittedMyself #BarelyMadeItOutAlive?”. Local clusters are also more likely to include notifications for local incidents and information, e.g. local crimes: ”Incident on #QM2Bus EB from 6th Avenue: 34th Street to 6th Avenue: 59th Street https://t.co/RmU4nfTP4E” and job opportunities: We’re #hiring! Click to apply: MCAT Instructor (Full-Time) - Los Angeles - https://t.co/AVoidXUmdE #Job #Education #LosAngeles, CA #Jobs. These qualitative results lend further support to the quantitative data discussed earlier.

In an attempt to gain further insight into how regional variables could affect spatial distribution of tweets, we looked into the relationship between crime rates and tweet prevalence rates (the number of tweets geotagged in a zipcode divided by the population of that zipcode). Fig. 13 illustrated a somewhat surprising result on a preliminary experiment conducted for NYC tweets: when zipcode populations are controlled, there is actually a positive correlation between the tweet prevalence rate and the crime rate, with weaker correlation observed for LA. Further study of this result is warranted and may involve going deeper into the tweets of high crime/capital zipcodes to understand the nature of such tweets. Since there is high spatio-textual affinity at the zipcode level for NYC (Table 2), sampling a few hundred tweets (or hashtags) for the high crime/capita zipcodes may be all that is necessary to get the ‘pulse’ of the neighborhood (at least approximately). It is important to note that while a positive correlation would have been expected if the variables had not been per capita (since the population would have been the obvious explanation for an upward contemporaneous trend in both variables), the interesting aspect about the plot on the left in Fig.  13 is the upward trend even when controlling for population. The LA plot seems more normal and expected in this regard.

Fig. 13
figure 13

Scatter plot of tweet prevalence rate vs. crime rate (one data point per zipcode). The populations of zipcodes are indicated through colors. There is a stronger positive correlation for NYC (right) than for LA (left)

Most likely there are other variables in the NYC case that could serve as explanations. For example, if there are more nightclubs in the high-density neighborhoods in NYC, then that may partially explain both high crime/capita and high tweet prevalence. The same may not be true for LA, which is much less dense and far more spread-out than NYC, as even earlier plots (such as Fig. 1) have illustrated. Another possible explanation is the unemployment rate. Earlier, we showed that there were hashtags in the dataset related to jobs (including #job itself). Since high crime and unemployment are themselves correlated, it may be that there is a positive correlation between tweets per capita and between the unemployment rate, both measured at a per-zipcode level, just like in Fig. 13. We plot this data in Fig. 14 and find that there is some evidence for this. There is a positive correlation for both cities, and the differences between the two cities seem to have been reduced. Interestingly, therefore, the tweets per capita could be used as an informal, real-time barometer for measuring changes in phenomena such as the unemployment rate without always relying on surveys, especially at the fine-grained level of zipcodes in large cities such as NYC and LA.

Fig. 14
figure 14

Scatter plot of tweet prevalence rate vs. unemployment rate (one data point per zipcode). The populations of zipcodes are indicated through colors

As another qualitative study illustrating city-specific dependencies on tweeting behavior, we also plot the number of tweets sent out each hour in local time. The results in Fig. 15 illustrate the clear lags in the orange peak showing that NYC tweeters tend to reach their peak in late evening, even though LA sent out more tweets (at least within the full month of the data). We are planning to rigorously study such differences on a larger scale in the future. For example, using our textual affinity variables, we can study not only whether, after controlling for the time zone differences, the two cities are tweeting at the same rates during the same local hours (which, according to Fig.  15, they are not) but also whether they are tweeting about the same things or topics.

Fig. 15
figure 15

Number of geotagged tweets for each hour of a day in local time (Pacific Time for Los Angeles and Eastern Time for New York) over the entire period of the dataset (October, 2016)

Conclusion and future work

In this article, we developed spatio-textual affinity as a comparative paradigm for profiling geotagged citizen social media, and understanding the differences between dense and diverse urban metropolises through such comparative profiles. To illustrate the promise of this paradigm, especially at the scale of hundreds of thousands of tweets, we collected a dataset that contains an unbiased sample of tweets across the New York and Los Angeles metropolitan areas within a reasonably compressed time period in the United States. Our dataset has a high degree of control due to low variation between the cities on several important factors that are known to heavily influence social media content (including politics, demographics and nationalities). Within our dataset, we found clear evidence of spatio-textual affinity. Our findings are reasonably robust to the specific measures of affinity adopted for both the spatial and textual dimensions.

However, we re-iterate that it is important to consider these results in their context, and not as absolute or causal truths. Like many urban computing and computational social studies, the data was collected through observation rather than intervention. However, the volume of the data, consistency among the results both when using hashtag-based and embedding-based textual similarity, and the mechanisms of control that we described in “Data and methodology”, lend some measure of credibility to the use of spatio-textual affinity as a framework for understanding urban social media differences, even between two otherwise similar cities.

Our studies reveal that there are multiple promising avenues for future research. In particular, the more qualitative results in “Discussion” can be formalized further, not dissimilar to the studies in “Experiments”. Other such studies of a similar nature can also be designed and investigated. Investigating similar questions in the context of other cities both within the US, but also in other countries, is an issue that future research could investigate. In particular, the question remains open whether similar findings will be observed when we consider cities that are big but politically disparate (such as New York and Houston), cities that are dissimilar in size, and cities that fall within different countries. Multi-lingual studies could also be considered, but may be more challenging to calibrate without adequate translation of Twitter lingo. Replicating such studies in non-Twitter social media may also yield interesting empirical findings. Finally, replicating such a study in the context of COVID-19 may yield even more useful insights, since most people are currently working from home and not commuting, potentially leading to more robust estimates of spatial affinity.