Semantic embedding for regions of interest

The available spatial data are rapidly growing and also diversifying. One may obtain in large quantities information such as annotated point/place of interest (POIs), check-in comments on those POIs, geo-tagged microblog comments, and demarked regions of interest (ROI). All sources interplay with each other, and together build a more complete picture of the spatial and social dynamics at play in a region. However, building a single fused representation of these data entries has been mainly rudimentary, such as allowing spatial joins. In this paper, we extend the concept of semantic embedding for POIs (points of interests) and devise the first semantic embedding of ROIs, and in particular ones that captures both its spatial and its semantic components. To accomplish this, we develop a multipart network model capturing the relationships between the diverse components, and through random-walk-based approaches, use this to embed the ROIs. We demonstrate the effectiveness of this embedding at simultaneously capturing both the spatial and semantic relationships between ROIs through extensive experiments. Applications like popularity region prediction demonstrate the benefit of using ROI embedding as features in comparison with baselines.


Introduction
In the last decade, location-based social networks (LBSNs) like Facebook, Instagram, Foursquare, Twitter have attracted billions of users, where people can check in at point of interests (POIs) and share life experience in the physical world via mobile device promptly. It is crucial for such service providers to leverage the data they collected to make personalized recommendations that help their users to explore new places and facilitate targeting advertisement for generating revenue [3,8,26]. Recent literature suggests that distributed representation of point of interest (POI) or embedding can further improve the results [18,43,47,56]. It is worth to note that point of interest (POI) is a single point/place on the map of Earth (e.g., New York Stock Exchange, New York). Recently, an increasing interest on studying region of interest (ROI) [45] is rising [39], where the social dynamics occurring at POIs located in a particular region is considered as a whole. By picturing the semantic and spatial features of different regions intertwined with people's activities can yield important information such as functional behavior, distinctive features, and social effects, which can be further utilized in urban planning and region-level recommendation.
An example of the application is shown in Fig. 1; ROIs 02000000 (blue), 08000005 (green) are semantically as well as spatially correlated with ROI 09000000 (yellow) from Manhattan, New York City. Semantic category information of ROI 09000000 is also presented in Fig. 1 where Outdoors and Recreation, College and Education, Nightlife and Pubs, Travel and Transport, Professional Services are presented as top five major categories based on cosine similarity metric. A careful observation in the map will reveal that ROI 09000000 consists of Statue of Liberty, Ellis Island and Battery Park and World Trade Center which has been visited by more than 3.5 million visitors in an average for the last 5 years [30], is a major reason for Outdoors and Recreation as the topmost category. New York University, The King's College and Pace University, etc., are also demarked within the region that follows the second top spot as College and Edu- cation. The next three top categories are intuitive to estimate since Lower Manhattan is the hub of some popular old pubs, financial offices, hotels and well-connected subway, transport and ferry system in the city. Though ROI 08000005 is geospatially distant from ROI 09000000, but they are semantically similar in terms of Outdoor and Recreation, College and Education and Travel and Transport because of Central Park, New York University Midtown Campus, Pace University, Grand Central Terminal and major subway connections respectively.
An effective approach to capture both semantic and spatial feature at the same time is to embed them in a latent semantic space as elaborated by [43,47,48] for POIs. Hence, embedding over ROI with semantic features would also be an effective method for ROI analysis. Nevertheless, existing solutions only consider the semantic embedding of POI but not ROI. A naive extension for extending semantic embedding for POI to obtain semantic embedding of ROI is to simply aggregate over POI features for all POIs inside a ROI and treat that ROI then simply as an aggregated POI. However, this approach is not effective in capturing spatial and semantic information simultaneously due to the loss of interesting correlations between spatial and semantic information in the process of simple aggregation, as also verified by our experiments. We deduced ROI embedding problem into a tripartite graph embedding problem with entities (a) ROIs, (b) POIs, (c) Words, whose embedding goal is to minimize the probability distribution difference between embedding entities in latent space and the information graph network based on edge connections. The ROI embedding model facilitates the online analysis and discovery of the (dis)similarities between any pair of ROIs from the perspective of human understanding.
To further add to our motivation, and answer why ROI embedding is needed, we need to look at the advantages of using embedding over raw information or semantic keyword based search. Firstly, computation efficient embeddings are generic and aggregate latent features that can easily be integrated into downstream tasks. Secondly, to comply with data retention policies and maintain security standards, it is essential to limit raw information access and step toward a generic and lossy embedding. Thirdly, semantic ROI embeddings grant measurable techniques to attribute a region and can account for its change over time.
Application-wise, incorporating ROI embedding as a feature for ad services can have a significant impact, as localized crowd engagement/activity in neighborhoods can promote economic growth. ROI feature is another step toward improvement of localized search results. Another far reaching application of ROI using features is vacation home rentals recommendations based on user's neighborhood preferences. Semantic embedding of ROIs also enables users to filter with scores on each categories like Travel and Transport, Shops and Services, Arts and Entertainment, Schools or Nightlife for finding listings with neighborhood information.
The main set of challenges of ROI semantic embedding comparing against POI semantic embedding lies in: 1. Geographic influence: Recent studies on POI embedding can effectively classify the POIs categories and use them as features for prediction and recommendation applications. However, evaluating the influence of POIs on its neighborhood region is challenging and not yet been addressed in the literature. 2. Capturing social effect: Social responses from microblog sites are highly dynamic and often captures popularity information of places in any region. Discovering any spatial features from social behavior is complex and involves significant challenges. 3. Semantic challenges: Leveraging the textual information associated with places and regions to obtain semantic features is a non-trivial task. We modeled a tripartite graph network embedding approach to learn ROI embedding. 4. Data challenges: It is difficult to get a large and open dataset of POIs with textual information from location-based social networks. Currently available public datasets are either geographically sparse or not suitable for our problem statement. We resort to scraping and crawling for creating appropriate datasets for our investigation.
We summarize the contributions of this paper.
-We formulate the region of interest (ROI) semantic embedding that simultaneously embeds into semantic space and spatial space. The rest of the paper is organized as follows: Section 2 presents our problem formulation with baseline approaches inspired by state-of-the-art literature. In Sect. 3, we present our model TNE: tripartite network embedding, followed by experiments in Sect. 4 and related works in Sect. 5, and Sect. 6 concludes the paper.

Preliminaries
This section introduces problem formulation with some necessary definitions and notations used in the paper. After that, we present problem statements on semantic ROI embedding formally and describe our information graph network. Next, we enlist a few baseline approaches to compare with our tripartite network embedding model, TNE.

Problem formulation
Assume we have three sets of data: points of interest (POIs), regions of interest (ROIs), and geotagged documents. We define each next.
A region of interest (ROI) dubbed as r is an area in the map of Earth demarked by a geometry of circle, rectangle or polygon, e.g., ROI 09000000 from Fig. 1. An ROI r =(id, geofeatures, name, properties) is a tuple of identifier, polygonal geofeatures, name and optional properties like state, country, respectively. ROIs are technically stored as GeoJSON [17]. We represents a set of ROIs as R = {r 1 , r 2 , . . . , r |R| }.
A point of interest (POI) dubbed as p is defined as a specific point location in the map of Earth, e.g., Empire State Building, New York. A POI p=(id, coord, name, properties) is a tuple of identifier, latitude-longitude geocoordinate, name, and optional properties like keywords, description, address, and category, respectively. It is also stored as GeoJSON object. A set of POIs is represented as A geotagged document dubbed as d is a geolocationassociated textual record either by origin or reference, e.g., check-in comments, reviews, microblogs, etc. A geotagged document d = (id, text, coord, properties) is a tuple of identifier, text, a latitude-longitude geo-coordinate and optional properties like timestamp, user information. We are mainly interested in two types of geotagged documents (a) microblogs; (b) social reviews. Microblog documents associated with a ROI r are denoted as D r , and social review documents associated with a POI p are denoted as D p . Documents are associated with POIs and ROIs based on geotagged locations. We define all geotagged documents as D = We capture relations among multiple entities (i.e., POI, ROI, and Words) through the information graph described in Sect. 2.3. In Table 1, we summarize all the notations used in this paper.

Problem statements
Problem 1 (Semantic Embedding of ROI) Given a set of ROIs R, a set of POIs P, an associated set of geotagged documents D and an embedding dimension n, the goal of semantics embedding of ROI is to embed each ROI r ∈ R as a vector r ∈ R n , such that the cosine distance of r i and r j captures the similarity of r i and r j in both spatial and semantic aspects.
The objective of ROI semantic embedding is capturing geographic information and the semantic perspective from the crowd about the region. If any ROI stands out in any semantic features then it must be captured via embedding, such as recreational activities, office and services, residential region or any combination of activities. We introduce an application of ROI embedding as Problem 2: Semantic Category Annotation for evaluation of ROI embedding.
Information graph network The aim of Problem 2 is to semantically annotate any ROI r from the generated ROI semantic embedding r. As we know, word representation in semantic space is capable of capturing its meaning via context or synonyms in close proximity space. Firstly, we propose a systematic approach to define semantic category which adheres to the categories defined in Table 2. In our model, we describe a category c with a set of words {w 1 , w 2 , . . . , w k } that captures meaningful information about that category. For example, the category Travel and Transport is described with words travel, trip, station, train, ferry, car, airport, pier etc. We perform a normalized average of these word vectors (each word is represented by a vector via an word embedding process, e.g., Word2Vec embedding) to represent the vector for the semantic category which we dub as semantic category vector c. The cosine similarity score of a semantic category vector c with an ROI r, i.e., normalized dot product r, c , determines the closeness of ROI with respective semantic category. The goal of this study is to find how well we can annotate an ROI with sentiment categories C = { c 1 , . . . , c 9 } and whether it adheres to real-world scenarios. An example of ROI semantic category annotation is given in bottom corner of Fig. 1.

Information graph network
We define an information graph network G = (G r p , G r w , G pw , G r , G w ), which is a combination of graphs with POI, ROI, and Word entities to capture spatial and semantic information, illustrated in Fig. 2. It is to note that vocabulary of semantic words W = {w : w ∈ d.text, d ∈ D} is from geotagged documents.
In our model, the information graph G is formed of multiple subgraphs. The subgraphs we model are of two types heterogeneous or bipartite subgraphs and homogeneous subgraphs. We define the three bipartite subgraphs and two homogeneous subgraphs as follows.
Definition 1 (ROI-POI Bipartite Graph: G r p ) An ROI-POI graph, denoted as G r p = (R ∪ P, E r p ), is a bipartite graph with edges E r p . An edge {e = (r i , p j ) ∈ E r p } exists iff p j is located within r i , and the weight of edge is ω(r i , p j ) = 1.
Definition 2 (ROI-Word Bipartite Graph: G r w ) An ROI-Word graph, denoted as G r w = (R ∪ W , E r w ), is a bipartite graph with edges E r w . An edge {e = (r i , w j ) ∈ E r w } Fig. 2 Information graph network G with illustration of ROI-POI bipartite graph G r p , ROI-Word bipartite graph G r w , POI-Word bipartite graph G pw , ROI graph G r and Word graph G w exists iff w j is mentioned in any d r i , and the weight of edge (r i , w j ) is calculated with tf-idf scores.
exists iff w j is mentioned in any d p i , and the weight of edge ( p i , w j ) is calculated with tf-idf scores.
Definition 4 (ROI Graph: G r ) An ROI graph, denoted as G r = (R, E r ), is a homogeneous graph network of ROIs where an edge e ∈ E r between two ROIs denotes they are spatially overlapped or neighboring region.
Definition 5 (Word Graph: G w ) An word graph, denoted as G w = (W , E w ), is a homogeneous graph network of words where an edge e ∈ E w between two words signifies their co-occurrence in geotagged documents.  [16] is a method proposed for learning vertex representation in bipartite graphs. We treat this as another baseline for learning ROI embedding from bipartite graphs G r p , G pw and G r w . It will be interesting to see if this baseline can capture the spatial affinity and semantic relation of ROIs.

Baseline approaches
We expect BiNE to fail in capturing geospatial correlation as transitivity property is not incorporated in this approach. In the related work (i.e., Sect. 5), we explicate the rationale of using BiNE as another state-of-the-art baseline model for comparisons.

TNE_wcr (TNE without Community Random Walk):
This version of our model TNE does not take advantage of our community-aware random walk strategy and uses the traditional random walk strategy. Including this baseline model in our experiments helps recognize the impact of incorporating the community-aware random walk in TNE.

TNE_nw (Non-weighted TNE):
This version of our model TNE does not use tf-idf weights over G pw graph for measuring the popularity of POIs. This approach demonstrates the modeling advantage of these weights in comparison with Jenkins et al. [22] which does not use such weights-among other differences.

TNE: tripartite network embedding
In this section, we present our approach TNE, the Tripartite Graph Network representation learning which can be generalized to a multipartite network embedding model. The primary focus of TNE is learning of ROI embedding, i.e., Problem 1 and Problem 2 are an application of the former. Our network embedding model TNE is (a) microscopic structure-preserving network embedding; (b) transitive property-preserving networks; and (c) communityaware network embedding. We explain each of these features as we simplistically unravel our model.

Direct relation models
The relationship among vertices which is straightforward visible from the edges set in the information network is known as direct relation model. We classify Direct Relation Models based on type of vertices between the edges in graph, such as (a) heterogeneous relation, (b) homogeneous relation models.

Direct heterogeneous relation model
The basic graph building block for any multipartite/tripartite networks is bipartite networks that represent relationships between two non-similar entities or vertices set. Considering our tripartite information network G, we have three bipartite networks G r p , G r w , G pw . A bipartite graph network is a heterogeneous vertex network (in our model) that represents direct or first-order relations which we dub as direct heterogeneous relation model.
In any structure-preserving network embedding, it is desirable that the closeness property between two well-connected vertex is high. Even if the connected vertices are different in nature (e.g., POI and Word in G pw ), their proximity in network is a direct relational information that must be imbibed in the embedding network. For the sake of understanding, let us consider a bipartite network are two sets of different types of vertices, and E uv ⊂ U × V is edge set. Also consider the embedding representation of vertex u i and v j as u i ∈ R n and v j ∈ R n , respectively. In our model, we consider the Euclidean embedding space where we define closeness measure between any two vertices u i and v j as conditional probability Pr(v j |u i ).
Existing literature and pioneer embedding work of word2vec [28] depict the importance of using inner product for similarity measure and transforming it into probability space with sigmoid function. The microscopic structure of network connection is captured with conditional probability between vertices.
The objective of the model is to learn the embedding vectors by minimizing difference between pairwise distribution.
where D K L is KL divergence measure for difference between probability distributions. The expression − Pr(v j |u i ) log Pr(v j |u i ) from Eq. 3 is the information entropy expression which is modeled as edge entropy, i.e., (u i , v j ) function. From the final expression, we obtain all the variables in optimization functions, i.e., vectors u i , v j from Pr(·).
KL divergence is a particular case of a broader class of divergences called f -divergences. KL divergence is asymmetric and commonly used by embedding methods that preserves local and microstructures [21]. There are other types of divergences such as reverse KL divergence (RKL), Jenson-Shannon (JS) divergence, Hellinger distance [19], χ 2 distance measures. As the name suggests, optimizing with reverse KL measures can capture the global or macronetwork structures. JS divergence is symmetric in nature, and some research works suggests using JS distance as a cost function in the empirical domain for optimization purpose [15,27]. χ 2 distance also behaves similar with respect to preserving local structure. Based on the intention of capturing micro-and macro-structure or giving equal importance to both of them, we can pick out the right methods.
In our case, the optimization equation for tripartite graphs G r p , G r w and G pw with KL-divergence method follows:

Direct homogeneous relation model
In many information network, having a direct homogeneous graph is not common. For example, consider the information network with Yahoo Answers or Quora. Users in these sites post questions which then gets answered by other users. There are direct relational graphs between users-questions, questions-answers and answers-users, but there are no direct relations among users. There are of course information networks where direct homogeneous graphs are present. It is important that we utilize the information from such graphs because more information helps in learning better [10,41] as it reduces uncertainty in learning weights within model.
In our scenario, G r and G w are two homogeneous graph in G, i.e., the edges are between the same type of vertices. The edges in these graphs signify explicit proximity between connected vertices. Even though the information from these explicit relations is very informative, it is not sufficient for embedding because of their sparse nature. The embedding model can still be significantly enhanced by incorporating implicit information via indirect relation graphs as discussed in Sect. 3.2 and then merging direct and indirect homogeneous graphs as shown in Sect. 3.2.2.

Indirect relation models
In this section, we focus on modeling indirect and deducible relations that contribute in obtaining meaningful information toward embedding. Recent work suggests deducible information helps in improving semantic properties [16,23,51]. Heterogeneous networks consisting of bipartite graphs do not have explicit relations among vertices of the same type. To understand the importance of indirect relation, take the example of POIs in our data. The POIs set P does not have any explicit edges between any two POIs. But there are POIs that are similar based on the reception they receive from people. A subset of words can form a topic and commonly describe similar POIs (which is true in real-world scenario), and it is very likely that there will be significant number of paths between the similar POIs in bipartite graph G pw . Generating all the paths between all pairs of large number of vertices is infeasible. To alleviate the issue, it is a common practice to generate several random walks to mimic the representation of a corpus of vertices with the intuition that important vertices get repeated based on its popularity.

Indirect homogeneous graphs
Random walks on bipartite graphs have periodicity issues [1]. The common strategy of addressing this problem is to construct two homogeneous graphs from bipartite graph utilizing second-order proximity between vertices of the same types [11]. Having said that, we construct , a homogeneous graph with vertices U by utilizing transitive relations with vertices V from bipartite graph G uv . We defined the second-order proximity between two vertices u i and u j by weight Similarly, we construct homogeneous graph

Merging homogeneous graphs
Our information network G consists of three bipartite graphs G r p , G r w , G pr . We now generate homogeneous graph G p r on ROIs R with indirect relations via POIs P and homogeneous graph G w r on ROIs R with indirect relations via words W . Similarly, homogeneous graphs G r p , G w p are obtained on POIs P with indirect relations on ROIs R and Words W , respectively. Homogeneous graphs G r w , G p w are also generated with indirect relations on ROIs R and POIs P, indirect homogeneous graphs are obtained from three bipartite graphs.
The homogeneous graphs G p r , G w r both on ROIs R provide implicit relation among its vertices. We use all the information from direct and indirect homogeneous graphs by simply appending the edges from the graphs G p r , G w r , G r to form a single graph G r for modeling random walks. However, it should be determined whether these graphs are compatible and not contrasting to each other. Intuitively incompatible graphs can be very contrast in terms of their hubs and authority vertices which can lead to information dilution and loss of quality. In such cases, a wise decision is to only choose the most effective-the most compatible-set of homogeneous networks to merge from multiple homogeneous graphs; this decision lies with the data scientist. To effectively measure the compatibility of graphs, we use the hub and authority matrices from both graphs. Close observation on HITS [25] algorithm reveals that it is an iterative power method to compute the dominant eigenvector for M · M T and for M T · M where matrix M is an adjacency matrix of a graph. Hub matrix is H = M · M T , and authority matrix is A = M T · M. Also, constant initialization of hub/authority scores enables us to perform power iteration on H and A and choose matrices from any iteration. Let H p r , H w r and A p r , A w r be the hub and authority score matrices of two homogeneous graphs on ROI R.
Finding similarity or distance with labeled graphs is an easy task, and we can leverage simple methods like edit distances, matrix similarity or even complex methods like coupled vertex-edge scoring [53], MCES [37], etc. For our model, we use Frobenius distance between two matrices and they qualify for merge if the sum of distance is less than some positive value φ.
Similar to G r homogeneous graph of ROIs, we construct G p and G w by merging (G r p , G w p ) and (G r w , G p w , G w ), respectively.

Community-aware random walks
Homogeneous graphs constructed from bipartite networks are used to generate a corpus of several random graphs. Deep-Walk [35] generates such random walk and utilizes it for learning embedding. BiNE [16] addresses issues that Deep-Walk [35] does not capture the characteristic of the real-world network because the distribution of vertices in random walks, and the graph network does not match. One solution is to generate random walks based on the importance of vertices measured with hubs and authority score of vertices.
Community is defined as a subset of vertices within the graph such that connections between the vertices are denser than connections with the rest of the network [36]. If the number of connections or reachability between vertices within a very few hops is high, then they must have a stronger bond. In a real-world scenario, we often have edges that act as bridges between communities or sub-communities. Often sparsity and lack of information in training data are responsible for the appearance of bridges within a community. Even if there is a moderate number of bridges, centrality biased random walks will seldom connect them. We propose a δ-hop communityaware random walk where a step in the random walk can mutate to a jump with probability α within δ-hop connected community.
The motivation of a δ-hop community is to include strongly/well-connected bridges and avoid weak connected community bridges. We used M 3 , and 3-hop is the least number of hops such that an internal node from a well-connected community can reach an internal node of another community via a bridge, where M is a adjacency matrix.
Hence, it is straightforward to follow that with a low δ = 3 and a low step-jump mutation probability α = 0.1, the jump likely remains within the community but alleviates the moderately connected community problem. Like other biased random walk model following "rich gets richer" principle, our mutated step-jump acts as a welfare strategy in the algorithm.
Algorithm 1 presents the summarized community-aware random walk to prepare corpus D u from graph homogeneous G u . Statistic suggests that mean length of sentences in English varies between 20 and 25 words and follows normal distribution [52]. Technical writing sentences are typically shorter. We take the inspiration from it and use normal distribution with mean μ = 15 and standard deviation σ = 10 to generate length of sequences in corpus D u . Starting a sequence with a vertex depends on its popularity (centrality), but we also limit it to a maximum of 5 with variable maxStart.

Corpus generation
Following the community-aware random walk on G r , G p , G w , we obtain corpuses D r , D p , D w , respectively, by using Algorithm 1.
For a sequence S in corpus D r , an ROI r i positioned at index c in S is represented as r c i . In a sequence S, a context of m from c will be the ROIs positioned from c − m to c + m, i.e., {r c−m , r c−m+1 , . . . , r c , r c+1 , . . . , r c+m }, where is in range [1, |P|]. We can now apply the skip-gram model on corpuses similar to the technique used in Word2Vec [29] embedding to optimize each embedding entity. To optimize the embedding for ROIs r, POIs p and Words w, we should minimize the expressions for objective functions O r , O p , and O w , respectively. It is to note that for each entity, as we create an embedding vector, we also need to assign a corresponding context vector for that entity.  N (μ, σ ); where ϕ c is the context vector for r c .
Similarly, we optimize for POIs p with function O p .
where c as the context vector for p c .
Finally, we optimize for Words w with function O w .
where ϑ c as the context vector for w c .

Negative sampling
The conditional probability Pr(v j |u i ) from Eqs. 3, 4 and Pr(u j |u i ) from Eqs. 8, 9, 10 is computationally expensive since it would need to sum over the entire set of vertices.
The state-of-the-art method to empirically estimate them is via negative sampling (e.g., as in specified in [29]), where the denominator is estimated by sampling random vertices. The numerator (defined by explicitly similar vertices) can be calculated directly.
In particular, negative sampling helps to learn a better embedding by selecting negative vertices that have significant probability difference, yet are closely connected vertices. Our negative sampling method uses popularity biased method which helps in learning faster but also alleviates gradient vanishing issues [6]. We use the concept of transition probabilities in random walk from one vertex to another, and this strategy perfectly replicates the popularity/ranking-based system which we leverage for negative sampling [55]. In a random walk starting from vertex u i adjacent to vertex u j , the probability of reaching from u i to u j is defined as the ratio of the weight of the edge (u i , u j ) over the sum of weights on all adjacency edges of vertex u i .
where M u i ,u j is the weight of edge between u i and u j . Naturally, T is a right stochastic matrix. We also make sure that self-loops, if they initially exists, are removed from the matrix. Based on the matrix T , we perform a δ-hop random walk by power iteration T δ . For some dense graphs, the matrix can converge and reach a steady-state distribution in few hops. For our purpose, we restrict the δ to δ max = 5. The row u i of T δ max u i act as a noise distribution matrix for selection of negative candidates for target vertex u i . We define the K negative samples for target u i as N K G u (u i ). Following the negative sampling technique for homogeneous graphs, we need to extend this technique for incorporating bipartite graphs as well. Firstly, we assume the prevalence of transitive property for bipartite graphs to model hops between the same type of vertices, i.e., if u i is connected to v k and then v k is connected to u j , then we assume existence of edge between u i and u j in graph G uv . The weights of edge, i.e., (u i , u j ) = v k ∈V (u i , v k ) · (v k , u j ). After we have defined the edges and weights between connected u i s and u j s, it is easy to obtain T . Thereafter, δ max -hop and T δ max noise matrix is obtained to perform negative sampling on the same seed and target type vertices, we dub this as homogeneous negative sampling. For a seed vertex u i in bipartite graph K negative samples N K For negative sampling on bipartite graphs where the seed vertex is different from target sample vertices, which we dub as heterogeneous negative sampling, e.g., seed u i to target v l , we apply the usual transition probabilities on the already obtained noise matrix T δ max . When a vertex u i connected to u j in δ max -hop, all the adjacent vertices of u j say V = {v l |e u j ,v l ∈ E uv )} are now considered for heterogeneous negative sampling. The entry of (u i , v l ) cell in noise distribution matrix of bipartite graphs is calculated as where M u j ,v l vm ∈V M u j ,vm is the transition probability from u j to v l .
For each edge (u i , v j ) in a graph with target vertex u i and K negative samples, we follow the conditional probability approximation Pr(v j |u i ), where ς j is the context vector for v j as follows: Similarly, for Pr(u j |u i ), where j is the context vector for u j as follows:

Optimization and model update
The intuitive solution for optimization is to minimize the sum of all objective functions. A more complex solution for multiobjective optimization can be applied. However, choosing a multiobjective optimization in embedding scenario requires more studies and can be presented as a separate research work on its own. Having said that, we use non-weighted linear combination of each optimization expressions from Eqs. 4, 8, 9, and 10 to make a single global optimization.
We present our tripartite joint optimization in Algorithm 2. In the preparation phase, community-aware random walks generate corpora D r , D p , D w , negative sampling module prepares noise distribution matrices. In the joint embedding training phase, edges are sampled from each graph simultaneously and update embedding vectors along with the context vectors using the stochastic gradient descent algorithm.
The complexity of the training depends on the density/sparsity of the graph network. To avoid expensive computation of centrality and δ-hop adjacency matrix, we perform walks on the graph based on degree centrality. The context size for a vertex is b·m, where b is the batch size much less than the maximum degree of the vertex, and m is context defined in Sect. 3.2.4. Overall, the computation complexity of our algorithm is O(|E r p + E r w + E pw | · b · m · (ns + 1)), where ns is the number of negative samples.
TNE supports increment updates as we collect new datasets from social networks and create a new information graph or update the old information graph. In this case, the embeddings previously generated from TNE should be used instead of random initialization of embedding vectors. Hyperparameter tuning, such as the learning rate, should be tweaked based on the age of the previously trained dataset and volume of the new dataset. With the increasing volume of new data and the aging of the previous dataset, the learning rate can gradually increase for optimal performance.

Experiments
In this section, we first describe our real-world dataset based on New York City (NYC) used in our experiments. We then present five experiments we performed exhibiting multifaceted effectiveness of ROI embedding with TNE on spatial correlation, semantic association and predictive capabilities. A summary of the experiments is as follows: we perform a ranking evaluation task with category annotation from ROI embeddings and crowdsourced ground truth results. We use normalized discounted cumulative gain (NDCG) [44] metric as the measure of performance. 4. Semantic category difference from ROI embedding: This experiment is similar to the previous experiment with a distinction here that we try to evaluate the semantic difference between a pair of ROIs from their embedding.

Popularity Prediction of Regions:
We introduce region popularity prediction experiment with the simplest of regression models to demonstrate that ROI embedding with TNE_nw, TNE can capture features better than extended baselines along with temporal features. The aim is not to overcomplicate experiment with complex models aiming lowest error but to show perceptible differences even with simple feature-based models.

Dataset
The dataset imitates the information graph G we presented in Fig. 2. Also, as described in Sect. 2, our real-world dataset consists of three entities (a) POI, (b) ROI, (c) Word. We will release the anonymized processed version of dataset adhering to the copyright of the sources for the growth of research work in this field.

POI-Word Data.
We used the check-in dataset from [49] and NYC government site [32] to collect POI dataset. Our dataset comprise of 38,008 POIs. Each POI is associated with geolocation, name, category, description and comments. The words from name, description and all available comments   Figure 3 shows some geographical divisions of NYC, such as boroughs, city councils, election districts, fire battalions, police precincts districts and health districts. All geographical divisions consist of several non-overlapping ROIs, and each of them is treated as a separate and unique ROI in our dataset. Overall, we have 12 different geographical divisions/districts as stated in Table 3 along with the number of ROIs from that division. The total number of ROIs in our dataset is 456. A POI is associated with an ROI iff the geolocation of the POI is within the polygonal boundary of the multi-polygonal spatial feature. It is notable that for a non-overlapping set of ROIs; POIs will create a many-one onto relation function with ROIs. However, introducing overlapping ROIs makes the information graph G interesting because shared POIs among two or more overlapping ROIs increases the complexity of the graph. The associated weight of the edges in ROI-POI graph is assigned a value of 1.0.

ROI-Word Data.
The relationship between ROI and Word is obtained from the geotagged tweets collected over a period of time. Similar to the technique used with POI-Word pair, the weight of edge between an ROI-Word is determined from the TF-IDF score. First, we used 1% sample tweet stream from twitter to collect our geotagged documents for one month and prepare a corpus of documents (each document associated with an ROI). On analyzing the twitter stream and performing TF-IDF on the corpus, it revealed that one month of 1%  Fig. 4) for 1 month and 6 months of data in Table 4 Table 4. It is also notable that TF-IDF scores for 1 month do not suggest good spatial correlation. On contrast, the result with 6 months of data shows significant correlation of TF-IDF score and true ROI location for Empire State Building, i.e., ROI 07. Furthermore, the TF-IDF scores for the same are in accordance with the neighborhood ROIs showing strong geospatial correlation.
Another interesting trend can be seen with Brooklyn Bridge, where the true spatial location is ROI 01 and 53 highlighted in bold in Table 4. For 1 months of data, though the top TF-IDF scores are in accordance with ground truth ROI location, the scores are very close to the ROIs that are not near to Brooklyn Bridge (i.e., ROI 53: 0.041; ROI 39: 0.035), whereas a clear disparate between TF-IDF scores of ground truth ROIs (53 and 01) and other ROIs (00, 03 and 04) with 6 months of data. These examples explain and support our decision of using 6 months of geotagged tweets.
Region Popularity Data. Region popularity data are collected from the New York check-in dataset [49], which contains 227,428 check-ins from Foursquare for period of April 2012 to February 2013. We score the popularity of a region from the number of check-ins.

TNE validation with POI classification
This experiment evaluates POI embedding from our model and baselines. The aim of this experiment is to validate that our model is consistent in learning POI embedding as other state-of-the-art work. In this experiment, we expect all the methods to perform equally well.
Our POI dataset has a ground truth category for each POI which has been collected from the data source. It is worth mentioning that Table 2 presents all nine top level categories for our POI dataset.  First, we present (i) k-nearest neighbor classification to evaluate POI embedding of all the models. Then, we use (ii) t-SNE visualization to notice the macro-and micro-structure of embeddings.
To boost our learning process, we initialized word embeddings with pre-trained GloVe [34] embedding. We used Glove vectors of words from description of POIs for POI embedding initialization. However, all the ROI embeddings are always initialized with random vectors. Our justification for initialization is to utilize full resources and information available in hand, rather than spending more iterations on learning from random initialization.

k-nearest neighbor classification
We trained our k-nearest neighbor (k-NN) classifier on 70% of embedding and evaluated on the rest of the embedding data. The dimension of embedding was kept 100 and k stands for the number of nearest neighbors considered for k-NN classification. From the result presented in Table 5, we see that GE_poi, TNE_nw, and TNE performed similarly with 96% accuracy in determining top category, whereas BiNE achieves more than 95% for k-NN with k ≥ 3. It verifies that TNE achieves comparable state-of-the-art performance with GE_poi. We have not included CrossMap result in Table 5 because CrossMap does not produce POI embedding.

t-SNE visualization
To reveal subtlety of the POI embedding and explore macro-and micro-structure, we perform t-SNE on the high-dimensional POI embedding. We color each POI in accordance with the top category mentioned in Table 2. Figure 5 shows how the POI embedding changes from training iteration 10 and 40 for TNE. Figure 5a shows different category points are much nearer and somewhere overlaps with one another. The scenario of such overlaps and distance between dissimilar category cluster improves with more iteration in Fig. 5b. We also present t-SNE of GE_poi in Fig. 5c.
Though our k-NN classification and t-SNE yield good performance for top level category or macro-structure, our experiment did not feature so well with subcategories. In Fig. 6 we present a microscopic analysis of the embeddings with t-SNE visualization based on POI subcategories. For this experiment, we have taken all the POIs with top category as Travel and Transport and performed t-SNE on it. The colors of the POIs in Fig. 6 are based on the subcategories. Here, we provide the list of the subcategories for Travel and Transport and order them with the color number in the t-SNE visualization: 0. Airport; 1  It is clear from Fig. 6 that the POI embedding of subcategories is overlapping for both GE_poi and TNE. The close association among POIs under the same top level category might explain such embedding phenomenon in semantic space. However, it might be worth to look into features of such intra-categorical POIs in the future work.

Geospatial affinity of ROIs
In this section, we evaluate the ROI embedding based on the geospatial affinity among ROIs. The intended scenario is to obtain similar embeddings for ROIs having geospatial affinity, i.e., (a) overlapping region, and (b) neighboring region.
We randomly selected 200 ROIs and analyzed 4 nearest neighbors of each ROI from our embedding with crowdsourced ground truth. Human judgment is used to find out whether the nearest neighbors ROIs predicted from embedding have any geospatial affinity or not with the queried ROI. We build a website with geographical map for crowdsourcing and to facilitate this process. Ideally, we would want more ROIs with 3-4 geospatially overlapped neighbors from the k-NN result from embedding space with k = 4. It is worth to mention that our dataset has 12 different geographical division that means each ROI has many (at least 10) geospatially overlapped ROIs. In plot of Fig. 7d, we show the number of ROI neighbors that have geospatial affinity for TNE. The last histogram bar with black color shows that out of 200 ROIs more than 80 ROIs have 4 neighbors with geographical overlapping region or neighboring boundary for TNE. We performed similar analysis on GE_poi; the number of ROIs with 4-NN is comparatively low (only 10%) as shown in Fig. 7a, compared to 40% with our model in Fig. 7d. The results for CrossMap and BiNE are far worst with almost 50% and 55% of the ROIs with zero geospatially overlapped neighbors, respectively, as shown in Fig. 7b, c. From this result, we can strongly deduce that our embedding preserves geospatial affinity in its embedding which other baseline approaches cannot.  Figure 8a shows nearest neighbors of ROI 06000000 from embedding (05000000, 03000068, 09000009, 02000005). Similarly, Fig. 8b shows nearest neighbors of ROI 07000004 as (10000027, 12000010, 02000025, 10000004). The interesting observation of Fig. 8c for nearest neighbors for ROI 11000043 in Staten Island is that it finds a ROI 05000045 located in Brooklyn to be similar. More detailed observation on both the ROIs reveals that they are similarly popular with Arts and Entertainment POIs, Outdoor activities as obtained from the cosine similarities of the embeddings. Table 6 presents the similarity scores of the above-mentioned ROIs for some semantic categories. We will discuss more on the technical methods on obtaining it in Sect. 4.4.

Semantic category annotation of ROIs
In this section, we present the analysis of ROI embedding on semantic category annotation. First, we show an example of semantic annotation in Table 7 for ROI 09000056. The geospatial location of ROI 09000056 Greenpoint, Brooklyn, NYC is presented in the map along with Table 4 as ROI 56. The rank of categories in Table 4 suggests Greenpoint has considerable shops and services locations, recreation parks and residential complexes. To verify our prediction, we tallied the rank with human raters who used Foursquare [14], NYC government site [9] and ArcGIS [2] maps, Twitter [42] and Wikipedia [46] for ground truth information. Crowdsourced ground-truth semantic categories of ROIs are ranked into three levels (1) low relevant level, (2) moderately relevant level, (3) highly relevant level. Crowdsourced information for ROI 09000056 suggests that there are many good shops, McCarren Park for outdoor activities and residential complexes. This information aligns with top 3 categories of semantic category annotation (a) Shops and Services, (b) Outdoors and Recreation, (c) Residential.
For a comprehensive analysis, we crowdsourced groundtruth categories with human raters for 100 random ROIs with category levels 3, 2, and 1. We compare ground-truth against the semantic category annotations obtained from the embedding. We converted it into a ranking problem. In an ideal case, all categories with level 3 should rank higher than level 2, followed by level 1 categories at the bottom. We used normalized discounted cumulative gain (NDCG) [44] to find the quality of embedding via ranking order. Table 8 shows NDCG scores at top-k ranking positions, and higher the score signifies better ranking order achieved by the model. Result presented in Table 8 with NDCG scores highlighted in bold suggests that TNE beats all baselines GE_poi, CrossMap, BiNE, TNE_wcr, and TNE_nw by a considerable margin. It is an important result in our experiment that gives us insights on how ROI embeddings can capture the semantic perspective observed by society about any region. From Table 8, we follow that TNE outperformed GE_poi, CrossMap and BiNE by 0.235, 0.215, 0.194 NDCG at rank 1 which is considerably high improvement in selecting the best category candidate for an ROI. The results are similar to other ranking levels. An average NDCG gain of more than 20% from stateof-the-art baselines (i.e., GE_poi, CrossMap, and BiNE) is a large gain (in ranking problem) that shows the efficacy of TNE. Also, to note that TNE_nw and TNE_wcr performed better than other baselines but beaten by TNE with an average score of 0.1 (or 12%). It shows the necessity of using edge weights in G, and community-aware random walk in our strategy.

Semantic category difference from ROI Embeddings
In this section, we briefly demonstrate the capability of ROI embedding to find semantic differences between ROIs. Tech- We demonstrate semantic category difference of 3 pairs of overlapped ROIs from lower east, west and midtown of Manhattan as shown in Fig. 9. We ranked the top three semantic category differences for each pair with the formulation mentioned before. The result is presented in the table within Fig. 9, and on close observation, it reveals discernible facts. The major semantic category differences between the pair of ROIs (09000004,08000024) from lower east Manhattan shown in Fig. 9a are Arts and Entertainment and Residence; it is because ROI 09000004 has popular music and theater performance centers and has a large residential community known as East Village and on the contrary lower part ROI 08000024 shown in orange has many restaurants. Similarly, the pair of ROIs (03000037,09000005) from west Manhattan shown in Fig. 9b has major difference with Travel and Transport and College and Education since ROI 03000037 contains the transit hub of Manhattan (Port Authority) and universities such as The City University of New York and State University of New York and similar places do not feature in ROI 09000005. Lastly, the midtown Manhattan with ROIs (09000012,08000026) shown in Fig. 9c does not show Recreation as major category difference as ROI 08000026 We performed an in-depth study of the semantic category difference annotation with NDCG analysis, similar to the analysis in Sect. 4.4. We chose 30 pairs of ROIs, and human raters annotated all categories on each pair of ROIs in three levels based on their differences as (1) non-significantly, (2) moderately, (3) critically different. In an ideal case, the analysis from embedding should rank categories in the order, 3 critically, 2 moderately, and 1 non-significantly different categories. Table 9 shows the performance of each model on NDCG analysis. We still found TNE to perform better than other baselines on NDCG scores highlighted in bold in Table 9.

Region popularity prediction
To evaluate the effectiveness of ROI embedding in a realworld application, we performed the popularity prediction of region experiment. We used an open available checkin dataset of New York City [49] for the prediction task. The only feature used for prediction is the ROI embedding obtained from baselines and TNE models. We used two regression models (a) random forest, and (b) XGBoost, for prediction of the number of check-in in a region. Table 10 shows the mean absolute error (MAE), and root-meansquared error (RMSE) for both the regression models the regression models, with the best results highlighted in bold. We can notice that TNE performed well in comparison with baselines in all except XGBoost-MAE where TNE_nw performed best. However, the RMSE error for TNE_nw is very high for both regressions. We also performed a temporal (day, night) region popularity experiment, shown in Table 11. TNE_day, and TNE_night are TNE models trained with G r w graph generated from geotagged tweets obtained during days and nights, respectively, and achieved the best MAEs for the corresponding time-period as highlighted in bold in Table 11. Summary: Each experiment investigates a qualitative aspect of the embedding procedures. TNE provides a qualitative semantic embedding, shown via semantic category annotation experiments. The spatial affinity experiment exhibits that TNE preserves strong geospatial relations. Region popularity prediction with embedding features demonstrates the expressiveness of features from the models. From all the above experiments, it can be established that our approach for ROI embeddings with TNE shows admissible support on the quality of ROI representation.

Related works
To the best of our knowledge, only one very recent work by Jenkins et al. [22], builds an ROI embedding jointly with rich auxiliary information-in their case POI, satellite images, and taxi flow data. While our approach also uses POI data, it makes the (as we see very important) distinction to weight these by popularity, and also equally incorporates semantic information from microblog text. This allows for a different and (in our view) richer set of applications demonstrated, including temporal variation using timing of microblog updates. The embedding methods are also different, while their approach uses a single auto-encoder from a convolutional network; we show how to build a tripartite network that can ensure the three components (ROIs, POIs, and semantic text) can be weighted equally. Although this work is only in-press, and their data are private, we still attempt to compare against this method by considering similar baselines-notably including method TNE_nw which like Jenkins et al. [22] does not include popularity weights on POIs in the G pw graph. POI Embedding. Extended literature survey suggests that research work on places of interest (POI) embedding are the most closest studies to our work. But there are major differences in our ROI embedding from the works on POI embedding [18,39,43,47,48,50,54]. Firstly, our work treats ROI as considerably bigger regions encircling many POIs, and simply aggregating POI embedding vectors to generate ROI embedding does not yield desired result, as we will see in our comparative experiments. Secondly, relevant POI embedding learning works focus on POI sequence recommendation task for users based on check-in activity [18,43,48,50], whereas our task on ROI embedding focuses on preserving spatial and semantic relation without involving users in the scenario. That makes our problem statement different from others. Thirdly, POI embedding work by Xie et al. [47] modeled a bipartite graph network embedding for learning POI which also consist of a POI-Region bipartite graph. Though the concept of region is unclear from their paper, we assumed our definition for ROI for a comparative analysis. Major difference in our work is that we capture the social behavior within region and also transitive/implicit relationship for bipartite graphs. Since POI recommendation task is extraneous to our problem statement, we cannot directly compare their task/experiment with ours. Fourthly, the work of Zhang et al. [54] aims to find correlation among hotspot locations (defined as spatial Gaussian kernel window), word and time to search spatio-temporal events. We find our work dissimilar from [54] as hotspot locations are very different from our geographically bounded polygonal ROI or POIs. Both of these spatial entity play significantly different role in our model. Semantic-Visual Embedding. The idea of cross-modal embedding in one-shot supervised learning has recently garnered researchers' attention. From the bird-eye view, we find our objective moderately matches semantic-visualization embedding on images where the problem is the assignment of semantic labels on sub-region/partial image [13,38]. Our semantic embedding of ROIs also uses multimodal features to find the uniqueness of a spatial region. However, there are distinctions between the two fields of work. Our work's novelty lies in the application of semantic features on the real-world geospatial regions of interest (ROIs) from the perspective of social engagement and solving the specific problems related to it. Additionally, the former focuses on feature-based spatial search on images, whereas our work concentrates on relational-based semantic learning on graph networks. In that aspect, our work is entirely original in the geospatial domain.
Graph Network Embedding. Broadly, our work is related to network embedding research. The commonly used methods for network embedding are matrix factorization, random walk, deep neural networks. Our model is based on random walk, and Deepwalk [35] is the first pioneer work on it. We made advancements in the field with structure-preserving tripartite or multipartite network embedding following the footsteps after groundbreaking contribution from LINE, HINE, Metapath2vec++, PME, BiNE, etc. [5,7,12,16,40]. The first use of tri-party or three entity in graph network embedding in alignment with random-walk strategy is from Pan et al. [33]. However, it is not a true tripartite graph network, rather an attributed heterogeneous embedding approach involving text associated-entity by incorporating contextual word embedding. Another more closely related work on tripartite embedding is HGP from Kim et al. [24] involving group→user→item and does not consider group-item relationship in the picture which does not make it a complete tripartite network. HGP propagates relation for each edge type independently, and their approach concentrates on attention mechanisms for large-scale adaptation. Overall, the main aim of HGP [24] is to tackle the oversmoothing problem in heterogeneous graphs on a large scale, which is very different from our objective of incorporating implicit and explicit relationship in learning representation.
More recent work from Hong et al. [20] aligns their research direction toward attributed network embedding in a different direction. Each vertex in the graph network has a fixed set of features to evaluate their similarity. While these works [20,24,33] mainly concentrate on feature-attributed network embedding, our work focuses on capturing implicit structural information from transitive relations on multipartite graph networks.
Furthermore, as described in BiNE [16], the random walk generator used in the works mentioned above (inspired by [35]) is not equipped to mimic the real-world distribution of vertices in a graph. BiNE [16] overshadows them in structure-preserving embedding, which thoroughly investigates vital information on edge relationship in graph network along with the oversmoothing problem of vertices. Hence, we also add BiNE [16] as a baseline for our experiments, where our model proposes community-aware random walk, transitive property preserving graphs, and a heterogeneous negative sampling technique for multiple entities embedding.
We thank reviewers of this paper to bring a very recent work of Chen et al. [4] to our notice, which explores folded bipartite network embedding using graph convolution network (GCN). This work advances bipartite network embedding by introducing higher-order relationships and using a self-attention technique to perform embedding. Our work concentrates on extending bipartite to multipartite network embedding with random walk modeling and supporting our use-case with a real-world application, which makes [4] partially orthogonal.
We believe our work contributes significantly toward structure-preserving network embedding and its application in semantic ROI embedding to herald a new direction in elucidating geospatial regions with semantic features.

Conclusion
In this paper, we propose TNE, a tripartite network embedding model for learning regions of interest (ROI) embedding. Our study focuses on learning ROI embedding that simultaneously captures semantic and geospatial features. First, we formalize the semantic embedding for ROIs problem with an information graph that captures social, semantic, and spatial attributes. Then, we use that TNE induces transitive relational features to obtain better learning performances while preserving the structure of the information graph. We performed multifaceted experiments on real-world data showing the advantages of performing ROI embedding with TNE over other baselines. Also, we demonstrate an interactive map to explore and discover the similarities and distinctness of regions.