Privacy preserving semantic trajectory data publishing for mobile location-based services

The development of wireless technologies and the popularity of mobile devices is responsible for generating large amounts of trajectory data for moving objects. Trajectory datasets have spatiotemporal features and are a rich information source. The mining of trajectory data can reveal interesting patterns of human activities and behaviors. However, trajectory data can also be exploited to disclose users’ privacy information, e.g., the places they live and work, which could be abused by a malicious user. Therefore, it is very important to protect the users’ privacy before publishing any trajectory data. While most previous research on this subject has only considered the privacy protection of stay points, this paper distinguishes itself by modeling and processing semantic trajectories, which not only contain spatiotemporal data but also involve POI information and the users’ motion modes such as walking, running, driving, etc. Accordingly, in this research, semantic trajectory anonymizing based on the k-anonymity model is proposed that can form sensitive areas that contain k − 1 POI points that are similar to the sensitive points. Then, trajectory ambiguity is executed based on the motion modes, road network topologies and road weights in the sensitive area. Finally, a similarity comparison is performed to obtain the recordable and releasable anonymity trajectory sets. Experimental results show that this method performs efficiently and provides high privacy levels.


Introduction
Due to the development of mobile devices and positioning technologies, various kinds of mobile positioning devices, such as car navigation systems, GPS-enabled mobile phones, mobile wearable devices, tablet computers and position sensors, have been made available to consumers in recent years [1][2][3][4][5]. The popularity of mobile positioning devices has spawned numerous location-based services (LBSs) [6][7][8] and has generated large amounts of locational data as well [9][10][11]. According to statistics, each moving object in LBSs transmits its current location every 15 s on average, which indicates that more than 100 million pieces of location information are transmitted per second. And the data are extensively applied in everyday life, thereby constantly influencing people's lifestyles, working habits and thinking modes. By making observations of a person's personal life, it is possible to provide a person with convenient location-based services by speculating on where he or she lives or where he or she goes every day. For instance, it is feasible to design ideal travel routes for a person in accordance with the person's available quantity of motion trajectory data [12][13][14][15][16].
Location data have created both benefits and problems, and of the problems, privacy disclosure is the most prominent issue [17][18][19][20][21][22][23]. In fact, the abuse of location data may lead to the disclosure of a user's most important personal information such as their personal interests, social relationships and living habits. For instance, potential attackers can not only identify the locations visited by a mobile user but also discover their home address and job location by analyzing their spatiotemporal trajectories. They can even derive private information such as a user's behavioral patterns from their daily motion trajectories, thereby posing a great threat to the user's safety and property. There have been cases when exposure of trajectory data caused damaging privacy disclosures and threatened a user's personal safety.
Researchers have proposed multiple solutions for solving privacy disclosure problems caused by LBSs. The existing privacy protection methods for LBSs mainly include data encryption, pseudoaddresses, space conversions and anonymity areas [24][25][26][27][28][29]. These methods are mostly focused on the location data, without delving into the relationships between the location data and the users, or the privacy implications of the location data. It is difficult to capture the significance of real-time human activities. Therefore, a growing number of researchers have studied location privacy protection based on semantics, with a view toward achieving a deeper level of protection. The protection of semantics-based moving object trajectories has also become a focus of more research [30][31][32].
With the increasing awareness of semantic information in trajectories, trajectory protection methods have gradually developed into methods based on semantics. Monreale et al. classified locations in order to generate generalized user access addresses, which enabled the creation of anonymity trajectory datasets that ensured that the probability of identifying user IDs and accessing sensitive locations was lower than a given threshold [33]. Lee et al. also imposed a threshold on the information obtainable by adversaries [34]. They suggested exploring location semantics by observing users' length of stay. Moreover, the ratio of suppressed frequent sequences is a direct indication of anonymized data quality for trajectory pattern mining [35,36]. This paper regards the length of stay as a semantic feature extractable for LBSs and considers it a location semantics metric that can protect users' privacy. It should be noted that the above methods merely involve semantic privacy protection for stay points. In fact, semantic trajectories can contain more information (i.e., motion modes of moving objects such as walking, running, cycling and driving) due to new developments in mobile technology. Therefore, it is necessary to adopt different trajectory privacy protection strategies for different motion modes. More importantly, increases in the semantic information contained in trajectories have posed greater challenges for user privacy protection. This paper presents a semantic-based trajectory anonymity protection method. The semantic trajectory is modeled based on the data it obtains including longitude, latitude, timestamp, POI yellow page information, velocity and motion mode. Subsequently, users' sensitive points (stay points) are identified and combined with the different motion modes as inputs for a pruning process. The pruning processing is carried out in the geographical space that covers the sensitive points. Finally, similarity comparisons are performed to obtain recordable and releasable anonymity trajectory datasets.

Semantic trajectory anonymity protection algorithm
In this section, an algorithm is proposed, and it consists of four main steps, as follows: Step 1 Semantic trajectory modeling: The algorithm preprocesses the raw data and extracts spatiotemporal sequences, important spatial points (starting points, end points and stay points), velocities and motion modes. In other words, the raw data acquired are transformed into semantic trajectories as defined in Definition 1.
Step 2 Sensitive area construction: The sensitive point is processed based on the k-anonymity model, eventually forming a coverage area that contains k -1 POI points of a similar type to the sensitive point. The coverage area is referred to as the sensitive area.
Step 3 Trajectory ambiguity: Trajectory ambiguity is performed according to the users' motion modes, the road network topologies and the road weights in the sensitive area. The targets of ambiguity mainly include the start-end points and the stay points. The ambiguity methods can be divided into two types, trajectory segment pruning and trajectory segment addition.
Step 4 K-anonymity set construction: A similarity comparison is performed to form an anonymity set that contains the other k -1 trajectories with the highest similarity.
Step 2 can effectively prevent semantic location attacks and reduce the attack probability to 1/K. Step 3 can effectively prevent maximum velocity attacks. It prunes the existing trajectory segments or constructs new trajectory segments, thereby preventing the attackers from effectively calculating the users' range of motion. Finally, the privacy protection effects can be significantly improved by releasing the trajectory k-anonymity datasets.

Semantic trajectory modeling
Semantic information such as velocity, timestamp and motion mode is all directly obtainable from the client. The sampling locations merely contain the latitude and longitude and contain no actual semantic information. The acquisition of useful semantic location information depends on the client and the GIS server. This section mainly describes how to extract the start-end and the stay points from the original location data. The start-end point refers to the starting and the ending points of a trajectory, while the stay point refers to the locations visited by mobile users. Both contain important semantic information on the moving objects and are regarded as sensitive points that need special protection. Therefore, semantic trajectories can be defined as follows: The semantic trajectory model is expressed as ST =h(x 0 ,y 0 ,z 0 ,p 0 ,s 0 ,w 0 ),…,(x n ,y n ,z n ,p n ,s n ,w n ) i, where x i , y i , z i , p i , s i and w i represent the longitude, latitude, timestamp, POI yellow page information, velocity and motion mode, respectively.
There are mainly two methods for extracting the stay points. One method is to extract stay points based on the length of stay, which is also the simplest method. It is necessary to set a time threshold, tth, when this method is adopted. A stay occurs when the time interval between two consecutive sampling locations is greater than tth and the distance between the two locations is smaller than the displacement threshold dth. Another method is to extract stay points based on the sampling density, which is essentially a supplement to the first method. Users tend to move at a low velocity when they stop at a certain outdoor location. Therefore, the actual stay points of users can be obtained by clustering low-velocity sampling points (the velocity is close to 0), as shown in Fig. 1.
In practice, the two methods are usually combined, thereby obtaining important location information.

Sensitive area construction
The entire geographic space is divided into several grid areas before sensitive area construction and detonated as SG mÂn ¼ fGði; jÞj1 i m; 1 j ng. Based on the actual conditions of the city where the objects are located and the roads shared by users, the unit length Dl of each grid area Gði; jÞ can range from 0.02 to 0.05 latitude and longitude coordinate intervals. Here, we select the intervals based on latitude and longitude coordinates mainly because the road network is generally stored in the spatial database in the latitude and longitude format.
Second, each grid area G is further divided into k Â k subgrids G kÂk ¼ fgði; jÞj1 i k; 1 j kg. The unit length of each subgrid is a 0.006 latitude and longitude coordinate interval, corresponding to an actual length of approximately 1 km. This unit of length not only achieves high computational accuracy but also reduces computational labor.
Sensitive area construction can be performed after dividing the areas and obtaining the semantic trajectories. This project adopts the k-anonymity model for sensitive area construction, that is, the area must contain at least k -1 location points of a similar type.
The sensitive area constructed based on the k-anonymity model can be quickly obtained through a k-nearest neighbor query of the GIS database. This project adopts the PostGIS spatial database -a database that can be obtained through the following query statements: SELECT g1.gid g2.gid FROM points as g1, polygons g2 WHERE g1.gid <> g2.gid AND g1.type = g2.type ORDER BY g1.gid, ST_Distance(g1.the_geom,g2.the_geom) LIMIT k; On the other hand, a MBB that satisfies the k value may be too large, eventually reducing the availability of the semantic trajectory. Therefore, it is necessary to determine the maximum value for a MBB. This paper considers the subgrid area that covers the sensitive point as the largest MBB possible.

Trajectory ambiguity
The trajectory ambiguity refers to the ambiguity processing of trajectories based on sensitive areas and other semantic information. The targets of ambiguity mainly include startend points and stay points. The ambiguity method can be divided into two types, namely, trajectory segment pruning and trajectory segment addition. The pruning method, as the name suggests, removes trajectory segments that contain sensitive points. These trajectory segments tend to exist in the vicinity of sensitive points, and users tend to move at a low velocity in these areas. The addition method involves constructing new trajectory segments and combining them with real trajectory segments to form new trajectories. The two methods can be combined together to form new trajectories, thereby achieving the goal of user privacy protection.

Start-end point ambiguity
The main steps for accomplishing start-end point ambiguity are as follows: • The first step is to calculate the sensitive area.
• A trajectory can be directly pruned when the trajectory contains the start-end point, involves the motion mode of walking and satisfies the following two conditions: -There are trajectory segments that contain different motion modes in the sensitive area. For instance, the semantic trajectory in Fig. 2 can be considered to be ST = hST s1 ,ST s2 i. ST s1 mainly involves the motion mode of walking and contains semantic information of the starting point (home), while ST s2 mainly involves the motion mode of driving. Suppose that the starting point (home) is set as a sensitive point. The first thing to do is calculate the sensitive area (red rectangle in the figure). Since the end point of ST s1 is less than 300 m from a sensitive point (home), the trajectory cannot be simply pruned. It is necessary to recalculate the weights of the roads in the sensitive area and select the road with the lowest weight to construct a new trajectory segment. In the figure, the blue road indicates an arterial road and has the lowest weights. Therefore, the red point is selected as the new starting point and combined with the black point to form the shortest path. Consequently, a new trajectory segment set ST = hST s3 ,ST s2 i is formed.

Stay point ambiguity
In contrast to start-end point ambiguity, stay point ambiguity can directly prune a trajectory segment that contains a stay point. The remaining trajectory segments can be processed according to the length of stay.
• The length of stay exceeds the threshold Dt.
When the user stays at a location for a long time, a recombination of the remaining trajectory segments will lead attackers to search for abnormal semantic information and obtain privacy information due to the rich semantic information contained in the semantic trajectory. To address this problem, this paper directly splits the remaining set of trajectory segments and recombines the trajectory segments by using start-end point ambiguity.
• The length of stay does not exceed the threshold Dt.
In this case, this paper performs ambiguity processing on the semantic information of other trajectory segment datasets in the sensitive area, to achieve the goal of sensitive point protection. The ambiguity method mainly includes velocity ambiguity (random average velocity) and For instance, the semantic trajectory in Fig. 4 can be considered to be ST = hST s1 ,ST s2 ,ST s3 i . ST s1 and ST s3 mainly involve the motion mode of driving, while ST s2 mainly involves the motion mode of walking. In addition, ST s2 contains semantic information for the hospital. Suppose that the hospital is set as a sensitive point. The first step is to construct a sensitive area (red rectangle in the figure). Subsequently, ST st2 can be directly pruned and the time interval between ST st1 and ST st3 can be evaluated.
If the time interval is less than the threshold Dt, it is necessary to perform velocity and timestamp ambiguity on ST st1 and ST st3 and reset the corresponding semantic information. For instance, the average velocity is set to: If the time interval exceeds the threshold Dt, it is necessary to split the remaining trajectory segment set and recombine the trajectory segments by using start-end point ambiguity. For instance, ST = hST s1 ,ST s2 ,ST s3 i in Fig. 3 is split into ST 1 = hST s1 i and ST 2 = hST s3 i. Suppose the red point is selected as a new start-end point; then, the new trajectory sets are ST 1 ' = hST s1' i and ST 2 ' = hST s2' ,ST s3 i, respectively.

Trajectory set construction based on the K-anonymity model
Trajectory sets can be constructed on the basis of the k-anonymity model after semantic trajectory ambiguity is accomplished. The construction of anonymity sets mainly depends on two factors, namely, spatiotemporal similarity and semantic similarity. Spatiotemporal similarity mainly refers to the similarity of two trajectories in geospatial and temporal dimensions, while semantic similarity mainly refers to the semantic similarity of two trajectories for stay points and motion modes.

Spatial distance measurement
In terms of spatial similarity, this paper adopts a similarity algorithm based on the Hausdorff distance (HD). The HD is a measure of the degree of similarity and is a defined form of the distance between two sets of points. HD can effectively calculate the distance between images without establishing a corresponding relationship between the templates and the sample pixels, and thereby, it is widely used in the field of mode recognition.

Definition 2 Hausdorff distance (HD)
Given two point sets A = {a 1 ,…,a p } and B = {b 1 ,…,b q }, the HD between the two point sets can be calculated as follows: kÁk is the distance paradigm between the two point sets A and B.
Since the HD is highly sensitive to outliers such as noise points, even a few noise points can significantly affect the distance values. To address this problem, some scholars have proposed the modified Hausdorff distance (MHD). The MHD is defined as follows: where m a is the number of objects in point set A.
The spatial distance between trajectories can be calculated by using the following equation. where where l A is the total spatial length of trajectory A.

Temporal distance measurement
The temporal attributes (i.e., the timestamp) of trajectories are generated along with the spatial sampling. It is meaningless to discuss the temporal distance of moving objects without considering the specific forms of the spatial trajectories. Therefore, MHD can be also used to measure the temporal distance between trajectories. The definition is as follows: Definition 5 Given two point sets A = {a 1 ,…,a p } and B = {b 1 ,…,b q }, the temporal distance between the two point sets can be calculated as follows: where where t A is the total temporal length of trajectory A.

Spatial-temporal distance measurement
The method of measuring the spatiotemporal similarity between trajectories can be derived from the spatial and temporal distance measurement methods.

Semantic distance measurement
Cosine similarity is a measure of the similarity between two nonzero vectors of an inner product space that measures the cosine of the angle between them. Since cosine similarity can be applied to a comparison of vectors of any dimension, it is widely used in similarity measurements, especially in text similarity measurements. This paper adopts the cosine similarity method to measure the similarity between the semantic trajectories in stay points and motion modes. Cosine similarity is defined as follows: Definition 7 Suppose that the semantic values of two locations vectors are sem(A) and sem(B). The similarity between the two semantic values can be expressed as follows: where i indicates the semantic contents compared.
The cosine value is limited to a range of [0,1]. The higher the semantic similarity between the two locations is, the closer the cosine value is to 1. The lower the semantic similarity is, the closer the value is to 0.

Semantic trajectory similarity measurement
Definition 8 The similarity between two semantic trajectories ST a and ST b can be defined as follows: The specific calculation steps are described below: Step 1 Perform noise reduction on the two trajectories A and B (the trajectory segments merely include the startend point, velocity outlier, stay point and road network node).
Step 2 Interpolate the various points in trajectories A and B into a third trajectory, to eliminate the impacts of different sampling sizes, reference locations and strategies.
Step 3 Calculate the spatial distance between the two trajectories by using Eq. 5.
Step 4 Calculate the temporal distance between the two trajectories by using Eq. 7.
Step 5 Calculate the spatiotemporal similarity between the two trajectories by using Eq. 9.
Step 6 Calculate the semantic similarity between the two trajectories for the start-end and stay points.
Step 7 Calculate the semantic similarity between the two trajectories for motion modes.
Step 8 Obtain the similarity between the two trajectories.
In summary, the pseudocode of the k-anonymity modelbased trajectory set construction algorithm is as follows:

Evaluation
The algorithm was written in Java and implemented on a DELL Optiplex host (CPU Core: 2 Duo 2 GHz; RAM: 4096 MB). The relational database system and the GIS database were Postgre9.1 and Postgis1.5, respectively. All the road datasets in the experiment came from national road datasets provided by OpenStreetMap. All the trajectory information came from the MyMap App. The experiment involved a total of 352,234 road data records and 36,825 trajectory records.

Overall performance
First, the overall performance of the algorithm was evaluated. We adopted the default grid division method to divide the grids into subgrids with unit lengths of 0.006 of the latitude and longitude coordinate interval (corresponding to an actual length of about 1 km). We considered POI settings such as home, school, hospital, bank and restaurant as the sensitive points and set the k value of the sensitive area(sak) to 2, 3, 4 and 5. For the semantic trajectory anonymity protection algorithm, we measured the algorithm efficiency in the cases of k = 3, 5, 8 and 10, and dataset amount ranges from 10 to 30 k separately and took the average. The execution time of the algorithm is shown in Fig. 4.
It can be seen from the figure that the average execution time is lengthy, which indicates that the algorithm does not have good efficiency. First, the algorithm needs to calculate sensitive areas and sort the road weights in the trajectory ambiguity process. Second, Dijkstra's shortest-path algorithm is adopted to construct new trajectory segments in sensitive areas and regenerate new semantic trajectories. For both of these reasons, the algorithm is time-consuming. All the road network data are prestored in the GIS database when the grid division method is adopted for sensitive area construction. The construction of new trajectory segments will be accelerated in that case. In other words, the overall performance of the algorithm is improved by taking these optimization measures. In general, the anonymity sets are released when the database is offline; therefore, it has no impact on the actual users.

Information loss rate
Information loss refers to the loss of the original trajectory information caused by trajectory anonymization. It is calculated by using the following equation: where k is the number of trajectories in the anonymity sets; M poi is the number of sensitive points after ambiguity processing; and N poi is the number of sensitive points in the original trajectory sets. Figure 5 illustrates the information loss caused by an anonymity set release. It can be seen from the figure that the information loss rate gradually increases with the increase in the k value. It is recommended that both k values of sensitive area and semantic trajectory anonymizing are set to a small threshold. Then, the privacy level and information loss can both be acceptable.

Query error rate
The error rate of the spatial range query is also an important measure of information loss. The so-called spatial range count query means querying the number of moving objects in a certain spatial area within a certain period of time. It will inevitably produce a certain error after the semantic trajectories are anonymously processed. The error is represented by error and is obtained by calculating Eq. 13.
error ¼ minðQðDÞ; DðD Ã ÞÞ maxðQðDÞ; DðD Ã ÞÞ ð13Þ where Q(D) is the value obtained by performing a spatial range count query on the original trajectory data and Q(D * ) is the value obtained by performing a spatial range count query on the data after privacy protection processing. The query error rate is shown in Fig. 6. It can be seen from the figure that the query error rate is less than 20% in the case of k = 3 and sak is 2 or 3. In addition, the error rate increases with the increase in the k value. Using the k-anonymity model generally protects semantic trajectory privacy. Considering the computational efficiency and query accuracy, the k value usually ranges between 3 and 5.

Conclusion
With regard to publishing trajectory data, this paper proposes a privacy preserving method that adopts semantic trajectory anonymizing based on the k-anonymity model. In contrast to traditional trajectory data models, which only contain the spatiotemporal attributes, our semantic trajectory model incorporates semantic information from sensitive points and users' motion modes. The algorithm first preprocesses the raw data and extracts spatiotemporal sequences, important spatial points, velocities and motion modes. Sensitive points are processed based on the k-anonymity model, eventually forming a coverage area that contains k -1 POI points of a similar type to the sensitive points that form a sensitive area. Trajectory ambiguity is accomplished based on the motion modes, road network topologies and road weights in the sensitive area. Finally, a similarity comparison is performed to form an anonymity set that contains the other k -1 trajectories with the highest similarity. The experimental results show