Introduction

Traffic flow prediction remains to be an indispensable part of intelligent transportation systems (ITS) [1, 2] in this ever-rapid changing modern society. In other words, the accurate and reliable traffic prediction will help effectively alleviate the huge traffic congestion problem, which is of great significance to traffic management and social security [3,4,5]. As a crucial part of 5 G, edge computing can optimize data processing performance and reduce the delay of traffic flow prediction [6, 7]. Generally speaking, traffic prediction is to predict the future traffic situation of road network based on historical traffic data [8,9,10]. Consequently, the integrity of historical traffic data is a key ingredient of the prediction recipe to success.

Considering that the data of traffic flow prediction come from different sensors (e.g., laser sensors and infrared sensors) [11,12,13], it is necessary for the infrared sensor platform to synergism with others (e.g., laser sensors platform) to exploit the integrated traffic data to improve the prediction accuracy of missing sensor data [14]. Therefore, it is necessary to buck a way to integrate traffic data [15,16,17]. However, the method is usually not feasible in the actual cooperation between manufacturers. One of the most fundamental reasons is that manufacturers rarely share the original traffic data with others owing to the conflicts of boot and users’ sensitive information [18,19,20]. Another reason is that the amount of raw data tends to grow over time, which will lead to lower efficiency of traffic data sharing, processing, mining and analysis [21,22,23].

Considering the above challenges, we propose a novel traffic prediction method named \(ASMVP_{distr-LSH}\), which is based on the principle of distributed locality-sensitive hashing (LSH) [24,25,26] to protect privacy and fill the missing traffic data [27]. LSH has a favorable feature that is to retain similarity, i.e., two adjacent points are likely to be given the same exponent [28,29,30,31]. Overall, our main academic contribution is twofold, which is specified as follows.

  1. (1)

    We formulate the topic of traffic flow prediction form multi-source data across different sensors and propose a distributed LSH method for the traffic flow prediction to protect user privacy, i.e., \(ASMVP_{distr-LSH}\). This method converts the traffic flow information into index information, and then uses the index information to predict. So as to achieve the purpose of privacy protection.

  2. (2)

    In this paper, we specially consider the data integrity before traffic flow prediction. We use the principle of finding similar in LSH method to fill in the missing data in the sensor, which can ensure the prediction accuracy of missing traffic sensor data.

The rest of this paper is organized as follows. In “Related work”, we summarize the related work in current traffic data prediction domain. In “Problem Description And Formulation’, we introduce the motivation of the study through a real-world traffic scenario and formalize the problem of missing values prediction in traffic. In “A distributed LSH-based missing value prediction approach: \(ASMVP_{distr-LSH}\)”, the novel method (i.e., \(ASMVP_{distr-LSH}\)) proposed in this paper is described in detail to achieve privacy-preserving and time-efficient traffic data prediction. In “Case study”, a case study is used to introduce the concrete steps and show the effectiveness of our method \(ASMVP_{distr-LSH}\). In addition, its shortcomings are analyzed in this section. Finally, we summarize the conclusions of this paper and point out the future work in “Evaluation and Further Discussion”.

Related work

Next, we briefly review the research progress of traffic flow prediction from the following two aspects, i.e., missing data prediction and user privacy protection.

Missing traffic flow data prediction

The data collected by different types of sensors on the road are the key ingredient of traffic flow prediction model. However, the data collected by these sensors are occasionally missing due to a variety of reasons, e.g., hardware failure, transmission error, etc. Tian et al. [32] solve the problem of missing data for a long time from two different perspectives, and meanwhile propose two machine learning methods to update missing data without gap length limitation. Laña et al. [33] utilize the periodicity of traffic flow data to infer missing values and put forward a method based on Long Short-Term Memory (LSTM) [34] model. Li et al. [35] propose a Multi-View Learning Method (MVLM) to estimate the missing value of traffic flow in the database.

Although the influence of missing values on prediction performances is often ignored in deep neural network methods, Boquet et al. [36] put forward an online unsupervised data imputation method to tackle this issue. Wang et al. [10] use the characteristic that the traffic flow in the network follows the spatial-temporal patterns to restrain the influence of missing data. Zhang et al. [37] propose a method called FNNTEL, which is a data missing estimation method of tensor heterogeneous ensemble learning based on Fuzzy Neural Network (FNN).

At present, the research on the solution of traffic flow data missing often focuses on the model based on either temporal information or spatial information or spatiotemporal information or tensor decomposition. However, they often lack the ability to protect users’ private information contained in traffic flow data.

Privacy protection in traffic flow data

In ITS, traffic flow data are the key factor and the foundation of many prediction models and analysis methods. However, it is inevitable to disclose user privacy information contained in traffic flow data. In order to protect the release of real-time vehicle trajectory data, Ma et al. [38] propose a method named RPTR, which is an effective privacy protection mechanism based on differential privacy. And in the RPTR mechanism, ensemble Kalman filter based on user location transition probability matrix is used to ensure the data availability.

To solve the problem of privacy leakage caused by data sharing between different private operators and public institutions, He et al. [39] propose a privacy control algorithm based on k-anonymity diffusion to realize reliable data sharing. Facing the challenge of protecting the privacy of a single vehicle when collecting point-to-point data for geographical location, Zhou et al. [40] propose a privacy-preserving traffic flow measurement method by using bit array to collect data that need to be protected and maximum likelihood estimation to obtain measurement results. In order to protect the location privacy data of users in GPS, Yang et al. [41] propose a virtual travel route system with stealth technology, which promotes the design of distributed architecture considerably.

Wu et al. [42] first used homomorphic encryption function in compressed sensing data collection, and proposed an efficient data collection method with privacy protection to prevent traffic analysis and tracking in wireless sensor networks. Liuet al. [43] propose a neural network algorithm of gated recurrent unit based on FL, where FL is a privacy-preserving machine learning technology named federated learning.

Qi et al. [44] put forward an LSH-based service recommendation approach \(SerRec_{distr-LSH}\) to secure the sensitive information of users hidden in historical user ratings. However, the authors do not take the time context factor into consideration and neglect the dynamic influence of time towards the unstable service quality. Therefore, the accuracy of the proposed \(SerRec_{distr-LSH}\) approach is reduced accordingly.

Meng et al. [45] propose an “optimal publishing” strategy to reveal only the optimal service quality records instead of publishing all the sensitive service quality data observed by users. This way, most private information contained in service quality data can be protected well. However, such a partial publishing way decreases the availability of historical service quality data significantly since there is a tradeoff between data availability and data privacy.

However, the above algorithms have not been applied to deal with the traffic flow prediction scenarios where the traffic flow data generated by multiple sensors are distributed in multiple platforms. Motivated by this fact, we propose a novel traffic flow prediction method based on distributed LSH, as elaborated in the following sections.

Problem description and formulation

In this section, we first describe the LSH-based Traffic Flow prediction problems with privacy-preservation. Then, we formalize the specific problem for easy understanding. The symbols used in this paper can be found in Table 1.

Table 1 Specification of symbols in this paper
Fig. 1
figure 1

Multi-platform distribution in traffic flow prediction

Problem description

We use Fig. 1 to help describe the traffic flow prediction problem with missing data. Concretely, suppose there are three kinds of sensors to record the traffic situation of a certain road. In the figure, three different sensors are represented by three different base stations. To predict the missing data in one of the sensors, it is necessary to use LSH technology find the similar date corresponding to the date of missing data. Finally, the missing values in the traffic flow data are predicted successfully.

However, there are two challenges in the process of traffic flow prediction. (1) As the above traffic data contains user privacy information and involves the interests of various companies, these sensor companies do not want to share their collected data with other companies. (2) With the increase of sensor types and traffic flow, the amount of traffic flow data has become more and more huge. As a result, the efficiency and scalability of data sharing among companies are significantly reduced, and it is difficult to meet the requirements of real-time traffic prediction.

To address these challenges, we propose a new distributed LSH-based approach named \(ASMVP_{distr-LSH}\), which has the characteristics of privacy protection and scalability. Details are given in the next section.

Problem formulation

In this paper, we focus on the problem of missing traffic flow prediction from multi-source data. For the convenience of understanding and the following discussion, we further formulate the traffic data prediction problem as a five-tuple problem \( LSH\_MVP(TFS, TP, Day, day_{target}, TD)\), where

  1. (1)

    \( TFS = \left\{ TFS_1,...,TFS_{SN} \right\} : tfs_d(1\le d\le SN)\) denotes the d-th traffic sensor, which supplies the d-th part of the traffic flow prediction data collected every day.

  2. (2)

    \( TP = \left\{ {TP_1,...,TP_{SN}}\right\} : TP_k(1\le k\le SN)\) denotes the time period in sensor \(TFS_k\).

  3. (3)

    \( Day = \left\{ {day_1,...,day_{DN} } \right\} : day_i(1\le i\le DN)\) represents the i-th day of the month. Here, the traffic flow prediction data collected every day come from many sensors in the set TFS, so it is multi-source.

  4. (4)

    \( day_{target}: \) a target day when we need to predict the missing data of a traffic sensor in a day. Here, \( day_{target}\in Day \) holds.

  5. (5)

    TD is the length of a time period, e.g., suppose that we adopt 15 min as a time period, then each day will contain 96 time periods.

A distributed LSH-based missing value prediction approach: \(ASMVP_{distr-LSH}\)

Our traffic flow prediction method can not only protect privacy and fill the missing data in abnormal sensors, but also make distributed prediction for a variety of sensor platforms [46, 47]. In short, in “LSH: Locality-Sensitive Hashing”, we briefly introduce the location sensitive hash technology. In “\(ASMVP_{distr-LSH}\): traffic flow missing value prediction based on distributed LSH”, we introduce \(ASMVP_{distr-LSH}\) in detail.

LSH: locality-sensitive hashing

LSH technology was improved and proposed by Aristides Gionis et al to achieve high speed information retrieval. Specifically, the algorithm makes hash buckets to store more than one point. In other words, (1) it makes two adjacent data points in the original space after hashing likely to be neighbors; and (2) it makes two non-adjacent data points in the original data space are not contiguous after hashing.

The above is the main idea of LSH algorithm. Therefore, the hash function satisfying the above two circumstances is named LSH function, and LSH has been proved to be a technology that can effectively deal with distributed applications such as distributed information retrieval, such as the multi-source cloud service recommendation method based on distributed LSH in [20].

Specifically, f(*) is the function of LSH, F(*) is the family of functions of LSH, assuming that \( x_1\) and \( x_2\) are two variables in the primitive data space, r(\( x_1\),\( x_2\)) represents the distance between two variables, f(x) represents the index or hash value of variable x, P(Y) represents the probability that condition Y holds, and (\( r_x\), \( r_y\), \( p_x\), \( p_y\) ) are a set of thresholds. If both circumstances (1) and (2) hold, then the f() is called (\( r_x\), \( r_y\), \( p_x\), \( p_y\) )-sensitive.

$$\begin{aligned} If \ r(x_1,x_2) \le r_x, \ then \ P(f(x_1)= & {} f(x_2)) \ge p_x, \end{aligned}$$
(1)
$$\begin{aligned} If \ r(x_1,x_2) \ge r_y, \ then \ P(f(x_1)= & {} f(x_2)) \le p_y. \end{aligned}$$
(2)

We use the example to illustrate the general procedure of LSH-based similar days search. First, assume that the original data space contains m data points (\( data_1\), \( data_2\),..., \( data_m\)), they can be mapped into n containers (\( b_1\), \( b_2\),..., \( b_n\)) by LSH function, and each container \( b_k(1\le k \le n)\) contains \( m_k(m_k \ll m \ \& \ \sum m_k = m)\) data points with similar neighbor characteristics.

As described above, if a target date (i.e., X) wants to hunt for its similar dates from (\( data_1\), \( data_2\),..., \( data_m\)), it should cypher the corresponding hash value f(X) through the hash function f(*), and then find the corresponding container, (assume \( b_k(1\le k \le n)\) here). According to the main idea of LSH, the \( m_i\) data points contained in \( b_k\) bucket are most likely the similar day data of target days X. In this case, once \(m_i \ll m\) is established, the size of the search will change from m to \( m_i\), and the search efficiency will also be significantly improved.

It can be seen from the above examples that the method based on LSH search has three advantages. First, this method uses the hash value or index generated by hash function f(*) to find the similar data points of the target data. In this situation, LSH protects the privacy information in the data. Second, distributed data points (\( data_1\), \( data_2\),..., \( data_m\)) can be centralized into a hash table through LSH, and then unified calculation can be carried out. Third, LSH can establish hash table offline and reduce search space, which can not only improve search efficiency, but also increase search scalability. Therefore, the extended LSH method is applied to the missing value prediction of traffic flow to achieve privacy protection and scalable traffic flow prediction in distributed multi-sensor environment.

\(ASMVP_{distr-LSH}\): traffic flow missing value prediction based on distributed LSH

In this section, we will introduce our \(ASMVP_{distr-LSH}\) method in detail. Generally speaking, our method mainly consists of four steps, as follows:

Step 1 (Establish date sub-indices offline):

Concretely, for each sensor \(TFS_k (1\le k \le SN)\), we can choose a family of LSH functions \( F_k(*)\) to create a sub-index for \( day \in Day \) offline based on the known traffic flow data collected by sensors. Because Pearson Correlation Coefficient (PCC) is often used to reflect the linear correlation degree of two variables X and Y, in this paper, we use LSH function family corresponding to PCC to build the index. In addition, the selection of LSH function family \( F_k(*)\) also needs to consider the (1) (2) conditional formula described in the previous section.

First, for a day, all its time periods \(\left\{ TP_{k,1},...,TP_{k,TN} \right\} \) are converted into a TP dimensions vector \( \overrightarrow{day(k)} = (TP_{k,1,TD},...,TP_{k,TN,TD})\), where TD refers to the length of a time period, and the missing data of a certain segment of sensor is expressed as \( TP_{k,i,TD}=0\). Then, the LSH function \( f_k(*)\) of the above TP dimensions vector is shown by Eq.(3) more formally.

$$\begin{aligned} f_k(day) = \left\{ \begin{array}{ll} 1 \quad if \ \overrightarrow{day(k)} \bullet \overrightarrow{p} > 0\\ 0 \quad if \ \overrightarrow{day(k)} \bullet \overrightarrow{p} \le 0. \end{array}\right. \end{aligned}$$
(3)

Here, \( \overrightarrow{p}\) is an TP dimensions vector \( (p_1,..,p_{TN})\) in which the elements are random values of the interval [-1,1]; the sign \( \bullet \) denotes the point multiplication. For the convenience of understanding, we can explain Eq. (3) as follows: first, vector \( \overrightarrow{p}\) denotes the hyperplane with cutting function, and then suppose that there are two vectors \( \overrightarrow{x_1}\) and \(\overrightarrow{x_2}\). If \( \overrightarrow{x_1}\) and \( \overrightarrow{x_2}\) are on the same flank of \( \overrightarrow{p}\) (i.e., both \( \overrightarrow{x_1} \bullet \overrightarrow{p} > 0 \) & \( \overrightarrow{x_2} \bullet \overrightarrow{p} > 0 \) hold, or, both \( \overrightarrow{x_1} \bullet \overrightarrow{p} \le 0 \) & \( \overrightarrow{x_2} \bullet \overrightarrow{p} \le 0 \) hold), then \( \overrightarrow{x_1}\) and \( \overrightarrow{x_2}\) are likely to be considered similar.

Second, since the elements in vector \( \overrightarrow{p}\) are randomly generated from the data interval [-1,1], the above hashing and mapping process can be repeated \(SF_k\) times using different vectors \( \overrightarrow{p}\). Then, the sub-index (i.e., \( F_k(day) = (f_{k}^{1}(day),..., f_{k}^{SF_k}(day))\)) of the sensor in one day can be obtained, in which \( f_{k}^{j}(day)(1 \le j \le SF_k)\) is calculated by Eq. (3). In particular, the sub-index \( F_k(day)\) here is a 0–1 vector with \(SF_k\) dimensions.

In addition, we can use the following pseudo code to represent the above process(see Algorithm 1).

figure a

Step 2 (Establish date index by amalgamating sub-indices offline):

In the previous step, in the light of traffic data of different sensors, we get the SN sub-indices \(F_1(day),...,F_{SN}(day)\). In this step, we will amalgamate the SN sub-indices offline into an integrated date index \( F(day) = (F_1(day),...,F_{SN}(day))\) with dimension \(\sum _{i=1}^{SN} R_i\). Subsequently, for each \(day \in Day\), we repeat the above process until the mapping relationships of “\(day \rightarrow F(day)\)” are established. Next, we record the mapping relationships “\(day \rightarrow F(day)\)” through a pre-defined hash table FTab.

In addition, we can use the following pseudo code to represent the above process(see Algorithm 2).

figure b

Step 3 (Find similar days of \( day_{target}\) online):

According to the operation of selecting hash function family \(F_m(*)(1 \le m \le SN)\) to get sub-index in step 1 and amalgamating sub-index in step 2, we can compute the index \( F(day_{target})\) of \(day_{target}\) online. Next, we can find the bucket with the value of \( F(day_{target})\) from the FTab exported in step 2. If a valid bucket can be found, in this case, each date contained in the container are regarded to be similar days of \( day_{target}\) and put into a dataset named DS-Set. If we cannot find the qualified container, in this case, we cannot simply judge that \( day_{target}\) has no similar days, because of the characteristics of LSH (i.e., probability). This characteristic cannot guarantee that all similar days can be found every time, i.e., some qualified results will be ignored.

Therefore, we use the above method to create T hash table \( FTab_1,...,FTab_T\) by repeating Step 1 and Step 2 to relax the judgment or evaluation condition of similar days search. Next, if the condition in Eq.(4) is true, we regard that \( day_{target}\) has similar days, and the dates whose values in the bucket are equal to \( {F(day_{target})}_x\) are similar days of \( day_{target}\). Moreover, we put the similar days into a new data set named DS-Set.

$$\begin{aligned} \begin{aligned}&\exists \ day (\in Day) \ and \ x (\in {1,...,T}), \\&\quad satisfy \ F(day)_x = {F(day_{target})}_x \ in \ FTab. \end{aligned} \end{aligned}$$
(4)

In addition, we can use the following pseudo code to represent the above process (see Algorithm 3).

figure c

Step 4 (Top-K missing value prediction):

In the previous step, we have gained a similar date set (i.e., DS-Set) for \( day_{target}\). Next, we use DS-Set to predict the missing values in \( day_{target}\) (here, we can set a threshold for |DS-Set|). Specifically, we use Eq.(5) to predict the missing values of abnormal sensors in traffic over the time period TD.

$$\begin{aligned} TP.F_{target} = \frac{1}{|DS-Set|} *\sum _{day_j\in DS-Set} TP.F_j. \end{aligned}$$
(5)

Here, \(TP.F_j\) represents the traffic flow of the corresponding time period in the sensor TFS, which is included in the days similar to \( day_{target}\) (i.e., a day with abnormal sensor values). Finally, we rank all the time periods of the sensor according to the prediction results by Eq.(5), and take the traffic flow data corresponding to the first k time periods as the final prediction results.

In addition, we can use the following pseudo code to represent the above process(see Algorithm 4).

figure d

After the above four steps of \(ASMVP_{distr-LSH}\) approach, we can accurately predict the missing values of abnormal sensors under the condition of privacy.

Case study

To illustrate the feasibility of our method, in this section, we use a case study to demonstrate the execution process of our proposed method. Suppose there are 2 different sensors that collect traffic flow data. In addition, we adopt 60 min as a time period, then there are totally 24 time periods in each day. For the convenience of readers’ understanding and the easy calculation, the traffic flow data of the sensors used only contain 5 days, each of which is divided into 10 time periods (each period is equal to 2.4 h).

Step 1 (Establish date sub-indices offline):

In this section, we use 4 hash functions to form a family of hash functions (i.e., \( F_k(day) = (f_{k}^{1}(day),..., f_{k}^{4}(day))\)) for better illustration. Concretely, the hash function family is shown in (6).

$$\begin{aligned} \begin{aligned} F_{10\times 4}= \begin{bmatrix} -0.16595599&{}\quad -0.16161097&{}\quad 0.60148914&{}\quad -0.80330633 \\ 0.44064899&{} \quad 0.370439&{}\quad 0.93652315&{}\quad -0.15778475 \\ -0.99977125&{}\quad -0.5910955&{}\quad -0.37315164&{}\quad 0.91577906 \\ -0.39533485&{}\quad 0.75623487&{}\quad 0.38464523&{}\quad 0.06633057 \\ -0.70648822&{}\quad -0.94522481&{}\quad 0.7527783&{}\quad 0.38375423 \\ -0.81532281&{}\quad 0.34093502&{}\quad 0.78921333&{}\quad -0.36896874 \\ -0.62747958&{}\quad -0.1653904&{}\quad -0.82991158&{}\quad 0.37300186 \\ -0.30887855&{}\quad 0.11737966&{}\quad -0.92189043&{}\quad 0.66925134 \\ -0.20646505&{}\quad -0.71922612&{}\quad -0.66033916&{}\quad -0.96342345 \\ 0.07763347&{} \quad -0.60379702&{} \quad 0.75628501&{}\quad 0.50028863 \\ \end{bmatrix}. \end{aligned} \end{aligned}$$
(6)

First, according to Eq. (3), a dot multiplication operation is adopted between a hash function and the vector corresponding to a sensor. Then, the above process is repeated four times based on different hash functions to get the sub-index of the sensor. Here, the sub-index of the sensor is a 0–1 vector. To facilitate the subsequent understanding and analysis, we transform the 0–1 vector of the sensor into a corresponding decimal number. The sub-indexes of the two sensors are shown in (7). Here, \( f_{k}(day)\) represents the sub-index of the k-th traffic sensor.

$$\begin{aligned} \begin{aligned} f_{1}(day)= \begin{bmatrix} 0&\quad 0&\quad 0&\quad 3&\quad 11 \end{bmatrix}, \\ f_{2}(day)= \begin{bmatrix} 1&\quad 1&\quad 0&\quad 1&\quad 1 \end{bmatrix}. \\ \end{aligned} \end{aligned}$$
(7)

Step 2 (Establish date index by amalgamating sub-indices offline):

In Step 1, we have obtained the sub-indexes of the two traffic sensors. Next, we separately merge the sub-indexes of the two sensors, and the final merging results are shown in Eq. (8). At the same time, the combined index is sent to each sensor platform.

$$\begin{aligned} \begin{aligned} F_{1,2}(day)= \begin{bmatrix} 0&{}\quad 0&{}\quad 0&{}\quad 3&{}\quad 11 \\ 1&{}\quad 1&{}\quad 0&{}\quad 1&{}\quad 1 \end{bmatrix}. \\ \end{aligned} \end{aligned}$$
(8)

Step 3 (Find similar days of \( day_{target}\) online):

Repeat Step 1 and Step 2 four times to get four different hash tables. Here, the sub-indexes obtained by four different groups of hash function families are presented in (9) and (10), respectively, and the combined indexes of four groups of sub-indexes are shown in (11). Here, \( f_{k}^{T}(day)\) represents the sub index of the k-th sensor obtained through the T-th group of hash function family.

$$\begin{aligned} f_{1}^{1}(day)= & {} \begin{bmatrix} 0&\quad 0&\quad 0&\quad 3&\quad 11 \end{bmatrix}, \ f_{1}^{2}(day)= \begin{bmatrix} 4&\quad 4&\quad 4&\quad 1&\quad 1 \end{bmatrix}, \nonumber \\ f_{1}^{3}(day)= & {} \begin{bmatrix} 2&\quad 0&\quad 0&\quad 2&\quad 2 \end{bmatrix}, \ f_{1}^{4}(day)= \begin{bmatrix} 1&\quad 3&\quad 10&\quad 5&\quad 5 \end{bmatrix}, \nonumber \\\end{aligned}$$
(9)
$$\begin{aligned} f_{2}^{1}(day)= & {} \begin{bmatrix} 1&\quad 1&\quad 0&\quad 1&\quad 1 \end{bmatrix}, \ f_{2}^{2}(day)= \begin{bmatrix} 5&\quad 5&\quad 4&\quad 1&\quad 1 \end{bmatrix}, \nonumber \\ f_{2}^{3}(day)= & {} \begin{bmatrix} 0&\quad 0&\quad 0&\quad 0&\quad 6 \end{bmatrix}, \ f_{2}^{4}(day)= \begin{bmatrix} 1&\quad 1&\quad 10&\quad 1&\quad 5 \end{bmatrix}, \nonumber \\\end{aligned}$$
(10)
$$\begin{aligned} F_{1,2}^{1}(day)= & {} \begin{bmatrix} 0&{}\quad 0&{}\quad 0&{}\quad 3&{}\quad 11 \\ 1&{}\quad 1&{}\quad 0&{}\quad 1&{}\quad 1 \end{bmatrix}, \ F_{1,2}^{2}(day)= \begin{bmatrix} 4&{}\quad 4&{}\quad 4&{}\quad 1&{} \quad 1 \\ 5&{} \quad 5&{} \quad 4&{} \quad 1&{}\quad 1 \end{bmatrix}, \nonumber \\ F_{1,2}^{3}(day)= & {} \begin{bmatrix} 2&{}\quad 0&{} \quad 0&{}\quad 2&{}\quad 2 \\ 0&{} \quad 0&{}\quad 0&{}\quad 0&{}\quad 6 \end{bmatrix}, \ F_{1,2}^{4}(day)= \begin{bmatrix} 1&{} \quad 3&{}\quad 10&{}\quad 5&{}\quad 5 \\ 1&{}\quad 1&{} \quad 10&{}\quad 1&{}\quad 5 \end{bmatrix}.\nonumber \\ \end{aligned}$$
(11)

Next, according to Eq. (4) and the above index values in (11), we can get a similar date matrix, and the results are shown in (12). In the similarity matrix in (12), the number of rows represents the number of days of the first sensor, and the number of columns represents the number of days of the second sensor.

$$\begin{aligned} \begin{aligned} sim_{5\times 5}= \begin{bmatrix} 1&{}\quad 1&{}\quad 1&{} \quad 1&{} \quad 0 \\ 1&{}\quad 1&{}\quad 1&{} \quad 1&{}\quad 0 \\ 1&{}\quad 1&{}\quad 1&{}\quad 1&{}\quad 0 \\ 0&{}\quad 0&{}\quad 0&{} \quad 1&{}\quad 1 \\ 0&{}\quad 0&{}\quad 0&{} \quad 1&{}\quad 1 \\ \end{bmatrix}. \end{aligned} \end{aligned}$$
(12)

Step 4 (Top-K missing value prediction):

According to Eq. (5) and the similarity matrix obtained in Step 3 (i.e., the matrix in (12)), we can predict the missing values of the abnormal sensors. Specifically, we assume that the original data of the two abnormal sensors are presented in (13). It can be seen from (13) that the data in the 3rd period of the 1st day of the second abnormal sensor is missing. According to (12), the 1st day is similar to the 2nd day and the 3rd day, so the missing value in 1st day can be predicted to be 1 according to Eq. (5). By analogy, after prediction, the complete data of the two sensors after prediction are shown in (14).

$$\begin{aligned}{} & {} FirSenMiss_{5\times 10}\nonumber \\ {}{} & {} = \begin{bmatrix} 6&{}\quad 3&{}\quad 0&{}\quad 0&{} \quad 1&{}\quad 17&{}\quad 81&{}\quad 121&{} \quad 87&{} \quad 83 \\ 1&{}\quad 2&{}\quad 1&{}\quad 3&{} \quad 1&{}\quad 8&{} \quad 51&{} \quad 112&{}\quad 87&{} \quad 83 \\ 3&{}\quad 1&{}\quad 0&{} \quad 1&{} \quad 4&{}\quad 12&{}\quad 92&{} \quad 0&{}\quad 111&{} \quad 0 \\ 8&{}\quad 0&{}\quad 2&{} \quad 3&{}\quad 8&{}\quad 0&{} \quad 112&{} \quad 173&{}\quad 0&{} \quad 117 \\ 0&{}\quad 3&{}\quad 4&{}\quad 4&{} \quad 4&{} \quad 34&{}\quad 0&{}\quad 157&{} \quad 0&{} \quad 119 \\ \end{bmatrix}, \nonumber \\{} & {} SecSenMiss_{5\times 10}\nonumber \\ {}{} & {} = \begin{bmatrix} 3&{}\quad 0&{}\quad 0&{} \quad 0&{}\quad 0&{}\quad 3&{} \quad 7&{}\quad 7&{}\quad 18&{}\quad 17 \\ 0&{}\quad 2&{} \quad 0&{} \quad 0&{} \quad 2&{}\quad 2&{}\quad 7&{}\quad 9&{} \quad 19&{} \quad 10 \\ 0&{}\quad 0&{} \quad 0&{}\quad 0&{}\quad 0&{}\quad 0&{}\quad 11&{}\quad 0&{}\quad 18&{}\quad 0 \\ 0&{}\quad 0&{} \quad 1&{} \quad 2&{} \quad 0&{}\quad 0&{} \quad 8&{} \quad 11&{} \quad 0&{} \quad 33.25 \\ 0&{}\quad 0&{}\quad 0&{}\quad 0&{}\quad 0&{} \quad 0&{}\quad 0&{}\quad 4&{} \quad 0&{}\quad 38 \\ \end{bmatrix}, \end{aligned}$$
(13)
$$\begin{aligned}{} & {} FirSenFore_{5\times 10}\nonumber \\ {}{} & {} = \begin{bmatrix} 6&{} \quad 3&{}\quad 1&{} \quad 2&{} \quad 1&{} \quad 17&{}\quad 81&{} \quad 121&{} \quad 87&{}\quad 83 \\ 1&{}\quad 2&{}\quad 1&{}\quad 3&{}\quad 1&{}\quad 8&{}\quad 51&{}\quad 112&{}\quad 87&{}\quad 83 \\ 3&{}\quad 1&{} \quad 1&{} \quad 1&{} \quad 4&{}\quad 12&{} \quad 92&{} \quad 116.5&{}\quad 111&{}\quad 83 \\ 8&{}\quad 3&{}\quad 2&{} \quad 3&{}\quad 8&{}\quad 34&{}\quad 112&{}\quad 173&{}\quad 0&{}\quad 117 \\ 8&{}\quad 3&{} \quad 4&{} \quad 4&{}\quad 4&{}\quad 34&{} \quad 112&{} \quad 157&{}\quad 0&{} \quad 119 \\ \end{bmatrix}, \nonumber \\{} & {} SecSenFore_{5\times 10}\nonumber \\ {}{} & {} = \begin{bmatrix} 3&{}\quad 2&{} \quad 0&{} \quad 0&{}\quad 2&{} \quad 3&{} \quad 7&{}\quad 7&{}\quad 18&{} \quad 17 \\ 3&{}\quad 2&{} \quad 0&{} \quad 0&{} \quad 2&{} \quad 2&{} \quad 7&{}\quad 9&{}\quad 19&{} \quad 10 \\ 3&{}\quad 2&{} \quad 0&{} \quad 0&{} \quad 2&{} \quad 2.5&{}\quad 11&{}\quad 8&{}\quad 18&{}\quad 13.5 \\ 0&{}\quad 0&{} \quad 1&{}\quad 2&{} \quad 0&{} \quad 0&{} \quad 8&{} \quad 11&{}\quad 0&{} \quad 33.25 \\ 0&{} \quad 0&{} \quad 1&{} \quad 2&{} \quad 0&{} \quad 0&{} \quad 8&{}\quad 4&{}\quad 0&{}\quad 38 \\ \end{bmatrix}.\nonumber \\ \end{aligned}$$
(14)

Evaluation and further discussion

Next, we measure the performances of our proposed \(ASMVP_{distr-LSH}\) method and compare it with another existing methods: \(SerRec_{distri-LSH}\) [44] and \(Optimal-Pub\) [45]. The recruited dataset is WS-DREAM [48]. Experiments are deployed in a laptop with 2.50 GHz CPU and 8.0 GB RAM, and repeated 50 times.

Profile 1: Accuracy comparison

In this profile, we test and compare the prediction accuracy (RMSE, smaller is better) of our method with other ones. Parameters are as follows: SN = 300, TN is varied from 1000 to 4000, threshold of \(|DS-Set|\) = 4. Experimental results are reported in Fig.  2. As the results indicate, the accuracy of \(ASMVP_{distr-LSH}\) is the highest (i.e., RMSE is the smallest) among the three methods because \(ASMVP_{distr-LSH}\) can find out the most similar sensors with the target sensor whose data are missing, based on the time-aware LSH technique. Therefore, \(ASMVP_{distr-LSH}\) can achieve a good prediction performance.

Profile 2: Efficiency comparison

In this profile, we compare the prediction efficiency of our method with other ones. Parameters are as follows: SN = 300, TN is varied from 1000 to 4000, threshold of \(|DS-Set|\) = 4. Experimental results are reported in Fig. 3. As Fig.  3 indicates, the efficiency of \(Optimal-Pub\) is the highest because it does not need to protect all the sensitive information of users when predicting missing values. While for \(ASMVP_{distr-LSH}\) and \(SerRec_{distri-LSH}\), additional time is needed to secure user privacy during prediction; therefore, time cost is increased accordingly. Therefore, \(ASMVP_{distr-LSH}\) performs better than \(SerRec_{distri-LSH}\) because we only need to consider the similar time periods in \(ASMVP_{distr-LSH}\).

Fig. 2
figure 2

RMSE comparison of different methods

Fig. 3
figure 3

Time cost comparison of different methods

Profile 3: Performances with respect to the threshold of \(|DS-Set|\).

The threshold of \(|DS-Set|\) affects the prediction performances of our \(ASMVP_{distr-LSH}\). Next, we measure the relationships. Parameters are as follows: SN = 300, TN = 4000, threshold of \(|DS-Set|\) = 2, 4, 6, 8. Experimental results are reported in Figs.  4 and 5. As results show, the RMSE and time cost of \(ASMVP_{distr-LSH}\) both decline with the growth of threshold. This can be explained as follows: a larger threshold means a “more similar” but “fewer” sensors based on the monitored data at more time periods. Therefore, the prediction results are better in accuracy and time cost simultaneously.

Fig. 4
figure 4

RMSE of \(ASMVP_{distr-LSH}\) w.r.t. threshold

Fig. 5
figure 5

Time cost of \(ASMVP_{distr-LSH}\) w.r.t. threshold

Profile 4: RMSE convergence of three methods.

The RMSE convergence of different methods is presented in Fig.  6. Parameters are as follows: SN = 300, TN is varied from 1000 to 4000, threshold of \(|DS-Set|\) = 4. Experimental results show that it is rational to execute experiments 50 times since the RMSE performances of three methods are all becoming stable approximately. This means that the convergence of our proposal is relatively satisfactory.

Fig. 6
figure 6

Convergence of RMSE of different methods

Next, we briefly analyze the shortcoming or limitation of our proposal in this paper and point out the possible improvement directions in the upcoming studies.

  1. (1)

    In our prediction method for missing traffic flow data caused by abnormal sensors, LSH technique is employed to achieve the privacy protection goal. Overall, our method can secure the sensitive user information while making missing traffic value prediction for abnormal sensors. However, it is still difficult to measure or evaluate the capability of degree of the proposed privacy-preservation method. This is because LSH is essentially a hash-based technique and we cannot measure its privacy-preservation effects directly and quantitatively.

  2. (2)

    Traffic flow data are heavily dependent on the time factor because users’ driving behaviors everyday render an obvious time-varied fluctuation tendency. Therefore, this paper takes the time factor into consideration when making accurate traffic data prediction. However, traffic data flow is also rather related to other influencing factors besides time, such as location, weather, climate and so on. Therefore, it is beneficial to extend the current traffic flow data prediction method by incorporating more influencing factors. Such an extension is helpful for creating a comprehensive and wide prediction framework for missing traffic data flow, especially in complex city management.

  3. (3)

    In the current traffic flow prediction method based on time, each day is divided into 96 time periods, each of which is corresponding to 15 min. Such a time interval segmentation way is fixed and lacks of some flexibility. For example, for busy hours in a day, traffic condition varies with time frequently; in this situation, a smaller time period division manner is better to depict the traffic flow condition of the city. While on the contrary, for free hours in a day, traffic condition seldom varies with time; in this situation, a larger time period is better for describing the traffic condition of the city. Therefore, flexible setting of time period in time-aware traffic flow data prediction is necessary and beneficial to the prediction accuracy and efficiency.

  4. (4)

    LSH is practically a probability-based similar object search technique; therefore, there is some uncertainty in traffic flow data prediction. In other words, it is possible that the prediction performances are not as good as expected, especially in terms of prediction accuracy. In view of this limitation, we need to optimize the traffic data prediction accuracy by improving the traditional LSH technique. One promising way in optimization is to use multiple hash functions instead of only one hash function when creating traffic sensor indexes or time slot indexed with LSH. Thus, through multiple repetition process of LSH, we can achieve a good tradeoff between traffic prediction accuracy and efficiency.

Conclusions and future work

Missing data of abnormal sensors is normal in traffic domain and brings a big challenge for accurate traffic flow prediction and traffic routine scheduling in smart city managements. Motivated by this challenge, this paper presents a distributed traffic flow missing value prediction method with privacy-preservation function for abnormal traffic sensors, i.e., \(ASMVP_{distr-LSH}\). First, our method can integrate known traffic flow data from different sensors offline and send these data to the edge server [49, 50]. Then, similar dates with close traffic conditions are filtered out based on LSH technique. Finally, the Top-k dates as well as their traffic flow data are used for predicting the missing traffic data of abnormal sensors in a certain day. To verify the feasibility of \(ASMVP_{distr-LSH}\), we provide a case study to demonstrate the concrete process in missing value prediction with privacy-preservation.

In the future, we will use a set of real traffic sensor data to test the performances of our method, and compare its performance with other related methods. Compared with cloud AI, edge AI has lower latency, which will greatly improve the performance of traffic flow prediction [51,52,53]. Therefore, we will consider applying edge AI technology to future research. For abnormal sensors, it is still a challenging task to consider user privacy, prediction accuracy and scalability simultaneously[54, 55]. We will consider a more complex traffic flow prediction scenario in the upcoming study [56, 57]. We will also further study how missing values in traffic flow are generated and how they compare with contemporary methods. In addition to missing values in traffic flow, we will also consider sensor-induced anomalous data through anomaly detection.