1 Introduction

Using telemetry stations is a cost-effective approach to automatically collect the hydrological data for monitoring water levels in a country in real time. However, there are several factors that might disrupt the operation of stations such as environmental, technological issues and human activities, and consequently result in anomalous or missing data in the collected water level data. Although, there are many methods, as discussed by Blázquez-García et al. (2020) and Yang et al. (2017), for discovering anomalies and missing data, they often remove anomalies from a series of data, or replace them with some constants, which are problematic because the missing data or inaccurate data can lead to erroneous analysis results. Thus, effective approaches for predicting missing values from accessible data are needed.

Replacing the missing gap by using the values from their most similar subsequence is the extensively utilized in many domains. Dynamic Time Warping (DTW) is an excellent technique of this kind and applied in a variety of problems. For example, the work by Tormene et al. (2009) uses DTW to discover the most similar incomplete time series from the stored reference of an arm movement sensor. Another example is the research by Caillault et al. (2020) which applied the derivative dynamic time warping (DDTW) developed by Keogh and Pazzani (2001) to search the subsequences before a missing gap, and then to repair the gap by using with the next most similar subsequence. The disadvantage of using dynamic time warping is time-consuming, which is addressed by extracting sequence features in sliding windows using a shape-feature extraction algorithm (Caillault et al. 2016), then calculating DDTW only if the correlation between the shape-features of this window and the subsequences before the missing gap is very high. The results demonstrate that their method produces superior outcomes when dealing with time series with a high correlation and strong seasonality.

Although DTW can find the most similar patterns that have similar dynamics but it may warp the shape by expanding or compressing, so the position of missing gaps may not be at the same position of the original pattern, as illustrated in Fig. 1.

Fig. 1
figure 1

An example of the most similar subsequence that has the same dynamic but has a different pattern to the original time series

The another way that uses for searching the most similar subsequence is to find two subsequences that have the lowest Euclidean distance as an indication of similarity. Since we need to calculate the distance of every pairwise in time series, the time needed to search for matches in a large time series dataset can be long and hence is considered as a disadvantage of this method. To address this issue Yeh et al. (2016) developed the techniques called Matrix Profile (MP), to speed up the process. An MP gives the distances between all subsequences and their nearest neighbours and thus can be used to efficiently extract some patterns characterised by a time series, such as motifs and discords. Very similar subsequences in a time series are called motifs, whereas very differing subsequences are called discords.

This research aims to develop new techniques for imputing missing values or anomalies more accurately and efficiently. Specifically, our research focuses on imputing the missing values in time series of telemetry surface water level data. Our basic idea is inspired by the facts that water levels usually vary with some similar patterns over a year, so, we can utilize this phenomenon by reproducing the pattern from the most similar subsequence in the historical data. As a result, we built an effective approach for imputing missing data based on some simple operations for pattern searching and matching.

The remainder of the paper is organised as follows: Section 2 describes the associate works. The data characteristic in Section 3. Section 4 shows the detail of proposed method. Section 5 shows the details datasets, comparative methods, experiment setting and evaluation. Results and discussion are presented in Section 6. Finally, the paper is concluded in Section 7.

2 Related Work

Time series imputation methods can be classified into two categories in term of variables used: univariate and multivariate. The first approach uses a single variable to impute missing values. The second estimates missing data by examining the relationship between several variables.

Several fundamental methods in filling missing data in univariate including as mean, median, last observation carried forward (LOCF), and interpolation techniques, are frequently utilised (Peugh and Enders 2004; Pratama et al. 2016; Osman et al. 2018). When only one or a few consecutive missing data points are present, those methods provide acceptable results. However, when the missing gaps are large, the results are bad.

Some studies have been conducted in the recent decade with the goal of imputing missing values in meteorological and hydrological data. For example, the study by Yang et al. (2017) used the mean of nearby points imputation methods for imputing the missing data of the integrated data set from the water level and atmospheric data. Lai and Kuok (2019) suggested a method known as Bayesian Principal Component Analysis (BPCA) to impute missing values in rainfall data; the findings showed that BPCA outperformed KNN when dealing with large continuous missing gaps. While the research by Gao et al. (2018) and Pratama et al. (2016) provided a review of applied statistical and machine learning methodologies for imputed anomalies and missing data.

With the significant contribution of machine learning (ML) in many domains, it has also been applied to imputation tasks. Kim et al. (2015) compared ML, artificial neural network, and physical-based model for recovering streamflow data. The result revealed that ML were generally better than other at capturing high flows. Dwivedi (2022) used random forest to impute the continuous gaps and extreme of sub-hourly ground water. Li et al. (2022) improve the availability of IoT multivariate data and ability of anomaly detection by applied the imputation techniques to impute the detected anomalies data. Others research that using a ML model to estimate the missing values are averaging the prediction from two directions (forward and backward) as the final imputed values (Akouemo and Povinelli 2014; Bokde et al. 2018; Phan 2020). Two forecasting models for estimating the missing value based on the forecaster and the backcaster of time series were proposed by Moahmed et al. (2014).

In the last two decades, deep learning techniques have been widely applied to various problems in time series analysis. Various studies have exploited deep learning to impute the missing data. Zhang and Thorburn (2021) proposed a dual-head sequence-to-sequence imputation model (Dual-SSIM) for water quality data imputation. By averaging the prediction from gated recurrent units (GRU) with the information before and after the missing gaps. In this proposal, the model imputes missing data more accurately than 5 benchmarks. Kulanuwat et al. (2021) reported work that used three approaches for imputing missing data, including linear interpolation, spline interpolation, and bidirectional LSTM for data imputation on telemetry water level data. Spline interpolation performed better on non-cyclical data, while bidirectional long short-term memory (BiLSTM) beat other interpolation approaches on a particular tidal data pattern. But, one common disadvantage that all deep learning neural networks have is very time consuming, which makes them less practical in real time applications, such as water level analysis and flood forecasting.

In this paper, we propose a novel approach for computing the missing values in incomplete subsequences, called Full Subsequence Matching (FSM). Instead of splitting it into two subsequences, we replace missing data with some temporary constant values to produce a dummy complete subsequence. We then search for the most similar subsequence from the simulated complete subsequence. The missing data is then recreated by imitating the pattern of subsequences that are the most similar to each other.

3 Water Level Data

Telemetry stations have been installed in various locations to monitor the changes of river water levels in Thailand. In general, there are three key causes that can affect the changes of water level in a river with different behaviours.

  • Tidal: The stations installed near the mouth of a river connecting to the sea will have a strong periodic pattern due to the effect from tidal.

  • Irrigation: The water level from the station that was installed in the canal will be affected by the irrigation operation. Because this canal was constructed to convey water from the main canal, which may be controlled by floodgates, in order to conduct irrigation in the distant areas or to prevent flooding. As a result, the water level will vary depending on the event and usually has low fluctuation, none periodic, and few change of water level. However, when the gate is in operation, such as closing or opening, the water level data from the nearby station changes rapidly. The measured water level in the canal away from the floodgate, on the other hand, has a few changes.

  • Rain: The station that has been installed in the natural river that far from the sea. When there is rainfall in the catchment area of a river, the water level can be affected in a variety of ways - in both upward and downward patterns.

In summary, these factors: raining, irrigating and tidal, have their patterns over the seasons and days. So the water levels that reflect their patterns can be explored and utilized to impute missing values.

4 Proposed Methods

To achieve our objective of building an effective and efficient framework for imputing missing values in water level data, we proposed a novel imputation approach, called Full Subsequence Matching (FSM), through finding the most similar subsequence. We compared our methods to the traditional idea of searching for the most similar subsequences of those patterns after splitting the subsequences, which is known as Partial Subsequence Matching (PSM). Each method is described in detail below:

4.1 Full Subsequence Matching (FSM)

In a time series, the data surrounding a missing gap can contain valuable information related to the gap and hence should be used for imputation of the missing points. A key question is how these pieces of useful information can be extracted and utilised in an efficient and effective manner. In this research, rather than separating the subsequence into two parts, before and after the missing gap, we replaced the missing gap with some temporary constant values to construct a dummy full sequence. Then, for parity in searching with historical data, we set the data in each sliding window at the same position as the same replaced constant values in the dummy full sequence. So, we can search for similar subsequences with the subsequences before and after a missing gap at the same time.

Our proposed FSM method consists of four main steps, which are explained as follows:

For a given time series \(X = \{x_{1},...,x_{N}\}\), where N is the length, i.e. the number of data point in a time series.

Step one - Identifying a missing gap: Firstly, we identify the first missing point \(x_{t}\) and the last point \(x_T\) of a missing gap with T number of consecutive points, \([x_t, ..., x_{t+T}]\).

Step two - Extracting an extended subsequence: We then extract a subsequence I that contains the identified missing gap sandwiched with two subsequences of m and n consecutive data points at the left side and the right side of the gap respectively. This extended subsequence can be represent as:

$$\begin{aligned} I =\{x_{t-m}, ...x_{t-1}, [x_{t},..., x_{t+T}], x_{t+T+1}, ..., x_{t+T+n}\} \end{aligned}$$
(1)

We then assign constant values c for every value of missing values in I as follows:

$$\begin{aligned} I = \{x_{t-m}, ..., x_{t-1}, [c,..., c], x_{t+T+1}, ..., x_{t+T+n}\}. \end{aligned}$$
(2)

Step three - Matching: This step searches and matches I with other subsequences in X. It is done by a sliding window technique. We set \(W = \{ w_{1}, w_{2}, ..., w_i\}\) to denote the subsequence in a sliding window where \(w_{i}\) is the set of consecutive values of X at position i with length z. We then compute the Euclidean distance of I with each subsequence in W. However, because the missing values in I have been replaced with constants, before computing the distance, we must replace all values in each subsequence of the sliding windows at the same position as the missing values in I with the same constant c. The most similar subsequence, denoted by S, is the one with the shortest distance, as shown in equation 3.

$$\begin{aligned} S = min\{d(I,W)\} \end{aligned}$$
(3)

Step four - Imputation : We developed two different techniques to impute missing values: difference imputation and scaling imputation, as shown in Algorithm 1. They are explained further below.

  1. 1.

    Difference Imputation (\(FSM_{D}\)): If we know the difference between every two consecutive values in the any sequence, we can recreate the original series even if some values are missing. We calculate the difference between each pair of consecutive values in S, starting with the first pair of values at the same position of missing data in I. Then addition those value with the first values before the missing gap in I to calculate the first missing values. The difference between the following pairwise values in S is computed and added to the latest imputed values in I. We do so until all missing values have been imputed.

  2. 2.

    Scaling Imputation (\(FSM_{S}\)): The scale of the query subsequence and the scale of the most similar subsequence should be the same or almost the same. Hence, we can adjust the scale of the most similar subsequence to the scale of query subsequence to regenerate the values in missing gaps.

figure a

4.2 Partial Subsequence Matching (PSM)

The basic idea behind the Partial Subsequence Matching method is that instead of using full subsequences for search and matching, only partial subsequences are used, which could speed up the process. The PSM is explained in detail as follows:

Step one - Identifying a missing gap: This step is the same as that of the FSM, i.e. finding the start and end indices of the missing gap in X.

Step two - Dividing: We then extract subsequence with m points from left (L) and n points from right (R) side of the missing gap in X. That is , we have that

$$\begin{aligned} L = \{x_{t-m},..., x_{t-1}\} \end{aligned}$$
(4)

and

$$\begin{aligned} R = \{x_{t+T+1},..., x_{t+T+n}\} \end{aligned}$$
(5)

Step three - Matching: We then search the most similar subsequences to L and R, denoted by \(S_{L}\) and \(S_{R}\), respectively. It is done by computing the Euclidean distance of L and R with each subsequence in sliding windows W. The most similar subsequences can be represent by

$$\begin{aligned} S_{L} = min\{d(L, W)\} \end{aligned}$$
(6)

and

$$\begin{aligned} S_{R} = min\{d(R, W)\} \end{aligned}$$
(7)

Step four - imputation: Four different techniques have been developed to impute the missing values as represented in Algorithm 2. We use \(S_{L}\) to generate the forward subsequence, and \(S_{R}\) to generate the backward subsequence, then combine those generated subsequences to impute the missing values. The missing values have been imputed by 4 different methods, as follows:

  1. 1.

    Average Imputation (\(PSM_{A}\)): We extracted the consecutive subsequence on the right side of \(S_{L}\), and on the left side of \(S_{R}\) that was the same length as the missing gap. Then, to calculate the difference between each pair of consecutive values, we combined them with the average method before using them to impute missing values.

  2. 2.

    Forward Imputation (\(PSM_{F}\)): Instead of using an average difference from both side of the most similar subsequence, we then use only the calculated difference from subsequence on the right side of \(S_{L}\) to impute the missing values.

  3. 3.

    Backward Imputation (\(PSM_{B}\)): We used only the calculated difference from the subsequence on the left side \(S_{R}\) to impute the missing values.

  4. 4.

    Weighted Imputation (\(PSM_{W}\)): The basic idea is that the values closest to the missing gap have more effect than the values farthest away. We will assign higher weights to closer points and decrease the weight as the time interval grows. The missing values are then imputed by multiplying the difference between each pair of consecutive subsequences by their weighted score.

figure b

5 Experimental Set-up

5.1 Dataset

HII’s telemetry sites in Thailand provided us with 10-minute time series water level data between 2012 and 2018. For the whole year, there are 52,560 data points. Missing values and abnormalities in raw data are commonly noticed as a result of faulty sensors or unexpected events. As a result, prior to any further analysis, a preliminary data pre-processing step is unavoidable.

We chose two years of data (2015 and 2016) from six representative stations (CPY012, CPY015, CPY016, CPY017, CHM003, and CHR004) to test the accuracy and generalisation of our proposed methods when dealing with different data behaviours. CPY012 and CPY015 stations represent the data with tidal effects that have strong periodic patterns, as depicted in Fig. 2. While the data from CPY016 and CPY017 that have fluctuation characteristics with few upward and downward as a result of irrigation operating in the canal that has telemetry stations installed, as shown in Fig. 3. Additionally, the data from CHM003 and CHR004 show fluctuations and many upward and downward patterns as a result of the rain effect, as shown in Fig. 4.

Fig. 2
figure 2

Water level data with tidal influences

Fig. 3
figure 3

Water level data with irrigation influences

Fig. 4
figure 4

Water level data with rain influences

5.2 Missing Data Generation

We were unable to examine the accuracy of imputation algorithms on genuine missing data because the true values were not available. But we can simulate some missing data with some methods on complete data in order to evaluate the performance of imputation approaches. To produce datasets with missing data, we delete consecutive values from the dataset under the assumption that it happens at random.

To simulate various missing data situations, we generated missing gaps of sizes 6, 12, 18, 36, 72, and 144 (1 hour, 2 hours, 3 hours, 6 hours, 12 hours, and 1 day), and the length of a consecutive subsequence before and after the missing gap equal to the size of the missing gap. For instance, if the missing gap is six in length, the lengths of the subsequences before and after the missing gap are also six.

5.3 Comparative Imputation Methods

We chose some well-known representative imputation methods for comparing our methods. These are interpolation of linear and polynomial, k-Nearest Neighbours (k-NN), MissForest (MF), and a deep learning method - Long Short Term Memory (LSTM).

  • Interpolation: When a time series data has a gap, the available data on each side of the gap or at a few particular locations within the gap, can be used with an interpolation method to estimate values in the gap. Two methods have been chosen to impute the gap in water level data:

    1. (a)

      Linear Interpolation (Li-inter): It just fits a straight line between two adjacent points of a missing gap for estimating the missing values. This method is simple and fast and hence is often used as a baseline in imputation.

    2. (b)

      Polynomial Interpolation (Poly-Deg2): It is an enhanced interpolation approach that attempts to find the optimal polynomial function to match the data. It can be used to estimate data in the form of a curve, and thus should be suitable for water level data.

  • k-Nearest Neighbour (k-NN): It works on the assumption that neighbouring data points belong to the same class. In other words, a new data point is more likely to have the same class label as its k-nearest neighbours than distant data points (Peterson 2009). k-NN identifies the neighboring points through a measure of distance and the missing values can be estimated using completed values of neighboring observations.

  • MissForest (MF): It is another machine learning-based data imputation technique that based on the Random Forest (RF) algorithm which has been created by Stekhoven and Bühlmann (2012). It can be divided into 3 main steps. Firstly, replace the missing values with the mean (for continuous variables) or the most frequent class (for categorical variables). Secondly, the observed observations are served as the training set and the missing observations are served as the prediction set. The training sets and the prediction sets are fed into a RF model. Then, the RF model’s predictions are put in place of the prediction set, creating a transformed dataset. Finally, one imputation loop is complete when all missing variables are imputed. Imputations are repeated.

  • Long-Short Term Memory (LSTM): LSTM is one of the architectures of artificial recurrent neural network (RNN) that has been utilised for a number of purposes, including, petroleum industry (Sagheer and Kotb 2019), handwriting recognition (Nogra et al. 2019), anomaly detection (Maleki et al. 2021), and data imputation (Yuan et al. 2018). A typical LSTM is made up of four units. (1) The cell that remembers values across arbitrary time intervals, (2) Input gate, (3) Output gate, and (4) a forget gate. LSTMs were developed to deal with the vanishing gradient problem that can be encountered when training traditional RNNs. There may be lags of unknown duration between important events in a time series. As a result, LSTM is well-suited to categorising, analysing, and forecasting time series data, and is considered as the state of the art method. We utilised two LSTM models (\(LSTM_{F}\) and \(LSTM_{B}\)) with the same architecture for predicting and backcasting the missing gaps. In the event of backcasting, we invert the position of the subsequence after the missing data and feed it into the LSTMB model. Furthermore, we created two new imputations based on the results of both LSTMF and LSTMB. The first is \(LSTM_{A}\), which takes the average of the outputs of both models and uses it as the final output. The second is \(LSTM_{W}\), which we weight the values of output from LSTMF and LSTMB using the same notion of weighting imputation as in \(FSM_{W}\) by assigning the greatest weighting score to the data that is closest to the current values.

5.4 Experimental Setting

The datasets are divided depending on each method, and we run each method 500 times and average the results. The dataset is divided into training/searching and testing/removing. The training data is used to fit neural network model and search the most similar subsequence, while the testing data is used to generate the missing subsequences and assess model performance. For training purposes, we used data from 2015, whereas for testing purposes, we used data from 2016.

For interpolation technique we used interpolate class in panda.DataFrameFootnote 1 python library, which is a method for filling missing value using an interpolation method. We specified a linear approach for linear interpolation and a polynomial method with an order of 2 for polynomial interpolation.

We used the grid search technique to find the best k number of nearest neighbours, ranging from 2 to the number of members in the query subsequence for k-NN models. Since MF models require multivariate data, we chose the simplest way to convert our data from univariate to multivariate by dividing the query sequence into three subsequences to use as input for MF models.

For partial subsequence matching, we used the matrix profile python library called STUMPY (Law 2019) which is a powerful and scalable library. In order to properly train LSTM models, we used one LSTM layer and one hidden layer that uses only dense layer, 30 training epochs, prevent time-consuming and over-fitting with early stopping with the patience values of 5 and mini-batch size of 128. In practical, we tried with different number of neurons per layer (64, 128, and 256) and found that 128 neuron per layer give the best result. The input of LSTM is the subsequence of before and after missing values for prediction the missing values.

All the experiments were coded with Python Programming Language (V3.6) and TensorFlow 2.8, and run on a personal computer with an Intel Core i5-7500 CPU @ 3.4 GHz, 32 GB RAM, 64-Bit Operating System.

5.5 Evaluation Metrics

The error or accuracy of an imputation method is measured with three metrics: root mean square error (RMSE), mean absolute error (MAE), and similarity (Sim) are defined as follows:

  • RMSE: The average squared difference between the imputed value \(\hat{y}\) and the respective genuine value y is referred to as the Root Mean Square Error (RMSE). This metric is very useful for determining overall correctness. The technique with the lowest RMSE would be the most accurate.

    $$\begin{aligned} RMSE(\hat{y}, y) = \sqrt{\frac{1}{T}\sum _{i=1}^{T}(\hat{y_{i}} - y_{i})^2} \end{aligned}$$
    (8)

    where T is the number of missing values.

  • MAE: The Mean Absolute Error is compute as average of the absolute difference between imputed values \(\hat{y}\) and actual values y, which calculated by:

    $$\begin{aligned} MAE(\hat{y}, y) = \frac{1}{n}\sum _{i=1}^{n}|\hat{y_{i}}-y_{i} |\end{aligned}$$
    (9)

    The method that is more effective will have a lower MAE.

  • Similarity: Sim(\(\hat{y}, y\)) defines the similar percentage between the imputed value (\(\hat{y}\)) and the actual data (y). It is calculated by:

    $$\begin{aligned} Sim(\hat{y}, y) = \frac{1}{T}\sum _{i=1}^{T}\frac{1}{1+\frac{|\hat{y_{i}}-y_{i} |}{max(y)-min(y)}} \end{aligned}$$
    (10)

    A higher similarity indicates a more accurate imputation of missing values.

6 Results and Discussion

6.1 Results

Incomplete subsequence matching methods as described in Section 4 guarantee that our suggested model will be capable of producing imputation results with varying lengths. Aside from that, the number of available data points around the missing values can also be adjusted. As a result, the FMS is built to deal with random size of data gaps in time series. We random remove the subsequence 500 times from telemetry water level data and the results described as follow.

Table 1 The average imputation performance indexes of 14 methods on telemetry water level data with tidal influence (the best score for each row in each gap size is shown in bold)

We first consider the tidal-influenced data whose recurrent upward and downward trends are noticeably and frequent changing with a similar magnitude. The average imputation performance of each methods are depicted in Table 1. As expected, when the size of the missing gaps is small, e.g., 6, linear interpolation techniques (Li-Inter) achieve the best performance for RMSE, MAE, and Sim with 0.0180, 0.0158, and 0.8525, respectively. However, their performance degrades steadily when dealing with gaps bigger than 6. Similar to polynomial interpolation (Poly-Deg2), which performed well when imputed missing data with a small gap but poorly when the gap increased. The MissForest (MF) technique performed the poorest results with gap size lower than 144, particularly on gap size 36, with 0.6674 (RMSE), 0.5987 (MAE), and 0.5991 (Sim), but it performed best when it filled in the missing gaps with a size of 144, with 0.1253 (RMSE), 0.1014 (MAE), and 0.9331 (Sim). Our proposed solution, FSMS, outperforms all others when imputed missing gaps of sizes 12, 18, and 36. LSTMW performed best on gap size 72, with RMSE, MAE, and Sim of 0.1949, 0.1678, and 0.8564, respectively. When the missing gaps were imputed at size 144, MF beat other models with the lowest RMSE of 0.0383, the lowest MAE of 0.0322, and the highest Sim of 0.9331.

We also plotted the average imputation performance for 14 methods using telemetry water level data with tidal features, as illustrated in Fig. 5. As we can see, the interpolation method, Li-Inter and Poly-Deg2, appeared to decrease in performance as the number of missing gaps rose. When filling in data gaps with sizes of 12, 18, and 36, our suggested method, FSMS, is clearly better than all others. After increasing the gap size to 144, performance of k-NN often improves. While performance of MF was the poorest while trying to impute missing data with a gap size of 72 or less, it improved to the best when the gap size was increased to 144. Although the set of PSM techniques performs poorly when the missing gap size is small, their performance improves when the missing gap size is equal to or higher than 72. It is interesting to note that as the amount of input data goes up, the performance of LSTM approaches gets better as the number of missing gaps goes up.

Fig. 5
figure 5

The performance for imputing telemetry water level data with tidal influence for various missing gap size

Fig. 6
figure 6

A critical difference diagram for 14 different imputation techniques on tidal influence datasets of telemetry water level data

Figure 6 shows the comparison of the critical difference between the different imputation models. The number associated with each algorithm is the average rank of the imputation models on each type of datasets and solid bar group classifiers with no significant difference. For the data type with tidal effect, FSMS achieved the top rank follow with FSMD, LSTMW, and Poly-Deg2, respectively. MF not only provided the lowest ranking but also significant difference from FSM-based technique.

Table 2 shows the imputing findings for the irrigation-affected data. Li-Inter is not only the best imputation model for missing gaps of size 6, but also performs well for larger missing gaps. k-NN, on the other hand, performed the poorest with every missing gap size and has a score difference from Li-Inter of roughly 0.08 in every performance metric.

Table 2 The average imputation performance indexes of 14 methods on telemetry water level data with irrigation influence (the best score for each row in each gap size is shown in bold)

As seen in Fig. 7, the performance of all approaches fell progressively as the size of the missing gap rose, with the exception of the similarity score of LSTM models, which tended to improve performance as the gap size increases.

Fig. 7
figure 7

The performance for imputing telemetry water level data with irrigation influence for various missing gap size

Fig. 8
figure 8

A critical difference diagram for 14 different imputation techniques on irrigation influence datasets of telemetry water level data

The CD diagram in Fig. 8 revealed that Li-Inter took first place, followed by our suggested technique (FSMS), and MF, respectively, while the group of LSTM models performed the worst. A collection of PSM and FSM models works well and offers a considerable improvement over LSTM-based imputation approaches.

Regarding the impacts of rain, Li-Inter outperformed across all gap sizes and evaluation metrics, with the exception of gap 6, where performance was slightly lower than MF, around 0.0259, for Sim score, as illustrated in Table 3. Moreover, the line charts in Figure 9 present the performance for imputing telemetry water level data with rain influence for various missing gap sizes. As we can see, the set of LSTM models scores the lowest on all evaluation metrics. Li-Inter and Poly-Deg2 have a tendency to perform worse as the number of missing values grows, while the set of PSM and FSM strategies maintain a consistent level of performance (Fig. 10).

Table 3 The average imputation performance indexes of 14 methods on telemetry water level data with rain influence (the best score for each row in each gap size is shown in bold)
Fig. 9
figure 9

The performance for imputing telemetry water level data with rain influence for various missing gap size

Fig. 10
figure 10

A critical difference diagram for 14 different imputation techniques on rain influence datasets of telemetry water level data

When used to impute the missing data on the water level with rain-effected, Li-Inter, FSMS, and MF maintained their top rankings. However, k-NN dropped to the bottom of the list. LSTM-based techniques still provided the low ranking and significant difference from Li-Inter and FSMS.

6.2 Discussion

Li-inter and Poly-Deg2 produced the best performance on data with non-cyclical and periodic patterns, like data with rain and irrigation effects. Moreover, with the small missing gap size Li-inter outperformed the others with all data behaviours, which is expected.

When imputing the missing values on water level data with tidal influence, our approaches FSMS outperformed the others. It is mostly due to the fact that some cyclical patterns of tidal influence are repeated in water level data over time, and our approaches are capable of finding and matching the most similar pattern in the past to impute the missing data more accurately. However, when we took a close look into the very short missing gaps, which still seem more like linear, then it is not surprising to see that Li-inter performed better.

The utilized LSTM model was trained on the input subsequence and its reversed copy that preserved both past and future information of a specific time frame. With this advantage, LSTM models are able to understand the context better and thus in principle they should be suitable for imputing the missing data in telemetry water level. However, they did not produce the best performance. According to our experiments on the data with different behaviours, LSTM models incorrectly estimated strongly fluctuated data, for example the data with some raining effects. On the other hand, they are capable of imputing the missing values on the data with tidal and irrigation effects with periodic and without frequent trends.

Since MissForest(MF) does not use data for training, it performed poorly when dealing with short subsequences but produced excellent outcomes for missing data with large gap sizes. In other words, MF performed better when appropriate data is available. However, since MF models need multivariate data, which is not always the case in this application and transformation from univariate data to multivariate data can introduce noise or misrepresentations. Splitting the sequence into many subsequences is the simplest technique to generate multivariate data. This raises the challenge of determining the optimal number of subsequence splits.

7 Conclusion

This paper introduced two sequence matching methods: Full and Partial methods, for searching the most similar subsequence and filling the missing values in telemetry water level data. They were tested with real-world water level data collected from 6 water level monitoring stations and their results were compared with a range of other existing methods including some commonly used methods and the latest so called state of the art deep learning methods - Long Short-Term Memory (LSTM). The results showed that our new methods, particularly the Full Sequence Matching with scaling imputation technique (FSMS), are better then all of them.

The FSM approach uses the Euclidean distance technique to search for the most similar subsequence of the query subsequences, then calculate missing data values based on the pattern of those subsequence. However, rather than dividing it into two subsequences we replaced missing data with constant values and searched as a single subsequence. The proposed methods were evaluated using missing data simulated on six time series water level data with three distinct data behaviours. The results indicate that FSM with scaling imputation, FSMS, outperforms other imputation methods when dealing with large missing gap sizes.

The FSMS performs well on data that has strong periodic and cyclical pattern such as data of water level with tidal effects. While the linear interpolation approach works well with data that fluctuates and has a number of up and down trends such as water level data with rain and irrigation effects. LSTM and MF models show increased performance when the missing gap size is increased. However, for large datasets, LSTM is computationally costly and time-consuming. Although we may train the model using long periods of historical data with high performance computing (HPC) to shorten training time, we have no way of knowing when we need to retrain the model, which is a significant downside of this approach. While MF needs to transform univariate data to multivariate data which difficult to find the appropriate techniques of transformation.

For further work, it should be useful to explore reconstructing the missing data from a different station for more robust data imputation. This is due to the fact that some installed stations that are close to each other and in the same canal/river are likely to have a similar pattern. Another work can be to extend the approaches for dealing with multidimensional information.