Skip to main content

ABC of order dependencies

Abstract

Band order dependencies (ODs) enhance constraint-based data quality by modeling the semantics of attributes that are monotonically related to small variations without an intrinsic violation of semantics. The class of approximate band conditional ODs (abcODs) generalizes band ODs to make them more relevant to real-world applications by relaxing them to hold approximately with some exceptions (abODs) and conditionally on subsets of the data. We study the automatic dependency discovery of abcODs to avoid human burden. First, we propose a more efficient algorithm to discover abODs than in recent prior work that is based on a new optimization to compute a longest monotonic band via dynamic programming and decreases the runtime from \(O(n^2)\) to \(O(n \log n)\). We then devise a dynamic programming algorithm for abcOD discovery that determines the optimal solution in polynomial time. To optimize the performance (without losing optimality), we adapt the algorithm to cheaply identify consecutive tuples that are guaranteed to belong to the same band. For generality, we extend our algorithms to discover bidirectional abcODs. Finally, we perform a thorough experimental evaluation of our techniques over real-world and synthetic datasets.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16

Notes

  1. 1.

    www.discogs.com.

  2. 2.

    www.classicdriver.com.

  3. 3.

    https://data.world/sanfrancisco/chfu-j7tc.

  4. 4.

    https://data.world/dot/airline-on-time-performance-statistics.

  5. 5.

    Note that the simpler problem of band conditional OD discovery (without considering approximation) can be solved in linear time (O(n)) by scanning a sequence of tuples and splitting it into contiguous segments, whenever an anomaly tuple appears.

  6. 6.

    We only vary the band-width here to evaluate the effect of the parameter variations on the F1-measure.

References

  1. 1.

    Agrawal, R., Mannila, H., Srikant, R., Toivonen, H., Verkamo, A.: Fast discovery of association rules, pp. 307–328. Advances in Knowledge Discovery and Data Mining, AAAI Press (1996)

  2. 2.

    Albert, M.H., Golynski, A., Hamel, A.M., Lopez-Ortiz, A., Rao, S., Safari, M.A.: Longest increasing subsequences in sliding windows. Theoret. Comput. Sci. 321(2–3), 405–414 (2004)

    MathSciNet  Article  Google Scholar 

  3. 3.

    Barnett, V., Lewis, T.: Outliers in Statistical Data, pp. 1–365. Wiley, New York (1978)

  4. 4.

    Brooks, M., Yan, Y., Lemire, D.: Scale-based monotonicity analysis in qualitative modelling with flat segments. IJCAI. pp 400–105 (2005)

  5. 5.

    Chen, E., Yang, L., Yuan, H.: Longest increasing subsequences in windows based on canonical antichain partition. Theoret. Comput. Sci. 378(3), 223–236 (2007)

    MathSciNet  Article  Google Scholar 

  6. 6.

    Chu, X., Ilyas, I., Papotti, P.: Discovering denial constraints. PVLDB 6(13), 1498–1509 (2013)

    Google Scholar 

  7. 7.

    Crochemore, M., Porat, E.: Computing a longest increasing subsequence of length k in time o(nloglogk). In: VoCS, pages 69–74, (2008)

  8. 8.

    Fan, W., Geerts, F., Li, J., Xiong, M.: Discovering conditional functional dependencies. TKDE 23(5), 683–698 (2011)

    Google Scholar 

  9. 9.

    Golab, L., Karloff, H., Korn, F., Saha, A., Srivastava, D.: Sequential dependencies. PVLDB 2(1), 574–585 (2009)

    Google Scholar 

  10. 10.

    Golab, L., Karloff, H., Korn, F., Srivastava, D., Yu, B.: On generating near-optimal tableaux for conditional functional dependencies. PVLDB 1(1), 376–390 (2008)

    Google Scholar 

  11. 11.

    Guha, S., Koudas, N., Shim, K.: Data-streams and histograms. STOC. pp 471–475, (2001)

  12. 12.

    Himberg, J., Korpiaho, K., Mannila, H., Tikanmäki, J., Toivonen, H.: Time series segmentation for context recognition in mobile devices. ICDM. pp 203–210 (2001)

  13. 13.

    Huhtala, Y., Kärkkäinen, J., Porkka, P., Toivonen, H.: TANE: an efficient algorithm for discovering functional and approximate dependencies. Comput. J. 42(2), 100–111 (1999)

    Article  Google Scholar 

  14. 14.

    Szlichta, J., Godfrey, P., Golab, L., Kargar, M., Srivastava, D.: Effective and complete discovery of bidirectional order dependencies via set-based axiomatization. VLDB J. 24(7), 573–591 (2018)

    Article  Google Scholar 

  15. 15.

    Karegar, R., Godfrey, L.P., Golab, M., Kargar, D., Srivastava, S.J.: Efficient Discovery of Approximate Order Dependencies. EDBT. pp 427–432, (2021)

  16. 16.

    Langer, P., Naumann, F.: Efficient order dependency detection. VLDB J. 25(2), 223–241 (2016)

    Article  Google Scholar 

  17. 17.

    Lavrenko, V., Schmill, M., Lawrie, D., Ogilvie, P., Jensen, D., Allan, J.: Mining of concurrent text & time series. SIGKDD. pp 37–44, (2000)

  18. 18.

    Li, P., Szlichta, J., Böhlen, M., Srivastava, D.: Discovering band order dependencies. ICDE. pp 1878–1881, (2020)

  19. 19.

    Liben-Nowell, D., Vee, E., Zhu, A.: Finding longest increasing and common subsequences in streaming data. J. Comb. Optim. 11(2), 155–175 (2006)

    MathSciNet  Article  Google Scholar 

  20. 20.

    Palpanas, T., Vlachos, M., Keogh, E., Gunopulos, D., Truppel, W.: Online amnesic approximation of streaming time series. ICDE. pp 338–349 (2004)

  21. 21.

    Papenbrock, T., Naumann, F.: A hybrid approach to functional dependency discovery. SIGMOD. pp 821–833 (2016)

  22. 22.

    Qiu, Y., Tan, K.Z., Yang, Yang, X., Guo, N.: Repairing data violations with order dependencies. DASFAA, pp 283–300 (2018)

  23. 23.

    Saxena, H., Golab, L., Ilyas, I.: Distributed dependency discovery. PVLDB 12(11), 1624–1636 (2019)

    Google Scholar 

  24. 24.

    Song, S., Chen, L.: Differential dependencies: reasoning and discovery. ACM TODS 36(3), 1–41 (2011)

    Article  Google Scholar 

  25. 25.

    Sriraman, N.: How can data quality enhance trust in artificial intelligence? Forbes (2020)

  26. 26.

    Szlichta, J., Godfrey, P., Golab, L., Kargar, M., Srivastava, D.: Effective and complete discovery of order dependencies via set-based axiomatization. PVLDB 10(7), 721–732 (2017)

    Google Scholar 

  27. 27.

    Szlichta, J., Godfrey, P., Gryz, J.: Fundamentals of order dependencies. PVLDB 5(11), 1220–1231 (2012)

    Google Scholar 

  28. 28.

    Szlichta, J., Godfrey, P., Gryz, J., Zuzarte, C.: Expressiveness and complexity of order dependencies. PVLDB 6(14), 1858–1869 (2013)

    Google Scholar 

  29. 29.

    Tan, Z., Ran, A., Ma, S., Qin, S.: Fast incremental discovery of pointwise order dependencies. PVLDB 13(10), 2150–8097 (2020)

    Google Scholar 

  30. 30.

    Terzi, E., Tsaparas, P.: Efficient algorithms for sequence segmentation. SIAM pp 316–327 (2006)

  31. 31.

    Wu, P., Carberry, S., Elzer, S.: Segmenting line graphs into trends. ICAI. pp 697–703 (2010)

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Jaroslaw Szlichta.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

Appendix: proofs

Theorem 9

(Extending MBs) Given a sequence of tuples \(T=\{t_1, \ldots , t_n\}\), a band-width \(\varDelta \) and a list of attributes \(\mathbf{Y} \), let \(\mathsf{MB}_{k, i}\) denote a MB with best tuple \(s_{k, i}\) among all MBs of length k in a prefix T[i]. If \(s_{k, i} \preceq _{\varDelta , \mathbf{Y} \mathord {\uparrow }} t_{i+1}\), then there are two candidates for \(\mathsf{MB}_{k+1, i+1}:\) \(\mathsf{MB}_{k+1, i}\) with the maximal tuple \(s_{k+1, i}\); and a new \(\mathsf{MB}_{k, i} \cup \{t_{i+1}\}\) with the maximal tuple \(\max _\mathbf{Y }(s_{k, i}, t_{i+1})\).

  1. 1.

    If \(s_{k+1, i}\) is not the minimal tuple among tuples \(\{s_{k, i},\) \(s_{k+1, i},\) \(t_{i+1}\}\), then \(\mathsf{MB}_{k+1, i+1} = \mathsf{MB}_{k, i} \cup \{t_{i+1}\}\) and \(s_{k+1, i+1} = \max _\mathbf{Y }(s_{k, i}, t_{i+1})\).

  2. 2.

    Else \(\mathsf{MB}_{k+1, i+1} = \mathsf{MB}_{k+1, i}\) and \(s_{k+1, i+1} = s_{k+1, i}\).

Proof

Consider first Theorem 1, Case 1. As \(d \text{( } t_{i+1}.\mathbf{Y} \), \(s_{k, i}.\mathbf{Y} \text{) }\) \(\le \varDelta \), tuple \(\max _\mathbf{Y }\{s_{k, i}, t_{i+1}\}\) is the maximal tuple of a new MB with length \(k+1\): \(\mathsf{MB}_{k, i} \cup t_{i+1}\). Also, \(\min _\mathbf{Y }\{s_{k+1, i}, \max _\mathbf{Y }\{s_{k, i}, t_{i+1}\}\}\) \(=\) \(\max _\mathbf{Y }\{s_{k, i}, t_{i+1}\}\), thus, \(\max _\mathbf{Y }\{s_{k, i}, t_{i+1}\}\) is the best tuple among MBs with length \(k+1\) in \(T\text{[ }i+1\text{] }\). Case 2 follows analogically. \(\square \)

Lemma 5

(Computing Best Tuples) Let \(s_{0, i} \preceq _{\varDelta , \mathbf{Y} \mathord {\uparrow }} t_j\) and \(t_j \preceq _{\varDelta , \mathbf{Y} \mathord {\uparrow }} s_{k, 0}\) for \(i \in [0, n], j, k \in [1, n]\). The best tuple \(s_{k+1, i+1}\) of a MB with length \(k+1\) in a prefix \(T[i+1]\) satisfies the following recurrence, where \(u = \min _\mathbf{Y } \text{( }s_{k+1, i}, \max _\mathbf{Y} \text{( }t_{i+1}, s_{k, i}\text{) } \text{) }\).

$$\begin{aligned} {\begin{matrix} &{} s_{k+1, i+1} = \left\{ \begin{array}{ll} u &{} \mathrm{if} s_{k, i} \preceq _{\varDelta , \mathbf{Y} \mathord {\uparrow }} t_{i+1} \\ s_{k+1, i} &{} \mathrm{otherwise} \end{array} \right. \end{matrix}} \end{aligned}$$
(12)

Proof

According to Theorem 1, there are two candidates for the best tuple \(s_{k+1, i+1}\) in a prefix \(T[i+1]\).

  1. 1.

    If \(s_{k, i} \preceq _{\varDelta , \mathbf{Y} \mathord {\uparrow }} t_{i+1}\), \(t_{i+1}\) can extend \(\mathsf{MB}_{k, i}\) by length of one, and the maximal tuple in the new \(\mathsf MB\) is \(\max _\mathbf{Y} (t_{i+1}\), \(s_{k, i})\). Given Def. 7, the best tuple \(s_{k+1, i+1}\) can be updated by \(\min _\mathbf{Y} (s_{k+1, i}, \max _\mathbf{Y} (t_{i+1}, s_{k, i}))\).

  2. 2.

    Else the best tuple \(s_{k+1, i+1}\) remains the same as \(s_{k+1, i}\).

Lemma 6

(Monotonic Lengths of Best Tuples) For each \(i \in [0, n]\), the best tuples in T[i] are monotonically ordered: \(\forall _{k_1, k_2 \in [0, n], k_1 < k_2} s_{k_1, i} \preceq _\mathbf{Y \mathord {\uparrow }} s_{k_2, i}\).

Proof

In Theorem 1, Case 1: \(s_{k, i} = \max _\mathbf{Y }(s_{k-1, i-1}, t_{i})\), hence, \(s_{k-1, i-1} \preceq _\mathbf{Y \mathord {\uparrow }} s_{k, i}\). In Theorem 1, Case 2, given \(s_{k, i} = s_{k, i-1}\), it is known that \(s_{k-1, i-1} \preceq _\mathbf{Y \mathord {\uparrow }} s_{k, i}\); hence, \(s_{k-1, i-1}\) \(\preceq _\mathbf{Y \mathord {\uparrow }}\) \(s_{k, i} = s_{k, i-1}\), i.e., \(\forall _{k_1, k_2, \in [0, n], k_1 < k_2}\) \(s_{k_1, i} \preceq _\mathbf{Y \mathord {\uparrow }} s_{k_2, i}\). \(\square \)

Theorem 10

(Correctness of Calculating LMB) Algorithm 1 correctly computes a LMB in the sequence of tuples T of size n in \(O\text{( }n \log n\text{) }\) time and \(O\text{( }n\text{) }\) space.

Proof

To find a LMB in the sequence T best tuples are used. Since tuple \(B_\mathsf{inc}\text{[ }k_1\text{] }\) is updated by \(\max \text{( }s_{k_1-1}.\mathbf{Y} \), \(t_i.\mathbf{Y} \text{) }\), where \(t_i\) \(\prec _\mathbf{Y \mathord {\uparrow }}\) \(s_{k_1}\), as in Algorithm 1, the corresponding band \(\mathsf {MB}_{k_1, i}\) is a MB with the best tuple in prefix \(T\text{[ }i\text{] }\). In addition, it has the shortest length, as \(k_1\) is the smallest index in \(B_\mathsf{inc}\). Similarly, \(\mathsf {MB}_{k_2, i}\) is an MB of the longest length among MBs with the best tuple that ends at \(t_i\) in the prefix \(T\text{[ }i\text{] }\). For each tuple \(t_i \in T\), the lengths of MBs with the smallest maximal tuples that end at \(t_i\) fall into range \(\text{[ }k_1, k_2\text{] }\).

Given that \(P_\mathsf{inc}\) is an array of size n that stores lengths of shortest and longest MBs with best tuples ending at \(t_i\) for each \(i \in \{1, \ldots , n \}\), i.e., \(P_\mathsf{inc}\text{[ }i\text{] }[0]=k_1\), \(P_\mathsf{inc}[i][1] = k_2\), the length of a LMB in \(T\text{[ }i\text{] }\) is the maximal value, \(P_\mathsf{inc}[i][1]=k_2\), in array \(P_\mathsf{inc}\text{[ }i\text{] }\).

For each tuple in a sequence T of size n, it takes \(O\text{( }\log n\text{) }\) time to update arrays \(B_\mathsf{inc}\) and \(P_\mathsf{inc}\), since they are maintained sorted. Thus, Algorithm 1 takes \(O\text{( }n\log n\text{) }\) time to find a \(\mathsf LMB\) in T . For each tuple \(t_i\) 2 values are inserted into array \(P_\mathsf{inc}\). Thus, Algorithm 1 takes O(n) space. \(\square \)

Theorem 11

(abOD Discovery) The abOD discovery problem is solvable by finding a longest monotonic band with an anomaly ratio \(e(\varphi )\) \(=\) \(|s \notin \mathsf {LMB}|\) / \(|s \in \mathsf T|\).

Proof

Based on Definition 5 of a longest monotonic band, the minimal set of tuples that violate a band OD \(\mathbf{X} \mapsto _{\varDelta } \overline{\mathbf{Y }}\) are inconsistent tuples s, such that \(s \in T\) and \(s \notin \mathsf{LMB}\). \(\square \)

Theorem 12

(abOD Discovery Complexity) The abOD discovery problem can be solved in \(O\text{( }n \log n\text{) }\) time in the number of tuples.

Proof

According to Theorem 3, the abOD discovery problem is solvable by finding a LMB with an anomaly ratio \(|s \notin {\mathsf{LMB}}|\) / \(|s \in {\mathsf{T}}|\); therefore, its time complexity is equivalent to that of calculating LMB, which is \(O(n \log n)\) according to Theorem 2. \(\square \)

Theorem 13

(Optimal Substructure Property) Let \(\mathsf{OPT}(j)\) denote an optimal solution to the abcOD discovery problem in T[j] and T[ij] denote prefix \(\{t_i\), \(\ldots \), \(t_j\}\). The optimal solution \(\mathsf{OPT}(j), j \in \{ 1, \ldots , n \}\) in prefix T[j] contains optimal solutions to the subproblems in prefixes \(T[1], T[2], \ldots , T[j-1]\).

$$\begin{aligned} \mathsf{OPT}(j) = \left\{ \begin{array}{ll} 0 &{} j = 0 \\ \max _{i \in \{0, \ldots , j-1 \} \text { } \text{ and } \text { } e(T[i+1, j]) < \epsilon }\{ \\ \quad \mathsf{OPT}(i) + g(T[i+1, j]) \} &{} j > 0 \end{array} \right. \end{aligned}$$
(13)

Proof

For the prefix T[i] with optimal solution \(\mathsf {OPT}(i)\) where \(i \in [1, j-1]\), consider forcing \(T[i+1, j]\) to form a single series with the gain \(g(T[i+1, j])\). Since \(\mathsf {OPT}(i)\) is the optimal solution in T[i], among all segmentations in T[j], where \(T[i+1, j]\) forms a single segment, there do not exist any segmentations with greater gain than \(\mathsf {OPT}(i)+g(T[i+1,j])\). Hence, the optimal solution \(\mathsf{OPT}(j)\) is chosen as the maximal value among \(\mathsf {OPT}(i)+g(T[i+1,j])\), where \(i \in [1, j-1]\). \(\square \)

Theorem 14

(abcOD Discovery) Algorithm 2 solves the abcOD discovery problem by finding optimal series in \(O\text{( }n^3\log n\text{) }\) time in a sequence of tuples T of size n.

Proof

Algorithm 2 applies dynamic programming to solve Equation 10, which is proved in Theorem 5. The recurrence in Equation 10 specifies that the optimal solution in prefix \(T\text{[ }j\text{] }\) is selected among j alternative options: (1) a singleton segment consisting of \(t_j\), and the optimal solution in the prefix \(T\text{[ }j-1\text{] }\); (2) a segment of length 2 consisting of \(\{ t_j, t_{j-1} \}\), and the optimal solution in prefix \(T\text{[ }j-2\text{] }\), and so on; and finally, a segment of length j consisting of all tuples in the prefix \(T\text{[ }j\text{] }\). It requires O(n) iteration to process the prefix T[j], where each iteration takes time \(O\text{( }n\log n\text{) }\) based on Theorem 2. In total, there are n tuples in the sequence T, i.e., \(j \in [1, n]\); thus, Algorithm 2 solves the abcOD discovery problem in time \(O(n^3\log n)\). \(\square \)

Lemma 7

(Complexity of Computing Pieces) Algorithm 3 takes \(O(n\text{) }\) time to compute pieces in a sequence of tuples T of size n.

Proof

Algorithm 3 scans tuples in the sequence T. For each tuple \(t_j, j \in [1, n]\), it verifies, if \(t_j\) can extend any maximal \(\mathsf{IP}\) (\(\mathsf{DP}\)) ending at \(t_{j-1}\) in the prefix \(T[j-1]\). Since there are at most \(\varDelta \) of such \(\mathsf{IP}\)s (\(\mathsf{DP}\)s) in \(T[j-1]\), it takes constant time to process each tuple. In total, Algorithm 3 takes O(n) time. \(\square \)

Theorem 15

(Pieces Based abcOD Discovery) Algorithm 4 finds optimal solution for abcODs discovery problem in \(O\text{( }m^2 n \log n\text{) }\) time, where m is the number of pieces in T, and n is the number of tuples in T. \(\square \)

Proof

We first prove that Algorithm 4 finds the optimal solution in the prefix \(T\text{[ }i\text{] }\), which ends at the piece \(P_i=\{t_{i-m+1}, \ldots , t_i\}\) of length m.

The last tuple \(t_i\) in the prefix \(T\text{[ }i\text{] }\) cannot be an anomaly of a series in the optimal solution of \(T\text{[ }i\text{] }\); otherwise, we can always find a better solution where \(t_i\) is a singleton series, i.e., \(\mathsf{OPT}\text{( }i-1\text{) } + 1\) according to Eq. 10. On the other hand, since every tuple in a piece \(P_i=\{t_{i-m+1}, \ldots , t_i\}\) belongs to the same sets of pre-pieces, there are no anomalies that violate LMB in \(P_i\); that is, \(g\text{( }T\text{[ }i-m+1, i]\text{) } = m^2\) and \(g\text{( }T\text{[ }i-k+1, i\text{] }\text{) } = k^2\), where \(0 \le k \le m-1\). Therefore, tuples in piece \(P_i=\{t_{t_{i-m+1}, \ldots , t_i}\}\) belong to the same series in the solution found by Algorithm 4.

Suppose that Algorithm 4 does not find the optimal solution in the prefix \(T\text{[ }i\text{] }\), i.e., there exists tuple \(t_{i-k} \in P_i, 0 \le k \le m-1\) in the optimal solution that splits \(P_i\) into two series: \(\{t_{i-m+1}, \ldots , t_{i-k}\}\) and \(\{t_{i-k+1}, \ldots , t_i\}\), where \(\mathsf{OPT}(i) = \mathsf{OPT}\text{( }i-k\text{) } + k^2\). We next prove that this assumption does not hold, i.e., \(\mathsf{OPT}\text{( }i\text{) } - \mathsf{OPT}\text{( }i-k\text{) } \ge k^2 \).

Consider tuple \(t_{i-j+1}\) is the first in the last series \(S_{i-m}\) of \(\mathsf{OPT}\text{( }i-m\text{) }\), where the length of a LMB in series \(S_{i-m}\) is l, \(j \ge m + 1, l >0\); and the maximal number of consecutive anomalies in \(T\text{[ }i-m\text{] }\) is q. Based on Theorem 1, \(\{t_{i-m+1}, \ldots , t_{i}\}\) extends the length of a LMB in \(S_{i-m}\) by \(m-k\) without increasing q, i.e., \(\mathsf{OPT}\text{( }i\text{) } = \mathsf{OPT}\text{( }i-j\text{) } + \text{( }l+m\text{) }^2\). Similarly, \(\mathsf{OPT}\text{( }i-k\text{) } = \mathsf{OPT}\text{( }i-j\text{) } + \text{( }l+m-k\text{) }^2\). Which implies, \(\mathsf{OPT}\text{( }i\text{) }-\mathsf{OPT}\text{( }i-k\text{) }=\text{( }l+m\text{) }^2-\text{( }l+m-k\text{) }^2 = 2k\text{( }l+m\text{) } > k^2\).

By the contradiction, Algorithm 4 finds the optimal solution in the sequence T. We next prove that Algorithm 4 takes \(O(m^2n \log n)\) time. Algorithm 4 first finds all pieces in the sequence T of length n, which takes O(n) time. Assume the number of pieces is m, Algorithm 4 applies dynamic programming on m pieces that takes time \(O\text{( }m^2\log n\text{) }\), analogous to Algorithm 2. Therefore, the overall time complexity is \(O\text{( }m^2n\log n\text{) }\). \(\square \)

Theorem 16

(Bidirectional Discovery) Extended Algorithm 2 solves bidirectional abcOD discovery problem optimally in \(O\text{( }n^3\log n\text{) }\) time.

Proof

Since computing a longest decreasing band (LDB) is symmetrical to calculating a longest increasing band (LIB) (Definition 18), it follows directly from Theorem 6. \(\square \)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Li, P., Szlichta, J., Böhlen, M. et al. ABC of order dependencies. The VLDB Journal (2021). https://doi.org/10.1007/s00778-021-00696-z

Download citation

Keywords

  • Data quality
  • Data profiling
  • Data discovery