Strongly Possible Keys for SQL

Missing data value is an extensive problem in both research and industrial developers. Two general approaches are there to deal with the problem of missing values in databases; they could be either ignored (removed) or imputed (filled in) with new values (Farhangfar et al. in IEEE Trans Syst Man Cybern-Part A: Syst Hum 37(5):692–709, 2007). For some SQL tables, it is possible that some candidate key of the table is not null-free and this needs to be handled. Possible keys and certain keys to deal with this situation were introduced in Köhler et al. (VLDB J 25(4):571–596, 2016). In the present paper, we introduce an intermediate concept called strongly possible keys that is based on a data mining approach using only information already contained in the SQL table. A strongly possible key is a key that holds for some possible world which is obtained by replacing any occurrences of nulls with some values already appearing in the corresponding attributes. Implication among strongly possible keys is characterized, and Armstrong tables are constructed. An algorithm to verify a strongly possible key is given applying bipartite matching. Connection between matroid intersection problem and system of strongly possible keys is established. For the cases when no strongly possible keys hold, an approximation notion is introduced to calculate the closeness of any given set of attributes to be considered as a strongly possible key using the g3\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$g_3$$\end{document} measure, and we derive its component version g4\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$g_4$$\end{document}. Analytical comparisons are given between the two measures.

Many systems today allow entering incomplete tuples into a database. For example, in the case of data warehousing if different sources of raw data are merged, some attributes may exist in some of the sources while not available in some of the others. This makes it necessary to treat keys over incomplete tables. It is common to encounter databases having up to half of the entries missing, making it very difficult to mine them using data analysis methods that can work only with complete data [11].
There are different reasons why incompleteness occurs in database tables. Date [3] determined more than one kind of missing data and identified seven distinct types of null as follows: value not applicable, value unknown, value not exist, value undefined, value not valid, value not supplied, and value being the empty set. The present paper deals with data consumption with missing values in a database table, and we take the second, third, and seventh types. For the other types of missing data, we assume that symbol N/A belongs to each domain, and we treat it as regular domain element in comparisons.
Missing values issue complicates data analysis for the analysts. Other problems are usually associated with missing values such as loss of data efficiency and effectiveness [10]. Although some methods of data analysis may overcome the missing value problem, many others require complete databases. Two general approaches are there to deal with the problem of missing values in databases, incomplete tuples either could be ignored (removed) or imputed (filled in) with new values [10].
In relational databases, a key over a relation is satisfied, if no two distinct tuples have the same values on all the attributes of the key. Codd formulated the principle of a key uniqueness and totality that for a key K of any relation schema R, any relation with nulls over R, K must be null-free [7,16,17]. However, it is possible that for a table there exists no null-free set of attributes that is a key and this violates Codd's condition for the keys. So the occurrences of nulls in candidate key sets of attributes need to be handled. For example, in Fig. 1a, the candidate key (CourseN ame Y ear) has a null in the last tuple.
The first approach of handling the missing values in the key attributes involves ignoring any tuple that has a null in any of its values in the key. This may lead to the loss of a large amount of data and may change the original data pattern and integrity if a large number of tuples need to be ignored compared to the total number of tuples. The other approach includes an imputation operation for each occurrence of a null with a value from the attribute domain as explained by Köhler et al. [20]. We investigate the situation when the attributes' domains are not known. For that, we only consider what we have in the given data and extract the values to be imputed from the data itself for each attribute so that the resulting complete dataset after the imputation would not contain two tuples having the same value in the designated key set. Köhler et al. [20] used possible worlds by replacing each occurrence of a null with a value from the corresponding attribute's (possibly infinite) domain. Each possible world is considered as a table of total tuples. They defined a possible key as a key that is satisfied by some possible world of a nontotal database table and a certain key as a key that is satisfied by every possible world of the table. For example, Fig. 1a (a) has some possible world that satisfies the possible key {Course N ame}, while there is no possible world of the table that satisfies the key {Lecturer} and, furthermore, every possible world of the table satisfies the certain key {Course N ame, Y ear, Semester}.
In many cases, we have no proper reason to assume the existence of any other attribute values than the ones already existing in the table. Such examples could be types of cars, diagnoses of patients, applied medications, dates of exams, course descriptions, etc. In an SQL table, most of these attributes would have type VARCHAR(n) or DATE, so in dealing with possible worlds, one should consider any character string of length up to n, or all possible dates from "1000-01-01' to '9999-12-31." This is certainly undesired in the cases mentioned above.
We define a strongly possible key as a key that is satisfied by some possible world that is obtained by replacing each occurrence of a null by a value from the corresponding attribute existing values. We call this kind of a possible world a strongly possible world. This is a data mining-type approach, our idea is that we are given a raw table with nulls, and we would like to identify possible key sets based on the data only. As an example, {Course N ame, Y ear} is a strongly possible key of table in Fig. 1a, as the strongly possible world Fig. 1b shows. However, clearly {Course N ame, Y ear} is not a certain key, since we may replace the null in the first attribute of the third tuple by Mathematics. This is like if we defined each attribute type as where val1, val2, val3, . . . are the non-NULL values that appear in the table in the given attribute's column.
There are incomplete SQL tables that do not have certain keys; for example, see Fig. 3b. In such cases, strongly possible keys make the least possible assumption about attribute domains, as they only take values that are already present, in contrast to possible keys that could take any value from a (possibly infinite) predefined domain. In this way, the results obtained by normalization, decomposition, etc. using strongly possible keys approximate the (theoretical) complete database lying behind the incomplete table. The importance of keys and the earlier investigation of the possible and certain keys motivate studying strongly possible keys and their semantic definition and properties.
We treat the implication problem for strongly possible keys and find that it behaves similarly to keys in complete datasets, which makes a difference from possible keys. Furthermore, we show that strongly possible key constraints enjoy Armstrong tables if they satisfy a natural necessary condition. We also point out a connection of matroid intersection problem and the satisfaction of a system of strongly possible keys. Furthermore, we introduce an algorithm to verify a single strongly possible key using matchings in bipartite graphs. Finally, we treat approximation measures of strongly possible keys.
The organization of this paper is as follows. In Sect. 2, the related work is reviewed, and Sect. 3 contains preliminaries and definitions. Strongly possible keys over relational data with null occurrences in the key attributes are studied in Sect. 4. In Sect. 5, the existence of a system of strongly possible keys and the use of matchings to discover strongly possible keys are studied. Algorithmic aspects of strongly possible keys are discussed in Sect. 6, and some tests results on real-world datasets are also shown. In Sect. 7, we apply an approximation measure keys to strongly possible keys and derive a new measure and compare the two measures. Results and future research directions are concluded in Sect. 8.

Related Work
Keys are important constraints that enforce the semantics of relational database systems. A key K satisfied by a total rela-tion over a relation schema R if there are no two tuples in the relation that agree on all the attributes of K . Database relations that occur in real database systems usually contain occurrences of null values, and for some cases, this includes the key columns. Various studies have been done for the purpose of handling missing values.
Sree [4] shows that it is necessary to impute the missing values based on other information in the dataset to overcome the biased results that affect the accuracy of classification generated by missing values. Similarly, we use the attribute's existing values for each null in that attribute. Cheng et al. [6] utilize clustering algorithms to cluster data and calculate coefficient values between different attributes by generating the minimum average error.
Alireza et al. introduced a framework of imputation methods in [10] and evaluates how the choice of different imputation methods affects the performance in [11]. Experimental analyses of several algorithms for imputation of missing values were performed by [1,5,9,18]. Our imputation method adopts the concept of graph matching by assigning for each incomplete record a complete one from the complete set of records constructed by combination of all attribute values of visible domains. An approach introduced by Zhang et al. [28] discusses and compares several strategies that utilize only known values and that "missing is useful" for cost reduction in cost-sensitive environments.
Köhler et al. [20] introduced possible and certain keys. A set K of attributes is a possible key if there is a possible world where K is a key. On the other hand, K is a certain key if it is a key in every possible world. The main concept of the present paper is between these two, since a strongly possible world is a possible world, as well. Possible worlds may use any value from an attribute domain to replace a null. This effectively allows an infinite pool of values. Strongly possible worlds are created from finite attribute domains. Some of the results in [20] essentially use that some attribute domains are infinite. In particular, the characterization of possible keys uses symbols not occurring before, to be imputed in place of nulls. This cannot be done for strongly possible keys. In the present paper, we investigate what can be stated without assuming arbitrarily large domains.

Preliminaries
We start with summarizing some basic definitions and terminologies. Let R = {A 1 , A 2 , . . . A n } be a relation schema. The set of all the possible values for each attribute A i ∈ R is called the domain of A i and denoted as D i = dom(A i ) for i = 1, 2, . . . n. For X ⊆ R, D X = ∀A i ∈K D i .
An instance T = (t 1 ,t 2 , . . . t s ) over R is a list of tuples that each tuple is a function t : For a tuple t r ∈ T and X ⊂ R, let t r [X ] be the restriction of the r th tuple of T to X . By taking list of tuples, we use the bag semantics that allows several occurrences of the same tuple.
In practice, database tables may have missing information about the value of some t j [A i ] . Codd's null marker is included in each domain to represent the missing information [12]. Morrissett classified imperfect data into four main classes [25]. These classes are: (i) imprecise and vague, which are defined as not exact and indefinite in nature, respectively; (ii) ambiguous and subjective, which represent the uncategorizable data; (iii) unclear and uncertain, which represent the not explicitly defined data; and (iv) inconsistent and incomplete, which define the values that show contradictions or not being finished. The first two classes do not define missing values, but show undetermined and unmeasurable values, while the last two classes define the missing values and can be represented using Codd's null marker ⊥. t r is called V -total for a set V of attributes if t r [A] = ⊥, ∀A ∈ V . Also a tuple t r is a total tuple if it does not contain any occurrence of ⊥ within all the attributes in the relation, i.e., if it is R-total. Two tuples t 1 and t 2 are weakly similar on X ⊆ R denoted as t 1 [X ] ∼ w t 2 [X ] defined by Köhler et al. [20] if: Furthermore, t 1 and t 2 are strongly similar on X ⊆ T denoted by t 1 For the sake of convenience, we write t 1 ∼ w t 2 if t 1 and t 2 are weakly similar on R and the same for strong similarity. For a null-free table (a table with R-total tuples), a set of attributes K ⊂ R is a key if there are no two distinct tuples in the table that share the same values in all the attributes of K : Possible and certain keys were defined by Köhler et al. [20]. Let T = (t 1 , t 2 , . . . t s ) be a table that represents a total version of T , which is obtained by replacing the occurrences of ⊥ in all attributes t[A i ] with a value from the domain D i , different from ⊥ for each i. T is called a possible world of T. T is a possible world if t i is weakly similar to t i and T is completely null-free table. A possible key K denoted as p K is a key for some possible world T of T . Similarly, a subset K of attributes is a certain key denoted as c K , if it is a key for every possible world T of T .

Strongly Possible Keys
A database attribute's domain is a predefined set of values that are allowed to be used for all the tuples in that attribute. For example, in Fig. 1a, the attribute CourseN ame has a predefined domain of all the computer science-related topics, but it only uses two values of Mathematics and Datamining along with ⊥ in the last tuple.

Definition 1 The visible domain of an attribute
is the set of all distinct values except ⊥ that are already used by the tuples in T : Fig. 1a is {Mathematics, Datamining}. The term visible domain refers to the data that already exist in a given dataset. For example, if we have a dataset with no information about the attributes' domains definitions, then we use the data themselves to define their own structure and domains. This may provide more realistic results when extracting relationships between data.
While a possible world is obtained by using the domain values instead of the occurrence of null as defined in Sect. 3, a strongly possible world is obtained by using the visibledomain values.
That allows us to construct a possible world of a set of data with some missed values by using only the available information. We define a strongly possible key as a key for some strongly possible world of T . Definition 3 A subset K ⊆ R is a strongly possible key (in notation sp K ) in T if there exists a strongly possible world T such that K is a key in T .
Recall the same instance in Fig. 1a implies sp CourseN ame Y ear as a strongly possible key, because there is a strongly possible world in Fig. 1b where CourseN ame Y ear is a key. On the other hand, the table implies neither sp CourseN ame Lecturer nor sp Y ear Lecturer , because there are no strongly possible worlds T that have (CourseN ame Lecturer) or (Y ear Lecturer) as keys.

and K is a key in T .
Note that if t i [K ] ∼ s t j [K ] for i = j, then K is not a strongly possible key, but the reverse is not necessarily true. For example, take the instance T of two attributes (A 1 , A 2 ) In the relational model, any subset of attributes that are not keys are called anti keys [26]. The analogous concept can be defined in the present context.

Definition 4
We say that K is strongly possible anti key ¬sp K if T is strongly possible world such that K is a key in T .

Implication Problem
Integrity constraints determine the way the elements are associated with each other in a database. The implication problem specifies if a given set of constraints entails further constraints. In other words, given an arbitrary set of constraints, the implication problem is to determine whether a single constraint is satisfied by all instances satisfying given set of constraints. In our context, to define the implication, let us consider as a set of strongly possible key constraints and θ as a single strongly possible key over a relation schema R.
logically implies θ , denoted as | θ , if for every instance T over R satisfying every strongly possible key in , we have that T satisfies θ . The next theorem describes the implication problem for the strongly possible keys.
, ∀i = j holds, as well. ⇒ : Suppose indirectly that sp Y / ∈ ∀Y ⊆ K . Consider the following instance consisting of two tuples Table 2. Then, the only possible t 2 in T is showing that (t 1 , t 2 ) satisfies every strongly possible key constraint from , but does not satisfy sp K .

Note 2 If
| sp K , then | p K , but the reverse is not necessarily true, since D K ⊇ V D K could be proper containment so K could be made a key by imputing values from D K \V D K . For example, in Fig. 2, it is shown that ¬sp K holds, but p K may hold in some T if there is at least one other value in the domain of K than zero that can be placed instead of the nulls in the second tuple so that

Note 3 If
| c K , then | sp K . As certain keys hold in any possible world, they hold also if this possible world is created using visible domain.

Note 4 For a single attribute
In other words, single attribute with a null value cannot be a strongly possible key. That is because replacing an occurrence of null with a visible-domain value results in duplicated values for that attribute.

Armstrong Tables
Armstrong tables are useful tools to represent constraint sets in a user-friendly way [2,8,15,24]. For a class C of constraints and a set of constraints in C, aC-Armstrong table T for satisfies and violates all the constraints in C not implied by . Therefore, given an Armstrong table T for , the problem of deciding for an arbitrary constraint σ in C whether implies σ reduces to the problem of verifying whether σ holds on T . The ability to compute an Armstrong table for provides us with a data sample that is a perfect semantic summary of . For further details how Armstrong tables help the communication between database engineers and domain experts, the reader is referred to Sect. 4.2 of [20].
Following [20], we introduce the concept of null-free subschema (NFS). Let R be a schema; an NFS R S over R is a set such that R S ⊆ R. An instance T satisfies NFS R S if it is R Stotal; that is, each tuple t ∈ T is R S -total. This corresponds to NOT NULL constraint of SQL.

Definition 6
Let be a collection of strongly possible key constraints. An instance T over (R, R S ) is an Armstrong table for (R, R S , ) if for every strongly possible key θ over R, θ holds in T iff | θ , and for every attribute A ∈ R\R S there exists a tuple t ∈ T with t[A] = ⊥.
Let us suppose that = {sp K : K ∈ K} is given. By Note 4 if |K | = 1, then K ⊆ R S must hold. If this restriction is satisfied, then enjoys an Armstrong table. We would like to point out a significant difference between strongly possible keys and possible keys. In [20], an example of a collection of possible and certain key constraints is given that does not have an Armstrong table, while the next theorem states this does not happen for strongly possible keys. Proof Let A be the collection of strongly possible anti keys; that is, A = {A ⊂ R : | sp A }. According to Theorem 2 and Note 1, A is a downset and Create a strongly possible world T from T by replacing the null of t i [X ] by 0 if X ∈ A i and by i otherwise. We claim that no two tuples of T agree on all attributes of K if | sp K . Indeed, this latter property happens iff Furthermore, if 0 ≤ i ≤ p and j > p, then t i and t j can agree in at most one attribute, but that attribute is not a singleton attribute key. On the other hand, if | sp L , then there exists i such that L ⊆ A i , which implies that t 0 [L] = t i [L]; that is, L is not a key in table T .

Matchings, Matroids, and Strongly Possible Keys
In this section, we study the verification problem of strongly possible keys. That is given table instance T = {t 1 , t 2 , . . . , t s } over schema R = {A 1 , A 2 , . . . , A n } and a collection K = {K 1 , K 2 , . . . K p }, attribute sets determine whether = {sp K 1 , sp K 2 , . . . , sp K p } holds in T . As we will see, there is a significant difference between the cases of single strongly possible key and system of multiple strongly possible keys.

Checking a Single Strongly Possible Key
If we want to decide whether sp K holds or not, we can forget about the attributes that are not in K , since we need distinct values on K as a matching from Thus, we may construct a table T that is formed by finding all the possible combinations of the visible domains of T | K that are weakly similar to some tuple in T | K .
. Finding a matching between T and T that covers all the tuples in T (if exist) will result in the set of tuples in T that can be used to replace incomplete tuples in T to obtain a strongly possible world for T , which verifies that K is a strongly possible key. The algorithmic details are discussed in Sect. 6. Figure 3b shows an incomplete set of tuples where K = {Lecturer, Course}. A visible domain can be identified for each attribute to construct tuples of T by finding the combinations of all the visible-domain values as shown in Fig. 3c (where we included all tuples for tables a and b together). Bipartite graphs between tuples with null(s) in T and tuples in T excluding those tuples that agree on K to any total tuple in T are constructed. Figure 4b illustrates the graph for Fig. 3b which contains a complete matching to assign a total tuple to each nontotal tuple in T and K is a key, while Figs. 3a, 4a show there is no matching that covers all the tuples in T .

System of Multiple Strongly Possible Keys
Here, we show how the existence of a system of strongly possible keys is equivalent to the existence of a given sized common independent set of several matroids. This is a natural extension of the bipartite matching idea of the previous subsection, since transversal matroids are generalizations of bipartite graphs. For basic definitions and properties of matroids, the reader is referred to Welsh's book [27]. Let us be given schema R = {A 1 , A 2 , . . . , A n }, and let K = {K 1 , K 2 , . . . K p } be a collection of attribute sets and T = {t 1 , t 2 , . . . , t s } be an instance with possible null occurrences. Our main question here is whether A strongly possible world that satisfies is given by an injective mapping and for each j, K j is a key in T = f (T ). Let S ⊆ V D 1 × V D 2 × · · · × V D n be the union S = E 1 ∪ E 2 ∪ · · · ∪ E s and define bipartite graph G = (T , S; E) by {t, t } ∈ E ⇐⇒ t ∼ w t for t ∈ T and t ∈ S. Let (S, M 0 ) be the transversal matroid defined by G on S; that is, a subset X ⊆ S satisfies induced by K j for j = 1, 2, . . . , p such that S j i 's are maximal sets of tuples from S that agree on K j . Let (S, M j ) be the partition matroid given by (1). We can formulate the following theorem. Proof An independent set T of size |T | in matroid (S, M 0 ) means that the tuples in T form a strongly possible world for T . That they are independent in (S, M j ) means that K j is a key in T ; that is, sp K j holds.

Conversely, if
= {sp K 1 , sp K 2 , . . . , sp K p } holds in T , then there exists a strongly possible world T = {t 1 , t 2 , . . . , t s } ⊆ V D 1 × V D 2 × · · · × V D n such that t i ∼ w t i . This means that T ⊆ S and that T is independent in transversal matroid (S, M 0 ). That sp K j holds implies that tuples t i are pairwise distinct on K j ; that is, T is independent in partition matroid (S, M j ).
Unfortunately, Theorem 4 does not give a good algorithm to decide the satisfaction of a system of strongly possible keys, because as soon as contains at least two constraints, then we would have to calculate the size of the largest common independent set of at least three matroids, known to be an NP-complete problem in general [13].
In the case of a single strongly possible key sp K constraint, Theorem 4 requires computing the largest common independent set of two matroids, which can be solved in polynomial time [21]. However, we can solve that case by reducing to the somewhat simpler problem of matchings in bipartite graphs.
To conclude this section, we give some necessary conditions for strongly possible key constraints to be satisfied. Note that sp K holds if a matching covering T exists in the bipartite graph G = (T , T ; E) defined as above, {t, t } ∈ E ⇐⇒ t[K ] ∼ w t [K ]. We can apply Hall's theorem to obtain ∀X ⊆ T , and we have |N (X )| ≥ |X | for

Necessary Conditions
The conditions in Proposition 2 are implied by Hall's condition, as well. Let us assume that Hall's condition is true for a set of tuples T and attribute set K . If t i , t j are strongly similar on K , then the set X = {t i , t j } has 1 = |N (X )| < |X | = 2 that proves (1). For condition (2) As defined in Sect. 3, certain key is a key for any possible world, i.e., all the tuples are distinct after filling the nulls regardless of what values are used. We prove that one does not need to check all possible worlds in order to verify that some certain key constraint is satisfied, and it is enough to consider strongly possible worlds only.

Theorem 5 Let T be a table instance over schema R such that a strongly possible world of T exists. K ⊆ R is a certain key if and only if K is a key in any strongly possible world of T .
Proof ⇒: If c K holds, then K is a key in any possible world by definition, so in particular in any strongly possible world, as well.
⇐: Let us assume that K is a key in any strongly possible world, but there exists a possible world T , and two distinct tuples t 1 = t 2 of T such that t 1 [K ] = t 2 [K ]. Let A ∈ K be an attribute. There are three possibilities. If neither t 1 nor Such an x exists, since T has a strongly possible world. For attributes not in K , extend t" 1 and t 2 arbitrarily from the visible domains of the attributes. Also, fill up the nulls of other tuples of T from the visible domains to obtain a strongly possible world, where distinct tuples t" 1 and t 2 agree on K , contradicting to the assumption that K is a key in any strongly possible world.
Note that a table instance T over schema R fails to have a strongly possible world only if there exists an attribute A ∈ R The concept of strongly possible keys lies in between the two concepts of possible and certain keys. Every certain key is a strongly possible key, and every strongly possible key is a possible key. Figure 6 shows that For example, Fig. 1a Fig. 6 Possible, strongly possible, and certain key scopes introduce Algorithm 1 to find a strongly possible world (if it exists) for a given T which verifies that sp K holds. We start by generating T from T . T contains all total tuples that are weakly similar to some incomplete tuples in T . Let us define the usable visible domain of an attribute B as We use the usable visible-domain values for each null in the attributes of K to reduce complexity based on Proposition 2 4. Generating T is done by taking the nontotal tuples of T one by one and finding the weakly similar total tuples using the U V Ds for each attribute instead of the nulls. This process may result in some duplicates in T because it is possible that more than one nontotal tuple in T can be weakly similar to same total tuple. For example, in Fig. 1a, the first and the last tuples are both nontotal and are both weakly similar to the total tuple (Mathematics, 2019, Sarah, 5) generated by using U V D for each null. For that, in the algorithm, we need to remove these duplicated tuples in T after generating the weakly similar tuples for all the tuples t ∈ T . Removing duplicates can be done by sorting the tuples using Radix sort.
After calculating T , we find the maximum matching between T \T total and T \T total , where T total is the set of all the total tuples in T . Every total tuple in T is matched to the corresponding total tuple in T , so that we find the maximum matching between T \T total and T \T total to reduce the complexity. Function BipartiteMatching uses the standard augmenting path method to find a largest matching [23]. Let us recall that an augmenting path with respect to a matching M is a path from an unmatched element in T to an unmatched element in T that uses nonmatching and matching edges alternatingly. We define M ⊕ P = (M\P) ∪ (P\M) for an augmenting path P with respect to a matching M. M ⊕ P is again a matching, and since P started and ended with nonmatching edges, M ⊕ P is of size one larger than M. If the resulting largest matching covers all the tuples in T \T total , then T has a strongly possible world which is T ∪ T total that verifies sp K , otherwise ¬sp K holds.
The running time depends on the size of T , which could be exponential in the size of the input. Sorting using Radix Sort takes O(|R|(|T | + |T |)) time, while finding the largest matching in G = (T \T total , T \T total ; E) takes O((|T \T total | + |T \T total |)|E|) time by the augmenting path method.

Algorithm 1 Discovering a Strongly Possible Keys
Input: Dataset T on relation schema R with a key K of b attributes Output: A strongly possible world if one exists 1: procedure spKeyDiscovery(item T , item K ) 2: for all t ∈ T \T total do 5: T temp = ∅ 8: end for 9: Sort T using RadixSort 10: Remove the duplicated tuples in T 11: Graph G = {V = (T \T total , T \T total ); E} 12: BipartiteMatching(G) 13: if size(Match) = |T \T total | then 14:

Application to Real-Life Datasets
The experiments work was done on an Asus X541UJ, Intel core i7,CPU 2.7 GHz, 8GB RAM, and 64-bit operating system. The datasets were stored and processed in MySQL version 5.7.21.
We used the following two publicly available datasets, and we removed some attribute values and replaced them by nulls randomly in 10%, and 50% of its rows simulate incomplete tables.
1. http://archive.ics.uci.edu/ml/datasets/QSAR+fish+toxicity 2. https://www.kaggle.com/mirichoi0218/insurance We looked for the smallest possible strongly possible key sets, that is, two-element sets. We found that the tables satisfy no certain key of two elements after introducing the nulls, while we could find some attributes combinations that form strongly possible keys. For the other pairs, we found that they are not strongly possible keys because either there is no matching that covers all the rows or of the existence of some duplicate rows.
QSAR fish toxicity Dataset This dataset was used to develop quantitative regression QSAR models to predict acute aquatic toxicity toward the fish Pimephales promelas (fathead minnow) on a set of 908 chemicals. Table name is qsar_fish_toxicity which has 908 rows and seven attributes, six molecular descriptors and one quan- Similar to the 10% null occurrence, the table satisfies sp M L OG P quantitative_response but not c M L OG P quantitative_response for 50% null occurrence.
The table satisfies sp bmi age and sp bmi charges with no certain key of size 2, after randomly adding nulls in 10% and 50%, respectively, of the rows.

Strongly Possible Keys Approximation
The main motivation for the study of strongly possible keys lies in the unique identification of tuples of an incomplete dataset by filling up occurrences of nulls using the alreadypresent data only, if it is possible. In some cases, there may not exist any strongly possible keys, so how well a table T approximates sp K is a natural question. Let T be the instance in Fig. 3a, then there is no possibility of having sp K because of the violation of Hall's condition as in Fig. 4. In fact, sp K holds restricted to the last four tuples; that is, if we remove the first tuple, sp K is satisfied. This is what we call the Approximate Strongly Possible Key (ASP Key) property.

Definition 7
Attribute set K is approximate strongly possible key of ratio a in table T , in notation asp a K , if there exists a subset S of the tuples T such that T \S satisfies sp K , and |S|/|T | ≤ a. The variable a represents the approximation which is the ratio of the number of tuples needed to be removed over the total number of tuples so that sp K holds.
Variable a has a value ranging from 0 to 1, it is equal to 0 exactly when sp K holds in T , i.e. no tuples are required to be removed. So for our example table, the strongly possible key approximation is asp 0.2 K since maximum of four tuples out of five may have a unique value in the key set in any strongly possible world.
Let us consider now Fig. 3d that has 2n + 2 tuples, and like in Fig. 3a, there are two tuples with the same values after filling the nulls using visible-domain elements. But the ratio of having an sp a K for Fig. 3d is smaller, since there are 2n+1 tuples with distinct values after filling the nulls. Hence, table d has a smaller degree to have a strongly possible key. To measure this degree in a given dataset, we use the g 3 measure introduced in [19]. g 3 based on the idea that the degree to which AS P key is approximated is determined by the minimum number of tuples needed to be removed from T so that K becomes an strongly possible key. So by using the g 3 measure, the approximation measure of the strongly possible key is: where ν(G) denotes the maximum size of a matching in graph G. Note that the smaller g 3 (K ) is, the closer is K being a strongly possible key. Let M be the collection of connected components in graph G that satisfy the strongly possible key condition; i.e. there is a matching covering all tuples in that set (∀ M∈M X ⊆ M ∩ T such that |X | > N (X )). Let C ⊆ G be defined as C = G\ ∀M∈M M, and let M be the set of connected components of C. Let V M denote the set of vertices of T in a component M. So, the maximum matching can be written as M∈M (|V M |) + ∀M ∈M ν(M ). Therefore, we can rewrite measure g 3 as: Measurement of strongly possible keys approximation can be more appropriate by taking into consideration the effect of each connected component in the graph on the matching. More specifically, M represents the sets of tuples that do not require any tuple to be removed to get a strongly possible key, while the components of M represent the sets of tuples that contain some tuples needed to be removed to have a strongly possible key. We consider the components of M to get their effect doubled in the approximation measure because they represent a part of the data that is not affected by any tuples removal. So we propose a derived version of g 3 measure named g 4 that consider the effects of these components:

Analytical Comparison
We compare approximation measures g 3 and g 4 similarly to the investigations in [14]. Figure 7 shows seven tables that represent the key part only of the data where each table has more than one attribute. Tables A, B, and C have 2n tuples, tables E and F have n tuples, and table D has n + l tuples, while table G has kn tuples. Table D includes a variable 0 ≤ β ≤ n 2 . We intend to use these cases to illustrate the differences between the two measures and give a bound of g 3 /g 4 where it is always true that g 3 − g 4 ≥ 0. The graphs show the weak similarity relationship between the data tuples and the visible domain combinations. The visible-domain combinations are shown in Fig. 8 We intend to distinguish between having an sp key that holds without any tuples removal and that which requires removing at least one tuple. A connected component that has a maximum matching that covers all the tuples in T of that component is stronger than the ones that require removing at  Fig. 7, tables A and B show the case that requires removing almost half of the tuples to get the strongly possible key hold. But in table B, half of the data need no changes, while table A has only two tuples out of 2n tuples which can be left as they are. Table C has a quarter of the total number of the tuples needed to be removed to get the maximum complete matching for the remaining tuples. l tuples in table D have complete matching in the graph, and the rest needs β − 1 tuples to get removed. For table E, one tuple should be kept after removing the rest, and table F requires one tuple of the first two tuples to be removed. And there are kn tuples in table G such that for each of n tuples, there is only one that needs to be removed, meaning there are k tuples to be removed of the total. Table 1 shows the measures of the seven tables using g 3 and g 4 . The results show some differences in g 4 measures between two tables although they may have the same g 3 value. The g 3 measure of table A, g 3 (A), is n−1 2n and similarly g 4 (A) = n−1 2n+2 , while g 3 (B) = n−1 2n and g 4 (B) = n−1 3n . Tables A and B got a lower approximation using g 4 than using g 3 , while table E has an equal measure for both g 3 and g 4 , because it has no M components unlike tables A and B. Tables F and G have a common tuples distribution or pattern that both of them have one tuple which should be removed for each n tuple, so that both of the tables' measures showed the same values and ratio. Finally, table F shows the maximum difference between g 3 and g 4 , while table E shows the minimum difference.

Theorem 6
For any table T and set of attributes K , we have either g 3 (K ) = g 4 (K ) or 1 < g 3 (K )/g 4 (K ) < 2. Furthermore, for any rational number 1 ≤ p q < 2, there exist tables of arbitrarily large number of tuples with g 3 (K )/g 4 (K ) = p q . Proof g 3 (K ) and g 4 (K ) differ only in the denominator part. The number of tuples of the components in M cannot be more than the total number of tuples in the table, so 0 ≤ M∈M |V M | ≤ |T | and M∈M |V M | = |T | iff every tuple is a member of a connected component in M . In the latter case, g 3 (K ) = g 4 (K ), otherwise the denominator of g 4 (K ) is less than twice the denominator of g 3 (K ) that proves the inequalities of the ratio.
Table E proves that g 3 (K ) = g 4 (K ) can hold for arbitrarily large tables. Now let 1 < p q < 2 be given with p q = 1+ p q . Consider Table D where which can simply be written as 1 + l n+l . Now, taking n = α(q − p ), l = α p and any β between 2 and n 2 , we obtain that Note that g 3 (K ) ranges between 1/n and 1/2 depending on the choice of β.

Conclusion and Future Directions
The main contributions of this paper are as follows: -We introduced and defined strongly possible keys over database tables that contain some occurrences of nulls. -We provided some properties and some necessary conditions so that a strongly possible key holds in a given dataset. We show that deciding whether a given set of attributes is a strongly possible key can be done by application of matchings in bipartite graph, so Hall's condition is naturally applied. -We provided an algorithm to validate a strongly possible key by finding a proper strongly possible world for that key if there is any such a world. The algorithm was applied to real-world datasets to determine all twoelement strongly possible key sets. -We showed that deciding whether a given system of sets of attributes is a system of strongly possible keys for a given table can be done using matroid intersection. However, we need at least three matroids, and matroid intersection of three or more matroids is NP-complete, which suggests that our problem is also NP-complete. -We studied systems of strongly possible keys, and we gave characterization of the implication problem. -We showed that systems of strongly possible key constraints enjoy Armstrong instances provided they satisfy a natural necessary condition. -We showed that certain keys are already characterized by strongly possible worlds, and one does not need to check all possible worlds.
-An approximation concept of the strongly possible key was introduced to measure how close approximation of a strongly possible key holds in a data relation, using g 3 measure. We derived the measure g 4 from g 3 and gave bounds of the two measures.
Strongly possible keys are special case of possible keys of relational schemata with each attribute having finite domain. So future research is needed to decide what properties of implication, axiomatization of inference remain valid in this setting. Note that the main results in [20] use that at least one attribute has infinite domain. We plan to extend research from keys to functional dependencies. Weak and strong functional dependencies were introduced in [22]. A wFD X → w Y holds if there is a possible world T that satisfies FD X → Y , while sFD X → s Y holds if every possible world satisfies FD X → Y . Our strongly possible world concept naturally induces an intermediate concept of functional dependency. Future research on possible keys of finite domains might extend our results on strongly possible keys.
Funding Open access funding provided by ELKH Alfréd Rényi Institute of Mathematics.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/.