Skip to main content

Data generation for testing and grading SQL queries


Correctness of SQL queries is usually tested by executing the queries on one or more datasets. Erroneous queries are often the results of small changes or mutations of the correct query. A mutation Q\('\) of a query Q is killed by a dataset D if Q(D) \(\ne \) Q\('\)(D). Earlier work on the XData system showed how to generate datasets that kill all mutations in a class of mutations that included join type and comparison operation mutations. In this paper, we extend the XData data generation techniques to handle a wider variety of SQL queries and a much larger class of mutations. We have also built a system for grading SQL queries using the datasets generated by XData. We present a study of the effectiveness of the datasets generated by the extended XData approach, using a variety of queries including queries submitted by students as part of a database course. We show that the XData datasets outperform predefined datasets as well as manual grading done earlier by teaching assistants, while also avoiding the drudgery of manual correction. Thus, we believe that our techniques will be of great value to database course instructors and TAs, particularly to those of MOOCs. It will also be valuable to database application developers and testers for testing SQL queries.

This is a preview of subscription content, access via your institution.

Fig. 1


  1. 1.

    A constraint solver takes as input a set of constraints and produces a result that satisfies the constraints.

  2. 2.

    Although there are some limitations in CVC4 currently, in future we may use CVC4 as an integrated solver for both string constraints and other constraints.

  3. 3.

    Since we are interested in small datasets, we set MAX_TUPLES to 32 in our experiments.

  4. 4.

    In case of aggregates, we may need to generate data, ensuring that the equality or inequality of aggregates is satisfied. Since our constrained aggregation techniques do not consider comparisons between aggregates, we currently do not handle these constraints.

  5. 5.

    Query containment can be reduced to equivalence similarly since \(Q_1 \subseteq Q_2 \equiv Q_1 \cap Q_2 =Q_1\).

  6. 6.

    We do not use any automated tool to generate mutations. The mutations generated by an automated tool may or may not be equivalent to the original query. If our dataset fails to kill some of the mutations, we would not be sure if that was because of the incompleteness of our tool or because of equivalence of mutation and the original query.

  7. 7.

    If students had been told that their queries would be graded by a tool, they would have probably taken more care to avoid such errors.


  1. 1.

    Barrett, C., Conway, C.L., Deters, M., Hadarean, L., Jovanović, D., King, T., Reynolds, A., Tinelli, C.: CVC4. In: 23rd International Conference on Computer Aided Verification (CAV’11), pp. 171–177. Springer, Berlin, Heidelberg (2011)

  2. 2.

    Barrett, C., Sebastiani, R., Seshia, S.A., Tinelli, C.: Satisfiability modulo theories. In: Biere, A., van Maaren, H., Walsh, T. (eds.) Handbook of Satisfiability, vol. 4, chap. 8. IOS Press (2009)

  3. 3.

    Barrett, C., Tinelli, C.: CVC3. In: Computer Aided Verification (CAV), pp. 298–302 (2007)

  4. 4.

    Bhangadiya, A., Chandra, B., Kar, B., Radhakrishnan, B., Reddy, K.V.M., Shah, S., Sudarshan, S.: The XDa-TA system for automated grading of SQL query assignments. In: ICDE (2015)

  5. 5.

    Binnig, C., Kossmann, D., Lo., E.: Reverse query processing. In: ICDE, pp. 506–515 (2007)

  6. 6.

    Chandra, A.K., Merlin, P.M.: Optimal implementation of conjunctive queries in relational data bases. In: STOC, pp. 77–90 (1977)

  7. 7.

    Chandra, B., Chawda, B., Shah, S., Sudarshan, S., Shah, A.: Extending XData to kill SQL query mutants in the wild. In: Sixth International Workshop on Testing Database Systems, DBTest ’13, Held in Conjunction with ACM SIGMOD, pp. 2:1–2:6 (2013)

  8. 8.

    de la Riva, C., Suárez-Cabal, M.J., Tuya, J.: Constraint-based test database generation for SQL queries. In: Workshop on Automation of Software Test (AST ’10), pp. 67–74 (2010)

  9. 9.

    Emmi, M., Majumdar, R., Sen, K.: Dynamic test input generation for database applications. In: International Symposium on Software Testing and Analysis, pp. 151–162 (2007)

  10. 10.

    Fu, X., Powell, M., Bantegui, M., Li, C.C.: Simple linear string constraints. Form. Asp. Comput. 25(6), 847–891 (2013)

    MATH  MathSciNet  Article  Google Scholar 

  11. 11.

    Gupta, B.P., Vira, D., Sudarshan, S.: X-Data: Generating Test Data for Killing SQL Mutants. In: ICDE (2010)

  12. 12.

    Jackson, D.: Alloy: A new technology for software modelling. In: Tools and Algorithms for the Construction and Analysis of Systems (TACAS), LNCS, vol. 2280, pp. 20–20 (2002)

  13. 13.

    Jayram, T.S., Kolaitis, P.G., Vee, E.: The containment problem for real conjunctive queries with inequalities. In: PODS, pp. 80–89 (2006)

  14. 14.

    Jia, Y., Harman, M.: An analysis and survey of the development of mutation testing. IEEE Trans. Softw. Eng. 37(5), 649–678 (2011)

    Article  Google Scholar 

  15. 15.

    Kapfhammer, G.M., Soffa, M.L.: A family of test adequacy criteria for database-driven applications. SIGSOFT Softw. Eng. Notes 28(5), 98–107 (2003)

    Article  Google Scholar 

  16. 16.

    Kiezun, A., Ganesh, V., Guo, P.J., Hooimeijer, P., Ernst, M.D.: HAMPI: a solver for string constraints. In: International Symposium on Software Testing and Analysis, pp. 105–116. ACM, New York, NY, USA (2009)

  17. 17.

    Klug, A.: On conjunctive queries containing inequalities. J. ACM 35(1), 146–160 (1988)

    MATH  MathSciNet  Article  Google Scholar 

  18. 18.

    Liang, T., Reynolds, A., Tinelli, C., Barrett, C., Deters, M.: A DPLL (T) theory solver for a theory of strings and regular expressions. In: Computer Aided Verification, pp. 646–662. Springer, Berlin (2014)

  19. 19.

    Mller, A.: Automaton Pakage dk.brics.automaton.

  20. 20.

    Olston, C., Chopra, S., Srivastava, U.: Generating example data for dataflow programs. In: SIGMOD Conference, pp. 245–256 (2009)

  21. 21.

    Pan, K., Wu, X., Xie, T.: Database state generation via dynamic symbolic execution for coverage criteria. In: Fourth International Workshop on Testing Database Systems, DBTest ’11, Held in Conjunction with ACM SIGMOD, pp. 4:1–4:6 (2011)

  22. 22.

    Pan, K., Wu, X., Xie, T.: Automatic test generation for mutation testing on database applications. In: 8th International Workshop on Automation of Software Test (AST), pp. 111–117 (2013)

  23. 23.

    Pan, K., Wu, X., Xie, T.: Guided test generation for database applications via synthesized database interactions. ACM TOSEM (2014)

  24. 24.

    SMT-LIB—The Satisfiability Modulo Theories Library.

  25. 25.

    SQLFpc—Generation of Full Predicate Coverage Rules for SQL Database Queries (MCDC for SQL) (2014).

  26. 26.

    Saxena, P., Akhawe, D., McCamant, S., Song, D.: KALUZA.

  27. 27.

    Shah, S., Sudarshan, S., Kajbaje, S., Patidar, S., Gupta, B.P., Vira, D.: Generating test data for killing SQL mutants: a constraint-based approach. In: ICDE (2011)

  28. 28.

    Silberschatz, A., Korth, H.F., Sudarshan, S.: Database System Concepts, 6th edn. McGraw Hill, New York (2010)

    Google Scholar 

  29. 29.

    Tuya, J., Cabal, M.J.S., de la Riva, C.: Full predicate coverage for testing SQL database queries. Softw. Test. Verif. Reliab. 20(3), 237–288 (2010)

    Article  Google Scholar 

  30. 30.

    Tuya, J., Suarez-Cabal, M.J., de la Riva, C.: Mutating database queries. Inf. Softw. Technol. 49(4), 398–417 (2007)

    Article  Google Scholar 

  31. 31.

    van der Meyden, R.: The complexity of querying indefinite data about linearly ordered domains. In: ACM PODS, pp. 331–345 (1992)

  32. 32.

    Veanes, M., Grigorenko, P., de Halleux, P., Tillmann, N.: Symbolic query exploration. In: ICFEM, pp. 49–68 (2009)

  33. 33.

    Veanes, M., de Halleux, P., Tillmann, N.: Rex: Symbolic regular expression explorer. In: ICST, pp. 498–507 (2010)

  34. 34.

    Veanes, M., Tillmann, N., de Halleux, J.: Qex: Symbolic SQL query explorer. In: LPAR, pp. 425–446 (2010)

  35. 35.

    Zhang, L., Xie, T., Zhang, L., Tillmann, N., De Halleux, J., Mei, H.: Test generation via dynamic symbolic execution for mutation testing. In: IEEE International Conference on Software Maintenance (ICSM), pp. 1–10 (2010)

Download references


We would like to thank Tata Consultancy Services (TCS), India, for partially funding this project through a grant and a PhD fellowship. We would also like to thank Amol Bhangdiya, Bharath Radhakrishnan and Ankit Shah for their help in writing some parts of the code and running some experiments.

Author information



Corresponding author

Correspondence to Bikash Chandra.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 385 KB)



Appendix A: Cardinality estimation for join inputs

The tuple estimation for each relation for constrained aggregation on join result is done in three steps. First we construct a join graph. Then we infer attributes to be added to uniqueElements and singleValuedAttributes. In the third step, we assign cardinality to each relation such that the resulting number of tuples is n.

Step 1: Construct Join Graph

We construct a join graph G = (R, E), with each relation in the query as a vertex. The join conditions from one table to another are represented by a single edge between the nodes. Figure 2 shows a join graph involving relations A, B and C. There are join conditions between A and B, and between B and C. However, there are no join conditions between A and C. Inferred join equalities are also added to the graph. For example, the join conditions A.a = B.b and B.b = C.c imply that A.a = C.c is also a join condition and hence it would be added to the graph. Note that this may introduce a cycle in the graph; our algorithm can work with cyclic join graphs.

Fig. 2

Join graph

Step 2: Infer Attribute Properties

Next we apply the following sets of rules to infer properties of attributes

Rule 1::

Every group by attribute is a single-valued attribute.

Rule 2::

Every set of attributes declared as primary key or unique key is unique in the group.

Rule 3::

Every attribute which appears in conjuncts of the form A.a=constant is a single-valued attribute.

Rule 4::

If each attribute of any uniqueElements[\(R_{i}\)] is a single-valued attribute, then all attributes of that relation are single-valued attributes.

Rule 5::

If any attribute, \(R_{i}.x\), is a single-valued attribute, then every attribute of equivalence class (Sect. 2.2) in which \(R_{i}.x\) is present becomes a single-valued attribute. For example, if the join condition is A.a = B.a and A.a is single-valued, B.a also becomes single-valued.

Rule 6::

If an attribute of a unique element is single-valued, then remaining attributes of unique element become unique. We apply this rule recursively on the unique element to get a minimal unique element. We then drop all non-minimal sets from uniqueElements. For example, if (A.a, A.b, A.c) is unique and A.a is single-valued, then (A.b, A.c) is unique and is added to \(uniqueElements[R_i]\). In this case (A.a, A.b, A.c) is dropped from \(uniqueElements[R_i]\).

The rules are applied according to Algorithm 2 to infer which attributes are added to uniqueElements and which to singleValuedAttributes.

Step 3: Assign Cardinality

We define some more terms

  • \(joinAttributes[R_{i},R_{j}]\): attributes of relation \(R_{i}\) that are involved in join conditions with relation \(R_{j}\).

  • \(unique[R_{i},R_{j}]\): \(\{ S_k \mid S_k \subseteq \text{ joinAttributes }[R_i,R_j] \wedge S_k \in \text{ uniqueElements }[R_i] \}\).

  • \(n_{R_i}\): number of tuples assigned to relation \(R_i\).


In order to find the number of tuples for each relation, we use the attributes inferred using Algorithm 2 along with the following rules.

Rule 7::

If \(n_{R_{i}}\)=\(n,n>1\) and \(unique[R_{i}, R_{j}]\ne \emptyset \) then \(n_{R_{j}}\) is set to n. We also infer further unique elements as follows. For each \(S_k \in \textit{unique}[R_i, R_j]\), let \(S'_k\) be the attributes from \(R_j\) that are equated to \(S_k\). Then add \(S'_k\) to uniqueElements[\(R_j\)].

The intuition behind Rule 7 is as follows. Consider the join of two relations A and B. Let the join condition be \(A.a=B.a\) and suppose that \(\{A.a\} \in \) uniqueElements[A]. Here joinAttributes [A, B]={A.a}, joinAttributes [BA]={B.a}, unique [A, B]={A.a} and unique [B, A]=\(\emptyset \). If the cardinality of A is n, since A.a is unique, it must have n different values. The relation B has join condition with A.a which belongs to uniqueElements[A]. So B must contain n tuples with distinct values for the attribute B.a across n tuples and each value matches with the value of A.a for one of the tuples in \(R_i\). So the cardinality of B becomes n and B.a becomes a unique attribute.

Implementation Rule 1: If \(n_{R_{i}}\)=\(n,n>1\) and \(R_{i}\) has a multi-attribute unique element, mu, such that every attribute of mu participates in some join conditions but joinAttributes \([R_i,R_j]\subset \) mu for all j, then for at least one relation \(R_k\) that joins with \(R_i\) joinAttributes \([R_i,R_k]\) is unique and \(n_{R_k}=n\). One such \(R_k\) is picked, and we add joinAttributes \([R_i,R_k]\) to uniqueElements \([R_i]\) and joinAttributes \([R_k,R_i]\) to \(uniqueElemen-ts[R_k]\).

The intuition is as follows. Consider the join graph shown in Fig. 2. Let joinConds[A, B]={A.a=B.a}, joinConds[B, C]={B.b=C.b}. Let (B.a, B.b) be unique. Here, joinAttributes [A, B] = {A.a}, joinAttributes [B, A] = {B.a}, joinAttributes [B, C]={B.b} and joinAttributes [C, B]={C.b}. Further, unique [A, B]=\(\emptyset \), unique [B, A]=\(\emptyset \), unique [B, C]=\(\emptyset \), unique [C, B]=\(\emptyset \).

Suppose cardinality of B is n. Since unique[B, A] = \(\emptyset \), it is possible that \(n_{A}\) =1 such that A.a matches with all values of B.a across n tuples. Here B.a contains same value across n tuples. Similarly, we can choose \(n_{C}\) = 1, and B.b will have the same across n tuples. Now both B.a and B.b have same values across all n tuples. But (B.a, B.b) must be unique across n tuples. So the assignment of cardinalities is incorrect. Hence, at least one of B.a or B.b must be chosen to be unique, and this will cause one \(n_A\) or \(n_B\) to be n.

Note that in this example had (B.a, B.b, B.c) been unique, every attribute of mu does not participate in any of the join conditions. In this case, the rule is not applicable and both A and C may have a cardinality of 1. To generate n tuples for B such that the join results in n tuples, B.c can have n distinct values, while B.a and B.b have same values corresponding to A.a, C.b, respectively.

We differentiate this rule from others since this rule can have several possible outcomes as opposed to the other rules for which the outcome is definite and unique. One outcome is chosen. The choice of which of the joining relations is assigned cardinality as n can be made by the solver or as heuristic the choice can be made arbitrarily; we describe these below.

Cardinality Inference Algorithm

Let the aggregated attribute be R.a. For getting the cardinality of each relation, using the rules and the given join conditions of the relations we can encode the tuple assignment problem in the form of constraints in CVC3. We add the following constraints in CVC3.

  • constraints ascertaining singleValuedAttributes and uniqueElements for each relation

  • for each relation such that all attributes are single-valued (Rule 4) constraints to ensure that the number of tuples is 1

  • constraints for Rule 7 and the Implementation Rule 1 for all the relations in the query as applicable

  • constraints to ensure that the final count after joining the tables is n

  • in case n values are required for some attribute R.a to satisfy some aggregate condition, we add constraints to ensure that the relation R has n tuples. For example, consider a case where \(SUM(R.a) = 17\), where a is an integer attribute and there is a constraint \(R.a\le 5\), we need at least four tuples for the given group of R and they cannot all be the same. It is not possible to satisfy the aggregation condition if we assign a single tuple to R, the join of R with other relations produces four tuples for the group. Similar is the case with SUM DISTINCT on an integer attribute.

On solving this set of constraints, we get the number of tuples for each relation.

The constraint approach for tuple generation works well if the number of attributes is not very large. In practice, we use a simple and fast heuristic approach described as follows. If any non-empty set of attributes of a relation forms a unique element and every attribute of that unique element is a single-valued attribute, then that relation must contain a single tuple (explained in Rule 4). For such relations, the only possible choice of cardinality is 1. Of the remaining relations, the heuristic algorithm chooses one relation and assigns to it a cardinality of n, making it the root node. The count of all other nodes of the join graph, \(n_{R_{i}}\) is initialized as 1. The root node (\(R_{r}\)) is then used as a starting relation to calculate the actual cardinality for other relations using Rule 7 and Implementation Rule 1. The procedure for this is described Algorithm 4 of Appendix C (Online resource). If the heuristic fails, we use the constraint approach.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Chandra, B., Chawda, B., Kar, B. et al. Data generation for testing and grading SQL queries. The VLDB Journal 24, 731–755 (2015).

Download citation


  • Mutation testing
  • Test data generation
  • SQL query grading