Skip to main content
Log in

xDBTagger: explainable natural language interface to databases using keyword mappings and schema graph

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

Recently, numerous studies have been proposed to attack the natural language interfaces to data-bases (NLIDB) problem by researchers either as a conventional pipeline-based or an end-to-end deep-learning-based solution. Although each approach has its own advantages and drawbacks, regardless of the approach preferred, both approaches exhibit black-box nature, which makes it difficult for potential users to comprehend the rationale behind the decisions made by the intelligent system to produce the translated SQL. Given that NLIDB targets users with little to no technical background, having interpretable and explainable solutions becomes crucial, which has been overlooked in the recent studies. To this end, we propose xDBTagger, an explainable hybrid translation pipeline that explains the decisions made along the way to the user both textually and visually. We also evaluate xDBTagger quantitatively in three real-world relational databases. The evaluation results indicate that in addition to being lightweight, fast, and fully explainable, xDBTagger is also competitive in terms of translation accuracy compared to both pipeline-based and end-to-end deep learning approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Algorithm 1
Algorithm 2
Algorithm 3
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. Available at https://github.com/arifusta/DBTagger

References

  1. Affolter, K., Stockinger, K., Bernstein, A.: A comparative survey of recent natural language interfaces for databases. VLDB J. 28(5), 793–819 (2019)

    Article  Google Scholar 

  2. Baik, C., Jagadish, H.V., Li, Y.: Bridging the semantic gap with SQL query logs in natural language interfaces to databases. In: 2019 IEEE 35th International Conference on Data Engineering (ICDE), pp. 374–385 (2019)

  3. Blunschi, L., Jossen, C., Kossmann, D., Mori, M., Stockinger, K.: Soda: generating SQL for business users. Proc. VLDB Endow. 5(10), 932–943 (2012)

    Article  Google Scholar 

  4. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017)

    Article  Google Scholar 

  5. Cao, R., Chen, L., Chen, Z., Zhao, Y., Zhu, S., Yu, K.: LGESQL: Line graph enhanced text-to-SQL model with mixed local and non-local relations. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 2541–2555. Association for Computational Linguistics, Online. https://doi.org/10.18653/v1/2021.acl-long.198, https://aclanthology.org/2021.acl-long.198 (2021)

  6. Caruana, R.: Multitask learning. Mach. Learn. 28(1), 41–75 (1997)

    Article  MathSciNet  Google Scholar 

  7. Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Gated feedback recurrent neural networks. In: Proceedings of the 32nd International Conference on International Conference on Machine Learning, JMLR.org, ICML’15, vol. 37, pp. 2067–2075 (2015)

  8. Clark, K., Luong, M., Le, Q.V., Manning, C.D.: ELECTRA: pre-training text encoders as discriminators rather than generators. arXiv:2003.10555 (2020)

  9. Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: Natural language processing (almost) from scratch. J. Mach. Learn. Res. 12(null), 2493–2537 (2011)

    Google Scholar 

  10. Crawshaw, M.: Multi-task learning with deep neural networks: a survey. CoRR abs arXiv:2009.09796 (2020)

  11. Deng, X., Awadallah, A.H., Meek, C., Polozov, O., Sun, H., Richardson, M.: Structure-grounded pretraining for text-to-SQL. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1337–1350. Association for Computational Linguistics, Online (2021)

  12. Deutch, D., Frost, N., Gilad, A.: Explaining natural language query results. VLDB J. 29(1), 485–508 (2020)

    Article  Google Scholar 

  13. Devlin, J., Chang, M. W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics, Minneapolis (2019)

  14. Dozat, T.: Incorporating nesterov momentum into adam. In: ICLR Workshop, JMLR.org (2016)

  15. Došilović, F.K., Brcic, M., Hlupic, N.: Explainable artificial intelligence: a survey. In: 2018 41st International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO), pp. 0210–0215 (2018)

  16. Graves, A., Mohamed, A., Hinton, G.: Speech recognition with deep recurrent neural networks. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6645–6649 (2013)

  17. Gregor, S., Benbasat, I.: Explanations from intelligent systems: theoretical foundations and implications for practice. MIS Q. 23, 497–530 (1999)

    Article  Google Scholar 

  18. Gunning, D., Aha, D.: DARPA’s explainable artificial intelligence (XAI) program. AI Mag. 40(2), 44–58 (2019)

    Google Scholar 

  19. Guo, J., Zhan, Z., Gao, Y., Xiao, Y., Lou, J.G., Liu, T., Zhang, D.: Towards Complex Text-to-SQL in Cross-domain Database with Intermediate Representation, pp. 4524–4535. Association for Computational Linguistics, Florence, Italy (2019)

    Google Scholar 

  20. Hayes-Roth, F., Jacobstein, N.: The state of knowledge-based systems. Commun. ACM 37(3), 26–39 (1994)

    Article  Google Scholar 

  21. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016)

  22. Hendrix, G.G., Sacerdoti, E.D., Sagalowicz, D., Slocum, J.: Developing a natural language interface to complex data. ACM Trans. Database Syst. 3(2), 105–147 (1978)

    Article  Google Scholar 

  23. Hinton, G.E., Srivastava, N., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Improving neural networks by preventing co-adaptation of feature detectors. ArXiv abs arXiv:1207.0580 (2012)

  24. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9, 1735–80 (1997)

    Article  CAS  PubMed  Google Scholar 

  25. Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K. Q.: Densely connected convolutional networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2261–2269 (2017)

  26. Huang, P.S., Wang, C., Singh, R., Yih, W., He, X.: Natural Language to Structured Query Generation via Meta-learning, pp. 732–738. Association for Computational Linguistics, New Orleans (2018)

    Google Scholar 

  27. Huang, Z., Xu, W., Yu, K.: Bidirectional LSTM-CRF models for sequence tagging. arXiv:1508.01991 (2015)

  28. Iyer, S., Konstas, I., Cheung, A., Krishnamurthy, J., Zettlemoyer, L.: Learning a neural semantic parser from user feedback. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 963–973 . Association for Computational Linguistics, Vancouver (2017)

  29. Jou, B., Chang, S.F.: Deep cross residual learning for multitask visual recognition. In: Proceedings of the 24th ACM International Conference on Multimedia, Association for Computing Machinery, New York, MM ’16, pp. 998–1007. https://doi.org/10.1145/2964284.2964309 (2016)

  30. Katsogiannis-Meimarakis, G., Koutrika, G.: A survey on deep learning approaches for text-to-SQL. VLDB J. 32(4), 905–936 (2023). https://doi.org/10.1007/s00778-022-00776-8

    Article  Google Scholar 

  31. Kim, H., So, B.H., Han, W.S., Lee, H.: Natural language to SQL: Where are we today? Proc. VLDB Endow. 13(10), 1737–1750 (2020)

    Article  Google Scholar 

  32. Koutrika, G., Simitsis, A., Ioannidis, Y.E.: Explaining structured queries in natural language. In: 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010), pp. 333–344. https://doi.org/10.1109/ICDE.2010.5447824 (2010)

  33. Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the Eighteenth International Conference on Machine Learning, Morgan Kaufmann Publishers Inc., San Francisco, ICML ’01, pp. 282–289 (2001)

  34. Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., Dyer, C.: Neural architectures for named entity recognition. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 260–270. Association for Computational Linguistics, San Diego (2016)

  35. Li, F., Jagadish, H.V.: Constructing an interactive natural language interface for relational databases. Proc. VLDB Endow. 8(1), 73–84 (2014)

    Article  CAS  Google Scholar 

  36. Lin, X. V., Socher, R., Xiong, C.: Bridging textual and tabular data for cross-domain text-to-SQL semantic parsing. In: Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 4870–4888. Association for Computational Linguistics, Online (2020)

  37. Ma, X., Hovy, E.: End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1064–1074. Association for Computational Linguistics, Berlin (2016)

  38. Manning, C.D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S. J., McClosky, D.: The stanford CoreNLP natural language processing toolkit. In: Association for Computational Linguistics (ACL) System Demonstrations, pp. 55–60 (2014)

  39. Müller, T., Grust, T.: Provenance for SQL through abstract interpretation: value-less, but worthwhile. Proc. VLDB Endow. 8(12), 1872–1875 (2015)

    Article  Google Scholar 

  40. Özcan, F., Quamar, A., Sen, J., Lei, C., Efthymiou, V.: State of the art and open challenges in natural language interfaces to data. In: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, Association for Computing Machinery, New York, NY, USA, SIGMOD ’20, pp. 2629–2636 (2020)

  41. Poulin, B., Eisner, R., Szafron, D., Lu, P., Greiner, R., Wishart, D.S., Fyshe, A., Pearcy, B., MacDonell, C., Anvik, J.: Visual explanation of evidence with additive classifiers. In: Proceedings of the National Conference on Artificial Intelligence, Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999, vol. 21, p. 1822 (2006)

  42. Ribeiro, M. T., Singh, S., Guestrin, C.: "why should I trust you?": explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 13–17, 2016, pp. 1135–1144 (2016)

  43. Saha, D., Floratou, A., Sankaranarayanan, K., Minhas, U.F., Mittal, A.R., Özcan, F.: ATHENA: an ontology-driven system for natural language querying over relational data stores. Proc. VLDB Endow. 9(12), 1209–1220 (2016)

    Article  Google Scholar 

  44. Scholak, T., Schucher, N., Bahdanau, D.: PICARD: parsing incrementally for constrained auto-regressive decoding from language models. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 9895–9901. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic (2021)

  45. Sen, J., Lei, C., Quamar, A., Özcan, F., Efthymiou, V., Dalmia, A., Stager, G., Mittal, A., Saha, D., Sankaranarayanan, K.: ATHENA++: natural language querying for complex nested SQL queries. Proc. VLDB Endow. 13(12), 2747–2759 (2020)

    Article  Google Scholar 

  46. Sheinin, V., Khorashani, E., Yeo, H., Xu, K., Vo, N.P.A., Popescu, O.: Quest: a natural language interface to relational databases. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) (2018)

  47. Usta, A., Karakayali, A., Ulusoy, O.: DBTagger: multi-task learning for keyword mapping in NLIDBs using bi-directional recurrent neural networks. Proc. VLDB Endow. 14(5), 813–821 (2021)

    Article  Google Scholar 

  48. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 30. Curran Associates Inc, New York (2017)

    Google Scholar 

  49. Wang, B., Shin, R., Liu, X., Polozov, O., Richardson, M.: RAT-SQL: relation-aware schema encoding and linking for text-to-SQL parsers. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7567–7578. Association for Computational Linguistics, Online (2020)

  50. Weir, N., Utama, P., Galakatos, A., Crotty, A., Ilkhechi, A., Ramaswamy, S., Bhushan, R., Geisler, N., Hättasch, B., Eger, S., Cetintemel, U., Binnig, C.: DBPal: a Fully Pluggable NL2SQL training pipeline. In: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, Association for Computing Machinery, New York, NY, USA, SIGMOD ’20, pp. 2347–2361 (2020)

  51. Wen, Y., Zhu, X., Roy, S., Yang, J.: Interactive summarization and exploration of top aggregate query answers. Proc. VLDB Endow. 11(13), 2196–2208 (2018)

    Article  Google Scholar 

  52. Xu, X., Liu, C., Song, D.: SQLNet: generating structured queries from natural language without reinforcement learning. arXiv preprint arXiv:1711.04436 (2017)

  53. Yaghmazadeh, N., Wang, Y., Dillig, I., Dillig, T.: SQLizer: query synthesis from natural language. Proc. ACM Program. Lang. 1(OOPSLA), 63:1-63:26 (2017)

    Article  Google Scholar 

  54. Yavuz, S., Gur, I., Su, Y., Yan, X.: What it takes to achieve 100% condition accuracy on WikiSQL. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 1702–1711. Association for Computational Linguistics, Brussels (2018)

  55. Yin, P., Neubig, G., Yih, Wt., Riedel, S.: TaBERT: pretraining for joint understanding of textual and tabular data. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 8413–8426. Association for Computational Linguistics, Online (2020)

  56. Yu, T., Li, Z., Zhang, Z., Zhang, R., Radev, D.: TypeSQL: Knowledge-based Type-aware Neural Text-to-SQL Generation, pp. 588–594. Association for Computational Linguistics, New Orleans (2018)

    Google Scholar 

  57. Yu, T., Yasunaga, M., Yang, K., Zhang, R., Wang, D., Li, Z., Radev, D.: SyntaxSQLNet: syntax tree networks for complex and cross-domain text-to-SQL task. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 1653–1663. Association for Computational Linguistics, Brussels (2018)

  58. Yu, T., Zhang, R., Yang, K., Yasunaga, M., Wang, D., Li, Z., Ma, J., Li, I., Yao, Q., Roman, S., Zhang, Z., Radev, D.: Spider: A Large-scale Human-Labeled Dataset for Complex and Cross-domain Semantic Parsing and Text-to-SQL Task, pp. 3911–3921. Association for Computational Linguistics, Brussels (2018)

    Google Scholar 

  59. Zeiler, M. D.: ADADELTA: an adaptive learning rate method. arXiv:1212.5701 (2012)

  60. Zhong, V., Xiong, C., Socher, R.: Seq2SQL: generating structured queries from natural language using reinforcement learning. arXiv preprint arXiv:1709.00103 (2017)

Download references

Acknowledgements

This research is supported by The Scientific and Technological Research Council of Türkiye (TÜBİTAK) under the grant no 118E724.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Arif Usta.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Usta, A., Karakayali, A. & Ulusoy, Ö. xDBTagger: explainable natural language interface to databases using keyword mappings and schema graph. The VLDB Journal 33, 301–321 (2024). https://doi.org/10.1007/s00778-023-00809-w

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-023-00809-w

Keywords

Navigation