Skip to main content

Advertisement

Log in

Interactively discovering and ranking desired tuples by data exploration

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

Data exploration—the problem of extracting knowledge from database even if we do not know exactly what we are looking for —is important for data discovery and analysis. However, precisely specifying SQL queries is not always practical, such as “finding and ranking off-road cars based on a combination of Price, Make, Model, Age, Mileage, etc”—not only due to the query complexity (e.g.,the queries may have many if-then-else, and, or and not logic), but also because the user typically does not have the knowledge of all data instances (and their variants). We propose DExPlorer, a system for interactive data exploration. From the user perspective, we propose a simple and user-friendly interface, which allows to: (1) confirm whether a tuple is desired or not, and (2) decide whether a tuple is more preferred than another. Behind the scenes, we jointly use multiple ML models to learn from the above two types of user feedback. Moreover, in order to effectively involve human-in-the-loop, we need to select a set of tuples for each user interaction so as to solicit feedback. Therefore, we devise question selection algorithms, which consider not only the estimated benefit of each tuple, but also the possible partial orders between any two suggested tuples. Experiments on real-world datasets show that DExPlorer outperforms existing approaches in effectiveness.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. The logarithmic function takes 2 as the base, and we can know that \(u(t) \in [0,1]\) for \(e \in [0,1]\), and \(u(t)=1\) when \(e=0.5\).

  2. https://www.djangoproject.com/.

  3. http://tabulator.info/.

  4. https://www.kaggle.com/orgesleka/used-cars-database.

  5. https://www.acm.org/publications/digital-library.

  6. https://relational.fit.cvut.cz/dataset/TPCH.

  7. https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html.

  8. https://sourceforge.net/p/lemur/wiki/RankLib/.

  9. http://www.cs.cornell.edu/people/tj/svm_light/svm_rank.html.

  10. We tell the workers that we need university students to participate in a user study, and ask them to fill in their “.edu” mails. We then send emails to these “.edu” mails with the link to the user study. Users can use DExPlorer to find their ranked desired tuples in this link.

  11. https://appen.com.

References

  1. Burges, C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Hamilton, N., Hullender, G.N.: Learning to rank using gradient descent. In: ICML, pp. 89–96 (2005)

  2. Chai, C., Li, G., Li, J., Deng, D., Feng, J.: Cost–effective crowdsourced entity resolution: a partial-order approach. In: Özcan, F., Koutrika, G., Madden, S. (eds.) Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016, San Francisco, CA, USA, 26 June–1 July 2016, pp. 969–984. ACM (2016). https://doi.org/10.1145/2882903.2915252

  3. Chai, C., Li, G., Li, J., Deng, D., Feng, J.: A partial–order–based framework for cost–effective crowdsourced entity resolution. VLDB J. 27(6), 745–770 (2018). https://doi.org/10.1007/s00778-018-0509-6

  4. Chai, C., Fan, J., Li, G., Wang, J., Zheng, Y.: Crowd–powered data mining. CoRR (2018). arXiv:1806.04968

  5. Chai, C., Fan, J., Li, G., Wang, J., Zheng, Y.: Crowdsourcing database systems: overview and challenges. In: 35th IEEE International Conference on Data Engineering, ICDE 2019, Macao, China, 8–11 April 2019, pp. 2052–2055. IEEE (2019). https://doi.org/10.1109/ICDE.2019.00237

  6. Chai, C., Li, G., Fan, J., Luo, Y.: Crowdsourcing-based data extraction from visualization charts. In: 36th IEEE International Conference on Data Engineering, ICDE 2020, Dallas, 20–24 April 2020, pp. 1814–1817. IEEE (2020). https://doi.org/10.1109/ICDE48307.2020.00177

  7. Chai, C., Cao, L., Li, G., Li, J., Luo, Y., Madden, S.: Human-in-the-loop outlier detection. In: Maier, D., Pottinger, R., Doan, A.H., Tan, W.-C., Alawini, A., Ngo, H.Q. (eds.) Proceedings of the 2020 International Conference on Management of Data, SIGMOD Conference 2020, Portland, OR, USA, 14–19 June 2020, pp. 19–33. ACM (2020). https://doi.org/10.1145/3318464.3389772

  8. Chai, C., Li, G., Fan, J., Luo, Y.: CrowdChart: crowdsourced data extraction from visualization charts. IEEE Trans. Knowl. Data Eng. 33(11), 3537–3549 (2021). https://doi.org/10.1109/TKDE.2020.2972543

  9. Chaudhuri, S., Das, G., Hristidis, V., Weikum, G.: Probabilistic ranking of database query results. In: VLDB, pp. 888–899 (2004)

  10. Chu, W., Ghahramani, Z.: Extensions of gaussian processes for ranking: semisupervised and active learning. Learning to Rank, 29 (2005)

  11. Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)

    MATH  Google Scholar 

  12. Dai, X., Yan, X., Zhou, K., Wang, Y., Yang, H., Cheng, J.: Convolutional embedding for edit distance. In: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 599–608 (2020)

  13. Diaconis, P.: Group representations in probability and statistics. IMS Lecture Notes-monograph 72(2), 7–108 (1988)

    MATH  Google Scholar 

  14. Dimitriadou, K., Papaemmanouil, O., Diao, Y.: Explore-by-example: an automatic query steering framework for interactive data exploration. In: SIGMOD, pp. 517–528 (2014)

  15. Dwork, C., Kumar, R., Naor, M., Sivakumar, D.: Rank aggregation methods for the web. In: WWW (2001)

  16. Fagin, R., Kumar, R., Sivakumar, D.: Comparing top k lists. SIAM J. Discrete Math. 17(1), 134–160 (2003)

    Article  MathSciNet  Google Scholar 

  17. Fariha, A., Meliou, A.: Example-driven query intent discovery: abductive reasoning using semantic similarity. PVLDB 12(11), 1262–1275 (2019)

    Google Scholar 

  18. Friedman, J.H.: Greedy function approximation: a gradient boosting machine. Ann. Stat. 29, 1189–1232 (2001)

    Article  MathSciNet  Google Scholar 

  19. Gharibshah, Z., Zhu, X., Hainline, A., Conway, M.: Deep learning for user interest and response prediction in online display advertising. Data Sci. Eng. 5(1), 12–26 (2020)

    Article  Google Scholar 

  20. Gollapudi, S., Sharma, A.: An axiomatic approach for result diversification. In: WWW, pp. 381–390 (2009)

  21. Hassin, R., Rubinstein, S., Tamir, A.: Approximation algorithms for maximum dispersion. Oper. Res. Lett. 21(3), 133–137 (1997)

    Article  MathSciNet  Google Scholar 

  22. Haveliwala, T.H.: Topic-sensitive pagerank. In: WWW, pp. 517–526. ACM (2002)

  23. Hazelwood, K., Bird, S., Brooks, D., Chintala, S., Diril, U., Dzhulgakov, D., Fawzy, M., Jia, B., Jia, Y., Kalro, A., Law, J., Lee, K., Lu, J., Noordhuis, P., Smelyanskiy, M., Xiong, L., Wang, X.: Applied machine learning at facebook: a datacenter infrastructure perspective. In: HPCA (2018)

  24. He, C., Wang, C., Zhong, Y.-X., Li, R.-F.: A survey on learning to rank. In: 2008 International Conference on Machine Learning and Cybernetics, vol. 3, pp. 1734–1739. IEEEE (2008)

  25. He, X., Pan, J., Jin, O., Xu, T., Liu, B., Xu, T., Shi, Y., Atallah, A., Herbrich, R., Bowers, S., Candela, J. Q.: Practical lessons from predicting clicks on ads at facebook. In: ADKDD, pp. 5:1–5:9 (2014)

  26. Hristidis, V., Gravano, L., Papakonstantinou, Y.: Efficient ir-style keyword search over relational databases. In: VLDB, pp. 850–861 (2003)

  27. Hristidis, V., Papakonstantinou, Y.: Discover: keyword search in relational databases. In: VLDB, pp. 670–681 (2002)

  28. Huang, E., Peng, L., Palma, L.D., Abdelkafi, A., Liu, A., Diao, Y.: Optimization for active learning-based interactive database exploration. PVLDB 12(1), 71–84 (2018)

    Google Scholar 

  29. Jamieson, K.G., Nowak, R.D.: Active ranking using pairwise comparisons. arXiv preprint arXiv:1109.3701 (2011)

  30. Joachims, T.: Training linear svms in linear time. In: SIGKDD, pp. 217–226 (2006)

  31. Kalashnikov, D.V., Lakshmanan, L.V., Srivastava, D.: Fastqre: Fast query reverse engineering. In: Proceedings of the 2018 International Conference on Management of Data, pp. 337–350 (2018)

  32. Lewis, D.D., Catlett, J.: Heterogeneous uncertainty sampling for supervised learning. In: Machine Learning Proceedings 1994, pp. 148–156. Elsevier (1994)

  33. Lewis, D.D., Gale, W.A.: A sequential algorithm for training text classifiers. In: SIGIR’94, pp. 3–12. Springer (1994)

  34. Li, H., Chan, C.-Y., Maier, D.: Query from examples: an iterative, data-driven approach to query construction. Proc. VLDB Endow. 8(13), 2158–2169 (2015)

    Article  Google Scholar 

  35. Li, G., Chai, C., Fan, J., Weng, X., Li, J., Zheng, Y., Li, Y., Yu, X., Zhang, X., Yuan, H.: CDB: optimizing queries with crowd–based selections and joins. In: Salihoglu, S., Zhou, W., Chirkova, R., Yang, J., Suciu, D. (eds.) Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD Conference 2017, Chicago, IL, USA, 14–19 May 2017, pp. 146–1478. ACM (2017). https://doi.org/10.1145/3035918.3064036

  36. Li, G., Chai, C., Fan, J., Weng, X., Li, J., Zheng, Y., Li, Y., Yu, X., Zhang, X., Yuan, H.: CDB: a crowd–powered database system. Proc. VLDB Endow. 11(12), 1926–1929 (2018). https://doi.org/10.14778/3229863.3236226

  37. Li, M., Wang, H., Li, J.: Mining conditional functional dependency rules on big data. Big Data Min. Anal. 03(01), 68 (2020)

    Article  Google Scholar 

  38. Liaw, A., Wiener, M., et al.: Classification and regression by randomforest. R News 2(3), 18–22 (2002)

    Google Scholar 

  39. Liu, F., Yu, C., Meng, W., Chowdhury, A.: Effective keyword search in relational databases. In: SIGMOD, pp. 563–574 (2006)

  40. Luo, Y., Chai, C., Qin, X., Tang, N., Li, G.: Interactive cleaning for progressive visualization through composite questions. In: ICDE, pp. 733–744 (2020)

  41. Luo, Y., Qin, X., Tang, N., Li, G.: Deepeye: towards automatic data visualization. In: ICDE, pp. 101–112 (2018)

  42. Luo, Y., Qin, X., Tang, N., Li, G., Wang, X.: DeepEye: Creating Good Data Visualizations by Keyword Search. In: Das, G., Jermaine, C.M., Bernstein, P.A. (eds.) Proceedings of the 2018 International Conference on Management of Data, SIGMOD Conference 2018, Houston, TX, USA, 10–15 June 2018, pp. 1733–1736. ACM (2018). https://doi.org/10.1145/3183713.3193545

  43. Luo, Y., Chai, C., Qin, X., Tang, N., Li, G.: VisClean: interactive cleaning for progressive visualization. Proc. VLDB Endow. 13(12), 2821–2824 (2020). https://doi.org/10.14778/3415478.3415484

  44. Luo, Y., Tang, N., Li, G., Li, W., Zhao, T., Yu, X.: DeepEye: a data science system for monitoring and exploring COVID–19 data. IEEE Data Eng. Bull. 43(2), 121–132 (2020)

  45. Luo, Y., Li, W., Zhao, T., Yu, X., Zhang, L., Li, G., Tang, N.: DeepTrack: monitoring and exploring spatio-temporal data – a case of tracking COVID–19. Proc. VLDB Endow. 13(12), 2841–2844 (2020). https://doi.org/10.14778/3415478.3415489

  46. Luo, Y., Qin, X., Chai, C., Tang, N., Li, G., Li, W.: Steerable self–driving data visualization. IEEE Trans. Knowl. Data Eng. (2020). https://doi.org/10.1109/TKDE.2020.2981464

  47. Luo, Y., Tang, N., Li, G., Tang, J., Chai, C., Qin, X.: Natural Language to visualization by neural machine translation. IEEE Trans. Vis. Comput. Graph. (2021). https://doi.org/10.1109/TVCG.2021.3114848

  48. Luo, Y., Tang, N., Li, G., Chai, C., Li, W., Qin, X.: Synthesizing natural language to visualization (NL2VIS) benchmarks from NL2SQL benchmarks. In: SIGMOD, pp. 1235–1247 (2021)

  49. Martins, D.M.L.: Reverse engineering database queries from examples: state-of-the-art, challenges, and research opportunities. Inf. Syst. 83, 89–100 (2019)

    Article  Google Scholar 

  50. Masermann, U, Vossen, G.: Design and implementation of a novel approach to keyword searching in relational databases. In: Current Issues in databases and information systems, pp. 171–184 (2000)

  51. Mishra, C., Koudas, N.: Interactive query refinement. In: Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology, pp. 862–873 (2009)

  52. Nanongkai, D., Lall, A., Sarma, A.D., Makino, K.: Interactive regret minimization, pp. 109–120 (2012)

  53. Panev, K., Michel, S.: Reverse engineering top-k database queries with paleo. In: EDBT, pp. 113–124 (2016)

  54. Panev, K., Michel, S., Milchevski, E., Pal, K.: Exploring databases via reverse engineering ranking queries with paleo. Proc. VLDB Endow. 9(13), 1525–1528 (2016)

    Article  Google Scholar 

  55. Psallidas, F., Ding, B., Chakrabarti, K., Chaudhuri, S.: S4: Top-k spreadsheet-style search for query discovery. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 2001–2016 (2015)

  56. Qian, L., Gao, J., Jagadish, H.: Learning user preferences by adaptive pairwise comparison. PVLDB 8(11), 1322–1333 (2015)

    Google Scholar 

  57. Qin, X., Chai, C., Luo, Y., Zhao, T., Tang, N., Li, G., Feng, J., Yu, X., Ouzzani, M.: Ranking desired tuples by database exploration. In: ICDE

  58. Qin, X., Luo, Y., Tang, N., Li, G.: Deepeye: an automatic big data visualization framework. Big Data Min. Anal. 1(1), 75–82 (2018)

    Article  Google Scholar 

  59. Qin, X., Luo, Y., Tang, N., Li, G.: DeepEye: Visualizing Your Data by Keyword Search. In: Böhlen, M.H., Pichler, R., May, N., Rahm, E., Wu, S.-H., Hose, K. (eds.) Proceedings of the 21st International Conference on Extending Database Technology, EDBT 2018, Vienna, Austria, 26–29 March 2018, pp 441–444. OpenProceedings.org (2018). https://doi.org/10.5441/002/edbt.2018.42

  60. Qin, X., Luo, Y., Tang, N., Li, G.: Making data visualization more efficient and effective: a survey. VLDB J. 29(1), 93–117 (2020)

    Article  Google Scholar 

  61. Settles, B.: Active learning literature survey (2009)

  62. Shannon, C.E.: A mathematical theory of communication. Bell Syst. Tech. J. 27(3), 379–423 (1948)

    Article  MathSciNet  Google Scholar 

  63. Shen, Y., Chakrabarti, K., Chaudhuri, S., Ding, B., Novik, L.: Discovering queries based on example tuples. In: SIGMOD, pp. 493–504 (2014)

  64. Shen, L., Shen, Luo, Y., Yang, X., Hu, X., Zhang, X., Tai, Z., Wang, J.: Towards natural language interfaces for data visualization: a survey (2021). arXiv:2109.03506

  65. Singh, R., Meduri, V.V., Elmagarmid, A.K., Madden, S., Papotti, P., Quiané-Ruiz, J., Solar-Lezama, A., Tang, N.: Synthesizing entity matching rules by examples. PVLDB 11(2), 189–202 (2017)

    Google Scholar 

  66. Tian, S., Mo, S., Wang, L., Peng, Z.: Deep reinforcement learning-based approach to tackle topic-aware influence maximization. Data Sci. Eng. 5(1), 1–11 (2020)

    Article  Google Scholar 

  67. Tran, Q.T., Chan, C.-Y., Parthasarathy, S.: Query by output. In: Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, pp. 535–548 (2009)

  68. Tran, Q.T., Chan, C.-Y., Parthasarathy, S.: Query reverse engineering. VLDB J. 23(5), 721–746 (2014)

    Article  Google Scholar 

  69. Wang, Y., Yao, Y., Tong, H., Xu, F., Lu, J.: A brief review of network embedding. Big Data Min. Anal. 2(1), 35 (2019)

  70. Weiss, Y.Y., Cohen, S.: Reverse engineering spj-queries from examples. In: Proceedings of the 36th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, pp. 151–166 (2017)

  71. Wu, Q., Burges, C.J., Svore, K.M., Gao, J.: Adapting boosting for information retrieval measures. Inf. Retriev. 13(3), 254–270 (2010)

    Article  Google Scholar 

  72. Xie, M., Chen, T., Wong, R.C.-W.: Findyourfavorite: an interactive system for finding the user’s favorite tuple in the database. In: SIGMOD, pp. 2017–2020 (2019)

  73. Xie, M., Wong, R.C.-W., Lall, A.: Strongly truthful interactive regret minimization. In: SIGMOD, pp. 281–298 (2019)

  74. Zhang, M., Elmeleegy, H., Procopiuc, C.M., Srivastava, D.: Reverse engineering complex join queries. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, pp. 809–820 (2013)

  75. Zhang, S., Sun, Y.: Automatically synthesizing sql queries from input-output examples. In: ASE, pp. 224–234 (2013)

Download references

Acknowledgements

This work is supported by NSF of China (61925205, 61632016, 62102215), Huawei, TAL education, China National Postdoctoral Program for Innovative Talents (BX2021155), China Postdoctoral Science Foundation (2021M691784), Shuimu Tsinghua Scholar and Zhejiang Lab’s International Talent Fund for Young Professionals.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Chengliang Chai or Guoliang Li.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Qin, X., Chai, C., Luo, Y. et al. Interactively discovering and ranking desired tuples by data exploration. The VLDB Journal 31, 753–777 (2022). https://doi.org/10.1007/s00778-021-00714-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-021-00714-0

Keywords

Navigation