Skip to main content

Advertisement

Log in

Efficient and robust active learning methods for interactive database exploration

  • Special Issue Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

There is an increasing gap between fast growth of data and the limited human ability to comprehend data. Consequently, there has been a growing demand of data management tools that can bridge this gap and help the user retrieve high-value content from data more effectively. In this work, we propose an interactive data exploration system as a new database service, using an approach called “explore-by-example.” Our new system is designed to assist the user in performing highly effective data exploration while reducing the human effort in the process. We cast the explore-by-example problem in a principled “active learning” framework. However, traditional active learning suffers from two fundamental limitations: slow convergence and lack of robustness under label noise. To overcome the slow convergence and label noise problems, we bring the properties of important classes of database queries to bear on the design of new algorithms and optimizations for active learning-based database exploration. Evaluation results using real-world datasets and user interest patterns show that our new system, both in the noise-free case and in the label noise case, significantly outperforms state-of-the-art active learning techniques and data exploration systems in accuracy while achieving the desired efficiency for interactive data exploration.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Algorithm 1
Fig. 4
Fig. 5
Fig. 6
Algorithm 2
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Notes

  1. In our work, we use the term, user interest query to refer to the final query that represents the user interest, and user interest model to refer to an immediate model before it converges to the true user interest.

  2. This work considers a dataset that consists of a single table. It is left to our future work to study the extension for join queries.

  3. Feature selection to filter irrelevant attributes is addressed in our previous paper [40]. Due to space constraints, in this paper we assume that feature selection is performed the same as [40] and focus on other data exploration issues.

  4. Ray provides fast distributed computing at https://ray.io/.

  5. https://scikit-learn.org/stable/.

  6. The content is extracted from http://www.teoalida.com/.

References

  1. Abouzied, A., Angluin, D., et al.: Learning and verifying quantified boolean queries by example. In: PODS, pp. 49–60 (2013)

  2. Abouzied, A., Hellerstein, J.M., Silberschatz, A.: Playful query specification with dataplay. PVLDB 5(12), 1938–1941 (2012)

    Google Scholar 

  3. Agarwal, A., Garg, R., Chaudhury, S.: Greedy search for active learning of OCR. In: ICDAR, pp. 837–841 (2013)

  4. Aggarwal, C.C., Hinneburg, A., Keim, D.A.: On the surprising behavior of distance metrics in high dimensional spaces. In: ICDT, pp. 420–434. Springer, Berlin (2001)

  5. Akbani, R., Kwek, S., Japkowicz, N.: Applying support vector machines to imbalanced datasets. In: ECML, pp. 39–50 (2004)

  6. Amer-Yahia, S., et al.: INODE: building an end-to-end data exploration system in practice. SIGMOD Rec. 50(4), 23–29 (2021)

    Article  Google Scholar 

  7. Amer-Yahia, S., Milo, T., Youngmann, B.: Exploring ratings in subjective databases. In: SIGMOD, pp. 62–75. ACM (2021)

  8. Bach, S.H., He, B., Ratner, A., Ré, C.: Learning the structure of generative models without labeled data. In: ICML, pp. 273–282 (2017)

  9. Barber, C.B., Dobkin, D.P., Huhdanpaa, H.: The quickhull algorithm for convex hulls. TOMS 22(4), 469–483 (1996)

    Article  MathSciNet  MATH  Google Scholar 

  10. Bellare, K., Iyengar, S., Parameswaran, A., Rastogi, V.: Active sampling for entity matching with guarantees. ACM Trans. Knowl. Discov. Data 7(3), 12:1-12:24 (2013)

    Article  Google Scholar 

  11. Berthon, A., Han, B., Niu, G., Liu, T., Sugiyama, M.: Confidence scores make instance-dependent label-noise learning possible. arXiv preprint arXiv:2001.03772 (2020)

  12. Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: COLT, pp. 92–100 (1998)

  13. Bordes, A., Ertekin, S., Weston, J., et al.: Fast kernel classifiers with online and active learning. JMLR 6, 1579–1619 (2005)

    MathSciNet  MATH  Google Scholar 

  14. Bouguelia, M., Nowaczyk, S., Santosh, K.C., Verikas, A.: Agreeing to disagree: active learning with noisy labels without crowdsourcing. IJMLC 9(8), 1307–1319 (2018)

    Google Scholar 

  15. Brodley, C.E., Friedl, M.A.: Identifying mislabeled training data. J. Artif. Intell. Res. 11, 131–167 (1999)

    Article  MATH  Google Scholar 

  16. Campbell, C., Cristianini, N., Smola, A.J.: Query learning with large margin classifiers. In: ICML, pp. 111–118 (2000)

  17. Cheng, J., Liu, T., et al.: Learning with bounded instance and label-dependent label noise. In: ICML, pp. 1789–1799. PMLR (2020)

  18. Cheung, A., Solar-Lezama, A.: Computer-assisted query formulation. Found. Trends Program. Lang. 3(1), 1–94 (2016)

    Article  Google Scholar 

  19. Cheung, A., Solar-Lezama, A., et al.: Using program synthesis for social recommendations. In: CIKM, pp. 1732–1736. ACM (2012)

  20. Cuendet, S., Hakkani-Tür, D., et al.: Automatic labeling inconsistencies detection and correction for sentence unit segmentation in conversational speech. In: International Workshop on MLMI, pp. 144–155 (2007)

  21. Dash, D., Rao, J., Megiddo, N., et al.: Dynamic faceted search for discovery-driven analysis. In: CIKM, pp. 3–12 (2008)

  22. Diao, Y., et al.: AIDE: an automatic user navigation system for interactive data exploration. PVLDB 8(12), 1964–1967 (2015)

    Google Scholar 

  23. Dimitriadou, K., Papaemmanouil, O., Diao, Y.: Explore-by-example: an automatic query steering framework for interactive data exploration. In: SIGMOD, pp. 517–528 (2014)

  24. Dimitriadou, K., Papaemmanouil, O., Diao, Y.: Aide: an active learning-based approach for interactive data exploration. TKDE 28(11), 2842–2856 (2016)

    Google Scholar 

  25. Du, J., Cai, Z.: Modelling class noise with symmetric and asymmetric distributions. In: AAAI, pp. 2589–2595 (2015)

  26. El, O.B., Milo, T., Somech, A.: Automatically generating data exploration sessions using deep reinforcement learning. In: SIGMOD, pp. 1527–1537. ACM (2020)

  27. El-Yaniv, R., Wiener, Y.: Active learning via perfect selective classification. JMLR 13(1), 255–279 (2012)

    MathSciNet  MATH  Google Scholar 

  28. Ertekin, S., Huang, J., et al.: Learning on the border: active learning in imbalanced data classification. In: CIKM, pp. 127–136 (2007)

  29. Esmailoghli, M., Quiané-Ruiz, J., et al.: COCOA: correlation coefficient-aware data augmentation. In: EDBT, pp. 331–336 (2021)

  30. Fernandez, R.C., Abedjan, Z., et al.: Aurum: a data discovery system. In: ICDE, pp. 1001–1012 (2018)

  31. Frénay, B., Verleysen, M.: Classification in the presence of label noise: a survey. IEEE Trans. Neural Netw. Learn. Syst. 25(5), 845–869 (2014)

    Article  MATH  Google Scholar 

  32. Gamberger, D., Lavrač, N., Džeroski, S.: Noise elimination in inductive concept learning: a case study in medical diagnosis. In: International workshop on ALT, pp. 199–212. Springer (1996)

  33. Garnett, R., et al.: Bayesian optimal active search and surveying. In: ICML, pp. 843–850 (2012)

  34. Grünbaum, B.: Convex polytopes, 2nd edn. In: Convex Polytopes. Springer, New York (2003)

  35. Hanneke, S.: Rates of convergence in active learning. Ann. Stat. 39(1), 333–361 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  36. Hanneke, S.: Theory of disagreement-based active learning. Found. Trends Mach. Learn. 7(2–3), 131–309 (2014)

    Article  MATH  Google Scholar 

  37. Hanneke, S.: Refined error bounds for several learning algorithms. J. Mach. Learn. Res. 17(1), 4667–4721 (2016)

    MathSciNet  MATH  Google Scholar 

  38. Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning. Springer, New York (2001)

    Book  MATH  Google Scholar 

  39. Huang, E.: Active Learning Methods for Interactive Exploration on Large Databases. Theses, Institut Polytechnique de Paris (2021)

  40. Huang, E., Peng, L., et al.: Optimization for active learning-based interactive database exploration. PVLDB 12(1), 71–84 (2018)

    Google Scholar 

  41. Ipeirotis, P.G., et al.: Repeated labeling using multiple noisy labelers. Data Min. Knowl. Discov. 28(2), 402–441 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  42. Jacobs, B.E., Walczak, C.A.: A generalized query-by-example data manipulation language based on database logic. IEEE Trans. Softw. Eng. 9(1), 40–57 (1983)

    Article  Google Scholar 

  43. Kahng, M., et al.: Interactive browsing and navigation in relational databases. PVLDB 9(12), 1017–1028 (2016)

    Google Scholar 

  44. Kalinin, A., Cetintemel, U., Zdonik, S.: Interactive data exploration using semantic windows. In: SIGMOD, pp. 505–516 (2014)

  45. Kamat, N., Jayachandran, P., Tunga, K., Nandi, A.: Distributed and interactive cube exploration. In: ICDE, pp. 472–483 (2014)

  46. Lay, S.R.: Convex Sets and Their Applications. Dover Publications, Mineola (2007)

    Google Scholar 

  47. Li, H., Chan, C.-Y., Maier, D.: Query from examples: an iterative, data-driven approach to query construction. PVLDB 8(13), 2158–2169 (2015)

    Google Scholar 

  48. Lin, C. H., Mausam, Weld, D. S.: Re-active learning: Active learning with relabeling. In AAAI, pages 1845–1852, 2016

  49. Liu, W., Diao, Y., Liu, A.: An analysis of query-agnostic sampling for interactive data exploration. Commun. Stat. Theory Methods 47(16), 3820–3837 (2018)

    Article  MathSciNet  MATH  Google Scholar 

  50. Ma, Y., Garnett, R., Schneider, J.G.: \(\Sigma \)-optimality for active learning on Gaussian random fields. In: NIPS, pp. 2751–2759 (2013)

  51. Menon, A.K., et al.: Learning from binary labels with instance-dependent noise. Mach. Learn. 107(8), 1561–1595 (2018)

    Article  MathSciNet  MATH  Google Scholar 

  52. Miranda, A.L.B., et al.: Use of classification algorithms in noise detection and elimination. In: HAIS, pp. 417–424 (2009)

  53. Mottin, D., et al.: Exemplar queries: give me an example of what you need. PVLDB 7(5), 365–376 (2014)

    Google Scholar 

  54. Neamtu, R., et al.: Interactive time series analytics powered by ONEX. In: SIGMOD, pp. 1595–1598 (2017)

  55. Omidvar-Tehrani, B., Personnaz, A., et al.: Guided text-based item exploration. In: CIKM, pp. 3410–3420. ACM (2022)

  56. Özsoyoglu, G., Wang, H.: Example-based graphical database query languages. Computer 26(5), 25–38 (1993)

    Article  Google Scholar 

  57. Palma, L.D.: New Algorithms and Optimizations for Human-in-the-Loop Model Development. Ph.D. thesis, Polytechnic Institute of Paris, France (2021)

  58. Qin, X., et al.: Interactively discovering and ranking desired tuples by data exploration. VLDB J. 31(4), 753–777 (2022)

    Article  Google Scholar 

  59. Ratner, A., et al.: Snorkel: rapid training data creation with weak supervision. Proc. VLDB Endow. 11(3), 269–282 (2017)

    Article  Google Scholar 

  60. Ratner, A.J., et al.: Data programming: creating large training sets, quickly. In: NeurIPS, pp. 3567–3575 (2016)

  61. Roy, S.B., et al.: Minimum-effort driven dynamic faceted search in structured databases. In: CIKM, pp. 13–22 (2008)

  62. Roy, S.B., et al.: DynaCet: building dynamic faceted search systems over databases. In: ICDE, pp. 1463–1466 (2009)

  63. Santos, A.S.R., et al.: A sketch-based index for correlated dataset search. In: ICDE, pp. 2928–2941 (2022)

  64. Scott, C., et al.: Classification with asymmetric label noise: consistency and maximal denoising. In: COLT, pp. 489–511 (2013)

  65. Seleznova, M., et al.: Guided exploration of user groups. PVLDB 13(9), 1469–1482 (2020)

    Google Scholar 

  66. Settles, B.: Active Learning. Synthesis Lectures on Artificial Intelligence & Machine Learning. Morgan Claypool Publishers, New York (2012)

    MATH  Google Scholar 

  67. Shen, Y., et al.: Discovering queries based on example tuples. In: SIGMOD, pp. 493–504 (2014)

  68. Sloan digital sky survey: DR8 sample SQL queries. http://skyserver.sdss.org/dr8/en/help/docs/realquery.asp

  69. Szalay, A., et al.: Designing and mining multi-terabyte astronomy archives: the Sloan digital sky survey. SIGMOD 451–462 (2000)

  70. Tang, B., et al.: Determining the impact regions of competing options in preference space. In: SIGMOD, pp. 805–820 (2017)

  71. Tang, F.: Bidirectional active learning with gold-instance-based human training. In: IJCAI, pp. 5989–5996 (2019)

  72. Tang, Y., et al.: Svms modeling for highly imbalanced classification. IEEE Trans. Syst. Man Cybern. Part B 39(1), 281–288 (2009)

    Article  Google Scholar 

  73. Tong, S., Koller, D.: Support vector machine active learning with applications to text classification. JMLR 2, 45–66 (2002)

    MATH  Google Scholar 

  74. Tran, Q.T., Chan, C.-Y., Parthasarathy, S.: Query by output. In: SIGMOD, pp. 535–548 (2009)

  75. Vanchinathan, H.P., et al.: Discovering valuable items from massive data. In: SIGKDD, pp. 1195–1204 (2015)

  76. Varma, P., et al.: Inferring generative model structure with static analysis. In: NeurIPS, pp. 240–250 (2017)

  77. Varma, P., Ré, C.: Snuba: automating weak supervision to label training data. Proc. VLDB Endow. 12(3), 223–236 (2018)

    Article  Google Scholar 

  78. Ventura, F., et al.: Expand your training limits! generating training data for ml-based data management. SIGMOD 1865–1878 (2021)

  79. Wallace, B.C., Dahabreh, I.J.: Class probability estimates are unreliable for imbalanced data (and how to fix them). In: ICDM, pp. 695–704. IEEE Computer Society (2012)

  80. Wu, G., Chang, E.Y.: KBA: kernel boundary alignment considering imbalanced data distribution. IEEE Trans. Knowl. Data Eng. 17(6), 786–795 (2005)

    Article  Google Scholar 

  81. Youngmann, B., Amer-Yahia, S., Personnaz, A.: Guided exploration of data summaries. PVLDB 15(9), 1798–1807 (2022)

    Google Scholar 

  82. Zhang, J., Wu, X., et al.: Active learning with imbalanced multiple noisy labeling. IEEE Trans. Cybern. 45(5), 1081–1093 (2015)

    Google Scholar 

  83. Zhang, X., Wang, S., Yun, X.: Bidirectional active learning: a two-way exploration into unlabeled and labeled data set. IEEE Trans. Neural Netw. Learn. Syst. 26(12), 3034–3044 (2015)

    Article  MathSciNet  Google Scholar 

  84. Zhao, Z., et al.: Controlling false discoveries during interactive data exploration. In: SIGMOD, pp. 527–540 (2017)

  85. Zhu, X., et al.: Budget constrained interactive search for multiple targets. PVLDB 14(6), 890–902 (2021)

    Google Scholar 

Download references

Acknowledgements

This work was supported in part by the European Research Council (ERC) Horizon 2020 research and innovation programme (Grant n725561), Agence Nationale de la Recherche (ANR), and Universite Paris-Saclay.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Enhui Huang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Huang, E., Diao, Y., Liu, A. et al. Efficient and robust active learning methods for interactive database exploration. The VLDB Journal (2023). https://doi.org/10.1007/s00778-023-00816-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s00778-023-00816-x

Keywords

Navigation