Abstract
There is an increasing gap between fast growth of data and the limited human ability to comprehend data. Consequently, there has been a growing demand of data management tools that can bridge this gap and help the user retrieve high-value content from data more effectively. In this work, we propose an interactive data exploration system as a new database service, using an approach called “explore-by-example.” Our new system is designed to assist the user in performing highly effective data exploration while reducing the human effort in the process. We cast the explore-by-example problem in a principled “active learning” framework. However, traditional active learning suffers from two fundamental limitations: slow convergence and lack of robustness under label noise. To overcome the slow convergence and label noise problems, we bring the properties of important classes of database queries to bear on the design of new algorithms and optimizations for active learning-based database exploration. Evaluation results using real-world datasets and user interest patterns show that our new system, both in the noise-free case and in the label noise case, significantly outperforms state-of-the-art active learning techniques and data exploration systems in accuracy while achieving the desired efficiency for interactive data exploration.
Similar content being viewed by others
Notes
In our work, we use the term, user interest query to refer to the final query that represents the user interest, and user interest model to refer to an immediate model before it converges to the true user interest.
This work considers a dataset that consists of a single table. It is left to our future work to study the extension for join queries.
Ray provides fast distributed computing at https://ray.io/.
The content is extracted from http://www.teoalida.com/.
References
Abouzied, A., Angluin, D., et al.: Learning and verifying quantified boolean queries by example. In: PODS, pp. 49–60 (2013)
Abouzied, A., Hellerstein, J.M., Silberschatz, A.: Playful query specification with dataplay. PVLDB 5(12), 1938–1941 (2012)
Agarwal, A., Garg, R., Chaudhury, S.: Greedy search for active learning of OCR. In: ICDAR, pp. 837–841 (2013)
Aggarwal, C.C., Hinneburg, A., Keim, D.A.: On the surprising behavior of distance metrics in high dimensional spaces. In: ICDT, pp. 420–434. Springer, Berlin (2001)
Akbani, R., Kwek, S., Japkowicz, N.: Applying support vector machines to imbalanced datasets. In: ECML, pp. 39–50 (2004)
Amer-Yahia, S., et al.: INODE: building an end-to-end data exploration system in practice. SIGMOD Rec. 50(4), 23–29 (2021)
Amer-Yahia, S., Milo, T., Youngmann, B.: Exploring ratings in subjective databases. In: SIGMOD, pp. 62–75. ACM (2021)
Bach, S.H., He, B., Ratner, A., Ré, C.: Learning the structure of generative models without labeled data. In: ICML, pp. 273–282 (2017)
Barber, C.B., Dobkin, D.P., Huhdanpaa, H.: The quickhull algorithm for convex hulls. TOMS 22(4), 469–483 (1996)
Bellare, K., Iyengar, S., Parameswaran, A., Rastogi, V.: Active sampling for entity matching with guarantees. ACM Trans. Knowl. Discov. Data 7(3), 12:1-12:24 (2013)
Berthon, A., Han, B., Niu, G., Liu, T., Sugiyama, M.: Confidence scores make instance-dependent label-noise learning possible. arXiv preprint arXiv:2001.03772 (2020)
Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: COLT, pp. 92–100 (1998)
Bordes, A., Ertekin, S., Weston, J., et al.: Fast kernel classifiers with online and active learning. JMLR 6, 1579–1619 (2005)
Bouguelia, M., Nowaczyk, S., Santosh, K.C., Verikas, A.: Agreeing to disagree: active learning with noisy labels without crowdsourcing. IJMLC 9(8), 1307–1319 (2018)
Brodley, C.E., Friedl, M.A.: Identifying mislabeled training data. J. Artif. Intell. Res. 11, 131–167 (1999)
Campbell, C., Cristianini, N., Smola, A.J.: Query learning with large margin classifiers. In: ICML, pp. 111–118 (2000)
Cheng, J., Liu, T., et al.: Learning with bounded instance and label-dependent label noise. In: ICML, pp. 1789–1799. PMLR (2020)
Cheung, A., Solar-Lezama, A.: Computer-assisted query formulation. Found. Trends Program. Lang. 3(1), 1–94 (2016)
Cheung, A., Solar-Lezama, A., et al.: Using program synthesis for social recommendations. In: CIKM, pp. 1732–1736. ACM (2012)
Cuendet, S., Hakkani-Tür, D., et al.: Automatic labeling inconsistencies detection and correction for sentence unit segmentation in conversational speech. In: International Workshop on MLMI, pp. 144–155 (2007)
Dash, D., Rao, J., Megiddo, N., et al.: Dynamic faceted search for discovery-driven analysis. In: CIKM, pp. 3–12 (2008)
Diao, Y., et al.: AIDE: an automatic user navigation system for interactive data exploration. PVLDB 8(12), 1964–1967 (2015)
Dimitriadou, K., Papaemmanouil, O., Diao, Y.: Explore-by-example: an automatic query steering framework for interactive data exploration. In: SIGMOD, pp. 517–528 (2014)
Dimitriadou, K., Papaemmanouil, O., Diao, Y.: Aide: an active learning-based approach for interactive data exploration. TKDE 28(11), 2842–2856 (2016)
Du, J., Cai, Z.: Modelling class noise with symmetric and asymmetric distributions. In: AAAI, pp. 2589–2595 (2015)
El, O.B., Milo, T., Somech, A.: Automatically generating data exploration sessions using deep reinforcement learning. In: SIGMOD, pp. 1527–1537. ACM (2020)
El-Yaniv, R., Wiener, Y.: Active learning via perfect selective classification. JMLR 13(1), 255–279 (2012)
Ertekin, S., Huang, J., et al.: Learning on the border: active learning in imbalanced data classification. In: CIKM, pp. 127–136 (2007)
Esmailoghli, M., Quiané-Ruiz, J., et al.: COCOA: correlation coefficient-aware data augmentation. In: EDBT, pp. 331–336 (2021)
Fernandez, R.C., Abedjan, Z., et al.: Aurum: a data discovery system. In: ICDE, pp. 1001–1012 (2018)
Frénay, B., Verleysen, M.: Classification in the presence of label noise: a survey. IEEE Trans. Neural Netw. Learn. Syst. 25(5), 845–869 (2014)
Gamberger, D., Lavrač, N., Džeroski, S.: Noise elimination in inductive concept learning: a case study in medical diagnosis. In: International workshop on ALT, pp. 199–212. Springer (1996)
Garnett, R., et al.: Bayesian optimal active search and surveying. In: ICML, pp. 843–850 (2012)
Grünbaum, B.: Convex polytopes, 2nd edn. In: Convex Polytopes. Springer, New York (2003)
Hanneke, S.: Rates of convergence in active learning. Ann. Stat. 39(1), 333–361 (2011)
Hanneke, S.: Theory of disagreement-based active learning. Found. Trends Mach. Learn. 7(2–3), 131–309 (2014)
Hanneke, S.: Refined error bounds for several learning algorithms. J. Mach. Learn. Res. 17(1), 4667–4721 (2016)
Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning. Springer, New York (2001)
Huang, E.: Active Learning Methods for Interactive Exploration on Large Databases. Theses, Institut Polytechnique de Paris (2021)
Huang, E., Peng, L., et al.: Optimization for active learning-based interactive database exploration. PVLDB 12(1), 71–84 (2018)
Ipeirotis, P.G., et al.: Repeated labeling using multiple noisy labelers. Data Min. Knowl. Discov. 28(2), 402–441 (2014)
Jacobs, B.E., Walczak, C.A.: A generalized query-by-example data manipulation language based on database logic. IEEE Trans. Softw. Eng. 9(1), 40–57 (1983)
Kahng, M., et al.: Interactive browsing and navigation in relational databases. PVLDB 9(12), 1017–1028 (2016)
Kalinin, A., Cetintemel, U., Zdonik, S.: Interactive data exploration using semantic windows. In: SIGMOD, pp. 505–516 (2014)
Kamat, N., Jayachandran, P., Tunga, K., Nandi, A.: Distributed and interactive cube exploration. In: ICDE, pp. 472–483 (2014)
Lay, S.R.: Convex Sets and Their Applications. Dover Publications, Mineola (2007)
Li, H., Chan, C.-Y., Maier, D.: Query from examples: an iterative, data-driven approach to query construction. PVLDB 8(13), 2158–2169 (2015)
Lin, C. H., Mausam, Weld, D. S.: Re-active learning: Active learning with relabeling. In AAAI, pages 1845–1852, 2016
Liu, W., Diao, Y., Liu, A.: An analysis of query-agnostic sampling for interactive data exploration. Commun. Stat. Theory Methods 47(16), 3820–3837 (2018)
Ma, Y., Garnett, R., Schneider, J.G.: \(\Sigma \)-optimality for active learning on Gaussian random fields. In: NIPS, pp. 2751–2759 (2013)
Menon, A.K., et al.: Learning from binary labels with instance-dependent noise. Mach. Learn. 107(8), 1561–1595 (2018)
Miranda, A.L.B., et al.: Use of classification algorithms in noise detection and elimination. In: HAIS, pp. 417–424 (2009)
Mottin, D., et al.: Exemplar queries: give me an example of what you need. PVLDB 7(5), 365–376 (2014)
Neamtu, R., et al.: Interactive time series analytics powered by ONEX. In: SIGMOD, pp. 1595–1598 (2017)
Omidvar-Tehrani, B., Personnaz, A., et al.: Guided text-based item exploration. In: CIKM, pp. 3410–3420. ACM (2022)
Özsoyoglu, G., Wang, H.: Example-based graphical database query languages. Computer 26(5), 25–38 (1993)
Palma, L.D.: New Algorithms and Optimizations for Human-in-the-Loop Model Development. Ph.D. thesis, Polytechnic Institute of Paris, France (2021)
Qin, X., et al.: Interactively discovering and ranking desired tuples by data exploration. VLDB J. 31(4), 753–777 (2022)
Ratner, A., et al.: Snorkel: rapid training data creation with weak supervision. Proc. VLDB Endow. 11(3), 269–282 (2017)
Ratner, A.J., et al.: Data programming: creating large training sets, quickly. In: NeurIPS, pp. 3567–3575 (2016)
Roy, S.B., et al.: Minimum-effort driven dynamic faceted search in structured databases. In: CIKM, pp. 13–22 (2008)
Roy, S.B., et al.: DynaCet: building dynamic faceted search systems over databases. In: ICDE, pp. 1463–1466 (2009)
Santos, A.S.R., et al.: A sketch-based index for correlated dataset search. In: ICDE, pp. 2928–2941 (2022)
Scott, C., et al.: Classification with asymmetric label noise: consistency and maximal denoising. In: COLT, pp. 489–511 (2013)
Seleznova, M., et al.: Guided exploration of user groups. PVLDB 13(9), 1469–1482 (2020)
Settles, B.: Active Learning. Synthesis Lectures on Artificial Intelligence & Machine Learning. Morgan Claypool Publishers, New York (2012)
Shen, Y., et al.: Discovering queries based on example tuples. In: SIGMOD, pp. 493–504 (2014)
Sloan digital sky survey: DR8 sample SQL queries. http://skyserver.sdss.org/dr8/en/help/docs/realquery.asp
Szalay, A., et al.: Designing and mining multi-terabyte astronomy archives: the Sloan digital sky survey. SIGMOD 451–462 (2000)
Tang, B., et al.: Determining the impact regions of competing options in preference space. In: SIGMOD, pp. 805–820 (2017)
Tang, F.: Bidirectional active learning with gold-instance-based human training. In: IJCAI, pp. 5989–5996 (2019)
Tang, Y., et al.: Svms modeling for highly imbalanced classification. IEEE Trans. Syst. Man Cybern. Part B 39(1), 281–288 (2009)
Tong, S., Koller, D.: Support vector machine active learning with applications to text classification. JMLR 2, 45–66 (2002)
Tran, Q.T., Chan, C.-Y., Parthasarathy, S.: Query by output. In: SIGMOD, pp. 535–548 (2009)
Vanchinathan, H.P., et al.: Discovering valuable items from massive data. In: SIGKDD, pp. 1195–1204 (2015)
Varma, P., et al.: Inferring generative model structure with static analysis. In: NeurIPS, pp. 240–250 (2017)
Varma, P., Ré, C.: Snuba: automating weak supervision to label training data. Proc. VLDB Endow. 12(3), 223–236 (2018)
Ventura, F., et al.: Expand your training limits! generating training data for ml-based data management. SIGMOD 1865–1878 (2021)
Wallace, B.C., Dahabreh, I.J.: Class probability estimates are unreliable for imbalanced data (and how to fix them). In: ICDM, pp. 695–704. IEEE Computer Society (2012)
Wu, G., Chang, E.Y.: KBA: kernel boundary alignment considering imbalanced data distribution. IEEE Trans. Knowl. Data Eng. 17(6), 786–795 (2005)
Youngmann, B., Amer-Yahia, S., Personnaz, A.: Guided exploration of data summaries. PVLDB 15(9), 1798–1807 (2022)
Zhang, J., Wu, X., et al.: Active learning with imbalanced multiple noisy labeling. IEEE Trans. Cybern. 45(5), 1081–1093 (2015)
Zhang, X., Wang, S., Yun, X.: Bidirectional active learning: a two-way exploration into unlabeled and labeled data set. IEEE Trans. Neural Netw. Learn. Syst. 26(12), 3034–3044 (2015)
Zhao, Z., et al.: Controlling false discoveries during interactive data exploration. In: SIGMOD, pp. 527–540 (2017)
Zhu, X., et al.: Budget constrained interactive search for multiple targets. PVLDB 14(6), 890–902 (2021)
Acknowledgements
This work was supported in part by the European Research Council (ERC) Horizon 2020 research and innovation programme (Grant n725561), Agence Nationale de la Recherche (ANR), and Universite Paris-Saclay.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Huang, E., Diao, Y., Liu, A. et al. Efficient and robust active learning methods for interactive database exploration. The VLDB Journal (2023). https://doi.org/10.1007/s00778-023-00816-x
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s00778-023-00816-x