RETRACTED ARTICLE: Impact of the learning set’s size

Korchi, Adil; Dardor, Mohamed; Mabrouk, El Houssine

doi:10.1007/s10639-020-10165-9

RETRACTED ARTICLE: Impact of the learning set’s size

Published: 06 January 2021

Volume 25, pages 4637–4657, (2020)
Cite this article

Education and Information Technologies Aims and scope Submit manuscript

Adil Korchi¹,
Mohamed Dardor² &
El Houssine Mabrouk³

264 Accesses
1 Citation
Explore all metrics

This article was retracted on 05 January 2021

This article has been updated

Abstract

Learning techniques have proven their capacity to treat large amount of data. Most statistical learning approaches use specific size learning sets and create static models. Withal, in certain some situations such as incremental or active learning the learning process can work with only a smal amount of data. In this case, the search for algorithms capable of producing models with only a few examples begin to be necessary. Generally, the literature relative to classifiers are evaluated according to criteria such as their classification performance, their ability to sort data. But this taxonomy of classifiers can singularly evolve if one is interested in their capabilities in the presence of some few examples. From our point of view, few studies have been carried out on this issue. It is in sense that this paper seeks to study a wider range of learning algorithms as well as data sets in order to show the power of every chosen algorithm that manipulates data. It also appears from this study, problem of algorithm’s choice to process small or large amount of data. And in order to resolve this, we will show that there are algorithms able of generating models with little data. In this case we look to select the smallest amount of data allowing the best learning to be achieved. We also wanted to show that some algorithms are capable of making good predictions with little data that is therefore necessary in order to have the least costly labeling procedure possible. And to concretize this, we will talk first about learning speed and typology of the tested algorithms to know the ability of a classifier to obtain an “interesting” solution to a classification problem using a minimum of examples present in learning, and we will know some various families of classification models based on parameter learning. After that, we will test all the classifiers mentioned previously such as linear and Non-linear classifiers. Then, we will seek to study the behavior these algorithms as a function of learning set’s size trough the experimental protocol in which various datasets will be Splited, manipulated and evaluated from the classification field in order to give results that merge from our experimental protocol. After that, we will discuss the obtained results through the global analysis section, and then conclude with recommendations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Learning from Imbalanced Data: A Comparative Study

A snapshot on nonstandard supervised learning problems: taxonomy, relationships, problem transformations and algorithm adaptations

Article 23 November 2018

Statistical Machine Learning

Change history

05 January 2021
An Erratum to this paper has been published: <ExternalRef><RefSource>https://doi.org/10.1007/s10639-020-10422-x</RefSource><RefTarget Address="10.1007/s10639-020-10422-x" TargetType="DOI"/></ExternalRef>

References

Bauer, E., & Kohavi, R. (1999). An Empirical Comparison Of Voting Classification Algorithms: Bagging, boosting, and variants. Machine Learning, 36(1–2), 105–139.
Article Google Scholar
Beluch, W. H., Genewein, T., Nürnberger, A., & Köhler, J. M. (2018). The power of ensembles for active learning in image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 9368–9377).
Google Scholar
Blake, C. L. & Merz, C. J. (1998). UCI repository of machine learning databases. University of California, Irvine, Dept. of Information and Computer Sciences.
Bouchard, G. & Triggs, B. (2004). August. The tradeoff between generative and discriminative classifiers, pp.721–728.
Bouckaert, R. R. (2004). Bayesian network classifiers in weka.
Boulle, M. (2004). Khiops: A Statistical Discretization Method Of Continuous Attributes. Machine Learning, 55(1), 53–69.
Article Google Scholar
Boullé, M. (2005). A grouping method for categorical attributes having very large number of values. In International Workshop on Machine Learning and Data Mining in Pattern Recognition (pp. 228–242). Berlin, Heidelberg: Springer.
Chapter Google Scholar
Boullé, M. (2006a). MODL: A Bayes optimal discretization method for continuous attributes. Machine Learning, 65(1), 131–165.
Article Google Scholar
Boullé, M. (2006b). Regularization and averaging of the selective Na ï ve Bayes classifier. In The 2006 IEEE International Joint Conference on Neural Network Proceedings (pp. 1680–1688) IEEE.
Google Scholar
Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32.
Article Google Scholar
Breiman, L., Friedman, J., Stone, C. J. & Olshen, R. A. (1984). Classification and regression trees. Boca Raton: CRC press.
Cervantes, A., Gagné, C., Isasi, P. & Parizeau, M. (2018). Evaluating and characterizing incremental learning from non-stationary data. arXiv preprint arXiv:1806.06610.
Chen, S., Webb, G. I., Liu, L., & Ma, X. (2019). A novel selective Naïve Bayes Algorithm. Knowledge-Based Systems, 105361.
Cucker, F., & Smale, S. (2002). Best Choices For Regularization Parameters in learning theory: on the bias-variance problem. Foundations of Computational Mathematics, 2(4), 413–428.
Demiröz, G., & Güvenir, H. A. (1997). Classification by voting feature intervals. In European Conference on Machine Learning (pp. 85–92). Berlin, Heidelberg: Springer.
Google Scholar
Domingos, P., & Hulten, G. (2000). Mining high-speed data streams. In Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 71–80).
Chapter Google Scholar
Domingos, P., & Pazzani, M. (1997). On the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning, 29(2–3), 103–130.
Fawcett, T. (2004). ROC graphs: notes and practical considerations for researchers. Machine Learning, 31(1), 1–38.
Féraud, R., Boullé, M., Clérot, F., Fessant, F., & Lemaire, V. (2010). The orange customer analysis platform. In Industrial Conference on Data Mining (pp. 584–594). Springer, Berlin, Heidelberg.
Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7(2), 179–188.
Freund, Y., & Mason, L. (1999). The alternating decision tree learning algorithm. In icml (Vol. 99, pp. 124–133).
Google Scholar
Gama, J., Rocha, R., & Medas, P. (2003). Accurate decision trees for mining high-speed data streams. In proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining (pp. 523–528).
Chapter Google Scholar
Gama, J., Medas, P., & Rodrigues, P. (2005). Learning decision trees from dynamic data streams. In proceedings of the 2005 ACM symposium on applied computing (pp. 573–577).
Chapter Google Scholar
Guyon, I., Lemaire, V., Boullé, M., Dror, G., & Vogel, D. (2009). Analysis of the kdd cup 2009: Fast scoring on a large orange customer database. In KDD-Cup 2009 Competition (pp. 1–22).
Google Scholar
Guyon, I., Cawley, G. C., Dror, G., & Lemaire, V. (2011). Results of the active learning challenge. In Active Learning and Experimental Design workshop. In conjunction with AISTATS 2010 (pp. 19–45).
Google Scholar
Han, T., Jiang, D., Zhao, Q., Wang, L., & Yin, K. (2018). Comparison of random forest, artificial neural networks and support vector machine for intelligent diagnosis of rotating machinery. Transactions of the Institute of Measurement and Control, 40(8), 2681–2693.
Hyndman, R. J., & Athanasopoulos, G. (2018). Forecasting: principles and practice. OTexts.
John, G. H. & Langley, P. (1995). Estimating continuous distributions in Bayesian classifiers. In proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence.
Langley, P., Iba, W. & Thomas, K. (1992). An analysis of Bayesian classi er. In proceedings of the Tenth National Conference of Artificial Intelligence.
Le Cessie, S., & Van Houwelingen, J. C. (1992). Ridge estimators in logistic regression. Journal of the Royal Statistical Society: Series C (Applied Statistics), 41(1), 191–201.
Lim, T. S., Loh, W. Y., & Shih, Y. S. (2000). A comparison of prediction accuracy, complexity, and training time of thirty-three old and new classification algorithms. Machine Learning, 40(3), 203–228.
Losing, V., Hammer, B., & Wersing, H. (2018). Incremental on-iine learning: a review and comparison of state of the art algorithms. Neurocomputing, 275, 1261–1274.
Michalski, R. S., Mozetic, I., Hong, J. & Lavrac, N. (1986). The multi-purpose incremental learning system Aq15 and its testing application to three medical domains. Proc. AAAI 1986, pp.1–041.
Mohamad, S., Sayed-Mouchaweh, M., & Bouchachia, A. (2018). Active learning for classifying data streams with unknown number of classes. Neural Networks, 98, 1–15.
Quinlan, J. R. (1993). C4. 5: programs for machine learning. Morgan Kaufmann, San Francisco. C4. 5: Programs for machine learning. Morgan Kaufmann, San Francisco.
Settles, B. (2010). Active learning literature survey. University of Wisconsin. Madison: Computer Science technical report 1648 52, 55-66.
Wang, J., Zhang, L., Cao, J. J., & Han, D. (2018). NBWELM: Naive Bayesian based weighted extreme learning machine. International Journal of Machine Learning and Cybernetics, 9(1), 21–35.
Wen, J., Fang, X., Cui, J., Fei, L., Yan, K., Chen, Y., & Xu, Y. (2018). Robust sparse linear discriminant analysis. IEEE Transactions on Circuits and Systems for Video Technology, 29(2), 390–403.
Witten, I. H., & Frank, E. (2002). Data mining: practical machine learning tools and techniques with java implementations. ACM SIGMOD Record, 31(1), 76–77.
Wolpert, D. H. (2018). The relationship between PAC, the statistical physics framework, the Bayesian framework, and the VC framework. In The mathematics of generalization (pp. 117–214). CRC press.
Xu, J., Xu, C., Zou, B., Tang, Y. Y., Peng, J., & You, X. (2018). New incremental learning algorithm with support vector machines. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 49(11), 2230–2241.

Online references

Palachy, S. (2019). Detecting stationarity in time series data. Available on line at: https://towardsdatascience.com/detecting-stationarity-in-time-series-data-d29e0a21e638.

Download references

Author information

Authors and Affiliations

Faculty of Sciences and Technologies, University Sidi Mohamed Ben Abdellah, Fez, Morocco
Adil Korchi
Department of Informatics, Faculty of Sciences, Dhar El Mehrez, Fez, Morocco
Mohamed Dardor
Faculty of Sciences and Technics, Moulay Ismail University, Errachidia, Morocco
El Houssine Mabrouk

Authors

Adil Korchi
View author publications
You can also search for this author in PubMed Google Scholar
Mohamed Dardor
View author publications
You can also search for this author in PubMed Google Scholar
El Houssine Mabrouk
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Adil Korchi.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article has been retracted. Please see the retraction notice for more detail: https://doi.org/10.1007/s10639-020-10422-x

About this article

Cite this article

Korchi, A., Dardor, M. & Mabrouk, E.H. RETRACTED ARTICLE: Impact of the learning set’s size. Educ Inf Technol 25, 4637–4657 (2020). https://doi.org/10.1007/s10639-020-10165-9

Download citation

Received: 04 January 2020
Accepted: 13 March 2020
Published: 06 January 2021
Issue Date: September 2020
DOI: https://doi.org/10.1007/s10639-020-10165-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

RETRACTED ARTICLE: Impact of the learning set’s size

Abstract

Access this article

Similar content being viewed by others

Learning from Imbalanced Data: A Comparative Study

A snapshot on nonstandard supervised learning problems: taxonomy, relationships, problem transformations and algorithm adaptations

Statistical Machine Learning

Change history

05 January 2021

References

Online references

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

About this article

Cite this article

Keywords

Navigation

RETRACTED ARTICLE: Impact of the learning set’s size

Abstract

Access this article

Similar content being viewed by others

Learning from Imbalanced Data: A Comparative Study

A snapshot on nonstandard supervised learning problems: taxonomy, relationships, problem transformations and algorithm adaptations

Statistical Machine Learning

Change history

05 January 2021

References

Online references

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

About this article

Cite this article

Share this article

Keywords

Search

Navigation