Abstract
Natural language processing (NLP) aims to analyze a large amount of natural language data. The NLP computes textual data via a set of data processing elements which is sequentially connected to a path data pipeline. Several data pipelines exist for a given set of textual data with various degrees of model accuracy. Instead of trying all the possible paths, such as random search or grid search to find an optimal path, we utilized the Bayesian optimization to search along with the space of hyper-parameters learning. In this study, we proposed a data pipeline selection for NLP using Sequential Model-based Optimization (SMBO). We implemented the SMBO for the NLP data pipeline using Hyperparameter Optimization (Hyperopt) library with Tree of Parzen Estimators (TPE) model and Adaptive Tree of Parzen Estimators (A-TPE) model for a surface model with expected improvement (EI) acquired function.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Uysal, A.K., Gunal, S.: The impact of preprocessing on text classification. Inf. Process. Manage. 50(1), 104–112 (2014)
Naymat, G., Etaiwi, W.: The impact of applying different preprocessing steps on review spam detection. In: The 8th International Conference on Emerging Ubiquitous Systems and Pervasive Networks (2017)
Alam, S., Yao, N.: The impact of preprocessing steps on the accuracy of machine learning algorithms in sentiment analysis. Comput. Math. Organ. Theory 25(3), 319–335 (2019)
Tripathy, A., Agrawal, A., Rath, S.K.: Classification of sentiment reviews using N-gram machine learning approach. Expert Syst. Appl. 57, 117–126 (2016)
Gaydhani, A., Doma, V., Kendre, S., Bhagwat, L.: Detecting hate speech and offensive language on twitter using machine learning: an N-gram and TFIDF based approach. arXiv preprint arXiv:1809.08651 (2018)
Quemy, A.: Data pipeline selection and optimization. Design, Optimization, Languages and Analytical Processing of Big Data (2019)
Bergstra, J., Bengio, Y.: Random search for hyper-parameter optimization. J. Mach. Learn. Res. 13(1), 281–305 (2012)
Mullapudi, R.T., Vasista, V., Bondhugula, U.: PolyMage: automatic optimization for image processing pipelines. ACM SIGARCH Comput. Archit. News 43(1), 429–443 (2015)
Moore, J.H., Olson, R.S.: TPOT: a tree-based pipeline optimization tool for automating machine learning, pp. 151–160 (2019)
Hutter, F., Hoos, H.H., Leyton-Brown, K.: Sequential model-based optimization for general algorithm configuration. In: Coello, C.A.C. (ed.) LION 2011. LNCS, vol. 6683, pp. 507–523. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-25566-3_40
Chauhan, K., et al.: Automated machine learning: the new wave of machine learning, pp. 205–212 (2020)
Bergstra, J.S., Bardenet, R., Bengio, Y., Kégl, B.: Algorithms for hyper-parameter optimization. In: Advances in Neural Information Processing Systems, pp. 2546–2554 (2011)
Bergstra, J., Yamins, D., Cox, D.D.: Hyperopt: A python library for optimizing the hyperparameters of machine learning algorithms. In: Proceedings of the 12th Python in Science Conference, vol. 13, p. 20. Citeseer (2013)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Arntong, P., Pongpech, W.A. (2021). Sequential Model-Based Optimization for Natural Language Processing Data Pipeline Selection and Optimization. In: Nguyen, N.T., Chittayasothorn, S., Niyato, D., Trawiński, B. (eds) Intelligent Information and Database Systems. ACIIDS 2021. Lecture Notes in Computer Science(), vol 12672. Springer, Cham. https://doi.org/10.1007/978-3-030-73280-6_24
Download citation
DOI: https://doi.org/10.1007/978-3-030-73280-6_24
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-73279-0
Online ISBN: 978-3-030-73280-6
eBook Packages: Computer ScienceComputer Science (R0)