Advertisement

Optimizing Natural Language Processing Pipelines: Opinion Mining Case Study

Conference paper
  • 764 Downloads
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11896)

Abstract

This research presents NLP-Opt, an Auto-ML technique for optimizing pipelines of machine learning algorithms that can be applied to different Natural Language Processing tasks. The process of selecting the algorithms and their parameters is modelled as an optimization problem and a technique was proposed to find an optimal combination based on the metaheuristic Population-Based Incremental Learning (PBIL). For validation purposes, this approach is applied to a standard opinion mining problem. NLP-Opt effectively optimizes the algorithms and parameters of pipelines. Additionally, NLP-Opt outputs probabilistic information about the optimization process, revealing the most relevant components of pipelines. The proposed technique can be applied to different Natural Language Processing problems, and the information provided by NLP-Opt can be used by researchers to gain insights on the characteristics of the best-performing pipelines. The source code is made available for other researchers. In contrast with other Auto-ML approaches, NLP-Opt provides a flexible mechanism for designing generic pipelines that can be applied to NLP problems. Furthermore, the use of the probabilistic model provides a more comprehensive approach to the Auto-ML problem that enriches researcher understanding of the possible solutions.

Keywords

Natural Language Processing Pipeline optimization Metaheuristics Opinion mining 

Notes

Acknowledgments

This research has been supported by a Carolina Foundation grant in accordance with the University of Alicante and the University of Havana. This work has also been partially funded by both aforementioned universities, the Generalitat Valenciana and the Spanish Government through the projects SIIA (PROMETEU/2018/089), LIVINGLANG (RTI2018-094653-B-C22) and INTEGER (RTI2018-094649-B-I00).

References

  1. 1.
    Abualigah, L.M., Khader, A.T., Al-Betar, M.A.: Unsupervised feature selection technique based on genetic algorithm for improving the text clustering. In: 2016 7th International Conference on Computer Science and Information Technology (CSIT), pp. 1–6. IEEE (2016)Google Scholar
  2. 2.
    Abualigah, L.M., Khader, A.T., Al-Betar, M.A., Alomari, O.A.: Text feature selection with a robust weight scheme and dynamic dimension reduction to text document clustering. Expert Syst. Appl. 84, 24–36 (2017).  https://doi.org/10.1016/j.eswa.2017.05.002. http://www.sciencedirect.com/science/article/pii/S0957417417303172CrossRefGoogle Scholar
  3. 3.
    Baluja, S.: Population-based incremental learning. A method for integrating genetic search based function optimization and competitive learning. Technical report, DTIC Document (1994)Google Scholar
  4. 4.
    Bishop, C.M.: Model-based machine learning. Phil. Trans. R. Soc. A 371(1984), 20120222 (2013)MathSciNetCrossRefGoogle Scholar
  5. 5.
    Dinakar, K., Reichart, R., Lieberman, H.: Modeling the detection of textual cyberbullying. Soc. Mobile Web 11(02), 11–17 (2011)Google Scholar
  6. 6.
    Feurer, M., Klein, A., Eggensperger, K., Springenberg, J., Blum, M., Hutter, F.: Efficient and robust automated machine learning. In: Advances in Neural Information Processing Systems, pp. 2962–2970 (2015)Google Scholar
  7. 7.
    Jain, S., Shukla, S., Wadhvani, R.: Dynamic selection of normalization techniques using data complexity measures. Expert Syst. Appl. 106, 252–262 (2018).  https://doi.org/10.1016/j.eswa.2018.04.008. http://www.sciencedirect.com/science/article/pii/S095741741830232XCrossRefGoogle Scholar
  8. 8.
    Komer, B., Bergstra, J., Eliasmith, C.: Hyperopt-sklearn: automatic hyperparameter configuration for scikit-learn. In: ICML Workshop on AutoML, pp. 2825–2830. Citeseer (2014)Google Scholar
  9. 9.
    Kontopoulos, E., Berberidis, C., Dergiades, T., Bassiliades, N.: Ontology-based sentiment analysis of twitter posts. Expert Syst. Appl. 40, 4065–4074 (2013)CrossRefGoogle Scholar
  10. 10.
    Luke, S.: Essentials of Metaheuristics. Lulu 2009. http://cs.gmu.edu/~sean/book/metaheuristics/ (2011)
  11. 11.
    Manning, C., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S., McClosky, D.: The stanford corenlp natural language processing toolkit. In: Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 55–60 (2014)Google Scholar
  12. 12.
    Mohr, F., Wever, M., Hüllermeier, E.: ML-Plan: automated machine learning via hierarchical planning. Mach. Learn. 107(8), 1495–1515 (2018).  https://doi.org/10.1007/s10994-018-5735-zMathSciNetCrossRefzbMATHGoogle Scholar
  13. 13.
    Olson, R.S., Moore, J.H.: TPOT: a tree-based pipeline optimization tool for automating machine learning. In: Hutter, F., Kotthoff, L., Vanschoren, J. (eds.) Automated Machine Learning. TSSCML, pp. 151–160. Springer, Cham (2019).  https://doi.org/10.1007/978-3-030-05318-5_8CrossRefGoogle Scholar
  14. 14.
    Pang, B., Lee, L.: Opinion mining and sentiment analysis. Found. Trends Inf. Retrieval 2(1–2), 1–135 (2008)CrossRefGoogle Scholar
  15. 15.
    Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)MathSciNetzbMATHGoogle Scholar
  16. 16.
    Rosenthal, S., Farra, N., Nakov, P.: Semeval-2017 task 4: sentiment analysis in Twitter. In: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pp. 502–518 (2017)Google Scholar
  17. 17.
    de Sá, A.G.C., Pinto, W.J.G.S., Oliveira, L.O.V.B., Pappa, G.L.: RECIPE: a grammar-based framework for automatically evolving classification pipelines. In: McDermott, J., Castelli, M., Sekanina, L., Haasdijk, E., García-Sánchez, P. (eds.) EuroGP 2017. LNCS, vol. 10196, pp. 246–261. Springer, Cham (2017).  https://doi.org/10.1007/978-3-319-55696-3_16CrossRefGoogle Scholar
  18. 18.
    Salimans, T., Ho, J., Chen, X., Sutskever, I.: Evolution strategies as a scalable alternative to reinforcement learning. arXiv preprint arXiv:1703.03864 (2017)
  19. 19.
    Thornton, C., Hutter, F., Hoos, H.H., Leyton-Brown, K.: Auto-weka: combined selection and hyperparameter optimization of classification algorithms. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 847–855. ACM (2013)Google Scholar
  20. 20.
    Villena-román, J., Lana Serrano, S., Martínez Cámara, E., González Cristóbal, J.C.: TASS workshop on sentiment analysis at SEPLN. Procesamiento del Lenguaje Natural (2013)Google Scholar
  21. 21.
    Zhang, L., Ghosh, R., Dekhil, M., Hsu, M., Liu, B.: Combining lexicon based and learning-based methods for twitter sentiment analysis. HP Laboratories, Technical Report HPL-2011 89 (2011)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.School of Math and Computer ScienceUniversity of HabanaHabanaCuba
  2. 2.University Institute for Computing Research (IUII)University of AlicanteSant Vicent del RaspeigSpain
  3. 3.Department of Languages and Computing SystemsUniversity of AlicanteSant Vicent del RaspeigSpain

Personalised recommendations