Skip to main content

Classification of Poverty Condition Using Natural Language Processing


This work introduces a methodology to classify between poor and extremely poor people through Natural Language Processing. The approach serves as a baseline to understand and classify poverty through the people’s discourses using machine learning algorithms. Based on classical and modern word vector representations we propose two strategies for document level representations: (1) document-level features based on the concatenation of descriptive statistics and (2) Gaussian mixture models. Three classification methods are systematically evaluated: Support Vector Machines, Random Forest, and Extreme Gradient Boosting. The fourth best experiments yielded around 55% of accuracy, while the embeddings based on GloVe word vectors yielded a sensitivity of 79.6% which could be of great interest for the public policy makers to accurately find people who need to be prioritized in social programs.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8


  1. Agua de panela is a very traditional beverage made out of sugarcane.

  2. “eswiki-latest-pages-articles.xml.bz2”


  • Abdillah, J., Asror, I., Wibowo, Y. F. A., et al. (2020). Emotion classification of song lyrics using bidirectional lstm method with glove word representation weighting. Jurnal RESTI (Rekayasa Sistem Dan Teknologi Informasi), 4(4), 723–729.

    Article  Google Scholar 

  • Aggarwal, C. C., & Zhai, C. (2012). Mining text data. Berlin: Springer Science & Business Media.

    Book  Google Scholar 

  • Alammar, J. (2020) . The illustrated transformer. Accessed: 2020-10-05

  • Alkire, S. (2007). The missing dimensions of poverty data: Introduction to the special issue. Oxford development studies, 35(4), 347–359.

    Article  Google Scholar 

  • Alkire, S., Roche, J. M., Ballon, P., Foster, J., Santos, M. E., & Seth, S. (2015). Multidimensional poverty measurement and analysis. USA: Oxford University Press.

    Book  Google Scholar 

  • Arias-Vergara, T., Vásquez-Correa, J.C., Orozco-Arroyave, J.R., Vargas-Bonilla, J.F., Nöth, E. (2016) . Parkinson’s disease progression assessment from speech using gmm-ubm. In Interspeech, pp. 1933–1937

  • Ayush, K., Uzkent, B., Burke, M., Lobell, D., Ermon, S. (2020) . Generating interpretable poverty maps using object detection in satellite images. arXiv preprint arXiv:2002.01612

  • Banerjee, A.V., Banerjee, A., Duflo, E. (2011) . Poor economics: A radical rethinking of the way to fight global poverty. Public Affairs

  • Bender, E.M., Gebru, T., McMillan-Major, A., Shmitchell, S. (2021) . On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pp. 610–623

  • Biggeri, M., & Santi, M. (2012). The missing dimensions of children’s well-being and well-becoming in education systems: Capabilities and philosophy for children. Journal of Human Development and Capabilities, 13(3), 373–395.

    Article  Google Scholar 

  • Blumenstock, J., Cadamuro, G., & On, R. (2015). Predicting poverty and wealth from mobile phone metadata. Science, 350(6264), 1073–1076.

    Article  Google Scholar 

  • Boyd, R.L., Schwartz, H.A. (2020). Natural language analysis and the psychology of verbal behavior: The past, present, and future states of the field. Journal of Language and Social Psychology p. 0261927X20967028

  • Breiman, L. (2001). Random forests. Machine learning, 45(1), 5–32.

    Article  Google Scholar 

  • Canete, J., Chaperon, G., Fuentes, R., Pérez, J. (2020) . Spanish pre-trained bert model and evaluation data. PML4DC at ICLR 2020

  • Caplan, M. A., Purser, G., & Kindle, P. A. (2017). Personal accounts of poverty: A thematic analysis of social media. Journal of Evidence-Informed Social Work, 14(6), 433–456.

    Article  Google Scholar 

  • Chen, T., Guestrin, C. (2016) . Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pp. 785–794

  • Chiquito, A. B., Pinardi, L. C., & Llull, G. (2019). La pobreza en la prensa. Palabras claves en los diarios de Argentina, Brasil: Colombia y México. CLACSO.

    Book  Google Scholar 

  • Departamento Nacional de Planeación: Actualización de los criterios para la determinación, identificación y selección de beneficiarios de programas sociales (2008).

  • Devlin, J., Chang, M.W., Lee, K., Toutanova, K. (2018) . Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805

  • Dumais, S. T. (2004). Latent semantic analysis. Annual review of information science and technology, 38(1), 188–230.

    Article  Google Scholar 

  • Engstrom, R., Hersh, J., Newhouse, D. (2017) . Poverty from space: using high-resolution satellite imagery for estimating economic well-being. Working Paper 8284, The World Bank

  • Escobar-Grisales, D., Vásquez-Correa, J., Vargas-Bonilla, J. F., Orozco-Arroyave, J. R., et al. (2020). Identity verification in virtual education using biometric analysis based on keystroke dynamics. TecnoLógicas, 23(47), 193–207.

    Article  Google Scholar 

  • Evans, J. A., & Aceves, P. (2016). Machine translation: Mining text for social theory. Annual Review of Sociology, 42, 21–50.

    Article  Google Scholar 

  • Gebru, T., Krause, J., Wang, Y., Chen, D., Deng, J., Aiden, E. L., & Fei-Fei, L. (2017). Using deep learning and google street view to estimate the demographic makeup of neighborhoods across the united states. Proceedings of the National Academy of Sciences, 114(50), 13108–13113.

    Article  Google Scholar 

  • Harris, Z. S. (1954). Distributional structure. Word, 10(2–3), 146–162.

    Article  Google Scholar 

  • Jang, B., Kim, I., & Kim, J. W. (2019). Word2vec convolutional neural networks for classification of news articles and tweets. PloS One, 14(8), e0220976.

    Article  Google Scholar 

  • Jean, N., Burke, M., Xie, M., Davis, W. M., Lobell, D. B., & Ermon, S. (2016). Combining satellite imagery and machine learning to predict poverty. Science, 353(6301), 790–794.

    Article  Google Scholar 

  • Jo, T. (2018). Text mining: Concepts, implementation, and big data challenge, vol. 45. Springer

  • Joachims, T. (1998) . Text categorization with support vector machines: Learning with many relevant features. In European conference on machine learning, pp. 137–142. Springer

  • Kenter, T., Borisov, A., de Rijke, M. (2016). Siamese CBOW: Optimizing word embeddings for sentence representations. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 941–951. Association for Computational Linguistics, Berlin, Germany .

  • Kozlowski, A. C., Taddy, M., & Evans, J. A. (2019). The geometry of culture: Analyzing the meanings of class through word embeddings. American Sociological Review, 84(5), 905–949.

    Article  Google Scholar 

  • Laderchi, C. R., Saith, R., & Stewart, F. (2003). Does it matter that we do not agree on the definition of poverty? A comparison of four approaches. Oxford Development Studies, 31(3), 243–274.

    Article  Google Scholar 

  • Landauer, T. K., Foltz, P. W., & Laham, D. (1998). An introduction to latent semantic analysis. Discourse Processes, 25(2–3), 259–284.

    Article  Google Scholar 

  • Ledesma, C., Garonita, O.L., Flores, L.J., Tingzon, I., & Dalisay, D. (2020). Interpretable poverty mapping using social media data, satellite images, and geospatial information. arXiv preprint arXiv:2011.13563

  • Lee, K., & Braithwaite, J. (2020). High-resolution poverty maps in sub-saharan africa. arXiv preprint arXiv:2009.00544

  • Li, Q., Peng, H., Li, J., Xia, C., Yang, R., Sun, L., Yu, P.S., & He, L. (2020) . A text classification survey: From shallow to deep learning. arXiv preprint arXiv:2008.00364

  • Li, X., Zhong, J., Wu, X., Yu, J., Liu, X., & Meng, H. (2020) . Adversarial attacks on gmm i-vector based speaker verification systems. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6579–6583. IEEE

  • Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems, 26, 3111–3119.

    Google Scholar 

  • Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., & Gao, J. (2020). Deep learning based text classification: A comprehensive review. arXiv preprint arXiv:2004.03705

  • Mitra, S., & Jenamani, M. (2020). Hybrid improved document-level embedding (hide). arXiv preprint arXiv:2006.01203

  • Naraya, D., Patel, R., Schafft, K., Rademacher, A., & Koch-Schulte, S. (2000). Can anyone hear us? The World Bank: Voices of the poor.

    Book  Google Scholar 

  • Narayan, D., Patel, R., Schafft, K., Rademacher, A., & Koch-Schulte, S. (1999). Can Anyone Hear Us? Voices From 47 Countries. Tech. rep., World Bank .

  • Nolan, B., & Whelan, C. T. (2011). Poverty and deprivation in Europe. Oxford: Oxford University Press.

    Book  Google Scholar 

  • Nussbaum, M.C. (2001) . Women and human development: The capabilities approach, vol. 3. Cambridge University Press

  • Oved, N., Feder, A., & Reichart, R. (2020). Predicting in-game actions from interviews of nba players. Computational Linguistics, 46(3), 667–712.

    Article  Google Scholar 

  • Pandey, S., Agarwal, T., & Krishnan, N.C. (2018). Multi-task deep learning for predicting poverty from satellite images. In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32, pp. 7793–7798

  • Pennington, J., Socher, R., & Manning, C.D. (2014). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543

  • Pilehvar, M. T., & Camacho-Collados, J. (2020). Embeddings in natural language processing: Theory and advances in vector representations of meaning. Synthesis Lectures on Human Language Technologies, 13(4), 1–175.

    Article  Google Scholar 

  • PNUD: La verdadera riqueza de las naciones: caminos al desarrollo humano. Tech. Rep. Reporte del desarrollo humano 2010, Programa de las Naciones Unidas para el Desarrollo, New York (2010).

  • Pokhriyal, N., & Jacques, D. C. (2017). Combining disparate data sources for improved poverty prediction and mapping. Proceedings of the National Academy of Sciences, 114(46), E9783–E9792.

    Article  Google Scholar 

  • Pokhriyal, N., Zambrano, O., Linares, J., & Hernández, H. (2020) . Estimating and forecasting income poverty and inequality in haiti using satellite imagery and mobile phone data. Tech. rep., Inter-American Development Bank .

  • Prabhakaran, V., Hutchinson, B., & Mitchell, M. (2019) . Perturbation sensitivity analysis to detect unintended model biases. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 5740–5745. Association for Computational Linguistics .

  • Pulse, U. G. (2014). Mining indonesian tweets to understand food price crises. Jakarta: UN Global Pulse.

    Google Scholar 

  • Ravallion, M. (2015). The economics of poverty: History, measurement, and policy. Oxford: Oxford University Press.

    Google Scholar 

  • Řehůřek, R., & Sojka, P. (2010). Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pp. 45–50. ELRA, Valletta, Malta .

  • Reynolds, D. (2009). Gaussian mixture models. In Encyclopedia of Biometrics, pp. 659–663

  • Reynolds, D. A., Quatieri, T. F., & Dunn, R. B. (2000). Speaker verification using adapted gaussian mixture models. Digital Signal Processing, 10(1–3), 19–41.

    Article  Google Scholar 

  • Rezaeinia, S. M., Rahmani, R., Ghodsi, A., & Veisi, H. (2019). Sentiment analysis based on improved pre-trained word embeddings. Expert Systems with Applications, 117, 139–147.

    Article  Google Scholar 

  • Rogers, A., Kovaleva, O., & Rumshisky, A. (2020) . A primer in bertology: What we know about how bert works. arXiv preprint arXiv:2002.12327

  • Salganik, M. J., Lundberg, I., Kindel, A. T., Ahearn, C. E., Al-Ghoneim, K., Almaatouq, A., Altschul, D. M., Brand, J. E., Carnegie, N. B., Compton, R. J., et al. (2020). Measuring the predictability of life outcomes with a scientific mass collaboration. Proceedings of the National Academy of Sciences, 117(15), 8398–8403.

    Article  Google Scholar 

  • Salton, G., & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Information Processing and Management, 24(5), 513–523.

    Article  Google Scholar 

  • Salvatore, C., Biffignandi, S., & Bianchi, A. (2020). Social media and twitter data quality for new social indicators. Social Indicators Research pp. 1–30

  • Sen, A.: Commodities and Capabilities. North-Holland, Amsterdam,. (1985). New Delhi: Oxford University Press, 1987; Italian translation: Giuffre Editore, 1988 (p. 1988). Japanese translation: Iwanami.

  • Sen, A. (1999). Development as freedom. Oxford: Oxford University Press.

    Google Scholar 

  • Sen, A. K. (2009). The idea of justice. United States: Harvard University Press.

    Book  Google Scholar 

  • Sheehan, E., Meng, C., Tan, M., Uzkent, B., Jean, N., Burke, M., Lobell, D., Ermon, S. (2019) . Predicting economic development using geolocated wikipedia articles. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2698–2706

  • Steele, J. E., Sundsøy, P. R., Pezzulo, C., Alegana, V. A., Bird, T. J., Blumenstock, J., Bjelland, J., Engø-Monsen, K., de Montjoye, Y. A., Iqbal, A. M., et al. (2017). Mapping poverty using mobile phone and satellite data. Journal of The Royal Society Interface, 14(127), 20160690.

    Article  Google Scholar 

  • Stein, R. A., Jaques, P. A., & Valiati, J. F. (2019). An analysis of hierarchical text classification using word embeddings. Information Sciences, 471, 216–232.

    Article  Google Scholar 

  • Turney, P. D., & Pantel, P. (2010). From frequency to meaning: Vector space models of semantics. Journal of Artificial Intelligence Research, 37, 141–188.

    Article  Google Scholar 

  • Villatoro, P., & Santos, M. E. (2019). quiénes son pobres? análisis de su identificación en américa latina. Revista Latinoamericana de Economía: Problemas del Desarrollo.

    Book  Google Scholar 

  • Wijffels, J. (2019). Udpipe: Tokenization, parts of speech tagging, lemmatization and dependency parsing with the udpipe nlp toolkit. R package version 0.8 3

  • World Bank: Monitoring Global Poverty: Report of the commission on Global Poverty. World Bank, Washington, D.C. (2017).

  • Yu, L.C., Wang, J., Lai, K.R., & Zhang, X. (2017). Refining word embeddings for sentiment analysis. In Proceedings of the 2017 conference on empirical methods in natural language processing, pp. 534–539

Download references


This study was partially funded by CODI from the University of Antioquia, grant # PRG2020-34068

Author information

Authors and Affiliations


Corresponding author

Correspondence to Guberney Muñetón-Santa.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Muñetón-Santa, G., Escobar-Grisales, D., López-Pabón, F.O. et al. Classification of Poverty Condition Using Natural Language Processing. Soc Indic Res (2022).

Download citation

  • Accepted:

  • Published:

  • DOI:


  • Poverty
  • Natural language processing
  • Text classification
  • Word embedding
  • Document-level embedding
  • Machine learning