Annals of Operations Research

, Volume 266, Issue 1–2, pp 511–529 | Cite as

Loan default prediction by combining soft information extracted from descriptive text in online peer-to-peer lending

  • Cuiqing Jiang
  • Zhao Wang
  • Ruiya WangEmail author
  • Yong Ding
Analytical Models for Financial Modeling and Risk Management


Predicting whether a borrower will default on a loan is of significant concern to platforms and investors in online peer-to-peer (P2P) lending. Because the data types online platforms use are complex and involve unstructured information such as text, which is difficult to quantify and analyze, loan default prediction faces new challenges in P2P. To this end, we propose a default prediction method for P2P lending combined with soft information related to textual description. We introduce a topic model to extract valuable features from the descriptive text concerning loans and construct four default prediction models to demonstrate the performance of these features for default prediction. Moreover, a two-stage method is designed to select an effective feature set containing both soft and hard information. An empirical analysis using real-word data from a major P2P lending platform in China shows that the proposed method can improve loan default prediction performance compared with existing methods based only on hard information.


P2P lending Default prediction Soft information Topic model 



The authors gratefully acknowledge the assistance provided by the constructive comments of the anonymous referees, which considerably improved the paper in terms of quality and clarity. This work was funded primarily by the National Natural Science Foundation of China (Grant Nos. 71571059,71331002 and 71731005), and the Humanities and Social Sciences Fund Projects of the Ministry of Education (Grant Nos. 13YJA630037, 15YJA630010).


  1. Abdou, H. A., & Pointon, J. (2011). Credit scoring, statistical techniques and evaluation criteria: A review of the literature. Intelligent Systems in Accounting Finance & Management, 18(2–3), 59–88.CrossRefGoogle Scholar
  2. Angilella, S., & Mazzù, S. (2015). The financing of innovative SMEs: A multicriteria credit rating model. European Journal of Operational Research, 244(2), 540–554.Google Scholar
  3. Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation. Scholar
  4. Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32.CrossRefGoogle Scholar
  5. Cornée, S. (2017). The relevance of soft information for predicting small business credit default: Evidence from a social bank. Journal of Small Business Management. doi: 10.1111/jsbm.12318.
  6. Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273–297.Google Scholar
  7. Crook, J. N., Edelman, D. B., & Thomas, L. C. (2007). Recent developments in consumer credit risk assessment. European Journal of Operational Research, 183(3), 1447–1465.CrossRefGoogle Scholar
  8. Dorfleitner, G., Priberny, C., Schuster, S., Stoiber, J., Weber, M., Castro, I. D., et al. (2016). Description-text related soft information in peer-to-peer lending—Evidence from two leading european platforms. Journal of Banking & Finance, 64, 169–187.CrossRefGoogle Scholar
  9. Emekter, R., Tu, Y., Jirasakuldech, B., & Lu, M. (2015). Evaluating credit risk and loan performance in online peer-to-peer (p2p) lending. Applied Economics, 47(1), 54–70.CrossRefGoogle Scholar
  10. Finlay, S. (2011). Multiple classifier architectures and their application to credit risk assessment. European Journal of Operational Research, 210(2), 368–378.CrossRefGoogle Scholar
  11. Friedman, N., Dan, G., & Goldszmidt, M. (1997). Bayesian network classifiers. Machine Learning, 29(2–3), 131–163.CrossRefGoogle Scholar
  12. Gao, Q., & Lin, M. (July 15, 2016). Economic value of texts: Evidence from online debt crowdfunding. Available at SSRN: doi: 10.2139/ssrn.2446114.
  13. Guo, Y., Zhou, W., Luo, C., Liu, C., & Xiong, H. (2015). Instance-based credit risk assessment for investment decisions in p2p lending. European Journal of Operational Research, 249(2), 417–426.CrossRefGoogle Scholar
  14. Hajek, P., & Michalak, K. (2013). Feature selection in corporate credit rating prediction. Knowledge-Based Systems, 51(1), 72–84.CrossRefGoogle Scholar
  15. Harris, T. (2013). Quantitative credit risk assessment using support vector machines: Broad versus narrow default definitions. Expert Systems with Applications, 40(11), 4404–4413.CrossRefGoogle Scholar
  16. Huang, C. L., Chen, M. C., & Wang, C. J. (2007). Credit scoring with a data mining approach based on support vector machines. Expert Systems with Applications, 33(4), 847–856.CrossRefGoogle Scholar
  17. Iyer, R., Khwaja, A. I., Luttmer, E. F., & Shue, K. (2015). Screening peers softly: Inferring the quality of small borrowers. Management Science, 62(6), 1554–1577.CrossRefGoogle Scholar
  18. Hájek, P. (2011). Municipal credit rating modelling by neural networks. Decision Support Systems, 51(1), 108–118.CrossRefGoogle Scholar
  19. Kruppa, J., Schwarz, A., Arminger, G., & Ziegler, A. (2013). Consumer credit risk: Individual probability estimates using machine learning. Expert Systems with Applications, 40(13), 5125–5131.CrossRefGoogle Scholar
  20. Kruppa, J., Ziegler, A., & König, I. R. (2012). Risk estimation and risk prediction using machine-learning methods. Human Genetics, 131(10), 1639–1654.CrossRefGoogle Scholar
  21. Landwehr, N., Hall, M., & Frank, E. (2005). Logistic model trees. Machine Learning, 59(1–2), 161–205.CrossRefGoogle Scholar
  22. Lessmann, S., Baesens, B., Seow, H. V., & Thomas, L. C. (2015). Benchmarking state-of-the-art classification algorithms for credit scoring: An update of research. European Journal of Operational Research, 247(1), 124–136.CrossRefGoogle Scholar
  23. Liberti, J. M., & Petersen, M. A. (2017). Information: Hard and Soft. Working Paper.Google Scholar
  24. Lin, M., Prabhala, N. R., & Viswanathan, S. (2013). Judging borrowers by the company they keep: Friendship networks and information asymmetry in online peer-to-peer lending. Management Science, 59(1), 17–35.CrossRefGoogle Scholar
  25. Malekipirbazari, M., & Aksakalli, V. (2015). Risk assessment in social lending via random forests. Expert Systems with Applications, 42(10), 4621–4631.CrossRefGoogle Scholar
  26. Michels, J. (2012). Do unverifiable disclosures matter? Evidence from peer-to-peer lending. The Accounting Review, 87(4), 1385–1413.CrossRefGoogle Scholar
  27. Paul, S. (2014). Creditworthiness of a borrower and the selection process in micro-finance: A case study from the urban slums of India. Margin: The Journal of Applied Economic Research, 8(1), 59–75.CrossRefGoogle Scholar
  28. Pope, D. G., & Sydnor, J. R. (2011). What’s in a picture? Evidence of discrimination from Journal of Human Resources, 46(1), 53–92.CrossRefGoogle Scholar
  29. Puro, L., Teich, J. E., Wallenius, H., & Wallenius, J. (2010). Borrower decision aid for people-to-people lending. Decision Support Systems, 49(1), 52–60.CrossRefGoogle Scholar
  30. Shao, H., Ju, X., Wu, C., Xu, J., & Liu, M. (2012). Research on commercial bank credit risk evaluation model based on the integration of the probability distribution theory and the bp neural network technology. International Journal of Advancements in Computing Technology, 4(22), 115–128.CrossRefGoogle Scholar
  31. Thomas, L. C. (2010). Consumer finance: Challenges for operational research. Journal of the Operational Research Society, 61(1), 41–52.CrossRefGoogle Scholar
  32. Wang, G., Ma, J., Huang, L., & Xu, K. (2012). Two credit scoring models based on dual strategy ensemble trees. Knowledge-Based Systems, 26, 61–68.CrossRefGoogle Scholar
  33. Wang, S., Qi, Y., Fu, B., & Liu, H. (2016). Credit risk evaluation based on text analysis. International Journal of Cognitive Informatics & Natural Intelligence, 10(1), 1–11.CrossRefGoogle Scholar
  34. Wei, X., & Croft, W. B. (2006). LDA-based document models for ad-hoc retrieval. In International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 178–185). ACM.Google Scholar
  35. Yao, X., Crook, J., & Andreeva, G. (2015). Support vector regression for loss given default modelling. European Journal of Operational Research, 240(2), 528–538.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2017

Authors and Affiliations

  1. 1.School of ManagementHefei University of TechnologyHefeiChina

Personalised recommendations