Introduction

The financial services industry has recently been forced to adopt technological changes to innovate its processes and products (Frame et al. 2018; Kou et al. 2021). As a result, a set of technology-assisted customized financial services (FinTech) has arisen in the banking market (Thakor 2020). Prominent among them are nonintermediated peer-to-peer (P2P) transactions based on digital infrastructures, such as lending and payments. Indeed, mobile P2P payments are a business vector with deeper market penetration (Abdullah and Naved Khan 2021) and have experienced an extraordinary boom, particularly since the beginning of the COVID-19 pandemic (Higueras-Castillo et al. 2023). It should be noted that mobile P2P payments constitute a real threat to traditional payment methods and were born from a need to break the domination of cash and credit card payments for common day-to-day purchases (Belanche et al. 2022; Insider Intelligence 2022). Mobile P2P payments have emerged as a singular digital payment system and are simpler, faster, more convenient, usually cost-free, and feature a social component that other (digital and not digital) systems lack (Li et al. 2021; Nasir et al. 2020, 2021).

Given that mobile P2P payments are a disruptive innovation in the financial services sector, previous research has focused on identifying the factors determining their use (Leong et al. 2022). In practice, financial entities drive change by fostering digital payments among customers. Thus, they need to know the attributes that explain customary resistance to change and the barriers to using new technologies and transferring know-how (Irimia-Diéguez et al. 2023). In this vein, Liébana-Cabanillas et al. (2021) showed that the precursors and barriers to using P2P payments differ from those of mobile-based payment methods, calling for further research.

Therefore, the key research question that this paper aims to shed light on is the drivers and barriers that foster the adoption of mobile P2P payments between banking customers (Shaikh et al. 2023). Accordingly, the main objective of this paper is to analyze factors that determine customers’ adoption of mobile P2P payments. Our contribution lies in finding the key variables that allow banking customers to be classified as users (or nonusers) of mobile P2P payments. To this end, we compare traditional parametric statistical techniques with a set of nonparametric approaches based on machine learning (ML) methods oriented to classification problems. These learning algorithms are the foundations of data mining and big data current trending topics in the financial innovation field and are considered to be a crucial part of a wider research area known as Knowledge Discovery from Data, which focuses on identifying patterns in data sets (Nguyen et al. 2022).

It is worth highlighting, as one of the core strengths of the present study, the use of a unique data set with information from 701 individuals (observations) who were asked about the use of mobile P2P payments; namely, the use of Bizum, one of the leading and pioneering mobile P2P payment applications worldwide, whose success is comparable to Venmo in the USA (Acker and Murthy 2020).

This paper contributes to the FinTech and ML literature in two ways. Practically, our findings have significant implications for banks with a high interest in precisely knowing the factors that impact the intent to use mobile P2P payment services to (i) create more customized products and services to satisfy the needs of their customers to a greater extent and (ii) properly plan their business, human resource, and marketing strategies. One of the key points of this research is the sample, which is built on a survey conducted with users of the mobile P2P payment platform Bizum. We highlight that one of the main variables explaining the adoption of Bizum as a mobile P2P payment is its full connection and integration with traditional financial players. In other words, given that Bizum is a bank-based platform with a largely predefined bank–customer relationship, it has benefited from its deep market penetration into the traditional banking industry to create new business relationships and become a trustworthy and massively used mobile P2P payment platform. Indeed, this, together with the development of technology allowing the widespread use of smartphones, is a primary factor explaining the strong expansion and adoption of Bizum as a mobile P2P payment method.

Theoretically, our framework employs the most relevant models from technology acceptance theories. We use variables from the theory of reasoned action from Fishbein and Ajzen (1977), technology acceptance model (TAM) from Davis et al. (1989), theory of planned behavior from Ajzen (1991), extended TAM, namely TAM 2 from Venkatesh and Davis (2000) and TAM 3 from Venkatesh and Bala (2008), unified theory of acceptance and use of technology (UTAUT) from Venkatesh et al. (2003), UTAUT2 from Venkatesh et al. (2012), and mobile payment technology acceptance model from Liébana-Cabanillas et al. (2014). Empirically, we follow Witten and Frank (2005), who suggest implementing various statistical languages and search procedures that serve some problems well and others badly, an added motivation for more carefully constructing and comparing alternative ML techniques. In addition, the first preselection of independent variables is applied by combining Boruta and Gini index procedures to obtain a more parsimonious model. Thus, the comparison of different ML techniques in the field of user adoption of mobile P2P payments constitutes the second contribution of this study.

The rest of this paper is structured as follows. "Theoretical background" section describes the dataset and the learning machine models used in this research. "Methodology" section presents the empirical results, "Results" section contains the discussion, and "Discussion" section sets out the conclusions, implications, and areas for future research.

Theoretical background

Evolution of payment systems

New payment systems have emerged from advancements in information and communication technology for financial transactions between businesses and their customers. Specifically, these systems arise as a means of addressing certain issues associated with handling physical money (Tamayo 1999), the need to reduce the cost of money and existing payment methods, providing flexibility for small purchases and instant payments, enhancing security and protection against fraud and other forms of crime, and the rise of e-commerce on the Internet and online payments.

Consequently, the financial sector is undergoing a profound transformation where traditional payment systems relying on cash are being replaced by electronic payment systems (see Fig. 1). According to a recent study by the European Central Bank (2022), the total number of noncash payment transactions in the euro area, encompassing all types of payment services, increased by 12.5% compared to the previous year, reaching 114.2 billion transactions, with a total amount increase of 18.6% to 197 trillion euros. Card payments accounted for 49% of the total transactions, transfers represented 22%, and direct debits represented 20%.

Fig. 1
figure 1

Source: Own elaboration based on Huang (2021)

Classification of payment systems.

In addition to this trend, the extensive use of technologies such as mobile phones has also brought about significant changes in user payment behaviors (Liébana-Cabanillas et al. 2022a). Current mobile payment solutions are based on the technological development of smartphones, enabling the creation of payment applications that can be used in various ways for conducting payment transactions with a mobile device (Liébana-Cabanillas et al. 2017). The classification of mobile payments evolves from the use of smartphones at the point of sale, where they are used to perform economic transactions for purchasing products or services and even function as a point-of-sale terminal for customers. Second, mobile phones can serve as a standard payment platform, offering various functionalities such as executing payments and sending money. Third, these phones can be used as a payment channel through the user’s telecommunications operator, with whom they have a contracted phone line.

Finally, closed-loop payments refer to mobile applications specifically developed for a particular store or brand, where the mobile phone functions not only as a payment option within that store but also includes additional payment-related services such as promotional notifications, loyalty programs, and discount coupons.

Previous research on mobile payment adoption

Since the seminal work of Dahlberg et al. (2008) on mobile payment systems, various authors have analyzed the field of mobile payments up to the present day (Liébana-Cabanillas et al. 2022b; Migliore et al. 2022). Dennehy and Sammon (2015) concluded that research on mobile payments is a well-established area that will continue to receive increased attention from various disciplines in the coming years, recognizing the potential and enrichment of mobile payment services as their adoption becomes increasingly imperative. To date, customer adoption continues to be of interest to many researchers, but the focus remains on investigating adoption in specific countries separately, with less attention given to comparing survey results across multiple countries and examining their differences. More recently, authors such as Abdullah and Naved Khan (2021), Tounekti et al. (2022), and Panetta et al. (2023) have proposed bibliometric reviews that highlight the importance of this current and future research topic. Furthermore, recent studies on adoption have specifically examined technology, security, and architecture. Table 1 summarizes recent research that has analyzed the adoption of mobile payment systems.

Table 1 Recent research on mobile payment adoption

Peer-to-peer mobile payment system: Bizum

P2P payments are peer-to-peer applications that facilitate the immediate transfer of mobile money transactions anywhere. Furthermore, this type of payment, which was previously widespread in the private sphere, is also starting to extend into the commercial realm for making purchases at physical establishments. An increasing number of consumers are using P2P payment apps to pay for their purchases at retail stores. This trend is driven by the growing acceptance of P2P payments by merchants (Visconti-Caparrós et al. 2022).

One pioneering P2P payment system in Europe is Bizum, which is known for its origin and comparative competitiveness. It offers its users three major advantages: (i) immediacy, as transferred funds reach recipients’ bank accounts within seconds; (ii) universality, as customers do not need to switch financial institutions, and the system is connected to all participating banks; and (iii) user-friendliness, as it allows users to make payments between individuals as well as at physical and online stores.

In addition, its operation is straightforward: to send money, the Bizum user selects a contact from their mobile phone lists and sets the desired transfer amount. The sender’s bank then sends a code to their mobile phone, which the user enters into the app, and the recipient immediately receives the money in their linked bank account.

Bizum is supported by all Spanish banks, with an option for each e-banking application, and it is used by more than 21 million active users (nearly 50% of the Spanish population), having a historical track record of 1,362 million transactions and more than EUR 70.5 million transferred since its launch in 2016 (Bizum 2022). Bizum can be considered a transversal payment method because its customer profile includes people of any age, educational level, and socioeconomic class (Belanche et al. 2022).

Considering this review of the adoption of mobile payment systems in general, and P2P systems in particular, as well as in line with our objectives, the current research proposes an improvement in the analysis techniques that may determine the variables that foster the intention to use P2P payment systems through the application of different statistical languages combining Boruta and Gini index procedures to obtain a more parsimonious model.

Machine learning and mobile payments

Comparative analysis of key machine learning techniques

ML is a part of artificial intelligence that, by compiling statistical algorithms and systems, demonstrates intelligence to interpret external data correctly and subsequently make decisions (Davenport et al. 2020). In essence, ML models seek to learn relationships and patterns from a given dataset, and therefore, they can be used to solve both predictive and classification/categorization problems (Bishop 2006).

ML is emerging in parallel with the development of computational science and data-driven business management (Sheth and Kellstadt 2021). This is why, in recent years, numerous ML-based intelligent systems have been massively penetrating our business and personal lives (known as the Internet of Things, IoT) (Kaplan and Haenlein 2019). Indeed, ML shows a much greater performance in high-dimensional data environments, where variable interactions and nonlinear relationships often arise, and automatized recurrent decisions are required (Vanini et al. 2023). Accordingly, ML algorithms have been successfully applied in many fields, such as banking, to decide whether to approve or reject a loan application (Alonso-Robisco and Carbó-Martínez 2022), and in engineering for structural design (Thai 2022).

One of the pioneering ML models that have subsequently reached a remarkable expansion and relevance is artificial neural networks (ANNs). ANNs attempt to emulate human brain functioning by creating a set of interconnected nodes (artificial neurons) placed on several layers that reason in a network architecture (Selvamuthu et al. 2019). Among the most used neural networks in business research is the multilayer perceptron (MLP) (Vellido et al. 1999), whose main theoretical advantage is that of supporting the fulfillment of the universal approximate property (Bishop 2006). Nevertheless, in recent years, complex ANN architectures have emerged, namely deep ANNs (e.g., convolutional ANNs), which already excel in human performance in some environments (Madani et al. 2018).

Despite their advantages, the main limitation of ANNs is their black-box nature, which jeopardizes the interpretation of results and the importance, effects, and relationships between the variables. ANNs have a high computational cost to tune the training parameters, which lengthens the time required to design the topology of the optimal network.

At the beginning of the current century, ensemble methods emerged, whose main ground is that the nature of a phenomenon is captured to a greater extent by combining several alternative methods that are subsequently synthesized by a sole optimal model. That is, ensemble algorithms benefit from the strengths of different models without conducting a biased model preselection. Within the ensemble-based approach, two primary methods emerge: bagging and boosting models.

First, the bagging algorithm proposed by Breiman (1996) fits the same underlying algorithm to each training step, creating a final prediction that is the average of each bootstrap prediction. Given a classification model, bagging draws B independent samples with a replacement from the available training set (bootstrap samples), fits a model to each bootstrap sample, and finally aggregates the B models by majority voting. Since the final prediction is always a pondered result of several bootstrap fits, bagging powerfully decreases the model variance and biases, leading to a model with higher generalization ability without overfitting problems (Schapire et al. 1998).

This advantage allows bagging to be successfully applied to generate other ML models. In this vein, when bagging is applied to a tree-based method, this results in a model called random forests (RF), which is one of the most relevant ML techniques (Breiman 2001). RF is an ML method based on the building and combination of a large set of trees. The main strength of RF is that in each split, a random subset of predictors is considered, increasing the probability of weak predictors being selected and thereby reducing bias in the model. Otherwise, stronger predictors would be used by many trees as a first split. To do so, it randomly selects the variables to split the dataset and create each node while each tree grows from a bootstrap sample of the training dataset.

The growing interest in the use of RFs is also due to their capacity to rank predictor variables according to their importance in explaining the studied phenomenon (Friedman et al. 2000). That is, unlike most ML methods that have a black-box nature, RF shows how each variable influences the understanding of the analyzed event. Indeed, this is the procedure employed in this study to select the most relevant variables (see "Feature selection results" section). Moreover, other positive aspects of this method are that it does not generally overfit and that Bayes consistency is obtained with a simple version of RF (Breiman 2001).

Note that RF can be considered an improved version of the classification and regression trees (CART) approach. In this vein, RF randomly selects the variables to split the dataset and creates each node while each tree grows from a bootstrap sample of the training dataset. Thus, it does not fail as CART, as the main disadvantage lies in that a change in a higher-level node, by the domino effect, can lead to completely different trees. In other words, the performance of the CART is strongly dependent on the stopping criteria implemented because this model is developed using binary recursive partitioning, which is an iterative procedure of splitting the dataset until reaching the final nodes. Of course, CART also has advantages. Indeed, Breiman (2001) considers that CART is the model with easier understanding and interpretation. Further, CART also assumes nonlinear relationships between variables and higher-order interactions (Boulesteix et al. 2015).

Second, unlike bagging, boosting trains models sequentially by analyzing the prediction errors, which results in a powerful improvement of the classifiers (Freund 1995). AdaBoost is the most relevant model within this approach. AdaBoost assigns increasing weights to observations that are incorrectly classified in the last iteration of the classifier. Consequently, the subsequent iterations will focus on correctly classifying these observations, which ultimately will minimize the prediction errors. In this paper, we implement Adaboost as well as Binominal Boosting and L2 Boosting. Other boosting algorithms related to additive basis expansion were developed by Friedman et al. (2000).

Finally, support vector machine (SVM) is a powerful technique mainly used for binary classification problems, although it can also be applied to multiclass classifications that build a hyperplane to separate the observations of different classes. To do so, the SVM uses support vectors that are data falling closest to the hyperplane. Although SVM usually generates low misclassification errors and can function well in environments with high-dimensional data, it has a high operational cost in terms of time consumption. Moreover, sometimes SVM works with a nonoptimal function, which undermines its performance.

User behavior prediction

Behavior analysis was introduced in 1953 by Skinner (1953) and focused on analyzing human behavior from a psychological perspective. However, technological advancements have allowed massive data processing and the powerful development of data mining and ML algorithms that have been increasingly applied to explore human behavior, biasing behavior analysis toward the computational science area. Indeed, behavior analysis is currently called behavioral analytics (Cao et al. 2015), whose aim is to model human behavior by understanding the past to predict its future, and thus create business strategies using statistical and ML approaches (Martín et al. 2021).

From the beginning, these analyses essentially address how individuals interact and the role that they play by acting as a group (collaboration-competition) as well as individually (routines–attitudes–intentions). However, the study of human behavior is not altruistic. Rather, there is a strong economic interest that companies are trying to exploit to increase their market share, brand, and products-services positioning and, ultimately, their profits. For this reason, currently, this discipline is closely connected to the economy and organizational management and is encompassed in the field called user behavior analysis, which comes together with human behavior ML techniques and business decision-making (LeCun et al. 2015; Cui et al. 2016).

In practice, ML has been successfully employed in different domains related to disruptive innovations and marketing, such as the recommendation of products to potential customers (Hagenauer and Helbich 2017) or the estimation of consumer preferences for technology products (Guo et al. 2021).

Particularly relevant is the use of ML in P2P finance (also known as Internet or Digital Finance), which mainly operates through the Internet; therefore, a large amount of data must be processed before decision-making (Wu et al. 2018). As suggested by Gomber et al. (2017), digital finance is a new form of finance based on third-party payment, cloud computing, big data, social networks, and e-commerce platforms to obtain financing and credit as well as to make payments and other financial transactions. In this challenging environment, ML can collect new data, update the model, and provide an output, thus adapting to rapidly evolving environments, such as economic patterns and shocks.

Indeed, ML is being effectively used to explore the factors that influence users’ digital finance behavior (Xiong et al. 2022). Authentication technology, the nonrepudiation of transactions, privacy protection, data integrity, and user trust have a significant impact on users’ Internet finance behavior.

Focusing on e-payment users, Bajari et al. (2015) suggested that ML techniques outperform discrete choice models, which have been the referenced statistical methods used to analyze consumers’ preferences and adoption of means of payments and other digital financial services (Hernández-Murillo et al. 2010). As pointed out by Cui et al. (2016), ML is a powerful methodological approach that promises to generate new insights into payment behavior. In this sense, Lee et al. (2020) used a two-stage analysis by employing Partial Least Squares and subsequently an artificial neural network to explore the antecedents that affect users’ behavioral intention to use wearable payments. Also, Aslam et al. (2022), using SVM, studied the users´ behavioral factors that explain the adoption of mobile payments. They found perceived value to be the most important predictor of usage behavior. Even, users´ behavior with mobile payments has been employed as a driver to forecast, through ML, stores’ total customer flows (Ma and Fildes 2020).

To the best of the authors´ knowledge, only the above few research articles analyze the users´ behavior regarding digital payments; therefore, more empirical evidence is needed. This is not surprising given that mobile payment applications are not yet widely used by the population, and more importantly, there is very little leading e-payment software that massively operates in a country (as does happen with Bizum). Therefore, it is not possible to question users about the behavioral factors that lead them to adopt these mobile payments. This reinforces the findings of the present study.

Methodology

In this study, we use a primary source of data obtained from a survey of 701 Spanish smartphone users who are considered potential users of mobile P2P payment systems. All the users who participated in the survey had experience using their cell phones for commercial activities, either for shopping or payments. The profile of the respondents was that of an average Spanish citizen having their place in the European culture and lifestyle framework. To collect the data, nonprobability snowball sampling was employed through a mailing list and social networks. Although simple random sampling is the best sampling method, many empirical studies published in high-impact journals have used a snowball method when collecting data (Belanche et al. 2022; Huang et al. 2019).

The questionnaire included items to measure the variables defined in Table 2. The items were selected through a review of the relevant literature, adapting the original scales to the nature of the research. The participants expressed their attitudes on a seven-point Likert scale (1: strongly disagree; 7: strongly agree). The questionnaire was developed using a multi-item approach, where three or more items measured each latent variable. This is a common procedure in the field of marketing research. Appendix 1 provides the questionnaire used in the study for reference.

Table 2 Variables and theoretical background

The dependent variable is a dummy variable with a null value (0) in the case of a merchant not having a mobile payment system available and a value of one (1) in the case of these payment systems being available to customers, according to the following:

$$Y_{it} = \left\{ {\begin{array}{*{20}l} 1 \hfill & {{\text{use }}\;{\text{mobile}}\;{\text{payment}}\;{\text{system}}} \hfill \\ 0 \hfill & {{\text{does }}\;{\text{not}}\;{\text{use}}\;{\text{mobile}}\;{\text{payment}}\;{\text{system}}} \hfill \\ \end{array} } \right.$$

To execute this research, we will classify the independent variables used in two categories. We established a group of behavioral variables related to the main theories concerning the adoption of technologies (perceived ease of use, perceived risk, trust, personal innovativeness, subjective norms, perceived enjoyment, loyalty to the banking brand, and perceived quality) and a second group of variables linked to the demographic classification of potential users of the payment system (gender and age).

Regarding the first group of variables, the classic scientific literature has developed multiple theories that have analyzed the behavior of individuals despite innovation. In recent years, some authors have applied these theories to the field of mobile and P2P payments (Upadhyay et al. 2022; Belanche et al. 2022). Table 2 describes the variables used and the sources employed for their definition.

The second block of variables refers to the gender and age of potential users of the proposed payment system. Therefore, our study includes the same categories used by the Spanish National Employment Institute in its statistical reports to classify a population.

Results

Feature selection results

We performed two preprocessing procedures because, as supported by Chen et al. (2020), their use substantially improves the prediction result. First, all predictor variables were standardized into the [0,1] interval to align the dimensionality of predictors and dependent (dummy) variables. Second, given that we have high-dimensional data in terms of the number of features (forty-two independent variables, see Table 3), it is necessary to apply a procedure to reduce the complexity of the model by capturing only the most relevant inputs. The inclusion of many predictors in a model to solve a classification problem has severe theoretical disadvantages such as: (i) overfitting, (ii) correlation problems, (iii) difficulty in interpreting results, and (iv) a slower training process. The idea is to reduce the noise and redundancy in the final model. Indeed, the principle of parsimony states that the best statistical model has fewer parameters (variables) and less dimensionality (Arora and Kaur 2020; Speiser et al. 2019).

Table 3 Feature selection (FS) under random forest approach

Consequently, we performed a procedure to select the most relevant predictors. This minimizes the complexity of our model and accelerates its training, as well as improves the robustness of performance measurements, in terms of higher accuracy or lower errors, due to the booster of the generalization capacity of a classifier. Dewi (2019) indicated that the feature selection (FS) of the procedure enables reducing the original features of a dataset to a smaller one while preserving the relevant information and rejecting redundant information. As Chen et al. (2020) sustain, FS crucially impacts the performance of the classification model. Indeed, FS is considered more important than designing the prediction model.

Following Chen et al. (2020), we implement the random forest (RF) algorithm as a method to select the most relevant feature from the data. Unlike other parametric techniques grounded in subset selection, such as logistic regression (LR) with forward or backward procedures, RF is a nonparametric method based on supervised ML that incorporates two procedures to select the most important variables: (i) the package varImp() of R, where the mean decrease of the Gini index is calculated, and (ii) Boruta (Fahimifar et al. 2022).

The package varImp() of R is implemented after running the RF model. This is a postestimation procedure applied to each tree obtained and consists of calculating the prediction accuracy and subsequently permuting each predictor variable. Afterward, the difference between the two accuracies is averaged over all the trees normalized by the SE. The package provides two measures of importance for each predictor, disaggregating the results by outcome class (1, when Bizum is adopted, and 0 otherwise). The first of these metrics indicates the decrease, on average, in accuracy when a variable is removed. The second measure provides the reduction of the Gini impurity when a variable is chosen to split a node. It should be noted that the sample used to calculate the importance of each variable is the out-of-sample data that was not used during tree construction. The recommendation is to analyze both measures together because this enables a comparison of the importance ranking of each one. However, their main disadvantage is that they may overstate the importance of the correlated variables.

To benchmark with respect to FS, we also implemented the Boruta algorithm that enables ranking the predictor variables based on their significance (default values for p value = 0.01 and maxRun = 100). One of the most important advantages of the use of Boruta is that it provides a classification of the variables in three groups: (i) confirmed, for those significant variables (the most relevant); (ii) tentative, for those variables that may be selected, but which have less importance; and (iii) rejected, for those variables that the method considers are not to be included.

The results of the FS analysis are depicted in Table 3 (graphically also in Fig. 2). As shown here, the two FS procedures employed (Boruta and the Gini index) match most of the rankings performed, especially in the first variables, i.e., the variables with the highest classification of importance.

Fig. 2
figure 2

The important measure for each variable using Boruta

Unlike the Gini index, one of the main advantages of the Boruta procedure is that it enables knowing which variable must be included in the model. However, as can be observed in Table 3, the Boruta procedure considers that the entire list of variables should be introduced into the model because they have a high importance level. Therefore, it is not operational from a computational viewpoint. Consequently, to increase the selection capacity of the FS procedures, we only select, from the ten first variables, the variables matching the two criteria (Boruta and the Gini index).

Eight of the ten first variables are the most relevant under both FS criteria (see Table 3); thus, these variables will be included in our classification model. It should be noted that with this procedure, we are dramatically reducing the number of variables that will be introduced into our model, considering only eight (i.e., only 19.04% of the information contained in the original dataset) from forty-two variables. This selection of the data’s critical features reduces the noise and redundancy of the final model and improves its interpretation while decreasing the computational costs.

Despite the advantages of the Boruta and Gini index procedures shown above, the main disadvantage of both procedures is that they do not consider the potential multicollinearity problems that may arise between the resulting explanatory variables. Indeed, multicollinearity problems remain understudied in the environment of AI and ML algorithms, although it is one of the most important aspects to consider in an econometric model (Chan et al. 2022). However, unlike what is often claimed, correlation does not necessarily mean multicollinearity as they are not the same, and thus multicollinearity problems cannot be analyzed by using the correlation matrix, but by using the Variation Inflation Factor (VIF) (Chan et al. 2022). The variable PU2 has the maximum VIF value (6.548), which confirms the lack of multicollinearity problems (note that although there is no strict threshold for VIF to confirm the presence of multicollinearity, there is a wide consensus in the previous research to consider that a VIF of 10 or higher often indicates multicollinearity (Weisberg 2005). Additionally, as a robustness check, we also implement the forward stepwise logistic regression as a parametric alternative approach to select the most relevant variables. Here we obtain only four resulting variables (PU2, SN3, TR5, and PENJ3), of which three match those obtained in the Boruta and Gini index procedures (our results, in terms of the nonparametric techniques based on ML outperform the classical LDA and LR, remain unaltered by applying the Boruta, Gini index, and forward stepwise logistic regression).

From a theoretical point of view, FS analysis suggests that the variables corresponding to usefulness, subjective usage norms, trust, and perceived enjoyment have a strong influence on the intention to use mobile payment systems and media.

Specifically, our results suggest that the usefulness of mobile payment media (PU1, PU2, PU3, and PU4) is a strong explanatory factor in their usage intention, which is an advance over the previous literature (Bhattacherjee and Premkumar 2004). This posits the concept of perceived usefulness to understand changes in beliefs and attitudes toward information technology use.

Second, moving on to personal innovation in the information technology domain, two subjective customer profile variables (SN3 and SN4) show high explanatory and predictive power for the intention to use mobile payment methods (Agarwal and Prasad 1998a; Taylor and Todd 1995).

Turning to variables related to perceived trust in mobile payment systems, in line with Ba and Pavlou (2002), our results identify a strong link between bank customers’ perceived trustworthiness in the mobile payment medium (TR2) and their direct intention to use.

Furthermore, our findings represent an advance over the previous literature regarding the variable related to the perceived enjoyment of using online payment systems (Agarwal and Karahanna 2000; Rouibah et al. 2016), as our results identify a significant relationship between the perceived enjoyment of using a mobile payment means and the intention to continue using this technology (PENJ3). To better illustrate the discriminatory power of the RF model after applying the FS procedure, we present the area under the ROC curve (AUC) in Fig. 3. AUC is calculated by plotting the true positive rate against the false positive rate at various threshold settings. Indeed, AUC can be defined as a tradeoff between sensitivity and specificity, given that an increase in sensitivity will cause a reduction in specificity. The model will have a greater classification power when the curve is closer to the upper left corner. Similarly, Fig. 4 shows the out-of-bag (OOB) error, which can be defined as the average error using predictions from trees that are not contained in their respective bootstrap sample. OOB is used to fit the classification power of the RF model while it is being trained. As depicted in Fig. 4, the OOB drastically decreases (i.e., the model increases its fitting) after the first 150 trees, oscillating steadily from them.

Fig. 3
figure 3

Area under ROC curve for random forest (AUC)

Fig. 4
figure 4

The Out-Of-Bag (OOB) error for final random forest model

Validation measures

The performance of each model is evaluated using different accuracy measurements on the results obtained for each method on the out-of-sample. In binary classification problems, two relevant metrics arise sensitivity and specificity. On the one hand, sensitivity measures the probability that the model classifies a Bizum user as a real user of Bizum. In other words, sensitivity measures the model’s ability to detect Bizum usage in its presence. Conversely, specificity measures the probability that the model classifies a real Bizum nonuser as a Bizum nonuser. That is, specificity measures the ability of the model to exclude the use of Bizum when it is lacking. Sensitivity and specificity are defined as follows:

$$Sensitivity = \frac{TP}{{TP + FN}};\quad Specificity = \frac{TN}{{TN + FP}}$$

where

TP = True Positive, the number of positive cases (not adopting mobile P2P payment) that are correctly identified as positive,

TN = True Negative, the number of negative cases (adopt mobile P2P payment) that are correctly identified as negative cases,

FN = False Negative, the number of positive cases (not adopt mobile P2P payment) that are misclassified as negative cases (adopt mobile P2P payment),

FP = False Positive, the number of negative cases (adopt mobile P2P payment) that are incorrectly identified as positive cases (not adopt mobile P2P payment).

Following Petropoulos et al. (2020), we built several performance measurements based on sensitivity and specificity to overcome the limitations of traditional accuracy metrics based only on the overall predictive ability. In this vein, we calculate the following measures:

  • G-mean: The geometric mean G-mean is the product of sensitivity and specificity. This metric illustrates the balance between the classification performances of the majority and minority classes.

    $$G = \sqrt {Sensitivity \cdot Specificity}$$

A poor performance in predicting positive cases will lead to a low G-mean value, even if the negative cases are correctly classified by the algorithm.

  • LR: The negative likelihood ratio is the ratio between the probability of predicting a case as negative when it is positive and the probability of predicting a case as negative when it is actually negative.

    $$LR = \frac{1 - Sensitivity}{{Specificity}}$$

A lower negative likelihood ratio signifies better performance in negative cases. This is the main point of interest in this study as we model bank failures.

  • DP: Discriminant power is a measurement that sums up sensitivity and specificity.

    $$DP = \frac{\sqrt 3 }{\pi }\left[ {log\left( {\frac{Sensitivity}{{1 - Sensitivity}}} \right) + log\left( {\frac{Specificity}{{1 - Specificity}}} \right)} \right]$$

The algorithm distinguishes between positive and negative cases for DP values greater than 3.

  • BA: Balanced accuracy is the average of Sensitivity and Specificity. If the classifier performs equally well on either class, this term lowers the conventional accuracy measure.

    $$BA = \frac{1}{2}\left( {Sensitivity + Specificity} \right)$$

In contrast, if the conventional accuracy is high simply because the classifier takes advantage of a good prediction on the majority class, the balanced accuracy will decrease, thus signaling any performance issues. That is, BA does not disregard the accuracy of the model in the minority class (i.e., adopt Bizum in our case).

  • Youden’s γ: Youden’s index is a linear transformation of the mean sensitivity and specificity; consequently, it is difficult to interpret.

    $$\gamma = Sensitivity - \left( {1 - Specificity} \right)$$

As a general rule, a higher value of Youden’s γ signifies a better ability of the algorithm to avoid misclassification of the population.

  • WBA1: A weighted balance accuracy measure that weighs specificity more than sensitivity (75%/25%).

  • WBA2: A weighted balance accuracy measure that weighs sensitivity more than specificity (75%/25%).

Alternatively, we also calculate the AUC, which can be defined as the probability that the classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instance. The value of AUC varies between 0.50 and 1, being accepted by the researcher that a value above 0.80 denotes a high performance.

Finally, to facilitate the interpretation of the results, we build a metric, the Global Performance Index (GPI), which summarizes the results of the previous performance measurements. We define GPI as the arithmetic average of all previous metrics, except for Type I and II errors, because they are complementary ratios to specificity and sensitivity. Moreover, given that a model obtains a better performance with lower values of LR, this metric subtracts in the following expression:

$$GPI = \frac{AUC + Accuracy\;ratio + Sensitivity + Specificity + Gmean - LR + DP + BA + Youden^{\prime}s + WBA1 + WBA2}{{11}}$$

Results

The final sample, after eliminating questionnaires that were completed too quickly or exceeded the recommended time, amounted to 701 participants, of whom 46.22% were male and 53.78% were female. 42.37% were between 18 and 24 years old, 51.21% were between 25 and 44 years old, and 6.28% were over 44 years old. Of these, 4.28% had doctoral studies, 49.93% had university studies, 26.68% had secondary school studies, 15.83% had primary school studies, and the remaining 3.28% had no studies at all. The number of invalid questionnaires rejected was only 13; thus, the valid response rate was 98%.

Table 4 summarizes the results in terms of performance metrics in the test set. This shows that there is not a unique model that obtains the best performance in terms of all metrics. However, our results demonstrate that nonparametric techniques based on ML often outperform classical LDA and LR. In particular, we find that binomial boosting, MLP4, and L2 boosting are the models that obtain the best performance in terms of GPI. Specifically, binomial boosting obtains the best GPI score with a value of 0.6859, followed by MLP 4 and L2 boosting, which reach GPIs of 0.6613 and 0.6609, respectively. In contrast, the two models based on classification trees, CART and CTBag, obtained the worst performance in terms of GPI.

Table 4 Performance results in out-of-sample

Since the AUC is based on conceptual and methodological foundations different from the rest of the metrics, which, as previously argued, are based on specificity and sensitivity (complementary measurements of type I error and type II error, respectively), we analyze this metric in more detail. In this sense, our findings show that the methods with the highest AUC values are the neuronal network (MLP 1 and MLP 2), followed closely by SVM and L2 boosting. In the same way as the GPI, CART, and CTBag are the two underperforming methods in terms of AUC.

When comparing the performance of the models built using all the variables for the models that apply FS to reduce the dimensionality of the data, our results suggest that the performance increases when FS is used. More importantly, we find that the increase in the performance of implementing FS remains unaltered for all the methods in terms of all the performance metrics.

Discussion

Theoretical implications

Our empirical research has two relevant results. First, related to the classification accuracy of methods, our findings suggest that using FS analysis as a preprocessing technique substantially improves prediction performance while reducing the noise and redundancy of the resulting model and the computational costs of its implementation due to lower data dimensionality. All of this definitively improves the theoretical interpretation of the final model and allows analysis of how each independent variable contributes to explicating and predicting the use of mobile P2P payments. We also find that there is not a unique method that outperforms in terms of all metrics, but it is demonstrated that, in general, nonparametric techniques based on ML outperform classical LDA and LR. Thus, the results show that binomial boosting, MLP4, and L2 boosting are the models that obtain the best performance according to the Global Performance Index (GPI).

Second, from a theoretical point of view, we document that (in this order of priority) the usefulness of mobile P2P payments, the influence of peers and other social groups such as friends, family, and colleagues on an individual’s behavior (i.e., subjective norms), and the perceived trust and enjoyment of the user experience in the digital context are the attributes that classify the (potential) users of mobile P2P payments with greater ability.

The major importance of usefulness in the intention to use this P2P payment service may be mainly based on the number of current users (approximately half of the population of Spain). This networking effect is crucial to the success of the service because the application must be used by both the sender and receiver. In addition, adequate resources or support are essential for users to perceive the usefulness of the service and even directly influence the intention of use. Subjective Norms, as the following significant factor on the intention to use the service, show that the information that users share about their experience when using the P2P payment service influences the intention of other users due to the social requirements of these services. This fact is highly relevant for those companies that provide these payment services since their plans of action should focus on developing word-of-mouth strategies and attempting to encourage current clients to directly recommend the service. Our results also show that perceived trust and enjoyment significantly affect the intention to use P2P payment services. This finding implies that service providers corroborate the need to develop P2P payment services that may be easy to use, secure, and attractive to consumers.

The future landscape of the payment sector will be promising for financial entities and FinTech organizations that are open to change, innovation, and forward-thinking. These players need to rapidly accelerate their transformation efforts to address unmet customer demands and plug the gaps. In this vein, our findings are novel and useful for both traditional and new financial intermediaries, businesses, customers, and other stakeholders that are part of financial systems, such as policymakers and regulators. More importantly, our findings could be of interest to financial institutions to define ad hoc financial services customized for their target market.

From a theoretical perspective, our results support the necessity of implementing statistical procedures to reduce the complexity of the data. Boruta and Gini algorithms are preferable methods because both are based on the nonlinearity performed by Random Forest, one of the most advanced current ML methods.

Practical implications

From a managerial standpoint, our research findings provide valuable insights for service providers in the mobile P2P payments industry. To effectively promote the adoption and usage of their platforms, providers must prioritize enhancing usability and user experience. This can be achieved by streamlining the payment process, simplifying user interfaces, and ensuring smooth and intuitive navigation. By focusing on subjective norms, providers can tap into the power of social influence, leveraging the positive perceptions and recommendations of existing users to attract new users. Implementing strategies to encourage word-of-mouth marketing, such as referral programs or incentives for users who refer others to the service, can be an effective approach to expanding user adoption.

Building trust is another critical aspect of driving the adoption of mobile P2P payments. Service providers should prioritize security measures and communicate them transparently to users. Highlighting the safety of transactions, data protection protocols, and robust authentication methods can help alleviate concerns and increase users’ trust in the platform. In addition, incorporating features that enhance user enjoyment and engagement, such as personalized experiences, rewards, or gamification elements, can contribute to positive user perception and encourage continued usage.

Beyond the immediate managerial implications, our research findings have broader societal and economic implications. Promoting the adoption of mobile P2P payments can contribute to financial inclusion, particularly for marginalized populations, such as the young, the unemployed, and those with limited access to traditional banking services in rural areas. By providing these individuals with convenient and accessible payment solutions, barriers to financial participation can be reduced, enabling them to engage in economic activities, make transactions, and manage their finances more effectively. This, in turn, can lead to increased economic empowerment, poverty reduction, and overall societal development.

Furthermore, from a macroeconomic perspective, higher adoption of mobile P2P payments can lead to increased financial stability. By reducing the reliance on cash transactions and expanding digital payment options, the risks associated with handling physical currency, such as theft or counterfeiting, can be mitigated. Additionally, the digitization of payments enables better tracking and monitoring of financial flows, contributing to enhanced transparency and accountability within the financial system. This improved oversight can help prevent illicit activities, such as money laundering and tax evasion while facilitating more efficient financial regulations and policy implementations.

In conclusion, the implications of our research emphasize the importance of prioritizing usability, trust, and enjoyment in mobile P2P payment services. By addressing these factors and promoting the adoption of mobile P2P payments, service providers can not only drive their business success but also contribute to financial inclusion, economic development, and financial stability at both the individual and societal levels.

Limitations and avenues for future research

Despite the valuable insights gained from this study, it is important to acknowledge its limitations, which open up avenues for future research. First, enhancing the dataset by incorporating additional information, such as users’ training in new technologies, educational background, and risk aversion, would provide valuable control and moderating variables to deepen our understanding of the factors influencing the use of mobile P2P payments. This could shed light on how these individual characteristics interact with other factors and impact adoption.

Second, obtaining data on the average size of digital payment transactions would allow for an analysis of how users’ risk aversion influences their adoption of mobile P2P payments. Examining whether risk-averse individuals are more or less likely to engage in larger transactions through these payment methods could provide valuable insights into the relationship between risk perception and usage behavior.

Third, replicating this study by surveying individuals from different countries would enable an analysis of the effects of institutional frameworks on the adoption of mobile P2P payments. Comparing adoption patterns across countries with varying regulatory environments and financial infrastructures could reveal the influence of these contextual factors on user behavior.

Finally, it is important to address the limitations associated with the sample selection process, specifically the use of a nonprobability snowball sampling method. Future research should consider employing alternative sampling techniques, such as simple random sampling or quota sampling, to ensure a more representative and generalizable sample. This would enhance the external validity of the findings and provide a more comprehensive understanding of the factors influencing mobile P2P payment adoption across diverse populations.

By addressing these limitations and pursuing further research in these areas, we can gain a more nuanced understanding of the adoption and usage of mobile P2P payments, leading to more effective strategies for service providers and policymakers in driving the growth and acceptance of these payment methods. Another limitation of ML is that it includes suitable choices from manifold implementation options, bias and drift in data, and the mitigation of black-box properties.

Conclusion

In the current era of increasing digitalization and massive use of FinTech services, digital P2P payments are being strongly extended as the preferred payment option, mainly among the young. The rise in P2P payments has been enhanced by the explosion of sure mobile payment applications as well as the COVID-19 pandemic, which has dramatically limited cash payments to prevent transmission of the virus. Of course, the need to align individuals´ behaviors with the Sustainable Development Goals also requires the boosting of digital P2P payments as a way to increase the financial inclusion of many individuals excluded from traditional financial banking services (Danisman and Tarazi 2020). Indeed, banks and other financial players are currently playing a relevant role in developing innovative payment services where P2P payments are becoming widespread. Thus, it is crucial to examine the factors that determine customers’ adoption of mobile P2P payments to exploit their potential.

This study explores the drivers of mobile P2P adoption by using ML to predict usage among FinTech disruptions in financial services. Our main conclusion is that ML must be applied by banks and other financial intermediaries to predict their customers’ adoption of mobile/digital P2P payments. Indeed, to the authors’ knowledge, this approach has not yet been employed in this field of research. In addition, our findings emphasize the relevance of usefulness, subjective norms, trust, and user enjoyment in classifying potential mobile P2P users.