1 Introduction

With the development of sharing economy, bike-sharing systems have rapidly emerged in major cities all over the world. Bike-sharing can be described as a short-term bike rental service for inner-city transportation providing bikes at unattended stations. It has become one of the most important low-carbon travel ways. Compared with traditional rental service, bike-sharing will not be limited by the boxes at bike-stations. It provides convenient services, but generates complicated problems. For instance, the layout of bike-sharing stations is flexible and the capacities of stations is not fixed, leading to big fluctuate demands for stations. Some new characteristics are exhibited, such as the uneven distribution of user demands in time and space.

Accurate demand prediction of bike-sharing can effectively improve user experience and enhance brand competence, which was elaborated in [2]. El-Assi et al. [3] and Ermagun et al. [4] elaborated the main problems and difficulties in the demand prediction of bike-sharing. Solutions can be mainly divided into two types. The traditional one is based on statistical analysis. Yang et al. [5] proposed a semi-parametric geographically weighted regression method to estimate a bike-sharing trip using location-based social network data. [6] proposed a method combining bootstrapping and subset selection that utilized partially useful information in each bike-sharing station. It can solve problems in which data cleaning approaches failed due to the lack of original data.

The other method is based on artificial neural networks. Yang et al. [7] proposed convolutional neural networks to predict daily demands of bike-sharing at both city and station levels. Lin et al. [8] proposed graph convolutional neural networks with data-driven graph filter model. The heterogeneities of demands among different bike-sharing stations were also discussed. Xu et al. [9] developed a dynamic demand prediction model based on a deep learning approach with large-scale datasets. The comparison results suggested that prediction accuracy of long-short term memory neural networks was better than statistical models and advanced machine learning methods. Chang et al. [10] developed a prediction framework integrating artificial immune system and neural networks. The performance is verified by comparing with other models. Feng et al. [11] discussed the Markov chain population model to predict bike demands among different travel stations. Kim [12] studied the influence of weather conditions and time characteristics on demands of bike-sharing. Furthermore, deep learning methods and comprehensive methods with heuristic algorithms were applied in various engineering projects [13,14,15,16], but rarely applied for the demand prediction of bike-sharing.

In addition, those methods have some limitations. Increasing the number of hidden layers is a feasible approach to achieve the certain prediction accuracy, which may lead to overfitting phenomena and reduce the generalization performance of prediction models. In addition, accelerating gradient descent can improve the convergence rate, while it leads to an unstable generalization. Considering these limitations, a novel neural network model is proposed, which is called pseudo-double hidden layer feedforward neural networks. In this paper, an algorithm, extreme learning machine with improved particle swarm optimization, is proposed to tune weights and biases in neural network to improve prediction accuracy. Finally, experiments are performed on the dataset of Streeter Dr bike-sharing station in Chicago to verify the effectiveness of the model proposed.

2 Pseudo-double hidden layer feedforward neural networks

2.1 Network structure

Pseudo-double hidden layer feedforward neural networks (PDLFNs) are biologically inspired computational models, which consist of processing elements (or neurons) and connections between them with coefficients. The structure of PDLFNs is different from the single hidden layer feedforward neural networks (SLFNs) and the double hidden layer feedforward neural networks (DLFNs). As shown in Fig. 1, it contains one input layer, pseudo-double hidden layers, and one output layer. The hidden layers consist of layer V and layer B. In SLFNs and DLFNs, the hidden layer is the collection of neurons with activation functions as well as provides one or two intermediate layers between the input layer and the output layer. While in PDLFNs, layer V is a special hidden layer. There is only one neuron with a smooth function in this layer. Thus, PDLFNs can directly process original sample data to produce the final results.

Fig. 1
figure 1

PDLFNs structure

The design consideration of V layer mainly comes from the following two results. In the first place, multiple hidden layers can reach high prediction accuracy even by setting fewer neurons in each hidden layer [17]. In addition, there is some noise disturbance in sample data, which can be reduced by smooth processes [18].

Without loss of generality, assume that the numbers of neurons in the input layer, layer B and output layer are I, J and K, respectively. There are N samples inputted into PDLFNs, commonly in the shape of means of multivariate time series. Each sample contains I-dimensional data. Mathematically, the n-th \(\left (1\leq n\leq N\right )\) sample is represented by the vector \(X^{\left (n\right )}=\left (x_{1}^{\left (n\right )},x_{2}^{\left (n\right )},...,x_{I}^{\left (n\right )}\right )^{\mathrm {T}}\), where \(x_{i}^{\left (n\right )} \left (1\leq i\leq I\right )\) is the data presented to the i-th neuron in the input layer. The corresponding weight vector of the input layer to layer V is denoted as \(W^{\left (\text {IV}\right )}=\left (w_{1}^{\left (\text {IV}\right )},w_{2}^{\left (\text {IV}\right )},...,w_{I}^{\left (\text {IV}\right )}\right )^{\mathrm {T}}\), where \(w_{i}^{\left (\text {IV}\right )}\) specifies the influence between the i-th neuron in the input layer and the neuron in layer V.

Compared with DLFNs, the neuron of layer V in PDLFNs is no longer with a bias value and an activation function, but with a smooth function. The smooth function, denoted as \(S_{\mathrm {V}}\left (X^{\left (n\right )},W^{\left (\text {IV}\right )}\right )\), can take corresponding forms according to characteristics of sample datasets, such as moving average functions, exponential smoothing functions, auto-regressive functions, and adaptive filtering functions.

$$ \begin{array}{@{}rcl@{}} &&O_{k}\left( X^{\left( n\right)},W^{\left( \text{BO}\right)}\right)\\ &=&\!A_{\mathrm{O}}\!\left( {\sum}_{j=1}^{J}\!\left( w_{jk}^{\left( \text{BO}\right)}B_{j}\left( X^{\left( n\right)},W^{\left( \text{VB}\right)}\right)\right)+b_{k}^{\left( \mathrm{O}\right)}\right)\\ \!&=&\! A_{\mathrm{O}} \!\!\left( {\sum}_{j=1}^{J}\!\left( w_{jk}^{\left( \text{BO}\right)}A_{\mathrm{B}}\!\left( w_{j}^{\left( \text{VB}\right)}S_{\mathrm{V}}\!\left( X^{\left( n\right)},W^{\left( IV\right)}\right) + b_{j}^{\left( \mathrm{B}\right)}\right)\right) + b_{k}^{\left( \mathrm{O}\right)}\!\right)\!, \end{array} $$
(1)

The rest part of PDLFNs is similar to traditional feedforward neural networks, where each neuron is assigned a bias and the output of each neuron is an activation function. The summation in each neuron includes a bias for lowering or raising its input to the activation function such as linear function, sigmoid function and hard limit function. It is worth mentioning that the activation functions for the neurons in the same layer are always the same. The weight vector between layer V and layer B is denoted as \(W^{\left (\text {VB}\right )}=(w_{1}^{\left (\text {VB}\right )},w_{2}^{\left (\text {VB}\right )},..., w_{J}^{\left (\text {VB}\right )})^{\mathrm {T}}\), where \(w_{j}^{\left (\text {VB}\right )} \left (1\leq j\leq J\right )\) is the weight of the connection between the neuron in layer V and the j-th neuron in layer B. The bias for the j-th neuron in layer B is denoted as \(b_{j}^{\left (\mathrm {B}\right )}\). The output of the j-th neuron in layer B is denoted as \(B_{j}\left (X^{\left (n\right )},W^{\left (\text {VB}\right )}\right )\) in Eq. 2, where \(A_{\mathrm {B}}\left (x\right )\) is the activation function for the neurons in layer B.

$$ B_{j}\!\left( \!X^{\left( n\right)},W^{\left( \text{VB}\right)}\!\right) = A_{\mathrm{B}}\!\left( \!w_{j}^{\left( \text{VB}\right)}S_{\mathrm{V}}\!\left( \!X^{\left( n\right)},W^{\left( \text{IV}\right)}\!\right) + b_{j}^{\left( \mathrm{B}\right)}\right). $$
(2)
$$ \begin{array}{@{}rcl@{}} O=\left( O_{1}\left( X^{\left( n\right)},W^{\left( \text{BO}\right)}\right),O_{2}\left( X^{\left( n\right)},W^{\left( \text{BO}\right)}\right),...,\right.\\ \left.O_{K}\left( X^{\left( n\right)},W^{\left( \text{BO}\right)}\right)\right)^{\mathrm{T}}. \end{array} $$
(3)

Finally, the outputs of PDLFNs can be represented by the vector in Eq. 3. \(O_{k}\left (X^{\left (n\right )},W^{\left (\text {BO}\right )}\right ) \left (1\leq k\leq K\right )\) is the output of the k-th neuron in the output layer from Eq. 1, where \(W^{\left (\text {BO}\right )}\) is the weight matrix between layer B and the output layer in Eq. 4, \(w_{jk}^{\left (\text {BO}\right )}\) the weight of the connection between the j-th neuron in layer B and the k-th neuron in the output layer, \(b_{k}^{\left (\mathrm {O}\right )}\) the bias for the k-th neuron in the output layer, \(A_{\mathrm {O}}\left (x\right )\) the activation function for the neurons in the output layer.

$$ W^{\left( \text{BO}\right)}=\left[\begin{array}{cccc} w_{11}^{\left( \text{BO}\right)} & w_{12}^{\left( \text{BO}\right)} & ... & w_{1K}^{\left( \text{BO}\right)}\\ w_{21}^{\left( \text{BO}\right)} & w_{22}^{\left( \text{BO}\right)} & ... & w_{2K}^{\left( \text{BO}\right)}\\ {\vdots} & {\vdots} & {\ddots} & \vdots\\ w_{J1}^{\left( \text{BO}\right)} & w_{J2}^{\left( \text{BO}\right)} & ... & w_{JK}^{\left( \text{BO}\right)} \end{array}\right]\begin{array}{c} \end{array}. $$
(4)
$$ \begin{array}{@{}rcl@{}} v_{md}^{t+1}= \left\{\begin{array}{ll} \left( w_{\max}-\frac{t\left( w_{\max}-w_{\min}\right)}{T_{\max}}\right)v_{md}^{t}+c_{1}{r_{1}^{t}}\left( pbest_{md}-z_{md}^{t}\right)+c_{2}{r_{2}^{t}}\left( gbest_{d}-z_{md}^{t}\right), & 0\leq t<\frac{2T_{\max}}{3},\\ \lambda v_{md}^{t}+c_{1}{r_{1}^{t}}\left( pbest_{md}-z_{md}^{t}\right)+c_{2}{r_{2}^{t}}\left( gbest_{d}-z_{md}^{t}\right), & \frac{2T_{\max}}{3}\leq t\leq T_{\max}, \end{array}\right. \end{array} $$
(5)

2.2 Learning algorithm

The traditional learning algorithms in the feedforward neural networks are mainly based on the gradient descent methods. The back-propagation (BP) learning algorithm is a representative one, where gradients can be computed efficiently by propagation from the output to the input. It is one of the most successful and widely popular learning algorithms for training neural networks in recent years. However, several limitations arise, such as not easy to determine network structures and learning rates, unstable convergence results, and time-consuming learning. To resolve the issues above related with gradient-based algorithms, Guangbin Huang [19] proposed an efficient learning algorithm called extreme learning machine (ELM) for feedforward neural networks, especially for SLFNs. ELM is equipped with several salient features different from traditional popular gradient-based learning algorithms.

  1. 1.

    Ease of use. No parameters need to be manually tuned during the iterative procedure except for the predefined network architecture. The number of neurons in the hidden layer is equal to or approximately equal to the number of samples by default.

  2. 2.

    Fast learning speed. Most training can be completed usually in within minutes.

  3. 3.

    High generalization performance. It obtains better generalization performance than gradient-based learning algorithms in most cases.

  4. 4.

    Suitable for nonlinear activation functions. Almost all piecewise continuous functions can be used as activation functions.

ELM avoids some difficulties in gradient-based learning algorithms, such as to determine stopping criteria, learning rates and the number of learning epochs, and to find local minima. However, it is also found that ELM tends to require more hidden neurons than traditional gradient-based learning algorithms as well as result in ill-condition problems due to random determination of the input weights and hidden biases. Considering these limitations of ELM, an improved particle swarm optimization (IPSO) is proposed to optimize input weights and hidden biases.

2.2.1 Improved particle swarm optimization

Particle swarm optimization (PSO) is one of the most representative meta-heuristic optimization algorithms. It mimics the social behavior of organisms, such as birds in a flock or fish in a school, which grants them surviving advantages.

Considering a swarm with M particles in a D-dimensional search space, there is a position vector \({Z_{m}^{t}}=(z_{m1}^{t},z_{m2}^{t}, ..., z_{mD}^{t})^{\mathrm {T}} \left (1\leq m\leq M\right )\) and a velocity vector \({V_{m}^{t}}=\left (v_{m1}^{t},v_{m2}^{t},...,v_{mD}^{t}\right )^{\mathrm {T}}\) for the m-th particle after the t-th iteration. \({Z_{m}^{0}}\) and \({V_{m}^{0}}\) are the initial position and velocity of the m-th particle, respectively. The best position of the m-th particle is denoted as \(PBEST_{m}=\left (pbest_{m1},pbest_{m2},...,pbest_{mD}\right )^{\mathrm {T}},\) and the best position of all particles as \(GBEST=\left (gbest_{1},gbest_{2},...,gbest_{D}\right )^{\mathrm {T}}\). In classical version of PSO, the position and velocity vectors are updated according to Eqs. 6 and 7.

$$ v_{md}^{t+1} \! = wv_{md}^{t}\! + \!c_{1}{r_{1}^{t}}\!\left( pbest_{md} - z_{md}^{t}\right)\! + \!c_{2}{r_{2}^{t}}\!\!\left( gbest_{d}\! - \!z_{md}^{t}\right)\!\!, $$
(6)
$$ z_{md}^{t+1} = z_{md}^{t}+v_{md}^{t+1}, $$
(7)

where 1 ≤ dD; w, c1, c2, \({r_{1}^{t}}\) and \({r_{2}^{t}}\) are respectively inertia weight constant, two acceleration constants with positive values, and two uniform random parameters within \(\left [0,1\right ]\). The final solution of PSO is sensitive to these control parameters.

The focus of the improved approaches has revolved around adapting the inertia weight w. It is important for balancing the global search, also known as exploration, and local search, known as exploitation.

In order to make the algorithm converge to the global optimal values more quickly and effectively, a comprehensively improved method is proposed. It is an adaptive PSO algorithm combined with the compression factor. The velocity vectors are updated according to Eq. 5, where \(\lambda =2/{\bigl |2-\beta -\sqrt {\beta \left (\beta -4\right )}\bigr |}\) is the compression factor given β = c1 + c2. \(T_{\max \limits }\), \(w_{\max \limits }\) and \(w_{\min \limits }\) are the maximum iteration number, the initial inertial weight and the final inertial weight, respectively. We can ensure that this algorithm is equipped with a strong global search ability in the early iterative stage by the adaptive inertia weight in the first part in Eq. 5. Correspondingly, we can also ensure that this algorithm is equipped with a local sophisticated search ability by the compression factor in the second part.

2.2.2 Improved extreme learning machine

In this paper, ELM combining with IPSO (IPSO-ELM) is proposed as the learning algorithm of PDLFNs in Fig. 2. It applies IPSO to optimize input weights and hidden biases to improve generalization abilities.

Fig. 2
figure 2

IPSO-ELM learn process

3 Demand prediction model

3.1 Prediction periods

Self-regulating abilities in bike-sharing systems can generally meet users’ rental demands during flat hump periods and low peak periods. However, it does not hold for peak periods. Meanwhile, the users’ rental during peak periods is one of the main factors influencing scheduling schemes. Thus, we only discuss the demand prediction problem during peak periods in this paper.

3.2 Demand Prediction Model

The demand prediction model of bike-sharing is shown in Fig. 3. In this model, the numbers of neurons in the input layer, layer B and output layer are 3, 29 and 1, respectively. The bike rental demands from the (n − 3)-th day to the (n − 1)-th day are inputted into the input layer. Weighted moving average function (WMA) in Eq. 8 is applied as the smooth function for neurons in layer V.

$$ S_{\mathrm{V}}\left( X^{\left( n\right)},W^{\left( \text{IV}\right)}\right)=\frac{{\sum}_{i=1}^{3}x_{i}^{\left( n\right)}w_{i}^{\left( \text{IV}\right)}}{{\sum}_{i=1}^{3}w_{i}^{\left( \text{IV}\right)}}. $$
(8)
Fig. 3
figure 3

Demand prediction model of bike-sharing based on PDLFNs

Hyperbolic tangent sigmoid function in Eq. 9 and linear function in Eq. 10 are applied as the activation functions for neurons in layer B and the output layer, respectively.

$$ B_{j}\!\!\left( \!\!X^{\left( n\right)}\!,\!W^{\left( \text{VB}\right)}\!\!\right) = \frac{1 - e^{-2\left( w_{j}^{\left( \text{VB}\right)}S_{\mathrm{V}}\left( X^{\left( n\right)},W^{\left( \text{IV}\right)}\right)+b_{j}^{\left( \mathrm{B}\right)}\right)}}{1 + e^{-2\left( w_{j}^{\left( \text{VB}\right)}S_{\mathrm{V}}\left( X^{\left( n\right)},W^{\left( \text{IV}\right)}\right)+b_{j}^{\left( \mathrm{B}\right)}\right)}}, $$
(9)
$$ O\left( X^{\left( n\right)},W^{\left( \text{BO}\right)}\right)={\sum}_{j=1}^{29}w_{j}^{\left( \text{BO}\right)}B_{j}\left( X^{\left( n\right)},W^{\left( \text{VB}\right)}\right). $$
(10)

There is only one neuron in the output layer. Thus, \(W^{\left (\text {BO}\right )}\), the weight matrix of layer B to the output layer, is degenerated to the weight vector in this model. \(O\left (X^{\left (n\right )},W^{\left (\text {BO}\right )}\right )\), from the neuron in the output layer of PDLFNs, is the prediction demand in the n-th day.

3.3 Evaluation criteria

In order to verify the effectiveness of the proposal, mean square error MSE and square correlation coefficient R2, are selected as the evaluation criteria.

$$ MSE=\frac{1}{N}{\sum}_{n=1}^{N}\left( r_{n}-p_{n}\right)^{2}, $$
$$ R^{2}=\frac{\left( {\sum}_{n=1}^{N}\left( p_{n}-\overline{p}\right)\left( r_{n}-\overline{r}\right)\right)^{2}}{{\sum}_{n=1}^{N}\left( p_{n}-\overline{p}\right)^{2}{\sum}_{n=1}^{N}(r_{n}-\overline{r})^{2}}, $$

where \(p_{n},r_{n},\overline {p}\) and \(\overline {r}\) are predicted bike rental demands in the n-th day, recorded demands in the n-th day, average demands throughout the whole period, and recorded demands throughout the whole period, respectively.

4 Demand prediction of streeter dr bike-sharing station

4.1 Load and prepare data

The dataset in this paper is from the official website of Streeter Dr bike-sharing station in Chicago. We split the data into two sets, one for training and the other for testing. The recorded bike rental demands during 92 days from March 1st, 2017, to May 31st, 2017 are aggregated as the training set. And the recorded demands during the next 30 days are aggregated as the test set.

According to the regularities in the dataset, we can find that the peak periods of renting bikes are always in the same interval. Taking March for instance, the number of rental bikes is shown in Fig. 4. The peak periods of renting bikes are mainly distributed from 13 p.m. to 17 p.m.

Fig. 4
figure 4

The number of rental bikes in March

4.2 Prediction results

To verify the performance of IPSO, MSEs before and after the improvement are shown Fig. 5. The global search ability of particle swarm is effectively improved by optimizing the particles’ velocities, by adjusting inertia weight in the early iterative stage and adding a compression factor in the later stage. IPSO-ELM is capable of jumping out of local optima and finding a solution in the final stage. The prediction results are shown in Fig. 6.

Fig. 5
figure 5

Mean square errors of two learning algorithms

Fig. 6
figure 6

Prediction results of PDLFNs with IPSO-ELM

4.3 Comparison and discussion

In order to verify the effectiveness of the prediction model in this paper, comparative experiments are conducted, considering 3 different network structures (SLFNs, DLFNs and PDLFNs) with 3 different learning algorithms (ELM, PSO-ELM and IPSO-ELM). The evaluation criteria are MSE and R2 proposed in the preceding section. The experiment results of 9 prediction models are shown in Table 1.

Table 1 MSE and R2

The longitudinal comparison is among the same network structure with different learning algorithms. MSE and R2 in IPSO-ELM are always the minimum and maximum in LSFNs, DLFNs and PDLFNs. It suggests that IPSO-ELM is the best learning algorithm among these 3 algorithms. The neural networks with IPSO-ELM can obtain the most accurate prediction results. The horizontal comparison among different network structures with the same learning algorithm shows that PDLFNs can obtain the most accurate prediction results. From the perspective of MSE and R2, PDLFNs with IPSO-ELM is the best prediction model among 9 models, and the prediction accuracy improved by changing the network structure is greater than changing the learning algorithm. The effectiveness in other experiment results is shown in Figs. 7 and 8.

Fig. 7
figure 7

Prediction results in different models

Fig. 8
figure 8

Normalized MSE in different models

5 Conclusion

Aiming at predicting the demands of bike-sharing, this paper constructs the PDLFNs model consisting of “input layer - V layer - B layer - output layer”and improves ELM combining with IPSO as learning algorithm. This model has two advantages verified by the comparative experiments on predicting the demands of Streeter Dr bike-sharing station. In the first place, a simple network structure is equipped with stable generalization and high accuracy. In addition, improved ELM is an effective learning algorithm of feedforward neural networks besides SLFNs, which optimizes the selection process of the input weights and hidden biases. From the comparative experiments, we have verified effectiveness on the prediction accuracy.

In this paper, we have predicted the demands of bike-sharing during peak periods, which are main factors influencing scheduling schemes. Nevertheless, optimal scheduling schemes are still influenced by the demands during flat hump periods and low peak periods. We will consider this issue in the future and predict 24-hour demands in the bike-sharing system.