1 Introduction

The entropy H(f) of the random variable X with distribution function F(x) and continuous density function f(x) is defined by [23] to be

$$\begin{aligned} H(f)=-\int \nolimits _{-\infty }^\infty {f(x) \log f(x) dx}. \end{aligned}$$
(1.1)

Estimation of entropy from a random sample has been considered by many authors. For discrete random variables, References [8, 15, 29] have proposed estimators of entropy, while [5, 9, 11, 14, 16, 27, 28] have proposed solutions for the problem of estimating the entropy for continuous random variables.

Vasicek [28] expressed (1.1) as

$$\begin{aligned} H(f)=\int \nolimits _{ 0}^{ 1} {\log \left\{ {\frac{d}{dp}F^{-1}(p)} \right\} dp}, \end{aligned}$$
(1.2)

and then by replacing the distribution function F by the empirical distribution function \(F_n,\) and using a difference operator instead of the differential operator proposed an entropy estimator. He also estimated the derivative of \(F^{-1}(p)\) by a function of the order statistics. Assuming that \(X_1,\ldots ,X_n\) is a random sample, Vasicek’s estimator is

$$\begin{aligned} \textit{HV}_{mn} =\frac{1}{n}\sum \limits _{i=1}^n {\log \left\{ {\frac{n}{2m}\left( X_{(i+m)} -X_{(i-m)} \right) } \right\} }, \end{aligned}$$
(1.3)

where the window size m is a positive integer smaller than \(n/2,\, X_{(i)} =X_{(1)}\) if \(i<1,\,X_{(i)} =X_{(n)}\) if \(i>n\) and \(X_{(1)} \le X_{(2)} \le \cdots \le X_{(n)}\) are order statistics based on a random sample of size n. Vasicek proved that his estimator is consistent, i.e., \(\textit{HV}_{mn} \xrightarrow {\Pr \!.}H(f)\) as \(n\rightarrow \infty , \, m\rightarrow \infty ,\,\frac{m}{n}\rightarrow 0.\)

Van Es [27] using spacings introduced an estimator of entropy and proved the consistency and asymptotic normality of this estimator under some conditions. Van Es’ estimator is given by

$$\begin{aligned} \textit{HVE}_{mn} =\frac{1}{n-m}\sum \limits _{i=1}^{n-m} {\left( {\frac{n+1}{m}\left( X_{(i+m)} -X_{(i)}\right) } \right) } +\sum \limits _{k=m}^n{\frac{1}{k}} +\log (m)-\log (n+1). \end{aligned}$$

Ebrahimi et al. [9] modified Vasicek’s estimator and then proposed the following estimator.

$$\begin{aligned} \textit{HE}_{mn} =\frac{1}{n}\sum \limits _{i=1}^n {\log \left\{ {\frac{n}{c_i m}\left( X_{(i+m)} -X_{(i-m)} \right) } \right\} }, \end{aligned}$$

where

$$\begin{aligned} c_i =\left\{ {{\begin{array}{l} {1+\frac{i-1}{m}, \quad 1\le i\le m}, \\ {2, \quad m+1\le i\le n-m}, \\ {1+\frac{n-i}{m}, \quad n-m+1\le i\le n}. \\ \end{array} }} \right. \end{aligned}$$

They showed that \(\textit{HE}_{mn} \xrightarrow {{\Pr \!.}}H(f)\) as \(n\rightarrow \infty ,\,m\rightarrow \infty ,\,m/n\rightarrow 0.\) Then they showed that their estimator has smaller bias and mean squared error than Vasicek’s estimator, by simulation.

Correa [5] proposed a modification of Vasicek estimator as follows:

He considered the sample information as

$$\begin{aligned} \left( F_n \left( X_{(1)} \right) ,\,X_{(1)}\right) ,\,\left( F_n \left( X_{(2)} \right) ,\,X_{(2)}\right) ,\ldots ,\left( F_n\left( X_{(n)} \right) ,\,X_{(n)}\right) , \end{aligned}$$

and written Eq. (1.2) as

$$\begin{aligned} \textit{HV}_{mn} =-\frac{1}{n}\sum \limits _{i=1}^n {\log \left\{ {\frac{(i+m)/n-(i-m)/n}{X_{(i+m)} -X_{(i-m)} } } \right\} }. \end{aligned}$$

Then he noted that the argument of log is the equation of the slope of the straight line that joins the points \((F_n (X_{(i+m)} ),\,X_{(i+m)} )\) and \((F_n (X_{(i-m)} ),\,X_{(i-m)}),\) and therefore applied a local linear model based on \(2m+1\) points to estimate the density of F(x) in the interval \((X_{(i+m)},\,X_{(i-m)}),\)

$$\begin{aligned} F\left( x_{(j)}\right) =\upalpha +\upbeta x_{(j)} +\upvarepsilon \quad j=m-i,\ldots ,m+i. \end{aligned}$$

Via the least square method, he proposed an estimator of entropy as

$$\begin{aligned} \textit{HC}_{mn} =-\frac{1}{n}\sum \limits _{i=1}^n {\log \left( {\frac{\sum \nolimits _{j=i-m}^{i+m} {(X_{(j)} -\bar{{X}}_{(i)} )(j-i)} }{n\sum \nolimits _{j=i-m}^{i+m} {(X_{(j)} -\bar{{X}}_{(i)} )^{2}} }} \right) }, \end{aligned}$$

where

$$\begin{aligned} \bar{{X}}_{(i)} =\frac{1}{2m+1}\sum \limits _{j=i-m}^{i+m} {X_{(j)}}. \end{aligned}$$

He compared his estimator with Vasicek’s and van Es’s estimators and concluded the mean square error (MSE) of his estimator is smaller than the MSE of Vasicek’s estimator. Also for some m his estimator behaves better than van Es estimator.

Entropy estimators are used to developing entropy-based statistical procedures. See for example, [1, 2, 10, 13, 20, 21]. Therefore, new entropy estimators can be useful in practice.

For many computational problems in applied probability and statistics, we have to compute the Riemann–Stieltjes integral of the following form

$$\begin{aligned} \int \nolimits _a^b {f(x)dg(x)}, \end{aligned}$$
(1.4)

where the function g(x) is usually a distribution function.

Direct integration of the Riemann–Stieltjes integral can be used to computing convolution integrals. This approach has been established to be simple and accurate with good convergence property.

The above integral can be approximated directly using the definition of the Riemann–Stieltjes integral. The function f(x) may approximate by a piecewise constant function \(\tilde{f}(x)\) and by transforming the integration to a summation, we have

$$\begin{aligned} \int \nolimits _a^b {f(x)dg(x)} \approx \sum \limits _{i=1}^n {\tilde{f}\left( x_i \right) \cdot \left( g\left( x_i\right) -g\left( x_{i-1}\right) \right) }. \end{aligned}$$

Therefore, a numerical algorithm can be obtained based on the above formula.

This procedure is used by [30] in solving renewal equation. Independently, a similar idea is used by [26]. Numerical results in these papers show that direct RS-integration is very simple and accurate compared with the existing algorithms ([6, 19], for example) that are more complicated. Some applications of this approach can be found in [4, 17, 25]. Xie et al. [31] developed further methodologies of the Riemann–Stieltjes integration under general conditions.

In Sect. 2, some direct integration methods and the bounds on the errors are given. In Sect. 3, numerical methods are used to estimation of entropy of continuous random variables. We show that the proposed estimators are consistent. Scale invariance of variance and mean squared error of the estimators are established. In the last section we present results of a comparison of the proposed estimators with the competing estimators by a simulation study.

2 Some Direct Integration Methods

Let the interval \(I=[a,\,b]\) be partitioned into n equal length intervals denoted by \(I_i =[x_{i-1},\,x_i ], \,i=1,\ldots ,n\) where \(x_0 =a\) and \(x_i =x_{i-1} +h,\) where \(h=(b-a)/n\) is the length of each interval. By using the direct Riemann–Stieltjes integration method ([30]), we approximate the function f(x) with a function \(\tilde{f}(x)\) and then use the definition of the Riemann–Stieltjes integral. Similar to the case of rectangle and trapezoidal rules for the Riemann integral, Xie et al. [31] distinguished between the midpoint and mean value RS-integration method, and described these methods as follows.

2.1 The Midpoint Rectangle RS-Integration

By the rectangle RS-integration method the function f(x) is approximated by its value at the midpoint

$$\begin{aligned} x_{i-.5} =\frac{x_{i-1} +x_i }{2}, \end{aligned}$$

of each interval. Then

$$\begin{aligned} \sum \limits _{i=1}^n {\int \nolimits _{x_{i-1} }^{x_i } {f(x)dg(x)} } \approx \sum \limits _{i=1}^n {f\left( x_{i-.5}\right) \left( g\left( x_i\right) -g\left( x_{i-1} \right) \right) }. \end{aligned}$$

Numerical accuracy and convergence are demonstrated in [30].

2.2 The Mean-Value Rectangle RS-Integration

Another method to approximate f(x) is to use the mean value of f(x) at each interval. We approximate the function f(x) at the ith interval by the mean value of the endpoints \((f(x_i )+f(x_{i-1} ))/2.\) In this case, the Eq. (1.4) is approximated by

$$\begin{aligned} Q_n= & {} \sum \limits _{i=1}^n {\int \nolimits _{x_{i-1} }^{x_i } {\left( {\frac{f(x_{i-1} )+f(x_i )}{2}} \right) dg(x)} } \nonumber \\= & {} \sum \limits _{i=1}^n {\left( {\frac{f(x_{i-1} )+f(x_i )}{2}} \right) \left( g\left( x_i\right) -g\left( x_{i-1}\right) \right) }. \end{aligned}$$
(2.1)

Since the mean value of f(x) at points \(x_i\) and \(x_{i-1}\) is used in approximating f(x),  we call the mean value rectangle RS-integration method or the generalized trapezoidal rule. For the Riemann integral, this method is usually called the trapezoidal rule since it is equivalent to the approximating f(x) by a piecewise linear function. This is not the case for Riemann–Stieltjes integral unless g(x) is a linear function.

This method is a simple method that gives satisfactory results and it has been used by [3, 26].

2.3 The Generalized Simpson Rule

Xie et al. [31] generalized other integration method to Riemann–Stieltjes equation (1.4) which is called the generalized Simpson rule.

Let the interval \(I=[a,\,b]\) be divided into 2n equal parts of length and let \(h=(b-a)/2n.\) Now the Eq. (1.4) approximated by

$$\begin{aligned} \int \nolimits _a^b {f(x)dg(x)} \approx \frac{4Q_{2n} }{3}-\frac{Q_n }{3}, \end{aligned}$$

where \(Q_n\) is given by the mean-value rectangle equation (2.1) with respect to n subintervals of length \((b-a)/n\) and \(Q_{2n}\) is the rule (2.1) for 2n subintervals of length \((b-a)/2n.\) Then

$$\begin{aligned} \int _a^b {f(x)dg(x)}\approx & {} \frac{4}{3}\sum _{i=1}^{2n} {\int _{x_{i-1} }^{x_i } {\left( {\frac{f(x_{i-1} )+f(x_i )}{2}} \right) \left( g\left( x_i\right) -g\left( x_{i-1}\right) \right) } } \\&-\frac{1}{3}\sum _{i=1}^n {\left( {\frac{f(x_{2i-2} )+f(x_{2i} )}{2}}\right) \left( g\left( x_{2i}\right) -g\left( x_{2i-2}\right) \right) } \\= & {} \frac{1}{6}\sum \left\{ \left( {-3g\left( x_{2i-2}\right) +4g\left( x_{2i-1} \right) -g\left( x_{2i}\right) }\right) f\left( x_{2i-2}\right) -4\left( g\left( x_{2i-2}\right) \right. \right. \\&\left. \left. -g\left( x_{2i} \right) \right) f\left( x_{2i-1}\right) +\left( {g\left( x_{2i-2}\right) -4g\left( x_{2i-1}\right) +3g\left( x_{2i}\right) } \right) f\left( x_{2i}\right) \right\} , \end{aligned}$$

is considered as generalized Simpson rule.

2.4 General Bounds on the Truncation Errors

Numerical results show that all the mention methods can be used successfully, see, e.g., [4, 7, 26, 30]. In [3] some convergence results are presented and then [31] derived some further bounds on the truncation errors of the method under very general assumptions. They only assumed that g(x) is an increasing function which is the usual assumption used for defining the Riemann–Stieltjes integral ([22], p. 122). The following theorems have proved by [31].

Theorem 1

For the midpoint method, the global truncation error \(| \upvarepsilon |\) may be bounded by

$$\begin{aligned} (h/2)\mathop {\sup }\limits _{x\in I} | {{f}^{\prime }(x)} |(g(b)-g(a)), \end{aligned}$$

assuming f(x) is continuously differentiable and g(x) is an increasing function.

Remark

The convergence of this method is thus of order 1. As the number of intervals n increases, the error will decrease at least as fast as 1 / n.

Theorem 2

Under the same conditions as in Theorem 1, the global truncation error may be bounded by

$$\begin{aligned} h\mathop {\sup }\limits _{x\in I} | {{f}^{\prime }(x)}|(g(b)-g(a)), \end{aligned}$$

for the mean value method.

Theorem 3

Under the same conditions as in Theorem 1, the global truncation error \(| \upvarepsilon |\) for the generalized Simpson rule may be bounded by

$$\begin{aligned} 2h\mathop {\sup }\limits _{x\in I} | {{f}^{\prime }(x)}|(g(b)-g(a)). \end{aligned}$$

3 Estimation of Entropy

Suppose \(X_1,\ldots ,X_n\) are order statistics of a random sample of size n from an unknown continuous distribution F with a probability density function f(x). We use the methods discussed in previous section to estimate the entropy H(f) of an unknown continuous probability density function f. Using these methods the following approximations for entropy can be derived.

  1. (1)

    \(H(f)=-\int {\log f(x)dF(x)} \approx -\sum \limits _{i=1}^n {\log f(x_i)\{F(x_i )-F(x_{i-1} )\}} =H_1,\)

  2. (2)

    \(H(f)\approx -\sum \limits _{i=1}^n {\log f\left( \frac{x_i +x_{i-1}}{2}\right) \{F(x_i )-F(x_{i-1} )\}} =H_2,\)

  3. (3)

    \(H(f)\approx -\sum \limits _{i=1}^n {\frac{\log f(x_{i-1} )+\log f(x_i )}{2}\{F(x_i )-F(x_{i-1} )\}} =H_3,\)

  4. (4)

    \(H(f)\approx \frac{4Q_{2n} }{3}-\frac{Q_n }{3}=H_4.\)

Now, we replace \(F(x_i)\) and \(f(x_i)\) with i / n (the empirical distribution function) and \(\hat{{f}}(x_i )\) (the kernel density estimate), respectively. Therefore, we obtain

$$\begin{aligned} H_1= & {} -\frac{1}{n}\sum \limits _{i=1}^n {\log \hat{{f}}\left( x_i\right) },\\ H_2= & {} -\frac{1}{n}\sum \limits _{i=1}^n {\log \hat{{f}}\left( \frac{x_i +x_{i-1}}{2}\right) },\\ H_3= & {} -\frac{1}{n}\sum \limits _{i=1}^n {\frac{\log \hat{{f}}(x_{i-1} )+\log \hat{{f}}(x_i )}{2}},\\ H_4= & {} -\frac{1}{6n}\sum \limits _{i=1}^n {\left\{ {\log \hat{{f}}\left( x_{2i-2}\right) +4\log \hat{{f}}\left( x_{2i-1}\right) +\log \hat{{f}}\left( x_{2i} \right) } \right\} }. \end{aligned}$$

The kernel density function estimator is well-known and is defined by

$$\begin{aligned} \hat{{f}}\left( X_i\right) =\frac{1}{nh}\sum \limits _{j=1}^n {k\left( \frac{X_i -X_j}{h}\right) }, \end{aligned}$$

where h is a bandwidth and k is a kernel function which satisfies

$$\begin{aligned} \int \nolimits _{-\infty }^\infty {k(x)dx} =1. \end{aligned}$$

Usually, k will be a symmetric probability density function (see, [24]).

Here, the kernel function is chosen the standard normal density function and the bandwidth h is chosen the normal optimal smoothing formula, \(h=1.06sn^{-\frac{1}{5}},\) where s is the sample standard deviation.

Using Theorems 13, the global truncation error \(| \upvarepsilon |\) can be obtained as follows.

We have

$$\begin{aligned} f(x)=-\log (\hat{{f}}(x))\quad \mathrm{and}\quad g(x)=F_n (x), \end{aligned}$$

therefore,

$$\begin{aligned} M=\mathop {\sup }\limits _{x\in I} | {{f}^{\prime }(x)}|=\mathop {\sup }\limits _{x\in [x_1,\,x_n ]} \left| {-\frac{\hat{{{f}^{\prime }}}(x)}{\hat{{f}}(x)}} \right| =\mathop {\sup }\limits _{x\in [x_1,\,x_n ]} \left| {-\frac{1}{h}\frac{\sum \nolimits _{j=1}^n {{k}'\left( \frac{x-x_j }{h}\right) } }{\sum \nolimits _{j=1}^n {k\left( \frac{x-x_j }{h}\right) } }} \right| . \end{aligned}$$

So, for the midpoint, mean value and generalized Simpson methods, the global truncation error \(| \upvarepsilon |\) may is bounded by

$$\begin{aligned} \frac{x_n -x_1 }{2n}M\left( 1-\frac{1}{n}\right) , \quad \frac{x_n -x_1 }{n}M\left( 1-\frac{1}{n}\right) , \quad \frac{x_n -x_1 }{2n}M\left( 1-\frac{1}{n}\right) , \end{aligned}$$

respectively.

The scale of the random variable X has no effect on the accuracy of \(H_1,\,H_2,\,H_3\) or \(H_4\) in estimating H(f). The following theorem shows this fact. Similar results have been obtained for \(\textit{HV}_{mn}\) and \(\textit{HE}_{mn}\) by [18] and [9], respectively.

Theorem 3.1

Let \(X_1,\ldots ,X_n\) be a sequence of i.i.d. random variables with entropy \(H^{X}(f)\) and let \(Y_i =cX_i,\,i=1,\ldots ,n,\) where \(c>0.\) Let \(H_1^X\) and \(H_1^Y\) be entropy estimators for \(H^{X}(f)\) and \(H^{Y}(g),\) respectively (here g is pdf of \(Y=cX).\) Then the following properties hold.

  1. (i)

    \(E( {H_1^Y })=E( {H_1^X } )+\log c,\)

  2. (ii)

    \(Var( {H_1^X })=Var( {H_1^X }),\)

  3. (iii)

    \(MSE( {H_1^X })=MSE( {H_1^X }).\)

Proof

We have

$$\begin{aligned} \hat{{f}}\left( cX_i \right) =\frac{1}{nh_y }\sum \limits _{j=1}^n {k\left( {\frac{cX_i-cX_j }{h_y }} \right) } =\frac{1}{nch_x }\sum \limits _{j=1}^n {k\left( {\frac{cX_i -cX_j }{ch_x }} \right) } =\frac{\hat{{f}}(X_i )}{c}, \end{aligned}$$

where \(h_y =1.06s_y n^{-1/5}=1.06cs_x n^{-\frac{1}{5}}=ch_x.\)

Therefore, \(H_1^Y =H_1^X +\log c,\) and the theorem is established.

The above theorem is hold for the other proposed estimators.

The following theorem establishes the consistency of estimators.

Theorem 3.2

Let C be the class of continuous densities with finite entropies and let \(X_1,\ldots ,X_n\) be a random sample from \(f\in C.\) If \(n\rightarrow \infty ,\) then

$$\begin{aligned} H_i \xrightarrow {\Pr \!.}H(f) \quad i=1,\ldots ,4. \end{aligned}$$

Proof

It is obvious by consistency of \(\hat{{f}}(x)\) and continuity of \(\hat{{f}}(x).\)

Table 1 Root of mean square error (and standard deviation) of estimators in estimate of entropy H(f) for standard normal distribution
Table 2 Root of mean square error (and standard deviation) of estimators in estimate of entropy H(f) for exponential distribution with mean one
Table 3 Root of mean square error (and standard deviation) of estimators in estimate of entropy H(f) for uniform distribution on (0, 1)

4 Simulation Study

A simulation study is carried out to analyze the behavior of the proposed entropy estimators, \(H_i,\,i=1,\,2,\,3,\,4.\) The proposed estimators are compared with some prominent estimators, namely Vasicek’s estimator [28], van Es’s estimator [27], Ebrahimi et al.’s estimator [9] and Correa’s estimator [5]. For comparisons, we consider normal, exponential and uniform distributions which are the same three distributions considered in [5]. For each sample size 10,000 samples were generated and the RMSEs of the estimators are computed.

For competitor estimators, we chose value of m using the following heuristic formula (see [12]):

$$\begin{aligned} m=[ {\sqrt{n}+0.5}]. \end{aligned}$$

Tables 1, 2 and 3 present the RMSE values (and standard deviation) of the eight estimators at different sample size.

We can see from Tables 1, 2 and 3 that the proposed estimators compare favorably with their competitors. Also, note that the proposed estimators do not depend on m. In general, the proposed estimators have a good performance for small sample sizes.

5 Conclusion

In this paper, we first have described some direct integration methods and the bounds on the errors have been given. We next have introduced some estimators of entropy of a continuous random variable using numerical methods and the bounds on the errors of estimators are obtained. We then have compared the proposed estimators with some prominent existing estimators and observed that for small sample sizes the new estimators behave better than the competitors. The advantage of the proposed estimators were the fact that they do not depend on the window size m,  unlike the competitor estimators.