Nonlinear Regression Modelling: A Primer with Applications and Caveats

Use of nonlinear statistical methods and models are ubiquitous in scientific research. However, these methods may not be fully understood, and as demonstrated here, commonly-reported parameter p-values and confidence intervals may be inaccurate. The gentle introduction to nonlinear regression modelling and comprehensive illustrations given here provides applied researchers with the needed overview and tools to appreciate the nuances and breadth of these important methods. Since these methods build upon topics covered in first and second courses in applied statistics and predictive modelling, the target audience includes practitioners and students alike. To guide practitioners, we summarize, illustrate, develop, and extend nonlinear modelling methods, and underscore caveats of Wald statistics using basic illustrations and give key reasons for preferring likelihood methods. Parameter profiling in multiparameter models and exact or near-exact versus approximate likelihood methods are discussed and curvature measures are connected with the failure of the Wald approximations regularly used in statistical software. The discussion in the main paper has been kept at an introductory level and it can be covered on a first reading; additional details given in the Appendices can be worked through upon further study. The associated online Supplementary Information also provides the data and R computer code which can be easily adapted to aid researchers to fit nonlinear models to their data. Supplementary Information The online version contains supplementary material available at 10.1007/s11538-024-01274-4.


Introduction
Facilitated by readily-available statistical software, practitioners in fields as diverse as agronomy, biochemistry, biomedicine, drug development, engineering, environmental science, neuroscience, pharmacology and toxicology fit nonlinear models to their data to help answer their research questions.In addition to providing parsimonious data fits, nonlinear models are preferred to empirical models since the associated nonlinear model parameters typically have meaningful practical interpretations.For example, two drugs or experimental conditions are often compared in terms of half maximal effective concentration (EC 50 ) or median lethal doses (L D 50 ), which can be modelled as nonlinear dose-response model parameters.Mechanistic nonlinear models are often chosen based on underlying subject-matter knowledge such as Michaelis-Menten enzyme kinetic theory or dose response modelling methods (Bates and Watts 2007;Finney 1978;Govindarajulu 2001;Hubert 1992;Miguez et al. 2020;Ratkowsky 1983;Seber and Wild 1989).Faced with a plethora of experimental design and modelling methods, even statistically-savvy subject-matter practitioners may be unaware of key nonlinear methods and important requirements and cautions associated with nonlinear regression hypothesis testing methodologies and confidence interval estimation techniques.
Using straightforward nonlinear regression models and illustrations, this article overviews and illustrates useful nonlinear regression methods, underscores problems associated with commonly-used Wald statistic test p-values and confidence intervals (Wald 1943), and demonstrates the preference for exact likelihood-based confidence intervals over Wald intervals.Specifically, as highlighted below, p-values provided by statistical software packages are often based on the Wald approximation and can be grossly inaccurate for small-to moderately-sized studies.For example, Wald-based confidence intervals for nonlinear model parameters with nominal labels of 95% may have actual coverage levels of 75% or even lower.Conversely, readily-available exact or near-exact likelihood-based intervals generally show good agreement between nominal and actual coverage levels.
This nonlinear modelling introductory discussion, which builds upon readers background in linear methods (Draper and Smith 1998;Kleinbaum et al. 2014;Mendenhall and Sincich 2020), also provides the basis for further consideration of additional topics including dose response modelling, high-throughput screening methods, compartmental models based on differential equation(s) and other multivariate nonlinear models, computational algorithms and starting values, and further explorations of curvature measures.Given the evolution away from hypothesis testing approaches to estimation methods (Halsey 2019;Krzywinski and Altman 2013;Meeker and Escobar 1995), the focus here is largely on accurate confidence interval methods instead of p-values.
The article is structured as follows.To provide important context, Sect. 2 introduces simple motivating nonlinear model examples which highlight both nonlinear modelling in practice and underscores key differences with linear models.Section 3 overviews general nonlinear regression methods, makes connections to and contrasts with linear models, discusses parameter profiling in multiparameter models, nonlinear model selection, model fitting algorithms, and starting value selection.Section 4 provides additional exemplary nonlinear illustrations and extensions.In Sect.5, we give important concluding remarks and discussion.The Appendices provide additional details and illustrations regarding the effects of curvature on nonlinear modelling, the Fieller-Creasy ratio of means example, comparisons of the F-based and the asymptotic likelihood tests and intervals, and comments and caveats regarding overfitting.Further, the R computer code (R Core Team 2020) used in the data analyses is given in the Supplementary Information.These R programs are easily adapted to help practitioners fit meaningful nonlinear models to their data.

Motivating Illustrations
The two key motivating examples introduced and discussed here illustrate the basic use of nonlinear modelling and demonstrate some of the widespread use of these methods.
Example 1 For a single substrate, Michaelis-Menten enzyme kinetics theory (Michaelis and Menten 1913;Bates and Watts 2007) can be used to model the connection between the velocity of an enzymatic reaction (in counts per min 2 ) to the substrate concentration (in ppm).To illustrate, consider the simulated data plotted in the left panel of Fig. 1.The chosen design involves the chosen substrate concentrations x 0.02, 0.04, 0.06, 0.08 and 0.10 replicated three times.The reader can see that there is an obvious curve in the data and a researcher may be inclined to use a linear approach by modelling the data with a polynomial regression that includes a quadratic term in which the predictor is the squared substrate concentrations.This approach may help the model to successfully fit the data by accounting for the curvature seen but polynomial regression is notoriously difficult to interpret.Instead, if a researcher were to use a nonlinear approach that was motivated by theory, such as the Michaelis-Menten enzyme kinetic theory, they would have a much more useful and interpretable model.
The classical Michaelis-Menten model function is given by Here, the model parameters are θ θ 1 θ 2 , where θ 1 is the upper asymptote (also called the ultimate velocity parameter) and θ 2 is the EC 50 or I C 50 (sometimes also called the half-velocity) parameter.This follows since when x θ 2 , we obtain η(x, θ ) 1 2 θ 1 .By connecting the substrate concentrations (x i ) with the velocity reaction rate (y i j ) data using the additive model expression y i j η(x i , θ ) + ε i j and least-squares estimation (see Sect. 3 and the R-code in the Supplementary Information), we obtain the parameter estimates, θ 1 209.868 and θ 2 0.0647.This fitted Michaelis-Menten model function is the nonlinear curve plotted in the left panel of Fig. 1.This model predicts that the enzyme velocity levels off at an ultimate velocity (upper asymptote) of almost 210 counts per min 2 .It also predicts that for a substrate concentration of 0.0647 ppm, the predicted velocity is approximately half of 210 or about 105 counts per min 2 .Note that the half-velocity point (0.0647, 104.9) is the plotted solid triangle and is highlighted by the dashed vertical and horizontal lines.
The above simple illustration demonstrates the common application of nonlinear modelling in practical applications.A researcher now has easy-to-interpret parameter estimates that fit within a theoretical model that was motivated by expert background knowledge rather than some arbitrary quadratic term from a polynomial linear regression that is difficult to interpret.The next example shows that nonlinear modelling is also encountered when fitting a linear model but where interest centers on a nonlinear function of the linear model parameters.

Example 2
The data plotted in the right panel of Fig. 1 are adapted from a regression study (Bowers et al. 1986;Samuels et al. 2016) relating laetisaric acid concentration (the independent variable) to fungal growth (the dependent variable) in P. ultimum, the plant pathogen.These n 6 data points are plotted along with the fitted ordinary leastsquares regression line.Indeed, a simple linear regression could be used to describe the relationship between laetisaric acid concentration and fungal growth in P. ultimum.However, the main goal of this particular study, as stated by the authors, was to estimate the acid concentration which "inhibits growth of P. ultimum by 50 percent" (Bowers et al. 1986, p. 106).Nonlinear regression modeling can help the authors meet this goal much more precisely.Rather than fitting a linear regression model, y i α + βx i + ε i , the researchers could fit a nonlinear model that directly estimates the half maximal inhibitory concentration (sometimes referred to as I C 50 ).In the following nonlinear model, this I C 50 parameter is represented by theta (θ ) and the expected zero-concentration value (i.e., the intercept) is represented by alpha (α): We can fit this model using the"nls" function in R (for more details, please see the code in the Supplementary Information).This model fit shows that the parameter estimate for alpha (i.e., the intercept) is α 32.64.In other words, the expected fungal growth with no applied laetisaric acid is estimated to be 32.64.Our model fit also estimates that θ 22.33.In other words, the concentration of laetisaric acid that will inhibit P. ultimum growth by half is 22.33.Indeed, if the reader plugs 22.33 into x in our fitted model, the expected value ofy, or y 50 η θ 16.32, is indeed half of α (i.e., the estimated intercept).
The observant reader may note that the fit for this model (depicted as a black line in the right panel of Fig. 1) is indeed a straight line.That is, if a person were to fit the data using the"lm" function in R, they would end up with essentially the same line as the one seen in Fig. 1.So, it is natural to ask why this model is considered nonlinear considering it produces a fit that is a straight line.As discussed further in Sect.3, nonlinearity refers to whether or not the parameters of the model enter the model linearly or nonlinearly.In the Eq. ( 2), note that the slope (parameter expression which multipliesx) is− α 2θ , which is nonlinear in the model parameters.Therefore, the model introduced in Eq. ( 2) is nonlinear.
The reader may next ask what benefit they gain from fitting a nonlinear model in this scenario since the produced model fit is the same line that a linear model would produce.When fitting a linear model in this case, the parameters that are estimated are the intercept (which is also estimated in the Eq. ( 2) model) and the slope of the line.This slope estimate would tell the researcher how much growth would change, on average, for every one unit increase in laetisaric acid concentration.This may be of interest to the researcher.But, as discussed above, the main goal of the study, as stated by the authors, was to investigate at which concentration of acid they may expect to see a 50% decline in growth.This so-called I C 50 value (i.e., theta) could be determined from the linear model but the researcher would be unable to perform hypothesis testing or to calculate confidence intervals without making often inaccurate simplifying approximations.With the nonlinear model, we are directly estimating the I C 50 parameter and, therefore, may directly test hypotheses and estimate confidence intervals.
In Fig. 1's right panel, the point ( θ , y 50 ) (the filled triangle) as well as the corresponding vertical and horizontal (dashed) line segments are also plotted.As mentioned above, because we now have a direct estimate of the I C 50 from our model we now calculate a confidence interval for this estimate that is so important for our study.
In this paper, we will discuss two different approaches to creating confidence intervals: the traditional Wald approach and the likelihood-based approach.As noted in the next section, for linear models, the two approaches generally give the same estimates for confidence intervals.But, as we will detail in Sect. 3 and elsewhere in the paper, likelihood-based confidence intervals are typically preferable when using nonlinear models.

Key Nonlinear Regression Methods and Results
In this section, we briefly introduce and develop key nonlinear regression results.Additional details, including theoretical results, are given in general nonlinear texts (Bates and Watts 2007;Ratkowsky 1983;Seber and Wild 1989) and in subject-matter works from an array of fields including agronomy (Miguez et al. 2020), animal science (Gonçalves et al. 2016), immunology (Bursa et al. 2020), and pupillometry (Bartošová et al. 2018;Rollins et al. 2014;You et al. 2021).Before addressing nonlinear models, we first illustrate various linear models.
For i 1, 2, . . .n, the usual (homoscedastic normal) simple linear regression model is written where "iid" denotes 'independent and identically distributed'.For the model function η(x i , θ ) α + βx i and model function parameter vector θ α β , the general structure of this model is The multiple linear regression model function is i .Perhaps surprisingly, even though the quadratic regression model (and other polynomial models) is sometimes used to account for curves observed in data, it is actually a linear model.In Sect.2, we also saw a case in which we used a nonlinear model to fit a straight line to data with no observable curvature in it (see Example 2).So, what exactly do we mean when we call a model "nonlinear" since it does not necessarily refer to the shape that we see in the data?We next define and illustrate nonlinearity in regression modelling.

What Makes a Nonlinear Model Function Nonlinear?
For the homoscedastic normal model given in Eq. (3), the model function η(x i , θ ) with parameters θ T θ 1 , θ 2 , . . ., θ p is characterized as nonlinear if the (partial) derivative of η(x i , θ ) with respect to at least one of the parameters includes at least one of the model parameters.For example, for the Michaelis-Menten model function from Example 1 in Sect.2, η(x, θ) Another example is when researchers fit the quadratic regression model and wish to estimate the input value x where the model function achieves its maximum or minimum.Using basic calculus, this value, denoted δ, is such that β 1 + 2β 2 δ 0, so that β 1 −2β 2 δ.The model function can then be rewritten η(x, θ ) α − 2β 2 δx + β 2 x 2 .This new way of writing the original quadratic model function, called a model reparameterization, yields a nonlinear model.It has the clear advantage of making the parameter of interest be an inherent model parameter so as to more easily obtain accurate point and interval estimates.

Parameter Estimation: Point Estimates and Standard Errors
Parameter estimation for the homoscedastic normal nonlinear models considered here can be achieved using maximum likelihood estimation, or equivalently, least squares estimation.For S(θ ) given below in Eq. ( 5), the corresponding log-likelihood is written, Since the model function parameters only appear in the S(θ ) term, maximum likelihood estimates (denoted MLEs) can be found by minimizing the sum of squares function,

S(θ
(5) Least-squares estimates (denoted LSEs) are those parameter values that minimize S(θ ) for each of the p model function parameters.In other words, the goal of least squares estimation is to find parameter estimates that minimize the difference between observed values of y (denoted as y i ) and model-predicted values of y (typically denoted as y i ).We denote the MLE/LSE parameter vector by θ, and, when transposed, we can write θ T θ 1 , θ 2 , . . ., θ p .So for nonlinear model parameters, MLEs and LSEs are indeed the same.
Under standard regularity conditions, least-squares parameter estimates for these model function parameters are obtained by differentiating S(θ) with respect to the p model parameters, setting these derivatives to zero, and solving the resulting p so-called normal equations.These normal equations are ∂ S(θ ) ∂θ 1 0, ∂ S(θ ) ∂θ 2 0, . . ., ∂ S(θ ) ∂θ p 0, and they can also be written, where for i 1, 2, . . ., n, the model residuals are e i y i − η x i , θ .Note that in general for nonlinear model functions, Eq. ( 6) is a nonlinear system of p equations in p unknowns (i.e., the model function parameters).Since the system of normal equations can be written more concisely in matrix form, we introduce the n× p so-called Jacobian matrix, Using this notation, the normal equations system of p equations from Eq. ( 6) can be rewritten, where e T (e 1 , e 2 , . . ., e n ).Readers familiar with linear models (where the model parameter vector is often denoted β) will recognize that here, e y − X β and the normal equations are X T y − X β 0 or X T X β X T y.We emphasize that this latter expression holds only for linear models whereas Eq. ( 8) holds for the more general nonlinear model situation.
For the nonlinear models considered here, parameter estimates are obtained by solving the (nonlinear) normal equation system in Eq. ( 8) and this in general involves using numerical methods and computer algorithms such as the root-finding methods available in freeware packages such as R (see the Supplementary Information).The following two illustrations are provided to demonstrate the use of these nonlinear normal equations in nonlinear parameter estimation.
Example 3 Similar to the Michaelis-Menton model function in Example 1, consider the one-parameter (i.e., p 1) model function (which is nonlinear since the derivative contains the parameter θ ), Notice that this model function has y 1 as its upper asymptote and the model parameter θ is the model EC 50 since for x θ , η(x, θ ) 1 2 .For this single parameter nonlinear illustration, we use the simulated n 4 data points, (x i , y i ) (0, 0.037), (2, 0.209), (4, 0.519) and (6, 0.430).Since ∂η ∂θ − x (θ+x) 2 , we obtain the estimate of θ by substituting the (x i , y i ) data values into the single normal equation, This expression is a nonlinear equation in the model function parameter, θ .The "uniroot" function in R (see the Supplementary Information) is used to obtain the LSE which here is θ 5.8698.
Example 2 (continued) For the two-parameter vector θ α θ , the model function and so the sum of squares function is Differentiating with respect to the model parameters gives the expressions, and When these two equations are set equal to zero, we obtain the two nonlinear normal equation system in two unknowns ( α and θ ), i.e., the normal equations.For the data used in this example, the numerical algorithm yields the LSEs α 32.639 and θ 22.327 as reported previously.
The manner of finding standard errors for model function parameters is similar to that used for linear models but for nonlinear models it is based on the following approximation.The first order (and also asymptotic) variance-covariance matrix associated with the LSE parameter vector estimate θ is s 2 (X T X) −1 where the mean-square n− p and the Jacobian matrix X is given in Eq. ( 7).The diagonal elements of s 2 (X T X) −1 are the squares of the standard errors (SEs) of the parameter estimates θ .Note that for linear models, these standard errors are exact but for nonlinear models, they are based on a first-order (or large-sample) approximation.
To illustrate these results for Example 3, since here s 2 S θ n− p 0.02217 3 0.00739 and X T X is the scalar 4 i 1 obtained for Examples 1 and 2 but in these cases, since X T X is of dimension 2 × 2, matrix inversion is used to find X T X −1 and thereby the corresponding standard errors associated with the LSE parameter estimates.

Parameter Estimation: Interval Estimates
In statistical methodology, confidence interval strategies and methodologies can be obtained by "inverting" a test statistic.For example, in many single parameter situations (such as for a paired t-tests or regression through the origin), the null hypothesis H 0 : θ θ 0 can be tested using the test statistic, Here, θ and S E are the corresponding parameter estimate and standard error.Under certain normal theory assumptions, this so-called Wald test statistic follows a t-distribution with n − 1 degrees of freedom (Wald 1943).The test statistic is rearranged and solved for θ 0 to produce the associated Wald (1 − α)100% confidence interval for θ , viz, θ ± t (α/2), (n−1) × S E. Here, t (α/2), (n−1) is the t-distribution quantile with n − 1 degrees of freedom which puts area α/2 in both the lower and the upper tails.Similarly, for the p-dimensional parameter case where p > 1, the corresponding Wald confidence interval (WCI) for parameter θ i , i 1, 2, . . ., p, is As noted in the previous section, S E i is the square root of the i th diagonal element of the variance-covariance matrix s 2 X T X −1 .The degrees of freedom of the t quantile here is n − p.
For this one-parameter hypothesis test H 0 : θ θ 0 , a rival test statistic to the above Wald test statistic is the likelihood-based F-statistic, When H 0 is true, this test statistic has the F distribution with 1 and (n − 1) degrees of freedom.Inverting this test statistic gives the likelihood-based confidence interval (LBCI) which, since s 2 S θ n−1 here, consists of the values of θ such that In this expression, S(θ is the sum of squares function introduced and discussed previously.Note that Eq. ( 16) also makes use of the fact that the square of a t-quantile with k degrees of freedom is equal to the corresponding F-quantile with 1 and k degrees of freedom.Once the data have been obtained and used to estimate the single model parameter θ , and once the confidence level has been set, the right-hand side of Eq. ( 16) is a positive number.And, per this equation, finding the values of θ for which S(θ ) is equal to that positive number is again a nonlinear root-finding undertaking which generally uses numerical methods to solve.
The following example provides an illustration of these Wald and likelihood-based confidence interval methodologies.

Example 3 (continued).
As reported in Sect.3.2, for the n 4 simulated data points and the one-parameter model function given in Eq. ( 9), the LSE parameter estimate and standard error are θ 5.8698 and S E 1.2755 respectively.The 95% t-quantile, obtained using the R command, qt(0.975,3), is 3.1824, and so the 95% (Wald) WCI is 5. 8698 ± 3.1824 × 1.2755 or(1.8106, 9.9291).Finding the (likelihood-based) LBCI is a little more challenging since it is the interval of θ values between the interval end-point values for whichS(θ ) S θ 1 + Using the "uniroot" R function employed in the Supplementary Information, the LBCI here is the values of θ in the interval(2.9960,12.7122).This simple illustration demonstrates a pronounced difference between these two types of 95% confidence intervals for these data.For example, the test that H 0 : θ 11 would be retained using the likelihood method but rejected using the Wald approach.The opposite conclusion would follow in testingH 0 : θ 2. We emphasize that whereas these two types of intervals coincide exactly for linear models, this is clearly not the case for nonlinear models.
We next demonstrate how the Wald approach supplies an approximation to the likelihood approach.Applying the first-order Taylor series approximation of the one parameter model function about θ and substituting this approximation in the sum-ofsquares function, we obtain The last line in Eq. ( 17) follows from the penultimate line by squaring, summing over i from 1 to n, and noting that the cross-product term is zero by the normal equation.This last line of Eq. ( 17) shows that subject to the assumed first-order approximation, S(θ ) is approximately equal to the constant S θ plus a quadratic expression in θ .When combined with Eq. ( 16), this provides the Wald interval given in Eq. ( 14).This demonstrates that Wald intervals are quadratic approximations to the true sum of squares function and result from an initial first-order approximation.These results are illustrated as follows.
Example 3 (continued).For the given data, the sum of squares function S(θ ) is plotted in the left panel of Fig. 2 using the solid curve and where the filled circle is the 0.09702 obtained from Eq. ( 16).The intersection of S(θ ) and the cut line gives the endpoints of the LBCI,(2.9960,12.7122), as indicated by the filled squares in the left panel.Also plotted is the (Wald) quadratic approximation as the dashed parabola; the intersection of this quadratic approximation and the cut line gives the endpoints of the WCI,(1.8106,9.9291).Although the exact shape of the sum of squares function S(θ ) is not relevant to the practitioner, what is important is that the Wald method is based on a linear approximation (which when squared gives the parabola) and that the two methods generally differ for nonlinear models.Before leaving this example, two other comments are important to note.First, notice that the LSE point estimate is the same whether we use the true S(θ ) function or the quadratic approximation.This is to be expected since the linear approximation takes place at θ θ , where the two functions are equal.Second, using Eq. ( 16), the cut-line for the 90% confidence intervals is easily computed to be y 0.06310.At this lower height (and lower confidence level), note that the difference between the Wald and likelihood-based intervals is less pronounced.Thus in general, the higher the confidence level, the greater will generally be the divergence between the two intervals for nonlinear models.
As noted in Eq. ( 14), Wald confidence interval methods for the multi-parameter case of p > 1 model parameters are straightforward.Methods for obtaining likelihood intervals involve the technique of parameter profiling which we now discuss.First, partition the p-dimensional parameter vector θ as θ 1 θ 2 where θ 1 contains p 1 model parameters and θ 2 contains p 2 model parameters so that p 1 + p 2 p.To test the subset null hypothesis H 0 : θ 2 θ 20 , the likelihood-based F test statistic (i.e., the counterpart of Eq. ( 15)) is Under the null hypothesis, this test statistic follows the F p 2 , (n− p) distribution, that is, the F distribution with p 2 and (n − p) degrees of freedom.In Eq. ( 18), θ 1 minimizes S(θ ) subject to the constraint H 0 : θ 2 θ 20 .This technique of removing so-called nuisance parameters (i.e., θ 1 here) by constrained optimization is the mentioned profiling technique.Note that the restricted (constrained) estimate θ 1 is in general not equal to the unrestricted (LSE) estimate θ 1 .Furthermore, since our interest is in obtaining a confidence interval instead of a region, let p 2 1 so that θ 2 is the single parameter of interest.Then, the result in Eq. ( 18) is Inverting this expression gives the profile likelihood confidence interval (PLCI) as the set of θ 2 values which solve the (typically nonlinear) equation We underscore that θ 1 θ 1 (θ 2 ) in Eq. ( 20) is the value of the remaining ( p − 1)vector θ 1 in θ θ 1 θ 2 that minimizes S(θ ) subject to the constraint that θ 2 is the given fixed value.In certain instances, algebraic results can be derived to obtain the θ 1 vector in closed-form, but in the general situation, numerical methods are required.Direct comparison of the LBCI expression in Eq. ( 16) with the PLCI expression in Eq. ( 20) highlights the fact that both approaches use root finding methods to find the corresponding confidence intervals.But the PLCI equation also involves constrained optimization to remove the remaining (so-called 'nuisance') parameters.The following example illustrates the PLCI method for a situation where an exact algebraic result is available.
Example 2 (continued).As noted above, the key parameter in this example is the I C 50 parameter, θ , so that the intercept parameter α is treated as a nuisance parameter.That is, α is a parameter which must be estimated but which is not the main focus of the study.The intercept α is profiled out by fixing θ and setting to zero only the partial derivative ∂ S (θ )  ∂α in Eq. ( 11).This gives the profiled (conditional) parameter estimate, α α(θ ) This profiled parameter estimate is then substituted into the sum of squares function to obtain the profiled sum of squares function, For the given data, this profiled sum of squares function is plotted in the right panel of Fig. 2 using the solid curve and where the filled circle is the point( θ , S θ ) (22.327, −109.5497).For ease in computations, this profiled sum of squares function has been shifted down here by the amount S θ 1 + n−2 so the horizontal cut line of Eq. ( 20) is then y 0. The intersection of the shifted profiled S(θ ) function and the horizontal cut line gives the endpoints of the PLCI,(15.9176,43.9640), as indicated by the filled squares in the figure.Also plotted is the (Wald) quadratic approximation as the dashed parabola; the intersection of this quadratic approximation and the cut line gives the endpoints of the WCI,(12.5786,32.0756).
The plots in the two panels of Fig. 2 look very similar but note that the plots in the left panel are for the sum of squares function, whereas in the right panel they correspond to the profiled sum of squares function.Regardless, in both cases, the Wald and likelihood curves and intervals are observed to differ appreciably.

Examples 1 and 2 (continued).
To summarize the previous findings for Sect.2's motivating examples and to introduce the topic of the next section, we briefly return to these two 2-parameter examples now displayed in the panels of Fig. 3.
In both of these examples, the Wald intervals are (by construction) symmetric about the respective LSEs whereas the profile likelihood intervals are shifted to the right.
We next turn to giving practical reasons for preferring one of these confidence interval methods over the other.Then, in Sect.3.5, we discuss nonlinear model selection and computational algorithms.

Deciding Which Confidence Interval Method is Preferred and Why
For nonlinear model function parameter estimation, F-statistic likelihood-based confidence intervals are generally preferred to Wald methods for several important reasons.These reasons, discussed further below, have been underscored by several notable works (Bates and Watts 2007;Clarke 1987;Cook and Witmer 1985;Donaldson and Schnabel 1987;Evans et al. 1996;Faraggi et al. 2003;Haines et al. 2004;Pawitan 2013;Peddada and Haseman 2005;Ratkowsky 1983;Seber and Wild 1989).As regards the notation used here, Wald confidence intervals (WCIs) are those given in Eq. ( 14) and likelihood confidence intervals are either likelihood-based confidence intervals (LBCIs) in the one-parameter setting such as in Eq. ( 16) or profile-likelihood confidence intervals (PLCIs) for the multiparameter setting such as for the two 2-parameter motivating examples displayed in Fig. 3 and as in Eq. ( 20).
There are several important reasons why likelihood confidence interval methods (LBCIs and PLCIs) are preferred over WCIs for nonlinear modelling.One reason is The confidence intervals are spanned by the horizontal lines at the bottom of the panels that likelihood intervals generally have much better agreement between nominal (i.e., assumed) confidence levels and actual confidence levels.Several works (Clarke 1987;Donaldson and Schnabel 1987;Evans et al. 1996;Faraggi et al. 2003;Ratkowsky 1983;Seber and Wild 1989) have used computer simulations to demonstrate that, provided the model/data's intrinsic curvature (discussed in Appendix A.1) is reasonably low, likelihood intervals typically demonstrate good agreement between the chosen nominal (e.g., 95%) and the actual confidence level.Results for Wald intervals, however, can be quite disappointing.For example, simulation studies for some reported homoscedastic normal nonlinear models have found that the "observed coverage for a nominally 95% [Wald] confidence interval is as low as 75.0%, 44.0%, and 10.8%" (Donaldson and Schnabel 1987, p.76), depending on the data and chosen model function.
The superior coverage of likelihood methods is not surprising and is to be expected since for homoscedastic normal nonlinear models, the likelihood test statistics given in Eqs. ( 15), (18), and (19)-and the associated likelihood confidence intervals-are exact or very nearly so.Theoretical results show that the only difference between these F-statistic results and exact results (including p-values and coverage probabilities) depends upon the model's intrinsic curvature (discussed in Appendix A.1). Further, intrinsic curvature is often negligible for nonlinear models in practice (Bates and Watts 2007;Clarke 1987;Ratkowsky 1983;Seber and Wild 1989).Wald confidence intervals, on the other hand, are also affected by so-called parameter-effects curvature (also discussed in Appendix A.1), which can be appreciable for many nonlinear modeldataset situations.
Another reason likelihood methods are preferred is that they more accurately reflect the information in the data.For example, for the two models and datasets of Fig. 3, the WCIs are symmetric whereas the PLCIs are shifted to the right.Since most of the datapoints (e.g., five of the six data points in the right panel) lie to the left of the estimated I C 50 , the relative amount of 'information' in the data about the I C 50 parameters is higher on the left of the I C 50 estimate and lower on the right side.So the PLCIs are more reasonable for these datasets and models since less information is contained in the data on the right side of the estimated parameters and so the PLCIs extend further on the right-hand side.More generally, since WCIs for nonlinear model parameters are always symmetric and PLCIs can and often are asymmetric, PLCIs can more accurately reflect any information/precision imbalances in the data regarding specific parameter values.
In sum, for all practical purposes, the F-based likelihood methods used here are essentially exact (see Appendices A.1 and A.2). On the other hand, Wald methods for nonlinear model parameters are based on the asymptotic (valid for large-sample) normality of the model parameter estimate.Since this approximation breaks down for many nonlinear models with small-to-moderate datasets, these Wald methods should only be used with caution or avoided altogether-and this includes the commonly reported Wald p-values given by some popular statistical software packages.

Nonlinear Model Function Selection and Computational Methods
As regards model function selection, our preference is to use mechanistic models instead of empirical models whenever possible.Mechanistic models are those chosen based on the subject-matter knowledge of the relevant system or phenomenon under study, whereas empirical models are those often chosen based on providing a good fit to the study data.In early-stage studies of two or more quantitative factors, empirical modelling sometimes includes response surface modelling such as quadratic (or higher-order) polynomial fitting.As expert knowledge of the system grows, focus often shifts to nonlinear modelling, such as using dose response or similar (nonlinear) model(s).
Mechanistic nonlinear model functions are sometimes based on a system of one or more differential equation(s).These are equations that model rates of change in the given system.These so-called compartmental models are popular in fields including chemical kinetics, ecology, pharmacology (including pharmacokinetics and pharmacodynamics) and toxicology (Bates and Watts 2007;Seber and Wild 1989).For example, the exponential decay model function for the population of an ecosystem at time t, is a solution of the differential equation (with given initial condition), d P(t) dt −r P(t), P(0) P 0 (24) This differential equation posits that the rate of decrease in the population at time t is proportional to the size of the population at that time.For another example, if the rate of change of the size of a biological culture is assumed to grow rapidly at first up to a point and then decrease (e.g., with increased competition), a commonly-assumed differential equation with 'half-life' condition is, A solution of this differential equation is the three-parameter, normally distributed logistic growth model function, Other models, such as the intermediate-product model in pharmacokinetics (Bates and Watts 2007), involve systems of two or more differential equations.
Parameter estimation for nonlinear models is generally achieved using iterative methods such the Newton-Raphson method (Ratkowsky 1983) or some variant thereof.This method involves successively substituting linear approximations such as those used in Eq. ( 17) into the sum of squares function and/or normal equations.This process is repeated "until convergence," meaning until the changes in the objective function between iterations are below some chosen threshold.These computational algorithms have been implemented into the NLIN, NLP and NLMIXED procedures in SAS, the "nls," "nlmer" and "gnm" functions in R, and other software packages such as GAUSS, Minitab, PRISM, STATA, etc. Paramount to this process is the necessity of well-chosen starting points, which is best achieved by first understanding the roles of the individual model function parameters and plotting the given data.Further details of computation aspects for nonlinear modelling can be found in nonlinear regression texts (Bates and Watts 2007;Ratkowsky 1983;Seber and Wild 1989).

Additional Nonlinear Illustrations
The following examples further serve to illustrate the wide-ranging applications of nonlinear modelling and are included for readers wishing for additional examples.
Example 4 The nonlinear model discussed here is a segmented regression function model (also called a broken-stick, piecewise, or change-point model).This model function is used in data science and application fields such as agronomy, economics, engineering, environmental studies, and medicine (Seber and Wild 1989).The data examined here (Anderson and Nelson 1975) and graphed in Fig. 4 relate average corn yields (the outcome variable) to the amount of nitrogen fertilizer applied (the input variable).Following the authors, the linear-plateau segmented model is fitted here, and the corresponding fitted linear-plateau curve is also superimposed in the figure.The linear-plateau model used here has parameter vector θ T (α, β, κ) and is written This model function can also be written Here I C is an indicator function equal to one when condition C in the subscript is true and equal to zero otherwise.In accordance with the underlying (agricultural) subject-matter reasoning used by the authors and as observing in the data plotted in Fig. 4, this chosen model is continuous at the unknown join or transition point x κ.This is a nonlinear model since the transition point (also called a knot), κ, is a model parameter to be estimated, and for x > κ, the derivative ∂η ∂β κ contains a model parameter.
Using the given R code (see the Supplementary Information), for these data the parameter estimates are α 60.90, β 0.22 and κ 101.28, so the maximum corn yield is estimated to be y M AX 60.90 + 0.22 × 101.28 83.58.The point ( κ, y M AX ) is the solid square plotted in Fig. 4. The 95% profile likelihood confidence interval for the transition point κ is (78.72, 143.48).
Although sound reasons were already given in Sect.3.4 for avoiding the use of Wald methods, we re-emphasize that caution is given (Hinkley 1969;Seber and Wild 1989) to avoid using Wald-based methods for such segmented models since the required asymptotical normality approximation for κ can often be quite poor; with such the small sample size of n 6, likelihood methods are instead recommended.This caution regarding WCIs should especially be borne in mind when using spline models with unknown knots such as in fitting smoothing splines and generalized additive models popular in the domains of predictive modelling and machine learning (James et al.
In Eq. ( 29), x 1 for group 1 observations and x 0 for group 2 observations.Thus, θ T  (θ 1 , θ 2 ), θ 1 μ 1 and θ 2 μ 2 /μ 1 .It follows that θ 2 is the parameter of interest since it is the ratio of the two means and θ 1 is the nuisance parameter; following Sect.3.3, θ 1 is removed by parameter profiling so as to find the PLCI for the ratio parameter, θ 2 .
To illuminate use of these methods here, we use the simulated dataset wherein the n 1 3 group 1 response values are y 1 j 3, 4, 5 and the n 2 8 group 2 response values are y 2 j 6, 6, 7, 8, 8, 9, 10, 10.Clearly, θ 1 y 1 4 and, since y 2 8, θ 2 8/4 2. With S( θ ) 20, the unbiased estimator of σ 2 is the mean-square error (MSE), s 2 20 11−2 2.22.Using the results given in Appendix A.3, the (1 − α)100% Wald confidence interval (WCI) for θ 2 is Likewise, the Appendix A.3 results are used to show that the profile likelihood confidence interval (PLCI) for θ 2 is In this expression, c , and c is in the interval, 0 < c < 1.Thus, the center of the PLCI, θ 2 (1−c) , is shifted to the right of the center of the WCI, θ 2 .For the given data, the 95% WCI is (0.98, 3.02) and the 95% PLCI is (1.30, 3.94).The rightward shift of the PLCI vis-à-vis the WCI is notable here.Also, whereas the Wald approach would retain equal means (i.e., value of one is retained for the ratio parameter, θ 2 ), the likelihood approach clearly rejects this claim.
In the following continuation of Example 1, we extend the original illustration to comparing two curves and calculate a relative potency parameter based on the ratio methodology of the previous illustration.
Example 1 continued.The original Example 1 enzyme kinetic data analyzed previously and displayed in the left panel of Fig. 1 are for samples untreated with an antibiotic; the averages of the three (same concentration) replicates of these data are plotted in Fig. 5 using the small, filled triangles.In a spirit similar to other works (Bates and Watts 2007, p. 269), additional enzyme velocity measurements were made (also in triplicate) using the same substrate concentrations but for samples treated with the antibiotic.Averages of these replicates are also shown in Fig. 5 using the small, filled circles.(The fitted curves in the figure will be discussed below.)Using the relevant kinetics nonlinear modelling, researchers are interested in determining and quantifying the effect of the antibiotic on enzymatic activity.
To enable testing between the treated and untreated groups, the Michaelis-Menten model function of Eq. ( 1) is modified to fit both groups simultaneously using the model function, In this expression, D T 1 for samples in the treated group and D T 0 for samples in the untreated group.Analogously, since D U 1 − D T , one obtains The reduced two-group Michaelis-Menten model function with common upper asymptote is given by the expression, Note that this expression is equal to η(x, θ ) θ 1 x θ 2T +x for the treated group and η(x, θ ) θ 1 x θ 2U +x for the untreated group, and the commonality of the upper asymptotes is noted.The connection between the right-hand expression of Eq. ( 33) is given by the relation, The so-called relative potency parameter ρ in Eq. ( 34) is the ratio of the respective EC 50 parameters, and it is in this context that this illustration mirrors Example 5; note too that by making it an explicit model function parameter, we can readily obtain an accurate (likelihood-based) confidence interval.
When the model function in Eq. ( 33) is fit to these data, the fitted curves are shown in Fig. 5 for the treated (top) and untreated (bottom) groups.For these data, the LSE estimate of ρ is ρ 1.8275, so the substrate is approximately 1.8 times more potent for the treated group than for the untreated group.Further, the 95% PLCI forρ,(1.5274,2.2366), lies entirely above one, thereby establishing that the substrate is significantly more potent for the treated group than for the untreated group.Example 6. Examined here are dose-response data (Seefeldt et al. 1995) relating yield dry weight of biotype C wild oat Avena fatua (the response variable in g) to herbicide dose (the explanatory variable in kg ai/ha).These data are plotted in Fig. 6 with the raw data shown in the left panel and with the log-yield data plotted in the right panel.We use here the four-parameter log-logistic (LL4) model function (Seefeldt et al. 1995), In this model function, which is also called the Hill equation or the Morgan-Mercer-Flodin family (Seber and Wild 1989), θ 3 is the E D 50 (50% effective dose) parameter and θ 4 is the slope parameter.For θ 4 > 0, θ 1 is the 'upper asymptote,' or the expected response when x 0 and θ 2 is the lower asymptote or the expected response for very large dose (i.e., for x → ∞).To establish that θ 3 is the E D 50 parameter, note that when x θ 3 , the expected response is indeed θ 1 +θ 2 2 , the average of the two asymptotes.In viewing the non-constant variance of the original data in the left panel of Fig. 5, we can fit the LL4 model function using log-yield as the response variable, and this fitted model function is superimposed as the solid curve in the right panel plot.Alternatively, after applying the log-transformation to both left and right sides of the equation, the log-yields could be fit using the logarithm of the LL4 model function.In this instance, the results in both cases are very similar.This practice of transforming the response variable (e.g., log-transformation here) with or without transformation of the model function, and then fitting the additive homoskedastic normal nonlinear model of Eq. ( 3), is quite commonly-used in practice.But, whether this is a sound practice depends on whether selected variance-stabilizing transformation (such as logarithm, square-root, etc.) is a good choice for the given dataset and model function.As such, we next consider an alternative strategy.
Although it falls outside of the constant-variance normal additive paradigm of Eq. ( 3), another option is to fit the additive LL4 model function to model the untransformed responses and to also model the variances using a variance function such as In Eq. ( 36), in addition to the variance parameter, σ 2 , an additional parameter, ρ, has been included as the power of the mean model function, η(x, θ ).If ρ 0, then Eq. ( 36) reduces to the usual homoskedastic case where var y i j σ 2 of Eq. ( 3).Whenever ρ > 0, this variance function holds that the variance (i.e., the spread of the data response values) decreases with the mean, and this behavior is indeed observed in the left panel of Fig. 5 since the variance of the responses is higher when the average yield is higher and lower when the average yield is lower.For the data plotted in the left panel of Fig. 6, the maximum-likelihood estimate of ρ is ρ 1.4707, and the test of H 0 : ρ 0 is rejected ( p < 0.0001).Using results in (Seber and Wild 1989), the estimate of ρ ≈ 1.5 suggests that the fourth-root transformation (y 1/4 ) may have been a better choice for these data than the log-transformation used above.For these data, however, since the results are very similar, the homoskedastic normal nonlinear fit shown in the right panel of Fig. 6 (for the log-transformed data) is deemed to be sufficient.

Discussion and Final Thoughts
Before the advent of sufficient computing power and model-fitting methods, nonlinear models-often derived and based on sound expert-knowledge and theory-were historically fit by using linearization methods.This technique ignores the overall additive model structure given in Eq. ( 3) and the underlying model assumptions.For example, for the Michaelis-Menten model and function, y θ 1 x θ 2 +x + ε, if this expression is replaced with the approximation y ≈ θ 1 x θ 2 +x , algebraic manipulation leads to the expression x/y ≈ (θ 2 /θ 1 ) + (1/θ 1 )x.With some further substitutions, the right-hand side of this expression is of the form α + βx and so linear models were then fit.Often, the resulting transformation introduced additional problems such as non-constant variance, lack-of-fit, and challenges in obtaining confidence intervals for the original model parameters.Although several authors (Currie 1982;Seber and Wild 1989) clearly warn against using such linearization methods, without introductory guides such as the current work, practitioners may not yet be aware of these problems.
In addition to the nonlinear regression methods and examples provided here, interested readers may wish to more fully explore topics such as further heteroskedastic (variance function) modelling, bioassay and synergy modelling (Lee et al. 2007;Lynch et al. 2016;Sims and O'Brien 2011;Straetemans et al. 2005;Tallarida 2000;Wheeler et al. 2006;White et al. 2019), multivariate, compartmental, and generalized nonlinear models, related experimental design considerations (Kim et al. 2021;O'Brien et al. 2010, O'Brien andSilcox 2021), and additional curvature examples (Seber and Wild 1989).Other notable recent application fields include the use of high-throughput dose response methods to evaluate compounds as potential antiviral drugs to treat COVID-19 patients (Chen et al. 2022) and modelling to assess enzymatic activity in viral proteins comparing SARS-CoV with SARS-CoV-2 (O'Brien et al. 2021).negligible."So, in practice, the main reason likelihood and Wald intervals differ is due to P E curvature, and this in turn is related to the manner in which distances are calculated when finding confidence intervals.This distance metric is straightforward to assess for linear models where both I N and P E measures are zero, but is more challenging for nonlinear models.In light of the linear approximation used in Wald methods, the metric used for Wald distance and interval calculations is often inaccurate.
The following one-parameter illustration provides the development and illustration of the intrinsic (I N) and parameter-effects (P E) curvature measures.It also underscores the manner in which Wald confidence intervals approximate likelihood intervals and highlights how and when these intervals can diverge.
Example 7. The homoscedastic normal one-parameter simple exponential model function, used here and in drug studies involving pharmacokinetic one-compartment modelling (Bailer and Portier 1990), is given by the expression, This one-parameter model function is fitted here to the simulated n 2-point data, (x 1 , y 1 ) (0.50, 0.93) and (x 2 , y 2 ) (4.0, 0.025).(Although it would be ill-advised to use such a small sample size in practice, choosing only two data points facilitates viewing the so-called one-dimensional expectation surface in this two-dimensional space.)These data and model yield the LSE/MLE θ 0.50, the 80% likelihoodbased confidence interval (LBCI), (0.12, 2.25) and the 80% Wald confidence interval (WCI), (−0.37, 1.37).
To appreciate how and why likelihood and Wald intervals differ, in the left panel of Fig. 7 is plotted the one-dimensional expectation surface (E) in the two-dimensional sample space for this model function and the given data (see also Bates and Watts 2007).Since here η 1 e −0.5θ andη 2 e −4θ , the expectation surface is given by the equationη 2 η 8 1 .The expectation surfaceE is mapped out as θ ranges from 0 to ∞ as indicated by the plotted θ values.Intrinsic curvature (I N) assesses the degree to which E deviates from a straight line here and from a plane or hyper-plane in the general p > 1 case.Given the pronounced bending in E in Fig. 7, it is not surprising that the realized value ofI N, calculated here to beI N 1.44, exceeds the suggested 0.30 threshold value (Bates and Watts 2007).(Details of these calculations are omitted here and can be found in the references.)Thus, I N is deemed to be "significant" in this instance.We again underscore that for linear models,I N 0, since then the expectation surface is planar (or a straight line when p 1).
The expectation surface given in the right panel of Fig. 7 is also for these data but corresponds to use of the reciprocal model function parameterization, η 2 (x, γ ) e −x/γ , instead of the original model function parameterization, η(x, θ ) e −θ x .For η 2 (x, γ ), the intrinsic curvature is again equal to I N 1.44; this follows due to the invariance of I N to the one-to-one reparameterization, γ 1/θ .
The data point (y 1 , y 2 ) (0.93, 0.025) is also displayed in both panels of Fig. 7.The least-squares estimate points, θ 0.5013 in the left panel and γ 1.9948 in the right panel, correspond to the point on E nearest to the data point.The second curvature measure, called parameter-effect (P E) curvature, assesses the extent to which the spacing of the parameter points on E near the LSE point are "regular" (i.e., equi-spaced in this one-parameter case).The spacing in the left panel is distorted since although the actual numerical distance between the numbers 1 and 0.5013 is about the same as that between the numbers 0.5013 to 0 (i.e., both are about 0.5), the distances between the respective labelled points in the left panel of Fig. 7 are quite different.That is, the arc length of the segment of E from the θ 1 point to θ 0.50 is much less than the arc length of the segment of E from θ 0.50 to θ 0. Since the P E measure evaluates aspects of spacing, it is not surprising, then, that the left panel's P E value, P E 2.43, greatly exceeds the threshold suggested cut-off value of 0.30.The P E curvature situation for the right panel in Fig. 7 in the neighborhood of γ 1.99 is not as problematic since the spacing near γ appears more regular.It is not surprising, then, that with the value of P E 1.03, although still above the 0.30 cut-off, the right panel's parameter-effects curvature measure is less than half that of the left panel.Clearly, for this model and data, both the I N and P E curvature measures are high, and so the Wald approximation will be poor, as was noted above.In practice, note that these curvature values can be reduced by increasing the sample size.We next delve deeper into understanding likelihood and Wald confidence interval methods and any discrepancies between the two.
Using the given data and the original model function, η(x, θ ) e −θ x , Fig. 8 enables the visualization of discrepancies between Wald and likelihood confidence intervals.The 80% confidence circle centered at the data point (labelled point Y ) is obtained from Eq. ( 5).Instead of plotting this equation in the one-dimensional parameter space, it is viewed here in the n 2-dimensional sample space and so it is the (circular) set As noted above, the intrinsic curvature measure (I N) assesses the discrepancy between the actual expectation surface E and the tangent line approximation.In this case, this discrepancy is pronounced and is reflected in the fact that the observed I N 1.44 exceeds the 0.30 cut-off.The parameter effects curvature measure (P E), on the other hand, measures the difference between the actual spacing of the θ values on E (see the left panel of Fig. 7) and the filled-square regularly-spaced θ values superimposed on the tangent line in Fig. 8.This difference is also pronounced and is reflected in the calculated value, P E 2.43, also exceeding 0.30.(In models with more than one model function parameter, P E assesses the degree to which θ values on E near θ behave like a regular grid.) Figure 8 shows that the intersection of the 80% confidence circle given in Eq. ( 38) with the expectation surface E occurs at point A (for which η 1 0.94 and η 2 0.63) and point B (for which η 1 0.32 and η 2 0.0001).Using the model function η e −θ x and solving for the corresponding values of θ , these points give the 80% LBCI, (0.12, 2.25).The intersection of the 80% confidence circle with the tangent line approximation occurs at point C (for which η 1 1.12 and η 2 0.60) and point D (for which η 1 0.44 and η 2 −0.33).Using the linear approximation to the model function, η(x, θ ) ∼ e − θ x − xe − θ x (θ − θ ) and solving for θ , these points give the 80% WCI, (−0.37, 1.37).This demonstrates how Wald and likelihood confidence intervals are obtained for nonlinear models and how and why they differ.
This simple, one-parameter, small (n 2) illustration demonstrates that Wald confidence intervals are obtained based on two approximations, which may or may not be met for a given nonlinear model and dataset.The first approximation involves replacing the actual expectation surface with its tangent line (or tangent plane or hyperplane) approximation.The second approximation involves replacing the actual spacing of the parameter values on the expectation surface near the parameter estimate with a regular grid, and using this approximate regular grid to measure distances.For the reciprocal parameterization used above for this model (and with expectation surface in the right panel of Fig. 7), the spacing of the parameter values on the expectation surface in the vicinity of γ is more regular, and so the P E curvature would therefore be lower.Since commonly-used statistical software typically does not indicate when these approximations hold or fail, practitioners are wise to bear them in mind when using (approximate) Wald confidence intervals and p-values.
For the laetisaric illustration in Example 2 (Sect.2) and the Fieller-Creasy illustration in Example 5 (Sect.4), the intrinsic curvatures measures are exactly zero.This follows since the chosen model functions are transformations of linear models, the I N measure is invariant to the reparameterization of the model function, and I N is zero for linear models.For example, Example 2's model function, η(x, θ ) α 1 − x 2θ , is a one-to-one transformation of the usual linear regression model function, η(x, θ ) α+βx, using the non-singular transformation from (α, β) to (α, θ ) involving θ − α 2β .This means that differences between PLCIs and WCIs for these examples result only from parameter-effects curvature, and that PLCIs are exact for these illustrations.
In addition to the two overarching reasons given in Sect.3.4 for preferring likelihood intervals over Wald intervals, a third reason is related to model reparameterization: likelihood intervals are invariant to reparameterizations-even nonlinear ones-but Wald intervals are not.To illustrate, recall that the model function used in Example 3 isη 1 (x, θ ) x θ+x , where the model parameter (θ ) is theEC 50 .The LSE is θ 5.8698 and the likelihood and Wald confidence intervals for θ are (2.9960,12.7122) and (1.8106, 9.9291) respectively.To assess overlap in these intervals, the intersection of these intervals is the interval, (2.9960, 9.9291) and the union of these intervals is the interval, (1.8106, 12.7122) 74.50%; since this value exceeds the 63.60% overlap assessment for the original parameterization, there is more agreement here between the LBCI and WCI.Thus, in general and also in multidimensional ( p > 1) situations, likelihood intervals are invariant to one-to-one parameter modifications and some reparameterizations yield closer agreement between likelihood and Wald intervals than others.

Appendix A.2. Distinguishing Two Likelihood Approaches and Penalizing for Overfitting
An internet search of the term, "profile likelihood confidence interval," yields several references to asymptotic likelihood tests and intervals (Royston 2007;Stryhn and Christensen 2003;Venzon and Moolgavkar 1988).The intentional focus in our work has been solely on model function parameters in homoscedastic normal nonlinear models which use the F-based (exact or nearly-exact) likelihood-based expressions.Along these lines, we next distinguish the two different likelihood approaches and discuss penalizing for overfitting related to profiled-out model function parameters.
In contrast with the F-statistic likelihood approach used in this paper, the approximate (or asymptotic) likelihood-based test for testing H 0 : θ θ 0 , is based on twice the change in the log-likelihood (denoted L L), (Recall that the log-likelihood for the normal distribution is given in Eq. ( 4).)In Eq. ( 39), θ M L E maximizes L L(θ).In very general situations, this asymptotic (i.e., valid for large sample sizes) test statistic approximately follows the chi-square distribution with p degrees of freedom.In a manner similar to Eq. ( 16), this test statistic can be "inverted" to obtain approximate likelihood intervals (see Eq. ( 40) below).The asymptotic likelihood approach is commonly used in generalized linear, survival and longitudinal modelling.For the homoscedastic normal cases considered here, these large-sample likelihood intervals and tests generally differ from the F-statistics likelihood methods, and F-statistic likelihood methods are preferred for the reasons given later in this section (and also due to increased power).
When comparing the F-based likelihood approach used in this work and the approximate likelihood approach in Eq. ( 39), in addition to better coverage, another important reason for preferring the F-based likelihood approach is that they levy a penalty for overfitting, and we demonstrate this as follows.Using model parameter profiling and inverting Eq. ( 39), the approximate profile likelihood confidence interval (APLCI) for key parameter θ 2 is the set of θ 2 for which Here, χ 2 α, 1 is the (1 − α)100% quantile of the distribution with one degree of freedom and is such that P χ 2 > χ 2 α, 1 α.On the other hand, as stated in the main text and repeated here, the (exact or nearly exact) F-statistic-based profile likelihood confidence interval (PLCI) is the set of θ 2 values which solve the equation Since the APLCI approach is based on the chi-square quantile with one degree of freedom, it treats the single-parameter situation and the multi-parameter situation the same.That is, in the multiparameter case, the approximate approach profiles out the nuisance parameter(s) and treats the resulting profile equation as a one-parameter likelihood equation, so it levies no penalty for estimating the profiled parameters.The preferred PLCI approach of Eq. ( 41), on the other hand, is based on the t distribution with (n − p) degrees of freedom, which penalizes for the estimation of the ( p − 1) other (i.e., nuisance) parameters.Indeed, this PLCI statistic calibrates the result since for fixed n and α, the term increases with the number of parameters, p.This means that for a larger total number of nonlinear model parameter to estimate, the cut-line of the profile function will be higher, and the resulting profile confidence interval will be wider (i.e., more conservative), thereby reflecting the penalty for estimating a larger number of parameters.
The interested reader can visualize these results by examining Fig. 2. Other works (Fears et al. 1996;Pawitan 2000) have highlighted Wald confidence interval anomalies however their examples have been less clear-cut and have been based on the approximate likelihood test.Focusing on variance component estimation in the one-way random effects analysis of variance (ANOVA) model, these works underscore the inadequacy of Wald method using both simulation and this approximate or large-sample likelihood approach.Estimation of the variance component in this one-way random-effects ANOVA case is problematic since the null hypothesis value (i.e., the zero in the expression, H 0 : σ 2 0 0) is on the boundary of the parameter space and so the likelihood distribution statistic has a mixed distribution (Chernoff 1954;Molenberghs and Verbeke 2007).As such, although the random-effects ANOVA illustration does underscore Wald inadequacies, it easily confuses practitioners who may confound boundary issues with Wald statistics caveats and so may not realize the far-reaching nature of these Wald inadequacies.Our chosen nonlinear regression examples, on the other hand, underscore some nuances associated with nonlinear modelling and clearly show the extent of the preference of F-based likelihood methods over Wald methods.
In sum, with strong evidence in favor of likelihood methods over Wald methods-and the F-statistic-based likelihood methods considered here over the APLCI likelihood approach in Eq. ( 40)-it is surprising that some software packages still report Wald (and sometimes approximate likelihood) p-values and intervals for homoskedastic normal nonlinear models.For these data, the 95% likelihood confidence region (solid region) and the 95% Wald confidence region (dashed ellipse) are plotted in the left panel of Fig. 9.The plotted central point in the figure is the least-squares estimator, θ T (4, 2).These regions are 'joint' or 'simultaneous' confidence regions for the parameter vector θ T (θ 1 , θ 2 ), and in a similar manner to the confidence interval case, simulation results (Donaldson and Schnabel 1987;Seber and Wild 1989) highlight the superiority of likelihood regions over Wald regions in the sense of better agreement between nominal and actual coverage probabilities (e.g., 95%).
Another manner to obtain the Wald approximate region in Eq. ( 46) is to observe that for this model function the first-order (planar) Taylor approximation of η(x, θ ) about the point ( θ 1 , θ 2 ) is Substituting these approximations into the sum of squares expression in Eqs. ( 5) and (45) gives the Wald confidence region expression given in Eq. (46).
To obtain the PLCI for θ 2 we profile out θ 1 by setting to zero the corresponding normal equation, when θ 2 is taken as fixed and this expression is solved for θ 1 , we obtain the constrained optimum value, This subtle but crucial change to Eq. ( 50)-whereby θ 2 2 in the denominator is replaced by the estimated value θ 2 2 -is cause for the difference between WCIs and PLCIs for this model.These details supply the justification of the WCI and PLCI expressions given in Eqs. ( 30) and (31).
For the given data, these shifted profile expressions are plotted in the right-hand panel of Fig. 9.The profile log-likelihood is the solid curve and the Wald approximation is the dashed parabola.The minimum value point occurs at ( θ 2 , 0).The cut-lines in Fig. 9's right-hand panel occur at s 2 t 2 (α/2), (n−2) (20/9)t 2 (α/2), 9 which is equal to 11.37 for 95% and to 23.47 for 99%.For the given data, the 95% PLCI, (1.30, 3.94), are points A 95 and B 95 in the right panel of Fig. 9, and the 95% WCI, (0.98, 3.02), are points C 95 and D 95 .Thus, for these data, the Wald interval contains the value of one and so would retain the hypothesis of equal means (H 0 : θ 2 1).But since the PLCI does not contain one, the equal means hypothesis would be rejected.Further, the 99% PLCI, (1.11, 6.71) (points A 99 to B 99 in the graph), represent a large right shift from the 99% WCI, (0.54, 3.46) (points C 99 to D 99 in the graph).Shifts in the center of PLCIs (versus WCIs) as well as lengthening or shortening of these intervals are related to skewness and curvature measures (Clarke 1987;Haines et al. 2004;Ratkowsky 1983).
It is noteworthy to underscore the following regarding model reparameterization here.Again for x 1 for group 1 observations and x 0 for group 2 observations, we now rewrite (or reparameterize) the Fieller-Creasy model function in Eq. ( 29) as

Fig. 1
Fig. 1 Two motivating example plots.Left panel: Plot of simulated data (filled circles), fitted two-parameter Michaelis-Menten model function and estimated EC 50 point (large, filled triangle).Right panel: Plot of fungal growth data (filled circles), fitted line and estimated I C 50 point (large, filled triangle)

Fig. 3
Fig. 3 Two motivating example plots with confidence intervals.Left panel: Plot of simulated data (small, filled circles), fitted two-parameter Michaelis-Menten model function, estimated EC 50 point (large, filled triangle), 95% Wald confidence interval (WCI) (short-dashed line segment between large, filled squares) and 95% profile likelihood confidence interval (PLCI) (long-dashed line segment between large, filled circles).Right panel: Plot of fungal growth data (small, filled circles), fitted line, estimated I C 50 point (large, filled triangle), 95% Wald confidence interval (WCI) (short-dashed line segment between large, filled squares) and 95% profile likelihood confidence interval (PLCI) (long-dashed line segment between large, filled circles).The confidence intervals are spanned by the horizontal lines at the bottom of the panels

Fig. 4
Fig. 4 Simple Spline Fit.Corn yield versus nitrogen fertilizer data (six filled circles), fitted linear-plateau segmented curve, and estimated knot or join-point (filled square)

Fig. 5
Fig. 5 Treated and Untreated Enzyme Kinetic Model Fits.Average enzyme velocity versus concentration for antibiotic treated data (small, filled circles) with fitted common-upper-asymptote Michaelis-Menten curve (dashed curve) and untreated data (small, filled triangles) with fitted common-upper-asymptote Michaelis-Menten curve (solid curve).Also shown are estimated EC 50 points: treated (larger, filled circle) and untreated (larger, filled triangle)

Fig. 6
Fig. 6 Wild Oat Dry Weight Dose Response Fits.Left panel: Original dry weight yield data plotted versus herbicide dose with heteroskedastic (variance function modelled) LL4 model fit (solid curve).Right panel: (Natural) Log-transformed dry weight yield data plotted versus herbicide dose with homoskedastic LL4 model fit (solid curve)

Fig. 7
Fig. 7 Illustration of Intrinsic (I N) and Parameter Effects (P E) Curvatures for Simple Exponential Example.Plots of the expectation surfaces (E) for one-parameter simple exponential model function/design for original (θ ) parameterization (left panel) and reciprocal (γ 1/θ ) parameterization (right panel).The data point is labelled "y (data)".The selected points on each expectation surface are indexed by the indicated parameter values

Fig. 8
Fig. 8 Distinguishing Wald and Likelihood Confidence Intervals.Plots of the expectation surface for the (original parameterization) simple exponential model and chosen design (curved surface E, the same as given in left panel of Fig. 7), data point (labelled Y), and point on E corresponding to the least-squares estimate θ 0.5013 (labelled E).Also plotted are the tangent line approximation to E at E (the dot-dashed line segment) connecting points C and D, and the 80% confidence circle centered at Y. The points on E labelled A and B correspond to the likelihood confidence interval endpoints.The points on tangent line approximation corresponding to Wald confidence interval endpoints are labelled C and D. A regular-grid approximation on tangent line approximation is indicated using filled squares

Fig. 9
Fig. 9 Confidence Regions and Intervals for the Fieller-Creasy Ratio of Means.Left panel: 95% joint confidence regions for Fieller-Creasy data: likelihood based region (solid region) and Wald approximate region (dashed ellipse).Right panel: profile likelihood sum-of-squares (SSE) curve for ratio-of-means parameter (solid curve), Wald approximation SSE curve (dashed parabola), 95% and 99% horizontal labelled cut-lines, and confidence intervals: 95 likelihood interval is ( A 95 , B 95 ), 95% Wald interval is (C 95 , D 95 ) corresponding to lower circles, 99% likelihood interval is ( A 99 , B 99 ), and 99% Wald interval is (C 99 , D 99 ) corresponding to upper circles note that both the partial derivatives, D U1 for samples in the untreated group and D U 0 for samples in the treated group.With θ T (θ 1T , θ 1U , θ 2T , θ 2U ), this model function expression is equal to η(x, θ ) θ 1T x θ 2T +x for samples in the treated group and η(x, θ ) θ 1U x θ 2U +x for samples in the untreated group, and so, as before, the θ 1 and θ 2 parameters are the respective upper asymptote and EC 50 parameters.2U , is soundly rejected with F 25.76, p < 0.0001, but the claim of equal upper asymptotes, H 0 : θ 1T θ 1U , p 0.89, is retained.(These results can be verified by running the R code in the Supplementary Information.) . These latter intervals have respective lengths 6.9331 and 10.9016 and an assessment of overlap of theLBCI and WCI (Haines et al. 2004) is6.9331   10.901663.60%.This shows a fair amount of difference between these two intervals.A simple modification of the original model function isη 2 (x, φ) φx 1+φx , so that this new model function's parameter (φ) is the reciprocal EC 50 since when φ 1/θ is substituted here, we obtain the original model function.When this new modified model function is fit to the data, we obtain φ 0.1704 (which equals1/ θ ) and the respective likelihood and Wald confidence intervals for φ are (0.07867, 0.3338) and(0.05255,0.2882).Notice first that (except for roundoff error) the reciprocal of the η 2 (x, φ) LBCI endpoints,