Automatic robust Box–Cox and extended Yeo–Johnson transformations in regression

The paper introduces an automatic procedure for the parametric transformation of the response in regression models to approximate normality. We consider the Box–Cox transformation and its generalization to the extended Yeo–Johnson transformation which allows for both positive and negative responses. A simulation study illuminates the superior comparative properties of our automatic procedure for the Box–Cox transformation. The usefulness of our procedure is demonstrated on four sets of data, two including negative observations. An important theoretical development is an extension of the Bayesian Information Criterion (BIC) to the comparison of models following the deletion of observations, the number deleted here depending on the transformation parameter.


Introduction
In the final sentence of Hinkley (1975), David Hinkley wrote 'Data transformation in the presence of outliers is a risky business'. The risk arises particularly in robust fitting of transformations. If a symmetrical model is fitted to a skewed distribution, et al. (2009). For simple regression, the procedures provide estimates with similar bias, but the computation time for that of Marazzi et al. increases rapidly with the sample size. For multiple regression our procedure has the lower bias.
Section 7 introduces the extended Yeo-Johnson transformation of Atkinson et al. (2020) which develops the transformation of Yeo and Johnson (2000) to allow differing values of the transformation parameter for positive and negative observations. The subject of Sect. 8 is the adaptation of the automatic procedure to the extended Yeo-Johnson transformation. Because there are two score statistics, for the transformation parameters for positive and negative observations, a more complicated form of the agreement index is required. Examples of the use of this automatic procedure are in Sect. 9. In the larger example we analyse 1405 observations on the profitability of firms, 407 of which make a loss. In the concluding Sect. 10 we discuss potential further extensions to the Yeo-Johnson transformation and relate our results to the remark of Hinkley quoted at the beginning of this paper.
The Appendix contains algebraic details of the normalized version of the extended Yeo-Johnson transformation and of the constructed variables used in our automatic procedure. Links to our programs for automatically determining transformations are at the end of Sect. 10.

Aggregate statistics
The Box-Cox transformation for non-negative responses is a function of the parameter k. The transformed response is yðkÞ ¼ ðy k À 1Þ=k ðk 6 ¼ 0Þ; log y ðk ¼ 0Þ; where y is n Â 1. In (1) k ¼ 1 corresponds to no transformation, k ¼ 1=2 to the square root transformation, k ¼ 0 to the logarithmic transformation and k ¼ À1 to the reciprocal transformation, thus avoiding a discontinuity at zero. The development in Box and Cox (1964) is for the normal theory linear model where X is n Â p, b is a p Â 1 vector of unknown parameters and the variance of the independent errors i ði ¼ 1; :::; nÞ is r 2 . For given k the parameters can be estimated by least squares. To estimate k it is necessary to allow for the change of scale of yðkÞ with k. The likelihood of the transformed observations relative to the original observations y is ð2pr 2 Þ Àn=2 expfÀSðkÞ=2r 2 gJ n BC ; where SðkÞ ¼ fyðkÞ À Xbg T fyðkÞ À Xbg and the Jacobian For the power transformation (1), oy i ðkÞ=oy i ¼ y kÀ1 i ; so that log J n BC ¼ ðk À 1Þ X log y i ¼ nðk À 1Þ log _ y; where _ y is the geometric mean of the observations. Box and Cox (1964) show that a simple form for the likelihood is found by working with the normalized transformation If an additive constant is ignored, the profile loglikelihood, maximized over b and r 2 , is with RðkÞ the residual sum of squares of zðkÞ. Thusk minimizes RðkÞ.
For inference about plausible values of the transformation parameter k, Box and Cox suggest the likelihood ratio test that k ¼ k 0 using (6), that is the statistic Although Box and Cox (1964) find the estimatek that maximizes the profile loglikelihood, they select the estimate from those values of k from the grid G that lie within the confidence interval generated by (7). In their examples where n ¼ 27 and 48, G ¼ ½À1; À0:5; 0; 0:5; 1: Carroll (1982) argues that the grid needs to become denser as n increases. An example is in Sect. 5.2. This formulation has led to some controversy in the statistical literature. Bickel and Doksum (1981) and Chen et al. (2002) ignore the suggested procedure. The variability of the estimated parameters in the linear model can be greatly increased when k is estimated byk, particularly for regression models with response yðkÞ. McCullagh (2002) is very clear that the Box-Cox procedure does not lead tok. Box and Cox (1982), Hinkley and Runger (1984), Cox and Reid (1987) and Proietti and Riani (2009) provide further discussion.
The practical procedure indicated by Box and Cox is analysis in terms of zðkÞ leading tok with an associated confidence interval and hence to a, hopefully, physically interpretable estimatek G . To avoid dependence on _ y in comparisons across sets of data, parameter estimates need to be calculated using yðk G Þ rather than zðk G Þ.
In the later sections of this paper we are concerned with power transformations of responses that may be both positive and negative. One procedure, analysed by Box and Cox, is the shifted power transformation in which the transformation is of ðy þ lÞ k , with the value l greater than the minimum value of y. If l is a known offset, such as 273.15 in converting from Celsius temperatures to Kelvin, the transformation is unproblematic. But if l is a parameter to be estimated from the data, inferential difficulties arise because the range of the observations depends on l (Atkinson et al. 1991). In addition the shifted power transformation is not sufficiently flexible to model data such as the examples presented in Sect. 9.

Constructed variables
The robust transformation of regression data is complicated by the inter-relationship of outliers and the value of k. Examples are in Atkinson and Riani (2000, Chapter 4). To determine the effect of individual observations on the estimate of k we use the approximate score statistic T A ðkÞ, (Atkinson 1973) derived by Taylor series expansion of zðkÞ (5) about k 0 . For a general one-parameter transformation this leads to the approximate regression model where c ¼ Àðk À k 0 Þ and the constructed variable wðk 0 Þ ¼ ozðkÞ=okj k¼k 0 , which only requires calculations at the hypothesized value k 0 . The approximate score statistic for testing the transformation is the t statistic for regression on Àwðk 0 Þ , that is the test for c ¼ 0 in the presence of all components of b. Because T A ðk 0 Þ is the t test for regression on Àwðk 0 Þ, large positive values of the statistic mean that k 0 is too low and that a higher value should be considered.

The fan plot
We use the forward search to provide a robust plot of the approximate score statistic T A ðkÞ over a grid G of values of k. For each k we start with a fit to m 0 ¼ p þ 1 observations. These are robustly chosen to be the subset providing the Least Median of Squares estimates of the parameters of the linear regression model, allowing for the additional presence of the parameter k (Atkinson and Riani 2000, p. 31). We then successively fit to larger subsets. For the subset of size m we order all observations by closeness to the fitted model; the residuals determine closeness. The subset size is increased by one to consist of the subset with the m þ 1 smallest squared residuals and the model is refitted to this new subset. Observations may both leave and join the subset in going from size m to size m þ 1. The process continues with increasing subset sizes until, finally, all the data are fitted. The process moves from a very robust fit to non-robust least squares. Any outliers will enter the subset towards the end of the search. We thus obtain a series of fits of the model to subsets of the data of size m; m 0 m n for each of which we refit the model and calculate the value of the score statistics for selected values of k 0 2 G. These are then plotted against the number of observations m used for estimation to give the ''fan plot''. The ordering of the observations in a fan plot, which reflects the presence of outliers, may depend on the value of k 0 . Because we are calculating score statistics we avoid the estimation of k and its confidence interval which is required in the original Box-Cox procedure.

Hospital data 1
Neter et al. (1996, pp. 334, 438) analyse 108 observations on the times of survival of patients who had a particular kind of liver surgery. There are four explanatory variables, with response taken as the logarithm of survival time. To check whether this is the best transformation in the Box-Cox family requires the constructed variables regression on which for the null value k 0 of k gives the statistic T A ðk 0 Þ. The fan plot of Fig. 1 for the five most commonly used values of k 0 ðÀ1; À0:5; 0; 0:5; 1Þ confirms that the logarithmic transformation is suitable for these survival data, as it is for the survival times in the poison data analysed by Box and Cox (1964). The trajectory of the score statistic for k 0 ¼ 0 lies within the central pointwise 99% band throughout, whereas those for the other four values of k 0 are outside the bands by the end of the search.
Since the constructed variables are functions of the response, the statistics cannot exactly follow the t distribution. Atkinson and Riani (2002) provide some numerical results on the distribution in the fan plot of the score statistic for the Box-Cox

Inference and outlier deletion
This section introduces the novel extension of BIC for choice of transformation parameters and supplements it with two further diagnostic tools, the Agreement Index AGI, and an extended form of R 2 . Section 4 describes the combination of these tools to provide the automatic procedure for the Box-Cox transformation.

BIC and the mean shift outlier model
As our main tool in assessing numerical information from score statistics calculated during the FS, we use the extended Bayesian Information Criterion (BIC). The original version (Schwarz 1978) can be used to choose the value of k for complete data with n observations. Then, from (6) BICðkÞ ¼ Àn logfSðkÞ=ng þ 2 log J n BC À ðp þ n k Þ logðnÞ; where there are p parameters in the linear model and, for the Box-Cox transformation, the single transformation parameter k, so that n k ¼ 1: Written in this form, large values of the index are to be preferred, corresponding to small values of RðkÞ, that is to maximizing the likelihood. Use of the forward search to provide robustness against outliers and incorrect transformations leads to the comparison of fitted models with different numbers of observations. We render outlier deletion compatible with BIC through use of the mean shift outlier model in which deleted observations are each fitted with an individual parameter, so having a zero residual.
Let the forward search terminate with h observations. Then n À h observations will have been deleted. This can be expressed by writing the regression model as where D is an n Â ðn À hÞ matrix with a single one in each of its columns and n À h rows, all other entries being zero. These entries specify the observations that are to have individual parameters or, equivalently, are to be deleted (Cook and Weisberg 1982, p. 21).
To incorporate deletion of observations in BIC (10) for values of k 2 G, let the value of SðkÞ when n À hðkÞ observations are deleted be Sðk; hÞ. Then BICðkÞ is replaced by BICðk; hÞ ¼ Àn logfSðk; hÞ=hg þ 2 log J n BC À fp þ n k þ n À hðkÞg logðnÞ; in which Sðk; hÞ is divided by h. A full treatment is in Riani et al. (2022).

The Tallis correction
To find the n À h outlying observations we have used the FS which orders the observations from those closest to the fitted model to those most remote. If there are no outliers and the value of k is correct, the value of SðkÞ at the end of the search will lead to an unbiased estimate of r 2 . The deletion of the n À h most remote observations yields the residual sum of squares Sðk; hÞ and to a too small estimate of r 2 based on the central h observations. The variance of the truncated normal distribution containing the central h/n portion of the full distribution is where /ðÁÞ and UðÁÞ are respectively the standard normal density and c.d.f. See, for example, Johnson et al. (1994, pp. 156-162). We scale up the value of Sðk; hÞ in (12) to obtain the extended BIC BIC EXT ðk; hÞ ¼ Àn log½Sðk; hÞ=fhr 2 ðhÞg þ 2 log J n BC À fp þ n k þ n À hðkÞg logðnÞ: This consistency correction is standard in robust regression (Rousseeuw and Leroy 1987, p. 130). The correction r 2 ðhÞ in (13) is the one-dimensional case of the general result in Tallis (1963) on elliptical truncation in the multivariate normal distribution. This result is useful in determining correction factors for variance estimation in robust methods for multivariate normal data. An example is Riani and Atkinson (2007). Neykov et al. (2007) use a related form of extended BIC for model choice in clustering when estimation is by trimmed likelihood.Their procedure differs from ours in two important ways: i Because they do not have a regression model they do not use the mean shift outlier model to add parameters as observations are trimmed. In consequence, their procedure differs from ours in the penalty function applied to the trimmed likelihood. In our notation their penalty term is Àp log h whereas, if we ignore the effect of varying k, our penalty term is Àðp þ n À hÞ log n, a stronger penalty; ii We do not directly use the likelihood estimate of r 2 based on the trimmed residual sum of squares SðkÞ, but apply the consistency factor (13) in the estimation.
Similar comments apply to the extended BIC used by Greco and Agostinelli (2020) for weighted likelihood estimation, again in clustering.

The agreement index AGI
The value of BIC EXT ðk; hÞ presents the information on the transformation for the subset of size hðkÞ. It can be helpful also to consider the history of the evidence for the transformation as a function of the subset size m. A correct transformation leads to identification of the outliers and to a normalizing transformation of the genuine data. The most satisfactory transformation will be one which is stable over the range of m for which no outliers are detected and will lead to small values of T A ðk 0 ; mÞ, indicating that the value ofk is consistent with k 0 over a set of subset sizes. The trajectory of T A ð0; mÞ in Fig. 1 is an example. We introduce an empirical diagnostic quantity, the agreement index, AGI, which leads to a graphical representation of this idea. The forward search procedure for testing the transformation monitors absolute values of the statistic from M to m ¼ hðkÞ. The default is to take the integer part of M ¼ 0:6n. For ease of comparison with BIC EXT ðk; hÞ we take the reciprocal of the mean of the absolute values of T A ðk 0 ; mÞ so that large values are again desired. In order to give more weight to searches with a larger value of hðkÞ, we rescale the value of the index by the variance of the truncated normal distribution. Let M ¼ m 2 ½M; hðkÞ: Then the agreement index calculated for k 2 G.

The coefficient of determination R 2
The main tools for determining the correct transformation are plots of hðkÞ and BIC EXT with the agreement index as a further diagnostic. In addition, we calculate values of a form of R 2 , extended to allow for the value of the size hðkÞ of the final cleaned sample: 4 The automatic procedure for the Box-Cox transformation The automatic procedure for robust selection of the transformation parameter and the identification of outliers requires the use of two functions. Section 5 provides data analyses for the Box-Cox transformation computed with these functions. Links to the functions and to full documentation are at the end of Sect. 10.
1. Fan Plot. Subroutine FSRfan. This Forward Search Regression function takes as input the data and the grid G of k values to be evaluated as transformation parameters.
(b) Each value of k has a separate forward search. Output structure outFSR contains numerical information to build fan plots such as Fig. 1. Optionally, the fan plot can be produced.
(a) Outlier detection. The forward search uses residuals to order the observations in such away that outliers, if any, enter the subset used in fitting towards the end of the search. In fan plots, such as that of Fig. 1, we use this ordering, but monitor the value of score statistics. In the automatic procedure, for each k 2 G, the values during the forward search of the score statistic T A ðkÞ of Sect. 2.2 are assessed against the standard normal distribution to estimate the maximum number m Ã ðkÞ of observations in agreement with that transformation. This procedure for testing the transformation monitors absolute values of the statistic from M to Simulations not reported here show that the size of this simultaneous procedure when testing at 1% is close to the nominal value. For a regression model with two variables the size is 0.75% when n ¼ 100, 0.95% when n ¼ 1000 and 1.11% when n ¼ 10;000, a sample size well in excess of that of the largest example we analyse. (b) The presence of any remaining outliers in the m Ã ðkÞ transformed observations is checked using the forward search procedure now monitoring deletion residuals. If any outliers are found in the transformed data the sample size of the cleaned data is hðkÞ\m Ã ðkÞ. Otherwise hðkÞ ¼ m Ã ðkÞ. (c) The extended BIC (14) allows comparisons of results from models with differing numbers of non-deleted observations hðkÞ. The maximal value determines the preferred parameter valuek. The agreement index (15) is also calculated over G. The conclusion from the analysis of the hospital data following from the fan plot of Fig. 1 was that the logarithmic transformation was appropriate; for other values of k several observations appeared outlying. The results from the automatic method of Sect. 4 for finding the cleaned sample size hðkÞ, when k ¼ À1; À0:5; 0; 0:5 and 1, are given in Table 1. There are no deletions of outliers in the searches on residuals, so hðkÞ ¼ m Ã ðkÞ for all k. No observations are deleted when k ¼ 0 so, in line with the earlier results, hð0Þ ¼ n.
The results from the automatic analysis are summarised in the single plot with three panels of Fig. 2. The automatic analysis has faithfully extracted all features of Fig. 1 in which the trajectory of T A ð0Þ oscillates around zero, without large divergences and remains inside the 99% bounds over the whole search. This figure is a typical example of the results of a seemingly straightforward analysis.  The upper left panel of the figure shows the plot of extended BIC for the five values of k. There is a clear peak at k ¼ 0 and the log transformation is indicated. The upper right-hand panel shows the plot for the agreement index, again with a peak at zero, and the lower panel shows the proportion of observations included in the final transformed analyses.

Loyalty cards data
We now analyse data from Atkinson and Riani (2006) on the behaviour of customers with loyalty cards from a supermarket chain in Northern Italy. The data are a random sample from a larger database. There are 509 observations: the response is the amount, in euros, spent at the shop over 6 months and the explanatory variables are: the number of visits to the supermarket in the 6 month period; the age of the customer and the number of members of the customer's family.
The upper top panel of Fig. 3 is a fan plot for 11 values of k ð0; 0:1; . . .; 1Þ with k ¼ 1 at the bottom and k ¼ 0 providing the top trajectory. It is clear that the trajectories for two of the five standard values of k (0 and 0.5) steadily leave the central band in opposite directions. For the best of the five values, k ¼ 0:5, the automatic procedure (mid and bottom panels of Fig. 3) deletes 37 observations as not agreeing with the transformation. The indication is that a finer grid of values of k is required, in line with the argument of Carroll (1982).
We now consider analysis with the finer grid G ¼ À1; À0:9; . . .; 1: Trajectories of the values of T A ðkÞ for positive values of k are given in the upper top panel of Fig. 3. The upper left-hand panel of Fig. 4 shows the extended BIC plot which is complemented by the plot in the lower panel of the proportion of clean observations hðkÞ=n. For k in the range À1 to 0.2, hðkÞ ¼ 0:6n, the point at which monitoring starts. The plot of the extended BIC shows thatk G ¼ 0:4 with 0.3 and 0.5 giving slightly smaller values. An important feature in the lower panel is that, when k ¼ 0:4, checking residuals for the presence of outliers, leads to the deletion of 16 observations from the m Ã ðkÞ found by checking the score statistic. These deleted observations are shown as the unfilled part of the bar in the plot. The plot of the agreement index in the right-hand panel also indicates a value of 0.4 for the transformation, a value which Perrotta et al. (2009) show provides a significantly better fit to the data than the value of 1/3 suggested by Atkinson and Riani (2006). Figure 6 of Perrotta et al. (2009) reveals that the outlying observations form a group lying on a distinct regression line with consumers of varying age and family size spending much less than would be expected given their high number of visits to the supermarket.

Comparison with the procedure of Yohai and Marazzi
The purpose of our paper is to provide a practical method of automatic transformation of data in the presence of outliers; robust methods are necessary. It is instructive to compare our approach to that of Marazzi et al. (2009) who have the same goal. They however chose MM estimation (Yohai 1987) and only considered the Box-Cox transformation. Different values of k are compared using a robust prediction error.
We start our comparisons with simulations. To enhance comparability, we based the structure of the simulations on those in §5 of Marazzi et al., initially for simple The results in Table 2 are for simple regression, that is  Table 2 suggest that there is little difference between the bias of the two methods; our method had the smaller bias for four of the seven error distributions. However, the method of our paper performed better for six out of the seven error distributions when the comparison was based on the estimated standard deviation ofk. Although some of the differences in performance are not large, that for uniform errors is appreciable. In interpreting these results we recall the reference to McCullagh (2002). The purpose of the transformation method is to find a transformation which, hopefully, has a physical interpretation, rather than establishing a precise value ofk for use in data analysis. Here k ¼ 0:5 is the often easily interpreted square root transformation.
We now extend our comparison to multiple regression with outliers at leverage points. We remain with homoscedastic regression, but now p ¼ 5. The 100 values of the four explanatory variables were drawn independently from uniform distributions on [0.2, 20], one set of values beng used for all simulations. We considered only two error distributions: Gauss and CntG. In this we again follow Marazzi et al. The parameters of the linear model were (20, 1, 1, 1, 1). We performed two simulations. In the first, ''without leverage points'', all simulations were drawn from this model. In the second, ''with leverage points'', 90 of the simulated observations followed the preceding model. There were, in addition, 10 outliers at x T i ¼ ð1; 40; 40; 40; 40Þ with 150 subtracted from each response value. The results are in Table 3. They show that, for all combinations studied, our automatic procedure produces estimates with lower biases than those for Marazzi et al.; the difference is largest for Gauss without That is the data, simulated in Matlab, were passed to the R package for estimation of k, the resulting estimate beng returned to Matlab for the analyses of bias.
We now compare timings of the two methods, calculated just for the time each program took to find the best estimate of k; thus the relevant part of our Matlab procedure is compared directly with the times for the R code. The results in Table 4 are for homoscedastic simple regression with uniform and normal error distributions  as the number of observations n increases from 100 to 1000. The results are very clear. For a sample size of 100 MY is faster than RAC, although only statistically significantly so for the normal distribution. But for n ¼ 200 or more RAC is faster, becoming increasingly so as n increases. For MY the timings as n increases seem not to depend on the error distribution, whereas for RAC, the times for the normal distribution are close to twice those for the uniform distribution. When n ¼ 1000 MY takes almost 6 min for either error distribution, whereas RAC takes 6 or 12 s. This difference would become of greater practical significance if model building required the fitting of several models, or if simulation were needed to determine the properties of the procedures for a specific application. We now turn to the comparative analysis of data. We first analysed the data on 78 patients given by Marazzi et al. and recovered their solution with our version of MY. We then used their algorithm to analyse our two examples of the Box-Cox transformation. For the hospital data we obtained a value of 0.134 fork which is in agreement with the log transformation indicated by our robust analysis. However, the hospital data do not contain any outliers and the data are small. We next analysed the 509 observations on loyalty cards in Sect. 5.2. The R-code returned three possible values for the estimate of k : 0.534, 0.699 and 0.777. None is close to the value of 0.4 that we recommend. However, the fan plot in the top panel of Fig. 3 gives some explanation. Particularly for the trajectory for k ¼ 0:7; there is a local maximum in the trajectory around m ¼ 450. This suggests that a local minimum of the robust prediction error used in estimation of k has been identified. The trajectory for k ¼ 0:4, on the other hand, has the desired shape of being reasonably stable until the end of the search when outliers enter and a different transformation starts to be indicated.
The conclusions from these comparisons are that our automatic procedure is to be preferred, both in terms of performance and, especially, for time. We also would like to stress that the procedure of Marazzi et al. has so far only been implemented for the Box-Cox transformation, whereas our automatic procedure is also available for the transformation of responses that can be both positive and negative, a topic covered in the remainder of our paper. Yeo and Johnson (2000) generalised the Box-Cox transformation to observations that can be positive or negative. Their extension used the same value of the transformation parameter k for positive and negative responses. Examples in Atkinson et al. (2020) show that the two classes of response may require transformation with different values of k. The Box-Cox transformation (1) has two regimes, that for k 6 ¼ 0 and the other for k ¼ 0. Both the Yeo-Johnson transformation and its extended version require four regimes. The two unnormalised transformations for these regions are in Table 5.

The extended Yeo-Johnson transformation
For y ! 0 the Yeo-Johnson transformation is the generalized Box-Cox transformation of y þ 1. For negative y the transformation is of Ày þ 1 to the power 2 À k.
For the extended Yeo-Johnson transformation the transformation parameter for positive y is k P , with that for negative y being 2 À k N .
Inference about the values of the transformation parameters needs to take account of the Jacobians of these transformations. As in Sect. 2.1, the maximum likelihood estimates can be found either through minimization of the sum of squares SðkÞJ or of the sum of squares RðkÞ of the normalized transformed variables z. Score tests for the parameters use constructed variables that are derivatives of z. We denote the score test for the overall parameter k in the Yeo-Johnson transformation as T O . In the extended Yeo-Johnson transformation there are in addition approximate score tests T P for k P and T N for k N . These include separate Jacobians for the positive and negative observations. Full details of the constructed variables for these models are in the Appendix.

The automatic procedure for the extended Yeo-Johnson transformation
The procedure introduced by Atkinson et al. (2020) for the analysis of data with the extended Yeo-Johnson transformation again depends heavily on the numerical identification of patterns in a series of trajectories of score tests. The analysis starts with the Yeo-Johnson transformation of Sect. 7 in which negative and positive observations are subject to transformation with the same value of k. The next stage is to determine whether this transformation parameter should be used for both positive and negative observations. The general procedure is to test whether k 0 ¼ ðk P0 ; k N0 Þ is the appropriate transformation by transforming the data using k 0 and then checking an ''extended'' fan plot to test the hypotheses for the transformed data that no further transformation is required. The three score tests in the extended fan plot are listed in (25) in the Appendix. They check the individual values of k P0 and k N0 as well as of the overall transformation k 0 . This information is incorporated in the extended agreement index defined in Point 3(d) below.
The automatic numerical procedure can be divided into three parts. The first provides information from the fan plots for the one-parameter Yeo-Johnson transformation which is used by the BIC in the second part to determinek, the best overall parameter for this transformation. The third uses the value ofk to calculate an appropriate grid of parameter values for the extended transformation, which is used, in conjunction with the BIC, to find the best parameters for the extended transformation. Developments in output conclude the section.
1. Fan Plot. Function FSRfan('family','YJ') (a) Since k is a scalar, this problem has the same structure as that of the Box-Cox transformation in Sect. 4. Choice of the family YJ fits the oneparameter Yeo-Johnson transformation of Table 5 using the FS over a grid of values of k. (b) The output structure outFSRfan contains numerical information to select the best transformation value. Optionally the fan plot can be generated.

Estimation of parameterk for the YJ transformation. Function fanBIC('family','YJ')
(a) Function fanBIC inputs outFSRfan to calculate provisional clean subsets m Ã ðkÞ, followed by a forward search on residuals to detect outliers. The final outlier-free subsets are hðkÞ. (b) Once hðkÞ is established for k 2 G; the BIC (14) is used to selectk, the estimate of the best transformation for the Yeo-Johnson family. (c) Optionally a plot like Fig. 4 can be produced.

The Extended Yeo-Johnson Transformation. A call is made to function
FSRfan('family', 'YJpn'), followed by one to function fanBICpn to first establish the grid G of pairs of ðk P ; k N Þ to be searched and then to choose the best pair.
(a) FSRfan calculates the three score statistics from the extended fan plot at k. The output is outFSRfanpn. (d) The BIC, now with n k ¼ 2, is calculated for the cleaned samples of observations for each combination of values of k P and k N . This information is supplemented by the diagnostic use of an extended agreement index, which measures agreement between the values of T P and T N and checks that no further overall transformation is needed. Agreement between the values of T P and T N is measured by the absolute value of the difference. We also require a small value of T O . Then the index depends on the sums As in (15) the sums are adjusted to give weight to searches with a larger value of h. Then As before, larger values are desired.

Output
(a) The extended BIC (23) in the Appendix, calculated using the Jacobian J n EYJ , the agreement index AGI and the values of hðk P ; k N Þ are presented as heat maps over the grid of values of k P and k N . These three plots summarize the properties of the best transformation and those close to it. (b) We also produce a heat map of the extended coefficient of determination, R 2 EXT , defined in (16), where the extension allows for the number of observations hðk P ; k N Þ.

Two examples of the automatic extended Yeo-Johnson transformation
This section presents the results of the automatic analysis of two sets of data, both of which include negative responses, so that the extended Yeo-Johnson transformation, Sect. 7, may be appropriate. We follow the procedure of Sect. 8.

Investment funds
We start with a straightforward example, without outliers once the correct transformation has been determined. The regression data concern the relationship of the medium term performance of 309 investment funds to two indicators. The data come from the Italian financial newspaper Il Sole -24 Ore. An analysis of the data based on conclusions from fan plots is presented by Atkinson et al. (2020).
Of the funds, 99 have negative performance. Scatterplots of y against the two explanatory variables show a strong, roughly linear, relationship between the response and both explanatory variables. It is also clear that the negative responses have a different behaviour from the positive ones: the variance is less and the slope of the relationship with both explanatory variables appears to be smaller. The purpose of using the extended Yeo-Johnson transformation is to find a transformation in which the transformed response, as for the Box-Cox transformation, satisfies the three requirements of homogeneity, additivity and approximate normality of errors.
The automatic analysis of Sect. 8 starts with the Yeo-Johnson transformation in which positive and negative observations are subject to transformation with the same value of k. Use of the standard five values of k leads to searches all of which immediately terminate at m Ã ðkÞ ¼ 0:6n. A series of searches over a finer grid of values suggestsk ¼ 0:75. The three-panel plot summarising the results is shown as Fig. 5. The values of k belong to the large grid G ¼ ½À1; À0:9; . . .; 1: The lower plot of values of hðkÞ shows that when k is less than 0.2, m Ã ðkÞ ¼ 0:6n and that hðkÞ is mostly appreciably less. The interpretation is that inappropriate values of the transformation parameter are indicating a large number of spurious outliers; the values of statistics in the upper plots are then based on an unrepresentative subset of the data, so that we exclude them from consideration. The two upper plots, show that both the extended BIC and the agreement index indicate a value of 0.8 fork. In calculating the extended BIC we have replaced J n BC by the Jacobian J n YJ given in (20). The next stage is to determine whether this transformation parameter should be used for both positive and negative observations. The general procedure is to test whether k 0 ¼ ðk P0 ; k N0 Þ is the appropriate transformation by transforming the data using k 0 and then testing the hypotheses for the transformed data that no further transformation is required.
The left-hand panel of Fig. 6 shows the heat map for the values of hðkÞ. For k P ¼ 1 and k N ¼ 0 this is 309, that is, all the data are in agreement with this transformation model. For any other pair of parameter values, at least two observations are deleted and, in some cases, many more. The right-hand panel of the figure shows the heat map for the extended BIC, which has a sharp maximum at the same parameter value. In both panels the performance ofk is so poor that the values for (0.75, 0.75) are not plotted.
Two further heat maps are in Fig. 7. That in the left-hand panel, for AGI, has a much sharper peak than that for R 2 EXT in the right-hand panel, but both again support the transformation indicated in Fig. 6. The automatic method for this example has used numerical information to provide a firm decision on the best transformation within the extended Yeo-Johnson family.

Balance sheet data
The data come from balance sheet information on limited liability companies. The response is profitability of individual firms in Italy. There are 998 observations with positive response and 407 with negative response, making 1,405 observations in all. Details of the response and the five explanatory variables are in Atkinson et al. (2020, §9), together with the results of a series of data analyses involving visual inspection of fan plots.
The automatic procedure starts with the Yeo-Johnson transformation of the data leading to the parameter estimatek (Point 2 of Sect. 8). The three resulting plots    Figure 10 shows further support for this transformation from the plots of Point 4 of Sect. 8); the left-hand panel is of the agreement index AGI (19) and the right-hand panel is of the values of R 2 EXT (16). In ambiguous cases the heat maps could be extended so that the highest values of the properties are not at an edge of the display.
The automatic procedure leads to the same result as the analysis based on the adjustment of fan plots presented by Atkinson et al. (2020, §9) who also discuss the economic interpretation of the regression model fitted to the transformed responses. The purpose of our paper is to provide a practical method of automatic transformation of data in the presence of outliers, robust methods being necessary.
Here we consider a few further points.
The assumption in the Yeo-Johnson transformation and its extension is that there is something special about zero. In some examples it may however be that the change from one transformation to the other occurs at some other threshold, which may perhaps need to be estimated. Atkinson et al. (2020) and Atkinson et al. (2021) compare the extended Yeo-Johnson transformation with the nonparametric methods ACE (Breiman and Friedman 1985) and AVAS (Tibshirani 1988) which do not divide the transformation into regions. The results from ACE for the investment funds data show a slight increase in the value of R 2 compared with the extended Yeo-Johnson transformation and a change in the transformation at y ¼ 4 rather than zero. Since the non-parametric transformations are not robust, our recommendation is to use them on cleaned data to check on the assumptions in the parametric transformation.
The largest timings of Sect. 6 for the Box and Cox method of Marazzi et al. are for simple regression and 1000 observations. We have not explored timings for multiple regression with samples of this size. However, it is clear that the use of simulation to obtain results about the behaviour of this method will be problematic. The results of Table 2 were for 1000 replications of the simulation. For simple regression with n ¼ 1000, the results of Table 4 suggest a time of about 100 h.
Our data analytical results show that we have developed a useful new tool for the automatic determination of power transformations of data contaminated by outliers. We have based this on the development of an extended BIC, the theory behind which incorporates inferences about the effect of the trimming of observations into the more customary use of BIC for model choice for data with a fixed number of observations. The calculations in this paper used Matlab routines from the FSDA toolbox, available as a Matlab add-on from the Mathworks file exchange https://www. mathworks.com/matlabcentral/fileexchange/ or from github at the web address https://github.com/UniprJRC/FSDA.
The data, the code used to reproduce all results including plots, and links to FSDA routines are available at http://www.riani.it/RAC2021.

Appendix: Constructed variables and the extended Yeo-Johnson transformation
The normalized form of the Yeo-Johnson transformation is given in Table 5. The Jacobian for this transformation (Yeo and Johnson 2000) is log J n YJ ¼ ðk À 1Þ X sgn ðy i Þ logðjy i j þ 1Þ: The four unnormalised forms of the extended Yeo-Johnson transformation, corresponding to different regions of values of k P and k N are also given in Table 5. Two Jacobians are required for the normalised transformation; one for positive observations and the other for the negative ones. For y ! 0 let v P ¼ y þ 1 with v N ¼ Ày þ 1 when y\0. For the negative observations Division is by n, not n N (the number of negative y i ), as the Jacobian is spread over all observations. Similarly, for the non-negative observations log v i;P and _ y P ¼ expðS P =nÞ: The Jacobian of the sample is then log J n EYJ ¼ ðk N À 1ÞS N þ ðk P À 1ÞS P ¼ nfðk N À 1Þ log _ y N þ ðk P À 1Þ log _ y P g: The normalised form of the extended Yeo-Johnson transformation is This extended transformation reduces to the standard Yeo-Johnson transformation when k N ¼ k P . The three score tests test three different departures from k ¼ ðk P ; k N ). The alternatives are T O : k P þ a; k N þ a; T P : k P þ a; k N ; T N : k P ; k N þ a; tested as a ! 0. General expressions for the constructed variables used to calculate these test statistics are in Atkinson et al. (2020).
In the automatic procedure of this paper we first transform the data using the null values of k P and k N and then test the hypothesis that no further transformation is needed, that is that, for the transformed data, one or both of k P and k N ¼ 1. There is then a simplification of the constructed variables.
For the overall test T O : In (26) k P ¼ 1 þ log _ y O and k N ¼ log _ y O À 1 and, from (21)  For T P , the test of the value of k P : where k Ã P ¼ 1 þ log _ y P . The structure is similar to that of the constructed variables w O in (26). The result for y\0 arises because the transformation for y\0 only depends on k P through the Jacobian.
Similarly for T N , the test of the value of k N : where k Ã N ¼ log _ y N À 1.