Modelling Large-Scale Scientific Data Transfers

This work focuses on the study of a recently published dataset (Bogado et al. in ATLAS Rucio transfers dataset. Zenodo, 2020.) with data that allow us to reconstruct the lifetime of file transfers in the contexts of the Worldwide LHC Computing Grid (WLCG). Several models for Rule Time To Complete (TTC) prediction are presented and evaluated. The dataset source is Rucio, an open-source software framework that provides scientific collaborations with the functionality to organize, manage, and access their data at scale. The rich amount of data gathered about the transfers and rules, presents a unique opportunity to better understand the complex mechanisms involved in file transfers across the WLCG.


Introduction
There have been efforts to model data transfers since Rucio commissioning at the end of 2014 [1]. The work cited in [2] focuses on Transfers TTC predictions. The work cited in [3] focuses on the prediction of the network throughput. The work cited in [4] focuses on the prediction of the length of the queues of the system, with emphasis on the importance of network throughput. However, prior studies have failed to delve into Rucio's replication rules modeling and Rules TTC prediction, and to assess how accurate a model that can predict the Network Time of a transfer can also predict the Transfer or Rule TTC.
Rucio handles transfers between sites but also handles the deletion requests to comply with the data retention policy. Both transfers and deletion requests are stored in REQUESTS and REQUESTS_HISTORY tables: theRE-QUESTS table stores the current requests with no final state, while the REQUESTS_HISTORY table works as an archive of requests that will not be updated anymore, that is the requests with final state. Deletions requests do not affect RSE transfer performance, and so can be ignored. Only transfer requests were taken into account for this work.
There are several instances of the transfer tool on the grid. These instances work at WLCG level and serve transfers from several VOs and not only ATLAS specific transfers. The Rucio Database does not contain information about the FTS transfer tool other than the instance that is used for each transfer request. Information about FTS queues state, scheduling and retries, number of nodes, and configuration are hidden from Rucio. Transfers in an FTS server from other VOs are also hidden from Rucio.
The main hypothesis is that the load in FTS queues has a noticeable impact on the difference of the submission time and starting time of a transfer. The more transfers are queued at FTS the more time will elapse between the submission time and starting time of a transfer. Ergo, a model that can predict the Network Time of a transfer with 100% accuracy will not necessarily predict accurately the Transfer TTC, as the Network Time represents a small fraction of the total time.
The studied dataset has been made publicly available [1].

Metric Selection
Standard metrics for regression tests include root mean squared error (RMSE) and mean squared error (MSE) [5], mean absolute error (MAE) and median absolute error (MedAE), mean squared logarithmic error (MSLE) [5] and root mean squared logarithmic error (RMSLE), explained variance score, R 2 score, mean Tweedie deviance, mean absolute percentage error (MAPE) [5], relative error (RE) and related metrics, and the Fraction of Good Predictions (FoGP) [6] The mean squared error measures the mean of the squared difference between the vectors y and ŷ , according to Eq. 1: As the differences are squared, this metric penalizes more the big differences and as it is a mean value, is sensitive to outliers. The RMSE version is the squared root of the MSE, making its units comparable with the units of y and ŷ , so if y and ŷ are in seconds, the RMSE can be interpreted in seconds too. When several models are to be compared or when the values of y have great variance, MSE and RMSE are not particularly useful. Two models and will be considered with comparable performance if RMSE(y,ŷ ) and RMSE(y,ŷ ) are in the same order of magnitude, but always the model with smaller RMSE will be preferred.
The Mean Absolute Error and the Median Absolute Error are the mean and median of the absolute value of the difference between y and ŷ , respectively. The MAE is calculated using Eq. 2, whereas MedAE is calculated using Eq. 3: MAE and MedAE are easier to interpret than MSE and RMSE, but MedAE is preferred for its robustness to outliers in y and ŷ . However, the four metrics are sensitive to the scale of y. This means that when the same model is evaluated, using two different set of observations y and y ′ , the metric will be different for the same model and will depend on the distribution of y and y ′ . If y present outliers and y ′ does not, the metric for the same model will be worse regardless of the performance of the model. This is not a problem when two models compared against the same y, but does not give an idea of the goodness of the models in general.
The mean squared logarithmic error is a metric robust against outliers and not sensitive to scale of y. It is defined by Eq. 4 as the mean of the squared differences between natural logarithms of 1 + y and the natural logarithm of 1 +ŷ . The root mean squared logarithmic error is the squared root of MSLE: (2) Hidden in the definition lies the problem that this metric tends to penalize the negative errors more than the positive ones, and thus will favor a model that overestimates the predictions over one that underestimates them. But the metric is difficult to interpret and the results do not give a good idea of the goodness of a model.
The explained variance score and the R 2 score are two metrics related to each other. The differences are subtle but important. The explained variance is defined in Eq. 5. It can be interpreted as the proportion of the variance of y that is explained by the model though the predictions ŷ: The R 2 score, also known as coefficient of determination, is defined as in Eq. 6: In Eqs. 5 and 6 it is possible to spot the difference at a glance. Both results are equal if y i −ŷ i is zero, meaning the R 2 score does not account for biased models as explained variance does. This also makes the R 2 slightly more sensitive to the scale of y.
Interpretation of both metrics is not clear at a glance, but are implied directly from the equations. In both cases, if the prediction of the model is perfect, then y −ŷ = 0 , and then both scores are equal to 1. This is the best score a model can achieve. By definition, a model that makes a prediction using the mean y has a score of 0, so any model with a score bigger than 0 will be better than the naive model. But then the predictions can be infinitely far away from the observed value. If that is the case both scores are negative.
Mean Absolute Percentage Error has the advantage of being easy to interpret. MAPE is based on the Relative Error, that is the absolute value of the difference between target and the prediction relative to the target, as shown in Eq. 7, and thus, is the error in the prediction relative to the observed value: The MAPE is the mean of the RE expressed as a percentage, as in Eq. 8. But, as a mean value, is sensitive to outliers in the relative errors. It is possible to overcome this by taking the median instead of the mean in Eq. 8. Mean Absolute Percentage Error and Median Absolute Percentage Error both diverge when y values are very close to zero: MAPE penalizes more the positive forecast values than the negative ones [7]. sMAPE or Symetric Mean Absolute Percentage Error and sMedAPE try to address this issue but still can suffer from the divergence problem due to the sum y +ŷi being small. In [5], the MASE or Mean Absolute Scaled Error is introduced to circumvent the mentioned problems. These derived metrics overshadow the straightforward interpretation of MAPE. Moreover, the models to evaluate in this work make predictions over positive integer targets, as both the Rule TTC and Transfer TTC are measured in seconds.
The metric selected to compare models in this work is described in [6] (p.16) as percentage of predictions with less than X percent RE. We call this metric Fraction of Good Predictions (FoGP), expressed as a number between 0 and 1, in which X is the threshold of relative error below which a prediction is considered good.
Formally, with the trivial function g defined as in Eq. 9, FoGP is defined in Eq. 10: As an example, assume that a certain model made a prediction for y. We calculate the FoGP with threshold 0.05 and we obtain the 0.5 as results. Formally, this can be expressed as FoGP(y,ŷ, 0.05) = 0.5 . This means that 50% of the predictions in ŷ are less than 5% from their real values.
This metric is easy to interpret, robust to outliers both in y and ŷ , and can be easily and efficiently implemented. Thus, the models studied in this work are evaluated using this metric.

Models ˛ and ˛′
The number of transfers per rule varies from rule to rule, but there are notable peaks in some numbers, most notably, rules with exactly 20 transfers. These rules are generated by an automated replication process by the experiment (Fig. 1).
The set of transfers going through the FTS BNL instance contains over 1.8M different rules. The mean rule TTC is 3.1 h but the mean varies within two orders of magnitude depending on the number of transfers per rule, as shown in Fig. 2.
A baseline model was developed using the dataset described in the previous section. The prediction for the Rule TTC is based on the created timestamp of the first created transfer, the ended timestamp of the first ended transfer and the total number of transfers to be created and transferred to fulfill the rule.
Formally, the method consists of a regression analysis using ordinary least squares to fit a 1-degree polynomial, where the independent variable is the completions percentage of the rule, and the dependent variable is the time at which the percentage of completion is reached. Figure 3 illustrates the application of the method to an observed rule. The rule has 20 transfers and the prediction is made after the first transfer ends, hence only using two points to fit the polynomial: More than two points can be used in the regression analysis. But the more points that are used, the more transfers need to finish before the prediction can be made, ergo the more the rule needs to progress, and the less useful the prediction will be, rendering this method useless for scheduling purposes. However, these models can still be applied, for example, to give feedback to users about the time their transfers will be finished.
We define the family of models k as in Eq. 12, where ax + b is the polynomial that results from the fit of the predictor vector X k , using the Ordinary Least Squares method. We call the predictor X k to the vector of points (( 1j , 2j )) , where 1j is the component that represents the percentage of the rule that is completed, and 2j is the time elapsed in seconds until that 1j is reached. The sub-index k is the number of points used to make the fit, so the range is k = 2, 3, 4, … , n , where n is the number of transfers in the rule. Thus, j sub-index range from 1, … , k ( Fig. 4): A more detailed analysis of the progress of the rules over time determined that only a fraction of the Rules TTC are co-linear with the first points of the progress of the rule, including the origin point.
From this observation, model ′ k was created and tested. Every member of the ′ k family is equal to its relative k+1 , but the origin point, where the rule is created is removed from the regression. Then, ′ 2 is equal to 3 , and the ending times of the first two transfers are used to make the linear regression but not the origin point. An important consequence is that the models ′ k and k+1 can make a prediction in the same stage of the progress of the rule, that is, when the transfer k is finished (Table 1).

Models ˇ and
One important caveat with the models and ′ is that they will not be able to make any prediction for the Rule TTC before at least two transfers of the rule have finished. One approach is to use metrics such as the mean or the median of the Rule TTC of already finished rules as the predictor of Rule TTC of the newly created rules. Figure 5 shows an  Results for the experiment showing the distribution of the FoGP(y,ŷ k , = 0.1) for the first 9 members of the k family. The bigger the k, the more points are included in the regression, the more accurate the results, and the later the prediction. For 2 , at least one transfer needs to finish before a prediction can be made, while for 10 , at least 9 transfers need to finish to make a prediction example of how this method could work. Assume we want to predict the Rule TTC of the rule R 6 created at time t 0 . A look back window can be defined since t 0 . Let's consider a window of 30 s and the mean of the Rule TTCs of those rules created between t 0 and t 0 − 30 s. Rules R 1 to R 5 were created over the last 30 s, and the mean TTC is 3.8 min, so that will be the prediction for the Rule TTC of R 6 . We called this model . The best window length was found through an optimization process described later.
However, some of the Rules R 1 to R 5 may not be completed at t 0 , so their TTCs will be unknown and not available to make any prediction, and thus the model cannot be implemented to be used in real time. The idea of this work is to use Time Series Analysis techniques to make a model that can predict the mean to use it as predictor of the rules created at t 0 . We call this the model .
Formally, the (t 0 , ) model family is defined as in Eq. 13. Here, y R i are the real Rule TTC of those rules created in the left-closed interval [t 0 − , t 0 ) , that is, the rules in the dataset with min_created < t 0 and min_created ≥ (t 0 − ) . The parameter can be interpreted as the size of the rolling window, usually measured in seconds. The aggregation function will be a function that returns a number that represents a summary of the information contained in the {y R i } set. The functions tested are min(), which returns the minimum element of the set, max(), which returns the maximum element of the set, median(), which calculates the arithmetic median, and mean, which returns the mean of the values of the set: Using this notation, the example in Fig. 5 can be annotated as mean ( = 30). Figure 6 summarizes the result of the experiment. The most accurate result was obtained by the median model with values between 20 and 30 s, through an FoGP(y,ŷ ( ) , = 0.1) = 0.22 , meaning around 22% of the predictions will present less than 10% of relative error. Other s present lower FoGP, and thus, lower predictive power. A new scanning of in the interval [20,30] shows no significant improvement in the FoGP. This also shows that the values of the aggregation function had some correlation with time and that these values could hold important information about the Rule TTC of near future rules. If the model is implemented using only the data at real time at t 0 , that is, at the time to make a prediction for a rule created at t 0 , then the only TTCs available will be the ones of those rules with created and finished in the semi-open interval [t 0 − , t 0 ) , that is, the rules with min_created ≥ (t 0 − ) and max_ended < t 0 . Figure 7 summarizes the results of the experiment of measuring the FoGP over 300 Rule TTCs predictions, repeated 100 times. The functions were calculated using real time data.

The Model
Model is implemented using real time data. It has low FoGP, as the aggregation function depends on real observed Rule TTC, and real time aggregation is not representative of the future Rule TTCs. However, as was demonstrated in the previous section, there is a time dependency in the aggregated values of the Rule TTCs. Model presented here uses a forecast of the time series of the aggregation function to predict the Rule TTC of the new rules. Special attention was put in the median() and mean() aggregation functions, both because of their hypothetical predictive power and their good statistical properties. Formally, the model is defined as in Eq. 14. This equation is very similar to the equation, but in the model, the function is estimated using an auto-regressive model with lags of size seconds. The parameter represents the look back, or how many lags are used to fit the model. The parameter represents the look ahead, or how many lags in the future the model will predict: Here, ̂ is the estimation of the function (y R i ) through the use of an auto-regressive model. As in the model, the set y R i is the Rule TTC of those rules created in the left-closed interval [t 0 − , t 0 ) but in this case also the ones that have finished before t 0 .
The algorithm proceeds as follows. First, all the rules that have been created between t 0 and t 0 − seconds and that have finished before t 0 are selected. That is, all the transfers in the Rules Dataset which satisfy the conditions t 0 − ≤ min_created < t 0 and max_ended < t 0 . The Rules Dataset contains the fields min_created and max_created time stamps, that represent the time when the first transfer and last transfer of the rules were created. Thus min_created time stamp is equal to the time creation of the rule and the minimum min_created is the time of creation of the first rule in the dataset. Also, the max_ended is the time stamp of the last transfer to finish, and thus, the finishing time of the rule. The function is calculated over (14) (t 0 , , , , ) =̂t he bins of length seconds, being the value of the first bin, the of the Rule TTC of the rules satisfying the condition t 0 − ≤ min_created < t 0 − + , the value of the second bin the of the rules satisfying the condition t 0 − + ≤ min_created < t 0 − + 2 , and so on. This generates a time series ̂ of frequency with a total of ∕ samples. Notice that this time series differs from the real time series in that the Rules TTCs for the lags closer to t 0 differ due to the selection filter rules described before. Once the time series is obtained, a standard autoregressive model AR(p) [8] is fitted using the first ∕ − samples. The parameter of the auto-regressive model is p = , meaning the model will need samples to make a prediction. Model train and prediction is implemented using the AutoReg function from the Python statsmodels v0.11.1 package [9]. Once the model is fitted, a forecast is made using the last − samples to predict the following lags. The ̂ , i.e.,the prediction of the , will be the last value of the returned forecast. Rule TTC prediction using the for one of the min(), median(), mean(), or max() functions. The parameter was selected to explore space and test the predictive power of each model. The FoGP metric is calculated for the same 300 random rules for every window size . The experiment is repeated 100 times. The red lines are the median FoGP of the experiment and the green lines are the mean FoGP, for every window size. Notice that the real min/median/ mean/max Rule TTC was used to make a prediction and that this value usually is not available in real time, i.e., the rules created during the previous seconds usually take several minutes to complete median of the Rule TTC were calculated to get the time series using bins of 30 s. The ̂ time series is shown in orange while time series values, that is the median of the Rule TTC including those of the rules that end after t 0 , is plotted in blue. These time series are very similar except in the last minutes before t 0 . The main reason for this discrepancy is that rules created some minutes before t 0 only finish after t 0 and are excluded, because they do not satisfy the condition max_ended < t 0 . This will happen in a hypothetical implementation of this method, where everything after t 0 is unknown as it is in the future.
The parameter for the experiments was set to 8 min and is estimated based on the median Rule TTC of the Rules Dataset. This means that most of the rules created until 8 min before t 0 will be finished at t 0 and thus, the difference between the time series of the real and the observed. All the models tested fix this parameter to represent 8 min but as it depends on , it is different for every model. The parameter is calculated using the formula = 8 × 60∕ . That is, for the model ̂( , , , ) with = 30 s, will be 16 lags and for the models with = 60 s, will be 8 lags. Figure 9 summarizes the results of the experiment for FoGP(y,ŷ̂( , , , ) , = 0.1) for a particular choice of the parameters, that is, how good the models ̂( , , , ) are to predict the Rule TTC of the rules, for being the minimum, median, mean and maximum functions. This figure can be read as follows. In the lower left plot, for the model mean ( = 30, = 45, = 240, = 16) , on average, a bit over 10% of the TTC predictions made with the model will have less than 10% relative error. The ( , , , ) parameters were selected using a grid search to maximize the FoGP at = 0.1 for the median function and can be sub-optimal for other functions.

The Model ı n
In his book "Deep Learning with Python" [10], Francois Chollet introduces a DNN architecture able to predict the temperatures for the Jena Dataset [11] slightly better than a naive model. We tried a similar approach to predict the Rule TTC, based on the time series of a number of observables we suspect could determine the TTC of such a rule at its creation time. This study should not be considered final nor exhaustive, but a preliminary study about the use of Deep Neural Networks to predict the Rule TTC based on the available data at the time. Rule TTC prediction using the for one of the min(), median(), mean(), or max() of the real time data available at t 0 . The function using the TTC of those rules that started between t 0 − and t 0 but also did finish before t 0 . This simulates the real time data available in the system to make the prediction, as those times beyond t 0 are in the future and TTC data for rules finishing after t 0 is usually not available. The FoGP metric is calculated for the same 300 random rules for every window size . The experiment is repeated 100 times. The red lines are the median FoGP of the experiment and the green lines are the mean FoGP, for every window size. Notice that if is small, usually the prediction is zero. That is because the number of rules created and finished in the interval [t 0 − , t 0 ) is zero From the Rules Dataset it is possible to create a set of time series from observables which can influence the Rule TTC. The minimum, median, mean, and maximum Rule TTC of previous transfers demonstrate at least some predictive power and have been used in previous models with limited success. Other variables that could influence the Rule TTC of future rules are the amount of transfers pending and also the amount of bytes pending. A way to calculate this is by extending the routines to calculate the time series for the minimum, median, mean, and maximum functions. The bins are filled with the sum of the bytes or the sum of the transfers of each rule for both time series, the observed and the real one. The difference between the two will be the time series of unfinished transfers and unfinished bytes. These values are known at rule creation time or can be approximated. As the majority of rules are over closed datasets, that is datasets to which new files can not be added, the number of files to be transferred and the size of each is mostly known at rule creation time.
We develop a model with a very similar structure to that proposed in Chapter 6 in [10]. We call it n , where n is the number of convolutional filters or number of Long-Short Term Memory (LSTM) neurons. The main difference is the substitution of the Gated Recurrent Units (GRU) layer from the original Chollet model for a LSTM layer, as this was a proposed improvement suggested in the book. Input of the model consists of 10 channels, each of which represents the time series of some attribute calculated between t 0 or the rule creation time, and t 0 − 120 min, in bins of 30 s. The The real median Rule TTC Time Series is plotted in blue. This represents the real median Rule TTC of 30 s bins of those rules created before t 0 . The orange line is the observed median Rule TTC, or the TTC of those rules that are finished before t 0 . The agreement between the blue and orange line is good except in the minutes previous to t 0 itself. This results in ineffective models when using real time data. The green line in the plot is the data used to fit the median model, that is the observed median Rule TTC till 8 min before t 0 . The red line corresponds to the prediction made for the model for 16 lags ahead of t 0 − 8 min Fig. 9 FoGP comparison of different models attributes used to build this time series were the minimum, median, mean, and maximum Rule TTC of each bin, plus the sum of transfers and bytes of finished, created, and pending rules. Each model was implemented using Keras/Ten-sorFlow Python API and trained for 120 epochs using the RMSProp optimizer to minimize the Mean Absolute Error loss function. Figure 10 shows the data splitting for training, validation, and testing. Training and validation data were selected based on the distribution of the the Rule TTCs at creation time. The training data comprises all the rules created between June 8th 2019 to July 3rd 2019. The validation set includes the rules created between July 4th 2019 and July 10th 2019. And finally, the testing set includes the rules created between July 11th 2019 and July 29th 2019.

The δνν n Model
The n model family does not take into account information about the rule for which we want to predict the TTC, that is, the model does not include information about the target rule. In this section, we present a model that includes the number of transfers the target rule consists of, the sum of bytes of all the transfers, and the links this transfers will affect, that is, the list of sources and destinations for all the transfers. Unlike the time series information fed to the previous model, the data about the target rule is point wise, such that it is not data about the past state of the system, but of the present or t 0 time.
The model has 3 inputs, the several time series representing the past of the system, the sum of bytes and the number of transfers of the rule, and the list of links affected by those transfers. The only output of the system will be the Rule TTC. This kind of model cannot be implemented using the Keras Sequential Model. Instead, the Keras Functional API was used to conceive a family of models capable of handling the different types of inputs. We call this family the n model family, where the n parameter is the number of convolutional filters or the number of LSTM neurons of the model. Figure 11 shows the architecture of the 32 model. Data flows from top to bottom. The left branch is the Chollet-Jena ( 32 ) model in charge of digesting the time series data. The center branch input is the number of transfers and sum of bytes of the target rule. The right branch input is a list of integers, each of which represents a link that will be affected by one of the transfers. This list is truncated to 50, so only the first 50 links are going to be accounted. If less that 50 links are used, the sequence is padded with zeros. There is a special need to convert the (source, destination) pairs into a unique number to feed the emb_input layer. This process is done in a preprocessing stage using the Keras Tokenizer tool. The alphabet of links is 8762 words of the form SRC-SITE__DSTSITE. The LSTM layer after the embedding processes the links in order (Fig. 12). Even though link order should not matter, that is, the order in which the links appear in the embedding should not determine or affect the Rule TTC, the usage of this layer has proven important, because the prediction rate over the testing set is 1-2% better for the model family that use the LSTM layer, as shown in Fig. 13.
Normalization of all the numerical data was done using Eq. 15. This allows the model not to give more importance to some observables over others because of the scale. Typical normalization, where values are subtracted from the mean and divided by the variance is not enough in this case, due to the very long tail of the distribution of the values: Figure 14 shows the histograms of the normalized vs. not normalized Rule TTC.
Both models were trained using the EarlyStopping callback that allows to monitor the progress of the validation loss. The callback stops training if there is no improvement after a fixed number of epochs and rolls back the weights to the ones of the last best model. For n models, the patience of the callback was set to 10 epochs. Figure 15 shows the n model family stops after 10 or 11 epochs, meaning the best model is obtained after only 1 or 2 training epochs. The n models are not able to generalize, and if trained for more epochs, model predictions converge to values around 480.
For the n models, EarlyStopping patience was set to 5. Figure 12 shows that this model family learns from the training data until epoch 12 in the best case, that is for model 32 . After that, there is no improvement in validation loss. Naturally, the larger the model, that is the n, the faster the model overfits.

Evaluation of Model Performance
It is instructive to compare the models with several values of , especially in the range (0.01, 0.25), to see how many of the predictions of each model have more than 1% and less than 25% relative error. For the comparison with the model to be fair, the best constant for each must be selected in order to maximize the FoGP(y,ŷ , ) . Using the same training data used to fit models and , the ∕ space was scanned calculating the FoGP(y,ŷ , ) in the ranges = (0, 2000) in steps of 1 and = (0.01, 2.0) in steps of 0.01. This procedure defines a surface defined in ℝ 3 with a local FoGP(y,ŷ , ) maximum for each and . We assume this is the optimal constant to predict the target with a given FoGP. Figure 17 shows this local maximum, that is, the constant that predicts the training set with the highest FoGP. Several things arise from this plot. First, there is a peak at = 567 which corresponds neither with the mean Rule TTC of the training set, that is 1962.1 s, nor with the median of 439 s. This means that both the prediction using the mean and the median are sub-optimal in terms of FoGP. Second, when = 0 the FoGP tops 1.0 all the predictions have less than 100% relative error. The explanation for this effect is straightforward. If the prediction for whatever value x is 0, then the relative error is calculated as |x − 0|∕x = 1 , meaning the error is 100%. As the FoGP measures how many predictions are less than , when > 1.0 , if the prediction is 0, all the predictions are accounted as having less than 100% relative error. Third, the FoGP values in the range (0.01, 0.25) fall between 0.027 and 0.251, meaning between 2.7% and 25.1% of the predictions presents less than between 1% and 25% relative error. and FoGP(y, 32 , = 0.1 ) over 1000 repetitions of the experiment to make a prediction for 300 samples. Numbers show that, on average, 9.9% of the predictions made with model also known as Chollet-Jena have less than 10% relative error. Meanwhile, the 13.0% of the predictions made with model also known as FunnelNet, have less than 10% relative error. For comparison, the results of model = 562 are shown. This is the model that makes a constant prediction for the Rule TTC of 562 s. For this model 12.2% have less than 10% relative error The model outperforms when is in the range (0.01, 0.04). In the range (0.04, 0.65) the model is better. Both and models return continuous values and hence do not make any sense to measure the FoGP when = 0 as the probability of the model to predict the exact value of the Rule TTC is almost zero. It does make sense to measure it for the model as the constant value is integer. This explains the better performance of the model for low values of . However, there is no noticeable change in the FoGP when the predictions of the and models is rounded.
Among the possible uses of these models for real world applications, we count the benefits for rule and transfer requests scheduling and the ability to give feedback to the users of the system about the time to complete of their transfers. Before any model can be used to make predictions to improve the scheduling of transfers or rules, we need two conditions be satisfied. First, the model needs to be able to make a prediction at the time the rule or transfer is created or the t 0 time. Second, the accuracy of the model should be high enough to actually improve the schedule. From talks with the experts, we expect a useful number will be a FoGP(y,ŷ, = 0.1) of around 0.95. All presented models except model can make predictions at the rule creation time, although the accuracy of the models presented here are below 0.2. The models that can make predictions at t 0 time can also be used to give feedback to the users about the TTC of their transfers. However, other models that include information of times post-t 0 have better accuracy in general and can be used too, depending on the need of the users to have the feedback early in the lifetime of the rule or transfer, or late, in which case the prediction will be more accurate (Fig. 18).

Model κ
The model, which always predicts a constant value, allows us to put a lower bound for the performance of the models over a range of interesting values. Optimizing the constant to maximize the FoGP results in a model that is surprisingly difficult to improve upon, both at high and low values. By its simplicity, and because its performance is comparable to other more sophisticated models, it should be the preferred to be implemented, for example to give feedback to users about the TTC of their transfers. If that is the case, the upper bound of a confidence interval could be interesting for users.

Model α k
Model is the only model of the studied ones that is not directly comparable with the other models due to inability to make predictions at the Rule creation time. Model needs at least two transfers within the rule to finish to fit and forecast when the other transfers probably will finish. This makes the model suitable to give feedback to the users but will not be helpful to improve the scheduler, as the decision about where to send the transfers will need to be done at rule creation time and before any transfer is submitted or finished. The model shows the non-linearity of the progression of the transfers, giving insights of the nature of the rules and their behavior. The time between transfer submissions for the transfers of a rule is not constant. Rucio's Conveyor daemon may consider that FTS has a high enough number of transfers already and decide not to submit more transfers  (y,ŷ , ) for a given . The colored points represents the actual FoGP(y,ŷ , ) value, the bluish the worse, the redder the better. The red line also represents the achieved FoGP with the y-axis on the right Fig. 18 FoGP comparison over a range from 0.01 to 2 for the best models known, which were presented in this work. Predictions for all the models were made for all the rules created between 2019-07-11 and 2019-07-29. Model median ( = 30) outperforms all the models. However, the real median of previous Rule TTC needs to be known for the model to work and this information is not available at t 0 . Model is the best model following the FoGP criteria with between 0.22 and 0.70 and is the model with greatest potential to be extended. The performance of all the models are comparable with the performance of Model , and for its simplicity, it should be the preferred model until some of those active transfers finish, increasing the Rucio Queue Time for part of the transfers of the rule. This will impact directly in the Rule TTC and this model will not be able to forecast this future delays.

Models β μ (t 0 , ρ) and β μ * (t 0 , ρ)
Models (t 0 , ) and * (t 0 , ) make a prediction calculating a function over the Rule TTC of those rules created in the last s. The difference between and * is that * excludes those rules that ends after t 0 . The model cannot be implemented with real time data as it calculates the function over the Rule TTC of all the rules that have started at some point in the past, including the ones that have not finished yet. This information from the future added to the model makes the two models radically different. One could assume that if the function could be predicted with 100% accuracy, then FoGP of the model represents the theoretical limit of FoGP of the model * , as the first include more information than the second. Yet, this statement does not hold in general, for example, for the function that take the maximum, including more information in the model does not make it more accurate. The max model makes a prediction by calculating the maximum Rule TTC of all the transfers created between t 0 and . The bigger the is the bigger is the chance that there exists a very slow rule. But * max filters out those transfers that have finished after t 0 , and thus the Rule TTC is throttled to the value of . For this reason, the FoGP(y,ŷ * max , = 0.1) presents a peak when is near 600. This value is close to the best value for the model at = 0.1 , which is 562. * models with other parameters presents lower FoGP values than * max at = 0.1 , and thus are considered inferior models. Figure 19 shows that * max ( = 600) outperforms model in the range between 0.04 and 0.22. This is the best model known to date in that range. It is not possible to implement the median model without knowing the Rule TTC of rules that didn't finish yet. If a model for a perfect prediction of the median of the Rule TTC exists, then the median ( = 30) shows the best performance across a wide range of .

Model γ μ (t 0 , ρ, λ, ψ, ω)
The model family is the first approach to solve the problem using time series analysis. The is an auto-regressive (AR) model, where the input is the time series of the Rule TTC. The function is calculated in bins of s. The input for the AR model consists of lags. The model is fitted using lags and the look ahead of the model is lags. The best model was obtained by scanning the parameter space and maximizing the FoGP, as detailed in Sect. 3.3. Model median (t 0 , = 30, = 45, 240, = 16) achieved the best FoGP at = 0.1 . If this model would predict the median with 100% accuracy then its results should be comparable with those obtained with the median . The results show that the model is not as good, especially at low . The median model is better than median only for > 0.91 . This model seems to be not accurate enough and other more complex models are worth to try. Integrated models were discarded after verifying that the time series show no trend, ergo there is no need for differentiation. Moving Average models are used after verifying that the time series are not stationary, which is not the case for long runs of Rule TTC time series. A straightforward check showed that the standard deviation from the mean changes over time, and thus a General Auto-Regressive Conditional Heterokedasticity (GARCH) model is more appropriate.

Models δ and δνν
The model is the first attempt to solve the forecast problem with neural networks using a modified model proposed by F. Chollet. This approach was shown to be ineffective but its accuracy is higher than the accuracy of model median . Model includes the information of the past state of the system in the form of time series but it does not include information of the present. Information from the rule that is known at the creation time like the number of transfers or the sum of bytes the system must process to complete the rule are not included in model . This observation leads to the model, a deep neural network model with multiple inputs that includes the time series from model, but also the number of bytes, number of transfers, and the links affected by the rule.
model is the best practical model in the range from 0.25 to 0.70, but more importantly, it is the easiest model to extend. We expect that this model would benefit enormously if information about failed transfers per link, history of transfers submitted to FTS, and history of the rate of the link were available and could be added as inputs.