Identifying Early Help Referrals For Local Authorities With Machine Learning And Bias Analysis

Local authorities in England, such as Leicestershire County Council (LCC), provide Early Help services that can be offered at any point in a young person's life when they experience difficulties that cannot be supported by universal services alone, such as schools. This paper investigates the utilisation of machine learning (ML) to assist experts in identifying families that may need to be referred for Early Help assessment and support. LCC provided an anonymised dataset comprising 14360 records of young people under the age of 18. The dataset was pre-processed, machine learning models were build, and experiments were conducted to validate and test the performance of the models. Bias mitigation techniques were applied to improve the fairness of these models. During testing, while the models demonstrated the capability to identify young people requiring intervention or early help, they also produced a significant number of false positives, especially when constructed with imbalanced data, incorrectly identifying individuals who most likely did not need an Early Help referral. This paper empirically explores the suitability of data-driven ML models for identifying young people who may require Early Help services and discusses their appropriateness and limitations for this task.


Introduction
Local authorities in England have statutory responsibility for protecting the welfare of children and delivering children's social care.The COVID-19 pandemic has put pressure on the children's social care sector, and exacerbated existing challenges in risk and other assessments [1,2].Early Help is a service provided by Local Authorities that offers social care and support to families, including early intervention services for children, young people, or families facing challenges beyond the scope of universal services like schools or general practitioners.Early Help provides services that meet the needs of families who have lower-level support needs (e.g.not child protection) to prevent problems from escalating and entering the social care system (e.g.child protection).Furthermore, Early Help can offer children the support needed to reach their full potential; improve the quality of a child's home and family life; enable them to perform better at school and support their mental health; and can also support a child to develop strengths and skills that can prepare them for adult life.Increase of children in need.Data collected from the most recent Child in Need Census 2022-2023 revealed that there are: 404 310 Children in need, up 4.1% from 2021 and up 3.9% from 2020.This is the highest number of children in arXiv:2307.06871v1[cs.LG] 13 Jul 2023 need since 2018.Furthermore, there were 650 270 referrals, up 8.8% from 2021 and up 1.1% from 2020.This is the highest number of referrals since 2019.In 2022, compared with 2021 when restrictions on school attendance were in place for parts of the year due to COVID-19, referrals from schools increased, in turn driving the overall rise in referrals.The Department for Education (DfE) is collecting data on early help provision as part of the Child in Need Census 2022-2023 [3], 2023-2024 Cescus [4].This census data is submitted by local authorities to the DfE between early April and end of July of each year.Information about early help enables DfE to understand more about the contact that children in need have with the Early Help services that local authorities provide.Need for data-driven tools and machine learning.The increasing numbers of children in need and referrals have highlighted the need for data-driven tools that can analyse large datasets to aid local authorities in making informed decisions for individuals at risk alleviating pressure on a chronically overstretched service.Machine learning (ML), an application of artificial intelligence (AI), can efficiently analyse vast amounts of data from diverse sources.In the context of children's services, this capability allows for the identification of risk factors of which social workers may not have otherwise been aware, such as a family falling behind on rent payments.Such data can be combined with other relevant information, such as school attendance records, to provide a more comprehensive view of the situation.The effectiveness of ML is currently limited by the lack of transparency [5] in the decision-making process of ML models [6] and the data they use [7,8].Therefore, it is crucial to increase transparency in ML decision-making processes to ensure fair and equitable outcomes for individuals and communities.A ML model may be biased if it systematically performs better on certain socio-demographic groups [9,10].This can occur when the model has been developed on unrepresentative, incomplete, faulty, or prejudicial data [11][12][13].Given the potential impact of bias on individuals and society, there is growing interest among businesses, governments, and organizations in tools and techniques for detecting, visualising, and mitigating bias in ML.These tools have gained popularity in addressing bias-related issues and are increasingly recognized as important solutions to promoting fairness [14][15][16][17][18][19][20][21].The use of ML in social care presents a topic of discussion that involves technical and ethical considerations [22].For example, if used responsibly and fairly, these models have the potential to assist in protecting young people [23], particularly when combined with successful early intervention programs such as the Early Start program developed in New Zealand [24].Responsible use of ML models has the potential to enhance the usefulness of risk assessment tools in child welfare [25].This paper evaluates the suitability of ML models for identifying young people who may require Early Help Services (EHS) and applies methods for identifying and mitigating bias.For the purposes of this work, Leicestershire County Council (LCC)'s locality triage will be categorised as (1) EH SUPPORT: Early Help support: the most intense type of intervention; (2) SOME ACTION: referral to less intensive services such as group activities or schemes that run during the school holidays, or to external services; (3) NO ACTION: additional support is not currently required.Specifically, the contributions of this paper are: (a) ML models were implemented and their performance was evaluated across different validation and test sets; and (b) bias analysis was conducted and mitigation algorithms were applied to reduce bias in the ML models.This study revealed that certain educational indicators such as fixed-term exclusion and free school meals may predict the need for EHS.

Dataset
The dataset contains records of young people who are under 18 years, and the dataset was provided by the LCC.The data relates to families and individuals assessed for Early Help support between April 2019 and August 2022.The time period of data included within the features varies depending on the age of the young person with older individuals having data across a longer time-frame.The initial dataset contained 15 976 records and 149 features.The total percentage of missing values was 5.41 %, and the total percentage of NA values was 20.33 %.
To pre-process the dataset, missing values were replaced with 0, and records with more than 30 % of missing values were removed (10 % of the data).For cells that contained NA values, each relevant feature was paired with another feature called FEATURENAME NA that received the value of 1 if the original feature was not relevant to the record.For example, the feature Not in Education, Employment or Training (NEET) is not applicable to those under 16 years, resulting in the presence of NA values.Supplementary Table S11 contains the statistics of the features before one hot encoding.
After pre-processing, the dataset contained 14 360 records and 149 features.The number of features with less than 5 % of missing values is 64 while the number of features with less than 20 % of NA cells is 91.After applying the one-hot encoding, the number of features in the pre-processed dataset was 363.The feature LOCALITY DECISION represents the target variable with three categories: SOME ACTION (56.59 %), EH SUPPORT (33.10 %), and NO ACTION (10.31 %).Those who received EHS belong to the EH SUPPORT category.Any other type of service provided by LCC to a child or signposting to an external organisation is considered as SOME ACTION, and young people who did not receive any action are labeled in the NO ACTION category.The remaining input features represent educational indicators and are grouped into topics such as Absence, Exclusion, School Transfer, Free School Meal (FSM), Special Educational Needs and Disabilities (SEND), Pupil Referral Unit (PRU), Home Education, Missing, Not in Education, Employment or Training (NEET), Early Years Funding (EYF) and the Income Deprivation Affecting Children Index (IDACI).A description of these features can be found in Supplementary Table S5.

Machine learning model evaluations
The aim is to identify the best performing ML model for predicting three LOCALITY DECISION outcomes: EH SUPPORT, SOME ACTION and NO ACTION.The dataset was divided into two sets: a Training/validation set with 10 052 records (70 %), and a Test set with 4 308 records (30 %).Then the following ML techniques were evaluated using stratified 10-fold cross-validation (CV): Ridge Classification, Logistic Regression, Support Vector Classification (Linear and Kernel), K-Nearest Neighbors (KNN) Classifier, Gaussian Naive Bayes, Decision Tree, Random Forest Classifier, Gradient Boosting Classifier, Extreme Gradient Boosting, Ensemble Methods (AdaBoost, Catboost) and Discriminant Analysis (Linear and Quadratic).Supplementary Tables S1-S3 show the results of evaluating the above-mentioned models for each LOCALITY DECISION outcome.
The hyperparameter settings of each model are shown in Supplementary Table S4 for reproducibility purposes.The best models were chosen based on their area under the curve (AUC), recall, and precision scores on the validation sets.To ensure that the best models did not suffer from low variance, the performance of each best model was evaluated using the 10-fold cross-validation approach which was repeated 30 times on the train set.A random seed generator was applied to create a different sequence of values each time the k-fold cross-validation was run to ensure randomness.The test set remained the same across the 30 iterations.The average and standard deviation values across the iterations were recorded for the validation and test set.
Multi-class models were implemented but these did not perform to a satisfactory standard (see Supplementary Tables S6-S7).As a solution, separate models were implemented for predicting each outcome (i.e. one model per module), and those achieved better results.Hence this paper presents the analysis of the separate models and the results of the multi-class models are found in the supplementary materials.

Evaluation metrics
The AUC, recall and precision evaluation metrics were utilised to evaluate and compare the predictive performance of the ML models.Recall (also known as sensitivity or true positive rate) measures the proportion of actual positive cases that a model correctly identifies.Precision (also known as positive predictive value) measures the proportion of positive predictions made by the model that are accurate.Recall and precision are calculated using the following expressions: Recall = TP/(TP + FN) and Precision = TP/(TP + FP), where • true positive (TP) refers to a young person who required EH SUPPORT or SOME ACTION or NO ACTION and was predicted by the model as such.• true negative (TN) refers to a young person who did not require EH SUPPORT or SOME ACTION or NO ACTION and was predicted as such.• false negative (FN) refers to a young person who required EH SUPPORT or SOME ACTION or NO ACTION and was predicted as not requiring such service.• false positive (FP) refers to a young person who did not require EH SUPPORT or SOME ACTION or NO ACTION and was predicted as requiring such service.
The receiver operating characteristic (ROC) curve illustrates the trade-off between the true positive rate (TPR) and the false positive rate (FPR), and the AUC represents an aggregate metric that evaluates the classification performance of a model.The closer the AUC is to 1, the better the performance of the classifier.The distribution of classes EH SUPPORT, SOME ACTION, and NO ACTION in the target variable LOCALITY DECISION is imbalanced.In such cases, adjusting the decision threshold can be a useful technique for improving the performance of ML models and reducing the occurrence of FNs, which are often the most costly errors in imbalanced classification problems.By default, the threshold is set to 0.5.However, an appropriate threshold was chosen based on the trade-off between the costs associated with FP and FN.This was achieved by calculating the precision-recall curve and selecting the threshold that maximizes recall.The process of threshold selection is described in Section Threshold adjustment.

Bias mitigation in ML models
The threshold optimizer and exponentiated gradient algorithms were applied as techniques to mitigate bias in the ML models.The bias evaluation considered the following sensitive features: GENDER, AGE AT LOCALITY DECISION, ATTENDANCE, and IDACI.False negative rate (FNR) was used as a metric to mitigate bias since it represents those who would benefit from the EH SUPPORT (or SOME ACTION or NO ACTION) but were not predicted as such.The two-sample Z-test for proportions statistical test was applied to evaluate whether there is a significant difference between the FNRs of the categories for a given sensitive feature (e.g., GENDER, AGE AT LOCALITY DECISION, ATTENDANCE, and IDACI).The null hypothesis is that there is no significant difference between the two proportions (i.e.FNRs), while the alternative hypothesis is that there is a significant difference between them.Under the null hypothesis, the test statistic follows a standard normal distribution and the p-value can be calculated using this distribution.If the p-value is less than the chosen significance level (α = 0.05), the null hypothesis is rejected and it can be concluded that there is a significant difference between the two FNR values, and the ML model may have been biased towards the sensitive feature under scrutiny.In this case, bias mitigation algorithms are considered to reduce the bias for the sensitive feature.
Otherwise, if the p-value is greater than the significance level (α = 0.05), the null hypothesis is not rejected, and it can be concluded that there is not enough evidence to suggest a significant difference between the two FNRs.In this case, it can also be concluded that the ML model does not present bias for the sensitive feature under scrutiny.

Threshold adjustment
Threshold adjustment is applied to find the threshold that maximizes the performance of the model in terms of precision and recall.The process of adjusting the threshold for the EH SUPPORT model is described as follows.After the model has been trained, it returns a probability score of the confidence of the model's prediction.The threshold value is then used to decide whether the prediction should be classified as EH SUPPORT (class 1) or not (class 0).For example, without adjustment, when the model predicts a probability greater than 0.5, the record is labeled as class 1 otherwise it is labeled as class 0. With adjustment, the threshold value was adjusted based on the threshold analysis illustrated in Figure 1.This analysis revealed that the optimal cutoff for classification purposes (i.e., the point with the best balance between precision and recall) occurs at a threshold of 0.27 for the gradient-boosting classifier (GBC) model and at a threshold of 0.25 for the logistic regression (LR) model.The F1 curve represents the harmonic average between the precision and recall rates.The performance of these models (with and without the threshold) is then evaluated using a stratified 10-fold CV over 30 iterations.Table 1 presents the average and standard deviation for the recall and the precision for both the GBC and LR models (with and without threshold).The same process followed for the EH Classifier Recall Precision GBC 0.1069 ± 0.0120 0.5296 ± 0.0370 GBC (with threshold) 0.8133 ± 0.0180 0.3839 ± 0.0038 LR 0.1604 ± 0.0124 0.5047 ± 0.0232 LR (with threshold) 0.8242 ± 0.0125 0.3781 ± 0.0035 Table 1: Threshold adjustment for the EH SUPPORT model: predictive performance for the GBC and LR models (with and without the threshold).Average and standard deviation for AUC, recall and precision using stratified 10-fold CV over 30 iterations.and hence the threshold value was set to 0.5.For the NO ACTION model, the optimal threshold was 0.46.Supplementary Fig. S4 illustrates these results.

Results
This section describes the performance of ML models for predicting whether a young person requires EHS.Specifically, it evaluates binary classification models for predicting each of the following outcomes: (1) EH SUPPORT; (2) SOME ACTION; (3) NO ACTION.The local interpretable model-agnostic explanations (LIME) [26] method is then applied to explain the model predictions by identifying those features that were most important for correct or incorrect classifications.Supplementary Table S5 gives a description of all features used in this work.Finally, the suitability of the threshold optimiser [27] and exponentiated gradient reductions [28] techniques in mitigating bias are applied and these results are also reported.

Can we predict whether a young person needs Early Help support? Model for EH SUPPORT
Model performance.Supplementary Table S8 lists the validation performance of the GBC and LR models across 30 iterations.The LR model outperformed the GBC on the validation sets (i.e.across 30 iterations) with an average AUC of 0.62 (standard deviation σ = 0.01), an average recall of 0.82 (σ = 0.01) and an average precision of 0.38 (σ = 0.00).On the test set, LR reached an AUC of 0.63, recall of 0.83 and precision of 0.38. Figure 2 illustrates the ROC curve for the LR classifier and Supplementary Table S10 presents the test performance metrics and the optimal ROC point.Interpretation of results.The LR model demonstrates a moderate ability to differentiate between young people who require early help and those who do not, as indicated by the AUC of 0.63.However, it has a relatively high recall of 0.83, meaning that it captures a good proportion of the young people who require early help.On the other hand, the precision of 0.38 suggests that the model generates a considerable number of FPs, incorrectly identifying some young people as needing early help.
Factor analysis.The findings from the LIME analysis revealed the factors related to young people correctly classified by the EH SUPPORT model.Those who did not require EH SUPPORT, i.e. the true negative (TN) group, had a median age of 14 years, compared to a median age of 8 years for those who did require EH SUPPORT, i.e. the true positive (TP) group.A higher proportion of young people (91 %) who required EH SUPPORT attended a pupil referral unit (PRU) and received special educational needs and disabilities for education, health and care (SEND EHC) support compared to those who did not require EH SUPPORT (77 %).Those who were not-applicable (NA) for the special educational needs and disabilities (SEND NEEDS) or referred for special educational needs and disabilities (SEND REFERRAL) services were less likely to require EH SUPPORT.Supplementary Fig. S1 illustrates these findings.For those belonging to the TP group, i.e. those that required EH SUPPORT and were correctly classified by the model, the most relevant features were PERMANENT EXCLUSION, Not in Education, Employment or Training (NEET), School Transfer Phased (TRANSFER PHASED), SEND REFERRAL and PRU. Figure 3 shows all important features for those young people that were correctly classified by the EH SUPPORT model.The negative LIME values in Figure 3a and the positive LIME values in Figure 3b show the most important features of the TN and TP groups, respectively.LIME analysis also revealed that the model prioritized features with lower percentages of missing data.For example, feature NEET (YEAR 4) has 0.1% missing values, whereas NEET (PREV 3TERMS) and NEET (YEAR 1) have 3.9 % and 2.3 % missing values, respectively.Another example is PRU (YEAR 5) with 1.6 % missing values and PRU (YEAR 1) with 6.3 % missing values (see Supplementary Table S11).Bias analysis.Figure 4 shows the application of the bias-mitigation algorithms.The threshold optimizer (labelled 'Post-processing') and the exponentiated gradient reductions algorithm (labelled 'Reductions') reduced the FNRs difference across categories but at the cost of an overall increase in the FNR.A two-sample Z-test for proportions compared the differences of the FNR values within the categories of the sensitive features GENDER and IDACI.The test was not performed for the category OTHER because it comprises only 0.4 % of the data.According to the Z-test, there were no significant differences between the FNR values of FEMALE (0.19) and MALE (0.16) as well as between most of the categories of IDACI, where the FNR oscillated between 0.13 and 0.22.Supplementary Table S12 details these results and concludes that the LR model does not present bias in the sensitive features GENDER and IDACI.
3.0.2Can we predict whether Some Action is needed?Model for SOME ACTION Model performance.The GBC reached an average AUC of 0.60 (σ = 0.01), an average recall of 0.79 (σ = 0.02) and an average precision of 0.61 (σ = 0.01) across 30 iterations.On the test set, the model had an AUC of 0.60, recall of 0.81 and a precision of 0.61.

Interpretation of results.
The GBC model demonstrates a modest ability to differentiate between young people who require some action and those who do not, as indicated by the AUC of 0.60.It has a relatively high recall of 0.79, indicating its capability to capture a good proportion of young people who require action.However, the precision of 0.31 suggests that the model generates a considerable number of FPs, incorrectly identifying a significant portion of young people as requiring action when they do not actually need it.Factor analysis.The LIME analysis revealed that the median age of those who did not receive SOME ACTION was 6 years versus 12 years for those that did.In the former group, a higher proportion received early years funding (EYF) and had free school meal (FSM).The average number of FIXED-TERM EXCLUSION sessions was higher (0.184 average) among those who received SOME ACTION, compared to those who did not (0.065 average).Supplementary Young people who received SOME ACTION had a higher average AGE AT LOCALITY DECISION (9.3 years) compared to those who did not receive SOME ACTION (8.5 years).The ATTENDANCE (year 4) feature (i.e. percentage of total school attendance sessions in the academic year four years previous), also presented a difference between those who received SOME ACTION (61.5 %) and those who did not receive SOME ACTION (56 %).For the other features, a small difference between the groups was observed (less than 2 %).Since AGE AT LOCALITY DECISION and ATTENDANCE (year 4) are numerical features, it was necessary to create new categorical features (CLASS AGE and ATTENDANCE BIN, respectively) for evaluating the presence of bias.Supplementary Fig. S5 represents histograms for both features.Feature CLASS AGE was categorised into three groups: below 7.5 years (group A), 7.5-12.5 years (group B) and above   There was a significant difference between the FNR values for all the groups.Similarly, for the ATTENDANCE BIN feature, the FNR was 0.12 in the group > 0.5 and 0.36 in the group ≤ 0.5.The test concluded that there was a significant difference between the FNR values between the groups.Supplementary Table S11 details these findings.Hence, these results reveal the presence of bias in the features CLASS AGE and ATTENDANCE BIN.Bias analysis.The use of the bias mitigation post-processing algorithm threshold optimizer reduced the gap between the categories, but it resulted in an increase in the FNR by more than 0.30 in all groups.On the other hand, the exponentiated gradient reductions algorithm decreased the FNR in most groups and produced the closest FNR between the groups, giving a better outcome.Figure 7 illustrates these findings.The GBC with exponentiated gradient reductions algorithm had a precision of 0.59 and a recall of 0.85 on the test set, wheread the GBC unmitigated model had a precision of 0.61 and recall of 0.80.From these results, it can observed that the GBC model with the exponentiated gradient reductions algorithm outperforms the GBC unmitigated model in terms of recall (0.85 > 0.80).However, the GBC unmitigated model has a slightly higher precision (0.61 > 0.59).Moreover, the use of the exponentiated gradient algorithm reduced the overall FNR from 0.21 to 0.15.

Can we predict whether NO ACTION is needed? Model for NO ACTION
Model performance.Supplementary Table S9 provides the validation performance of the ML models over 30 iterations.
The validation results of the LR model reached an average AUC of 0.56 (σ = 0.01), an average recall of 0.60 (σ = 0.03), and an average precision of 0.11 (σ = 0.01).The presence of imbalanced classes reflects the lower values for recall and precision.On the test set the model had an AUC of 0.56, recall of 0.63 and a precision of 0.12.None of the models performed well on this task.The presence of imbalanced classes reflects their low performance.Figure 8 illustrates the ROC curve and Supplementary Table S10 presents the predictive performance metrics and the optimal ROC for the LR classifier on the test set.Moreover, it was identified that a high proportion of those who had NA values in the SEND NEED feature did not belong to the NO ACTION category (hence they had received EH SUPPORT or SOME ACTION).This result aligns with the finding obtained by the EH SUPPORT model.Young people in the TN group had a median age of 9 years, compared to a median of 12 years in the TP group.The box plot suggests no difference between the age of those belonging to the TN and TP groups (see Supplementary Fig. S3).The LR model for NO ACTION had an FNR of 0.36 for the category  FEMALE and 0.39 for the category MALE.The FNR values between the categories of IDACI oscillated between 0.30 and 0.41.According to the two-sample Z-test, there was no significant difference between the FNR values of FEMALE and MALE and between all the categories of IDACI.This suggests that the LR model is not biased with respect to these features.Regarding the feature CLASS AGE, the LR model had an FNR of 0.36 in group age A (below 7.5 years), 0.56 in group age B (7.5-12.5 years) and 0.23 in age group C (above 12.5 years).The two-sample Z-test concluded that there was a significant difference between the FNR values between all groups suggesting the presence of bias in the feature CLASS AGE.Supplementary Table S12 details these results.
The use of the bias mitigation algorithms, threshold optimizer (post-processing) and exponentiated gradient (reductions) decreased the difference between the categories but resulted in an increase in the FNR for all groups, which is not ideal for classification purposes.performance on the test set and correctly identified 61 % of all young people within the NO ACTION category.However, the bias mitigation algorithms did not reduce the FNR and the presence of a strong imbalance in the target variable affected the predictive performance of the model.

Discussion
Overview.For the social care task described in this paper, models were developed to determine whether machine learning models can assist human decision-makers with regard to identifying families whose young people may require EHS.Young people who could benefit from EHS but are not identified or offered such services can be disadvantaged by the social care system.Therefore, it was important to identify features of those that could be disproportionately negatively impacted by the ML models as a strategy for understanding and communicating the limitations of ML models.Since the dataset was sparse and noisy, adequate data treatment was required prior to the use of the ML models.Imputation techniques were considered as well as the use of one-hot encoded features, allowing the ML algorithms to distinguish between NA, missing, and filled-out cells.The pre-processed dataset was thereafter utilised for data analysis and ML tasks.
Model testing.During testing, while the models show some capability in capturing young people who require intervention or early help, they also generate a significant number of FPs, incorrectly identifying individuals who do not actually need early help referral.This indicates room for improvement in terms of precision and overall performance in accurately identifying those in need of action or help.
Bias analysis.The bias analysis revealed that sensitive features GENDER and IDACI did not bias the model with regards to predicting LOCALITY DECISION (i.e.EH SUPPORT, SOME ACTION, NO ACTION).However, the analysis identified AGE AT LOCALITY DECISION and ATTENDANCE (year 4) as sensitive features and the difference between the FNR values for some groups was statistically significant.For example, for young people of age below 7.5 years (FNR = 0.38) and for those of age 7.5-12.5 years (FNR = 0.03).The use of bias mitigation algorithms reduced the FNR in these groups and improved the predictive performance of the models EH SUPPORT and SOME ACTION.The data imbalance in the NO ACTION category affected the model's predictive performance.For those that were correctly classified by the EH SUPPORT model (and hence who did not require EH SUPPORT) had a median age of 14 years, compared to a median age of eight years for those who did require EH SUPPORT.A higher proportion of young people who required EH SUPPORT attended PRU services or received SEND support.However, the median age for those who did not receive SOME ACTION was 6 years and young people with a median age of 12 years received SOME ACTION.Furthermore, in the group of those that did not receive SOME ACTION, it was identified that a higher proportion received benefits such as EYF or FSM.Although a variety of ML algorithms and bias mitigation techniques were considered, fairness is a socio-technical challenge, and therefore mitigations are not all technical and need to be supported by processes and practices.The use of sensitive features including demographic information during the analysis of the ML results can enhance the understanding of model behaviour and aid with the identification of groups that could be subject to bias.It is important to assess ML performance differences across groups and the likelihood of bias.

Conclusion.
The findings from our study demonstrate that ML has the potential to support decision-making in social care, and the results highlight that further research is needed in developing methods that work on such complex datasets.In particular, further research is needed in developing methods and strategies for dealing with missing, uncertain, and sparse data; and in developing ML models that can provide clear explanations for their predictions.Research is required about how to best visualise and communicate outputs of ML models to end-users in a way that supports decision-making.

Limitations
The limitations of the study are as follows.
• A limitation of this study is that a decision was taken early on to restrict to young people aged 18 and under at the point of assessment.However, this does leave a blind spot in the study of young people (n=23 420) aged 18 and under that were never referred for assessment (but possibly should have been).
• The original dataset contained information about FIRST LANGUAGE.However, this feature was excluded from the study due to a high proportion (29 %) of missing values.
• The dataset also contained information on ETHNICITY.Analysis of the ML results revealed that the performance was similar across the groups and models (see Supplementary Table S13).Ethnicity was categorised as either WHITE or NON-WHITE due to the lower frequency of other ethnic categories (see Supplementary Table S14).The team is currently in the process of pursuing a follow-up collaboration with LCC to expand this study.Future work involves analysing the data of primary and secondary school young people who require EH SUPPORT and incorporating new demographic features to uncover new insights and findings.• Imputation methods were explored in our previous study and not reported in this paper.None of the imputation methods explored were suitable for imputing missing values in this dataset due to their sensitivity, and therefore it was considered ethical not to impute the missing values.Instead the missing not at random values were treated using one-hot encoding.Further study is needed to develop algorithms suitable for imputing random missing values.
• Class imbalance appears to be a contributor to the models' low precision values.The SOME ACTION model that was trained on a near-balanced dataset achieved the highest precision (i.e.61%) compared to the other models that were not trained on balanced datasets (EH SUPPORT: 38% precision; and NO ACTION: 12% precision).
• This paper focuses on the needs and characteristics of individuals and models their service requirements accordingly.This approach does not take into consideration the broader family group and how those complex interrelationships may impact on the requirements of each individual within the group.With EHS being a whole family intervention service there is a future piece of work to understand those interrelationships and identify requirements at a family level.

Impact on social care
The findings of this paper provide an entry point for local authorities into using AI to support the optimal provision of EHS.Whilst acknowledging the limitations and the need to approach the implementation very carefully, this is a positive step in the long road to incorporating AI into the decision-making process within EHS and potentially the broader remit of Children's social care.At this early stage, a suitable use-case for the model would be to provide additional, data-driven support to the triage process, placing the AI outputs alongside the descriptive referral case notes and information collected by front-line workers.With the focus throughout this paper being on providing explainable AI models, a softer benefit would be to expand confidence and understanding of AI within practitioners and the benefits it could bring to their daily decision-making.
In addition to providing a more complete understanding of the needs of those referred to EHS the model also has the potential to help identify those that may be in need of support but that have not been referred.With the focus on EHS being to provide support and intervention before issues escalate, identifying this group and acting accordingly would be expected to reduce the requirement for higher-intensity support later on.With the current limitations of the model such an approach would need careful consideration as to how that fitted into the existing referral processes.It is not considered justifiable, certainly at this stage, for referrals and allocation of provision to be driven by AI.

Declarations
1. Competing interests.The authors declare no competing interests.
2. Data availability.The data cannot be made publicly available because the demographic features in combination with other features may reveal a child's identify [29].Statistical information about the data has been provided in the Supplementary material.S3.NO ACTION: Validation performance of ML models.Single run stratified 10-fold CV.Predominant need (SEND NEED) is included independently of the level of support provided.
Features description.Features marked with asterix ( * ) are computed for the previous term before the locality decision (PREV TERM), the previous three terms (PREV 3 TERMS) and the academic year previous (YEAR 1), two years previous (YEAR 2), three years previous (YEAR 3), four years previous (YEAR 4) and five years previous (YEAR 5).

Figure 1 :
Figure 1: Threshold analysis for the GBC and LR EH SUPPORT models.

Figure 2 :
Figure 2: EH SUPPORT Model: ROC curve, AUC and optimal ROC point for the LR classifier on the test set.
(a) Most important features of the TN group.(b) Most important features of the TP group.

Figure 3 :
Figure 3: LIME analysis of young people correctly classified by the EH SUPPORT model (LR classifier) on the test set.

Figure 4 :
Figure 4: LR model for EH SUPPORT.FNR of the unmitigated and mitigated models (on the test set) for features GENDER and IDACI.Note that 'Reductions' refers to the exponentiated gradient reductions algorithm.

Figure 5 :
Figure 5: SOME ACTION Model: ROC curve, AUC and optimal ROC point for the GBC on the test set.
Fig. S2 illustrates these results.Moreover, amongst those that received SOME ACTION and were correctly classified by the model (TP group), the most relevant features were PERMANENT EXCLUSIONS, FSM, SEND REFERRAL, AGE AT LOCALITY DECISION (11-13 years) and MISSING EDUCATION.The negative LIME values in Figure 6a represent the most important features of the TN group and the positive LIME values in Figure 6b show the most important features of the TP group.
(a) Most important features of the TN group.(b) Most important features of the TP group.

Figure 6 :
Figure 6: LIME analysis for those correctly classified (test set) by the SOME ACTION model (GBC).

12. 5
years (group C).Additionally, the new binary feature ATTENDANCE BIN considers two categories: ≤ 0.5 and > 0.5.The GBC model for SOME ACTION had an FNR of 0.22 for the category FEMALE and 0.20 for the category MALE.Category OTHER had the lowest FNR of 0.08.The FNR values between the categories of IDACI oscillated between 0.18 and 0.25.Two-sample Z-tests were carried out to compare differences between categories in order to identify potential biases within the GBC model.According to the Z-test, there was no significant difference between the FNR values of MALE and FEMALE and between the FNR values for most of the categories of IDACI.With regards to the CLASS AGE feature, the GBC model had an FNR of 0.03 in group age C (above 12.5 years), 0.21 in group age B (7.5-12.5 years), and 0.38 in group age A (below 7.5 years).

Figure 7 :
Figure 7: GBC model for SOME ACTION.FNR of unmitigated and mitigated models (on the test set) for features CLASS AGE and ATTENDANCE BIN.On the x-axis, each bracket contains two values.The first value refers to CLASS AGE and the second refers to ATTENDANCE BIN.

Figure 8 :
Figure 8: NO ACTION Model: ROC curve, AUC and optimal ROC point for the LR classifier on the test set.
(a) Important features of the TN group.(b) Important features of the TP group.

Figure 9 :
Figure 9: LIME analysis for young people correctly classified by the NO ACTION model (LR classifier) on the test set.
Figure 10: LR model for NO ACTION.FNR of unmitigated and mitigated models (on the test set) for CLASS AGE.
of early help service received by a young person (EH SUPpeople living in income-deprived families in the home postcode of the individual.school exclusion sessions (CNT) and lunchtime school exclusion sessions as a percentage of total school attendance sessions (PCT).Fixed term exclusions * EXC FIXED Numeric Number of fixed term school exclusion sessions (CNT) and fixed term school exclusion sessions as a percentage of total school attendance sessions (PCT).School transfer phased * TRANSFER PHASED Numeric Number of phased school transfers.Phased meaning they naturally progressed from primary school to secondary school.

Figure S1 .
Figure S1.EH SUPPORT: Profile of young people correctly classified as TN and TP.Features AGE AT LOCALITY DECISION, PRU (YEAR 2), SEND NEED ASD (PREV TERM IS NA) and SEND EHC (PREV 3 TERM).

Figure S2 .
Figure S2.SOME ACTION: Profile of young people correctly classified as TN and TP.Features AGE AT LOCALITY DECISION, FSM (YEAR 1), EXC FIXED (PREV 3 TERM) and EYF (PREV TERM).

Figure S3 .
Figure S3.NO ACTION: Profile of young people correctly classified as TN and TP.Features AGE AT LOCALITY DECISION and SEND NEED (PREV TERM IS AN).

Figure S4 .
Figure S4.Threshold analysis for the GBC (SOME ACTION model) and the LR classifier (NO ACTION model).

Figure S5 .
Figure S5.Histogram for the features AGE AT LOCALITY DECISION and ATTENDANCE (YEAR 4).

SEND
Descriptive statistics: mean, standard deviation (std), percentage of missing values and percentage of not applicable (NA) cells by feature.
The code and a sample set of records are available under the Github repository https://github.com/gcosma/ThemisAIPapers-Code

Table S6 .
). Multi-class models: Validation performance of the ML models.Single run stratified 10-fold CV.

Table S7 .
Multi-class model: Validation performance of the ML models.Average and standard deviation of the metrics using stratified 10-fold CV across 30 iterations.

Table S8 .
EH SUPPORT model: Validation performance for the best models.Average and standard deviation of the performance metrics in the validation sets (stratified 10-fold CV) across 30 iterations.

Table S9 .
NO ACTION model: Validation performance of the ML models.Average and standard deviation of the metrics using stratified 10-fold CV across 30 iterations.

Table S10 .
Test performance of the best models.

Table S11 .
Descriptive statistics: mean, standard deviation (std), percentage of missing values and percentage of not applicable (NA) cells by feature.

Table S12 .
Two-sample Z-test for proportions.Comparison between the false negative rates (p) and sensitive features.Sample size (n), test statistic Z and p-value.Values in bold reject the null hypothesis at significance level α = 0.05.

Table S13 .
Frequency distribution of ETHNICITY in the entire dataset and in the test set.

Table S14 .
Percentage of young people correctly classified and misclassified by the model in the test set.Category NOT WHITE refers to young people who are not in the category WHITE in Supplementary TableS14.