An Improved AdaBoost for Prosecutorial Case-Workload Estimation via Case Grouping

Case-workload estimation has always been a complex process and plays a vital role in prosecutorial work. Despite the increasing development of rule-based techniques, artificial intelligence and machine learning have rarely been used to study case-workload estimation problems, leaving many cases processed without quantitative estimation. This paper aims to develop a new case-work estimation method that combines artificial intelligence methods with practical needs and apply it to the case assignment system of the prosecutor’s office. We propose a feature learning model, the improved AdaBoost model, to capture the features of cases for case grouping to estimate case workload. We first learn the case textual data based on the judicial proper noun dictionary, extract the case labels from the case information with the AdaBoost learner, and group and encode each case by fuzzy matching. Then, the extracted vital information estimates case workload based on the length of case processing time and suspects number, respectively. We conducted extensive experiments to compare the proposed method with eight baseline methods, including the traditional AdaBoost classifier, to evaluate the performance of the proposed model on a real prosecution case dataset. The experimental results demonstrate the superiority of our proposed workload estimation model.


Introduction
Over the past few decades, the role of information technology in the judicial field has also become increasingly prominent [1,2]. The procuratorial authorities rely on the Unified Business Application System (UBAS) in China to receive cases, and the subsequent "case assignment" process is based on "random case assignments, supplemented by assigned case assignments". Technically, it lacks the means to estimate the difficulty of the case quantitatively. Therefore, with the wide application of artificial intelligence and machine learning in various fields [3], the research to quantify the complexity of the cases can help ensure the quality of the cases in real-time and assist with processing the cases.
The issue of data analysis in judicial field is very actual and challenging [4][5][6]. Some critical information of the case, such as length of case processing time and suspects, will directly affect the case handling efficiency [7]. In prosecutorial cases, the length of case processing time is the period between the acceptance of a case by the prosecutor and its completion. In addition, the suspects refer to those suspected of committing a crime in the prosecutorial case. The manual annotation method can only annotate structured data, but this method is time-consuming and labor-intensive, and cannot distinguish the same type of cases significantly and effectively. There is a lot of unstructured data in the case data information, which contains a lot of valuable key information and will directly affect the case processing process [8].
In the process of case processing, the problem of caseworkload estimation is directly related to the rationality of case allocation, the performance evaluation of prosecutors, and the improvement of the incentive system. Since the judicial reform, many scholars have studied the issue of case workload, but the current research is only at the stage of qualitative analysis. Except for listing the specific factors included in the case workload, it does not give a method that can be used for statistics and measurement of case workload [9]. This paper analyzes the historical case processing data based on the above problems. In detail, first, we segment the case data based on the judicial noun dictionary. Then, the cases are grouped and coded with improved AdaBoost and fuzzy matching. Finally, we estimate the case workload by assigning case labels and critical information (length of case processing time and suspects). By analyzing a large number of historical case data with this grouping and case-workload estimation method, it is possible to classify cases accurately and estimate the workload of each case quantitatively. This paper provides a variety of estimation output methods for different prosecutor workloads, different prosecutor offices, and different case types, which can improve the accuracy of case-workload estimation, ensure the quality of case processing, and improve case processing efficiency.
In summary, this paper has the following contributions.
• To get better case grouping encoding results, we propose the improved AdaBoost model and combine it with fuzzy matching. • To improve the accuracy of the case workload, we propose two effective workload estimation methods for the first time: the case-workload estimation method based on the length of case processing time and suspects. • Experimental results on actual data show that our final model is practical for case-workload estimation and intelligent case assignment.
The rest of the section is as follows. Section 2 provides a brief review of the related work. Section 3 presents the case grouping and the case-workload estimation methods, respectively. Section 4 details the experimental results and analyses. Section 5 discussion presents further results analysis and potential limitations. Finally, Sect. 6 is our conclusion.

Related Work
In the process of case processing, the issue of case-workload estimation is very significant. However, it is impossible to estimate case workload intuitively by dealing with case data directly. Therefore, it is necessary to extract critical information about the case.
The process of extracting critical information about a case, which is a process of classifying essential details on a case, can be seen as a multi-classification problem. Octavia-Maria et al. support the SVM algorithm for classifying legal cases [10] and propose a method combining statistics and machine learning for predicting the outcome of French Supreme Court decisions [11]. For the case text classification problem, the literature [12] compared seven standard classification methods and concluded that the basic Bayesian and KNN methods are significantly better than the other five classification methods. In this paper, combined with previous work, we choose ensemble learning for case grouping.
With the rapid development of artificial intelligence technology, machine learning, especially deep learning, have become a new trend in the development of intelligent judicial technology in the judicial field. Robert et al. use convolutional neural networks (CNNs) to extract semantic features from case text [13], thus supporting the deep mining of case data. Although CNNs can perform well in many tasks, CNN models cannot model case summary information accurately, thus failing to achieve the case grouping encoding required in this paper. The text modeling approach in the study mentioned above is based on feature labels, and labeling features require a lot of manual work and expert knowledge. Therefore, the case classification or conviction algorithm formed on this basis is not scalable. In 2016, Liu et al. [14] proposed three different information-sharing mechanisms based on the recurrent neural network (RNN) for the text multi-classification task and achieved good results in four benchmark text classification tasks. Still, the results were general for the study of this paper.
We focus on constructing a case-workload estimation method to express the complexity of a case. The case workload can be output for different prosecutors, prosecution offices, and case types.

Methods
This section presents technical details, including the case grouping and encoding and case-workload estimation. As shown in Fig 1, in our estimation method, we first code the case groups based on case summaries. Then, we estimate the case workload for the cases based on the encoding information and case processing information. Table 1 depicts the notations and critical concepts used in this paper. First, we grouped and coded cases and assigned case labels. Then, workloads are estimated based on the length of case processing and case suspects number, respectively. Formally, the task of case-workload estimation is to achieve three general objectives.

Notations and Problem Formulation
• Case grouping and coding: It is the essential requirement of case-workload estimation which demands the case label are assigned based on the case classification.
• Workload for length of case processing : We estimate case workload for the length of case processing time based on case labels. • Workload for case suspects number: We estimate case workload for the number of suspects based on case labels.

Case Grouping and Encoding
This section presents a case grouping encoding algorithm that assigns a case label to each case. Six months of case summary data from a particular municipal prosecutor's office is used as input. The method consists of three main steps. First, we pre-process case data. Then, the cases are grouped and coded with the improved AdaBoost and fuzzy matching.

Data Pre-processing
We collected case data from the prosecution system of a municipality that contains only criminal cases for six months, including both structured and unstructured data. Each case data consists of five main attributes in the case data. As shown in Table 2, the main qualities include case serial number, case undertaker, case summary, the start of processing time (St), the end of processing time (Et), etc., all of which are crucial for subsequent steps. A case summary data mainly contains the following information: the time of the crime, the crime, the people involved, the means of the crime, the tools of the crime, the loss of goods, the amount of loss, and so on. For example, "On January 23, 2006, Zhang San came to the sub-bureau to report that a moped parked at his home in the evening of the 22nd was stolen, valued at more than RMB 1000 yuan, in the village of xx, xx town of this district".
Specialists mainly record the case summary, which is usually maintained in textual form. These short texts contain a high number of judicial terms, resulting in problems of inaccurate case information learning.
The case summary text preprocessing in this paper is mainly for the word segmentation of the text data, which is based on the Bi-direction match method [15]. The algorithm compares the word segmentation results obtained by the forward maximal matching method with the results obtained by the reverse maximal matching method and then follows the maximal matching principle, which is to find the longest matching string as a word in the dictionary. The result with the fewest number of word segments is selected. A word assignment dictionary is a dictionary containing proper judicial nouns.

Case Labels Learning
Learning case labels from detailed case summary data accurately is one of the main challenging problems in caseworkload computation. In contrast to traditional classification methods, we consider an ensemble learning method AdaBoost to learn case labels from case data. AdaBoost (Adaptive Boosting), was proposed by Yoav Freund and Robert Schapire in 1995 [16]. It is adaptive in the following way: the wrong samples of the previous basic classifier will be boosted, and the weighted whole samples will be used again to train the next basic classifier. At the same time, a new weak classifier is added in each iteration until some predetermined sufficiently small error rate is reached or a prespecified maximum number of iterations is reached [17][18][19].
Compared to the conventional AdaBoost, we presented the regularization factor when updating the weight distribution of the training set in the improved AdaBoost. The presentation of regularization factors can avoid the distribution of the extreme weights for case summary learning when the weights are updated. It is worth mentioning that we consider a prosecutor's processing of cases as a set of data, which more closely matches the following workload estimation needs and allows for easy comparison of each prosecutor's case processing capacity.
First, we ask some judicial professionals to label some cases in advance. Then, we use the already labeled data from the case data as the training set of the learner, leaving the unlabeled data as the test data set.
We input a dat aset wit h m sample sets: Y denotes the set of case labels, then we want to output the weights of sample set at the k-th iterations as follows: The main steps are as follows: 1. Initialize the distribution of weights for the training samples, with equal weights for each training sample.
(1) The weight set of the jth sample set at the ith iteration W The weight set of the sample set X The feature vector matrix of the dataset Y The set of all case labels in the dataset h i The ith weak learner a i The parameters of the learner h i i The weight of the ith weak learner P The combination of a i and i z i The normalization factor g The workload estimation weight where the function I(⋅) is expressed as 3 Estimate the weight i of the weak learner h i .

Update the weight distribution of the sample dataset
for the next round of i + 1 iterations: z i is the normalization factor: 3. Integrate N weak classifiers into a final strong learner.
where X denotes the set of feature vector x, i denotes the weight of the ith weak learner among the N weak learners, and P denotes the combination of i and a i , P = a i , i . The test set feature vector X test is brought into the strong learner for case label prediction. The classification process is shown in Fig 2.

Fuzzy Matching
Because of the need to satisfy multiple types of caseworkload computation requirements, we need to assign the most relevant case category to each case label. With the development of machine learning, fuzzy matching, and fuzzy consistency studies [20][21][22][23] play an essential role in classification tasks. In this paper, we adopt FuzzyWuzzy's string fuzzy matching method for case category assignment based on the table of criminal law assignment rules [24], the main idea of which is to estimate the difference between two strings using the editorial distance (Levenshtein Distance). Therefore, here the editorial distance of two strings is denoted as follows:  lev a,b ( , ) , where and denote the length of a and a, respectively. i,j denotes the i,jth character of the string, respectively. For example, a case labeled "drug offense" is fuzzily matched to a specific case type "drug offense".

Grouping and Encoding
To facilitate meeting the needs of subsequent case-workload estimations, we finally performed cases grouping and encoding (CGE), each CGE consisting of four digits with the following encoding rules.
• The first digit is a capital letter, and A-Y denotes the 25 first-degree counts of the case, respectively. • The second digit is lowercase letters, with a-z denoting the secondary counts of the case, respectively. • The third digit is an Arabic numeral indicating the order of the second-degree offense within the first-degree offense. • The fourth digit is an Arabic number indicating whether there is a combination of crimes and the proportion of suspects.

"3" indicates a combination of a lesser offense and a suspect. 3. "5" indicates a combination of an aggravating cir-
cumstance and a person suspected of committing the offense.

"4" indicates a combination of lesser offenses and
not less than one suspect.

"6" indicates a combination of aggravating circumstances and not less than one person suspected of the offense. "0" indicates a case where no express indication was given.
An example is shown in Fig 3. For example, a CGE code Ka13, as shown in Fig. 3, is explained as follows: the case is a crime against the personal and democratic rights of citizens in the first degree (K), a crime of intentional homicide in the second degree (a), the second degree is arranged in the order of one (1) in the first degree, and the case has a combination of lesser circumstances and one suspect (3).

Case-Workload Estimation
In the previous section, each case was coded into a set of CGE codes. This section estimates the case workload based on all the CGE codes and case processing information. Among the existing case-workload research methods, they are only at the qualitative analysis stage and do not give ways that count and measure case workload. By analyzing the factors influencing a case's workload, the workload statistics and estimations for different cases based on historical case data can effectively predict the workload of unprocessed cases and provide a reference for prosecutors' workload estimation. It has been proved that the length of the case processing time and the number of suspects in the case are the main factors influencing the workload of the case. To achieve improved case-workload estimation performance, we also introduce the EM algorithm and CRF named entity technique to enhance the accuracy of case-workload estimation based on two factors. In the following, we will describe these steps in detail.

Case-Workload Estimation Based on the Length of Case Processing Time
The number of cases handled by a certain prosecutor over a period of time n, the start time of the case St i , the end time Et i , i = 1, 2, … , n , the start time and end time for sorting, sorting, and stored in the array t[j], j = 1, 2, … , 2n . The time is divided into 2n − 1 time interval, expressed as To improve the accuracy of case-workload estimation based on the length of case processing time, we propose an EM algorithm-based method to iteratively solve the case workload based on the length of case processing time. The main processes are as follows. To fully understand estimating the case workload for this algorithm, the case-specific workload is estimated for cases where three of the time dimensions intersect, as shown in I. Estimation of initial weights: g 0 1 = g 0 2 = g 0 3 = 1 II. Estimation of initial workload: We set the number of iterations N = 10 , substitute the initial workload, and start the iterative estimation. Then, the weight and workload for the z-th iteration are as follows: 1. Weight of case 1:

Fig. 4 Case timeline
Based on the above method, it is possible to estimate the workload per case based on the length of case processing time f (x) = T n x .

Case-Workload Estimation Based on Suspects
After getting the case workload based on the length of time, we estimated the case workload of criminal suspect. In this paper, CRF named entity recognition technology is used to extract criminal suspects. Conditional Random Fields (CRF) is a conditional probability distribution model for another set of output sequences given a set of input sequences, which is widely used in natural language entity naming and recognition [25].
The whole flow chart is as follows (Fig 5): With the above method, we first identify and extract the suspects of each case in the dataset, then eliminate the repeated suspects in the cases, and finally count the number of suspects, such that h(x) = N x , where x is the case ordinal number.

Case-Workload Estimation
Case-workload estimation mainly includes two parts: caseworkload estimation based on the length of case processing time and criminal suspect. After estimating the case workload based on the case processing time and the criminal suspect, the two kinds of workloads are weighted and summed: case workload F(x): The u 1 and u 2 , respectively, represent the weights of the two workloads, according to the experimental analysis of previous data, which are set as u 1 = 0.7, u 2 = 0.3.

Analysis of Experimental Results
In this part, experiments verify the effectiveness of the proposed method in case-workload estimation. We first describe the data set and then validate our approach. Finally, we will analyze the validity of our case-workload estimation method.

Dataset
We adopted a real dataset extracted from a city prosecutor's office whose cases cover many common cases. The dataset is a collection of data on criminal cases in the municipality for 2018-2019. A criminal case is a case in which a suspect is accused of violating a social relationship protected by the Criminal Law. The state investigates, tries, and imposes criminal sanctions to hold the suspect or defendant criminally responsible for the crime. Detailed statistics are summarized in Table 3. The statistics on the length of case processing time (LOC) and the number of suspects (NOS) allow us to infer that the workload of various criminal cases varies. After data preparation, the data are approximately 1000 criminal case data.

Comparison of Case Grouping Results
To evaluate the performance of improved AdaBoost in our data set, we implemented seven classic classification models and analyzed and compared the accuracy of grouping. Compare the accuracy, respectively: where #h denotes the number of test sets that are correctly classified and |X test | indicates the number of all test sets. Precision: Recall: where K is the number of categories, TP, FP, TN, and FN indicate the number of true positive, false positive, true negative, and false negative, respectively, and the subscript i indicates which category it belongs to.
Although the evaluation metrics of accuracy and recall are good, they are usually in conflict with each other, and improving one usually comes at the expense of the other. Therefore, we combine precision and recall to introduce the F1 value. The F1 value is estimated as follows: The experimental setup and evaluation methods are fair for all comparisons, and we focus on the relevant improvements taken in this paper. The experimental results are shown in Table 4. The bolded values in Table 4 are the results of our method, and the same is true for Table 5.
As can be seen from Table 4, under a consistent experimental setup condition, the method proposed in this paper was obviously superior to the traditional classification methods. Since traditional methods mostly use a single learner, AdaBoost was a kind of ensemble learning, which can accumulate the advantages of multiple learners. Due to the limited amount of prosecutorial case data, it was difficult to train an accurate model with deep learning such as CNN.
Moreover, compared to the best baseline method AdaBoost, our method showed better performance in terms of evaluation metrics. It proves that our improvement of AdaBoost with regularization factors is effective. Furthermore, we also compared the case grouping and encoding results. As shown in Table 5, we first asked the professionals to manually group and code the 1032 cases. We grouped and coded the 1032 cases based on the classification results of the baseline methods and proposed method. It was observed that the proposed model exhibited better performance and improved the accuracy of case grouping and encoding.

Case-Workload Analysis and Comparison
The case workload was analyzed for multiple factors (different prosecutors, different prosecution offices, and different case types), and the results are shown in Figs. 6 and 7.

Discussion
It is feasible to use the case data to group and code, then the case labels are assigned to each case, and finally, the case workload is estimated from the length of the case processing and the number of suspects. Compared with the traditional method, it has the following characteristics.
First, the cases are grouped and coded by an improved AdaBoost classifier, therefore, whose effect of grouping is better than a single classifier.
Second, since the length of case processing time and suspects are the most important factors affecting the complexity of cases, it is very reasonable to utilize them to estimate the  case workload. It also provides a new idea for quantitative evaluation of complex cases in the judicial field. Finally, the case assignment system based on case workload is more efficient than the traditional random assignment strategy. The satisfaction of case assignments is also greatly improved.
However, during the development of this method, a few potential limitations need to be considered.  have the possibility of overfitting. 2. We have mainly exploited the length of case processing time and the number of suspects to estimate the caseload. However, prosecution case workload is influenced by several case features in practical application, such as the number of case files involved and case ratio. Our future research work will attempt to leverage multiple case features to improve the accuracy of case-workload estimation.

Conclusion
This paper developed a case-workload estimation technique for judicial research, which provides an effective tool for prosecutors' offices to assign cases rationally. Specifically, we first adopted an improved AdaBoost for grouping and coding prosecutorial cases to determine the category labels of cases. Then, we estimated the case workload with the length of the case processing and the number of suspects in the case. We conducted extensive experiments to evaluate the performance of the proposed model on an actual legal case dataset. The experimental results demonstrated the superiority of the proposed model in this paper. The judicial scenario is one of the essential components of practical application scenarios, and our research provides new light on intelligent justice. Further, our research will focus on multi-features case-workload estimation, brilliant assignment of prosecutorial cases, etc.
The list of abbreviations is shown in Table 6. The conditional random fields LOC The length of case processing time NOS The number of suspects CNNs The convolutional neural networks SVM The support vector machine KNN The k-nearest neighbor