1 Introduction

Over the years, digitizing government services and operations has increased public interest and participation in government affairs. Many staff members must coordinate facts and legislation for specific cases. People are increasingly vocal and active in opining on different state and federal issues. One branch of the government that has experienced the highest level of public interest is the judiciary. The increased digitization in the judiciary has led to an increase in the number of people evaluating different court rulings to understand the law better. A large pool of personnel is required to coordinate the arrangement of facts and support legislation for specific cases [1].

Many legal practitioners have begun adopting data mining and Artificial Intelligence (AI) systems as resources for evaluating other cases [2]. Through machine learning (ML), legal practitioners can quickly analyze data related to past decisions about cases similar to those they currently handle [3, 4]. Analyzing data within the judiciary has enabled better collaboration among court staff and improved access to information that facilitates prompt legal decision-making, thus reducing the duration taken to analyze and classify cases [5]. Legal firms have started employing deep learning technology to evaluate their cases. Deep learning is an aspect of computer science that enables the extraction of archival data and the outcome prediction of a case based on the facts presented and data derived from court databases. As a result of ML and deep learning approaches, legal professionals can use computational techniques to collect and analyze large amounts of data to guide maritime decision-making [6, 7].

One of the main problems that the Canadian judicial system faces is court delays. Court delays occur when the courts take longer to resolve a case and cause backlogs when incoming cases exceed the decided cases. The issue of delays is most prevalent in the Federal Court and primarily concerns matters relating to maritime law. Disputes are common in maritime transactions. They generally arise when a party fails to address its contractual obligations during shipping, cargo clearing, purchase of goods, or cargo supply at the agreed time [8, 9]. This delay is because judges at the Federal Court handling maritime cases face significant challenges regarding issues and laws that can be referenced when resolving maritime disputes. The ML model can analyze past cases and draw insights to guide future decision-making [10, 11]. The Court must effectively employ digital systems to expedite decision-making and ensure the public is better informed of maritime laws.

A systematic review of deep learning research in the legal arena has been conducted. The International Conference on Artificial Intelligence and Law, IEEE Conference on Knowledge and Systems Engineering, ACM Conference on Knowledge Discovery and Data Mining, and Journal of Machine Learning (Springer) are among the most influential publications.

We studied sentiment analysis using SVM (support vector machine), Naïve Bayes, and logistic regression. The model was trained on the dataset using three algorithms and approaches. AI algorithms were used to support automated processes in court cases and retrievals. Thus, they reduced workforce needs and related costs. Simultaneously, AI algorithms were applied in maritime law to help understand the patterns of sentimental texts related to Canadian maritime court cases. The result of the sentiment analysis of court cases in the sea should assist decision-making depending on the available data.

The development of this study's machine-learning-based sentiment analysis system has been organized into the following sections. Further, in Sect. 1, the development of the sentiment analysis system is contextualized. Section 2 covers the data-handling operations conducted to develop the ML model. Section 3 discusses the model development activities. The results and analysis of the experiment are detailed in Sect. 4, followed by the Future works and conclusion of this study.

1.1 Sentiment analysis

Sentiment analysis is a data analysis technique that involves extracting information from an entity and identifying its subjectivities, such as whether the text expresses positive, negative, or neutral perspectives regarding a topic [12]. The levels of sentiment classification include aspects, sentences, and documents. The three commonly used sentiment analysis approaches are lexicon-based ML and hybrid. The lexicon-based approach is subdivided into the dictionary and corpus-based methods [13]. A dictionary-based approach to sentiment analysis involves categorizing sentiments using a dictionary of terms.

ML methods include traditional and deep learning models. Examples of conventional learning models include the Bayes classifier and the maximum entropy classifier [14]. They use inputs, such as lexical features, parts of speech, lexicon-based words, adjectives, and other parts of speech. Compared to traditional models, deep learning provides better results. Deep learning models can analyze sentiments at the document, sentence, and aspect levels. In contrast, a corpus-based approach involves statistical analysis of the contents of selected documents using techniques such as k-nearest neighbors (k-NN), hidden Markov models (HMM), and conditional random field (CRF).

1.2 Related works of sentiment analysis

In the legal area, AI enables the rapid evaluation of documents required for due diligence. It facilitates the retrieval of specific clauses and the detection of irregularities or alterations in legal documents [15]. Recent improvements have enabled AI to design legally binding contracts [12], provide predictive insights, and assist in legal decision-making processes such as bail determinations, increasing attorney productivity and reducing errors [16]. In deep learning, sentiment analysis mainly employs particular neural networks for image identification, such as Convolutional Neural Networks (CNNs). Another significant category is recurrent neural networks (RNNs), with connection nodes that create a temporal series graph. With its deep hidden layers, the Deep Neural Network (DNN) considerably improves machine learning by enhancing accuracy and lowering loss [17].

Pillai and Chandran [18] highlighted the pivotal contribution of Convolutional Neural Networks (CNNs), a subset of Deep Neural Networks (DNNs), in the realm of Indian judicial law cases, showcasing their impressive 85% accuracy in predicting charges, discerning the nature of offenses, and ultimately providing approximate judicial decisions based on the Indian Penal Code (IPC) (Figs. 1, 2). Singh and Thanaya [19] Utilized Long Short-Term Memory (LSTM) networks as a subset of RNNs that specialize in learning long-term dependencies, making them particularly useful for sequential prediction tasks. Also, Alghazzawi et al. [20] it employed an LSTM + CNN model to predict court judgments, achieving a remarkable accuracy of 92.05%. Previous research conducted in 2015 extensively investigated sentiment analysis [21].

Fig. 1
figure 1

Source: Stanford.edu

Architecture of sentiment analysis.

Fig. 2
figure 2

A brief overview of recurrent neural networks (RNN), analytics Vidhya, Debasish Kalita—updated on November 7th, 2023

For each timestep \(t\), the activation \(a^{\left\langle t \right\rangle }\) and the output \(y^{\left\langle t \right\rangle }\) are expressed as follows:

$$a^{\left\langle t \right\rangle } = g_{1} \left( {w_{aa} a^{{\left\langle {t - 1} \right\rangle }} + w_{ax} x^{\left\langle t \right\rangle } + b_{a} } \right)\quad {\text{and}}\quad a^{\left\langle t \right\rangle } = g_{2} \left( {w_{ya} a^{\left\langle t \right\rangle } + b_{y} } \right)$$

where\(w_{ax}\), \(w_{aa}\), \(w_{ya}\), \(b_{a}\), \(b_{y}\) are coefficients that are shared temporally and \(g_{1}\), \(g_{2}\) activation functions.

Similarly, Lam et al. [22] conducted sentiment analysis by embedding and categorizing sentiments, similar to discussions on the importance of Machine Learning (ML) in sentiment analysis [23]. Parts of Speech (POS) were used as text features in these experiments to weigh words. Deep learning has been used in various fields, including banking, weather forecasting, travel warnings, and movie reviews [23]. Text extraction from diverse sources and analytical techniques such as Word2vec aid in insightful text classification [21]. Another study Ghorbani et al. [12] utilized sentiment analysis to categorize customer-generated data on social media. This analysis serves to tracks [24] shifts in customer preferences within rapidly evolving business landscapes [25]. Multi-layered architectures have been deployed to conduct deep learning sentiment analysis on tweets across various languages [3], focusing on textual sentiment polarity and customer review sentiment ratings. Moreover, extensive sentiment analysis has been applied to recommender systems, employing demographic-based, content-based, hybrid, or Collaborative Filtering (CF) [26]. Content-based social media sentiment analysis utilizes subscriber profiles, including gender, age, nationality, and occupation.

2 Data

This section introduces the critical aspects of datasets and delves into the sentiment classification process. Data are crucial in machine learning as the foundation for model training and evaluation. Additionally, the selection and preparation of datasets, considering factors like size and relevance, are discussed here. The essentials of sentiment classification, unravelling the methods used to analyze and categorize sentiments in textual data, are explored. We aim to provide a concise yet comprehensive overview, empowering readers to navigate sentiment analysis effectively.

2.1 Data collecting, processing, analysis of relevant legislation data, and analysis of court judgments

Data collection is an essential aspect of all research projects. The nature of the collected data should be relevant to the type of research being conducted. Data from the 2000 maritime court cases obtained from the Federal High Court website were used to analyze Canadian maritime law [27]. The data collection process was performed manually because all cases had to be read to identify the majority court opinion. All court cases used in conducting the research were part of public records; hence, no permission was required to evaluate and collect the data used in the study. Data anonymization, a critical data management practice, was not performed for the present work because personal information was not collected. This implies that the collected data could not be used to personally identify any of the participants involved in the case. The terms plaintiff and defendant were used to represent the people involved in the case. The features identified in the data collection were as follows (Table 1).

Table 1 Features identified in the data [28]

Court decision legislation data include the cited regulations and past rulings that judges use to make decisions. The relevant legislation was found using the Federal High Court website filter option during the data-collection evaluation. We used the search filter option to narrow our search to ruling on maritime law. This ensured that only the relevant cases were chosen.

2.2 Dynamics of court judgment

The nature of the Court primarily determines its judgment dynamics and the case under consideration. The Federal Court in Canada hears many maritime law cases where the judge can agree with either the defendant or the plaintiff. Individuals dissatisfied with the verdict sometimes file an appeal [29]. During the research, the dynamics of a court judgment were classified as either affirmative or negative. A judgment is classified as affirmed when a higher court agrees with the lower court's decision or when the judge agrees with the plaintiff's claims. The judgment status will change if a higher court reverses a lower court's decision or agrees with the defense.

2.3 Analysis of differences between court judgments

There are insufficient reviews regarding the separate opinions in legal linguistics literature today. Little attention is allotted to the linguistic and communicative elements from which judges determine their disagreements. This paper attempts to evaluate the entity of votum separatum, or separate opinions, using a relative and cross-language dimension using a linguistic approach. Evidence reveals a concise similarity in integrating unique opinions within macrostructures of the US SC opinions and the Constitutional Tribunal verdicts. This review shows how judges typically utilize highly formulaic expressions when expressing their disagreements despite the deficiency of concise frameworks to relay such instances. Evaluating regular phraseology reveals that declaring votum separatum and giving justification are unique linguistic and legal practices, specifically in terms of formulaicity. American and Polish justifications are different in showing regular phraseology of judicial verdicts and the availability of noteworthy examination issues.

The disparities in court judgments were evaluated according to the facts of the various case decisions. Such investigation typically requires using updated literature from peer-reviewed articles on judicial decisions. The chosen articles must have been published in the past 10 years to enhance relevance. After evaluating findings in different reviews, the common theme is that legislation is crucial in making verdicts. Judges must analyze appropriate legislation and decide if the claim violates the legislation, which examines whether the claim depends on the case. Sentiment analysis showed that variations in case facts led to differences in verdicts since decisions are primarily founded on the attributes defining claims and the legislation applicable to the case.

2.4 Analysis of court decision data and machine learning methods

After evaluating the different ML models, a decision was made to utilize the LSTM + CNN model. The model employs LSTM and CNN functionalities to evaluate the data and predict potential rulings based on the facts of the case. The system can evaluate the judge’s opinion of a case and predict likely rulings. The first component involved a CNN enhanced with a word-embedding architecture for classifying texts in the input data pre-processing stage. The second stage involved deploying an event detection model that employed an RNN to learn the data feature occurrence time series by identifying temporal information.

2.5 Model hypermeters

Experimenting with 70–30%, 80–20%, and 90–10% splits for training and testing, we meticulously changed the dataset distribution in our CNN and LSTM model analysis. The CNN architecture we developed has three convolutional layers with 3 × 3 filter sizes, a 2 × 2 pooling layer, and ReLU activation functions. The number of filters increases from 32 to 128 in a sequential fashion. A learning rate of 0.001, a batch size of 64, categorical cross-entropy loss, and Adam optimization were all used during CNN training. At the same time, our LSTM models were set up with two 64-unit LSTM layers and a 0.2 dropout rate for sequential data processing. This LSTM model was trained using Adam optimization, using 32 batches, 50 epochs, a learning rate of 0.001, and binary cross-entropy loss for binary classification tasks. Certain hyperparameter combinations were carefully discovered through iterative experimentation to maximize the model's performance for our specific tasks and datasets.

3 Experiments

In sentiment analysis, the current landscape is characterized by the widespread utilization of ML-based models. As indicated in the previous section, the study employed an LSTM + CNN model to evaluate the data. The model's effectiveness was evaluated by comparing it with several other models.

3.1 Model selection and experiment setup

The proposed model was compared with logistic regression, Multinomial Naïve Bayes, and Linear SVM model. This comparison helped gauge the success of the proposed model in predicting judgments when evaluated against other models. The insight drawn from this comparison is essential because it enhances the credibility of the proposed model. The ML model used in this study was developed using Python. The experiment was conducted in the Jupyter Notebook, providing an interactive coding operation platform. The experiment consisted of five stages:

  • Data loading: This stage covers loading the test data into the model. Data were loaded into the model as a data frame.

  • Data preparation: This stage involved eliminating null and empty values in the dataset.

  • Exploratory data analysis: This stage covered the activities conducted to obtain more insight into the nature of the data. The activities included counting the total number of entries left in the dataset after the data preparation stage.

  • ML-based predictive modeling: This stage was developing an ML model.

  • Deep learning-based predictive modeling: This stage incorporated deep learning functionalities into the ML model.

The first step in developing the model was to load all the Python libraries used in the experiment. Loading libraries is crucial to ensure the efficient operation of the experiment. The full functionality of the model is explained using comments in the provided code.

4 Results and analysis

We followed a step-by-step process to achieve the best results, starting with Data Loading and Feature Extraction and then analyzing Word Distribution and Judgments. After comparing various Machine Learning Models, we identified the most effective one. This thorough approach ensured that the chosen model was well-informed and yielded optimal outcomes.

4.1 Data loading and feature extraction

The exploratory analysis section of the experiment included feature extraction, and the data were analyzed to determine the number of affirmations and reversals in the dataset. The positive and negative sentiments from the text data were collected using the script shown in Fig. 3. Affirmations were represented as positive, whereas reversals were represented as negatives.

Fig. 3
figure 3

Positive and negative reviews

There were 41% positive and 59% negative reviews, indicating no neutral reviews. The next step was to select data for validation and training. The sample top and bottom datasets are shown in Fig. 4. We also evaluated the distribution of words used when making judgments to understand key features in the decisions. By evaluating the distribution of words, we can identify some of the keywords used when judges affirm a decision and some of the keywords used when judges reverse a judgment. Some of the keywords used by the judges when making affirmations and reversals were identified using Word Cloud. The Word Cloud provides insight into some of the judges' most famous words when they write their opinions. The second stage involved separating positive and negative words. A sample of words appearing in cases with affirmations and reversals is shown in Fig. 5a and b.

Fig. 4
figure 4

Top five and last five rows

Fig. 5
figure 5

a Words in positive reviews. b Words in negative review

4.2 Word distribution and judgments

From the word distribution of the judgments, the images show that several words were commonly shared between the negative and positive reviews. The reversals represent negative reviews or sentiments, whereas affirmations represent positive reviews. A comparative analysis of affirmations and reversals is presented in Fig. 6. The distribution of characters in texts from cases with reversals was greater than those with affirmations.

Fig. 6
figure 6

Comparison of count of words in affirmations and reversals

4.3 Comparative analysis of different machine learning models

When employing the ML approaches of logistic regression, Multinomial Naïve Bayes Classifier, and Linear Support Vector Classifier, the accuracy scores obtained were 50% for all tests. Figure 7a–c provide evidence of this.

Fig. 7
figure 7

Confusion Matrix of a logistic regression. b Multinomial Naïve Bayes classifier. c Linear support vector classifier

The confusion matrix results in the logistic regression showed that the classification accuracy was 99%. The result of the confusion matrix in the Multinomial Naïve Bayes Classifier showed that the classification accuracy was 99%. The Linear Support Vector Classifier results showed that the classification accuracy was 99%.

To create the deep learning LTSM + CNN model, we used a tokenizer to establish the vocabulary of the dataset. Subsequently, the model was trained using the sigmoid, ReLu, and Adam optimizers. At a minimum of 20 epochs, the deep learning model had a training score of 49.56% and a validation score of 50.51%. The score indicated that the model correctly predicted 50% of the test cases. The model had an accuracy level of 1.00, indicating that the predicted models were 100% accurate. The effectiveness of the model is shown in Fig. 8.

Fig. 8
figure 8

Execution of the training and validation

According to the findings, the proposed model could predict judgment almost 50% of the time based on the judge’s opinion. The learning curve in Fig. 9a shows that the model's accuracy increased as the number of epochs and data size increased.

Fig. 9
figure 9

a Validation and training accuracy. b Validation and training loss

Accuracy analysis (Fig. 9a) of the validation and training showed that the validation accuracy was constant at 92.5%. In comparison, the training accuracy increased rapidly between 91 and 93.5% and remained consistent across the rest of the epochs.

The validation and training loss analysis's accuracy analysis (Fig. 9b) showed that the validation loss was reduced from 34.5 to 30% and remained constant. In contrast, the training loss decreased rapidly from 52.5 to 27% and remained consistent across the rest of the epochs.

5 Conclusion

This study highlighted the use of ML-based sentiment analysis as a resource for enhancing the efficiency of the Canadian judicial system and presented an ML module built on LSTM and CNN algorithms. The module was created using judgments from 100 court cases derived from the Canadian Maritime Court. Deep learning models extracted court records with relevant sentiments or statements for making maritime court judgments. The model can predict decisions based on judicial opinions. It evaluates judicial opinions, classifies them as positive or negative, and determines how that sentiment influences the overall judgment. Sentiment analysis of historical judicial decisions provides a mechanism to expedite the analysis of judicial jurisprudence. This enhances the capacity of judges to oversee maritime cases, listen to them expeditiously and fairly, and make judgments. The improved ability to analyze judicial records will help improve the efficiency of the Canadian Maritime Court system.

6 Future work

Future versions of this study will need to examine the application of the CNN + LSTM model in other legal contexts, include other feature selection methods, and explore the application of pre-trained models, such as Glove and fastText Word2Vec. Future studies will also use a comparative analysis of deep learning methods for recommending court rulings with the traditional approach to identify more advantageous techniques for accurate judgment. The LSTM + CNN model is a tool for forecasting judgments [21]. However, the efficacy of this model varies over time. Furthermore, the relevance of the input data may change over time, and the model may recommend incorrect judgments if the data needs to be updated. Other factors that could influence the model's outcomes include public opinion, disagreements among branches of law, changes in views of justice, and changes in the standards used to prosecute Canadian maritime cases.