1 Introduction

Microscopic urine sediment analysis is a routine laboratory test [1]. Identifiable urine sediments include erythrocytes, leukocytes, crystals, casts, epithelial cells, sperms, bacteria, and mycetes (fungi) [2, 3], which, either singly or in combination, can connote the presence of different clinical conditions [4]. For example, the presence of erythrocyte urinary sediments above a specified threshold means that there is bleeding into the urine, which may signify diverse pathologies (stone, infection, cancer, etc.) affecting anatomical structures along the urinary tract, e.g., kidney glomerulus, kidney tubules, ureter, bladder, prostate, and urethra. Here, the shape of urinary erythrocytes depends on the origin, i.e., dysmorphic versus non-dysmorphic in glomerular and non-glomerular hematuria, respectively [2]. Urine sediment microscopic examination can be conducted manually, labor-intensive and subject to human bias [5] or using automated devices. The latter enhances operation efficiency, helps reduce laboratory work burden, and is an invaluable diagnostic screening tool in high-volume clinical laboratories [6]. Indeed, image-based intelligent analytic systems can offer accurate and robust results at high throughput [7, 8] for diagnosis, surveillance, and therapeutic monitoring of various kidney and urinary tract diseases [9].

Automated image recognition is integral to urinalysis automation and generally comprises segmentation [10], feature selection [11], optimization [12, 13], and classification steps [14, 15]. Wide variations in sediment shapes, small cell sizes, and occasional clumping of cells pose challenges to the application of machine learning for urinary sediment classification [7, 14]. Table 1 summarizes the state-of-the-art image-based urine sediment analysis, which comprises deep learning models exclusively. Of note, large numbers of images (ranging from a few to 300 thousand) of distinct sediment types (3–10 classes) have been studied that collectively encompassed the sediment types commonly encountered in the laboratory: red blood cells, white blood cells, epithelial cells, hyaline casts, mucus strands, crystals (e.g., calcium oxalate), spermatozoa, as well as exogenous infectious agents like bacteria and yeast cells.

Table 1 Recent image-based models developed for automated urinary sediment analysis

It can be noted from Table 1 that the literature gaps of the urinary sediment classification models are listed below.

  • To achieve high classification accuracies, deep learning models have typically been employed, despite their high computational complexity. However, there is a need for lightweight models that can run on simpler configurations such as a laptop.

  • Datasets with a limited number of categories or with large distances between categories, such as RBC and sperm cells, may require alternative approaches to achieve high accuracy.

  • All models listed in Table 1 are deep learning models, highlighting the need for the proposal of competitive feature engineering models as an alternative to deep learning models.

The main objective of this study is to address the gaps in feature engineering by proposing a new architecture, i.e., Swin-LBP in this work. This architecture represents a new generation feature engineering work that we believe will lead to improved performance in a variety of image analysis tasks. Furthermore, we have also developed a novel dataset of urinary cell images, which includes 7 different urinary cell categories. This dataset is an important contribution to the field, as it will allow researchers to benchmark the performance of their algorithms on a standardized dataset. Also, the created dataset contains more than 10,000 images.

1.1 Motivation and our method

We were motivated to develop an accurate handcrafted urine sediment classification method based on computer vision of urine cell images. The challenge was posed as an image classification problem, to which deep learning networks have been widely applied [20,21,22]. Computer vision, which uses convolution neural networks (CNN) and transformer-based models [23, 24], has emerged as an important tool for image-based classification. Transformers-based models possess interesting architecture [25]. The popular vision transformer [26] and Swin transformer [27] rely on patch-based classification: the former uses fixed-sized patches to extract deep features, and the latter four-layered shifted windows patch division. In this work, we proposed a hand-modeled image classification method based on the Swin architecture in combination with a local binary pattern (LBP) [28] feature extraction function, which we named Swin-LBP. The model comprised five phases—(i) preprocessing of input images using shifted windows-based patch division; (ii) LBP-based feature extraction [28]; (iii) neighborhood component analysis (NCA) [29]-based feature selection; (iv) support vector machine (SVM) [30]-based classification; and (v) majority voting—and was trained and tested on a new reconstructed 7-class urine sediment image dataset.

1.2 Contributions

The contributions of the proposed model are given below.

  • Novel handcrafted transformer-inspired Swin feature engineering model is proposed.

  • We have built an effective and efficient computer vision model by combining:

    • Shifted windows-based patch division of images.

    • Computationally lightweight handcrafted feature extraction and selection functions.

    • Standard shallow classifier.

    • A simple majority voting algorithm has been used to get general classification results.

  • Trained and tested on a 7-class urine sediment image dataset, the Swin-LBP model attained salutary 7-class classification accuracy of 92.60%.

The given bullets demonstrated that we are the first team to use swin architecture for handcrafted features. It is a new way to get high classification results from shallow methods.

2 Dataset

Utilizing a published dataset [14, 31, 32], we conducted segmentation, extraction, and cropping of individual urine sediment images. This process yielded a collection of 12,330 images, which we subsequently grouped into seven distinct classes: (i) cast (inclusive of all types of casts); (ii) crystal; (iii) epithelia; (iv) epithelial nuclei; (v) erythrocyte; (vi) leukocyte; and (vii) mycete. This dataset contains more than 10,000 images, and the distribution of the aforementioned dataset is tabulated in Table 2.

Table 2 Attributes of the collected dataset

We have randomly selected 2000 images from each category but there are lower than 2000 observations in the crystal and epithelial nuclei datasets. Therefore, all images from these datasets have been involved in our used urine cell image dataset. Sample images about this dataset are also demonstrated in Fig. 1.

Fig. 1
figure 1

Sample urine cell images of the used dataset a cast, b crystal, c epithelia, d epithelia nuclei, e erythrocyte, f leukocyte, g mycete

3 Swin-LBP model

Our novel contribution to the field of computer vision is a cutting-edge model that we call Swin-LBP. The main objective of this work is to significantly improve the classification capabilities of shallow models. As depicted in Fig. 2, our proposed model comprises five phases that are designed to work together seamlessly.

Fig. 2
figure 2

Graphical depiction of the proposed model. P: patches; LBP: local binary pattern; f: the generated individual feature vectors; F: merged feature vectors; s: selected feature vectors; p: predicted labels; v: voted vectors

The first phase involves preprocessing the data, whereby each urine sediment image is resized to 240 × 240 and then it is divided into six patches of various sizes (30 × 30, 40 × 40, 48 × 48, 60 × 60, 80 × 80, and 120 × 120). This process generates 64, 36, 25, 16, 9, and 4 patches for each resized image. Next, in the second phase, we use LBP [28] to extract 59 features (as described in Sect. 3.2) from each of the patches and the undivided sediment images. This results in a large number of feature vectors that are one more than the number of patches at every extraction layer. To handle this, we merge the generated feature vectors to create six merged feature vectors for each input sediment image.

In the third phase, we employ the neighborhood component analysis (NCA) [29] feature selection function to select the most informative 295 features from each feature vector, thereby balancing the lengths of the six feature vectors. In the fourth phase, we feed the six selected vectors, each containing the top discriminative features, to a shallow support vector machine (SVM) [30] classifier to obtain six predicted vectors using tenfold cross-validation strategy.

Finally, in the fifth and last phase, we apply a majority voting algorithm to the six predicted vectors to obtain four predicted vectors. From the six predicted vectors and four voted vectors obtained in the fourth and fifth phases, respectively, we select the one with the most accurate result as the final output. We provide technical details of each phase in the following sections. Moreover, we have illustrated a block diagram (open version) of the proposed Swin-LBP in Fig. 3.

Fig. 3
figure 3

Block diagram of the Swin-LBP model (see text for detailed description). f, extracted feature vector; P, patch, SVM, support vector machine; LBP, local binary pattern; NCA, neighborhood component analysis

3.1 Preprocessing

In this first phase, Swin architecture-inspired shifted windows-based patch division is performed as follows:

Step 0: Read urine sediment images from the collected dataset.

Step 1: Resize each image to a 240 × 240 sized image.

Step 2: Apply six types of patch division to create six layers. This process is defined below.

$$\begin{gathered} p_{t}^{k} = Im\left( {i:i + s_{n} - 1,j:j + s_{n} - 1} \right), s \in \left\{ {30,40,48,60,80,120} \right\} \hfill \\ k \in \left\{ {1,2, \ldots ,6} \right\},t \in \left\{ {1,2, \ldots ,\frac{240}{{s_{n} }}} \right\}, i \in \left\{ {1,s_{n} , \ldots ,240} \right\},j \in \left\{ {1,s_{n} , \ldots ,240} \right\} \hfill \\ \end{gathered}$$
(1)

where \(p\) represents patch; \(Im\), the used image; \(k\), the type of patch; \(s\), the size of the patch; and \(t\), the number of patches.

3.2 Feature extraction

LBP is a histogram-based feature extraction function deployed in the model to extract global and local textural features from the undivided resized sediment image and its corresponding patches, respectively, using neighborhood relations constrained within microstructural image units of 3 × 3 overlapping windows (Fig. 4).

Fig. 4
figure 4

Block diagram of the LBP feature extraction function

For each resized input sediment image, the extract LBP function of MATLAB is used to extract at every one of the six layers of patch divisions 59 features from the undivided image, and its corresponding derived patches.

Step 3: Extract features from the resized images and generate patches. This process is defined below.

$$\begin{aligned} f_{1}^{k} = & bp\left( {Im} \right) \\ f_{t + 1}^{k} = & bp\left( {p_{t}^{k} } \right) \\ \end{aligned}$$
(2)

where \(f\) represents the generated feature vector with a length of 59; and \(bp(.)\), LBP function.

Step 4: Merge the generated feature vectors in every layer to create six merged feature vectors per input image.

$${F}^{k}\left(j+59\times \left(h-1\right)\right)={f}_{t}^{k}\left(j\right),j\in \left\{\mathrm{1,2},\dots ,59\right\},h\in \left\{\mathrm{1,2},\dots ,t+1\right\}$$
(3)

where \({F}^{k}\) represents the kth merged feature vector; the lengths of \({F}^{1},{F}^{2},{F}^{3},{F}^{4},{F}^{5},\) and \({F}^{6}\) being 3853 (= 65 × 59), 2183 (= 37 × 59), 1534 (= 26 × 59), 1003 (= 17 × 59), 590 (= 10 × 59), and 295 (= 5 × 59), respectively.

3.3 Feature selection

The NCA function, a simple and effective L1-norm distance-based feature selector [29], is deployed to select the most discriminative 295 features in each of the merged six feature vectors, which are unequal in lengths (see Sect. 3.2), generated per input urine sediment image. In so doing, two important aims are achieved: (i) reduction in data dimensionality; and (ii) balancing/equalizing the lengths of the resultant NCA-selected feature vectors to 295.

Step 5: Apply qualified indexes using the NCA feature selection function.

$${ind}^{k}=\xi ({F}^{k},y)$$
(4)

where \(\xi (.,.)\) represents the NCA feature selection function; \({ind}^{k}\) implies the qualified indexes of the features; and \(y\) defines actual labels.

Step 6: Choose the most informative 295 features from the extracted feature vectors.

$${s}^{k}\left(d,i\right)={F}^{k}\left(d,{ind}^{k}\left(i\right)\right), d\in \left\{\mathrm{1,2},\dots ,NoI\right\}, i\in \left\{\mathrm{1,2},\dots ,295\right\}$$
(5)

where \({s}^{k}\) represents the kth feature vector with a length of 295 and \(NoI\), the number of images.

3.4 Classification

A shallow classifier has been used in this model, and this classifier is SVM. SVM is the widely used classifier in the literature. Hyperparameters are set at: kernel function is polynomial, polynomial order is three, box constraint is 1, coding one-vs-all, validation is tenfold cross-validation. The classification process is defined below.

Step 7: Apply SVM-based classification.

$${p}^{k}=\kappa \left({s}^{k},y\right)$$
(6)

where \(p\) represents the predicted vector; and \(\kappa ()\), the SVM classifier function.

3.5 Majority voting

Mode function-based weightless/hard majority voting is implemented to augment the classification performance of the Swin-LBP model. This process is defined below.

Step 8: Calculate the accuracies of the predicted vectors.

Step 9: Sort predicted vectors in descending order of accuracy rates.

$$id=sort({p}^{k})$$
(7)

where \(id\) represents the index of the sorted vector.

Step 10: Generate four voted predicted vectors.

$${v}^{1}=\omega \left({p}^{id\left(1\right)},{p}^{id\left(2\right)},{p}^{id\left(3\right)}\right)$$
(8)
$${v}^{2}=\omega \left({p}^{id\left(1\right)},{p}^{id\left(2\right)},{p}^{id\left(3\right)},{p}^{id\left(4\right)}\right)$$
(9)
$${v}^{3}=\omega \left({p}^{id\left(1\right)},{p}^{id\left(2\right)},{p}^{id\left(3\right)},{p}^{id\left(4\right)},{p}^{id\left(5\right)}\right)$$
(10)
$${v}^{4}=\omega \left({p}^{id\left(1\right)},{p}^{id\left(2\right)},{p}^{id\left(3\right)},{p}^{id\left(4\right)},{p}^{id\left(5\right)},{p}^{id\left(6\right)}\right)$$
(11)

where \(v\) represents the voted vector, and \(\omega ()\), the mode function.

Step 11: Calculate accuracies of the four voted predicted vectors and select the most accurate voted one.

4 Results

Two performance evaluation metrics were used to evaluate the model: accuracy and F1-score [33, 34]. These equations are given below.

$$acc=\frac{tp+tn}{tp+tn+fp+fn}$$
(12)
$$f1=\frac{2tp}{2tp+fp+fn}$$
(13)

where \(acc\) represents accuracy; \(f1\), F1-score; and \(tp\), \(tn\), \(fp\), and \(fn,\) the number of true positives, true negatives, false positives, and false negatives, respectively. The performances of the proposed model have been presented using a tenfold cross-validation strategy.

4.1 Results of each layer

The Swin-LBP model extracts feature from six layers, each with variable defined patch sizes, input to downstream NCA feature selector and SVM classifier. Table 3 summarizes layer-wise classification performance obtained by the SVM classifier.

Table 3 7-class support vector machine-based classification performance by feature extraction layer

4.2 Voted results

The mode function used a majority voting algorithm to select the best performance among the layer-wise predicted vectors. Table 4 summarizes the voted results obtained by hard voting. The Swin-LBP model attained the highest classification accuracy of 92% with the third voted vector, which is higher than the best layer-wise accuracy of 90.69% (Layer 4) (Table 3) attained before majority voting. Our proposed Swin-LBP was designed as a self-organized image classification architecture. Therefore, the most accurate result among the ten (six layer-wise plus four voted results) was selected as the final result, i.e., the Swin-LBP model attained 92.60% classification accuracy (and 91.19% overall F1-score) on the urine sediment image dataset.

Table 4 7-class model classification performance obtained by majority voting on the predicted feature vectors

4.3 Class-wise results

Figure 5 and Table 5 depict the model's confusion matrix, and class-wise performance, respectively, based on the final best results, which were determined by the highest accuracy scores among the six layer-wise and four voted results calculated by the SVM classifier and majority voting, respectively. The best and worst classification performances were obtained for the “erythrocyte” (96.15% accuracy; 95.93% F1-score) and “epithelial nuclei” (69.26% accuracy; 76.96% F1-score) urine sediment classes, respectively.

Fig. 5
figure 5

Confusion matrix obtained by applying the model on the urine sediment image dataset. Classes 1 to 7 correspond to the urine sediment classes “Cast,” “Crystal,” “Epithelia,” “Epithelial nuclei,” “Erythrocyte,” “Leukocyte,” and “Mycete,” respectively

Table 5 Model classification performance for each urine sediment class

4.4 Time complexity analysis

Swin-LBP is a lightweight feed-forward image classification model based on handcrafted feature extraction. Using big O notation, the model complexity is shown to be linear (Table 6).

Table 6 Computational complexity of the Swin-LBP model

5 Discussion

In this paper, a new urinalysis classification model was proposed that was trained and tested on a urine sediment image dataset comprising 6687 urine sentiment images divided equally among seven distinct classes. The novel Swin-LBP model employed a new learning architecture inspired by a Swin transformer in combination with a handcrafted LBP-based feature extractor. Despite its linear time complexity, our Swin-LBP model attained 92.60% classification accuracy for 7-class classification problem which is commensurate with the classification performance of more computationally demanding CNN-based deep learning models that had been developed on the same dataset from which our study dataset was derived (Table 7).

Table 7 Comparison of automated image-based models for urine sediment analysis using the same dataset or its derivative*

It can be noted from Table 7 that our proposed model reached high classification performance and it is a competitive feature engineering model to deep learning model.

5.1 Ablations

We have presented the ablations to show the high classification performance of the presented Swin-LBP model. In the first phase of this section, we used the shallow classifiers to get comparative results. Then, we compared results to LBP and local phase quantization (LPQ)-based models. These items have been defined below.

  • Item 1: We have used LBP, histogram-oriented gradients (HOG), and local phase quantization (LPQ) feature extractors to get feature vectors.

  • Item 2: neural network (NN), k-nearest neighbors (k-NN), and linear discriminant (LD) have been utilized as classifiers.

  • Item 3: We have used a tenfold cross-validation strategy to get results. In this item, threefold cross-validation and fivefold cross-validation have been used and the calculated results using these validations have been presented for comparison.

The results of this ablation study are illustrated in Fig. 6.

Fig. 6
figure 6

Classification accuracies obtained for the ablation study

According to Fig. 6, the proposed Swin-LBP achieved a classification accuracy of 92.60% using the utilized dataset. We used LBP as the primary feature extraction function for the introduced Swin-LBP architecture, resulting in an 80.84% classification accuracy for the LBP-based urinary image classification model. Therefore, our proposed Swin-LBP outperforms LBP by 11.76% points for this dataset. In addition, HOG and LPQ feature extraction-based models achieved 85.77% and 86% accuracies, respectively. These results suggest that Swin-LBP is the optimal handcrafted model among all considered models.

We used NN, k-NN, LD and SVM classifiers to get benchmark results. To attain classification accuracy, we have used feature vectors of layer 4 (60 × 60 sized patches) and the classification accuracies of these classifiers are shown in Fig. 7.

Fig. 7
figure 7

Accuracy obtained using various classifiers

In Fig. 7, we observe that the SVM classifier achieves the highest accuracy of 90.63%, surpassing the NN classifier's performance of 87.20%.

In the defined Item 3, the results obtained using three cross-validations techniques (threefold CV, fivefold CV and tenfold CV) for the generated six feature vectors are shown in Fig. 8.

Fig. 8
figure 8

Classification accuracies obtained for various feature vectors and SVM classifier with three validations (threefold, fivefold and tenfold CVs)

Figure 8 demonstrates that the tenfold cross-validation (CV) technique produces the highest accuracy among the validation methods used. However, we also evaluated the proposed Swin-LBP approach using two additional validation techniques. Furthermore, we have calculated the final results using information fusion (iterative majority voting) of the used k-fold CV-based models and these classification results have been depicted in Fig. 9.

Fig. 9
figure 9

Summary of accuracies obtained for three cross-validation techniques

Figure 9 depicts that the tenfold CV is the best validation technique.

Our experimental results for ablations, as shown above, validate the effectiveness of the proposed Swin-LBP approach in achieving the optimal combination of parameters.

5.2 Highlights of the study

We have given highlights of this study in three following items: (1) findings, (2) advantages and (3) limitations. These important points have been listed below.

5.2.1 Findings

  • Presents an automated urine sediment analysis system.

  • Swin-LBP model uses machine learning algorithms to classify urine sediment images.

  • It is based on Swin transformer architecture and LBP feature extraction technique and has five phases. Moreover, the introduced Swin-LBP has six feature extraction layers to use six types of fixed-size patch divisions.

  • The phases are preprocessing, feature extraction, feature selection, support vector machine-based calculation, and majority voting.

  • The model was trained and tested on a 7-class urine sediment image dataset containing 12,330 images. This model achieved an accuracy of 92.60% and an average precision of 92.05%.

  • According to class-wise results, the best accurate cell type is Erythrocyte and the worst one is epithelial nuclei. There are only 687 epithelial nuclei cell images in this dataset, and these cells are similar to epithelial urine cells.

  • The best feature extraction layer is the 4th feature extraction layer. We have used 60 × 60 sized patches in this feature extraction layer. Moreover, the second best features have been generated using 60 × 60 sized patches (5th feature extraction layer). The classification accuracies of the 4th and 5th feature extraction layers are 90.69% and 90.67% respectively.

  • The worst one is the 1st feature extraction layer since the features of this layer yielded 88.87% classification accuracy and this layer has used 60 × 60 sized patches.

  • We have compared the commonly used shallow classifiers and the best resulting classifier is the SVM classifier. Therefore, we have used this classifier.

  • Three validations were used to attain the classification results.

5.2.2 Advantages

  • Our team has found that Swin-LBP can improve LBP's classification performance by approximately 11%.

  • We tested the recommended Swin-LBP approach on a large dataset of 12,230 urine cell images, achieving an impressive classification accuracy of 92.60%. This result highlights the potential of handcrafted models to achieve outstanding performance on large image datasets.

  • Our study also revealed that Swin-LBP outperforms published deep models developed on the same urine sediment image dataset.

  • The Swin-LBP model proposed in our study is simple and can be easily implemented by researchers to address image classification tasks.

5.2.3 Limitations

  • Although fine-tuning operations can achieve higher classification performance, we require a fast-responding model. As such, we opted not to use any optimization techniques.

  • To further validate the proposed Swin-LBP model's classification performance, additional urine image datasets could be used. Such datasets could provide more comprehensive insights into the model's generalization capabilities and its potential to address real-world image classification problems.

6 Conclusions

In this work, we proposed a computationally lightweight yet accurate model for automated analysis of urine sediments using the Swin Transformer architecture with shifted windows-based patch division. Our approach enabled global and local textural feature extraction, selection, and classification using LBP, NCA, and SVM, respectively. The model achieved excellent results on a derived 7-class study dataset comprising 12,330 urine images, with a classification accuracy of 92.60%. Our model has low time complexity and is simple yet accurate, making it suitable for real-world urine sediment analysis.

Our study also highlights the versatility and utility of computer vision-inspired shifted windows-based patch division for general image classification problems that enable multilevel downstream feature extraction using handcrafted feature engineering. Specifically, our results confirm the feasibility of the Swin-LBP approach in biomedical image analysis applications.

Our study presents a novel method for the automated analysis of urine sediments that achieves excellent classification accuracy with a computationally efficient approach. Additionally, our use of shifted windows-based patch division provides a promising technique for general image classification problems that can enable multilevel downstream feature extraction using handcrafted feature engineering.

In future work, we plan to propose automated urine cell counting and classification applications where the Swin architecture can be combined with other image classification options such as transfer learning, further extending the utility of our approach.