1 Introduction

Artificial Intelligence (AI) based systems in numerous industrial fields have evolved the perspective of efficiency and automation. One of the vital aspects within this field is Object Detection (OD), which plays a key role in streamlining activities and decision-making in real-time industrial scenarios. AI and machine learning systems are progressively adapted to automate the OD, enabling precise control and monitoring of real-time activities. This improves productivity, guarantees safety and time efficiency, and boosts economic growth (Jan et al. 2022).

Deep Learning (DL) attained remarkable success in Computer Vision (CV), especially in OD and image classification tasks to automate them in real-time using Convolutional Neural Networks (CNN). OD is a core CV domain that recognizes and localizes interested objects in images or videos. It is accomplished by categorizing the objects in the scene along with accurate bounding box dimensions. OD has a wide range of applications in robotics, remote sensing (Wang et al. 2022; Gallo et al. 2024), healthcare (Salman et al. 2022), agriculture (Gallo et al. 2023), image enhancement (Landro et al. 2024), and the food industry (Rehman et al. 2023) allowing AI systems to comprehend and communicate with the visual environment. Modern OD algorithms are mostly trained on domain-specific tasks and perform exceptionally well when trained on images with the same distribution as the test images (Zou et al. 2023). It is a fact that they perform poorly when assessed on a visual representation that differs from those seen during training owing to Domain Shifts (DS) (Kondrateva et al. 2021) problem.

DS occurs when the data distribution in the target and source domains differs, raising a compelling challenge for OD tasks. The degradation in performance due to DS is problematic for real-time critical systems like autonomous driving, detection in video surveillance, and digit recognition, where it is unavoidable. For example, consider an OD system that uses Closed-Circuit Television (CCTV) images/videos from numerous camera sensors to identify objects on the road. If the training images from the camera do not comprehend fluctuations in noise characteristics, various weather environments, and image resolutions, the OD system’s efficiency could suffer in challenging situations. Similarly, if the model is trained for a digit recognition task on data consisting of synthetic images (MNIST (LeCun et al. 1998)), printed images (SHVN (Netzer et al. 2011)), scanned images (USPS (Hull 1994)), and images of handwritten digits and tests it on real scenarios images like digit images from a weight scale machines, the performance of the model is likely to suffer due to variations in image properties among different conditions, like color variation, image distortion (e.g. blurriness) (Michaelis et al. 2019), and different camera angles and backgrounds. As a result, algorithms that can handle label scarcity and DS issues must be developed.

To avoid the issues of making a bigger and more accurate labeled dataset for OD, this study constructed a small digit recognition dataset named the Planeat dataset. This study investigates the real-time food package process of a Planeat industry, which is why it constructed a new Planeat dataset. For the Planeat dataset, we visited a food packaging industry to automate its food package weight detection process. We captured the images of weight scale machines to generate a digits dataset (see 4.1.2). It consists of 231 images with correct OD annotation information. The purpose of the Planeat dataset is to recognize digits using the DA technique, which does not require a bigger dataset. We consider the Planeat dataset as a target (T) dataset and employ the publicly available labeled digit Mixture dataset (MixtureDataset 2022) and consider it a source (S) dataset.

Like genetic algorithm (Grefenstette 1993) and neural network (Nielsen 2015), this work proposes a Cross-Pollination of Knowledge (CPK) strategy at the input level for source and target dataset images. CPK, inspired by cross-pollination in botany (Wertheim 1995), involves the exchange of concepts or features across different entities, enhancing genetic diversity and adaptability (Willard et al. 2022). This metaphorical idea has yielded remarkable outcomes in various fields (Kusters et al. 2020; Cranenburgh et al. 2022). In the context of Domain Adaptation for OD, this study utilizes CPK to enhance the adaptability of DL models from one domain (target) to another (source). The approach blends samples from target to source datasets, facilitating a more robust and accurate OD framework. This fusion of different images improves the model’s functionality across diverse detection scenarios.

CPK in DA strategically integrates a subset of target domain images into the source domain during training. Unlike typical data augmentation, which modifies existing images within a single domain, CPK introduces unaltered target domain data into the source training set. Existing mixup (Zhang et al. 2017) augmentation creates blended images through linear combinations of pairs of images and their labels; our method focuses on the actual transfer of unmodified target domain data into the source domain training set. This is not merely a different setting of mixup (Zhang et al. 2017) augmentation but an entirely different approach aimed at bridging the gap between two distinct domains. This is a fundamental difference, as our method does not create synthetic variations of existing data but strategically utilizes actual data from the target domain.

With the CPK strategy, this work mixed up the N, N=1,2,3,4\(\ldots\) number of target dataset images in the source dataset and merged the random and unique image samples from the target datasets in the training sets of a source datasets. This work performed supervised digit recognition experiments using the proposed CPK approach and obtained impressive results compared to Unsupervised Domain Adaptation for digit recognition (UDA-digit) experiments. This study conducts detailed experiments with CPK for the DA on different datasets including benchmarks such as the Mixture to Planeat (MtoP), Sim10K to Cityscape (SMtoC) (Johnson-Roberson et al. 2016; Cordts et al. 2016), and KITTI to Cityscape (KtoC) (Geiger et al. 2012) and experiments for digits and car detection show efficient performance. For a fair comparison, this study also employs a self-supervised DA framework for digit recognition (UDA-digit) (Mekhalfi et al. 2023) to automate the food industrial process by solving the DS problem of digits detection. The detector trains on the Mixture dataset, uses pre-trained weights to obtain pseudo labels from the Planeat dataset, crops and augments high-confidence regions, and produces composite one image at the input level to the detector. This image trains the unsupervised detector in a self-supervision manner for digit recognition from Mixture to Planeat (MtoP) datasets.

The summary of the main contributions of this paper is given below.

  • Developed a real-time new digit recognition dataset for supervised domain adaptation domain to automate the industrial process.

  • Introduced a cross-domain knowledge (CPK) strategy for object detection tasks in domain adaptation.

  • Conducted empirical analysis with the CPK approach, outperforming the existing DA methods.

The rest of the paper is organized as follows: related work and methodology are described in Sects. 2 and 3 respectively, experiments and dataset construction are in Sect. 4, and results discussion and conclusion are given in Sects. 5 and 6.

2 Related work

The majority of the recent OD-based UDA methods, including the production of accurate pseudo labels (Yu et al. 2022; Zhao et al. 2020; VS et al. 2023; Oza et al. 2023) and the adoption of data augmentation techniques (Mekhalfi et al. 2023; Hendrycks et al. 2019, 2021; Michaelis et al. 2019), have been published to handle the DS for DA.

UDA for Object Detection: Given an unlabelled target and labeled source dataset, UDA focuses on employing the available source sample to generate a model that is capable of generalizing and accomplishing well on the target dataset. A conventional method is to lessen the domain gap by reducing the distance among feature distribution in the source and target domain using the discrepancy loss method (Long et al. 2015; Sun and Saenko 2016). However, recent UDA studies demonstrated the advantage of employing pseudo labels to maximally exploit target domain information with self-training approaches (Liang et al. 2021; Li et al. 2021, 2021). To tackle the degradation of performance raised by the overfitting of noisy pseudo labels, Yu et al. (2022) proposed an uncertainty control method to fuse pseudo labels set produced using stochastic inference. Li et al. (2021) introduces a self-entropy metric to set a suitable confidence value for unbiased pseudo labels. Another study (Mekhalfi et al. 2023) efficiently extracts pseudo labels using a target dataset employing a source dataset-based trained model, uses data augmentation on high confidence region, mixes various augmented versions output, and applies unsupervised UDA over the augmented region to detect target domain features for car detection. Mattolin et al. (2023) presented a radically different approach with a mix-based data augmentation on the local region of the target data that corresponds to the most confident pseudo detection of source data.

Data augmentation for Object Detection in UDA: The adoption of data augmentation techniques (Mekhalfi et al. 2023; Hendrycks et al. 2019, 2021; Michaelis et al. 2019) have been published to handle the DS for DA. DA data augmentation generates distinctive data samples to improve detector adaptability when tackling different data domains. It includes synthetic data construction, transformation of feature space, and domain-specific augmentation, helping the model to generalize efficiently among various domains. In UDA, recent studies (DeVries and Taylor 2017; Wu et al. 2020; Yun et al. 2019; Zhang et al. 2017; Hendrycks et al. 2019) have demonstrated that data augmentation can produce feasible improvements by generating mixed samples in the training and target data (Mekhalfi et al. 2023; Mattolin et al. 2023) in segmentation and classification tasks. Methods based on augmentation alleviate the DS issues at the input level without changing the architecture of the detector. In the literature for DA, recently, authors mixed up the local regions of target samples, which leads to high-confidence pseudo-detection with the inclusion of source samples (Mattolin et al. 2023). Another study (Mekhalfi et al. 2023) detects highly confident regions, applies data augmentation techniques on these regions and composes an image of this region with their augmented version before performing the OD task.

The authors in Kusters et al. (2020) described the concept of interdisciplinary research of AI with other domains. In addition, integrating scientific and natural knowledge with modern AI methods and properly selecting models with accurate approaches from different domains are discussed by Willard et al. (2022). Bataduwaarachchi et al. (2023) applied the CNN algorithm on the new tomatoes’ images dataset and predicted the tomatoes’ cross-pollination. However, contrary to recent UDA, Supervised DA (SDA), and existing data augmentation methods, this study introduced an integration of the natural botany domain concept with AI for object detection. It presents a strategy of a mixup approach based on the Cross-Pollination of Knowledge (CPK) at the input level of training between the source and target dataset for digit and car recognition in domain adaptation.

Fig. 1
figure 1

Working pipeline of UDA-digit and CPK strategy. The methodology comprises two phases, and the first is subdivided into two steps. In the first step, a detector is trained on the source dataset. In the second step, the pre-trained weight of the first step helps the detector extract pseudo labels from the target dataset and perform detection, cropping, and augmentation for unsupervised domain adaptation training. In the second phase, the proposed cross-pollination of knowledge strategy is applied to source and target datasets by mixing unique and random images to reduce domain shift for digit recognition. In the end, a comparison is performed between the proposed CPK and UDA methods

3 Methodology

The proposed methodology is visually demonstrated in Fig. 1 describing the digit recognition approach using domain adaptation utilizing multiple datasets like Mixture, Planeat, KITTI, Cityscape, and Sim10k. A new digit recognition dataset, termed Planeat, was developed specifically for this methodology. The approach comprises two phases: the initial phase utilizes the UDA-digit approach, while the subsequent phase incorporates the proposed Cross-Pollination of Knowledge (CPK) approach. The first phase is further subdivided into two steps. In the first step, the YOLOv5 detector is initially trained with labeled source data to generate pre-trained weights. In the second step, these pre-trained weights are applied to target data, enabling the detector to identify object regions. These regions are then subjected to cropping and augmentation processes, producing a composite image that combines the detected and augmented regions for the purpose of self-supervised domain adaptation. In the second phase, the proposed CPK approach integrates images from target and source datasets to enhance digit recognition. This phase focuses on the fusion of multiple unique and random images to improve the robustness of the digit recognition model using YOLOv5. Additionally, the effectiveness of unsupervised domain adaptation is compared with the proposed CPK approach. The subsequent sections provide an in-depth discussion of the UDA, YOLOv5, and the proposed CPK approach, elaborating on their application in addressing the challenge of digit detection.

3.1 Unsupervised domain adaptation- UDA

Domain Adaptation (DA) is a machine learning method that focuses on applying detectors trained on a source dataset to a different target dataset. DA can be used for various domains like the detection of objects, text classification, speech recognition, etc. Several scientific papers explore domain adaptation scenarios where the target domain is labeled, and the adaptation process exploits these target labels. This scenario is often referred to as “Supervised Domain Adaptation (SDA)” because it uses labeled data from both the source and target domains(Ben-David et al. 2010; Yao and Doretto 2010).

Detectors typically perform poorly on the target dataset when the data distributions between the source and target domains diverge. Feature alignment is a familiar technique in DA that focuses on making the representation of features of the source and target dataset more similar, which can enhance detector efficiency on the target dataset.

We denote the RGB image (\(R^{H \times W \times 3}\)) in the data distribution of the source labeled Mixture dataset as \(I_S\) and each \(I_S\) comprises its ground truth \(I_G\) where

\(I_G = \{(B_i, l_i): i=1,2,3,...., N_S\}\)

consists of \(N_S\) ground truth annotation with bounding boxes and \(Bi \; \epsilon \; {\mathbb {R}}^{4}\) indicates the two opposite corners of the \(ith\) bounding boxes, and \(li \; \epsilon \; [1,..., L]\) contains a label to depict the category of the object in \(I_S\).

We define an RGB image sampled from the data distribution of the target Planeat dataset as \(I_T \; \epsilon R^{H \times W \times 3}\). We consider the \(I_S\) and \(I_T\) to be similar in size, and a DL network conducts the OD using a function

\(\Phi _\nabla : {\mathbb {R}}^{H \times W \times 3} \rightarrow {\mathbb {R}}^{M \times 4} \times [1,..., L]^M\)

with a \(\nabla\) parameter to generate a list of output with M detection of an interested object using RGB image \(I\) like \(\Phi _\nabla (I)= D\).

In order to obtained a pseudo labels \(S_L\) for the training of detector, first the DL network \(\Phi _\nabla\) trained on source data, takes the target image \(I_T\) and generate

\(S_L \; \epsilon \; {\mathbb {R}}^{M_T \times 4} \times [1,...,C]^M{_T}\).

The \(S_L\) is suitable for discovering a method to engage with appropriate detection and eliminate those with low confidence and may lead toward a false positive detection. To achieve this, input target image \(I_T\) divided into rows and columns to create a cell in a grid \(S_{row} \times S_{col}\). Further, the most confident region is allocated to each cell, derived by aggregating the confidence value of the bounding boxes containing its center. From the \(I_T\), the highly confident region \(I^\thicksim {_T}\) is cropped, and the region other than the confident portion is trimmed to get \(S^\thicksim {_L}\).

Given that the most suitable detection \(S^\thicksim {_L}\) only takes up a small region of \(I_T\), the following steps are performed: a) fill the remaining cell of the grid with pertinent data to prevent computational waste, b) used those sample from the target images to balance the source and target data during DA, and c) most importantly, applied augmentation methods on \(I^\thicksim {_T}\) and fill the remaining cell of the grid and also maintain the reliability of pseudo labels since the conversion of \(S^\thicksim {_L}\) based on the selected augmentation methods rather than approximating them over most distorted variants of \(I^\thicksim {_T}\). During the experiment, randomly used contrast, perturbing brightness, color jittering, downscaling, horizontal flip, blurring, ping, and cropping data augmentation techniques applied on the \(I^\thicksim {_T}, S^\thicksim {_L}\) and generated a composed image with the same size as \(I_T\). The key purpose behind the augmented composite image is to employ \(S^\thicksim {_L}\) to self-train \(\Phi _\nabla\) with a pseudo-detection. For this, an input of \(I^\thicksim {_T}\) to \(\Phi _\nabla\) to get the detection \(D^\thicksim {_T}\) and assist consistency between \(D^\thicksim {_T}\) and \(S^\thicksim {_L}\) by minimizing the loss.

3.2 You only look once- YOLO

This study used the fifth version of the ’You Only Look Once’ (YOLO) model (Github 2022; Redmon et al. 2016; Redmon and Farhadi 2017, 2018; Bochkovskiy et al. 2020; Gallo et al. 2023), which is a Single-Stage (SS) State-Of-The-Art (SOTA) detector used extensively for OD. The SS detectors are more feasible for real-time applications than two-stage detectors as they are faster by conducting localization and classification simultaneously in a single pass to the neural network. The YOLO detector comprises three main modules: the head, neck, and backbone. The backbone focuses on feature extraction by extracting hierarchical features progressively, as these features are important for identifying objects. Darknet (Redmon and Farhadi 2018), MobileNet (Howard et al. 2017), and ResNet (He et al. 2016) are the common backbones for OD detectors. The neck accomplishes feature fusion and transition while connecting the backbone to the head. It frequently combines features from various scales to impart the head with broad information. Feature pyramids (Lin et al. 2017) are a frequent strategy for managing items of various sizes in the neck. The head manages and controls the predictions in YOLO, uses the fused features provided by the neck, generates bounding boxes, and scores with prediction in the images.

The YOLOv5 comprises five variants: nano, small, big, large, and extra large. All the variants are similar without any difference except for the number of parameters and layers. The Cross Stage Partial (CSP) Network CSP-darknet53 (Bochkovskiy et al. 2020) is the Backbone of YOLOv5. The YOLO deep network employs residual and dense blocks to allow information to travel to the deepest levels and circumvent the vanishing gradient issue. However, one of the perks of employing dense and residual blocks is the issue of redundant gradients. The CSP networks aim to tackle this issue by reducing the gradient flow, and it divides the feature map of the base layer into two parts in YOLOv5 and joins them using a cross-stage hierarchy. Applying this approach has significant benefits for YOLOv5, as it reduces the number of parameters and requires less processing (few FLOPS), which increases the inference speed, a key factor in real-time OD models. The YOLOv5 introduced two significant modifications in its Neck. First, the path Aggregation Network (PANet) (Liu et al. 2018) with the incorporation of the BottleNeckCSP and second, the Spatial Pyramid Network (SPP) (He et al. 2015) in the architecture. A feature pyramid network called PANet was adopted in YOLO version 4 to enhance the flow of information and aid in accurately locating pixels for the task of mask prediction. This network has been changed in YOLOv5 by adding the CSPNet strategy. The SPP block aggregates the data it receives from the inputs and outputs a fixed-length result. As a result, it has the benefit of significantly expanding the receptive field and sorting the most important context features without affecting the network’s speed. The YOLOv5 has the same head as in YOLOv4, and it comprises three convolutional layers to predict the score of the output classes and locate the bounding boxes (x, y, width, height).

3.3 Cross-pollination of knowledge strategy

Cross-pollination of Knowledge originates from the Botany field, where the pollen moves from one flower to another using insects for reproduction (Wertheim 1995). With new interdisciplinary research in engineering and environmental sciences, various studies are adapting natural science methodologies to solve complex scenarios of AI for better accuracy (Willard et al. 2022). The DL models are interconnected with cross-domain and adapt the concepts of physics, natural science, business strategies, etc., to improve efficiency. This study also adopts the famous concept of cross-pollination to fuse the sample of one domain in another domain (Willard et al. 2022; Kusters et al. 2020; Cranenburgh et al. 2022). In OD, the idea of CPK appears to be the dominant strategy. It entails the intended mingling of datasets, particularly the fusion of the target domain with source domain samples. This knowledge-sharing strategy improves detection accuracy by uplifting the detector to transfer insights from one domain to another, mainly when datasets share similar object categories. Leveraging this CPK method shows constructive in adapting detectors to new domains, giving more accurate and robust OD results. For a new domain of interest, where a bigger dataset is scarce and not easy to construct from scratch for CV tasks, especially OD. DA usually works efficiently to overcome the scarcity of samples in a new domain using the source dataset, which usually comprises the bigger annotated dataset. However, the target datasets are usually very small and may or may not be labeled.

In this study, we adopted the concept of cross-pollination of knowledge (CPK) like genetic algorithms and neural networks. The CPK strategy introduces the fusion of target images \(I_t\) in the labeled source images \(I_s\). We feed the labeled source dataset images \(I_s\) to detector \(\Phi _\nabla\) to obtain weights to perform detection on \(I_t\). The algorithm 1 describes the working of the proposed CPK concept for the random and unique mixup of target images in the source dataset and uses the YOLOv5 detector as a neural network. To improve the efficient detection over the \(I_T\) images, we proceed with the following way;

1: Source Domain: \(S\) (Mixture dataset),

Training source samples:

\(T_{ss} = \{I_{is}, l_{is}\}\) for \(i= 1 \; to \; T_{ss}\) (with \(T_{ss}\) in the source domain),

Labels: \(l_{is}\) represents the labels for source domain samples, and Model: \(\Phi _{\nabla _s}\), a neural network detector trained on the source domain

2: Target Domain: \(T\) (Planeat dataset),

Test target samples:

\(T_{ts} = \{I_{jt}, l_{jt}\}\) for \(j = 1\) to \(T_{ts}\) (with \(T_{ts}\) samples in the target domain),

Labels: \(l_{jt}\) represents the labels for target domain samples.

Case 1: No Mixup (Baseline), Training and testing using the original source and target datasets:

\(T_{ss} \longrightarrow \Phi _{\nabla _s} \longrightarrow T_{ts} \longrightarrow \ Detection Accuracy_1\)


Case 2: Mixup with 5 unique images from P (Planeat Dataset), adding 5 unique images from the target dataset into the source dataset:

\(T'_{ss} = \{I_{is}, l_{is}\} \cup \{I_{jt}, l_{jt}\} \text { for } j = 1 \text { to } 5 \rightarrow \Phi '_{\nabla _s}\)

Training and testing on the mixed dataset:

\(T'_{ss} \longrightarrow \Phi '_{\nabla _s} \longrightarrow T_{ts} \longrightarrow \ Detection Accuracy_2\)


Case 3: Mixup with random images from P (Planeat Dataset), adding a random number of images (e.g.,1,2,3,4,5, 10, 15, 20) from the target dataset into the source dataset:

\(T''_{ss} = \{I_{is}, l_{is}\} \cup \{I_{jt}, l_{jt}\} \text { for random } j \text { images} \rightarrow \Phi ''_{\nabla _s}\)

Training and testing on the mixed dataset:

\(T''_{ss} \longrightarrow \Phi ''_{\nabla _s} \longrightarrow T_{ts} \longrightarrow \ Detection Accuracy_3\)

These cases represent different scenarios for DA. The improvement in detection accuracy can be quantified as:

\(Improvement_2 = Detection Accuracy_2 - Detection Accuracy_1\)

\(Improvement_3 = Detection Accuracy_3 - Detection Accuracy_1\)

For case II in the algorithm 1, we automatically select the 5 unique images of the target dataset and fuse them in the respective source datasets. These images were specifically chosen because they collectively encompass all classes present in the Planeat dataset. This particular selection aimed to ensure that the detectors get exposed to a representative sample of the entire class spectrum of the target dataset. In case III, we randomly choose 5, 10, and 15 random images of the target dataset to enhance model’s adaptation further. This random selection allows us to test the robustness of the model against varying degrees of DS with a different number of images, simulating scenarios where limited target data is available. The selection of unique and random images was performed only once before the training of respective datasets. This work employs this CPK strategy for the Mixture, Planeat, Cityscape, KITTI, and Sim10k datasets for domain adaptation, and the results are given in the next sections.

4 Experimental analysis

In this section, the UDA-digit approach with a CPK strategy has been thoroughly evaluated on five DA datasets. First, we describe the selection of the Mixture (M) dataset and the construction of the Planeat (P) dataset. Secondly, this study performs the experiments using the proposed CPK for digit recognition and compares its results with recent UDA techniques results. Lastly, the effectiveness of the overall performance of the presented approach with different parameter settings is described.

4.1 Dataset description

Fig. 2
figure 2

Different sample images from Mixture and Planeat datasets. The Mixture dataset is publicly available, and a Planeat dataset is a newly constructed dataset in this study for digit recognition

Table 1 Mixture and planeat dataset descriptions and statistics
Table 2 MtoP performance comparison for CPK strategy experiments. The training and testing of the Planeat (P) dataset uses only its 10 images, a mixup of 5 unique Planeat images, and a mixup of random 5, 10, 15, and 20 Planeat images in the Mixture dataset. The unseen images of the Planeat target set (TS) were used for testing purposes in all experiments. The fusion of MtoP+5images-Unique and random MtoP+10images outperformed the rest in all evaluation metrics for digit recognition
Table 3 Performance comparison between MtoP-UDA and CPK-based MtoP with random 1,2,3, and 4 image mixup. The Mixture (M) and Planeat (P) are source and target datasets. Only a mixup of random 2, 3, and 4 images of the Planeat target dataset in the source Mixture dataset obtains the highest evaluation scores compared to UDA methods for digit recognition

4.1.1 Mixture dataset

This study used the annotated publicly available digit recognition dataset (MixtureDataset 2022) on the Roboflow (Robolfow 2020) platform for our problem, and it is a mixture of different digit recognition samples from various other datasets, named Mixture (MD) dataset. Roboflow is an online platform with freely available annotated datasets for CV tasks. The MD consists of a total of 2390 images. Images of digital weight scales, clocks, and single-digit synthetic images with different image variations, including background colors (orange, green, sky blue, and red) and dull, bright, and transparent scenarios, are part of the MD. It contains 12 classes, such as “., 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,-” and this study split the MD into 75%, 15%, and 10% for train, validation, and test sets, respectively. Its statistical distribution is given in Table 1, and samples are illustrated in Fig. 2.

Fig. 3
figure 3

This figure illustrates the results obtained from the proposed CPK approach. The left side of the figure shows the output of the Mixture to Planeat (MtoP) dataset domain adaptation. The right side of the figure represents the enhanced output when we fused 5 random Planeat dataset samples in the Mixture dataset (MtoP+5image). Mixture and Planeat are source and target datasets, respectively

Algorithm 1
figure a

Domain Adaptation with the cross-pollination of knowledge for object detection

Fig. 4
figure 4

Output provided by UDA-digit approach experiment after the recognition of digits instances using Mixture to Planeat (MtoP) datasets

4.1.2 Planeat dataset

In this study, we have also proposed a new dataset named Planeat dataset. The primary purpose of the Planeat dataset is to ensure that it would be well-suited for studying DA for digit recognition problems. For the construction of the Planeat dataset, we visited the Planeat food packaging industry in Italy. The planeat industry prepares the food packages according to customer requirements with accurate weights for every food item in the package. It calculates the weights of the packages manually with the help of weight machines. We manually installed the Raspberry PI camera over the five workstations to automate weight calculation with detection methods. Those cameras automatically capture the images of the food weight scale every second, and the captured images are stored automatically in an online Amazon Web Services (AWS) server and saved in a local machine using a real-time streaming protocol. We selected the appropriate 231 Red, Green, and Blue (RGB) images and uploaded them to the Roboflow platform for annotation. Roboflow is an online platform that provides the facility to prepare the dataset for the different detectors, as object detectors require a special structure (labels and annotation) with the dataset images before processing any computations. Annotation and labeling are crucial steps in OD dataset construction; the first refers to physically drawing boxes (bounding boxes) around the present object (digit) in the images, and the second refers to assigning the category to that object (digit) and the bounding boxes show the information of X and Y coordinates, width, height, and class label. This manual annotation process took enough time to construct an accurate dataset, and the dataset was exported into YOLOv5 format. The Planeat dataset is different from the Mixture dataset in terms of color variation, the shape of digits, and environment, and the Planeat dataset comprises 12 classes such as “., 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,–”.

One of the key design choices was ensuring the dataset reflects real-world scenarios. The images in the dataset are manually captured from weight scales, representing a real-time application scenario. This dataset comprises several characteristics specifically chosen to emphasize the domain gap: the practical images vary significantly in quality, lighting, and background. It is important for testing various DA methods because it provides a challenging yet realistic scenario for digit recognition in domain adaptation. This contrasts with many datasets, often composed of images captured in controlled or laboratory settings. The Planeat dataset was categorized into two versions as follows; 1)Target Set (TS): The target set consists of 100 labeled images for testing purposes. 4) Planeat UDA training Set (UDA-TS): We picked the remaining 131 unlabeled images of the Planeat dataset to train the UDA detector. The usage of all different variants is described in section 4. The statistical distribution of the Planeat is illustrated in Table 1, and samples are given in Fig. 2.

Table 4 Performance comparison for CPK experiments with training on KITTI (K) and testing on unseen Cityscape (C) images and mixed the random 5 and 10 Cityscape (target) dataset images in the KITTI (source) dataset. The detectors are tested using the test set of the Cityscape dataset, and the KITTI and Cityscape are source and target datasets, respectively. The output for a mixup of random 5 and 10 images is higher than the source KtoC case for car detection. The mixup of only 5 random images (second row) in the proposed CPK outperforms the source KtoC, which shows the CPK approach’s efficiency
Table 5 Performance comparison for CPK experiments with training on Sim10k (SM) and testing on Unseen Cityscape (C) images and the random mixup of 5 and 10 C dataset images in the Sim10k (SM) datase. The detectors are tested using the test set of the C dataset. The output for a mixup of random 5 and 10 images outperformed the source (SMtoC) case for car detection. The mixup of only 5 random images (second row) in the proposed CPK outperforms the source SMtoC, which shows the CPK approach’s efficiency
Table 6 Performance comparison of Unsupervised Domain Adaptation (UDA-Digit (Mekhalfi et al. 2023)) experiments for MtoP, KtoC, and SMtoC. The Mixture (M), KITTI (K), and Sim10k(SM) are source datasets, whereas Planeat (P) and Cityscape (C) are target datasets. All three columns are the output of the UDA-digit framework for three datasets
Table 7 Performance comparison between Unsupervised Domain Adaptation (UDA-Digit (Mekhalfi et al. 2023)) and CPK strategy experiment for MtoP, KtoC, and SMtoC. The Mixture (M), KITTI (K), and Sim10k (SM) are source datasets, whereas Planeat and Cityscape are target datasets. It is obvious that in CPK, the fusion of unique and random 5 and 10 images in MtoP is outperforming the MtoP-UDA. The random mixup of 10 images in SMtoC+10images and KtoC+10images achieves more mAP as compared to their respective UDA methods

4.1.3 Cityscape dataset

The Cityscape (K) (Cordts et al. 2016) is a well-known benchmark dataset for CV tasks such as OD and segmentation, and it contains high-resolution images of urban environments that provide well-detailed annotations for various tasks. These images were manually captured from a recorded video of 50 European cities. It consists of 5000 images with 30 classes, which are grouped into 8 categories about city life like roads, persons, cars, pedestrians, sidewalks, traffic signs, etc.

Fig. 5
figure 5

The random mixup of MtoP+2 and MtoP+10 target images in the source data outperforms the MtoP-UDA and MtoP. This shows the effectiveness of the CPK approach for a new domain dataset while having a small target dataset. MtoTS refers to testing a Mixture dataset-based trained detector on the Planeat dataset

4.1.4 Sim10k dataset

For DA and simulation-to-real transfer learning for autonomous vehicle driving, the Sim10k (SM)(Johnson-Roberson et al. 2016) dataset is an important resource for CV and OD fields. The Sim10k comprises 10,000 synthetic images taken from a famous video game named Grand Theft Auto Five (GTAV). It is a well-known dataset used to tackle DS issues in DA, and it has annotation information for different urban scenarios to help with the complexities of real-world city driving cases.

4.1.5 KITTI dataset

Karlsruhe Institute of Technology and Toyota Technological Institute (KITTI) (K) (Geiger et al. 2012) is a benchmark dataset for driverless cars and mobile robotics. It contains recorded videos of traffic using grayscale stereo and high-resolution RGB cameras and a 3D laser scanner. Despite its widespread use, the dataset does not provide ground truth for segmentation. To suit their needs, many researchers have manually annotated some dataset portions. Three classes-road, vertical, and sky- were used to construct ground truth for 323 images from the road detection challenge by the (Alvarez et al. 2012). In the same way, Zhang et al. (Zhang et al. 2015) performed annotation using 252 images for tracking challenges for 10 categories, and Ros et al. (Ros et al. 2015) annotated 116 images for the visual odometry challenge with 11 classes. However, this study only used car class for our experiments and used 7481 RGB images for the DA task.

For UDA experiments with Sim10K, Cityscape, and KITTI datasets, propose work only consider the ’car’ class, use the existing annotation files for these datasets (Mekhalfi et al. 2023), and follow the settings as given in the paper (Mekhalfi et al. 2023).

4.2 Implementation details

In this study, the experimental protocols for both the UDA-digit and proposed CPK methodologies were carried out utilizing the PyTorch deep learning library, facilitated by computational hardware featuring with Quadro RTX 5000 GPU and a 16GB RAM configuration. The UDA-digit used the YOLOv5 model, and the CSP-Darknet53 is the backbone of the YOLOv5 network as it enhances the flow of information and representation of features for State-of-the-Art (SOTA) OD performance while managing the computational efficiency, making the model best for real-time scenarios. Further, existing techniques have also currently used YOLOv5 with CSP-Darknet53 as the backbone (Mattolin et al. 2023; Mekhalfi et al. 2023). That is why using YOLOv5 makes it possible to fairly compare the presented UDA-digit approach with other SOTA algorithms. YOLOv5 has four variants, and the proposed work selected its small variant (YOLOv5s) because it is the fastest and consumes less graphical processing unit memory. We performed various experiments in this study, and for the CPK strategy cases, we automatically selected the unique and random images from target domains and fused them in the training sets of corresponding datasets; the details are given in the next section.

4.2.1 Cross-pollination of knowledge (CPK) strategy experiments

Mixture to Planeat (MtoP) Dataset: In this section, the proposed work uses the Mixture dataset for training and the Planeat Target Set (TS) for testing. In the first experiment (MtoP), the proposed work used YOLOv5 to train the Mixture dataset and used its weights to test the unseen images of the TS. In the second (P-10images), the proposed work trained the YOLOv5 detector only with 10 Planeat dataset images and tested it on the TS, showing poor performance. In the third (MtoP+5images-Unique), we automatically selected and mixed the 5 unique images from the Planeat dataset inside the training set of the Mixture dataset, and these five images contain all the classes of the Planeat dataset, applied YOLOv5 to train M, and utilized its weights to test the unseen images of TS. This step showed improved results compared to the results in the first and second experiments. The Table 2 reports the results of all CPK experiments for MtoP. In the fourth (MtoP+10images), fifth (MtoP+15images), and sixth (MtoP+20images) experiments, we randomly selected and mixed random 10, 15, and 20 images of the Planeat dataset inside the training set of the M, trained with YOLOv5, and tested it on the unseen images of TS respectively. Table 2 gives the results of the last three experiments. In the seventh, the proposed work also trained YOLOv5 using the Mixture dataset with the randomly chosen and fusion of 1, 2, 3, and 4 random Planeat images in the Mixture dataset. The test results of this experiment and its comparison to MtoP-UDA are given in Table 3. In the eighth experiment, we tested the Mixture dataset trained weights using 1 random image, first, 1 random image is mixed with the Mixture dataset and tested on the TS, second, repeated this experiment with another 1 random image, and so on performed this experiment ten times, and the results of these 10 experiments are given in Table 8. This study used 100 epochs, 640 image size, 32 batch size, a learning rate of 0.01, a stochastic gradient descent (SGD) optimizer, and COCO pre-trained weights for YOLOv5 training used during all CPK-based experiments.

KITTI to Cityscape (KtoC) : In this section, the proposed work employed the supervised version of the KITTI and Cityscape dataset and trained the detector using KITTI dataset and tested it on the unseen images of the cityscape dataset (KtoC). We also mixed randomly selected 5 random (KtoC+5images) and 10 (KtoC+10images) images of the Cityscape dataset with their annotation details inside the training set of the KITTI dataset, trained the YOLOv5 detector and tested on the unseen images of the Cityscape dataset. The results of the KtoC can be seen in Table 4.

Fig. 6
figure 6

CPK experiments performance comparison for KtoC and KtoC+5images. The KtoC+5images efficiently recognize the car as compared to KtoC and KtoC-UDA. KITTI and Cityscape are source and target datasets, respectively

Sim10k to Cityscape (SMtoC) : The proposed work used the supervised form of the Sim10k and Cityscape dataset in this section and also repeated the same experiments like KtoC with a CPK strategy for Sim10k to Cityscape by adding random 5 and 10 annotated images. Their YOLOv5-based test results are given in Table 5 and these experiments used 100 epochs for KtoC and SMtoC experiments.

4.2.2 Unsupervised domain adaptation experiments

In this section, all the unsupervised domain adaptation (UDA) experiments are conducted using Mixture, KITTI, and Sim10k as source datasets, and Planeat and Cityscape are used as target datasets. In MtoP UDA, this work used a labeled Mixture dataset training set, an unlabeled Planeat UDA training Set (UDA-TS), and a Planeat Target Set (TS). The UDA-digit used the Mixture dataset training set and produced its pre-trained weight. This weight helps the detector to extract pseudo labels using UDA-TS, apply different augmentation methods over the highly confident cropped region, start the UDA training for DA between Mixture and Planeat datasets, test the UDA-digit weights over the Planeat TS, and show impressive performance, as given in Table 6. Similarly, this work adapted the same procedure for the KtoC and the SMtoC for their feature alignment between source and target datasets. The DA test results for KtoC ad SMtoC are also given in Table 6. The comparison of UDA-digit and the CPK approach for digit recognition is shown in Table 7, and for these experiments, this study used a batch size of 2 with one source and one target image during all unsupervised DA experiments.

5 Results and discussion

This study evaluated the UDA-digit and CPK approach across the Planeat and Cityscape target datasets in numerous ways. Overall, the CPK strategy showed efficient results on the test set for OD in terms of evaluation metrics like mean Average Precision (mAP), Recall (R), and Precision (R).

In Table 2, the first row (MtoP) shows the CPK testing results with 83.2% Precision and 82.5% mAP0.5 results when tested on unseen Planeat TS. In the second row, P-10images, the detector showed poor performance using only 10 Planeat dataset images for training and testing on the TS images in contrast to MtoP. However, just mixing 5 unique Planeat dataset images (MtoP+5images-Unique) in the Mixture dataset provides enhanced results than P-10images and outperformed with a margin of 10.1% Precision and 7.1% mAP0.5 as compared to MtoP (1st row). In Table 2, it is clear that the mixup of 5, 10, 15, and 20 random Planeat dataset (4th-6th rows) images in the Mixture dataset also enhanced the results and in this case, the mixup of random 10 images (MtoP+10images) outperformed the rest in terms of Precision, Recall, mAP0.5, and mAP0.5:0.9 with the values of 93.0%, 88.0%, 92.1%, and 69.8% respectively except for MtoP+5images-Unique. The performance comparison of MtoP and MtoP+5images-Unique on TS images is described in Fig. 3 where MtoP+5images detect the digits accurately.

Further, The comparison of MtoP-UDA and MtoP with the mixture of 1,2,3, and 4 random images is in Table 3. Table 3 depicts the change of pattern in results with the mixup of only 2 random images (MtoP+2images) of the Planeat dataset, and it achieved better mAP0.5 and mAP0.5:0.95 with 88.4% and 61.2% respectively than MtoP-UDA. Similarly, the mixup of 3 and 4 (MtoP+3images and MtoP+4images) random images showed enhanced results 84.3% and 90.6% for Recall and Precision respectively contrary to MtoP-UDA.

In a similar way, in Table 4 and Fig. 6, it is clear that just mixing the 5 random images (KtoC+5images) and testing on the unseen Cityscape dataset images depict the improved metrics results for these two benchmark datasets for the car object identification. It showed 77.3% precision, 45.4% recall, 50.6% mAP0.5, and 29.9% mAP0.5:0.95 as compared to KtoC and (KtoC+5images) efficiently detected the car object.

For the SMtoC in Table 5 and Fig. 7, the testing results of the Sim10k dataset-based trained detector over an unseen test set of Cityscape dataset showed 73% P, 44.5% R, 51.4% mAP0.5, and 30.4% mAP0.5:0.95. For these two benchmarks dataset, the mixup of random 5 images (SMtoC+5images) provided higher precision (73\(-\)77.9%) and higher recall (44.5\(-\)46.0%), mAP0.5 (51.4\(-\)54.2%), and mAP0.5:0.95 (30.4\(-\)33.2%) for car detection contrasted SMtoC and also SMtoC+5image improved the output for the case of car detection.

The UDA-digit approach for MtoP, KtoC, and SMtoC are given in Table 6, and the comparison of the UDA-digit with the CPK strategy of random images is given in Table 7. The MtoP-UDA (1st row) achieved 84.3% precision and 82.7% recall while mixup MtoP+5images (images which contain all the 12 classes)(2nd row) outperformed with 93.3% and 90.1% precision and recall whereas enhanced the mAP0.5 from 85.4% to 89.6% and 59.3% to 64.6% for mAP0.5:0.95 as compared to MtoP-UDA and MtoP-UDA detector output to detect digits is given in Fig. 4. The random mixup of 10 Planeat dataset images (MtoP+10images) (3rd row) also achieved enhanced results than MtoP-UDA in terms of precision (84.3% to 92.3%), recall (82.7% to 89%), mAP0.5(85.4% to 91.4%), and mAP0.5:0.95(59.3% to 69.8%).

For KtoC+10images (7th row), the mixup of 10 random Cityscape dataset images outperformed with the margin of 5%, 5.1%, 3.1%, and 5.2% for precision, recall, mAP0.5, and mAP0.5:0.95 respectively than KtoC-UDA. The SMtoC+10 images (5th row) only achieved improved results for precision (77.9%) and mAP0.5:0.95 (35%) in comparison to SMtoC-UDA and SMtoC-UDA showed higher Recall (54%) and mAP0.5 (58.3%) as compared to SMtoC+10images.

The various experiments with 1 random image also show similar results in Table 8, and the highest results among these experiments are 88.4% precision for the MtoP+1d, 85.1% recall for MtoP+1 g, 85.4% mAP0.5 for MtoP+1a, and 60.7% mAP0.5:0.95 for the MtoP+1 h and MtoP+1i random image. The proposed work also obtained 2.8, 2.08, 1.33, and 0.69 Standard Deviation (SD) values to analyze the variations in the distribution in these experiments (a-j).

The comparison of the CPK strategy’s experiments with the existing UDA results in the literature for SMtoC and KtoC for car detection is in Table 9 where KtoC+10images outperformed the existing methods results for car detection results. The Table 9 contains the numeric results of all the relevant UDA OD methods. The last two rows of this table indicate two recent studies (Mekhalfi et al. 2023; Mattolin et al. 2023) which show the UDA results for car detection and compared their results with the rest of the studies in the Table 9. We compared our CPK approach with this Table 9 reported in Mekhalfi et al. (2023) for car detection results for Sim10k and Cityscapes datasets.

5.1 Discussion

DL plays a crucial role in applying modern OD tasks in computer vision. DL works efficiently when a bigger, diverse, annotated, and challenging dataset is available in OD. However, in the case of a new domain where the data samples are unavailable, it is difficult to construct a new dataset as the construction is time-consuming, labor-intensive, and expensive. DA plays an important role in avoiding these construction issues. DA is a subcategory of transfer learning and helps when the data samples of new domains are scarce. DA needs source and target datasets and aligns the features between them to reduce the domain shift problem.

Fig. 7
figure 7

CPK experiments performance comparison for SMtoC and SMtoC+5images. The random mixup of 5 images (SMtoC+5images) efficiently recognizes the car object compared to SMtoC and SMtoC-UDA. Sim10k and Cityscape are source and target datasets, respectively

This study is inspired by the botany domain concepts and used this concept with AI to enhance detection accuracy in real-time systems. This paper investigated the digit recognition problem in a supervised (CPK) and unsupervised (DA) manner. To automate the real-time industrial scenario of a food packaging company without generating a more extensive dataset, this work introduced a newly constructed Planeat (P) dataset, which consists of 231 real-time industrial images with valid annotation information. This study also used a freely available dataset named the Mixture (M) dataset, which is composed of a mixture of 2261 images of digits.

The CPK strategy experiments showed impressive outcomes as in Table 2, the testing of the MtoP (1st row) showed 82.5% mAP0.5 on the unseen images of the Planeat dataset. The training and testing of the Planeat dataset with its 10 images (P-10images) depicted poor results because of a few samples, which motivated us to mix 5, 10, 15, and 20 random images in the Mixture dataset. The mixup of random 10 (MtoP+10) images provided better precision, recall, mAP0.5, and mAP0.5:0.95 than other cases in Table 2 except MtoP+5image-Unique. It illustrated more efficiency because it helps the detector train itself with the maximum features of the Planeat dataset using the unseen distribution of features in the Planeat dataset. Overall, in the mixup of 5, 10, 15, and 20 images, the MtoP+10images showed the highest evaluation metric results like 92.1% mAP0.5 among the rest result in Table 2. The primary objective of CPK lies in strategically integrating a limited subset of target dataset images into the source dataset, aiming to surpass the performance metrics achieved by self-supervised DA (UDA) and Supervised DA (SDA) models. Consequently, using CPK methodology, the mixup of +10 images yielded superior results compared to prevailing UDA and SDA models, and we decided not to add more images beyond this point.

Furthermore, with the mixup of only 5 unique images in the Mixture dataset (MtoP+5images-Unique), it provides the highest precision (93.3%) and recall (90.1%) in contrast to the rest of the scenarios. It is clear from Fig. 3 that MtoP+5images-Unique efficiently detected the digit compared to MtoP and handled the wrong detection of the “-” sign efficiently. However, it also showed that mixing a few unique images containing all the objects’ classes significantly enhances the OD results.

Table 8 Performance comparison of MtoP with the mixup of 1 random image. The same experiment was repeated 10 times with different random images. The alphabet (a-j) represents the experiment (1-10). The Mixture (M) and Planeat (P) are source and target datasets. This table shows the Mean and Standard Deviation (SD) for all the evaluation metrics in the mixup of random 1 images for a-j cases, and mAP0.5:0.95 achieved higher SD

Similarly, the testing of the KtoC demonstrated 45.4% mAP0.5, on the other hand, KtoC+5images improved this mAP0.5 efficiently up to 5.2% in Table 4 and KtoC+5image efficiently identify the car then KtoC in Fig. 6. The mAP0.5 also improved after the mixup of 5 images in the SMtoC and increased from 51.4% to 54.2% in Table 5. Mixing 5 unique and 10 random images improved OD results in the KITTI, Sim10k, and Cityscape benchmark datasets.

Table 9 Performance comparison of car detection Average Precision (AP) on the KtoC, SMtoC domain adaptation benchmark with the CPK approach. K: KITTI, SM: Sim10K, and C: Cityscape. This table represents the output of the existing DA studies for car detection. Our CPK approach outperforms the recent Confmix and DACA methods for KtoC car detection scenarios. UDA: Unsupervised Domain Adaptation, SSDA: self-supervised Domain Adaptation, SDA: supervised Domain adaptation.

In the case of UDA, Table 6, showed that the UDA-digit efficiently recognized the digits using a labeled source Mixture dataset and unlabeled target Planeat dataset with the strategy of detection and augmentation of regions before the unsupervised training. In addition, the KtoC-UDA and SMtoC-UDA followed the UDA-digit approach for car and digit object detection. They efficiently detected the car object by adapting features from the source to the target dataset. However, during the comparison between CPK experiments with the UDA-digit in Table 7, the MtoP+5images-Unique and MtoP+10images with random images outperformed the MtoP-UDA in terms of mAP0.5 metrics with the margins of 4.2% and 6% respectively. The efficiency of UDA-digit in recognizing the different digits using the newly constructed Planeat dataset is shown in Fig. 4. On the other hand, the mixup of only random 2 and 10 target images (MtoP+2images and MtoP+10images) in the source dataset experiments significantly outperforms the MtoP-UDA results in Fig. 5 where the graph describing the efficiency of the CPK strategy.

Furthermore, the KtoC-UDA also gained lower mAP0.5 53%, and KtoC+10images enhanced result up to 3.1%for mAP0.5 and KtoC+5images locate the car object efficiently as compared to KtoC-UDA in Fig. 6. The SMtoC with random 10 images also provided more precision and mAP0.5:0.95 than SMtoC-UDA, and SMtoC-UDA gained higher recall and mAP0.5 as compared to SMtoC+10images in Table 7. In Fig. 7, the SMtoC+5images accurately detected the car and performed well than SMtoC and SMtoC-UDA. Table 3 showed the effectiveness of the CPK approach where only the mixup of random 2 images (MtoP+2images) from the Planeat to Mixture dataset depicted more results for mAP0.5 (+3%) and mAP0.5:0.95 (+1.9%) than MtoP-UDA (1st row).

We also analyzed the impact of results with a mixup of 1 random image, the MtoP+1a to MtoP+1j experiments with different 1 random image showed the lowest mAP0.5 (81.4%) and the highest mAP0.5 (83.5%) in Table 8. The standard deviation in Table 8 also illustrates that MtoP+1image experiments gained 0.69 SD for mAP0.5:0.95, describing that it has low variation in mixup data distribution. The lower SD scores indicate the usability of a mixup of random images strategy for OD, as it showed less variability among the random images.

We compare the results of this study (CPK) with self-supervised DA (UDA) and Supervised (SDA) methods, and UDA methods also require a smaller number of target dataset samples for domain adaptations. However, the CPK strategy is simple, time- and resource-efficient, contrary to the unsupervised strategy, which usually requires much computational resources and time during experiments. To prove the efficiency of CPK, this study applied CPK on benchmark datasets for car detection and compared the results with those of recent UDA studies. The CPK strategy tackles aligning the source and target domains, reduces the domain gap, and shares the knowledge between classes in both domains, making the model effective on the target dataset. With the CPK approach, only the fusion of a few unique and random target domain images in the source domain can enhance the DA accuracy over unseen target data. In the CPK, we choose unique and random target domain images once at the input level and then fuse these images in source domain images, which can produce accurate results in DA with SOTA models. The CPK improves the detector’s generalization and diversity by introducing the target domain samples in the source domain, which helps the detector learn additional patterns that are not part of a source dataset. It makes the detector capable of handling complexities and variation by showing robust output across various scenarios and splitting the dataset.

6 Conclusions

This research introduces a novel Cross-Pollination of Knowledge (CPK) strategy for domain adaptation in object detection, aimed at overcoming the challenges posed by domain shifts in real-time industrial applications. By merging the unique and random target domain samples into the source dataset during training, the CPK approach significantly enhances the model’s generalization capabilities. Experimental results validate the effectiveness of this strategy, showing substantial improvements in detection accuracy across various challenging scenarios. Specifically, the CPK method achieved a 10.9% improvement in mAP, outperforming both Self-Supervised and Supervised Domain Adaptation techniques. The successful implementation of the CPK method underscores its potential to advance the field of domain adaptation in object detection, providing a robust solution for industrial automation and other applications where domain shifts are prevalent. In the future, we will extend our strategy for cross-domain detection using a multi-model process, including other types of input images like LiDAR and 3D for different domains, such as medical images, anomaly, and geospatial object detection.