Introduction

The rapid advancement in computer technology has significantly transformed information accessibility, resulting in an exponential surge of digital content. Textual information assumes a pivotal role in comprehending visual data, thereby emphasizing the critical nature of text detection in images within the domain of computer vision, offering a wide array of applications. Notably, recent years have witnessed significant progress driven by the rapid evolution of Convolutional Neural Networks (CNNs) [5, 12, 43]. These applications span from text recognition and license plate identification to autonomous driving, digital archiving, and intelligent healthcare. Particularly significant is the precise and efficient extraction of textual information from medical images, bearing profound practical implications. Text detection, which targets the localization of text instances within images, stands as an indispensable component in text reading. Despite substantial progress in recent years, text detection remains challenging due to the varied scales, irregular shapes, and extreme aspect ratios of the text instances.

The field of text detection has witnessed notable progress due to recent advancements in CNN-based object detection [7, 9, 11] and segmentation techniques [8, 58]. These advancements have significantly enhanced the detection of scene text [27, 28, 32, 41, 45, 46] in general applications. However, specific text environments present unique challenges that necessitate further attention. For instance, images with dense text distribution, diverse content, and irregular shapes often pose difficulties for existing algorithms in accurately locating and segmenting all text instances. Medical images encompass a variety of healthcare records, including outpatient invoices, hospital invoices, drug tax receipts, and discharge summaries. Detecting text within medical images poses distinct challenges that require tailored approaches. Challenges, such as skewed text caused by non-flat surfaces during image scanning or capturing, dense text distribution, incomplete boundary detection due to the presence of different languages and symbols, and the introduction of irrelevant noise from external environmental factors, must be addressed.

To address these challenges, there has been extensive research into the joint learning of visual and textual representations within numerous Vision-Language Pre-training (VLP) techniques. This development significantly advances various Vision-Language (VL) tasks, including Visual Question Answering (VQA), Image–Text Retrieval, among others. Given its linguistic nature, Optical Character Recognition (OCR) inherently stands to benefit from the implementation of these VLP techniques [2, 18, 37]. The OCR Contrastive Language-Image Pre-training (oCLIP) [49] technique leverages textual information to learn effective visual text representations, aiming to enhance scene text detection and spotting.

While conducting this study, it is pertinent to note that our previous work titled ’Multi-level Feature Enhancement Method For Medical Text Detection’ has been accepted for publication but is pending formal publication. The work presented in this paper serves as an extension and further exploration of the methodologies introduced in the aforementioned forthcoming paper. In this study, we extend the research presented in our previous work from two distinct perspectives. First, we enhanced the FPN network by introducing residual structures and channel attention mechanisms, further augmenting the representational capability of the feature maps. Subsequently, we conducted a more comprehensive experimental analysis of the proposed text detection network, evaluating the impact of the number of cascaded EFEM modules on the detection results.

In summary, our contributions are threefold:

  • We have introduced two highly efficient modules, the Efficient Feature Enhancement Module (EFEM) and the Multi-Scale Feature Fusion Module with RSEConv (MSFM-RSE), based on residual structures and spatial channel attention mechanisms. These efficient modules significantly bolster the network’s feature representation.

  • By leveraging vision-language pre-training models obtained through large-scale visual language understanding tasks, our system enhances the representation capabilities, leading to improved accuracy and robustness in text detection.

  • Using the proposed method, we achieved competitive results in terms of efficiency, accuracy, F-score, and robustness across five publicly available scene text detection datasets. Particularly, our method demonstrated superior performance on medical image text datasets, outperforming others in this domain.

The rest paper is organized as follows: Sect. 2 reviews the relevant text detection methods. Our proposed approach is described in Sect. 3. The experiments are discussed and analyzed in Sect. 4. The conclusions are summarized in Sect. 5.

Fig. 1
figure 1

Overall Architecture of our method. The 1/4, 1/8, 1/16, and 1/32 indicate the scale ratio compared to the input image

Related works

Text detection

In recent years, scene text detection has witnessed significant advancements due to the rapid progress in deep learning techniques. The field mainly explores two prominent approaches: segmentation-based methods and regression-based methods, extensively researched for their capacity to represent text instances with diverse shapes.

Regression-based text detectors

Regression-based methods for text detection are often influenced by general object detection frameworks like Faster R-CNN [38] and SSD [29]. TextBoxes [22] directly adjusts the anchor scales and convolution kernel shapes of SSD to manage text with extreme aspect ratios. TextBoxes++ [23] further refines by regressing quadrangles instead of horizontal bounding boxes for the detection of multi-oriented text. EAST [59] and Deep-Reg [16] are anchor-free methods employing pixel-level regression to detect multi-oriented text instances. RRPN [36] employs Faster R-CNN and innovates the rotation proposals in the RPN part to detect arbitrarily oriented text. RRD [25] extracts feature maps for text classification and regression using two separate branches to enhance long text detection. These methods have shown promising results in various text detection tasks. However, accurately representing irregular shapes, such as curved text, with precise bounding boxes remains a challenge for regression-based methods.

Segmentation-based text detectors

Segmentation-based methods draw significant inspiration from fully convolutional networks (FCN) [31]. Zhang et al. [57] were the first to employ FCN for extracting text blocks and detecting character candidates through MSER. Mask TextSpotter [34] detects arbitrary-shaped text instances following an instance segmentation approach derived from Mask R-CNN [13]. PSENet [45] introduced a progressive scale expansion technique by segmenting text instances using kernels of varying scales. This approach allows for better handling of text instances with varying sizes and aspect ratios. Tian et al. [42] proposed pixel embedding, which groups pixels based on the segmentation results, enabling more accurate text localization. Post-processing algorithms, such as geometric verification and text-line construction, are often applied to refine the segmentation results. While these methods achieve high accuracy, they may suffer from reduced inference speed due to the complexity of the post-processing steps.

Vision-language pre-training

In recent years, the integration of vision-language pre-training models [49] has shown great promise in various computer vision tasks, including text detection. Vision-language pre-training involves training models on large-scale visual language understanding tasks, enabling them to acquire rich visual and semantic representations. ViLBERT [33] and LXMERT [40] introduce a two-stream framework incorporating a vision-language co-attention module for cross-modal feature fusion. Conversely, VisualBERT [21], Unicoder-VL [20], and VL-BERT [39] adopt a single-stream framework. These pre-training models capture the contextual information and semantic understanding necessary for accurate text detection. By leveraging the power of vision-language pre-training, text detection systems can benefit from enhanced representation capabilities, leading to improved accuracy and robustness.

Our proposed approach exhibits superior performance in both speed and the detection of medical text and irregular shapes compared to existing rapid scene text detectors. Integrating an efficient segmentation module that merges the strengths of regression and segmentation-based methods, our method also incorporates a learnable post-processing technique, uniting traditional post-processing advantages with the efficiency of deep learning. Additionally, through the utilization of vision-language pre-training models, our approach attains heightened accuracy and robustness in text detection. To assess our approach, we conducted comprehensive experiments on diverse benchmark datasets, including challenging medical text datasets and datasets featuring arbitrary-shaped text. Results showcase our method’s outperformance of existing state-of-the-art text detectors in accuracy, speed, and robustness. This proposed approach displays significant potential for practical applications in medical image analysis, document processing, and intelligent information retrieval.

Methodology

Overall network architecture

The architecture of our proposed model, depicted in Fig. 1, integrates a vision-language pre-training model alongside four primary components: a feature extraction backbone, a feature enhancement backbone, a multi-scale feature fusion model, and a post-processing procedure. During the initial stage, the input image undergoes processing by the vision-language pre-training backbone, previously trained on comprehensive visual language understanding tasks at a large scale. We initially pre-train the vision-language model on SynthText, incorporating full annotations, and then transfer the backbone weights to fine-tune our proposed model using real datasets. This backbone acquires intricate visual and semantic representations, serving as a robust basis for subsequent stages (refer to Fig. 1a, b). The subsequent step involves feeding the input image into an FPN-RESConv structure to acquire multi-level features and simultaneously adjust the channel number of the feature maps (refer to Fig. 1c, d). Second, the pyramid features are up-sampled to a uniform scale and then fed into the Efficient Feature Enhancement Module (EFEM). The EFEM module, structured in a cascaded manner, provides the benefit of low computational cost. Positioned behind the backbone network, it seamlessly integrates to augment and enhance the expressive capability of features across various scales (as illustrated in Fig. 1e, f). Following this, we integrate the MSFM-RSE to efficiently fuse the features produced by the EFEMs at different depths, resulting in a holistic final feature representation for segmentation purposes (depicted in Fig. 1g). Following this step, the extracted feature F is employed to forecast both the probability map (P) and the threshold map (T). Subsequently, the approximate binary map (\({{\hat{B}}}\)) is computed by utilizing the probability map and feature F (as depicted in Fig. 1h–k). Throughout the training phase, supervision is applied to the probability map, threshold map, and approximate binary map. Remarkably, the probability map and approximate binary map undergo identical supervision. During the inference phase, acquiring the bounding boxes is seamless, achievable from either the approximate binary map or the probability map through a dedicated box formulation module.

Efficient feature enhancement module

The EFEM, illustrated as a U-shaped module (see Fig. 2), is instrumental in augmenting features of varied scales within our proposed text detection system. It comprises three crucial phases working in unison to achieve optimal performance.

In the up-scale enhancement phase, the feature maps undergo iterative refinement at intervals of 32, 16, 8, and 4 pixels. This iterative process ensures gradual improvement and enrichment of each scale’s features with more intricate information. It enables capturing fine-grained details, thereby enhancing the feature discriminative capabilities.

A fundamental component within the EFEM is the Pyramid Squeeze Attention (PSA) module [54], functioning as an efficient attention block. This module adeptly extracts multi-scale spatial information, enabling a nuanced understanding of the intricate relationships between various feature map regions. By establishing extensive channel dependencies, it captures both global context and local details, significantly boosting the features’ discriminative strength and overall text detection accuracy.

In the down-scale phase, the EFEM leverages the feature pyramid generated during up-scale enhancement. Starting from a 4-stride, this iterative process continues until it reaches a 32-stride. This down-scale enhancement further refines the features by amalgamating multi-level details from lower to higher scales. By amalgamating low-level specifics with high-level semantic data, the EFEM ensures that the final feature representation is comprehensive, informative, and proficient in capturing diverse text instance characteristics.

Compared to existing methods, the EFEM offers two key advantages. First, its ability to be cascaded multiple times for feature fusion and expanded receptive fields allows for effective handling of text instances of various sizes and aspect ratios, enhancing adaptability to diverse text scenarios. Second, its computational efficiency makes it suitable for real-world applications with limited resources. This efficiency ensures real-time functionality, facilitating practical deployment in various scenarios.

Fig. 2
figure 2

The details of EFEM. “+”, “2\(\times \)”, “DWConv”, “Conv”, and “BN” represent element-wise addition, 2\(\times \) linear upsampling, depthwise convolution, regular convolution, and Batch Normalization, respectively

Multi-scale feature fusion with RSEConv

The ’Multi-Scale Feature Fusion with RSEConv’ (MSFM-RSE) serves as a pivotal module in our proposed text detection system. It is meticulously crafted to amalgamate features extracted from diverse scales. By employing RSEConv, a specialized convolutional layer incorporating channel attention mechanisms and residual architecture, this module adeptly harmonizes high-level, multi-scale feature representations derived from the Efficient Feature Enhancement Module (EFEM). The MSFM-RSE efficiently merges and refines these features, creating a comprehensive, multi-scale feature representation ideal for text detection. This enhancement significantly augments the system’s capability to identify diverse text characteristics across different complexities and scales. Figure 3 illustrates the schematic diagram of RSEConv.

Fig. 3
figure 3

Illustration of the RSEConv module

Through the integration of RSEConv layers, the MSFM-RSE skillfully merges high-level features produced by the EFEM, ensuring a cohesive and informative feature set that substantially strengthens the system’s capability to discern complex text instances across various scales. This fusion, facilitated by the residual structures and channel attention mechanisms, optimizes feature refinement for accurate text detection across different complexities and scales, showcasing the system’s adaptability in addressing multifaceted challenges encountered in text detection across diverse contexts.

Figure 4 illustrates the schematic diagram of this module. Specifically, the process begins with an element-wise addition to merge the feature maps from corresponding scales. This step enables the combination of local details and global context, effectively capturing both fine-grained information and overall scene understanding. Subsequently, the resulting output is processed through RESConv, simultaneously adjusting the channel number of the feature maps. Following this, the resulting feature maps undergo upsampling and concatenation, culminating in the creation of a comprehensive final feature map.

Fig. 4
figure 4

Illustration of the MSFM-RSE module. “+” is element-wise addition

Deformable convolution and label generation

In our approach, we exploit the advantages of deformable convolutions [61] to improve the model’s receptive field, particularly beneficial for text instances with extreme aspect ratios. Following the methodology detailed in [62], we integrate modulated deformable convolutions into all the 3\(\times \)3 convolutional layers across the conv3, conv4, and conv5 stages of the ResNet-50 backbone.

Deformable convolution offers an adaptable and adjustable receptive field, enabling the model to effectively detect text instances of diverse shapes and sizes. By integrating deformable convolutions into the network architecture, the model can dynamically modify its receptive field, focusing on the most pertinent regions within the input image. This adaptable mechanism is particularly beneficial for text detection, as it assists the model in handling the complexities presented by text instances with extreme aspect ratios. The incorporation of modulated deformable convolutions within the ResNet-50 backbone strengthens the model’s capacity to capture intricate details, efficiently representing the intricate structures of text instances. This inclusion further enhances the overall performance and resilience of our text detection system, enabling accurate detection of text instances with varying aspect ratios and challenging visual attributes.

The label generation for the probability map is influenced by PSENet [45]. Typically, post-processing algorithms display the segmentation results using a collection of vertices that define a polygon

$$\begin{aligned} G = \left\{ {{S_{{\textrm{k}}}}} \right\} _{k = 1}^n. \end{aligned}$$
(1)

n represents the number of vertices, which typically varies depending on the labeling rules in different datasets, and S denotes the segmentation results for each image. G represents the symbolic notation for a set of line segments that describe the text regions in an image and S is a symbol for segmentation results in each image. Mathematically, the offset D can be calculated as follows:

$$\begin{aligned} D = \frac{{{\mathrm{Area(}}P{\mathrm{)}} \times {\mathrm{(1 - }}{{\textrm{r}}^2}{\mathrm{)}}}}{{{\textrm{Perimeter}}}}. \end{aligned}$$
(2)

In this context, Area(\(\cdot \)) refers to the calculation of the polygon’s area, and Perimeter(\(\cdot \)) denotes the calculation of the polygon’s perimeter. P refers to the original polygonal text region. The shrink ratio, denoted as "r," is empirically set to 0.4. By employing graphics-related operations, the shrunken polygons can be derived from the original ground truth, serving as the fundamental building block for each text region. During the inference phase, we have the option to utilize either the probability map or the approximate binary map to generate text bounding boxes, which yield nearly identical results. To enhance efficiency, we opt for the probability map, allowing us to eliminate the threshold branch. The process of forming the bounding boxes involves three steps: 1. The probability map or the approximate binary map is initially binarized using a fixed threshold (0.2) to produce the binary map. 2. Connected regions, representing the shrunken text regions, are identified based on the binary map. 3. The shrunken regions are then expanded or dilated using an offset \(D'\). The calculation of \(D'\), is as follows:

$$\begin{aligned} D' = \frac{{A' \times r'}}{{L'}}, \end{aligned}$$
(3)

where \(A'\) is the area of the shrunk polygon; \(L'\) is the perimeter of the shrunk polygon; \(r'\) is set to 1.5 empirically.

Table 1 Detection results with different settings of Deformable Convolution, EFEM, and MSFM

Experimental results

Datasets

In addition to the medical text dataset, our experiments encompass several prevalent public scene text detection datasets, including SynthText, Total-Text, CTW1500, ICDA R2015, and MSRA-TD500. These datasets serve to evaluate the robustness and generalization abilities of our proposed model. Through assessment across diverse datasets, our goal is to validate the model’s performance in various text detection scenarios, showcasing its effectiveness in handling an array of text types and shapes, extending beyond medical text.

MEBI-2000 dataset is derived from the public dataset of Ali Tianchi competition [4]. Specifically created for medical ticket text detection, it includes discharge summaries, prescription invoices, outpatient invoices, and inpatient invoices. Representing scenarios encountered in medical insurance, this dataset includes challenges like mixed text and images, camera shake, skewed images, dense text, and table data. Privacy is maintained through anonymization. MEBI-2000 serves as a practical dataset for medical insurance image analysis.

SynthText dataset [10] is a synthetic dataset comprising over 800k synthetic images. It is exclusively utilized for pre-training our model.

Total-Text dataset [3] comprises 1255 training images and 300 testing images, emphasizing curved text. This dataset covers various text types, including horizontal, multi-oriented, and curved text, and includes word-level annotations, enabling precise evaluation and analysis.

CTW1500 dataset [52] includes 1000 training images and 500 testing images, focusing on curved text. The dataset contains both English and Chinese texts, annotated at the text-line level using polygonal shapes. It serves as a valuable resource for developing and evaluating algorithms designed for curved text detection and recognition tasks.

ICDAR 2015 dataset [19] is a frequently used dataset for text detection, containing 1000 training images and 500 testing images captured using Google Glass. It includes text instances that are often heavily distorted or blurred, annotated by their quadrangle’s four vertices.

MSRA-TD500 dataset [50] includes 300 training images and 200 test images with text-line level annotations. It is a dataset with multi-lingual, arbitrary-oriented and long text lines. Because the training set is rather small, we follow the previous works [32, 35, 60] to include the 400 images from HUST-TR400 [51] as training data.

Fig. 5
figure 5

Some visualization results on text instances of MEBI

Implementation details

We initially pre-train the vision-language model on SynthText, incorporating full annotations, and then transfer the backbone weights to fine-tune our proposed model using real datasets (MEBI, Total-Text, CTW1500, ICDAR2015, and MSRA-TD500). We proceed with fine-tuning the models for 1200 epochs on the corresponding real-world datasets. During training, our primary data augmentation techniques encompass random rotation, random cropping, and random horizontal and vertical flipping. Additionally, we resize all images to 640 \(\times \) 640 to enhance training efficiency. For all datasets, the training batch size is set to 16, and we adhere to a "poly" learning rate policy to facilitate gradual decay of the learning rate. Initially, the learning rate is set to 0.007, accompanied by an attenuation coefficient of 0.9. Our framework employs stochastic gradient descent (SGD) as the optimization algorithm, with weight decay and momentum values set to 0.0001 and 0.9, respectively.

Ablation study

To illustrate the effectiveness of pivotal modules, specifically the deformable convolution, Efficient Feature Enhancement Module, and Multi-Scale Feature Fusion Module with RSEConv, we conducted an ablation study using the MEBI dataset and the ICDAR2015 dataset. Detailed insights into each module’s performance are presented in Table 1. This evaluation allows an assessment of the individual modules’ contributions to the entire text detection system, validating their role in enhancing accuracy and robustness.

Table 2 The influence of the number of cascaded EFEMS

The efficacy of Deformable Convolution is evident from the outcomes displayed in Table 1. For the ICDAR2015 dataset, integrating deformable convolution yields a noteworthy 3.4% increase in the F-measure. Correspondingly, in the case of the MEBI dataset, deformable convolution contributes to an improvement of 1.8%. These results underscore how deformable convolution endows the backbone network with an adaptable receptive field, facilitating the effective capture of spatial relationships and enhancing the performance of the text detection system. Significantly, the incurred additional computational cost is relatively marginal, rendering it a practical and efficient enhancement for the model.

The effectiveness of the EFEM is prominently displayed in Table 1. In the case of the ICDAR2015 dataset, the EFEM substantially boosts the F-measure by 5.7%. Furthermore, when utilized alongside deformable convolution (EFEM+DConv), the F-measure enhancement reaches 7.8%. Similarly, for the MEBI dataset, the EFEM demonstrates a 6.6% increase in the F-measure, while the combined EFEM+DConv indicates an even higher improvement of 9.5%. Although this combination might experience a slight reduction in inference speed, the notable enhancements in the F-measure unequivocally portray the EFEM’s improved accuracy in detecting text instances.

Fig. 6
figure 6

Some visualization results on text instances of MEBI

Fig. 7
figure 7

The visualized detection results obtained by our method, as well as DBNet++ and PSE, on selected challenging samples from the MEBI test set

The effectiveness of MSFM-RSE is evident from the outcomes in Table 1. For the ICDAR2015 dataset, the MSFM-RSE contributes to a 2.8% enhancement in the F-measure. When utilized alongside deformable convolution (MSFM-RSE+DConv), this improvement increases to 5.4%. Similarly, in the MEBI dataset, the MSFM-RSE demonstrates its effectiveness by raising the F-measure by 5%. Moreover, the combined MSFM-RSE+DConv approach achieves an even more notable increase of 7.9% in the F-measure. While there might be a slight reduction in inference speed, the remarkable improvements in the F-measure highlight the MSFM-RSE’s superior performance in accurately detecting text instances.

The influence of the number of cascaded EFEMS We analyzed the impact of the number of cascaded EFEMs by varying n from 0 to 4 in Table 2. It was observed that the F-measures on the test sets consistently increased with the growth of n until it began to plateau when n reached a value of 2 or higher. However, it is notable that a larger n could potentially slow down the model, despite the low computational cost of the EFEM. To maintain an optimal balance between performance and speed, n has been set to 2 as the default value for subsequent experiments.

Table 3 Detection results on the MEBI dataset
Fig. 8
figure 8

Some visualization results on text instances of various shapes, including curved text, multi-oriented text, vertical text

Comparisons with previous methods

We assess the performance of our proposed method by conducting a comparative analysis against previous approaches on five established benchmarks. These benchmarks encompass various text scenarios, including medical text, curved text, multi-oriented text, and long text lines in multiple languages. For a comprehensive evaluation, we employ both quantitative and qualitative methods. Quantitative results are presented in terms of benchmark scores, while qualitative findings are visualized in Figs. 5, 6, and 7. This comprehensive analysis allows us to showcase the superior performance of our proposed method and its efficacy in effectively handling diverse text scenarios

Medical text detection The MEBI dataset is intricately designed for medical text, encompassing a diverse array of textual instances with varying scales, irregular shapes, and extreme aspect ratios. In Table 3, we offer a comprehensive comparison of our model against previous methodologies, showcasing its leading-edge performance in terms of accuracy, F-measure, recall, and frames per second (FPS). Surpassing existing approaches, our model further establishes its superiority in dealing with the intricacies of medical text detection. Figures 5 and  6 visually demonstrates our model’s performance, featuring visualizations of medical text instances extracted from the MEBI dataset. These visuals exhibit the model’s adeptness in accurately detecting and segmenting text within challenging medical images, underscoring its robustness and efficacy.

To verify the robustness and adaptability of our model, we conducted experiments on publicly available scene text datasets. Figure 8 displays qualitative results from these supplementary experiments, depicting our model’s proficiency in addressing various text scenarios beyond the medical domain. These outcomes further emphasize the flexibility and efficacy of our model across diverse text detection tasks.

Table 4 Detection results on the Total-Text dataset
Table 5 Detection results on the CTW1500 dataset
Table 6 Detection results on the ICDAR 2015 dataset

Curved text detection The performance of our approach in scene text detection is outlined in Tables 4 and  5, focusing on two curved text datasets: Total-Text and CTW1500. These results underscore our method’s outstanding performance, particularly in terms of the F-measure metric, signifying the overall accuracy of text detection. In comparison to the DBNet++ method, recognized for its strong performance on these datasets, our method attains similar performance levels while notably reducing the inference time. Notably, our approach achieves equivalent or even superior F-measure scores, demanding only about 80% of the time taken by DBNet++. This comparison accentuates the efficiency and effectiveness of our proposed approach in addressing curved text instances. Our method not only achieves precise text detection but also exhibits superior efficiency, rendering it a highly competitive solution for curved text detection tasks.

Table 7 Detection results on the MSRA-TD500 dataset

Multi-oriented and Multi-oriented text detection It is essential to highlight the challenges posed by the ICDAR 2015 dataset, characterized by various obstacles like multi-oriented text, small text instances, and low-resolution images. Despite these difficulties, our method displayed remarkable performance, surpassing the established DBNet++ as detailed in Table 6. Demonstrating a substantial 2.6% improvement in F-measure compared to DBNet++, our method confirms its efficiency in addressing the complexities of the dataset. Similarly, Table 7 showcases our method’s comparable or even superior performance in F-measure when compared to DBNet++ on the MSRA-TD500 dataset. These results underscore the favorable equilibrium our method establishes between detection performance and inference speed, rendering it an attractive option for practical applications. In summary, our method excels in tackling the challenges of the ICDAR 2015 dataset, showcasing enhanced performance in comparison to DBNet++. Furthermore, our method exhibits competitive performance on the MSRA-TD500 dataset, coupled with faster inference speed. These findings affirm the potential of our approach in real-world applications, providing an effective and efficient solution for scene text detection tasks.

Conclusion

In summary, we have introduced an efficient framework for real-time detection of medical and arbitrary-shaped text, capitalizing on the capabilities of a vision-language pre-training model. Our approach integrates the Efficient Feature Enhancement Module (EFEM) and Multi-Scale Feature Fusion Module with RSEConv (MSFM-RES) to improve feature extraction without incurring significant computational overhead. Through extensive experiments on multiple datasets, our method has showcased significant improvements in both speed and accuracy compared to prior state-of-the-art text detectors.

Future endeavors could concentrate on further refining the computational efficiency of our framework to enable real-time text detection in resource-constrained environments. Exploring advanced vision-language pre-training models and leveraging more extensive training data may enhance the model’s performance and generalization capabilities. Additionally, expanding the framework to address more challenging text detection scenarios, such as scene text in complex backgrounds or multi-lingual text, presents an intriguing avenue for future research.