1 Introduction

The measurement on Textual-similarity turns to be the challenging complication as it demands the better semantic understanding of input-sentences. The past models of neural-networks utilized coarse-grained modeling of sentences that has complexity in capturing out the fine-grained word range information for comparisons of semantic-texts. These efforts in exploring the efficient methods for semantic-text similarity detections, has grabbed attentions in NLP-Natural language-processing [4]. These task of text-similarity detections plays the significant role in tasks like text-summarization, minimizing the redundancy of duplicates of documents or the question pairs in portals, question generations, tracking of topics, machine-translation, document clustering, essay scoring etc., The present works upon the determination of text-similarity has been partitioned to three type of approaches such as knowledge-based similarities, corpus-based similarities and string-based similarities [6]. The neural-network model on the basis of Bi-LSTM and ConvNet has been presented for measuring the semantic-text similarity. In this type of model, the similarity focused-layer and the pair-wise interaction model of the words would capture efficiently the fine-grained semantic data [9].

The input data in text-similarity analysis, the pre-processing algorithm which chains out the co-referential entities altogether and then performed the segmentation of words to retain the phrasal verbs meaning and idioms meaning. The similarity analysis is also employed for short-texts as well [24].This classification in this study, algorithm presents the short text representation in the form of two dense type vectors. The first vector is constructed utilizing the word to word similarity on the basis of pre-trained vectors of words and the other dense vector uses the similar word to word-similarity on the basis of external knowledge sources [26]. Along with the sentence pairs, or the question pairs and document comparisons, the para-phrase identification is presented as the new approach [10]. This approach analyses the text-similarity among the sentences pair on the basis of semantic-levels and lexical-levels through integrating keywords usage and neural-networks [39]. Hence focusing on semantic Text-similarity identification, the study employs the hybrid approach of Fine-tuned BERT feature extraction with Siamese network Bi-LSTM classification technique obtained from Quora dataset. The BERT extraction process, extracts the features of the sentence sin question pairs. These embedded vectors were trained by using this Siamese network with Bi-LSTM model. This embedded-vector from BERT process, were connected to input vector that traverses through multi-layer perceptron of Bi-LSTM model. This produces the output-labels of the words. The trained model outputs, their vectors predict the text-similarity of the documents or the question pairs in terms of similarity scores.

The major advantages of the proposed method such as for Bi-LSTM- better features are learned, stores in memory and followed by based on stored input, the output is obtained, for Siamese- the learning process is effective and for BERT feature extraction, it can mapping lot of sentences, vectors are accurate and finally the output weights are passed to Siamese input.

The major contributions of the study are listed below:

  • To generate the feature-vectors from preprocessed data using Weighted Fine-Tuned BERT extraction process.

  • To train the embedded feature vectors by Deep Siamese network based Bi-LSTM model through multi-layer perceptron. The trained vectors are generated with output-labels.

  • To predict the trained model output from the text-similarity between different set of question pairs from quora dataset. Hence the similarity-scores among various sentence pairs were determined by the technique.

  • To evaluate the Similarity predictions with respect to various parameters such as accuracy rate, recall value, precision factors and F1-score value. The validation is also employed in different documents comparison, and obtaining text-similarity.

1.1 Paper-organization

The section I describes the basic concepts of Text-similarity and the introductory section for how it is determined by using various NLP approaches for minimizing the duplication and other context issues. The section II enumerates the review of various existing Text-similarity detections models or systems relied in different datasets. The section III explores the proposed hybrid framework approach to implement efficient text-similarity detections by using BERT extraction with Siamese Bi-LSTM model. The outcomes obtained through this proposed-framework are illustrated in section IV. Finally the overall conclusion statements and result inferences were stated in section V.

2 Review of existing work

The modeling task of semantic similarity among the text pair turned out to be the critical Natural language-processing task in question answering applications to plag-detection applications. For this purpose, numerous models were proposed such as traditional feature-engineering techniques and deep learning models.

This type of semantic-text similarity detections is one of the primary roles in natural-language processing. For this initiative, the topic-informed architecture of BERT based model is established for pair-wise similarity in text detections [28]. The BERT based model enables for the feature selection process. This model enhances the performance level upon baselines of strong neural network over the English-language datasets. These enhancements were attained success upon these domain-specific words in qualitative-analysis phase.

The major level of advantage in some pre-trained language-models is the ability that they could efficiently absorb the word context within the sentences. For the analysis and investigation of this efficiency, the Pretrained Bi-Directional Encoder Representations from Transformers model of language processing is implemented and it is fine-tuned upon two QA-Question-answering data-sets and three CQA-community Question-answering data-sets is chosen for this answer-selection process [19]. It has been depicted that this approach attains the maximum enhancement denoting 13.10% in the datasets of Question-answering type and 18.70% in CQA- dataset types in comparison with the other conventional state-of art models. Several NLP applications including information retrieval-engines, dialog medical diagnosis systems for COVID-1 relies in the capability for measuring STS-.

Semantic textual-similarity in some existing data-sets. These models sometimes failed to yield out the performance towards domain-specific environment of COVID-19. For to rectify this gap, The CORD-19STS data-sets is introduced that comprises of around 13,710 sentence-pairs gathered from COVID-19 open-research data-set. As the result, around 1million sentence-pairs were generated by utilizing this various sampling-strategies [7]. The BERT-language model is implemented for scores similarity calculation in accordance with various similarity-levels among different sentence-pairs. This provides of about 32,000 sentence-pairs. The broadly utilized social media entertainment has provided the users with bunches of news depicting about similar events of the society and government. Hence as the attempt for determining the semantic-similarity and paraphrases has turned out as the demand for avoiding the similar news appearing in different channels in different times. This study employs the state of-art techniques for rectifying this issue. The widely used social media have flooded their users with news talking about similar events. Therefore, detecting para-phrases and semantic similarity analysis have become a need to avoid receiving the same news post several times. This research proposes a state-of-the-art approach for PI and STS analysis of news Arabic tweets [25]. The proposed approach employs a set of extracted features based on lexical, syntactic, and semantic computation. Moreover, the approach uses word alignment features to detect the level of similarity between tweets pairs. Likewise, another model which utilizes the knowledge-distillation strategical technique for training out the light-weighted deployment-friendly models of student’s information by using this proper weight-initialization and layer-pruning technique. The student-model possesses complete independency of upper teacher-model and results has been generalized through this BERT-like teacher-model [20]. The outcomes of the student model exhibits two-times rapid and attains 96 accuracy rate detections of this fine-tuned BERT-model. This model can be further improvised with data-augmentation and un-labeled information. These BERT topic-models were integrated to bring out the flexible framework. In this scenario, the fake news-detection tools is one of the critical requirement which has to be automated for this detection. In order to address out the mentioned complexity, the framework integrates the potentials of LSTM and CNN-model, and it is utilized with two distinct dimensionality reduction-techniques [36]. The Techniques are Chi-Square and PCA-Principal Component-analysis technique. This implemented is adapted to minimize the feature dimensionality vectors in prior to passing on to classifiers. In this type of qualitative analysis, it is depicted that these enhancements has achieved on instances evolving domain-specific words.

The Deep-Siamese Bi-LSTM-model is implemented for feeding out the embedded vectors from BERT model and predicted the similarity of the text pairs [33]. Hence to demonstrate this work, the Siamese network architecture application has been presented for investigation to larger scale stylistic author-attribution. The system provides the general authorship notion, and it overtakes the key-similarity based technique on 1-shot N-way evaluation and also performs the well-known author-context. From this study, it is depicted that Convolution-neural networks structure of this Siamese sub-network and applicable for higher author numbers, wherein this LSTM model is infeasible approach for training the model. The non-supervised Cross-lingual STS-Semantic textual-Similarity on the basis of Contextual embeddings vectors obtained from BERT-Bi-directional Encoder-Representation is presented in this study. The main objective of this STS-cross lingual approach is to analyze the two text segment’s degree within several languages which has the same meaning. The outcomes of the study illustrated that this non-supervised cross-lingual STS metrics utilizing this BERT-extraction model without the process of fine-tuning attains successful outcomes [22]. The achieved performance upon supervised cross-lingual STS approaches or in supervised weakly approached as well. The larger pre-trained language-models including BERT model is significant enhancement upon NLP-tasks. But also this BERT model trains the model for missing words predictions in next sentence or behind the masks without the Semantic information knowledge, Lexical data knowledge or syntactic information knowledge. This is achieved through this non-supervise pre-training models [27]. This new method is presented for in injecting explicitly the linguistic-knowledge in embedding of word-forms to BERT pre-trained layer. The enhancements on the performance upon semantic similarity multiple data-sets, implies that the information is useful and for missing out content from original-model while in injecting out the counter-fitted embeddings and dependency-based embeddings. Another work which analysis the similarity among the sentences in Korean language. This analysis uses by integrating the deep-learning technique and another approach which assumes lexical-relationships [43]. In this deep-learning technique, five neural-network in associated with RNN, BERT and CNN were utilized. This technique considers lexical-relationships and also employs the cosine-similarity for vectors embedding by using the model of word-representation. For this type of establishing text similarity, text generation is very essential in the approach. Hence for this purpose, automation metric for evaluation in text-generation is applied referred as BERTSCORE. In this method, BERTSCORE performs the computation of similarity score for every token [44]. This token present in sentences of the candidate with every token present in reference-sentences. The BERTSCORE method, does efficient correlation than the levels of human-judgments and also yields out the stronger perfect selection performance compared to other related metrics performance. In addition to this some of the language representation-model is introduced for promoting the better understanding of language [46]. This representation-model incorporates the contextual semantics explicitly obtained from labeling of pre-trained semantic-roles. This representation model is addressed as SemBERT-Semantic aware-BERT model, that is potential for absorb the contextual-sentences explicitly upon the BERT method. Some of this BERT models utilizes the Siamese network structures, in analyzing this one of the study that presents SBERT-Sentence BERT model, is employed for deriving out the meaningful semantical sentences and the embedding. This SBERT model is the pre-trained modified BERT-network structure. This analysis outcomes were made in comparison with Cosine-similarities [32]. This model minimizes the efforts in detecting out the most predominant sentence pairs obtained in pan of sixty five hours with RoBERT method or BERT method to about time range of five seconds with this efficient SBERT method. This model also retains the efficient accuracy rate from that of BERT model. Most of STS-Semantic Text-similarity systems utilizes either this one-hot model of representation or uses this distributed-representation for modeling out sentence-pairs and this has been considered as the regression complexity. Hence a novel model framework is implemented in some studies for integrates the one-hot model representation and distributed-representation for clinical STS-systems [42]. This model utilizes the gated-network. The experimental analysis is relied upon this benchmark data-set and it depicted that this both representations seems to be the efficient fusion for model representation. Similarly another hybridized framework on the basis of Siamese network is proposed in such studies. This hybrid-network combines the Bi-GRU and G-CNN model to learn out the sentences representations [21]. This model also evaluates the similarity calculation between the sentences. This integrated model assumes the global features within sentences and also takes the sentence local features as well and as a result it provides the higher qualified representations of sentences and also evaluates the similarity between these sentences.

The significant task of ASAG-Automatic short-answers grading-systems is in assigning the ordinal appropriate scores to answers of students. For this analysis, the new architecture of neural-networks is implemented for it integrates the Siamese bi-LSTM pooling-layer as well [14]. This mechanism is relied on the basis of Sinkhorn distance evaluation among the state sequences of LSTM and between the output layers of support-vectors. The learning of similarity between the Chinese sentences is employed by using Siamese architecture of CNN model. This technique utilizes the two type of similarity evaluation metrics such as Manhattan similarity and Co-sine similarity metric [34].The outcomes of the experiments depicted the higher accuracy rate of CNN-Siamese model than other models. In Similar to this, another similarity evaluation method is established as MSE-based knowledge referred as multi-granular semantic-embedding model on the basis of Knowledge-enhancement [30].This model resolves the sentence similarity and the association between the semantic matching of long-text. The accurate rate in representation of documents by categorizing the texts in documents is accomplished by implementing Siamese LSTM basis framework model [35].This enables the learning of accurate representation of documents. This model leverages the sub-network of Siamese LSTM-model to measure out the semantic-distances among the two various documents. The Deep-learning method integrates the Bi-LSTM architecture to attain the greater accuracy in semantic-similarity in question pairs section for CCKS2018-QIM tasks [47]. This detection is relied on the basis of Siamese-network structure. The table similarity is also analyzed by presenting TabSim method and it utilizes the Siamese neural-network. This network achieves the greater range of recall value, higher precisionF1-score values and accuracy rates upon these three various corpora of table-pairs [8].In Similar to this, the matched sentences and non-matched sentence pairs obtain from the sentence-pairs were extracted by using Hungarian-algorithm [41].The Hungarian layer is designed in these embedded vectors. The dynamic-computation graph is described for the optimization of Hungarian-layer, and it embeds other algorithms as well to neural-network. This Semantic textual-similarity is performed by utilizing by LIPN-team. This framework utilizes the support-vector regression model integrating distinct measures of text-similarity [2]. This type of semantic textual-similarity techniques is applied for identification of paraphrases in Arabic-languages on the basis of various combinations of NLP-techniques. One of such technique TF-IDF method for enhancing the words identification is used. This method employs word-2cvec algorithm and it minimizes the complexity of computation and also optimizes the word predictions probability within the context [23].The distributed representation of vectors are also utilized along with this algorithm. Likewise, UNITOR ,a semantic text-similarity model which facilitates the Sem-Evaluation [3]. The task involved in this model and depicted as the SV-regression model complication, wherein this scoring function of this text similarity among the pairs of texts is attained from provided examples.

The Text-similarity implemented in cloud- platform requires a dynamic algorithm for task scheduling to process STS-Semantic textual similarity application. This would minimizes the makesspan time, by increasing the cloud-resources utilisation and satisfying the parameters usage [29].Such resource allocation-model for application processing is accomplished through PSO-Particle Swarm Optimization algorithm defines as PSO-COGENT, for optimizing the cost of execution and computational time of text similarity detection [15, 16]. The Scheduling algorithm ought to optimizes the performance of parameters indicators such as makespan time, energy-consumption, resource utilisation, reliability etc. [17], In another platform as in fog-computing ,utilised for real-time application processing neared to data-sources and enhance the scalability, latency work and aids in resisting the congestion of data load [18].

2.1 Research gaps

The STS-Semantic Text-similarity is also utilised in clinical domain handling the clinical records, such that method utilised for eliminating the redundant data to improvise the clinical decision-making process. However, the size of those clinical semantic-text similarity resources handled by those STS is relatively small, since the data were developed though clinical-notes obtained from single institutional data [40]. The annotation schema uses conventional guidelines of STS, having limited clinical properties. Another limitation is that the dataset used were manually annotated by clinical experts. In some cases, SemEval STS shared tasks are employed through crowd-sourcing on Amazon Mechanical-Turk that is not suitable for such sensitive patient dataset. The extracted features of input text data were subjected to plagiarism that leads to further complexity in some cases. In case of unrealistic plagiarism data-set, PAN corpus has its own drawbacks, such that it brings out larger plagiarism cases. The evaluation outcomes were performed on PAN-Plagiarism [37]. In such cases, the test-beds requirements ought to be emphasized, with more manual and realistic plagiarism cases. BERT Network involved in feature extraction process undergoes major limitation, is that it does not consist of any independent computation of sentence embeddings. The scenario makes the model a difficult task to derive out the Sentence-embedding obtained from BERT model. The researchers ought to pass out single-sentences by using BERT framework, to derive fixed-size embedded vector through averaging the values of outputs [12]. The approach also addresses out the long-term dependency issues and data losses that impacts the existing research models specifically if the size of input data is higher. Since the model in such research, requires more training time and more trained data, out of the baselines [32].

3 Proposed framework

The proposed-framework is employed for efficient identification of Semantic text-similarity by using Weighted Fine-tuned BERT model with Deep Siamese Bi-LSTM model. The Data is obtained from Quora dataset, consisting of various question-pairs sentences. The sentences from this question pairs were taken as the input data and it is loaded. In pre-processing phases, the questions sentences were reduced by removing of Non-ASCII characters, punctuations, and special characters by using python.

In Tokenization process, each token is assigned to features of sentences. This token is utilized as the symbol for features aggregation from single sentence or in the sentence pairs. After the pre-processing phase, the glove-based word embeddings converts the words of input to vectors. For achieving this, the vocabulary pre-processor is employed, obtained from tensorflow. The lookup functions of tensorflow embedding are utilized for yielding the word-embedding. In Pre-processing phase, vectorization process is performed after Tokenization process. In vectorization process, phrases or the words were changed to corresponding vector-form, consisting of real numbers. The vectors are mapped out in with respective words in pre-trained BERT model. These vectors facilitate to predict the word semantics in question-pairs. The method is also implemented for semantic-text similarity detection between different pdf documents as well. These embedded vectors were trained using Siamese Bi-LSTM network model. For the calculation of Semantic text-similarity in question-pairs, to ensure if the question belongs to label 0 or label 1 turns out to be problem for classification. Hence the hybrid approach of Siamese network in Bi-LSTM model is implemented. The overall proposed-framework in determining text-similarity is represented in Fig. 1 above. The Siamese network, defined as neural-network with the two similar question pairs or similar documents, evaluates the text-similarity by using shared weight sub-networks. Glove Embedding comprises of 100d vectors of Wikipedia. In Bi-LSTM layer, dependency of the non-continuous and longer distance between the sentence words has been extracted. The two layers of this Bi-LSTM structure consists of encoding of embedded-vector inputs of the words. In attention layer, attention weights were calculated. The key-parts of first-layer’s output were taken as inputs for next proceeding layer. The embedded layer outcomes are represented as the weighted sum to vectors, defined as WFT BERT Weights for every sentence. The outcomes are then fed to input for non-linear layers. The distance and angle-information of output alignment of the question pair is obtained using Deep Siamese Bi-LSTM layer. The Similarity score were calculated in each layer for every sentence, with WFH BERT weights output value. Through this similarity score, Efficient Text similarity for input query are evaluated. The trained model outputs, with their vectors predicts the text-similarity of the documents or the question pairs.

Fig. 1
figure 1

Overall flow of proposed framework

3.1 Weighted Fine-Tuned BERT model for feature extraction

The above Fig. 2 represents the framework of Weighted Fine-Tuned BERT Extraction process used in the study. The Text sequence is denoted as \(\text{A} = \left\{{\text{a}}_{1} , \dots , {\text{a}}_{\text{L}} \right\} \text{w}\text{h}\text{e}\text{r}\text{e} {\text{a}}_{\text{l}} \left(1\le \text{l} \le \text{L}\right)\) points to sentence and \(\text{L}\) represents text sequence length. Bidirectional pre-trained BERT model encoded the text sequence \(\text{A} = \{{\text{a}}_{1}, \dots , {\text{a}}_{\text{L}} \}\) to fixed-length sentence vector forms ℎ, represented as input source-element. The vector forms of sentence were represented as stated below:

$${\text{s}}_{\text{l}} = \text{BERTsent}\left({\text{a}}_{\text{l}}\right)$$
(1)
Fig. 2
figure 2

Weighted Fine-Tuned BERT Extraction Model

Such that \(\text{S} = \{{\text{s}}_{1} , \dots , {\text{s}}_{\text{L}} \}\) are source-elements as depicted in the figure. \(\text{BERT}\text{sent}(.)\) illustrated the encoding sentences into those sentence vectors. The hidden Vector-representation \({\text{u}}_{\text{l}}\)of converted sentence Vector-form \({\text{s}}_{\text{l}}\) were obtained by using MLP-Multi-Layer Perception.

$${\text{u}}_{\text{l}} = \text{tanh}({\text{W}}_{1}{\text{s}}_{\text{l}} + {\text{b}}_{1})$$
(2)

The Hidden Vector-representation is stated in the above equation wherein \({\text{b}}_{1} \ \text{and} \ {\text{W}}_{1}\)denotes to bias and Weight parameter. The general used Text-representation techniques usually neglects the interaction information among all text sentences. This leads to partial Semantics losses. In the study, the entire source-elements were considered as context-information, to get text representation, consisting of more semantics. For instance, \(\text{interaction information} \left({\text{i}}_{\text{k}}\right)\) between all the source elements\(\{{\text{s}}_{1} , {\text{s}}_{2} , \dots , {\text{s}}_{\text{L}} \}\) and one source element (\({\text{h}}_{\text{k}}\)) are captured. The semantics-weight of \({\text{a}}_{\text{l}}\) allocated by a source element \({\text{a}}_{\text{k}}\) represented as\({ {\upalpha }}_{\text{k}\text{l}}\).

$${{\upalpha }}_{\text{k}\text{l}} =\frac{\text{exp}\left(\text{u}{\text{l}}^{\text{T}} {{\text{u}}_{\text{k}}}\right)}{\sum _{\text{l}=1}^{\text{L}}\text{exp}\left({{\text{u}}_{\text{l}}^{\text{T}}}{\text{u}}_{\text{k}}\right)}$$
(3)

The \({\text{i}}_{\text{k}}\) were formulated as

$$\mathrm{ik}\;={\textstyle\sum_{\mathrm l-1}^{\mathrm L}}{\mathrm\alpha}_{\mathrm{kl}}{\mathrm S}_1$$
(4)

Similarly to above instance, every single source-element does interaction with entire sources, and obtains the interaction among all source-elements and single source-element shown in Fig. 1 above.

$$\text{I} = \left({\text{i}}_{1} , {\text{i}}_{2} , \dots , {\text{i}}_{\text{L}}\right)$$
(5)

The interaction contributed to finalized unequal text-representation, and attention layer is added for getting informative-interaction, similar to classification in Fig. 1. ‘ s’ represents the compatibility score, to weight I. I denotes the interaction-representation. While during process of joint embedding of words, the whole text compatibility score were generated. As a result the final text is represented as below.

$$\text{T} = \text{sI}$$
(6)

Every sentence is represented as \({\text{a}}_{\text{l}}= \{{\text{wrd}}_{1} , {\text{wrd}}_{2} , \dots , {\text{wrd}}_{\text{n}}\},\) such that \({\text{wrd}}_{1}\)denotes each word within a sentence. Pre-trained BERT model encodes every sentence\({\text{a}}_{\text{l}} = \{{\text{wrd}}_{1} , {\text{wrd}}_{2} , \dots , {\text{wrd}}_{\text{n}}\)} to their respective word embedding-forms, \({\text{E}}_{\text{l}} =\) \(\{{\text{E}}_{1} , \dots , {\text{E}}_{\text{n}}\},\) This is stated as below

$${\text{V}}_{\text{l}} = \text{BERT}\text{token}\left({\text{a}}_{\text{l}}\right)$$
(7)

In the notation, \(\text{BERT}\text{token}\left({\text{a}}_{\text{l}}\right)\) represents word encoding to their word vectors. The word-embedding representation of entire text sequence \(\text{A} = \{{\text{a}}_{1} , \dots , {\text{a}}_{\text{L}} \}\) were represented as\(\text{E} = \{{\text{E}}_{1} , {\text{E}}_{2} , \dots , {\text{E}}_{\text{l}}\} = \left\{\right\{{\text{e}}_{1} , \dots , {\text{e}}_{\text{n}}\}, \{{\text{e}}_{1} , \dots , {\text{e}}_{\text{n}}\}, \dots ,\{{\text{e}}_{1} , \dots , {\text{e}}_{\text{n}}\left\}\right\}\), such that n describes the total count of words. Additionally, \(\text{b}\) points to corresponding text sequence label \(\text{A}\), that could be encoded to their label embedding forms \(\text{F} = \{{\text{f}}_{1} , {\text{f}}_{2} , \dots , {\text{f}}_{\text{K}}\}\) determined through BERT, wherein \(\text{K}\) represents count of classes. \(\text{F} = \{{\text{f}}_{1} , {\text{f}}_{2} , \dots , {\text{f}}_{\text{K}}\}\) would be represented as

$${\text{f}}_{\text{k}} = \text{BERT}\text{token}\left(\text{b}\right)$$
(8)

The words and labels were embedded to one joint-space. It seems quite tedious to learn within Joint-space. The simple approach for calculating compatibility among pairs of label words is the cosine similarity. In the below notation, G denotes the compatibility among the pairs of label words, which is stated below.

$$\text{G} = \left({\text{F}}^{\text{T}}\text{E}\right)\oslash \widehat{\text{G}}$$
(9)

The element-wise division, participated in vectors operation or matrix operation is denoted by operator \(\oslash\). The normalized matrix were represented by \(\widehat{\text{G}}\), consisting of \(\text{K} \times \text{L}\) size. Every normalized matrix element is determined as \(\widehat{\text{g}}= \parallel \text{ck} \parallel \parallel \text{el} \parallel , \text{such that} \parallel . \parallel\) denotes the norms \(2\). \({\text{e}}_{\text{l}} \ \text{and}\) \({\text{f}}_{\text{k}}\)represents \({\text{l}}^{\text{t}\text{h}}\) word embedding and \({\text{k}}^{\text{t}\text{h}}\) label embedding. The relative spatial-information between consecutive words, were calculated using non-linear function, during obtaining the label word-pairs compatibility. For example, text-sequence were centered at position q, such that the length of text-sequence is \(2\text{i} +1\).In provided pairs of label-phrase, the label to-token compatibility \({\text{is represented as} \ \text{G}}_{\text{q}-\text{i}:\text{q}+\text{i}}\).

In below equation, \({\text{e}}_{\text{q}}\) describes the high level compatibility stigmatization among whole lables and \({\text{q}}^{\text{t}\text{h}}\) phrase. The function is evaluated as below.

$${\text{e}}_{\text{q}} = \text{ReLU}\left({\text{G}}_{\text{q}-\text{i}:\text{q}+\text{i}} {\text{WRD}}_{2} + {\text{b}}_{2}\right)$$
(10)

Such that \({\text{b}}_{2 } \ \text{denotes}\) bias and \({\text{WRD}}_{2 }\) weight.The maxpooling operation obtains the highest compatibility value among q-th phrase in accordance with whole labels. The operation is represented as below.

$${\text{m}}_{\text{q}} = \text{max}\left({\text{e}}_{\text{q}}\right)$$
(11)

The entire text sequence compatibility score are represented as stated below.

$$\text{s} = \text{SoftMax}\left(\text{m}\right)$$
(12)

In the notation, m and L denotes the vector and length. \({\text{S}}_{\text{q}}\) represents \({\text{q}}^{\text{t}\text{h}}\) Softmax element.

$${\text{S}}_{\text{q}} =\frac{\text{exp}\left({\text{m}}_{\text{q}}\right)}{\sum _{{\text{q}}^{1}=1}^{\text{L}}\text{exp}\left({\text{m}}_{\acute{\text{q}}}\right)}$$
(13)

Figure 1 illustrates the Dual-label embedding process. The compatibility-score(s) of whole text-sequence were evaluated by using words learning embedding and labels embedding. The compatibility score is used for capturing high interactive-information, and to weight finalized interactive text representation \(\text{I} = \left({\text{i}}_{1} , {\text{i}}_{2} , \dots , {\text{i}}_{\text{L}}\right)\). Further to this, labels could learn more count of textual-content, the classifier could effectively leverages those weighted labels, for text classification. And also the compatibility score(s) were employed, to weight finalized label-vectors \({\text{F}}_{\text{k}}\).

$$\text{T} = \sum \text{qs}_{\text{Q}}{{\text{i}}_{\text{q}}}$$
(14)
$$\acute{\mathrm{F}}=\sum \text{qs}_{\text{Q}}{{\text{F}}_{\text{k}}}$$
(15)

T denotes final representation of text and the final label-representation represented as \(\acute{\mathrm{F}}\) (Table 1).

Table 1 Weighted Fine-Tuned BERT feature extraction

At the first stage, Feature extraction process is represented in the steps included in the above algorithm 1. The features set of the question from the Quora website where been extracted from the question section or input statement. Similarly the similarity-scores of the different statements has also been extracted in the analysis phase. The proposed-framework does the extraction of features subset from the questions. The dataset provided is (S,Y) = \(\left\{\left({\text{S}}_{1}, {\text{y}}_{1}\right). . .\left( {\text{S}}_{\text{m}}, {\text{y}}_{\text{m}}\right)\right\}\). The trained classification -model represents C: S◊Y. The soft-label setting is assumed such that attacker could query out the classifier, for obtaining output-probabilities, upon provided input. The model parameters, training data is not provided for access. For example, weight example is denoted by \(\text{S}\_\text{weight}.\) The weighted instance \(\text{S}\_\text{weight need to get generated}\) for provided input-pair, such the condition C(\({\text{S}}_{\text{weight}}\)) ≠ y ought to be satisfied. Simultaneously, \(\text{S}\_\text{weight}\) ought to be grammatically corrected with semantically similar as S. In order to generate, Weighted example \(\text{S}\_\text{weight},\) two categories of token level perturbations such as token replacement and new token insertion is introduced.(i) Token replacement with condition a ϵ S with other token. (ii) New token insertion \({\text{a}}^{`}\) within S.Few token within the input, contributed high for Final Prediction through C in compared to others. The Token replacement or new token insertion could have stronger impacts in modifying the classifier’s predictions. The token-importance \({\text{I}}_{\text{i}}\) were estimated for every ‘a’ through deleting ‘a’ from S and also through computation of probability decrease in prediction of correct label(y). The insertion and replacement operations, were performed on token (a), through determining the similarity, as a result similarity token insertion occurs. The pre-trained BERT-model utilized for prediction of similarity token. The similarity tokens would fits well to text context and grammar of text. If there is occurrence of multiple-tokens causes C for S misclassification, during token replacement, token is chosen that makes \(\text{S}\_\text{weight}\) more similar to Original S on the basis of similarity scores. If misclassification does not occur, another token is chosen that reduces probability of prediction. The token perturbations are applied iteratively unless either C(\(\text{S}\_\text{weight}\)) ≠ S or entire S tokens are perturbed.

3.2 Siamese Bi-LSTM-model for prediction

3.2.1 Long-Short-Term Memory (LSTM)

This Siamese LSTM-Model is described in the study, that it explored for determining the semantic-relatedness between the documents. The document pair denoted as \({\text{D}\text{c}}_{\text{i}}\) and \({\text{D}\text{c}}_{\text{j} }\)is the input of the model. The variables \({\text{w}\text{t}}_{\text{t}}^{\text{i}}\) (\({\text{wt}}_{\text{t}}^{\text{j}} ) \text{and} \ {{\text{em}}_{\text{t}}}^{\text{i}}\) (\({\text{e}\text{m}}_{\text{t}}^{\text{j}})\) represents the t-th word for \({\text{D}\text{c}}_{\text{i}}\) (\({\text{D}\text{c}}_{\text{j}}\)) document and \({\text{e}\text{m}}_{\text{t}}^{\text{i}}\) represents the word-embedding. The LSTM model is treated as the encoder. In the document, the words within the documents are represented as (\({\text{w}\text{t}}_{1}^{\text{i}}\)….\({\text{w}\text{t}}_{\text{t}}^{\text{i}}\) ….\({\text{w}\text{t}}_{\text{T}\text{i}}^{\text{i}}\)). The T denoted as subscript represents the total count of words within the document. The document may consist of arbitrary words count. At every step of this model, The LSTM-framework generates the hidden-state \({\text{h}\text{i}}_{\text{t}}^{\text{i}}\) which interpreted as sequence representation. (\({\text{w}\text{t}}_{1}^{\text{i}}\)….\({\text{w}\text{t}}_{\text{t}}^{\text{i}}\) ….\({\text{w}\text{t}}_{\text{T}\text{i}}^{\text{i}}\)). The LSTM-encoder transforms all words within document to distributed vector. The following equations are represented for document encoding. The following Eqs. (16), (17), (18), (19), (20), (21) states the memory-state detections and forget determinations obtained through sigmoid functions in Eqs. (16) and (17)

$${\text{i}}_{\text{t}}=\text{sigmoid}\ ({\text{wt}}_{\text{i}}{{\text{em}}_{\text{t}}}+{{\text{U}}_{\text{i}}}{\text{hi}}_{\text{t}-1}{+ \text{c}}_{\text{i}})$$
(16)
$${\text{fr}}_{\text{t}}=\text{sigmoid}\ ({\text{wt}}_{\text{f}}{{\text{em}}_{\text{t}}}+{{\text{U}}_{\text{f}}}{\text{hi}}_{\text{t}-1}{+ \text{c}}_{\text{f}})$$
(17)
$${\widetilde{\mathrm{cs}}}_{\mathrm t}=\text{tanh}({{\text{wt}}_\text{cs}}{\text{em}}_\text{t}+{{\text{U}}_\text{cs}}{\text{hi}}_{\text{t}-1}+{\text{c}}_\text{c})$$
(18)
$${\text{cs}}_\text{t}={\text{i}}_\text{t}\odot\widetilde{{\mathrm{cs}}_{\mathrm t}}+{\text{f}\text{i}}_\text{t}\odot{\text{cs}}_{\text{t}-1}$$
(19)
$${\text{ou}}_{\text{t}}=\text{sigmoid}\ ({{\text{wt}}_{\text{o}}}{{\text{em}}_{\text{t}}}+{{\text{U}}_{\text{o}}}{\text{hi}}_{\text{t}-1}+ {\text{c}}_{\text{o}})$$
(20)
$${\text{hi}}_{\text{t} }={\text{ou}}_{\text{t}}\odot \ \text{tanh} \ \odot \ {\text{cs}}_{\text{t}}$$
(21)

The above equation determines the memory-state and forget-state by using this LSTM-model .The \({\text{f}\text{i}}_{\text{t}}\) denotes the forget-state of the model [11].

In this equation, the variable \({\text{c}\text{s}}_{\text{t} }\)represents the memory-state and the forget state is represented as \({\text{f}\text{i}}_{\text{t}}\). \({\text{o}\text{u}}_{\text{t}}\)Denotes the output-gate and \({\text{i}}_{\text{t} } \text{r}\text{e}\text{p}\text{r}\text{e}\text{s}\text{e}\text{n}\text{t}\text{s}\) the input-gate. ʘ represents the component-wise-product. The weights matrix for hidden-layer ht and embeddings are defined as U and M. These representation are noticed in Eqs. (18) and (19).

3.2.2 Siamese-Networks

The LSTM-network model hidden vector is defined as the vector-representation for every document denoted in equations in (1) to (6). The weights of the network in two Siamese-networks were shared. The Euclidian-distance depicted as the negative similarity-function and it expresses the relatedness degree among the documents pair upon network top-layer. The Euclidian distance among the vectors of the documents were taken as the finalized scores of textual-semantic distances. Let Z is considered as the binary-label allocated to pair of documents. Z=0 meant that the documents pairs were similarly deemed. If the value of Z is 1, then this indicates that pairs are negative, such that the pairs are dissimilarly deemed. This Siamese-LSTM network model thus estimates the semantic-relatedness for document pairs with the aid of Euclidean distance-measure. In this Siamese LSTM, the model is trained utilizing the BPTT-back propagation through-time algorithm on the basis of contrastive-loss function. This network architecture compels the sub-networks of LSTM for capturing out the differences of textual semantics in the training phases. After the training phases is completed, the strategy is applied for mapping out the document to semantic-vector of fixed type sizes through one of two LSTM trained sub-networks. Then the resulted semantic-vectors is fed in 3layers DNN-deep neural-network. These vectors estimate the probability distributions as well across various classes.

3.2.3 Architecture of Siamese network based Bi-LSTM model

The efficient capability in sequence modelling were proved through LSTM model specifically long sequences modelling with inherent sequential non-linear patterns.

The typical LSTM model would be formulated as stated below.

$${\text{f}\text{g}}_{\text{t}}= {\upsigma } ({\text{W}}_{\text{f}} . \left[{\text{h}}_{\text{t}-1},{\text{x}}_{\text{t}} \right]+{\text{b}}_{\text{f}\text{g}})$$
(22)
$${\text{a}}_{\text{t}}= {\upsigma } ({\text{W}}_{\text{a}} . \left[{\text{h}}_{\text{t}-1},{\text{x}}_{\text{t}} \right]+{\text{b}}_{\text{a}})$$
(23)
$${\text{S}}_{\text{t}}^\prime=\text{tanh}\left({\text{W}}_{\text{s}} . \left[{\text{h}}_{\text{t}-1},{\text{x}}_{\text{t}} \right]+{\text{b}}_{\text{s}}\right)$$
(24)
$${\text{S}}_{\text{t}}={\text{f}\text{g}}_{\text{t}}\ast{\text{S}}_{\text{t}-1}+ {\text{a}}_{\text{t}}\ast{\text{S}}_{\text{t}}^\prime$$
(25)
$${\text{b}}_{\text{t}}={\upsigma }\left({\text{W}}_{\text{b}}\left[{\text{h}}_{\text{t}-1},{\text{x}}_{\text{t}}\right]+{\text{b}}_{\text{b}}\right)$$
(26)
$${\text{h}}_{\text{t}}={\text{b}}_{\text{t}}\ast\text{t}\text{a}\text{n}\text{h}\left({\text{S}}_{\text{t}}\right)$$
(27)

Such that \({\text{b}}_{\text{t}},\) \({\text{f}\text{g}}_{\text{t} }\) and \({\text{a}}_{\text{t} }\) represents output, forget-gate and input activation-vectors at specific time-step (t). \({\text{h}}_{\text{t}} \text{and} \ {\text{x}}_{\text{t}}\)Denotes the output vector and input-vector of LSTM unit. The LSTM cell-state is stored in \({\text{S}}_{\text{t}}\). Further W and b that is subscribed through fg,a,s described the parameter-matrices, which is learnt in training process. The dot product-operation is represented by. and the element-wise multiplication-operation is represented by *.

Even though, the LSTM model does has the memory technique comprising of gating-function, to protect longer term information, the recurrent-structure, faces still the issues of long dependencies learning. This is due to the long-path backward signals and forward signals, which ought to be traversed within network. Particularly, the trajectories head information ought to traverse total recurrent network length, in order to attain embedding output vector. The Bi-directional structure is applied to reduce the traversing path of information. (depicted in Fig. 3).

Fig. 3
figure 3

Siamese network with Bi-LSTM based trajectory encoder

$${\text{t}}^\prime\text{r}=\text{f}\left(\text{t}\text{r}\right)$$
(28)
$$\overrightarrow{\text{h}}=\overrightarrow{\text{L}\text{S}\text{T}\text{M}}\left({\text{t}}^\prime\text{r}\right)$$
(29)
$$\overleftarrow{\text{h}}=\overleftarrow{\text{L}\text{S}\text{T}\text{M}}\left({\text{t}}^\prime\text{r}\right)$$
(30)
$$\text{w}=\text{g}\left(\right[{\overrightarrow{\text{h}}}_{\text{T}} ,{\overleftarrow{\text{h}}}_{1}\left]\right)$$
(31)

In the above equations, the document-trajectory (\({\text{t}}^\prime\text{r})\) or query trajectory is determined. fg represents the linear time-distributed transformation, consisting of tanh activation method. The function does encoding of geo-coordinates. The encoded-trajectory is denoted by \({\text{t}}^\prime\text{r}\). The linear-transformation that provides the embedding vector is g. The embedding vector points to w. The two BiLSTM directions consists of \(\overleftarrow{\text{L}\text{S}\text{T}\text{M}}\) and \(\overrightarrow{\text{L}\text{S}\text{T}\text{M}}\). The directions are reverses factors of input-trajectory .The outcomes of \(\overleftarrow{\text{h}}\) \(\text{a}\text{n}\text{d} \overrightarrow{\text{h}}\) are obtained (Table 2).

Table 2 Siamese Bi-LSTM model

In this proposed-framework, the Siamese network approach is integrated with Bi-LSTM model for generating the trained model for similarity evaluation in this algorithm 2. The Siamese-network is trained first on the embedded vectors or the labeled instances by utilizing the triplet loss function. The technique is utilized for the predictions of labels of the sentence pairs for those non-labeled instances. Proceeding to this, the embedding vectors of labeled instances and non-labeled examples were pass on to LLGC-Local learning with Global-consistency. After few iterations count, an appropriate unlabelled examples percentage are selected on the basis of LLGC scores and it is added with labeled-instances for next following iteration. The model is used for classification of question-answer pair sentences obtained from Quora data-set. In this first, the training of the labels were performed by Siamese-BiLSTM model. This framework, Siamese training-phase training phase algorithms executed for twenty-five iterations. For this Siamese- Bi-LSTM integrated algorithm-2, the small labeled subset instances or examples were selected in accordance with semi-supervised learning-practice, acquiring balanced count of examples for every class. The rest of the labels were assumed as non-labeled ones. The outcomes of the algorithm-2 were determined in three random-executions by utilizing random-initialization of the parameters (\({\text{A}1}_{1} ,{\text{A}1}_{2}\) ,\({\text{A}2}_{1}\) ,\({\text{A}2}_{2}\),\({\text{A}3}_{1}\) ,\({\text{A}3}_{2}\)) of Siamese-network in every run, given the initially labeled-examples were selected randomly. The parameters of the trained model is then concatenated with the provided-parameters in the final step.

4 Results and discussions

4.1 Dataset description

The Quora Question pair’s dataset has been used for analysis phase. This dataset nearly consists of 400 thousand of question-pairs that are labeled as non-duplicates and duplicates labels. In the dataset, the vocabulary consists of around 85,000 of words that mapped to unique identification numbers and 100d glove-embedding were utilized for vector conversion of those sentence features. The experimental analysis is employed utilizing the Quora-dataset for thirty epochs and utilizes the batch-size of sixteen. The training-set consists of around 361,745 question-pairs and the testing phase question-pairs consist of around 36,174 count. The accuracy rate of the model is enhanced by increasing the training steps count. The gain in the performance is attained across the simpler multi-layer perceptron method by using this Siamese Network-model architecture. The enhanced increase to 14% performance gain is achieved over shingling-technique. Further more than 80% rate of accuracy is attained by using this Siamese CNN-architecture model. For implementing the non-linearity, the sigmoid or ReLU activation-function is utilized and the loss in the features can be evaluated by Adam-optimizer. The functions are implemented by using Tensorflow-GPU. The testing phase is also carried out by using PDF files as inputs and calculations of similarity scores for those pdf documents. The single pdf files is made in comparison with other files at each iteration.

4.2 Performance-evaluation results

The assessment in the performance analysis of the proposed-framework is established by making the comparisons of various performance parameters such as accuracy rate, precision value in detecting the text-similarity scores, Recall values and the F1 score evaluation. These metrics value were determined for existing deep-learning methods such as Shingling-method, CNN technique and MLP technique with proposed-framework.

The above Fig. 4 enumerates the performance comparisons of different techniques of text-similarity detection methods. From the graphical representation, it is well depicted that the proposed-framework exhibited higher range of accuracy rate in predictions, recall values, F1-Score values and also in precision factors than the other existing techniques in Duplicate detection of text-similarity between the documents or in question pairs in this Fig. 4.

Fig. 4
figure 4

Performance Comparison evaluation

The representation of the performance of various semantic text-similarity detection methods were illustrated in the Table 3 above. The proposed-framework have 0.90% of accuracy rate value, 0.9 precision rate values, 0.9% of recall data and 0.92% of F1-score values [1]. These values are higher than the equivalent metrics values of shingling technique [13], CNN-technique [45] and MLP method [31]. This Table 3 above shows that proposed framework evolving Siamese Bi-LSTM model networks with BERT extraction is efficient in text similarity prediction than other existing methods.

Table 3 Analysis of proposed and existing system with respect to various parameters

Similarly, specific evaluation metrics such as recall factors, F1-score values and the precision rate of the proposed-framework is made compared with other algorithms such logistic-regression model, decision-tree algorithm, KNN-algorithm, RF-Random-forest algorithm, and SVM technique is represented in Fig. 5 above. The validation of the model provided the output that this framework outperforms the other algorithms in identifying the Text-similarity within the sentences and evaluating the similarity-scores [38]. from the Fig. 5 above it is found that the metrics values in like the precision in determining the similarity scores, recall values and f1-score values is higher for proposed-framework in compared to the other algorithms.

Fig. 5
figure 5

Comparison evaluation of proposed framework with existing algorithms

The Table 4 above demonstrates that the proposed-framework of Deep Siamese Bi-LSTM model found to generate higher rate as 0.9% of precision rate, 0.910 of Recall value, with 90% of accuracy rate and 0.92% of F1-score values in text-similarity and duplicate detection through this BERT with Deep Siamese Bi-LSTM network model. These rate values are higher than the other algorithms. This evaluation analysis depicted that proposed-framework exhibited higher efficiency rate in identification of text-similarity and in predictions of similarity scores.

Table 4 Comparison analysis of proposed framework with different algorithms of various parameters
Fig. 6
figure 6

Performance analysis of proposed-framework in terms of accuracy parameter

The above Fig. 6 illustrates the graphical representation of accuracy rate evaluations of proposed-framework with other algorithms as well. This framework, exhibited higher accuracy detection rate in determining the semantic-text similarity within the questions pairs originated from quora dataset or between different pdf files. This Siamese network model process each sentence and attained accuracy rate detections percentage in sentence similarity.

The accuracy rate values were incorporated in a single Table 5 and made in comparison with the other existing algorithms, involved in duplicate-detections of sentences within question pairs or the duplicates of the sentences stating the same questions. Among the rate of accurate detection the Bi-LSTM Siamese model overtakes the performance of other algorithms. This attains the efficient detection rate of 90%. This fetches the duplicate sentence pairs within the different type of questions obtained from the larger number of quora question-pair samples.

Table 5 Comparison analysis of proposed system with respect to accuracy rate

The present proposed-framework model is employed in Quora Dataset where the portal consists of various set of question pair sentences. The similarity between these semantic-texts are analyzed and predictions of the similarity scores were calculates which depicts the performance level of the model. The above Fig. 7 represents the implementation of the model relying in another MRPC-Microsoft Paraphrase corpus dataset [26], and the performance in this dataset is evaluated. This figure demonstrates this assessment. This MRPC-dataset comprises of sentence-pairs that is extracted automatically through online sources of news with along human-annotations, to ensure if the sentence are equivalent semantically.

Fig. 7
figure 7

Accuracy rate evaluation of proposed framework upon MRPC dataset [26]

The accuracy rate of existing algorithms were compared with similarity accuracy rate proposed-framework from quora dataset. From the Fig. 8, it is clearly seen that Weighted Fine-tuned BERT LSTM model exhibits higher accuracy rate of 90%.

Fig. 8
figure 8

Comparison of proposed framework with existing algorithms [29]

The above Table 6 illustrates the accuracy rate value representation and F1 score comparison of proposed model with other methods such as Multi-perspective LSTM, Siamese LSTM model, L.D.C, ESIM, DINN, Enhanced RCNN, Siamese CNN model and multi-perspective CNN model. The proposed model to determine text-similarity using Weighted Fine-Tuned BERT extraction with Bi-LSTM method, determined the text-similarity measure in 90% accuracy and F1 score with 92% [29] (Fig. 9).

Table 6 Performance evaluation of proposed framework
Fig. 9
figure 9

Comparison evaluation of accuracy rate of proposed model algorithm [29]

Similarly, the accuracy value of proposed Bi-LSTM model were made compared with other methods by taking quora question pairs dataset. The accuracy of the framework is higher for proposed model, compared to other methods such as BERT-Base, Enhanced RCNN algorithm and Enhanced RCNN model.

The Table 7 above enumerates the accuracy rate comparisons and F1-score evaluation of various methods [29], compared with proposed model. The results of the graphical representation of proposed Weighted Fine-Tuned BERT extraction with Bi-LSTM model exhibited higher accuracy rate of 90.89% and F1-Score value of 92.64%.

Table 7 Performance evaluation with existing method

The F1 score evaluation of proposed model is analyzed with F1-score values of existing methods. The result outcomes in Fig. 10 depicted that F1-score of proposed framework, seems to be higher than other algorithms. The outcomes proved the efficiency of propose text-similarity method [29] (Table 8).

Fig. 10
figure 10

F1-Score evaluation of proposed framework

Table 8 Performance evaluation of proposed-system with model upon different dataset

 The above Table 8 denotes the accuracy rate value representations of model relying in MRPC dataset and the accuracy rate comparisons of proposed-framework upon quora dataset. From the results analysis, it is found that proposed-framework shown higher rate of accurate degree predictions in determining the semantic-text similarity by using embedded vectors representation. The outcomes of the model exhibited 0.91% of accuracy rate, higher than framework model upon MRPC dataset. This accuracy rate upon MRPC dataset found to 0.84% lesser than Siamese-Bi-LSTM network model (Table 8).

The proposed framework relying in quora dataset is made compared with the accuracy rate of other pre-trained unsupervised BERT model of Deep-Bidirectional transformers in this Fig. 11 above. This comparisons evaluated that the proposed-model exhibits higher rate of accuracy ranges in determining the Text-similarity and performing the NLP-tasks than this BERT pre-trained model [5] (Table 9).

Fig. 11
figure 11

Accuracy rate analysis of proposed-framework with other unsupervised BERT model in different dataset

Table 9 Performance evaluation of proposed-framework with different dataset model

The features contributing to the Semantic text-similarity is extracted using BERT model upon MRPC dataset, and this accurate prediction detections in this text-similarity is evaluates and it exhibited 89% of accuracy in this Table 9  above [5]. In this analysis, Deep BERT-Bidirectional Transformers method is utilized for understanding the language similarity. This method is a pre-trained analysis design. But the proposed-framework provided 91% of accurate similarity prediction determinations upon this Quora dataset. These well-defined accuracy rate predictions showed the precise approach and efficient technique to determine the semantic-text similarity between the question pairs obtained from quora dataset and all establishing the similarity scores between different semantic-text documents, that utilizes the embedded-vector represents for shorter and longer texts .

In this Semantic-Text Similarity detections, another performance analysis parameter in assessing the efficiency of proposed-framework is execution time analysis. The Execution time for various file sizes has been considered were been fetched through our framework. The execution time of different file-sizes within our framework is represented in the Table These values were plotted in graph. This graphical-representation of execution time variation for different file-sizes is determined. This is depicted in the above Fig. 12. The Fig. 12 illustrates the variation of execution time increases with respect to File sizes within our system. But the execution time of this Semantic-text similarity detection were executed or performed in few seconds itself. Hence this depicted the efficient performance of system.

Fig. 12
figure 12

Execution time performance analysis

The above table represents the representation of execution time values in detecting the Semantic-Text similarity scores by using our proposed-framework relying quora-dataset. The File Size taken for this similarity detection were 5 Kb, 10 kb, 15 Kb and 20 Kb. The Similarity detection for Semantic-Text, in how many seconds this is determined for these above mentioned various file sizes is represented in the Table 10. The execution time for those sizes are 1, 23 s, 2.5 s, 3.15 s, and 4.56 s as such taken by executing proposed-model algorithms. From the Table 10 , it is specified that our proposed-framework determines the Text-Similarity for every file size in few seconds. This in turn states the efficiency of the framework in terms of execution time performance analysis. However the major limitation of the study exhibited as increased computational time.

Table 10 Performance evaluation of proposed-model (Execution-Time)

5 Conclusion

The Deep learning techniques outbreaks the baselines in text-similarity in eliminating the duplication of data in sentences, Documents wise comparisons and in question-pairs analysis. The present study implements the hybrid approach of Weighted Fine-tuned BERT extraction process with Deep Siamese networks Bi-LSTM model in quora question pair dataset. Hence in preprocessing method, special character removal and used vectorization method for the conversion of words to appropriate vector representations. The BERT extraction process, extracts the features of the sentence from question pairs. Those embedded vectors were trained by using Siamese network with Bi-LSTM model. The layers of Bi-LSTM structure does the encoding in feature-vectors. This embedded-vector were connected to input vector that traverses through multi-layer perceptron of Bi-LSTM model, with the addition of shared Weighted Fine-Tuned Weight values. The trained model outputs, with their embedded vectors predict the text-similarity of the documents or the question pairs. The study were validated among different pdf documents comparisons for text-similarity detections by using the model. The proposed-framework showed the better efficiency in determining text-similarity and the prediction score evaluation than the other existing algorithms. The model exhibited higher accuracy rate of 91% in text-similarity identifications compared with state of art approaches.