1 Introduction

Despite the importance of Internet use, the massive availability, accessibility, and rapid growth of information on the Internet have brought about benefits and drawbacks. Recently, information is exploited on the Internet and unethically attributed to other authors as a source of development and knowledge. This act is considered a crime and a direct threat to the research community and education. This act is called plagiarism, which, according to the Writing Program Board of Directors (WPA), is defined as “In an instructional setting, plagiarism occurs when a writer deliberately uses someone else’s language, ideas, or another original (not common-knowledge) material without acknowledging its source” [1].

Text plagiarism can be classified to direct plagiarism or various degrees of obfuscation. Copy–paste is a direct plagiarism type that involves copying text from an original document and pasting it into a suspicious document. The other degrees of obfuscation vary according to the text variations types, such as changing the structure of the text, replacing words with their synonyms, and using both aspects together. These variations can be made automatically by using paraphrasing software tools or manually.

To overcome the phenomenon of text plagiarism, researchers have developed several approaches that rely on text linguistic analysis, which depends on its semantic features to calculate the similarity. Some approaches are established on n-grams features [2, 3], while the others are established on converting the texts to vectors using Vector Space Model (VSM) representation [4, 5], Wordnet based-knowledge [6, 7], several pre-trained methods such as Word2Vec [8], InferSent [9], Glove [10], etc. These vectors are used as an input for cosine, Dice, Jaccard, Fuzzy, Match, or Euclidean resemblance measurements. The main weakness of these approaches is that many experiments are required to find the best threshold values for classifying cases. In addition, these approaches fail to detect the cases with the highest degrees of obfuscation. Therefore, to solve these problems, an intelligent deep learning system has been developed depending on statistics [11] and deep learning approaches [12].

Machine learning algorithms are now used in scientific computing, as well as in data and text mining [13,14,15,16,17]. Recent research found that it is important to speed up and improve the performance of the models by involving neural network models in the system. The key reasons for the superiority of deep learning over traditional machine learning are determining decision boundaries and feature engineering. Therefore, our proposed system depends on deep learning architectures: densely connected convolutional network and long short-term memory to determine the most convenient model that can fit the training data.

The key contributions of this research are summed up as follows:

  • Creating a new text similarity feature (TSF) database to learn the deep learning models for detecting text plagiarism. This database was created by considering all the similarity features that reflect different types of lexical, syntactic, and semantic text aspects. Each row of this database was contained by a similarity case, represented in a vector of 42 values of the words and sentences similarity criteria. In sum, this database can be used for lexical, syntactic, and semantic text similarity research purposes.

  • Preprocessing the created database to be appropriate with different deep learning models: Each one-dimensional (1D) vector of the database was converted to a three-dimensional (3D) image for the convolution neural network models and a signal for the recurrent neural network models.

  • Proposing two classifiers for detecting different types of text plagiarism. The first classifier is based on densely connected convolutional network (DenseNet), while the second is based on long short-term memory (LSTM).

  • Conducting a comparative study to find the most convenient model to fit the training data. This study was developed between the proposed classifiers based on different deep learning models and the traditional machine learning models.

The proposed system comprises three steps: preprocessing, detailed analysis, and post-processing. Firstly, according to basic natural language preprocessing, the suspicious and original documents are prepared into sentence and passage levels. Secondly, detailed analysis is responsible for extracting plagiarized cases by applying techniques like common n-grams, meteor scores, and intelligent deep learning classification. Finally, post-processing is applied to find the best largest plagiarized segment by solving the overlapping issue, merging adjacent cases, and removing small cases.

A new database of the lexical, syntactic, and semantic text similarity is created for the deep learning approaches, having 42 features for each similarity case. The constructed features’ values are computed based on the similarity metrics of words and sentences from two benchmark datasets, that is, PAN 2013 [18] and PAN 2014 [19]. The constructed database trains the proposed system, and it is evaluated using the recall, precision, F-measure, granularity, and Plagdet measures. The performance of the proposed system based on LSTM has the first rank in PAN 2013 and PAN 2014 compared to the state-of-the-art systems.

The rest of the paper is organized as follows. Section 2 describes recent works on plagiarism detection. Section 3 explains in detail the proposed system. Section 4 presents the experimental setup and the obtained results. Finally, Sect. 5 presents the conclusions and some future work directions.

2 Related work

The phenomenon of plagiarism mainly using the Internet has attracted many researchers to develop efficient systems for its detection. However, the main challenge is the diversity of complexity degrees of plagiarism, whether at the lexical, syntactic, or semantic levels. In this section, we explain recent plagiarism detection techniques. Many researchers have used different statistical techniques and methodologies to detect plagiarism. Sánchez Vega et al. [3] introduced an automatic paraphrase plagiarism identification approach based on character-level features. Their key contribution was exploring text by content and style features according to n-grams to detect similarity fragments. The main drawback of this method is that it neglects the semantic aspect.

Sanchez-Perez et al. [4] proposed a system for detecting text alignment between the suspicious and source documents. Their core contributions are the usage of tf-isf (term frequency-inverse sentence frequency) method between the sentences to extract the plagiarized cases; a recursive algorithm was also proposed to combine adjacent cases to configure the largest plagiarized fragments and solving the overlap issues. Additionally, adaptive behavior was developed to adjust the parameters depending on the type of plagiarism. This system lacks the capability to integrate the linguistic aspect into the analysis. In addition, it had not a method to determine the type of plagiarism that being dealt with it.

Meysam et al. [5] suggested a two-level matching strategy to properly align similarity fragments from the source and suspicious documents. It is taken into consideration both syntactic and semantic components. The first level applied a vector space model depending on a local weighting approach and a multilingual word embedding based on dictionary to select the smallest collection of highly similar candidate fragment pairs. Then, at the second level, graph of word representations is used for pairwise comparisons. The primary limitation of this strategy is the recognition of short length cases.

Vishal et al. [7] introduced an approach based on extracting syntactic and semantic knowledge from text documents. This knowledge was calculated based on linguistic aspects: path similarity and depth estimation with different weights. In addition, the syntactic and the Dice measures were utilized as matrices similarity for determining syntactic and semantic knowledge between sentences. This approach did not use machine learning techniques and fails to detect complex cases such as manually, translated, and summarized. Gharavi et al. [11] developed a scalable and language-independent system for plagiarism detection based on text embedding with syntactic and semantic information for comparison. This comparison was made at the sentence level for extracting pairs of suspicious and source sentences with the highest similarity. This system proposed three methods for parameter tuning. The first method, known as offline threshold tuning, involves several tests on a training dataset with varying parameter values. The second method was called offline threshold tuning with obfuscation. The obfuscation type was treated as a special type in their analysis. The third method was called online threshold tuning; the parameters were changed from one type to another according to standard deviation or median absolute deviation for all Jaccard similarity values between each detected pair. The result shows that the obfuscation type still needs to be investigated further.

Altheneyan et al. [20] presented an automatic plagiarism detection system based on a support vector machine (SVM) classifier with lexical, syntactic, and semantic features. According to the kernel used, there were two prototypes: a linear kernel (PlagLinSVM) and a radius kernel (PlagRbfSVM). The implementation mainly occurred in two stages: paragraph and sentence. The paragraph stage identified the pairs of similar paragraphs between the suspicious and source documents using comparison depending on common unigram and bigram. Afterward, the sentencing stage detects the pairs of similar sentences based on common unigrams and meteor scores with the pre-defined condition. However, if the condition is not satisfied, the SVM classifier then determines whether the sentences are similar or not. Lastly, the adjacent cases were extended to form large plagiarism passages between the suspicious and the source documents. The drawback of this system is that it did not investigate several merging heuristics and deep learning techniques.

Nguyen et al. [12] designed a t-step plagiarism detection approach using an LSTM network model and feature extraction approach: a passage phase and a word phase. The passage phase extracted plagiarism passages according to the maximum passage similarity, intersection, and important features. The word phase identified the exact plagiarism strings relying on word and average word similarities and sentence-based similarity features. This system has some shortcomings in finding optimal parameters, sentential redundancy, word missing, and redundancy. Table 1 summarizes recent systems used for plagiarism detection.

Table 1 An overview of the recent systems used for plagiarism detection

3 Proposed system

In this manuscript, we present the proposed system to identify text plagiarism by extracting different linguistic characteristics of texts using the WordNet lexical database [21]. The target of the proposed system is to find text plagiarism with the best possible accuracy. These characteristics discover various aspects of plagiarism, such as changing the structure of the text, replacing words with their synonyms, and using both aspects together. The proposed system is mainly divided into three steps: preprocessing, detailed analysis, and post-processing. Figure 1 depicts these steps. The proposed system depends on creating a database to learn the deep learning models for detecting text plagiarism. It was constructed by considering all the similarity features that reflect different types of lexical, syntactic, and semantic text aspects. Each row of this database was contained by a similarity case, represented in a vector of 42 values of the words and sentences similarity criteria.

Fig. 1
figure 1

Structure of the proposed system

3.1 Preprocessing

Document text is one of the most unstructured forms of data; it contains noise in various forms such as special characters, punctuation, and text in different case. Machine learning approaches do not have also the ability to understand text data. Therefore, a preprocessing step is applied to clean the text and convert it into a more analyzable form so that the machine learning algorithms can better classify the text.

In this step, the input documents are prepared in two levels: sentence and passage levels. Sentence level is based on seven natural language preprocessing techniques: sentence segmentation, tokenization, lowercasing, removing stop words, removing special characters, removing irrelevant parts of speech tags, and lemmatization. The first technique, which is sentence segmentation, has been determined to be responsible for splitting documents into meaningful sentences. Secondly, tokenization converts each sentence to small units called unigrams. The third technique, lowercasing, reformats each token into lowercase form. The fourth technique removes stop words such as my, is, are, and the. The fifth technique removes special characters and numeric such as $, @, &,!. The sixth technique is used to prune irrelevant parts of speech tags (conjunction, preposition, articles, pronouns, prepositions, and determiners). Lastly, lemmatization obtains the basic form of the word through its context, where morphological analysis of the words is performed according to parts of speech.

Passage level was based on converting documents into meaningful passages using sentence segmentation technique and then merging adjacent sentences until the length of passage extended to at most 500 characters. After forming the passages, each passage was tokenized into unigram and bigram units; then, lowercasing and removing stop words, special characters, and numbers were implemented.

3.2 Detailed analysis

This step is used to discover plagiarized cases between the documents; it depended on three phases: passage, sentence, and intelligent classification. The first phase is applied to get the pairs of suspicious and source passages with the highest probability of plagiarism, which reduces the search area for plagiarized cases from whole documents. The second phase is based on n-grams and meteor score techniques to detect the plagiarized cases. The third phase is developed to discover the complex plagiarized cases with a high level of obfuscation using deep learning approach.

Passage-phase In this phase, the goal is to get the pairs of suspicious and source passages with the highest probability of plagiarism. This goal is achieved by examining each suspected and source passage according to maximum common unigrams and bigrams [22]. Then, if plagiarism is applicable, it retrieves the source passage’s prior and subsequent passages. The duplicate pairs are eliminated once the passage pairs have been extracted.

Sentence phase After reducing the search area for plagiarized cases from whole documents and limiting it to the passage level, the pairs of passages are analyzed for the plagiarized sentences. This comparison is based on maximum common unigrams and the meteor score [20] between the sentences. Two conditions are applied with two thresholds (thupper, thlow). The conditioned output is compared with the meteor score to determine the type of case and if it requires a deeper analysis. For the first condition, if the meteor score value is more than or equal to thupper, it is considered a plagiarized case. However, for the second condition, if the meteor score value is less than thlow, the case is discarded. Finally, if neither the first nor second conditions are fulfilled, the case is classified using the constructed deep algorithm.

Intelligent classification phase Complex cases with a high level of obfuscation can pass successfully undetected. The role of the intelligent classification phase is to train the deep learning model with the lexical, syntactic, and semantic features for each case. The deep learning models need a supervised database for the training process and constructing the deep classifier. This phase is divided into two steps: TSF database creation and binary deep classifier construction.

3.2.1 TSF Database creation

All the benchmark datasets used in the text similarity research were constructed with a set of sentences without the similarity features of the words and sentences. Therefore, in this paper, the TSF database was constructed to contain the features that have the ability to discriminate the suspicious cases and differentiate the variations in the text similarities. It was created by considering all the similar features that reflect lexical, syntactic, and semantic text similarity types. Each row of this database was consisted of a plagiarized case, represented in a vector of 42 values of the words and sentences similarity criteria. This database is beneficial for training the intelligent classification models to detect the text similarities, which have lexical, syntactic, and semantic text similarity features.

Positive and negative case extraction To create the supervised database, it must contain the supervised cases: positive and negative cases. Therefore, the pairs of the plagiarized passages were first extracted and formatted according to the sentence level preprocessing as mentioned in Sect. 3.1. Then, each suspicious sentence was scanned with all the source sentences upon the meteor score and common unigrams. The pair with the maximum meteor score and common unigrams was considered a positive case. While extracting the negative cases, the pairs of documents that excluded plagiarism were pre-processed using the sentence level mentioned in Sect. 3.1. After that, each suspicious sentence was checked with all source sentences based on common unigrams. If equal to 1, it is considered a negative case.

Features computation The characteristics that can classify the cases were computed in this step. These characteristics were mainly explored using their alternative meanings and the order of words between the sentences for each case. Each row of the TSF database was contained by a similarity case, represented in a vector of 42 values of the sentences’ similarity criteria. The similarity features of the training database (TSF database) were based on fuzzy and gross criteria of sentence similarity. Additionally, they contained five different criteria of the similarity sentences. Each of which was computed for eight different criteria [6, 7, 23,24,25,26,27] for the word similarity. The computed word and sentence similarity criteria are explained as follows:

Path similarity criteria [6]: It returns a value indicating the degree of similarity between two-word senses.

$${Crit}_{PA}\left({m}_{l},{m}_{r}\right)=\frac{1}{d+1}$$
(1)

where ml and mr are word senses and d is the shortest path distance.

Depth estimation similarity criteria [6]: It uses the difference between the summation of depths and least common subsumer.

$${Crit}_{DE}\left({m}_{l},{m}_{r}\right)={e}^{-(H\left({m}_{l}\right)+H\left({m}_{r}\right)-2*H\left(L\left({m}_{l},{m}_{r}\right)\right))}$$
(2)

H(m) indicates the maximum depth of word sense, while m and L(ml, mr) are the least common subsumer.

Hybrid path and depth estimation similarity criteria [7]: It depends on integrating both path and depth estimation similarity criteria by a specific weight δ.

$${Crit}_{H}\left({m}_{l},{m}_{r}\right)=\delta *{Crit}_{PA}\left({m}_{l},{m}_{r}\right)+\left(1-\delta \right)*{Crit}_{DE}\left({m}_{l},{m}_{r}\right)$$
(3)

where δ is equal to 0.5.

Leacock–Chodorow similarity criteria [23]: It is based on the shortest path that connects the senses and the maximum depth of the taxonomy in which the senses occur.

$${Crit}_{Lch}\left({m}_{l},{m}_{r}\right)=-\mathrm{log}\frac{d+1}{2*H({m}_{l},{m}_{r})}$$
(4)

where the maximum depth of noun taxonomy is 13 and verbs is 19.

Depth similarity criteria [24] It deals with the depth only for word senses and their least common subsumer.

$${Crit}_{D}\left({m}_{l},{m}_{r}\right)=\frac{2*H\left(L\left({m}_{l},{m}_{r}\right)\right)}{H\left({m}_{l}\right)+H\left({m}_{r}\right)}$$
(5)

Resnik, Lin, and Jiang–Conrath similarity criteria [25,26,27] These criteria are based on the information content (INF), which indicates the quantity of information conveyed by a specific word in a particular corpus. So, its value varies from one corpus to another according to the size of the corpus and the number of occurrences and meanings. The proposed system used Brown Corpus to calculate information content. The Brown Corpus was the first million-word electronic corpus of English, created in 1961 at Brown University. This corpus contains text from 500 sources, and the sources have been categorized by genres, such as news and editorial. Additionally, it supports access as a list of words or sentences.

$${Crit}_{Res}\left({m}_{l},{m}_{r}\right)=INF(L\left({m}_{l},{m}_{r}\right))$$
(6)
$${Crit}_{Lin}\left({m}_{l},{m}_{r}\right)=\frac{2*INF\left(L\left({m}_{l},{m}_{r}\right)\right)}{INF\left({m}_{l}\right)+INF\left({m}_{r}\right)}$$
(7)
$${Crit}_{Jcn}\left({m}_{l},{m}_{r}\right)=\frac{1}{INF\left({m}_{l}\right)+INF\left({m}_{r}\right)-2*INF\left(L\left({m}_{l},{m}_{r}\right)\right)}$$
(8)
$$INF\left(m\right)=\mathrm{log}\frac{1}{P(m)}$$
(9)
$$P\left(m\right)=\frac{\sum_{w\epsilon W(m)}appearances(w)}{C}$$
(10)

where INF(m) is information content for word sense m, P(m) is the probability of word sense m in Brown Corpus,\(W(m)\) is the set of words in the corpus whose senses are subsumed by m, and C is the total number of corpus words. The characteristics can be divided into two main types: fuzzy and semantic and syntactic and gross characteristics.

3.2.2 Fuzzy characteristic

This characteristic is a special type of semantic similarity. This characteristic can discover summary cases because it mainly depends on the number and meanings of the words of the suspected sentence when compared with the source sentence without a mediator, as shown in Eq. (11), where each word in the pre-processed suspicious sentence is compared with all the words in pre-processed source sentence by eight criteria [6, 7, 23,24,25,26,27]. According to WordNet, there is a list of synonyms for each word in the comparison process between two words. Therefore, each suspicious synonym was compared with all source synonyms, as shown in Eq. (13) and Eq. (14).

$$Sim\;\;(E_{l} ,\;E_{r} ) = (\rho_{1,r} + \rho_{2,r} + \ldots . + \rho_{l,r} + \ldots . + \rho_{n,r} )/k$$
(11)
$$\rho _{{l,r}} = 1 - \;\Pi W_{r} \in E_{r} (1 - F_{{lr}} )$$
(12)
$$\underset{\mathit{ml}\in Wl, mr\in Wr}{F\mathit{lr}=\mathrm{max}}Flr \left({Crit}_{ws}\left({m}_{l},{m}_{r}\right)\right)$$
(13)
$$\mathop {OFlr = {\text{max}}}\limits_{{ml \in Wl,mr \in Wr}} Flr\left( {Crit_{D} \left( {m_{l} ,m_{r} } \right)} \right) = \left\{ {\begin{array}{*{20}c} {\begin{array}{*{20}c} {\begin{array}{*{20}c} {1.0\quad if \,o = 1.0} \\ {0.7 \quad if \,o \in [{\text{0}}.{\text{7}},{\text{1}}.{\text{0}})} \\ \end{array} } \\ {0.5 \quad if \,o \in [{\text{0}}.{\text{5}},{\text{0}}.{\text{7}})} \\ {0.3 \quad if\,o \in [{\text{0}}.{\text{3}},{\text{0}}.{\text{5}})} \\ \end{array} } \\ {0.2 \quad if \,o \in (0.0.0.3)} \\ {0.0 \quad if \,o = 0.0} \\ \end{array} } \right.$$
(14)

where El and Er rep resent the suspicious sentence and source sentence, respectively, ρl,r is the correlation coefficient from word to sentence for each word Wl in El and Er, k is the total number of words in El, and Flr (\({Crit}_{ws}\) (ml, mr)) is the fuzzy-semantic similarity between Wl and Wr. ml, mr are the synonyms/meaning; ws is the word similarity criterion. OFlr is the optimized semantic function; it optimizes the semantic output value of the WordNet using heuristic boundary conditions. o is the output of optimized fuzzy semantic similarity.

Semantic, syntactic, and gross characteristics

These characteristics depend on their calculation on creating semantic and syntactic lists for both suspicious and source sentences, as shown in Figs. 2 and 3. A mutual list is required for estimating these lists, which was constructed by adding the distinct words for both suspected and source sentences. This led to the dimensions of both the semantic and syntactic lists. The following steps describe in detail how to calculate these lists using eight criteria [6, 7, 23,24,25,26,27] and the desired characteristics:

Fig. 2
figure 2

Workflow for semantic, syntactic, and gross characteristics calculation

Fig. 3
figure 3

Flowchart for calculating semantic and syntactic similarity

Step 1: The proposed system examines whether the word from the mutual list is shown in the suspected sentence.

Step 2: If this condition is satisfied, the corresponding dimension in the semantic list is recorded to the value1, and the corresponding dimension in the syntactic list is recorded by the value of the word order of the suspected sentence.

Step 3: Otherwise, the similarity between the word against all words in the suspicious sentence is computed, and the maximum degree of similarity between the pairs of words is recorded in the semantic list, and their order is stored as a value in the syntactic list.

Step 4: If there is no degree of similarity, the value is assigned to 0 in the semantic list corresponding to that word dimension.

Step 5: The above steps are repeated for all words in the mutual list.

Step 6: In like manner, the semantic and syntactic lists for the source sentence are created.

Step 7: Afterward, the semantic similarity is calculated using Jaccard similarity, cosine similarity, and dice similarity as follows:

$$\mathrm{Jaccard }\left({E}_{l},{E}_{r}\right)=\frac{{\sum }_{i=1}^{v}{T}_{li}* {T}_{ri}}{{\sum }_{i=1}^{v}{T}_{li}^{2}+ {\sum }_{i=1}^{v}{T}_{ri}^{2}-{\sum }_{i=1}^{v}{T}_{li }* {T}_{ri}}$$
(15)
$$\mathrm{cosine }({E}_{l},{E}_{r})=\frac{{\sum }_{i=1}^{v}{T}_{li}* {T}_{ri}}{\sqrt{{\sum }_{i=1}^{v}{T}_{li}^{2}}* \sqrt{{\sum }_{i=1}^{v}{T}_{ri}^{2}}}$$
(16)
$$dice ({E}_{l},{E}_{r})=\frac{2*{\sum }_{i=1}^{v}{T}_{zi}* {T}_{ri}}{{\sum }_{i=1}^{v}{T}_{li}^{2}+ {\sum }_{i=1}^{v}{T}_{ri}^{2}}$$
(17)

where El and Er are semantic lists derived from suspicious and source sentences, respectively; Txi is the value of the \({i}^{th}\) dimension in the semantic list; and v is the number of words.

Step 8: The syntactic similarity is calculated as follows:

$$\mathrm{Syntactic }({O}_{l},{O}_{r})=1-\frac{||{O}_{l}-{O}_{r}||}{||{ O}_{l}+{O}_{r}||}$$
(18)
$$||{O}_{l}-{O}_{r}||=\sqrt{{({O}_{11}-{O}_{21})}^{2}+{({O}_{12}-{O}_{22})}^{2}+\dots +{({O}_{1v}-{O}_{2v})}^{2}}$$
(19)
$$||{O}_{l}+{O}_{r}||=\sqrt{{({O}_{11}+{O}_{21})}^{2}+{({O}_{12}+{O}_{22})}^{2}+\dots +{({O}_{1v}+{O}_{2v})}^{2}}$$
(20)

where Ol and Or are syntactic lists derived from suspicious and source sentences, respectively; Oxv is the value of the \({i}^{th}\) dimension in the syntactic list; and v is the number of words.

Step 9: The gross similarity between dice semantic and syntactic similarity is calculated based on hybrid word similarity as follows:

$$\mathrm{Gross }\left({E}_{l},{E}_{r}\right)=\gamma *(dice \left({E}_{l},{E}_{r}\right)+\left(1-\gamma \right)*\mathrm{ syntactic }({O}_{l},{O}_{r})$$
(21)

Here, γ = 0.8 represents the significance coefficient for semantic and syntactic similarity, where dice semantic similarity has the most significant effect on gross similarity.

3.3 Intelligent classification construction

This phase is used to detect the complex plagiarized cases with a high level of obfuscation, if the second phase “Sentence” didn’t able to discover the text similarity. The proposed system depends on deep learning architectures: Densely connected convolutional network (DenseNet) [28] and LSTM [29] to determine the most convenient model that can fit the training data. The input data for each technique were normalized using the Z-score method to change the value of data in the dataset to a common scale without distorting the differences in the ranges of the values [30].

3.3.1 LSTM

LSTM is defined as an artificial recurrent neural network (RNN) architecture used in the field of deep learning. It is suitable to classify and make predictions. Here it was used as a classifier that requires the input in the three-dimensional form: seq_len, time_steps, and batch_size. Therefore, the input was reshaped to the desired form and fed into the classifier, where seq_len equals the number of characteristics (seq_len = 42), time_steps equal 1, and batch_size equals 1 upon stochastic mode. According to the sigmoid activation function, the output was in range (0,1), so the cases with values greater than or equal to 0.5 were considered a positive case. Figure 4 shows the architecture of the LSTM model. Figure 4 shows the architecture of the LSTM model, it consists of four LSTM, two fully connected, and output layers. Each cell of the first, second, third, and fourth LSTM layers was constructed of 256, 128, 64, and 32 units, respectively. The fully connected layers were contained by 42 and 20 units in the first and second layers, respectively.

Fig. 4
figure 4

Structure of the LSTM model

3.3.2 DenseNet

DenseNet is a type of convolutional neural network (CNN) that uses dense connections between layers through dense blocks, where all layers are directly connected with each other, as shown in Fig. 5. It mainly deals with at least n × 32 × 32 × 3 input shape (number of samples × width × height × channels). Therefore, each sample with 42 characteristics must satisfy the desired input. So, each case will be reshaped to 6 × 7 (2D), padding with 26 columns and 25 rows with zero value to be 32 × 32 forms (2D). Then, it will be repeated thrice to give 32 × 32 × 3 (3D). Figure 6 describes this method. The prediction range 0–1 was used to classify the cases. If the output is greater than or equal to 0.5, it is considered a positive case.

Fig. 5
figure 5

Structure of DenseNet model

Fig. 6:
figure 6

1D vector conversion to 3D image

3.4 Post-processing

In this step, the main objectives were to solve the overlapping problem, merge adjacent detected cases, and discard the short cases, which extract the best-plagiarized segment between the suspicious and source documents. These objectives were achieved through three operations: Filter I, adaptive behavior, and Filter II.

3.4.1 Filter I

The second phase of the proposed system may discover similar cases with multiple source sentences. Therefore, the proposed system supposes that only one suspicious sentence is plagiarized with one source sentence, but not vice versa to solve the overlapping problem. This process allows the pair of suspicious and corresponding source sentences with the maximum meteor score to pass the next steps.

3.4.2 Adaptive behavior

This behavior supports the proposed system in terms of dealing with various types of obfuscation, as in PAN Workshop series [18, 19]. Several cases exist, including copy–paste, random obfuscation, translation obfuscation, and summary obfuscation. Each type needs a special method with specific parameters to deal with it and adapt to get the best performance. However, this type of obfuscation remains unknown. Therefore, the proposed system should be adapted with different types using the following methods illustrated in Fig. 7.

Fig. 7
figure 7

Workflow of adaptive behavior

Adaptive extension The input of cases C passed from Filter I has been defined as the pairs (x, y) of the detected plagiarized sentences. These sentences need a technique to merge with their adjacent sentences to form the largest text segments similar to the two documents. This technique is called extension, and it is divided into two sub-processes [4], that is, clustering and validation.

Clustering process The cases that are not separated by more gaps of sentences are grouped. This process was implemented by sorting and clustering the set of cases by x (left or suspicious document) such that xn − xn+1 ≤ maxleft-gap. Then, the sorting and clustering processes were used for each resulting cluster based on y through a maxright-gap threshold (right or source document).

Validation process However, the proposed system uses the parameters maxleft-gap and maxright-gap to cluster the cases into the largest text segments. Some sentences in these segments may have no similarity to any sentences in the corresponding segment. Therefore, to avoid adding noise in the clustering step, we have validated the similarity between the text segments of the remaining clusters using a threshold. If the similarity is less than the given threshold (Sim1), it then applies the extension stage using maxleft-gap − 1 and maxright-gap − 1 for this particular cluster, therefore reducing the gaps to minleft-gap and minright-gap values, respectively. If any minimum values are reached and the validation condition is not met, the cluster is then discarded. Given a cluster integrated by the cases of the form (x, y), the text segment in the suspicious document segment segleft collects all the sentences from the smallest to the largest x in the cluster. Similarly, the corresponding text segment in the source document segright collects all the sentences from the smallest to the largest y in the cluster. This proposed system to measure the similarity between the segments is based on gross similarity in Eq. (20). The extension technique was used twice with different left and right gap parameters for the adaptive extension.

The adaptive extension uses the extension technique with two tracks upon different maxleft-gap and maxright-gap values to achieve the best result. The first track is a regular track, where the maximum left and right gaps are equal to 4 sentences which are found to realize the best performance to copy–paste, random, and translation subsets. The second is a summary track, where the maximum left and right gaps equal 24 sentences, which are the best for the summary obfuscation subset.

Copy–paste detector This detector obtains copy–paste segments from the regular case. The longest common substring algorithm is used with more than or equals a certain threshold (thcopy) measured in characters.

Output selector After applying adaptive extension and copy–paste detector methods, three outputs are extracted and are named as follows: regular plagiarism cases (Cases G), summary plagiarism cases (Cases U), and copy–paste plagiarism cases (Cases P). The output selector selects copy–paste cases as a priority. Then, if the length of the source segment is more than or equal to three times the length of the corresponding suspicious segment, it is then called summary cases, otherwise regular cases.

Filter II Small pairs of plagiarized segments are removed in this step. The pair of segments is considered small if the suspicious segment’s length is less than Lenleft or the length of the source segment is less than Lenright.

4 Experimental setup

This proposed system was developed using Python programming language. It was implemented on an Intel®CoreTMi5-4210U CPU @ 2.40 GHz computer with 8.00 GB RAM. Table 2 indicates the setting of the parameters used in the proposed system.

Table 2 Setting parameters

4.1 Datasets

The proposed system was evaluated using two benchmark datasets, that is, PAN 2013 and PAN 2014. These datasets were provided for text-alignment sub-tasks in the plagiarism detection depending on various types of obfuscation such as:

  • No-obfuscation: This type is known as copy–paste, where the source text and reused text are identical.

  • Random obfuscation: The source text is developed using various plagiarism techniques to change the sentences’ composition and semantics or both. These techniques are applied using automatic software tools or manually.

  • Cyclic translation: Machine translators convert the source text into other languages and then return it into the original language.

  • Summary: This is considered the most complex among the other types because it depends on a good understanding of the subject or text to be plagiarized manually.

PAN 2014 is a supplement to PAN 2013, and it consists of only two forms of obfuscation: no-obfuscation and random obfuscation. Number of documents for each obfuscation type in the training and testing corpus of PAN 2013 and PAN 2014 is presented in Table 3.

Table 3 The description of TSF training and testing subsets datasets

4.2 TSF database creation

The proposed system depends on building a supervised training database to train the intelligent classification models. Therefore, TSF database was created based on the available benchmark datasets PAN 2013 and PAN 2014. Forty-two values of the sentences similarity features are computed and recorded aggregating with the class label for each extracted case to build TFS database. The positive and negative cases of the TSF database were extracted from the benchmark datasets as explained in Sect. 3.2. The training cases of the TSF database were extracted from a training corpus of PAN 2013, and the test cases were selected from a testing corpus of PAN 2013 and PAN 2014. Table 3 shows the number of samples of TSF training and testing databases.

The training database was used to train the intelligent classification models, and the testing database was used to evaluate the constructed intelligent classifiers. The order of TFS features is effective in the construction process of DenseNet and LSTM classifiers. Therefore, information gain [31] was applied to rank the features. Table 4 shows the description of the ranked features of TSF database.

Table 4 Description of TSF database features

The statistical analysis of TSF database was developed and is shown in Table 5 to explain the importance of each created feature. The range of values for all sentence similarity features of TSF database is between 1 and 0. If the feature value is closer to 1, this indicates that the two sentences are similar, and if the feature value is closer to 0, this indicates that the two sentences are dissimilar. The discriminative sentence similarity feature that has the ability to differentiate the positive and negative cases with high accuracy is the feature that contains high intra similarity and low inter similarity values of the class labels. Therefore, metrics of the mean, standard deviation, 95% confidence limits, and P-value scores were calculated for each sentence similarity feature.

Table 5 Statistical analysis of TSF database features

4.3 Evaluation metrics

Potthast et al. [32] introduced a novel score called Plagdet to assess the effectiveness of various plagiarism detection techniques. This score combines three measurements factors: recall, precision, and granularity. These measurements are computed as follows:

$$Rec(A,B)=\frac{1}{A}\times \sum_{a\in A}\frac{|{\bigcup }_{b\in B}(a\cap b)|}{|a|}$$
(22)
$$Prec(A,B)=\frac{1}{B}\times \sum_{b\in B}\frac{|{\bigcup }_{a\in A}(a\cap b)|}{|b|}$$
(23)
$$\mathrm{where\, }a\cap b=\left\{\begin{array}{cc}a\cap b& \mathrm{if b detects a }(\mathrm{number\, of \,overlapping\, characters})\\ \varnothing & \mathrm{otherwise}\end{array}\right.$$

where A, a, B, and b are the actual collection of all plagiarism cases, plagiarism case, detected collection of all plagiarism instances, and detected case using the proposed system, respectively. Granularity is determined in Eq. (23) to handle overlapping or multiple detections for one plagiarism instance. This measure assesses the ability of the detection system to detect each instance of plagiarism as a single piece.

$$\mathrm{Gran}(A,B)=\frac{1}{|{A}_{B}|}\times \sum_{a\in {A}_{B}}{|B}_{A}|$$
(24)

where AB ⊆ A is the instances detected in A and BA ⊆ A is all the detections of instance a.

The Plagdet score is calculated as:

$$\mathrm{Plagdet}(A,B)=\frac{F-\mathrm{measure}}{{\mathrm{log}}_{2}(1+\mathrm{Gran}(A,B))}$$
(25)

where the F-measure is the harmonic mean of recall and precision and is estimated as follows:

$$F-\mathrm{measure}=\frac{2\times \mathrm{Rec}\left(A,B\right)\times \mathrm{Prec}\left(A,B\right)}{\mathrm{Rec}\left(A,B\right)+\mathrm{Prec}\left(A,B\right)}$$
(26)

4.4 Results and discussions

The proposed system depends on two paths to detect the text plagiarism. The first path is based on traditional paragraph-level comparison, and the second path is based deep learning approach. The second path is used, if the first path didn’t able to discover the text similarity; it is based on constructing deep learning classifier that has the ability to detect all the confused types of lexical, syntactic, and semantic plagiarism cases. As shown in Table 5, all the sentence similarity features have closer positive and negative mean values, 95% confidence limits values of the positive cases are far from the one value and closer to the positive and negative mean values, and P-value is more than 0.05, this indicates that relying with each feature will cause a confusion in the decision. Therefore, the previous researches [3,4,5, 7, 11, 12, 20] depended on 2, 3, or 4 sentence similarity features instead of depending on one feature to enhance the text plagiarism detection. There is also a challenge to depend on 2, 3 or 4 sentence similarity features, because these features are not discriminative as shown in Table 5. Therefore, the proposed system takes into consideration all different possible types of the sentence similarity features by creating TSF database to train intelligent classification algorithms.

A comparative study was implemented to determine the most convenient model to fit the TSF training dataset. It was also used to detect text plagiarism in the TSF testing dataset with the highest accuracy. This study was conducted on different intelligent learning models: SVM, DenseNet, and LSTM. According to the experiments conducted with these three models, the results showed that the LSTM model was able to achieve the highest accuracy using different test datasets. These results are presented in Fig. 8 and Tables 6, 7, and 8. The LSTM can weigh the fluctuations between the values of 42 features. This ability has helped discover the highest accuracy of the lexical, syntactic, and semantic similarity cases compared with SVM and DenseNet classifiers.

Fig. 8
figure 8

Comparison between SVM, DenseNet, and LSTM models on the TSF testing dataset

Table 6 Performance of SVM classifier on the TSF testing dataset
Table 7 Performance of DenseNet classifier on the TSF testing dataset
Table 8 Performance of LSTM classifier on the TSF testing dataset

Based on the previous results, the proposed system is based on the constructed LSTM classifier to detect text plagiarism. This system depends on n-grams and meteor score techniques to detect the simple cases and the constructed LSTM for the complex cases. As explained in Sect. 3.2, the meteor score was computed for each pair of sentences. If the meteor score value is more than or equal to thupper, it is considered a plagiarized case. Else if the meteor score value is less than thlow, the case is then discarded. Additionally, if neither the first nor second conditions are fulfilled, the case is analyzed using the constructed LSTM. After the classification process.

The post-processing phase of the proposed system was applied as has been explained in Sect. 3.3 to find the best largest plagiarized segment by solving the overlapping issues, merging adjacent cases, and removing small cases. To study the contribution of post-processing phase in the proposed system’s performance, the proposed system without post-processing phase was applied to discover all the obfuscation types in the testing corpus of PAN 2013 and PAN 2014. The results demonstrated the effectiveness of post-processing phase to improve the proposed system’s performance as shown in Figs. 9 and 10.

Fig. 9
figure 9

Effectiveness of post-processing step in the proposed system’s performance on detecting the all types of PAN 2013 obfuscations

Fig. 10
figure 10

Effectiveness of post-processing step in the proposed system’s performance on detecting the all types of PAN 2014 obfuscations

The proposed system was evaluated on the PAN 2013 and PAN 2014 datasets and compared with recent research systems. From Tables 9, 10, and 11, it is evident that the suggested system is efficient and accurate for all forms of plagiarism. It surpasses the competition on the PAN 2013 random obfuscation subset, entire PAN 2013 dataset, and entire PAN 2014 dataset, which achieved the highest accuracy compared to up-to-date ranking research in this field.

Table 9 The proposed system’s performance compared to the prior systems on PAN 2013 random obfuscation subset
Table 10 The proposed system’s performance compared to the prior systems on the complete PAN 2013 dataset
Table 11 The proposed system’s performance compared to the prior systems on the complete PAN 2014 dataset

As per the previous results, the proposed system could balance the evaluation criteria, precision, and recall plus reduce the granularity, where all steps of the proposed system contributed to making it reliable and robust:

  • Preprocessing: Contributed to reducing the incidence of false-positive cases and the size of comparable texts.

  • Detailed Analysis: Participated in extracting the most accurate cases of plagiarism between the suspicious and original documents.

  • TSF database creation: Included linguistic analysis of the texts and extracting linguistic features that contain all the aspects that could be applied to change the structure of the plagiarized text and can distinguish the classified cases.

  • LSTM classification construction relies on deep learning techniques, which weigh the fluctuations between the linguistic features instead of the experimental setting of the similarity criterion parameters in the previous research.

  • Post-processing: Improved the precision and reduced the granularity without significant impact on the recall.

5 Conclusion

This paper has constructed a new database for intelligent learning purposes to detect text plagiarism. It had 42 features for each similarity case. The database values were computed using the words and sentences similarity metrics that reflect all different lexical, syntactic, and semantic text plagiarism types. Each case of the created database was converted to a 3D image and signal to appropriate the inputs of different deep learning architectures. The proposed intelligent system pre-processes the documents, selects the possible plagiarized cases, and computes the lexical, syntactic, and semantic similarity criterion values for each extracted case to create the training database. It also constructs the deep learning classifier and extracts the better segments of the plagiarized text by filtering and merging adjacent detected seeds of similar text sentences. A comparative study was implemented on the two benchmark datasets, that is, PAN 2013 and PAN 2014, to determine the most convenient deep learning architecture that can detect text plagiarism with the highest accuracy by fitting the values of the created training database. As per our findings, it was determined that the proposed system based on LSTM architecture achieved the highest accuracy compared to the state-of-the-art text plagiarism systems, which can weigh the fluctuations between the values of lexical, syntactic, and semantic similarity criteria of the suspicious cases.