Science and Engineering Ethics

, Volume 18, Issue 2, pp 223–239

Prevalence of Plagiarism in Recent Submissions to the Croatian Medical Journal

Authors

    • Department of Medical InformaticsRijeka University School of Medicine
  • Lidija Bilić-Zulle
    • Department of Medical InformaticsRijeka University School of Medicine
    • Clinical Department of Laboratory DiagnosticsClinical Hospital Centre Rijeka
  • Gordana Brumini
    • Department of Medical InformaticsRijeka University School of Medicine
  • Mladen Petrovečki
    • Department of Medical InformaticsRijeka University School of Medicine
    • Department of Clinical Laboratory DiagnosticsDubrava Clinical Hospital
Original Paper

DOI: 10.1007/s11948-011-9347-2

Cite this article as:
Baždarić, K., Bilić-Zulle, L., Brumini, G. et al. Sci Eng Ethics (2012) 18: 223. doi:10.1007/s11948-011-9347-2

Abstract

To assess the prevalence of plagiarism in manuscripts submitted for publication in the Croatian Medical Journal (CMJ). All manuscripts submitted in 2009–2010 were analyzed using plagiarism detection software: eTBLAST, CrossCheck, and WCopyfind. Plagiarism was suspected in manuscripts with more than 10% of the text derived from other sources. These manuscripts were checked against the Déjà vu database and manually verified by investigators. Of 754 submitted manuscripts, 105 (14%) were identified by the software as suspicious of plagiarism. Manual verification confirmed that 85 (11%) manuscripts were plagiarized: 63 (8%) were true plagiarism and 22 (3%) were self-plagiarism. Plagiarized manuscripts were mostly submitted from China (21%), Croatia (14%), and Turkey (19%). There was no significant difference in the text similarity rate between plagiarized and self-plagiarized manuscripts (25% [95% CI 22–27%] vs. 28% [95% CI 20–33%]; U = 645.50; P = 0.634). Differences in text similarity rate were found between various sections of self-plagiarized manuscripts (H = 12.65, P = 0.013). The plagiarism rate in the Materials and Methods (61% (95% CI 41–68%) was higher than in the Results (23% [95% CI 17–36%], U = 33.50; P = 0.009) or Discussion (25.5 [95% CI 15–35%]; U = 57.50; P < 0.001) sections. Three authors were identified in the Déjà vu database. Plagiarism detection software combined with manual verification may be used to detect plagiarized manuscripts and prevent their publication. The prevalence of plagiarized manuscripts submitted to the CMJ, a journal dedicated to promoting research integrity, was 11% in the 2-year period 2009–2010.

Keywords

ResearchEthicsPeer reviewPlagiarismSelf-plagiarismScientific misconductResearch integritySoftware

Abbreviations

CMJ

Croatian Medical Journal

COPE

Committee on Publication Ethics

Introduction

Plagiarism is considered a form of scientific misconduct and a serious breach of publication ethics (Bilic-Zulle 2010; Mason 2009). It is defined as “appropriation of another person’s ideas, processes, results or words without giving appropriate credit to the source or author” (ORI 2000). While plagiarism indicates intellectual theft from another author, self-plagiarism occurs when “one’s previously published idea, text or data is being reused and presented as original work” (Roig 2010). Understanding the causes and knowing the likely prevalence of plagiarism may help prevent this unethical practice.

Some of the most commonly identified causes of plagiarism include pressure to publish (Roig 2010; Bilic-Zulle 2010), limited English and writing proficiency (Roig 2010; Mason 2009), attitudes towards plagiarism (Mavrinac et al. 2010; Pupovac et al. 2010; Segal et al. 2010) and influence of cultural values (Bilic-Zulle et al. 2008; Hayes and Introna 2005; Segal et al. 2010). Authors are under pressure to publish because their scientific productivity is often measured by the number of papers published (“publish or perish”). Also, in order to be recognized in a broad scientific community, they have to publish in English (Bilic-Zulle 2010). This, however, may be difficult for scientists who are non-native speakers of English and especially for those whose native languages use different alphabets, such as Chinese (Roig 2010).

Several studies investigating attitudes towards plagiarism have revealed that students from Eastern and post-communist countries are more tolerant toward cheating and plagiarism than their Western counterparts (Hayes and Introna 2005; Hrabak et al. 2004; Magnus et al. 2002). A recent study performed among medical students in Croatia, a post-communist transition country, showed that students have a relatively permissive attitude towards plagiarism (Pupovac et al. 2010). Cultural values differ across countries. Since Eastern cultures are more collectivistic and less likely to consider plagiarism unethical, authors from these cultures may be more likely to commit plagiarism (Hayes and Introna 2005; Segal et al. 2010).

Development of information and communication technologies is thought to have increased plagiarism, but it has also provided effective means for detecting plagiarism and measuring its prevalence (Bilic-Zulle 2010; Garner 2011). Most types of plagiarism detection software use algorithms to compare texts, highlight similar parts, and calculate similarity rate between texts (Bilić-Zulle et al. 2005). In previous studies of plagiarism prevalence in the academic setting, investigators mostly used EVE (Essay Verification Engine, http://www.canexus.com/; Braumoeller and Gaines 2001), WCopyfind (http://plagiarism.bloomfieldmedia.com/z-wordpress/; Bilić-Zulle et al. 2005, 2008) and Turnitin (https://turnitin.com; Segal et al. 2010).

Plagiarism detection software able to compare a text against the published scientific material (rather than material freely available on the internet) has only recently been introduced, with eTBLAST (http://etest.vbi.vt.edu/etblast3/) and CrossCheck (http://www.crossref.org/crosscheck.html) being the most frequently used (Garner 2011). CrossCheck is especially interesting and promising tool for plagiarism detection as it allows comparison of the text against a growing database1 of published academic writing, including many journals that are usually not free, i.e., are available only to subscribers. To date, little has been published on the use of CrossCheck in detecting plagiarism in scientific manuscripts. A recent 6-month pilot study of three Taylor and Francis journals using CrossCheck revealed that editors had to reject 6–23% of manuscripts accepted for publication because of suspected plagiarism (Butler 2010). Similarly, an investigation of plagiarism in Chinese Journal of Zhejiang University-Science revealed that 29% of submitted manuscripts analyzed with CrossCheck contained plagiarized parts (Zhang 2010).

To determine the prevalence of plagiarism in manuscripts submitted for publication to the CMJ, an international, peer-reviewed, general medical journal with an impact factor of 1.46 in 2011, we carried out a study using eTBLAST and CrossCheck. We investigated the characteristics of plagiarized manuscripts according to the type of plagiarism, the manuscript section where the plagiarism occurred, the type of manuscript, and the first author’s workplace. We also compared the two software systems’ ability to detect plagiarism in submitted manuscripts.

Materials and Methods

Materials

Manuscripts

We assessed all original research manuscripts, review manuscripts, and case reports submitted for publication to the Croatian Medical Journal (CMJ,www.cmj.hr) between January 1, 2009 and December 31, 2010. Manuscripts were downloaded from the CMJ editors’ portal with authorized access (http://dora.zesoi.fer.hr/cmj/person_auth.php) and permission to access the portal by the Editor-in-Chief (Ana Marusic). The first submission of each manuscript was included in the analysis.

Of 754 submitted manuscripts, 247 (33%) were sent for external peer review and 502 (67%) were rejected by the editors after initial assessment of their quality and suitability for the journal (Table 1; data were collected up to July 1, 2011).
Table 1

Manuscripts submitted to the Croatian Medical Journal and action taken

Manuscript outcome

Peer-review n (%)

Yes

No

Total

Accepted for publication

129 (17)

129 (17)

Revision

7 (1)

7 (1)

Rejected

96 (13)

502 (67)

598 (80)

Withdrawn

15 (2)

5 (0)

20 (2)

Total

247 (33)

507 (67)

754 (100)

Manuscripts were submitted from 64 countries: 177 (23%) from Croatia, 107 (14%) from Turkey, 75 (10%) from China, and 395 (53%) from other countries. Of 754 submitted manuscripts, 129 (17%) were accepted for publication: 57 (44%) from Croatia and 72 (56%) from other countries (relative frequencies were calculated only for countries with 10% or more submitted manuscripts).

Software

Abstracts and full text of all submitted manuscripts were compared against available, previously published scholarly literature using plagiarism detection software: eTBLAST (eTBLAST 2.03.0,Virginia Bioinformatics Institute, Virginia, USA), CrossCheck (iParadigms, LLC, Oakland CA, USA), and WCopyfind (WCopyfind v. 2.6, Charlottesville, Virginia, USA).

The Déjà vu database (Déjà vu, Virginia Bioinformatics Institute, Virginia, USA) was used to search for potentially similar publications using authors’ names.

eTBLAST (http://etest.vbi.vt.edu/etblast3/) is free software that searches databases for similar text (Garner 2011; Sun et al. 2010). We searched the Medline database using eTBLAST and for each manuscript obtained a list of published abstracts ranked by the similarity score (Errami et al. 2007).

CrossCheck (http://www.crossref.org/crosscheck.html) is available to members of CrossRef, a non-profit collaborative organization of publishers and journals (Butler 2010; Garner 2011; Wager 2011). The software uses the iThenticate algorithm to check the similarity between a given text and all papers contained in the CrossCheck database and material freely available on the Internet (CrossCheck, http://www.crossref.org/crosscheck.html) and produces a similarity report listing possible sources and text similarity rates (in percentages). In our study, we used the default viewing mode, i.e. the similarity report.

WCopyfind (http://plagiarism.bloomfieldmedia.com/z-wordpress/) is off-line software that compares two or more documents and produces a report on similarity rate. Text similarity rate is calculated as the ratio of matching words to the total number of words in the manuscript and expressed as a percentage.

For the purpose of our study, the shortest phrase to match in the WCopyfind was set to six words. All parameters were used as recommended by the author of the software, Lou Bloomfield (http://plagiarism.bloomfieldmedia.com/z-wordpress/software/wcopyfind). Only the body of text of each manuscript was checked; titles, author’s affiliations, acknowledgments, and references were excluded (Bilić-Zulle et al. 2005).

Déjà vu is a database (not a type of text similarity detection software) containing approximately 80,000 potentially similar publications that were previously identified with eTBLAST (Errami et al. 2009). We searched the database for authors’ names (http://dejavu.vbi.vt.edu/dejavu/duplicate/search/).

Methods

Criteria for Plagiarism

The first criterion (A) for detecting plagiarism in abstracts was the identification of six or more words in a row (counted manually) that were identical to those found in already published abstracts (Bloomfield 2004; Sorokina et al. 2006).

The second criterion (B) was more than 10% text similarity rate, as suggested by the British Medical Journal (British Medical Journal, http://resources.bmj.com/bmj/authors/article-submission/) and Segal et al. (2010). Abstracts and manuscripts with text similarity rates above that value were suspected of being plagiarized. Each full-text manuscript with more than 10% text similarity rate was analyzed by sections; a section with more than 10% text similarity rate was considered plagiarized.

Manuscripts suspected of being plagiarized according to criterion B in the full text were manually verified. Follow-up investigations, applications for research grant by the same author(s), invited letters to editor in which authors express their opinion for educational purposes similar to previously published text, and manuscripts that contained only previously published congress abstracts by the same author(s) of up to 400 words were not considered to be plagiarized. These criteria are based on the guidelines suggested by the Committee on Publication Ethics (COPE 2006) and the CMJ “Guidelines for Authors: Manuscript Preparation and Submission” (Croatian Medical Journal, http://www.cmj.hr/2011/52/1/CMJ_52(1)_GUIDELINES.pdf).

Procedure

Submitted manuscripts were first analyzed with plagiarism detection software and then manually verified by the investigators.

Abstracts of all submitted manuscripts (Fig. 1, left arm) were compared against previously published abstracts in the Medline database using eTBLAST. The top five abstracts listed in the eTBLAST report were presumed to be the originals. If the submitted abstract had a string of six or more words (Criterion A, counted manually by KB2) identical to a word string in the top five abstracts, it was suspected of being plagiarized and further examined with WCopyfind (Bloomfield 2004; Sorokina et al. 2006).
https://static-content.springer.com/image/art%3A10.1007%2Fs11948-011-9347-2/MediaObjects/11948_2011_9347_Fig1_HTML.gif
Fig. 1

The study procedure. Criterion A: six or more similar words in a row; Criterion B: more than 10% text similarity rate

Then, full texts of all submitted manuscripts (Fig. 1, right arm) were analyzed using CrossCheck. Manuscripts with more than 10% text similarity rate according to the similarity report were suspected of being plagiarized and further examined with WCopyfind (Segal et al. 2010).

For each manuscript suspected of being plagiarized, we recorded whether the text similarity was found with eTBLAST, CrossCheck or both in order to compare their search results.

If an abstract or full-text manuscript suspected of being plagiarized had more than 10% text similarity with a previously published abstract or full-text paper (Criterion B), the full text of that manuscript was compared with the full-text of paper(s) found with eTBLAST and CrossCheck. Comparison of abstracts and full-text manuscripts was done with WCopyfind. If the software found more than one published paper, presumed originals were mutually compared with WCopyfind in order to exclude text similarities and to allow summing up of text similarity rates.

Only full-text manuscripts with more than 10% text similarity rate to another source were suspected of being plagiarized, further analyzed by sections, checked against the Déjà vu database, and manually verified (Fig. 1, middle arm).

Analysis of manuscript sections was performed with WCopyfind (Fig. 1, middle arm). Each section of a manuscript suspected of being plagiarized (Abstract, Introduction, Materials and Methods, Results, and Discussion) was compared to the corresponding section in the presumed original. The authors of manuscripts suspected of being plagiarized were checked against the Déjà vu database. If their names were found in the author byline of highly similar publication(s) in the database, their record in the database was examined according to the type of possible plagiarism (true plagiarism/self-plagiarism) and manual verification (verified/unverified).

Manual verification consisted of reading the whole manuscript suspected of being plagiarized and comparing it with the presumed originals (KB). Manuscripts with different authors were suspected of true plagiarism, whereas manuscripts that shared authors were suspected of self-plagiarism. Manuscripts with more than 10% text similarity rate to two or more presumed originals were considered patchwork plagiarism. Manuscripts with more than 10% text similarity rate in each section were considered to be section plagiarism. Those with more than 10% text similarity rate only in the Materials and Methods section were considered technical plagiarism and excluded from further analysis (COPE 2006). Tables and figures of manuscripts with section plagiarism in the Results section were manually compared with originals to identify possible data plagiarism.

Omission to cite the original paper was considered intention to plagiarize.

On the basis of the above considerations, a final decision was made and each manuscript was marked as either plagiarized or non-plagiarized. True plagiarism was distinguished from self-plagiarism and manuscripts were grouped into three plagiarism categories according to the extent of plagiarism (text similarity rate): minor (11–24%), moderate (25–49%), and major (50% and more).

A report on manuscripts containing plagiarism was sent to the CMJ editorial board with a recommendation to the editors. If a manuscript suspected of plagiarism was in the process of peer review, the process was terminated immediately and the manuscript was either rejected or the authors were required to rewrite the plagiarized parts.

Statistical Analysis

Categorical variables are presented with absolute and relative frequencies; distributions were compared with test of proportions with power estimation. Continuous variables are expressed as median with 95% confidence interval (CI), because data did not follow the normal distribution (Kolmogorov–Smirnov test). Between-group differences were evaluated using Mann–Whitney and Kruskal–Wallis tests. Post hoc comparisons were made using the Mann–Whitney test adjusted for multiple comparisons. The degree of association between variables was measured with Spearman’s correlation coefficient.

A P value of less than 0.05 was considered significant. Data were analyzed using the MedCalc statistical software, version 11.2.0.0 (MedCalc Inc., Mariakerke, Belgium). Power estimation for test of proportions was calculated using Russel-Lenth’s power and sample size webpage (http://www.stat.uiowa.edu/~rlenth/Power).

Ethics

The study was approved by the Ethics Committee of the Rijeka University School of Medicine and the CMJ.

The CMJ’s policy regarding plagiarism is clearly expressed in the journal’s Guidelines for Authors: Editorial Policy. The journal is a member of CrossCheck and displays the Crosscheck logo on its web site to indicate to authors that their manuscript might be checked for plagiarism.

Results

Analysis of Manuscripts

According to the eTBLAST reports, abstracts of 228 submitted manuscripts met criterion A for text similarity (i.e. contained a string of six or more identical words) (Fig. 1, left arm). Of these 228 abstracts, 57 also met criterion B (10% or more similarity rate) and were further analyzed with WCopyfind. Thirty three full-text manuscripts of the 57 abstracts that met criterion B were found by WCopyfind to contain text similar to other published papers (33/228 or 14%). These manuscripts were further analyzed.

According to criterion B, 151 of 754 submitted manuscripts were found to contain more than 10% similarity when analyzed with CrossCheck (Fig. 1, right arm). Of these 151 manuscripts, 102 with more than 10% text similarity were identified with WCopyfind and were further analyzed using criterion B (102/151 or 68%).

One hundred and two manuscripts suspected of being plagiarized were identified with CrossCheck and 33 with eTBLAST. Matches between submitted manuscripts and published papers were detected by both eTBLAST and CrossCheck in 30 cases. Three manuscripts that had not been detected by CrossCheck were found with eTBLAST. Overall, 105 manuscripts suspected of being plagiarized were manually verified (Fig. 1, middle arm). After manual verification, 20 manuscripts were excluded from further analysis and 85 manuscripts were confirmed to contain plagiarism. Half of the excluded manuscripts suspected of being plagiarized contained technical plagiarism (six with the same authors, four with different authors), five manuscripts were reports of follow-up studies by the same authors, two manuscripts contained previously published congress abstracts of up to 400 words by the same author, two manuscripts were applications for research grant by the same author, and one manuscript was an invited letter by the same author. In half of the excluded manuscripts the presumed original was properly cited.

Additionally, comparison of CrossCheck and WCopyfind text similarity rates (Fig. 2) showed a moderate correlation (N = 151; rs = 0.57; P < 0.001).
https://static-content.springer.com/image/art%3A10.1007%2Fs11948-011-9347-2/MediaObjects/11948_2011_9347_Fig2_HTML.gif
Fig. 2

Correlation of CrossCheck and WCopyfind text similarity rate (N = 151; rs = 0.57; sP < 0.001)

Characteristics of Plagiarized Manuscripts

Among 754 submitted manuscripts, 85 (11%) were identified as being plagiarized. Of them, 63 (8%) contained true plagiarism and 22 (3%) contained self-plagiarism (Table 2). Of the 63 manuscripts that contained true plagiarism, 18 contained patchwork plagiarism from two sources and one manuscript contained plagiarism from three sources. Nine manuscripts contained major plagiarism, with eight of them containing true plagiarism (three patchwork plagiarism) and one self-plagiarism. We found no difference between the proportion of true plagiarism and self-plagiarism in three categories of manuscripts (Table 2).
Table 2

Plagiarized manuscripts according to the extent of plagiarism

Extent of plagiarism (text similarity rate)

Total

n (%)

True plagiarism

n

Self-plagiarism

n

χ2*

P

Power estimation

Minor (11–24%)

39 (46)

31

8

0.65

0.421

0.12

Moderate (25–50%)

37 (43)

24

13

2.13

0.144

0.31

Major (>50%)

9 (11)

8

1

0.41

0.523

<0.03

Total

85 (100)

63

22

27.58

<0.001

0.98

df = 1 for all comparisons

Fifteen (18%) of 85 plagiarized manuscripts were sent for external peer review. Plagiarized manuscripts were mostly submitted from China (21%), Turkey (19%), and Croatia (14%), while authors from other countries accounted for the remaining 46% of plagiarized manuscripts (Table 3). We found a higher proportion of plagiarized than non-plagiarized manuscripts from China (21% vs. 8%, respectively; χ2 = 13.39; df = 1; P < 0.001) and a lower proportion of plagiarized than non-plagiarized manuscripts from Croatia (14% vs. 25%, respectively; χ2 = 4.45; df = 1; P = 0.035).
Table 3

Distribution of submitted manuscripts according to country of first author’s affiliation

Country of first author’s affiliation

TP

SP

χ2*

P

Power estimation

Plagiarized manuscripts

n (%)

Non-plagiarized manuscripts

n (%)

χ2*

P

Power estimation

China

15

3

0.47

0.495

0.08

18 (21)

57 (8)

13.39

<0.001

0.90

Croatia

8

4

0.05

0.823

0.07

12 (14)

165 (25)

4.45

0.035

0.57

Turkey

12

4

0.05

0.831

0.05

16 (19)

91 (14)

1.14

0.286

0.21

Other countries

28

11

0.06

0.812

0.06

39 (46)

356 (53)

1.21

0.271

0.19

Total

63

22

85 (100)

669 (100)

TP true plagiarism; SP self-plagiarism; * df = 1 for all comparisons

Analysis of text similarity rates in plagiarized manuscripts is presented in Table 4. No difference in text similarity rate was found between manuscripts containing true plagiarism and those containing self-plagiarism (U = 645.50; P = 0.634). However, a higher text similarity rate was found in true plagiarized manuscripts where the original source was not cited than in manuscripts in which the source was cited (U = 308.00; P = 0.011), but no such difference was found for self-plagiarized manuscripts (U = 56.50; P = 0.793).
Table 4

Text similarity rate of true plagiarism and self-plagiarism in plagiarized manuscripts according to the source citation, manuscript section, type of manuscript, and country of first author’s affiliation

Variable

n

True plagiarism

TSR (%)

Median (95% CI)

n

Self-plagiarism

TSR (%)

Median (95% CI)

Statistics

U

P

Total

63

25 (22–27)

22

28 (20–33)

645.50

0.634

Original manuscript cited

No

29

36 (23–39)

11

28 (19–37)

127.5

0.338

Yes

34

23 (17–25)

11

27 (19–35)

140.50

0.223

Statistics

 U

 

308.00

 

56.5

 

 P

 

0.011

 

0.793

 

Manuscript sectiona

Abstract

25

28 (27–36)

8

28.5 (15–41)

94.00

0.801

Introduction

34

28 (21–37)

17

26 (19–40)

253.5

0.478

Materials and Methods

33

32 (26–41)

17

61 (41–68)

197.5

0.089

Results

19

31 (23–46)

10

23 (17–36)

73.00

0.313

Discussion

45

27 (22–37)

16

25.5 (15–35)

293.5

0.268

Statistics

 H(df)

2.75(4)

12.65(4)

 

 P

 

0.600

 

0.013§

 

Country of first author’s affiliation

China

15

27 (16–37)

3

37b

 b

Croatia

8

24.5 (13–39)

4

30.5b

 b

Turkey

12

22.5 (17–30)

4

20.5b

 b

Others

28

26 (20–38)

11

27 (20–37)

 b

Statistics

 H(df)

 

1.42(3)

 

5.39(3)

 

 P

 

0.700

 

0.144

 

Type of manuscript

Original research

37

24 (18–27)

20

28.5 (21–34)

317.00

0.375

Case report

15

23 (16–36)

2

22b

 b

Review

11

26 (21–41)

0

 

 b

Statistics

 H (df)

 

2.99(2)

 

 b

 

 P

 

0.223

   

TSR text similarity rate

§ In self-plagiarized manuscripts text similarity rate significantly differed between Materials and Methods and Results sections (U = 33.50; P = 0.009) and Materials and Methods and Discussion sections (U = 57.50; P = 0.005). Assessed with Mann–Whitney test and adjusted to multiple comparisons (P < 0.01)

a Sections are not mutually exclusive categories and cannot be summed to total of 85. b Data were not analyzed due to small sample size

We found no differences between true plagiarism and self-plagiarism text similarity rates in various sections of manuscript. Also, sections of manuscripts containing true plagiarism did not differ in text similarity rates (H = 2.75; df = 4; P = 0.600). However, sections of self-plagiarized manuscripts differed in text similarity rates (H = 12.65; df = 4; P = 0.013); as expected, the text similarity rate was highest in the Materials and Methods section, and significantly higher than text similarity rates in Results (U = 33.50; P = 0.009) or Discussion (U = 57.50; P = 0.005) sections.

Text similarity rate was similar in the true plagiarized (H = 1.42; P = 0.700) and self-plagiarized manuscripts (H = 5.39; P = 0.144) submitted from different countries and for different types of plagiarized manuscripts (H = 2.99; P = 0.223).

There was no correlation between the number of authors and text similarity rate (rs = −0.22, P = 0.040).

Analysis of section plagiarism by country (using the first author’s affiliation) showed that the most frequently plagiarized section was the Discussion for true plagiarism and the Introduction and Materials and Methods for self-plagiarism (Table 5). Data in Table 5 did not meet the criteria for a Chi-square test.
Table 5

Frequency of section plagiarism according to the country of first author’s affiliation

Manuscript section

China (n = 18)

Croatia (n = 12)

Turkey (n = 16)

Others (n = 39)

Total

TP

SP

TP

SP

TP

SP

TP

SP

TP

SP

Abstract

10

1

5

2

1

1

9

4

25

8

Introduction

7

3

3

2

9

3

15

9

34

17

Materials and Methods

12

3

2

4

9

2

10

8

33

17

Results

7

2

1

3

4

2

4

6

19

10

Discussion

13

3

6

3

10

2

16

8

45

16

TP true plagiarism; SP self-plagiarism

Names of three authors in our sample were found in the Déjà vu database, in the unverified category. Two papers were suspicious of true plagiarism and one of self-plagiarism.

Manual comparisons revealed four cases of data similarities along with text similarities in the Results section: two authors submitted their own previously published demographic data, one author submitted his previously published case report, and one author submitted his data already published in two previous papers. Similarities in figures were not found. Authors did not cite previously published papers.

Reports on 10 manuscripts were sent to the CMJ editorial board. Seven of these manuscripts were rejected and three, which were follow-up studies, were accepted for publication. Editors wrote to the authors of these three manuscripts, which were in the peer-review process, to resolve the discovered similarities (but authors’ institutions were not contacted). After resolving similarities and manuscript revision, these manuscripts were accepted for publication.

Discussion

Our study showed that more manuscripts contained true plagiarism (8%) than self-plagiarism (3%), a finding that is contrary to two recent studies. Zhang (2010) used CrossCheck and found 6% of submitted manuscripts containing true plagiarism and 23% containing self-plagiarism. Using eTBLAST, Sun et al. (2010) concluded that it was more likely for a manuscript to be self-plagiarized than plagiarized (458/276; OR 1.66). The reason why our study results did not confirm their findings may be the use of different methodology. In Zhang’s study, the threshold used for CrossCheck was 40% overall similarity, while the threshold in our study was 10% similarity to a single source. Also, full-text similarity rate in Zhang’s study was determined using CrossCheck only, whereas eTBLAST and WCopyfind were used in our study as additional tools. In the Sun et al. study, the threshold used was also different from ours; Sun et al. (2010) used 0.5 similarity ratio (eTBLAST), while we used six or more similar words and afterwards more than 10% text similarity rate in abstracts.

Nearly three-quarters of plagiarized manuscripts submitted for publication to the CMJ were rejected by the editors-in-chief immediately after submission, mainly because the manuscripts were not suitable for the journal and their quality was lower than that expected by the journal. However, these authors could submit their plagiarized manuscripts to other science journals and perhaps have them published. Submitted manuscripts mostly originated from Croatia, Turkey, and China, and most plagiarized manuscripts came from the same countries. On the one hand, authors from China were over-represented among plagiarized (21%) compared to non-plagiarized (8%) manuscripts. In Eastern cultures, such as Chinese and Turkish, copying from others is not perceived negatively (Mason 2009, Segal et al. 2010) and the attitude to plagiarism in Croatia is not clear (Mavrinac et al. 2010; Pupovac et al. 2010). On the other hand, authors from Croatia were under-represented (14% plagiarized vs. 25% non-plagiarized). Also, authors from Croatia, Turkey, and China may find it difficult to write in English and this could be the reason why some of them may have taken “shortcuts” and plagiarized (Bilic-Zulle 2010; Mason 2009; Roig 2010).The temptation to copy sentences or even paragraphs from well-written papers published in high impact journals is understandable, but it is not considered good scientific practice (Bilic-Zulle 2010; Mason 2009; Roig 2010; Wager 2011). Authors should write the text themselves and then ask a language professional for help in order to avoid plagiarism (Kerans and Jager 2010).

High text-similarity rates were not expected in either true plagiarized (25%) or self-plagiarized (28%) manuscripts. In a study by Bilić-Zulle et al. (2005) on the prevalence of plagiarism among medical students, the observed average text similarity rate was 19%. We expected that manuscripts, written by authors who were older and more experienced writers than students, would have lower text similarity rate. Unfortunately, we cannot compare our results with other studies of plagiarism prevalence because they have not reported text similarity rates. Based on previous studies, we also expected to find higher text similarity rates in self-plagiarized manuscripts than in those containing true plagiarism (Sun et al. 2010; Zhang 2010), but no differences were found.

The intention to deceive is usually very difficult to prove (Wager 2011). In our study, we tried to connect indirectly the intention to plagiarize with the act of not citing the previously published paper from which text had been derived. The original paper was cited in only half of true plagiarized and self-plagiarized manuscripts. We believe that in true plagiarized manuscripts, where the original was cited, authors copied “perfect” sentences without any real intention to deceive, because of low skills of academic writing combined with limited proficiency in English (Bilic-Zulle 2010; Roig 2009). Failure to cite the original source could be interpreted as intention to deceive and if found, could be considered a proof of plagiarism, but it could also be the result of ignorance, academic laziness or lack of time (Fisher and Zigmond 2011; Kerans and Jager 2010; Roig 2009). The original source was not cited in 29 (4%) manuscripts containing true plagiarism from a total of 754 submitted manuscripts; these were the manuscripts with the highest text similarity rate (36%) and such prevalence (4%) could be considered as the true prevalence of plagiarism in manuscripts submitted to the CMJ.

All plagiarized sections were moderately plagiarized (23–31%), except for the Materials and Methods in self-plagiarized (61%) manuscripts. Such a result may reflect the understanding that using one’s own words is not plagiarism (Wager 2011). In addition, technical descriptions are difficult to rewrite once they have been written (Wager 2011; Kerans and Jager 2010). If an author uses his or her own or others’ text to describe a widely used technique or instrument, such as a validated questionnaire, the author is not considered to have committed a serious offence and editors regard text similarity in the Materials and Methods section differently from text similarity in the other parts of manuscripts (Wager 2011). However, if a manuscript contains plagiarism in other sections, or even all sections, it may raise serious doubts about the reliability of the study because plagiarism of text suggests data plagiarism (Mason 2009), which is considered to be major plagiarism and warrants immediate rejection (Wager 2011).

The Discussion was the most frequently plagiarized section in manuscripts containing true plagiarism, because it is the main section that requires writing and language skills in order to explain the unique results of the study (Wager 2011). Although we think that the main reason for plagiarizing this section is limited proficiency in English, it could also be the fear of failure to publish the paper. However, since the Discussion section should provide original commentaries of results, hypothesis, and comparison with previously published work, we did not exclude plagiarism in this section.

A small number (33 of 85) of plagiarized manuscripts were detected with eTBLAST. Although Sun et al. (2010) claimed that abstract similarity was a good predictor of full-text similarity, it was not observed in our study. A search of the Déjà vu database did not produce the expected results, possibly because of the small number of manually verified manuscripts in that database (only 6% by July 2011).

CrossCheck is a powerful software, but its database is not all-encompassing. Not all publishers and journals are CrossRef members and, consequently, their content is not available in the CrossCheck database. Therefore, additional use of eTBLAST or similar software rather than CrossCheck alone is recommended. In our study, we found 49 manuscripts with an overall CrossCheck false positive text similarity rate of 40–80%, but further examination showed that the text similarity rate to any single source was below 10%. The reason for false positive results could be the source texts in the CrossCheck database. Source texts in the CrossCheck database include affiliations, titles, and references, and we compared submitted manuscripts with presumed original sources using WCopyfind only after the additional content was removed from the manuscripts. Another reason could be that sources found with CrossCheck had text similarities, and we mutually compared them using WCopyfind in order to exclude text similarities and to allow summing up of text similarity rates. Finally, the overall text similarity rate in CrossCheck is the sum of text similarity to all possible sources from which the text was misappropriated, while the extent of similarity to a particular source was below the threshold. Therefore, instead of automatic rejection of manuscripts above a certain threshold (e.g with more than 40% text similarity rate in CrossCheck, as done by Zhang (2010)), manual verification of manuscripts is recommended.

Manual verification is the key element in plagiarism detection (Garner 2011; Segal et al. 2010; Sun et al. 2010). Plagiarism detection software searches for similar text but not necessarily for plagiarized content (Segal et al. 2010). Automatic approaches to plagiarism identified by software should be avoided, especially if the rejection of a manuscript depends exclusively on the similarity report. Editors are responsible for accusations of plagiarism and, therefore, they should not make conclusions about the extent of plagiarism in a manuscript solely on the basis of automated plagiarism detection results (Kerans and Jager 2010). Each case of suspected plagiarism is unique and should be processed independently (Roig 2009). Only after the analysis of both the manuscript suspected of being plagiarized and the previously published paper should editors make a final decision on the extent of plagiarism. If the manuscript is not plagiarized, editors should give the authors a chance to rewrite the text in order to be published (Garner 2011; Kerans and Jager 2010; Segal et al. 2010).

In our study, we decided to examine manually each manuscript suspected of being plagiarized according to chosen criteria. Although we used the CMJ’s guidelines for authors (Croatian Medical Journal and COPE flowcharts (COPE), sometimes they were not enough to help us decide whether the manuscript was plagiarized and we had to develop our own model. We defined criteria for a manuscript to be considered plagiarized similar to those proposed in the recent COPE discussion paper (2011). In that sense, we agree with Elizabeth Wager (2011) that editors “need to decide how to interpret and to respond to findings of text similarity”. Editors need to do a better job educating authors in order to prevent plagiarized content from being published (Marusic 2010). Use of plagiarism detection software along with guidelines and detailed instructions to authors with a reminder that all manuscripts will be checked for plagiarism (Kerans and Jager 2010) may help editors avoid publishing plagiarized content, improve the quality of their journals, and become “active gatekeepers of research integrity” (Marusic 2010).

The limitations of our study include the well-known restrictions of plagiarism detection software. First, none of the currently available software is exhaustive and can search all possible sources. There is still a large amount of scientific material that is not included in these software databases and authors could have taken text from sources or hard copies that are not on the Internet. If so, the prevalence of plagiarism in manuscripts submitted to the CMJ was underestimated. Plagiarism of ideas, figures, and data, or plagiarism by translation would also have escaped detection in our study. Another limitation could also be the threshold of 10% text similarity rate used in our study. There is still no “gold standard” for the degree of text misappropriation that constitutes plagiarism or for the degree of acceptable misappropriation. Finally, we have not included data, such as age, sex, years of experience, and number of published papers, that could have been compared and associated with true plagiarism and self-plagiarism.

The results from our study are particularly disturbing because the CMJ made great efforts in promoting scientific integrity and discouraging plagiarism. The CMJ was the first medical journal to have a research integrity editor (Petrovečki and Scheetz 2001). It has always promoted high standards of research integrity through their instructions to authors and papers on the subject. Scientific misconduct and plagiarism in particular are topics that raise a lot of interest, but Wager et al. (2009) revealed that science journals editors were generally not concerned about it. Of 231 (response rate 44%) science journals editors, 19% reported that plagiarism never occurred in their journal. What is worrisome about Wager et al’s results is that the majority of surveyed editors were not familiar with widely known guidelines, such as those from COPE, that could help them in dealing with ethical issues. Consequently, one has to wonder how much content published in medical journals such as the CMJ is plagiarized, especially if their editors have not been very active in promoting research integrity.

Footnotes
1

The CrossCheck database contains over 25.5 million papers from over 48,000 journals and books from 83 publishers, but most of these contents are accessible only through subscription (Butler 2010, Garner 2011).

 
2

These counts were not double-checked by another investigator.

 

Acknowledgments

The authors wish to thank Ana Marušić and Ivan Damjanov (editors-in-chief during the 2009–2011), Matko Marušić (founder and editor emeritus), Vedran Katavić, Dario Sambunjak and Vesna Kušec (editorial board members during the 2009–2011) from the Croatian Medical Journal, Aleksandra Mišak (translator) and Elizabeth Wager (COPE).

Funding

The study is part of the scientific project “Prevalence and attitude towards plagiarism” (No. 062-0000000-3552) supported by the Croatian Ministry of Science, Education, and Sports and project “Prevalence and attitudes towards plagiarism in biomedical publishing” supported by the Committee on Publication Ethics (COPE).

Conflict of interest

The authors declare that they have no conflict of interests.

Copyright information

© Springer Science+Business Media B.V. 2011