Semantic Duplicate Identification with Parsing and Machine Learning

  • Sven Hartrumpf
  • Tim vor der Brück
  • Christian Eichhorn
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6231)


Identifying duplicate texts is important in many areas like plagiarism detection, information retrieval, text summarization, and question answering. Current approaches are mostly surface-oriented (or use only shallow syntactic representations) and see each text only as a token list. In this work however, we describe a deep, semantically oriented method based on semantic networks which are derived by a syntactico-semantic parser. Semantically identical or similar semantic networks for each sentence of a given base text are efficiently retrieved by using a specialized index. In order to detect many kinds of paraphrases the semantic networks of a candidate text are varied by applying inferences: lexico-semantic relations, relation axioms, and meaning postulates. Important phenomena occurring in difficult duplicates are discussed. The deep approach profits from background knowledge, whose acquisition from corpora is explained briefly. The deep duplicate recognizer is combined with two shallow duplicate recognizers in order to guarantee a high recall for texts which are not fully parsable. The evaluation shows that the combined approach preserves recall and increases precision considerably in comparison to traditional shallow methods.


Semantic Network Deep Approach Question Answering Text Summarization Plagiarism Detection 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Eichhorn, C.: Automatische Duplikatserkennung (Automatic Duplicate Detection). Der Andere Verlag, Tönning (2010)Google Scholar
  2. 2.
    Hüttinger, G.: Software zur Plagiatserkennung im Test – die Systeme haben sich deutlich gebessert (Test of plagiarism detection software) (November 2008),
  3. 3.
    Helbig, H.: Knowledge Representation and the Semantics of Natural Language. Springer, Berlin (2006)zbMATHGoogle Scholar
  4. 4.
    Weber-Wulff, D.: Softwaretest 2008 (2009),
  5. 5.
    Balaguer, E.V.: Putting Ourselves in SME’s Shoes: Automatic Detection of Plagiarism by the Tool. In: Proceedings of PAN Workshop and Competition, Valencia, Spain (2009)Google Scholar
  6. 6.
    Weber-Wulff, D.: Der große Online-Schwindel (The Big Online Deception). Spiegel Online (2002)Google Scholar
  7. 7.
    Guthrie, D., Allison, B., Liu, W., Guthrie, L., Wilks, Y.: A Closer Look at Skip-Gram Modelling. In: Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC), Geneva, Switzerland, pp. 1222–1225 (2006)Google Scholar
  8. 8.
    Hartrumpf, S., Helbig, H., Osswald, R.: The Semantically Based Computer Lexicon HaGenLex – Structure and Technological Environment. Traitement automatique des langues 44(2), 81–105 (2003)Google Scholar
  9. 9.
    Hearst, M.: Automatic Acquisition of Hyponyms from Large Text Corpora. In: Proceedings of COLING, Nantes, France (1992)Google Scholar
  10. 10.
    Hartrumpf, S.: Hybrid Disambiguation in Natural Language Analysis. Der Andere Verlag, Osnabrück (2003)Google Scholar
  11. 11.
    EL-Manzalawy, Y., Honavar, V.: WLSVM: Integrating LibSVM into Weka Environment (2005), Software available at
  12. 12.
    Chang, C.C., Lin, C.J.: LIBSVM: a Library for Support Vector Machines (2001), Software available at

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Sven Hartrumpf
    • 1
  • Tim vor der Brück
    • 1
  • Christian Eichhorn
    • 2
  1. 1.Intelligent Information and Communication Systems (IICS)FernUniversität in HagenHagenGermany
  2. 2.Lehrstuhl Informatik 1, Technische Universität DortmundDortmundGermany

Personalised recommendations