Semantic Duplicate Identification with Parsing and Machine Learning

Hartrumpf, Sven; vor der Brück, Tim; Eichhorn, Christian

doi:10.1007/978-3-642-15760-8_12

Sven Hartrumpf²³,
Tim vor der Brück²³ &
Christian Eichhorn²⁴

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 6231))

Included in the following conference series:

International Conference on Text, Speech and Dialogue

1468 Accesses
2 Citations

Abstract

Identifying duplicate texts is important in many areas like plagiarism detection, information retrieval, text summarization, and question answering. Current approaches are mostly surface-oriented (or use only shallow syntactic representations) and see each text only as a token list. In this work however, we describe a deep, semantically oriented method based on semantic networks which are derived by a syntactico-semantic parser. Semantically identical or similar semantic networks for each sentence of a given base text are efficiently retrieved by using a specialized index. In order to detect many kinds of paraphrases the semantic networks of a candidate text are varied by applying inferences: lexico-semantic relations, relation axioms, and meaning postulates. Important phenomena occurring in difficult duplicates are discussed. The deep approach profits from background knowledge, whose acquisition from corpora is explained briefly. The deep duplicate recognizer is combined with two shallow duplicate recognizers in order to guarantee a high recall for texts which are not fully parsable. The evaluation shows that the combined approach preserves recall and increases precision considerably in comparison to traditional shallow methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Eichhorn, C.: Automatische Duplikatserkennung (Automatic Duplicate Detection). Der Andere Verlag, Tönning (2010)
Google Scholar
Hüttinger, G.: Software zur Plagiatserkennung im Test – die Systeme haben sich deutlich gebessert (Test of plagiarism detection software) (November 2008), http://www.htw-berlin.de/Aktuelles/Pressemitteilungen/2008/index.html
Helbig, H.: Knowledge Representation and the Semantics of Natural Language. Springer, Berlin (2006)
MATH Google Scholar
Weber-Wulff, D.: Softwaretest 2008 (2009), http://plagiat.fhtw-berlin.de/software
Balaguer, E.V.: Putting Ourselves in SME’s Shoes: Automatic Detection of Plagiarism by the Tool. In: Proceedings of PAN Workshop and Competition, Valencia, Spain (2009)
Google Scholar
Weber-Wulff, D.: Der große Online-Schwindel (The Big Online Deception). Spiegel Online (2002)
Google Scholar
Guthrie, D., Allison, B., Liu, W., Guthrie, L., Wilks, Y.: A Closer Look at Skip-Gram Modelling. In: Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC), Geneva, Switzerland, pp. 1222–1225 (2006)
Google Scholar
Hartrumpf, S., Helbig, H., Osswald, R.: The Semantically Based Computer Lexicon HaGenLex – Structure and Technological Environment. Traitement automatique des langues 44(2), 81–105 (2003)
Google Scholar
Hearst, M.: Automatic Acquisition of Hyponyms from Large Text Corpora. In: Proceedings of COLING, Nantes, France (1992)
Google Scholar
Hartrumpf, S.: Hybrid Disambiguation in Natural Language Analysis. Der Andere Verlag, Osnabrück (2003)
Google Scholar
EL-Manzalawy, Y., Honavar, V.: WLSVM: Integrating LibSVM into Weka Environment (2005), Software available at http://www.cs.iastate.edu/~yasser/wlsvm
Chang, C.C., Lin, C.J.: LIBSVM: a Library for Support Vector Machines (2001), Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm

Download references

Author information

Authors and Affiliations

Intelligent Information and Communication Systems (IICS), FernUniversität in Hagen, 58084, Hagen, Germany
Sven Hartrumpf & Tim vor der Brück
Lehrstuhl Informatik 1, Technische Universität Dortmund, 44227, Dortmund, Germany
Christian Eichhorn

Authors

Sven Hartrumpf
View author publications
You can also search for this author in PubMed Google Scholar
Tim vor der Brück
View author publications
You can also search for this author in PubMed Google Scholar
Christian Eichhorn
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Faculty of Informatics, Masaryk University, Brno, Czech Republic
Petr Sojka
Faculty of Informatics, Masaryk University, Botanická 68a, 602 00, Brno, Czech Republic
Aleš Horák
Faculty of Informatics, Masaryk University, Botanická 68a, CZ-602 00, Brno, Czech Republic
Ivan Kopeček
Faculty of Informatics, Department of Computer Graphics and Design, Masaryk University, Botanická 68a, 602 00, Brno, Czech Republic
Karel Pala

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hartrumpf, S., vor der Brück, T., Eichhorn, C. (2010). Semantic Duplicate Identification with Parsing and Machine Learning. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds) Text, Speech and Dialogue. TSD 2010. Lecture Notes in Computer Science(), vol 6231. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15760-8_12

Download citation

DOI: https://doi.org/10.1007/978-3-642-15760-8_12
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-15759-2
Online ISBN: 978-3-642-15760-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Semantic Duplicate Identification with Parsing and Machine Learning