A Method for Fine-Grained Document Alignment Using Structural Information

Tsujio, Naoki; Shimizu, Toshiyuki; Yoshikawa, Masatoshi

doi:10.1007/978-3-319-11116-2_18

Naoki Tsujio¹⁹,
Toshiyuki Shimizu¹⁹ &
Masatoshi Yoshikawa¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8709))

Included in the following conference series:

Asia-Pacific Web Conference

3247 Accesses
1 Citations

Abstract

It is useful to understand the corresponding relationships between each part of related documents, such as a conference paper and its modified version published as a journal paper, or documents in different versions. However, it is hard to associate corresponding parts which have been heavily modified only using similarity in their content. We propose a method of aligning documents considering not only content information but also structural information in documents. Our method consists of three steps; baseline alignment considering document order, merging, and swapping. We used papers which have been presented at a domestic conference and an international conference, then obtained their alignments by using several methods in our evaluation experiments. The results revealed the effectiveness of the use of document structures.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Daumé III, H., Marcu, D.: A phrase-based HMM approach to document/abstract alignment. In: Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, Barcelona, Spain, pp. 119–126 (July 2004)
Google Scholar
Jeong, M., Titov, I.: Multi-document topic segmentation. In: Proceedings of the 19th ACM Conference on Information and Knowledge Management, pp. 1119–1128 (October 2010)
Google Scholar
Romary, L., Bonhomme, P.: Parallel alignment of structured documents. In: Véronis, J. (ed.) Parallel Text Processing, pp. 233–253. Kluwer Academic Publishers (2000)
Google Scholar
Zhang, H., Chow, T.W.S.: A multi-level matching method with hybrid similarity for document retrieval. Expert Systems with Applications 39(3), 2710–2719 (2012)
Article Google Scholar
Zhang, H., Chow, T.W.S.: A coarse-to-fine framework to efficiently thwart plagiarism. Pattern Recognition 44(2), 471–487 (2011)
Article MathSciNet Google Scholar
Wan, X.: A novel document similarity measure based on earth mover’s distance. Information Sciences 177(18), 3718–3730 (2007)
Article Google Scholar
Tekli, J., Chbeir, R.: A novel XML document structure comparison framework based-on sub-tree commonalities and label semantics. Journal of Web Semantics 11, 14–40 (2012)
Article Google Scholar
Yahyaei, S., Bonzanini, M., Roelleke, T.: Cross-lingual text fragment alignment using divergence from randomness. In: Grossi, R., Sebastiani, F., Silvestri, F. (eds.) SPIRE 2011. LNCS, vol. 7024, pp. 14–25. Springer, Heidelberg (2011)
Chapter Google Scholar
Au Yeung, C., Duh, K., Nagata, M.: Providing cross-lingual editing assistance to wikipedia editors. In: Gelbukh, A. (ed.) CICLing 2011, Part II. LNCS, vol. 6609, pp. 377–389. Springer, Heidelberg (2011)
Chapter Google Scholar
Smith, J.R., Quirk, C., Toutanova, K.: Extracting parallel sentences from comparable corpora using document level alignment. In: Human Language Technologies: Conference of the North American Chapter of the Association of Computational Linguistics, Los Angeles, USA, pp. 403–411 (June 2010)
Google Scholar
Vu, T., Aw, A., Zhang, M.: Feature-based method for document alignment in comparable news corpora. In: Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, Athens, Greece, pp. 843–851 (2009)
Google Scholar
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady 10(8), 707–710 (1966)
MathSciNet Google Scholar
Zhang, K., Shasha, D.: Simple fast algorithms for the editing distance between trees and related problems. SIAM Journal on Computing 18(6), 1245–1262 (1989)
Article MATH MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Graduate School of Informatics, Kyoto University, Kyoto, 606-8501, Japan
Naoki Tsujio, Toshiyuki Shimizu & Masatoshi Yoshikawa

Authors

Naoki Tsujio
View author publications
You can also search for this author in PubMed Google Scholar
Toshiyuki Shimizu
View author publications
You can also search for this author in PubMed Google Scholar
Masatoshi Yoshikawa
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Beijing Institute of Spacecraft System Engineering, Beijing, China
Lei Chen
School of Computer Science, National University of Defense Technology, 410073, Changsha, Hunan, China
Yan Jia
RMIT University, Melbourne, Australia
Timos Sellis
School of Computer Science and Technology, Soochow University, 215006, Suzhou, China
Guanfeng Liu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tsujio, N., Shimizu, T., Yoshikawa, M. (2014). A Method for Fine-Grained Document Alignment Using Structural Information. In: Chen, L., Jia, Y., Sellis, T., Liu, G. (eds) Web Technologies and Applications. APWeb 2014. Lecture Notes in Computer Science, vol 8709. Springer, Cham. https://doi.org/10.1007/978-3-319-11116-2_18

Download citation

DOI: https://doi.org/10.1007/978-3-319-11116-2_18
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-11115-5
Online ISBN: 978-3-319-11116-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics