Skip to main content
Log in

A hybrid system for German encyclopedia alignment

  • Published:
International Journal on Digital Libraries Aims and scope Submit manuscript

Abstract

Collaboratively created on-line encyclopedias have become increasingly popular. Especially in terms of completeness they have begun to surpass their printed counterparts. Two German publishers of traditional encyclopedias have reacted to this challenge and started an initiative to merge their corpora to create a single, more complete encyclopedia. The crucial step in this merging process is the alignment of articles. We have developed a two-step hybrid system to provide high-accurate alignments with low manual effort. First, we apply an information retrieval based, automatic alignment algorithm. Second, the articles with a low confidence score are revised using a manual alignment scheme carefully designed for quality assurance. Our evaluation shows that a combination of weighting and ranking techniques utilizing different facets of the encyclopedia articles allow to effectively reduce the number of necessary manual alignments. Further, the setup of the manual alignment turned out to be robust against inter-indexer inconsistencies. As a result, the developed system empowered us to align four encyclopedias with high accuracy and low effort.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Anderka, M., Stein, B.: The ESA retrieval model revisited. In: Sanderson, M., Zhai, C., Zobel, J., Aslam, J. (eds.) 32th Annual International ACM SIGIR Conference (SIGIR 09), pp. 670–671. ACM (2009). doi:https://doi.org/10.1145/1571941.1572070

  2. Bernstein, Y., Zobel, J.: A scalable system for identifying co-derivative documents. In: String Processing and Information Retrieval Symposium, pp. 55–67 (2004)

  3. Bouma, G., Duarte, S., Islam, Z.: Cross-lingual alignment and completion of wikipedia templates. In: Proceedings of the Third International Workshop on Cross Lingual Information Access: Addressing the Information Need of Multilingual Societies, CLIAWS3 ’09, pp. 21–29. Association for Computational Linguistics, Stroudsburg, PA (2009)

  4. Fang, H., Zhai, C.: An exploration of axiomatic approaches to information retrieval. In: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and development in Information Retrieval, pp. 480–487. ACM (2005)

  5. Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using wikipedia-based explicit semantic analysis. In: Proceedings of the Twentieth International Joint Conference for Artificial Intelligence, pp. 1606–1611. Hyderabad (2007)

  6. Gries S.: Dispersions and adjusted frequencies in corpora. Int. J. Corpus Linguist. 13(4), 403–437 (2008). doi:https://doi.org/10.1075/ijcl.13.4.02gri

    Article  Google Scholar 

  7. Kern, R., Granitzer, M.: Efficient linear text segmentation based on information retrieval techniques. In: MEDES ’09: Proceedings of the International Conference on Management of Emergent Digital EcoSystems, pp. 167–171. ACM, New York, NY (2009). doi:https://doi.org/10.1145/1643823.1643854

  8. Li Y., McLean D., Bandar Z.: Sentence similarity based on semantic nets and corpus statistics. IEEE Trans. Knowl. Data Eng. 18(8), 1138–1150 (2006)

    Article  Google Scholar 

  9. Liu, X., Zhou, Y., Zheng, R.: Measuring semantic similarity within sentences. In: Proceedings of the 7th International Conference on Machine Learning and Cybernetics, ICMLC, vol. 5, pp. 2558–2562 (2008). doi:https://doi.org/10.1109/ICMLC.2008.4620839

  10. Marko, K., Baud, R., Zweigenbaum, P., Merkel, M., Gronostaj, M.T., Kokkinakis, D., Schulz, S.: Cross-lingual alignment of medical lexicons. In: Workshop on Acquiring and Representing Multilingual, Specialized Lexicons: the Case of Biomedicine (2006)

  11. Metzler, D., Bernstein, Y., Croft, W., Moffat, A., Zobel, J.: Similarity measures for tracking information flow. In: CIKM ’05: Proceedings of the 14th ACM International Conference on Information and Knowledge Management, pp. 517–524. ACM (2005)

  12. O’Shea, J., Bandar, Z., Crockett, K., McLean, D.: A comparative study of two short text semantic similarity measures. In: Agent and Multi-Agent Systems: Technologies and Applications: Second KES International Symposium, vol. 4953, pp. 172–181. Springer (2008)

  13. Pedersen, T.: Computational approaches to measuring the similarity of short contexts: a review of applications and methods. Comput. Res. Repos. (CoRR) abs/0806.3 (2008)

  14. Rector L.H.: Comparison of Wikipedia and other encyclopedias for accuracy, breadth, and depth in historical articles. Ref. Serv. Rev. 36(1), 7–22 (2008). doi:https://doi.org/10.1108/00907320810851998

    Article  Google Scholar 

  15. Robertson, S., Gatford, M.: Okapi at TREC-4. In: Proceedings of the Fourth Text Retrieval Conference, pp. 73–97 (1996)

  16. Sahami, M., Heilman, T.: A web-based kernel function for measuring the similarity of short text snippets. In: WWW ’06: Proceedings of the 15th International Conference on World Wide Web, pp. 377–386. ACM (2006)

  17. Yih, W., Meek, C.: Improving similarity measures for short segments of text. In: AAAI’07: Proceedings of the 22nd National Conference on Artificial Intelligence, pp. 1489–1494. AAAI Press (2007)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Roman Kern.

Additional information

This article is a substantially revised and extended version of a article with the title “German Encyclopedia Alignment Based on Information Retrieval Techniques” originally appeared in the Proceedings of the 14th European Conference on Digital Libraries (ECDL 2010).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kern, R., Seifert, C. & Granitzer, M. A hybrid system for German encyclopedia alignment. Int J Digit Libr 11, 75–89 (2010). https://doi.org/10.1007/s00799-011-0069-5

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00799-011-0069-5

Keywords

Navigation