A hybrid system for German encyclopedia alignment

  • Roman KernEmail author
  • Christin Seifert
  • Michael Granitzer


Collaboratively created on-line encyclopedias have become increasingly popular. Especially in terms of completeness they have begun to surpass their printed counterparts. Two German publishers of traditional encyclopedias have reacted to this challenge and started an initiative to merge their corpora to create a single, more complete encyclopedia. The crucial step in this merging process is the alignment of articles. We have developed a two-step hybrid system to provide high-accurate alignments with low manual effort. First, we apply an information retrieval based, automatic alignment algorithm. Second, the articles with a low confidence score are revised using a manual alignment scheme carefully designed for quality assurance. Our evaluation shows that a combination of weighting and ranking techniques utilizing different facets of the encyclopedia articles allow to effectively reduce the number of necessary manual alignments. Further, the setup of the manual alignment turned out to be robust against inter-indexer inconsistencies. As a result, the developed system empowered us to align four encyclopedias with high accuracy and low effort.


Encyclopedia alignment Semantic similarity Hybrid alignment system 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Anderka, M., Stein, B.: The ESA retrieval model revisited. In: Sanderson, M., Zhai, C., Zobel, J., Aslam, J. (eds.) 32th Annual International ACM SIGIR Conference (SIGIR 09), pp. 670–671. ACM (2009). doi:
  2. 2.
    Bernstein, Y., Zobel, J.: A scalable system for identifying co-derivative documents. In: String Processing and Information Retrieval Symposium, pp. 55–67 (2004)Google Scholar
  3. 3.
    Bouma, G., Duarte, S., Islam, Z.: Cross-lingual alignment and completion of wikipedia templates. In: Proceedings of the Third International Workshop on Cross Lingual Information Access: Addressing the Information Need of Multilingual Societies, CLIAWS3 ’09, pp. 21–29. Association for Computational Linguistics, Stroudsburg, PA (2009)Google Scholar
  4. 4.
    Fang, H., Zhai, C.: An exploration of axiomatic approaches to information retrieval. In: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and development in Information Retrieval, pp. 480–487. ACM (2005)Google Scholar
  5. 5.
    Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using wikipedia-based explicit semantic analysis. In: Proceedings of the Twentieth International Joint Conference for Artificial Intelligence, pp. 1606–1611. Hyderabad (2007)Google Scholar
  6. 6.
    Gries S.: Dispersions and adjusted frequencies in corpora. Int. J. Corpus Linguist. 13(4), 403–437 (2008). doi: CrossRefGoogle Scholar
  7. 7.
    Kern, R., Granitzer, M.: Efficient linear text segmentation based on information retrieval techniques. In: MEDES ’09: Proceedings of the International Conference on Management of Emergent Digital EcoSystems, pp. 167–171. ACM, New York, NY (2009). doi:
  8. 8.
    Li Y., McLean D., Bandar Z.: Sentence similarity based on semantic nets and corpus statistics. IEEE Trans. Knowl. Data Eng. 18(8), 1138–1150 (2006)CrossRefGoogle Scholar
  9. 9.
    Liu, X., Zhou, Y., Zheng, R.: Measuring semantic similarity within sentences. In: Proceedings of the 7th International Conference on Machine Learning and Cybernetics, ICMLC, vol. 5, pp. 2558–2562 (2008). doi:
  10. 10.
    Marko, K., Baud, R., Zweigenbaum, P., Merkel, M., Gronostaj, M.T., Kokkinakis, D., Schulz, S.: Cross-lingual alignment of medical lexicons. In: Workshop on Acquiring and Representing Multilingual, Specialized Lexicons: the Case of Biomedicine (2006)Google Scholar
  11. 11.
    Metzler, D., Bernstein, Y., Croft, W., Moffat, A., Zobel, J.: Similarity measures for tracking information flow. In: CIKM ’05: Proceedings of the 14th ACM International Conference on Information and Knowledge Management, pp. 517–524. ACM (2005)Google Scholar
  12. 12.
    O’Shea, J., Bandar, Z., Crockett, K., McLean, D.: A comparative study of two short text semantic similarity measures. In: Agent and Multi-Agent Systems: Technologies and Applications: Second KES International Symposium, vol. 4953, pp. 172–181. Springer (2008)Google Scholar
  13. 13.
    Pedersen, T.: Computational approaches to measuring the similarity of short contexts: a review of applications and methods. Comput. Res. Repos. (CoRR) abs/0806.3 (2008)Google Scholar
  14. 14.
    Rector L.H.: Comparison of Wikipedia and other encyclopedias for accuracy, breadth, and depth in historical articles. Ref. Serv. Rev. 36(1), 7–22 (2008). doi: CrossRefGoogle Scholar
  15. 15.
    Robertson, S., Gatford, M.: Okapi at TREC-4. In: Proceedings of the Fourth Text Retrieval Conference, pp. 73–97 (1996)Google Scholar
  16. 16.
    Sahami, M., Heilman, T.: A web-based kernel function for measuring the similarity of short text snippets. In: WWW ’06: Proceedings of the 15th International Conference on World Wide Web, pp. 377–386. ACM (2006)Google Scholar
  17. 17.
    Yih, W., Meek, C.: Improving similarity measures for short segments of text. In: AAAI’07: Proceedings of the 22nd National Conference on Artificial Intelligence, pp. 1489–1494. AAAI Press (2007)Google Scholar

Copyright information

© Springer-Verlag 2011

Authors and Affiliations

  • Roman Kern
    • 1
    Email author
  • Christin Seifert
    • 1
  • Michael Granitzer
    • 1
    • 2
  1. 1.Graz University of Technology, Knowledge Management InstituteGrazAustria
  2. 2.Know-Center GmbH and Graz University of Technology, Knowledge Management InstituteGrazAustria

Personalised recommendations