Gazetteer Compression Technique Based on Substructure Recognition

  • Jan Daciuk
  • Jakub Piskorski
Part of the Advances in Soft Computing book series (AINSC, volume 35)

Abstract

Finite-state automata are state-of-the-art representation of dictionaries in natural language processing. We present a novel compression technique that is especially useful for gazetteers – a particular sort of dictionaries. We replace common substructures in the automaton by unique copies. To find them, we treat a transition vector as a string, and we apply a Ziv-Lempel-style text compression technique that uses suffix tree to find repetitions in lineaqr time. Empirical evaluation on real-world data reveals space savings of up to 18,6%, which makes this method highly attractive.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Copyright information

© Springer 2006

Authors and Affiliations

  • Jan Daciuk
    • 1
  • Jakub Piskorski
    • 2
  1. 1.Technical University of GdańskGdańskPoland
  2. 2.DFKI GmbHSaarbrückenGermany

Personalised recommendations