Gazetteer Compression Technique Based on Substructure Recognition
Finite-state automata are state-of-the-art representation of dictionaries in natural language processing. We present a novel compression technique that is especially useful for gazetteers – a particular sort of dictionaries. We replace common substructures in the automaton by unique copies. To find them, we treat a transition vector as a string, and we apply a Ziv-Lempel-style text compression technique that uses suffix tree to find repetitions in lineaqr time. Empirical evaluation on real-world data reveals space savings of up to 18,6%, which makes this method highly attractive.
Unable to display preview. Download preview PDF.