Beyond Morphology: Pattern Matching with FST
fst stands for Finite-State Toolkit. It is an enhanced version of the xfst tool described in the 2003 Beesley and Karttunen book Finite State Morphology. Like xfst, fst serves two purposes. It is a development tool for compiling finite-state networks and a runtime tool that applies networks to input strings or files. xfst is limited to morphological analysis and generation. fst can also be used for other applications. This paper describes the new features of the fst regular expression formalism and illustrates their use for named-entity recognition, relation extraction, tokenization and parsing. The fst pattern matching algorithm (pmatch) operates on a single pattern network but the network can be the union of any number of distinct pattern definitions. Many patterns can be matched simultaneously in one pass over a text. This is a distinct fst advantage over pattern matching facilities in languages such as Perl and Python.
Keywordsfinite-state automata tokenization pattern matching
Unable to display preview. Download preview PDF.