Abstract
We consider Conditional random fields (CRFs) with pattern-based potentials defined on a chain. In this model the energy of a string (labeling) \(x_1\ldots x_n\) is the sum of terms over intervals [i, j] where each term is non-zero only if the substring \(x_i\ldots x_j\) equals a prespecified pattern w. Such CRFs can be naturally applied to many sequence tagging problems. We present efficient algorithms for the three standard inference tasks in a CRF, namely computing (i) the partition function, (ii) marginals, and (iii) computing the MAP. Their complexities are respectively \(O(\textit{nL})\), \(O(\textit{nL} \ell _{\max })\) and \(O(\textit{nL} \min \{|D|,\log (\ell _{\max }\!+\!1)\})\) where L is the combined length of input patterns, \(\ell _{\max }\) is the maximum length of a pattern, and D is the input alphabet. This improves on the previous algorithms of Ye et al. (NIPS, 2009) whose complexities are respectively \(O(\textit{nL} |D|)\), \(O\left( n |\varGamma | L^2 \ell _{\max }^2\right) \) and \(O(\textit{nL} |D|)\), where \(|\varGamma |\) is the number of input patterns. In addition, we give an efficient algorithm for sampling, and revisit the case of MAP with non-positive weights.
Similar content being viewed by others
Notes
Note that we still claim complexity \(O(n{P})\) where \({P}\) is the number of distinct non-empty prefixes of words in the original set \(\varGamma \). Indeed, we can assume w.l.o.g. that each letter in D occurs in at least one word \(w\!\in \!\varGamma \) (If not, then we can “merge” non-occuring letters to a single letter and add this letter to \(\varGamma \); clearly, any instance over the original pair \((D,\varGamma )\) can be equivalenly formulated as an instance over the new pair. The transformation increases \({P}\) only by 1). The assumption implies that \(|D|\le {P}\). Adding D to \(\varGamma \) increases \({P}\) by at most \({P}\), and thus does not affect bound \(O(n{P})\).
The assumption \(|I(\varGamma )|\sim k\) will hold if e.g. we have \(|\varGamma _\delta |\ll k\). The assumption \(|\widehat{\varGamma }|\sim k\overline{\ell }\) means, roughly speaking, that words \(w_1,\ldots ,w_k\) rarely have common prefixes. It will hold, for example, if all words \(sw_is\) have the same length \(\overline{\ell }\) and their prefixes of length \(\overline{\ell }/2\) are all unique.
References
Berkman, O., Vishkin, U.: Recursive star-tree parallel data structure. SIAM J. Comput. 22(2), 221–242 (1993)
Bystroff, C., Thorsson, V., Baker, D.: HMMSTR: a hidden Markov model for local sequence-structure correlation in proteins. J Mol. Biol. 301, 173–190 (2000)
Komodakis, N., Paragios, N.: Beyond pairwise energies: efficient optimization for higher-order MRFs. In: CVPR (2009)
Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: ICML (2001)
Nguyen, V.C., Ye, N., Lee, W.S., Chieu, H.L.: Semi-Markov conditional random field with high-order features. In: ICML 2011 Structured Sparsity: Learning and Inference Workshop (2011)
Qian, X., Jiang, X., Zhang, Q., Huang, X., Wu, L.: Sparse higher order conditional random fields for improved sequence labeling. In: ICML (2009)
Rother, C., Kohli, P., Feng, W., Jia, J.: Minimizing sparse higher order energy functions of discrete variables. In: CVPR (2009)
Sarawagi, S., Cohen, W.: Semi-Markov conditional random fields for information extraction. In: NIPS (2004)
Takhanov, R., Kolmogorov, V.: Inference algorithms for pattern-based CRFs on sequence data. In: ICML (2013)
Vose, M.D.: A linear algorithm for generating random numbers with a given distribution. IEEE Trans. Softw. Eng. 17(9), 972–975 (1991)
Ye, N., Lee, W.S., Chieu, H.L., Wu, D.: Conditional random fields with high-order features for sequence labeling. In: NIPS (2009)
Acknowledgments
The authors thank Herbert Edelsbrunner for helpful discussions. This work has been partially supported by the European Research Council under the European Unions Seventh Framework Programme (FP7/2007-2013)/ERC grant agreement no. 616160.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Kolmogorov, V., Takhanov, R. Inference Algorithms for Pattern-Based CRFs on Sequence Data. Algorithmica 76, 17–46 (2016). https://doi.org/10.1007/s00453-015-0017-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00453-015-0017-7