Inference Algorithms for Pattern-Based CRFs on Sequence Data

Kolmogorov, Vladimir; Takhanov, Rustem

doi:10.1007/s00453-015-0017-7

Inference Algorithms for Pattern-Based CRFs on Sequence Data

Published: 20 June 2015

Volume 76, pages 17–46, (2016)
Cite this article

Algorithmica Aims and scope Submit manuscript

Vladimir Kolmogorov¹ &
Rustem Takhanov²

266 Accesses
1 Citation
2 Altmetric
Explore all metrics

Abstract

We consider Conditional random fields (CRFs) with pattern-based potentials defined on a chain. In this model the energy of a string (labeling) \(x_1\ldots x_n\) is the sum of terms over intervals [i, j] where each term is non-zero only if the substring \(x_i\ldots x_j\) equals a prespecified pattern w. Such CRFs can be naturally applied to many sequence tagging problems. We present efficient algorithms for the three standard inference tasks in a CRF, namely computing (i) the partition function, (ii) marginals, and (iii) computing the MAP. Their complexities are respectively \(O(\textit{nL})\), \(O(\textit{nL} \ell _{\max })\) and \(O(\textit{nL} \min \{|D|,\log (\ell _{\max }\!+\!1)\})\) where L is the combined length of input patterns, \(\ell _{\max }\) is the maximum length of a pattern, and D is the input alphabet. This improves on the previous algorithms of Ye et al. (NIPS, 2009) whose complexities are respectively \(O(\textit{nL} |D|)\), \(O\left( n |\varGamma | L^2 \ell _{\max }^2\right) \) and \(O(\textit{nL} |D|)\), where \(|\varGamma |\) is the number of input patterns. In addition, we give an efficient algorithm for sampling, and revisit the case of MAP with non-positive weights.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A random forest guided tour

Article 19 April 2016

Longest Common Substring with Approximately k Mismatches

Article Open access 16 February 2019

Introduction to Bioinformatics

Notes

Some of the bounds stated in [11] are actually weaker. However, it is not difficult to show that their algorithms can be implemented in times stated above, using our Lemma 1.
Note that we still claim complexity \(O(n{P})\) where \({P}\) is the number of distinct non-empty prefixes of words in the original set \(\varGamma \). Indeed, we can assume w.l.o.g. that each letter in D occurs in at least one word \(w\!\in \!\varGamma \) (If not, then we can “merge” non-occuring letters to a single letter and add this letter to \(\varGamma \); clearly, any instance over the original pair \((D,\varGamma )\) can be equivalenly formulated as an instance over the new pair. The transformation increases \({P}\) only by 1). The assumption implies that \(|D|\le {P}\). Adding D to \(\varGamma \) increases \({P}\) by at most \({P}\), and thus does not affect bound \(O(n{P})\).
The assumption \(|I(\varGamma )|\sim k\) will hold if e.g. we have \(|\varGamma _\delta |\ll k\). The assumption \(|\widehat{\varGamma }|\sim k\overline{\ell }\) means, roughly speaking, that words \(w_1,\ldots ,w_k\) rarely have common prefixes. It will hold, for example, if all words \(sw_is\) have the same length \(\overline{\ell }\) and their prefixes of length \(\overline{\ell }/2\) are all unique.

References

Berkman, O., Vishkin, U.: Recursive star-tree parallel data structure. SIAM J. Comput. 22(2), 221–242 (1993)
Article MathSciNet MATH Google Scholar
Bystroff, C., Thorsson, V., Baker, D.: HMMSTR: a hidden Markov model for local sequence-structure correlation in proteins. J Mol. Biol. 301, 173–190 (2000)
Article Google Scholar
Komodakis, N., Paragios, N.: Beyond pairwise energies: efficient optimization for higher-order MRFs. In: CVPR (2009)
Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: ICML (2001)
Nguyen, V.C., Ye, N., Lee, W.S., Chieu, H.L.: Semi-Markov conditional random field with high-order features. In: ICML 2011 Structured Sparsity: Learning and Inference Workshop (2011)
Qian, X., Jiang, X., Zhang, Q., Huang, X., Wu, L.: Sparse higher order conditional random fields for improved sequence labeling. In: ICML (2009)
Rother, C., Kohli, P., Feng, W., Jia, J.: Minimizing sparse higher order energy functions of discrete variables. In: CVPR (2009)
Sarawagi, S., Cohen, W.: Semi-Markov conditional random fields for information extraction. In: NIPS (2004)
Takhanov, R., Kolmogorov, V.: Inference algorithms for pattern-based CRFs on sequence data. In: ICML (2013)
Vose, M.D.: A linear algorithm for generating random numbers with a given distribution. IEEE Trans. Softw. Eng. 17(9), 972–975 (1991)
Article MathSciNet Google Scholar
Ye, N., Lee, W.S., Chieu, H.L., Wu, D.: Conditional random fields with high-order features for sequence labeling. In: NIPS (2009)

Download references

Acknowledgments

The authors thank Herbert Edelsbrunner for helpful discussions. This work has been partially supported by the European Research Council under the European Unions Seventh Framework Programme (FP7/2007-2013)/ERC grant agreement no. 616160.

Author information

Authors and Affiliations

IST Austria, Am Campus 1, 3400, Klosterneuburg, Austria
Vladimir Kolmogorov
Mathematics Department, Nazarbayev University, 53 Qabanbay Batyr Ave, 010000, Astana, Kazakhstan
Rustem Takhanov

Authors

Vladimir Kolmogorov
View author publications
You can also search for this author in PubMed Google Scholar
Rustem Takhanov
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Vladimir Kolmogorov.

Additional information

A preliminary version of this paper appeared in Proceedings of the 30th International Conference on Machine Learning (ICML), 2013 [9]. This expanded version contains proofs that were missing in [9], and also revisits the case of MAP with non-positive weights (see Sect. 8).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kolmogorov, V., Takhanov, R. Inference Algorithms for Pattern-Based CRFs on Sequence Data. Algorithmica 76, 17–46 (2016). https://doi.org/10.1007/s00453-015-0017-7

Download citation

Received: 31 October 2013
Accepted: 10 June 2015
Published: 20 June 2015
Issue Date: September 2016
DOI: https://doi.org/10.1007/s00453-015-0017-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Inference Algorithms for Pattern-Based CRFs on Sequence Data

Abstract

Access this article

Similar content being viewed by others

A random forest guided tour

Longest Common Substring with Approximately k Mismatches

Introduction to Bioinformatics

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Inference Algorithms for Pattern-Based CRFs on Sequence Data

Abstract

Access this article

Similar content being viewed by others

A random forest guided tour

Longest Common Substring with Approximately k Mismatches

Introduction to Bioinformatics

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation