Abstract
Many applications in natural language, speech processing and data integration require model-based segmentation of sequences. Semi-Markov conditional random fields (semi-CRFs) are a generalization of CRFs and provide a full conditional distribution over all possible segmentation of a sequence. Semi-CRFs are particularly suitable for tasks that entail segment-level features such as match with existing dictionary of segments. Empirical results on real-life NER tasks show that they yield higher accuracy than CRFs, but the straightforward foreword–backward inference algorithm requires 3–10 times the computation cost of CRFs. This running time can be reduced significantly by exploiting overlapping features across segments. We present a succinct representation of overlapping features and an efficient training algorithm that can sum over all possible input segmentation in time that is sub-quadratic in the input length, even while imposing no bound on the maximum segment length. Consequently, the running time becomes comparable to CRFs even with the addition of useful entity-level features on large input segments.
Similar content being viewed by others
Notes
Be careful about using the following code—I’ve only proven that it works, I haven’t tested it. Donald Knuth
References
Barbar D, Garcia-Molina H, Porter D (1992) The management of probabilistic data. IEEE Trans Knowl Data Eng 4(5):487–502. https://doi.org/10.1109/69.166990
Beck E, Hannemann M, Dtsch P, Schlter R, Ney H (2018) Segmental encoder-decoder models for large vocabulary automatic speech recognition. Proc Interspeech 2018:766–770
Borthwick A, Sterling J, Agichtein E, Grishman R (1998) Exploiting diverse knowledge sources via maximum entropy in named entity recognition. In: Sixth Workshop on Very Large Corpora. Association for Computational Linguistics, New Brunswick, New Jersey
Boulos J, Dalvi N, Mandhani B, Mathur S, Re C, Suciu D (2005) Mystiq: a system for finding more answers by using probabilities. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, Baltimore, Maryland. https://doi.org/10.1145/1066157.1066277
Dalvi NN, Suciu D (2004) Efficient query evaluation on probabilistic databases. In: Proceedings of the 30th VLDB Conference, Toronto, Canada, pp 864–875
Fuhr N (1990) A probabilistic framework for vague queries and imprecise information in databases. In: Proceedings of the sixteenth international conference on Very large databases. Morgan Kaufmann Publishers Inc., San Francisco, pp 696–707
Green TJ, Tannen V (2006) Models for incomplete and probabilistic information. IEEE Data Eng Bull 29(1)
Gupta R, Sarawagi S (2006) Curating probabilistic databases from information extraction models. In: VLDB
Gupta R, Sarawagi S (2009) Answering table augmentation queries from unstructured lists on the web. In: PVLDB
Kemos A, Adel H, Schtze H (2018) Neural semi-markov conditional random fields for robust character-based part-of-speech tagging. 1808.04208
Keshet J, Shalev-Shwartz S, Singer Y (2005) Phoneme alignment using large margin techniques. In: Workshop on the advances in structured learning for text and speech processing, NIPS
Krogh A (1998) Gene finding: putting the parts together. In: Bishop MJ (ed) Guide to human genome computing, 2nd edn. Academic Press, Cambridge, pp 261–274
Lafferty J, McCallum A, Pereira F (2001) Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proceedings of the International Conference on Machine Learning (ICML-2001), Williams
Liu DC, Nocedal J (1989) On the limited memory bfgs method for large-scale optimization. Math Programm 45:503–528
Malouf R (2002) A comparison of algorithms for maximum entropy parameter estimation. In: Proceedings of The sixth conference on natural language learning (CoNLL-2002), pp 49–55
Mansuri I, Sarawagi S (2006) A system for integrating unstructured data into relational databases. In: Proc. of the 22nd IEEE Int’l Conference on Data Engineering (ICDE)
McCallum A, Li W (2003) Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In: Proceedings of The Seventh Conference on Natural Language Learning (CoNLL-2003), Edmonton, Canada
McDonald R, Crammer K, Pereira F (2005) Flexible text segmentation with structured multilabel classification. In: HLT/EMNLP
Sarawagi S (2006) Efficient inference on sequence segmentation models. In: Proceedings of the \({23}^{\rm {rd}}\) International Conference on Machine Learning (ICML), Pittsburgh
Sarawagi S, Cohen WW (2004) Semi-markov conditional random fields for information extraction. In: NIPS
Sarma AD, Benjelloun O, Halevy A, Widom J (2006) Working models for uncertain data. In: ICDE
Sha F, Pereira F (2003) Shallow parsing with conditional random fields. In: Proceedings of HLT-NAACL
Ye ZX, Ling ZH (2018) Hybrid semi-markov crf for neural sequence labeling. In: ACL
Zhang T, Damerau F, Johnson D (2002) Text chunking based on a generalization of winnow. J Mach Learn Res 2:615–637
Zhuo J, Cao Y, Zhu J, Zhang B, Nie Z (2016) Segment-level sequence modeling using gated recursive semi-markov conditional random fields. In: ACL
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This article belongs to the Special issue-Recent Advances in Machine Learning.
Rights and permissions
About this article
Cite this article
Sarawagi, S. Sequence Segmentation Using Semi-Markov Conditional Random Fields. J Indian Inst Sci 99, 215–224 (2019). https://doi.org/10.1007/s41745-019-0100-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s41745-019-0100-1