Skip to main content
Log in

Sequence Segmentation Using Semi-Markov Conditional Random Fields

  • Review Article
  • Published:
Journal of the Indian Institute of Science Aims and scope

Abstract

Many applications in natural language, speech processing and data integration require model-based segmentation of sequences. Semi-Markov conditional random fields (semi-CRFs) are a generalization of CRFs and provide a full conditional distribution over all possible segmentation of a sequence. Semi-CRFs are particularly suitable for tasks that entail segment-level features such as match with existing dictionary of segments. Empirical results on real-life NER tasks show that they yield higher accuracy than CRFs, but the straightforward foreword–backward inference algorithm requires 3–10 times the computation cost of CRFs. This running time can be reduced significantly by exploiting overlapping features across segments. We present a succinct representation of overlapping features and an efficient training algorithm that can sum over all possible input segmentation in time that is sub-quadratic in the input length, even while imposing no bound on the maximum segment length. Consequently, the running time becomes comparable to CRFs even with the addition of useful entity-level features on large input segments.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Figure 1:
Figure 2:

Similar content being viewed by others

Notes

  1. Be careful about using the following code—I’ve only proven that it works, I haven’t tested it. Donald Knuth

References

  1. Barbar D, Garcia-Molina H, Porter D (1992) The management of probabilistic data. IEEE Trans Knowl Data Eng 4(5):487–502. https://doi.org/10.1109/69.166990

    Article  Google Scholar 

  2. Beck E, Hannemann M, Dtsch P, Schlter R, Ney H (2018) Segmental encoder-decoder models for large vocabulary automatic speech recognition. Proc Interspeech 2018:766–770

    Article  Google Scholar 

  3. Borthwick A, Sterling J, Agichtein E, Grishman R (1998) Exploiting diverse knowledge sources via maximum entropy in named entity recognition. In: Sixth Workshop on Very Large Corpora. Association for Computational Linguistics, New Brunswick, New Jersey

    Google Scholar 

  4. Boulos J, Dalvi N, Mandhani B, Mathur S, Re C, Suciu D (2005) Mystiq: a system for finding more answers by using probabilities. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, Baltimore, Maryland. https://doi.org/10.1145/1066157.1066277

  5. Dalvi NN, Suciu D (2004) Efficient query evaluation on probabilistic databases. In: Proceedings of the 30th VLDB Conference, Toronto, Canada, pp 864–875

  6. Fuhr N (1990) A probabilistic framework for vague queries and imprecise information in databases. In: Proceedings of the sixteenth international conference on Very large databases. Morgan Kaufmann Publishers Inc., San Francisco, pp 696–707

    Google Scholar 

  7. Green TJ, Tannen V (2006) Models for incomplete and probabilistic information. IEEE Data Eng Bull 29(1)

  8. Gupta R, Sarawagi S (2006) Curating probabilistic databases from information extraction models. In: VLDB

  9. Gupta R, Sarawagi S (2009) Answering table augmentation queries from unstructured lists on the web. In: PVLDB

  10. Kemos A, Adel H, Schtze H (2018) Neural semi-markov conditional random fields for robust character-based part-of-speech tagging. 1808.04208

  11. Keshet J, Shalev-Shwartz S, Singer Y (2005) Phoneme alignment using large margin techniques. In: Workshop on the advances in structured learning for text and speech processing, NIPS

  12. Krogh A (1998) Gene finding: putting the parts together. In: Bishop MJ (ed) Guide to human genome computing, 2nd edn. Academic Press, Cambridge, pp 261–274

    Chapter  Google Scholar 

  13. Lafferty J, McCallum A, Pereira F (2001) Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proceedings of the International Conference on Machine Learning (ICML-2001), Williams

  14. Liu DC, Nocedal J (1989) On the limited memory bfgs method for large-scale optimization. Math Programm 45:503–528

    Article  Google Scholar 

  15. Malouf R (2002) A comparison of algorithms for maximum entropy parameter estimation. In: Proceedings of The sixth conference on natural language learning (CoNLL-2002), pp 49–55

  16. Mansuri I, Sarawagi S (2006) A system for integrating unstructured data into relational databases. In: Proc. of the 22nd IEEE Int’l Conference on Data Engineering (ICDE)

  17. McCallum A, Li W (2003) Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In: Proceedings of The Seventh Conference on Natural Language Learning (CoNLL-2003), Edmonton, Canada

  18. McDonald R, Crammer K, Pereira F (2005) Flexible text segmentation with structured multilabel classification. In: HLT/EMNLP

  19. Sarawagi S (2006) Efficient inference on sequence segmentation models. In: Proceedings of the \({23}^{\rm {rd}}\) International Conference on Machine Learning (ICML), Pittsburgh

  20. Sarawagi S, Cohen WW (2004) Semi-markov conditional random fields for information extraction. In: NIPS

  21. Sarma AD, Benjelloun O, Halevy A, Widom J (2006) Working models for uncertain data. In: ICDE

  22. Sha F, Pereira F (2003) Shallow parsing with conditional random fields. In: Proceedings of HLT-NAACL

  23. Ye ZX, Ling ZH (2018) Hybrid semi-markov crf for neural sequence labeling. In: ACL

  24. Zhang T, Damerau F, Johnson D (2002) Text chunking based on a generalization of winnow. J Mach Learn Res 2:615–637

    Google Scholar 

  25. Zhuo J, Cao Y, Zhu J, Zhang B, Nie Z (2016) Segment-level sequence modeling using gated recursive semi-markov conditional random fields. In: ACL

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sunita Sarawagi.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article belongs to the Special issue-Recent Advances in Machine Learning.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sarawagi, S. Sequence Segmentation Using Semi-Markov Conditional Random Fields. J Indian Inst Sci 99, 215–224 (2019). https://doi.org/10.1007/s41745-019-0100-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s41745-019-0100-1

Navigation