Skip to main content

Error Detection of CRF-Based Bibliography Extraction from Reference Strings

  • Conference paper
The Outreach of Digital Libraries: A Globalized Resource Network (ICADL 2012)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7634))

Included in the following conference series:

Abstract

We proposed a parsing method for reference strings usually listed at the end of research papers to extract important bibliographies such as a title from them. The method uses a conditional random field (CRF) to estimate the correct bibliographic label for each token in the token sequence generated from a reference string. Although we achieved reasonable parsing accuracies for a Japanese academic journal, errors are inevitable. Therefore, this paper proposes ways to increase confidence for CRF-based bibliography parsing to detect such parsing errors. This paper also reports an empirical evaluation of the proposed parsing on the basis not only of its accuracies but also of how easy it is to detect errors. The experiments showed that the proposed measures reasonably indicated parsing errors and could be used to improve the quality of extracted bibliographies at a moderate manual post-editing cost.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Peng, F., McCallum, A.: Accurate information extraction from research papers using conditional random fields. In: HLT-NAACL, pp. 329–336 (2004)

    Google Scholar 

  2. Okada, T., Takasu, A., Adachi, J.: Bibliographic Component Extraction Using Support Vector Machines and Hidden Markov Models. In: Heery, R., Lyon, L. (eds.) ECDL 2004. LNCS, vol. 3232, pp. 501–512. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  3. Councill, I.G., Giles, C.L., Kan, M.Y.: ParsCit: An open-source CRF reference string parsing package. In: Proc. of Language Resources and Evaluation Conference (LREC 2008), pp. 661–667 (2008)

    Google Scholar 

  4. Ohta, M., Inoue, R., Takasu, A.: Empirical evaluation of CRF-based bibliography extraction from research papers. In: Proc. of IADIS IS 2012, pp. 18–26 (2012)

    Google Scholar 

  5. Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proc. of 18th International Conference on Machine Learning, pp. 282–289 (2001)

    Google Scholar 

  6. Kudo, T., Yamamoto, K., Matsumoto, Y.: Applying conditional random fields to Japanese morphological analysis. In: Proc. of EMNLP 2004, pp. 230–237 (2004)

    Google Scholar 

  7. Ohta, M., Arauchi, D., Takasu, A., Adachi, J.: CRF-based bibliography extraction from reference strings focusing on various token granularities. In: Proc. of IAPR DAS 2012, pp. 276–281 (2012)

    Google Scholar 

  8. Settles, B., Craven, M.: An analysis of active learning strategies for sequence labeling tasks. In: Proc. of EMNLP 2008, pp. 1070–1079 (2008)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Ohta, M., Arauchi, D., Takasu, A., Adachi, J. (2012). Error Detection of CRF-Based Bibliography Extraction from Reference Strings. In: Chen, HH., Chowdhury, G. (eds) The Outreach of Digital Libraries: A Globalized Resource Network. ICADL 2012. Lecture Notes in Computer Science, vol 7634. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-34752-8_29

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-34752-8_29

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-34751-1

  • Online ISBN: 978-3-642-34752-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics