Skip to main content

Automatically Identify and Label Sections in Scientific Journals Using Conditional Random Fields

  • Conference paper
  • First Online:

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 641))

Abstract

In this paper, we describe a pipeline that automatically converts a journal article in the PDF format to an XML which conforms to NLM JATS DTD. First, the text and typographical features are extracted from the document using character level information. Then, we use a trickle down multi-level conditional random fields based classifier where at each level the pre-trained CRF model classifies a given line of text into one of the tags of DTD at a particular depth and feeds the resulting tag into the next level model as a feature. After identifying tags upto level three, we make use of separate supervised models for parsing authors, affiliations, references and citations. We employ heuristic based methods for matching affiliation to authors, and citation to references. The JATS XML thus generated, is converted into an RDF document. SPARQL queries are run on the RDF, to address the queries of Task 2 of the Semantic Publishing Challenge.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    Semantic Publishing Challenge 2016 - https://github.com/ceurws/lod/wiki/SemPub 2016.

  2. 2.

    Typographical features include information about typefaces, point size and line length.

  3. 3.

    NLM JATS DTD. http://dtd.nlm.nih.gov/archiving/tag-library/3.0/index.html.

  4. 4.

    Resource Description Framework (RDF), http://www.w3.org/RDF/.

  5. 5.

    Apache PDFBox. https://pdfbox.apache.org/.

  6. 6.

    If it is a binary feature, then set would mean setting the value to 1. If it is a multi-categorical feature, then the values are discrete integers ranging from 0 to (number of buckets - 1).

  7. 7.

    Stanford NER Tagger: http://nlp.stanford.edu/software/CRF-NER.shtml.

  8. 8.

    CRF++: https://taku910.github.io/crfpp/.

  9. 9.

    CoNLL: http://www.cnts.ua.ac.be/conll2000/chunking/.

  10. 10.

    A subset of scientific journals published CEUR-WS.org - https://github.com/ceurws/lod/wiki/SemPub16_Task2#training-dataset-td2.

  11. 11.

    Stanford Log-linear Part-Of-Speech Tagger - http://nlp.stanford.edu/software/tagg er.shtml.

  12. 12.

    Maxmind Free World Cities Database - https://www.maxmind.com/en/free-world- cities-database.

  13. 13.

    Symbols like *, \(\dagger \), \(\ddagger \) and \(\S \), or numbers 0–9.

  14. 14.

    Vancouver System of Referencing - https://en.wikipedia.org/wiki/Vancouver_system.

  15. 15.

    Harvard Referencing - https://en.wikipedia.org/wiki/Parenthetical_referencing.

  16. 16.

    SPAR - the Semantic Publishing and Referencing Ontologies is an integrated ecosystem of various ontologies like DoCO and CiTO.

  17. 17.

    Document Components Ontology (DoCO), http://purl.org/spar/doco.

  18. 18.

    https://github.com/ceurws/lod/wiki/SemPub15_Task2#training-dataset-td2.

  19. 19.

    https://github.com/ceurws/lod/wiki/SemPub16_Task2#training-dataset-td2.

  20. 20.

    http://www.cnts.ua.ac.be/conll2000/chunking/conlleval.txt.

  21. 21.

    https://github.com/angelobo/SemPubEvaluator.

  22. 22.

    https://github.com/ceurws/lod/wiki/SemPub2016#winners.

References

  1. Rosenthol, L.: Developing with PDF: Dive Into the Portable Document Format. O’Reilly Media Inc., Sebastopol (2013)

    Google Scholar 

  2. Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of ICML, pp. 282–289 (2001)

    Google Scholar 

  3. Tkaczyk, D., Szostek, P., Fedoryszak, M., Dendek, P.J., Bolikowski, Ł.: CERMINE: automatic extraction of structured metadata from scientific literature. Int. J. Doc. Anal. Recogn. (IJDAR) 18, 317–335 (2015). Springer

    Article  Google Scholar 

  4. Klampfl, S., Kern, R.: Machine learning techniques for automatically extracting contextual information from Scientific Publications. In: Gandon, F., Cabrio, E., Stankovic, M., Zimmermann, A. (eds.) SemWebEval 2015. CCIS, vol. 548, pp. 105–116. Springer, Heidelberg (2015). doi:10.1007/978-3-319-25518-7_9

    Chapter  Google Scholar 

  5. Pembe, F.C., Güngör, T.: Heading-based sectional hierarchy identification for HTML documents. In: 22nd International Symposium on Computer and Information Sciences, ISCIS, pp. 1–6. IEEE (2007)

    Google Scholar 

  6. Vanderbeck, S., Bockhorst, J., Oldfather, C.: A machine learning approach to identifying sections in legal briefs. In: MAICS, pp. 16–22 (2011)

    Google Scholar 

  7. Do, H.H.N., Chandrasekaran, M.K., Cho, P.S., Kan, M.Y.: Extracting and matching authors and affiliations in scholarly documents. In: Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 219–228. ACM (2013)

    Google Scholar 

  8. Finkel, J.R., Grenager, T., Manning, C.: Incorporating non-local information into information extraction systems by Gibbs sampling. In: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics (2005)

    Google Scholar 

  9. Ramshaw, L.A., Mitchell, P.M.: Text chunking using transformation-based learning (1995). arXiv preprint: arXiv:cmp-lg/9505040

  10. Iorio, A.D., Lange, C., Dimou, A., Vahdati, S.: Semantic publishing challenge – assessing the quality of scientific output by information extraction and interlinking. In: Gandon, F., Cabrio, E., Stankovic, M., Zimmermann, A. (eds.) SemWebEval 2015. CCIS, vol. 548, pp. 65–80. Springer, Heidelberg (2015). doi:10.1007/978-3-319-25518-7_6

    Chapter  Google Scholar 

  11. Lange, C., Di Iorio, A.: Semantic publishing challenge – assessing the quality of scientific output. In: Presutti, V., et al. (eds.) SemWebEval 2014. CCIS, vol. 475, pp. 61–76. Springer, Heidelberg (2014)

    Google Scholar 

  12. Peroni, S., Lapeyre, D.A., Shotton, D.: From markup to linked data: mapping NISO JATS v1.0 to RDF using the SPAR (Semantic Publishing and Referencing) ontologies. In: Journal Article Tag Suite Conference (JATS-Con) Proceedings 2012 [Internet]. National Center for Biotechnology Information (US), Bethesda (MD) (2012). http://www.ncbi.nlm.nih.gov/books/NBK100491/

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sree Harsha Ramesh .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Ramesh, S.H. et al. (2016). Automatically Identify and Label Sections in Scientific Journals Using Conditional Random Fields. In: Sack, H., Dietze, S., Tordai, A., Lange, C. (eds) Semantic Web Challenges. SemWebEval 2016. Communications in Computer and Information Science, vol 641. Springer, Cham. https://doi.org/10.1007/978-3-319-46565-4_21

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-46565-4_21

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-46564-7

  • Online ISBN: 978-3-319-46565-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics