Automatically Identify and Label Sections in Scientific Journals Using Conditional Random Fields

Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 641)


In this paper, we describe a pipeline that automatically converts a journal article in the PDF format to an XML which conforms to NLM JATS DTD. First, the text and typographical features are extracted from the document using character level information. Then, we use a trickle down multi-level conditional random fields based classifier where at each level the pre-trained CRF model classifies a given line of text into one of the tags of DTD at a particular depth and feeds the resulting tag into the next level model as a feature. After identifying tags upto level three, we make use of separate supervised models for parsing authors, affiliations, references and citations. We employ heuristic based methods for matching affiliation to authors, and citation to references. The JATS XML thus generated, is converted into an RDF document. SPARQL queries are run on the RDF, to address the queries of Task 2 of the Semantic Publishing Challenge.


Multi-level CRF BIO encoding NLM JATS JATS2RDF 


  1. 1.
    Rosenthol, L.: Developing with PDF: Dive Into the Portable Document Format. O’Reilly Media Inc., Sebastopol (2013)Google Scholar
  2. 2.
    Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of ICML, pp. 282–289 (2001)Google Scholar
  3. 3.
    Tkaczyk, D., Szostek, P., Fedoryszak, M., Dendek, P.J., Bolikowski, Ł.: CERMINE: automatic extraction of structured metadata from scientific literature. Int. J. Doc. Anal. Recogn. (IJDAR) 18, 317–335 (2015). SpringerCrossRefGoogle Scholar
  4. 4.
    Klampfl, S., Kern, R.: Machine learning techniques for automatically extracting contextual information from Scientific Publications. In: Gandon, F., Cabrio, E., Stankovic, M., Zimmermann, A. (eds.) SemWebEval 2015. CCIS, vol. 548, pp. 105–116. Springer, Heidelberg (2015). doi: 10.1007/978-3-319-25518-7_9 CrossRefGoogle Scholar
  5. 5.
    Pembe, F.C., Güngör, T.: Heading-based sectional hierarchy identification for HTML documents. In: 22nd International Symposium on Computer and Information Sciences, ISCIS, pp. 1–6. IEEE (2007)Google Scholar
  6. 6.
    Vanderbeck, S., Bockhorst, J., Oldfather, C.: A machine learning approach to identifying sections in legal briefs. In: MAICS, pp. 16–22 (2011)Google Scholar
  7. 7.
    Do, H.H.N., Chandrasekaran, M.K., Cho, P.S., Kan, M.Y.: Extracting and matching authors and affiliations in scholarly documents. In: Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 219–228. ACM (2013)Google Scholar
  8. 8.
    Finkel, J.R., Grenager, T., Manning, C.: Incorporating non-local information into information extraction systems by Gibbs sampling. In: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics (2005)Google Scholar
  9. 9.
    Ramshaw, L.A., Mitchell, P.M.: Text chunking using transformation-based learning (1995). arXiv preprint: arXiv:cmp-lg/9505040
  10. 10.
    Iorio, A.D., Lange, C., Dimou, A., Vahdati, S.: Semantic publishing challenge – assessing the quality of scientific output by information extraction and interlinking. In: Gandon, F., Cabrio, E., Stankovic, M., Zimmermann, A. (eds.) SemWebEval 2015. CCIS, vol. 548, pp. 65–80. Springer, Heidelberg (2015). doi: 10.1007/978-3-319-25518-7_6 CrossRefGoogle Scholar
  11. 11.
    Lange, C., Di Iorio, A.: Semantic publishing challenge – assessing the quality of scientific output. In: Presutti, V., et al. (eds.) SemWebEval 2014. CCIS, vol. 475, pp. 61–76. Springer, Heidelberg (2014)Google Scholar
  12. 12.
    Peroni, S., Lapeyre, D.A., Shotton, D.: From markup to linked data: mapping NISO JATS v1.0 to RDF using the SPAR (Semantic Publishing and Referencing) ontologies. In: Journal Article Tag Suite Conference (JATS-Con) Proceedings 2012 [Internet]. National Center for Biotechnology Information (US), Bethesda (MD) (2012).

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  1. 1.Surukam AnalyticsChennaiIndia
  2. 2.Newgen KnowledgeWorksChennaiIndia

Personalised recommendations