Skip to main content
Log in

From raw publications to Linked Data

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

The continuous development of the Linked Data Web depends on the advancement of the underlying extraction mechanisms. This is of particular interest for the scientific publishing domain, where currently most of the data sets are being created manually. In this article, we present a Machine Learning pipeline that enables the automatic extraction of heading metadata (i.e., title, authors, etc) from scientific publications. The experimental evaluation shows that our solution handles very well any type of publication format and improves the average extraction performance of the state of the art with around 4%, in addition to showing an increased versatility. Finally, we propose a flexible Linked Data-driven mechanism to be used both for refining and linking the automatically extracted metadata.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Alexander K, Cyganiak R, Hausenblas M, Zhao J (2011) Describing linked datasets with the VoID vocabulary. W3C Interest Group Note. http://www.w3.org/TR/void/ (March 2011)

  2. Barabasi AL, Jeong H, Neda Z, Ravasz E, Schubert A, Vicsek T (2002) Evolution of the social network of scientific collaborations. Physica A Stat Mech Appl 311(3–4): 590–614

    Article  MathSciNet  MATH  Google Scholar 

  3. Berners-Lee T, Hendler J, Lassila O (2001) The semantic web. Sci Am 284(5): 35–43

    Article  Google Scholar 

  4. Bizer C, Heath T, Berners-Lee T (2009) Linked data—the story so far. Int J Semant Web Inf Syst 5(3): 1–22

    Article  Google Scholar 

  5. Bouquet P, Stoermer H, Niederee C, Mana A (2008) Entity name system: the backbone of an open and scalable web of data. In: Proceedings of the IEEE international conference on semantic computing (ICSC 2008), IEEE Computer Society, pp 554–561

  6. Cesario E, Folino F, Locane A, Manco G, Ortale R (2008) Boosting text segmentation via progressive classification. Knowl Inf Syst 15(3): 285–320

    Article  Google Scholar 

  7. Dorji TC, sayed Atlam E, Yata S, Fuketa M, Morita K, ichi Aoe J (2010) Extraction, selection and ranking of Field Association (FA) Terms from domain-specific corpora for building a comprehensive FA terms dictionary. Knowl Inf Syst 27(1): 141–161

    Article  Google Scholar 

  8. Giuffrida G, Shek EC, Yang J (2000) Knowledge-based metadata extraction from Postscript files. In: JCDL ’05, San Antonio, Texas, USA, pp 77–84

  9. Groza T, Handschuh S, Hulpus I (2009) A document engineering approach to automatic extraction of shallow metadata from scientific publications. Tech. Rep. 2009–06–01. Digital Enterprise Research Institute

  10. Haghighi A, Klein D (2010) Coreference resolution in a modular, entity-centered model. In: Proceedings of the 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics

  11. Han H, Giles CL, Manavoglu E, Zha H, Zhang Z, Fox EA (2003) Automatic document metadata extraction using support vector machines. In: JCDL ’03, Houston, pp 37–48

  12. Han H, Manavoglu E, Zha H, Tsioutsiouliklis K, Giles CL, Zhang X (2005) Rule-based word clustering for document metadata extraction. In: Proceedings of the 2005 ACM symposium on applied computing, Santa Fe, New Mexico

  13. Kohavi R (1995) A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proceedings of the fourteenth international joint conference on artificial intelligence, pp 1137–1143

  14. Lafferty JD, McCallum A, Pereira FCN (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: ICML’01, Francisco, pp 282–289

  15. Liu X, Bollen J, Nelson ML, de Sompel HV (2005) Co-authorship networks in the digital library research community. Inf Process Manage 41(6): 1462–1480

    Article  Google Scholar 

  16. Möller K, Heath T, Handschuh S, Domingue J (2007) Recipes for semantic web dog food—The ESWC and ISWC metadata projects. In: Proceedings of ISWC 2007, Busan, Korea

  17. Monge A, Elkan C (1997) An efficient domain-independent algorithm for detecting approximately duplicate database records. In: Proceedings of the SIGMOD workshop on data mining and knowledge discovery

  18. National Archives and Records Administration (2007) The soundex indexing system. Technical report, May 2007

  19. Peng F, McCallum A (2006) Accurate information extraction from research papers using conditional random fields. Inf Process Manage Int J 42(4): 963–979

    Article  Google Scholar 

  20. Peng W, Li T (2011) Temporal relation co-clustering on directional social network and author-topic evolution. Knowl Inf Syst 26(3): 467–486

    Article  Google Scholar 

  21. PubMed http://www.ncbi.nlm.nih.gov/pubmed

  22. PDFBox http://pdfbox.apache.org/

  23. Qi Y, Kuksa P, Collobert R, Sadamasa K, Kavukcuoglu K, Weston J (2009) Semi-supervised sequence labeling with self-learned features. In: Proceedings of IEEE international conference on data mining (ICDM)

  24. Sanchez D, Isern D, Millan M (2010) Content annotation for the semantic web: an automatic web-based approach. Knowl Inf Syst 27(3): 393–418

    Article  Google Scholar 

  25. Sandler T, Ungar LH, Crammer K (2009) Resolving identity uncertainty with learned random walks. In: Proceedings of IEEE international conference on data mining (ICDM)

  26. Seymore K, McCallum A, Rosenfeld R (1999) Learning hidden Markov model structure for information extraction. In: Proceedings of the AAAI’99 workshop on machine learning for information extraction, pp 37–42

  27. Shaw WM, Burgin R, Howell P (1997) Performance standards and evaluations in IR test collections: cluster-based retrieval models. Inf Process Manage 33(1): 1–14

    Article  Google Scholar 

  28. Sure Y, Bloehdorn S, Haase P, Hartmann J, Oberle D (2005) The SWRC ontology—semantic web for research communities. In: Proceedings of the 12th Portuguese conference on artificial intelligence (EPIA 2005), Covilha, Portugal

  29. Vapnik V (1995) The nature of statistical learning theory. Springer, Berlin

    MATH  Google Scholar 

  30. Volz J, Bizer C, Gaedke M, Kobilarov G (2009) Discovering and Maintaining Links on the Web of Data. In: Proceedings of the international semantic web conference (ISWC 2009)

  31. Wick M, Culotta A, Rohanimanesh K, McCallum A (2009) An entity based model for coreference resolution. In: Proceedings of the nineth SIAM international conference on data mining, pp 365–377

  32. Yilmazel O, Finneran CM, Liddy ED (2004) Metaextract: an nlp system to automatically assign metadata. In: JCDL ’04, pp 241–242

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tudor Groza.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Groza, T., Grimnes, G.A., Handschuh, S. et al. From raw publications to Linked Data. Knowl Inf Syst 34, 1–21 (2013). https://doi.org/10.1007/s10115-011-0473-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-011-0473-6

Keywords

Navigation