Knowledge and Information Systems

, Volume 34, Issue 1, pp 1–21 | Cite as

From raw publications to Linked Data

  • Tudor Groza
  • Gunnar AAstrand Grimnes
  • Siegfried Handschuh
  • Stefan Decker
Regular Paper


The continuous development of the Linked Data Web depends on the advancement of the underlying extraction mechanisms. This is of particular interest for the scientific publishing domain, where currently most of the data sets are being created manually. In this article, we present a Machine Learning pipeline that enables the automatic extraction of heading metadata (i.e., title, authors, etc) from scientific publications. The experimental evaluation shows that our solution handles very well any type of publication format and improves the average extraction performance of the state of the art with around 4%, in addition to showing an increased versatility. Finally, we propose a flexible Linked Data-driven mechanism to be used both for refining and linking the automatically extracted metadata.


Metadata extraction Support vector machines Conditional random fields Linked data 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Alexander K, Cyganiak R, Hausenblas M, Zhao J (2011) Describing linked datasets with the VoID vocabulary. W3C Interest Group Note. (March 2011)
  2. 2.
    Barabasi AL, Jeong H, Neda Z, Ravasz E, Schubert A, Vicsek T (2002) Evolution of the social network of scientific collaborations. Physica A Stat Mech Appl 311(3–4): 590–614MathSciNetMATHCrossRefGoogle Scholar
  3. 3.
    Berners-Lee T, Hendler J, Lassila O (2001) The semantic web. Sci Am 284(5): 35–43CrossRefGoogle Scholar
  4. 4.
    Bizer C, Heath T, Berners-Lee T (2009) Linked data—the story so far. Int J Semant Web Inf Syst 5(3): 1–22CrossRefGoogle Scholar
  5. 5.
    Bouquet P, Stoermer H, Niederee C, Mana A (2008) Entity name system: the backbone of an open and scalable web of data. In: Proceedings of the IEEE international conference on semantic computing (ICSC 2008), IEEE Computer Society, pp 554–561Google Scholar
  6. 6.
    Cesario E, Folino F, Locane A, Manco G, Ortale R (2008) Boosting text segmentation via progressive classification. Knowl Inf Syst 15(3): 285–320CrossRefGoogle Scholar
  7. 7.
    Dorji TC, sayed Atlam E, Yata S, Fuketa M, Morita K, ichi Aoe J (2010) Extraction, selection and ranking of Field Association (FA) Terms from domain-specific corpora for building a comprehensive FA terms dictionary. Knowl Inf Syst 27(1): 141–161CrossRefGoogle Scholar
  8. 8.
    Giuffrida G, Shek EC, Yang J (2000) Knowledge-based metadata extraction from Postscript files. In: JCDL ’05, San Antonio, Texas, USA, pp 77–84Google Scholar
  9. 9.
    Groza T, Handschuh S, Hulpus I (2009) A document engineering approach to automatic extraction of shallow metadata from scientific publications. Tech. Rep. 2009–06–01. Digital Enterprise Research InstituteGoogle Scholar
  10. 10.
    Haghighi A, Klein D (2010) Coreference resolution in a modular, entity-centered model. In: Proceedings of the 2010 Annual Conference of the North American Chapter of the Association for Computational LinguisticsGoogle Scholar
  11. 11.
    Han H, Giles CL, Manavoglu E, Zha H, Zhang Z, Fox EA (2003) Automatic document metadata extraction using support vector machines. In: JCDL ’03, Houston, pp 37–48Google Scholar
  12. 12.
    Han H, Manavoglu E, Zha H, Tsioutsiouliklis K, Giles CL, Zhang X (2005) Rule-based word clustering for document metadata extraction. In: Proceedings of the 2005 ACM symposium on applied computing, Santa Fe, New MexicoGoogle Scholar
  13. 13.
    Kohavi R (1995) A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proceedings of the fourteenth international joint conference on artificial intelligence, pp 1137–1143Google Scholar
  14. 14.
    Lafferty JD, McCallum A, Pereira FCN (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: ICML’01, Francisco, pp 282–289Google Scholar
  15. 15.
    Liu X, Bollen J, Nelson ML, de Sompel HV (2005) Co-authorship networks in the digital library research community. Inf Process Manage 41(6): 1462–1480CrossRefGoogle Scholar
  16. 16.
    Möller K, Heath T, Handschuh S, Domingue J (2007) Recipes for semantic web dog food—The ESWC and ISWC metadata projects. In: Proceedings of ISWC 2007, Busan, KoreaGoogle Scholar
  17. 17.
    Monge A, Elkan C (1997) An efficient domain-independent algorithm for detecting approximately duplicate database records. In: Proceedings of the SIGMOD workshop on data mining and knowledge discoveryGoogle Scholar
  18. 18.
    National Archives and Records Administration (2007) The soundex indexing system. Technical report, May 2007Google Scholar
  19. 19.
    Peng F, McCallum A (2006) Accurate information extraction from research papers using conditional random fields. Inf Process Manage Int J 42(4): 963–979CrossRefGoogle Scholar
  20. 20.
    Peng W, Li T (2011) Temporal relation co-clustering on directional social network and author-topic evolution. Knowl Inf Syst 26(3): 467–486CrossRefGoogle Scholar
  21. 21.
  22. 22.
  23. 23.
    Qi Y, Kuksa P, Collobert R, Sadamasa K, Kavukcuoglu K, Weston J (2009) Semi-supervised sequence labeling with self-learned features. In: Proceedings of IEEE international conference on data mining (ICDM)Google Scholar
  24. 24.
    Sanchez D, Isern D, Millan M (2010) Content annotation for the semantic web: an automatic web-based approach. Knowl Inf Syst 27(3): 393–418CrossRefGoogle Scholar
  25. 25.
    Sandler T, Ungar LH, Crammer K (2009) Resolving identity uncertainty with learned random walks. In: Proceedings of IEEE international conference on data mining (ICDM)Google Scholar
  26. 26.
    Seymore K, McCallum A, Rosenfeld R (1999) Learning hidden Markov model structure for information extraction. In: Proceedings of the AAAI’99 workshop on machine learning for information extraction, pp 37–42Google Scholar
  27. 27.
    Shaw WM, Burgin R, Howell P (1997) Performance standards and evaluations in IR test collections: cluster-based retrieval models. Inf Process Manage 33(1): 1–14CrossRefGoogle Scholar
  28. 28.
    Sure Y, Bloehdorn S, Haase P, Hartmann J, Oberle D (2005) The SWRC ontology—semantic web for research communities. In: Proceedings of the 12th Portuguese conference on artificial intelligence (EPIA 2005), Covilha, PortugalGoogle Scholar
  29. 29.
    Vapnik V (1995) The nature of statistical learning theory. Springer, BerlinMATHGoogle Scholar
  30. 30.
    Volz J, Bizer C, Gaedke M, Kobilarov G (2009) Discovering and Maintaining Links on the Web of Data. In: Proceedings of the international semantic web conference (ISWC 2009)Google Scholar
  31. 31.
    Wick M, Culotta A, Rohanimanesh K, McCallum A (2009) An entity based model for coreference resolution. In: Proceedings of the nineth SIAM international conference on data mining, pp 365–377Google Scholar
  32. 32.
    Yilmazel O, Finneran CM, Liddy ED (2004) Metaextract: an nlp system to automatically assign metadata. In: JCDL ’04, pp 241–242Google Scholar

Copyright information

© Springer-Verlag London Limited 2011

Authors and Affiliations

  • Tudor Groza
    • 1
    • 2
    • 3
  • Gunnar AAstrand Grimnes
    • 4
  • Siegfried Handschuh
    • 1
    • 2
  • Stefan Decker
    • 1
    • 2
  1. 1.DERI, National University of IrelandGalwayIreland
  2. 2.IDA Business Park, Lower DanganGalwayIreland
  3. 3.School of ITEEThe University of QueenslandQueenslandAustralia
  4. 4.DFKI GmbHKaiserslauternGermany

Personalised recommendations