From raw publications to Linked Data

Groza, Tudor; Grimnes, Gunnar AAstrand; Handschuh, Siegfried; Decker, Stefan

doi:10.1007/s10115-011-0473-6

From raw publications to Linked Data

Regular Paper
Published: 29 December 2011

Volume 34, pages 1–21, (2013)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Tudor Groza^1,2^nAff3,
Gunnar AAstrand Grimnes⁴,
Siegfried Handschuh^1,2 &
…
Stefan Decker^1,2

586 Accesses
9 Citations
Explore all metrics

Abstract

The continuous development of the Linked Data Web depends on the advancement of the underlying extraction mechanisms. This is of particular interest for the scientific publishing domain, where currently most of the data sets are being created manually. In this article, we present a Machine Learning pipeline that enables the automatic extraction of heading metadata (i.e., title, authors, etc) from scientific publications. The experimental evaluation shows that our solution handles very well any type of publication format and improves the average extraction performance of the state of the art with around 4%, in addition to showing an increased versatility. Finally, we propose a flexible Linked Data-driven mechanism to be used both for refining and linking the automatically extracted metadata.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Alexander K, Cyganiak R, Hausenblas M, Zhao J (2011) Describing linked datasets with the VoID vocabulary. W3C Interest Group Note. http://www.w3.org/TR/void/ (March 2011)
Barabasi AL, Jeong H, Neda Z, Ravasz E, Schubert A, Vicsek T (2002) Evolution of the social network of scientific collaborations. Physica A Stat Mech Appl 311(3–4): 590–614
Article MathSciNet MATH Google Scholar
Berners-Lee T, Hendler J, Lassila O (2001) The semantic web. Sci Am 284(5): 35–43
Article Google Scholar
Bizer C, Heath T, Berners-Lee T (2009) Linked data—the story so far. Int J Semant Web Inf Syst 5(3): 1–22
Article Google Scholar
Bouquet P, Stoermer H, Niederee C, Mana A (2008) Entity name system: the backbone of an open and scalable web of data. In: Proceedings of the IEEE international conference on semantic computing (ICSC 2008), IEEE Computer Society, pp 554–561
Cesario E, Folino F, Locane A, Manco G, Ortale R (2008) Boosting text segmentation via progressive classification. Knowl Inf Syst 15(3): 285–320
Article Google Scholar
Dorji TC, sayed Atlam E, Yata S, Fuketa M, Morita K, ichi Aoe J (2010) Extraction, selection and ranking of Field Association (FA) Terms from domain-specific corpora for building a comprehensive FA terms dictionary. Knowl Inf Syst 27(1): 141–161
Article Google Scholar
Giuffrida G, Shek EC, Yang J (2000) Knowledge-based metadata extraction from Postscript files. In: JCDL ’05, San Antonio, Texas, USA, pp 77–84
Groza T, Handschuh S, Hulpus I (2009) A document engineering approach to automatic extraction of shallow metadata from scientific publications. Tech. Rep. 2009–06–01. Digital Enterprise Research Institute
Haghighi A, Klein D (2010) Coreference resolution in a modular, entity-centered model. In: Proceedings of the 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Han H, Giles CL, Manavoglu E, Zha H, Zhang Z, Fox EA (2003) Automatic document metadata extraction using support vector machines. In: JCDL ’03, Houston, pp 37–48
Han H, Manavoglu E, Zha H, Tsioutsiouliklis K, Giles CL, Zhang X (2005) Rule-based word clustering for document metadata extraction. In: Proceedings of the 2005 ACM symposium on applied computing, Santa Fe, New Mexico
Kohavi R (1995) A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proceedings of the fourteenth international joint conference on artificial intelligence, pp 1137–1143
Lafferty JD, McCallum A, Pereira FCN (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: ICML’01, Francisco, pp 282–289
Liu X, Bollen J, Nelson ML, de Sompel HV (2005) Co-authorship networks in the digital library research community. Inf Process Manage 41(6): 1462–1480
Article Google Scholar
Möller K, Heath T, Handschuh S, Domingue J (2007) Recipes for semantic web dog food—The ESWC and ISWC metadata projects. In: Proceedings of ISWC 2007, Busan, Korea
Monge A, Elkan C (1997) An efficient domain-independent algorithm for detecting approximately duplicate database records. In: Proceedings of the SIGMOD workshop on data mining and knowledge discovery
National Archives and Records Administration (2007) The soundex indexing system. Technical report, May 2007
Peng F, McCallum A (2006) Accurate information extraction from research papers using conditional random fields. Inf Process Manage Int J 42(4): 963–979
Article Google Scholar
Peng W, Li T (2011) Temporal relation co-clustering on directional social network and author-topic evolution. Knowl Inf Syst 26(3): 467–486
Article Google Scholar
PubMed http://www.ncbi.nlm.nih.gov/pubmed
PDFBox http://pdfbox.apache.org/
Qi Y, Kuksa P, Collobert R, Sadamasa K, Kavukcuoglu K, Weston J (2009) Semi-supervised sequence labeling with self-learned features. In: Proceedings of IEEE international conference on data mining (ICDM)
Sanchez D, Isern D, Millan M (2010) Content annotation for the semantic web: an automatic web-based approach. Knowl Inf Syst 27(3): 393–418
Article Google Scholar
Sandler T, Ungar LH, Crammer K (2009) Resolving identity uncertainty with learned random walks. In: Proceedings of IEEE international conference on data mining (ICDM)
Seymore K, McCallum A, Rosenfeld R (1999) Learning hidden Markov model structure for information extraction. In: Proceedings of the AAAI’99 workshop on machine learning for information extraction, pp 37–42
Shaw WM, Burgin R, Howell P (1997) Performance standards and evaluations in IR test collections: cluster-based retrieval models. Inf Process Manage 33(1): 1–14
Article Google Scholar
Sure Y, Bloehdorn S, Haase P, Hartmann J, Oberle D (2005) The SWRC ontology—semantic web for research communities. In: Proceedings of the 12th Portuguese conference on artificial intelligence (EPIA 2005), Covilha, Portugal
Vapnik V (1995) The nature of statistical learning theory. Springer, Berlin
MATH Google Scholar
Volz J, Bizer C, Gaedke M, Kobilarov G (2009) Discovering and Maintaining Links on the Web of Data. In: Proceedings of the international semantic web conference (ISWC 2009)
Wick M, Culotta A, Rohanimanesh K, McCallum A (2009) An entity based model for coreference resolution. In: Proceedings of the nineth SIAM international conference on data mining, pp 365–377
Yilmazel O, Finneran CM, Liddy ED (2004) Metaextract: an nlp system to automatically assign metadata. In: JCDL ’04, pp 241–242

Download references

Author information

Tudor Groza
Present address: School of ITEE, The University of Queensland, R. 78-709, Level 7, GP South (#78), St. Lucia campus, 4072, Queensland, Australia

Authors and Affiliations

DERI, National University of Ireland, Galway, Ireland
Tudor Groza, Siegfried Handschuh & Stefan Decker
IDA Business Park, Lower Dangan, Galway, Ireland
Tudor Groza, Siegfried Handschuh & Stefan Decker
DFKI GmbH, Trippstadter Strasse 122, 67663, Kaiserslautern, Germany
Gunnar AAstrand Grimnes

Authors

Tudor Groza
View author publications
You can also search for this author in PubMed Google Scholar
Gunnar AAstrand Grimnes
View author publications
You can also search for this author in PubMed Google Scholar
Siegfried Handschuh
View author publications
You can also search for this author in PubMed Google Scholar
Stefan Decker
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tudor Groza.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Groza, T., Grimnes, G.A., Handschuh, S. et al. From raw publications to Linked Data. Knowl Inf Syst 34, 1–21 (2013). https://doi.org/10.1007/s10115-011-0473-6

Download citation

Received: 07 September 2010
Revised: 10 June 2011
Accepted: 13 December 2011
Published: 29 December 2011
Issue Date: January 2013
DOI: https://doi.org/10.1007/s10115-011-0473-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

From raw publications to Linked Data

Abstract

Access this article

Similar content being viewed by others

CERMINE: automatic extraction of structured metadata from scientific literature

CiteSeer x : A Scholarly Big Dataset

Extraction and Characterization of Citations in Scientific Papers

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

From raw publications to Linked Data

Abstract

Access this article

Similar content being viewed by others

CERMINE: automatic extraction of structured metadata from scientific literature

CiteSeer x : A Scholarly Big Dataset

Extraction and Characterization of Citations in Scientific Papers

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation