Identifying References to Datasets in Publications

  • Katarina Boland
  • Dominique Ritze
  • Kai Eckert
  • Brigitte Mathiak
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7489)


Research data and publications are usually stored in separate and structurally distinct information systems. Often, links between these resources are not explicitly available which complicates the search for previous research. In this paper, we propose a pattern induction method for the detection of study references in full texts. Since these references are not specified in a standardized way and may occur inside a variety of different contexts – i.e., captions, footnotes, or continuous text – our algorithm is required to induce very flexible patterns. To overcome the sparse distribution of training instances, we induce patterns iteratively using a bootstrapping approach. We show that our method achieves promising results for the automatic identification of data references and is a first step towards building an integrated information system.


Digital Libraries Information Extraction Recognition of Dataset References Iterative Pattern Induction Bootstrapping 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Afzal, M.T., Maurer, H., Balke, W.T., Kulathuramaiyer, N.: Rule based autonomous citation mining with tierl. Journal of Digital Information Management 8(3), 196–204 (2010)Google Scholar
  2. 2.
    Councill, I.G., Giles, C.L., Kan, M.Y.: Parscit: An open-source crf reference string parsing package. In: Proceedings of the Language Resources and Evaluation Conference, European Language Resources Association (2008)Google Scholar
  3. 3.
    Gipp, B., Beel, J.: Citation Proximity Analysis (CPA) - A new approach for identifying related work based on Co-Citation Analysis. In: Proceedings of the 12th International Conference on Scientometrics and Informetrics, vol. 2, pp. 571–575 (2009)Google Scholar
  4. 4.
    Kubala, F., Schwartz, R., Stone, R., Weischedel, R.: Named entity extraction from speech. In: DARPA Workshop on Broadcast News Understanding Systems, pp. 287–292 (1998)Google Scholar
  5. 5.
    Etzioni, O., Cafarella, M., Downey, D., Popescu, A.M., Shaked, T., Soderland, S., Weld, D.S., Yates, A.: Unsupervised named-entity extraction from the web: An experimental study. Artificial Intelligence 165, 91–134 (2005)CrossRefGoogle Scholar
  6. 6.
    Hearst, M.: Automatic acquisition of hyponyms from large text corpora. In: Proceedings of the 14th Conference on Computational Linguistics, pp. 539–545. Association for Computational Linguistics, Stroudsburg (1992)CrossRefGoogle Scholar
  7. 7.
    Berland, M., Charniak, E.: Finding parts in very large corpora. In: Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics, pp. 57–64. Association for Computational Linguistics, Stroudsburg (1999)CrossRefGoogle Scholar
  8. 8.
    Pennacchiotti, M., Pantel, P.: A bootstrapping algorithm for automatically harvesting semantic relations. In: Proceedings of the Inference in Computational Semantics, pp. 87–96 (2006)Google Scholar
  9. 9.
    Meusel, R., Niepert, M., Eckert, K., Stuckenschmidt, H.: Thesaurus Extension Using Web Search Engines. In: Chowdhury, G., Koo, C., Hunter, J. (eds.) ICADL 2010. LNCS, vol. 6102, pp. 198–207. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  10. 10.
    Thelen, M., Riloff, E.: A bootstrapping method for learning semantic lexicons using extraction pattern contexts. In: Proceedings of the 2002 Conference on Empirical Methods in NLP, pp. 214–221. Association for Computational Linguistics, Stroudsburg (2002)Google Scholar
  11. 11.
    Xu, R., Morgan, A., Das, A.K., Garber, A.: Investigation of unsupervised pattern learning techniques for bootstrap construction of a medical treatment lexicon. In: Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing, BioNLP 2009, pp. 63–70. Association for Computational Linguistics, Stroudsburg (2009)Google Scholar
  12. 12.
    Altman, M., King, G.: A proposed standard for the scholarly citation of quantitative data. D-Lib Magazine 13(3) (2007)Google Scholar
  13. 13.
    Green, T.: We need publishing standards for datasets and data tables. OECD Publishing White Paper (2009)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Katarina Boland
    • 1
  • Dominique Ritze
    • 2
  • Kai Eckert
    • 2
  • Brigitte Mathiak
    • 1
  1. 1.GESIS - Leibniz Institute for the Social SciencesCologneGermany
  2. 2.Mannheim University LibraryMannheimGermany

Personalised recommendations