Improved Bibliographic Reference Parsing Based on Repeated Patterns

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7489)


Parsing details like author names and titles out of bibliographic references of scientific publications is an important issue. However, most existing techniques are tailored to the highly standardized reference styles used in the last two to three decades. Their performance tends to degrade when faced with the wider variety of reference styles used in older, historic publications. Thus, existing techniques are of limited use when creating comprehensive bibliographies covering both historic and contemporary scientific publications. This paper presents RefParse, a generic approach to bibliographic reference parsing that is independent of any specific reference style. Its core feature is an inference mechanism that exploits the regularities inherent in any list of references to deduce its format. Our evaluation shows that RefParse outperforms existing parsers both for contemporary and for historic reference lists.


Parsing Bibliography Data Algorithms 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Biodiversity Heritage Library,
  2. 2.
    Chen, C.-C., Yang, K.-H., et al.: BibPro: A Citation Parser Based on Sequence Alignment Techniques. In: Proc. AINAW 2008, Okinawa, Japan (2008)Google Scholar
  3. 3.
  4. 4.
    Cortez, E., da Silva, A.S., Goncalves, M.A., et al.: FLUX-CiM: flexible unsupervised extraction of citation metadata. In: Proc. JCDL 2007, Vancouver, BC, Canada (2007)Google Scholar
  5. 5.
    Day, M.-Y., Tsai, R.T.-H., et al.: Reference metadata extraction using a hierarchical knowledge representation framework. Decision Support Systems 43, 152–167 (2007)CrossRefGoogle Scholar
  6. 6.
    Councill, I.G., Giles, C.L., Kan, M.-Y.: ParsCit: an open-source CRF reference string parsing package. In: Proceedings of LREC 2008, Marrakech, Morocco (2008)Google Scholar
  7. 7.
  8. 8.
    Hetzner, E.: A simple method for citation metadata extraction using hidden markov models. In: Proceedings of JCDL 2008, Pittsburgh, PA, USA (2008)Google Scholar
  9. 9.
    Huang, I.-A., Ho, J.-M., Kao, H.-Y., Lin, W.-C.: Extracting Citation Metadata from Online Publication Lists Using BLAST. In: Dai, H., Srikant, R., Zhang, C. (eds.) PAKDD 2004. LNCS (LNAI), vol. 3056, pp. 539–548. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  10. 10.
    Krämer, M., Kaprykowsky, H., Keysers, D., Breuel, T.: Bibliographic meta-data extraction using probabilistic finite state transducers. In: Proc. ICDAR 2007, Curitiba, Brazil (2007)Google Scholar
  11. 11.
    McCallum, A., Nigam, K., Rennie, J., Seymore, K.: A machine learning approach to building domain-specific search engines. In: Proc. IJCAI 1999, Stockholm, Sweden (1999)Google Scholar
  12. 12.
  13. 13.
    Patashnik, O.: BibTeXing - the original manual. Proceedings of the IEEE, 77 (1988)Google Scholar
  14. 14.
    Peng, F., McCallum, A.: Accurate information extraction from research papers using conditional random fields. In: Proc. HTL/NAACL 2004, Boston, MA, USA (2004)Google Scholar
  15. 15.
  16. 16.
    Takasu, A.: Bibliographic Attribute Extraction from Erroneous References Based on a Statistical Model. In: Proc. JCDL 2003, Houston, TX, USA (2003)Google Scholar
  17. 17.
    ViBRANT: Virtual Biodiversity Research and Access Network for Taxonomy, grant 261532 in EU FP7/2007-2013Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  1. 1.Computer Science DepartmentKarlsruhe Institute of TechnologyKarlsruheGermany

Personalised recommendations