Improved Bibliographic Reference Parsing Based on Repeated Patterns
- 3 Citations
- 1.9k Downloads
Abstract
Parsing details like author names and titles out of bibliographic references of scientific publications is an important issue. However, most existing techniques are tailored to the highly standardized reference styles used in the last two to three decades. Their performance tends to degrade when faced with the wider variety of reference styles used in older, historic publications. Thus, existing techniques are of limited use when creating comprehensive bibliographies covering both historic and contemporary scientific publications. This paper presents RefParse, a generic approach to bibliographic reference parsing that is independent of any specific reference style. Its core feature is an inference mechanism that exploits the regularities inherent in any list of references to deduce its format. Our evaluation shows that RefParse outperforms existing parsers both for contemporary and for historic reference lists.
Keywords
Parsing Bibliography Data AlgorithmsPreview
Unable to display preview. Download preview PDF.
References
- 1.Biodiversity Heritage Library, http://www.biodiversitylibrary.org/
- 2.Chen, C.-C., Yang, K.-H., et al.: BibPro: A Citation Parser Based on Sequence Alignment Techniques. In: Proc. AINAW 2008, Okinawa, Japan (2008)Google Scholar
- 3.CiteseerX, http://citeseerx.ist.psu.edu/index
- 4.Cortez, E., da Silva, A.S., Goncalves, M.A., et al.: FLUX-CiM: flexible unsupervised extraction of citation metadata. In: Proc. JCDL 2007, Vancouver, BC, Canada (2007)Google Scholar
- 5.Day, M.-Y., Tsai, R.T.-H., et al.: Reference metadata extraction using a hierarchical knowledge representation framework. Decision Support Systems 43, 152–167 (2007)CrossRefGoogle Scholar
- 6.Councill, I.G., Giles, C.L., Kan, M.-Y.: ParsCit: an open-source CRF reference string parsing package. In: Proceedings of LREC 2008, Marrakech, Morocco (2008)Google Scholar
- 7.GoogleScholar, http://scholar.google.com
- 8.Hetzner, E.: A simple method for citation metadata extraction using hidden markov models. In: Proceedings of JCDL 2008, Pittsburgh, PA, USA (2008)Google Scholar
- 9.Huang, I.-A., Ho, J.-M., Kao, H.-Y., Lin, W.-C.: Extracting Citation Metadata from Online Publication Lists Using BLAST. In: Dai, H., Srikant, R., Zhang, C. (eds.) PAKDD 2004. LNCS (LNAI), vol. 3056, pp. 539–548. Springer, Heidelberg (2004)CrossRefGoogle Scholar
- 10.Krämer, M., Kaprykowsky, H., Keysers, D., Breuel, T.: Bibliographic meta-data extraction using probabilistic finite state transducers. In: Proc. ICDAR 2007, Curitiba, Brazil (2007)Google Scholar
- 11.McCallum, A., Nigam, K., Rennie, J., Seymore, K.: A machine learning approach to building domain-specific search engines. In: Proc. IJCAI 1999, Stockholm, Sweden (1999)Google Scholar
- 12.ParaCite, http://paracite.eprints.org/
- 13.Patashnik, O.: BibTeXing - the original manual. Proceedings of the IEEE, 77 (1988)Google Scholar
- 14.Peng, F., McCallum, A.: Accurate information extraction from research papers using conditional random fields. In: Proc. HTL/NAACL 2004, Boston, MA, USA (2004)Google Scholar
- 15.Plazi, http://plazi.org/
- 16.Takasu, A.: Bibliographic Attribute Extraction from Erroneous References Based on a Statistical Model. In: Proc. JCDL 2003, Houston, TX, USA (2003)Google Scholar
- 17.ViBRANT: Virtual Biodiversity Research and Access Network for Taxonomy, grant 261532 in EU FP7/2007-2013Google Scholar