Locating and parsing bibliographic references in HTML medical articles

Zou, Jie; Le, Daniel; Thoma, George R.

doi:10.1007/s10032-009-0105-9

Locating and parsing bibliographic references in HTML medical articles

Original Paper
Published: 16 January 2010

Volume 13, pages 107–119, (2010)
Cite this article

International Journal on Document Analysis and Recognition (IJDAR) Aims and scope Submit manuscript

Jie Zou¹,
Daniel Le¹ &
George R. Thoma¹

238 Accesses
19 Citations
Explore all metrics

Abstract

The set of references that typically appear toward the end of journal articles is sometimes, though not always, a field in bibliographic (citation) databases. But even if references do not constitute such a field, they can be useful as a preprocessing step in the automated extraction of other bibliographic data from articles, as well as in computer-assisted indexing of articles. Automation in data extraction and indexing to minimize human labor is key to the affordable creation and maintenance of large bibliographic databases. Extracting the components of references, such as author names, article title, journal name, publication date and other entities, is therefore a valuable and sometimes necessary task. This paper describes a two-step process using statistical machine learning algorithms, to first locate the references in HTML medical articles and then to parse them. Reference locating identifies the reference section in an article and then decomposes it into individual references. We formulate this step as a two-class classification problem based on text and geometric features. An evaluation conducted on 500 articles drawn from 100 medical journals achieves near-perfect precision and recall rates for locating references. Reference parsing identifies the components of each reference. For this second step, we implement and compare two algorithms. One relies on sequence statistics and trains a Conditional Random Field. The other focuses on local feature statistics and trains a Support Vector Machine to classify each individual word, followed by a search algorithm that systematically corrects low confidence labels if the label sequence violates a set of predefined rules. The overall performance of these two reference-parsing algorithms is about the same: above 99% accuracy at the word level, and over 97% accuracy at the chunk level.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Aronson, A.R., Bodenreider, O., Chang, H.F., Humphrey, S.M., Mork, J.G., Nelson, S.J., Rindflesch, T.C., Wilbur, W.J.: The NLM indexing initiative. In: Proceedings of AMIA Symposium, pp. 17–21 (2000)
Baird, H.S., Jones, S.E., Fortune, S.J.: Image segmentation by shape-directed covers. In: Proceedings of International Conference Pattern Recognition, pp. 820–825 (1990)
Besagni D., Belaïd A., Benet N.: A segmentation method for bibliographic references by contextual tagging of fields. Proc. ICDAR 1, 384–388 (2003)
Google Scholar
Buyukkokten, O., Garcia-Molina, H., Paepche, A.: Accordion summarization for end-game browsing on PDAs and cellular phones. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 213–220 (2001)
Chang, C.-C., Lin, C.-J.: LIBSVM: a library for support vector machines. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm (2001)
Chowdhury G.: Template mining for information extraction from digital documents. Libr. Trends 48(1), 182–208 (1999)
Google Scholar
Cortez E., da Silva A.S., Goncalves M.A., Mesquita F., de Moura E.S.: A flexible approach for extracting metadata from bibliographic citations. J. Am. Soc. Inf. Sci. Technol. 60(6), 1144–1158 (2009)
Article Google Scholar
Councill, I.G., Giles, C.L., Kan, M.-Y.: ParsCit: an open-source CRF reference string parsing package. In: Proceedings of the 6th International Language Resources and Evaluation (2008)
Day, M.-Y., Tsai, T.-H., Sung, C.-L., Lee, C.-W., Wu, S.-H., Ong, C.-S., Hsu, W.-L.: A knowledge-based approach to citation extraction. In: IEEE International Conference Information Reuse and Integration, pp. 50–55 (2005)
Day M.-Y., Tsai R.T.-H., Sung C.-L., Hsieh C.-C., Lee C.-W., Wu S.-H., Wu K.-P., Ong C.-S., Hsu W.-L.: Reference metadata extraction using a hierarchical knowledge representation framework. Decis. Support Syst. 43(1), 152–167 (2007)
Article Google Scholar
Diao, Y., Lu, H., Chen, S., Tian, Z.: Toward learning based web query processing. In: Proceedings of International Conference on Very Large Databases, pp. 317–328 (2000)
Ding, Y., Chowdhury, G., Foo, S.: Template mining for the extraction of citation from digital documents. In: Proceedings of the 2nd Asian Digital Library Conference, pp. 47–62 (1999)
Galavotti, L., Sebastiani, F., Simi, M.: Experiments on the use of feature selection and negative evidence in automated text categorization. In: Proceedings of ECDL, pp. 59–68 (2000)
Ha, J., Haralick, R., Phillips, I.: Recursive X-Y cut using bounding boxes of connected components. In: Proceedings 3rd International Conference Document Analysis and Recognition, pp. 952–955 (1995)
Hauser S.E., Le D.X., Thoma G.R.: Automated zone correction in bitmapped document images. Proc. SPIE: Document Recognit. Retr. VII 3976, 248–258 (2000)
Google Scholar
Huang, I.-A., Ho, J.-M., Kao, H.-Y., Lin, W.-C.: Extracting citation metadata from online publication lists using BLAST. In: Proceedings of the 8th Pacific–Asia Conference on Knowledge Discovery and Data Mining, pp. 26–28 (2004)
Jain A.K., Yu B.: Document representation and its application to page decomposition. IEEE Trans. Pattern Recognit. Mach. Intell. 20(3), 294–308 (1998)
Article Google Scholar
Kaasinen, E., Aaltonen, M., Kolari, J., Melakoski, S., Laakko, T.: Two approaches to bringing internet services to WAP devices. In: Proceedings of the 9th International World Wide Web Conference, pp. 231–246 (2000)
Kim, I., Le, D., Thoma, G.R.: Identification of “comment-on sentences” in online biomedical documents using support vector machines. In: Proceedings of the SPIE Conference on Document Recognition and Retrieval, vol. 68150, pp. X1–X9 (2007)
Kim, J., Le, D., Thoma, G.R.: Automatic labeling in document images. In: Proceedings of the SPIE Conference on Document Recognition and Retrieval, pp. 111–122 (2001)
Klink S., Kieninger T.: Rule-based document structure understanding with a fuzzy combination of layout and textual features. Int. J. Document Anal. Recognit. 4, 18–26 (2001)
Article Google Scholar
Lafferty, J., McCallum, A., and Pereira, F.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the ICML, pp. 282–289 (2001)
Lawrence S., Giles C.L., Bollacker K.: Digital libraries and autonomous citation indexing. IEEE Comput. 32(6), 67–71 (1999)
Google Scholar
Likforman-Sulem L., Vaillant P., de Bodard A.: Automatic name extraction from degraded document images. Pattern. Anal. Appl. 9(2), 211–227 (2006)
Article MathSciNet Google Scholar
Liu B., Grossman R., Zhai Y.: Mining Web pages for data records. IEEE Intell. Syst. 19(6), 49–55 (2004)
Article Google Scholar
McCallum, A.K.: MALLET: a machine learning for language toolkit. http://mallet.cs.umass.edu (2002)
Nagy G., Seth S., Viswanathan M.: A prototype document image analysis system for technical journals. Computer 25, 10–22 (1992)
Article Google Scholar
Nagy G.: Twenty years of document image analysis in PAMI. IEEE Trans. Pattern Anal. Mach. Intell. 22(1), 38–62 (2000)
Article Google Scholar
O’Gorman L.: The document spectrum for page layout analysis. IEEE Trans. Pattern Recognit. Mach. Intell. 15, 1162–1173 (1993)
Article Google Scholar
Okada, T., Takasu, A., Adachi, J.: Bibliographic component extraction using support vector machines and hidden Markov models. In: Proceedings of the ECDL, pp. 501–512 (2004)
Parmentier, F., Belaïd, A.: Logical structure recognition of scientific bibliographic references. In: Proceedings of the ICDAR, pp. 1072–1076 (1997)
Pavlidis T., Zhou J.: Page segmentation and classification. Graph. Models Image Process. 54, 484–496 (1992)
Article Google Scholar
Peng, F., McCallum, A.: Accurate information extraction from research papers using conditional random fields. In: Proceedings of Human Language Technology Conference, pp. 329–336 (2004)
Reis, D.C., Golgher, P.B., Silva, A.S., Laender, A.F.: Automatic web news extraction using tree edit distance. In: Proceedings of the WWW, pp. 502–511 (2004)
Sebastiani F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)
Article Google Scholar
Sutton C., McCallum A.: An introduction to conditional random fields for relational learning. In: Getoor, L., Taskar, B. (eds) Introduction to statistical relational learning, MIT Press, Cambridge (2006)
Google Scholar
Takasu, A.: Bibliographic attribute extraction from erroneous references based on a statistical model. In: Proceedings of the JCDL, pp. 49–60 (2003)
Zhai Y., Liu B.: Structure data extraction from the Web based on partial tree alignment. IEEE Tran. Knowl. Data Eng. 18(12), 1614–1628 (2006)
Article Google Scholar
Zou, J., Le, D., Thoma, G.R.: Structure and content analysis for HTML medical articles: a hidden markov model approach. In: Proceedings of the DocEng, pp. 119–201 (2007)
Zou J., Le D., Thoma G.R.: Extracting a sparsely-located named entity from online HTML medical articles using support vector machine. Proc. Document Recognit. Retr. 68150, P1–P10 (2008)
Google Scholar
http://www.isiwebofknowledge.com/
http://scholar.google.com/

Download references

Author information

Authors and Affiliations

Lister Hill National Center for Biomedical Communications, National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD, 20894, USA
Jie Zou, Daniel Le & George R. Thoma

Authors

Jie Zou
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Le
View author publications
You can also search for this author in PubMed Google Scholar
George R. Thoma
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jie Zou.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zou, J., Le, D. & Thoma, G.R. Locating and parsing bibliographic references in HTML medical articles. IJDAR 13, 107–119 (2010). https://doi.org/10.1007/s10032-009-0105-9

Download citation

Received: 01 April 2009
Revised: 20 October 2009
Accepted: 01 December 2009
Published: 16 January 2010
Issue Date: June 2010
DOI: https://doi.org/10.1007/s10032-009-0105-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Locating and parsing bibliographic references in HTML medical articles

Abstract

Access this article

Similar content being viewed by others

Two-Tier Machine Learning Using Conditional Random Fields with Constraints

Mining the Context of Citations in Scientific Publications

Machine Learning Techniques for Automatically Extracting Contextual Information from Scientific Publications

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Locating and parsing bibliographic references in HTML medical articles

Abstract

Access this article

Similar content being viewed by others

Two-Tier Machine Learning Using Conditional Random Fields with Constraints

Mining the Context of Citations in Scientific Publications

Machine Learning Techniques for Automatically Extracting Contextual Information from Scientific Publications

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation