A Frame-Based Approach for Reference Metadata Extraction
In this paper, we propose a novel frame-based approach (FBA) and use reference metadata extraction as a case study to demonstrate its advantages. The main contributions of this research are three-fold. First, the new frame matching algorithm, based on sequence alignment, can compensate for the shortcomings of traditional rule-based approach, in which rule matching lacks flexibility and generality. Second, an approximate matching is adopted for capturing reasonable abbreviations or errors in the input reference string to further increase the coverage of the frames. Third, experiments conducted on extensive datasets show that the same knowledge framework performed equally well on various untrained domains. Comparing to a widely-used machine learning method, Conditional Random Fields (CRFs), the FBA can drastically reduce the average field error rate across all four independent test sets by 70% (2.24% vs. 7.54%).
KeywordsReference Metadata Extraction Knowledge representation Frame-based approach
Unable to display preview. Download preview PDF.
- 1.Agichtein, E., Ganti, V.: Mining reference tables for automatic text segmentation. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 20–29 (2004)Google Scholar
- 3.Chowdhury, G.: Template mining for information extraction from digital documents. Library Trends 48, 182–208 (1999)Google Scholar
- 4.Cortez, E., da Silva, A.S., Goncalves, M.A., Mesquita, F., de Moura, E.S.: FLUX-CiM: Flexible unsupervised extraction of citation metadata. In: Proceedings of the Seventh ACM/IEEE-CS Joint Conf. Digital Libraries, pp. 215–224 (2007)Google Scholar
- 6.Ding, Y., Chowdhury, G., Foo, S.: Template mining for the extraction of citation from digital documents. In: Proceedings of the Second Asian Digital Library Conference, pp. 47–62 (1999)Google Scholar
- 7.Giles, C.L., Bollacker, K.D., Lawrence, S.: CiteSeer: An automatic citation indexing system. In: Proceedings of the Third ACM Conference on Digital Libraries, pp. 89–98 (1998)Google Scholar
- 8.Han, H.C., Giles, L., Manavoglu, E., Zha, H., Zhang, Z., Fox, E.A.: Automatic document metadata extraction using support vector machines. In: Proceedings of the Third ACM/IEEE-CS Joint Conference on Digital libraries, pp. 37–48 (2003)Google Scholar
- 9.Mitchell, T.M.: Machine Learning. McGraw-Hill, Inc. (1997)Google Scholar
- 10.Peng, F., McCallum, A.: Accurate information extraction from research papers using conditional random fields. In: Proceedings of the Human Language Technology Conference and North American Chapter of the Association for Computational Linguistics (HLT-NAACL), pp. 320–336 (2004)Google Scholar
- 11.Seymore, K., McCallum, A., Rosenfeld, R.: Learning hidden markov model structure for information extraction. In: Proceedings of the AAAI-99 Workshop on Machine Learning for Information Extraction, pp. 37–42 (1999)Google Scholar
- 12.Wu, S.H., Tsai, T.H., Hsu, W.L.: Domain event extraction and representation with domain ontology. In: Proceedings of the IJCAI 2003 Workshop on Information Integration on the Web, Acapulco, Mexico (2003)Google Scholar