Header Metadata Extraction from Semi-structured Documents Using Template Matching

  • Zewu Huang
  • Hai Jin
  • Pingpeng Yuan
  • Zongfen Han
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4278)


With the recent proliferation of documents, automatic metadata extraction from document becomes an important task. In this paper, we propose a novel template matching based method for header metadata extraction form semi-structured documents stored in PDF. In our approach, templates are defined, and the document is considered as strings with format. Templates are used to guide finite state automaton (FSA) to extract header metadata of papers. The testing results indicate that our approach can effectively extract metadata, without any training cost and available to some special situation. This approach can effectively assist the automatic index creation in lots of fields such as digital libraries, information retrieval, and data mining.


Data Stream Digital Library Template Match Finite State Automaton Layout Information 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Murphy, L.D.: Digital document metadata in organizations: roles, analytical approaches, and future research directions. In: Proceedings of the 31st Annual Hawaii International Conference on System Sciences, pp. 267–276 (1998)Google Scholar
  2. 2.
    Brody, T.: Celestial - Open Archives Gateway, http://celestial.eprints.org
  3. 3.
    Liu, X.: Federating. Heterogeneous Digital Libraries by metadata harvesting. Ph.D. Dissertation, Old Dominion University (2002) Google Scholar
  4. 4.
    Bishop, A.P.: Digital libraries and knowledge disaggregation: The use of journal article components. In: Proceedings of the 3rd ACM International Conference on Digital Libraries, pp. 29–39 (1998)Google Scholar
  5. 5.
    Giuffrida, G., Shek, E.C., Yang, J.: Knowledge-based metadata extraction from PostScript files. In: Proceedings of the 5th ACM Conference on Digital Libraries, pp. 77–84 (2000)Google Scholar
  6. 6.
    Nevill-Manning, C.G., Reed, T., Witten, I.H.: Extracting text from postscript. Technical report, Comp. Science Dept., University of Waikato, New Zealand (1997) Google Scholar
  7. 7.
    Liddy, E.D., Sutton, S., Allen, E., Harwell, S., Corieri, S., Yilmazel, O., Ozgencil, N.E., Diekema, A., McCracken, N., Silverstein, J.: Automatic Metadata generation & evaluation. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 401–402 (2002)Google Scholar
  8. 8.
    Mao, S., Kim, J.W., Thoma, G.R.: A dynamic feature generation system for automated metadata extraction in preservation of digital materials. In: Proceedings of the 1st International Workshop on Document Image Analysis for Libraries, pp. 225–232 (2004)Google Scholar
  9. 9.
    Hu, Y., Li, H., Cao, Y., Meyerzon, D., Zheng, Q.: Automatic extraction of titles from general documents using machine learning. In: Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (2005)Google Scholar
  10. 10.
    Han, H., Giles, C.L., Manavoglu, E., Zha, H., Zhang, Z., Fox, E.A.: Automatic document metadata extraction using support vector machines. In: Proceedings of the 3rd ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 37–48 (2003)Google Scholar
  11. 11.
    Joachims, T.: A statistical learning model of text classification with Support Vector Machines. In: Proceedings of the 24th ACM International Conference on Research and Development in Information Retrieval, pp. 128–136 (2001)Google Scholar
  12. 12.
    McCallum, A., Freitag, D., Pereira, F.: Maximum entropy Markov models for information extraction and segmentation. In: Proceedings of the 17th International Conf. on Machine Learning, pp. 591–598 (2000)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Zewu Huang
    • 1
  • Hai Jin
    • 1
  • Pingpeng Yuan
    • 1
  • Zongfen Han
    • 1
  1. 1.Cluster and Grid Computing LabHuazhong University of Science and TechnologyWuhanChina

Personalised recommendations