Semi-supervised Bibliographic Element Segmentation with Latent Permutations

  • Tomonari Masada
  • Atsuhiro Takasu
  • Yuichiro Shibata
  • Kiyoshi Oguri
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7008)


This paper proposes a semi-supervised bibliographic element segmentation. Our input data is a large scale set of bibliographic references each given as an unsegmented sequence of word tokens. Our problem is to segment each reference into bibliographic elements, e.g. authors, title, journal, pages, etc. We solve this problem with an LDA-like topic model by assigning each word token to a topic so that the word tokens assigned to the same topic refer to the same bibliographic element. Topic assignments should satisfy contiguity constraint, i.e., the constraint that the word tokens assigned to the same topic should be contiguous. Therefore, we proposed a topic model in our preceding work [8] based on the topic model devised by Chen et al. [3]. Our model extends LDA and realizes unsupervised topic assignments satisfying contiguity constraint. The main contribution of this paper is the proposal of a semi-supervised learning for our proposed model. We assume that at most one third of word tokens are already labeled. In addition, we assume that a few percent of the labels may be incorrect. The experiment showed that our semi-supervised learning improved the unsupervised learning by a large margin and achieved an over 90% segmentation accuracy.


Hide Markov Model Topic Model Segmentation Accuracy Word Token Topic Assignment 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet Allocation. Journal of Machine Learning Research 3, 993–1022 (2003)zbMATHGoogle Scholar
  2. 2.
    Connan, J., Omlin, C.W.: Bibliography Extraction with Hidden Markov Models. Technical Report US-CS-TR-00-6, University of Stellenbosch (2000)Google Scholar
  3. 3.
    Chen, H., Branavan, S.R.K., Barzilay, R., Karger, D.R.: Global Models of Document Structure Using Latent Permutations. In: Proc. of North American Chapter of the Association for Computational Linguistics - Human Language Technologies (NAACL HLT) 2009 Conference, pp. 371–379 (2009)Google Scholar
  4. 4.
    Fligner, M.A., Verducci, J.S.: Distance Based Ranking Models. Journals of the Royal Statistical Society B 48(3), 359–369 (1986)MathSciNetzbMATHGoogle Scholar
  5. 5.
    Hetzner, E.: A Simple Method for Citation Metadata Extraction Using Hidden Markov Models. In: Proc. of the 8th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 280–284 (2008)Google Scholar
  6. 6.
    Kramer, M., Kaprykowsky, H., Keysers, D., Breuel, T.M.: Bibliographic Meta-Data Extraction Using Probabilistic Finite State Transducers. In: Proc. of the 9th International Conference on Document Analysis and Recognition, pp. 609–613 (2007)Google Scholar
  7. 7.
    Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In: Proc. of the Eighteenth International Conference on Machine Learning, pp. 282–289 (2001)Google Scholar
  8. 8.
    Masada, T., Shibata, Y., Oguri, K.: Unsupervised Segmentation of Bibliographic Elements with Latent Permutations. International Journal of Organizational and Collective Intelligence 2(2), 49–62 (2011)CrossRefGoogle Scholar
  9. 9.
    Sharifi, M.: Semi-supervised Extraction of Entity Attributes Using Topic Models. Master’s Thesis, Carnegie Mellon University (2009)Google Scholar
  10. 10.
    Takasu, A.: Bibliographic Attribute Extraction from Erroneous References Based on a Statistical Model. In: Proc. of the 3rd ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 49–60 (2003)Google Scholar
  11. 11.
    Yin, P., Zhang, M., Deng, Z.-H., Yang, D.-Q.: Metadata Extraction from Bibliographies Using Bigram HMM. In: Proc. of the 7th International Conference on Asian Digital Libraries, pp. 1–14 (2004)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Tomonari Masada
    • 1
  • Atsuhiro Takasu
    • 2
  • Yuichiro Shibata
    • 1
  • Kiyoshi Oguri
    • 1
  1. 1.Nagasaki UniversityNagasaki-shiJapan
  2. 2.National Institute of InformaticsChiyoda-kuJapan

Personalised recommendations