Unsupervised Segmentation of Bibliographic Elements with Latent Permutations

  • Tomonari Masada
  • Yuichiro Shibata
  • Kiyoshi Oguri
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6724)


This paper introduces a novel approach for large-scale unsupervised segmentation of bibliographic elements. Our problem is to segment a word token sequence representing a citation into subsequences each corresponding to a different bibliographic element, e.g. authors, paper title, journal name, publication year, etc. Obviously, each bibliographic element should be represented by contiguous word tokens. We call this constraint contiguity constraint. Therefore, we should infer a sequence of assignments of word tokens to bibliographic elements so that this constraint is satisfied. Many HMM-based methods solve this problem by prescribing fixed transition patterns among bibliographic elements. In this paper, we use generalized Mallows models (GMM) in a Bayesian multi-topic model, effectively applied to document structure learning by Chen et al. [4], and infer a permutation of latent topics each of which can be interpreted as one among the bibliographic elements. According to the inferred permutation, we arrange the order of the draws from a multinomial distribution defined over topics. In this manner, we can obtain an ordered sequence of topic assignments satisfying contiguity constraint. We do not need to prescribe any transition patterns among bibliographic elements. We only need to specify the number of bibliographic elements. However, the method proposed by Chen et al. works for our problem only after introducing modification. The main contribution of this paper is to propose strategies to make their method work also for our problem.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Andrzejewski, D., Zhu, X., Craven, M.: Incorporating Domain Knowedge into Topic Modeling via Dirichlet Forest Priors. In: Proc. of ICML (2009)Google Scholar
  2. 2.
    Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet Allocation. JMLR 3, 993–1022 (2003)zbMATHGoogle Scholar
  3. 3.
    Connan, J., Omlin, C.W.: Bibliography Extraction with Hidden Markov Models. Tech. Rep. US-CS-TR-00-6, Univ. of Stellenbosch (2000)Google Scholar
  4. 4.
    Chen, H., Branavan, S.R.K., Barzilay, R., Karger, D.R.: Global Models of Document Structure Using Latent Permutations. In: Proc. of ACL, pp. 371–379 (2009)Google Scholar
  5. 5.
    Eisenstein, J., Barzilay, R.: Bayesian Unsupervised Topic Segmentation. In: Proc. of EMNLP, pp. 334–343 (2008)Google Scholar
  6. 6.
    Fligner, M.A., Verducci, J.S.: Distance Based Ranking Models. J. R. Statist. Soc. B 48(3), 359–369 (1986)zbMATHMathSciNetGoogle Scholar
  7. 7.
    Griffiths, T.L., Steyvers, M.: Finding Scientific Topics. Proc. of Natl. Acad. Sci. 101(suppl.1), 5228–5235 (2004)CrossRefGoogle Scholar
  8. 8.
    Hetzner, E.: A Simple Method for Citation Metadata Extraction Using Hidden Markov Models. In: Proc. of JCDL, pp. 280–284 (2008)Google Scholar
  9. 9.
    Minka, T.: Estimating a Dirichlet Distribution (2000),
  10. 10.
    Takasu, A.: Bibliographic Attribute Extraction from Erroneous References Based on a Statistical Model. In: Proc. of JCDL, pp. 49–60 (2003)Google Scholar
  11. 11.
    Yin, P., Zhang, M., Deng, Z.-H., Yang, D.-q.: Metadata extraction from bibliographies using bigram HMM. In: Chen, Z., Chen, H., Miao, Q., Fu, Y., Fox, E., Lim, E.-p. (eds.) ICADL 2004. LNCS, vol. 3334, pp. 310–319. Springer, Heidelberg (2004)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Tomonari Masada
    • 1
  • Yuichiro Shibata
    • 1
  • Kiyoshi Oguri
    • 1
  1. 1.Nagasaki UniversityNagasakiJapan

Personalised recommendations