Semi-supervised Bibliographic Element Segmentation with Latent Permutations

Masada, Tomonari; Takasu, Atsuhiro; Shibata, Yuichiro; Oguri, Kiyoshi

doi:10.1007/978-3-642-24826-9_11

Tomonari Masada¹⁹,
Atsuhiro Takasu²⁰,
Yuichiro Shibata¹⁹ &
…
Kiyoshi Oguri¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7008))

Included in the following conference series:

International Conference on Asian Digital Libraries

2085 Accesses

Abstract

This paper proposes a semi-supervised bibliographic element segmentation. Our input data is a large scale set of bibliographic references each given as an unsegmented sequence of word tokens. Our problem is to segment each reference into bibliographic elements, e.g. authors, title, journal, pages, etc. We solve this problem with an LDA-like topic model by assigning each word token to a topic so that the word tokens assigned to the same topic refer to the same bibliographic element. Topic assignments should satisfy contiguity constraint, i.e., the constraint that the word tokens assigned to the same topic should be contiguous. Therefore, we proposed a topic model in our preceding work [8] based on the topic model devised by Chen et al. [3]. Our model extends LDA and realizes unsupervised topic assignments satisfying contiguity constraint. The main contribution of this paper is the proposal of a semi-supervised learning for our proposed model. We assume that at most one third of word tokens are already labeled. In addition, we assume that a few percent of the labels may be incorrect. The experiment showed that our semi-supervised learning improved the unsupervised learning by a large margin and achieved an over 90% segmentation accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet Allocation. Journal of Machine Learning Research 3, 993–1022 (2003)
MATH Google Scholar
Connan, J., Omlin, C.W.: Bibliography Extraction with Hidden Markov Models. Technical Report US-CS-TR-00-6, University of Stellenbosch (2000)
Google Scholar
Chen, H., Branavan, S.R.K., Barzilay, R., Karger, D.R.: Global Models of Document Structure Using Latent Permutations. In: Proc. of North American Chapter of the Association for Computational Linguistics - Human Language Technologies (NAACL HLT) 2009 Conference, pp. 371–379 (2009)
Google Scholar
Fligner, M.A., Verducci, J.S.: Distance Based Ranking Models. Journals of the Royal Statistical Society B 48(3), 359–369 (1986)
MathSciNet MATH Google Scholar
Hetzner, E.: A Simple Method for Citation Metadata Extraction Using Hidden Markov Models. In: Proc. of the 8th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 280–284 (2008)
Google Scholar
Kramer, M., Kaprykowsky, H., Keysers, D., Breuel, T.M.: Bibliographic Meta-Data Extraction Using Probabilistic Finite State Transducers. In: Proc. of the 9th International Conference on Document Analysis and Recognition, pp. 609–613 (2007)
Google Scholar
Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In: Proc. of the Eighteenth International Conference on Machine Learning, pp. 282–289 (2001)
Google Scholar
Masada, T., Shibata, Y., Oguri, K.: Unsupervised Segmentation of Bibliographic Elements with Latent Permutations. International Journal of Organizational and Collective Intelligence 2(2), 49–62 (2011)
Article Google Scholar
Sharifi, M.: Semi-supervised Extraction of Entity Attributes Using Topic Models. Master’s Thesis, Carnegie Mellon University (2009)
Google Scholar
Takasu, A.: Bibliographic Attribute Extraction from Erroneous References Based on a Statistical Model. In: Proc. of the 3rd ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 49–60 (2003)
Google Scholar
Yin, P., Zhang, M., Deng, Z.-H., Yang, D.-Q.: Metadata Extraction from Bibliographies Using Bigram HMM. In: Proc. of the 7th International Conference on Asian Digital Libraries, pp. 1–14 (2004)
Google Scholar

Download references

Author information

Authors and Affiliations

Nagasaki University, 1-14 Bunkyo-machi, Nagasaki-shi, Nagasaki, Japan
Tomonari Masada, Yuichiro Shibata & Kiyoshi Oguri
National Institute of Informatics, 2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo, Japan
Atsuhiro Takasu

Authors

Tomonari Masada
View author publications
You can also search for this author in PubMed Google Scholar
Atsuhiro Takasu
View author publications
You can also search for this author in PubMed Google Scholar
Yuichiro Shibata
View author publications
You can also search for this author in PubMed Google Scholar
Kiyoshi Oguri
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Information Science and Technology Building, Tsinghua University, 100084, Beijing, P.R. China
Chunxiao Xing
Faculty of Informatics, University of Lugano, 6900, Lugano, Switzerland
Fabio Crestani
Institute of Software Technology and Interactive Systems,, Vienna University of Technology, 1040, Vienna, Austria
Andreas Rauber

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Masada, T., Takasu, A., Shibata, Y., Oguri, K. (2011). Semi-supervised Bibliographic Element Segmentation with Latent Permutations. In: Xing, C., Crestani, F., Rauber, A. (eds) Digital Libraries: For Cultural Heritage, Knowledge Dissemination, and Future Creation. ICADL 2011. Lecture Notes in Computer Science, vol 7008. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-24826-9_11

Download citation

DOI: https://doi.org/10.1007/978-3-642-24826-9_11
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-24825-2
Online ISBN: 978-3-642-24826-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics