Empirical study of constructing a knowledge organization system of patent documents using topic modeling

Hu, Zhengyin; Fang, Shu; Liang, Tian

doi:10.1007/s11192-014-1328-1

Empirical study of constructing a knowledge organization system of patent documents using topic modeling

Published: 04 June 2014

Volume 100, pages 787–799, (2014)
Cite this article

Scientometrics Aims and scope Submit manuscript

Zhengyin Hu^1,2,
Shu Fang¹ &
Tian Liang¹

1459 Accesses
36 Citations
Explore all metrics

Abstract

A knowledge organization system (KOS) can help easily indicate the deep knowledge structure of a patent document set. Compared to classification code systems, a personalized KOS made up of topics can represent the technology information in a more agile, detailed manner. This paper presents an approach to automatically construct a KOS of patent documents based on term clumping, Latent Dirichlet Allocation (LDA) model, K-Means clustering and Principal Components Analysis (PCA). Term clumping is adopted to generate a better bag-of-words for topic modeling and LDA model is applied to generate raw topics. Then by iteratively using K-Means clustering and PCA on the document set and topics matrix, we generated new upper topics and computed the relationships between topics to construct a KOS. Finally, documents are mapped to the KOS. The nodes of the KOS are topics which are represented by terms and their weights and the leaves are patent documents. We evaluated the approach with a set of Large Aperture Optical Elements (LAOE) patent documents as an empirical study and constructed the LAOE KOS. The method used discovered the deep semantic relationships between the topics and helped better describe the technology themes of LAOE. Based on the KOS, two types of applications were implemented: the automatic classification of patents documents and the categorical refinements above search results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Abbreviations

DII:: Derwent Innovations Index
KOS:: Knowledge Organization System
LAOE:: Large Aperture Optical Elements
LDA:: Latent Dirichlet Allocation
MALLET:: MAchine Learning for LanguagE Toolkit (a toolkit for machine learning developed by Andrew et al. at University of Massachusetts Amherst)
NLP:: Natural Language Processing
PCA:: Principal Components Analysis

References

Almeida, J., Barbosa, L., Pais, A., & Formosinho, S. (2007). Improving hierarchical cluster analysis: A new method with outlier detection and automatic clustering. Chemometrics and Intelligent Laboratory Systems, 87, 208–217.
Article Google Scholar
Blei, D. M. (2011). Probabilistic Topic Models. Resource document. Department of Computer Science of Princeton University. http://www.cs.princeton.edu/~blei/kdd-tutorial.pdf. Accessed 6 March 2012.
Blei, D. M., Griffiths, T. L., & Jordan, M. I. (2010). The nested chinese restaurant process and bayesian nonparametric inference of topic hierarchies. Journal of the ACM, 57(2) doi: 10.1145/1667053.1667056.
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research, 3, 993–1022.
MATH Google Scholar
Dietz, L., & Stewart, A. (2006). Utilize Probabilistic Topic Models to Enrich Knowledge Bases. Resource document. Fraunhofer Integrated Publication and Information Systems Institute (IPSI). ftp.informatik.rwth-aachen.de/Publications/CEUR-WS/Vol-187/25.pdf. Accessed 6 March 2012.
Griffiths, T. L., & Steyvers, M. (2004). Finding Scientific Topics. Proceedings of the National Academy of Sciences, 101(suppl.1), 5228–5235.
Hodge, G. (2000). Systems of Knowledge Organization for Digital Libraries: Beyond Traditional Authority Files. Resource document. The Digital Library Federation of Council on Library and Information Resources. http://www.sims.monash.edu/subjects/ims2603/resources/Assignment2Papers/SKOforDigLib.pdf. Accessed 10 November 2012.
Ian, D., & Ravi, S.S. (2005). Towards Efficient and Improved Hierarchical Clustering with Instance and Cluster Level Constraints. The CiteSeerX Resources. http://www.cs.albany.edu/~davidson/Publications/hierFinal.pdf. Accessed 22 May 2013.
Kleinsorge, R., Willis, J., & Emrick, S. (2007). AMIA 2007 Tutorial T12 UMLS^® Overview. Resource document. National Library of Medicine in National Institutes of Health. http://www.nlm.nih.gov/research/umls/presentations/AMIA-2007-T12-final.ppt. Accessed 8 March 2012.
Kunal, P., Suju, R., & Joydeep, G. (2006). Automatic Construction of N-ary Tree Based Taxonomies. Data Mining Workshops, 2006. Hong Kong, pp. 75–79.
Kvarv, G. S. (2007). Ontology Learning: Suggesting Associations from Text. Master Dissertation. Norwegian University of Science and Technology, pp.87.
McCallum, Kachites. A. (2002). MALLET: A Machine Learning for Language Toolkit. Open Source Software. University of Massachusetts Amherst. http://mallet.cs.umass.edu. Accessed 2 December 2011.
Mimno, D. (2011). Machine Learning with MALLET. Resource document. Information Extraction and Synthesis Laboratory, Department of CS UMass, Amherst. http://mallet.cs.umass.edu/mallet-tutorial.pdf. Accessed 2 March 2012.
Xue, Q. L., Yan, Q. S., Shi, X. L., & Hai, X. W. (2012). Automatic taxonomy construction from keywords, KDD ‘12 Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, New York, NY, USA, pp. 1433–1441.
Zhang, Y., Porter, A. L., & Hu, Z. Y. (2012). An Inductive Method for “Term Clumping”: A Case Study on Dye-Sensitized Solar Cells, the International Conference on Innovative Methods for Innovation Management and Policy, Beijing, P.R.China, May 21–25.

Download references

Author information

Authors and Affiliations

Chengdu Document and Information Center, Chinese Academy of Sciences, No.16 South Sec.2 Yihuan Rd., Chengdu, 610041, China
Zhengyin Hu, Shu Fang & Tian Liang
University of Chinese Academy of Sciences, No.19A Yuquan Rd., Beijing, 100049, China
Zhengyin Hu

Authors

Zhengyin Hu
View author publications
You can also search for this author in PubMed Google Scholar
Shu Fang
View author publications
You can also search for this author in PubMed Google Scholar
Tian Liang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhengyin Hu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hu, Z., Fang, S. & Liang, T. Empirical study of constructing a knowledge organization system of patent documents using topic modeling. Scientometrics 100, 787–799 (2014). https://doi.org/10.1007/s11192-014-1328-1

Download citation

Received: 22 April 2014
Published: 04 June 2014
Issue Date: September 2014
DOI: https://doi.org/10.1007/s11192-014-1328-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Empirical study of constructing a knowledge organization system of patent documents using topic modeling

Abstract

Access this article

Abbreviations

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation