Skip to main content

Empirical study of constructing a knowledge organization system of patent documents using topic modeling


A knowledge organization system (KOS) can help easily indicate the deep knowledge structure of a patent document set. Compared to classification code systems, a personalized KOS made up of topics can represent the technology information in a more agile, detailed manner. This paper presents an approach to automatically construct a KOS of patent documents based on term clumping, Latent Dirichlet Allocation (LDA) model, K-Means clustering and Principal Components Analysis (PCA). Term clumping is adopted to generate a better bag-of-words for topic modeling and LDA model is applied to generate raw topics. Then by iteratively using K-Means clustering and PCA on the document set and topics matrix, we generated new upper topics and computed the relationships between topics to construct a KOS. Finally, documents are mapped to the KOS. The nodes of the KOS are topics which are represented by terms and their weights and the leaves are patent documents. We evaluated the approach with a set of Large Aperture Optical Elements (LAOE) patent documents as an empirical study and constructed the LAOE KOS. The method used discovered the deep semantic relationships between the topics and helped better describe the technology themes of LAOE. Based on the KOS, two types of applications were implemented: the automatic classification of patents documents and the categorical refinements above search results.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5



Derwent Innovations Index


Knowledge Organization System


Large Aperture Optical Elements


Latent Dirichlet Allocation


MAchine Learning for LanguagE Toolkit (a toolkit for machine learning developed by Andrew et al. at University of Massachusetts Amherst)


Natural Language Processing


Principal Components Analysis


  • Almeida, J., Barbosa, L., Pais, A., & Formosinho, S. (2007). Improving hierarchical cluster analysis: A new method with outlier detection and automatic clustering. Chemometrics and Intelligent Laboratory Systems, 87, 208–217.

    Article  Google Scholar 

  • Blei, D. M. (2011). Probabilistic Topic Models. Resource document. Department of Computer Science of Princeton University. Accessed 6 March 2012.

  • Blei, D. M., Griffiths, T. L., & Jordan, M. I. (2010). The nested chinese restaurant process and bayesian nonparametric inference of topic hierarchies. Journal of the ACM, 57(2) doi: 10.1145/1667053.1667056.

  • Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research, 3, 993–1022.

    MATH  Google Scholar 

  • Dietz, L., & Stewart, A. (2006). Utilize Probabilistic Topic Models to Enrich Knowledge Bases. Resource document. Fraunhofer Integrated Publication and Information Systems Institute (IPSI). Accessed 6 March 2012.

  • Griffiths, T. L., & Steyvers, M. (2004). Finding Scientific Topics. Proceedings of the National Academy of Sciences, 101(suppl.1), 5228–5235.

  • Hodge, G. (2000). Systems of Knowledge Organization for Digital Libraries: Beyond Traditional Authority Files. Resource document. The Digital Library Federation of Council on Library and Information Resources. Accessed 10 November 2012.

  • Ian, D., & Ravi, S.S. (2005). Towards Efficient and Improved Hierarchical Clustering with Instance and Cluster Level Constraints. The CiteSeerX Resources. Accessed 22 May 2013.

  • Kleinsorge, R., Willis, J., & Emrick, S. (2007). AMIA 2007 Tutorial T12 UMLS® Overview. Resource document. National Library of Medicine in National Institutes of Health. Accessed 8 March 2012.

  • Kunal, P., Suju, R., & Joydeep, G. (2006). Automatic Construction of N-ary Tree Based Taxonomies. Data Mining Workshops, 2006. Hong Kong, pp. 75–79.

  • Kvarv, G. S. (2007). Ontology Learning: Suggesting Associations from Text. Master Dissertation. Norwegian University of Science and Technology, pp.87.

  • McCallum, Kachites. A. (2002). MALLET: A Machine Learning for Language Toolkit. Open Source Software. University of Massachusetts Amherst. Accessed 2 December 2011.

  • Mimno, D. (2011). Machine Learning with MALLET. Resource document. Information Extraction and Synthesis Laboratory, Department of CS UMass, Amherst. Accessed 2 March 2012.

  • Xue, Q. L., Yan, Q. S., Shi, X. L., & Hai, X. W. (2012). Automatic taxonomy construction from keywords, KDD ‘12 Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, New York, NY, USA, pp. 1433–1441.

  • Zhang, Y., Porter, A. L., & Hu, Z. Y. (2012). An Inductive Method for “Term Clumping”: A Case Study on Dye-Sensitized Solar Cells, the International Conference on Innovative Methods for Innovation Management and Policy, Beijing, P.R.China, May 21–25.

Download references

Author information

Authors and Affiliations


Corresponding author

Correspondence to Zhengyin Hu.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Hu, Z., Fang, S. & Liang, T. Empirical study of constructing a knowledge organization system of patent documents using topic modeling. Scientometrics 100, 787–799 (2014).

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI:


  • Topic model
  • Term clumping
  • Knowledge organization system
  • Text clustering
  • Principal Component Analysis