Skip to main content

Patent document clustering with deep embeddings


The analysis of scientific and technical documents is crucial in the process of establishing science and technology strategies. One popular method for such analysis is for field experts to manually classify each scientific or technical document into one of several predefined technical categories. However, not only is manual classification error-prone and expensive, but it also requires extended efforts to handle frequent data updates. In contrast, machine learning and text mining techniques enable cheaper and faster operations, and can alleviate the burden on human resources. In this paper, we propose a method for extracting embedded feature vectors by applying a neural embedding approach for text features in patent documents and automatically clustering the embedding features by utilizing a deep embedding clustering method.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4







  • Akers, L. (2003). The future of patent information—a user with a view. World Patent Information, 25(4), 303.

    Article  Google Scholar 

  • Beltz, H., Fülöp, A., Wadhwa, R. R., & Érdi, P. (2017). In 2017 International joint conference on neural networks (IJCNN) (pp. 1388–1394). IEEE.

  • Bengio, Y., Ducharme, R., Vincent, P., & Jauvin, C. (2003). A neural probabilistic language model. Journal of Machine Learning Research, 3(Feb), 1137.

    MATH  Google Scholar 

  • Bengio, Y., Lamblin, P., Popovici, D., & Larochelle, H. (2007). In: Advances in neural information processing systems (pp. 153–160).

  • Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2016). Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606.

  • Choi, S., & Jun, S. (2014). Vacant technology forecasting using new bayesian patent clustering. Technology Analysis & Strategic Management, 26(3), 241.

    Article  Google Scholar 

  • Choi, S., Park, H., Kang, D., Lee, J. Y., & Kim, K. (2012). An sao-based text mining approach to building a technology tree for technology planning. Expert Systems with Applications, 39(13), 11443.

    Article  Google Scholar 

  • Delorme, J. (1982). Dissemination of patent information. World Patent Information, 4(4), 155.

    MathSciNet  Article  Google Scholar 

  • Du, R., Drake, B., & Park, H. (2017). Hybrid clustering based on content and connection structure using joint nonnegative matrix factorization. arXiv preprint arXiv:1703.09646

  • Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul), 2121.

    MathSciNet  MATH  Google Scholar 

  • Fattori, M., Pedrazzi, G., & Turra, R. (2003). Text mining applied to patent mapping: A practical business case. World Patent Information, 25(4), 335.

    Article  Google Scholar 

  • Fowlkes, E. B., & Mallows, C. L. (1983). A method for comparing two hierarchical clusterings. Journal of the American Statistical Association, 78(383), 553.

    MATH  Article  Google Scholar 

  • Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural networks. Science, 313(5786), 504.

    MathSciNet  MATH  Article  Google Scholar 

  • Hubert, L., & Arabie, P. (1985). Comparing partitions. Journal of Classification, 2(1), 193.

    MATH  Article  Google Scholar 

  • Jun, S., Park, S. S., & Jang, D. S. (2014). Document clustering method using dimension reduction and support vector clustering to overcome sparseness. Expert Systems with Applications, 41(7), 3204.

    Article  Google Scholar 

  • Kang, I. S., Na, S. H., Kim, J., & Lee, J. H. (2007). Cluster-based patent retrieval. Information Processing & Management, 43(5), 1173.

    Article  Google Scholar 

  • Klambauer, G., Unterthiner, T., Mayr, A., & Hochreiter, S. (2017). In: Advances in neural information processing systems (pp. 972–981).

  • Krizhevsky, A., Nair, V., & Hinton, G. (2009). Cifar-10 and cifar-100 datasets. Retrieved March 1, 2016, from html.

  • Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). In Advances in neural information processing systems (pp. 1097–1105).

  • Le, Q., & Mikolov, T. (2014). In: International conference on machine learning (pp. 1188–1196).

  • Lee, C., Jeon, J., & Park, Y. (2011). Monitoring trends of technological changes based on the dynamic patent lattice: A modified formal concept analysis approach. Technological Forecasting and Social Change, 78(4), 690.

    Article  Google Scholar 

  • Maaten, L., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of Machine Learning Research, 9(Nov), 2579.

    MATH  Google Scholar 

  • Madani, F., & Weber, C. (2016). The evolution of patent mining: Applying bibliometrics analysis and keyword network analysis. World Patent Information, 46, 32.

    Article  Google Scholar 

  • Meireles, M. R. G., Carvalho, J. R., do Patrocínio Júnior, Z. K., & Almeida, P. E. (2017). Automatic patent clustering using som and bibliographic coupling. iSys-Revista Brasileira de Sistemas de Informação, 10(1), 06.

    Google Scholar 

  • Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). In: Advances in neural information processing systems (pp. 3111–3119).

  • Pang, B., & Lee, L. (2005). In Proceedings of the 43rd annual meeting on association for computational linguistics (ACL) (pp. 115–124). Association for Computational Linguistics.

  • Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., et al. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825.

    MathSciNet  MATH  Google Scholar 

  • Pennington, J., Socher, R., & Manning, C. (2014). In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532–1543).

  • Ramos, J., et al. (2003). In Proceedings of the first instructional conference on machine learning (Vol. 242, pp. 133–142).

  • Rodriguez, A., Tosyali, A., Kim, B., Choi, J., Lee, J., Coh, B., et al. (2016). Patent clustering and outlier ranking methodologies for attributed patent citation networks for technology opportunity discovery. IEEE Transactions on Engineering Management, 63(4), 426.

    Article  Google Scholar 

  • Shibata, N., Kajikawa, Y., Takeda, Y., & Matsushima, K. (2008). Detecting emerging research fronts based on topological measures in citation networks of scientific publications. Technovation, 28(11), 758.

    Article  Google Scholar 

  • Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1), 1929.

    MathSciNet  MATH  Google Scholar 

  • Trappey, A. J., & Trappey, C. V. (2008). An R&D knowledge management method for patent document summarization. Industrial Management & Data Systems, 108(2), 245.

    Article  Google Scholar 

  • Trappey, A. J., Trappey, C. V., & Wu, C. Y. (2009). Automatic patent document summarization for collaborative knowledge systems and services. Journal of Systems Science and Systems Engineering, 18(1), 71.

    Article  Google Scholar 

  • Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., & Manzagol, P. A. (2010). Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of Machine Learning Research, 11(Dec), 3371.

    MathSciNet  MATH  Google Scholar 

  • Wallach, H. M. (2006). In Proceedings of the 23rd international conference on machine learning (pp. 977–984). ACM.

  • Xie, J., Girshick, R., & Farhadi, A. (2016). In International conference on machine learning (pp. 478–487).

  • Yang, Z., Yang, D., Dyer, C., He, X., Smola, A., & Hovy, E. (2016). In: Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: Human language technologies (pp. 1480–1489).

  • Yoon, B., & Park, Y. (2004). A text-mining-based patent network: Analytical tool for high-technology trend. The Journal of High Technology Management Research, 15(1), 37.

    Article  Google Scholar 

  • Yoon, J., & Kim, K. (2012). Detecting signals of new technological opportunities using semantic patent analysis and outlier detection. Scientometrics, 90(2), 445.

    Article  Google Scholar 

  • Young, T., Hazarika, D., Poria, S., & Cambria, E. (2017). Recent trends in deep learning based natural language processing. arXiv preprint arXiv:1708.02709

  • Zeiler, M. D. (2012). Adadelta: An adaptive learning rate method. arXiv preprint arXiv:1212.5701

  • Zhang, C., & Zhang, S. (2002). Association rule mining: Models and algorithms. Berlin: Springer.

    MATH  Book  Google Scholar 

  • Zhang, W., Yoshida, T., Tang, X., & Wang, Q. (2010). Text clustering using frequent itemsets. Knowledge-Based Systems, 23(5), 379.

    Article  Google Scholar 

Download references


This work was supported by the National Research Foundation of Korea (NRF) Grant and funded by the Korean government (No. NRF-2015R1C1A1A01056185 and 2018R1D1A1B07045825).

Author information

Authors and Affiliations


Corresponding authors

Correspondence to Eunjeong Park or Sungchul Choi.

Additional information

Eunjeong Park and Sungchul Choi are co-corresponding authors.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Kim, J., Yoon, J., Park, E. et al. Patent document clustering with deep embeddings. Scientometrics 123, 563–577 (2020).

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI:


  • Information embedding
  • Patent clustering
  • Deep learning
  • Text mining

Mathematics Subject Classification

  • 68U15