The quantitative analysis of digitized historical documents has begun in earnest in recent years. Text classification is of particular importance for quantitative historical analysis because it helps to search literature efficiently and to determine the important subjects of a particular age. While numerous historians have joined together to classify large-scale historical documents, consistent classification among individual researchers has not been achieved. In this study, we present a classification method for large-scale historical data that uses a recently developed supervised learning algorithm called the Hierarchical Attention Network (HAN). By applying various classification methods to the Annals of the Joseon Dynasty (AJD), we show that HAN is more accurate than conventional techniques with word-frequency-based features. HAN provides the extent that a particular sentence or word contributes to the classification process through a quantitative value called ’attention’. We extract the representative keywords from various categories by using the attention mechanism and show the evolution of the keywords over the 472-year span of the AJD. Our results reveal that largely two groups of event categories are found in the AJD. In one group, the representative keywords of the categories were stable over long periods while the keywords in the other group varied rapidly, exhibiting repeatedly changing characteristics of the categories. Observing such macroscopic changes of representative words may provide insight into how a particular topic changes over a historical period.
This is a preview of subscription content, access via your institution.
Buy single article
Instant access to the full article PDF.
Tax calculation will be finalised during checkout.
D. J. Hopkins and G. King, Am. J. Political Sci. 54, 229 (2010).
J. Grimmer and B. M. Stewart, Polit. Anal. 21, 267 (2013).
J. B. Michel et al, Science 331, 176 (2011).
S. Klingenstein, T. Hitchcock and S. DeDeo, Proc. Natl. Acad. Sci. U.S.A. 111, 9419 (2014).
S. Hochreiter and J. Schmidhuber, Neural Comput. 9, 1735 (1997).
Y. Wu et al, arXiv: 1609.08144.
D. Tang et al., in Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Baltimore, Maryland, USA, June 23-25, 2014), Vol. 1, pp. 1555–1565.
Y. Kim, arXiv:1408.5882.
X. Zhang, J. Zhao and Y. LeCun, in Advances in Neural Information Processing Systems (Montreal, Canada, December 7-12, 2015), pp. 649–657.
Z. Yang et al, in Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (San Diego, CA, USA, June 12-17, 2016), pp. 1480–1489.
B. Lee, D. Kim, D. Kim and H. Jeong, New Phys.: Sae Mulli 66, 502 (2016).
J. Bak and A. Oh, in Proceedings of the 9th SIGHUM Workshop on Language Technology for Cultural Heritage, Social Sciences and Humanities (LaTeCH) (Beijing, China, July 30, 2015), pp. 10–14.
J. Bak and A. Oh, in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (Brussels, Belgium, October 31-November 4, 2018), pp. 956–961.
The Annals of the Joseon Dynasty, http://sillok.history.go.kr.
The Daily Records of Royal Secretariat of Joseon Dynasty, http://sjw.history.go.kr.
R. Rehurek and P. Sojka, in Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, (Valletta, Malta, May 22, 2010), pp. 45–50.
T. Mikolov, K. Chen, G. Corrado and J. Dean, arXiv:1301.3781.
D. Bahdanau, K. Cho and Y. Bengio, arXiv: 1409.0473.
K. Xu et al., in International Conference on Machine Learning (Lille, France, July 6-11, 2015), pp. 2048–2057.
D. P. Kingma and J. Ba, arXiv:1412.6980.
A. Paszke et al., in 31st Conference on Neural Information Processing Systems (Long Beach, CA, USA, December 4-9, 2017).
G. Salton and M. J. McGill, Introduction to Modern Information Retrieval (McGraw-Hill, New York, NY, USA, 1983).
S. J. Russell and P. Norvig, Artificial Intelligence: A Modern Approach (Pearson Education Limited, Malaysia, 2016).
F. Pedregosa et al, J. Mach. Learn. Res. 12, 2825 (2011).
G. Tsoumakas and I. Katakis, Int. J. Data Warehous. Min. 3, 1 (2007).
The ratio of people’s names to verbs and nouns in each category is as follows; Royal 0.11, Military 0.13, Diplomacy 0.18, Finance 0.10, Agriculture 0.10, Science 0.01, Politics 0.58, Administration 0.30, Personnel 0.60, Jurisdiction 0.51, Rebellion 0.60, Philosophy 0.25 and History 0.42.
This work was supported by the National Research Foundation of Korea (Grant No. 2017R1A2B3006930).
About this article
Cite this article
Kim, DK., Lee, B., Kim, D. et al. Multi-Label Classification of Historical Documents by Using Hierarchical Attention Networks. J. Korean Phys. Soc. 76, 368–377 (2020). https://doi.org/10.3938/jkps.76.368
- Deep learning
- Recurrent neural network
- Text analysis
- Big data