Abstract
In the recent informatization of Chinese courts, the huge amount of law cases and judgment documents, which were digital stored, has provided a good foundation for the research of judicial big data and machine learning. In this situation, some ideas about Chinese courts can reach automation or get better result through the research of machine learning, such as similar documents recommendation, workload evaluation based on similarity of judgement documents and prediction of possible relevant statutes. In trying to achieve all above mentioned, and also in face of the characteristics of Chinese judgement document, we propose a topic model based approach to measure the text similarity of Chinese judgement document, which is based on TF-IDF, Latent Dirichlet Allocation (LDA), Labeled Latent Dirichlet Allocation (LLDA) and other treatments. Combining with the characteristics of Chinese judgment document, we focus on the specific steps of approach, the preprocessing of corpus, the parameters choices of training and the evaluation of similarity measure result. Besides, implementing the approach for prediction of possible statutes and regarding the prediction accuracy as the evaluation metric, we designed experiments to demonstrate the reasonability of decisions in the process of design and the high performance of our approach on text similarity measure. The experiments also show the restriction of our approach which need to be focused in future work.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Gomaa, W., Fahmy, A.: A survey of text similarity approaches. Int. J. Comput. Appl. 68, 13–18 (2013)
Zhang, Z.: The construction of legal system in transitional China. China Legal Sci. 140, 93 (2009)
Deerwester, S., Dumais, S., Furnas, G., et al.: Indexing by latent semantic analysis. J. Assoc. Inf. Sci. Technol. 41, 391–407 (1990)
Hofmann, T.: Probabilistic latent semantic indexing. In: International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 50–57. ACM (1999)
Blei, D., Ng, A., Jordan, M.: Latent Dirichlet Allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
Blei, D., Mcauliffe, J.: Supervised topic models. Adv. Neural Inf. Process. Syst. 3, 327–332 (2010)
Ramage, D., Hall, D., Nallapati, R., et al.: Labeled LDA: a supervised topic model for credit attribution in multi-labeled corpora. In: Conference on Empirical Methods in Natural Language Processing, vol. 1, pp. 248–256. Association for Computational Linguistics (2009)
Perotte, A., Bartlett, N., Elhadad, N., et al.: Hierarchically supervised Latent Dirichlet Allocation. Adv. Neural Inf. Process. Syst. 24, 2609–2617 (2011)
Li, F., Perona, P.: A Bayesian hierarchical model for learning natural scene categories. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 524–531. IEEE Computer Society (2005)
Sivic, J., Russell, B., Efros, A., et al.: Discovering objects and their location in images. In: Tenth IEEE International Conference on Computer Vision, ICCV 2005, vol. 1, pp. 370–377. IEEE (2005)
Wang, C., Blei, D., Li, F.: Simultaneous image classification and annotation. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2009, pp. 1903–1910. IEEE (2009)
Lin, C., He, Y.: Joint sentiment/topic model for sentiment analysis. In: ACM Conference on Information and Knowledge Management, pp. 375–384. ACM (2009)
Lukins, S., Kraft, N., Etzkorn, H.: Bug localization using latent Dirichlet Allocation. Inf. Softw. Technol. 52, 972–990 (2010)
Rasiwasia, N., Vasconcelos, N.: Latent Dirichlet Allocation models for image classification. IEEE Trans. Pattern Anal. Mach. Intell. 35, 2665–2679 (2013)
Misra, H., Jose, J., Cappe, O.: Text segmentation via topic modeling: an analytical study. In: ACM Conference on Information and Knowledge Management, CIKM 2009, Hong Kong, China, pp. 1553–1556. DBLP, November 2009
Arun, R., Suresh, V., Veni Madhavan, C.E., Narasimha Murthy, M.N.: On finding the natural number of topics with Latent Dirichlet Allocation: some observations. In: Zaki, M.J., Yu, J.X., Ravindran, B., Pudi, V. (eds.) PAKDD 2010. LNCS, vol. 6118, pp. 391–402. Springer, Heidelberg (2010). doi:10.1007/978-3-642-13657-3_43
Teh, Y., Jordan, M., Beal, M., et al.: Hierarchical Dirichlet processes. J. Am. Stat. Assoc. 101, 1566–1581 (2006)
Zirn, C., Stuckenschmidt, H.: Multidimensional topic analysis in political texts. Data Knowl. Eng. 90, 38–53 (2014)
Li, W., Sun, L., Zhang, D.: Text classification based on labeled-LDA model. Chin. J. Comput. 31, 620 (2008). Chinese Edition
Si, J., Mukherjee, A., Liu, B., et al.: Exploiting social relations and sentiment for stock prediction. EMNLP 14, 1139–1145 (2014)
Zhou, F.: Reason, Jurisprudence, Sense and Writing. Shandong Justice (2007)
Acknowledgement
This work was supported by the Key Program of Research and Development of China (2016YFC0800803).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Wang, Y. et al. (2017). Topic Model Based Text Similarity Measure for Chinese Judgment Document. In: Zou, B., Han, Q., Sun, G., Jing, W., Peng, X., Lu, Z. (eds) Data Science. ICPCSEE 2017. Communications in Computer and Information Science, vol 728. Springer, Singapore. https://doi.org/10.1007/978-981-10-6388-6_4
Download citation
DOI: https://doi.org/10.1007/978-981-10-6388-6_4
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-6387-9
Online ISBN: 978-981-10-6388-6
eBook Packages: Computer ScienceComputer Science (R0)