Topic Model Based Text Similarity Measure for Chinese Judgment Document

Wang, Yue; Ge, Jidong; Zhou, Yemao; Feng, Yi; Li, Chuanyi; Li, Zhongjin; Zhou, Xiaoyu; Luo, Bin

doi:10.1007/978-981-10-6388-6_4

Yue Wang^15,16,
Jidong Ge^15,16,
Yemao Zhou^15,16,
Yi Feng^15,16,
Chuanyi Li^15,16,
Zhongjin Li^15,16,
Xiaoyu Zhou^15,16 &
…
Bin Luo^15,16

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 728))

Included in the following conference series:

International Conference of Pioneering Computer Scientists, Engineers and Educators

2463 Accesses
4 Citations

Abstract

In the recent informatization of Chinese courts, the huge amount of law cases and judgment documents, which were digital stored, has provided a good foundation for the research of judicial big data and machine learning. In this situation, some ideas about Chinese courts can reach automation or get better result through the research of machine learning, such as similar documents recommendation, workload evaluation based on similarity of judgement documents and prediction of possible relevant statutes. In trying to achieve all above mentioned, and also in face of the characteristics of Chinese judgement document, we propose a topic model based approach to measure the text similarity of Chinese judgement document, which is based on TF-IDF, Latent Dirichlet Allocation (LDA), Labeled Latent Dirichlet Allocation (LLDA) and other treatments. Combining with the characteristics of Chinese judgment document, we focus on the specific steps of approach, the preprocessing of corpus, the parameters choices of training and the evaluation of similarity measure result. Besides, implementing the approach for prediction of possible statutes and regarding the prediction accuracy as the evaluation metric, we designed experiments to demonstrate the reasonability of decisions in the process of design and the high performance of our approach on text similarity measure. The experiments also show the restriction of our approach which need to be focused in future work.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Gomaa, W., Fahmy, A.: A survey of text similarity approaches. Int. J. Comput. Appl. 68, 13–18 (2013)
Google Scholar
Zhang, Z.: The construction of legal system in transitional China. China Legal Sci. 140, 93 (2009)
Google Scholar
Deerwester, S., Dumais, S., Furnas, G., et al.: Indexing by latent semantic analysis. J. Assoc. Inf. Sci. Technol. 41, 391–407 (1990)
Google Scholar
Hofmann, T.: Probabilistic latent semantic indexing. In: International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 50–57. ACM (1999)
Google Scholar
Blei, D., Ng, A., Jordan, M.: Latent Dirichlet Allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
MATH Google Scholar
Blei, D., Mcauliffe, J.: Supervised topic models. Adv. Neural Inf. Process. Syst. 3, 327–332 (2010)
Google Scholar
Ramage, D., Hall, D., Nallapati, R., et al.: Labeled LDA: a supervised topic model for credit attribution in multi-labeled corpora. In: Conference on Empirical Methods in Natural Language Processing, vol. 1, pp. 248–256. Association for Computational Linguistics (2009)
Google Scholar
Perotte, A., Bartlett, N., Elhadad, N., et al.: Hierarchically supervised Latent Dirichlet Allocation. Adv. Neural Inf. Process. Syst. 24, 2609–2617 (2011)
Google Scholar
Li, F., Perona, P.: A Bayesian hierarchical model for learning natural scene categories. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 524–531. IEEE Computer Society (2005)
Google Scholar
Sivic, J., Russell, B., Efros, A., et al.: Discovering objects and their location in images. In: Tenth IEEE International Conference on Computer Vision, ICCV 2005, vol. 1, pp. 370–377. IEEE (2005)
Google Scholar
Wang, C., Blei, D., Li, F.: Simultaneous image classification and annotation. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2009, pp. 1903–1910. IEEE (2009)
Google Scholar
Lin, C., He, Y.: Joint sentiment/topic model for sentiment analysis. In: ACM Conference on Information and Knowledge Management, pp. 375–384. ACM (2009)
Google Scholar
Lukins, S., Kraft, N., Etzkorn, H.: Bug localization using latent Dirichlet Allocation. Inf. Softw. Technol. 52, 972–990 (2010)
Article Google Scholar
Rasiwasia, N., Vasconcelos, N.: Latent Dirichlet Allocation models for image classification. IEEE Trans. Pattern Anal. Mach. Intell. 35, 2665–2679 (2013)
Article Google Scholar
Misra, H., Jose, J., Cappe, O.: Text segmentation via topic modeling: an analytical study. In: ACM Conference on Information and Knowledge Management, CIKM 2009, Hong Kong, China, pp. 1553–1556. DBLP, November 2009
Google Scholar
Arun, R., Suresh, V., Veni Madhavan, C.E., Narasimha Murthy, M.N.: On finding the natural number of topics with Latent Dirichlet Allocation: some observations. In: Zaki, M.J., Yu, J.X., Ravindran, B., Pudi, V. (eds.) PAKDD 2010. LNCS, vol. 6118, pp. 391–402. Springer, Heidelberg (2010). doi:10.1007/978-3-642-13657-3_43
Chapter Google Scholar
Teh, Y., Jordan, M., Beal, M., et al.: Hierarchical Dirichlet processes. J. Am. Stat. Assoc. 101, 1566–1581 (2006)
Article MathSciNet MATH Google Scholar
Zirn, C., Stuckenschmidt, H.: Multidimensional topic analysis in political texts. Data Knowl. Eng. 90, 38–53 (2014)
Article Google Scholar
Li, W., Sun, L., Zhang, D.: Text classification based on labeled-LDA model. Chin. J. Comput. 31, 620 (2008). Chinese Edition
Article MathSciNet Google Scholar
Si, J., Mukherjee, A., Liu, B., et al.: Exploiting social relations and sentiment for stock prediction. EMNLP 14, 1139–1145 (2014)
Google Scholar
Zhou, F.: Reason, Jurisprudence, Sense and Writing. Shandong Justice (2007)
Google Scholar

Download references

Acknowledgement

This work was supported by the Key Program of Research and Development of China (2016YFC0800803).

Author information

Authors and Affiliations

State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, 210093, China
Yue Wang, Jidong Ge, Yemao Zhou, Yi Feng, Chuanyi Li, Zhongjin Li, Xiaoyu Zhou & Bin Luo
Software Institute, Nanjing University, Nanjing, 210093, China
Yue Wang, Jidong Ge, Yemao Zhou, Yi Feng, Chuanyi Li, Zhongjin Li, Xiaoyu Zhou & Bin Luo

Authors

Yue Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jidong Ge
View author publications
You can also search for this author in PubMed Google Scholar
Yemao Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Yi Feng
View author publications
You can also search for this author in PubMed Google Scholar
Chuanyi Li
View author publications
You can also search for this author in PubMed Google Scholar
Zhongjin Li
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoyu Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Bin Luo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jidong Ge .

Editor information

Editors and Affiliations

Central South University, Changsha, China
Beiji Zou
Harbin Engineering University, Harbin, China
Qilong Han
Harbin University of Science and Technology, Harbin, China
Guanglu Sun
Northeast Forestry University, Harbin, China
Weipeng Jing
Huaihua University, Huaihua, Hunan, China
Xiaoning Peng
Sciences of Country Tripod Institute of Data Science, Harbin, China
Zeguang Lu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, Y. et al. (2017). Topic Model Based Text Similarity Measure for Chinese Judgment Document. In: Zou, B., Han, Q., Sun, G., Jing, W., Peng, X., Lu, Z. (eds) Data Science. ICPCSEE 2017. Communications in Computer and Information Science, vol 728. Springer, Singapore. https://doi.org/10.1007/978-981-10-6388-6_4

Download citation

DOI: https://doi.org/10.1007/978-981-10-6388-6_4
Published: 16 September 2017
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-6387-9
Online ISBN: 978-981-10-6388-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics