Abstract
With the development of information technology and the increasing richness of network information, people can more and more easily search for and obtain the required information from the network. However, how to quickly obtain the required information in the massive network information is very important. Therefore, information retrieval technology emerges, One of the important supporting technologies is keyword extraction technology. Currently, the most widely used keyword extraction technique is the TF-IDFs algorithm (Term Frequency-Inverse Document Frequency). The basic principle of the TF-IDF algorithm is to calculate the number of occurrences of words and the frequency of words. It ranks and selects the top few words as keywords. The TF-IDF algorithm has features such as simplicity and high reliability, but there are also deficiencies. This paper analyzes its shortcomings for an improved TFIDF algorithm, and optimizes it from the information theory point of view. It uses the information entropy and relative entropy in information theory as the calculation factor, adds to the above improved TFIDF algorithm, optimizes its performance, and passes Simulation experiments verify its performance.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Saltong, G., Mcgill, M.J.: Introduction to Modern Information Retrieval. McGraw-Hill, New York (1983)
Jones, K.S.: A statistical interpretation of term specificity and its application in retrieval. J. Doc. 28(1), 11–21 (1972)
Saltong, G., Yu, C.T.: On the construction of effective vocabularies for information retrieval. In: Proceedings of the 1973 Meeting on Programming Languages and Information Retrieval, p. 11. ACM, New York (1973)
Saltong, G., Fox, E.A., Wu, H.: Extended Boolean information retrieval. Commun. ACM 26(11), 1022–1036 (1983)
Saltong, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. In: Information Processing and Management, pp. 513–523 (1988)
Basili, R., Pazienzam, M.: A test classifier based on linguistic processing. In: Proceedings of IJCAIp 1999, Machine Learning for Information Filtering (1999)
How, B.C., Narayanan, K.: An empirical study of feature selection for text categorization based on term weight age. In: Proceedings of the 2004 IEEE W/IC/ACM International Conference on Web Intelligence, pp. 599–602. IEEE Computer Society, Washington, DC (2004)
Guo, A., Yang, T.: Research and improvement of feature words weight based on TFIDF algorithm. In: Information Technology, Networking, Electronic and Automation Control Conference, pp. 415–419. IEEE (2016)
Zuo, R.: Information theory, information view, and software testing. In: Seventh International Conference on Information Technology: New Generations, pp. 998–1003. IEEE Computer Society (2010)
Salton, G., Fox, E.A., Wu, H.: Extended Boolean information retrieval. Cornell University (1982)
Lin, F.L., Ning, B.: Relative entropy and torsion coupling. Phys. Rev. D 94(12), 126007 (2016)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Cheng, L., Yang, Y., Zhao, K., Gao, Z. (2020). Research and Improvement of TF-IDF Algorithm Based on Information Theory. In: Liu, Q., Mısır, M., Wang, X., Liu, W. (eds) The 8th International Conference on Computer Engineering and Networks (CENet2018). CENet2018 2018. Advances in Intelligent Systems and Computing, vol 905. Springer, Cham. https://doi.org/10.1007/978-3-030-14680-1_67
Download citation
DOI: https://doi.org/10.1007/978-3-030-14680-1_67
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-14679-5
Online ISBN: 978-3-030-14680-1
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)