A Technique to Find Out Low Frequency Rare Words in Medical Cancer Text Document Classification

Patel, Falguni N.; Shah, Hitesh B.; Shah, Shishir

doi:10.1007/978-981-16-8403-6_11

Falguni N. Patel⁶,
Hitesh B. Shah⁷ &
Shishir Shah⁸

Part of the book series: Lecture Notes on Data Engineering and Communications Technologies ((LNDECT,volume 106))

Abstract

A vast amount of digital medical documents are increasing day by day, and there is need of automatic text document classification. Medical research persons, doctors, and medical community search or classify their relevant documents. The documents can be medical research papers, articles, reports, surveys, etc. In this paper, we have investigated that tradition classification method applied on medical data and removed rare low frequency words that degrade performance of classifiers. We find that rare words are important in medical domain and study existing methods to find rare words. The available methods are fixed statistical calculation-based threshold value for all dataset or sample collection. So, we proposed a method for rare word finding using dynamic threshold calculation based on term frequency as well as inverse documents frequency and medical dictionary words matching concept. We have taken two real medical text dataset and applied three text classifiers kNN, NB, and SVM. The results shown that our method finds right rare words. Considering only rare words gives same or nearer accuracy of all features in classification. It also shows that removing rare words degrades performance of classifiers in most of the cases specific in medical domain.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 189.00; Price excludes VAT (USA)

Softcover Book: USD 249.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

H.S. Yahia, A.M. Abdulazeez, Medical text classification based on convolutional neural network: a review. Int. J. Sci. Bus. IJSAB Int. 5(3), 27–41 (2021)
Google Scholar
X. Yan, J. Bien, Rare feature selection in high dimensions. J. Am. Stat. Assoc. (2020) https://doi.org/10.1080/01621459.2020.1796677
Al.-D.I. Obaidat, M. Lee, Unstructured medical text classification using linguistic analysis: a supervised deep learning approach. in 2019 IEEE/ACS 16th International conference (AICCSA) (2019), pp. 1–7, https://doi.org/10.1109/AICCSA47632.2019.9035282
L. Qing, W. Linhong, D. Xuehai, A novel neural network-based method for medical text classification. Future Internet 11(12), 255 (2019). https://doi.org/10.3390/fi11120255
P.V. Arivoli, T. Chakravarthy, Document classification using machine learning algorithms—a review. IJSER, ISSN (Online) 5(2), 2347–3878 (2017)
Google Scholar
U. Naseem, M. Khushi, S.K. Khan, K. Shaukat, M.A. Moni, A comparative analysis of active learning for biomedical text mining. Appl. Syst. Innov. 4(1), 23 (2021). https://doi.org/10.3390/asi4010023
Article Google Scholar
R. Jindal, R. Malhotra, A. Jain, Techniques for text classification: literature review and current trends. Webology 12(2) (2015)
Google Scholar
R.T.W. Lo, et al., Automatically building a stopword list for an information retrieval system. J. Dig. Infor. Mgmt. 3(1) (2005)
Google Scholar
A. Holzinger, J. Schantl, M. Schroettner et al., in Biomedical Text Mining: State-of-the-Art, Open Problems and Future Challenges. Springer Lecture Notes in Computer Science, vol. 8401. Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-43968-5_16
M. Tahrawi, The role of rare terms in enhancing the performance of polynomial networks based text categorization. J. Intell. Learn. Syst. Appl. 05, 84–89 (2013). https://doi.org/10.4236/jilsa.2013.52009
Article Google Scholar
M. Tahrawi,The significance of low frequent terms in text classification. Int. J. Intell. Syst. 29 (2014). https://doi.org/10.1002/int.21643
G. Bathla, R. Jindal, Similarity measures of research papers and patents using adaptive and parameter-free threshold. IJCA, ISSN 0975–8887 (2011)
Google Scholar
L. Skorkovska, Dynamic Threshold Selection Method for Multi-label Newspaper Topic Identification. LNAI, vol. 8082, pp. 209–216 (Springer-Verlag Berlin Heidelberg, 2013)
Google Scholar
S. Basheer, et al., Efficient text summarization method for blind people using text mining techniques. Int. J. Speech Technol. 1–13 (2020)
Google Scholar
E. Padma Lahari, D.V.N. Siva Kumar, S. Prasad, Automatic text summarization with statistical and linguistic features using successive thresholds. 2014 IEEE Int. Conf. Adv. Commun. Control Comput. Technol.
Google Scholar
Li, Yanling, and Li Song, Threshold determining method for feature selection. in 2009 Second International Symposium on Electronic Commerce and Security, vol. 2. IEEE (2009)
Google Scholar
E. Marchiori, Class Dependent Feature Weighting and K-Nearest Neighbor Classification (Springer, 2013)
Google Scholar
R. Roy, R. Homayouni, M.W. Berry, A.A. Puretskiy, Nonnegative Tensor Factorization of Biomedical Literature for Analysis of Genomic Data. https://doi.org/10.1007/978-3-642-45252-9_7.70
H. Christian, M. Agus, D. Suhartono, Single document automatic text summarization using term frequency-inverse document frequency (TF-IDF). ComTech 7(4), 285–294 (2016)
Article Google Scholar
N. Ishtayeh, in Similarity Threshold Determination for Text Document Clustering. Thesis of Master in CS (Zarqa University, Jordan, 2014)
Google Scholar
J. Huang, Y. Wei, J. Yi, M. Liu, An improved kNN based on class contribution and feature weighting. IEEE (2018)
Google Scholar
B. Settles, ABNER: an open-source tool for automatically tagging genes, proteins and other entity names in text. Bioinformatics 21, 3191–3192 (2005)
Article Google Scholar
https://github.com/glutanimate/wordlist-medicalterms-en
https://figshare.com/articles/dataset/SparkText_SampleDataset_19681Abstract.zip
PubMed: www.pubmed.ncbi.nlm.nih.go

Download references

Author information

Authors and Affiliations

GTU, Ahmedabad, Gujarat, India
Falguni N. Patel
Department of EC, GCET, Ahmedabad, Gujarat, India
Hitesh B. Shah
University of Houston, Houston, USA
Shishir Shah

Authors

Falguni N. Patel
View author publications
You can also search for this author in PubMed Google Scholar
Hitesh B. Shah
View author publications
You can also search for this author in PubMed Google Scholar
Shishir Shah
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Electronics and Communication Engineering, National Institute of Technology, Kurukshetra, Kurukshetra, India
Pankaj Verma
Department of Electronics and Communication Engineering, National Institute of Technology, Kurukshetra, Kurukshetra, India
Chhagan Charan
Department of Electrical and Computer Engineering, Ryerson University, Toronto, ON, Canada
Xavier Fernando
Department of Electrical and Computer Engineering, Oakland University, Rochester, MI, USA
Subramaniam Ganesan

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Patel, F.N., Shah, H.B., Shah, S. (2022). A Technique to Find Out Low Frequency Rare Words in Medical Cancer Text Document Classification. In: Verma, P., Charan, C., Fernando, X., Ganesan, S. (eds) Advances in Data Computing, Communication and Security. Lecture Notes on Data Engineering and Communications Technologies, vol 106. Springer, Singapore. https://doi.org/10.1007/978-981-16-8403-6_11

Download citation

DOI: https://doi.org/10.1007/978-981-16-8403-6_11
Published: 29 March 2022
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-8402-9
Online ISBN: 978-981-16-8403-6
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics